My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.
Consider a data frame
library( tidyverse )
df <- tibble( potentially_long_name_i_dont_want_to_type_twice = 1:10,
another_annoyingly_long_name = 21:30 )
I would like to apply mean to the first column and sum to the second column, without unnecessarily typing each column name twice.
As the question I linked above shows, summarize allows you to do this, but requires that the name of each column appears twice. On the other hand, summarize_at allows you to succinctly apply multiple functions to multiple columns, but it does so by calling all specified functions on all specified columns, instead of doing it in a one-to-one fashion. Is there a way to combine these distinct features of summarize and summarize_at?
I was able to hack it with rlang, but I'm not sure if it's any cleaner than just typing each variable twice:
v <- c("potentially_long_name_i_dont_want_to_type_twice",
"another_annoyingly_long_name")
f <- list(mean,sum)
## Desired output
smrz <- set_names(v) %>% map(sym) %>% map2( f, ~rlang::call2(.y,.x) )
df %>% summarize( !!!smrz )
# # A tibble: 1 x 2
# potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
# <dbl> <int>
# 1 5.5 255
EDIT to address some philosophical points
I don’t think that wanting to avoid the x=f(x) idiom is unreasonable. I probably came across a bit overzealous about typing long names, but the real issue is actually having (relatively) long names that are very similar to each other. Examples include nucleotide sequences (e.g., AGCCAGCGGAAACAGTAAGG) and TCGA barcodes. Not only is autocomplete of limited utility in such cases, but writing things like AGCCAGCGGAAACAGTAAGG = sum( AGCCAGCGGAAACAGTAAGG ) introduces unnecessary coupling and increases the risk that the two sides of the assignment might accidentally go out of sync as the code is developed and maintained.
I completely agree with @MrFlick about dplyr increasing code readability, but I don’t think that readability should come at the cost of correctness. Functions like summarize_at and mutate_at are brilliant, because they strike a perfect balance between placing operations next to their operands (clarity) and guaranteeing that the result is written to the correct column (correctness).
By the same token, I feel that the proposed solutions which remove variable mention altogether swing too far in the other direction. While inherently clever -- and I certainly appreciate the extra typing they save -- I think that, by removing the association between functions and variable names, such solutions now rely on proper ordering of variables, which creates its own risks of accidental errors.
In short, I believe that a self-mutating / self-summarizing operation should mention each variable name exactly once.