We can apply generic speed-up strategies:
- Do less
- Choose an appropriate back-end
- Use appropriate data structures
dplyr provides syntactic sugar for data manipulation, but may not be the most efficient when it comes to handling large data sets.
solution 1
We could rewrite the code slightly to be more efficient by using the collapse package, which provides a C++ interface to dplyr functions. It prepends dplyr functions with f, with one exception fsubset which is similar to dplyr::filter (or base R subset).
library(collapse)
df |>
fgroup_by(Month, ID) |>
fsummarise(Qty = fsum(Qty),
Sales = fsum(Sales),
Leads = fsum(Leads),
Region = fsubset(Region, 1L),
keep.group_vars = T) |>
as_tibble() # optional
#> # A tibble: 4 x 6
#> Month ID Qty Sales Leads Region
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 April 11 230 2100 22 East
#> 2 June 11 260 2450 15 North
#> 3 May 10 110 1000 8 East
#> 4 May 12 110 900 9 North
Where |> (Requires R version > 3.5) is a slightly faster pipe than %>%. Its result is ungrouped.
solution 2
data.table is often lauded for its speed, memory use and utility. The easiest conversion from existing dplyr code to use data.table is using the dtplyr package, which ships with tidyverse. We can convert it by adding two lines of code.
library(dtplyr)
df1 <- lazy_dt(df)
df1 %>%
group_by(Month, ID) %>%
summarize(across(.cols = Qty:Leads, ~sum(.x, na.rm = T)),
Region = first(Region)) %>%
as_tibble() # or data.table()
Note that this results is an ungrouped data.frame at the end.
Benchmarks
Approaches are put in wrapper functions. dplyr here is www's approach. All approaches outputting is a tibble.
bench::mark(collapse = collapse(df), dplyr = dplyr(df), dtplyr = dtplyr(df),
time_unit = "ms", iterations = 200)[c(1, 3,5,7)]
# A tibble: 3 x 4
expression median mem_alloc n_itr
<bch:expr> <dbl> <bch:byt> <int>
1 collapse 0.316 0B 200
2 dplyr 5.42 8.73KB 195
3 dtplyr 6.67 120.21KB 196
We can see that collapse is more memory efficient, and significantly faster compared to dplyr. dtplyr approach is included here, as its time complexity is different than that of dplyr and its convenience to rewrite.
Per @www's request, an inclusion of pure data.table approach, wrapper functions rewritten for brevity. Input/ Output is a data.frame for collapse and a data.table for data.table respectively.
data.table = \(x){setDT(x); cols = c("Qty", "Sales", "Leads");x[, c(lapply(.SD, sum, na.rm = T), Region = first(Region)), .SDcols = cols, by = .(Month, ID)][]}
# retainig the `|>` pipes for readability, impact is ~4us.
collapse = \(x) x|>fgroup_by(Month, ID)|>fsummarise(Qty = fsum(Qty),Sales = fsum(Sales),Leads = fsum(Leads),Region = fsubset(Region, 1L),keep.group_vars = T)
dt <- as.data.table(df)
bench::mark(collapse(df), iterations = 10e3)[c(1,3,5,7)] ; bench::mark(data.table(dt), iterations = 10e3)[c(1,3,5,7)]
expression median mem_alloc n_itr
<bch:expr> <bch:tm> <bch:byt> <int>
1 collapse(df) 150us 0B 9988
2 data.table(dt) 796us 146KB 9939
The difference between collapse and pure data.table, for such a small dataset, is negligible. The reason for speed increase is likely the use of fsum instead of base R sum.