-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Description
As far as I understand, weighted.mean is the first GForce optimised function (introduced in #5246) that takes a second non-scalar argument. However, this introduces some complications with .SD. Writing out all columns to be computed works of course:
options(datatable.quiet = TRUE,
datatable.verbose = TRUE)
library(data.table)
DT <- as.data.table(mtcars)
DT[, .(mpg = weighted.mean(mpg, disp),
hp = weighted.mean(hp, disp)),
by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> Detected that j uses these columns: [mpg, disp, hp]
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as 'list(weighted.mean(mpg, disp), weighted.mean(hp, disp))'
#> GForce optimized j to 'list(gweighted.mean(mpg, disp), gweighted.mean(hp, disp))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.001
#> gforce assign high and low took 0.001
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> gforce eval took 0.001
#> 0.020s elapsed (0.010s cpu)
#> cyl mpg hp
#> <num> <num> <num>
#> 1: 6 19.77198 119.86409
#> 2: 4 25.81985 84.75037
#> 3: 8 14.86285 210.28867But what to do with .SD? Anonymous functions are obviously not optimised.
DT[, lapply(.SD, \(x) weighted.mean(x, disp)), .SDcols = c("mpg", "hp"), by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization changed j from 'lapply(.SD, function(x) weighted.mean(x, disp))' to 'list(..FUN1(mpg), ..FUN1(hp))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ...
#> collecting discontiguous groups took 0.000s for 3 groups
#> eval(j) took 0.000s for 3 calls
#> 0.000s elapsed (0.000s cpu)
#> cyl mpg hp
#> <num> <num> <num>
#> 1: 6 19.77198 119.86409
#> 2: 4 25.81985 84.75037
#> 3: 8 14.86285 210.28867And we can't simply pass an argument in the form of lapply(.SD, weighted.mean, disp) because disp doesn't exist in the evaluating environment. One workaround could be to use get():
DT[, lapply(.SD, weighted.mean, get("disp")), .SDcols = c("mpg", "hp"), by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> '(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.
#> Old ansvars: [mpg, hp]
#> New ansvars: [mpg, hp, disp, drat, wt, qsec, vs, am, gear, carb]
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization changed j from 'lapply(.SD, weighted.mean, get("disp"))' to 'list(weighted.mean(mpg, get("disp")), weighted.mean(hp, get("disp")))'
#> GForce optimized j to 'list(gweighted.mean(mpg, get("disp")), gweighted.mean(hp, get("disp")))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.001
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> gforce eval took 0.001
#> 0.000s elapsed (0.000s cpu)
#> cyl mpg hp
#> <num> <num> <num>
#> 1: 6 19.77198 119.86409
#> 2: 4 25.81985 84.75037
#> 3: 8 14.86285 210.28867Is there a more straightforward solution? Should data.table add support for this usecase? If not, an example in the docs would be useful.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels