Skip to content

Running GForce optimized weighted.mean with .SD #5628

@svraka

Description

@svraka

As far as I understand, weighted.mean is the first GForce optimised function (introduced in #5246) that takes a second non-scalar argument. However, this introduces some complications with .SD. Writing out all columns to be computed works of course:

options(datatable.quiet = TRUE,
        datatable.verbose = TRUE)
library(data.table)

DT <- as.data.table(mtcars)

DT[, .(mpg = weighted.mean(mpg, disp),
       hp = weighted.mean(hp, disp)),
   by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> Detected that j uses these columns: [mpg, disp, hp]
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(weighted.mean(mpg, disp), weighted.mean(hp, disp))'
#> GForce optimized j to 'list(gweighted.mean(mpg, disp), gweighted.mean(hp, disp))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.001
#> gforce assign high and low took 0.001
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> gforce eval took 0.001
#> 0.020s elapsed (0.010s cpu)
#>      cyl      mpg        hp
#>    <num>    <num>     <num>
#> 1:     6 19.77198 119.86409
#> 2:     4 25.81985  84.75037
#> 3:     8 14.86285 210.28867

But what to do with .SD? Anonymous functions are obviously not optimised.

DT[, lapply(.SD, \(x) weighted.mean(x, disp)), .SDcols = c("mpg", "hp"), by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'lapply(.SD, function(x) weighted.mean(x, disp))' to 'list(..FUN1(mpg), ..FUN1(hp))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... 
#>   collecting discontiguous groups took 0.000s for 3 groups
#>   eval(j) took 0.000s for 3 calls
#> 0.000s elapsed (0.000s cpu)
#>      cyl      mpg        hp
#>    <num>    <num>     <num>
#> 1:     6 19.77198 119.86409
#> 2:     4 25.81985  84.75037
#> 3:     8 14.86285 210.28867

And we can't simply pass an argument in the form of lapply(.SD, weighted.mean, disp) because disp doesn't exist in the evaluating environment. One workaround could be to use get():

DT[, lapply(.SD, weighted.mean, get("disp")), .SDcols = c("mpg", "hp"), by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> '(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.
#> Old ansvars: [mpg, hp] 
#> New ansvars: [mpg, hp, disp, drat, wt, qsec, vs, am, gear, carb] 
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'lapply(.SD, weighted.mean, get("disp"))' to 'list(weighted.mean(mpg, get("disp")), weighted.mean(hp, get("disp")))'
#> GForce optimized j to 'list(gweighted.mean(mpg, get("disp")), gweighted.mean(hp, get("disp")))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.001
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> gforce eval took 0.001
#> 0.000s elapsed (0.000s cpu)
#>      cyl      mpg        hp
#>    <num>    <num>     <num>
#> 1:     6 19.77198 119.86409
#> 2:     4 25.81985  84.75037
#> 3:     8 14.86285 210.28867

Is there a more straightforward solution? Should data.table add support for this usecase? If not, an example in the docs would be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions