Running GForce optimized `weighted.mean` with `.SD`

As far as I understand, `weighted.mean` is the first GForce optimised function (introduced in https://github.com/Rdatatable/data.table/pull/5246) that takes a second non-scalar argument. However, this introduces some complications with `.SD`. Writing out all columns to be computed works of course:

``` r
options(datatable.quiet = TRUE,
        datatable.verbose = TRUE)
library(data.table)

DT <- as.data.table(mtcars)

DT[, .(mpg = weighted.mean(mpg, disp),
       hp = weighted.mean(hp, disp)),
   by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> Detected that j uses these columns: [mpg, disp, hp]
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as 'list(weighted.mean(mpg, disp), weighted.mean(hp, disp))'
#> GForce optimized j to 'list(gweighted.mean(mpg, disp), gweighted.mean(hp, disp))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.001
#> gforce assign high and low took 0.001
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> gforce eval took 0.001
#> 0.020s elapsed (0.010s cpu)
#>      cyl      mpg        hp
#>    <num>    <num>     <num>
#> 1:     6 19.77198 119.86409
#> 2:     4 25.81985  84.75037
#> 3:     8 14.86285 210.28867
```

But what to do with `.SD`? Anonymous functions are obviously not optimised.

``` r
DT[, lapply(.SD, \(x) weighted.mean(x, disp)), .SDcols = c("mpg", "hp"), by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'lapply(.SD, function(x) weighted.mean(x, disp))' to 'list(..FUN1(mpg), ..FUN1(hp))'
#> GForce is on, left j unchanged
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... 
#>   collecting discontiguous groups took 0.000s for 3 groups
#>   eval(j) took 0.000s for 3 calls
#> 0.000s elapsed (0.000s cpu)
#>      cyl      mpg        hp
#>    <num>    <num>     <num>
#> 1:     6 19.77198 119.86409
#> 2:     4 25.81985  84.75037
#> 3:     8 14.86285 210.28867
```

And we can't simply pass an argument in the form of `lapply(.SD, weighted.mean, disp)` because `disp` doesn't exist in the evaluating environment. One workaround could be to use `get()`:

``` r
DT[, lapply(.SD, weighted.mean, get("disp")), .SDcols = c("mpg", "hp"), by = "cyl"]
#> Argument 'by' after substitute: "cyl"
#> '(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.
#> Old ansvars: [mpg, hp] 
#> New ansvars: [mpg, hp, disp, drat, wt, qsec, vs, am, gear, carb] 
#> Finding groups using forderv ... forder.c received 32 rows and 1 columns
#> 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'lapply(.SD, weighted.mean, get("disp"))' to 'list(weighted.mean(mpg, get("disp")), weighted.mean(hp, get("disp")))'
#> GForce optimized j to 'list(gweighted.mean(mpg, get("disp")), gweighted.mean(hp, get("disp")))'
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.001
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> This gsum (narm=FALSE) took ... gather took ... 0.000s
#> 0.000s
#> gforce eval took 0.001
#> 0.000s elapsed (0.000s cpu)
#>      cyl      mpg        hp
#>    <num>    <num>     <num>
#> 1:     6 19.77198 119.86409
#> 2:     4 25.81985  84.75037
#> 3:     8 14.86285 210.28867
```

Is there a more straightforward solution? Should data.table add support for this usecase? If not, an example in the docs would be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running GForce optimized `weighted.mean` with `.SD` #5628

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running GForce optimized weighted.mean with .SD #5628

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Running GForce optimized `weighted.mean` with `.SD` #5628