ARROW-5718: [R] auto splice data frames in record_batch() and table() by romainfrancois · Pull Request #4704 · apache/arrow

romainfrancois · 2019-06-26T12:14:21Z

elements of ... that are unnamed in record_batch(...) or table(...) are automatically spliced:

library(arrow, warn.conflicts = FALSE)
library(tibble)

df <- tibble(x = 1:10, y = letters[1:10])

# auto splicing
batch <- record_batch(df, z = 1:10)
as_tibble(batch)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

# same as explicit splicing
batch <- record_batch(!!!df, z = 1:10)
as_tibble(batch)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

# auto splicing
tab <- table(df, z = 1:10)
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

# same as explicit splicing
tab <- table(!!!df, z = 1:10)
as_tibble(tab)
#> # A tibble: 10 x 3
#>        x y         z
#>    <int> <chr> <int>
#>  1     1 a         1
#>  2     2 b         2
#>  3     3 c         3
#>  4     4 d         4
#>  5     5 e         5
#>  6     6 f         6
#>  7     7 g         7
#>  8     8 h         8
#>  9     9 i         9
#> 10    10 j        10

In particular, this gives us back record_batch(<data.frame>) and table(<data.frame>) :

library(arrow, warn.conflicts = FALSE)
library(tibble)

df <- tibble(x = 1:10, y = letters[1:10])
record_batch(df)
#> arrow::RecordBatch
table(df)
#> arrow::Table

nealrichardson

A couple of notes. Generally looks good, might like to see a couple more tests.

r/R/RecordBatch.R

r/tests/testthat/test-RecordBatch.R

nealrichardson · 2019-06-26T13:37:00Z

r/src/table.cpp


-  R_xlen_t n = XLENGTH(lst);
-  std::vector<std::shared_ptr<arrow::Column>> columns(n);
+  int num_fields;


I'm curious why the record_batch and table code isn't more shared

It could be. record batch only have to handle arrays, where tables have to handle arrays, chunked arrays, columns. but I agree that the structure of the code is similar.

I think we should refactor the flatten-for loop into a small function if they're all the same.

nealrichardson · 2019-06-26T14:17:45Z

r/R/RecordBatch.R

 #' @export
 record_batch <- function(..., schema = NULL){
  arrays <- list2(...)
+  # making sure there are always names


Right but why? IIUC this has to do with how the cpp code distinguishes things to autosplice, switching behavior on nchar(name). That's a subtlety I'll probably forget by Monday so it seems worth explaining.

names(list(a = 1, b = 2)) #> [1] "a" "b" names(list(a = 1, 2)) #> [1] "a" "" names(list(1, 2)) #> NULL

otherwise we'd have to treat the NULL case differently internally as we did before, maybe I could do that instead.

fsaintjacques · 2019-06-26T17:47:31Z

r/src/recordbatch.cpp

+    arrays[j] = arrow::r::Array__from_vector(x, schema->field(j)->type(), false);
+  };
+
+  for (R_xlen_t i = 0, j = 0; j < num_fields; i++) {


Can you simplify the loop conditions? Make it only iterate over i and XLENGTH(lst) and use an external variable,e.g.

int out_idx = 0; for (R_xlen_t i = 0; i < XLENGHT(lst); i++) { ... if (..) { for (...) fill_array(out_idx++, VECTOR_ELT(x_i, k), STRING_ELT(names_x_i, k)); } else { fill_array(out_idx++, x_i, name_i); } } assert(out_idx == num_fields);

The loop is now invariant on j (out_idx in example) Easier to follow and review/refactor.

Bonus point, get rid of all of this and make a C++ iterator over lst that will behave as-if "flattened".

fsaintjacques · 2019-06-26T17:55:01Z

r/src/recordbatch.cpp

-  std::vector<std::shared_ptr<arrow::Field>> fields(n_arrays);
-  for (R_xlen_t i = 0; i < n_arrays; i++) {
-    fields[i] = std::make_shared<arrow::Field>(std::string(names[i]), arrays[i]->type());
+  std::vector<std::shared_ptr<arrow::Field>> fields(num_fields);


Seems like Schema::Make(const std::vector<Array>, const std::vector<string>) could be a convenient method to have globally.

fsaintjacques · 2019-06-26T19:12:47Z

r/src/table.cpp


-  R_xlen_t n = XLENGTH(lst);
-  std::vector<std::shared_ptr<arrow::Column>> columns(n);
+  int num_fields;


I think we should refactor the flatten-for loop into a small function if they're all the same.

wesm · 2019-06-27T17:03:11Z

What's the status on this? Since we're at the 0.14.0 end game we probably need to get this to merge readiness or defer to next release in the next 12-24 hours

nealrichardson · 2019-06-27T17:34:39Z

This should go in 0.14, otherwise we'll have to facilitate a sparklyr workaround. Plus it's the right interface to expose.

IMO the feedback from François could be addressed in a followup ticket.

fsaintjacques · 2019-06-27T17:37:56Z

I also think it's more important to get this in (seems like a very convenient feature), than have a simplified loop.

nealrichardson · 2019-06-27T19:06:41Z

The appveyor failure was the r-project.org DNS outage from yesterday. I just checked out this PR locally and ran R CMD check and it was clean. Created https://issues.apache.org/jira/browse/ARROW-5761 for the followup. @fsaintjacques would you please merge?

fsaintjacques · 2019-06-27T19:10:16Z

I will run the docker-compose R image locally, should be completed in 5-10 minutes and will merge if successful.

fsaintjacques · 2019-06-27T19:51:43Z

docker image failed, but unrelated to this change, I'll open a ticket, but merge.

romainfrancois added the Component: R label Jun 26, 2019

romainfrancois added 3 commits June 26, 2019 14:22

trim redundant code

5eabd20

record_batch() auto splicing of data frames

0a4892c

table() auto splicing

61902ab

romainfrancois force-pushed the ARROW-5718/df_auto_splice branch from 663730f to 61902ab Compare June 26, 2019 12:25

nealrichardson reviewed Jun 26, 2019

View reviewed changes

romainfrancois added 2 commits June 26, 2019 15:47

+ comment

28bc470

more tests about auto splicingh in record_batch()

79a7041

nealrichardson reviewed Jun 26, 2019

View reviewed changes

fsaintjacques reviewed Jun 26, 2019

View reviewed changes

fsaintjacques closed this in 4e6759c Jun 27, 2019

romainfrancois deleted the ARROW-5718/df_auto_splice branch June 28, 2019 13:05

asfimport mentioned this pull request Jun 27, 2019

[R] auto splice data frames in record_batch() and table() #22146

Closed

nealrichardson mentioned this pull request Apr 6, 2023

GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame #34825

Closed

asfimport mentioned this pull request Jun 11, 2020

[R] Improve autosplice cpp code #22186

Closed

Comments

Conversation

romainfrancois commented Jun 26, 2019

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Jun 27, 2019

Uh oh!

nealrichardson commented Jun 27, 2019

Uh oh!

fsaintjacques commented Jun 27, 2019

Uh oh!

nealrichardson commented Jun 27, 2019

Uh oh!

fsaintjacques commented Jun 27, 2019

Uh oh!

fsaintjacques commented Jun 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants