ARROW-5718: [R] auto splice data frames in record_batch() and table()#4704
ARROW-5718: [R] auto splice data frames in record_batch() and table()#4704romainfrancois wants to merge 5 commits intoapache:masterfrom
Conversation
663730f to
61902ab
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
A couple of notes. Generally looks good, might like to see a couple more tests.
|
|
||
| R_xlen_t n = XLENGTH(lst); | ||
| std::vector<std::shared_ptr<arrow::Column>> columns(n); | ||
| int num_fields; |
There was a problem hiding this comment.
I'm curious why the record_batch and table code isn't more shared
There was a problem hiding this comment.
It could be. record batch only have to handle arrays, where tables have to handle arrays, chunked arrays, columns. but I agree that the structure of the code is similar.
There was a problem hiding this comment.
I think we should refactor the flatten-for loop into a small function if they're all the same.
| #' @export | ||
| record_batch <- function(..., schema = NULL){ | ||
| arrays <- list2(...) | ||
| # making sure there are always names |
There was a problem hiding this comment.
Right but why? IIUC this has to do with how the cpp code distinguishes things to autosplice, switching behavior on nchar(name). That's a subtlety I'll probably forget by Monday so it seems worth explaining.
There was a problem hiding this comment.
names(list(a = 1, b = 2))
#> [1] "a" "b"
names(list(a = 1, 2))
#> [1] "a" ""
names(list(1, 2))
#> NULLotherwise we'd have to treat the NULL case differently internally as we did before, maybe I could do that instead.
| arrays[j] = arrow::r::Array__from_vector(x, schema->field(j)->type(), false); | ||
| }; | ||
|
|
||
| for (R_xlen_t i = 0, j = 0; j < num_fields; i++) { |
There was a problem hiding this comment.
Can you simplify the loop conditions? Make it only iterate over i and XLENGTH(lst) and use an external variable,e.g.
int out_idx = 0;
for (R_xlen_t i = 0; i < XLENGHT(lst); i++) {
...
if (..) {
for (...)
fill_array(out_idx++, VECTOR_ELT(x_i, k), STRING_ELT(names_x_i, k));
} else {
fill_array(out_idx++, x_i, name_i);
}
}
assert(out_idx == num_fields);
The loop is now invariant on j (out_idx in example) Easier to follow and review/refactor.
There was a problem hiding this comment.
Bonus point, get rid of all of this and make a C++ iterator over lst that will behave as-if "flattened".
| std::vector<std::shared_ptr<arrow::Field>> fields(n_arrays); | ||
| for (R_xlen_t i = 0; i < n_arrays; i++) { | ||
| fields[i] = std::make_shared<arrow::Field>(std::string(names[i]), arrays[i]->type()); | ||
| std::vector<std::shared_ptr<arrow::Field>> fields(num_fields); |
There was a problem hiding this comment.
Seems like Schema::Make(const std::vector<Array>, const std::vector<string>) could be a convenient method to have globally.
|
|
||
| R_xlen_t n = XLENGTH(lst); | ||
| std::vector<std::shared_ptr<arrow::Column>> columns(n); | ||
| int num_fields; |
There was a problem hiding this comment.
I think we should refactor the flatten-for loop into a small function if they're all the same.
|
What's the status on this? Since we're at the 0.14.0 end game we probably need to get this to merge readiness or defer to next release in the next 12-24 hours |
|
This should go in 0.14, otherwise we'll have to facilitate a sparklyr workaround. Plus it's the right interface to expose. IMO the feedback from François could be addressed in a followup ticket. |
|
I also think it's more important to get this in (seems like a very convenient feature), than have a simplified loop. |
|
The appveyor failure was the r-project.org DNS outage from yesterday. I just checked out this PR locally and ran |
|
I will run the docker-compose R image locally, should be completed in 5-10 minutes and will merge if successful. |
|
docker image failed, but unrelated to this change, I'll open a ticket, but merge. |
elements of
...that are unnamed inrecord_batch(...)ortable(...)are automatically spliced:In particular, this gives us back
record_batch(<data.frame>)andtable(<data.frame>):