-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-28866: [Java] Java Dataset API ScanOptions expansion #41646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
Can you help review this PR? If the framework is OK, I will add more common config in this PR. Thanks! @westonpace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insert all the options to a map because it is a easy implement, and now we don't have same option name in CPP parse_options and read_options, but to further extend, we may need to serialize more accurately. I'm open to here if you think we should serialize each option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe having a better serialize option for each would be better. But I see your point, maybe we could do it in a follow up PR.
|
CC @vibhatha |
cpp/thirdparty/versions.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Occasional change, for my local environment, will remove it
cpp/src/arrow/dataset/file_csv.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
| return Status::Invalid("Not support this config " + it.first); | |
| return Status::Invalid("Config " + it.first + " is not supported."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| return Status::Invalid("Literal does not have map"); | |
| return Status::Invalid("Literal does not have a map"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| "illegal file format id: " + std::to_string(file_format_id); | |
| "Illegal file format id: " + std::to_string(file_format_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function just written to pass a Java Map to C++ via JNI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put it as a private static helper somewhere? No need to expose it publicly as an instance method
cpp/src/arrow/dataset/file_csv.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check for possible -1 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it in Java side to not add invalid schema address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at the functionality, I think what we have here is a util class which converts a particular map config to a particular Substrait protobuf message. Since this can be used in other cases, it could come under substrait.util package. And the toProtobuf could be mapToExpressionLiteral() ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have doubts about having a separate class for this purpose though.
vibhatha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only added a few comments. But I am going to go through the content once more.
|
@github-actions crossbow submit java-jars |
|
Revision: 04e4390be2bb640c6b8579627b1990b277d98fcf Submitted crossbow builds: ursacomputing/crossbow @ actions-4da86e64a2
|
|
Sorry, do you mind rebasing again so we can validate the JNI job? |
|
@github-actions crossbow submit java-jars |
|
Revision: 98a3beb Submitted crossbow builds: ursacomputing/crossbow @ actions-f053a0e3dd
|
|
Sigh, well, another update broke that pipeline again.. |
Hmm, seems like that. I will take a look. |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit fd69e5e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 24 possible false positives for unstable benchmarks that are known to sometimes produce them. |
|
I find arrow-dataset tests don't run in github Action, is it right? only arrow-c-data test runs. |
|
Can you make a PR? |
|
Sure I can. |
|
Thanks @jinchengchenghh |
|
I have created a PR: #43503 |
…#41646) ### Rationale for this change ### What changes are included in this PR? Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions ### Are these changes tested? new added UT. ### Are there any user-facing changes? No. * GitHub Issue: apache#28866 Authored-by: Chengcheng Jin <[email protected]> Signed-off-by: David Li <[email protected]>
Rationale for this change
What changes are included in this PR?
Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions
Are these changes tested?
new added UT.
Are there any user-facing changes?
No.