-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-38792: [C++][Gandiva] Optimize cache #38790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-38792: [C++][Gandiva] Optimize cache #38790
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or In the case of PARQUET issues on JIRA the title also supports: See also: |
|
|
|
|
|
|
|
|
|
Thanks for your contribution! The proposal sounds good: caching at node level is indeed better. I have a couple of questions regarding the implementation:
|
|
2c292a6 to
9382a92
Compare
I add some check for invalid execution happen as you said. Thanks for the advice. |
|
@kou Could you help approve the CI workflows? Thanks! |
|
Approved! |
5292e9b to
52e5c41
Compare
|
I implemented type validation for calculations within Gandiva. All of my local tests passed successfully before I add the type validation check. However, it appears that some existing tests in Gandiva have fields with types that may not necessarily match the data types. In such cases, my type validation is blocking them. As a result, a few Gandiva tests did not pass on the CI. Could you please approve another round of CI testing? @kou |
|
Approved! |
52e5c41 to
196138b
Compare
196138b to
d2c9d70
Compare
|
Thank you for your contribution. Unfortunately, this |
|
@likun666661 Are you interested in finishing this still? If not Im happy to take it on. |
Rationale for this change
In Gandiva, it has been observed that the tool caches compiled expressions. However, when using the expression cache, Gandiva takes into account the schema, expr->ToString, and specific parameters. Nevertheless, in llvm_generator.cc, Gandiva still traverses the expressions to generate LLVM IR code and then produces specific JIT functions. Considering that llvm_generator traverses expressions and calculates the memory start positions of fields involved in the computation, we can further enhance Gandiva's caching mechanism.
Specifically, we can exclude the influence of the schema. Simultaneously, we can also exclude the field names' information involved in the expression computation. For example, for a schema {f0:int32, f1:int32} with an expression add(f0, f1), we can cache it initially. Then, for a schema {a0:int32, a1:int32, a2:int32} with an expression add(a1, a2), it should not require recompilation; it can directly reuse the JIT function generated by the expression {f0:int32, f1:int32}, add(f0, f1).
What changes are included in this PR?
This PR primarily alters the construction of the cache key and makes modifications to the schema validation in both the projector and filter. It also adjusts the validation of fields in expr_validator. Additionally, a ToCacheKeyString function has been added to each node. This PR mainly focuses on these areas.
Are these changes tested?
Yes,these changes are tested.
Are there any user-facing changes?
No.
#38792