Skip to content

Conversation

@likun666661
Copy link

@likun666661 likun666661 commented Nov 20, 2023

Rationale for this change

In Gandiva, it has been observed that the tool caches compiled expressions. However, when using the expression cache, Gandiva takes into account the schema, expr->ToString, and specific parameters. Nevertheless, in llvm_generator.cc, Gandiva still traverses the expressions to generate LLVM IR code and then produces specific JIT functions. Considering that llvm_generator traverses expressions and calculates the memory start positions of fields involved in the computation, we can further enhance Gandiva's caching mechanism.

Specifically, we can exclude the influence of the schema. Simultaneously, we can also exclude the field names' information involved in the expression computation. For example, for a schema {f0:int32, f1:int32} with an expression add(f0, f1), we can cache it initially. Then, for a schema {a0:int32, a1:int32, a2:int32} with an expression add(a1, a2), it should not require recompilation; it can directly reuse the JIT function generated by the expression {f0:int32, f1:int32}, add(f0, f1).

What changes are included in this PR?

This PR primarily alters the construction of the cache key and makes modifications to the schema validation in both the projector and filter. It also adjusts the validation of fields in expr_validator. Additionally, a ToCacheKeyString function has been added to each node. This PR mainly focuses on these areas.

Are these changes tested?

Yes,these changes are tested.

Are there any user-facing changes?

No.

#38792

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@likun666661 likun666661 changed the title optimize gandiva cache GH-38433: [C++][Gandiva] optimize gandiva cache Nov 20, 2023
@github-actions
Copy link

⚠️ GitHub issue #38433 has been automatically assigned in GitHub to PR creator.

@likun666661 likun666661 changed the title GH-38433: [C++][Gandiva] optimize gandiva cache GH-38434: [C++][Gandiva] optimize gandiva cache Nov 20, 2023
@github-actions
Copy link

⚠️ GitHub issue #38434 has been automatically assigned in GitHub to PR creator.

@likun666661 likun666661 changed the title GH-38434: [C++][Gandiva] optimize gandiva cache GH-38790: [C++][Gandiva] optimize gandiva cache Nov 20, 2023
@github-actions
Copy link

⚠️ GitHub issue #38790 has been automatically assigned in GitHub to PR creator.

@likun666661 likun666661 changed the title GH-38790: [C++][Gandiva] optimize gandiva cache GH-38792: [C++][Gandiva] optimize gandiva cache Nov 20, 2023
@github-actions
Copy link

⚠️ GitHub issue #38792 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member

pitrou commented Nov 28, 2023

@js8544

@js8544
Copy link
Collaborator

js8544 commented Nov 29, 2023

Thanks for your contribution! The proposal sounds good: caching at node level is indeed better. I have a couple of questions regarding the implementation:

  1. Could you resolve the conflicts so that we can initiate the CI checks?
  2. I see that schema checks are completely removed. Will invalid execution happen, e.g. Projector of one type executed on a RecordBatch of another type? If so, is it possible to keep the checks on types and only skip the checks on names?

@likun666661
Copy link
Author

Thanks for your contribution! The proposal sounds good: caching at node level is indeed better. I have a couple of questions regarding the implementation:

  1. Could you resolve the conflicts so that we can initiate the CI checks?
  2. I see that schema checks are completely removed. Will invalid execution happen, e.g. Projector of one type executed on a RecordBatch of another type? If so, is it possible to keep the checks on types and only skip the checks on names?
  1. I will do this later.
  2. For second,gandiva expr validator will check the tree expr. So , when in function node ,it will check the fieldnode type and thus it will checks on types and only skip the checks on names.

@likun666661 likun666661 force-pushed the feature/optimize_gandiva_cache branch from 2c292a6 to 9382a92 Compare December 5, 2023 15:30
@likun666661
Copy link
Author

Thanks for your contribution! The proposal sounds good: caching at node level is indeed better. I have a couple of questions regarding the implementation:

  1. Could you resolve the conflicts so that we can initiate the CI checks?
  2. I see that schema checks are completely removed. Will invalid execution happen, e.g. Projector of one type executed on a RecordBatch of another type? If so, is it possible to keep the checks on types and only skip the checks on names?

I add some check for invalid execution happen as you said. Thanks for the advice.

@js8544
Copy link
Collaborator

js8544 commented Dec 6, 2023

@kou Could you help approve the CI workflows? Thanks!

@kou
Copy link
Member

kou commented Dec 6, 2023

Approved!

@likun666661 likun666661 force-pushed the feature/optimize_gandiva_cache branch 2 times, most recently from 5292e9b to 52e5c41 Compare December 6, 2023 14:24
@likun666661
Copy link
Author

likun666661 commented Dec 7, 2023

I implemented type validation for calculations within Gandiva. All of my local tests passed successfully before I add the type validation check. However, it appears that some existing tests in Gandiva have fields with types that may not necessarily match the data types. In such cases, my type validation is blocking them. As a result, a few Gandiva tests did not pass on the CI. Could you please approve another round of CI testing? @kou

@kou
Copy link
Member

kou commented Dec 7, 2023

Approved!
FYI: You can run CI on your fork by enabling GitHub Actions on your fork.

@likun666661 likun666661 force-pushed the feature/optimize_gandiva_cache branch from 52e5c41 to 196138b Compare January 3, 2024 05:25
@likun666661 likun666661 force-pushed the feature/optimize_gandiva_cache branch from 196138b to d2c9d70 Compare January 3, 2024 05:33
@kou kou changed the title GH-38792: [C++][Gandiva] optimize gandiva cache GH-38792: [C++][Gandiva] Optimize cache Jan 4, 2024
@github-actions github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025
@thisisnic
Copy link
Member

Thank you for your contribution. Unfortunately, this
pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label
or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you
do not have repository permissions to reopen the PR, please tag a maintainer.

@lriggs
Copy link
Contributor

lriggs commented Dec 19, 2025

@likun666661 Are you interested in finishing this still? If not Im happy to take it on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Awaiting review Component: C++ Component: Gandiva Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[C++] [Gandiva] Optimize Gandiva Cache

6 participants