GH-38792: [C++][Gandiva] Optimize cache #38790

likun666661 · 2023-11-20T06:22:41Z

Rationale for this change

In Gandiva, it has been observed that the tool caches compiled expressions. However, when using the expression cache, Gandiva takes into account the schema, expr->ToString, and specific parameters. Nevertheless, in llvm_generator.cc, Gandiva still traverses the expressions to generate LLVM IR code and then produces specific JIT functions. Considering that llvm_generator traverses expressions and calculates the memory start positions of fields involved in the computation, we can further enhance Gandiva's caching mechanism.

Specifically, we can exclude the influence of the schema. Simultaneously, we can also exclude the field names' information involved in the expression computation. For example, for a schema {f0:int32, f1:int32} with an expression add(f0, f1), we can cache it initially. Then, for a schema {a0:int32, a1:int32, a2:int32} with an expression add(a1, a2), it should not require recompilation; it can directly reuse the JIT function generated by the expression {f0:int32, f1:int32}, add(f0, f1).

What changes are included in this PR?

This PR primarily alters the construction of the cache key and makes modifications to the schema validation in both the projector and filter. It also adjusts the validation of fields in expr_validator. Additionally, a ToCacheKeyString function has been added to each node. This PR mainly focuses on these areas.

Are these changes tested?

Yes,these changes are tested.

Are there any user-facing changes?

No.

#38792

Closes: [C++] [Gandiva] Optimize Gandiva Cache #38792

github-actions · 2023-11-20T06:23:05Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2023-11-20T06:23:56Z

⚠️ GitHub issue #38433 has been automatically assigned in GitHub to PR creator.

github-actions · 2023-11-20T06:24:56Z

⚠️ GitHub issue #38434 has been automatically assigned in GitHub to PR creator.

github-actions · 2023-11-20T06:25:46Z

⚠️ GitHub issue #38790 has been automatically assigned in GitHub to PR creator.

github-actions · 2023-11-20T06:32:51Z

⚠️ GitHub issue #38792 has been automatically assigned in GitHub to PR creator.

pitrou · 2023-11-28T17:43:32Z

@js8544

js8544 · 2023-11-29T03:39:22Z

Thanks for your contribution! The proposal sounds good: caching at node level is indeed better. I have a couple of questions regarding the implementation:

Could you resolve the conflicts so that we can initiate the CI checks?
I see that schema checks are completely removed. Will invalid execution happen, e.g. Projector of one type executed on a RecordBatch of another type? If so, is it possible to keep the checks on types and only skip the checks on names?

likun666661 · 2023-12-04T15:12:15Z

Thanks for your contribution! The proposal sounds good: caching at node level is indeed better. I have a couple of questions regarding the implementation:

Could you resolve the conflicts so that we can initiate the CI checks?

I see that schema checks are completely removed. Will invalid execution happen, e.g. Projector of one type executed on a RecordBatch of another type? If so, is it possible to keep the checks on types and only skip the checks on names?

I will do this later.
For second，gandiva expr validator will check the tree expr. So , when in function node ,it will check the fieldnode type and thus it will checks on types and only skip the checks on names.

likun666661 · 2023-12-05T15:33:11Z

Thanks for your contribution! The proposal sounds good: caching at node level is indeed better. I have a couple of questions regarding the implementation:

Could you resolve the conflicts so that we can initiate the CI checks?

I see that schema checks are completely removed. Will invalid execution happen, e.g. Projector of one type executed on a RecordBatch of another type? If so, is it possible to keep the checks on types and only skip the checks on names?

I add some check for invalid execution happen as you said. Thanks for the advice.

js8544 · 2023-12-06T03:13:42Z

@kou Could you help approve the CI workflows? Thanks!

kou · 2023-12-06T04:50:36Z

Approved!

likun666661 · 2023-12-07T01:47:40Z

I implemented type validation for calculations within Gandiva. All of my local tests passed successfully before I add the type validation check. However, it appears that some existing tests in Gandiva have fields with types that may not necessarily match the data types. In such cases, my type validation is blocking them. As a result, a few Gandiva tests did not pass on the CI. Could you please approve another round of CI testing? @kou

kou · 2023-12-07T01:52:08Z

Approved!
FYI: You can run CI on your fork by enabling GitHub Actions on your fork.

thisisnic · 2025-11-18T12:32:21Z

Thank you for your contribution. Unfortunately, this
pull request has been marked as stale because it has had no activity in the past 365 days. Please remove the stale label
or comment below, or this PR will be closed in 14 days. Feel free to re-open this if it has been closed in error. If you
do not have repository permissions to reopen the PR, please tag a maintainer.

lriggs · 2025-12-19T00:34:40Z

@likun666661 Are you interested in finishing this still? If not Im happy to take it on.

github-actions bot added Component: C++ Component: Gandiva awaiting review Awaiting review labels Nov 20, 2023

likun666661 changed the title ~~optimize gandiva cache~~ GH-38433: [C++][Gandiva] optimize gandiva cache Nov 20, 2023

likun666661 changed the title ~~GH-38433: [C++][Gandiva] optimize gandiva cache~~ GH-38434: [C++][Gandiva] optimize gandiva cache Nov 20, 2023

likun666661 changed the title ~~GH-38434: [C++][Gandiva] optimize gandiva cache~~ GH-38790: [C++][Gandiva] optimize gandiva cache Nov 20, 2023

github-actions bot assigned likun666661 Nov 20, 2023

likun666661 changed the title ~~GH-38790: [C++][Gandiva] optimize gandiva cache~~ GH-38792: [C++][Gandiva] optimize gandiva cache Nov 20, 2023

likun666661 force-pushed the feature/optimize_gandiva_cache branch from 2c292a6 to 9382a92 Compare December 5, 2023 15:30

likun666661 force-pushed the feature/optimize_gandiva_cache branch 2 times, most recently from 5292e9b to 52e5c41 Compare December 6, 2023 14:24

likun666661 force-pushed the feature/optimize_gandiva_cache branch from 52e5c41 to 196138b Compare January 3, 2024 05:25

optimize gandiva cache

d2c9d70

likun666661 force-pushed the feature/optimize_gandiva_cache branch from 196138b to d2c9d70 Compare January 3, 2024 05:33

kou changed the title ~~GH-38792: [C++][Gandiva] optimize gandiva cache~~ GH-38792: [C++][Gandiva] Optimize cache Jan 4, 2024

github-actions bot added the Status: stale-warning Issues and PRs flagged as stale which are due to be closed if no indication otherwise label Nov 18, 2025

lriggs mentioned this pull request Dec 19, 2025

GH-38792: [C++][Gandiva] Optimize cache #48597

Open

GH-38792: [C++][Gandiva] Optimize cache #38790

Are you sure you want to change the base?

GH-38792: [C++][Gandiva] Optimize cache #38790

Uh oh!

Conversation

likun666661 commented Nov 20, 2023 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 20, 2023

Uh oh!

github-actions bot commented Nov 20, 2023

Uh oh!

github-actions bot commented Nov 20, 2023

Uh oh!

github-actions bot commented Nov 20, 2023

Uh oh!

github-actions bot commented Nov 20, 2023

Uh oh!

pitrou commented Nov 28, 2023

Uh oh!

js8544 commented Nov 29, 2023

Uh oh!

likun666661 commented Dec 4, 2023

Uh oh!

likun666661 commented Dec 5, 2023

Uh oh!

js8544 commented Dec 6, 2023

Uh oh!

kou commented Dec 6, 2023

Uh oh!

likun666661 commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kou commented Dec 7, 2023

Uh oh!

thisisnic commented Nov 18, 2025

Uh oh!

lriggs commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

likun666661 commented Nov 20, 2023 •

edited by kou

Loading

likun666661 commented Dec 7, 2023 •

edited

Loading