Minor: Refactor memory size estimation for HashTable by marvinlanhenke · Pull Request #10748 · apache/datafusion

marvinlanhenke · 2024-06-01T15:14:47Z

Which issue does this PR close?

Closes #8764.

Rationale for this change

As stated in the issue here the goal is to consolidate estimation logic into a single function. This allows for better documentation as well as better maintainability.
Also there was an issue with the existing implementation that could still overflow if the checked_mul was capped at usize::MAX. I tried to fix this by checking for overflows and returning usize::MAX as a 'max cap'.

What changes are included in this PR?

extract logic into single function
fixed missing overflow checks
moved logic into datfusion common
used the new function in count_distinct/native.rs and hash_join.rs

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

marvinlanhenke · 2024-06-01T15:15:20Z

@alamb @yyy1000 PTAL

yyy1000

I see it's good abstraction.

My only concern is, it seems that the original goal was kind of ideal, which just pass the hashtable and get the estimate size. But now before using this function, user have to calculate the fixed_size themself, and also the calculation is not always the same.

Though the abstraction is not pretty enough, I think given the current code this PR is still an improvement. :)

marvinlanhenke · 2024-06-01T16:56:27Z

I see it's good abstraction.

My only concern is, it seems that the original goal was kind of ideal, which just pass the hashtable and get the estimate size. But now before using this function, user have to calculate the fixed_size themself, and also the calculation is not always the same.

Though the abstraction is not pretty enough, I think given the current code this PR is still an improvement. :)

Yes, I totally agree with you on this. However, as discussed in the issue, it still might be worth it due to the improved documentation.

datafusion/common/src/utils/memory.rs

yyy1000 · 2024-06-01T17:53:29Z

However, as discussed in the issue, it still might be worth it due to the improved documentation.

Agree! I left a comment which may improve but it's not necessary. Thanks!

alamb

Thank you so much @marvinlanhenke -- this looks like a great step forward in readability 🙏

Thank you @yyy1000 for your review

I left some suggestions / comments. Let me know what you think

datafusion/common/src/utils/memory.rs

datafusion/physical-plan/src/joins/hash_join.rs

marvinlanhenke · 2024-06-02T05:18:16Z

@alamb
I made the changes based on your comments. Thanks for the comprehensive review. 🚀

I changed fn estimate_memory_size(..) to return Result<usize> which should set us up for any possible future changes in the Accumulator trait; for now I simply unwrap in fn size() and panic.

alamb

Thank you @marvinlanhenke

datafusion/common/src/utils/memory.rs

datafusion/physical-expr/src/aggregate/count_distinct/native.rs

alamb · 2024-06-03T18:40:35Z

Thanks again @marvinlanhenke

* refactor: extract estimate_memory_size * refactor: cap at usize::MAX * refactor: use estimate_memory_size * chore: add examples * refactor: return Result<usize>; add testcase * fix: docs * fix: remove unneccessary checked_div * fix: remove additional and_then

marvinlanhenke added 3 commits June 1, 2024 16:04

refactor: extract estimate_memory_size

176b257

refactor: cap at usize::MAX

b555adc

refactor: use estimate_memory_size

a719511

github-actions bot added the physical-expr Changes to the physical-expr crates label Jun 1, 2024

yyy1000 reviewed Jun 1, 2024

View reviewed changes

datafusion/common/src/utils/memory.rs Show resolved Hide resolved

chore: add examples

93de4ac

alamb reviewed Jun 1, 2024

View reviewed changes

refactor: return Result<usize>; add testcase

97906ec

marvinlanhenke added 3 commits June 2, 2024 07:32

fix: docs

75b5892

fix: remove unneccessary checked_div

33d24d1

fix: remove additional and_then

eebf4cf

alamb approved these changes Jun 2, 2024

View reviewed changes

datafusion/common/src/utils/memory.rs Show resolved Hide resolved

datafusion/physical-expr/src/aggregate/count_distinct/native.rs Show resolved Hide resolved

alamb merged commit a92f803 into apache:main Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor: Refactor memory size estimation for HashTable#10748

Minor: Refactor memory size estimation for HashTable#10748
alamb merged 8 commits intoapache:mainfrom
marvinlanhenke:refactor_size_estimation

marvinlanhenke commented Jun 1, 2024

Uh oh!

marvinlanhenke commented Jun 1, 2024

Uh oh!

yyy1000 left a comment

Uh oh!

marvinlanhenke commented Jun 1, 2024

Uh oh!

Uh oh!

yyy1000 commented Jun 1, 2024

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marvinlanhenke commented Jun 2, 2024

Uh oh!

alamb left a comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

marvinlanhenke commented Jun 1, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

marvinlanhenke commented Jun 1, 2024

Uh oh!

yyy1000 left a comment

Choose a reason for hiding this comment

Uh oh!

marvinlanhenke commented Jun 1, 2024

Uh oh!

Uh oh!

yyy1000 commented Jun 1, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marvinlanhenke commented Jun 2, 2024

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb commented Jun 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants