-
Notifications
You must be signed in to change notification settings - Fork 426
fix manifest cache #2951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix manifest cache #2951
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Improves manifest-list caching to prevent quadratic memory growth by deduplicating cached ManifestFile objects by manifest_path, addressing the memory issue described in #2325.
Changes:
- Reworked manifest caching to store individual
ManifestFileinstances keyed bymanifest_path(instead of caching whole manifest-list tuples). - Updated/added tests to validate
ManifestFileidentity reuse across repeated reads and across overlapping manifest lists. - Added benchmark tests to measure cache memory growth and deduplication behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
pyiceberg/manifest.py |
Changes cache strategy to dedupe ManifestFile objects by manifest_path and adds a lock for cache access. |
tests/utils/test_manifest.py |
Updates the existing cache test and adds new unit tests for cross-manifest-list deduplication. |
tests/benchmark/test_memory_benchmark.py |
Adds benchmark tests intended to reproduce/guard the memory-growth behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
geruh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great fix!!
| # We expect about 5-10 KB per iteration for typical workloads | ||
| # The key improvement is that growth is O(N) not O(N²) | ||
| # Threshold of 15KB/iteration based on observed behavior - O(N²) would show ~50KB+/iteration | ||
| assert growth_per_iteration < 15000, ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: non blocking can make a constant
| https://github.com/apache/iceberg-python/issues/2325 | ||
| The issue: When caching manifest lists as tuples, overlapping ManifestFile objects | ||
| are duplicated across cache entries, causing O(N²) memory growth instead of O(N). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
Rationale for this change
Fix part of #2325
Context: #2325 (comment)
Cache Manifest File object instead of Manifest List (tuple of Manifest Files).
This PR fix the O(N²) cache inefficiency, into the expected O(N) linear growth pattern.
Are these changes tested?
Yes, with benchmark test (
tests/benchmark/test_memory_benchmark.py)Result running from main branch: https://gist.github.com/kevinjqliu/970f4b51a12aaa0318a2671173430736
Result running from this branch: https://gist.github.com/kevinjqliu/24990d18d2cea2fa468597c16bfa27fd
Benchmark Comparison: main vs kevinjqliu/fix-manifest-cache
test_manifest_cache_memory_growthtest_memory_after_gc_with_cache_clearedtest_manifest_cache_deduplication_efficiencyMemory Growth Benchmark (50 append operations)
Memory at Each Iteration
This fix reduces memory growth by ~67%, bringing per-iteration growth from ~27 KB down to ~9 KB.
The improvement comes from caching individual
ManifestFileobjects by theirmanifest_pathinstead of caching entire manifest list tuples. This deduplicatesManifestFileobjects that appear in multiple manifest lists (common after appends).Are there any user-facing changes?