Skip to content

Conversation

@kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Jan 25, 2026

Rationale for this change

Fix part of #2325
Context: #2325 (comment)

Cache Manifest File object instead of Manifest List (tuple of Manifest Files).
This PR fix the O(N²) cache inefficiency, into the expected O(N) linear growth pattern.

Are these changes tested?

Yes, with benchmark test (tests/benchmark/test_memory_benchmark.py)
Result running from main branch: https://gist.github.com/kevinjqliu/970f4b51a12aaa0318a2671173430736
Result running from this branch: https://gist.github.com/kevinjqliu/24990d18d2cea2fa468597c16bfa27fd

Benchmark Comparison: main vs kevinjqliu/fix-manifest-cache

Test main fix branch
test_manifest_cache_memory_growth ❌ FAILED ✅ PASSED
test_memory_after_gc_with_cache_cleared ✅ PASSED ✅ PASSED
test_manifest_cache_deduplication_efficiency ✅ PASSED ✅ PASSED

Memory Growth Benchmark (50 append operations)

Metric main fix branch Improvement
Initial memory 3,233.4 KB 3,210.7 KB -0.7%
Final memory 4,280.6 KB 3,558.9 KB -16.9%
Total growth 1,047.2 KB 348.1 KB -66.8%
Growth per iteration 26,809 bytes 8,913 bytes -66.8%

Memory at Each Iteration

Iteration main fix branch Δ
10 3,233.4 KB 3,210.7 KB -22.7 KB
20 3,471.0 KB 3,371.4 KB -99.6 KB
30 3,719.3 KB 3,467.1 KB -252.2 KB
40 3,943.9 KB 3,483.2 KB -460.7 KB
50 4,280.6 KB 3,558.9 KB -721.7 KB

This fix reduces memory growth by ~67%, bringing per-iteration growth from ~27 KB down to ~9 KB.

The improvement comes from caching individual ManifestFile objects by their manifest_path instead of caching entire manifest list tuples. This deduplicates ManifestFile objects that appear in multiple manifest lists (common after appends).

Are there any user-facing changes?

@kevinjqliu kevinjqliu mentioned this pull request Jan 25, 2026
3 tasks
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves manifest-list caching to prevent quadratic memory growth by deduplicating cached ManifestFile objects by manifest_path, addressing the memory issue described in #2325.

Changes:

  • Reworked manifest caching to store individual ManifestFile instances keyed by manifest_path (instead of caching whole manifest-list tuples).
  • Updated/added tests to validate ManifestFile identity reuse across repeated reads and across overlapping manifest lists.
  • Added benchmark tests to measure cache memory growth and deduplication behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
pyiceberg/manifest.py Changes cache strategy to dedupe ManifestFile objects by manifest_path and adds a lock for cache access.
tests/utils/test_manifest.py Updates the existing cache test and adds new unit tests for cross-manifest-list deduplication.
tests/benchmark/test_memory_benchmark.py Adds benchmark tests intended to reproduce/guard the memory-growth behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kevinjqliu and others added 4 commits January 25, 2026 16:14
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@kevinjqliu kevinjqliu requested review from Fokko and geruh January 25, 2026 21:28
@kevinjqliu kevinjqliu added this to the PyIceberg 0.11.0 milestone Jan 25, 2026
Copy link
Contributor

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great fix!!

# We expect about 5-10 KB per iteration for typical workloads
# The key improvement is that growth is O(N) not O(N²)
# Threshold of 15KB/iteration based on observed behavior - O(N²) would show ~50KB+/iteration
assert growth_per_iteration < 15000, (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: non blocking can make a constant

https://github.com/apache/iceberg-python/issues/2325
The issue: When caching manifest lists as tuples, overlapping ManifestFile objects
are duplicated across cache entries, causing O(N²) memory growth instead of O(N).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants