fix manifest cache #2951

kevinjqliu · 2026-01-25T19:28:28Z

Rationale for this change

Fix part of #2325
Context: #2325 (comment)

Cache Manifest File object instead of Manifest List (tuple of Manifest Files).
This PR fix the O(N²) cache inefficiency, into the expected O(N) linear growth pattern.

Are these changes tested?

Yes, with benchmark test (tests/benchmark/test_memory_benchmark.py)
Result running from main branch: https://gist.github.com/kevinjqliu/970f4b51a12aaa0318a2671173430736
Result running from this branch: https://gist.github.com/kevinjqliu/24990d18d2cea2fa468597c16bfa27fd

Benchmark Comparison: main vs kevinjqliu/fix-manifest-cache

Test	main	fix branch
`test_manifest_cache_memory_growth`	❌ FAILED	✅ PASSED
`test_memory_after_gc_with_cache_cleared`	✅ PASSED	✅ PASSED
`test_manifest_cache_deduplication_efficiency`	✅ PASSED	✅ PASSED

Memory Growth Benchmark (50 append operations)

Metric	main	fix branch	Improvement
Initial memory	3,233.4 KB	3,210.7 KB	-0.7%
Final memory	4,280.6 KB	3,558.9 KB	-16.9%
Total growth	1,047.2 KB	348.1 KB	-66.8%
Growth per iteration	26,809 bytes	8,913 bytes	-66.8%

Memory at Each Iteration

Iteration	main	fix branch	Δ
10	3,233.4 KB	3,210.7 KB	-22.7 KB
20	3,471.0 KB	3,371.4 KB	-99.6 KB
30	3,719.3 KB	3,467.1 KB	-252.2 KB
40	3,943.9 KB	3,483.2 KB	-460.7 KB
50	4,280.6 KB	3,558.9 KB	-721.7 KB

This fix reduces memory growth by ~67%, bringing per-iteration growth from ~27 KB down to ~9 KB.

The improvement comes from caching individual ManifestFile objects by their manifest_path instead of caching entire manifest list tuples. This deduplicates ManifestFile objects that appear in multiple manifest lists (common after appends).

Are there any user-facing changes?

pyiceberg/manifest.py

Copilot

Pull request overview

Improves manifest-list caching to prevent quadratic memory growth by deduplicating cached ManifestFile objects by manifest_path, addressing the memory issue described in #2325.

Changes:

Reworked manifest caching to store individual ManifestFile instances keyed by manifest_path (instead of caching whole manifest-list tuples).
Updated/added tests to validate ManifestFile identity reuse across repeated reads and across overlapping manifest lists.
Added benchmark tests to measure cache memory growth and deduplication behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
`pyiceberg/manifest.py`	Changes cache strategy to dedupe `ManifestFile` objects by `manifest_path` and adds a lock for cache access.
`tests/utils/test_manifest.py`	Updates the existing cache test and adds new unit tests for cross-manifest-list deduplication.
`tests/benchmark/test_memory_benchmark.py`	Adds benchmark tests intended to reproduce/guard the memory-growth behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/utils/test_manifest.py

pyiceberg/manifest.py

tests/benchmark/test_memory_benchmark.py

tests/utils/test_manifest.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

geruh

LGTM, great fix!!

geruh · 2026-01-25T21:37:01Z

tests/benchmark/test_memory_benchmark.py

+        # We expect about 5-10 KB per iteration for typical workloads
+        # The key improvement is that growth is O(N) not O(N²)
+        # Threshold of 15KB/iteration based on observed behavior - O(N²) would show ~50KB+/iteration
+        assert growth_per_iteration < 15000, (


nit: non blocking can make a constant

geruh · 2026-01-25T21:48:59Z

tests/benchmark/test_memory_benchmark.py

+https://github.com/apache/iceberg-python/issues/2325
+
+The issue: When caching manifest lists as tuples, overlapping ManifestFile objects
+are duplicated across cache entries, causing O(N²) memory growth instead of O(N).


cache manifest, not tuple

cab3823

kevinjqliu mentioned this pull request Jan 25, 2026

Avro reader memory leak #2325

Open

3 tasks

geruh reviewed Jan 25, 2026

View reviewed changes

pyiceberg/manifest.py Outdated Show resolved Hide resolved

kevinjqliu added 6 commits January 25, 2026 15:22

thx drew

0f2bf0d

typo

fa2863f

add memory benchmark

a5b7544

dont lock during io

3c32b5d

fix benchmark to use cache

c2fbb9c

fix benchmark

76c71aa

kevinjqliu requested a review from Copilot January 25, 2026 20:53

Copilot started reviewing on behalf of kevinjqliu January 25, 2026 20:53 View session

Copilot AI reviewed Jan 25, 2026

View reviewed changes

kevinjqliu and others added 4 commits January 25, 2026 16:14

update docs

d92accf

Update tests/utils/test_manifest.py

1721483

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix test

1f5861b

more docs

50a366c

kevinjqliu requested review from Fokko and geruh January 25, 2026 21:28

kevinjqliu added this to the PyIceberg 0.11.0 milestone Jan 25, 2026

geruh approved these changes Jan 25, 2026

View reviewed changes

kris-gaudel mentioned this pull request Jan 25, 2026

Make manifest cache size configurable and allow for disabling #2952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix manifest cache #2951

fix manifest cache #2951

kevinjqliu commented Jan 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geruh left a comment

Uh oh!

geruh Jan 25, 2026

Uh oh!

geruh Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix manifest cache #2951

Are you sure you want to change the base?

fix manifest cache #2951

Conversation

kevinjqliu commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Benchmark Comparison: main vs kevinjqliu/fix-manifest-cache

Memory Growth Benchmark (50 append operations)

Memory at Each Iteration

Are there any user-facing changes?

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

geruh Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

geruh Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinjqliu commented Jan 25, 2026 •

edited

Loading