Skip to content

[python] Add FollowUpScanner, IncrementalDiffScanner, sharding#7348

Merged
JingsongLi merged 5 commits intoapache:masterfrom
tub:python-streaming-1b-scanners
Mar 11, 2026
Merged

[python] Add FollowUpScanner, IncrementalDiffScanner, sharding#7348
JingsongLi merged 5 commits intoapache:masterfrom
tub:python-streaming-1b-scanners

Conversation

@tub
Copy link
Contributor

@tub tub commented Mar 5, 2026

Summary

  • Add FollowUpScanner hierarchy (base, delta, changelog) for streaming scan planning
  • Add IncrementalDiffScanner for diff-based streaming reads
  • Add sharding support to FileScanner

Stacked PR series

This is PR 1b/5 in the Python streaming read series:

  • PR 1a: Caching infrastructure + utilities
  • PR 1b (this): Scanners, sharding, row kind (~1096 lines)
  • PR 1b - part 2: Exposing RowKind
  • PR 1c: Consumer management
  • PR 2: Core streaming (AsyncStreamingTableScan)
  • PR 3: CLI (paimon tail)

Incremental diff (vs 1a): tub/paimon@python-streaming-1a-caching...tub:paimon:python-streaming-1b-scanners (or wait until 1a is merged & compare)

Test plan

  • flake8 passes on all changed files
  • python -m pytest passes
  • New tests: follow_up_scanner_test.py, changelog_follow_up_scanner_test.py, incremental_diff_scanner_test.py

@tub tub marked this pull request as ready for review March 5, 2026 16:30
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 6, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 9, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-1b-scanners branch from a3f6a2b to fe599ca Compare March 9, 2026 17:11
tub and others added 3 commits March 10, 2026 11:04
- Add FollowUpScanner hierarchy (base, delta, changelog)
- Add IncrementalDiffScanner for diff-based streaming reads
- Add sharding support to FileScanner
- Add row kind support to TableRead for changelog streams

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub force-pushed the python-streaming-1b-scanners branch from 2da01d8 to fdaed61 Compare March 10, 2026 12:37
Move the include_row_kind feature out of this PR into a separate
branch (python-streaming-1b2-row-kind) to keep the scanners PR
focused on scanners and sharding only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub changed the title [python] Add scanners, sharding, and row kind support [python] Add scanners, sharding Mar 10, 2026
tub added a commit to tub/paimon that referenced this pull request Mar 10, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 10, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tub tub changed the title [python] Add scanners, sharding [python] Add FollowUpScanner, IncrementalDiffScanner, sharding Mar 10, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit 4a884fc into apache:master Mar 11, 2026
5 checks passed
tub added a commit to tub/paimon that referenced this pull request Mar 11, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 11, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 11, 2026
…rim docs, remove ChangelogProducer

- Upgrade cachetools to >=7,<8 for cachedmethod(info=True) support
- Remove ChangelogProducer enum (belongs in apache#7348 scanners branch)
- Replace manual cache hit/miss counters with @cachedmethod(info=True)
  decorator on ManifestFileManager, ManifestListManager, SnapshotManager
- Trim verbose docstrings across identifier, file_io, pyarrow_file_io,
  manifest_list_manager, and snapshot_manager
- Update cache tests to use cache_info() instead of manual counters

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
tub added a commit to tub/paimon that referenced this pull request Mar 11, 2026
…olidate tests, fix parallelism

- Collapse repetitive module/class/method docstrings to one-liners in all
  scanner files (follow_up_scanner, delta, changelog, incremental_diff)
- Remove TDD process commentary from test docstrings
- Consolidate DeltaFollowUpScanner false-case tests into one parameterized test
- Remove misleading commit_kind from ChangelogFollowUpScanner test mocks
- Extract duplicated mock helpers to module-level functions
- Fix max(8, ...) parallelism bug: respect user-configured parallelism
- Remove obvious/redundant inline comments
- Standardize license headers to comment style, merge double docstrings
- Add clarifying docstring to ManifestListManager.read_all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants