Fix O(n²) performance in ReindentFilter for large queries by jonmmease · Pull Request #18 · hex-inc/sqlparse

jonmmease · 2026-02-23T23:50:08Z

After reviewing #16, I set claude loose on the repo and asked it to stress test other forms of large queries to look for similar cases. And it found 3. The issues and fixes are described below. I also had it added pytest-benchmark benchmarks for a bunch of cases and report the speedups.

Summary

Fix O(n²) performance bottlenecks in ReindentFilter that cause sqlparse.format(..., reindent=True) to hang on large SQL queries. After these fixes, the same queries format in seconds.

This is a companion to #16 which fixes parse-time O(n²) in group_tokens. This branch fixes the remaining format-time O(n²) in the reindent filter.

Changes

Fix 1: Replace `_get_offset` with backward tree walk

Old: _flatten_up_to_token concatenated all tokens from the start of the statement, then took splitlines()[-1] to get the current line — O(statement_length) per call
New: Walk backward from the target token through the tree until a newline is found — O(line_length) per call
Added _reverse_leaves_before and _reverse_flatten for the backward walk
Removed the now-unused _flatten_up_to_token method

Fix 2: Eliminate `token_index` calls in `_process_identifierlist`

Old: insert_before(token, ...) called token_index(token) → list.index(token) = O(n) linear scan per insertion
New: Pre-compute {id(token): index} mapping once, pass integer indices directly to insert_before/insert_after, and track cumulative shift from insertions

Fix 3: Eliminate O(n²) in `_process_values` for large INSERT statements

Old: Same two O(n²) patterns as above in the VALUES processing loop
New: Pass integer indices from token_next_by directly. Added _parent_idx hint parameter so the backward walk skips the O(n) index() call on the hot parent. Hoisted loop-invariant offset in comma_first mode.

Benchmark Results

Reproducible benchmark tests are included in the first commit (tests/test_benchmarks.py). Run with:

uv run --with pytest --with pytest-benchmark pytest tests/test_benchmarks.py -v

Reindent benchmarks

Benchmark	Example SQL	Before	After	Speedup
wide_select	`SELECT col_0, col_1, ... col_4999 FROM t`	0.66s	0.16s	4.1x
large_in_list	`SELECT * FROM t WHERE id IN (0, 1, ... 99999)`	DNF (>120s)	6.39s	FIXED
large_insert	`INSERT INTO t VALUES (0, 1), (1, 2), ... (24999, 25000)`	DNF (>120s)	2.56s	FIXED
deep_subqueries	`SELECT * FROM (SELECT * FROM (... FROM t ...) s0) s11`	0.00s	0.00s	—
many_joins	`SELECT * FROM t0 JOIN t1 ON ... JOIN t2 ON ... (×500)`	0.07s	0.07s	—
complex_where	`SELECT * FROM t WHERE ((col_0 = 0 AND ...) OR ...) (depth=8, breadth=3)`	16.56s	0.60s	27.6x
mixed_batch	`CREATE TABLE ...; INSERT INTO ...; SELECT ...; UPDATE ... (×50)`	0.20s	0.13s	1.5x
heavy_formatting	`WITH cte AS (SELECT CASE WHEN col_0 > 0 ... (×200)) SELECT * FROM cte`	0.20s	0.05s	4.0x

INSERT scaling (Fix 3)

Benchmark	Example SQL	Before	After	Speedup
insert 5k rows	`INSERT INTO t VALUES (0, 1), ... (4999, 5000)`	13.26s	0.37s	35.8x
insert 10k rows	`INSERT INTO t VALUES (0, 1), ... (9999, 10000)`	DNF (>120s)	0.86s	FIXED
insert 25k rows	`INSERT INTO t VALUES (0, 1), ... (24999, 25000)`	DNF (>120s)	2.47s	FIXED

Testing

All 467 existing tests pass
11 pytest-benchmark tests included for reproducible performance measurement

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

sqlparse/filters/reindent.py

Adds 11 benchmark tests covering the query types from the PR performance tables. Each test generates a deterministic SQL query and measures sqlparse.format(sql, reindent=True). Run with: uv run --with pytest --with pytest-benchmark pytest tests/test_benchmarks.py -v Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two independent O(n²) bottlenecks in the reindent filter caused format operations to take >300s or time out entirely on 1MB SQL queries. Replace _get_offset's _flatten_up_to_token (which flattened the entire statement from the beginning for every token) with a backward tree walk that only examines tokens on the current line — O(line_length) instead of O(statement_length). Eliminate token_index/list.index calls in _process_identifierlist by pre-computing a token-to-index mapping and tracking insertion shifts, avoiding O(n) linear scans per insert_before call. Results on 1MB queries vs prior branch: - wide_select format: >300s timeout → 8s - large_in_list format: 144s → 9s (16x) - large_insert format: 69s → 5s (15x) - deep_subqueries format: 36s → 14s (2.6x) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

_process_values had two per-iteration O(n) calls that made it O(n²) for large INSERT VALUES: 1. insert_before(ptoken, ...) / insert_after(ptoken, ...) passed token objects, triggering token_index() linear scan. Fixed by passing the integer index (ptidx) already available from token_next_by(). 2. _get_offset(token) called _reverse_leaves_before which did parent.tokens.index(current) on the Values group. Fixed by adding an optional _parent_idx hint so the backward walk skips the O(n) index() call on the hot parent. Also hoists _get_offset(first_token) out of the loop in comma_first mode since the value is loop-invariant. Before: 5k rows 30s, 10k rows 87s (quadratic) After: 5k rows 1s, 10k rows 3s, 25k rows 6s (linear) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sqlparse/filters/reindent.py

Rename _parent_idx to known_parent_and_idx for clarity and remove unnecessary comment in _process_values per reviewer feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor bot reviewed Feb 23, 2026

View reviewed changes

sqlparse/filters/reindent.py Show resolved Hide resolved

jonmmease force-pushed the fix/reindent-quadratic-perf branch from 38d7d6d to 6da15e1 Compare February 24, 2026 16:10

jonmmease changed the base branch from fix/group-tokens-quadratic-perf to hex February 24, 2026 16:11

jonmmease marked this pull request as draft February 24, 2026 16:37

jonmmease and others added 3 commits February 24, 2026 11:46

jonmmease force-pushed the fix/reindent-quadratic-perf branch from 6da15e1 to 3326190 Compare February 24, 2026 16:46

jonmmease marked this pull request as ready for review February 24, 2026 17:50

jonmmease requested review from glentakahashi and stevephodgson February 24, 2026 17:50

jonmmease mentioned this pull request Feb 24, 2026

perf: some python optimizations #21

Draft

stevephodgson requested changes Feb 25, 2026

View reviewed changes

sqlparse/filters/reindent.py Outdated Show resolved Hide resolved

sqlparse/filters/reindent.py Outdated Show resolved Hide resolved

Address PR review: rename _parent_idx, remove comment

d32aade

Rename _parent_idx to known_parent_and_idx for clarity and remove unnecessary comment in _process_values per reviewer feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jonmmease requested a review from stevephodgson February 25, 2026 14:55

stevephodgson approved these changes Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix O(n²) performance in ReindentFilter for large queries#18

Fix O(n²) performance in ReindentFilter for large queries#18
jonmmease wants to merge 4 commits intohexfrom
fix/reindent-quadratic-perf

jonmmease commented Feb 23, 2026 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonmmease commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Fix 1: Replace _get_offset with backward tree walk

Fix 2: Eliminate token_index calls in _process_identifierlist

Fix 3: Eliminate O(n²) in _process_values for large INSERT statements

Benchmark Results

Reindent benchmarks

INSERT scaling (Fix 3)

Testing

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonmmease commented Feb 23, 2026 •

edited

Loading

Fix 1: Replace `_get_offset` with backward tree walk

Fix 2: Eliminate `token_index` calls in `_process_identifierlist`

Fix 3: Eliminate O(n²) in `_process_values` for large INSERT statements