Fix O(n²) performance in TokenList.group_tokens() by stevephodgson · Pull Request #16 · hex-inc/sqlparse

stevephodgson · 2026-02-23T12:09:47Z

Summary

Removes the grp.value = str(start) line in the extend path of TokenList.group_tokens(), which caused O(n²) complexity when grouping large token lists. This replaces the monkey-patch in sqlparse_fast.py (introduced in #39253) with a direct fix.

The bug

TokenList.group_tokens() has two code paths: one that creates a new group, and one that extends an existing group (extend=True). The extend path is used by group_identifier_list, group_aliased, align_comments, group_values, group_typed_literal, and group_identifier — any grouping function that incrementally grows a group by appending tokens one at a time.

On each extend call, the line grp.value = str(start) called str() on the group, which invokes TokenList.__str__() → flatten() → iterates all child tokens to rebuild the string. As the group grows, each call takes O(k) where k is the current group size. Over n extensions, this is O(1 + 2 + ... + n) = O(n²).

For a query with a 20k-element IN clause, this took ~50 seconds. With this fix, the same query takes ~0.5 seconds.

Why this fix is safe

`TokenList.value` is already a stale cache

TokenList inherits a value slot from Token, but overrides __str__() to always recompute from children:

class Token:
    __slots__ = ('value', 'ttype', 'parent', ...)
    def __str__(self):
        return self.value  # returns cached value

class TokenList(Token):
    def __init__(self, tokens=None):
        super().__init__(None, str(self))  # caches value at construction time
    def __str__(self):
        return ''.join(token.value for token in self.flatten())  # always recomputes

TokenList.value is set once at construction via super().__init__(None, str(self)) and then only updated in this one extend code path. It is never updated when tokens are added via insert_before(), insert_after(), or when child groups are further restructured by subsequent grouping passes. For example, group_identifier_list (step 27 in the pipeline) may extend an IdentifierList and update .value, but then group_values (step 28) can restructure its children — and never updates .value on the IdentifierList.

This means .value on any TokenList was already unreliable before this change. The removed line just made one specific path do expensive work to keep it slightly less stale, without ever making it fully correct.

This patch does not materially change the staleness

The removed line updated .value during the extend path of group_tokens(). But .value was already stale in other mutation paths (insert_before, insert_after, subsequent grouping passes). This patch simply makes the extend path consistent with every other mutation — none of them update .value.

Any code that needs the correct string representation of a TokenList should call str(token), which always recomputes correctly via __str__. Code that reads .value directly on leaf Token objects is unaffected (leaf tokens don't go through this code path, and their .value is always correct).

Places that read `.value` on potentially-grouped tokens

We audited all .value accesses in the codebase:

engine/grouping.py:361-365 (group_functions): Reads tmp_token.value comparing against 'CREATE', 'TABLE', 'AS'. These are DML/DDL keywords — always leaf Token objects, never TokenList groups.
filters/reindent.py: Reads .value on identifier tokens for line-wrapping width calculations. These are Identifier groups, but the values are used for approximate len() calculations where minor staleness has no functional impact (slightly off wrap points).
filters/*.py (output, others, right_margin, aligned_indent): All operate on tokens yielded by flatten(), which only yields leaf Token objects — .value is always correct.
sql.py (get_typecast, get_parent_name, etc.): Navigate to specific child tokens by type/match first, then read .value. These target leaf tokens (type names, punctuation).

Test coverage

The existing test suite (461 tests) passes in full. The tests cover:

Parsing: test_parse.py — validates token tree structure for various SQL constructs
Formatting: test_format.py — validates sqlparse.format() output, which calls str() on the formatted tree (exercises the correct __str__ path)
Grouping: test_grouping.py — validates that tokens are grouped into the correct TokenList subclasses
Splitting: test_split.py — validates statement boundary detection
Regressions: test_regressions.py — covers edge cases from past bugs

The one test that referenced .value on grouped tokens (test_configurable_keywords) has been updated to use str(t) instead, which is the correct API for getting the string representation of a TokenList.

Remove eager `grp.value = str(start)` recomputation that caused quadratic complexity when grouping large token lists (e.g. IN clauses with 20k elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

tests/test_parse.py

jonmmease · 2026-02-23T13:20:48Z

will review, but am I seeing this correctly that we're not running the tests in circleci? would that be straightforward (for claude) to wire up?

stevephodgson · 2026-02-23T14:27:47Z

will review, but am I seeing this correctly that we're not running the tests in circleci? would that be straightforward (for claude) to wire up?

I will do this. I did run all the test manually though (hence the test fixes)

jonmmease

reviewed, with Claude Code, the different logic paths that end up here, and everything checks out

…atic-perf

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-24T11:04:00Z

sqlparse/sql.py

            grp = start
            grp.tokens.extend(subtokens)
            del self.tokens[start_idx + 1 : end_idx]
-            grp.value = str(start)


Stale group value breaks wrapping

Medium Severity

Removing the grp.value refresh in the extend=True path makes TokenList.value stay at its construction-time cache even when the group grows. Some formatting logic relies on len(token.value) for grouped tokens (e.g., identifiers) to decide wrapping/indentation, so extended groups can be measured too short and produce different (potentially overlong) formatted output.

…atic-perf

Fix O(n²) performance in TokenList.group_tokens() extend path

824a6fe

Remove eager `grp.value = str(start)` recomputation that caused quadratic complexity when grouping large token lists (e.g. IN clauses with 20k elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

stevephodgson force-pushed the fix/group-tokens-quadratic-perf branch from 0570ed9 to 824a6fe Compare February 23, 2026 12:32

stevephodgson changed the base branch from master to hex February 23, 2026 12:32

stevephodgson requested review from gjoseph92, glentakahashi, jonmmease and sukwanlee February 23, 2026 13:12

stevephodgson marked this pull request as ready for review February 23, 2026 13:12

stevephodgson marked this pull request as draft February 23, 2026 13:13

cursor bot reviewed Feb 23, 2026

View reviewed changes

tests/test_parse.py Outdated Show resolved Hide resolved

merge conflict

dbd42b0

stevephodgson marked this pull request as ready for review February 23, 2026 14:29

jonmmease approved these changes Feb 23, 2026

View reviewed changes

glentakahashi approved these changes Feb 23, 2026

View reviewed changes

jonmmease mentioned this pull request Feb 23, 2026

Fix O(n²) performance in ReindentFilter for large queries #18

Merged

Merge remote-tracking branch 'origin/hex' into fix/group-tokens-quadr…

d5242a7

…atic-perf

cursor bot reviewed Feb 24, 2026

View reviewed changes

stevephodgson closed this Feb 24, 2026

stevephodgson reopened this Feb 24, 2026

Merge remote-tracking branch 'origin/hex' into fix/group-tokens-quadr…

e64afe4

…atic-perf

stevephodgson merged commit b0ea13e into hex Feb 24, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix O(n²) performance in TokenList.group_tokens()#16

Fix O(n²) performance in TokenList.group_tokens()#16
stevephodgson merged 4 commits intohexfrom
fix/group-tokens-quadratic-perf

stevephodgson commented Feb 23, 2026

Uh oh!

Uh oh!

jonmmease commented Feb 23, 2026

Uh oh!

stevephodgson commented Feb 23, 2026

Uh oh!

jonmmease left a comment

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stevephodgson commented Feb 23, 2026

Summary

The bug

Why this fix is safe

TokenList.value is already a stale cache

This patch does not materially change the staleness

Places that read .value on potentially-grouped tokens

Test coverage

Uh oh!

Uh oh!

jonmmease commented Feb 23, 2026

Uh oh!

stevephodgson commented Feb 23, 2026

Uh oh!

jonmmease left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 24, 2026

Choose a reason for hiding this comment

Stale group value breaks wrapping

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`TokenList.value` is already a stale cache

Places that read `.value` on potentially-grouped tokens