Skip to content

Fix O(n²) performance in TokenList.group_tokens()#16

Merged
stevephodgson merged 4 commits intohexfrom
fix/group-tokens-quadratic-perf
Feb 24, 2026
Merged

Fix O(n²) performance in TokenList.group_tokens()#16
stevephodgson merged 4 commits intohexfrom
fix/group-tokens-quadratic-perf

Conversation

@stevephodgson
Copy link

Summary

Removes the grp.value = str(start) line in the extend path of TokenList.group_tokens(), which caused O(n²) complexity when grouping large token lists. This replaces the monkey-patch in sqlparse_fast.py (introduced in #39253) with a direct fix.

The bug

TokenList.group_tokens() has two code paths: one that creates a new group, and one that extends an existing group (extend=True). The extend path is used by group_identifier_list, group_aliased, align_comments, group_values, group_typed_literal, and group_identifier — any grouping function that incrementally grows a group by appending tokens one at a time.

On each extend call, the line grp.value = str(start) called str() on the group, which invokes TokenList.__str__()flatten() → iterates all child tokens to rebuild the string. As the group grows, each call takes O(k) where k is the current group size. Over n extensions, this is O(1 + 2 + ... + n) = O(n²).

For a query with a 20k-element IN clause, this took ~50 seconds. With this fix, the same query takes ~0.5 seconds.

Why this fix is safe

TokenList.value is already a stale cache

TokenList inherits a value slot from Token, but overrides __str__() to always recompute from children:

class Token:
    __slots__ = ('value', 'ttype', 'parent', ...)
    def __str__(self):
        return self.value  # returns cached value

class TokenList(Token):
    def __init__(self, tokens=None):
        super().__init__(None, str(self))  # caches value at construction time
    def __str__(self):
        return ''.join(token.value for token in self.flatten())  # always recomputes

TokenList.value is set once at construction via super().__init__(None, str(self)) and then only updated in this one extend code path. It is never updated when tokens are added via insert_before(), insert_after(), or when child groups are further restructured by subsequent grouping passes. For example, group_identifier_list (step 27 in the pipeline) may extend an IdentifierList and update .value, but then group_values (step 28) can restructure its children — and never updates .value on the IdentifierList.

This means .value on any TokenList was already unreliable before this change. The removed line just made one specific path do expensive work to keep it slightly less stale, without ever making it fully correct.

This patch does not materially change the staleness

The removed line updated .value during the extend path of group_tokens(). But .value was already stale in other mutation paths (insert_before, insert_after, subsequent grouping passes). This patch simply makes the extend path consistent with every other mutation — none of them update .value.

Any code that needs the correct string representation of a TokenList should call str(token), which always recomputes correctly via __str__. Code that reads .value directly on leaf Token objects is unaffected (leaf tokens don't go through this code path, and their .value is always correct).

Places that read .value on potentially-grouped tokens

We audited all .value accesses in the codebase:

  • engine/grouping.py:361-365 (group_functions): Reads tmp_token.value comparing against 'CREATE', 'TABLE', 'AS'. These are DML/DDL keywords — always leaf Token objects, never TokenList groups.
  • filters/reindent.py: Reads .value on identifier tokens for line-wrapping width calculations. These are Identifier groups, but the values are used for approximate len() calculations where minor staleness has no functional impact (slightly off wrap points).
  • filters/*.py (output, others, right_margin, aligned_indent): All operate on tokens yielded by flatten(), which only yields leaf Token objects — .value is always correct.
  • sql.py (get_typecast, get_parent_name, etc.): Navigate to specific child tokens by type/match first, then read .value. These target leaf tokens (type names, punctuation).

Test coverage

The existing test suite (461 tests) passes in full. The tests cover:

  • Parsing: test_parse.py — validates token tree structure for various SQL constructs
  • Formatting: test_format.py — validates sqlparse.format() output, which calls str() on the formatted tree (exercises the correct __str__ path)
  • Grouping: test_grouping.py — validates that tokens are grouped into the correct TokenList subclasses
  • Splitting: test_split.py — validates statement boundary detection
  • Regressions: test_regressions.py — covers edge cases from past bugs

The one test that referenced .value on grouped tokens (test_configurable_keywords) has been updated to use str(t) instead, which is the correct API for getting the string representation of a TokenList.

Remove eager `grp.value = str(start)` recomputation that caused
quadratic complexity when grouping large token lists (e.g. IN clauses
with 20k elements).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@stevephodgson stevephodgson force-pushed the fix/group-tokens-quadratic-perf branch from 0570ed9 to 824a6fe Compare February 23, 2026 12:32
@stevephodgson stevephodgson changed the base branch from master to hex February 23, 2026 12:32
@stevephodgson stevephodgson marked this pull request as ready for review February 23, 2026 13:12
@stevephodgson stevephodgson marked this pull request as draft February 23, 2026 13:13
@jonmmease
Copy link

will review, but am I seeing this correctly that we're not running the tests in circleci? would that be straightforward (for claude) to wire up?

@stevephodgson
Copy link
Author

will review, but am I seeing this correctly that we're not running the tests in circleci? would that be straightforward (for claude) to wire up?

I will do this. I did run all the test manually though (hence the test fixes)

@stevephodgson stevephodgson marked this pull request as ready for review February 23, 2026 14:29
Copy link

@jonmmease jonmmease left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed, with Claude Code, the different logic paths that end up here, and everything checks out

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

grp = start
grp.tokens.extend(subtokens)
del self.tokens[start_idx + 1 : end_idx]
grp.value = str(start)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale group value breaks wrapping

Medium Severity

Removing the grp.value refresh in the extend=True path makes TokenList.value stay at its construction-time cache even when the group grows. Some formatting logic relies on len(token.value) for grouped tokens (e.g., identifiers) to decide wrapping/indentation, so extended groups can be measured too short and produce different (potentially overlong) formatted output.

Fix in Cursor Fix in Web

@stevephodgson stevephodgson merged commit b0ea13e into hex Feb 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants