Fix O(n²) performance in TokenList.group_tokens()#16
Conversation
Remove eager `grp.value = str(start)` recomputation that caused quadratic complexity when grouping large token lists (e.g. IN clauses with 20k elements). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0570ed9 to
824a6fe
Compare
|
will review, but am I seeing this correctly that we're not running the tests in circleci? would that be straightforward (for claude) to wire up? |
I will do this. I did run all the test manually though (hence the test fixes) |
jonmmease
left a comment
There was a problem hiding this comment.
reviewed, with Claude Code, the different logic paths that end up here, and everything checks out
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| grp = start | ||
| grp.tokens.extend(subtokens) | ||
| del self.tokens[start_idx + 1 : end_idx] | ||
| grp.value = str(start) |
There was a problem hiding this comment.
Stale group value breaks wrapping
Medium Severity
Removing the grp.value refresh in the extend=True path makes TokenList.value stay at its construction-time cache even when the group grows. Some formatting logic relies on len(token.value) for grouped tokens (e.g., identifiers) to decide wrapping/indentation, so extended groups can be measured too short and produce different (potentially overlong) formatted output.


Summary
Removes the
grp.value = str(start)line in theextendpath ofTokenList.group_tokens(), which caused O(n²) complexity when grouping large token lists. This replaces the monkey-patch insqlparse_fast.py(introduced in #39253) with a direct fix.The bug
TokenList.group_tokens()has two code paths: one that creates a new group, and one that extends an existing group (extend=True). The extend path is used bygroup_identifier_list,group_aliased,align_comments,group_values,group_typed_literal, andgroup_identifier— any grouping function that incrementally grows a group by appending tokens one at a time.On each extend call, the line
grp.value = str(start)calledstr()on the group, which invokesTokenList.__str__()→flatten()→ iterates all child tokens to rebuild the string. As the group grows, each call takes O(k) where k is the current group size. Over n extensions, this is O(1 + 2 + ... + n) = O(n²).For a query with a 20k-element IN clause, this took ~50 seconds. With this fix, the same query takes ~0.5 seconds.
Why this fix is safe
TokenList.valueis already a stale cacheTokenListinherits avalueslot fromToken, but overrides__str__()to always recompute from children:TokenList.valueis set once at construction viasuper().__init__(None, str(self))and then only updated in this oneextendcode path. It is never updated when tokens are added viainsert_before(),insert_after(), or when child groups are further restructured by subsequent grouping passes. For example,group_identifier_list(step 27 in the pipeline) may extend anIdentifierListand update.value, but thengroup_values(step 28) can restructure its children — and never updates.valueon theIdentifierList.This means
.valueon anyTokenListwas already unreliable before this change. The removed line just made one specific path do expensive work to keep it slightly less stale, without ever making it fully correct.This patch does not materially change the staleness
The removed line updated
.valueduring the extend path ofgroup_tokens(). But.valuewas already stale in other mutation paths (insert_before,insert_after, subsequent grouping passes). This patch simply makes the extend path consistent with every other mutation — none of them update.value.Any code that needs the correct string representation of a
TokenListshould callstr(token), which always recomputes correctly via__str__. Code that reads.valuedirectly on leafTokenobjects is unaffected (leaf tokens don't go through this code path, and their.valueis always correct).Places that read
.valueon potentially-grouped tokensWe audited all
.valueaccesses in the codebase:engine/grouping.py:361-365(group_functions): Readstmp_token.valuecomparing against'CREATE','TABLE','AS'. These are DML/DDL keywords — always leafTokenobjects, neverTokenListgroups.filters/reindent.py: Reads.valueon identifier tokens for line-wrapping width calculations. These areIdentifiergroups, but the values are used for approximatelen()calculations where minor staleness has no functional impact (slightly off wrap points).filters/*.py(output, others, right_margin, aligned_indent): All operate on tokens yielded byflatten(), which only yields leafTokenobjects —.valueis always correct.sql.py(get_typecast,get_parent_name, etc.): Navigate to specific child tokens by type/match first, then read.value. These target leaf tokens (type names, punctuation).Test coverage
The existing test suite (461 tests) passes in full. The tests cover:
test_parse.py— validates token tree structure for various SQL constructstest_format.py— validatessqlparse.format()output, which callsstr()on the formatted tree (exercises the correct__str__path)test_grouping.py— validates that tokens are grouped into the correctTokenListsubclassestest_split.py— validates statement boundary detectiontest_regressions.py— covers edge cases from past bugsThe one test that referenced
.valueon grouped tokens (test_configurable_keywords) has been updated to usestr(t)instead, which is the correct API for getting the string representation of aTokenList.