Skip to content

Cache parenthesized expression boundaries in the formatter#26344

Merged
charliermarsh merged 8 commits into
mainfrom
charlie/codex-formatter-parentheses-index
Jun 26, 2026
Merged

Cache parenthesized expression boundaries in the formatter#26344
charliermarsh merged 8 commits into
mainfrom
charlie/codex-formatter-parentheses-index

Conversation

@charliermarsh

@charliermarsh charliermarsh commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

Formatting frequently asks whether an expression was parenthesized in the source. Prior to this change, each check re-tokenized the surrounding source and skipped trivia independently, even though the parser token stream is already available.

This PR builds TriviaRanges in the same token traversal that collects comment ranges. It stores those ranges alongside a ParenthesizedExpressions index backed by a single FxHashSet<TextRange>, so comment placement and PyFormatContext share the same source-boundary data. Hot-path checks become one lookup without an additional token traversal or source-scanning fallback.

Formatter paths that need the full parenthesized range continue to use the parsed-token parentheses_iterator.

@astral-sh-bot

astral-sh-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@codspeed-hq

codspeed-hq Bot commented Jun 24, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 6.4%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 4 improved benchmarks
✅ 143 untouched benchmarks
⏩ 4 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation formatter[large/dataset.py] 9.2 ms 8.4 ms +9.92%
Simulation formatter[numpy/ctypeslib.py] 1.8 ms 1.7 ms +5.52%
Simulation formatter[pydantic/types.py] 3.5 ms 3.3 ms +5.37%
Simulation formatter[unicode/pypinyin.py] 669.8 µs 638.9 µs +4.85%

Tip

Curious why this is faster? Use the CodSpeed MCP and ask your agent.


Comparing charlie/codex-formatter-parentheses-index (cc9584b) with main (645dca3)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@charliermarsh charliermarsh force-pushed the charlie/codex-formatter-parentheses-index branch from 6531bd4 to 3670e4a Compare June 25, 2026 17:50
@charliermarsh charliermarsh added performance Potential performance improvement formatter Related to the formatter labels Jun 25, 2026
@charliermarsh charliermarsh marked this pull request as ready for review June 25, 2026 18:49
@DTiming24

This comment was marked as spam.

@MichaReiser MichaReiser left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Comment on lines +37 to +38
comments: &'a Comments<'a>,
context: &PyFormatContext,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass both? context contains comments

/// this index once avoids re-tokenizing the source for every expression.
#[derive(Debug)]
pub(crate) struct ParenthesesIndex {
ranges: FxHashSet<TextRange>,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you compared this version with a Vec<TextRange> with binary search?

Comment on lines +164 to +166
pub(crate) fn is_expression_parenthesized(expr: ExprRef, context: &PyFormatContext) -> bool {
context.is_expression_parenthesized(expr)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two alternatives:

a) Make it a method on context. IMO, reads slightly nicer: context.is_expression_parenthesized(expr)
b) Define a new trait and implement it for ExprRef, Expr which has a is_parenthesized(context) method


/// Returns `true` if the [`ExprRef`] is enclosed by parentheses in the source code.
pub(crate) fn is_expression_parenthesized(
pub(crate) fn is_expression_parenthesized(expr: ExprRef, context: &PyFormatContext) -> bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete this function. It looks like codex got tired of rewriting all call sites. But I rather prefer that over having this wrapper


/// Returns `true` if the [`ExprRef`] is enclosed by parentheses by re-tokenizing the surrounding
/// source. Prefer [`is_expression_parenthesized`] when a formatting context is available.
pub(crate) fn is_expression_parenthesized_in_source(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that we still need this function only for comment placement. Any chance we could compute the ParenthesizedExpressions index earlier and pass it to context instead? That would also eliminate the need for AssertEquivalent (which codex claims isn't equivalent, the new version is stricter because it matches pairs where the old implementation did not)

let mut stack = Vec::<Option<TextSize>>::new();
let mut previous_end = None;

for token in tokens {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate that this requires a full tokens traversal.

I wonder if we could build it as part of Comments/CommentRanges (passed in to format_module_ast), because building CommentRanges also requires a full tokens pass. Building the struct earlier would also allow us to use it in CommentsBuilder

@charliermarsh charliermarsh marked this pull request as draft June 26, 2026 12:31
@charliermarsh charliermarsh force-pushed the charlie/codex-formatter-parentheses-index branch from 74d960a to 045e89b Compare June 26, 2026 12:33
@astral-sh-bot

astral-sh-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Typing conformance results

No changes detected ✅

Current numbers
The percentage of diagnostics emitted that were expected errors held steady at 94.47%. The percentage of expected errors that received a diagnostic held steady at 89.19%. The number of fully passing files held steady at 95/134.

@astral-sh-bot

astral-sh-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Memory usage report

Memory usage unchanged ✅

@astral-sh-bot

astral-sh-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

ecosystem-analyzer results

No diagnostic changes detected ✅

Flaky changes detected. This PR summary excludes flaky changes; see the HTML report for details.

Full report with detailed diff (timing results)

@charliermarsh charliermarsh force-pushed the charlie/codex-formatter-parentheses-index branch from 045e89b to 27fb258 Compare June 26, 2026 14:22
@charliermarsh charliermarsh marked this pull request as ready for review June 26, 2026 14:59

@MichaReiser MichaReiser left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, this is huge

}
}

impl From<&Tokens> for TriviaRanges {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for expanding scope, it's completely fine not to do this in this PR. Should we also replace the function that we use in the linter to use the new trivia ranges?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think that should be a separate change.

parenthesized.insert(TextRange::new(start, end));
}
}
_ => {

@MichaReiser MichaReiser Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This opens an interesting question. Is it intentional that we set start if the current token is a comment?

Edit: We actually don't do this... I should not review code this late :)

Comment thread crates/ruff_python_formatter/src/comments/mod.rs Outdated
Comment thread crates/ruff_python_formatter/src/expression/binary_like.rs Outdated
@MichaReiser

MichaReiser commented Jun 26, 2026

Copy link
Copy Markdown
Member

Oh, Codex is right here

[P3] TriviaRanges should not dereference to CommentRanges. Methods like trivia_ranges.is_empty() silently inspect only comments, despite the type also containing parenthesized-expression data. This ambiguity already appears in Comments::from_ast and could cause future misuse. Prefer explicit .comments() access and remove the Deref implementation.

I think it should be trivia.comments() and trivia.parenthesized()

@charliermarsh charliermarsh force-pushed the charlie/codex-formatter-parentheses-index branch from 2539a56 to cc9584b Compare June 26, 2026 18:53
@charliermarsh charliermarsh merged commit 2cd74f0 into main Jun 26, 2026
61 of 62 checks passed
@charliermarsh

Copy link
Copy Markdown
Member Author

<3

@charliermarsh charliermarsh deleted the charlie/codex-formatter-parentheses-index branch June 26, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

formatter Related to the formatter performance Potential performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants