Cache parenthesized expression boundaries in the formatter#26344
Conversation
|
Merging this PR will improve performance by 6.4%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | formatter[large/dataset.py] |
9.2 ms | 8.4 ms | +9.92% |
| ⚡ | Simulation | formatter[numpy/ctypeslib.py] |
1.8 ms | 1.7 ms | +5.52% |
| ⚡ | Simulation | formatter[pydantic/types.py] |
3.5 ms | 3.3 ms | +5.37% |
| ⚡ | Simulation | formatter[unicode/pypinyin.py] |
669.8 µs | 638.9 µs | +4.85% |
Tip
Curious why this is faster? Use the CodSpeed MCP and ask your agent.
Comparing charlie/codex-formatter-parentheses-index (cc9584b) with main (645dca3)
Footnotes
-
4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
6531bd4 to
3670e4a
Compare
This comment was marked as spam.
This comment was marked as spam.
| comments: &'a Comments<'a>, | ||
| context: &PyFormatContext, |
There was a problem hiding this comment.
Do we need to pass both? context contains comments
| /// this index once avoids re-tokenizing the source for every expression. | ||
| #[derive(Debug)] | ||
| pub(crate) struct ParenthesesIndex { | ||
| ranges: FxHashSet<TextRange>, |
There was a problem hiding this comment.
Have you compared this version with a Vec<TextRange> with binary search?
| pub(crate) fn is_expression_parenthesized(expr: ExprRef, context: &PyFormatContext) -> bool { | ||
| context.is_expression_parenthesized(expr) | ||
| } |
There was a problem hiding this comment.
Two alternatives:
a) Make it a method on context. IMO, reads slightly nicer: context.is_expression_parenthesized(expr)
b) Define a new trait and implement it for ExprRef, Expr which has a is_parenthesized(context) method
|
|
||
| /// Returns `true` if the [`ExprRef`] is enclosed by parentheses in the source code. | ||
| pub(crate) fn is_expression_parenthesized( | ||
| pub(crate) fn is_expression_parenthesized(expr: ExprRef, context: &PyFormatContext) -> bool { |
There was a problem hiding this comment.
Can we delete this function. It looks like codex got tired of rewriting all call sites. But I rather prefer that over having this wrapper
|
|
||
| /// Returns `true` if the [`ExprRef`] is enclosed by parentheses by re-tokenizing the surrounding | ||
| /// source. Prefer [`is_expression_parenthesized`] when a formatting context is available. | ||
| pub(crate) fn is_expression_parenthesized_in_source( |
There was a problem hiding this comment.
It's unfortunate that we still need this function only for comment placement. Any chance we could compute the ParenthesizedExpressions index earlier and pass it to context instead? That would also eliminate the need for AssertEquivalent (which codex claims isn't equivalent, the new version is stricter because it matches pairs where the old implementation did not)
| let mut stack = Vec::<Option<TextSize>>::new(); | ||
| let mut previous_end = None; | ||
|
|
||
| for token in tokens { |
There was a problem hiding this comment.
It's a bit unfortunate that this requires a full tokens traversal.
I wonder if we could build it as part of Comments/CommentRanges (passed in to format_module_ast), because building CommentRanges also requires a full tokens pass. Building the struct earlier would also allow us to use it in CommentsBuilder
74d960a to
045e89b
Compare
Typing conformance resultsNo changes detected ✅Current numbersThe percentage of diagnostics emitted that were expected errors held steady at 94.47%. The percentage of expected errors that received a diagnostic held steady at 89.19%. The number of fully passing files held steady at 95/134. |
Memory usage reportMemory usage unchanged ✅ |
|
045e89b to
27fb258
Compare
| } | ||
| } | ||
|
|
||
| impl From<&Tokens> for TriviaRanges { |
There was a problem hiding this comment.
Sorry for expanding scope, it's completely fine not to do this in this PR. Should we also replace the function that we use in the linter to use the new trivia ranges?
There was a problem hiding this comment.
Good question, I think that should be a separate change.
| parenthesized.insert(TextRange::new(start, end)); | ||
| } | ||
| } | ||
| _ => { |
There was a problem hiding this comment.
This opens an interesting question. Is it intentional that we set start if the current token is a comment?
Edit: We actually don't do this... I should not review code this late :)
|
Oh, Codex is right here
I think it should be |
This reverts commit 27fb258.
2539a56 to
cc9584b
Compare
|
<3 |
Summary
Formatting frequently asks whether an expression was parenthesized in the source. Prior to this change, each check re-tokenized the surrounding source and skipped trivia independently, even though the parser token stream is already available.
This PR builds
TriviaRangesin the same token traversal that collects comment ranges. It stores those ranges alongside aParenthesizedExpressionsindex backed by a singleFxHashSet<TextRange>, so comment placement andPyFormatContextshare the same source-boundary data. Hot-path checks become one lookup without an additional token traversal or source-scanning fallback.Formatter paths that need the full parenthesized range continue to use the parsed-token
parentheses_iterator.