Skip to content

Spill stack to the heap to support arbitrarily deep nested programs#25464

Open
MichaReiser wants to merge 12 commits into
mainfrom
micha/parser-stacker
Open

Spill stack to the heap to support arbitrarily deep nested programs#25464
MichaReiser wants to merge 12 commits into
mainfrom
micha/parser-stacker

Conversation

@MichaReiser

@MichaReiser MichaReiser commented May 29, 2026

Copy link
Copy Markdown
Member

Summary

This PR removes the hard nesting limits in the parser and instead spills the stack to the heap when it approaches the stack limit, using stacker. This is limited to supported platforms.

Why?

The primary goal of Ruff's parser is to accept every valid (C)Python program. This is currently no longer the case because Ruff's recursion limits are stricter than CPython's. CPython uses multiple limits:

  • 200: For nested parentheses
  • 6000: Increased whenever CPython's PEG parser enters a new rule. The limit is lower for WASM release and even lower for WASM debug builds.
  • When reaching CPython's stack limit

I don't see a way for Ruff to approximate the latter two limits other than by finding limits that are at least as large as CPython's. However, this is not enough to protect against all stack overflows:

  • Different platforms use different (default) stack sizes
  • Applications can change the default stack size
  • The stack frames are different between compilation profiles (debug > release)
  • The stack frame depends on user code (how much stack space does the program allocate when visiting each node?). Where is ruff's parser called? There's less stack size remaining if the call happens very deep in a call tree

We'd have to pick platform and compilation-target-specific limits. But even this won't be enough, because we don't control the stack size, or when the parser is called and how much stack size is remaining at this point.

The idea of this PR is to remove the stack limitation from Ruff's parser and enable upstream applications to enforce relevant restriction if they want to (only applies to stack overflows, allocation failures are out of scope). This makes Ruff's parser again conformant with the Python standard and CPython.

In general, I advise upstream tools from using a uniform stack size on platforms supporting changing the default thread stack size. In general, using a larger than default stack size should be more than enough to support all reasonable Python programs and stacker gives them the tool to increase the stack size on demand.

Upstream tools can restore the previous behavior by:

  • Writing a custom visitor.
  • Make sure you use the same stack size on all platforms or use stacker
  • If you only care about giving the user an idea what went wrong: abort with a custom message when reaching a too deeply nested node. It's important that you abort and don't panic, because panic still runs drop handler unless you use panic=abort.
  • If you want nice diagnostics: Use the node's range and build your own Diagnostic. Now, dropping the subtree is a bit tricky because the Drop implementation can stack overflow also. That's why you need to
    • Replace the subtree with a placeholder node (invalid expr name)
    • Queue the subtree for drop but repeat the cut-subtree replacement whenever the sub-tree exceeds the limit again, then requeue the sub-sub-tree, and so on...

Writing this in user code not only removes the complexity from Ruff. It also has the advantage that you get perfect error recovery. The AST has the exact range of the problematic expression. It's not the range of whatever comes after the too-deeply nested parentheses.

As said before, this only protects against stack overflows. As the previous implementation, it doesn't protect against allocation errors. An attacker could construct a program that requires the parser to allocate more memory than is available (e.g. WASM32 has a 4GB limit). Examples are very long statement or expression lists, or binary expressions with the same precedence, etc. I'm sure there are ways to protect against this too, but this feels way beyond something the parser should be concerned about and not something that hard limits could solve.

Follow-up to #24810.

Closes #22930

Performance

Codspeed reports a perf regression from about 1% for many ty walltime benchmarks, which is very unfortunate. The only way I can think of to mitigate the regression is to reduce the call-sites where we grow the stack size if necessary, at the cost that the parser will overflow in some very specific constructed cases. I personally would be in favor of doing so from a practical standpoint: In my view it feels wrong to optimize for cases where someone intentionally tries to break the parser, but it comes at the cost that using a visitor is no longer fully sufficient to mitigate/protect against those cases. Having said that. I'm still not convinced that a stack overflow in the parser is a security concern. Even CPython aborts when exceeding the stack size! I really consider this a concern of the harness that must implement appropriate measures to handle stack overflows, allocation failures and other crashes due to bugs accordingly.

Considered alternatives

Keep limits

Fix the non CPython conform limits but otherwise keep the limits in place. The main benefit of this approach is that upstream applications are less likely to run into stack overflow issues, but it also doesn't protect them against it.

  • Some limits can only be approximations because they're CPython specific and not mentioned anywhere in the specification
  • Error recovery and the diagnostic ranges are worse than when handled post-parsing
  • Can still stack overflow on systems with smaller stack sizes, or depending on when or where the parser is called.
  • Makes assumptions what the right limit is for upstream tools
  • Even CPython's ast module aborts when there's not enough remaining stack size
  • Limits are also required when the parser unrolls recursions into a loop or when the Rust compiler applies the tail call recursion optimization

Configurable limits

Don't set any limits by default. Very similar to the former with the same downsides except that Ruff makes no assumptions about what the right defaults are for downstream tools.

I think that's a reasonable approach but it requires users to pick limits, which seems non-trivial and the error recovery is worse than if Ruff parses out the full program. But more importantly, I don't see a strong benefit over moving the entire complexity to programs for which increasing the thread's stack size itself isn't a sufficient mitigation. A post-parse visitor gives them the same flexibility, and it even allows them to customize what to do when the limit is exceeded (diagnostic vs abort vs something else). This approach also gives them a more accurate parse tree if they decide not to abort.

What do other parsers do

  • Rustc uses stacker to support common cases where recursion stack overflows. But there are cases where rustc can overlfow
  • CPython's AST module throws if there isn't enough stack frame remaining.

Note

This PR is no guarantee that we'll maintain the invariant that Ruff's parser never stack overflows on any input. We'll give our best, but we may decide that the risk of stack overflowing is rare enough and mitigating against it is to expensive (in terms of performance) so that the trade offs don't feel right.

@astral-sh-bot

astral-sh-bot Bot commented May 29, 2026

Copy link
Copy Markdown

Typing conformance results

No changes detected ✅

Current numbers
The percentage of diagnostics emitted that were expected errors held steady at 92.23%. The percentage of expected errors that received a diagnostic held steady at 87.42%. The number of fully passing files held steady at 92/134.

@astral-sh-bot

astral-sh-bot Bot commented May 29, 2026

Copy link
Copy Markdown

Memory usage report

Memory usage unchanged ✅

@astral-sh-bot

astral-sh-bot Bot commented May 29, 2026

Copy link
Copy Markdown

ecosystem-analyzer results

No diagnostic changes detected ✅

Flaky changes detected. This PR summary excludes flaky changes; see the HTML report for details.

Full report with detailed diff (timing results)

@codspeed-hq

codspeed-hq Bot commented May 29, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 127 untouched benchmarks


Comparing micha/parser-stacker (7ad9bb9) with main (f0fbe2b)

Open in CodSpeed

@MichaReiser

Copy link
Copy Markdown
Member Author

Nice, this seems to be almost zero cost

@astral-sh-bot

astral-sh-bot Bot commented May 29, 2026

Copy link
Copy Markdown

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@MichaReiser

This comment was marked as outdated.

@MichaReiser

This comment was marked as outdated.

@MichaReiser MichaReiser changed the title Use stacker for parser recursion Spill stack to the heap to support arbitrarily deep nested programs May 29, 2026
Comment on lines +740 to +741
TokenKind::Lpar => Expr::Call(self.parse_call_expression(lhs, start)),
TokenKind::Lsqb => Expr::Subscript(self.parse_subscript_expression(lhs, start)),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already handled by nesting

self.bump(TokenKind::Async);

// Consume repeated invalid `async` prefixes iteratively. This is the only
// invalid-async recovery shape that can recurse without bound.

@MichaReiser MichaReiser Jun 1, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unroll the recursion in a loop, avoiding any stack size issues

Comment on lines +349 to +351
if self.tokens.nesting() < STACK_GROWTH_NESTING_THRESHOLD
&& !Self::token_starts_unnested_recursive_lhs(token)
{

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this is one of the cases where we don't guarantee to be 100% stack overflow free because the performance trade off would be too bad.

A simple example where this overflows is if the application calls Ruff's parser in a very deep call graph where only very little stack frame is remaining. It's then not unlikely that Ruff stack overflows even before exceeding a limit of 200 nesting.

I intentionally did not spend any time on refining this heuristic.

@MichaReiser

Copy link
Copy Markdown
Member Author

I'll open this up for an initial round of feedback before fully polishing it. CC @samuelcolvin as you might have opinions on this / know what monty needs here.

@MichaReiser MichaReiser marked this pull request as ready for review June 1, 2026 13:46
@MichaReiser MichaReiser requested a review from dhruvmanila as a code owner June 1, 2026 13:46
@astral-sh-bot astral-sh-bot Bot requested a review from charliermarsh June 1, 2026 13:46
@samuelcolvin

Copy link
Copy Markdown
Contributor

I've reviewed, but I don't see how I can return an Err on deeply nested code which is a blocker for us.

If you only care about giving the user an idea what went wrong: abort with a custom message when reaching a too deeply nested node.

The process needs to survive excessively nested code, so aborting or panicing are not an option for us.

Unless I'm missing something, this seems less good than what's currently on main + #25480

@MichaReiser

Copy link
Copy Markdown
Member Author

I've reviewed, but I don't see how I can return an Err on deeply nested code which is a blocker for us.

The basic idea is that your Visitor does the same as the parser today. Once you reach the limit, replace the problematic sub-tree with a placeholder, e.g. replace the Expr subtree with a ExprName with context Invalid. If you want, also push a diagnostic for the replaced subtree's range. Then schedule the subtree for drop (you need to be careful to avoid stack overflows in drop)

Unless I'm missing something, this seems less good than what's currently on main + #25480

I tend to agree with this assessment from monty's perspective. I don't think this is the case for all projects that depend on Ruff that use the parser as a library.

@samuelcolvin

Copy link
Copy Markdown
Contributor

Ye makes sense. I'll try, but sounds like we can probably make this work.

Since we'll limit the depth when building the ast, we should be safe on drop.

@MichaReiser

Copy link
Copy Markdown
Member Author

Ye makes sense. I'll try, but sounds like we can probably make this work.

This sounds great

Since we'll limit the depth when building the ast, we should be safe on drop.

I'm not sure I fully understand this part. It's still the parser that builds the AST. But the visitor can "shrink" the trees by replacing sub-trees.

@MichaReiser

MichaReiser commented Jun 3, 2026

Copy link
Copy Markdown
Member Author

We still feel a bit uncertain about whether the parser should try to guarantee against all stack overflows (which this PR almost does but not fully).

I think we should at least revert the limits on main, to make the parser conform to CPython and the Python parsing standard. I'm inclined to do so before we decide on how we continue with this PR.

I think we should land a version of this PR, but we have to decide if we only want to add stacker in the more common code paths or everywhere where the parser recurse and at what performance cost.

@samuelcolvin could you say a little more about your use case.

  • How do you use the parser?
  • Do you use a custom thread stack size or do you use the defaults?
  • Why is it important that the parser does not stack overflow? You mentioned security, but what specifically?
  • How are stack overflows different from allocation failures or other crashes in Ruff's parser?

@samuelcolvin

samuelcolvin commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Hi @MichaReiser, here we go:

How do you use the parser?

Monty is a minimal, secure Python interpreter written in Rust for use by AI.

In other words we aim to be able to take ANY untrusted Python code, and run it (or return an error).

The ruff parser is used to build an AST which we then process and use to compile bytecode. We also optionally use ty to typecheck the code before running it.

This means the ruff AST parser is the first line of defense against untrusted code. The first think we do with the code you submit (here) is parse the the code with ruff to get an ast.

In order of important in the commitments we make:

  1. 100% guaranteeing the code can NEVER escape the sandbox (e.g. RCE on the host, reading host files or host environment variables)
  2. absolutely minimising the chances an attacker can craft code that OOMs the host or runs code indefinitely
  3. minimising the chances an attacker can craft code that causes a crash of the process as this can cause a DOS risk
  4. trying to avoid panics, although ultimately panics can be caught and recovered from
  5. being as compliant as possible with cpython

Do you use a custom thread stack size or do you use the defaults?

we use the defaults. Monty is distribute as PyPI and npm packages and we currently let users run Monty code on the main thread. We could change this if you think that's the easiest solution?

Why is it important that the parser does not stack overflow? You mentioned security, but what specifically?

It's a DOS risk - when running the code in the main process a stackoverflow will kill the entire app. We don't want to spawn a new process for every invocation as it would increase latency. I guess we could add orchestration to create a persistent background process and run monty code there - they we can recover from a process that's killed, but it's not idea.

How are stack overflows different from allocation failures or other crashes in Ruff's parser?

With the current architecture, any crash we can't catch (stack overflow, seg fault etc - e.g. things that aren't panics) are bad because they kill the entire main process.


BTW, monty is open source, you can look at the code at https://github.com/pydantic/monty.

You might also want to look at our bounty program - https://hackmonty.com where a lot of these vulnerabilities surfaced.

@MichaReiser

Copy link
Copy Markdown
Member Author

Thank you. This helped me build a much better understanding of monty!

we use the defaults. Monty is distribute as PyPI and npm packages and we currently let users run Monty code on the main thread. We could change this if you think that's the easiest solution?

It makes the problem much less likely and ensures monty's behavior is more uniform across platforms. However, it does not prevent stack overflows.

With the current architecture, any crash we can't catch (stack overflow, seg fault etc - e.g. things that aren't panics) are bad because they kill the entire main process.

With the current architecture, any crash we can't catch (stack overflow, seg fault etc - e.g. things that aren't panics) are bad because they kill the entire main process.

That makes sense and suggests to me that the isolation must happen higher up in monty, because what we have now is only sufficient to handle stack overflows. We could extend the parser to also handle allocations gracefully, but that would be a lot of work and doesn't really align with Ruff's goals. And even if we would handle allocation failures in the parser, this still doesn't protect against bugs. That's why I'm coming to the conclusion that isolating these errors at the monty level, e.g. by spawning a process for each invocation similar to nextest (where performance doesn't seem to be an issue), is the better outcome for monty and Ruff's parser. It would allow us to focus on gracefully avoiding the most common stack overflows without sacrificing performance too much.

@samuelcolvin

Copy link
Copy Markdown
Contributor

We definitely can't spawn a new process per run/step. Imagine you're running the following code:

for step in range(big):
    result = external_call(...)

Monty's suspend and resume mechanism means that every call to external_call suspends code execution and returns control to the host. If we introduced a new process invocation for every step, it would increase the step overhead from ~1-5us to ~15ms - making the script 3000x slower to run.

but if we have a pool of processes and communicate with them over stdout/stdin over protobuf, the increased latency might be tolerable. It would also have other advantages for us - like making other language SDKs easier to build and maintain. @davidhewitt and I discussed this option earlier.

Even with a subprocess pool, processes that die are still a bad thing - adding new processes to the pool will be much slower than reusing an existing process.


All that said, I'm not really clear why we need the profound change here? #24810 had negligible performance overhead - in ruff it'll be insignificant, in ty it'll be unmeasurable (<0.0001%?).

You've closed #25480 but I still think it's a good solution to the remaining stack overflows.

The remaining issue AFAIK with main (including #24810) + #25480 seems to be:

6000: Increased whenever CPython's PEG parser enters a new rule.

Can you show an example of code that currently passes with cpython's parser, but fails with Ruff? I find it hard to believe that there's valid code someone actually wants to check or run that can't be passed with ruff.

The primary goal of Ruff's parser is to accept every valid (C)Python program.

I think it's important not to be naively absolutist.

There's probably a "good enough" story that works for 5 nines of use cases without being perfect.

@MichaReiser

Copy link
Copy Markdown
Member Author

I can see that spawning a process is expensive. However, it may just be necessary to protect against heap allocation failures, bugs, and stack overflows. But I'm not familiar enough with what alternatives exist here.

Codex was able to find a few relatively quickly

Problem Representative input Ruff first rejects CPython 3.14.5 first rejects Other equivalent forms
Ruff applies its 202 recursion budget to unary chains ---...1 203 5967 unary +, ~, not
Ruff applies the same budget to recursive expression grammar paths 1**1**...**1 203 2984 ternary chain: Ruff 203, CPython 5967; lambda chain: Ruff 203, CPython 2984
Ruff spends roughly two budget units per mixed nesting layer -(-(...1)) 102 201 parenthesized power, ternary, lambda; exact CPython boundary is 199 for power and lambda
Ruff spends roughly two units for recursively nested starred or keyword expressions [*[*...[]]] 101 200 nested await (...), yield (...), yield from (...); CPython boundary is 201 for keyword forms

We could probably fix them, but the solution overfits monty. Even if the performance regression is only 1-2%, it's still considerable given that this now applies to every tool built on top of the Ruff parser and don't need this feature.

You've closed #25480 but I still think it's a good solution to the remaining stack overflows.

I disagree. It just keeps patching Ruff without solving the root cause. And there remains an entire category of errors that we don't protect against yet but monty requires protection against. And as I said in my summary. These limits don't protect against stack overflows on all platforms, targets, or programs. They just happen to be sufficient for some.

My main concern really is, this solution overfits monty. It's patching specific instances over solving the root cause. And it's not a goal for Ruff's parser to be stack overflow or heap allocation failures free.

@samuelcolvin

Copy link
Copy Markdown
Contributor

I don't think any of those examples qualify as "valid code someone actually wants to check or run".

the solution overfits monty.

Well, only if monty is the only tool built on the ruff ast parser that cases about not crashing! That seems unlikely in the medium and long term.

Even if the performance regression is only 1-2%, it's still considerable given that this now applies to every tool built on top of the Ruff parser
And it's not a goal for Ruff's parser to be stack overflow or heap allocation failures free.

I don't think these two statements make sense. What proportion of time do tools using Ruff's parser spend on parsing? I would guess most are <10%, many (e.g. ty) will be <1%. In that scenario I think most tools would accept a 1-2% performance degradation (equating to 0.01-0.2% of execution time) in return for significantly lower risk of a crash.


Ultimately it's your choice, but I think we might end up with a fork of ruff's parser if you really think 1-2% performance gains are preferable to avoiding memory errors.

@charliermarsh

Copy link
Copy Markdown
Member

And there remains an entire category of errors that we don't protect against yet but monty requires protection against. And as I said in my summary. These limits don't protect against stack overflows on all platforms, targets, or programs. They just happen to be sufficient for some.

Can you help me understand how you think about this? Asking genuinely -- even if we merge these changes, we can probably come up with cases that could stack overflow (just as, e.g., rustc stack overflows in the same way), so why is patching these cases helpful at all if you still need to harden against it?

@charliermarsh

Copy link
Copy Markdown
Member

Ultimately it's your choice, but I think we might end up with a fork of ruff's parser if you really think 1-2% performance gains are preferable to avoiding memory errors.

To be honest, this isn't really an accurate description of the tradeoff. We're adding complexity to the parser that we'll have to maintain and continue to extend into the future to uphold new guarantees. And the current hardenings are sort of arbitrary.

@samuelcolvin

Copy link
Copy Markdown
Contributor

To be honest, this isn't really an accurate description of the tradeoff. We're adding complexity to the parser that we'll have to maintain and continue to extend into the future to uphold new guarantees. And the current hardenings are sort of arbitrary.

Fair, let's not conflate performance and complexity.

I think my point on performance stands, the complexity point is more complicated.

I think the main question is: what other errors (or categories of error) are we talking about? - like you asked above.

If it's just:

Then I think it's work fixing in ruff, but I agree if there are lots of other errors it becomes unrealistic/infeasible to guard against them all.

@MichaReiser

MichaReiser commented Jun 3, 2026

Copy link
Copy Markdown
Member Author

There are more cases that would need handling. codex spit out these few:

  • I reproduced an abort in IPython mode with x.a.a...a?. The recursive helper is here.
  • I reproduced an abort when running the standard AST visitor over a 100,000-node + chain. Its recursive binary traversal is here.
  • All other automatically derived impls require the same treatment as Drop (PartialEq, Clone)
  • Unrolling the Drop is probably enough, but it doesn't fully protect against stack overflows. Whether the drop succeeds still depends on how much stack size is left when drop is called.

@charliermarsh

Copy link
Copy Markdown
Member

@samuelcolvin -- I think if you want to add specific guardrails and fix this urgently, forking isn't a bad outcome. There are just a lot of considerations for us here given that the parser is foundational, and the use-case isn't common / a priority right now so it's hard for us to justify spending the necessary time here to make/explore the right decisions.

@samuelcolvin

Copy link
Copy Markdown
Contributor

@samuelcolvin -- I think if you want to add specific guardrails and fix this urgently, forking isn't a bad outcome. There are just a lot of considerations for us here given that the parser is foundational, and the use-case isn't common / a priority right now so it's hard for us to justify spending the necessary time here to make/explore the right decisions.

Thanks, we're going to discuss this at our offsite next week and decide on strategy.

Personally I would prefer you don't merge this. What is on main + plus a solution to iterative drop might cover all cases, but of course up to you.

@MichaReiser

Copy link
Copy Markdown
Member Author

I'm fine waiting two weeks to consider alternative solutions. I'd prefer a fork over keeping what's on main.

@samuelcolvin

Copy link
Copy Markdown
Contributor

I'm fine waiting two weeks to consider alternative solutions. I'd prefer a fork over keeping what's on main.

Okay, understood.

@samuelcolvin

Copy link
Copy Markdown
Contributor

We're working on a major rewrite of monty that moves all monty code execution (and ast parsing and type checking) into subprocesses that can be recreated on crash - pydantic/monty#500.

Feel free to go ahead with this PR, we can update to the head of main once our PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parser Related to the parser

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle parser stack overflows more gracefully

3 participants