Spill stack to the heap to support arbitrarily deep nested programs#25464
Spill stack to the heap to support arbitrarily deep nested programs#25464MichaReiser wants to merge 12 commits into
Conversation
Typing conformance resultsNo changes detected ✅Current numbersThe percentage of diagnostics emitted that were expected errors held steady at 92.23%. The percentage of expected errors that received a diagnostic held steady at 87.42%. The number of fully passing files held steady at 92/134. |
Memory usage reportMemory usage unchanged ✅ |
|
|
Nice, this seems to be almost zero cost |
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
| TokenKind::Lpar => Expr::Call(self.parse_call_expression(lhs, start)), | ||
| TokenKind::Lsqb => Expr::Subscript(self.parse_subscript_expression(lhs, start)), |
There was a problem hiding this comment.
Already handled by nesting
| self.bump(TokenKind::Async); | ||
|
|
||
| // Consume repeated invalid `async` prefixes iteratively. This is the only | ||
| // invalid-async recovery shape that can recurse without bound. |
There was a problem hiding this comment.
Unroll the recursion in a loop, avoiding any stack size issues
| if self.tokens.nesting() < STACK_GROWTH_NESTING_THRESHOLD | ||
| && !Self::token_starts_unnested_recursive_lhs(token) | ||
| { |
There was a problem hiding this comment.
I suspect that this is one of the cases where we don't guarantee to be 100% stack overflow free because the performance trade off would be too bad.
A simple example where this overflows is if the application calls Ruff's parser in a very deep call graph where only very little stack frame is remaining. It's then not unlikely that Ruff stack overflows even before exceeding a limit of 200 nesting.
I intentionally did not spend any time on refining this heuristic.
|
I'll open this up for an initial round of feedback before fully polishing it. CC @samuelcolvin as you might have opinions on this / know what monty needs here. |
|
I've reviewed, but I don't see how I can return an
The process needs to survive excessively nested code, so aborting or panicing are not an option for us. Unless I'm missing something, this seems less good than what's currently on main + #25480 |
The basic idea is that your Visitor does the same as the parser today. Once you reach the limit, replace the problematic sub-tree with a placeholder, e.g. replace the
I tend to agree with this assessment from monty's perspective. I don't think this is the case for all projects that depend on Ruff that use the parser as a library. |
|
Ye makes sense. I'll try, but sounds like we can probably make this work. Since we'll limit the depth when building the ast, we should be safe on drop. |
This sounds great
I'm not sure I fully understand this part. It's still the parser that builds the AST. But the visitor can "shrink" the trees by replacing sub-trees. |
|
We still feel a bit uncertain about whether the parser should try to guarantee against all stack overflows (which this PR almost does but not fully). I think we should at least revert the limits on main, to make the parser conform to CPython and the Python parsing standard. I'm inclined to do so before we decide on how we continue with this PR. I think we should land a version of this PR, but we have to decide if we only want to add stacker in the more common code paths or everywhere where the parser recurse and at what performance cost. @samuelcolvin could you say a little more about your use case.
|
|
Hi @MichaReiser, here we go:
Monty is a minimal, secure Python interpreter written in Rust for use by AI. In other words we aim to be able to take ANY untrusted Python code, and run it (or return an error). The ruff parser is used to build an AST which we then process and use to compile bytecode. We also optionally use This means the ruff AST parser is the first line of defense against untrusted code. The first think we do with the code you submit (here) is parse the the code with ruff to get an ast. In order of important in the commitments we make:
we use the defaults. Monty is distribute as PyPI and npm packages and we currently let users run Monty code on the main thread. We could change this if you think that's the easiest solution?
It's a DOS risk - when running the code in the main process a stackoverflow will kill the entire app. We don't want to spawn a new process for every invocation as it would increase latency. I guess we could add orchestration to create a persistent background process and run monty code there - they we can recover from a process that's killed, but it's not idea.
With the current architecture, any crash we can't catch (stack overflow, seg fault etc - e.g. things that aren't panics) are bad because they kill the entire main process. BTW, monty is open source, you can look at the code at https://github.com/pydantic/monty. You might also want to look at our bounty program - https://hackmonty.com where a lot of these vulnerabilities surfaced. |
|
Thank you. This helped me build a much better understanding of monty!
It makes the problem much less likely and ensures monty's behavior is more uniform across platforms. However, it does not prevent stack overflows.
That makes sense and suggests to me that the isolation must happen higher up in monty, because what we have now is only sufficient to handle stack overflows. We could extend the parser to also handle allocations gracefully, but that would be a lot of work and doesn't really align with Ruff's goals. And even if we would handle allocation failures in the parser, this still doesn't protect against bugs. That's why I'm coming to the conclusion that isolating these errors at the monty level, e.g. by spawning a process for each invocation similar to nextest (where performance doesn't seem to be an issue), is the better outcome for monty and Ruff's parser. It would allow us to focus on gracefully avoiding the most common stack overflows without sacrificing performance too much. |
|
We definitely can't spawn a new process per run/step. Imagine you're running the following code: for step in range(big):
result = external_call(...)Monty's suspend and resume mechanism means that every call to but if we have a pool of processes and communicate with them over stdout/stdin over protobuf, the increased latency might be tolerable. It would also have other advantages for us - like making other language SDKs easier to build and maintain. @davidhewitt and I discussed this option earlier. Even with a subprocess pool, processes that die are still a bad thing - adding new processes to the pool will be much slower than reusing an existing process. All that said, I'm not really clear why we need the profound change here? #24810 had negligible performance overhead - in You've closed #25480 but I still think it's a good solution to the remaining stack overflows. The remaining issue AFAIK with
Can you show an example of code that currently passes with cpython's parser, but fails with Ruff? I find it hard to believe that there's valid code someone actually wants to check or run that can't be passed with ruff.
I think it's important not to be naively absolutist. There's probably a "good enough" story that works for 5 nines of use cases without being perfect. |
|
I can see that spawning a process is expensive. However, it may just be necessary to protect against heap allocation failures, bugs, and stack overflows. But I'm not familiar enough with what alternatives exist here. Codex was able to find a few relatively quickly
We could probably fix them, but the solution overfits monty. Even if the performance regression is only 1-2%, it's still considerable given that this now applies to every tool built on top of the Ruff parser and don't need this feature.
I disagree. It just keeps patching Ruff without solving the root cause. And there remains an entire category of errors that we don't protect against yet but monty requires protection against. And as I said in my summary. These limits don't protect against stack overflows on all platforms, targets, or programs. They just happen to be sufficient for some. My main concern really is, this solution overfits monty. It's patching specific instances over solving the root cause. And it's not a goal for Ruff's parser to be stack overflow or heap allocation failures free. |
|
I don't think any of those examples qualify as "valid code someone actually wants to check or run".
Well, only if monty is the only tool built on the ruff ast parser that cases about not crashing! That seems unlikely in the medium and long term.
I don't think these two statements make sense. What proportion of time do tools using Ruff's parser spend on parsing? I would guess most are <10%, many (e.g. Ultimately it's your choice, but I think we might end up with a fork of ruff's parser if you really think 1-2% performance gains are preferable to avoiding memory errors. |
Can you help me understand how you think about this? Asking genuinely -- even if we merge these changes, we can probably come up with cases that could stack overflow (just as, e.g., rustc stack overflows in the same way), so why is patching these cases helpful at all if you still need to harden against it? |
To be honest, this isn't really an accurate description of the tradeoff. We're adding complexity to the parser that we'll have to maintain and continue to extend into the future to uphold new guarantees. And the current hardenings are sort of arbitrary. |
Fair, let's not conflate performance and complexity. I think my point on performance stands, the complexity point is more complicated. I think the main question is: what other errors (or categories of error) are we talking about? - like you asked above. If it's just:
Then I think it's work fixing in ruff, but I agree if there are lots of other errors it becomes unrealistic/infeasible to guard against them all. |
|
There are more cases that would need handling. codex spit out these few:
|
|
@samuelcolvin -- I think if you want to add specific guardrails and fix this urgently, forking isn't a bad outcome. There are just a lot of considerations for us here given that the parser is foundational, and the use-case isn't common / a priority right now so it's hard for us to justify spending the necessary time here to make/explore the right decisions. |
Thanks, we're going to discuss this at our offsite next week and decide on strategy. Personally I would prefer you don't merge this. What is on main + plus a solution to iterative drop might cover all cases, but of course up to you. |
|
I'm fine waiting two weeks to consider alternative solutions. I'd prefer a fork over keeping what's on main. |
Okay, understood. |
|
We're working on a major rewrite of monty that moves all monty code execution (and ast parsing and type checking) into subprocesses that can be recreated on crash - pydantic/monty#500. Feel free to go ahead with this PR, we can update to the head of main once our PR is merged. |
Summary
This PR removes the hard nesting limits in the parser and instead spills the stack to the heap when it approaches the stack limit, using
stacker. This is limited to supported platforms.Why?
The primary goal of Ruff's parser is to accept every valid (C)Python program. This is currently no longer the case because Ruff's recursion limits are stricter than CPython's. CPython uses multiple limits:
I don't see a way for Ruff to approximate the latter two limits other than by finding limits that are at least as large as CPython's. However, this is not enough to protect against all stack overflows:
We'd have to pick platform and compilation-target-specific limits. But even this won't be enough, because we don't control the stack size, or when the parser is called and how much stack size is remaining at this point.
The idea of this PR is to remove the stack limitation from Ruff's parser and enable upstream applications to enforce relevant restriction if they want to (only applies to stack overflows, allocation failures are out of scope). This makes Ruff's parser again conformant with the Python standard and CPython.
In general, I advise upstream tools from using a uniform stack size on platforms supporting changing the default thread stack size. In general, using a larger than default stack size should be more than enough to support all reasonable Python programs and stacker gives them the tool to increase the stack size on demand.
Upstream tools can restore the previous behavior by:
panicstill runs drop handler unless you usepanic=abort.Diagnostic. Now, dropping the subtree is a bit tricky because theDropimplementation can stack overflow also. That's why you need toWriting this in user code not only removes the complexity from Ruff. It also has the advantage that you get perfect error recovery. The AST has the exact range of the problematic expression. It's not the range of whatever comes after the too-deeply nested parentheses.
As said before, this only protects against stack overflows. As the previous implementation, it doesn't protect against allocation errors. An attacker could construct a program that requires the parser to allocate more memory than is available (e.g. WASM32 has a 4GB limit). Examples are very long statement or expression lists, or binary expressions with the same precedence, etc. I'm sure there are ways to protect against this too, but this feels way beyond something the parser should be concerned about and not something that hard limits could solve.
Follow-up to #24810.
Closes #22930
Performance
Codspeed reports a perf regression from about 1% for many ty walltime benchmarks, which is very unfortunate. The only way I can think of to mitigate the regression is to reduce the call-sites where we grow the stack size if necessary, at the cost that the parser will overflow in some very specific constructed cases. I personally would be in favor of doing so from a practical standpoint: In my view it feels wrong to optimize for cases where someone intentionally tries to break the parser, but it comes at the cost that using a visitor is no longer fully sufficient to mitigate/protect against those cases. Having said that. I'm still not convinced that a stack overflow in the parser is a security concern. Even CPython aborts when exceeding the stack size! I really consider this a concern of the harness that must implement appropriate measures to handle stack overflows, allocation failures and other crashes due to bugs accordingly.
Considered alternatives
Keep limits
Fix the non CPython conform limits but otherwise keep the limits in place. The main benefit of this approach is that upstream applications are less likely to run into stack overflow issues, but it also doesn't protect them against it.
Configurable limits
Don't set any limits by default. Very similar to the former with the same downsides except that Ruff makes no assumptions about what the right defaults are for downstream tools.
I think that's a reasonable approach but it requires users to pick limits, which seems non-trivial and the error recovery is worse than if Ruff parses out the full program. But more importantly, I don't see a strong benefit over moving the entire complexity to programs for which increasing the thread's stack size itself isn't a sufficient mitigation. A post-parse visitor gives them the same flexibility, and it even allows them to customize what to do when the limit is exceeded (diagnostic vs abort vs something else). This approach also gives them a more accurate parse tree if they decide not to abort.
What do other parsers do
Note
This PR is no guarantee that we'll maintain the invariant that Ruff's parser never stack overflows on any input. We'll give our best, but we may decide that the risk of stack overflowing is rare enough and mitigating against it is to expensive (in terms of performance) so that the trade offs don't feel right.