Move execution to a subprocess pool#500
Conversation
A monty process can never be made fully crash-proof against memory errors (stack overflow aborts, allocator aborts) triggered by adversarial input. This runs the interpreter in worker subprocesses instead: a crash kills only the worker, which the parent detects and replaces. - new `monty-proto` crate: protobuf wire protocol — schema with checked-in prost codegen (`make generate-proto` / `make check-proto` in CI), 4-byte LE length-prefixed framing, and fallible conversions so frames from a possibly-compromised worker are treated as untrusted - new `monty --subprocess` child mode: serves one REPL session per checkout over stdin/stdout with strict request/event alternation, child-local mounts, OS-call bubbling, streamed prints, and dump/load - new `monty-pool` crate: elastic worker pool with prewarming, crash detection and replacement, worker recycling, and a watchdog enforcing a hard per-request timeout - `pydantic_monty.MontyPool`: async-first Python API (`async with` pool and checkout sessions, sync/async external functions, os callbacks, print streaming, dump) plus `MontyCrashedError` - new `pydantic-monty-cli` package (maturin bin bindings, the uv/ruff pattern) ships the `monty` binary as a pydantic-monty dependency Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes
|
There was a problem hiding this comment.
12 issues found across 92 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/monty-proto/src/convert/exception.rs">
<violation number="1" location="crates/monty-proto/src/convert/exception.rs:55">
P1: Validate traceback coordinates before constructing `StackFrame`; untrusted wire data can otherwise produce invalid line 0 tracebacks or panic/DoS during formatting.</violation>
</file>
<file name="crates/monty-cli/src/subprocess.rs">
<violation number="1" location="crates/monty-cli/src/subprocess.rs:145">
P2: `Hello` is accepted multiple times; the handshake is not actually enforced as one-time, allowing invalid mid-session re-handshakes.</violation>
<violation number="2" location="crates/monty-cli/src/subprocess.rs:336">
P2: Dump/Load omits child-side type-check/session metadata, so restored sessions can silently lose type-check enforcement and stub context.</violation>
</file>
<file name="crates/monty-proto/src/frame.rs">
<violation number="1" location="crates/monty-proto/src/frame.rs:86">
P2: FrameWriter does not enforce MAX_FRAME_LEN, so it can produce frames that the default FrameReader rejects.</violation>
</file>
<file name="crates/monty-pool/src/lib.rs">
<violation number="1" location="crates/monty-pool/src/lib.rs:147">
P3: `PoolError` does not implement `Error::source`, so wrapped `Runtime(MontyException)` causes are hidden from error-chain tooling.</violation>
</file>
<file name="crates/monty-pool/src/checkout.rs">
<violation number="1" location="crates/monty-pool/src/checkout.rs:262">
P2: `resume_name_lookup` skips max-depth validation, so deeply nested lookup values can trigger protocol failure instead of a clean runtime error.</violation>
<violation number="2" location="crates/monty-pool/src/checkout.rs:354">
P2: `expect_turn` misclassifies protocol mismatches as crash/timeout by routing them through `poison` instead of returning `PoolError::Protocol`.</violation>
</file>
<file name="crates/monty-proto/src/convert/object.rs">
<violation number="1" location="crates/monty-proto/src/convert/object.rs:58">
P2: TimeDelta proto decoding does not validate normalized `seconds`/`microseconds` bounds.</violation>
<violation number="2" location="crates/monty-proto/src/convert/object.rs:130">
P2: Date/DateTime proto decoding misses semantic range checks (only checks u8 fit), allowing invalid temporal values through the wire boundary.</violation>
</file>
<file name="crates/monty-pool/src/watchdog.rs">
<violation number="1" location="crates/monty-pool/src/watchdog.rs:60">
P2: Watchdog thread creation panics on spawn failure instead of surfacing a `PoolError::Spawn`, so pool construction can abort the process under thread-resource exhaustion.</violation>
<violation number="2" location="crates/monty-pool/src/watchdog.rs:136">
P2: Recompute the sleep from a fresh `Instant::now()` after draining expired deadlines; otherwise time spent killing earlier workers can delay the next timeout.</violation>
</file>
<file name="crates/monty-cli/pyproject.toml">
<violation number="1" location="crates/monty-cli/pyproject.toml:15">
P3: Add a Windows OS classifier. This package is intended to work on Windows too, but the current metadata only advertises Unix/Linux/macOS support.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
There was a problem hiding this comment.
15 issues found across 75 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/monty-wasm/package.json">
<violation number="1" location="crates/monty-wasm/package.json:20">
P0: The npm allowlist omits the wasm binary and WASI helper files, so the published package will be missing the interpreter runtime and fail at load time.</violation>
</file>
<file name="crates/monty-js/src/binary.ts">
<violation number="1" location="crates/monty-js/src/binary.ts:91">
P2: Catching all `require.resolve` errors hides real package resolution failures and can silently fall back to an unintended binary.</violation>
</file>
<file name="crates/monty-js/src/worker.ts">
<violation number="1" location="crates/monty-js/src/worker.ts:133">
P3: Exit status can be dropped due to returning `this.exitStatus` even when `exitCode`/`signalCode` is already known.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…n, JS client hardening Also includes the previously staged worker env isolation: workers spawn with an empty environment (Windows keeps only SystemRoot) so host secrets are never in a worker's memory; Linux-gated /proc tests pin it. Security/robustness: - validate traceback coordinates from the wire (underflow panic / multi-GiB caret allocation when rendering a hostile traceback) - pause the JS frame reader's stream when no read is pending, restoring pipe backpressure against a worker spewing unbounded output - enforce MAX_FRAME_LEN on frame writers (Rust + JS) before writing; the child reports an oversize response as a parseable FatalError - semantic range checks for wire dates/datetimes/timedeltas - dump/load now carries script name + type-check state so a loaded session keeps type-check enforcement and its accumulated stubs - JS sendableResult is total and handler throws poison the session, so a suspension can never be left wedged awaiting a resume Correctness/polish: - resume_name_lookup enforces the wire depth bound like other resumes - protocol violations on an intact stream classify as PoolError::Protocol (and discard the worker) instead of Crashed/Timeout - reject a repeated Hello after the handshake - watchdog: recompute sleep from a fresh Instant::now(); surface thread spawn failure as PoolError::Spawn instead of panicking - PoolError implements Error::source for Runtime - JS: half-open i64 bound (2^63 crosses as float), prewarm failure kills already-spawned workers, discard() checks pool closure before handing out a replacement, futures cleared on failed feeds, ConversionError for malformed dataclass markers, Object.hasOwn for mount modes, 10s handshake deadline, executability check in the PATH scan - pydantic-monty-cli: Windows OS classifier - monty-wasm: document MontyRepl in the README; stop emitting source maps the package does not ship Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
2 issues found across 26 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/monty-js/src/worker.ts">
<violation number="1" location="crates/monty-js/src/worker.ts:88">
P2: Handshake timeout kills the worker without setting `killedForTimeout`, so startup timeouts are misreported as generic crashes.</violation>
</file>
<file name="crates/monty-proto/tests/roundtrip.rs">
<violation number="1" location="crates/monty-proto/tests/roundtrip.rs:311">
P3: The `out_of_range_temporal_values_are_rejected` test validates boundary rejection for date fields and `DatetimeValue.hour`, but omits coverage for `DatetimeValue.minute` (0..=59), `DatetimeValue.second` (0..=59), and `DatetimeValue.microsecond` (<= 999_999), even though all four are validated by the same conversion code in `object.rs`. Adding assertions for the invalid cases (`minute: 60`, `second: 60`, `microsecond: 1_000_000`) would make the test's coverage match its documented scope.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| timezone_name: None, | ||
| })), | ||
| }; | ||
| assert!(rejected_as(datetime_with_hour(24), "DateTimeValue.hour")); |
There was a problem hiding this comment.
P3: The out_of_range_temporal_values_are_rejected test validates boundary rejection for date fields and DatetimeValue.hour, but omits coverage for DatetimeValue.minute (0..=59), DatetimeValue.second (0..=59), and DatetimeValue.microsecond (<= 999_999), even though all four are validated by the same conversion code in object.rs. Adding assertions for the invalid cases (minute: 60, second: 60, microsecond: 1_000_000) would make the test's coverage match its documented scope.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At crates/monty-proto/tests/roundtrip.rs, line 311:
<comment>The `out_of_range_temporal_values_are_rejected` test validates boundary rejection for date fields and `DatetimeValue.hour`, but omits coverage for `DatetimeValue.minute` (0..=59), `DatetimeValue.second` (0..=59), and `DatetimeValue.microsecond` (<= 999_999), even though all four are validated by the same conversion code in `object.rs`. Adding assertions for the invalid cases (`minute: 60`, `second: 60`, `microsecond: 1_000_000`) would make the test's coverage match its documented scope.</comment>
<file context>
@@ -268,6 +268,102 @@ fn invalid_values_are_rejected() {
+ timezone_name: None,
+ })),
+ };
+ assert!(rejected_as(datetime_with_hour(24), "DateTimeValue.hour"));
+
+ let timedelta = |seconds, microseconds| pb::MontyValue {
</file context>
Codecov Results 📊❌ Patch coverage is 49.24%. Project has 37715 uncovered lines. Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
- Coverage 66.22% 49.24% -16.98%
==========================================
Files 281 301 +20
Lines 72724 74302 +1578
Branches 156220 158988 +2768
==========================================
+ Hits 48158 36587 -11571
- Misses 24566 37715 +13149
- Partials 4115 2946 -1169Generated by Codecov Action |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
1 issue found across 22 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="crates/monty-proto/tests/roundtrip.rs">
<violation number="1" location="crates/monty-proto/tests/roundtrip.rs:311">
P3: The `out_of_range_temporal_values_are_rejected` test validates boundary rejection for date fields and `DatetimeValue.hour`, but omits coverage for `DatetimeValue.minute` (0..=59), `DatetimeValue.second` (0..=59), and `DatetimeValue.microsecond` (<= 999_999), even though all four are validated by the same conversion code in `object.rs`. Adding assertions for the invalid cases (`minute: 60`, `second: 60`, `microsecond: 1_000_000`) would make the test's coverage match its documented scope.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Returning a cyclic structure (e.g. d = {}; d['self'] = d; d) from a
worker killed the session: the parent decoded the Complete payload via
the shared TryFrom<pb::MontyValue> impl, which rejected Kind::Cycle as
"output-only", treated it as a protocol violation, and discarded a
healthy worker.
The "output-only" guard was meant for inputs (parent -> child), but the
conversion is direction-agnostic. Fix by letting Cycle round-trip on the
wire like Repr does — using one as an execution input is still rejected
by MontyObject::to_value with a proper Python-level error.
- MontyObject::Cycle now carries a plain usize identity token instead of
HeapId; the id was only ever used for equality (same-object detection)
and never dereferenced, and this removes the constructor obstacle that
forced the proto layer to reject cycles
- rename the CycleValue wire field heap_id -> identity (same tag, wire
compatible) and regenerate the Rust + JS protobuf code
- drop the now-unused ProtoConvertError::OutputOnly variant
- regression test: a cyclic return value decodes and the worker survives
and is reused
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Values are the hot payload of the subprocess protocol — every external
function call ships its arguments, result, and completion value across
the process boundary, previously as MontyObject -> pb mirror struct
(a full deep clone of every string and container) -> prost encode, and
the reverse tree rebuild on decode.
Map the monty.v1.MontyObject schema message onto a new WireObject
newtype via prost-build extern_path, with a hand-written prost::Message
impl (monty-proto/src/wire.rs):
- encode walks a borrowed MontyObject and writes bytes directly — no
intermediate tree, no clones
- decode builds the MontyObject straight from the wire, running the
semantic validation (date ranges, timedelta normalization, enum
names) during the parse, so untrusted bytes never exist in memory as
an unvalidated value
- all other protocol messages stay prost-generated and now embed
WireObject directly, so the conversion layer disappears across the
whole protocol; the old convert/object.rs mirror conversions are
deleted
Correctness is enforced by a differential oracle: generate-proto also
emits tests/oracle/monty.v1.rs (same schema, fully prost-generated,
CI-checked via check-proto), and tests/differential.rs proves the hand
codec byte-compatible against it over a corpus tuned to protobuf's
presence rules (default skipping, optional Some(""), zigzag i64::MIN,
NaN), plus pinned rejection messages for hostile frames.
Behavioural consequences of decode-time validation, documented in
limitations/pool-architecture.md:
- a parent receiving an invalid value classifies it as a protocol
violation (worker discarded) rather than a crash
- a worker receiving a malformed request answers with a
session-preserving protocol-violation error instead of exiting
The one remaining clone: function-call args are cloned when emitting a
suspension event, because the suspension keeps its payload so Dump/Load
replay stays complete.
Also rename the wire message MontyValue -> MontyObject (it mirrors the
Rust type; same tags, wire compatible) and convert the remaining
free-function conversions that fit the crate's From/TryFrom convention:
CodeLoc <-> pb::CodeLoc and MountSpec -> pb::Mount.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…20 support - killed-worker pool test: Windows has no signals, expect 'exit code: 1' - smoke-test: remove committed platform-specific file: deps and install tarballs with --no-save so they can't be baked back in - smoke-test: compile with tsc and run the emitted JS instead of --experimental-strip-types, which node 20 lacks Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Frame writing is stateless (encode, size-check, write, flush — nothing carried between frames), so the FrameWriter wrapper struct earned its keep nowhere. Replace it with monty_proto::write_frame(&mut impl Write, &impl Message). The big win is in the subprocess child: the Rc<RefCell<FrameWriter<BufWriter<Stdout>>>> shared between Child and ProtoPrint existed only to satisfy the borrow checker. Writing each frame to a fresh io::stdout() handle is safe (Stdout handles share one global buffer — the panic hook already relied on this), so the shared writer, the SharedWriter alias, and both writer fields are gone. monty-pool's Worker and the CLI integration tests now hold ChildStdin directly and call write_frame on it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
…limit, checkout deadlock, wire depth check - ci: test-builds-arch installed only pydantic-monty and smoke-tested the deleted in-process API, failing every run and blocking release-python; it now installs the cli wheel too and runs a pool smoke test - monty: MontyRepl::call_function entered the interpreter without opening an execution window, freezing the max_duration clock so an infinite loop could never time out - pydantic_monty: session __enter__/__exit__/__aexit__ and worker_pid locked the checkout mutex while holding the GIL, deadlocking with a turn blocked in Python::attach for a print callback; all slot access now happens detached and worker_pid uses a non-blocking try_lock - monty-proto: the value depth check charged one budget unit per container, but dicts cost three prost recursion levels and dataclasses four, so deep dicts passed the sender check and then failed to decode, killing the worker as a protocol failure instead of raising the clean depth error; the check (and its monty-js mirror) now charges exact per-shape proto-level costs, with boundary tests proving check-pass matches frame decodability Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pydantic/monty is no longer a pure-TypeScript protocol client: it is now a napi-rs binding over the monty-pool crate — the same Rust pool engine pydantic_monty uses — so pool elasticity, watchdogs, crash recovery, framing and value conversion are single-sourced in Rust. The public TypeScript API is unchanged. - crates/monty-wasm is merged back into crates/monty-js; the legacy in-process API ships in the same package under the @pydantic/monty/wasm subpath (the subprocess pool is cfg-gated off on wasm), and the separate @pydantic/monty-wasm package is gone - new src/pool.rs exposes turn-level primitives (NativePool, NativeSession.feed/resume*); the drive loop stays in TypeScript where async external functions are native — turns run on tokio's blocking pool, prints stream mid-turn through a threadsafe function that blocks the turn thread until the JS callback has run (preserving ordering and backpressure), and the checkout mutex is never locked on the event loop (workerPid uses try_lock) - the TS protocol stack (frame/worker/convert), the generated protobuf-es code, the buf toolchain, and check-proto-js are deleted - sandbox-controlled kwargs cross as [key, value] pairs and become null-prototype records; dataclass fields use defineProperties — closing the __proto__ prototype-injection hole in host callbacks - converter fixes: a number of exactly 2^63 silently saturated to i64::MAX (now crosses as float); the tuple __tuple__ marker is now non-enumerable - platform packages ship the napi .node next to the monty binary (napi create-npm-dirs + a patcher script); @pydantic/monty-wasm32-wasi joins optionalDependencies; release-wasm merges into release-js - 609 ava tests pass natively, 342 in-process tests pass against the wasm build, and the pack-and-install smoke test passes Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deleted by the interim pure-TS rewrite and not restored when monty-js went back to being a napi crate. It matters: it statically links the CRT into the Windows .node, matching main's released binaries. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
b0a4e0b to
5506c00
Compare
extract_repl_inputs blanketed every key extraction failure as a raw TypeError, including the UnicodeEncodeError a lone-surrogate str key produces. Now invalid input surfaces uniformly as MontyRuntimeError: a non-str key as TypeError, a surrogate str key as ValueError — matching how input values are already handled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
errors.ts re-implemented monty's `MontyException` Display (renderTraceback
& friends) over the structured frames, a second renderer that had to be
kept in lockstep with the Rust one. pool.rs already reconstructs the full
MontyException, so render the traceback string there (the single source of
truth) and carry it across as `NativeException.traceback`. The error classes
now return that verbatim for `display('traceback')`; structured frames stay
for `MontyRuntimeError.traceback()`. Net -47 lines, no behaviour change
(traceback output is byte-identical, confirmed by the ava exception suite).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@cubic-dev-ai review |
@samuelcolvin I have started the AI code review. It will take a few minutes to complete. |
Benchmark the monty-pool overheads subprocess isolation adds, with no CPython equivalent: - pool_create_session_run: cold start (worker spawn + checkout + run) - session_checkout_run: warm-pool checkout handshake + trivial feed - ext_calls_1000: 1000 external-call round trips over the framed protobuf wire protocol Wired into CodSpeed CI (unified onto the package target so both `main` and `pool` build/run). The workers run the `monty` CLI built in release and resolved via MONTY_TEST_BIN, both in CI and in `make bench-pool`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@cubic-dev-ai update the PR description |
@samuelcolvin I can help — what did you want me to do?
|
Summary by cubic
Runs Monty in crash‑isolated worker subprocesses via a protobuf protocol and a shared Rust pool for Python and Node.js, improving resilience to crashes/timeouts and tightening wire validation. Also standardizes invalid REPL input errors, guards against wire memory amplification, and renames proto types for clarity without breaking compatibility.
New Features
monty-proto: protobuf schema with checked‑inprostoutput (exactprostpin for reproducible codegen), 4‑byte LE framing, and a hand‑written zero‑copyMontyObjectwire codec; differential tests ensure byte‑compatibility and validation; cyclic values round‑trip; precise depth checks that mirror prost recursion; MAX_FRAME_LEN enforcement and memory‑amplification guards; no hello handshake; statelesswrite_frame; wire rename for clarity (MontyValue → MontyObject; cycle fieldheap_id→identity, wire‑compatible).monty --subprocess: child mode with strict turn‑based protocol, streamed prints, child‑local mounts, OS callback bubbling, dump/load (preserves script name and type‑check state), clean fatal errors, and an empty worker environment by default.monty-pool: elastic worker pool with prewarming, crash detection/replacement, recycling, and a watchdog enforcing per‑request timeouts; improved protocol‑violation classification, deadline handling, and guaranteed child reaping.checkout(...).feed_run(...)), coroutine callbacks, streamed prints,MontyCrashedError;pydantic-monty-cliinstalls themontybinary;_binary.pyresolves it; fixes session lifecycle deadlock and execution‑duration accounting.@pydantic/montyis anapibinding over the Rustmonty-pool; exposes aMontypool withcheckout().feedRun(...), streamed prints, async external functions, andMontyCrashedError. Platform npm packages ship both the.nodelibrary and themontybinary; the client auto‑resolves or acceptsbinaryPath(PATH executability check + 10s startup timeout). A wasm‑only in‑process API is available at@pydantic/monty/wasm. Hardening/bug fixes: non‑enumerable tuple markers, kwargs/dict objects use null‑prototype records anddefinePropertiesto block__proto__pollution, paused readers restore backpressure, improved binary/platform package handling, half‑open i64 bound fix, tracebacks rendered once in Rust and shipped as a string, and updated smoke tests (compile TS and run JS).make check-proto), builds a realmontybinary for protocol/pool tests, publishes platform npm packages (napi + binary + wasm), fixes Node 20/Windows CI, safer smoke‑test packaging (installs the CLI wheel and runs a pool smoke test), restores Windows static CRT linking for the.node, and cleans up Python tests.monty-benchsubprocess‑pool benches (pool_create_session_run,session_checkout_run,ext_calls_1000) and wires them into CodSpeed; CI builds themontyworker in release and runs all benches withMONTY_TEST_BINset so workers don’t build mid‑run.MontyRuntimeErrorconsistently (non‑str keys →TypeError, lone‑surrogate strings →ValueError).Migration
.run()/.run_async()with session calls.m = pydantic_monty.Monty(code, ...) ; m.run(...)with Monty() as pool: with pool.checkout(...) as s: s.feed_run(code, inputs=..., external_functions=...)async with AsyncMonty() as pool: async with pool.checkout() as s: await s.feed_run(...)napi) but the public API stays the same. For in‑process execution (e.g. browsers), use@pydantic/monty/wasm.new Monty(code, opts).run({ inputs, externalFunctions })const pool = await Monty.create(); const s = await pool.checkout({...}); const out = await s.feedRun(code, { inputs, externalFunctions, printCallback });montybinary is resolved via platform npm packages; override withMONTY_BIN/binaryPathif needed. Script metadata and type‑checking move tocheckout(...)args. Legacy REPL/snapshot APIs and old serialization modules were removed in favor of pool sessions.Written for commit 1442981. Summary will update on new commits.