feat(fss): add file-system snapshot support by djenriquez · Pull Request #137 · coreweave/cwsandbox-client

djenriquez · 2026-06-08T18:19:56Z

Summary

The CoreWeave Sandbox backend can snapshot a sandbox's working directory and restore it into new sandboxes. This PR adds file-system snapshot (FSS) support to the Python SDK so users can capture, restore, and fork a sandbox's filesystem from client code. FSS is gated per-organization on the backend; orgs that aren't enabled receive a clear SnapshotNotSupportedError.

The feature follows the SDK's existing sync/async hybrid style: a snapshot mount you configure at start, snapshot() to capture on demand, snapshot-on-stop, restore/fork by ID, and a management surface for listing and lifecycle.

What's included

Configure a snapshot mount at start with file_system_snapshot=FileSystemSnapshotOptions(mount_path=..., size=...) on run(), Session.sandbox(), @session.function(), or SandboxDefaults.
Capture mid-life with sb.snapshot() (returns the new snapshot ID), or on shutdown with sb.stop(snapshot_on_stop=True) (ID exposed via file_system_snapshot_id).
Restore / fork by passing a file_system_snapshot_id back into Sandbox.run(...).
Manage snapshots via Sandbox classmethods: get_snapshot, list_snapshots (auto-paginated, client-side filters), delete_snapshot, and admin get_snapshot_bucket_config / set_snapshot_bucket_config.
Typed errors: a SandboxSnapshotError hierarchy mapped from CWSANDBOX_FSS_* reasons (not-found / not-ready / not-supported / size / quota / bucket-mismatch, plus transient throttle and wait-timeout).

Details

Wire contract

The FSS messages and RPCs (CreateFileSystemSnapshot, GetFileSystemSnapshot, ListFileSystemSnapshots, DeleteFileSystemSnapshot, and bucket config) plus the StartSandboxRequest.file_system mount and StopSandboxRequest.file_system_snapshot_on_stop fields are vendored into the gateway proto stubs.

Behavior

Config & start — FileSystemSnapshotOptions (mount_path, size, optional file_system_snapshot_id to restore) is accepted as a dataclass or plain dict and translated into the start request's file_system mount.
Capture — snapshot() returns OperationRef[str] (the new ID) and waits for the sandbox to be RUNNING before archiving, the same as exec/read_file/write_file. Both snapshot() and stop(snapshot_on_stop=True) auto-start the sandbox if needed and use a generous deadline (and bound the server-side wait) since the call blocks on the archive. An idempotency key is auto-generated when the caller omits one, so a retried create dedups rather than producing a duplicate.
Retries — the FSS RPCs retry transient errors (unavailability, request-deadline, resource-exhaustion, backend throttle) with a bounded wall-clock budget and AIP-193 backoff; non-transient errors fail fast. list_snapshots restarts pagination cleanly on a retry, and delete_snapshot treats a not-found-on-retry as success (a committed delete whose response was lost).
Error mapping — trusted CWSANDBOX_FSS_* reasons map to the typed exception hierarchy; a bare NOT_FOUND on a snapshot op maps to SnapshotNotFoundError.

Test plan

Unit tests for the FSS types, proto conversions, error mapping, start/stop wiring, snapshot + management methods, and transient retry (incl. pagination-restart, delete-retry-as-success, and snapshot-waits-for-running)
mypy clean, ruff format + lint clean
FSS integration suite passes against an FSS-enabled org — test_snapshot_and_restore, test_snapshot_on_stop, test_list_get_delete_snapshot (3/3, run in parallel)
Update MANIFEST_GROUPS in the coreweave/docs API-ref generator for the new public FSS exports (separate repo)

Wire the backend file-system snapshot (FSS) feature into the SDK. - FileSystemSnapshotOptions mount config on run()/Session.sandbox()/ @session.function() and SandboxDefaults (mount_path, size, restore via file_system_snapshot_id) - snapshot() captures a mid-life snapshot and returns the new ID; stop(snapshot_on_stop=True) snapshots on shutdown and exposes the ID via the file_system_snapshot_id property - management classmethods: get_snapshot, list_snapshots, delete_snapshot, get_snapshot_bucket_config, set_snapshot_bucket_config - FileSystemSnapshot record plus status/trigger/bucket enums and the FileSystemSnapshotBucketConfig type - SandboxSnapshotError hierarchy mapped from CWSANDBOX_FSS_* reasons - bounded client-side transient retry for the FSS RPCs, reusing the poll loop's retry classification and AIP-193 backoff; create auto-generates an idempotency key so a retried create dedups - vendored gateway proto stubs regenerated for the FSS messages and RPCs - unit and integration tests, an example, and docs

- snapshot() waits for RUNNING before archiving, matching exec/read_file/write_file - create-snapshot sets max_timeout_seconds when wait_for_ready (mirrors snapshot-on-stop) - bare gRPC NOT_FOUND on a snapshot op maps to SnapshotNotFoundError when a snapshot ID is in context - delete_snapshot treats NOT_FOUND on a retry as success (a committed delete whose response was lost) - list_snapshots retry restarts pagination from page 1 instead of resuming at the last page_token - doc/comment fixes (FileSystemSnapshot return type, keyword-only bucket-config args) and regression tests for each

stop() coalesces concurrent callers onto a single shared _stop_task (first-caller-wins), discarding later callers' parameters. A stop(snapshot_on_stop=True) caller that joined an in-flight plain stop — or a sandbox already stopping/stopped — silently inherited "no snapshot": the sandbox was torn down with no archive, file_system snapshot_id stayed None, and .result() returned success. Silent data loss. Detect the incompatible join under _stop_lock and raise the new SnapshotOnStopConflictError instead of coalescing. Coalescing is kept for every compatible case (plain joining plain, plain joining snapshot, snapshot joining snapshot, user-stop joining a server-initiated drain), preserving the idempotent-teardown contract the context manager and cleanup handlers rely on. Mirrors the backend's FailedPrecondition for a snapshot-on-stop that arrives after the sandbox has begun terminating. Cases that now raise when snapshot_on_stop=True: a plain stop already in flight; a TERMINATING drain with no owned stop task (no snapshot RPC will be sent); already terminal with no snapshot captured. Already-terminal with a snapshot already captured stays a convergent no-op, and a never-started sandbox stays on the normal no-op path (no mount to archive). Addresses PR #137 review finding #1.

The backend runs a snapshot-on-stop in two sequential phases: it archives the mount (bounded by max_timeout_seconds) and THEN deletes the pod (bounded by graceful_shutdown_seconds). The client deadline was hard-set to the archive budget alone (DEFAULT_FSS_STOP_TIMEOUT_SECONDS + 5s ≈ 605s), ignoring the additive pod-delete grace — so a healthy stop whose archive runs long plus a real grace could exceed the client deadline and surface a spurious DEADLINE_EXCEEDED. The old 5s buffer was also smaller than the backend's ~30s gateway request slack, so a near-budget archive alone could trip it. Decouple the proto field from the client deadline: - proto max_timeout_seconds stays DEFAULT_FSS_STOP_TIMEOUT_SECONDS (600, the archive budget; the backend defaults to this and does not cap it). - client deadline = archive budget + effective grace + slack, where effective grace substitutes the backend's 30s default when 0 is sent (sending 0 does not mean "no grace"), and slack (~35s) covers the backend's ~30s gateway request slack plus network round-trip. This keeps the client deadline ~5s past the backend's worst-case wall-clock at every grace value. Adds DEFAULT_FSS_STOP_GRACE_FALLBACK_SECONDS and DEFAULT_FSS_STOP_CLIENT_SLACK_SECONDS (named + commented) and documents the grace semantics on stop(). No client-side graceful>300 validation; the backend rejects it. Backend behavior confirmed against coreweave/aviato: 600s is a default, not a cap; the only hard cap is graceful_shutdown_seconds <= 300. Addresses PR #137 review finding #7.

snapshot(wait_for_ready=True) uses a ~605s client deadline (just past the 600s server-side wait bound) and wraps the create in _retry_transient_rpc (30s inter-attempt budget). A client DEADLINE_EXCEEDED maps to SandboxRequestTimeoutError, which is retryable, and the loop bounds only the gap between attempts — not attempt duration — so a wedged backend that blows past its own wait-timeout triggers a second full ~605s attempt: ~1210s wall-clock vs the ~605s the ceiling implies. A client deadline on a wait-for-ready create is the ceiling being hit, not a transient blip (that's UNAVAILABLE / RESOURCE_EXHAUSTED / throttle, which still retry). Treat it as terminal: add an optional non_retryable override to _retry_transient_rpc and pass SandboxRequestTimeoutError on the create call. The wait now ends at ~605s, and the surfaced error is unchanged (SandboxRequestTimeoutError — the snapshot may still finish server-side; poll get_snapshot()). Scoped to snapshot(); the stop path has no retry loop. Addresses PR #137 review finding #8.

djenriquez force-pushed the worktree-djenriquez+fss-integration branch 3 times, most recently from 0746995 to af0baba Compare June 8, 2026 23:04

djenriquez marked this pull request as ready for review June 9, 2026 00:01

djenriquez requested a review from a team as a code owner June 9, 2026 00:01

djenriquez added 2 commits June 9, 2026 12:51

djenriquez force-pushed the worktree-djenriquez+fss-integration branch from af0baba to d80e6cc Compare June 9, 2026 19:52

djenriquez requested a review from brandonrjacobs as a code owner June 9, 2026 19:52

djenriquez added 3 commits June 10, 2026 09:39

brandonrjacobs approved these changes Jun 10, 2026

View reviewed changes

brandonrjacobs merged commit db055eb into main Jun 11, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(fss): add file-system snapshot support#137

feat(fss): add file-system snapshot support#137
brandonrjacobs merged 5 commits into
mainfrom
worktree-djenriquez+fss-integration

djenriquez commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

djenriquez commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included

Details

Wire contract

Behavior

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

djenriquez commented Jun 8, 2026 •

edited

Loading