feat(fss): add file-system snapshot support#137
Merged
Conversation
0746995 to
af0baba
Compare
Wire the backend file-system snapshot (FSS) feature into the SDK. - FileSystemSnapshotOptions mount config on run()/Session.sandbox()/ @session.function() and SandboxDefaults (mount_path, size, restore via file_system_snapshot_id) - snapshot() captures a mid-life snapshot and returns the new ID; stop(snapshot_on_stop=True) snapshots on shutdown and exposes the ID via the file_system_snapshot_id property - management classmethods: get_snapshot, list_snapshots, delete_snapshot, get_snapshot_bucket_config, set_snapshot_bucket_config - FileSystemSnapshot record plus status/trigger/bucket enums and the FileSystemSnapshotBucketConfig type - SandboxSnapshotError hierarchy mapped from CWSANDBOX_FSS_* reasons - bounded client-side transient retry for the FSS RPCs, reusing the poll loop's retry classification and AIP-193 backoff; create auto-generates an idempotency key so a retried create dedups - vendored gateway proto stubs regenerated for the FSS messages and RPCs - unit and integration tests, an example, and docs
- snapshot() waits for RUNNING before archiving, matching exec/read_file/write_file - create-snapshot sets max_timeout_seconds when wait_for_ready (mirrors snapshot-on-stop) - bare gRPC NOT_FOUND on a snapshot op maps to SnapshotNotFoundError when a snapshot ID is in context - delete_snapshot treats NOT_FOUND on a retry as success (a committed delete whose response was lost) - list_snapshots retry restarts pagination from page 1 instead of resuming at the last page_token - doc/comment fixes (FileSystemSnapshot return type, keyword-only bucket-config args) and regression tests for each
af0baba to
d80e6cc
Compare
stop() coalesces concurrent callers onto a single shared _stop_task (first-caller-wins), discarding later callers' parameters. A stop(snapshot_on_stop=True) caller that joined an in-flight plain stop — or a sandbox already stopping/stopped — silently inherited "no snapshot": the sandbox was torn down with no archive, file_system snapshot_id stayed None, and .result() returned success. Silent data loss. Detect the incompatible join under _stop_lock and raise the new SnapshotOnStopConflictError instead of coalescing. Coalescing is kept for every compatible case (plain joining plain, plain joining snapshot, snapshot joining snapshot, user-stop joining a server-initiated drain), preserving the idempotent-teardown contract the context manager and cleanup handlers rely on. Mirrors the backend's FailedPrecondition for a snapshot-on-stop that arrives after the sandbox has begun terminating. Cases that now raise when snapshot_on_stop=True: a plain stop already in flight; a TERMINATING drain with no owned stop task (no snapshot RPC will be sent); already terminal with no snapshot captured. Already-terminal with a snapshot already captured stays a convergent no-op, and a never-started sandbox stays on the normal no-op path (no mount to archive). Addresses PR #137 review finding #1.
The backend runs a snapshot-on-stop in two sequential phases: it archives the mount (bounded by max_timeout_seconds) and THEN deletes the pod (bounded by graceful_shutdown_seconds). The client deadline was hard-set to the archive budget alone (DEFAULT_FSS_STOP_TIMEOUT_SECONDS + 5s ≈ 605s), ignoring the additive pod-delete grace — so a healthy stop whose archive runs long plus a real grace could exceed the client deadline and surface a spurious DEADLINE_EXCEEDED. The old 5s buffer was also smaller than the backend's ~30s gateway request slack, so a near-budget archive alone could trip it. Decouple the proto field from the client deadline: - proto max_timeout_seconds stays DEFAULT_FSS_STOP_TIMEOUT_SECONDS (600, the archive budget; the backend defaults to this and does not cap it). - client deadline = archive budget + effective grace + slack, where effective grace substitutes the backend's 30s default when 0 is sent (sending 0 does not mean "no grace"), and slack (~35s) covers the backend's ~30s gateway request slack plus network round-trip. This keeps the client deadline ~5s past the backend's worst-case wall-clock at every grace value. Adds DEFAULT_FSS_STOP_GRACE_FALLBACK_SECONDS and DEFAULT_FSS_STOP_CLIENT_SLACK_SECONDS (named + commented) and documents the grace semantics on stop(). No client-side graceful>300 validation; the backend rejects it. Backend behavior confirmed against coreweave/aviato: 600s is a default, not a cap; the only hard cap is graceful_shutdown_seconds <= 300. Addresses PR #137 review finding #7.
snapshot(wait_for_ready=True) uses a ~605s client deadline (just past the 600s server-side wait bound) and wraps the create in _retry_transient_rpc (30s inter-attempt budget). A client DEADLINE_EXCEEDED maps to SandboxRequestTimeoutError, which is retryable, and the loop bounds only the gap between attempts — not attempt duration — so a wedged backend that blows past its own wait-timeout triggers a second full ~605s attempt: ~1210s wall-clock vs the ~605s the ceiling implies. A client deadline on a wait-for-ready create is the ceiling being hit, not a transient blip (that's UNAVAILABLE / RESOURCE_EXHAUSTED / throttle, which still retry). Treat it as terminal: add an optional non_retryable override to _retry_transient_rpc and pass SandboxRequestTimeoutError on the create call. The wait now ends at ~605s, and the surfaced error is unchanged (SandboxRequestTimeoutError — the snapshot may still finish server-side; poll get_snapshot()). Scoped to snapshot(); the stop path has no retry loop. Addresses PR #137 review finding #8.
brandonrjacobs
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The CoreWeave Sandbox backend can snapshot a sandbox's working directory and restore it into new sandboxes. This PR adds file-system snapshot (FSS) support to the Python SDK so users can capture, restore, and fork a sandbox's filesystem from client code. FSS is gated per-organization on the backend; orgs that aren't enabled receive a clear
SnapshotNotSupportedError.The feature follows the SDK's existing sync/async hybrid style: a snapshot mount you configure at start,
snapshot()to capture on demand, snapshot-on-stop, restore/fork by ID, and a management surface for listing and lifecycle.What's included
file_system_snapshot=FileSystemSnapshotOptions(mount_path=..., size=...)onrun(),Session.sandbox(),@session.function(), orSandboxDefaults.sb.snapshot()(returns the new snapshot ID), or on shutdown withsb.stop(snapshot_on_stop=True)(ID exposed viafile_system_snapshot_id).file_system_snapshot_idback intoSandbox.run(...).Sandboxclassmethods:get_snapshot,list_snapshots(auto-paginated, client-side filters),delete_snapshot, and adminget_snapshot_bucket_config/set_snapshot_bucket_config.SandboxSnapshotErrorhierarchy mapped fromCWSANDBOX_FSS_*reasons (not-found / not-ready / not-supported / size / quota / bucket-mismatch, plus transient throttle and wait-timeout).Details
Wire contract
The FSS messages and RPCs (
CreateFileSystemSnapshot,GetFileSystemSnapshot,ListFileSystemSnapshots,DeleteFileSystemSnapshot, and bucket config) plus theStartSandboxRequest.file_systemmount andStopSandboxRequest.file_system_snapshot_on_stopfields are vendored into the gateway proto stubs.Behavior
FileSystemSnapshotOptions(mount_path,size, optionalfile_system_snapshot_idto restore) is accepted as a dataclass or plain dict and translated into the start request'sfile_systemmount.snapshot()returnsOperationRef[str](the new ID) and waits for the sandbox to be RUNNING before archiving, the same asexec/read_file/write_file. Bothsnapshot()andstop(snapshot_on_stop=True)auto-start the sandbox if needed and use a generous deadline (and bound the server-side wait) since the call blocks on the archive. An idempotency key is auto-generated when the caller omits one, so a retried create dedups rather than producing a duplicate.list_snapshotsrestarts pagination cleanly on a retry, anddelete_snapshottreats a not-found-on-retry as success (a committed delete whose response was lost).CWSANDBOX_FSS_*reasons map to the typed exception hierarchy; a bareNOT_FOUNDon a snapshot op maps toSnapshotNotFoundError.Test plan
mypyclean,ruffformat + lint cleantest_snapshot_and_restore,test_snapshot_on_stop,test_list_get_delete_snapshot(3/3, run in parallel)MANIFEST_GROUPSin thecoreweave/docsAPI-ref generator for the new public FSS exports (separate repo)