Skip to content

feat(fss): add file-system snapshot support#137

Merged
brandonrjacobs merged 5 commits into
mainfrom
worktree-djenriquez+fss-integration
Jun 11, 2026
Merged

feat(fss): add file-system snapshot support#137
brandonrjacobs merged 5 commits into
mainfrom
worktree-djenriquez+fss-integration

Conversation

@djenriquez

@djenriquez djenriquez commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

The CoreWeave Sandbox backend can snapshot a sandbox's working directory and restore it into new sandboxes. This PR adds file-system snapshot (FSS) support to the Python SDK so users can capture, restore, and fork a sandbox's filesystem from client code. FSS is gated per-organization on the backend; orgs that aren't enabled receive a clear SnapshotNotSupportedError.

The feature follows the SDK's existing sync/async hybrid style: a snapshot mount you configure at start, snapshot() to capture on demand, snapshot-on-stop, restore/fork by ID, and a management surface for listing and lifecycle.

What's included

  • Configure a snapshot mount at start with file_system_snapshot=FileSystemSnapshotOptions(mount_path=..., size=...) on run(), Session.sandbox(), @session.function(), or SandboxDefaults.
  • Capture mid-life with sb.snapshot() (returns the new snapshot ID), or on shutdown with sb.stop(snapshot_on_stop=True) (ID exposed via file_system_snapshot_id).
  • Restore / fork by passing a file_system_snapshot_id back into Sandbox.run(...).
  • Manage snapshots via Sandbox classmethods: get_snapshot, list_snapshots (auto-paginated, client-side filters), delete_snapshot, and admin get_snapshot_bucket_config / set_snapshot_bucket_config.
  • Typed errors: a SandboxSnapshotError hierarchy mapped from CWSANDBOX_FSS_* reasons (not-found / not-ready / not-supported / size / quota / bucket-mismatch, plus transient throttle and wait-timeout).

Details

Wire contract

The FSS messages and RPCs (CreateFileSystemSnapshot, GetFileSystemSnapshot, ListFileSystemSnapshots, DeleteFileSystemSnapshot, and bucket config) plus the StartSandboxRequest.file_system mount and StopSandboxRequest.file_system_snapshot_on_stop fields are vendored into the gateway proto stubs.

Behavior

  • Config & startFileSystemSnapshotOptions (mount_path, size, optional file_system_snapshot_id to restore) is accepted as a dataclass or plain dict and translated into the start request's file_system mount.
  • Capturesnapshot() returns OperationRef[str] (the new ID) and waits for the sandbox to be RUNNING before archiving, the same as exec/read_file/write_file. Both snapshot() and stop(snapshot_on_stop=True) auto-start the sandbox if needed and use a generous deadline (and bound the server-side wait) since the call blocks on the archive. An idempotency key is auto-generated when the caller omits one, so a retried create dedups rather than producing a duplicate.
  • Retries — the FSS RPCs retry transient errors (unavailability, request-deadline, resource-exhaustion, backend throttle) with a bounded wall-clock budget and AIP-193 backoff; non-transient errors fail fast. list_snapshots restarts pagination cleanly on a retry, and delete_snapshot treats a not-found-on-retry as success (a committed delete whose response was lost).
  • Error mapping — trusted CWSANDBOX_FSS_* reasons map to the typed exception hierarchy; a bare NOT_FOUND on a snapshot op maps to SnapshotNotFoundError.

Test plan

  • Unit tests for the FSS types, proto conversions, error mapping, start/stop wiring, snapshot + management methods, and transient retry (incl. pagination-restart, delete-retry-as-success, and snapshot-waits-for-running)
  • mypy clean, ruff format + lint clean
  • FSS integration suite passes against an FSS-enabled org — test_snapshot_and_restore, test_snapshot_on_stop, test_list_get_delete_snapshot (3/3, run in parallel)
  • Update MANIFEST_GROUPS in the coreweave/docs API-ref generator for the new public FSS exports (separate repo)

@djenriquez djenriquez force-pushed the worktree-djenriquez+fss-integration branch 3 times, most recently from 0746995 to af0baba Compare June 8, 2026 23:04
@djenriquez djenriquez marked this pull request as ready for review June 9, 2026 00:01
@djenriquez djenriquez requested a review from a team as a code owner June 9, 2026 00:01
Wire the backend file-system snapshot (FSS) feature into the SDK.

- FileSystemSnapshotOptions mount config on run()/Session.sandbox()/
  @session.function() and SandboxDefaults (mount_path, size, restore via
  file_system_snapshot_id)
- snapshot() captures a mid-life snapshot and returns the new ID;
  stop(snapshot_on_stop=True) snapshots on shutdown and exposes the ID via
  the file_system_snapshot_id property
- management classmethods: get_snapshot, list_snapshots, delete_snapshot,
  get_snapshot_bucket_config, set_snapshot_bucket_config
- FileSystemSnapshot record plus status/trigger/bucket enums and the
  FileSystemSnapshotBucketConfig type
- SandboxSnapshotError hierarchy mapped from CWSANDBOX_FSS_* reasons
- bounded client-side transient retry for the FSS RPCs, reusing the poll
  loop's retry classification and AIP-193 backoff; create auto-generates an
  idempotency key so a retried create dedups
- vendored gateway proto stubs regenerated for the FSS messages and RPCs
- unit and integration tests, an example, and docs
- snapshot() waits for RUNNING before archiving, matching
  exec/read_file/write_file
- create-snapshot sets max_timeout_seconds when wait_for_ready
  (mirrors snapshot-on-stop)
- bare gRPC NOT_FOUND on a snapshot op maps to SnapshotNotFoundError
  when a snapshot ID is in context
- delete_snapshot treats NOT_FOUND on a retry as success (a committed
  delete whose response was lost)
- list_snapshots retry restarts pagination from page 1 instead of
  resuming at the last page_token
- doc/comment fixes (FileSystemSnapshot return type, keyword-only
  bucket-config args) and regression tests for each
@djenriquez djenriquez force-pushed the worktree-djenriquez+fss-integration branch from af0baba to d80e6cc Compare June 9, 2026 19:52
stop() coalesces concurrent callers onto a single shared _stop_task
(first-caller-wins), discarding later callers' parameters. A
stop(snapshot_on_stop=True) caller that joined an in-flight plain stop
— or a sandbox already stopping/stopped — silently inherited "no
snapshot": the sandbox was torn down with no archive, file_system
snapshot_id stayed None, and .result() returned success. Silent data
loss.

Detect the incompatible join under _stop_lock and raise the new
SnapshotOnStopConflictError instead of coalescing. Coalescing is kept
for every compatible case (plain joining plain, plain joining snapshot,
snapshot joining snapshot, user-stop joining a server-initiated drain),
preserving the idempotent-teardown contract the context manager and
cleanup handlers rely on. Mirrors the backend's FailedPrecondition for a
snapshot-on-stop that arrives after the sandbox has begun terminating.

Cases that now raise when snapshot_on_stop=True: a plain stop already in
flight; a TERMINATING drain with no owned stop task (no snapshot RPC will
be sent); already terminal with no snapshot captured. Already-terminal
with a snapshot already captured stays a convergent no-op, and a
never-started sandbox stays on the normal no-op path (no mount to
archive).

Addresses PR #137 review finding #1.
The backend runs a snapshot-on-stop in two sequential phases: it archives
the mount (bounded by max_timeout_seconds) and THEN deletes the pod
(bounded by graceful_shutdown_seconds). The client deadline was hard-set
to the archive budget alone (DEFAULT_FSS_STOP_TIMEOUT_SECONDS + 5s ≈
605s), ignoring the additive pod-delete grace — so a healthy stop whose
archive runs long plus a real grace could exceed the client deadline and
surface a spurious DEADLINE_EXCEEDED. The old 5s buffer was also smaller
than the backend's ~30s gateway request slack, so a near-budget archive
alone could trip it.

Decouple the proto field from the client deadline:
- proto max_timeout_seconds stays DEFAULT_FSS_STOP_TIMEOUT_SECONDS (600,
  the archive budget; the backend defaults to this and does not cap it).
- client deadline = archive budget + effective grace + slack, where
  effective grace substitutes the backend's 30s default when 0 is sent
  (sending 0 does not mean "no grace"), and slack (~35s) covers the
  backend's ~30s gateway request slack plus network round-trip. This
  keeps the client deadline ~5s past the backend's worst-case wall-clock
  at every grace value.

Adds DEFAULT_FSS_STOP_GRACE_FALLBACK_SECONDS and
DEFAULT_FSS_STOP_CLIENT_SLACK_SECONDS (named + commented) and documents
the grace semantics on stop(). No client-side graceful>300 validation;
the backend rejects it.

Backend behavior confirmed against coreweave/aviato: 600s is a default,
not a cap; the only hard cap is graceful_shutdown_seconds <= 300.

Addresses PR #137 review finding #7.
snapshot(wait_for_ready=True) uses a ~605s client deadline (just past the
600s server-side wait bound) and wraps the create in _retry_transient_rpc
(30s inter-attempt budget). A client DEADLINE_EXCEEDED maps to
SandboxRequestTimeoutError, which is retryable, and the loop bounds only
the gap between attempts — not attempt duration — so a wedged backend
that blows past its own wait-timeout triggers a second full ~605s
attempt: ~1210s wall-clock vs the ~605s the ceiling implies.

A client deadline on a wait-for-ready create is the ceiling being hit,
not a transient blip (that's UNAVAILABLE / RESOURCE_EXHAUSTED / throttle,
which still retry). Treat it as terminal: add an optional non_retryable
override to _retry_transient_rpc and pass SandboxRequestTimeoutError on
the create call. The wait now ends at ~605s, and the surfaced error is
unchanged (SandboxRequestTimeoutError — the snapshot may still finish
server-side; poll get_snapshot()). Scoped to snapshot(); the stop path
has no retry loop.

Addresses PR #137 review finding #8.
@brandonrjacobs brandonrjacobs merged commit db055eb into main Jun 11, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants