Skip to content

refactor(retry): unify transient retry for sandbox idempotent reads#140

Open
nicholaspun-wandb wants to merge 1 commit into
mainfrom
refactor/unify-transient-retry
Open

refactor(retry): unify transient retry for sandbox idempotent reads#140
nicholaspun-wandb wants to merge 1 commit into
mainfrom
refactor/unify-transient-retry

Conversation

@nicholaspun-wandb

Copy link
Copy Markdown
Collaborator

Summary

  • Adds cwsandbox/_retry.py as the single home for client-side retry of transient gRPC failures (UNAVAILABLE / DEADLINE_EXCEEDED / RESOURCE_EXHAUSTED, e.g. a gateway pod restart/rollout):
    retry_transient_async (one bounded-retry backoff core — wall-clock budget, AIP-193 RetryInfo honoring, decorrelated jitter) and classify / is_retryable (the single transient-vs-fatal classification
    point).
  • Replaces the bespoke retry loop in _poll_with_retry with the shared primitive — no behavior change to the poll-retry contract (regression-tested).
  • Extends retry to the idempotent reads that previously had none: get_status() (was bypassing the retry budget entirely), Sandbox.list, and unary read_file. read_file excludes RESOURCE_EXHAUSTED from
    retry because there it signals the oversized-message → exec-streaming fallback (fix: Add exec fallback for oversized file operations #127), not a transient overload.
  • exec (non-TTY) now translates a transport-level UNAVAILABLE / connection reset into a typed SandboxUnavailableError instead of leaking a raw grpc.aio.AioRpcError. exec and start are still never
    auto-retried (not idempotent); this only makes the failure catchable via the SDK exception hierarchy.
  • Removes the now-superseded _classify_poll_error / _RETRYABLE_POLL_EXCEPTIONS from _sandbox.py (moved into _retry).

Test plan

  • mise run lint / ruff format --check . clean
  • mypy src/ clean
  • mise run test — 1151 unit tests pass. New coverage in tests/unit/cwsandbox/test_retry.py (primitive: retry-then-success, budget=0 single attempt, fatal short-circuit, should_retry override, timeout
    clamp; and the classify/is_retryable registry), plus get_status retry and exec UNAVAILABLESandboxUnavailableError. Existing poll-retry tests preserved (patches re-pointed to cwsandbox._retry).

@nicholaspun-wandb nicholaspun-wandb self-assigned this Jun 9, 2026
@nicholaspun-wandb nicholaspun-wandb force-pushed the refactor/unify-transient-retry branch 2 times, most recently from 7d3a493 to ad6fa22 Compare June 9, 2026 19:54
@brandonrjacobs

Copy link
Copy Markdown
Collaborator

@nicholaspun-wandb have you run a e2e test suite with this branch?

@nicholaspun-wandb

Copy link
Copy Markdown
Collaborator Author

Had to split it against local dev (network tests fail) and W&B serverless (blocked on a couple operations), but between the two the integration tests pass:

CWSANDBOX_BASE_URL=http://localhost:8080 CWSANDBOX_API_KEY=key1 uv run pytest tests/integration/ -n 6

========================================================== short test summary info ==========================================================
FAILED tests/integration/cwsandbox/test_sandbox.py::test_sandbox_with_network_options - cwsandbox.exceptions.SandboxResourceExhaustedError: Start sandbox resource exhausted: peer Gateway returned Start error: peer Start retu...
FAILED tests/integration/cwsandbox/test_sandbox.py::test_sandbox_public_service_connectivity - httpx.ConnectError: [Errno 8] nodename nor servname provided, or not known
=========================================== 2 failed, 112 passed, 1 skipped in 403.40s (0:06:43) ============================================
uv run --no-sync pytest tests/integration/ -p wandb.sandbox -n auto
========================================================== short test summary info ==========================================================
FAILED tests/integration/cwsandbox/test_discovery.py::TestGetProfile::test_get_nonexistent_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
FAILED tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_capacity_filter - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
FAILED tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_nonexistent_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
FAILED tests/integration/cwsandbox/test_discovery.py::TestGetRunner::test_get_nonexistent_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_by_architecture - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_runner_fields_populated - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetRunner::test_get_existing_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_include_resources_true - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_profile_fields_populated - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_service_exposure_mode - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_by_profile_name - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_include_resources_false_default - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetRunner::test_get_runner_always_has_full_details - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestCrossReference::test_profile_runner_ids_exist - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_egress_mode - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetProfile::test_get_existing_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_architecture - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_by_runner_group_id - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_runner_id - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_returns_profiles - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetProfile::test_get_profile_without_runner_id - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestCrossReference::test_runner_profiles_match_list_profiles - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_returns_runners - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_profile_and_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
ERROR tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission denied: W&B auth is not allowed for this sandbox operation
======================================= 4 failed, 88 passed, 1 skipped, 22 errors in 64.61s (0:01:04) =======================================

@brandonrjacobs

Copy link
Copy Markdown
Collaborator

LGTM after resolving conflicts.

@nicholaspun-wandb nicholaspun-wandb force-pushed the refactor/unify-transient-retry branch from ad6fa22 to 8b160d7 Compare June 11, 2026 20:51
@nicholaspun-wandb

Copy link
Copy Markdown
Collaborator Author

Post-rebase integration test run:

uv run --no-sync pytest tests/integration/ -p wandb.sandbox -n auto

===================== short test summary info ======================
FAILED tests/integration/cwsandbox/test_discovery.py::TestGetProfile::test_get_nonexistent_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
FAILED tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_capacity_filter - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
FAILED tests/integration/cwsandbox/test_file_system_snapshot.py::test_snapshot_on_stop - cwsandbox.exceptions.SandboxError: Start sandbox failed: no sui...
FAILED tests/integration/cwsandbox/test_discovery.py::TestGetRunner::test_get_nonexistent_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
FAILED tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_nonexistent_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
FAILED tests/integration/cwsandbox/test_file_system_snapshot.py::test_list_get_delete_snapshot - cwsandbox.exceptions.SandboxError: Start sandbox failed: no sui...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_include_resources_false_default - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_runner_fields_populated - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetRunner::test_get_runner_always_has_full_details - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_by_profile_name - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetProfile::test_get_existing_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestCrossReference::test_profile_runner_ids_exist - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_by_architecture - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_include_resources_true - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_egress_mode - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_service_exposure_mode - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_profile_fields_populated - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetProfile::test_get_profile_without_runner_id - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_filter_by_runner_group_id - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestGetRunner::test_get_existing_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_returns_profiles - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_runner_id - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListProfiles::test_filter_by_architecture - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestListRunners::test_returns_runners - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_discovery.py::TestCrossReference::test_runner_profiles_match_list_profiles - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_profile_and_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_profile - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
ERROR tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_runner - cwsandbox.exceptions.CWSandboxAuthenticationError: Permission d...
== 6 failed, 88 passed, 2 skipped, 22 errors in 61.99s (0:01:01) ===
CWSANDBOX_BASE_URL=http://localhost:8080 \
CWSANDBOX_API_KEY=key1 \
uv run pytest \
tests/integration/cwsandbox/test_discovery.py \
tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_profile_and_runner \
tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_profile \
tests/integration/cwsandbox/test_sandbox.py::test_sandbox_pinned_to_runner \
tests/integration/cwsandbox/test_file_system_snapshot.py \
-n 6
============================================================ test session starts ============================================================
platform darwin -- Python 3.11.10, pytest-9.0.2, pluggy-1.6.0
rootdir: /Users/nicholaspun/Documents/cwsandbox-client
configfile: pyproject.toml
plugins: anyio-4.12.1, xdist-3.8.0, asyncio-1.3.0, dotenv-0.5.2, cov-7.0.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
6 workers [29 items]    
.............................                                                                                                         [100%]
====================================================== 29 passed in 484.74s (0:08:04) =======================================================

@nicholaspun-wandb nicholaspun-wandb force-pushed the refactor/unify-transient-retry branch from 8b160d7 to 4e61f83 Compare June 11, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants