Skip to content

fix(sdk): Short-circuit repeated provider failures#291

Merged
dcramer merged 7 commits intomainfrom
codex/provider-failure-circuit-breaker
May 6, 2026
Merged

fix(sdk): Short-circuit repeated provider failures#291
dcramer merged 7 commits intomainfrom
codex/provider-failure-circuit-breaker

Conversation

@dcramer
Copy link
Copy Markdown
Member

@dcramer dcramer commented May 6, 2026

Previously, outage-shaped Claude failures could exhaust retries independently for every chunk. Warden now shares a lightweight circuit breaker across the run so obvious auth failures and repeated provider failures stop queued analysis early.

Circuit Breaker

Authentication failures open the circuit immediately. Provider-unavailable failures open after consecutive failures across chunks and share the same abort path used by the CLI, Ink renderer, and GitHub Action.

Reporting

provider_unavailable is a stable report ErrorCode, and action Sentry captures provider/all-chunk failures with stable fingerprints.

Validated with pnpm run build, pnpm lint, and pnpm test.

Stop Warden runs after repeated provider-unavailable failures, and stop immediately on authentication failures. This prevents outage-shaped failures from burning through every chunk while preserving regular retry behavior for transient attempts.

Record provider_unavailable as a stable error code in reports and group matching action errors in Sentry.

Co-Authored-By: GPT-5 Codex <noreply@anthropic.com>
@dcramer dcramer marked this pull request as ready for review May 6, 2026 18:19
Comment thread src/sdk/analyze.ts
Comment thread src/sdk/analyze.ts
Return SDK reports when a provider circuit opens after earlier hunks already produced findings.

This keeps direct runSkill consumers aligned with the CLI task path while still treating empty circuit-open runs as fatal.

Co-Authored-By: GPT-5 Codex <noreply@anthropic.com>
Comment thread src/sdk/analyze.ts
Comment thread src/sdk/analyze.ts Outdated
Record provider failures after a hunk exhausts retries instead of counting each retry attempt.

This keeps the circuit breaker threshold tied to consecutive failing hunks rather than transient attempts within one hunk.

Co-Authored-By: GPT-5 Codex <noreply@anthropic.com>
Comment thread src/sdk/analyze.ts
Return a circuit reason only when the current failure is a circuit-breaker code.

This keeps in-flight max_turns and sdk_error hunks from being mislabeled after another hunk opens the shared circuit.

Co-Authored-By: GPT-5 Codex <noreply@anthropic.com>
Comment thread src/cli/output/tasks.ts Outdated
Comment thread src/cli/output/tasks.ts Outdated
Classify all-hunks-failed runs by the shared analysis failure code even when extraction failures are also present.

This keeps auth and provider outages from falling back to generic all_hunks_failed reports.

Co-Authored-By: GPT-5 Codex <noreply@anthropic.com>
Comment thread src/cli/output/tasks.ts
Require a skill to have its own hunk or extraction failures before shared circuit state can synthesize a failed report. This keeps concurrent clean runs from inheriting another skill's circuit-breaker failure.

Co-Authored-By: GPT-5 Codex <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2906cd5. Configure here.

Comment thread src/sdk/analyze.ts
Require SDK runs to have their own hunk or extraction failures before shared circuit state can throw. This keeps clean zero-finding SDK runs from inheriting another task's circuit-breaker failure.

Co-Authored-By: GPT-5 Codex <noreply@anthropic.com>
@dcramer dcramer merged commit aa9f4f3 into main May 6, 2026
15 checks passed
@dcramer dcramer deleted the codex/provider-failure-circuit-breaker branch May 6, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant