-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: model fallback system with automatic retry and background agent support #797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
feat: model fallback system with automatic retry and background agent support #797
Conversation
…a fields Add new optional fields to AgentOverrideConfigSchema and CategoryConfigSchema: - fallback: string[] - array of fallback model strings - fallbackDelayMs: number (100-30000) - delay between retry attempts - fallbackRetryCount: number (1-5) - maximum models to try These fields enable automatic model fallback when primary model fails due to rate limits, service unavailability, or other transient errors.
Implement core model fallback utilities in src/features/model-fallback/: - parseModelString(): Parse 'provider/model' format into ModelSpec - modelSpecToString(): Convert ModelSpec back to string - buildModelChain(): Build ordered list of models from primary + fallbacks - isModelError(): Detect retryable errors (rate limits, 429, 503, timeouts) - withModelFallback(): Generic retry wrapper for simple use cases - formatRetryErrors(): Format error list for user-friendly display The module detects these transient errors for automatic retry: - Rate limits (rate_limit, 429) - Service unavailable (503, 502, unavailable, overloaded, capacity) - Network errors (timeout, ECONNREFUSED, ENOTFOUND) Non-model errors (auth failures, invalid input) are NOT retried.
Update sisyphus-task to use the model fallback system: Integration changes: - Import parseModelString, isModelError, RetryConfig from model-fallback - Update resolveCategoryConfig() to return modelChain and retryConfig - Use configurable fallbackDelayMs and fallbackRetryCount from config - Build model chain from primary model + fallback array User notification on fallback: - Show which model was actually used in task completion message - Display warning when fallback occurred with list of failed models - Example: '⚠️ Model fallback occurred: Primary model failed, used anthropic/claude-haiku-4-5 instead. Failed models: deepseek/deepseek-v3-2' Special handling: - Agent configuration errors (agent.name) do NOT trigger fallback - Only transient model errors (rate limits, timeouts) retry with fallback - Inline retry logic preserved for session lifecycle management
Add 36 unit tests covering all model-fallback functions: parseModelString() tests: - Valid provider/model strings - Nested paths (provider/sub/model) - Invalid single-segment strings - Empty string handling buildModelChain() tests: - Primary model only - Primary with fallback array - Invalid models filtered out - Empty/undefined fallback handling isModelError() tests: - Rate limit detection (rate_limit, 429) - Service errors (503, 502, unavailable, overloaded, capacity) - Network errors (timeout, ECONNREFUSED, ENOTFOUND) - Non-model errors return false - Non-Error objects return false withModelFallback() tests: - Success on first model - Fallback on model error - No retry on non-model errors - Exhausts all models on persistent failures - Respects maxAttempts config - Empty model chain handling - Records usedModel on success/failure formatRetryErrors() tests: - Empty array handling - Single error formatting - Multiple errors as numbered list
- Fix P1: Remove duplicate session.prompt calls in manager.ts that caused double execution before task registration. Synced with origin/dev and properly integrated modelChain support. - Fix P2: parseModelString now rejects empty provider/model segments, preventing malformed model specs from entering the fallback chain. - Remove pr_790_body.txt which belonged to a different PR (HTTP transport).
- Restore constants.ts to use standard provider names (google/, openai/) instead of custom nextologies provider - Add modelChain and retryConfig to resolveCategoryConfig return type
Security & Correctness Fixes: - P1: Fix path traversal vulnerability in getMessageDir by validating sessionID - P2: Fix lsp_symbols array access when limit <= 0 by enforcing min limit of 1 - P2: Add fallback chains to all DEFAULT_CATEGORIES to enable automatic model fallback Documentation Fixes: - P2: Update README.md beta release link from v3.0.0-beta.1 to v3.0.0-beta.6 - P3: Fix AGENTS.md LSP tool count from 11 to 7 to match actual count All fixes verified with full test suite (1048/1048 passing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 issues found across 11 files
Confidence score: 2/5
- Zero-attempt retry loop in
src/tools/sisyphus-task/tools.tsleaves subagents without any prompt when no model is configured, so user requests silently fail. maxAttemptsinsrc/features/model-fallback/index.tsaccepts 0/negative values, producing empty errors even when viable models exist, which risks regressions for fallback flows.- Multiple issues in
src/tools/sisyphus-task/tools.ts(undefined model passed to build chain, over-broad agent-not-found detection, leakedsubagentSessions) indicate fragile error handling that can crash or misreport before the user sees a useful message. - Pay close attention to
src/tools/sisyphus-task/tools.ts,src/features/model-fallback/index.ts- unstable retry/mode selection logic and cleanup must be corrected.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/model-fallback/index.ts">
<violation number="1" location="src/features/model-fallback/index.ts:75">
P2: maxAttempts accepts 0/negative values, causing zero attempts and empty errors despite available models</violation>
</file>
<file name="src/tools/sisyphus-task/tools.ts">
<violation number="1" location="src/tools/sisyphus-task/tools.ts:103">
P2: buildModelChain is called with possible undefined model, causing runtime crash before user-facing error when no model is configured</violation>
<violation number="2" location="src/tools/sisyphus-task/tools.ts:551">
P1: No prompt is sent for subagent requests because the retry loop stops at 0 attempts when no model is configured.</violation>
<violation number="3" location="src/tools/sisyphus-task/tools.ts:584">
P2: Over-broad agent-not-found detection: any error containing “undefined” is treated as agent missing, aborting fallbacks and misreporting errors.</violation>
<violation number="4" location="src/tools/sisyphus-task/tools.ts:588">
P2: Early prompt failure returns leak subagent session tracking (missing subagentSessions.delete)</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| } | ||
|
|
||
| const fallbackList = userConfig?.fallback ?? defaultConfig?.fallback | ||
| const modelChain = buildModelChain(config.model, fallbackList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: buildModelChain is called with possible undefined model, causing runtime crash before user-facing error when no model is configured
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/tools/sisyphus-task/tools.ts, line 103:
<comment>buildModelChain is called with possible undefined model, causing runtime crash before user-facing error when no model is configured</comment>
<file context>
@@ -85,14 +99,22 @@ function resolveCategoryConfig(
}
+ const fallbackList = userConfig?.fallback ?? defaultConfig?.fallback
+ const modelChain = buildModelChain(config.model, fallbackList)
+
+ const retryConfig: RetryConfig = {
</file context>
| const modelChain = buildModelChain(config.model, fallbackList) | |
| const modelChain = config.model ? buildModelChain(config.model, fallbackList) : [] |
- Fix maxAttempts accepting 0/negative values causing zero retry attempts in both withModelFallback() and sisyphus_task retry loop - Remove over-broad "undefined" check in agent-not-found detection that could misclassify unrelated errors and abort fallbacks prematurely - Add missing subagentSessions.delete() calls on early error returns to prevent memory leaks in session tracking Fixes issues reported by cubic-dev-ai review: - P1: No prompt sent for subagent requests when maxAttempts=0 - P2: maxAttempts accepts 0/negative values - P2: Over-broad agent-not-found detection - P2: Session tracking leak on early failures Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/model-fallback/index.ts">
<violation number="1" location="src/features/model-fallback/index.ts:75">
P2: `maxAttempts` is not validated—`NaN` input causes the loop to skip all models and returns `attempts: NaN`</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Use Number.isFinite() to validate maxAttempts before using it. If maxAttempts is NaN, Infinity, or not a number, fall back to using the model chain length instead. Adds test case for NaN input. Co-Authored-By: Claude Opus 4.5 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/features/model-fallback/index.test.ts">
<violation number="1" location="src/features/model-fallback/index.test.ts:415">
P2: Test does not exercise or assert chain-length fallback for NaN maxAttempts; passes even if NaN coerces to 1.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
The previous test succeeded on first attempt, so it couldn't distinguish maxAttempts=1 from maxAttempts=chainLength. Now uses 3 models that all fail, verifying all 3 are tried (proving maxAttempts correctly falls back to chain length, not 1). Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
This would be super cool. Are you going to keep working on it @stranger2904? |
|
@GollyJer Hi, I'm still testing a few things, it did work for my original use-case, but I want to try to break it a bit more... do you want me to convert it to a draft until than ? |
|
@stranger2904 Thanks for the update. I converted to Draft for now. 👍 |
Summary
Implements a comprehensive model fallback system with automatic retry capabilities, addressing issues found in PR #795. This is a clean replacement for #795 with improved conflict resolution and proper integration with the existing codebase.
Core Features
Model Fallback System
Integration Points
sisyphus_tasktool: Automatic retry with fallback modelsConfiguration Schema
Added three new config fields:
Bug Fixes (from PR #795)
cubic-dev-ai P1/P2 Issues:
session.prompt()calls in background agent managerparseModelString()pr_790_body.txt(belonged to different PR)resolveCategoryConfigreturn type to includemodelChainandretryConfigauto-review bot Issues:
getMessageDir()(sessionID validation)lsp_symbolsarray access when limit <= 0DEFAULT_CATEGORIESTesting
parseModelString()validationbuildModelChain()chain constructionwithModelFallback()retry logic with delaysisModelError()Files Changed
Usage Example
Category Configuration:
{ "categories": { "visual-engineering": { "model": "google/gemini-3-pro-preview", "fallback": ["anthropic/claude-sonnet-4-5", "google/gemini-3-flash-preview"], "fallbackDelayMs": 2000, "fallbackRetryCount": 2 } } }Automatic Fallback:
When primary model fails, automatically retries with fallback models and notifies user:
Replaces
Testing Instructions
bun test && bun run typecheckto verifyRelated
Summary by cubic
Adds a model fallback system with automatic retries for transient failures, integrated into sisyphus_task and background agents to improve reliability and user feedback. Introduces simple config fields for fallback models, retry delay, and max attempts.
New Features
Bug Fixes
Written for commit 7460340. Summary will update on new commits.