feat: automatic model fallback system with background support #795

stranger2904 · 2026-01-14T19:30:02Z

Summary\n\nAdds an automatic model fallback system that gracefully handles API failures by retrying with configured fallback models. Background tasks now properly support model fallback chains, ensuring reliability even for async operations.\n\n## Features\n\n### Automatic Fallback on Transient Errors\n- Rate limits and HTTP 429 errors\n- Service unavailable (HTTP 503, 502)\n- Capacity issues (overloaded, capacity, unavailable)\n- Network errors (timeout, ECONNREFUSED, ENOTFOUND)\n\nNon-model errors (authentication failures, invalid input) are NOT retried.\n\n### Background Task Support\n- Background tasks properly use modelChain for fallback\n- Ensures reliability for async operations\n\n### Validated Configuration\n- maxAttempts >= 1 enforced to prevent silent failures\n- Clean error handling: only aborts on configuration errors, continues on model errors\n\n## Implementation\n\nCore module src/features/model-fallback/ provides utilities for model parsing, chain building, error detection, and retry logic with validation. Integrated with sisyphus-task for seamless fallback handling.\n\n## Fixes from Review\n

P1: maxAttempts validation (Math.max(1, ...))\n- P2: Error matching scoped to agent.name only\n- P2: Background LaunchInput supports modelChain\n- P2: BackgroundTask stores modelChain field\n- P2: Background manager passes modelChain\n\n## Testing\n
✅ 36/36 model-fallback tests passing\n✅ Build successful\n✅ TypeScript types valid\n\nAll commits authored by stranger2904 with GitHub noreply email.\nClean PR without auto-generated descriptions.\n\nRelated: Fixes PR feat: automatic model fallback system with user notification #785 authorship issue.

Summary by cubic

Adds an automatic model fallback system for both sync and background tasks to keep prompts running through transient API failures. Also adds HTTP transport for MCP servers.

New Features
- Retries with fallback models on 429, 502/503, capacity, timeouts, and common network errors; skips auth/validation errors.
- Background tasks now accept and pass modelChain for fallback.
- Config options: fallback[], fallbackDelayMs (100–30000ms), fallbackRetryCount (1–5); schemas updated; default categories define fallback chains.
- Task completion output includes the model used and a fallback notice when applicable.
- Skill MCP: HTTP transport with optional headers, auto-detects HTTP vs stdio, backward compatible.
Bug Fixes
- Enforce maxAttempts >= 1.
- Preserve variant on the primary model in modelChain.
- Narrow error matching to agent.name for config issues.
- Clean up subagentSessions on early returns.

^{Written for commit aa1c1dc. Summary will update on new commits.}

Add support for connecting to remote MCP servers via HTTP in addition to the existing stdio (local process) connections. This enables skills to use cloud-hosted MCP servers and aggregated MCP gateways. ## Changes - Extend SkillMcpManager to detect connection type from config: - Explicit `type: "http"` or `type: "sse"` → HTTP connection - Explicit `type: "stdio"` → stdio connection - Infer from `url` field → HTTP connection - Infer from `command` field → stdio connection - Add StreamableHTTPClientTransport from MCP SDK for HTTP connections - Supports custom headers for authentication (e.g., API keys) - Proper error handling with helpful hints - Maintain full backward compatibility with existing stdio configurations ## Usage ```yaml # HTTP connection (new) mcp: remote-server: url: https://mcp.example.com/mcp headers: Authorization: Bearer ${API_KEY} # stdio connection (existing, unchanged) mcp: local-server: command: npx args: [-y, @some/mcp-server] ``` ## Tests Added comprehensive tests for: - Connection type detection (explicit type vs inferred) - HTTP URL validation and error messages - Headers configuration - Backward compatibility with stdio configs

Add mocks for HTTP transport to avoid real network calls during tests. This addresses reviewer feedback about test reliability: - Tests are now faster (no network latency) - Tests are deterministic across environments - Test intent is clearer (unit testing error handling logic) The mock throws immediately with a controlled error message, allowing tests to validate error handling without network dependencies. Co-Authored-By: Claude Opus 4.5 <[email protected]>

…a fields Add new optional fields to AgentOverrideConfigSchema and CategoryConfigSchema: - fallback: string[] - array of fallback model strings - fallbackDelayMs: number (100-30000) - delay between retry attempts - fallbackRetryCount: number (1-5) - maximum models to try These fields enable automatic model fallback when primary model fails due to rate limits, service unavailability, or other transient errors.

Implement core model fallback utilities in src/features/model-fallback/: - parseModelString(): Parse 'provider/model' format into ModelSpec - modelSpecToString(): Convert ModelSpec back to string - buildModelChain(): Build ordered list of models from primary + fallbacks - isModelError(): Detect retryable errors (rate limits, 429, 503, timeouts) - withModelFallback(): Generic retry wrapper for simple use cases - formatRetryErrors(): Format error list for user-friendly display The module detects these transient errors for automatic retry: - Rate limits (rate_limit, 429) - Service unavailable (503, 502, unavailable, overloaded, capacity) - Network errors (timeout, ECONNREFUSED, ENOTFOUND) Non-model errors (auth failures, invalid input) are NOT retried.

Update sisyphus-task to use the model fallback system: Integration changes: - Import parseModelString, isModelError, RetryConfig from model-fallback - Update resolveCategoryConfig() to return modelChain and retryConfig - Use configurable fallbackDelayMs and fallbackRetryCount from config - Build model chain from primary model + fallback array User notification on fallback: - Show which model was actually used in task completion message - Display warning when fallback occurred with list of failed models - Example: '⚠️ Model fallback occurred: Primary model failed, used anthropic/claude-haiku-4-5 instead. Failed models: deepseek/deepseek-v3-2' Special handling: - Agent configuration errors (agent.name) do NOT trigger fallback - Only transient model errors (rate limits, timeouts) retry with fallback - Inline retry logic preserved for session lifecycle management

Add 36 unit tests covering all model-fallback functions: parseModelString() tests: - Valid provider/model strings - Nested paths (provider/sub/model) - Invalid single-segment strings - Empty string handling buildModelChain() tests: - Primary model only - Primary with fallback array - Invalid models filtered out - Empty/undefined fallback handling isModelError() tests: - Rate limit detection (rate_limit, 429) - Service errors (503, 502, unavailable, overloaded, capacity) - Network errors (timeout, ECONNREFUSED, ENOTFOUND) - Non-model errors return false - Non-Error objects return false withModelFallback() tests: - Success on first model - Fallback on model error - No retry on non-model errors - Exhausts all models on persistent failures - Respects maxAttempts config - Empty model chain handling - Records usedModel on success/failure formatRetryErrors() tests: - Empty array handling - Single error formatting - Multiple errors as numbered list

- P1: Default maxAttempts=1 when no category (was 0, sync prompts failed) - P2: Preserve variant on primary model when building modelChain - P2: Clean up subagentSessions on early returns (memory leak) - P2: Ensure maxAttempts >= 1 even with empty modelChain - P2: Update warning text to use generic model descriptions

P1 Fixes: - maxAttempts validation ensures minimum 1 attempt P2 Fixes: - Remove over-broad undefined error matching (only check for agent.name) - Add modelChain support to background tasks (LaunchInput, BackgroundTask) - Pass modelChain to background manager for fallback support All model-fallback tests pass (36/36).

cubic-dev-ai

4 issues found across 12 files

Confidence score: 3/5

Duplicate session.prompt invocation in src/features/background-agent/manager.ts can double-run the same task and leak concurrency before registration, which is a concrete regression risk.
parseModelString in src/features/model-fallback/index.ts currently accepts empty provider/model pieces, so malformed specs could enter the fallback chain and trigger unexpected fallbacks.
src/features/skill-mcp-manager/manager.test.ts doesn’t confirm headers reach the transport layer, leaving header regressions undetected even though the test would still pass.
Pay close attention to src/features/background-agent/manager.ts, src/features/skill-mcp-manager/manager.test.ts, src/features/model-fallback/index.ts, pr_790_body.txt - functionality and documentation mismatches need to be resolved.

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="pr_790_body.txt">

<violation number="1" location="pr_790_body.txt:3">
P2: PR body describes HTTP transport for SkillMcpManager, which does not match the PR title about automatic model fallback, likely misleading reviewers/release notes.</violation>
</file>

<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:94">
P1: Duplicate session.prompt calls in launch() cause double execution and potential concurrency leak before task registration</violation>
</file>

<file name="src/features/skill-mcp-manager/manager.test.ts">

<violation number="1" location="src/features/skill-mcp-manager/manager.test.ts:47">
P2: HTTP headers test does not verify headers reach the transport; regression dropping headers would still pass</violation>
</file>

<file name="src/features/model-fallback/index.ts">

<violation number="1" location="src/features/model-fallback/index.ts:10">
P2: parseModelString treats empty provider/model segments as valid, allowing malformed model specs into the fallback chain</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/features/background-agent/manager.ts

pr_790_body.txt

cubic-dev-ai · 2026-01-14T19:37:03Z

src/features/skill-mcp-manager/manager.test.ts

-        /missing required 'command' field/
-      )
+        // #when / #then
+        await expect(manager.getOrCreateClient(info, config)).rejects.toThrow(


P2: HTTP headers test does not verify headers reach the transport; regression dropping headers would still pass

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/features/skill-mcp-manager/manager.test.ts, line 47: <comment>HTTP headers test does not verify headers reach the transport; regression dropping headers would still pass</comment> <file context> @@ -3,46 +3,326 @@ import { SkillMcpManager } from "./manager" - /missing required 'command' field/ - ) + // #when / #then + await expect(manager.getOrCreateClient(info, config)).rejects.toThrow( + /no valid connection configuration/ + ) </file context>

src/features/model-fallback/index.ts

- Fix P1: Remove duplicate session.prompt calls in manager.ts that caused double execution before task registration. Synced with origin/dev and properly integrated modelChain support. - Fix P2: parseModelString now rejects empty provider/model segments, preventing malformed model specs from entering the fallback chain. - Remove pr_790_body.txt which belonged to a different PR (HTTP transport).

…stem

- Restore constants.ts to use standard provider names (google/, openai/) instead of custom nextologies provider - Add modelChain and retryConfig to resolveCategoryConfig return type

github-actions · 2026-01-14T20:25:12Z

All contributors have signed the CLA. Thank you! ✅
_{Posted by the CLA Assistant Lite bot.}

stranger29 · 2026-01-14T20:31:35Z

I have read the CLA Document and I hereby sign the CLA

cubic-dev-ai

6 issues found across 41 files (changes from recent commits).

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/tools/lsp/tools.ts">

<violation number="1">
P2: `lsp_symbols` can throw when `limit` ≤ 0 because it accesses `limited[0]` after slicing to an empty array.</violation>
</file>

<file name="src/tools/sisyphus-task/tools.ts">

<violation number="1">
P1: User-controlled resume session ID is joined into filesystem path without validation, allowing path traversal outside MESSAGE_STORAGE.</violation>
</file>

<file name="README.md">

<violation number="1">
P2: Beta release link still points to v3.0.0-beta.1 while install text references v3.0.0-beta.6, causing user-facing version mismatch.</violation>
</file>

<file name="src/tools/AGENTS.md">

<violation number="1">
P3: Documentation inconsistency: overview still says LSP has 11 tools while this row indicates 7, causing conflicting guidance about available LSP tools.</violation>
</file>

<file name="src/tools/sisyphus-task/constants.ts">

<violation number="1">
P2: Default category configs no longer define fallback chains, so modelChain has no alternatives and the new automatic fallback system doesn’t activate for category/background tasks.</violation>
</file>

<file name="src/tools/sisyphus-task/tools.test.ts">

<violation number="1">
P2: Test reimplements model source detection instead of exercising production modelInfo logic, so it can pass even if the real bug remains.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-01-14T20:40:48Z

src/tools/sisyphus-task/tools.ts

 import { resolveMultipleSkills } from "../../features/opencode-skill-loader/skill-content"
 import { createBuiltinSkills } from "../../features/builtin-skills/skills"
 import { getTaskToastManager } from "../../features/task-toast-manager"
 import type { ModelFallbackInfo } from "../../features/task-toast-manager/types"


P1: User-controlled resume session ID is joined into filesystem path without validation, allowing path traversal outside MESSAGE_STORAGE.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/tools/sisyphus-task/tools.ts, line 251: <comment>User-controlled resume session ID is joined into filesystem path without validation, allowing path traversal outside MESSAGE_STORAGE.</comment> <file context> @@ -239,9 +248,18 @@ Use \`background_output\` with task_id="${task.id}" to check progress.` }) try { + const resumeMessageDir = getMessageDir(args.resume) + const resumeMessage = resumeMessageDir ? findNearestMessageWithFields(resumeMessageDir) : null + const resumeAgent = resumeMessage?.agent </file context>

stranger2904 · 2026-01-14T20:57:37Z

Replaced by #797 with clean commit history and proper conflict resolution.

stranger2904 and others added 8 commits January 14, 2026 11:35

cubic-dev-ai bot reviewed Jan 14, 2026

View reviewed changes

stranger2904 and others added 3 commits January 14, 2026 15:09

Merge remote-tracking branch 'origin/dev' into feat/model-fallback-sy…

b44ca94

…stem

fix: restore standard provider names and fix type annotations

aa1c1dc

- Restore constants.ts to use standard provider names (google/, openai/) instead of custom nextologies provider - Add modelChain and retryConfig to resolveCategoryConfig return type

github-actions bot added a commit that referenced this pull request Jan 14, 2026

@stranger29 has signed the CLA in #795

5a8d9f0

cubic-dev-ai bot reviewed Jan 14, 2026

View reviewed changes

stranger2904 mentioned this pull request Jan 14, 2026

feat: model fallback system with automatic retry and background agent support #797

Draft

stranger2904 closed this Jan 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: automatic model fallback system with background support #795

feat: automatic model fallback system with background support #795

Uh oh!

stranger2904 commented Jan 14, 2026 •

edited by cubic-dev-ai bot

Loading

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot Jan 14, 2026

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

stranger29 commented Jan 14, 2026

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Jan 14, 2026

Uh oh!

stranger2904 commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: automatic model fallback system with background support #795

feat: automatic model fallback system with background support #795

Uh oh!

Conversation

stranger2904 commented Jan 14, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stranger29 commented Jan 14, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

stranger2904 commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stranger2904 commented Jan 14, 2026 •

edited by cubic-dev-ai bot

Loading

github-actions bot commented Jan 14, 2026 •

edited

Loading