Skip to content

Conversation

@stranger2904
Copy link
Contributor

@stranger2904 stranger2904 commented Jan 14, 2026

Summary\n\nAdds an automatic model fallback system that gracefully handles API failures by retrying with configured fallback models. Background tasks now properly support model fallback chains, ensuring reliability even for async operations.\n\n## Features\n\n### Automatic Fallback on Transient Errors\n- Rate limits and HTTP 429 errors\n- Service unavailable (HTTP 503, 502)\n- Capacity issues (overloaded, capacity, unavailable)\n- Network errors (timeout, ECONNREFUSED, ENOTFOUND)\n\nNon-model errors (authentication failures, invalid input) are NOT retried.\n\n### Background Task Support\n- Background tasks properly use modelChain for fallback\n- Ensures reliability for async operations\n\n### Validated Configuration\n- maxAttempts >= 1 enforced to prevent silent failures\n- Clean error handling: only aborts on configuration errors, continues on model errors\n\n## Implementation\n\nCore module src/features/model-fallback/ provides utilities for model parsing, chain building, error detection, and retry logic with validation. Integrated with sisyphus-task for seamless fallback handling.\n\n## Fixes from Review\n

  • P1: maxAttempts validation (Math.max(1, ...))\n- P2: Error matching scoped to agent.name only\n- P2: Background LaunchInput supports modelChain\n- P2: BackgroundTask stores modelChain field\n- P2: Background manager passes modelChain\n\n## Testing\n
    ✅ 36/36 model-fallback tests passing\n✅ Build successful\n✅ TypeScript types valid\n\nAll commits authored by stranger2904 with GitHub noreply email.\nClean PR without auto-generated descriptions.\n\nRelated: Fixes PR feat: automatic model fallback system with user notification #785 authorship issue.

Summary by cubic

Adds an automatic model fallback system for both sync and background tasks to keep prompts running through transient API failures. Also adds HTTP transport for MCP servers.

  • New Features

    • Retries with fallback models on 429, 502/503, capacity, timeouts, and common network errors; skips auth/validation errors.
    • Background tasks now accept and pass modelChain for fallback.
    • Config options: fallback[], fallbackDelayMs (100–30000ms), fallbackRetryCount (1–5); schemas updated; default categories define fallback chains.
    • Task completion output includes the model used and a fallback notice when applicable.
    • Skill MCP: HTTP transport with optional headers, auto-detects HTTP vs stdio, backward compatible.
  • Bug Fixes

    • Enforce maxAttempts >= 1.
    • Preserve variant on the primary model in modelChain.
    • Narrow error matching to agent.name for config issues.
    • Clean up subagentSessions on early returns.

Written for commit aa1c1dc. Summary will update on new commits.

stranger2904 and others added 8 commits January 14, 2026 11:35
Add support for connecting to remote MCP servers via HTTP in addition to
the existing stdio (local process) connections. This enables skills to
use cloud-hosted MCP servers and aggregated MCP gateways.

## Changes

- Extend SkillMcpManager to detect connection type from config:
  - Explicit `type: "http"` or `type: "sse"` → HTTP connection
  - Explicit `type: "stdio"` → stdio connection
  - Infer from `url` field → HTTP connection
  - Infer from `command` field → stdio connection

- Add StreamableHTTPClientTransport from MCP SDK for HTTP connections
  - Supports custom headers for authentication (e.g., API keys)
  - Proper error handling with helpful hints

- Maintain full backward compatibility with existing stdio configurations

## Usage

```yaml
# HTTP connection (new)
mcp:
  remote-server:
    url: https://mcp.example.com/mcp
    headers:
      Authorization: Bearer ${API_KEY}

# stdio connection (existing, unchanged)
mcp:
  local-server:
    command: npx
    args: [-y, @some/mcp-server]
```

## Tests

Added comprehensive tests for:
- Connection type detection (explicit type vs inferred)
- HTTP URL validation and error messages
- Headers configuration
- Backward compatibility with stdio configs
Add mocks for HTTP transport to avoid real network calls during tests.
This addresses reviewer feedback about test reliability:
- Tests are now faster (no network latency)
- Tests are deterministic across environments
- Test intent is clearer (unit testing error handling logic)

The mock throws immediately with a controlled error message,
allowing tests to validate error handling without network dependencies.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…a fields

Add new optional fields to AgentOverrideConfigSchema and CategoryConfigSchema:
- fallback: string[] - array of fallback model strings
- fallbackDelayMs: number (100-30000) - delay between retry attempts
- fallbackRetryCount: number (1-5) - maximum models to try

These fields enable automatic model fallback when primary model fails due to
rate limits, service unavailability, or other transient errors.
Implement core model fallback utilities in src/features/model-fallback/:
- parseModelString(): Parse 'provider/model' format into ModelSpec
- modelSpecToString(): Convert ModelSpec back to string
- buildModelChain(): Build ordered list of models from primary + fallbacks
- isModelError(): Detect retryable errors (rate limits, 429, 503, timeouts)
- withModelFallback(): Generic retry wrapper for simple use cases
- formatRetryErrors(): Format error list for user-friendly display

The module detects these transient errors for automatic retry:
- Rate limits (rate_limit, 429)
- Service unavailable (503, 502, unavailable, overloaded, capacity)
- Network errors (timeout, ECONNREFUSED, ENOTFOUND)

Non-model errors (auth failures, invalid input) are NOT retried.
Update sisyphus-task to use the model fallback system:

Integration changes:
- Import parseModelString, isModelError, RetryConfig from model-fallback
- Update resolveCategoryConfig() to return modelChain and retryConfig
- Use configurable fallbackDelayMs and fallbackRetryCount from config
- Build model chain from primary model + fallback array

User notification on fallback:
- Show which model was actually used in task completion message
- Display warning when fallback occurred with list of failed models
- Example: '⚠️ Model fallback occurred: Primary model failed, used
  anthropic/claude-haiku-4-5 instead. Failed models: deepseek/deepseek-v3-2'

Special handling:
- Agent configuration errors (agent.name) do NOT trigger fallback
- Only transient model errors (rate limits, timeouts) retry with fallback
- Inline retry logic preserved for session lifecycle management
Add 36 unit tests covering all model-fallback functions:

parseModelString() tests:
- Valid provider/model strings
- Nested paths (provider/sub/model)
- Invalid single-segment strings
- Empty string handling

buildModelChain() tests:
- Primary model only
- Primary with fallback array
- Invalid models filtered out
- Empty/undefined fallback handling

isModelError() tests:
- Rate limit detection (rate_limit, 429)
- Service errors (503, 502, unavailable, overloaded, capacity)
- Network errors (timeout, ECONNREFUSED, ENOTFOUND)
- Non-model errors return false
- Non-Error objects return false

withModelFallback() tests:
- Success on first model
- Fallback on model error
- No retry on non-model errors
- Exhausts all models on persistent failures
- Respects maxAttempts config
- Empty model chain handling
- Records usedModel on success/failure

formatRetryErrors() tests:
- Empty array handling
- Single error formatting
- Multiple errors as numbered list
- P1: Default maxAttempts=1 when no category (was 0, sync prompts failed)
- P2: Preserve variant on primary model when building modelChain
- P2: Clean up subagentSessions on early returns (memory leak)
- P2: Ensure maxAttempts >= 1 even with empty modelChain
- P2: Update warning text to use generic model descriptions
P1 Fixes:
- maxAttempts validation ensures minimum 1 attempt

P2 Fixes:
- Remove over-broad undefined error matching (only check for agent.name)
- Add modelChain support to background tasks (LaunchInput, BackgroundTask)
- Pass modelChain to background manager for fallback support

All model-fallback tests pass (36/36).
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 12 files

Confidence score: 3/5

  • Duplicate session.prompt invocation in src/features/background-agent/manager.ts can double-run the same task and leak concurrency before registration, which is a concrete regression risk.
  • parseModelString in src/features/model-fallback/index.ts currently accepts empty provider/model pieces, so malformed specs could enter the fallback chain and trigger unexpected fallbacks.
  • src/features/skill-mcp-manager/manager.test.ts doesn’t confirm headers reach the transport layer, leaving header regressions undetected even though the test would still pass.
  • Pay close attention to src/features/background-agent/manager.ts, src/features/skill-mcp-manager/manager.test.ts, src/features/model-fallback/index.ts, pr_790_body.txt - functionality and documentation mismatches need to be resolved.
Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="pr_790_body.txt">

<violation number="1" location="pr_790_body.txt:3">
P2: PR body describes HTTP transport for SkillMcpManager, which does not match the PR title about automatic model fallback, likely misleading reviewers/release notes.</violation>
</file>

<file name="src/features/background-agent/manager.ts">

<violation number="1" location="src/features/background-agent/manager.ts:94">
P1: Duplicate session.prompt calls in launch() cause double execution and potential concurrency leak before task registration</violation>
</file>

<file name="src/features/skill-mcp-manager/manager.test.ts">

<violation number="1" location="src/features/skill-mcp-manager/manager.test.ts:47">
P2: HTTP headers test does not verify headers reach the transport; regression dropping headers would still pass</violation>
</file>

<file name="src/features/model-fallback/index.ts">

<violation number="1" location="src/features/model-fallback/index.ts:10">
P2: parseModelString treats empty provider/model segments as valid, allowing malformed model specs into the fallback chain</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

/missing required 'command' field/
)
// #when / #then
await expect(manager.getOrCreateClient(info, config)).rejects.toThrow(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: HTTP headers test does not verify headers reach the transport; regression dropping headers would still pass

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/features/skill-mcp-manager/manager.test.ts, line 47:

<comment>HTTP headers test does not verify headers reach the transport; regression dropping headers would still pass</comment>

<file context>
@@ -3,46 +3,326 @@ import { SkillMcpManager } from "./manager"
-        /missing required 'command' field/
-      )
+        // #when / #then
+        await expect(manager.getOrCreateClient(info, config)).rejects.toThrow(
+          /no valid connection configuration/
+        )
</file context>

stranger2904 and others added 3 commits January 14, 2026 15:09
- Fix P1: Remove duplicate session.prompt calls in manager.ts that caused
  double execution before task registration. Synced with origin/dev and
  properly integrated modelChain support.

- Fix P2: parseModelString now rejects empty provider/model segments,
  preventing malformed model specs from entering the fallback chain.

- Remove pr_790_body.txt which belonged to a different PR (HTTP transport).
- Restore constants.ts to use standard provider names (google/, openai/)
  instead of custom nextologies provider
- Add modelChain and retryConfig to resolveCategoryConfig return type
@github-actions
Copy link
Contributor

github-actions bot commented Jan 14, 2026

All contributors have signed the CLA. Thank you! ✅
Posted by the CLA Assistant Lite bot.

@stranger29
Copy link
Contributor

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Jan 14, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 41 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/tools/lsp/tools.ts">

<violation number="1">
P2: `lsp_symbols` can throw when `limit` ≤ 0 because it accesses `limited[0]` after slicing to an empty array.</violation>
</file>

<file name="src/tools/sisyphus-task/tools.ts">

<violation number="1">
P1: User-controlled resume session ID is joined into filesystem path without validation, allowing path traversal outside MESSAGE_STORAGE.</violation>
</file>

<file name="README.md">

<violation number="1">
P2: Beta release link still points to v3.0.0-beta.1 while install text references v3.0.0-beta.6, causing user-facing version mismatch.</violation>
</file>

<file name="src/tools/AGENTS.md">

<violation number="1">
P3: Documentation inconsistency: overview still says LSP has 11 tools while this row indicates 7, causing conflicting guidance about available LSP tools.</violation>
</file>

<file name="src/tools/sisyphus-task/constants.ts">

<violation number="1">
P2: Default category configs no longer define fallback chains, so modelChain has no alternatives and the new automatic fallback system doesn’t activate for category/background tasks.</violation>
</file>

<file name="src/tools/sisyphus-task/tools.test.ts">

<violation number="1">
P2: Test reimplements model source detection instead of exercising production modelInfo logic, so it can pass even if the real bug remains.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

import { resolveMultipleSkills } from "../../features/opencode-skill-loader/skill-content"
import { createBuiltinSkills } from "../../features/builtin-skills/skills"
import { getTaskToastManager } from "../../features/task-toast-manager"
import type { ModelFallbackInfo } from "../../features/task-toast-manager/types"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: User-controlled resume session ID is joined into filesystem path without validation, allowing path traversal outside MESSAGE_STORAGE.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/tools/sisyphus-task/tools.ts, line 251:

<comment>User-controlled resume session ID is joined into filesystem path without validation, allowing path traversal outside MESSAGE_STORAGE.</comment>

<file context>
@@ -239,9 +248,18 @@ Use \`background_output\` with task_id="${task.id}" to check progress.`
         })
 
         try {
+          const resumeMessageDir = getMessageDir(args.resume)
+          const resumeMessage = resumeMessageDir ? findNearestMessageWithFields(resumeMessageDir) : null
+          const resumeAgent = resumeMessage?.agent
</file context>

@stranger2904
Copy link
Contributor Author

Replaced by #797 with clean commit history and proper conflict resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants