Skip to content
This repository was archived by the owner on Mar 26, 2026. It is now read-only.
This repository was archived by the owner on Mar 26, 2026. It is now read-only.

[RFC] Resilience Strategy: Hierarchical Fallback Chains for SDD Pipelines #63

@RamonsDka

Description

@RamonsDka

Summary

Implement a Hierarchical Fallback Strategy for the SDD multi-agent workflow. This ensures that if a primary AI model fails (due to quota, rate limits, or service downtime), the system automatically retries the task using a pre-configured alternative model, preventing pipeline stagnation and context loss.

The Problem

Currently, agent-teams-lite allows assigning specific models to each SDD phase (e.g., sdd-design -> claude-3-5-sonnet). This creates a Single Point of Failure (SPOF). If the primary provider (Anthropic, OpenAI, Google) experience:

  • Rate Limiting (429)
  • Quota Exceeded (402)
  • Service Outage (503)
  • Context Length Overflow (400)

The entire development pipeline stops. In a complex SDD flow, losing context or stopping mid-way is a major productivity bottleneck.

Proposed Solution: Agent Fallback Chains

Instead of a 1-to-1 mapping, we propose a Hierarchical Chain of Agents. Each SDD phase will have an ordered list of models (Plan A, Plan B, Plan C).

Example Configuration (config.yaml):

agents:
  sdd-design:
    pipeline:
      - model: "anthropic/claude-3-5-sonnet" # Plan A (Premium)
        temperature: 0.2
      - model: "openai/gpt-4o"              # Plan B (Robust Fallback)
        temperature: 0.3
      - model: "google/gemini-1.5-pro"      # Plan C (Deep Context Fallback)
    retry_policy:
      max_attempts: 3
      backoff: "exponential"

Technical Enhancements & Optimizations

To make this more than a simple "try/catch", I suggest the following:

  1. Smart Error Detection: The fallback should only trigger on infrastructure errors (429, 500, 503, 402). If the error is a logic or syntax error, the system should report it normally.
  2. Context Re-alignment: The orchestrator should ensure that if we switch from Claude to GPT-4, the system prompts are adjusted (if specific tweaks are defined per model) to maintain output quality.
  3. Dynamic Promotion (Circuit Breaker): If a primary model fails multiple times in a session, the system should temporarily "promote" the fallback to primary status for the remainder of the process to avoid unnecessary timeouts.
  4. Resilience Metadata: Each generated artifact (spec, design, tasks) should include metadata indicating which model was actually used for its creation.

Advantages

  • Uninterrupted Availability (99.9% Uptime): The developer never gets stuck mid-task.
  • Cost/Quota Management: Allows using expensive models as primary and cheaper ones as safety nets.
  • Multi-Provider Resilience: If one provider goes down, the project stays alive using another.
  • Improved UX: A self-healing system creates more trust for enterprise teams.

User Story

"As a software architect, I want the system to automatically use Gemini 1.5 Flash if my Claude quota is exhausted during the task generation phase (sdd-tasks), notifying me of the change but allowing me to continue working without interruptions."

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requeststatus:approvedApproved for implementation — PRs can be opened

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions