[RFC] Resilience Strategy: Hierarchical Fallback Chains for SDD Pipelines

## Summary
Implement a **Hierarchical Fallback Strategy** for the SDD multi-agent workflow. This ensures that if a primary AI model fails (due to quota, rate limits, or service downtime), the system automatically retries the task using a pre-configured alternative model, preventing pipeline stagnation and context loss.

## The Problem
Currently, `agent-teams-lite` allows assigning specific models to each SDD phase (e.g., `sdd-design` -> `claude-3-5-sonnet`). This creates a **Single Point of Failure (SPOF)**. If the primary provider (Anthropic, OpenAI, Google) experience:
- **Rate Limiting (429)**
- **Quota Exceeded (402)**
- **Service Outage (503)**
- **Context Length Overflow (400)**

The entire development pipeline stops. In a complex SDD flow, losing context or stopping mid-way is a major productivity bottleneck.

## Proposed Solution: Agent Fallback Chains
Instead of a 1-to-1 mapping, we propose a **Hierarchical Chain of Agents**. Each SDD phase will have an ordered list of models (Plan A, Plan B, Plan C).

### Example Configuration (`config.yaml`):
```yaml
agents:
  sdd-design:
    pipeline:
      - model: "anthropic/claude-3-5-sonnet" # Plan A (Premium)
        temperature: 0.2
      - model: "openai/gpt-4o"              # Plan B (Robust Fallback)
        temperature: 0.3
      - model: "google/gemini-1.5-pro"      # Plan C (Deep Context Fallback)
    retry_policy:
      max_attempts: 3
      backoff: "exponential"
```

## Technical Enhancements & Optimizations
To make this more than a simple "try/catch", I suggest the following:

1. **Smart Error Detection**: The fallback should only trigger on **infrastructure errors** (429, 500, 503, 402). If the error is a logic or syntax error, the system should report it normally.
2. **Context Re-alignment**: The orchestrator should ensure that if we switch from `Claude` to `GPT-4`, the system prompts are adjusted (if specific tweaks are defined per model) to maintain output quality.
3. **Dynamic Promotion (Circuit Breaker)**: If a primary model fails multiple times in a session, the system should temporarily "promote" the fallback to primary status for the remainder of the process to avoid unnecessary timeouts.
4. **Resilience Metadata**: Each generated artifact (spec, design, tasks) should include metadata indicating which model was actually used for its creation.

## Advantages
- **Uninterrupted Availability (99.9% Uptime)**: The developer never gets stuck mid-task.
- **Cost/Quota Management**: Allows using expensive models as primary and cheaper ones as safety nets.
- **Multi-Provider Resilience**: If one provider goes down, the project stays alive using another.
- **Improved UX**: A self-healing system creates more trust for enterprise teams.

## User Story
*"As a software architect, I want the system to automatically use Gemini 1.5 Flash if my Claude quota is exhausted during the task generation phase (`sdd-tasks`), notifying me of the change but allowing me to continue working without interruptions."*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Resilience Strategy: Hierarchical Fallback Chains for SDD Pipelines #63

Summary

The Problem

Proposed Solution: Agent Fallback Chains

Example Configuration (`config.yaml`):

Technical Enhancements & Optimizations

Advantages

User Story

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Resilience Strategy: Hierarchical Fallback Chains for SDD Pipelines #63

Description

Summary

The Problem

Proposed Solution: Agent Fallback Chains

Example Configuration (config.yaml):

Technical Enhancements & Optimizations

Advantages

User Story

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Example Configuration (`config.yaml`):