A Claude Code plugin that iteratively refines product specifications through multi-model debate until consensus is reached.
Key insight: A single LLM reviewing a spec will miss things. Multiple LLMs debating a spec will catch gaps, challenge assumptions, and surface edge cases that any one model would overlook. The result is a document that has survived rigorous adversarial review.
Claude is an active participant, not just an orchestrator. Claude provides independent critiques, challenges opponent models, and contributes substantive improvements alongside external models.
# 1. Add the marketplace and install the plugin
claude plugin marketplace add zscole/adversarial-spec
claude plugin install adversarial-spec
# 2. Set at least one API key
export OPENAI_API_KEY="sk-..."
# Or use OpenRouter for access to multiple providers with one key
export OPENROUTER_API_KEY="sk-or-..."
# 3. Run it
/adversarial-spec "Build a rate limiter service with Redis backend"You describe product --> Claude drafts spec --> Multiple LLMs critique in parallel
| |
| v
| Claude synthesizes + adds own critique
| |
| v
| Revise and repeat until ALL agree
| |
+--------------------------------------------->|
v
User review period
|
v
Final document output
- Describe your product concept or provide an existing document
- (Optional) Start with an in-depth interview to capture requirements
- Claude drafts the initial document (PRD or tech spec)
- Document is sent to opponent models (GPT, Gemini, Grok, etc.) for parallel critique
- Claude provides independent critique alongside opponent feedback
- Claude synthesizes all feedback and revises
- Loop continues until ALL models AND Claude agree
- User review period: request changes or run additional cycles
- Final converged document is output
- Python 3.10+
litellmpackage:pip install litellm- API key for at least one LLM provider
| Provider | Env Var | Example Models |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
gpt-4o, gpt-4-turbo, o1 |
GEMINI_API_KEY |
gemini/gemini-2.0-flash, gemini/gemini-pro |
|
| xAI | XAI_API_KEY |
xai/grok-3, xai/grok-beta |
| Mistral | MISTRAL_API_KEY |
mistral/mistral-large, mistral/codestral |
| Groq | GROQ_API_KEY |
groq/llama-3.3-70b-versatile |
| OpenRouter | OPENROUTER_API_KEY |
openrouter/openai/gpt-4o, openrouter/anthropic/claude-3.5-sonnet |
| Deepseek | DEEPSEEK_API_KEY |
deepseek/deepseek-chat |
| Zhipu | ZHIPUAI_API_KEY |
zhipu/glm-4, zhipu/glm-4-plus |
Check which keys are configured:
python3 ~/.claude/skills/adversarial-spec/scripts/debate.py providersFor enterprise users who need to route all model calls through AWS Bedrock (e.g., for security compliance or inference gateway requirements):
# Enable Bedrock mode
python3 ~/.claude/skills/adversarial-spec/scripts/debate.py bedrock enable --region us-east-1
# Add models enabled in your Bedrock account
python3 ~/.claude/skills/adversarial-spec/scripts/debate.py bedrock add-model claude-3-sonnet
python3 ~/.claude/skills/adversarial-spec/scripts/debate.py bedrock add-model claude-3-haiku
# Check configuration
python3 ~/.claude/skills/adversarial-spec/scripts/debate.py bedrock status
# Disable Bedrock mode
python3 ~/.claude/skills/adversarial-spec/scripts/debate.py bedrock disableWhen Bedrock is enabled, all model calls route through Bedrock - no direct API calls are made. Use friendly names like claude-3-sonnet which are automatically mapped to Bedrock model IDs.
Configuration is stored at ~/.claude/adversarial-spec/config.json.
OpenRouter provides unified access to multiple LLM providers through a single API. This is useful for:
- Accessing models from multiple providers with one API key
- Comparing models across different providers
- Automatic fallback and load balancing
- Cost optimization across providers
Setup:
# Get your API key from https://openrouter.ai/keys
export OPENROUTER_API_KEY="sk-or-..."
# Use OpenRouter models (prefix with openrouter/)
python3 debate.py critique --models openrouter/openai/gpt-4o,openrouter/anthropic/claude-3.5-sonnet < spec.mdPopular OpenRouter models:
openrouter/openai/gpt-4o- GPT-4o via OpenRouteropenrouter/anthropic/claude-3.5-sonnet- Claude 3.5 Sonnetopenrouter/google/gemini-2.0-flash- Gemini 2.0 Flashopenrouter/meta-llama/llama-3.3-70b-instruct- Llama 3.3 70Bopenrouter/qwen/qwen-2.5-72b-instruct- Qwen 2.5 72B
See the full model list at openrouter.ai/models.
For models that expose an OpenAI-compatible API (local LLMs, self-hosted models, alternative providers), set OPENAI_API_BASE:
# Point to a custom endpoint
export OPENAI_API_KEY="your-key"
export OPENAI_API_BASE="https://your-endpoint.com/v1"
# Use with any model name
python3 debate.py critique --models gpt-4o < spec.mdThis works with:
- Local LLM servers (Ollama, vLLM, text-generation-webui)
- OpenAI-compatible providers
- Self-hosted inference endpoints
Start from scratch:
/adversarial-spec "Build a rate limiter service with Redis backend"
Refine an existing document:
/adversarial-spec ./docs/my-spec.md
You will be prompted for:
- Document type: PRD (business/product focus) or tech spec (engineering focus)
- Interview mode: Optional in-depth requirements gathering session
- Opponent models: Comma-separated list (e.g.,
gpt-4o,gemini/gemini-2.0-flash,xai/grok-3)
More models = more perspectives = stricter convergence.
For stakeholders, PMs, and designers.
Sections: Executive Summary, Problem Statement, Target Users/Personas, User Stories, Functional Requirements, Non-Functional Requirements, Success Metrics, Scope (In/Out), Dependencies, Risks
Critique focuses on: Clear problem definition, well-defined personas, measurable success criteria, explicit scope boundaries, no technical implementation details
For developers and architects.
Sections: Overview, Goals/Non-Goals, System Architecture, Component Design, API Design (full schemas), Data Models, Infrastructure, Security, Error Handling, Performance/SLAs, Observability, Testing Strategy, Deployment Strategy
Critique focuses on: Complete API contracts, data model coverage, security threat mitigation, error handling, specific performance targets, no ambiguity for engineers
Before the debate begins, opt into an in-depth interview session to capture requirements upfront.
Covers: Problem context, users/stakeholders, functional requirements, technical constraints, UI/UX, tradeoffs, risks, success criteria
The interview uses probing follow-up questions and challenges assumptions. After completion, Claude synthesizes answers into a complete spec before starting the adversarial debate.
Each round, Claude:
- Reviews opponent critiques for validity
- Provides independent critique (what did opponents miss?)
- States agreement/disagreement with specific points
- Synthesizes all feedback into revisions
Display format:
--- Round N ---
Opponent Models:
- [GPT-4o]: critiqued: missing rate limit config
- [Gemini]: agreed
Claude's Critique:
Security section lacks input validation strategy. Adding OWASP top 10 coverage.
Synthesis:
- Accepted from GPT-4o: rate limit configuration
- Added by Claude: input validation, OWASP coverage
- Rejected: none
If a model agrees within the first 2 rounds, Claude is skeptical. The model is pressed to:
- Confirm it read the entire document
- List specific sections reviewed
- Explain why it agrees
- Identify any remaining concerns
This prevents false convergence from models that rubber-stamp without thorough review.
After all models agree, you enter a review period with three options:
- Accept as-is: Document is complete
- Request changes: Claude updates the spec, you iterate without a full debate cycle
- Run another cycle: Send the updated spec through another adversarial debate
Run multiple cycles with different strategies:
- First cycle with fast models (gpt-4o), second with stronger models (o1)
- First cycle for structure/completeness, second for security focus
- Fresh perspective after user-requested changes
When a PRD reaches consensus, you're offered the option to continue directly into a Technical Specification based on the PRD. This creates a complete documentation pair in a single session.
Direct models to prioritize specific concerns:
--focus security # Auth, input validation, encryption, vulnerabilities
--focus scalability # Horizontal scaling, sharding, caching, capacity
--focus performance # Latency targets, throughput, query optimization
--focus ux # User journeys, error states, accessibility
--focus reliability # Failure modes, circuit breakers, disaster recovery
--focus cost # Infrastructure costs, resource efficiencyHave models critique from specific professional perspectives:
--persona security-engineer # Thinks like an attacker
--persona oncall-engineer # Cares about debugging at 3am
--persona junior-developer # Flags ambiguity and tribal knowledge
--persona qa-engineer # Missing test scenarios
--persona site-reliability # Deployment, monitoring, incidents
--persona product-manager # User value, success metrics
--persona data-engineer # Data models, ETL implications
--persona mobile-developer # API design for mobile
--persona accessibility-specialist # WCAG, screen readers
--persona legal-compliance # GDPR, CCPA, regulatoryCustom personas also work: --persona "fintech compliance officer"
Include existing documents for models to consider:
--context ./existing-api.md --context ./schema.sqlUse cases:
- Existing API documentation the new spec must integrate with
- Database schemas the spec must work with
- Design documents or prior specs for consistency
- Compliance requirements documents
Long debates can crash or need to pause. Sessions save state automatically:
# Start a named session
echo "spec" | python3 debate.py critique --models gpt-4o --session my-feature-spec
# Resume where you left off
python3 debate.py critique --resume my-feature-spec
# List all sessions
python3 debate.py sessionsSessions save:
- Current spec state
- Round number
- All configuration (models, focus, persona, etc.)
- History of previous rounds
Sessions are stored in ~/.config/adversarial-spec/sessions/.
When using sessions, each round's spec is saved to .adversarial-spec-checkpoints/:
.adversarial-spec-checkpoints/
├── my-feature-spec-round-1.md
├── my-feature-spec-round-2.md
└── my-feature-spec-round-3.md
Use these to rollback if a revision makes things worse.
Convergence can sand off novel ideas when models interpret "unusual" as "wrong". The --preserve-intent flag makes removal expensive:
--preserve-intentWhen enabled, models must:
- Quote exactly what they want to remove or substantially change
- Justify the harm - not just "unnecessary" but what concrete problem it causes
- Distinguish error from preference - only remove things that are factually wrong, contradictory, or risky
- Ask before removing unusual but functional choices: "Was this intentional?"
This shifts the default from "sand off anything unusual" to "add protective detail while preserving distinctive choices."
Use when:
- Your spec contains intentional unconventional choices
- You want models to challenge your ideas, not homogenize them
- Previous rounds removed things you wanted to keep
Every critique round displays token usage and estimated cost:
=== Cost Summary ===
Total tokens: 12,543 in / 3,221 out
Total cost: $0.0847
By model:
gpt-4o: $0.0523 (8,234 in / 2,100 out)
gemini/gemini-2.0-flash: $0.0324 (4,309 in / 1,121 out)
Save frequently used configurations:
# Create a profile
python3 debate.py save-profile strict-security \
--models gpt-4o,gemini/gemini-2.0-flash \
--focus security \
--doc-type tech
# Use a profile
python3 debate.py critique --profile strict-security < spec.md
# List profiles
python3 debate.py profilesProfiles are stored in ~/.config/adversarial-spec/profiles/.
See exactly what changed between spec versions:
python3 debate.py diff --previous round1.md --current round2.mdExtract actionable tasks from a finalized spec:
cat spec-output.md | python3 debate.py export-tasks --models gpt-4o --doc-type prdOutput includes title, type, priority, description, and acceptance criteria.
Use --json for structured output suitable for importing into issue trackers.
Get notified on your phone and inject feedback during the debate.
Setup:
- Message @BotFather on Telegram, send
/newbot, follow prompts - Copy the bot token
- Run:
python3 ~/.claude/skills/adversarial-spec/scripts/telegram_bot.py setup - Message your bot, run setup again to get your chat ID
- Set environment variables:
export TELEGRAM_BOT_TOKEN="..."
export TELEGRAM_CHAT_ID="..."Features:
- Async notifications when rounds complete (includes cost)
- 60-second window to reply with feedback (incorporated into next round)
- Final document sent to Telegram when debate concludes
Final document is:
- Complete, following full structure for document type
- Vetted by all models until unanimous agreement
- Ready for stakeholders without further editing
Output locations:
- Printed to terminal
- Written to
spec-output.md(PRD) ortech-spec-output.md(tech spec) - Sent to Telegram (if enabled)
Debate summary includes rounds completed, cycles run, models involved, Claude's contributions, cost, and key refinements made.
# Core commands
debate.py critique --models MODEL_LIST --doc-type TYPE [OPTIONS] < spec.md
debate.py critique --resume SESSION_ID
debate.py diff --previous OLD.md --current NEW.md
debate.py export-tasks --models MODEL --doc-type TYPE [--json] < spec.md
# Info commands
debate.py providers # List providers and API key status
debate.py focus-areas # List focus areas
debate.py personas # List personas
debate.py profiles # List saved profiles
debate.py sessions # List saved sessions
# Profile management
debate.py save-profile NAME --models ... [--focus ...] [--persona ...]
# Bedrock management
debate.py bedrock status # Show Bedrock configuration
debate.py bedrock enable --region REGION # Enable Bedrock mode
debate.py bedrock disable # Disable Bedrock mode
debate.py bedrock add-model MODEL # Add model to available list
debate.py bedrock remove-model MODEL # Remove model from list
debate.py bedrock list-models # List built-in model mappingsOptions:
--models, -m- Comma-separated model list--doc-type, -d- prd or tech--focus, -f- Focus area (security, scalability, performance, ux, reliability, cost)--persona- Professional persona--context, -c- Context file (repeatable)--profile- Load saved profile--preserve-intent- Require justification for removals--session, -s- Session ID for persistence and checkpointing--resume- Resume a previous session--press, -p- Anti-laziness check--telegram, -t- Enable Telegram--json, -j- JSON output
adversarial-spec/
├── .claude-plugin/
│ └── plugin.json # Plugin metadata
├── README.md
├── LICENSE
└── skills/
└── adversarial-spec/
├── SKILL.md # Skill definition and process
└── scripts/
├── debate.py # Multi-model debate orchestration
└── telegram_bot.py # Telegram notifications
MIT