Skip to content

feat: Multi-Agent Engine System for Ralphy#7

Open
zkwentz wants to merge 37 commits intomainfrom
feat/multi-agent
Open

feat: Multi-Agent Engine System for Ralphy#7
zkwentz wants to merge 37 commits intomainfrom
feat/multi-agent

Conversation

@zkwentz
Copy link
Copy Markdown
Owner

@zkwentz zkwentz commented Jan 19, 2026

Summary

This PR introduces a comprehensive multi-agent engine system that transforms Ralphy from a single-engine orchestrator into an intelligent multi-engine system. The implementation enables multiple AI coding engines to work simultaneously with intelligent task routing, consensus-based decision making, and performance-based learning.

Key Features

  • Consensus Mode: Run multiple engines on the same task with AI meta-agent reviewing and selecting/merging the best solution
  • Specialization Mode: Intelligent task routing to specialized engines based on pattern matching rules
  • Race Mode: Multiple engines compete in parallel, first successful solution wins
  • Meta-Agent Resolution: AI-powered conflict resolution and solution comparison
  • Performance Tracking: Adaptive engine selection based on historical performance metrics
  • Cost Management: Built-in cost controls and estimation for multi-engine execution

Execution Modes

  1. Consensus Mode - Critical tasks requiring high confidence

    • 2+ engines work on same task
    • Meta-agent reviews all solutions
    • Best for: Complex refactoring, critical bug fixes, architecture changes
  2. Specialization Mode - Efficient task distribution

    • Routes tasks based on engine strengths
    • Configurable pattern matching rules
    • Best for: Large PRDs with mixed task types
  3. Race Mode - Speed optimization

    • Parallel execution with early winner detection
    • First valid solution accepted
    • Best for: Simple bug fixes, formatting, documentation

Architecture Changes

  • New Modules: .ralphy/engines.sh, .ralphy/modes.sh, .ralphy/meta-agent.sh, .ralphy/metrics.sh
  • Enhanced Config: Extended .ralphy/config.yaml with engine settings, routing rules, and cost controls
  • Metrics System: Performance tracking with adaptive learning capabilities
  • Validation Gates: Comprehensive solution validation (tests, linting, build verification)

CLI Interface

# Mode selection
./ralphy.sh --mode consensus
./ralphy.sh --mode specialization
./ralphy.sh --mode race

# Engine configuration
./ralphy.sh --consensus-engines "claude,cursor,opencode"
./ralphy.sh --meta-agent claude

# Performance tracking
./ralphy.sh --show-metrics

Implementation Highlights

  • 27 files changed: 7,717 insertions, 67 deletions
  • Comprehensive test suites for each mode and component
  • Backwards compatible with existing Ralphy workflows
  • Bash-based implementation maintaining low barrier to entry
  • Modular design for easy maintenance and extension

Testing Coverage

  • Consensus mode with similar/different results
  • Specialization mode with pattern matching
  • Race mode with early winner and failure scenarios
  • Meta-agent decision parsing and validation
  • Metrics recording and adaptive selection
  • Cost limit enforcement
  • Validation gate handling

Documentation

  • Detailed MultiAgentPlan.md with architecture overview
  • Extended README with usage examples
  • Comprehensive test files demonstrating each feature

Test Plan

  • All test files passing (test_consensus.sh, test_race_mode.sh, test_specialization.sh, etc.)
  • Manual testing of all three execution modes
  • Meta-agent decision making validated
  • Performance metrics tracking verified
  • Cost controls functioning
  • Backwards compatibility confirmed

🤖 Generated with Claude Code

zkwentz and others added 30 commits January 18, 2026 20:17
Created demo login page with enhanced button styling featuring:
- Gradient background with smooth color transitions
- Hover animations with scale and shadow effects
- Shimmer effect overlay for visual polish
- Icon animation on interaction
- Proper accessibility with focus states
- Responsive design for mobile devices

This implementation demonstrates modern UI/UX patterns for the
Ralphy multi-agent system testing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implemented a complete authentication module with extensive test coverage:

Features:
- User creation and management (create, activate, deactivate)
- Secure password hashing (SHA-256)
- Session token generation and validation
- Token expiration and cleanup
- Concurrent authentication support
- Race condition handling

Test Coverage:
- 56 comprehensive unit tests
- All tests passing successfully
- Covers authentication, authorization, session management
- Tests edge cases, race conditions, and security features

Files Added:
- .ralphy/auth.sh: Core authentication module (385 lines)
- .ralphy/auth.test.sh: Complete test suite (709 lines)
- .ralphy/AUTH_README.md: Documentation with usage examples
- .ralphy/progress.txt: Task progress tracking

This implementation is ready for race mode testing across Cursor, Codex,
and Qwen engines as specified in the task requirements.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Extract engine-specific authentication logic from ralphy.sh into a
dedicated authentication module (.ralphy/auth.sh) for improved
maintainability and modularity.

Changes:
- Create .ralphy/auth.sh with 11 authentication functions
- Centralize permission configurations for all 6 AI engines
- Refactor run_ai_command() to use new auth module
- Update cleanup operations to use cleanup_engine_auth()
- Add comprehensive test suite (.ralphy/test_auth.sh)
- Maintain backward compatibility with fallback implementation

Benefits:
- Separation of concerns: auth logic isolated from orchestration
- Easier to add new engines (update one module vs scattered code)
- Testable authentication layer
- Consistent permission handling across all engines
- Clear documentation of each engine's auth requirements

The refactoring supports the planned multi-engine consensus mode
per MultiAgentPlan.md by providing a clean foundation for engine
management and authentication.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit implements Phase 2 of the multi-agent architecture for Ralphy,
enabling consensus mode where multiple AI engines can work on the same task
in parallel and their solutions are compared and merged.

Features:
- New .ralphy/modes.sh module with consensus orchestration
- New .ralphy/meta-agent.sh module for solution comparison
- CLI flags: --mode, --consensus-engines, --meta-agent
- Git worktree isolation for each engine
- Solution similarity detection (>80% threshold)
- Auto-acceptance when solutions are similar
- Test suite for validation

Usage:
  ./ralphy.sh "task" --consensus-engines "claude,cursor"
  ./ralphy.sh "task" --mode consensus --consensus-engines "claude,opencode,cursor"

Implementation follows MultiAgentPlan.md specification.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit addresses multiple critical and high-severity command injection vulnerabilities
identified in ralphy.sh that could allow arbitrary code execution through malicious task titles.

Security Fixes:

1. YQ Command Injection (CRITICAL)
   - Fixed mark_task_complete_yaml() to use env(TASK) instead of string interpolation
   - Fixed get_parallel_group_yaml() to use env(TASK) pattern
   - Prevents arbitrary YAML manipulation and code execution via task titles

2. GitHub CLI Argument Injection (HIGH)
   - Added sanitize_task_title() function to remove control characters
   - Applied sanitization in create_pull_request() function
   - Applied sanitization in parallel execution PR creation
   - Prevents command injection through gh pr create arguments

Implementation Details:
- Added sanitize_task_title() function (removes newlines, null bytes, control chars)
- Changed YQ calls from direct interpolation to environment variable passing
- All PR creation now sanitizes titles before passing to GitHub CLI
- Followed existing secure pattern from add_rule() function

Testing:
- Created comprehensive security test suite (test_security_fixes_simple.sh)
- All 6 security tests passing
- Validates proper use of env(TASK) pattern
- Confirms sanitization in all PR creation paths
- Verifies CWE-78 documentation

Impact:
Prevents attackers from executing arbitrary commands, injecting YAML content,
or breaking out of shell commands through specially crafted task titles.

References: CWE-78 (Improper Neutralization of Special Elements in OS Command)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit implements a complete consensus mode feature that enables
multiple AI engines to work on the same task in parallel, with a
meta-agent selecting the best solution when results differ.

New Features:
- Multi-engine parallel execution using isolated git worktrees
- Meta-agent AI comparison and selection of best solution
- Real-time status monitoring across all engines
- Comprehensive solution storage for comparison
- CLI flags: --mode consensus, --consensus-engines, --meta-agent

Files Added:
- .ralphy/modes.sh: Core consensus mode orchestration logic
- .ralphy/meta-agent.sh: AI-powered solution comparison
- test_consensus.sh: Comprehensive test suite (10 tests, all passing)
- .ralphy/progress.txt: Implementation documentation

Files Modified:
- ralphy.sh: Added consensus mode support to brownfield tasks

Implementation Details:
- Each engine runs in isolated worktree (agent-N/)
- Solutions stored as diffs, commits, and statistics
- Meta-agent analyzes: correctness, quality, completeness, testing
- Automatic merge of chosen solution to current branch
- Supports all 6 engines: claude, opencode, cursor, codex, qwen, droid

Usage:
  ./ralphy.sh --mode consensus "implement feature"
  ./ralphy.sh --mode consensus --consensus-engines "claude,cursor,opencode" "fix bug"

All tests passing. Ready for production use.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added race mode functionality that allows multiple AI engines to compete
on the same task, with the first to complete successfully being declared
the winner. This optimizes speed for straightforward tasks.

Features:
- CLI flags: --race, --race-engines, --no-validation, --race-timeout
- Real-time early winner detection with 0.3s polling
- Automatic cleanup of losing agents and their worktrees
- Validation pipeline (commits, tests, lint)
- Multi-engine support (claude, opencode, cursor, codex, qwen, droid)
- Timeout handling with configurable multiplier

Functions added:
- validate_race_solution(): Validates solutions before declaring winner
- run_race_agent(): Runs individual racing agent in isolated worktree
- run_race_mode(): Orchestrates the race and manages cleanup

Integration:
- Routes --race flag in main() for single-task mode
- Uses existing worktree infrastructure for isolation
- Maintains backward compatibility with parallel mode

Testing:
- Comprehensive test suite (test_race_mode.sh) verifies all features
- All tests pass successfully

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implemented Phase 3 of the multi-agent architecture to enable
intelligent task routing to specialized AI engines based on
pattern matching rules.

Changes:
- Created .ralphy/modes.sh module for multi-engine execution modes
- Added match_specialization_rule() for pattern-based task routing
- Added get_engine_for_task() to select optimal engine per task
- Added validate_engine_available() to check engine availability
- Integrated specialization mode into ralphy.sh main execution flow
- Added --mode and --specialization CLI flags
- Enhanced config.yaml schema with engines.specialization_rules section
- Added default rules for UI, refactoring, testing, bugs, API, and database tasks
- Created comprehensive test suite (.ralphy/test_specialization.sh)
- All 10 tests passing successfully

Features:
- Routes tasks to specialized engines based on regex pattern matching
- Case-insensitive pattern matching
- Falls back to default engine when no rule matches or engine unavailable
- Fully backward compatible (defaults to single-engine mode)
- Foundation for future consensus and race modes

Usage:
  ./ralphy.sh --specialization --prd PRD.md
  ./ralphy.sh --mode specialization "add login button"

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit implements Phase 3 (Specialization Mode) of the multi-agent
architecture, with specific focus on the fallback behavior when no
specialization rules match a task description.

Features:
- New .ralphy/modes.sh module with specialization functions
- Pattern-based task routing using config rules
- Three-level fallback: config → env var → hardcoded default (claude)
- Comprehensive test suite with 10 test cases
- Metadata tracking for performance metrics
- yq-based YAML config parsing

Key Functions:
- run_specialization_mode() - Main orchestration
- match_specialization_rule() - Regex pattern matching against config
- get_default_engine() - Multi-level fallback logic
- run_single_engine_task() - Single engine execution

Fallback Behavior (No Matching Rules):
1. Task description doesn't match any pattern in config
2. match_specialization_rule() returns empty string
3. get_default_engine() tries:
   - .ralphy/config.yaml: engines.meta_agent.engine
   - Environment: $AI_ENGINE
   - Hardcoded: "claude"
4. Metadata logs: matched_pattern = "(no match - default)"

Test Coverage:
- Pattern matching (UI, test, refactor patterns)
- No-match scenario returns empty
- Default engine fallback from config
- Environment variable fallback
- Hardcoded default fallback
- Missing config handling
- Case-insensitive matching
- First-match-wins precedence

Config Schema:
engines:
  meta_agent:
    engine: "claude"
  specialization_rules:
    - pattern: "UI|frontend|styling"
      engines: ["cursor"]

Manual Testing Checklist Progress:
✓ Specialization with no matching rules (this commit)
- Specialization with matching rules (next)

Implementation follows MultiAgentPlan.md specification (lines 268-300).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add .ralphy/meta-agent.sh with parse_meta_decision() function
  - Parses meta-agent output with DECISION, CHOSEN, REASONING fields
  - Supports both "select" and "merge" decision types
  - Extracts merged solution code blocks
  - Returns JSON-formatted output with proper escaping
  - Handles multiline text and edge cases

- Add helper functions:
  - prepare_meta_prompt(): Build comparison prompt for solutions
  - run_meta_agent(): Execute meta-agent with multi-engine support
  - merge_solutions(): Apply merged solutions (placeholder)

- Add comprehensive test suite (.ralphy/test-meta-agent.sh)
  - 32 tests covering all parsing scenarios
  - Tests for error conditions and edge cases
  - All tests passing ✓

Part of multi-agent engine implementation (Phase 5).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adds race mode implementation for Ralphy's multi-agent system with robust
handling when all engines fail to complete a task.

Key Features:
- Parallel engine execution with isolated git worktrees
- Comprehensive failure reporting and diagnostics
- Four fallback strategies presented to users
- Metrics tracking for race history and outcomes
- Automatic cleanup of worktrees and branches
- Validation system for solution quality
- Bash 3 compatible for macOS support

Files Added:
- .ralphy/engines.sh: Engine abstraction layer
- .ralphy/modes.sh: Race mode implementation with failure handling
- .ralphy/test-race-mode.sh: Comprehensive test suite (all tests passing)
- .ralphy/RACE_MODE.md: Complete documentation
- .ralphy/progress.txt: Implementation progress tracking

When all engines fail, users receive:
1. Detailed failure report with exit codes and outputs
2. Metrics recorded in .ralphy/metrics.json
3. Fallback strategies:
   - Retry with different engines
   - Switch to consensus mode
   - Manual intervention guidance
   - Task breakdown suggestions

Part of the multi-agent system implementation outlined in MultiAgentPlan.md
(Phase 4: Race Mode Implementation).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit adds comprehensive performance tracking and intelligent
engine selection to Ralphy based on historical success rates.

New Features:
- Automatic metrics recording for all task executions
- Pattern-based task categorization (10 patterns)
- Adaptive engine selection based on historical performance
- CLI flags: --show-metrics, --reset-metrics, --export-metrics, --no-adapt
- Comprehensive test suite (14 tests, all passing)

Implementation Details:
- New .ralphy/metrics.sh module with 11 core functions
- JSON-based metrics storage in .ralphy/metrics.json
- Tracks success rate, duration, cost, and tokens per engine
- Pattern-specific performance metrics
- Minimum 5 samples required for adaptive recommendations
- Zero breaking changes, backward compatible

Files Added:
- .ralphy/metrics.sh (536 lines) - Core metrics module
- .ralphy/test_metrics.sh (290 lines) - Test suite
- .ralphy/progress.txt - Implementation documentation

Files Modified:
- ralphy.sh - Integrated metrics recording and adaptive selection
  * Added metrics module sourcing
  * Added task timing tracking
  * Added success/failure metrics recording
  * Added adaptive engine selection logic
  * Added 4 new CLI flags
  * Updated help text and init function

Foundation for future multi-agent features including consensus mode,
race mode, and specialization routing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive cost control features to prevent runaway costs during
multi-agent execution as specified in MultiAgentPlan.md.

Features:
- Per-task cost limits with configurable thresholds
- Per-session cost limits to control total spending
- Warning alerts when approaching cost limits (default 75%)
- Automatic task/session termination when limits exceeded
- Support for actual costs (OpenCode) and estimated costs (token-based)

Configuration (in .ralphy/config.yaml):
- cost_controls.max_per_task: Maximum USD per task (0 = unlimited)
- cost_controls.max_per_session: Maximum USD per session (0 = unlimited)
- cost_controls.warn_threshold: Warning percentage (default 0.75)

Implementation details:
- Added cost tracking variables and functions to ralphy.sh
- Integrated cost checking after each AI API call
- Enhanced config initialization with cost_controls section
- Updated --config display to show cost limits

Testing:
- Syntax validation passed
- Cost calculation verified (1M input + 500k output = $10.50)
- Limit detection and warning thresholds tested successfully

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add comprehensive validation module for multi-engine mode support:

Features:
- Four validation gates: diff check, linting, tests, and build
- Configurable retry logic with timeout support
- Cross-platform compatibility (macOS/Linux)
- JSON reporting for metrics integration
- Loads commands from .ralphy/config.yaml

Implementation:
- .ralphy/validation.sh: Core validation module with 400+ lines
  - validate_solution(): Main function with all 4 gates
  - validate_solution_with_retry(): Retry wrapper with smart retry logic
  - Individual gate functions with timeout handling
  - Validation reporting and config loading

- tests/test_validation.sh: Comprehensive test suite
  - 22 tests covering all validation scenarios
  - Mock git repository testing for diff validation
  - JSON report validation
  - Cross-platform test handling

Testing:
- All 22 tests passing
- Graceful fallback when timeout command unavailable
- Tested on macOS environment

This module provides the foundation for handling validation failures
in consensus, race, and specialization modes as specified in
MultiAgentPlan.md lines 474-506.

Closes validation gate failures task (line 619 in MultiAgentPlan.md).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…tion-system-consensus-claude-op

Refactor authentication system into modular architecture
…-styling-auto

Update login button styling with modern design
…auth-race-cursor-codex-qwen

Add comprehensive unit tests for auth module
…ty-bug-consensus-claude-cursor-

Fix critical security vulnerabilities (CWE-78) in command injection
…-2-engines-similar-results

Ralphy/agent 5 consensus mode with 2 engines similar results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant