Skip to content

Conversation

@Suhaib3100
Copy link

@Suhaib3100 Suhaib3100 commented Jan 27, 2026

Fixes #279

Summary

Implements the foundation for AI Swarm Mode - enabling a master agent to spawn and orchestrate multiple worker agents in separate browser windows for parallel task execution.

Use Case

User: "Research and compare the top 5 CRM solutions"

Master Agent (Coordinator)
├── Decomposes task into 5 parallel subtasks
├── Spawns 5 worker windows
├── Monitors progress
└── Synthesizes final report
    │
    ├── Worker 1: Research Salesforce
    ├── Worker 2: Research HubSpot
    ├── Worker 3: Research Pipedrive
    ├── Worker 4: Research Zoho
    └── Worker 5: Research Monday

Components Added

Core (apps/server/src/swarm/)

Component Description
types.ts SwarmState, WorkerState, SwarmRequest, SwarmResult, message types
constants.ts SWARM_LIMITS, SWARM_TIMEOUTS, default configs
coordinator/swarm-registry.ts Tracks active swarms and workers
coordinator/task-planner.ts LLM-based task decomposition
coordinator/swarm-coordinator.ts Main orchestrator for swarm lifecycle
worker/worker-lifecycle.ts Spawn, monitor, terminate workers
messaging/swarm-bus.ts EventEmitter-based pub/sub communication
aggregation/result-aggregator.ts Merges worker results

API (apps/server/src/api/routes/swarm.ts)

Method Endpoint Description
POST /swarm Create and execute swarm
POST /swarm/create Create swarm only
POST /swarm/:id/execute Execute existing swarm
GET /swarm/:id Get status
GET /swarm/:id/stream SSE for real-time updates
DELETE /swarm/:id Terminate swarm

Key Features

  • Task Decomposition: LLM automatically breaks complex tasks into parallel subtasks
  • Worker Management: Spawn workers in separate windows with health monitoring
  • Retry Logic: Exponential backoff for failed workers (max 3 retries)
  • Progress Tracking: Real-time updates via SSE streaming
  • Result Aggregation: Merge worker results with partial failure support
  • Output Formats: JSON, Markdown, or HTML

Commits

  1. feat(swarm): add core types and constants
  2. feat(swarm): add SwarmRegistry for tracking active swarms
  3. feat(swarm): add SwarmMessagingBus for inter-agent communication
  4. feat(swarm): add WorkerLifecycleManager for worker management
  5. feat(swarm): add TaskPlanner for LLM-based task decomposition
  6. feat(swarm): add ResultAggregator for merging worker results
  7. feat(swarm): add SwarmCoordinator as main orchestrator
  8. feat(swarm): add HTTP API routes for swarm management
  9. docs(swarm): add design document and update exports

Next Steps

  • Integration with main server
  • Worker agent implementation
  • UI components for swarm visualization
  • Chromium-side SwarmWindowManager for resource isolation

- Add SwarmState, WorkerState, and related enums
- Add SwarmConfig, RetryPolicy, ResourceLimits interfaces
- Add SwarmRequest/SwarmResult/SwarmStatus types
- Add Worker and WorkerTask types
- Add SwarmMessage protocol types
- Add Zod schemas for validation
- Add SWARM_LIMITS and SWARM_TIMEOUTS constants
- Add default configurations

Part of browseros-ai#279
- Manage swarm lifecycle (create, update, delete)
- Track workers per swarm with state management
- Calculate progress and status summaries
- Enforce concurrent swarm limits
- Worker state transitions with timestamps

Part of browseros-ai#279
- EventEmitter-based pub/sub messaging
- Channel naming: swarm:{id}:master, swarm:{id}:worker:{id}
- sendToWorker(), broadcast(), sendToMaster() helpers
- subscribe(), subscribeAll(), subscribeBroadcast()
- waitFor() with timeout for sync patterns
- Cleanup with removeSwarmListeners()

Part of browseros-ai#279
- spawnWorker() creates window via ControllerBridge
- Health monitoring with heartbeat checks
- Progress stale detection
- handleWorkerFailure() with exponential backoff retry
- terminateWorker() and terminateAllWorkers()
- Cleanup methods for graceful shutdown

Part of browseros-ai#279
- decompose() breaks complex tasks into parallel subtasks
- estimateWorkerCount() for optimal worker sizing
- Zod schema validation for LLM output
- createManualTasks() fallback for non-LLM usage
- Dependency handling between subtasks

Part of browseros-ai#279
- aggregate() collects and merges worker results
- Handles partial results from failed workers
- calculateMetrics() for execution stats
- Output formats: JSON, Markdown, HTML
- Optional LLM synthesizer integration

Part of browseros-ai#279
- createSwarm() initializes new swarm
- executeSwarm() runs full lifecycle:
  1. Planning (task decomposition)
  2. Spawning (worker windows)
  3. Executing (monitor progress)
  4. Aggregating (merge results)
- terminateSwarm() for graceful shutdown
- Event-based progress reporting
- Timeout handling and error recovery

Part of browseros-ai#279
Endpoints:
- POST /swarm - Create and execute swarm
- POST /swarm/create - Create swarm only
- POST /swarm/:id/execute - Execute existing swarm
- GET /swarm/:id - Get status
- GET /swarm/:id/stream - SSE for real-time updates
- DELETE /swarm/:id - Terminate swarm

Includes Zod validation and error handling.

Part of browseros-ai#279
- Add comprehensive design doc with architecture overview
- Export all swarm components from index.ts
- Document API endpoints and message protocol
- Track implementation status

Part of browseros-ai#279
@github-actions
Copy link
Contributor

github-actions bot commented Jan 27, 2026

All contributors have signed the CLA. Thank you!
Posted by the CLA Assistant Lite bot.

@Suhaib3100
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 27, 2026

Greptile Overview

Greptile Summary

Implements a comprehensive AI Swarm Mode feature that enables parallel task execution across multiple browser windows. The architecture is well-designed with clear separation of concerns:

  • Core Components: SwarmRegistry tracks state, TaskPlanner decomposes tasks via LLM, WorkerLifecycleManager manages worker windows, SwarmMessagingBus handles inter-agent communication, and ResultAggregator merges results
  • API Layer: HTTP endpoints with SSE streaming for real-time progress updates
  • Type Safety: Comprehensive TypeScript types with Zod validation schemas

Key Issues Found:

  1. Import Path Error (syntax): result-aggregator.ts has incorrect import path for SwarmRegistry
  2. Retry Logic Bug (logic): Worker retry count is reset when respawning, breaking retry limit enforcement
  3. Resource Leak (logic): SSE keepAlive interval not cleared on terminal events
  4. Architecture Violation (style): index.ts barrel export violates CLAUDE.md guideline against bundling all exports

Strengths:

  • Clean architecture with proper dependency injection
  • Comprehensive error handling and logging
  • Well-documented with inline comments
  • Proper event-driven design for monitoring
  • Health monitoring with heartbeat and progress tracking

Confidence Score: 4/5

  • Safe to merge with minor fixes needed for import path and retry logic
  • Score of 4 reflects solid architecture and implementation with 2 syntax/logic bugs that need fixing: incorrect import path will cause runtime error, and retry logic bug could lead to infinite retries. Resource leak is less critical but should be addressed. Style violation (index.ts) doesn't block merge but should be cleaned up per project guidelines.
  • Pay close attention to result-aggregator.ts (import fix required) and worker-lifecycle.ts (retry logic fix required)

Important Files Changed

Filename Overview
apps/server/src/swarm/worker/worker-lifecycle.ts Worker lifecycle management with health monitoring; has retry logic bug where retryCount is reset on respawn
apps/server/src/swarm/aggregation/result-aggregator.ts Result aggregation with formatting options; has incorrect import path for SwarmRegistry
apps/server/src/swarm/coordinator/swarm-coordinator.ts Main orchestrator with proper phase management and event emission - no issues
apps/server/src/api/routes/swarm.ts HTTP API routes with SSE streaming; potential resource leak in keepAlive interval cleanup
apps/server/src/swarm/index.ts Barrel export file that violates CLAUDE.md guideline against index.ts exports

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as Swarm API Routes
    participant Coord as SwarmCoordinator
    participant Registry as SwarmRegistry
    participant Planner as TaskPlanner
    participant Lifecycle as WorkerLifecycleManager
    participant Bus as SwarmMessagingBus
    participant Bridge as ControllerBridge
    participant Workers as Worker Windows
    participant Aggregator as ResultAggregator

    Client->>API: POST /swarm {task, maxWorkers}
    API->>Coord: createAndExecute(request)
    
    Note over Coord: Phase 1: Planning
    Coord->>Registry: create(task, config)
    Registry-->>Coord: swarm object
    Coord->>Registry: updateState(swarmId, 'planning')
    Coord->>Planner: decompose(task, config)
    Planner->>Planner: LLM generates subtasks
    Planner-->>Coord: WorkerTask[]
    
    Note over Coord: Phase 2: Spawning Workers
    Coord->>Registry: updateState(swarmId, 'spawning')
    loop For each task
        Coord->>Lifecycle: spawnWorker(swarmId, task)
        Lifecycle->>Registry: addWorker(swarmId, worker)
        Lifecycle->>Bridge: sendRequest('create_window')
        Bridge-->>Lifecycle: {windowId}
        Lifecycle->>Bus: startHealthMonitoring()
        Lifecycle-->>Coord: worker object
        Coord->>Coord: emit('worker_spawned')
    end
    
    Note over Coord: Phase 3: Execution & Monitoring
    Coord->>Registry: updateState(swarmId, 'executing')
    Coord->>Coord: emit('swarm_started')
    Coord->>Bus: subscribeToMaster(swarmId)
    
    loop Worker Execution
        Workers->>Bus: sendToMaster('task_progress')
        Bus->>Coord: message received
        Coord->>Registry: updateWorkerProgress()
        Coord->>Coord: emit('worker_progress')
        
        Workers->>Bus: sendToMaster('task_complete')
        Bus->>Coord: message received
        Coord->>Registry: setWorkerResult()
        Coord->>Coord: emit('worker_completed')
    end
    
    Note over Coord: Phase 4: Aggregation
    Coord->>Registry: updateState(swarmId, 'aggregating')
    Coord->>Coord: emit('aggregation_started')
    Coord->>Aggregator: aggregate(swarmId, format)
    Aggregator->>Registry: getWorkers(swarmId)
    Registry-->>Aggregator: Worker[]
    Aggregator->>Aggregator: calculateMetrics()
    Aggregator->>Aggregator: simpleAggregate() or LLM synthesize()
    Aggregator-->>Coord: AggregatedResult
    
    Coord->>Registry: updateState(swarmId, 'completed')
    Coord->>Coord: emit('swarm_completed')
    Coord-->>API: SwarmResult
    API-->>Client: {success: true, data: result}
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +219 to +234
// Keep alive
const keepAlive = setInterval(() => {
stream.writeSSE({
event: 'ping',
data: JSON.stringify({ timestamp: Date.now() }),
})
}, 30000)

// Wait for stream to close
await new Promise<void>((resolve) => {
stream.onAbort(() => {
clearInterval(keepAlive)
unsubscribe()
resolve()
})
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential resource leak - keepAlive interval may not be cleared if stream closes via the earlier terminal event path (lines 206-211)

When stream.close() is called in the terminal event handler, the keepAlive interval continues running. Consider clearing it before closing:

if (
  event.type === 'swarm_completed' ||
  event.type === 'swarm_failed'
) {
  clearInterval(keepAlive)
  stream.close()
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/server/src/api/routes/swarm.ts
Line: 219:234

Comment:
Potential resource leak - `keepAlive` interval may not be cleared if stream closes via the earlier terminal event path (lines 206-211)

When `stream.close()` is called in the terminal event handler, the `keepAlive` interval continues running. Consider clearing it before closing:

```typescript
if (
  event.type === 'swarm_completed' ||
  event.type === 'swarm_failed'
) {
  clearInterval(keepAlive)
  stream.close()
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 247 to 251
// Increment retry count and respawn
worker.retryCount++
worker.state = 'spawning'
worker.windowId = undefined
worker.error = undefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry logic mutates existing worker object, potentially causing issues when spawning new worker with same ID

When retrying, the code increments retryCount on the existing worker object then calls spawnWorker() which creates a fresh worker with retryCount: 0. The retry count won't persist correctly.

Prompt To Fix With AI
This is a comment left during a code review.
Path: apps/server/src/swarm/worker/worker-lifecycle.ts
Line: 247:251

Comment:
Retry logic mutates existing worker object, potentially causing issues when spawning new worker with same ID

When retrying, the code increments `retryCount` on the existing worker object then calls `spawnWorker()` which creates a fresh worker with retryCount: 0. The retry count won't persist correctly.

How can I resolve this? If you propose a fix, please make it concise.

This commit adds production-ready advanced features to AI Swarm Mode:

## Scheduling & Load Balancing
- PriorityTaskQueue: Priority scheduling with aging, deadline urgency,
  dependency resolution, and preemption support
- LoadBalancer: 5 strategies (round-robin, least-connections, weighted,
  resource-aware, latency-based) with sticky sessions and health scoring

## Fault Tolerance (Resilience)
- CircuitBreaker: Failure threshold monitoring, half-open recovery, fallback
- Bulkhead: Concurrent execution limiting with queue
- Utilities: retryWithBackoff(), withTimeout()

## Resource Pooling
- WorkerPool: Pre-warmed workers for instant task assignment
- Auto-scaling based on utilization
- Idle timeout and maintenance loops

## Streaming Aggregation
- StreamingAggregator: Real-time result streaming via async iterators
- 4 aggregation modes: merge, concat, vote, custom
- Conflict detection and resolution strategies

## Observability
- SwarmTracer: OpenTelemetry-compatible distributed tracing
- SwarmMetricsCollector: Time-series metrics with history
- SwarmHealthChecker: Multi-check health status

## Worker Agent
- SwarmWorkerAgent: LLM-powered execution planning
- Browser automation via BrowserController interface
- Heartbeat reporting, pause/resume, progress tracking

## Integration
- SwarmService: Unified entry point integrating all components
- Enhanced API routes with streaming, health, metrics, tracing endpoints
- Server integration with optional swarm config

Issue: browseros-ai#279
## Extension Side (controller-ext)
- SwarmWindowManager: Manages worker windows for swarm mode
  - Create windows with cascading positions
  - Focus, minimize, close individual workers
  - Terminate entire swarm (close all windows)
  - Arrange windows (grid, cascade, tile layouts)
  - Capture screenshots from workers
  - Handle external window close events

- SwarmActions: Chrome extension action handlers
  - createSwarmWindow, navigateSwarmWindow, focusSwarmWindow
  - closeSwarmWindow, terminateSwarm, arrangeSwarmWindows
  - getSwarmWindows, captureSwarmScreenshot, getSwarmStats

- Registered all swarm actions in BrowserOSController

## Agent UI (React components)
- SwarmPanel: Main visualization panel
  - Shows swarm status, progress, workers
  - Compact and expanded worker views
  - Window arrangement controls
  - Result preview and metrics display

- SwarmWorkerCard: Individual worker status card
  - Visual status indicators (pending, executing, completed, failed)
  - Progress bar and duration tracking
  - Click to focus worker window

- SwarmTrigger: Chat interface button
  - Enable/disable swarm mode
  - Configure max workers and priority

- useSwarm hook: React state management
  - SSE streaming for real-time updates
  - API communication with server
  - Worker focus and termination

Issue: browseros-ai#279
- Mark all extension and UI components as complete
- Add file structure for controller-ext/actions/swarm
- Add file structure for agent/components/swarm and lib/swarm
- Update pending items to only remaining tasks

Issue: browseros-ai#279
- SwarmTrigger: simple toggle button matching ChatModeToggle style
- SwarmPanel: compact inline progress bar (not complex Card)
- SwarmWorkerCard: minimal worker dots/indicators
- Use var(--accent-orange) instead of purple
- Use TooltipProvider delayDuration={0} consistently
- Removed heavy dependencies (Card, Badge, Collapsible)
- Add SwarmTrigger to ChatFooter (shows when in agent mode)
- Add SwarmPanel above ChatFooter for progress visualization
- Update Chat component with swarm state and handlers
- Connect useSwarm hook to getAgentServerUrl() for API calls
- Swarm toggle appears next to ChatModeToggle when in Agent mode
- SSE streaming for real-time worker progress updates
- Pass swarm config to createHttpServer in main.ts
- Enables SwarmService with all features:
  - enabled: true
  - maxWorkers: 10
  - enablePooling: true
  - enableCircuitBreaker: true
  - enableTracing: true
  - loadBalancingStrategy: 'resource-aware'

This enables the /swarm API endpoints in production.
1. Fix resource leak in SSE stream (swarm.ts)
   - Clear keepAlive interval before closing on terminal events
   - Call unsubscribe() to prevent memory leaks

2. Fix retry count persistence (worker-lifecycle.ts)
   - Preserve retryCount when respawning worker
   - New worker now inherits the incremented retry count
@Suhaib3100
Copy link
Author

Thanks for the review! Both issues have been fixed in commit 1640b9e:

1. Resource leak in SSE stream (swarm.ts)

  • Now calling clearInterval(keepAlive) and unsubscribe() before stream.close() on terminal events

2. Retry count persistence (worker-lifecycle.ts)

  • Storing the incremented retry count in a variable before respawning
  • Applying the preserved retryCount to the new worker after spawn

Both fixes ensure proper cleanup and correct retry behavior.

@Suhaib3100
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

- Add 5s timeout to CDP connection to prevent server hanging
- Make worker pool pre-warming non-blocking (background warmup)
- Initialize SwarmService even without extension bridge connected
- Remove unused imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: AI Swarm Mode - Multi-agent parallel task execution

1 participant