Skip to content

stack4:feat: implement parallel sandbox/worktree orchestration#163

Open
VX1D wants to merge 9 commits intomichaelshimeles:mainfrom
VX1D:stack/pr4-parallel-sandbox
Open

stack4:feat: implement parallel sandbox/worktree orchestration#163
VX1D wants to merge 9 commits intomichaelshimeles:mainfrom
VX1D:stack/pr4-parallel-sandbox

Conversation

@VX1D
Copy link
Copy Markdown
Contributor

@VX1D VX1D commented Mar 3, 2026

PR4: parallel sandbox

Summary

This PR delivers the heavy parallel runtime: agent runners, sandbox/worktree isolation, conflict-aware batching, merge coordination, and crash-resilient cleanup tracking.

Why this PR exists

  • Running many agents safely needs stronger isolation and merge control.
  • Worktree/sandbox cleanup paths are high-risk and must be guarded.
  • Parallel planning/testing agents need explicit runtime coordination.

What it adds

  • New execution runtime files:
    • agent-runner.ts, runner-types.ts
    • parallel.ts, parallel-no-git.ts
    • graph-coloring.ts (conflict-aware batch partitioning)
    • file-utils.ts, hash-store.ts
  • Updates in:
    • sandbox.ts, sandbox-git.ts
    • conflict-resolution.ts, deferred.ts, execution/index.ts
    • tasks/cached-task-source.ts

PR preview

  • Parallel execution can run as worktree mode or sandbox-only mode (parallel-no-git).
  • Agent runner supports planning-first execution with selective file copy into isolated sandboxes.
  • Merge phase is guarded by a global lock to prevent concurrent merge corruption.
  • Orphaned worktrees are tracked and recovered/cleaned on next run.
  • Conflict-aware batching reduces merge collisions by grouping low-overlap tasks first.

Concrete scenarios this fixes

  • Multiple ralphy processes finish at once: merge lock prevents lockstep branch damage.
  • Process crash during parallel run: tracked worktrees are cleaned instead of accumulating forever.
  • Sandbox copy-back with partial file set: only validated/copied files are staged and committed.

Planning/testing/agent behavior added

  • Agent runner supports multi-phase execution:
    • planning phase (optional planningModel)
    • execution phase
    • test-model orchestration path (testModel)
  • Planned file pre-copy and selective sandbox isolation.
  • Static per-agent progress display integration.
  • No-git parallel mode for sandbox-only execution.

Security/reliability hardening

  • Safer sandbox/worktree cleanup boundaries.
  • Global merge lock for cross-process merge coordination.
  • Tracked worktree cleanup and orphan recovery logic.
  • Sandbox commit path stages only validated/copied files.
  • Merge and branch cleanup paths hardened under partial failures.

Validation

  • Included in cumulative stack verification.
  • Stack passes bun run check, bun tsc --noEmit, and bun test in order.

Note

Implement parallel sandbox/worktree orchestration for task execution and serialize merge with a 5-minute global lock across processes

Adds a state-driven parallel runner with conflict-aware batching, sandbox/worktree orchestration, global merge locking, and a static progress UI; introduces robust sandbox sync and path validation, unified file/process locking, circuit-breaker retries, environment-aware prompt builders, and structured logging; migrates sequential/parallel flows to TaskStateManager; includes skill compression and CSV export utilities and expands telemetry sanitization. Entry points: [file:cli/src/execution/parallel.ts], [file:cli/src/execution/sandbox.ts], [file:cli/src/execution/task-state.ts], [file:cli/src/execution/prompt.ts].

📍Where to Start

Start with runParallel in [file:cli/src/execution/parallel.ts], then follow calls into runAgentInSandbox/runAgentInWorktree, TaskStateManager in [file:cli/src/execution/task-state.ts], and sandbox helpers in [file:cli/src/execution/sandbox.ts].

Macroscope summarized dd5630c.

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 3, 2026

@VX1D is attempting to deploy a commit to the Goshen Labs Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot
Copy link
Copy Markdown

dosubot bot commented Mar 3, 2026

Related Documentation

4 document(s) may need updating based on files changed in this PR:

Goshen Labs's Space

AI-Assisted Merge Conflict Resolution
View Suggested Changes
@@ -1,9 +1,33 @@
 ## Merge Phase After Parallel Task Completion
 After all parallel tasks finish, the system enters an optimized merge phase that integrates the results of each batch using a chained integration branch strategy. This process is managed by the `runParallel` function, which invokes the merge phase unless the `--no-merge` CLI flag is set.
 
+### Global Merge Lock for Concurrency Safety
+The merge phase uses a **global merge lock** to prevent concurrent merge corruption when multiple ralphy processes are running simultaneously. This cross-process coordination mechanism is critical for preventing race conditions during merge operations.
+
+The lock is implemented using atomic file operations:
+- Before starting the merge phase, ralphy attempts to acquire an exclusive lock by creating a lock file with the `wx` flag (write-exclusive)
+- If the lock file already exists, ralphy checks if it's stale (older than 5 minutes) using an atomic rename operation to prevent TOCTOU (time-of-check-time-of-use) vulnerabilities
+- If another process holds a valid lock, the merge phase is skipped and branches are preserved for manual merge
+- The lock is automatically released after the merge completes or if the process crashes
+
+This ensures that only one ralphy instance can perform merge operations at a time, preventing branch corruption from concurrent git operations. If the merge lock cannot be acquired, the completed branches are preserved and logged for manual integration.
+
+### Conflict-Aware Batch Partitioning
+
+Before execution, ralphy uses **graph coloring** to intelligently partition parallel tasks into conflict-free batches. This proactive conflict prevention mechanism reduces the likelihood of merge collisions before they occur.
+
+The batching process works as follows:
+
+1. **Conflict Graph Construction**: The system builds a conflict graph where tasks are nodes and edges connect tasks that modify overlapping files. File information is extracted from planning analysis when available.
+2. **Graph Coloring**: A DSatur (degree saturation) graph coloring algorithm assigns colors to tasks such that no two tasks with the same color have file overlaps.
+3. **Batch Formation**: Tasks with the same color are grouped into batches that can run safely in parallel without file conflicts.
+4. **Batch Size Limiting**: If a color group exceeds `maxParallel`, it is split into sub-batches to respect the parallelism limit.
+
+This approach minimizes file overlaps within each batch, reducing contention during the merge phase. Tasks with known file dependencies are strategically separated into different batches, allowing conflict-free parallel execution. [Source: `execution/graph-coloring.ts`]
+
 ### Chained Integration Branch Workflow
 
-Instead of merging all completed branches directly into the original base branch, the system now creates an **integration branch** after each batch of parallel tasks. The workflow is as follows:
+Instead of merging all completed branches directly into the original base branch, the system creates an **integration branch** after each batch of parallel tasks. The workflow is as follows:
 
 1. **Batch Execution**: Tasks are executed in parallel batches (groups). Each batch operates on the current base branch (which may be an integration branch from the previous batch).
 2. **Integration Branch Creation**: After each batch completes, the system creates a new integration branch (e.g., `ralphy/integration-group-1`) from the current base branch.
@@ -32,7 +56,9 @@
 Conflicts are detected during the merge operation. If the merge fails, the system inspects the working directory for conflicted files using `git status`. The list of conflicted files is returned for further processing. [Source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/git/merge.ts#L3-L192)
 
 ### AI Conflict Resolution Workflow
-The AI engine receives a prompt listing all conflicted files and instructions to resolve each by removing conflict markers and ensuring the resulting code is valid and logically correct. The AI edits the files, stages them, and commits the merge. After AI execution, the system checks for remaining conflicts. If all conflicts are resolved, the merge is completed; if not, the merge is aborted and the branch is preserved for manual review. [Source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/conflict-resolution.ts#L8-L80)
+The AI engine receives a prompt listing all conflicted files and instructions to resolve each by removing conflict markers and ensuring the resulting code is valid and logically correct. The AI edits the files, stages them, and commits the merge. After AI execution, the system checks for remaining conflicts. If all conflicts are resolved, the merge is completed; if not, the merge is aborted and the branch is preserved for manual review.
+
+The conflict resolution module has been enhanced with improved error handling using standardized error utilities, providing more consistent error reporting across different failure scenarios. This improves debugging and allows better detection of retryable vs. permanent errors. [Source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/conflict-resolution.ts#L8-L80)
 
 If AI resolution fails or throws an error, the system logs the failure and aborts the merge, leaving the branch intact for manual inspection. Users are notified of branches that failed to merge and require manual review.
 
@@ -63,6 +89,17 @@
 ## Handling Failed Merges
 Branches that fail to merge—either due to unresolved conflicts or other errors—are preserved for manual review. The system logs which branches failed and ensures they are not deleted, allowing users to inspect and resolve issues manually. [Source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/parallel.ts#L146-L372)
 
+### Worktree Cleanup Tracking and Recovery
+The parallel execution system includes a robust worktree cleanup tracking mechanism that recovers from process crashes. This persistent tracking prevents worktree accumulation across multiple runs.
+
+The cleanup tracking system:
+- **Atomic Tracking**: Each worktree is recorded in a persistent tracking file (`.ralphy-worktrees/tracked.json`) using atomic file operations (write-then-rename) to prevent corruption
+- **Process Monitoring**: Each tracked worktree includes the process ID (PID) and creation timestamp, allowing detection of orphaned worktrees from crashed processes
+- **Crash Recovery**: On startup, ralphy automatically detects and cleans up stale worktrees from previous runs if the associated process is no longer running
+- **Safe Cleanup Boundaries**: Path validation ensures worktrees are only cleaned within the designated worktree base directory, preventing accidental deletion of unrelated directories
+
+When a parallel run completes successfully, worktrees are untracked and cleaned up. If a process crashes, the orphaned worktrees remain tracked and are automatically cleaned on the next run. This ensures worktree directories don't accumulate over time, even after unexpected failures.
+
 ## Module Documentation
 ### `git/merge.ts`
 
@@ -81,6 +118,18 @@
 - `PreMergeAnalysis`: Interface describing the result of pre-merge analysis, including branch name, files changed, and file count.
 
 See [source](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/git/merge.ts) for implementation details.
+
+### `execution/graph-coloring.ts`
+
+This module implements conflict-aware batch partitioning using graph coloring algorithms to minimize file overlaps in parallel execution:
+
+- `buildConflictGraph(tasks)`: Constructs a conflict graph where nodes are tasks and edges connect tasks that modify overlapping files. Returns a `ConflictGraph` object with file overlap information.
+- `colorGraph(tasks, graph)`: Applies the DSatur (degree saturation) graph coloring algorithm to assign colors to tasks such that conflicting tasks receive different colors. Returns a map of task IDs to color numbers.
+- `batchByColor(tasks, colors, maxParallel)`: Groups tasks by color into parallel-safe batches, respecting the `maxParallel` limit by splitting large color groups into sub-batches.
+- `PlannedTask`: Interface representing a task with its associated file list, used for conflict analysis.
+- `ConflictGraph`: Interface representing the conflict relationship graph between tasks.
+
+This proactive conflict prevention mechanism reduces merge collisions by ensuring tasks in the same batch don't modify overlapping files. See [source](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/execution/graph-coloring.ts) for implementation details.
 
 ## CLI Flags
 ### `--no-merge`
@@ -127,5 +176,8 @@
 - AI conflict resolution is best-effort; manual intervention may be required for complex conflicts.
 - Failed merges are logged and branches are preserved for manual review.
 - Ensure the working directory is clean before starting parallel execution to avoid unintended merge issues.
+- When running multiple ralphy instances simultaneously, the global merge lock prevents concurrent merge corruption. If a lock cannot be acquired, branches are preserved for manual merge.
+- Orphaned worktrees from crashed processes are automatically cleaned up on the next run, preventing directory accumulation.
+- Graph coloring batch partitioning works best when tasks have known file dependencies (from planning analysis). Tasks without file information are batched using the default strategy.
 
-For further details, consult the implementation in [`git/merge.ts`](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/git/merge.ts#L3-L192) and [`execution/conflict-resolution.ts`](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/conflict-resolution.ts#L8-L80).
+For further details, consult the implementation in [`git/merge.ts`](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/git/merge.ts#L3-L192), [`execution/conflict-resolution.ts`](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/conflict-resolution.ts#L8-L80), and [`execution/graph-coloring.ts`](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/execution/graph-coloring.ts).

[Accept] [Decline]

AI-Assisted Merge Conflict Resolution Workflow
View Suggested Changes
@@ -4,6 +4,8 @@
 The overall flow is:
 
 - `Group 0` (parallel tasks) → `Integration 1` (merges Group 0) → `Group 1` (parallel tasks, sees Group 0's work) → `Integration 2` (merges Group 1) → ... → `Final Merge` (merges last integration branch into the original base branch, e.g., `main`).
+
+Before executing parallel tasks, ralphy uses **conflict-aware batching via graph coloring** to intelligently group tasks that minimize the risk of merge conflicts. The system builds a conflict graph by analyzing which files each task is expected to modify (based on planning analysis or heuristics), then applies a DSatur (degree of saturation) graph coloring algorithm to partition tasks into batches. Tasks in the same batch (same "color") touch disjoint sets of files, meaning they can run in parallel with minimal conflict risk. This proactive approach reduces merge collisions before they occur, complementing the reactive AI-assisted conflict resolution that handles any remaining conflicts during the merge phase.
 
 This phase is triggered unless the `--no-merge` flag is set or the run is a dry run. Each agent still operates in an isolated worktree and branch (e.g., `ralphy/agent-1-create-auth`), but after each batch, their branches are merged into an integration branch (e.g., `ralphy/integration-group-1`).
 
@@ -22,8 +24,14 @@
 
 This process is coordinated by the parallel execution engine and integrates tightly with git operations, using commands such as `git merge`, `git branch -d`, and `git merge --abort` as needed. These optimizations and reliability features significantly reduce the time required for the merge phase and improve robustness, especially when working with many branches or in complex repository setups.
 
+#### Global Merge Lock for Concurrent Safety
+When multiple ralphy processes run concurrently, the merge phase is protected by a **global cross-process merge lock**. This lock prevents simultaneous merge operations from corrupting branch state or causing race conditions. The lock is acquired using atomic file operations before any merge begins, and is automatically released after completion. If another ralphy instance is already merging, the lock acquisition will fail gracefully, and the current instance will skip the merge phase while preserving all completed branches for later manual merge. The lock includes stale detection: if a lock is held for more than 5 minutes (the default timeout), it is automatically cleaned up and the merge can proceed. This coordination mechanism is essential for maintaining repository integrity when parallel ralphy runs complete simultaneously or when multiple developers run ralphy concurrently in the same repository.
+
+#### Tracked Worktree Cleanup and Crash Recovery
+Ralphy tracks all worktree creation and cleanup operations in a persistent tracking file (`.ralphy-worktrees/tracked.json`). This tracking survives process crashes and system restarts. Each worktree entry records its directory path, branch name, creation timestamp, and the process ID that created it. On every ralphy startup, the system checks for orphaned worktrees—those belonging to processes that are no longer running. Orphaned worktrees are automatically cleaned up by removing their directories and git worktree references. This crash-resilient cleanup prevents worktree accumulation after abnormal terminations such as `kill -9`, system crashes, or terminal disconnections. The tracking file uses atomic rename operations to ensure data integrity even during concurrent access from multiple ralphy processes. Successfully cleaned worktrees are removed from the tracking file, while worktrees that fail cleanup (e.g., due to file locks) are retained for retry on the next run.
+
 ### AI-Assisted Merge Conflict Resolution
-If a merge conflict occurs during this phase, ralphy automatically invokes AI-assisted conflict resolution. The system builds a prompt for the AI engine (such as Claude, Codex, Cursor, etc.) instructing it to read the conflict markers in the affected files, understand both sides of the conflict, and produce a resolved version that combines the changes appropriately. The AI is required to remove all conflict markers and ensure the resulting code is syntactically and logically correct. Once the AI resolves the conflicts, it stages the resolved files with `git add` and completes the merge with `git commit --no-edit`. If the AI cannot resolve all conflicts, the merge is aborted and the branch is marked as failed [source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/conflict-resolution.ts#L8-L81).
+If a merge conflict occurs during this phase, ralphy automatically invokes AI-assisted conflict resolution. The system builds a prompt for the AI engine (such as Claude, Codex, Cursor, etc.) instructing it to read the conflict markers in the affected files, understand both sides of the conflict, and produce a resolved version that combines the changes appropriately. The AI is required to remove all conflict markers and ensure the resulting code is syntactically and logically correct. Once the AI resolves the conflicts, it stages the resolved files with `git add` and completes the merge with `git commit --no-edit`. If the AI cannot resolve all conflicts, the merge is aborted and the branch is marked as failed. The conflict resolution module includes improved error handling using standardized error messages for better diagnostics and logging.
 
 Example AI conflict resolution prompt:
 ```
@@ -64,16 +72,20 @@
 #### Example Workflow
 ```mermaid
 graph TD
-    A["Parallel Task Execution"] --> B["Completed Branches"]
-    B --> C["Merge Phase (unless --no-merge)"]
-    C --> D["Merge into Base Branch"]
-    D --> E{"Merge Conflict?"}
-    E -- No --> F["Delete Merged Branch"]
-    E -- Yes --> G["AI Conflict Resolution"]
-    G -- Success --> F
-    G -- Failure --> H["Abort Merge, Preserve Branch"]
-    F --> I["Restore Starting Branch"]
-    H --> I
+    A["Graph Coloring Batching"] --> B["Parallel Task Execution"]
+    B --> C["Completed Branches"]
+    C --> D["Global Merge Lock Acquired"]
+    D --> E["Merge Phase (unless --no-merge)"]
+    E --> F["Merge into Base Branch"]
+    F --> G{"Merge Conflict?"}
+    G -- No --> H["Delete Merged Branch"]
+    G -- Yes --> I["AI Conflict Resolution"]
+    I -- Success --> H
+    I -- Failure --> J["Abort Merge, Preserve Branch"]
+    H --> K["Release Global Merge Lock"]
+    J --> K
+    K --> L["Restore Starting Branch"]
+    L --> M["Cleanup Orphaned Worktrees"]
 ```
 
 For more details, see the [README documentation](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/README.md#L153-L286) and the [merge implementation](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/git/merge.ts#L16-L182).

[Accept] [Decline]

Copilot AI Engine Integration
View Suggested Changes
@@ -50,10 +50,17 @@
 ### CLI Flags and Model Override
 The CLI supports the `--copilot` flag to select the Copilot engine. Model override is supported via the `--model <name>` flag, which passes the specified model to the Copilot CLI. The argument parser sets the engine to `'copilot'` when the flag is present and includes the model override in the runtime options.
 
+**Planning and Test Model Support:**
+- `--planning-model <name>`: Use a dedicated model for planning-related tasks. During parallel execution, a separate planning phase runs before the main execution, allowing the agent to analyze which files are needed and create an execution plan. This improves efficiency by pre-identifying file dependencies.
+- `--test-model <name>`: Use a dedicated model for test-related tasks. When tasks involve testing, this model is used to orchestrate test execution and validation, providing specialized handling for test workflows.
+
+These model options enable multi-phase agent execution: planning → execution → testing, with each phase using the most appropriate model for that task type.
+
 Example usage:
 ```sh
 ralphy --copilot "Refactor the authentication module"
 ralphy --copilot --model v1.2 "Implement a new search feature"
+ralphy --copilot --planning-model gpt-4 --test-model gpt-3.5-turbo --parallel "Add test coverage"
 ```
 
 You can also pass engine-specific arguments to the underlying Copilot CLI (or any engine) using the `--` separator:
@@ -72,6 +79,9 @@
 
 #### Parallel Execution Reliability and Fallback
 - The CLI checks if git worktrees are available and usable. If not, or if worktree creation fails for any reason, it automatically falls back to sandbox mode for parallel execution. This ensures robust operation across different repository configurations and platforms.
+- **Conflict-Aware Batching:** Parallel batches use file-overlap graph coloring (via DSatur algorithm) to reduce merge conflicts. Tasks that modify overlapping files are assigned different colors and run in separate batches, minimizing the likelihood of merge collisions and improving parallel execution efficiency.
+- **Worktree Cleanup Tracking:** Worktree cleanup state is persisted to disk and automatically recovered across process crashes. If ralphy crashes during parallel execution, orphaned worktrees are detected and cleaned up on the next run, preventing accumulation of stale worktrees and ensuring reliable cleanup.
+- **Global Merge Lock:** The merge phase uses a cross-process atomic lock (via file system operations) to prevent concurrent merge corruption when multiple ralphy instances run simultaneously. This ensures merge operations are coordinated across all ralphy processes, preventing race conditions and repository corruption.
 - When running in parallel mode, if any retryable error (such as rate limits or quota exceeded) is detected, the run stops early to avoid unnecessary retries and allows you to rerun later.
 - Before merging branches in parallel mode, any local changes are stashed and restored after the merge phase to prevent conflicts.
 - If the repository has no commits, parallel/branch mode is disabled until an initial commit is made, and the CLI provides a clear prompt to the user.
@@ -115,6 +125,7 @@
 **How sandbox mode works:**
 - **Symlinks** read-only dependencies (`node_modules`, `.git`, `vendor`, `.venv`, `.pnpm-store`, `.yarn`, `.cache`)
 - **Copies** source files that agents might modify (`src/`, `app/`, `lib/`, config files, etc.)
+- **Intelligent File Selection:** When planning model is enabled, sandbox mode uses semantic file analysis to pre-copy only the files identified during the planning phase, reducing overhead and improving performance. This selective isolation ensures agents work with just the files they need.
 - **Robust cleanup:** When cleaning up sandboxes, ralphy uses a retrying deletion mechanism that attempts to remove sandbox directories up to 5 times with exponential backoff (500ms, 1s, 2s, 4s, 8s) if files are locked (e.g., by the OS, antivirus, or open handles). If the directory remains locked after all retries, a warning is logged, but the runner continues without crashing. This ensures that temporary file locks—common on Windows—do not cause unrecoverable errors or halt the workflow. Leftover locked folders can be cleaned up by the OS or in subsequent runs.
 
 **Benefits:**
@@ -123,6 +134,7 @@
 - Changes sync back to the original directory after each task
 - **Automatic fallback:** If worktree creation fails or is not supported, sandbox mode is used automatically for reliability
 - **Graceful failure:** Locked or busy files/folders during cleanup do not crash the runner; warnings are logged and execution continues
+- **Selective isolation with planning:** Pre-planning identifies required files, reducing unnecessary file copying and improving execution speed
 
 **When to use worktrees instead (default):**
 - Need full git history access in each sandbox

[Accept] [Decline]

Parallel Execution with Group Dependencies
View Suggested Changes
@@ -82,5 +82,57 @@
 
 This workflow ensures that each group’s tasks have full visibility of all prior group commits, and that merge conflicts are minimized because overlapping changes are integrated before new work begins.
 
+### Conflict-Aware Batching via Graph Coloring
+When tasks share file dependencies, running them in parallel can lead to merge conflicts. To minimize this, Ralphy uses **graph coloring** to detect potential conflicts and intelligently group tasks into batches.
+
+**How It Works:**
+1. For each task, the system identifies which files it will modify (from planning metadata or file patterns).
+2. A conflict graph is built where tasks that touch overlapping files are connected.
+3. Graph coloring assigns each task a "color" such that no two connected tasks share the same color.
+4. Tasks with the same color can run together without conflicts, while different colors run sequentially.
+
+This **conflict-aware batching** reduces merge collisions by ensuring that tasks operating on disjoint file sets run in parallel, while potentially conflicting tasks are serialized across batches. The implementation uses the DSatur algorithm for efficient graph coloring [source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/graph-coloring.ts).
+
+### Global Merge Lock
+When multiple Ralphy processes run simultaneously (e.g., different feature branches or CI/CD pipelines), concurrent merge operations can corrupt the Git repository. To prevent this, the merge phase uses a **global lock** implemented with atomic file operations.
+
+**Lock Mechanism:**
+- Before merging branches, Ralphy acquires a global lock file (`.ralphy/.global-merge.lock`) using exclusive write flags.
+- The lock is held during the entire merge phase to ensure only one process can merge at a time.
+- Stale locks (older than 5 minutes) are automatically detected and removed.
+- If the lock cannot be acquired, the merge phase is skipped and branches are preserved for manual merge.
+
+This cross-process coordination is critical for parallel execution reliability, preventing lockstep branch damage and repository corruption [source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/parallel.ts#L93-L160).
+
+### Worktree Cleanup Tracking and Crash Recovery
+Git worktrees are temporary directories used to isolate parallel task execution. If a process crashes before cleanup, these worktrees can accumulate and consume disk space. Ralphy addresses this with **persistent cleanup tracking**.
+
+**Tracking Mechanism:**
+- Each created worktree is recorded in `.ralphy-worktrees/tracked.json` with metadata (path, branch, PID, timestamp).
+- The tracking file is written atomically using rename operations to prevent corruption.
+- On startup, Ralphy checks for orphaned worktrees (where the parent process is no longer running).
+- Stale worktrees are automatically cleaned up, preventing resource leaks across crashes.
+
+**Safety Features:**
+- Path validation ensures cleanup operations only target legitimate worktree directories.
+- Process liveness checks (via `kill(pid, 0)`) distinguish between active and dead processes.
+- Cleanup failures are logged but don't block execution, ensuring robust operation even in edge cases.
+
+This crash recovery mechanism improves reliability by handling abnormal termination gracefully and maintaining a clean workspace [source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/parallel.ts#L63-L177).
+
+### Sandbox-Only Parallel Mode
+In addition to worktree-based isolation, Ralphy supports a **sandbox-only mode** as an alternative isolation strategy. This mode is faster for large repositories and works without git worktrees.
+
+**Key Differences:**
+- **Worktree Mode:** Creates full git worktrees with separate branches, suitable for git-based workflows.
+- **Sandbox Mode:** Uses lightweight sandboxes with symlinks for read-only resources (node_modules, .git) and copies source files, avoiding git operations entirely.
+
+**When to Use Sandbox Mode:**
+- Large repositories where worktree creation is slow.
+- Systems where git worktrees are unavailable or problematic.
+- Scenarios where git history isn't needed for parallel execution.
+
+The `parallel-no-git.ts` implementation provides sandbox-only execution without merge phases, offering a faster alternative for certain workflows [source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/parallel-no-git.ts).
+
 ### Summary
-By sequencing parallel task execution with integration branches, Ralphy guarantees that dependencies between groups are respected, prior changes are visible to subsequent work, and merge conflicts are significantly reduced. This approach is essential for reliable, scalable automation of complex, multi-step development workflows.
+By combining integration branches, conflict-aware batching, global merge locks, crash-resilient cleanup tracking, and flexible isolation modes, Ralphy provides a comprehensive parallel execution system. These features work together to ensure that dependencies between groups are respected, prior changes are visible to subsequent work, and merge conflicts are minimized—while maintaining reliability, safety, and efficiency even under adverse conditions. This approach is essential for scalable automation of complex, multi-step development workflows.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR introduces a substantial parallel execution runtime (~9,700 lines added/modified) including agent runners, sandbox/worktree isolation, conflict-aware graph-colored batching, a cross-process TaskStateManager, a global merge lock, and crash-resilient worktree tracking. The architecture is sound and addresses real safety gaps.

Critical issues found:

  • commitSandboxChanges error recovery restores to baseBranch instead of the saved currentBranch (sandbox-git.ts:190): on any commit failure, git is left on the base branch rather than the branch the caller was on, corrupting state for subsequent serialized commits in the same mutex queue.
  • loadState silently resets all task state on any read error (task-state.ts:527): a transient I/O error or file lock during claimTaskForExecution causes this.tasks to be reset to an empty Map, and the next persistState call then atomically overwrites the state file with empty content — permanently destroying all accumulated task progress.
  • Planning rejection path is missing result.tasksFailed++, notifyTaskFailed, and staticAgentDisplay.agentComplete (parallel.ts:636): tasks that fail in the planning phase are silently excluded from the execution result count, no desktop notification is sent, and the static display keeps showing the agent as running.

Confidence Score: 1/5

  • Not safe to merge — data-loss bug in task state manager and incomplete planning-failure handling create correctness and user-visibility issues.
  • Three critical bugs remain: (1) task state can be permanently wiped on transient I/O errors (data loss), (2) git is left on the wrong branch after commit failures (state corruption), (3) planning-phase failures are silently undercounted and not displayed (missing notifications). The data-loss bug in loadState is particularly severe because it silently destroys all accumulated task progress on the next program run if any transient error occurs during state loading.
  • cli/src/execution/task-state.ts (data-loss on read error), cli/src/execution/sandbox-git.ts (wrong branch on error recovery), and cli/src/execution/parallel.ts (incomplete planning-failure handling).

Last reviewed commit: dd5630c

Comment on lines +774 to +778
export async function createHashStore(taskId: string, projectRoot?: string): Promise<FileHashStore> {
const store = new FileHashStore(taskId, projectRoot);
await store.initialize();
return store;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High execution/hash-store.ts:774

The createHashStore function passes taskId directly to the FileHashStore constructor without validation. The constructor uses taskId in join(this.baseDir, taskId) to construct this.taskDir. If taskId contains path traversal sequences like ../../sensitive_dir, taskDir resolves outside the hash store base directory. Since cleanup() calls rmSync(this.taskDir, { recursive: true, force: true }), an attacker-controlled or malformed taskId can cause arbitrary directory deletion anywhere on the filesystem.

+export async function createHashStore(taskId: string, projectRoot?: string): Promise<FileHashStore> {
+	// Validate taskId to prevent path traversal
+	if (!taskId || typeof taskId !== 'string') {
+		throw new HashStoreError('taskId is required and must be a string');
+	}
+	if (taskId.includes('/') || taskId.includes('\\') || taskId.includes('..')) {
+		throw new HashStoreError('taskId cannot contain path separators or traversal sequences');
+	}
+	
+	const store = new FileHashStore(taskId, projectRoot);
	await store.initialize();
	return store;
}
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/hash-store.ts around lines 774-778:

The `createHashStore` function passes `taskId` directly to the `FileHashStore` constructor without validation. The constructor uses `taskId` in `join(this.baseDir, taskId)` to construct `this.taskDir`. If `taskId` contains path traversal sequences like `../../sensitive_dir`, `taskDir` resolves outside the hash store base directory. Since `cleanup()` calls `rmSync(this.taskDir, { recursive: true, force: true })`, an attacker-controlled or malformed `taskId` can cause arbitrary directory deletion anywhere on the filesystem.

Evidence trail:
cli/src/execution/hash-store.ts lines 774-776 (createHashStore function passes taskId without validation), line 225 (constructor uses `join(this.baseDir, taskId)` without sanitization), line 674 (`rmSync(this.taskDir, { recursive: true, force: true })` in cleanup method). Verified at commit REVIEWED_COMMIT.


for (const [_filePath, reference] of Object.entries(this.index.files)) {
uniqueHashes.add(reference.hash);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low execution/hash-store.ts:614

getStats returns an inflated compressedSize because it sums the physical file size once per logical reference, even when multiple files are deduplicated to the same physical hash. The resulting compressedSize can exceed actual disk usage by multiples. Consider tracking unique hash files to compute the true physical size.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/hash-store.ts around line 614:

`getStats` returns an inflated `compressedSize` because it sums the physical file size once per logical reference, even when multiple files are deduplicated to the same physical hash. The resulting `compressedSize` can exceed actual disk usage by multiples. Consider tracking unique hash files to compute the true physical size.

Evidence trail:
cli/src/execution/hash-store.ts lines 357-362 (hashPath is derived from hash, confirming deduplication shares same physical file), lines 393-397 (multiple logical files get same hashPath when deduplicated), lines 612-637 (getStats iterates all files and sums compressedSize per reference without checking uniqueHashes)

Comment on lines +28 to +40
/**
* Create a promise that rejects after a timeout
*/
function createTimeoutPromise(timeoutMs: number, operation: string): Promise<never> {
return new Promise((_, reject) => {
const timer = setTimeout(() => {
reject(new Error(`${operation} timed out after ${timeoutMs}ms`));
}, timeoutMs);

// Prevent memory leak if promise settles before timeout
return () => clearTimeout(timer);
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium execution/hash-store.ts:28

createTimeoutPromise/withTimeout leave orphaned timers that later reject after the race settles, keeping the event loop alive and causing crashes. Suggest: don’t return a cleanup from the Promise executor; keep the timer handle and clear it when the wrapped promise settles (e.g., in finally), or use an AbortController to cancel the timeout and the underlying work.

+function createTimeoutPromise(timeoutMs: number, operation: string): Promise<never> {
+	let timer: ReturnType<typeof setTimeout> | undefined;
+	return new Promise((_, reject) => {
+		timer = setTimeout(() => {
+			reject(new Error(`${operation} timed out after ${timeoutMs}ms`));
+		}, timeoutMs);
+	}).finally(() => {
+		if (timer) clearTimeout(timer);
+	});
+}
Also found in 1 other location(s)

cli/src/execution/sandbox.ts:1070

The scheduleBackgroundCleanup function creates a setTimeout timer but fails to call .unref() on it. In Node.js, active timers prevent the process from exiting naturally even after the main event loop is empty. This will cause the CLI process to hang for SANDBOX_BACKGROUND_CLEANUP_DELAY_MS (e.g., 5 minutes) after the command finishes, unless cancelScheduledCleanups is explicitly called in every possible exit path. For background maintenance tasks like this, using .unref() is standard practice to allow the process to terminate once its work is done.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/hash-store.ts around lines 28-40:

`createTimeoutPromise`/`withTimeout` leave orphaned timers that later reject after the race settles, keeping the event loop alive and causing crashes. Suggest: don’t return a cleanup from the Promise executor; keep the timer handle and clear it when the wrapped promise settles (e.g., in `finally`), or use an `AbortController` to cancel the timeout and the underlying work.

Evidence trail:
cli/src/execution/hash-store.ts lines 28-43 (REVIEWED_COMMIT). The `createTimeoutPromise` function returns a cleanup function `() => clearTimeout(timer)` from the Promise executor, but Promise executor return values are ignored per the JavaScript/TypeScript language specification. The cleanup is never invoked, so timers created by `setTimeout` persist until they fire. The function is used by `withTimeout` on lines 40-42 and called in `storeCompressed` (line 319) and `storeUncompressed` (line 327) with 30-second timeout.

Also found in 1 other location(s):
- cli/src/execution/sandbox.ts:1070 -- The `scheduleBackgroundCleanup` function creates a `setTimeout` timer but fails to call `.unref()` on it. In Node.js, active timers prevent the process from exiting naturally even after the main event loop is empty. This will cause the CLI process to hang for `SANDBOX_BACKGROUND_CLEANUP_DELAY_MS` (e.g., 5 minutes) after the command finishes, unless `cancelScheduledCleanups` is explicitly called in every possible exit path. For background maintenance tasks like this, using `.unref()` is standard practice to allow the process to terminate once its work is done.

}

const storedHash = await this.getHash(filePath);
if (!storedHash) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium execution/hash-store.ts:561

hasChanged calls this.getHash(filePath) without error handling, but getHash invokes initialize() which performs synchronous file operations (mkdirSync, readFileSync). If initialize throws (e.g., permission denied on the store directory), the error propagates and crashes instead of returning true to indicate the file should be treated as changed. Consider wrapping the getHash call in a try/catch to return true on any initialization failure.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/hash-store.ts around line 561:

`hasChanged` calls `this.getHash(filePath)` without error handling, but `getHash` invokes `initialize()` which performs synchronous file operations (`mkdirSync`, `readFileSync`). If `initialize` throws (e.g., permission denied on the store directory), the error propagates and crashes instead of returning `true` to indicate the file should be treated as changed. Consider wrapping the `getHash` call in a try/catch to return `true` on any initialization failure.

Evidence trail:
cli/src/execution/hash-store.ts lines 555-572 (hasChanged method with no try/catch around getHash), lines 538-547 (getHash calls initialize), lines 250-291 (initialize has try/catch but rethrows as HashStoreError on lines 286-290), lines 252-255 (mkdirSync calls), line 259 (readFileSync call)

Comment on lines +97 to +101
export function batchByColor(
tasks: PlannedTask[],
colors: Map<string, number>,
maxParallel: number,
): Map<number, PlannedTask[]> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High execution/graph-coloring.ts:97

When maxParallel is 0, the batch splitting loop for (let i = 0; i < batch.length; i += maxParallel) never increments i, causing a synchronous infinite loop that hangs the application. Consider adding a guard at the start of batchByColor to throw or clamp the value when maxParallel <= 0.

-export function batchByColor(
-	tasks: PlannedTask[],
-	colors: Map<string, number>,
-	maxParallel: number,
-): Map<number, PlannedTask[]> {
+export function batchByColor(
+	tasks: PlannedTask[],
+	colors: Map<string, number>,
+	maxParallel: number,
+): Map<number, PlannedTask[]> {
+	if (maxParallel <= 0) {
+		throw new Error(`maxParallel must be positive, got ${maxParallel}`);
+	}
	const batches = new Map<number, PlannedTask[]>();
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/graph-coloring.ts around lines 97-101:

When `maxParallel` is 0, the batch splitting loop `for (let i = 0; i < batch.length; i += maxParallel)` never increments `i`, causing a synchronous infinite loop that hangs the application. Consider adding a guard at the start of `batchByColor` to throw or clamp the value when `maxParallel <= 0`.

Evidence trail:
cli/src/execution/graph-coloring.ts lines 97, 124-129 (REVIEWED_COMMIT): Function `batchByColor` is exported with `maxParallel: number` parameter, and the loop `for (let i = 0; i < batch.length; i += maxParallel)` at line 127 would infinite loop when maxParallel <= 0. cli/src/cli/args.ts line 149: CLI parsing has `|| 3` fallback but this doesn't protect the exported function from direct calls with invalid values, and negative values bypass this check.

mkdirSync(sandboxDir, { recursive: true });

// If selective isolation is requested (filesToCopy provided)
if (filesToCopy && Array.isArray(filesToCopy) && filesToCopy.length > 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High execution/agent-runner.ts:308

When filesToCopy is provided, the selective isolation block prepares the sandbox but falls through to createSandbox, which overwrites the selective isolation by copying all files from the original directory. The agent runs in a full environment instead of the restricted one, defeating the performance optimization. Consider adding return or moving the agent execution into this block to preserve selective isolation.

Also found in 1 other location(s)

cli/src/execution/sandbox.ts:885

symlinkSharedResources throws on Windows file symlink creation failures without the expected fallback handled by the caller.

The code at lines 916-920 catches symlink creation errors (which are common on Windows for files when running without admin privileges) and re-throws them, with a comment explicitly stating "caller can fall back to copying".

However, the caller in cli/src/execution/agent-runner.ts (runAgentInSandbox) does not implement this fallback logic. It simply catches the error and terminates the agent execution with a failure result. This causes the selective isolation feature (and thus the agent) to fail completely on Windows if any shared resource in the configuration is a file rather than a directory, and the user lacks symlink privileges.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/agent-runner.ts around line 308:

When `filesToCopy` is provided, the selective isolation block prepares the sandbox but falls through to `createSandbox`, which overwrites the selective isolation by copying all files from the original directory. The agent runs in a full environment instead of the restricted one, defeating the performance optimization. Consider adding `return` or moving the agent execution into this block to preserve selective isolation.

Evidence trail:
cli/src/execution/agent-runner.ts lines 308-316 (first if block for filesToCopy - no return statement), lines 317-391 (else-if block for semantic chunking - has return at lines 384-390), lines 393-400 (createSandbox call after the if-else chain). The first if block at line 308 sets up selective isolation but falls through to createSandbox at line 393.

Also found in 1 other location(s):
- cli/src/execution/sandbox.ts:885 -- `symlinkSharedResources` throws on Windows file symlink creation failures without the expected fallback handled by the caller.

The code at lines 916-920 catches symlink creation errors (which are common on Windows for files when running without admin privileges) and re-throws them, with a comment explicitly stating "caller can fall back to copying".

However, the caller in `cli/src/execution/agent-runner.ts` (`runAgentInSandbox`) does not implement this fallback logic. It simply catches the error and terminates the agent execution with a failure result. This causes the selective isolation feature (and thus the agent) to fail completely on Windows if any shared resource in the configuration is a file rather than a directory, and the user lacks symlink privileges.

Comment on lines +702 to +708
}

const baseDir = resolve(projectRoot, HASH_STORE_DIR);

if (!existsSync(baseDir)) {
return 0;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low execution/hash-store.ts:702

The cleanupOldStores method is declared async but uses synchronous I/O (readdirSync, readFileSync, rmSync), so it blocks the event loop while iterating through directories. When the hash store contains many old entries, the CLI freezes and cannot respond to concurrent async tasks like UI updates or signal handlers.

-static async cleanupOldStores(
+static cleanupOldStores(
 		projectRoot: string = process.cwd(),
 		maxAgeMs: number = HASH_STORE_MAX_AGE_MS,
-	): Promise<number> {
+	): number {
 		if (!ENABLE_HASH_STORE) {
 			return 0;
 		}
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/hash-store.ts around lines 702-708:

The `cleanupOldStores` method is declared `async` but uses synchronous I/O (`readdirSync`, `readFileSync`, `rmSync`), so it blocks the event loop while iterating through directories. When the hash store contains many old entries, the CLI freezes and cannot respond to concurrent async tasks like UI updates or signal handlers.

Evidence trail:
cli/src/execution/hash-store.ts lines 697-757 (viewed at REVIEWED_COMMIT): Method declared `static async cleanupOldStores(...)` at line 697. Synchronous I/O operations used: `readdirSync` (line 714), `existsSync` (lines 707, 725), `readFileSync` (line 727), `statSync` (line 737), `rmSync` (line 742).

Comment on lines +542 to +547
if (prdSource === "markdown" || prdSource === "yaml") {
const srcPath = join(originalDir, prdFile);
const destPath = join(targetDir, prdFile);
if (existsSync(srcPath)) {
copyFileSync(srcPath, destPath);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium execution/agent-runner.ts:542

copyPrdResources calls copyFileSync with destPath = join(targetDir, prdFile) without ensuring the parent directory exists. When prdFile contains a subdirectory path (e.g., docs/prd.md), the copy throws ENOENT because the directory structure doesn't exist in the fresh sandbox/worktree.

 	if (prdSource === "markdown" || prdSource === "yaml") {
 		const srcPath = join(originalDir, prdFile);
 		const destPath = join(targetDir, prdFile);
 		if (existsSync(srcPath)) {
+			const parentDir = dirname(destPath);
+			if (!existsSync(parentDir)) {
+				mkdirSync(parentDir, { recursive: true });
+			}
 			copyFileSync(srcPath, destPath);
 		}
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/agent-runner.ts around lines 542-547:

`copyPrdResources` calls `copyFileSync` with `destPath = join(targetDir, prdFile)` without ensuring the parent directory exists. When `prdFile` contains a subdirectory path (e.g., `docs/prd.md`), the copy throws `ENOENT` because the directory structure doesn't exist in the fresh sandbox/worktree.

Evidence trail:
cli/src/execution/agent-runner.ts lines 1-2 (imports showing `copyFileSync, mkdirSync` from `node:fs`), lines 536-556 (the `copyPrdResources` function). Line 543: `const destPath = join(targetDir, prdFile)`. Line 545: `copyFileSync(srcPath, destPath)` - no `mkdirSync` call precedes this to create parent directories.

Comment on lines +97 to +106
export function hasExceededMaxDeferrals(
type: TaskSourceType,
task: Task,
workDir: string,
maxRetries: number,
prdFile?: string,
): boolean {
const count = getDeferredCount(type, task, workDir, prdFile);
return count >= maxRetries;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low execution/deferred.ts:97

hasExceededMaxDeferrals returns true when count >= maxRetries, so maxRetries: 0 incorrectly blocks initial execution (fresh task count 0 fails 0 >= 0). The function name promises "exceeded" semantics but implements "reached" semantics. Consider using > instead, or renaming to match the actual behavior.

Suggested change
export function hasExceededMaxDeferrals(
type: TaskSourceType,
task: Task,
workDir: string,
maxRetries: number,
prdFile?: string,
): boolean {
const count = getDeferredCount(type, task, workDir, prdFile);
return count >= maxRetries;
}
export function hasExceededMaxDeferrals(
type: TaskSourceType,
task: Task,
workDir: string,
maxRetries: number,
prdFile?: string,
): boolean {
const count = getDeferredCount(type, task, workDir, prdFile);
return count > maxRetries;
}
Also found in 1 other location(s)

cli/src/execution/parallel-no-git.ts:286

Missing retry tracking for resolved retryable errors. When the agent runner resolves with a retryable error (lines 286-292), the code transitions the task to DEFERRED but fails to call recordDeferredTask. Unlike the rejection path (line 253), this omits incrementing the persistent retry counter. Consequently, tasks failing in this manner can be retried indefinitely across multiple CLI runs without ever hitting the maxRetries threshold.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/deferred.ts around lines 97-106:

`hasExceededMaxDeferrals` returns `true` when `count >= maxRetries`, so `maxRetries: 0` incorrectly blocks initial execution (fresh task count `0` fails `0 >= 0`). The function name promises "exceeded" semantics but implements "reached" semantics. Consider using `>` instead, or renaming to match the actual behavior.

Evidence trail:
cli/src/execution/deferred.ts lines 94-107 (REVIEWED_COMMIT): function `hasExceededMaxDeferrals` uses `return count >= maxRetries;` at line 105, but the function name says 'exceeded' which semantically implies `>` not `>=`.

Also found in 1 other location(s):
- cli/src/execution/parallel-no-git.ts:286 -- Missing retry tracking for resolved retryable errors. When the agent runner resolves with a retryable error (lines 286-292), the code transitions the task to `DEFERRED` but fails to call `recordDeferredTask`. Unlike the rejection path (line 253), this omits incrementing the persistent retry counter. Consequently, tasks failing in this manner can be retried indefinitely across multiple CLI runs without ever hitting the `maxRetries` threshold.

Comment on lines +356 to +362

// Check if we already have this hash
const hashFileName = `${hash}.gz`;
const hashPath = join("content", hashFileName);
const absoluteHashPath = join(this.taskDir, hashPath);

const alreadyExists = existsSync(absoluteHashPath);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium execution/hash-store.ts:356

Dedup/indexing assume .gz and miss uncompressed files, causing redundant writes and broken references. Suggest: check for both {hash}.gz and {hash}, use the actual existing path in the index, and persist/read the compression state from metadata so alreadyExists and hashPath are consistent.

-		// Check if we already have this hash
-		const hashFileName = `${hash}.gz`;
-		const hashPath = join("content", hashFileName);
+		// Check if we already have this hash (look for both compressed and uncompressed)
+		const hashFileNameCompressed = `${hash}.gz`;
+		const hashFileNameUncompressed = `${hash}`;
+		const compressedHashPath = join("content", hashFileNameCompressed);
+		const uncompressedHashPath = join("content", hashFileNameUncompressed);
 		const absoluteHashPath = join(this.taskDir, hashPath);
 
-		const alreadyExists = existsSync(absoluteHashPath);
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/hash-store.ts around lines 356-362:

Dedup/indexing assume `.gz` and miss uncompressed files, causing redundant writes and broken references. Suggest: check for both `{hash}.gz` and `{hash}`, use the actual existing path in the index, and persist/read the compression state from metadata so `alreadyExists` and `hashPath` are consistent.

Evidence trail:
cli/src/execution/hash-store.ts lines 357-362: `const hashFileName = `${hash}.gz`; ... const alreadyExists = existsSync(absoluteHashPath);` - existence check only looks for .gz files. Lines 370-372: `await this.storeUncompressed(absolutePath, absoluteHashPath.replace(".gz", ""));` - uncompressed files are stored WITHOUT .gz extension. This mismatch means the existence check at line 362 will never find previously stored uncompressed files, causing redundant writes.

* @param useJitter - Add random jitter (0-25% of delay)
* Global circuit breaker instance
*/
export const circuitBreaker = ConnectionStateManager.getInstance();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared circuit breaker blocks all parallel agents on one agent's failure

circuitBreaker is a module-level singleton (line 185). When any single agent experiences 3 consecutive connection errors, the circuit opens and applies the 30-second recovery timeout to all concurrently running agents — including those with healthy connections. This defeats the purpose of parallel execution.

The circuit breaker should be scoped per-agent or per-engine endpoint, not shared globally. Consider:

  • Creating a per-agent instance via a factory function, or
  • Accepting an optional circuit breaker parameter in withRetry to allow callers to control isolation
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/execution/retry.ts
Line: 185

Comment:
**Shared circuit breaker blocks all parallel agents on one agent's failure**

`circuitBreaker` is a module-level singleton (line 185). When any single agent experiences 3 consecutive connection errors, the circuit opens and applies the 30-second recovery timeout to **all concurrently running agents** — including those with healthy connections. This defeats the purpose of parallel execution.

The circuit breaker should be scoped per-agent or per-engine endpoint, not shared globally. Consider:
- Creating a per-agent instance via a factory function, or  
- Accepting an optional circuit breaker parameter in `withRetry` to allow callers to control isolation

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +527 to +530
} catch (error) {
logError(`Failed to load task state: ${error}`);
this.tasks = new Map();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadState silently resets all task state to an empty Map on any file read error, causing permanent data loss on the next persist.

The catch block (line 527–530) catches any error — including transient I/O errors, permission denied, file locks, or corruption — and resets this.tasks = new Map(). When claimTaskForExecution calls loadState inside the claim lock (line 159) and encounters any read error, the next persistState call (line 179) will atomically overwrite the state file with an empty task map, permanently destroying all accumulated task progress.

Distinguish between the file being absent (safe to return empty) and being present but unreadable (should propagate to fail loudly):

} catch (error) {
    // If the file genuinely doesn't exist, start with empty state
    if ((error as NodeJS.ErrnoException).code === "ENOENT") {
        this.tasks = new Map();
        return;
    }
    // For all other errors (corruption, I/O errors), propagate rather than silently losing state
    logError(`Failed to load task state: ${error}`);
    throw error;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/execution/task-state.ts
Line: 527-530

Comment:
`loadState` silently resets all task state to an empty Map on any file read error, causing permanent data loss on the next persist.

The catch block (line 527–530) catches any error — including transient I/O errors, permission denied, file locks, or corruption — and resets `this.tasks = new Map()`. When `claimTaskForExecution` calls `loadState` inside the claim lock (line 159) and encounters any read error, the next `persistState` call (line 179) will atomically overwrite the state file with an empty task map, permanently destroying all accumulated task progress.

Distinguish between the file being absent (safe to return empty) and being present but unreadable (should propagate to fail loudly):

```typescript
} catch (error) {
    // If the file genuinely doesn't exist, start with empty state
    if ((error as NodeJS.ErrnoException).code === "ENOENT") {
        this.tasks = new Map();
        return;
    }
    // For all other errors (corruption, I/O errors), propagate rather than silently losing state
    logError(`Failed to load task state: ${error}`);
    throw error;
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +636 to +645
if (isPlanningRejection(error)) {
// Planning phase failed - transition to failed state but don't mark complete
logDebug(
`Planning phase failed for task "${task.title}", transitioning to FAILED state`,
);
await taskStateManager.transitionState(task.id, TaskState.FAILED, String(error));
await taskSource.markComplete(task.id);
clearDeferredTask(taskSource.type, task, workDir, prdFile);
continue;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planning rejection path omits critical task-completion bookkeeping: missing result.tasksFailed++, notifyTaskFailed, and agentComplete.

When isPlanningRejection(error) is true (lines 636–644), the code transitions the task to FAILED and calls markComplete, but three required operations are missing:

  1. result.tasksFailed++ (line 655/671 in other failure paths) — the execution result counts are inaccurate
  2. notifyTaskFailed(task.title, errorMessage) (line 656/672 in other paths) — the user is not notified of the failure
  3. logTaskProgress(task.title, "failed", workDir) (line 654/670 in other paths) — the task progress log is incomplete
  4. staticAgentDisplay.agentComplete(claimedTask.agentNum) (called at line 688 for success but never for rejection) — the UI display keeps showing the agent as running

Compare against the retryable failure path (lines 651–658) and non-retryable failure path (lines 668–675) which both perform all four operations.

Suggested change
if (isPlanningRejection(error)) {
// Planning phase failed - transition to failed state but don't mark complete
logDebug(
`Planning phase failed for task "${task.title}", transitioning to FAILED state`,
);
await taskStateManager.transitionState(task.id, TaskState.FAILED, String(error));
await taskSource.markComplete(task.id);
clearDeferredTask(taskSource.type, task, workDir, prdFile);
continue;
}
if (isPlanningRejection(error)) {
logDebug(
`Planning phase failed for task "${task.title}", transitioning to FAILED state`,
);
await taskStateManager.transitionState(task.id, TaskState.FAILED, String(error));
logTaskProgress(task.title, "failed", workDir);
result.tasksFailed++;
notifyTaskFailed(task.title, String(error));
await taskSource.markComplete(task.id);
clearDeferredTask(taskSource.type, task, workDir, prdFile);
staticAgentDisplay.agentComplete(claimedTask.agentNum);
continue;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/execution/parallel.ts
Line: 636-645

Comment:
Planning rejection path omits critical task-completion bookkeeping: missing `result.tasksFailed++`, `notifyTaskFailed`, and `agentComplete`.

When `isPlanningRejection(error)` is true (lines 636–644), the code transitions the task to `FAILED` and calls `markComplete`, but three required operations are missing:

1. **`result.tasksFailed++`** (line 655/671 in other failure paths) — the execution result counts are inaccurate
2. **`notifyTaskFailed(task.title, errorMessage)`** (line 656/672 in other paths) — the user is not notified of the failure
3. **`logTaskProgress(task.title, "failed", workDir)`** (line 654/670 in other paths) — the task progress log is incomplete
4. **`staticAgentDisplay.agentComplete(claimedTask.agentNum)`** (called at line 688 for success but never for rejection) — the UI display keeps showing the agent as running

Compare against the retryable failure path (lines 651–658) and non-retryable failure path (lines 668–675) which both perform all four operations.

```suggestion
if (isPlanningRejection(error)) {
    logDebug(
        `Planning phase failed for task "${task.title}", transitioning to FAILED state`,
    );
    await taskStateManager.transitionState(task.id, TaskState.FAILED, String(error));
    logTaskProgress(task.title, "failed", workDir);
    result.tasksFailed++;
    notifyTaskFailed(task.title, String(error));
    await taskSource.markComplete(task.id);
    clearDeferredTask(taskSource.type, task, workDir, prdFile);
    staticAgentDisplay.agentComplete(claimedTask.agentNum);
    continue;
}
```

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (1)

cli/src/execution/sandbox-git.ts
Error recovery restores to baseBranch instead of the saved currentBranch, leaving git on the wrong branch after a failed commit.

When commitSandboxChanges fails after checking out the task branch (line 122), the catch block at lines 190–195 attempts to restore git state but checks against baseBranch instead of the currentBranch saved at line 119. This means if the caller's repository was on any branch other than baseBranch (e.g., a feature branch), after a commit failure the working tree is left on baseBranch, not the original branch.

Since this function is called from the merge coordination path where the repo can be on any branch, this corrupts the state for subsequent serialized commits in the same mutex queue.

		} catch (error) {
			const errorMsg = standardizeError(error).message;

			// Try to return to a safe state
			try {
				const branches = await git.branch();
				if (branches.current !== currentBranch) {
					await git.checkout(currentBranch);
				}
			} catch {
				// Ignore cleanup errors
			}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/execution/sandbox-git.ts
Line: 187-198

Comment:
Error recovery restores to `baseBranch` instead of the saved `currentBranch`, leaving git on the wrong branch after a failed commit.

When `commitSandboxChanges` fails after checking out the task branch (line 122), the catch block at lines 190–195 attempts to restore git state but checks against `baseBranch` instead of the `currentBranch` saved at line 119. This means if the caller's repository was on any branch other than `baseBranch` (e.g., a feature branch), after a commit failure the working tree is left on `baseBranch`, not the original branch.

Since this function is called from the merge coordination path where the repo can be on any branch, this corrupts the state for subsequent serialized commits in the same mutex queue.

```suggestion
		} catch (error) {
			const errorMsg = standardizeError(error).message;

			// Try to return to a safe state
			try {
				const branches = await git.branch();
				if (branches.current !== currentBranch) {
					await git.checkout(currentBranch);
				}
			} catch {
				// Ignore cleanup errors
			}
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant