stack3:feat: add execution state orchestration and retry flow#162
stack3:feat: add execution state orchestration and retry flow#162VX1D wants to merge 9 commits intomichaelshimeles:mainfrom
Conversation
|
@VX1D is attempting to deploy a commit to the Goshen Labs Team on Vercel. A member of the Team first needs to authorize it. |
Greptile SummaryThis PR introduces a substantial stateful execution core for Ralphy's task lifecycle: atomic task claiming, state persistence in multiple formats (YAML/JSON/CSV/MD), file-based cross-process locking, a planning cache with repo fingerprinting, a test-model orchestrator, and an upgraded sequential runner. The work is architecturally sound and fixes several real problems (Windows key collisions via URL encoding, CSV quoting, Key concerns remaining after reviewing the current state of the code:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as runSequential
participant TSM as TaskStateManager
participant Lock as FileLock
participant Disk as StateFile (.ralphy/)
participant Eng as AIEngine
participant Orch as Orchestrator
CLI->>TSM: initialize(allTasks)
TSM->>Disk: loadState() / persistState()
loop For each pending task
CLI->>TSM: getNextPendingTask()
TSM-->>CLI: pendingTask
CLI->>TSM: claimTaskForExecution(taskId)
TSM->>Lock: acquireFileLock(stateFile.claim)
TSM->>Disk: loadState()
TSM->>Disk: persistState() [state=RUNNING]
TSM->>Lock: releaseFileLock()
TSM-->>CLI: claimed=true
alt testModel configured & shouldUseOrchestrator
CLI->>Orch: executeWithOrchestrator(prompt)
Orch->>Eng: executeWithRetry(mainModel)
Eng-->>Orch: mainResult
Orch->>Eng: executeWithRetry(testModel, testPrompt)
Eng-->>Orch: testResult
alt hasFailures
Orch->>Eng: executeWithRetry(mainModel, fixPrompt)
end
Orch-->>CLI: OrchestratorResult
else standard execution
CLI->>Eng: withRetry(execute)
Eng-->>CLI: AIResult
end
alt success
CLI->>TSM: transitionState(COMPLETED)
Note over TSM,Disk: ⚠️ No lock acquired — parallel<br/>writers can overwrite each other
TSM->>Disk: persistState()
else retryable failure
CLI->>TSM: transitionState(DEFERRED)
TSM->>Disk: persistState()
else fatal / non-retryable
CLI->>TSM: transitionState(FAILED)
TSM->>Disk: persistState()
end
end
Last reviewed commit: 4be1eff |
| const success = acquireFileLock(testFile, TEST_BASE); | ||
| expect(success).toBe(true); | ||
|
|
||
| releaseFileLock(testFile, TEST_BASE); |
There was a problem hiding this comment.
Hardcoded lock-dir path diverges from the LOCK_DIR constant and the other test file
This line hardcodes ".ralphy-locks":
const lockDir = join(TEST_BASE, ".ralphy-locks");However, locking.test.ts (line 349) uses join(workDir, ".ralphy", "locks"), which implies LOCK_DIR = ".ralphy/locks". Since LOCK_DIR is already imported at the top of this test file, this assertion may be checking the wrong directory and could give a false-positive result (the lockFiles.length check against 5050 would trivially pass if it looks in a non-existent directory).
| releaseFileLock(testFile, TEST_BASE); | |
| const lockDir = join(TEST_BASE, LOCK_DIR); |
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/__tests__/locking-security.test.ts
Line: 210
Comment:
**Hardcoded lock-dir path diverges from the `LOCK_DIR` constant and the other test file**
This line hardcodes `".ralphy-locks"`:
```ts
const lockDir = join(TEST_BASE, ".ralphy-locks");
```
However, `locking.test.ts` (line 349) uses `join(workDir, ".ralphy", "locks")`, which implies `LOCK_DIR = ".ralphy/locks"`. Since `LOCK_DIR` is already imported at the top of this test file, this assertion may be checking the wrong directory and could give a false-positive result (the `lockFiles.length` check against 5050 would trivially pass if it looks in a non-existent directory).
```suggestion
const lockDir = join(TEST_BASE, LOCK_DIR);
```
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (5)
Or, since and then: return lstatSync(join(workDir, d.name as string)).isSymbolicLink();Prompt To Fix With AIThis is a comment left during a code review.
Path: cli/src/execution/prompt.ts
Line: 2388-2393
Comment:
**`statSync` follows symlinks — `isSymbolicLink()` always returns false**
`statSync` dereferences the symlink before returning stats, so `isSymbolicLink()` is always `false` on its result. You need `lstatSync` to get the stats of the symlink entry itself. As written, `detectSymlinks` will always return an empty array, so the "Symlink Analysis" section in the prompt can never be populated.
```suggestion
return require("node:fs").lstatSync(join(workDir, d.name as string)).isSymbolicLink();
```
Or, since `lstatSync` is already available in the import-block at the top of this file (if you add it), replace the import and usage accordingly:
```
import { existsSync, readFileSync, readdirSync, lstatSync, statSync } from "node:fs";
```
and then:
```typescript
return lstatSync(join(workDir, d.name as string)).isSymbolicLink();
```
How can I resolve this? If you propose a fix, please make it concise.
The Node.js fallback path spins in a tight busy-wait loop for up to 5 seconds per retry attempt: while (Date.now() - start < delay) {
if (Date.now() % 10 === 0) { } // This does nothing
}The empty Either make // Option: truly async acquisition
export async function acquireFileLock(...): Promise<boolean> {
// ...
await sleep(delay);
// ...
}If the function must remain synchronous, use const sab = new SharedArrayBuffer(4);
const arr = new Int32Array(sab);
Atomics.wait(arr, 0, 0, delay); // blocks the thread without spinningPrompt To Fix With AIThis is a comment left during a code review.
Path: cli/src/execution/locking.ts
Line: 1019-1026
Comment:
**Synchronous busy-wait blocks the Node.js event loop**
The Node.js fallback path spins in a tight busy-wait loop for up to 5 seconds per retry attempt:
```ts
while (Date.now() - start < delay) {
if (Date.now() % 10 === 0) { } // This does nothing
}
```
The empty `if` block yields nothing. This will fully block the Node.js event loop (and any other async work, timers, or I/O) for the entire duration of the delay. Since `maxRetries` can be 5 and `delay` can be up to 5 seconds each, a single lock contention could freeze the runtime for up to 25 seconds.
Either make `acquireFileLock` async and use `await sleep(delay)`, or use `Atomics.wait` with a `SharedArrayBuffer` for a proper synchronous yield:
```ts
// Option: truly async acquisition
export async function acquireFileLock(...): Promise<boolean> {
// ...
await sleep(delay);
// ...
}
```
If the function must remain synchronous, use `Atomics.wait`:
```ts
const sab = new SharedArrayBuffer(4);
const arr = new Int32Array(sab);
Atomics.wait(arr, 0, 0, delay); // blocks the thread without spinning
```
How can I resolve this? If you propose a fix, please make it concise.
The previous code had an This PR removes both If the intent is to defer fatal-error handling to the circuit breaker in Prompt To Fix With AIThis is a comment left during a code review.
Path: cli/src/execution/sequential.ts
Line: 3531-3548
Comment:
**Fatal error fast-path (auth/config failures) silently removed**
The previous code had an `isFatalError` fast-path that immediately aborted all remaining tasks when an authentication or configuration error was detected (e.g. `401`, `403`, `not authenticated`, `command not found`). Both error branches in the old code had this guard.
This PR removes both `isFatalError` calls entirely. Now, if the API key is invalid or the CLI tool is not installed, every single task will attempt execution, fail, get marked as `FAILED`, and continue. This can waste significant time (and tokens) compared to the old behavior of aborting on the first fatal error.
If the intent is to defer fatal-error handling to the circuit breaker in `retry.ts`, that only covers connection errors — not auth/config errors. Consider re-introducing the fast-abort for fatal (non-retryable, non-connection) errors, or at minimum document the intentional behavior change in the PR description.
How can I resolve this? If you propose a fix, please make it concise.
Consider loading the full task list once before the loop (or maintaining a task-by-ID map) and only refreshing it when the task source signals a change: // Before the while loop
const allTasksCache = await taskSource.getAllTasks();
const taskById = new Map(allTasksCache.map(t => [t.id, t]));
// Inside the loop
const task = taskById.get(pendingTask.id);Prompt To Fix With AIThis is a comment left during a code review.
Path: cli/src/execution/sequential.ts
Line: 3429-3431
Comment:
**`getAllTasks()` called on every loop iteration — potentially expensive**
`getAllTasks()` is now called inside the main `while(true)` loop on every task iteration to find the full task object by ID. Depending on the task source implementation, this may involve file I/O or network calls on each iteration. For a source with N tasks, this results in O(N²) calls.
Consider loading the full task list once before the loop (or maintaining a task-by-ID map) and only refreshing it when the task source signals a change:
```ts
// Before the while loop
const allTasksCache = await taskSource.getAllTasks();
const taskById = new Map(allTasksCache.map(t => [t.id, t]));
// Inside the loop
const task = taskById.get(pendingTask.id);
```
How can I resolve this? If you propose a fix, please make it concise.
The Any existing callers that relied on a capped retry delay or on jitter to prevent thundering-herd effects will now get the new fixed-delay / exponential-backoff behaviour with no upper bound for non-connection errors ( If these callers exist in the codebase (outside this PR), their option objects will need to be updated. Consider documenting the breaking change in Prompt To Fix With AIThis is a comment left during a code review.
Path: cli/src/execution/retry.ts
Line: 3109-3122
Comment:
**`maxDelay` and `jitter` options silently dropped — potential behavior regression for callers**
The `RetryOptions` interface previously included `maxDelay` (default 60 s) and `jitter` (default `true`) to bound and randomize retry delays. Both have been removed in this PR.
Any existing callers that relied on a capped retry delay or on jitter to prevent thundering-herd effects will now get the new fixed-delay / exponential-backoff behaviour with no upper bound for non-connection errors (`baseDelaySeconds * 1000`). The type change also means callers that previously passed `maxDelay` will now get a TypeScript compile error when they upgrade, which may not be obvious from the PR description.
If these callers exist in the codebase (outside this PR), their option objects will need to be updated. Consider documenting the breaking change in `RetryOptions`, or re-adding the options as optional no-ops for backward compatibility.
How can I resolve this? If you propose a fix, please make it concise. |
|
Related Documentation 3 document(s) may need updating based on files changed in this PR: Goshen Labs's Space Model Override and SelectionView Suggested Changes@@ -38,9 +38,14 @@
- If a task has a `model:` property in YAML, it overrides any global model specified via CLI flags for that task.
- If no per-task model is set, the global model override (if any) is used.
- If neither is set, the engine's default model is used.
+- `--planning-model` (if specified) is used for the planning phase; otherwise, the main model is used.
+- `--test-model` (if specified) enables test orchestration for test-related tasks.
**Planning Model Override:**
You can use the `--planning-model` flag to specify a separate model for the planning phase (file prediction) while keeping a different model for code generation/execution. This allows you to use a cheaper or faster model for planning and a more capable model for implementation, optimizing both cost and performance.
+
+**Test Model Override:**
+You can use the `--test-model` flag to specify a separate model for test execution and validation. When specified, test-related tasks automatically trigger orchestrator-based execution, where the main model implements the feature, then the test model verifies the implementation and provides feedback. This allows you to use specialized models for testing without affecting main execution.
For example:
```bash
@@ -52,6 +57,7 @@
ralphy --gemini --model gemini-1.0-pro "summarize" # Use a custom Gemini model
ralphy --ollama --model glm-4.7 "add feature" # Use a custom Ollama model
ralphy --model opus --planning-model haiku "task" # Use 'opus' for execution, 'haiku' for planning
+ralphy --model sonnet --test-model opus "implement X" # Use 'sonnet' for main work, 'opus' for testing
```
[Reference: README.md](https://github.com/michaelshimeles/ralphy/blob/main/README.md)
@@ -59,16 +65,26 @@
For more details on model override flow and engine selection, see the sections below.
### Model Override Flow in the Execution Pipeline
-When you invoke `ralphy` with `--model`, `--planning-model`, or a shortcut flag, the CLI argument parser determines the engine and both model overrides. For example, `--sonnet` is equivalent to `--claude --model sonnet`. The selected engine, model override, and planning model override are stored in the runtime options and passed through all phases of execution, including sequential runs, parallel execution, and conflict resolution.
-
-- The `--model` flag sets the model for code generation and execution.
-- The `--planning-model` flag (optional) sets a separate model for the planning phase (file prediction). If not specified, the main model is used for both phases.
-
-The execution pipeline ensures that both model overrides are consistently propagated. During the planning phase (especially in parallel and no-git modes), the planning model is used to generate file lists for each task. During code generation and execution, the main model is used. Each engine implementation checks for these overrides and appends the appropriate `--model <name>` argument as needed.
-
-This separation allows you to optimize for speed and cost by using a lightweight model for planning and a more capable model for implementation.
-
-[Reference: cli/src/cli/args.ts, cli/src/execution/planning.ts, cli/src/execution/parallel.ts, cli/src/execution/sequential.ts]
+When you invoke `ralphy` with `--model`, `--planning-model`, `--test-model`, or a shortcut flag, the CLI argument parser determines the engine and model overrides. For example, `--sonnet` is equivalent to `--claude --model sonnet`. The selected engine and model overrides are stored in the runtime options and passed through all phases of execution, including sequential runs, parallel execution, and conflict resolution.
+
+The execution pipeline supports three distinct execution phases, each with its own model:
+
+1. **Planning Phase** (uses `--planning-model` if specified, otherwise `--model`): Generates structured analysis and file lists for each task. The planning model is used to predict which files need modification before implementation begins.
+
+2. **Main Execution Phase** (uses `--model`): Implements the actual code changes, performs code generation, and handles primary execution work.
+
+3. **Test Phase** (uses `--test-model` if specified): Automatically triggered for test-related tasks. The test model verifies implementation, runs tests, analyzes results, and provides feedback. If tests fail, the system returns to the main execution phase for fixes, then re-tests.
+
+**Model Selection Decision Tree:**
+- **Planning**: Use `--planning-model` if provided, otherwise use `--model`, otherwise use engine default
+- **Main Execution**: Use `--model` if provided, otherwise use engine default
+- **Testing**: Use `--test-model` if provided, otherwise skip test orchestration
+
+The execution pipeline ensures that all model overrides are consistently propagated. Each engine implementation checks for these overrides and appends the appropriate `--model <name>` argument as needed.
+
+This separation allows you to optimize for different phases: use a lightweight model for planning to reduce cost, a capable model for implementation, and a specialized model for testing—all without affecting the quality of each phase.
+
+[Reference: cli/src/execution/planning.ts, cli/src/execution/orchestrator.ts, cli/src/execution/parallel.ts, cli/src/execution/sequential.ts]
### Specifying Models for Different Engines
You can specify a model for any supported engine by combining the engine flag with `--model`. For example, to use a specific model with OpenCode, use `--opencode --model <model-name>`. The same pattern applies to other engines, including Trae, Gemini, Ollama, and Kimi:
@@ -82,7 +98,9 @@
ralphy --kimi --model kimi-k2.5 "your task"
```
-You can also specify a separate planning model for any engine using `--planning-model <model-name>`. This model will be used only for the planning phase (file prediction), while the main model (from `--model`) is used for code generation and execution:
+You can also specify a separate planning model for any engine using `--planning-model <model-name>`. This model will be used only for the planning phase (file prediction), while the main model (from `--model`) is used for code generation and execution.
+
+Additionally, you can specify a test model using `--test-model <model-name>` to enable orchestrated test execution. When a test model is specified, test-related tasks automatically trigger a flow where the main model implements, the test model validates, and fixes are applied if needed:
```bash
ralphy --model opus --planning-model haiku "task"
@@ -91,11 +109,51 @@
ralphy --gemini --model gemini-1.0-pro --planning-model gemini-1.0 "summarize"
ralphy --ollama --model glm-4.7 --planning-model glm-3.5 "add feature"
ralphy --kimi --model kimi-k2.5 --planning-model kimi-k2.5 "your task"
+
+# With test model orchestration
+ralphy --model sonnet --test-model opus "implement feature with tests"
+ralphy --opencode --model opencode/glm-4.7 --test-model opencode/gpt-5-nano "build API"
```
Only one engine/model combination can be specified per command invocation. There is no built-in support for specifying different models for multiple engines in a single command; you must run separate commands for each engine/model pair.
-[Reference: cli/src/cli/args.ts, cli/src/execution/planning.ts]
+### Test Model Orchestration
+When you specify a `--test-model`, the execution pipeline can automatically delegate test execution and validation to a separate specialized model. This orchestration is triggered for test-related tasks (tasks containing keywords like "test", "implement", "create", "build", "fix", or "debug").
+
+**How Test Model Orchestration Works:**
+
+1. **Automatic Delegation Flow**: The main model implements the feature or fix, then the test model is automatically invoked to verify the implementation by running tests.
+
+2. **shouldUseOrchestrator() Decision**: The system determines whether to use orchestrator mode by checking:
+ - Is a `--test-model` specified?
+ - Does the task title or description contain test-related or implementation keywords?
+ - If both conditions are met, orchestration is enabled.
+
+3. **executeWithOrchestrator() Flow**:
+ - **Step 1 (Main Model)**: Implement the task/feature
+ - **Step 2 (Test Model)**: Run tests and analyze results
+ - **Step 3 (Optional Fix)**: If tests fail, return to main model with test results for fixes, then re-test
+
+**Benefits:**
+- Use a capable model (e.g., Claude Opus) for complex implementation work
+- Use a specialized or faster model (e.g., GPT-4) for test execution and validation
+- Automatic retry flow when tests fail, without manual intervention
+- Separation of concerns: implementation vs. validation
+
+**Example Workflow:**
+```
+User: ralphy --model anthropic/claude-3.5-sonnet --test-model anthropic/claude-3-opus "implement login feature"
+
+System Flow:
+1. Main model (sonnet) implements login feature
+2. Test model (opus) automatically runs tests, finds issues
+3. Main model (sonnet) receives test feedback, applies fixes
+4. Test model (opus) re-validates, confirms tests pass
+```
+
+This orchestration provides a robust way to ensure code quality without requiring the same model to handle both implementation and testing phases.
+
+[Reference: cli/src/execution/orchestrator.ts]
### Examples
- Use the default model for Claude:
@@ -159,11 +217,18 @@
ralphy --ollama --model glm-4.7 --planning-model glm-3.5 "add feature"
ralphy --kimi --model kimi-k2.5 --planning-model kimi-k2.5 "your task"
```
+ - Use a test model for orchestrated testing:
+ ```bash
+ ralphy --model anthropic/claude-3.5-sonnet --test-model anthropic/claude-3-opus "implement feature X"
+ ralphy --model openai/gpt-4 --test-model anthropic/claude-3.5-sonnet --planning-model openai/gpt-4o "complex task"
+ ralphy --opencode --model opencode/glm-4.7 --test-model opencode/gpt-5-nano "build API with tests"
+ ```
### Best Practices
- Use shortcut flags (like `--sonnet`) for common engine/model combinations to reduce typing and avoid mistakes.
- Use `--model` with the appropriate engine flag for custom or less common model selections, including `--kimi` for Kimi Code CLI.
- Use `--planning-model` to select a cheaper or faster model for the planning phase, especially in parallel or no-git modes where planning is a distinct step. This can reduce cost and speed up planning without affecting code quality.
+- Use `--test-model` to enable automatic test orchestration for implementation tasks. This provides a feedback loop where the test model validates implementation and suggests fixes if tests fail, improving code quality without manual test execution.
- Ensure you specify only one engine/model pair per command.
- Model overrides are consistently applied throughout the execution pipeline, including parallel runs and conflict resolution, so you can rely on the selected model being used for all phases of the operation.
- Refer to the engine documentation or `ralphy --help` for the list of supported models for each engine, including Kimi Code CLI.Progress Spinner Enhancements and Settings DisplayView Suggested Changes@@ -1,5 +1,16 @@
+## Overview
+
+The progress tracking system now operates on multiple layers, providing comprehensive visibility into execution:
+
+1. **Basic Spinner** - Live timer and step updates for sequential execution
+2. **Planning Progress Events** - Structured events during the planning phase (analyzing, planning, files, optimization)
+3. **Orchestrator Progress Callbacks** - Execution phase reporting for test feedback workflows
+4. **Static Agent Display** - Real-time agent execution tracking with task state transitions
+
+Each layer complements the others to provide complete transparency from planning through execution to testing.
+
### Timer with Live Updates
-The spinner now features a timer that updates every second, showing the elapsed time since the task began. This is implemented by starting a `setInterval` when the spinner is created, which calls an update method every 1000 milliseconds. The elapsed time is formatted as minutes and seconds (e.g., `2m 15s`) or just seconds if under a minute, and is displayed in dim text next to the spinner's current step and task description.
+The spinner features a timer that updates every second, showing the elapsed time since the task began. This is implemented by starting a `setInterval` when the spinner is created, which calls an update method every 1000 milliseconds. The elapsed time is formatted as minutes and seconds (e.g., `2m 15s`) or just seconds if under a minute, and is displayed in dim text next to the spinner's current step and task description.
**Operation Timing Breakdown:**
When a task completes, the spinner now displays a breakdown of how much time was spent in each major step (such as "Working", "Retry", etc.), surfacing slow paths and making performance bottlenecks visible. This breakdown appears in dim text after the main spinner output, and only includes steps that took at least 1 second. This helps users and developers identify which parts of the process are slow.
@@ -46,3 +57,167 @@
Progress logging is now non-blocking and batched, ensuring that the CLI remains responsive even during heavy parallel operations.
This transparency helps users understand the current state of the CLI, reduces confusion, and aids in troubleshooting or optimizing task runs. The spinner's integration with the task run commands ensures that feedback is always up-to-date and contextually relevant, from task start through retries to completion.
+
+## Planning Progress Events
+
+The planning phase reports structured progress events through the `PlanningProgressEvent` and `PlanningProgressCallback` types defined in `cli/src/execution/progress-types.ts`.
+
+### Event Structure
+
+Each planning event includes:
+- `taskId`: Identifier for the task being planned
+- `status`: Current planning status - "started", "analyzing", "thinking", "planning", "completed", "error", or custom status strings
+- `timestamp`: Event occurrence time
+- `message`: Optional human-readable description of the current activity
+- `metadata`: Optional additional data (file counts, file lists, planning flags)
+- `reward`: Optional numeric reward/confidence score from the AI model
+
+### Planning Phases
+
+The planning system emits events for distinct phases:
+
+1. **Analyzing** - Reading project structure and files, extracting task keywords, performing semantic chunking to identify relevant files
+2. **Thinking** - Processing the planning request and generating analysis
+3. **Planning** - Creating the structured plan with steps and file lists
+4. **Completed** - Planning finished successfully with file count and step count
+5. **Error/Failed** - Planning encountered issues requiring retry or manual intervention
+
+### Integration with UI
+
+UI components can subscribe to planning progress by providing a callback:
+
+```typescript
+const onProgress: PlanningProgressCallback = (event) => {
+ // Display event.status and event.message
+ // Show event.metadata.fileCount and event.metadata.files if available
+ // Use event.reward for confidence indicators
+};
+
+await planTaskFiles(engine, task, workDir, {
+ onProgress,
+ // ... other options
+});
+```
+
+The planning system automatically detects streaming vs. non-streaming execution and emits appropriate events. For streaming execution, events are emitted in real-time as the AI model processes each phase. For non-streaming execution, events are emitted at key transition points.
+
+[See implementation](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/execution/progress-types.ts)
+
+## Orchestrator Progress Callbacks
+
+The orchestrator provides execution phase tracking for test-feedback workflows through a progress callback mechanism defined in `cli/src/execution/orchestrator.ts`.
+
+### Callback Interface
+
+The orchestrator accepts an optional `onProgress?: (step: string) => void` callback that receives string descriptions of execution phases:
+
+- **"Starting execution with test feedback"** - Orchestrator workflow beginning
+- **"Running main model..."** - Initial implementation phase starting
+- **"Main model complete, running tests..."** - Implementation finished, transitioning to test phase
+- **"Sending to test model (model-name)..."** - Test execution starting with specific model
+- **"Test prompt ready, executing test model..."** - Test prompt constructed and ready
+- **"Test model complete. Response length: N chars"** - Test phase finished
+- **"Test output preview: ..."** - Sample of test results
+- **"Issues found, requesting fixes..."** - Tests failed, requesting fixes from main model
+- **"All tests passed"** - Successful completion
+
+### Execution Flow
+
+The orchestrator coordinates three stages:
+
+1. **Main Model Execution** - Implements the task based on the original prompt
+2. **Test Model Execution** - Runs tests against the implementation and reports results
+3. **Fix Iteration** (if needed) - Requests fixes from main model if tests reveal issues
+
+Each stage transition triggers a progress callback, allowing the UI to update the displayed phase and activity.
+
+### Integration with Spinner
+
+The orchestrator progress callbacks integrate seamlessly with the existing spinner system. When an orchestrator callback fires, it can update the spinner's step label to reflect the current phase:
+
+```typescript
+await executeWithOrchestrator(prompt, options, (step) => {
+ spinner.updateStep(step);
+});
+```
+
+The orchestrator also updates the `StaticAgentDisplay` to show which model is currently running (main vs. test) and which execution phase is active (execution vs. testing).
+
+[See implementation](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/execution/orchestrator.ts)
+
+## Static Agent Display
+
+The `StaticAgentDisplay` component (`cli/src/ui/static-agent-display.ts`) provides a dedicated UI for tracking agent execution across all phases, with support for parallel agent workflows.
+
+### Task State Transitions
+
+The display tracks tasks through these states:
+
+- **pending** - Task is queued but not yet started
+- **planning** - Task is in the planning phase (analyzing codebase, identifying files)
+- **working** - Task is actively being executed by an agent
+- **completed** - Task finished successfully (shown with ✓)
+- **failed** - Task encountered an error (shown with ✗)
+- **deferred** - Task was deferred for later execution
+- **skipped** - Task was skipped due to dry-run or other conditions
+
+### Execution Phases
+
+The display shows high-level execution phases:
+
+- **PLANNING** - Analyzing the codebase and planning file changes (shown in cyan)
+- **EXECUTION** - Implementing the task by modifying files (shown in magenta)
+- **TESTING** - Running tests and verifying implementation (shown in yellow)
+
+### Display Layout
+
+The static agent display renders:
+
+1. **Workflow Bar** - Visual progress indicator showing current phase (PLANNING → EXECUTION → TESTING)
+2. **Agent Status Lines** - One line per agent showing:
+ - Status indicator (●/✓/✗)
+ - Agent number
+ - Current phase tag
+ - Task title
+ - Active model name
+ - Elapsed time
+3. **Recent Steps** - Up to 5 recent actions per agent, numbered and color-coded by action type:
+ - Read/Search actions (blue)
+ - Write/Edit actions (magenta)
+ - Tool/Execute actions (yellow)
+ - Test/Build actions (green)
+ - Fix/Debug actions (red)
+
+### Integration with Orchestrator
+
+The static display integrates with the orchestrator and planning systems:
+
+- Planning phase events update the display to show "planning" status
+- Orchestrator callbacks update the display to show which model is running (main/planning/test)
+- Phase transitions are reflected in the workflow bar
+- Recent steps are populated from OpenCode streaming output or explicit step updates
+
+### Example Display Output
+
+```
+ PLANNING → ▓▓▓ EXECUTION ▓▓▓ → TESTING
+─────────────────── AGENTS ────────────────────
+
+● Agent 1 [EXECUTION] Implement auth flow [main] 0m 42s
+ 1. Write: src/auth/login.ts
+ 2. Edit: src/routes.ts
+ 3. Tool: bash: npm test
+ 4. Read: package.json
+ 5.
+
+✓ Agent 2 [TESTING] Fix validation bug [test] 1m 15s
+ 1. Test: npm test
+ 2. Read: src/validation.ts
+ 3. Edit: src/validation.ts
+ 4. Test: npm test
+ 5. Tool: git commit
+
+Press Ctrl+C to stop
+```
+
+[See implementation](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/ui/static-agent-display.ts)Real-Time Streaming Output HandlingView Suggested Changes@@ -9,6 +9,15 @@
- **Linting**: For commands invoking lint tools (e.g., "eslint", "prettier").
- **Committing**: For commands or descriptions containing "git commit".
- **Staging**: For commands or descriptions containing "git add".
+
+#### Planning Phase Steps
+In addition to execution steps, the system recognizes structured planning phase markers from the planning prompt:
+- **ANALYSIS**: Analyzing the problem, complexity, risks, and goal
+- **PLAN**: Creating implementation steps and strategy
+- **FILES**: Identifying files that need modification
+- **OPTIMIZATION**: Determining efficient approaches and shortcuts
+
+These planning steps are detected when the AI output contains the corresponding XML-style tags (`<ANALYSIS>`, `<PLAN>`, `<FILES>`, `<OPTIMIZATION>`) defined in the planning prompt. During streaming execution, planning progress events can report these as status values to provide visibility into the planning process before implementation begins.
Example logic:
```ts
@@ -128,3 +137,162 @@
```
This architecture ensures that users receive immediate, meaningful feedback about the current phase of each task as it executes, improving transparency and responsiveness in the development workflow.
+
+## Task State Transitions
+The execution system maintains explicit task states through a centralized state machine implemented in `task-state.ts`. This provides a single source of truth for task lifecycle across all execution modes (sequential, parallel).
+
+### State Machine
+Tasks transition through six states:
+- **pending**: Task is ready to be executed
+- **running**: Task is currently being executed
+- **completed**: Task finished successfully
+- **failed**: Task failed and cannot be retried
+- **deferred**: Task failed with a retryable error and will be retried
+- **skipped**: Task exceeded maximum retry attempts or was explicitly skipped
+
+### State Transitions
+State transitions are atomic and persisted to disk in the same format as the task source (YAML, JSON, CSV, or Markdown):
+
+```ts
+// Claim a task for execution (atomic transition from pending → running)
+const claimed = await taskStateManager.claimTaskForExecution(taskId);
+
+// Transition to completed/failed/deferred/skipped
+await taskStateManager.transitionState(taskId, TaskState.COMPLETED);
+await taskStateManager.transitionState(taskId, TaskState.FAILED, errorMessage);
+await taskStateManager.transitionState(taskId, TaskState.DEFERRED, errorMessage);
+```
+
+### Impact on Progress Tracking
+State transitions affect real-time output in several ways:
+- **Interrupted tasks**: Tasks in `RUNNING` state are automatically reset to `PENDING` on program restart
+- **Retry attempts**: The state manager tracks `attemptCount` per task, preventing infinite retry loops
+- **Parallel execution**: The atomic claim mechanism prevents race conditions where multiple workers try to execute the same task
+- **Progress reporting**: State changes trigger updates to the progress UI, spinner, and agent display
+
+The state manager ensures consistent task lifecycle behavior across both sequential and parallel execution modes. [Source](https://github.com/michaelshimeles/ralphy/pull/162/files#diff-task-state.ts)
+
+## Orchestrator Execution Flow
+The orchestrator pattern in `orchestrator.ts` provides a higher-level execution flow that coordinates between main models and test models. This affects the sequence of steps visible in the streaming output.
+
+### Orchestrated Execution Phases
+When orchestrator mode is enabled (via the `--test-model` flag), tasks follow this execution pattern:
+
+1. **"Starting execution with test feedback"**: Initial setup phase
+2. **"Running main model..."**: Main model implements the task
+3. **"Main model complete, running tests..."**: Transition to test phase
+4. **"Sending to test model (MODEL_NAME)..."**: Test model invoked
+5. **"Test model complete."**: Test results received
+6. **"Issues found, requesting fixes..."**: (If tests fail) Re-invoking main model with test feedback
+7. **Completion or error state**
+
+### Test Model Integration
+The orchestrator uses separate progress reporting for test phases:
+```ts
+// Update agent display to show test model is running
+const display = StaticAgentDisplay.getInstance();
+if (display && agentNum) {
+ display.setAgentStatus(agentNum, "", "working", "testing", testModel);
+}
+```
+
+### Impact on Streaming Output
+During orchestrated execution:
+- The streaming output includes orchestrator status messages in addition to model-generated steps
+- Progress spinners update to show "testing" phase when test model is active
+- Test results are appended to the main output response for visibility
+- Multiple model calls (main → test → main) produce a longer sequence of progress events
+
+This orchestration is transparent to the user but provides more comprehensive test coverage. The orchestrator decides whether to use this pattern based on the `testModel` setting and task characteristics. [Source](https://github.com/michaelshimeles/ralphy/pull/162/files#diff-orchestrator.ts)
+
+## Progress Types and Events
+The `progress-types.ts` module defines structured progress event contracts that extend the original step detection system. These types enable more granular progress reporting beyond the heuristic JSON step detection.
+
+### Planning Progress Events
+The `PlanningProgressEvent` type tracks planning phase progress:
+
+```ts
+export interface PlanningProgressEvent {
+ taskId: string;
+ status: "started" | "thinking" | "completed" | "error" | string;
+ timestamp: number;
+ message?: string;
+ metadata?: Record<string, unknown>;
+ reward?: number;
+}
+```
+
+Status values include:
+- **started**: Planning phase initiated
+- **thinking**: AI is processing the planning request
+- **analyzing**: Reading and exploring the codebase
+- **planning**: Generating the implementation plan
+- **completed**: Planning phase finished successfully
+- **failed**: Planning encountered an error
+
+### Agent Progress Tracking
+The `AgentProgress` interface tracks parallel agent execution state:
+
+```ts
+export interface AgentProgress {
+ agentNum: number;
+ taskTitle: string;
+ status: "planning" | "working" | "completed" | "failed";
+ phase?: ExecutionPhase; // "planning" | "execution" | "testing"
+ modelName?: string; // "main", "planning", "test"
+ currentStep?: string;
+ recentSteps?: string[];
+ startTime: number;
+}
+```
+
+### Integration with Streaming Output
+During streaming execution, these progress events complement the original step detection:
+
+1. **Planning phase**: The `planTaskFiles` function emits `PlanningProgressEvent` callbacks during plan generation
+2. **Execution phase**: The `AgentProgress` updates track which model is active and what step it's on
+3. **Test phase**: When orchestrator is enabled, events include `phase: "testing"` and `modelName: "test"`
+
+These structured events provide the data for the parallel progress UI and static agent display, enabling real-time dashboard-like visibility into all running tasks. [Source](https://github.com/michaelshimeles/ralphy/pull/162/files#diff-progress-types.ts)
+
+## Static Agent Display Integration
+The `static-agent-display.ts` component provides a real-time, in-terminal UI for tracking parallel agent execution. This complements the spinner-based progress during sequential execution with a dashboard-like view during parallel execution.
+
+### Visual Representation
+The static agent display renders:
+- **Workflow bar**: Shows current execution phase (planning → execution → testing) with progress indicators
+- **Agent cards**: One card per parallel agent, showing:
+ - Agent number and task title
+ - Current phase tag (`[PLANNING]`, `[EXECUTION]`, `[TESTING]`)
+ - Model tag (e.g., `[main]`, `[planning]`, `[test]`)
+ - Status indicator (● for working, ✓ for completed, ✗ for failed)
+ - Elapsed time
+- **Step history**: Last 5 steps for each agent, numbered 1-5, with color-coded actions
+
+### Integration with Task States
+The static agent display uses task state transitions to update agent status:
+
+```ts
+// When a task is claimed for execution
+display.setAgentStatus(agentNum, taskTitle, "working", "execution", "main");
+
+// When test model runs
+display.setAgentStatus(agentNum, taskTitle, "working", "testing", testModel);
+
+// When task completes
+display.setAgentStatus(agentNum, taskTitle, "completed", "execution", "main");
+```
+
+### Real-Time Step Updates
+The display updates agent steps from two sources:
+1. **Heuristic step detection**: Via `updateAgent(agentNum, step)` with detected steps from JSON output
+2. **OpenCode streaming**: Via `updateAgentFromOpenCode(agentNum, jsonLine)` which parses OpenCode's structured JSON output (tool_use, text, step_finish events)
+
+### Display Rendering
+The display:
+- Refreshes every second to show elapsed time updates
+- Updates in-place using ANSI escape codes to minimize terminal flicker
+- Shows empty numbered steps (1-5) when agents have no recent activity
+- Clears and re-renders the entire terminal on each update for a clean dashboard view
+
+This display significantly improves observability during parallel execution by showing all agents' progress simultaneously, making it easy to identify slow agents, stuck tasks, or errors. [Source](https://github.com/michaelshimeles/ralphy/pull/162/files#diff-static-agent-display.ts)Note: You must be authenticated to accept/decline updates. |
| if (locks.size > LOCK_MAX_LOCKS) { | ||
| logWarn(`Lock registry size (${locks.size}) exceeded ${LOCK_MAX_LOCKS}. Evicting oldest non-own locks.`); | ||
|
|
||
| const sorted = Array.from(locks.entries()).sort((a, b) => a[1].timestamp - b[1].timestamp); | ||
|
|
||
| // Keep all locks owned by this process, evict oldest of others first | ||
| const others = sorted.filter(([_path, info]) => info.owner !== lockOwner); | ||
| const overflow = locks.size - LOCK_MAX_LOCKS; | ||
| const toEvictOthers = others.slice(0, Math.max(overflow, 0)); |
There was a problem hiding this comment.
🟡 Medium execution/locking.ts:348
cleanupStaleLocks logs a warning about exceeding LOCK_MAX_LOCKS but never evicts any locks. The filter info.owner !== lockOwner (line 354) always produces an empty array because locks only contains entries where owner === lockOwner (set at lines 152 and 211). When the registry is oversized, the warning fires repeatedly but memory usage is never reduced.
- const sorted = Array.from(locks.entries()).sort((a, b) => a[1].timestamp - b[1].timestamp);
-
- // Keep all locks owned by this process, evict oldest of others first
- const others = sorted.filter(([_path, info]) => info.owner !== lockOwner);
- const overflow = locks.size - LOCK_MAX_LOCKS;
- const toEvictOthers = others.slice(0, Math.max(overflow, 0));
+ const sorted = Array.from(locks.entries()).sort((a, b) => a[1].timestamp - b[1].timestamp);
+
+ // Evict oldest locks from this process when over limit
+ const overflow = locks.size - LOCK_MAX_LOCKS;
+ const toEvict = sorted.slice(0, Math.max(overflow, 0));🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/locking.ts around lines 348-356:
`cleanupStaleLocks` logs a warning about exceeding `LOCK_MAX_LOCKS` but never evicts any locks. The filter `info.owner !== lockOwner` (line 354) always produces an empty array because `locks` only contains entries where `owner === lockOwner` (set at lines 152 and 211). When the registry is oversized, the warning fires repeatedly but memory usage is never reduced.
Evidence trail:
cli/src/execution/locking.ts lines 17-18 (locks Map and lockOwner definition), lines 149-162 (first locks.set with owner: lockOwner), lines 211-219 (second locks.set only when typedLockInfo.owner === lockOwner), lines 348-359 (cleanup logic with filter info.owner !== lockOwner). git_grep confirmed only two calls to locks.set exist.
| function buildSkillsSection(workDir: string): string { | ||
| const skillsCsv = getSkillsAsCsv(workDir); | ||
| if (skillsCsv) { | ||
| return `## Agent Skills | ||
| This repo includes compressed skill/playbook documentation for token efficiency: | ||
| ${skillsCsv} | ||
|
|
||
| Before you start coding: | ||
| - Read and follow any relevant skill docs from compressed list above. | ||
| - If your engine supports a \`skill\` tool (e.g. OpenCode), use it to load relevant skills before implementing. | ||
| - If none apply, continue normally.`; | ||
| } | ||
|
|
||
| const skillRoots = SKILL_DIRECTORIES.map((dir) => join(workDir, dir)).filter(existsSync); | ||
| if (skillRoots.length > 0) { | ||
| return `## Agent Skills | ||
| This repo includes skill/playbook docs that describe preferred patterns, workflows, or tooling: | ||
| ${skillRoots.map((p) => `- ${p}`).join("\n")} | ||
|
|
||
| Before you start coding: | ||
| - Read and follow any relevant skill docs from paths above. | ||
| - If your engine supports a \`skill\` tool (e.g. OpenCode), use it to load relevant skills before implementing. | ||
| - If none apply, continue normally.`; | ||
| } |
There was a problem hiding this comment.
🟢 Low execution/prompt.ts:288
When skill directories exist but contain no .md files, getSkillsAsCsv returns empty and buildSkillsSection falls back to listing those same directories as "preferred patterns, workflows, or tooling" paths to read. This prompts the agent to read docs from empty directories, even though getSkillsAsCsv already determined they contain no valid skill files.
- const skillRoots = SKILL_DIRECTORIES.map((dir) => join(workDir, dir)).filter(existsSync);
- if (skillRoots.length > 0) {
- return `## Agent Skills
-This repo includes skill/playbook docs that describe preferred patterns, workflows, or tooling:
-${skillRoots.map((p) => `- ${p}`).join("\n")}
-
-Before you start coding:
-- Read and follow any relevant skill docs from paths above.
-- If your engine supports a \`skill\` tool (e.g. OpenCode), use it to load relevant skills before implementing.
-- If none apply, continue normally.`;
- }🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/prompt.ts around lines 288-311:
When skill directories exist but contain no `.md` files, `getSkillsAsCsv` returns empty and `buildSkillsSection` falls back to listing those same directories as "preferred patterns, workflows, or tooling" paths to read. This prompts the agent to read docs from empty directories, even though `getSkillsAsCsv` already determined they contain no valid skill files.
Evidence trail:
cli/src/execution/prompt.ts:21 - `SKILL_DIRECTORIES = [".opencode/skills", ".claude/skills", ".skills"]`
cli/src/execution/prompt.ts:288-316 - `buildSkillsSection` function showing the fallback logic
cli/src/execution/skill-compress.ts:108-137 - `getSkillsAsCsv` function showing it only processes `.md` files and returns empty string if none found
cli/src/execution/locking.ts
Outdated
| const start = Date.now(); | ||
| while (Date.now() - start < delay) { | ||
| // Minimal work to allow event loop processing | ||
| if (Date.now() % 10 === 0) { | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🟢 Low execution/locking.ts:248
The Node.js branch of the retry backoff uses a while loop that spins the CPU for the entire delay duration, blocking the event loop and preventing any I/O, timers, or signal handlers from running. This makes the CLI unresponsive for up to 3+ seconds during lock retries, contradicting the comment claiming it "yields to the event loop". Consider using an actual yielding sleep like Atomics.wait if available, or at minimum setTimeout with a Promise to avoid blocking.
- } else {
- // For Node.js, use a lighter busy-wait with occasional yielding
- const start = Date.now();
- while (Date.now() - start < delay) {
- // Minimal work to allow event loop processing
- if (Date.now() % 10 === 0) {
- }
- }
- }🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/locking.ts around lines 248-254:
The Node.js branch of the retry backoff uses a `while` loop that spins the CPU for the entire delay duration, blocking the event loop and preventing any I/O, timers, or signal handlers from running. This makes the CLI unresponsive for up to 3+ seconds during lock retries, contradicting the comment claiming it "yields to the event loop". Consider using an actual yielding sleep like `Atomics.wait` if available, or at minimum `setTimeout` with a Promise to avoid blocking.
Evidence trail:
cli/src/execution/locking.ts lines 240-253 at REVIEWED_COMMIT. Line 240 shows `Math.min(baseDelay + jitter, 5000)` confirming max 5s delay. Lines 248-253 show the Node.js branch with a synchronous while loop: `while (Date.now() - start < delay) { if (Date.now() % 10 === 0) {} }`. The comments on lines 247-248 claim 'lighter busy-wait with occasional yielding' and 'Minimal work to allow event loop processing' but synchronous while loops in JavaScript cannot yield to the event loop.
| return false; | ||
| } | ||
|
|
||
| export function releaseFileLock(filePath: string, workDir: string): void { |
There was a problem hiding this comment.
🟠 High execution/locking.ts:262
releaseFileLock unconditionally deletes the lock file without checking ownership. If the lock expired and another process acquired it, this will delete that other process's active lock file, breaking mutual exclusion and allowing a third process to enter the critical section.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/locking.ts around line 262:
`releaseFileLock` unconditionally deletes the lock file without checking ownership. If the lock expired and another process acquired it, this will delete that other process's active lock file, breaking mutual exclusion and allowing a third process to enter the critical section.
Evidence trail:
cli/src/execution/locking.ts lines 262-274 (releaseFileLock function showing no ownership check before unlinkSync), lines 9-14 (LockInfo interface with owner field), line 17 (lockOwner definition), lines 150-153 (lock acquisition stores owner in file)
| const hasFailures = | ||
| /test (fail|error|broken)|failed|failing|✗|❌|assertion|exception/i.test(testOutput) && | ||
| !/0 fail|no fail|all pass|✓|✔|passed/i.test(testOutput); | ||
|
|
There was a problem hiding this comment.
🟡 Medium execution/orchestrator.ts:240
Suggestion: Make hasFailures robust: "5 passed, 1 failed" should trigger fixes, while infrastructure errors (e.g., Test execution failed: ...) should not. Parse counts/status and handle infra failures separately before deciding to fix.
- // Check if tests indicate failures that need fixing
- const hasFailures =
- /test (fail|error|broken)|failed|failing|✗|❌|assertion|exception/i.test(testOutput) &&
- !/0 fail|no fail|all pass|✓|✔|passed/i.test(testOutput);🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/orchestrator.ts around lines 240-243:
Suggestion: Make `hasFailures` robust: "5 passed, 1 failed" should trigger fixes, while infrastructure errors (e.g., `Test execution failed: ...`) should not. Parse counts/status and handle infra failures separately before deciding to fix.
Evidence trail:
cli/src/execution/orchestrator.ts lines 232-243 at REVIEWED_COMMIT - shows testOutput construction (lines 232-234) and hasFailures regex logic (lines 240-242). Line 234 shows infrastructure errors become 'Test execution failed: ${testResult.error}' which matches the first regex ('failed') but not the exclusion patterns. The exclusion regex on line 242 includes 'passed' which incorrectly excludes mixed results like '5 passed, 1 failed'.
|
|
||
| export interface PlanningProgressEvent { | ||
| taskId: string; | ||
| status: "started" | "thinking" | "completed" | "error" | string; |
There was a problem hiding this comment.
🟢 Low execution/progress-types.ts:35
The status property includes "error" in its type definition, but the implementation in planTaskFiles only ever emits "failed" for error states. Consumers checking for "error" will miss all failure events.
| status: "started" | "thinking" | "completed" | "error" | string; | |
| status: "started" | "thinking" | "completed" | "failed" | string; |
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/progress-types.ts around line 35:
The `status` property includes `"error"` in its type definition, but the implementation in `planTaskFiles` only ever emits `"failed"` for error states. Consumers checking for `"error"` will miss all failure events.
Evidence trail:
cli/src/execution/progress-types.ts lines 33-40: PlanningProgressEvent interface with `status: "started" | "thinking" | "completed" | "error" | string;`
cli/src/execution/planning.ts lines ~377-379: streaming callback sets `status = "failed"` for error conditions
cli/src/execution/planning.ts line ~435: emits `status: "failed"` when raw tool use detected
cli/src/execution/planning.ts line ~478: emits `status: "failed"` for regular failures
git_grep for `status === "error"` returned no matches (no consumers checking for "error")
git_grep for `status === "failed"` shows usage in static-agent-display.ts:230
| "yarn.lock", | ||
| ]; | ||
| const fileStates = new Map<string, { mtime: number; size: number; hash: string }>(); | ||
| let changed = !cached; |
There was a problem hiding this comment.
🟡 Medium execution/planning.ts:49
When a monitored file like package.json is deleted, generateRepoFingerprint returns the stale cached hash instead of detecting the change. The loop skips deleted files without removing them from the cached fileStates, so the changed flag stays false and the stale dirHash is returned. This causes the planning cache to believe the deleted file still exists.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/planning.ts around line 49:
When a monitored file like `package.json` is deleted, `generateRepoFingerprint` returns the stale cached hash instead of detecting the change. The loop skips deleted files without removing them from the cached `fileStates`, so the `changed` flag stays false and the stale `dirHash` is returned. This causes the planning cache to believe the deleted file still exists.
Evidence trail:
cli/src/execution/planning.ts lines 29-107 (REVIEWED_COMMIT): The `generateRepoFingerprint` function at line 29. The loop at lines 48-66 only processes files where `existsSync(filePath)` returns true (line 49). The `changed` flag is only set to `true` when an existing file has different mtime/size (line 60). At lines 85-88, if `!changed && cached`, the stale `cached.dirHash` is returned. There is no code to detect when a file that was previously in `cached.fileStates` no longer exists.
| return dirHash; | ||
| } | ||
|
|
||
| export function loadPlanningCache( |
There was a problem hiding this comment.
🟡 Medium execution/planning.ts:112
loadPlanningCache always loads the compressed file (.gz) when present, even if the uncompressed file is newer. Since savePlanningCache can leave a stale compressed file next to a fresh uncompressed file after a gzip error, this causes stale cache data to be loaded. Consider checking mtime of both files and loading whichever is newer.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/planning.ts around line 112:
`loadPlanningCache` always loads the compressed file (`.gz`) when present, even if the uncompressed file is newer. Since `savePlanningCache` can leave a stale compressed file next to a fresh uncompressed file after a gzip error, this causes stale cache data to be loaded. Consider checking `mtime` of both files and loading whichever is newer.
Evidence trail:
cli/src/execution/planning.ts lines 112-141 (`loadPlanningCache`): checks `existsSync(compressedCacheFile)` at line 117, loads it immediately without mtime check. cli/src/execution/planning.ts lines 143-162 (`savePlanningCache`): catch block at line 159-160 writes uncompressed file but does not delete existing `.gz` file. Commit: REVIEWED_COMMIT
| return `You are working on a specific task. Focus ONLY on this task: | ||
|
|
||
| TASK: ${task} | ||
|
|
||
| ${sections.join("\n\n")} | ||
|
|
||
| ${buildProtectedPathsWarning(prdFile, boundaries)} | ||
| Do NOT Read, Glob, or Search inside .ralphy-sandboxes or .ralphy-worktrees. | ||
| Do NOT mark tasks complete - that will be handled separately. | ||
| Focus only on implementing: ${task}`; | ||
| } |
There was a problem hiding this comment.
🟡 Medium execution/prompt.ts:385
The buildPrompt function outputs a raw list of protected paths at the end of the prompt without a section header or prohibitory instruction. The LLM receives filenames like the PRD file and .ralphy paths with no explicit "Do NOT modify" directive, unlike buildExecutionPrompt which uses a Boundaries - Do NOT modify: header. This omission leaves system files vulnerable to modification because the agent has no clear signal these are immutable boundaries.
return `You are working on a specific task. Focus ONLY on this task:
TASK: ${task}
${sections.join("\n\n")}
-${buildProtectedPathsWarning(prdFile, boundaries)}
-Do NOT Read, Glob, or Search inside .ralphy-sandboxes or .ralphy-worktrees.
-Do NOT mark tasks complete - that will be handled separately.
+## Boundaries (Do NOT modify these files/directories)
+${buildProtectedPathsWarning(prdFile, boundaries)}
+
+Do NOT Read, Glob, or Search inside .ralphy-sandboxes or .ralphy-worktrees.
+Do NOT mark tasks complete - that will be handled separately.
Focus only on implementing: ${task}`;🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/prompt.ts around lines 385-395:
The `buildPrompt` function outputs a raw list of protected paths at the end of the prompt without a section header or prohibitory instruction. The LLM receives filenames like the PRD file and `.ralphy` paths with no explicit "Do NOT modify" directive, unlike `buildExecutionPrompt` which uses a `Boundaries - Do NOT modify:` header. This omission leaves system files vulnerable to modification because the agent has no clear signal these are immutable boundaries.
Evidence trail:
cli/src/execution/prompt.ts:348-355 - `buildProtectedPathsWarning` returns raw list with no header: `return [...systemPaths, ...userPaths].join('\n')`; cli/src/execution/prompt.ts:391 - `buildPrompt` uses this raw output; cli/src/execution/prompt.ts:463 - `buildExecutionPrompt` uses proper header `Boundaries - Do NOT modify:\n${allBoundaries.join('\n')}`; cli/src/execution/prompt.ts:15-19 - RALPHY_PROTECTED_PATHS constant definition
cli/src/execution/orchestrator.ts
Outdated
| "debug", | ||
| ]; | ||
|
|
||
| return testKeywords.some((kw) => combined.includes(kw)); |
There was a problem hiding this comment.
🟢 Low execution/orchestrator.ts:304
The keyword matching in shouldUseOrchestrator uses String.prototype.includes, so the keyword "test" matches substrings inside words like "latest", "fastest", and "contest". This causes the orchestrator to be enabled for tasks that don't actually involve testing. Consider matching whole words with word boundaries (e.g., regex with \b) or splitting on whitespace and checking for exact matches.
- return testKeywords.some((kw) => combined.includes(kw));
+ const words = combined.split(/\s+/);
+ return testKeywords.some((kw) => words.includes(kw));Also found in 1 other location(s)
cli/src/execution/locking.ts:106
The
isInRalphyDirfunction uses case-sensitive substring matching (String.prototype.includes), which leads to two issues:
- Case Sensitivity on Windows: On Windows, file systems are case-insensitive. A path containing
.RALPHY(e.g.C:\Project\.RALPHY\data) will returnfalse, failing to identify the directory correctly. This contradicts the logic innormalizePathForLockingwhich explicitly handles Windows case normalization. - False Positives: The simple substring check will match unrelated files such as
my.ralphy.backup.txtor directories likeprojects/not.ralphy/. This causes the application to incorrectly classify user data as internal system files.
Fix: Normalize the path first (or handle case-insensitivity) and check for specific path segments (e.g. surrounding the search term with path separators) rather than a partial string match.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/execution/orchestrator.ts around line 304:
The keyword matching in `shouldUseOrchestrator` uses `String.prototype.includes`, so the keyword `"test"` matches substrings inside words like `"latest"`, `"fastest"`, and `"contest"`. This causes the orchestrator to be enabled for tasks that don't actually involve testing. Consider matching whole words with word boundaries (e.g., regex with `\b`) or splitting on whitespace and checking for exact matches.
Evidence trail:
cli/src/execution/orchestrator.ts lines 280-306 (REVIEWED_COMMIT): Function `shouldUseOrchestrator` at line 304 uses `combined.includes(kw)` where `combined` is the concatenation of taskTitle and taskDescription. The `testKeywords` array at line 290 includes 'test'. JavaScript's `String.prototype.includes()` performs substring matching, so 'latest'.includes('test') === true, 'fastest'.includes('test') === true, 'contest'.includes('test') === true.
Also found in 1 other location(s):
- cli/src/execution/locking.ts:106 -- The `isInRalphyDir` function uses case-sensitive substring matching (`String.prototype.includes`), which leads to two issues:
1. **Case Sensitivity on Windows**: On Windows, file systems are case-insensitive. A path containing `.RALPHY` (e.g. `C:\Project\.RALPHY\data`) will return `false`, failing to identify the directory correctly. This contradicts the logic in `normalizePathForLocking` which explicitly handles Windows case normalization.
2. **False Positives**: The simple substring check will match unrelated files such as `my.ralphy.backup.txt` or directories like `projects/not.ralphy/`. This causes the application to incorrectly classify user data as internal system files.
**Fix**: Normalize the path first (or handle case-insensitivity) and check for specific path segments (e.g. surrounding the search term with path separators) rather than a partial string match.
| const retryableMessages = [ | ||
| "timeout", | ||
| "connection refused", | ||
| "network", | ||
| "rate limit", | ||
| "too many requests", | ||
| "temporary failure", | ||
| "try again", | ||
| "locked", | ||
| "conflict", | ||
| "connection error", | ||
| "unable to connect", | ||
| "internet connection", | ||
| "econnrefused", | ||
| "econnreset", | ||
| "socket hang up", | ||
| "fetch failed", | ||
| ]; |
There was a problem hiding this comment.
isRetryableError is missing several patterns from the previous implementation
The retryableMessages list drops these critical patterns present in the old retry.ts version:
| Missing pattern | Example error | Impact |
|---|---|---|
hit your limit |
"rate_limit_error: You have hit your limit" |
API rate limit not retried |
quota |
"You exceeded your current quota" |
Quota errors treated as permanent |
429 (bare code) |
"Error: 429" or just "429" |
HTTP 429 not retried if message is bare |
ENOTFOUND |
"getaddrinfo ENOTFOUND api.example.com" |
DNS failures permanently fail (not caught by "timeout") |
overloaded |
"Service Overloaded" |
Service overload errors not retried |
These errors will now be treated as non-retryable, causing tasks to fail permanently on transient DNS, quota, or rate-limit errors.
Add these patterns to the retryableMessages list:
const retryableMessages = [
// ... existing entries ...
"hit your limit",
"quota",
"429",
"enotfound",
"overloaded",
];Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/utils/errors.ts
Line: 87-104
Comment:
**`isRetryableError` is missing several patterns from the previous implementation**
The `retryableMessages` list drops these critical patterns present in the old `retry.ts` version:
| Missing pattern | Example error | Impact |
|---|---|---|
| `hit your limit` | `"rate_limit_error: You have hit your limit"` | API rate limit not retried |
| `quota` | `"You exceeded your current quota"` | Quota errors treated as permanent |
| `429` (bare code) | `"Error: 429"` or just `"429"` | HTTP 429 not retried if message is bare |
| `ENOTFOUND` | `"getaddrinfo ENOTFOUND api.example.com"` | DNS failures permanently fail (not caught by "timeout") |
| `overloaded` | `"Service Overloaded"` | Service overload errors not retried |
These errors will now be treated as non-retryable, causing tasks to fail permanently on transient DNS, quota, or rate-limit errors.
Add these patterns to the `retryableMessages` list:
```ts
const retryableMessages = [
// ... existing entries ...
"hit your limit",
"quota",
"429",
"enotfound",
"overloaded",
];
```
How can I resolve this? If you propose a fix, please make it concise.| if (dryRun && !options.debugOpenCode) { | ||
| spinner.success("(dry run) Skipped"); | ||
| } else { |
There was a problem hiding this comment.
Unreachable dead-code branch
This if (dryRun && !options.debugOpenCode) condition can never evaluate to true inside the main loop. The early-exit block at lines 107–115 already handles the dryRun && !debugOpenCode case and returns before the TaskStateManager is even initialised:
// lines 107-115 — returns unconditionally when dryRun && !debugOpenCode
if (dryRun && !options.debugOpenCode) {
...
return result;
}By the time execution enters the while loop, the only surviving combination is:
dryRun = false→ condition at 224 is triviallyfalse, ordryRun = true && debugOpenCode = true→!options.debugOpenCodeisfalse, so the condition is stillfalse.
The spinner call spinner.success("(dry run) Skipped") is therefore dead code that will never execute. The branch and spinner should be removed to avoid misleading future maintainers.
| if (dryRun && !options.debugOpenCode) { | |
| spinner.success("(dry run) Skipped"); | |
| } else { | |
| } else { |
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/execution/sequential.ts
Line: 224-226
Comment:
**Unreachable dead-code branch**
This `if (dryRun && !options.debugOpenCode)` condition can never evaluate to `true` inside the main loop. The early-exit block at lines 107–115 already handles the `dryRun && !debugOpenCode` case and returns before the `TaskStateManager` is even initialised:
```ts
// lines 107-115 — returns unconditionally when dryRun && !debugOpenCode
if (dryRun && !options.debugOpenCode) {
...
return result;
}
```
By the time execution enters the `while` loop, the only surviving combination is:
- `dryRun = false` → condition at 224 is trivially `false`, or
- `dryRun = true && debugOpenCode = true` → `!options.debugOpenCode` is `false`, so the condition is still `false`.
The spinner call `spinner.success("(dry run) Skipped")` is therefore dead code that will never execute. The branch and spinner should be removed to avoid misleading future maintainers.
```suggestion
} else {
```
How can I resolve this? If you propose a fix, please make it concise.| /connection|network|timeout|unable to connect|internet connection|econnrefused|econnreset/i.test( | ||
| error, | ||
| ); | ||
| const attemptNumber = DEFAULT_MAX_REPLANS - maxReplans + 1; |
There was a problem hiding this comment.
attemptNumber uses a global constant instead of actual call depth
const attemptNumber = DEFAULT_MAX_REPLANS - maxReplans + 1;This formula only produces the correct attempt number when the initial maxReplans argument equals DEFAULT_MAX_REPLANS. If planTaskFiles is ever called directly with a smaller value — e.g. planTaskFiles(engine, task, workDir, model, 1) — then on the retry call maxReplans becomes 0, and attemptNumber computes as DEFAULT_MAX_REPLANS - 0 + 1 = DEFAULT_MAX_REPLANS + 1. The exponential backoff Math.min(2000 * 2 ** (attemptNumber - 1), 10000) then yields the maximum delay on the very first retry, introducing an unexpectedly long wait.
Track the attempt independently by passing it down through the recursion, or derive it from a local counter:
// Replace with a local counter that doesn't depend on a global constant:
const attempt = maxReplans === 0 ? 1 : DEFAULT_MAX_REPLANS - maxReplans + 1;Or, cleaner: add an attempt = 1 parameter that increments on each recursive call.
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/execution/planning.ts
Line: 538
Comment:
**`attemptNumber` uses a global constant instead of actual call depth**
```ts
const attemptNumber = DEFAULT_MAX_REPLANS - maxReplans + 1;
```
This formula only produces the correct attempt number when the initial `maxReplans` argument equals `DEFAULT_MAX_REPLANS`. If `planTaskFiles` is ever called directly with a smaller value — e.g. `planTaskFiles(engine, task, workDir, model, 1)` — then on the retry call `maxReplans` becomes `0`, and `attemptNumber` computes as `DEFAULT_MAX_REPLANS - 0 + 1 = DEFAULT_MAX_REPLANS + 1`. The exponential backoff `Math.min(2000 * 2 ** (attemptNumber - 1), 10000)` then yields the maximum delay on the very first retry, introducing an unexpectedly long wait.
Track the attempt independently by passing it down through the recursion, or derive it from a local counter:
```ts
// Replace with a local counter that doesn't depend on a global constant:
const attempt = maxReplans === 0 ? 1 : DEFAULT_MAX_REPLANS - maxReplans + 1;
```
Or, cleaner: add an `attempt = 1` parameter that increments on each recursive call.
How can I resolve this? If you propose a fix, please make it concise.| async transitionState( | ||
| taskId: string, | ||
| newState: TaskState, | ||
| error?: string, | ||
| executionContext?: TaskStateEntry["executionContext"], | ||
| ): Promise<void> { | ||
| const key = this.buildTaskKey(taskId); | ||
| const task = this.tasks.get(key); | ||
|
|
||
| if (!task) { | ||
| logError(`Task ${taskId} not found in state manager`); | ||
| return; | ||
| } | ||
|
|
||
| const oldState = task.state; | ||
| task.state = newState; | ||
|
|
||
| if (error) { | ||
| task.errorHistory.push(error); | ||
| } | ||
|
|
||
| if (executionContext) { | ||
| task.executionContext = { ...task.executionContext, ...executionContext }; | ||
| } | ||
|
|
||
| await this.persistState(); | ||
| logDebug(`Task ${taskId} transitioned from ${oldState} to ${newState}`); | ||
| } |
There was a problem hiding this comment.
transitionState writes state to disk without acquiring a file lock
claimTaskForExecution correctly acquires a file lock before loading, mutating, and persisting the state file — preventing parallel workers from double-claiming a task. transitionState, however, operates only on the in-memory tasks Map and writes directly to disk with no lock and no fresh reload:
async transitionState(...): Promise<void> {
const task = this.tasks.get(key); // potentially stale in-memory snapshot
task.state = newState;
await this.persistState(); // blind overwrite — no lock, no reload
}In parallel execution where multiple workers share a single state file, two agents completing their tasks concurrently will each hold their own (now-diverged) snapshots of the full state file. When both call persistState, the second writer overwrites the first writer's completed-task record, silently discarding a COMPLETED or FAILED transition.
The fix is to apply the same lock / reload / mutate / persist / release pattern already used by claimTaskForExecution to transitionState.
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/execution/task-state.ts
Line: 188-215
Comment:
**`transitionState` writes state to disk without acquiring a file lock**
`claimTaskForExecution` correctly acquires a file lock before loading, mutating, and persisting the state file — preventing parallel workers from double-claiming a task. `transitionState`, however, operates only on the in-memory `tasks` Map and writes directly to disk with no lock and no fresh reload:
```ts
async transitionState(...): Promise<void> {
const task = this.tasks.get(key); // potentially stale in-memory snapshot
task.state = newState;
await this.persistState(); // blind overwrite — no lock, no reload
}
```
In parallel execution where multiple workers share a single state file, two agents completing their tasks concurrently will each hold their own (now-diverged) snapshots of the full state file. When both call `persistState`, the second writer overwrites the first writer's completed-task record, silently discarding a `COMPLETED` or `FAILED` transition.
The fix is to apply the same lock / reload / mutate / persist / release pattern already used by `claimTaskForExecution` to `transitionState`.
How can I resolve this? If you propose a fix, please make it concise.
PR3: execution state
Summary
This PR introduces the stateful execution core used by later agent/parallel work: task-state transitions, planning scaffolding, locking primitives, and prompt/test-orchestration contracts.
Why this PR exists
What it adds
cli/src/execution/:task-state.ts(state machine + claim/transition APIs)locking.ts(file/task lock semantics)planning.ts(planning cache, repo fingerprinting, planned-file parsing)orchestrator.ts,skill-compress.ts,progress-types.tssequential.ts,retry.ts,prompt.tscli/src/ui/static-agent-display.tscli/src/tasks/types.ts(description/dependencies/csv-aware source typing).PR preview
ANALYSIS,PLAN,FILES,OPTIMIZATION).Concrete scenarios this fixes
Planning and test-agent details
<ANALYSIS>,<PLAN>,<FILES>,<OPTIMIZATION>Reliability hardening
Tests added/updated
cli/__tests__/locking.test.tscli/__tests__/locking-security.test.tscli/__tests__/orchestrator.test.tscli/src/execution/prompt.test.tsValidation
bun run check,bun tsc --noEmit, andbun testin order.Note
Add execution-state orchestration and retry flow by moving sequential runs to
TaskStateManagerwith atomic claims and integrating circuit breaker retries across execution and testingIntroduce centralized task state with file locks, atomic claims, and transitions in sequential execution; add an orchestrated main→test→fix flow with circuit breaker–backed retries; refactor prompt building to include environment/skills and protected paths; add locking utilities and tests; standardize errors and telemetry redaction; and switch CLI commands to throw instead of
process.exit.📍Where to Start
Start with
runSequentialin sequential.ts, then review state operations inTaskStateManagerin task-state.ts and the retry/circuit logic inwithRetryin retry.ts.Macroscope summarized 4be1eff.