Skip to content

stack2:feat: strengthen engine execution and parser reliability#161

Open
VX1D wants to merge 9 commits intomichaelshimeles:mainfrom
VX1D:stack/pr2-engine-layer
Open

stack2:feat: strengthen engine execution and parser reliability#161
VX1D wants to merge 9 commits intomichaelshimeles:mainfrom
VX1D:stack/pr2-engine-layer

Conversation

@VX1D
Copy link
Copy Markdown
Contributor

@VX1D VX1D commented Mar 3, 2026

PR2: engine layer

Summary

This PR rebuilds the engine execution boundary for correctness and safer failure handling. It covers engine adapters, parser behavior, command execution, and diagnostics.

Why this PR exists

  • Engine retries/failures were too brittle in mixed streaming/non-streaming flows.
  • Exit code handling had correctness gaps.
  • Parser output interpretation needed better auth/error extraction behavior.

What it adds

  • Engine-layer updates across cli/src/engines/:
    • base.ts, executor.ts, parsers.ts
    • provider adapters (claude.ts, codex.ts, copilot.ts, cursor.ts, droid.ts, gemini.ts, opencode.ts, qwen.ts)
    • shared engine types/index wiring
  • New execution helper files:
    • cli/src/engines/executor.ts
    • cli/src/engines/parsers.ts

PR preview (what reviewers will notice)

  • Engine failures now report the real CLI exit code instead of defaulting to success in streaming mode.
  • Retry decisions are more consistent because parser output is normalized before error classification.
  • Authentication failures are surfaced as auth failures, not generic command errors.
  • OpenCode debug diagnostics are still useful, but high-risk raw snippets are removed.
  • Provider adapters follow one cleaner error path, which makes behavior match across Claude/Copilot/Codex/OpenCode.

Concrete scenarios this fixes

  • A command exits non-zero in streaming mode: task now fails correctly and can retry/defer based on actual error.
  • Copilot returns noisy stderr with auth text: parser extracts auth failure instead of mislabeling it.
  • Debug mode on production-like runs: logs keep operational metadata without leaking full outputs.

Security/reliability details

  • Streaming path now propagates real process exit code (no masked failures).
  • Node streaming process lifecycle cleanup tightened to avoid leaked registrations.
  • Parser logic improved for auth failure extraction and command error formatting.
  • OpenCode diagnostics redacted to reduce risk of leaking command args/output snippets.
  • Copilot temp-file lifecycle behavior hardened.

Test-facing behavior

  • Parser behavior now matches stricter test expectations for failure strings.
  • Engine error pathways are more deterministic under retry.

Validation

  • Included in cumulative stack verification.
  • Stack passes bun run check, bun tsc --noEmit, and bun test in order.

Note

Unify CLI engine execution to read prompts from stdin, add robust stream-json parsing and error handling in engines.BaseAIEngine and adopt it across OpenCode, Claude, Qwen, Copilot, Cursor, Droid, Gemini, and Codex

Refactors engine execution to a fuller engines.BaseAIEngine with stdin-based prompts, streaming, timeouts, centralized parsers, and environment handling; migrates engines to use shared helpers; adds OpenCode-specific parsing with session IDs and step detection; introduces executor, validation, and JSON schemas; and hardens telemetry redaction and cleanup utilities.

📍Where to Start

Start with the new shared execution and parsing flow in engines.BaseAIEngine and helpers in cli/src/engines/base.ts, then review parser utilities in cli/src/engines/parsers.ts and executor in cli/src/engines/executor.ts.

Macroscope summarized 16dcd22.

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 3, 2026

@VX1D is attempting to deploy a commit to the Goshen Labs Team on Vercel.

A member of the Team first needs to authorize it.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Greptile Summary

This PR refactors the engine execution layer by extracting shared logic into executor.ts, parsers.ts, and validation.ts, and migrates all provider adapters to a unified buildArgs + processCliResult contract on BaseAIEngine. Several real bugs from the previous stack are fixed (streaming exit codes, stderr processing, duplicate TextDecoder concerns remain, OPENCODE_REQUEST_DELAY spread order, diagnostic log path).

Key issues found:

  • dryRun not honored in six engine overridesOpenCodeEngine, CopilotEngine, CodexEngine, CursorEngine, GeminiEngine, and DroidEngine all override execute() (and most override executeStreaming()) without re-implementing the dryRun guard that BaseAIEngine.execute now provides. Only ClaudeEngine and QwenEngine benefit from that guard, making dry-run behavior inconsistent across engines.
  • checkForErrors false-positive risk on AI-generated text — the primary pattern scan in parsers.ts applies broad keyword matches ("model not found" + "error", "invalid model" + "error") to the entire combined output, including the AI's own response prose, which can produce a failure result even when the CLI exits 0.
  • CodexEngine.buildArgs generates a lastMessageFile path that processCliResult cannot clean up — the execution is safe today because executeStreaming overrides to call execute, but the buildArgs/processCliResult contract leaks the temp file if the base-class execution path is ever reached.

Confidence Score: 3/5

  • The PR introduces meaningful improvements but ships two logic bugs that can surface in production: dry-run tasks that should be skipped will actually execute for most engines, and the shared checkForErrors can produce false failures on valid AI output.
  • The architectural direction is sound and many previous issues were addressed (stderr processing, diagnostic log path, timeout flag capture). However the dryRun bypass affects six of eight engines and is a silent regression from the base-class refactoring, and the checkForErrors false-positive risk is shared across every engine now that it's centralised.
  • cli/src/engines/parsers.ts (shared false-positive risk), all engine files that override execute()/executeStreaming() without a dryRun guard (opencode.ts, copilot.ts, codex.ts, cursor.ts, gemini.ts, droid.ts), and cli/src/engines/codex.ts (latent lastMessageFile leak).

Important Files Changed

Filename Overview
cli/src/engines/base.ts Core base class refactored to delegate to shared executor/parser modules; executeStreaming now uses execCommandStreamingNew with proper Promise.all for concurrent stream reading; new TextDecoder() is still instantiated per-chunk (lines 171, 194) which can corrupt multi-byte UTF-8 sequences split across chunks.
cli/src/engines/executor.ts New shared execution module; execCommandStreaming correctly processes both stdout and stderr via readStreamLines; timeout-flag capture (didTimeOut) is fixed before reset; Node.js path properly wraps streams with Readable.toWeb; registered process cleanup is tightened with unregisterOnce guard.
cli/src/engines/parsers.ts New shared parser module; secondary error check correctly finds the matched line instead of the last line; however the primary checkForErrors scan applies broad model-error keyword patterns to the entire output (including AI response text), creating false-positive failure risk on any AI explanation that mentions error class names.
cli/src/engines/validation.ts New validation module with sound command/arg whitelisting; validateCommand correctly blocks </> redirection characters, but validateArgs does not, creating a latent inconsistency if shell: true is ever used; otherwise the approach is solid with shell: false preventing real shell injection.
cli/src/engines/opencode.ts OPENCODE_REQUEST_DELAY default is now correctly overridable by callers (spread order fixed); diagnostic log correctly gated behind debug flag and written to os.tmpdir(); however execute() override bypasses BaseAIEngine.execute's dryRun guard, so dry-run mode has no effect on OpenCode.
cli/src/engines/copilot.ts Auth error detection now present in both execute and executeStreaming; processCliResult now fully implemented; temp file lifecycle is correctly handled with finally blocks; execute() and executeStreaming() overrides bypass BaseAIEngine's dryRun guard.
cli/src/engines/codex.ts processCliResult now properly implemented; execute() override correctly handles cleanup of lastMessageFile via finally; buildArgs generates a new temp file path on each call but processCliResult cannot access it, creating a latent file-leak risk if the base-class execution path is ever used.
cli/src/engines/gemini.ts processCliResult now correctly throws to prevent silent misuse; execute() and executeStreaming() overrides implement consistent error handling; both overrides bypass the base-class dryRun guard.
cli/src/engines/claude.ts Cleanly refactored to rely entirely on BaseAIEngine.execute/executeStreaming via buildArgs + processCliResult; correctly inherits dryRun handling from the base class; no issues found.
cli/src/engines/qwen.ts Cleanly refactored to use the base-class execution path; processCliResult properly delegates to shared createErrorResult/createSuccessResult helpers; inherits dryRun handling correctly.
cli/src/engines/cursor.ts processCliResult now implemented; uses buildArgsWithStdin for cross-platform stdin delivery; both execute() and executeStreaming() overrides bypass the dryRun guard from the base class.
cli/src/engines/droid.ts processCliResult now fully implemented; consistent error-handling pattern across execute, executeStreaming, and processCliResult; execute() and executeStreaming() overrides bypass the base-class dryRun guard.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Caller: execute / executeStreaming] --> B{Engine overrides execute?}
    B -- No: Claude, Qwen --> C[BaseAIEngine.execute]
    B -- Yes: OpenCode, Copilot, Codex, Cursor, Gemini, Droid --> D[Engine.execute override]
    C --> E{dryRun?}
    E -- Yes --> F[Return dry-run result ✅]
    E -- No --> G[buildArgs + execCommand]
    D --> H[execCommand directly ⚠️ no dryRun check]
    G --> I[processCliResult]
    H --> I
    I --> J[checkForErrors on full output]
    J --> K{Error keyword in AI response text?}
    K -- Yes → false positive ⚠️ --> L[Return success:false]
    K -- No --> M[Parse tokens + return AIResult]

    subgraph execCommandStreamingNew
        N[Bun.spawn OR Node spawn+Readable.toWeb]
        N --> O[Promise.all: readStdout + readStderr + exited]
    end

    subgraph execCommandStreaming legacy
        P[execCommandStreamingNew]
        P --> Q[readStreamLines stdout+stderr]
        Q --> R[onLine callback]
    end
Loading

Last reviewed commit: 16dcd22

Comment on lines +112 to +119
protected processCliResult(
_stdout: string,
_stderr: string,
_exitCode: number,
_workDir: string,
): AIResult {
return { success: true, response: "Not implemented", inputTokens: 0, outputTokens: 0 };
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stub processCliResult silently returns success with "Not implemented"

CodexEngine (and DroidEngine at cli/src/engines/droid.ts lines 181–188) both implement processCliResult as a no-op that returns { success: true, response: "Not implemented", ... }. Now that BaseAIEngine.executeStreaming is concrete and calls this.processCliResult, any caller that invokes executeStreaming on a CodexEngine or DroidEngine instance will receive a silent success result with a misleading response instead of an error. This is a silent regression from the previous behavior.

Either implement the full result-parsing logic in processCliResult, or throw an unambiguous error:

protected processCliResult(...): AIResult {
    throw new Error(`${this.name}: processCliResult is not implemented; use execute() directly`);
}

Same fix needed in cli/src/engines/droid.ts line 187.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/codex.ts
Line: 112-119

Comment:
**Stub `processCliResult` silently returns success with "Not implemented"**

`CodexEngine` (and `DroidEngine` at `cli/src/engines/droid.ts` lines 181–188) both implement `processCliResult` as a no-op that returns `{ success: true, response: "Not implemented", ... }`. Now that `BaseAIEngine.executeStreaming` is concrete and calls `this.processCliResult`, any caller that invokes `executeStreaming` on a `CodexEngine` or `DroidEngine` instance will receive a silent success result with a misleading response instead of an error. This is a silent regression from the previous behavior.

Either implement the full result-parsing logic in `processCliResult`, or throw an unambiguous error:

```typescript
protected processCliResult(...): AIResult {
    throw new Error(`${this.name}: processCliResult is not implemented; use execute() directly`);
}
```

Same fix needed in `cli/src/engines/droid.ts` line 187.

How can I resolve this? If you propose a fix, please make it concise.


// Only check for specific CLI errors, not general "Error:" in response content
const error = checkForErrors(output);
if (error && !response.toLowerCase().includes("here's the fix")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fragile hardcoded string suppresses error detection

The condition !response.toLowerCase().includes("here's the fix") silently swallows a detected CLI error any time Copilot's response happens to contain the phrase "here's the fix". This is an extremely narrow, brittle heuristic — any legitimate error that happens alongside a response containing that phrase (e.g. a non-zero exit while Copilot explains a fix) will be reported as a success. There is no principled reason to tie error suppression to this specific natural-language phrase.

// Remove the heuristic — error detection should not depend on response content
const error = checkForErrors(output);
if (error) {
    return {
        success: false,
        response,
        inputTokens: tokenCounts.input,
        outputTokens: tokenCounts.output,
        error,
    };
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/copilot.ts
Line: 309

Comment:
**Fragile hardcoded string suppresses error detection**

The condition `!response.toLowerCase().includes("here's the fix")` silently swallows a detected CLI error any time Copilot's response happens to contain the phrase "here's the fix". This is an extremely narrow, brittle heuristic — any legitimate error that happens alongside a response containing that phrase (e.g. a non-zero exit while Copilot explains a fix) will be reported as a success. There is no principled reason to tie error suppression to this specific natural-language phrase.

```typescript
// Remove the heuristic — error detection should not depend on response content
const error = checkForErrors(output);
if (error) {
    return {
        success: false,
        response,
        inputTokens: tokenCounts.input,
        outputTokens: tokenCounts.output,
        error,
    };
}
```

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 3, 2026

Additional Comments (1)

cli/src/engines/index.ts
GeminiEngine silently removed without a deprecation path

"gemini" has been removed from AIEngineName and from the createEngine() switch statement. This is a silent breaking change: any existing configuration or code that specifies "gemini" as an engine name will now hit the default branch and throw Unknown AI engine: gemini at runtime. There is no migration guide, deprecation warning, or fallback.

If this removal is intentional, consider:

  1. Adding a case "gemini": throw new Error("Gemini engine has been removed; ...") with a helpful message before the default.
  2. Documenting the removal in README.md and CHANGELOG.
  3. Keeping "gemini" in AIEngineName as deprecated with a type comment.
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/index.ts
Line: 36-42

Comment:
**`GeminiEngine` silently removed without a deprecation path**

`"gemini"` has been removed from `AIEngineName` and from the `createEngine()` switch statement. This is a silent breaking change: any existing configuration or code that specifies `"gemini"` as an engine name will now hit the `default` branch and throw `Unknown AI engine: gemini` at runtime. There is no migration guide, deprecation warning, or fallback.

If this removal is intentional, consider:
1. Adding a `case "gemini": throw new Error("Gemini engine has been removed; ...")` with a helpful message before the `default`.
2. Documenting the removal in `README.md` and `CHANGELOG`.
3. Keeping `"gemini"` in `AIEngineName` as `deprecated` with a type comment.

How can I resolve this? If you propose a fix, please make it concise.

@dosubot
Copy link
Copy Markdown

dosubot bot commented Mar 3, 2026

Related Documentation

5 document(s) may need updating based on files changed in this PR:

Goshen Labs's Space

AI Engine Addition Process
View Suggested Changes
@@ -1,5 +1,11 @@
 ## Architecture and Extension Points
-AI engines in Ralphy are implemented as classes extending a shared `BaseAIEngine` abstract class. Each engine defines its name, CLI command, and execution logic. Engines are registered in `cli/src/engines/index.ts` and instantiated via the `createEngine` function, which maps engine names to their respective classes. This modular approach ensures consistent integration and simplifies extension for new engines ([source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/engines/index.ts#L3-L56)).
+AI engines in Ralphy are implemented as classes extending a shared `BaseAIEngine` abstract class. Each engine defines its name, CLI command, and implements two abstract methods: `buildArgs()` to construct command-line arguments, and `processCliResult()` to parse CLI output into an `AIResult`. The base class handles the execution flow, while engines provide command-specific logic. Engines are registered in `cli/src/engines/index.ts` and instantiated via the `createEngine` function, which maps engine names to their respective classes. This modular approach ensures consistent integration and simplifies extension for new engines.
+
+The engine layer is organized into several modules:
+- `base.ts` - Base class and re-exports for backward compatibility
+- `executor.ts` - Command execution logic (handles cross-platform spawning)
+- `parsers.ts` - Parsing and error extraction utilities
+- `validation.ts` - Validation functions for command security
 
 Supported engines include Claude Code, OpenCode, Cursor, Codex, Qwen-Code, Factory Droid, GitHub Copilot, Trae Agent, Gemini CLI, Ollama (via Claude Code CLI), and Kimi Code. Each engine follows the same integration pattern, allowing for consistent behavior and easy extensibility.
 
@@ -34,100 +40,147 @@
 This ensures that when the `--kimi` flag is provided, the Kimi Code engine is selected for execution.
 
 ## Engine Implementation and Command Execution
-Implement the engine as a class extending `BaseAIEngine`. Specify the CLI command and provide `execute` and optionally `executeStreaming` methods. Use the shared `execCommand` and `execCommandStreaming` utilities for command execution, which handle cross-platform compatibility (Node.js, Bun, Windows command wrappers).
-
-**Command Execution Details:**
-- On Windows, npm global packages (like `claude`, `gemini`, `kimi`, or `ollama` via Claude Code) are installed as `.cmd` wrapper scripts. To execute these reliably, Ralphy now uses `shell: true` when spawning processes with Node.js, and wraps commands with `cmd.exe /c` when using Bun. This ensures that `.cmd` wrappers are properly invoked and avoids ENOENT errors, without needing to manually resolve the command path.
-- The `windowsVerbatimArguments` option is set to `false` on Windows and `true` on other platforms to prevent argument escaping issues.
-- If a process spawn error occurs, the error message is included in `stderr` (for `execCommand`) or reported via `onLine` (for `execCommandStreaming`), and the promise resolves with an exit code of 1. This maintains backward compatibility and avoids unhandled promise rejections.
-
-**Kimi Code Example:**
-```typescript
-export class KimiEngine extends BaseAIEngine {
-  name = "Kimi Code";
-  cliCommand = "kimi";
-
-  async execute(prompt: string, workDir: string, options?: EngineOptions): Promise<AIResult> {
-    const args = ["--yolo", "--output-format", "stream-json"];
+Implement the engine as a class extending `BaseAIEngine`. Specify the CLI command and implement two abstract methods:
+- `buildArgs(prompt, workDir, options)` - Constructs the command-line arguments array
+- `processCliResult(stdout, stderr, exitCode, workDir)` - Parses CLI output into an `AIResult`
+
+The base class handles execution flow using the `execCommand` and `execCommandStreaming` utilities from `executor.ts`, which handle cross-platform compatibility (Node.js, Bun, Windows command wrappers).
+
+**Prompt Passing:**
+Prompts are always passed via stdin to ensure cross-platform reliability and avoid shell escaping issues. Use the `buildArgsWithStdin()` helper method to construct arguments with stdin handling:
+
+```typescript
+protected buildArgs(prompt: string, workDir: string, options?: EngineOptions): string[] {
+  const args = ["--yolo", "--output-format", "stream-json"];
+  if (options?.modelOverride) {
+    args.push("--model", options.modelOverride);
+  }
+  // Tell the CLI to read prompt from stdin
+  args.push("-p");
+  return args;
+}
+```
+
+The base class automatically passes the prompt via stdin, so engines only need to add the flag that tells the CLI to read from stdin (e.g., `-p` for Claude/Gemini/Kimi).
+
+**Claude Code Example:**
+```typescript
+export class ClaudeEngine extends BaseAIEngine {
+  name = "Claude Code";
+  cliCommand = "claude";
+
+  protected buildArgs(_prompt: string, _workDir: string, options?: EngineOptions): string[] {
+    const args = ["--dangerously-skip-permissions", "--verbose", "--output-format", "stream-json"];
     if (options?.modelOverride) {
       args.push("--model", options.modelOverride);
     }
-    if (options?.engineArgs && options.engineArgs.length > 0) {
+    if (options?.engineArgs) {
       args.push(...options.engineArgs);
     }
-    let stdinContent: string | undefined;
-    if (isWindows) {
-      args.push("-p");
-      stdinContent = prompt;
-    } else {
-      args.push("-p", prompt);
-    }
-    const { stdout, stderr, exitCode } = await execCommand(
-      this.cliCommand,
-      args,
-      workDir,
-      undefined,
-      stdinContent,
-    );
-    // ...parse output...
-  }
-}
-```
-
-**Ollama Example:**
-```typescript
-export class OllamaEngine extends BaseAIEngine {
-  name = "Ollama (Claude Code)";
-  cliCommand = "claude";
-  // ...
-}
-```
-
-**Factory Droid Example:**
-```typescript
-export class DroidEngine extends BaseAIEngine {
-  name = "Factory Droid";
-  cliCommand = "droid";
-  // ...
-}
-```
-**Qwen-Code Example:**
-```typescript
-export class QwenEngine extends BaseAIEngine {
-  name = "Qwen-Code";
-  cliCommand = "qwen";
-  // ...
-}
-```
+    // Prompt is passed via stdin by base engine
+    args.push("-p");
+    return args;
+  }
+
+  protected processCliResult(stdout: string, stderr: string, exitCode: number): AIResult {
+    const output = stdout + stderr;
+    const error = checkForErrors(output);
+    if (error) {
+      return { success: false, response: "", inputTokens: 0, outputTokens: 0, error };
+    }
+
+    const { response, inputTokens, outputTokens } = parseStreamJsonResult(output);
+
+    if (exitCode !== 0) {
+      return createErrorResult(exitCode, output, response, inputTokens, outputTokens);
+    }
+
+    return createSuccessResult(response, inputTokens, outputTokens);
+  }
+}
+```
+
 **Gemini CLI Example:**
 ```typescript
 export class GeminiEngine extends BaseAIEngine {
   name = "Gemini CLI";
   cliCommand = "gemini";
-  // ...
-}
-```
-**Trae Agent Example:**
-```typescript
-export class TraeEngine extends BaseAIEngine {
-  name = "Trae Agent";
-  cliCommand = "trae";
-  // ...
-}
-```
-Always verify that the CLI tool is available in the system PATH. If not, exit with an error and provide installation instructions ([source](https://github.com/michaelshimeles/ralphy/pull/17)).
-
-For more details, see the implementation in `cli/src/engines/base.ts`. For the latest error handling and cross-platform logic, refer to the current implementation.
+
+  protected buildArgs(prompt: string, workDir: string, options?: EngineOptions): string[] {
+    const args = ["--output-format", "stream-json", "--yolo"];
+    if (options?.modelOverride) {
+      args.push("--model", options.modelOverride);
+    }
+    args.push("-p"); // Read prompt from stdin
+    return args;
+  }
+
+  protected processCliResult(stdout: string, stderr: string, exitCode: number): AIResult {
+    // Same parsing logic as Claude
+    const output = stdout + stderr;
+    const error = checkForErrors(output);
+    if (error) {
+      return { success: false, response: "", inputTokens: 0, outputTokens: 0, error };
+    }
+    const { response, inputTokens, outputTokens } = parseStreamJsonResult(output);
+    if (exitCode !== 0) {
+      return createErrorResult(exitCode, output, response, inputTokens, outputTokens);
+    }
+    return createSuccessResult(response, inputTokens, outputTokens);
+  }
+}
+```
+
+Always verify that the CLI tool is available in the system PATH using the inherited `isAvailable()` method. If not, exit with an error and provide installation instructions.
+
+For more details, see the implementation in `cli/src/engines/base.ts` and examples in `claude.ts`, `gemini.ts`, and other engine files.
 
 ## JSON Output Parsing
-Parse engine output using utility functions or engine-specific logic. For Qwen-Code, Gemini CLI, Kimi Code, Ollama (via Claude Code CLI), and Trae Agent, use `parseStreamJsonResult` to extract response text and token counts from JSON lines. For Factory Droid, parse lines for a `completion` event and extract `finalText` and `durationMs` ([source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/engines/droid.ts#L13-L124), [source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/engines/qwen.ts#L14-L100)).
+Parse engine output using utility functions from `parsers.ts`. The parsing utilities provide consistent error handling and result extraction across engines:
+
+**Available Parsers:**
+- `parseStreamJsonResult(output)` - Extract response text and token counts from stream-json format (used by Claude, Gemini, Qwen, Kimi, Ollama)
+- `extractAuthenticationError(output)` - Detect and extract authentication failure messages
+- `checkForErrors(output)` - Check for general error conditions in CLI output
+- `formatCommandError(exitCode, output)` - Format meaningful error messages with context
+- `detectStepFromOutput(line, logThoughts)` - Extract progress steps from output lines
+
+**Key Parser Behaviors:**
+Authentication and command-failure messages are parsed more strictly to reduce false positives and make retry behavior more predictable. The `extractAuthenticationError()` function now looks for specific authentication-related error patterns and extracts clean error messages. The `formatCommandError()` function checks for authentication errors first before formatting general command failures.
+
+Example usage in `processCliResult()`:
+```typescript
+protected processCliResult(stdout: string, stderr: string, exitCode: number): AIResult {
+  const output = stdout + stderr;
+  
+  // Check for errors using parser utilities
+  const error = checkForErrors(output);
+  if (error) {
+    return { success: false, response: "", inputTokens: 0, outputTokens: 0, error };
+  }
+
+  // Parse result using appropriate utility
+  const { response, inputTokens, outputTokens } = parseStreamJsonResult(output);
+
+  // Use helper to create error result with formatted message
+  if (exitCode !== 0) {
+    return createErrorResult(exitCode, output, response, inputTokens, outputTokens);
+  }
+
+  return createSuccessResult(response, inputTokens, outputTokens);
+}
+```
+
+For engines with custom output formats (like Factory Droid or Trae Agent), implement engine-specific parsing logic within `processCliResult()`:
 
 Example for Factory Droid:
 ```typescript
-private parseOutput(output: string): { response: string; durationMs: number } {
+protected processCliResult(stdout: string, stderr: string, exitCode: number): AIResult {
+  const output = stdout + stderr;
   const lines = output.split("\n").filter(Boolean);
   let response = "";
   let durationMs = 0;
+  
   for (const line of lines) {
     try {
       const parsed = JSON.parse(line);
@@ -137,67 +190,27 @@
       }
     } catch { /* ignore non-JSON lines */ }
   }
-  return { response, durationMs };
-}
-```
-
-Example for Trae Agent:
-```typescript
-private parseOutput(output: string): {
-  response: string;
-  inputTokens: number;
-  outputTokens: number;
-  durationMs: number;
-} {
-  const lines = output.split("\n").filter(Boolean);
-  let response = "";
-  let inputTokens = 0;
-  let outputTokens = 0;
-  let durationMs = 0;
-
-  for (const line of lines) {
-    try {
-      const parsed = JSON.parse(line);
-      if (parsed.type === "result") {
-        response = parsed.result || response;
-        inputTokens = parsed.usage?.input_tokens ?? inputTokens;
-        outputTokens = parsed.usage?.output_tokens ?? outputTokens;
-      }
-      if (parsed.type === "completion") {
-        response = parsed.finalText || response;
-        if (typeof parsed.durationMs === "number") {
-          durationMs = parsed.durationMs;
-        }
-      }
-      if (parsed.type === "assistant" && !response) {
-        const content = parsed.message?.content;
-        if (Array.isArray(content) && content[0]?.text) {
-          response = content[0].text;
-        } else if (typeof content === "string") {
-          response = content;
-        }
-      }
-      if (typeof parsed.duration_ms === "number") {
-        durationMs = parsed.duration_ms;
-      }
-    } catch {
-      // Ignore non-JSON lines
-    }
-  }
-
-  return {
-    response: response || "Task completed",
-    inputTokens,
-    outputTokens,
-    durationMs,
-  };
-}
-```
-
-For Gemini CLI, Qwen-Code, Kimi Code, and Ollama (via Claude Code CLI), use `parseStreamJsonResult` to extract the response and token counts from the stream-json output format. This approach ensures that output from these engines is parsed correctly, extracting relevant response text and metrics.
+  
+  if (exitCode !== 0) {
+    return createErrorResult(exitCode, output, response, 0, 0);
+  }
+  
+  return createSuccessResult(response, 0, 0, { durationMs });
+}
+```
+
+## Error Handling and Exit Codes
+Engine failures now report the real CLI exit code instead of defaulting to success in streaming mode. This ensures proper failure detection and reporting across both streaming and non-streaming execution paths.
+
+The base class handles process lifecycle cleanup and timeout management. In streaming mode:
+- The child process is registered for cleanup and properly terminated on timeout or completion
+- Exit codes are correctly propagated from both Bun and Node.js execution paths
+- If a process spawn error occurs, the error is included in the result and the operation fails with exit code 1
+
+The improved exit code handling makes retry logic more reliable since the system can accurately distinguish between successful completions, authentication failures, and command errors.
 
 ## Integration with Parallel Execution and Merge Conflict Resolution
-Ralphy supports parallel execution via the `--parallel` and `--max-parallel` flags, creating isolated worktrees and branches for each agent/task. After execution, completed branches are merged back to the base branch. If merge conflicts occur, Ralphy uses AI-assisted conflict resolution by building a prompt listing conflicted files and running the selected engine to resolve conflicts. The process verifies that all conflicts are resolved before completing the merge ([source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/parallel.ts#L34-L372), [source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/execution/conflict-resolution.ts#L8-L81)).
+Ralphy supports parallel execution via the `--parallel` and `--max-parallel` flags, creating isolated worktrees and branches for each agent/task. After execution, completed branches are merged back to the base branch. If merge conflicts occur, Ralphy uses AI-assisted conflict resolution by building a prompt listing conflicted files and running the selected engine to resolve conflicts. The process verifies that all conflicts are resolved before completing the merge.
 
 ## Documentation and Changelog Updates
 Update the README to include the new engine flag (`--kimi`), usage examples, engine details table, and any relevant notes about output metrics (tokens, duration, cost). Add a changelog entry noting the new engine, its flag, and the version bump.
@@ -255,15 +268,32 @@
 - Check cross-platform compatibility (macOS, Linux, Windows) by verifying CLI detection and execution ([source](https://github.com/michaelshimeles/ralphy/issues/52), [source](https://github.com/michaelshimeles/ralphy/issues/64)).
 
 ## Cross-Platform Considerations
-Ensure CLI tools are available in the system PATH and use platform-specific detection (`which` on Unix, `where` on Windows). For command execution, Ralphy now uses shell mode (`shell: true` for Node.js, or `cmd.exe /c` for Bun) on Windows to ensure that `.cmd` wrapper scripts for npm global packages are executed correctly. The `windowsVerbatimArguments` option is set to `false` on Windows and `true` on other platforms to prevent argument escaping issues. If a process spawn error occurs, the error message is included in `stderr` or reported via `onLine`, and the process resolves with an exit code of 1, maintaining backward compatibility and avoiding unhandled promise rejections. This approach avoids ENOENT errors and removes the need for custom path resolution logic. Provide clear error messages and installation instructions if the CLI is not found. Test on all supported platforms to confirm compatibility ([source](https://github.com/michaelshimeles/ralphy/issues/52), [source](https://github.com/michaelshimeles/ralphy/issues/64)).
+Ensure CLI tools are available in the system PATH and use platform-specific detection via the `commandExists()` utility (which uses `which` on Unix, `where` on Windows). 
+
+The command execution layer in `executor.ts` handles cross-platform compatibility:
+- Command and argument validation is performed before execution to prevent injection attacks
+- Process spawning is handled consistently across Node.js and Bun runtimes
+- Child processes are registered for cleanup to prevent leaks
+- Shell mode is disabled (`shell: false`) to improve security while maintaining cross-platform compatibility
+
+The executor module uses the following approach:
+- Bun: Direct spawning with command arrays (no shell needed)
+- Node.js: Direct spawning with `shell: false` for security
+
+Prompts are always passed via stdin to avoid shell escaping issues and command line length limits on all platforms. This provides the most reliable cross-platform approach.
+
+Provide clear error messages and installation instructions if the CLI is not found. Test on all supported platforms to confirm compatibility.
 
 ## Example: Adding a New Engine
-1. Implement the engine class extending `BaseAIEngine` (see GeminiEngine for Gemini CLI or KimiEngine for Kimi Code as examples).
+1. Implement the engine class extending `BaseAIEngine`:
+   - Define `name` and `cliCommand` properties
+   - Implement `buildArgs()` to construct CLI arguments (pass prompts via stdin)
+   - Implement `processCliResult()` to parse output and return `AIResult`
 2. Register the engine in `cli/src/engines/index.ts` and update the CLI argument parser.
 3. Add a command-line flag for engine selection (e.g., `--gemini`, `--kimi`).
-4. Implement command execution and output parsing, using `parseStreamJsonResult` for Gemini CLI, Kimi Code, and Qwen-Code.
-5. Integrate with parallel execution and merge conflict resolution.
+4. Use parsing utilities from `parsers.ts` (`parseStreamJsonResult`, `checkForErrors`, `formatCommandError`, etc.).
+5. Integrate with parallel execution and merge conflict resolution (handled by base class).
 6. Update documentation and changelog to include the new engine.
 7. Verify functionality and cross-platform compatibility.
 
-By following these practices and using Factory Droid, Qwen-Code, Gemini CLI, and Kimi Code as templates, you can safely and consistently add new AI engines to Ralphy.
+By following these practices and using Claude, Gemini, and other engines as templates, you can safely and consistently add new AI engines to Ralphy.

[Accept] [Decline]

AI Engine Integration
View Suggested Changes
@@ -113,9 +113,10 @@
 - No fragile text parsing or temp file prompts
 
 **Error Handling:**
-- Errors are surfaced via ACP protocol stop reasons. Authentication errors are detected and reported with instructions to run `copilot /login` or set the `COPILOT_GITHUB_TOKEN` environment variable.
+- Errors are surfaced via ACP protocol stop reasons. Authentication errors are detected and reported with instructions to run `gh auth login` or set the `COPILOT_GITHUB_TOKEN` environment variable.
 - Other errors (rate limiting, network issues) are surfaced as protocol errors when available.
 - Fatal errors abort all remaining tasks and prevent infinite retry loops.
+- Improved temp file lifecycle management ensures cleanup even on errors.
 
 **Token Usage Reporting:**
 - **Token counts are not available**: Copilot CLI's ACP implementation does not expose usage/token metadata. The result only contains `stopReason`. Telemetry loses token tracking for Copilot. This is a Copilot CLI limitation.
@@ -169,17 +170,53 @@
 ```
 This mechanism allows you to leverage advanced or experimental features of any supported engine, even if Ralphy does not yet provide explicit flags for them. The arguments are isolated from Ralphy's own options, ensuring compatibility and safety across all engines.
 
-## Adding a New Engine
+## Engine Architecture
+
+### Modular Structure
+The engine layer has been restructured for improved maintainability and reliability:
+
+- **`base.ts`**: Base engine class that re-exports functionality from specialized modules. Provides the `BaseAIEngine` abstract class with standardized execution flow.
+- **`executor.ts`**: Command execution utilities including `execCommand`, `execCommandStreaming`, and `commandExists`. Handles process lifecycle, stdin/stdout management, and cross-platform compatibility.
+- **`parsers.ts`**: Output parsing functions including `parseStreamJsonResult`, `checkForErrors`, `extractAuthenticationError`, `formatCommandError`, and `detectStepFromOutput`. Provides consistent error detection and token counting across engines.
+- **`validation.ts`**: Command and argument validation for security and correctness.
+- **Provider adapters**: Individual engine implementations (`claude.ts`, `codex.ts`, `copilot.ts`, `cursor.ts`, etc.) that extend `BaseAIEngine`.
+
+### Execution Flow
+Engines implement two key methods called by the base class execution flow:
+
+1. **`buildArgs(prompt, workDir, options)`**: Constructs CLI arguments for the engine. Prompts are passed via stdin (using `buildArgsWithStdin()`) for cross-platform reliability and to avoid shell escaping issues.
+2. **`processCliResult(stdout, stderr, exitCode, workDir)`**: Processes CLI output into an `AIResult`. Parses tokens, detects errors, and formats responses.
+
+The base class handles:
+- Command execution (streaming and non-streaming)
+- Timeout management
+- Process lifecycle cleanup
+- Stdin content delivery
+- Progress callback invocation
+
+### Error Handling Improvements
+Error handling has been strengthened to reduce false positives and improve reliability:
+
+- **Authentication errors** are now detected more accurately and surfaced as auth failures rather than generic command errors. The parser distinguishes between authentication issues and other error types.
+- **Exit codes**: Engine failures now report the actual CLI exit code instead of defaulting to success in streaming mode. This is a critical reliability improvement that affects failure detection and retry logic.
+- **Parser utilities**: New functions in `parsers.ts` provide consistent error extraction:
+  - `extractAuthenticationError`: Detects auth failures from JSON or plain text output, with improved logic for Copilot (references "gh auth login" instead of "/login")
+  - `formatCommandError`: Formats error messages with context while avoiding false positives
+  - `checkForErrors`: Detects errors in stream-json output or general CLI output
+  - `detectStepFromOutput`: Extracts progress steps from output for live feedback
+
+### Adding a New Engine
 To add a new AI engine:
 1. Implement a new class extending `BaseAIEngine`, specifying the engine's name and CLI command.
-2. Integrate using the engine's protocol or output format (e.g., ACP, stream-json, file, NDJSON).
-3. Handle errors using the shared error detection utilities and protocol stop reasons.
-4. Support model overrides by appending the `--model <name>` flag if provided.
-5. Register the engine in the CLI flag parser and engine registry.
-6. Ensure CLI availability by checking the command exists in the system PATH.
-7. Update documentation and provide example usage.
-
-All engines follow a uniform integration pattern, making it straightforward to add new engines with minimal changes to the CLI and execution flow. See [base.ts](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/engines/base.ts) for the base class and utilities.
+2. Implement `buildArgs()` to construct CLI arguments. Use `buildArgsWithStdin()` for prompts.
+3. Implement `processCliResult()` to parse output into `AIResult`. Use parser utilities from `parsers.ts`.
+4. Use shared error detection utilities (`checkForErrors`, `extractAuthenticationError`) and protocol stop reasons.
+5. Support model overrides by including `--model <name>` in your args if provided.
+6. Register the engine in the CLI flag parser and engine registry.
+7. Ensure CLI availability by checking the command exists in the system PATH.
+8. Update documentation and provide example usage.
+
+All engines follow a uniform integration pattern, making it straightforward to add new engines with minimal changes to the CLI and execution flow. See [base.ts](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/engines/base.ts) for the base class, [executor.ts](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/engines/executor.ts) for execution helpers, and [parsers.ts](https://github.com/michaelshimeles/ralphy/blob/main/cli/src/engines/parsers.ts) for parsing utilities.
 
 ---
 

[Accept] [Decline]

Command Execution and Process Spawning Enhancements
View Suggested Changes
@@ -1,5 +1,15 @@
 # Command Execution and Process Spawning (Windows and Cross-Platform)
 This document describes the approach for robust, cross-platform command execution in the CLI, with a focus on Windows compatibility and reliable process spawning for both Bun and Node.js runtimes.
+
+## Architecture: Modular Command Execution
+
+The command execution system is organized into separate modules for better maintainability and testability:
+
+- **`executor.ts`**: Core command execution module handling process spawning, stdin/stdout/stderr management, and cross-platform compatibility for both Bun and Node.js runtimes.
+- **`parsers.ts`**: Parser utilities for extracting authentication errors, token counts, step detection, and formatting error messages from CLI output.
+- **`base.ts`**: Base engine implementation that re-exports functions from `executor.ts` and `parsers.ts` for backward compatibility with existing engine code.
+
+This modular structure separates concerns: execution logic is isolated in `executor.ts`, output parsing is centralized in `parsers.ts`, and the base engine coordinates between them.
 
 ## Windows Compatibility: .cmd Wrappers and Argument Handling
 On Windows, many CLI tools (especially those installed globally via npm) are exposed as `.cmd` wrappers. To ensure these commands are executed reliably regardless of runtime (Bun or Node.js), the CLI uses the following strategies:
@@ -13,15 +23,42 @@
 
 This approach eliminates the need for manual command resolution and ensures that npm global commands and batch scripts are found and executed correctly on all platforms.
 
-## Passing Prompts and Arguments: Stdin vs. Arguments
-- Copilot CLI output is parsed to filter out CLI artifacts, status messages, and stats sections. Token usage (input/output tokens) is extracted from the output when available.
-- Copilot-specific error detection is now more robust:
-  - **Authentication errors** are detected by parsing each line of output as JSON and checking for error types or result objects with authentication-related keywords (such as "invalid api key", "authentication", "not authenticated", "unauthorized", or "/login") anywhere in the error message, case-insensitive. This includes nested error structures and assistant message formats.
-  - **Rate limiting** is detected only if the output starts with a rate limit phrase (e.g., "Rate limit", "Too many requests") and the output is short (less than 200 characters), to avoid false positives from valid response content.
-  - **Network errors** are not detected based on output content, since such phrases may appear in valid responses (e.g., test results or code feedback). Actual network errors are surfaced via non-zero exit codes.
-  - **Generic errors** are only detected if the output starts with "error:" (case-insensitive). Occurrences of "error:" in the middle of the output are not treated as CLI errors.
+## Stdin-Based Prompt Passing (Universal Approach)
 
-This approach ensures that authentication errors are reliably detected across a variety of output formats, while avoiding false positives for valid Copilot responses.
+For reliable cross-platform behavior, prompts are passed via stdin across **all platforms** (Windows, Linux, macOS). This universal approach:
+
+- Avoids shell escaping issues that vary by platform and shell
+- Prevents command line length limits (especially on Windows)
+- Ensures consistent behavior across Bun and Node.js runtimes
+- Handles multi-line prompts and special characters safely
+
+The executor sets `stdin: "pipe"` when stdin content is provided, writes the prompt, and closes the stdin stream. On platforms where stdin is not needed, it is set to `"ignore"` to prevent process hangs.
+
+## Parser Utilities and Error Detection
+
+The `parsers.ts` module provides standardized utilities for interpreting CLI output:
+
+- **`extractAuthenticationError()`**: Detects authentication failures by checking if the first line of output starts with `"not authenticated"`. This simple, reliable check avoids false positives from response content that may mention authentication topics. The error message now references `"gh auth login"` instead of the deprecated `"/login"` command.
+
+- **`formatCommandError()`**: Formats command failures with meaningful context, including exit codes and relevant output snippets.
+
+- **`detectStepFromOutput()`**: Identifies progress steps from CLI output using pattern matching on action verbs and step markers.
+
+- **`checkForErrors()`**: Validates JSON output against error schemas and detects common error patterns (rate limits, quota exceeded, connection errors) while avoiding false positives in valid response content.
+
+This centralized error detection ensures consistent behavior across all engine adapters (Copilot, Claude, OpenCode, etc.).
+
+## Exit Code Propagation and Streaming Reliability
+
+A critical improvement in this architecture is that the **streaming execution path now propagates real process exit codes** with no masked failures. Previously, streaming commands could incorrectly report success even when the underlying process failed.
+
+The streaming implementation:
+- Captures the actual exit code from the child process
+- Returns the real exit code in the result, ensuring proper error detection
+- Integrates with parser utilities to format errors correctly
+- Ensures retry logic and error handling work correctly based on actual process status
+
+This eliminates a class of bugs where streaming tasks would appear to succeed but actually failed, preventing proper error recovery and user notification.
 
 ## Error Handling: Backward Compatibility
 To maintain backward compatibility with existing CLI code (which does not use try/catch on command execution), error handling is designed to avoid unhandled promise rejections:
@@ -51,24 +88,45 @@
 ```
 
 ## Implementation Highlights
-- **No more resolveCommand():**
-  - The previous approach of resolving command paths manually is no longer used. Shell execution handles `.cmd` wrappers and path resolution.
-- **windowsVerbatimArguments:**
-  - This option is set to prevent double-escaping issues with arguments on non-Windows platforms.
-- **Stdin Management:**
-  - Stdin is set to `pipe` only if content is provided; otherwise, it is ignored/closed to prevent hangs.
+
+### Executor Module (`executor.ts`)
+- **Process spawning**: Handles both Bun and Node.js runtimes with platform-specific strategies
+- **Command validation**: Uses `validateCommandAndArgs()` to prevent command injection
+- **Stdin management**: Sets stdin to `pipe` only when content is provided; otherwise uses `ignore` to prevent hangs
+- **Process cleanup**: Integrates with cleanup registry to prevent leaked child processes
+- **Streaming support**: Provides both legacy callback-based (`execCommandStreaming`) and modern stream-based (`execCommandStreamingNew`) APIs
+
+### Parser Module (`parsers.ts`)
+- **`extractAuthenticationError()`**: Detects auth failures by checking the first line with `startsWith("not authenticated")`
+- **`formatCommandError()`**: Creates meaningful error messages with exit codes and output context
+- **`detectStepFromOutput()`**: Identifies progress steps using pattern matching on action verbs
+- **`checkForErrors()`**: Validates JSON and detects error patterns while avoiding false positives
+- **`extractTokenCounts()`**: Parses token usage from JSON output
+- **`createErrorResult()` / `createSuccessResult()`**: Factory functions for consistent result objects
+
+### Base Engine (`base.ts`)
+- **Backward compatibility**: Re-exports all functions from `executor.ts` and `parsers.ts`
+- **Universal stdin handling**: Uses `buildArgsWithStdin()` to pass prompts via stdin on all platforms
+- **Streaming integration**: Coordinates between executor and parsers for streaming execution
+- **Engine lifecycle**: Manages process timeouts, cleanup, and progress callbacks
+
+### Windows-Specific Details
+- **No more resolveCommand()**: The previous approach of resolving command paths manually is no longer used. Shell execution handles `.cmd` wrappers and path resolution.
+- **windowsVerbatimArguments**: This option is set to prevent double-escaping issues with arguments on non-Windows platforms.
 
 ## Summary Table
 | Platform   | Runtime | Command Execution Strategy                | Prompt Passing         |
 |------------|---------|-------------------------------------------|-----------------------|
 | Windows    | Bun     | `cmd.exe /c <command> ...args`            | stdin                 |
 | Windows    | Node.js | `shell: true` in spawn                    | stdin                 |
-| Other      | Bun     | `<command> ...args`                       | argument              |
-| Other      | Node.js | `<command> ...args`                       | argument              |
+| Other      | Bun     | `<command> ...args`                       | stdin                 |
+| Other      | Node.js | `<command> ...args`                       | stdin                 |
 
 ## Impact
 These changes ensure that:
 - All supported engines work reliably on Windows, including those installed via npm as `.cmd` wrappers.
-- Multi-line prompts are handled safely on Windows.
+- Multi-line prompts are handled safely and consistently across all platforms via stdin.
 - Errors are always reported in output, never as unhandled promise rejections.
 - The CLI behaves consistently across Bun and Node.js, and across all platforms.
+- Streaming execution properly propagates exit codes, enabling correct error detection and retry logic.
+- Modular architecture improves maintainability, testability, and makes it easier to add new engines or execution strategies.

[Accept] [Decline]

Copilot AI Engine Integration
View Suggested Changes
@@ -11,15 +11,23 @@
 ```
 
 ### Implementation Details
-- Prompts are always passed to the Copilot CLI as CLI arguments, and the `--yolo` flag is always included for non-interactive execution. This ensures consistent, non-interactive behavior across all platforms and prevents silent task completion.
-- **Conservative Copilot-specific error detection:** The engine detects only clear, CLI-level errors from Copilot CLI output:
-  - **Authentication errors:** Detected if the output starts with known authentication-related phrases (e.g., "Not authenticated", "No authentication", "Authentication required", "Please authenticate").
-  - **Rate limiting:** Detected if the output starts with rate limit phrases (e.g., "Rate limit", "Too many requests") and the output is short (less than 200 characters), to avoid false positives.
-  - **Network errors:** Only detected via non-zero exit codes from the Copilot CLI, not by searching for phrases in the output.
-  - **Generic errors:** Detected if the output starts with "error:" (case-insensitive). Occurrences of "error:" in the middle of the output are not treated as CLI errors.
+- **Prompt Handling:** Prompts are passed to the Copilot CLI via temporary markdown files to preserve formatting and avoid shell escaping issues. The engine uses a robust lifecycle management system with three methods:
+  - `createTempFile`: Creates uniquely-named temporary files in a dedicated directory with markdown extension.
+  - `cleanupTempFile`: Removes temporary files after task completion.
+  - `cleanupOldTempFiles`: Automatically cleans up temp files older than 1 hour to prevent disk space exhaustion.
+  - The `--yolo` flag is always included for non-interactive execution, ensuring consistent behavior across all platforms.
+- **Enhanced Error Detection:** The engine implements conservative, first-line authentication checking and improved error handling:
+  - **Authentication errors:** Detected by checking the first line of output for phrases like "Not authenticated" or "No authentication". Authentication error messages direct users to run `gh auth login` instead of referencing the deprecated `/login` command.
+  - **Improved stderr handling:** Authentication and error detection now examines both stdout and stderr for more reliable error identification.
+  - **Exit code handling:** The engine properly propagates real process exit codes, especially in streaming mode, instead of masking failures with default success codes.
+- **Streaming Support:** The `executeStreaming` method is now implemented for Copilot, enabling real-time progress updates and step detection during task execution. Streaming mode correctly propagates process exit codes for accurate failure reporting.
 - When a fatal error (such as authentication or configuration issues) is detected, all remaining tasks are aborted immediately.
 - Infinite retry loops are prevented: tasks encountering non-retryable or fatal errors are marked complete and not retried in sequential execution mode.
-- Token usage reporting: The engine parses and reports input and output token counts from Copilot CLI output, supporting values with `k` and `m` suffixes (e.g., `17.5k`). Duration is no longer used for cost reporting; token counts are now the primary metric.
+- **Token Usage Reporting:** Token parsing has been refactored into separate `parseTokenCounts` and `parseOutput` methods with improved implementation:
+  - The `parseTokenCounts` method extracts input and output token counts from Copilot CLI output lines (e.g., `17.5k in, 73 out`).
+  - Supports values with `k` and `m` suffixes for readability.
+  - Uses atomic regex patterns to prevent ReDoS vulnerabilities.
+  - Duration is no longer used for cost reporting; token counts are now the primary metric.
 - Improved reliability for parallel execution: If git worktree operations fail or are unavailable, ralphy automatically falls back to sandbox mode for parallel execution, ensuring tasks can still run in parallel even in repositories where worktrees are not supported.
 - Enhanced error output: When a CLI command fails, the error output includes the last lines of the command's output for easier debugging.
 - Retryable rate-limit detection: The CLI detects rate-limit and quota errors and will stop early, allowing you to retry later when limits are reset.
@@ -27,10 +35,12 @@
 - If the repository has no commits, parallel/branch mode will not run and the CLI will prompt you to create an initial commit.
 
 #### Windows Compatibility and Error Handling
-- Prompts are always passed to the Copilot CLI as CLI arguments, and the `--yolo` flag is always included for non-interactive execution. This ensures consistent behavior across all operating systems and prevents silent failures.
-- Improved error handling: Copilot-specific errors (such as authentication failures and rate limiting) are detected and reported clearly, even if the Copilot CLI returns a zero exit code. Network errors are only detected via non-zero exit codes, not by searching for phrases in the output.
+- Prompts are passed to the Copilot CLI via temporary markdown files to preserve formatting and prevent issues with special characters and newlines across all platforms.
+- The `--yolo` flag is always included for non-interactive execution, ensuring consistent behavior across all operating systems and preventing silent failures.
+- Improved error handling: Copilot-specific errors (such as authentication failures and rate limiting) are detected and reported clearly. Authentication failures now direct users to run `gh auth login` instead of the deprecated `/login` command.
+- First-line authentication checking: The engine examines the first line of output for authentication errors, making error detection more reliable and preventing false positives.
+- Exit code handling: Real process exit codes are properly propagated in both streaming and non-streaming modes, ensuring accurate failure reporting.
 - Config parsing errors and process spawn errors are logged to stderr, making debugging easier and preventing silent failures.
-- The engine uses `shell: true` for Node.js on Windows and wraps commands with `cmd.exe /c` for Bun, ensuring .cmd wrappers are executed correctly.
 
 #### Integration with Engine Types
 Copilot is integrated into the engine factory. When `'copilot'` is selected, a new `CopilotEngine` instance is returned. The `AIEngineName` type includes `'copilot'`, ensuring Copilot is recognized throughout the CLI.
@@ -54,7 +64,7 @@
 ```
 
 ### Script Support
-Copilot is fully supported in both single-task and PRD-driven (multi-task) modes. In scripts, Copilot is invoked using the `copilot` CLI with the `-p` flag for the prompt and `--model` if specified. Parallel execution and isolated git worktrees are supported, matching other engines.
+Copilot is fully supported in both single-task and PRD-driven (multi-task) modes. In scripts, Copilot is invoked using the `copilot` CLI. Prompts are passed via temporary markdown files (with the `-p` flag pointing to the temp file path), and `--model` is used if specified. Parallel execution and isolated git worktrees are supported, matching other engines.
 
 ---
 
@@ -67,7 +77,7 @@
 - If the repository has no commits, parallel/branch mode is disabled until an initial commit is made, and the CLI provides a clear prompt to the user.
 
 #### Windows Prompt Handling
-Prompts are always passed as CLI arguments to the Copilot CLI using the `-p <prompt>` flag, regardless of platform. The `--yolo` flag is also always included for non-interactive execution. This ensures compatibility and prevents silent failures on all operating systems. Multi-line prompts are sanitized and special characters are escaped for Windows compatibility.
+Prompts are passed to the Copilot CLI via temporary markdown files to preserve formatting, special characters, and newlines. This approach ensures compatibility across all operating systems and avoids issues with command-line argument parsing. The `--yolo` flag is also always included for non-interactive execution. Multi-line prompts and markdown content are preserved exactly as written without sanitization or escaping.
 
 #### Robust Error Handling
 Process spawning includes backward-compatible error handling. Spawn errors (such as missing executables or .cmd wrappers) are logged to stderr and do not cause unhandled promise rejections.
@@ -76,13 +86,13 @@
 Model override is implemented by passing the `--model` flag to the Copilot CLI if specified.
 
 #### Parsing Logic and Filter Patterns
-The Copilot engine's output parser removes lines that are interactive prompts, command prompts, status messages, and stats sections (such as token usage and breakdowns), ensuring only meaningful AI-generated responses are returned. The parser also extracts input and output token counts from lines like `17.5k in, 73 out`, supporting `k` and `m` suffixes. This allows the engine to report accurate token usage for each task.
+The Copilot engine uses refactored parsing logic with separate `parseTokenCounts` and `parseOutput` methods. The output parser removes lines that are interactive prompts, command prompts, status messages, and stats sections (such as token usage and breakdowns), ensuring only meaningful AI-generated responses are returned. The `parseTokenCounts` method extracts input and output token counts from lines like `17.5k in, 73 out`, supporting `k` and `m` suffixes. The implementation uses atomic regex patterns to prevent ReDoS vulnerabilities and improve reliability. This allows the engine to report accurate token usage for each task.
 
 #### Cost and Token Reporting
 Copilot parses and reports token usage from CLI output. The engine extracts `inputTokens` and `outputTokens` from Copilot CLI output lines (e.g., `17.5k in, 73 out`). Cost is tracked as token usage, not duration. If token counts are not present in the output, the engine reports zero for both values. The Engine Details table and reporting now reflect tokens as the primary metric for Copilot.
 
 ### Usage Examples
-**Note:** Prompts are always passed to the Copilot CLI as CLI arguments using the `-p <prompt>` flag, and the `--yolo` flag is used for non-interactive mode. This is handled automatically by ralphy; no changes to user commands are required, regardless of operating system.
+**Note:** Prompts are passed to the Copilot CLI via temporary markdown files (using the `-p <temp-file-path>` flag), and the `--yolo` flag is used for non-interactive mode. This is handled automatically by ralphy; no changes to user commands are required, regardless of operating system. The temporary file approach preserves markdown formatting and special characters.
 
 ## Webhook Notifications
 Get notified when sessions complete via Discord, Slack, or custom webhooks. Configure webhooks in `.ralphy/config.yaml`:
@@ -157,7 +167,13 @@
 
 # Changelog v4.5.1 Highlights
 ### v4.7.1
-- **Copilot engine improvements**: non-interactive mode (`--yolo`), conservative error detection for authentication/rate-limit errors, token usage parsing, temp file-based prompts for markdown preservation
+- **Copilot engine improvements**: 
+  - **Prompt handling**: Prompts are now passed via temporary markdown files with improved lifecycle management (`createTempFile`, `cleanupTempFile`, `cleanupOldTempFiles` methods) instead of CLI arguments, preserving formatting and preventing shell escaping issues.
+  - **Streaming support**: `executeStreaming` is now implemented, enabling real-time progress updates and step detection during task execution.
+  - **Enhanced error detection**: First-line authentication checking and improved stderr handling make error detection more reliable. Authentication error messages now direct users to run `gh auth login` instead of `/login`.
+  - **Exit code handling**: Streaming mode now properly propagates real process exit codes instead of masking failures.
+  - **Token parsing refactored**: Separate `parseTokenCounts` and `parseOutput` methods with atomic regex patterns to prevent ReDoS vulnerabilities.
+  - **Non-interactive mode**: `--yolo` flag ensures consistent behavior across platforms.
 - **Fixed infinite retry loop**: tasks now properly abort on fatal configuration/authentication errors
 - **Project standards**: added `.editorconfig` and `.gitattributes` for consistent coding styles
 - **Sandbox cleanup reliability**: sandbox directories are deleted with retry and exponential backoff; locked files/folders log warnings instead of crashing, improving stability on Windows and during parallel execution

[Accept] [Decline]

Qwen-Code AI Engine
View Suggested Changes
@@ -29,26 +29,19 @@
 - `-p <prompt>` (the prompt to execute)
 - `--model <modelOverride>` (optional, if specified)
 
-**Prompt Passing on Windows:**
-- On Windows, due to limitations with `cmd.exe` and multi-line arguments, ralphy passes the prompt via standard input (stdin) instead of as a command-line argument. The `-p` flag is still included, but the actual prompt text is sent through stdin. This ensures compatibility with multi-line prompts and avoids argument parsing issues.
-- On other platforms, the prompt is passed as a command-line argument as before.
+**Prompt Passing via stdin:**
+- Ralphy passes prompts to the Qwen-Code CLI via standard input (stdin) on all platforms (Windows, Linux, macOS). This universal approach avoids shell escaping issues, handles multi-line content reliably, and bypasses command-line length limitations.
+- The `-p` flag tells the `qwen` CLI to read the prompt from stdin. The prompt text itself is not passed as a command-line argument.
+- This cross-platform stdin behavior is standardized using the `buildArgsWithStdin()` helper, ensuring consistent execution across all operating systems.
 
-Example invocation (non-Windows):
-
-```bash
-qwen --output-format stream-json --approval-mode yolo -p "Implement OAuth2 login"
-```
-
-Example invocation (Windows):
+Example invocation:
 
 ```bash
 qwen --output-format stream-json --approval-mode yolo -p
 # (prompt text is sent via stdin)
 ```
 
-[Source](https://github.com/michaelshimeles/ralphy/blob/fc2df589969b5fe16d31eccb4e7ff91314e31776/cli/src/engines/qwen.ts#L14-L100)
-
-This change is handled automatically by ralphy; you do not need to modify your usage, but be aware of this difference if you are troubleshooting or inspecting process invocation on Windows.
+This is handled automatically by ralphy; you do not need to modify your usage, but be aware that prompts are passed via stdin if you are troubleshooting or inspecting process invocation.
 
 ## Engine-Specific Arguments
 You can pass additional arguments directly to the Qwen-Code CLI by using the standard `--` separator in your ralphy command. Any arguments after `--` are forwarded unmodified to the underlying engine, allowing you to access advanced or experimental Qwen-Code features without waiting for explicit support in ralphy.
@@ -69,15 +62,20 @@
 This provides maximum flexibility for advanced users and ensures compatibility with future Qwen-Code CLI updates.
 
 ### JSON Output Handling and Result Parsing
-Qwen-Code emits output as newline-delimited JSON objects. Ralphy processes this output by:
+Qwen-Code emits output as newline-delimited JSON objects. Ralphy processes this output using the refactored `parsers.ts` module, which provides enhanced error classification and improved reliability:
 
-1. Concatenating stdout and stderr.
-2. Checking for errors by scanning for JSON lines with `"type": "error"`. If found, the error message is extracted and reported.
-3. If the Qwen-Code process exits with a non-zero exit code, ralphy now reports an error message that includes the exit code **and a snippet of the CLI output** (up to the last 12 lines). This provides more meaningful feedback than the previous generic "Unknown error" message, making troubleshooting easier.
-4. Parsing the result by searching for the last JSON line with `"type": "result"`. From this, ralphy extracts:
+1. **Output Concatenation:** stdout and stderr are concatenated for analysis.
+2. **Error Detection:** The `checkForErrors()` parser function scans for JSON lines with `"type": "error"` and extracts structured error messages. Common error patterns (rate limits, quota issues, model not found) are detected and reported with specific, actionable guidance.
+3. **Authentication Failures:** The `extractAuthenticationError()` parser function identifies authentication-related errors (invalid API keys, unauthorized access) and surfaces them as distinct authentication failures rather than generic command errors. This improves troubleshooting for credential-related issues.
+4. **Exit Code Handling:** When the Qwen-Code process exits with a non-zero exit code, ralphy uses the `formatCommandError()` helper to generate a detailed error message that includes:
+   - The actual process exit code (no longer masked in streaming mode)
+   - A snippet of the CLI output (up to the last 12 lines)
+   - Authentication-specific errors if detected
+   This provides more meaningful feedback than previous generic error messages.
+5. **Result Parsing:** The `parseStreamJsonResult()` function searches for the last JSON line with `"type": "result"` and extracts:
    - The response text (`result`)
    - Token usage fields (`usage.input_tokens`, `usage.output_tokens`)
-5. If no response text is found, ralphy defaults to `"Task completed"`.
+6. **Default Response:** If no response text is found, ralphy defaults to `"Task completed"`.
 
 Example Qwen-Code output line:
 
@@ -87,14 +85,20 @@
 
 Ralphy extracts the `result` and token usage from this line. This parsing logic is shared with other engines like Claude Code, ensuring consistent behavior across engines.
 
-If a command fails, you will now see an error message such as:
+**Example error output:**
 
 ```
 [ERROR] Task "Implement OAuth2 login" failed: Command failed with exit code 1. Output:
 <last lines of CLI output>
 ```
 
-This makes it easier to diagnose issues with the underlying engine or environment.
+**Example authentication error:**
+
+```
+[ERROR] Task failed: Invalid API key. Please check your Qwen-Code credentials.
+```
+
+These improvements make it easier to diagnose issues with the underlying engine, environment, or authentication configuration.
 
 #### Retryable Error Handling
 If a retryable error is detected (such as a rate limit, quota exceeded, or temporary network issue), ralphy will **stop the current run early** and print a warning. Tasks that encounter retryable errors are not marked as failed and will be retried on the next run. This prevents unnecessary retries and allows you to resume once the temporary issue is resolved.
@@ -106,7 +110,15 @@
 
 
 ### Ecosystem Integration
-Qwen-Code is fully integrated into the ralphy engine ecosystem. It is selectable alongside other engines such as Claude Code, OpenCode, Codex, Cursor agent, and Factory Droid. All engines share the same task execution loop, parallel execution support, and merge conflict resolution features. Qwen-Code uses the same JSON parsing and error handling utilities as other engines, so switching between engines is seamless.
+Qwen-Code is fully integrated into the ralphy engine ecosystem. It is selectable alongside other engines such as Claude Code, OpenCode, Codex, Cursor agent, and Factory Droid. All engines share the same task execution loop, parallel execution support, and merge conflict resolution features.
+
+**Refactored Engine Architecture:**
+As part of PR #161, the Qwen-Code engine now uses the refactored engine execution layer, which includes:
+- **executor.ts:** Provides `execCommand()` and `execCommandStreamingNew()` for reliable command execution with proper exit code propagation and cross-platform stdin handling.
+- **parsers.ts:** Centralizes output parsing logic with enhanced error classification (`checkForErrors()`, `extractAuthenticationError()`, `formatCommandError()`). This ensures authentication failures are surfaced as auth failures, not generic command errors.
+- **Improved exit code handling:** The streaming execution path now propagates real process exit codes with no masked failures, improving error detection and retry logic.
+
+These improvements make Qwen-Code more reliable and consistent with other engines. Switching between engines is seamless, as all share the same underlying execution and parsing infrastructure.
 
 **Parallel Execution Reliability Improvements:**
 - When running ralphy in parallel mode, if a git worktree operation fails (for example, due to repository configuration or file system issues), ralphy will **automatically fall back to sandbox mode** for that task. This ensures that tasks can still run even if worktrees are unavailable or misconfigured.
@@ -145,6 +157,12 @@
 5. If the `qwen` CLI is missing, confirm that ralphy exits with the appropriate error message.
 6. Optionally, test model override and parallel execution modes.
 
+**Troubleshooting Tips:**
+- **Authentication errors:** If you see messages like "Invalid API key" or "not authenticated," check your Qwen-Code API credentials and configuration. The new `extractAuthenticationError()` parser will clearly identify these as authentication failures.
+- **Exit code errors:** If a task fails with a non-zero exit code, review the error message for CLI output snippets that may indicate the root cause (e.g., missing dependencies, network issues, rate limits).
+- **Retryable errors:** If ralphy stops early with a warning about rate limits or quotas, wait for the suggested period and re-run ralphy. Tasks with retryable errors are automatically retried on the next run.
+- **Process invocation:** Remember that prompts are passed via stdin on all platforms. If you're manually testing the `qwen` CLI, you'll need to pipe the prompt text to stdin rather than passing it as an argument.
+
 Example verification:
 
 ```bash

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

}
}

protected buildArgs(prompt: string, workDir: string, options?: EngineOptions): string[] {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High engines/codex.ts:107

Streaming is broken: buildArgs drops the lastMessageFile, and processCliResult returns "Not implemented", so temp files leak and the real response is never read. Suggest persisting the lastMessageFile (e.g., on the instance) and having processCliResult read/trim the file and delete it. This will make executeStreaming return the actual output and stop leaking.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/codex.ts around line 107:

Streaming is broken: `buildArgs` drops the `lastMessageFile`, and `processCliResult` returns "Not implemented", so temp files leak and the real response is never read. Suggest persisting the `lastMessageFile` (e.g., on the instance) and having `processCliResult` read/trim the file and delete it. This will make `executeStreaming` return the actual output and stop leaking.

Evidence trail:
 cli /src /engines /code x .ts lines 14 -30 ( build Args Internal creates temp file ), lines 95 -98 ( build Args discards last Message File ), lines 100 -108 ( process Cli Result returns "Not implemented"); cli /src /engines /base .ts line 97 ( execute Streaming calls this.build Args ), line 179 ( execute Streaming calls this.process Cli Result ); cli /src /engines /code x .ts lines 33 -90 ( non - streaming execute properly handles temp file ) 

return meaningfulLines.join("\n") || "Task completed";
}

async executeStreaming(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High engines/copilot.ts:217

Consider restoring the base safeguards in executeStreaming: wrap with try/catch and enforce RALPHY_EXECUTION_TIMEOUT with process cleanup (or reuse BaseAIEngine.executeStreaming) to avoid unhandled rejections and hangs.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/copilot.ts around line 217:

Consider restoring the base safeguards in `executeStreaming`: wrap with try/catch and enforce `RALPHY_EXECUTION_TIMEOUT` with process cleanup (or reuse `BaseAIEngine.executeStreaming`) to avoid unhandled rejections and hangs.

Evidence trail:
cli/src/engines/copilot.ts lines 217-281: executeStreaming has try/finally but no catch block; cli/src/engines/base.ts lines 88-210: BaseAIEngine.executeStreaming has try/catch (line 120 try, line 203-210 catch), timeout setup (lines 107-128 with RALPHY_EXECUTION_TIMEOUT, setTimeout, childProcess.kill()); cli/src/engines/copilot.ts line 13: CopilotEngine extends BaseAIEngine

Comment on lines +208 to +209
detectStepFromOutput(line: string, logThoughts = false): string | null {
const trimmed = line.trim();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low engines/opencode.ts:208

The detectStepFromOutput method on OpenCodeEngine is unreachable during streaming execution. BaseAIEngine.executeStreaming calls this.parseProgressLine, which invokes the imported standalone detectStepFromOutput from parsers.ts — not this instance method. OpenCode-specific events (tool usage, JSON tool calls) are therefore never detected, and progress reporting silently falls back to generic patterns. Consider overriding parseProgressLine in OpenCodeEngine to call this.detectStepFromOutput.

-	/** Detect step from output for progress tracking */
-	detectStepFromOutput(line: string, logThoughts = false): string | null {
+	/**
+	 * Parse a line of output to extract progress information.
+	 * Overrides base implementation to use OpenCode-specific step detection.
+	 */
+	protected override parseProgressLine(line: string, logThoughts?: boolean): string | null {
+		// Try OpenCode-specific detection first
+		const step = this.detectStepFromOutput(line, logThoughts);
+		if (step) return step;
+		// Fall back to base implementation for other patterns
+		return super.parseProgressLine(line, logThoughts);
+	}
+
+	/** Detect step from output for progress tracking (OpenCode-specific) */
+	private detectStepFromOutput(line: string, logThoughts = false): string | null {
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/opencode.ts around lines 208-209:

The `detectStepFromOutput` method on `OpenCodeEngine` is unreachable during streaming execution. `BaseAIEngine.executeStreaming` calls `this.parseProgressLine`, which invokes the imported standalone `detectStepFromOutput` from `parsers.ts` — not this instance method. OpenCode-specific events (tool usage, JSON tool calls) are therefore never detected, and progress reporting silently falls back to generic patterns. Consider overriding `parseProgressLine` in `OpenCodeEngine` to call `this.detectStepFromOutput`.

Evidence trail:
cli/src/engines/base.ts line 7: `import { detectStepFromOutput } from "./parsers.ts";`
cli/src/engines/base.ts lines 287, 295: `parseProgressLine` calls standalone `detectStepFromOutput()`
cli/src/engines/base.ts line 171: `executeStreaming` calls `this.parseProgressLine(line, options?.logThoughts)`
cli/src/engines/opencode.ts line 208: instance method `detectStepFromOutput(line: string, logThoughts = false)` (handles JSON tool calls)
cli/src/engines/opencode.ts: no override of `parseProgressLine` (grep returned no results)
cli/src/engines/parsers.ts line 240: `export function detectStepFromOutput(line: string, logThoughts = true)`

Comment on lines +176 to +186
const match = line.match(/(\d+\.?\d*)([km]?)\s+in,\s+(\d+\.?\d*)([km]?)\s+out/i);
if (match) {
let input = Number.parseFloat(match[1]);
let output = Number.parseFloat(match[3]);

// Handle k/m suffixes
if (match[2] === "k") input *= 1000;
if (match[2] === "m") input *= 1000000;
if (match[4] === "k") output *= 1000;
if (match[4] === "m") output *= 1000000;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium engines/copilot.ts:176

In parseTokenCounts, the regex uses the /i flag to match suffixes case-insensitively, but the multiplier checks strictly for lowercase 'k' and 'm'. When the Copilot CLI outputs uppercase suffixes like '15K in', the suffix check fails and the value is returned unmultiplied (e.g., 15 instead of 15000). Consider converting the suffix to lowercase before comparison.

-				if (match[2] === "k") input *= 1000;
-				if (match[2] === "m") input *= 1000000;
-				if (match[4] === "k") output *= 1000;
-				if (match[4] === "m") output *= 1000000;
+				const inputSuffix = match[2].toLowerCase();
+				const outputSuffix = match[4].toLowerCase();
+				if (inputSuffix === "k") input *= 1000;
+				if (inputSuffix === "m") input *= 1000000;
+				if (outputSuffix === "k") output *= 1000;
+				if (outputSuffix === "m") output *= 1000000;
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/copilot.ts around lines 176-186:

In `parseTokenCounts`, the regex uses the `/i` flag to match suffixes case-insensitively, but the multiplier checks strictly for lowercase `'k'` and `'m'`. When the Copilot CLI outputs uppercase suffixes like `'15K in'`, the suffix check fails and the value is returned unmultiplied (e.g., 15 instead of 15000). Consider converting the suffix to lowercase before comparison.

Evidence trail:
cli/src/engines/copilot.ts lines 171-186 at REVIEWED_COMMIT. Line 176 shows regex with `/i` flag: `/(\.?\d*)([km]?)\s+in,\s+(\d+\.?\d*)([km]?)\s+out/i`. Lines 181-184 show lowercase-only comparisons: `match[2] === "k"`, `match[2] === "m"`, `match[4] === "k"`, `match[4] === "m"`.

Comment on lines +185 to +188
if (stdinContent && proc.stdin) {
proc.stdin.write(stdinContent);
proc.stdin.end();
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High engines/executor.ts:185

Writing to proc.stdin after spawn without attaching an error listener to the stream itself crashes the process with an unhandled EPIPE error when the child process exits early. The proc.on('error', ...) handler on line 210 catches process spawn errors, not stream errors. Consider adding proc.stdin.on('error', ...) to handle early pipe closure gracefully.

+		// Handle stdin errors to prevent crash on EPIPE
+		proc.stdin.on('error', () => {});
+
 		// Write stdin content if provided
 		if (stdinContent && proc.stdin) {
 			proc.stdin.write(stdinContent);
 			proc.stdin.end();
 		}
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/executor.ts around lines 185-188:

Writing to `proc.stdin` after spawn without attaching an `error` listener to the stream itself crashes the process with an unhandled `EPIPE` error when the child process exits early. The `proc.on('error', ...)` handler on line 210 catches process spawn errors, not stream errors. Consider adding `proc.stdin.on('error', ...)` to handle early pipe closure gracefully.

Evidence trail:
cli/src/engines/executor.ts lines 175-220 at REVIEWED_COMMIT: Lines 185-187 show `proc.stdin.write(stdinContent)` and `proc.stdin.end()` with no error handler on the stdin stream. Line 210 shows `proc.on('error', ...)` which handles ChildProcess errors, not stream errors. No `proc.stdin.on('error', ...)` handler exists in the function.

Comment on lines +32 to 36
function debugLog(...args: unknown[]): void {
if (DEBUG || (globalThis as { verboseMode?: boolean }).verboseMode === true) {
logDebug(args.map((a) => String(a)).join(" "));
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low engines/base.ts:32

debugLog delegates to logDebug, which enforces its own verboseMode check. Setting RALPHY_DEBUG=true has no effect unless verbose mode is also enabled, defeating the purpose of the environment variable. Additionally, args.map(a => String(a)) stringifies objects to "[object Object]", making debug output useless for inspecting data structures. Consider using console.log directly when DEBUG is true, and remove the stringification to preserve object formatting.

 function debugLog(...args: unknown[]): void {
 	if (DEBUG || (globalThis as { verboseMode?: boolean }).verboseMode === true) {
-		logDebug(args.map((a) => String(a)).join(" "));
+		console.log(pc.dim("[DEBUG]"), ...args);
 	}
 }
Also found in 1 other location(s)

cli/src/engines/executor.ts:11

The logic in debugLog defeats the purpose of the RALPHY_DEBUG environment variable. While debugLog correctly checks DEBUG || verboseMode (line 12), it delegates printing to logDebug (line 13). The logDebug function (as shown in references) internally guards its execution with if (verboseMode). Consequently, if a user enables RALPHY_DEBUG but not verboseMode, debugLog will attempt to log, but logDebug will silently ignore the call.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/base.ts around lines 32-36:

`debugLog` delegates to `logDebug`, which enforces its own `verboseMode` check. Setting `RALPHY_DEBUG=true` has no effect unless verbose mode is also enabled, defeating the purpose of the environment variable. Additionally, `args.map(a => String(a))` stringifies objects to `"[object Object]"`, making debug output useless for inspecting data structures. Consider using `console.log` directly when `DEBUG` is true, and remove the stringification to preserve object formatting.

Evidence trail:
cli/src/engines/base.ts lines 32-35 show `debugLog` function with `if (DEBUG || verboseMode)` check that calls `logDebug()`. cli/src/ui/logger.ts lines 43-47 show `logDebug` function with its own `if (verboseMode)` check, where `verboseMode` is a module-level boolean (line 3). This confirms the redundant check claim. JavaScript's `String()` function on objects returns "[object Object]" - this is standard language behavior.

Also found in 1 other location(s):
- cli/src/engines/executor.ts:11 -- The logic in `debugLog` defeats the purpose of the `RALPHY_DEBUG` environment variable. While `debugLog` correctly checks `DEBUG || verboseMode` (line 12), it delegates printing to `logDebug` (line 13). The `logDebug` function (as shown in references) internally guards its execution with `if (verboseMode)`. Consequently, if a user enables `RALPHY_DEBUG` but not `verboseMode`, `debugLog` will attempt to log, but `logDebug` will silently ignore the call.

let stderr = "";
let exitCode = 0;

if (isBun && result.stdout?.getReader && result.stderr?.getReader) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium engines/base.ts:156

Suggestion: Bun streaming decoding corrupts multibyte characters and loses progress on chunk-split lines. Use one TextDecoder per stream with decode(value, { stream: true }), keep a buffer for incomplete lines, and only emit complete lines.

Also found in 1 other location(s)

cli/src/engines/executor.ts:149

Accumulating stdout and stderr by calling data.toString() on each chunk results in data corruption for multi-byte characters (e.g., UTF-8 emojis or non-Latin scripts) that happen to be split across chunk boundaries. This produces replacement characters () in the output string. To handle this correctly, use StringDecoder or collect all Buffer chunks in an array and decode them only after the stream ends.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/base.ts around line 156:

Suggestion: Bun streaming decoding corrupts multibyte characters and loses progress on chunk-split lines. Use one `TextDecoder` per stream with `decode(value, { stream: true })`, keep a buffer for incomplete lines, and only emit complete lines.

Evidence trail:
cli/src/engines/base.ts lines 165-176 at REVIEWED_COMMIT:
- Line 165: `const chunk = new TextDecoder().decode(value);` - creates new decoder per chunk (no stream:true option, no reuse)
- Lines 168-176: `const lines = chunk.split("\n");` followed by immediate processing of each line without buffering incomplete lines

Also found in 1 other location(s):
- cli/src/engines/executor.ts:149 -- Accumulating `stdout` and `stderr` by calling `data.toString()` on each chunk results in data corruption for multi-byte characters (e.g., UTF-8 emojis or non-Latin scripts) that happen to be split across chunk boundaries. This produces replacement characters () in the output string. To handle this correctly, use `StringDecoder` or collect all `Buffer` chunks in an array and decode them only after the stream ends.

Comment on lines +146 to +152
workDir,
env,
stdinContent,
);

childProcess = result.process;
let stdout = "";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High engines/base.ts:146

Suggestion: executeStreaming starts a process with execCommandStreamingNew then falls back to execCommand without terminating the first process and clears the timeout, leaving a zombie and no time limit. Please manage a single child: either reuse the first process/streams across Node and Bun, or explicitly kill the first before fallback and keep timeout protection active.

-    try {
-      const result = await execCommandStreamingNew(
-        this.cliCommand,
-        args,
-        workDir,
-        env,
-        stdinContent,
-      );
+    let streamingResult: Awaited<ReturnType<typeof execCommandStreamingNew>> | undefined;
+    try {
+      streamingResult = await execCommandStreamingNew(
+        this.cliCommand,
+        args,
+        workDir,
+        env,
+        stdinContent,
+      );
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/base.ts around lines 146-152:

Suggestion: `executeStreaming` starts a process with `execCommandStreamingNew` then falls back to `execCommand` without terminating the first process and clears the timeout, leaving a zombie and no time limit. Please manage a single child: either reuse the first process/streams across Node and Bun, or explicitly kill the first before fallback and keep timeout protection active.

Evidence trail:
cli/src/engines/base.ts lines 143-151 (execCommandStreamingNew called, childProcess assigned), lines 155 (isBun condition), lines 197-207 (fallback branch - clearTimeout called, execCommand called without killing first process); cli/src/engines/executor.ts lines 262-330 (execCommandStreamingNew spawns real process for Node.js with spawn() at line 302, returns stdout: null, stderr: null at lines 327-328)

Comment on lines +300 to +311
// Check for errors first
const error = checkForErrors(output);
if (error) {
return {
success: false,
response: "",
inputTokens: 0,
outputTokens: 0,
error,
sessionId,
};
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium engines/opencode.ts:300

When checkForErrors detects an error in the output, processCliResult returns hardcoded zero/empty values for inputTokens, outputTokens, and response, discarding the successfully parsed values from this.parseOutput. This causes data loss for usage tracking — tokens consumed before the error are not reported. The behavior is inconsistent with the exitCode !== 0 path, which preserves these values. Consider passing the parsed values through to the error result so usage metrics are available even when an error occurs.

-		if (error) {
-			return {
-				success: false,
-				response: "",
-				inputTokens: 0,
-				outputTokens: 0,
-				error,
-				sessionId,
-			};
-		}
+		if (error) {
+			return {
+				success: false,
+				response,
+				inputTokens,
+				outputTokens,
+				error,
+				sessionId,
+			};
+		}
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/opencode.ts around lines 300-311:

When `checkForErrors` detects an error in the output, `processCliResult` returns hardcoded zero/empty values for `inputTokens`, `outputTokens`, and `response`, discarding the successfully parsed values from `this.parseOutput`. This causes data loss for usage tracking — tokens consumed before the error are not reported. The behavior is inconsistent with the `exitCode !== 0` path, which preserves these values. Consider passing the parsed values through to the error result so usage metrics are available even when an error occurs.

Evidence trail:
cli/src/engines/opencode.ts lines 280-335 at REVIEWED_COMMIT. Key lines: 298 (parseOutput call extracts values), 302-309 (checkForErrors path returns hardcoded 0/empty values), 312-319 (exitCode !== 0 path preserves parsed values).

Comment on lines +152 to +175
function execWithNode(
command: string,
args: string[],
workDir: string,
env?: Record<string, string>,
stdinContent?: string,
): Promise<ExecutionResult> {
// Validate before execution
const validation = validateCommandAndArgs(command, args);
if (!validation.valid || !validation.command || !validation.args) {
return Promise.resolve({
stdout: "",
stderr: `Error: ${validation.error}`,
exitCode: 1,
});
}

// Store validated values to ensure TypeScript knows they're defined
const validatedCommand = validation.command;
const validatedArgs = validation.args;

return new Promise((resolve) => {
const proc = spawn(validatedCommand, validatedArgs, {
cwd: workDir,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium engines/executor.ts:152

execWithNode calls validateCommandAndArgs again on arguments that were already validated and potentially transformed by execCommand. Since execWithBun skips re-validation, the same command can succeed under Bun but fail under Node.js when the validator transforms inputs (e.g., resolving aliases). Consider removing the redundant validation block at lines 160-167 so both execution paths behave consistently.

): Promise<ExecutionResult> {
-	// Validate before execution
-	const validation = validateCommandAndArgs(command, args);
-	if (!validation.valid || !validation.command || !validation.args) {
-		return Promise.resolve({
-			stdout: "",
-			stderr: `Error: ${validation.error}`,
-			exitCode: 1,
-		});
-	}
-
-	// Store validated values to ensure TypeScript knows they're defined
-	const validatedCommand = validation.command;
-	const validatedArgs = validation.args;
-
	return new Promise((resolve) => {
-		const proc = spawn(validatedCommand, validatedArgs, {
+		const proc = spawn(command, args, {
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file cli/src/engines/executor.ts around lines 152-175:

`execWithNode` calls `validateCommandAndArgs` again on arguments that were already validated and potentially transformed by `execCommand`. Since `execWithBun` skips re-validation, the same command can succeed under Bun but fail under Node.js when the validator transforms inputs (e.g., resolving aliases). Consider removing the redundant validation block at lines 160-167 so both execution paths behave consistently.

Evidence trail:
cli/src/engines/executor.ts lines 67-86 (`execCommand` validates at line 72, passes validated values to `execWithBun`/`execWithNode`); cli/src/engines/executor.ts lines 107-139 (`execWithBun` uses values directly without re-validation); cli/src/engines/executor.ts lines 152-175 (`execWithNode` re-validates at line 160 with same `validateCommandAndArgs` call)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (1)

cli/src/engines/gemini.ts
GeminiEngine defines buildArgs() (lines 23–35) and processCliResult() (lines 40–80) as template-method hooks following the pattern introduced for ClaudeEngine/QwenEngine. However, execute() (lines 82–139) rebuilds the exact same argument list from scratch instead of calling this.buildArgs(), making the template hook dead code. The same duplication exists in executeStreaming() (lines 141–214).

This means future changes to buildArgs() will be silently ignored in actual execution, and both execute() and executeStreaming() diverge from the template-method pattern. Remove the standalone overrides and rely on BaseAIEngine.execute() and BaseAIEngine.executeStreaming(), just as ClaudeEngine and QwenEngine now do.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/gemini.ts
Line: 82-139

Comment:
GeminiEngine defines `buildArgs()` (lines 23–35) and `processCliResult()` (lines 40–80) as template-method hooks following the pattern introduced for ClaudeEngine/QwenEngine. However, `execute()` (lines 82–139) rebuilds the exact same argument list from scratch instead of calling `this.buildArgs()`, making the template hook dead code. The same duplication exists in `executeStreaming()` (lines 141–214).

This means future changes to `buildArgs()` will be silently ignored in actual execution, and both `execute()` and `executeStreaming()` diverge from the template-method pattern. Remove the standalone overrides and rely on `BaseAIEngine.execute()` and `BaseAIEngine.executeStreaming()`, just as ClaudeEngine and QwenEngine now do.

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (3)

cli/src/engines/copilot.ts
Temp file created inside buildArgs — file leaks on early error

createTempFile is called inside buildArgsInternal, which is called at the very start of both execute and executeStreaming. If any step between buildArgsInternal and the try { ... } finally { cleanupTempFile } block throws synchronously (e.g., constructing args, a future guard check), the temp file will have been created but the finally block will never run because the exception propagates before try is entered.

The temp file creation should be moved inside the try block, or the cleanup should be registered before any code that can throw:

async execute(prompt: string, workDir: string, options?: EngineOptions): Promise<AIResult> {
    const { args } = this.buildArgsInternalWithoutFile(prompt, options);
    const tempFile = this.createTempFile(prompt);
    args.push(tempFile);
    try {
        // ... rest of execute
    } finally {
        this.cleanupTempFile(tempFile);
    }
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/copilot.ts
Line: 1186-1188

Comment:
**Temp file created inside `buildArgs` — file leaks on early error**

`createTempFile` is called inside `buildArgsInternal`, which is called at the very start of both `execute` and `executeStreaming`. If any step between `buildArgsInternal` and the `try { ... } finally { cleanupTempFile }` block throws synchronously (e.g., constructing `args`, a future guard check), the temp file will have been created but the `finally` block will never run because the exception propagates before `try` is entered.

The temp file creation should be moved inside the `try` block, or the cleanup should be registered before any code that can throw:

```typescript
async execute(prompt: string, workDir: string, options?: EngineOptions): Promise<AIResult> {
    const { args } = this.buildArgsInternalWithoutFile(prompt, options);
    const tempFile = this.createTempFile(prompt);
    args.push(tempFile);
    try {
        // ... rest of execute
    } finally {
        this.cleanupTempFile(tempFile);
    }
}
```

How can I resolve this? If you propose a fix, please make it concise.

cli/src/engines/parsers.ts
checkForErrors detects false positives from legitimate AI responses

The plain-text pattern matching in checkForErrors checks for strings like "not available", "model not found", and "invalid model" anywhere in the output. These strings frequently appear in legitimate AI assistant responses (e.g., "This feature is not available in Python 2", "The invalid model you've used..."). Any engine using this function—including ClaudeEngine and QwenEngine—will misclassify successful responses as errors whenever the AI happens to mention these phrases.

The original implementation in base.ts only matched on JSON type === "error" lines, which is safe. The new plain-text scan is far too broad. Consider restricting it to only the first few lines of output, or only apply it when the exit code is already non-zero.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/parsers.ts
Line: 2929-2950

Comment:
**`checkForErrors` detects false positives from legitimate AI responses**

The plain-text pattern matching in `checkForErrors` checks for strings like `"not available"`, `"model not found"`, and `"invalid model"` anywhere in the output. These strings frequently appear in legitimate AI assistant responses (e.g., "This feature is not available in Python 2", "The invalid model you've used..."). Any engine using this function—including `ClaudeEngine` and `QwenEngine`—will misclassify successful responses as errors whenever the AI happens to mention these phrases.

The original implementation in `base.ts` only matched on JSON `type === "error"` lines, which is safe. The new plain-text scan is far too broad. Consider restricting it to only the first few lines of output, or only apply it when the exit code is already non-zero.

How can I resolve this? If you propose a fix, please make it concise.

cli/src/engines/gemini.ts
GeminiEngine.execute duplicates buildArgs argument construction

After this refactor, GeminiEngine has both a buildArgs method and a standalone execute method that re-constructs the args array from scratch (including the --output-format stream-json --yolo --model -p flags) rather than calling this.buildArgs(prompt, workDir, options). The same duplication exists in executeStreaming. This means any change to the flag set must be made in three places, making future maintenance error-prone.

execute and executeStreaming should call this.buildArgs(prompt, workDir, options) instead of rebuilding args inline.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/gemini.ts
Line: 17-100

Comment:
**`GeminiEngine.execute` duplicates `buildArgs` argument construction**

After this refactor, `GeminiEngine` has both a `buildArgs` method and a standalone `execute` method that re-constructs the args array from scratch (including the `--output-format stream-json --yolo --model -p` flags) rather than calling `this.buildArgs(prompt, workDir, options)`. The same duplication exists in `executeStreaming`. This means any change to the flag set must be made in three places, making future maintenance error-prone.

`execute` and `executeStreaming` should call `this.buildArgs(prompt, workDir, options)` instead of rebuilding args inline.

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (1)

cli/src/engines/gemini.ts
GeminiEngine.execute() ignores options.env

execCommand is called with undefined for the env parameter, which means any environment variables supplied by the caller (via EngineOptions.env) are silently dropped. The same issue appears in executeStreaming at line 154 where execCommandStreaming also receives undefined for env.

By contrast, BaseAIEngine.executeStreaming correctly calls this.getEnv(options). The non-streaming path bypasses this, so engine-level overrides (e.g., DEBUG_OPENCODE, custom ANTHROPIC_API_KEY for tests, etc.) will never be propagated for Gemini.

		const { stdout, stderr, exitCode } = await execCommand(
			this.cliCommand,
			args,
			workDir,
			this.getEnv(options),
			stdinContent,
		);

The same fix is needed at line 154 in executeStreaming:

const { exitCode } = await execCommandStreaming(
    this.cliCommand, args, workDir, (line) => {...},
    this.getEnv(options),   // ← was undefined
    stdinContent,
);
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/gemini.ts
Line: 86-92

Comment:
**`GeminiEngine.execute()` ignores `options.env`**

`execCommand` is called with `undefined` for the `env` parameter, which means any environment variables supplied by the caller (via `EngineOptions.env`) are silently dropped. The same issue appears in `executeStreaming` at line 154 where `execCommandStreaming` also receives `undefined` for `env`.

By contrast, `BaseAIEngine.executeStreaming` correctly calls `this.getEnv(options)`. The non-streaming path bypasses this, so engine-level overrides (e.g., `DEBUG_OPENCODE`, custom `ANTHROPIC_API_KEY` for tests, etc.) will never be propagated for Gemini.

```suggestion
		const { stdout, stderr, exitCode } = await execCommand(
			this.cliCommand,
			args,
			workDir,
			this.getEnv(options),
			stdinContent,
		);
```

The same fix is needed at line 154 in `executeStreaming`:
```typescript
const { exitCode } = await execCommandStreaming(
    this.cliCommand, args, workDir, (line) => {...},
    this.getEnv(options),   // ← was undefined
    stdinContent,
);
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +57 to 70
async execute(prompt: string, workDir: string, options?: EngineOptions): Promise<AIResult> {
const args = this.buildArgs(prompt, workDir, options);

// Pass prompt via stdin for cross-platform compatibility
// This avoids shell escaping issues and argument length limits on all platforms
const stdinContent = prompt;

const { stdout, stderr, exitCode } = await execCommand(
this.cliCommand,
args,
workDir,
{ OPENCODE_PERMISSION: '{"*":"allow"}' },
this.getEnv(options),
stdinContent,
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dryRun option not honored in execute() override

OpenCodeEngine.execute overrides BaseAIEngine.execute but does not check options?.dryRun. BaseAIEngine.execute (line 249–252 of base.ts) guards against dry-run execution:

if (options?.dryRun) {
    return { success: true, response: "(dry run) Skipped", inputTokens: 0, outputTokens: 0 };
}

Because OpenCodeEngine.execute overrides the base class, that guard is never reached. As a result, OpenCodeEngine will issue a real network request even when callers explicitly set dryRun: true. The same problem affects every other engine that overrides execute() in this PR: CopilotEngine (copilot.ts:118), CodexEngine (codex.ts:42), CursorEngine (cursor.ts:34), GeminiEngine (gemini.ts:49), and DroidEngine (droid.ts:34). Only ClaudeEngine and QwenEngine, which rely entirely on the base-class execute, respect dry-run correctly.

Similarly, executeStreaming overrides in CopilotEngine (copilot.ts:235), CursorEngine (cursor.ts:116), GeminiEngine (gemini.ts:97), and DroidEngine (droid.ts:106) skip the base-class dry-run guard in executeStreaming.

The simplest fix is to add the guard at the top of each overriding method:

async execute(prompt: string, workDir: string, options?: EngineOptions): Promise<AIResult> {
    if (options?.dryRun) {
        return { success: true, response: "(dry run) Skipped", inputTokens: 0, outputTokens: 0 };
    }
    // ... rest of existing implementation
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/opencode.ts
Line: 57-70

Comment:
**`dryRun` option not honored in `execute()` override**

`OpenCodeEngine.execute` overrides `BaseAIEngine.execute` but does not check `options?.dryRun`. `BaseAIEngine.execute` (line 249–252 of `base.ts`) guards against dry-run execution:

```typescript
if (options?.dryRun) {
    return { success: true, response: "(dry run) Skipped", inputTokens: 0, outputTokens: 0 };
}
```

Because `OpenCodeEngine.execute` overrides the base class, that guard is never reached. As a result, `OpenCodeEngine` will issue a real network request even when callers explicitly set `dryRun: true`. The same problem affects every other engine that overrides `execute()` in this PR: `CopilotEngine` (`copilot.ts:118`), `CodexEngine` (`codex.ts:42`), `CursorEngine` (`cursor.ts:34`), `GeminiEngine` (`gemini.ts:49`), and `DroidEngine` (`droid.ts:34`). Only `ClaudeEngine` and `QwenEngine`, which rely entirely on the base-class `execute`, respect dry-run correctly.

Similarly, `executeStreaming` overrides in `CopilotEngine` (`copilot.ts:235`), `CursorEngine` (`cursor.ts:116`), `GeminiEngine` (`gemini.ts:97`), and `DroidEngine` (`droid.ts:106`) skip the base-class dry-run guard in `executeStreaming`.

The simplest fix is to add the guard at the top of each overriding method:

```typescript
async execute(prompt: string, workDir: string, options?: EngineOptions): Promise<AIResult> {
    if (options?.dryRun) {
        return { success: true, response: "(dry run) Skipped", inputTokens: 0, outputTokens: 0 };
    }
    // ... rest of existing implementation
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +130 to +151
// Look for common error patterns in plain text (case-insensitive)
const lowerTrimmed = trimmed.toLowerCase();
const hasExplicitErrorPrefix = lowerTrimmed.startsWith("fatal:");
const hasKnownModelErrorToken =
lowerTrimmed.includes("providermodelnotfounderror") ||
lowerTrimmed.includes("modelnotfounderror") ||
(lowerTrimmed.includes("model not found") && lowerTrimmed.includes("error")) ||
(lowerTrimmed.includes("invalid model") && lowerTrimmed.includes("error"));

if (hasExplicitErrorPrefix || hasKnownModelErrorToken) {
// Improve specific error messages
if (lowerTrimmed.includes("rate limit")) {
return "OpenCode Rate Limit: Too many requests. Try: Wait 30-60s";
}
if (lowerTrimmed.includes("quota")) {
return "OpenCode Quota Exceeded: You've reached your usage limit. Check your OpenCode plan";
}
if (lowerTrimmed.includes("connection") || lowerTrimmed.includes("timeout")) {
return "OpenCode Connection Error: Unable to connect to the service. Check internet connection";
}
return trimmed;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkForErrors produces false positives on AI-generated explanatory text

The primary scan loop (lines 130–151) applies pattern matching to every line of the full combined output, including the AI's response text. The patterns at lines 134–137 are broad enough to match explanatory prose:

(lowerTrimmed.includes("model not found") && lowerTrimmed.includes("error")) ||
(lowerTrimmed.includes("invalid model") && lowerTrimmed.includes("error"));

If, for example, Claude or Gemini responds with a line such as:

"This stacktrace occurs because the request triggers a ProviderModelNotFoundError when the model ID is incorrect."

the function would return a non-null error string and every calling engine would mark the task as failed — even when exitCode is 0 and the AI successfully diagnosed the problem. The same risk applies to providermodelnotfounderror and modelnotfounderror matching (lines 134–135): error class names are commonly mentioned in AI explanations.

The secondary check (lines 154–167) is safer because it requires a known shell prefix (fatal:, error:, bash:, etc.), but the primary scan has no such guard.

Consider restricting hasKnownModelErrorToken to only apply when the line also has a structured-error-like prefix, or limit this plain-text scan to stdout/stderr sections that are known to be CLI metadata rather than AI response content.

Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/parsers.ts
Line: 130-151

Comment:
**`checkForErrors` produces false positives on AI-generated explanatory text**

The primary scan loop (lines 130–151) applies pattern matching to **every line of the full combined output**, including the AI's response text. The patterns at lines 134–137 are broad enough to match explanatory prose:

```typescript
(lowerTrimmed.includes("model not found") && lowerTrimmed.includes("error")) ||
(lowerTrimmed.includes("invalid model") && lowerTrimmed.includes("error"));
```

If, for example, Claude or Gemini responds with a line such as:

> "This stacktrace occurs because the request triggers a `ProviderModelNotFoundError` when the model ID is incorrect."

the function would return a non-null error string and every calling engine would mark the task as **failed** — even when `exitCode` is 0 and the AI successfully diagnosed the problem. The same risk applies to `providermodelnotfounderror` and `modelnotfounderror` matching (lines 134–135): error class names are commonly mentioned in AI explanations.

The secondary check (lines 154–167) is safer because it requires a known shell prefix (`fatal:`, `error:`, `bash:`, etc.), but the primary scan has no such guard.

Consider restricting `hasKnownModelErrorToken` to only apply when the line also has a structured-error-like prefix, or limit this plain-text scan to `stdout`/`stderr` sections that are known to be CLI metadata rather than AI response content.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +115 to +118
protected buildArgs(prompt: string, workDir: string, options?: EngineOptions): string[] {
const { args } = this.buildArgsInternal(prompt, workDir, options);
return args;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buildArgs generates a new lastMessageFile path that processCliResult cannot clean up

Every call to buildArgs delegates to buildArgsInternal, which creates a unique lastMessageFile path (using Date.now() + process.pid + randomUUID()) and embeds it in --output-last-message. The Codex CLI will write to this file during execution.

processCliResult at lines 120–150 has no knowledge of this path and performs no cleanup. The current override of executeStreaming (which calls execute, where a finally block handles cleanup) means this is safe today. However, buildArgs is a protected method that is part of the BaseAIEngine contract — calling it is now a side-effecting operation (the CLI will create a file) that cannot be undone by processCliResult.

Any future refactor that removes the executeStreaming override, any test that calls buildArgs independently, or any subclass that relies on the base execution path will silently leak .codex-last-message-*.txt files in the working directory.

Consider storing the lastMessageFile path as an instance field so processCliResult can read and clean it up regardless of the execution path:

private pendingLastMessageFile?: string;

protected buildArgs(prompt: string, workDir: string, options?: EngineOptions): string[] {
    const { args, lastMessageFile } = this.buildArgsInternal(prompt, workDir, options);
    this.pendingLastMessageFile = lastMessageFile;
    return args;
}

protected processCliResult(stdout: string, stderr: string, exitCode: number, workDir: string): AIResult {
    const lastMessageFile = this.pendingLastMessageFile;
    this.pendingLastMessageFile = undefined;
    // ... read and cleanup lastMessageFile
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: cli/src/engines/codex.ts
Line: 115-118

Comment:
**`buildArgs` generates a new `lastMessageFile` path that `processCliResult` cannot clean up**

Every call to `buildArgs` delegates to `buildArgsInternal`, which creates a unique `lastMessageFile` path (using `Date.now() + process.pid + randomUUID()`) and embeds it in `--output-last-message`. The Codex CLI will write to this file during execution.

`processCliResult` at lines 120–150 has no knowledge of this path and performs no cleanup. The current override of `executeStreaming` (which calls `execute`, where a `finally` block handles cleanup) means this is safe today. However, `buildArgs` is a `protected` method that is part of the `BaseAIEngine` contract — calling it is now a side-effecting operation (the CLI will create a file) that cannot be undone by `processCliResult`.

Any future refactor that removes the `executeStreaming` override, any test that calls `buildArgs` independently, or any subclass that relies on the base execution path will silently leak `.codex-last-message-*.txt` files in the working directory.

Consider storing the `lastMessageFile` path as an instance field so `processCliResult` can read and clean it up regardless of the execution path:

```typescript
private pendingLastMessageFile?: string;

protected buildArgs(prompt: string, workDir: string, options?: EngineOptions): string[] {
    const { args, lastMessageFile } = this.buildArgsInternal(prompt, workDir, options);
    this.pendingLastMessageFile = lastMessageFile;
    return args;
}

protected processCliResult(stdout: string, stderr: string, exitCode: number, workDir: string): AIResult {
    const lastMessageFile = this.pendingLastMessageFile;
    this.pendingLastMessageFile = undefined;
    // ... read and cleanup lastMessageFile
}
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant