Skip to content

refactor(agent): decouple nvm bootstrap from agent Run/InstallMCP commands#76

Open
zpzjzj wants to merge 2 commits into
mainfrom
fix/command_exec
Open

refactor(agent): decouple nvm bootstrap from agent Run/InstallMCP commands#76
zpzjzj wants to merge 2 commits into
mainfrom
fix/command_exec

Conversation

@zpzjzj
Copy link
Copy Markdown
Collaborator

@zpzjzj zpzjzj commented Jun 1, 2026

Summary

Splits the nvm/Node bootstrap and the agent invocation into two distinct rt.Exec calls so failures can be cleanly attributed and the agent's stderr buffer stays free of bootstrap noise.

Fixes #72

Problem

nodeRuntimeCommandWithGuard (internal/agent/node_install.go) packed the bootstrap and the agent's real command (claude -p, codex exec --json, claude mcp add, codex mcp add) into a single multi-line shell script handed to one rt.Exec call. Consequences:

  • Bootstrap failures (curl timeout to the nvm installer, sha256 mismatch, nvm install errors against an offline mirror) surfaced as claude-code run failed: ... / codex run failed: .... Hard to triage without reading the wrapped script.
  • Bootstrap stderr was concatenated into result.Stderr, which is then scanned by providerAuthFailureSignal / providerRateLimitSignal — widening their false-positive surface.
  • Bootstrap stdout could prepend stdout.json artifacts.
  • Bootstrap re-fired silently at Run time when Install failed or wasn't called, masking what should be a clear "Install was never called" failure.

Fix

New ensureNodeRuntime(ctx, rt, cliBin, opts) helper runs only the bootstrap as its own rt.Exec. The script's first line short-circuits with exit 0 when command -v <cli> succeeds, so the happy path is one extra command -v check with no output.

Call site (before) After
claude_code.go:127 nodeRuntimeCommandWithGuard("claude", ...) (Run) ensureNodeRuntime → Exec
claude_code.go:72 nodeRuntimeCommandWithGuard("claude", ...) (MCP) InstallMCP calls ensureNodeRuntime once before the per-server loop
codex.go:236 (Run) ensureNodeRuntime → Exec
codex.go:136, :171 (MCP stdio / http+bridge) wrap-once in InstallMCP

nodeRuntimeCommandWithGuard and nodeRuntimeCommand are removed — no remaining production callers. qodercli is untouched (it never used the wrapper).

Errors from ensureNodeRuntime wrap with node bootstrap failed: ... so they're trivially distinguishable from agent run failures in result handling and signal detectors. Bootstrap stderr no longer reaches result.Stderr of the agent Exec.

A follow-up commit (de46509) preserves the existing convention of returning a non-nil SessionResult{Engine, ExitCode: 1, DurationMs} on bootstrap failure, so the evaluator still records telemetry for failed cases.

Test plan

  • go build ./...
  • go test ./... (full suite green)
  • make verify (gofmt + vet + revive + golangci-lint, 0 issues)
  • New tests in internal/agent/node_install_test.go: script shape (CLI short-circuit at top), Exec-error wrapping, non-zero-exit wrapping with stderr surfacing, empty-cliBin guard
  • Updated fake runtimes (claudeCodeTestRuntime, codexTestRuntime) recognize bootstrap commands by content sniffing and short-circuit them so existing lastCommand / execResult assertions still observe the agent command
  • Manual smoke: run an eval with claude-code and verify trace shows two distinct rt.Exec calls (bootstrap + run) and that stdout.json contains only stream-json from claude
  • Negative smoke: set NVM_SOURCE to an unreachable host; failure should report node bootstrap failed: rather than claude-code run failed:

🤖 Generated with Claude Code

zpzjzj added 2 commits June 1, 2026 11:05
…ate Exec

Previously `nodeRuntimeCommandWithGuard` packed the nvm/Node bootstrap and
the agent's real command (`claude -p`, `codex exec --json`, MCP install)
into a single multi-line shell script handed to one `rt.Exec` call. When
the bootstrap failed (curl timeout to nvm installer, sha256 mismatch, nvm
install errors against an offline mirror), the failure surfaced as a
`claude-code run failed` / `codex run failed` and bootstrap stderr was
concatenated into `result.Stderr`, which then fed `providerAuthFailureSignal`
and `providerRateLimitSignal` — widening their false-positive surface.

Replace the wrapper with `ensureNodeRuntime`, which runs only the bootstrap
as its own `rt.Exec` (short-circuited by `if command -v <cli>; then exit 0`).
Run/InstallMCP call it first, then Exec the agent command separately. Each
Exec now has its own exit code, stderr, stdout; bootstrap failures wrap as
`node bootstrap failed: ...` and never reach the agent's stderr buffer or
`stdout.json` artifact.

`qodercli` is unaffected — it never used the wrapper.

Fixes #72
The new ensureNodeRuntime call path returned `nil, err` on bootstrap
failure, breaking the existing convention (post-Exec error path) of
always returning a SessionResult with Engine/ExitCode/DurationMs so the
evaluator can record telemetry for failed cases. Build a minimal
SessionResult before the error return, matching the surrounding pattern.

Self-review catch.
@zpzjzj zpzjzj requested a review from hittyt as a code owner June 1, 2026 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decouple nvm/node bootstrap from agent Run/Exec commands for clearer failure attribution

1 participant