fix: improve worker lifecycle reliability#360
Conversation
Four targeted fixes to the worker start/complete/cleanup flow: 1. Detect stale workers with PID=0: Workers where Claude failed to start (PID=0) were never detected as dead by the health check. Now marks them for cleanup after a 2-minute grace period. 2. Clean up transient agents with dead processes: When a worker's process dies but its tmux window persists, the health check now marks non-persistent agents for cleanup instead of just logging a warning. 3. Rollback on worker creation failure: If createWorker() fails after creating resources (worktree, tmux window), those resources are now cleaned up instead of being left orphaned. 4. Record task history on manual worker removal: `worker rm` now records task history before removing, matching the behavior of automatic cleanup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Triage ReviewPriority: P0 (Reliable worker lifecycle - roadmap item) Changes:
Recommendation: Merge before #364. Solid fix with good test coverage. |
Local CI Verification (2026-03-12)
CI Status: No GitHub Actions checks are running — this is expected for first-time fork PRs. GitHub requires a maintainer to approve workflow runs for PRs from forks. Branch is rebased on |
|
Hi! All 8 open PRs from this fork are waiting on GitHub Actions approval ( Priority PRs ready for review:
All branches are rebased on current |
Summary
createWorker()fails after creating worktree/tmux window (e.g., daemon registration fails), orphaned resources are now cleaned up.worker rm: Manual worker removal viaworker rmnow records task history, matching automatic cleanup behavior.Context
This addresses the P0 roadmap item "Reliable worker lifecycle - Workers should start, complete, and clean up without manual intervention." The audit identified four gaps in the worker lifecycle that could require manual intervention:
checkAgentHealth()skipped PID=0 workers entirely (line 338:if agent.PID > 0)createWorker()had no rollback - partial failures left orphaned tmux windows and worktreeshandleRemoveAgent()didn't callrecordTaskHistory()unlikecleanupDeadAgents()Test plan
TestCheckAgentHealthPIDZeroStaleWorker- verifies PID=0 grace period logicTestHandleRemoveAgentRecordsTaskHistory- verifies task history is recorded on manual removalgo test ./...- 27 packages)go build ./cmd/multiclaude)Opportunities (not implemented)
🤖 Generated with Claude Code