fix: prevent indefinite hang on Windows when child processes hold pipes open#273
Conversation
umputun
left a comment
There was a problem hiding this comment.
LGTM, thx. clean fix, correct approach - goroutine + buffered channel is the standard Go idiom for making a blocking call context-cancellable. tests pass, linter clean, race detector clean.
There was a problem hiding this comment.
Pull request overview
This PR addresses a Windows-specific hang where readLines() could block indefinitely in ReadString when child processes keep stdout/stderr pipe handles open, preventing context cancellation (idle/session timeouts) from interrupting a blocked read.
Changes:
- Refactors
readLines()to performbufio.Reader.ReadString('\n')in a goroutine andselectbetween the read result andctx.Done(). - Updates inline documentation explaining why async reads are required (notably on Windows with inherited pipe handles).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| reader := bufio.NewReader(r) | ||
| ch := make(chan readResult, 1) // buffered: lets abandoned goroutine exit after kill | ||
|
|
||
| doRead := func() { | ||
| line, err := reader.ReadString('\n') | ||
| ch <- readResult{line, err} | ||
| } | ||
|
|
||
| go doRead() | ||
|
|
There was a problem hiding this comment.
readLines() now always starts a ReadString goroutine before checking ctx. If ctx is already canceled, there’s a race where the read goroutine can fill ch and the select may take the res := <-ch branch, potentially draining input and returning nil/EOF instead of context.Canceled (can make cancellation propagation and tests flaky). Consider checking ctx.Err() up front (before starting the first goroutine) and/or prioritizing cancellation inside the loop (e.g., if ctx.Err()!=nil when receiving res, return the context error without processing/spawning another read).
| } | ||
| return fmt.Errorf("read lines: %w", res.err) | ||
| } | ||
| return fmt.Errorf("read lines: %w", err) | ||
| go doRead() | ||
| } |
There was a problem hiding this comment.
This implementation spawns a new goroutine for every line (go doRead() on each successful read). For long-running/high-volume streams this adds avoidable goroutine creation/teardown overhead. A lower-overhead approach is to run a single long-lived reader goroutine that loops on ReadString and sends readResults on a channel, while the main goroutine selects on ctx.Done() vs results.
Summary
readLines()to runReadStringin a goroutine withselecton context, so cancellation (idle/session timeout) can interrupt a blocked pipe readReadStringto block indefinitelyRoot cause
On Unix,
killProcesssendsSIGKILLto the entire process group, closing all inherited handles. On Windows,process.Kill()only terminates the direct process — child processes survive, pipes stay open,ReadStringnever returns, andctx.Done()was never checked during the blocking read.What changed
readLines()inpkg/executor/linereader.go:Before: synchronous
ReadStringin a loop withselect { case <-ctx.Done() / default }— thedefaultbranch calledReadStringwhich blocked, making context cancellation unreachable.After:
ReadStringruns in a goroutine, result arrives via buffered channel.selectlistens on both the channel andctx.Done(), so cancellation takes effect immediately even while the read is blocked.Recommendation for Windows users
Set
idle_timeout = 5min config (~/.config/ralphex/config). Without a timeout, no one cancels the context, and the pipe hangs forever. 5 minutes of silence is never normal during active task execution.Long-term fix: Job Objects (
CreateJobObject+TerminateJobObject) inprocgroup_windows.goto kill the entire process tree, making pipes close immediately without relying on timeouts.Test plan
go test ./pkg/executor/passesgo build ./...compilesKill()succeeds,Wait()returns, task loop continues to next task