fix(polling): Prevent hanging providers from permanently blocking background usage refresh#414
fix(polling): Prevent hanging providers from permanently blocking background usage refresh#414cnovak wants to merge 2 commits intosteipete:mainfrom
Conversation
…kground usage refresh This aggregates three related safety valves to address instances of permanent hangs in usage polling: 1. Adds a 60-second global timeout to the background polling loop in `UsageStore.swift` 2. Adds a 30-second per-provider timeout in `UsageStore+Refresh.swift` 3. Hardens `SubprocessRunner.swift` with improved pipe management, task cancellation, and a more aggressive SIGKILL enforcement mechanism to prevent zombie processes. Specifically, it explicitly closes stdout/stderr pipes before awaiting reading tasks so that stray inherited file handles do not block reads indefinitely. Fixes steipete#189
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37eadee216
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Using `try?` on `Task.sleep` swallows the CancellationError. When the task group is cancelled upon a successful provider fetch, the timeout task would continue to the warning path and falsely report a timeout. Using a do-catch block correctly returns early on cancellation, preventing unreliable hang diagnostics.
| return nil | ||
| } | ||
| let first = await group.next() | ||
| group.cancelAll() |
There was a problem hiding this comment.
Could we double-check whether canceling the task group here guarantees this method returns in ~30s when a provider fetch is stuck in non-cooperative work?
| return outcome | ||
| } else { | ||
| return ProviderFetchOutcome( | ||
| result: .failure(SubprocessRunnerError.timedOut("\(provider.rawValue) fetch")), |
There was a problem hiding this comment.
When we create this timeout failure, how are we distinguishing a true timeout from parent-task cancellation?
| throw SubprocessRunnerError.timedOut("global refresh") | ||
| } | ||
| _ = try await group.next() | ||
| group.cancelAll() |
There was a problem hiding this comment.
Do we know this cancellation path always lets the timer loop move forward, even if refresh work does not respond to cancellation quickly?
| // readToEnd() can block indefinitely if the underlying process is dead but the pipe is still "open" | ||
| // in a zombie state or if a child process inherited it. Closing the handle explicitly triggers EOF | ||
| // in the reading task, allowing stdoutTask.value to complete. | ||
| try? stdoutPipe.fileHandleForReading.close() |
There was a problem hiding this comment.
Is there any chance closing the read handle here could race the reader task and cause us to miss some stdout/stderr data?
Summary
This PR prevents hanging usage providers from permanently blocking the background usage refresh loop.
What was happening
Root cause
UsageStorepolling loop lacked a global timeout safety net.UsageStore+Refreshawaited providerfetchOutcome()indefinitely.SubprocessRunnerpipes were awaited before being closed, meaning inherited file handles from zombie child processes could block stdout/stderr reads safely.What changed
UsageStore.swift.UsageStore+Refresh.swift.SubprocessRunner.swiftwith guaranteed execution of cleanup via adeferblock.SIGKILLenforcement to murder processes resistingSIGTERM.readToEnd()calls.Before / After
Before
After
Validation
SubprocessRunnerError.timedOut.defercleanup loops.Notes
Fixes #189