Skip to content

Comments

fix(polling): Prevent hanging providers from permanently blocking background usage refresh#414

Open
cnovak wants to merge 2 commits intosteipete:mainfrom
cnovak:fix/polling-timeout
Open

fix(polling): Prevent hanging providers from permanently blocking background usage refresh#414
cnovak wants to merge 2 commits intosteipete:mainfrom
cnovak:fix/polling-timeout

Conversation

@cnovak
Copy link

@cnovak cnovak commented Feb 22, 2026

Summary

This PR prevents hanging usage providers from permanently blocking the background usage refresh loop.

What was happening

  • The background usage poller could permanently freeze if a provider request or subprocess blocked indefinitely (e.g., due to a network blackhole or a hanging CLI command).
  • This caused CodexBar to stop updating usage data entirely until restarted.

Root cause

  • The UsageStore polling loop lacked a global timeout safety net.
  • UsageStore+Refresh awaited provider fetchOutcome() indefinitely.
  • SubprocessRunner pipes were awaited before being closed, meaning inherited file handles from zombie child processes could block stdout/stderr reads safely.

What changed

  1. Global Polling Timeout
  • Added a 60-second task group timeout to the background polling loop in UsageStore.swift.
  1. Per-Provider Refresh Timeout
  • Added a 30-second task group timeout per provider in UsageStore+Refresh.swift.
  1. Subprocess Hardening
  • Improved SubprocessRunner.swift with guaranteed execution of cleanup via a defer block.
  • Implemented aggressive SIGKILL enforcement to murder processes resisting SIGTERM.
  • Explicitly closed stdout/stderr pipes before awaiting their read tasks to unblock hanging readToEnd() calls.

Before / After

Before

  • Permanent background thread hang if Antigravity stalled, or if network requests were blackholed.

After

  • The system gracefully recovers and logs a warning if any provider takes longer than 30 seconds.
  • Background polling loop cleanly restarts even if a fetch hangs.

Validation

  • Monitored background polling logs ensuring hanging fetching operations correctly encounter SubprocessRunnerError.timedOut.
  • Zombie processes correctly terminated via defer cleanup loops.

Notes

Fixes #189

…kground usage refresh

This aggregates three related safety valves to address instances of permanent hangs in usage polling:
1. Adds a 60-second global timeout to the background polling loop in `UsageStore.swift`
2. Adds a 30-second per-provider timeout in `UsageStore+Refresh.swift`
3. Hardens `SubprocessRunner.swift` with improved pipe management, task cancellation, and a more aggressive SIGKILL enforcement mechanism to prevent zombie processes. Specifically, it explicitly closes stdout/stderr pipes before awaiting reading tasks so that stray inherited file handles do not block reads indefinitely.

Fixes steipete#189
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37eadee216

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Using `try?` on `Task.sleep` swallows the CancellationError. When the task group is cancelled upon a successful provider fetch, the timeout task would continue to the warning path and falsely report a timeout. Using a do-catch block correctly returns early on cancellation, preventing unreliable hang diagnostics.
return nil
}
let first = await group.next()
group.cancelAll()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we double-check whether canceling the task group here guarantees this method returns in ~30s when a provider fetch is stuck in non-cooperative work?

return outcome
} else {
return ProviderFetchOutcome(
result: .failure(SubprocessRunnerError.timedOut("\(provider.rawValue) fetch")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we create this timeout failure, how are we distinguishing a true timeout from parent-task cancellation?

throw SubprocessRunnerError.timedOut("global refresh")
}
_ = try await group.next()
group.cancelAll()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know this cancellation path always lets the timer loop move forward, even if refresh work does not respond to cancellation quickly?

// readToEnd() can block indefinitely if the underlying process is dead but the pipe is still "open"
// in a zombie state or if a child process inherited it. Closing the handle explicitly triggers EOF
// in the reading task, allowing stdoutTask.value to complete.
try? stdoutPipe.fileHandleForReading.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance closing the read handle here could race the reader task and cause us to miss some stdout/stderr data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Antigravity is not refreshing after a few hours

2 participants