Skip to content

Codex bridge: thread stuck in waitingOnApproval deadlocks delivery forever (bridge can't answer approval requests; launcher never replaces the hung bridge) #299

Description

@fujikkofujio

Summary

A Codex thread stuck in waitingOnApproval deadlocks the monitor bridge forever: the bridge cannot answer approval requests, tryStartTurn() defers silently while the thread is active, and no watchdog ever fires. On top of that, the launcher's staleness check cannot replace the hung bridge, so everything looks healthy (delivery.sh status reports mode: monitor, watch processes alive) while nothing is delivered for days.

Hit in production on macOS: delivery silently stopped for 3 days (2026-06-30 → 2026-07-03).

  • agmsg: v1.1.2 installed (code paths verified still present on main f665c1c)
  • codex: 0.142.x (Homebrew), app-server over ws://127.0.0.1:<port>

Incident timeline

  1. Bridge (started 06-29) was mid-flight on a busy thread. A bridge-initiated turn (or a subsequent user turn) raised an approval request that nothing ever answered — the bridge has no handler for approval/elicitation events (grep -i approval codex-bridge.js → no hits).
  2. The thread went {"type":"active","activeFlags":["waitingOnApproval"]} and stayed there.
  3. Next wakeup (wakeup 21 in the log) set pendingWake = true, but tryStartTurn() returned silently on the gate (codex-bridge.js:942):
    if (!this.pendingWake || this.turnActive || !this.threadIdle) return;
    The turn watchdog only starts after a bridge turn starts, so nothing ever times out. Last log line for 3 days: codex-bridge: wakeup 21 for dashboard/codex.
  4. A new Codex TUI session days later re-ran the launcher, but codex-bridge-launcher.sh reuses a live bridge whenever pid is alive and the bound app-server URL matches — a hung bridge on the same app-server is never torn down. (The Windows/Git Bash: native sqlite3.exe can't open MSYS /c/ DB path (inbox/send/watch fail) + 4 related findings #197/fix(codex): keep Codex working across 0.142 upgrades (fail-open + stale app-server reuse) #237 guard only covers URL mismatch.)
  5. Even after kill <bridge_pid> (launcher respawned a fresh bridge within a second — that part works great), the fresh bridge resumed the same thread, saw status.type === "active" (codex-bridge.js:785), and deferred the pending wake indefinitely again. The deadlock survives bridge restarts because the stuck state lives in the app-server thread.

Also note the fresh bridge attached to the first entry of thread/loaded/list — a 4-day-old thread — not the thread the live TUI was actually showing, so even a successful delivery would have been invisible to the user (related to the #170 discovery heuristic).

Diagnosis (reproducible)

Speak JSON-RPC 2.0 over WebSocket to the app-server (initialize with capabilities.experimentalApi: true, notify initialized, then):

thread/loaded/list                      → ["019f1360-...", "019f1867-...", "019f27a5-..."]
thread/resume {threadId, excludeTurns}  → status: {"type":"active","activeFlags":["waitingOnApproval"]}   ← smoking gun

Manual recovery that worked

thread/turns/list {threadId}                 → find turn with status "inProgress"
turn/interrupt   {threadId, turnId}          → thread goes idle within seconds

(turnId is required — turn/interrupt {threadId} alone fails with missing field 'turnId'.)

The instant the thread went idle, the bridge picked up the pending wake, delivered the queued message, and the agent replied. End-to-end confirmed working again.

Suggested fixes

  1. Handle approval requests on bridge-initiated turns. Subscribe to the approval/elicitation request events and auto-respond (deny by default, configurable), or pass a non-interactive approval policy on turn/start if the API supports one. A headless bridge must never create a prompt only a human can answer.
  2. Pending-wake watchdog. If pendingWake has been deferred for more than N minutes while the thread reports active, re-poll the thread status; on waitingOnApproval with no TUI activity, log loudly and optionally turn/interrupt the stuck turn (opt-in flag).
  3. Launcher liveness beyond URL match. The reuse check should also consider bridge progress (e.g. a heartbeat file/timestamp the bridge touches on every wake/turn event). pid-alive + URL-match kept a 3-day-hung bridge in place.
  4. (Minor, related to Codex monitor bridge never engages on recent codex CLI (--remote is ws-only; SessionStart hook can't resolve the thread id) #170) Prefer the most-recently-active loaded thread over ids[0] when discovering the TUI thread.

Happy to provide the full bridge log excerpts or test a patch.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions