You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A Codex thread stuck in waitingOnApproval deadlocks the monitor bridge forever: the bridge cannot answer approval requests, tryStartTurn() defers silently while the thread is active, and no watchdog ever fires. On top of that, the launcher's staleness check cannot replace the hung bridge, so everything looks healthy (delivery.sh status reports mode: monitor, watch processes alive) while nothing is delivered for days.
Hit in production on macOS: delivery silently stopped for 3 days (2026-06-30 → 2026-07-03).
agmsg: v1.1.2 installed (code paths verified still present on main f665c1c)
codex: 0.142.x (Homebrew), app-server over ws://127.0.0.1:<port>
Incident timeline
Bridge (started 06-29) was mid-flight on a busy thread. A bridge-initiated turn (or a subsequent user turn) raised an approval request that nothing ever answered — the bridge has no handler for approval/elicitation events (grep -i approval codex-bridge.js → no hits).
The thread went {"type":"active","activeFlags":["waitingOnApproval"]} and stayed there.
Next wakeup (wakeup 21 in the log) set pendingWake = true, but tryStartTurn() returned silently on the gate (codex-bridge.js:942):
The turn watchdog only starts after a bridge turn starts, so nothing ever times out. Last log line for 3 days: codex-bridge: wakeup 21 for dashboard/codex.
Even after kill <bridge_pid> (launcher respawned a fresh bridge within a second — that part works great), the fresh bridge resumed the same thread, saw status.type === "active" (codex-bridge.js:785), and deferred the pending wake indefinitely again. The deadlock survives bridge restarts because the stuck state lives in the app-server thread.
Also note the fresh bridge attached to the first entry of thread/loaded/list — a 4-day-old thread — not the thread the live TUI was actually showing, so even a successful delivery would have been invisible to the user (related to the #170 discovery heuristic).
Diagnosis (reproducible)
Speak JSON-RPC 2.0 over WebSocket to the app-server (initialize with capabilities.experimentalApi: true, notify initialized, then):
thread/turns/list {threadId} → find turn with status "inProgress"
turn/interrupt {threadId, turnId} → thread goes idle within seconds
(turnId is required — turn/interrupt {threadId} alone fails with missing field 'turnId'.)
The instant the thread went idle, the bridge picked up the pending wake, delivered the queued message, and the agent replied. End-to-end confirmed working again.
Suggested fixes
Handle approval requests on bridge-initiated turns. Subscribe to the approval/elicitation request events and auto-respond (deny by default, configurable), or pass a non-interactive approval policy on turn/start if the API supports one. A headless bridge must never create a prompt only a human can answer.
Pending-wake watchdog. If pendingWake has been deferred for more than N minutes while the thread reports active, re-poll the thread status; on waitingOnApproval with no TUI activity, log loudly and optionally turn/interrupt the stuck turn (opt-in flag).
Launcher liveness beyond URL match. The reuse check should also consider bridge progress (e.g. a heartbeat file/timestamp the bridge touches on every wake/turn event). pid-alive + URL-match kept a 3-day-hung bridge in place.
Summary
A Codex thread stuck in
waitingOnApprovaldeadlocks the monitor bridge forever: the bridge cannot answer approval requests,tryStartTurn()defers silently while the thread is active, and no watchdog ever fires. On top of that, the launcher's staleness check cannot replace the hung bridge, so everything looks healthy (delivery.sh statusreportsmode: monitor, watch processes alive) while nothing is delivered for days.Hit in production on macOS: delivery silently stopped for 3 days (2026-06-30 → 2026-07-03).
f665c1c)ws://127.0.0.1:<port>Incident timeline
grep -i approval codex-bridge.js→ no hits).{"type":"active","activeFlags":["waitingOnApproval"]}and stayed there.wakeup 21in the log) setpendingWake = true, buttryStartTurn()returned silently on the gate (codex-bridge.js:942):codex-bridge: wakeup 21 for dashboard/codex.codex-bridge-launcher.shreuses a live bridge whenever pid is alive and the bound app-server URL matches — a hung bridge on the same app-server is never torn down. (The Windows/Git Bash: native sqlite3.exe can't open MSYS /c/ DB path (inbox/send/watch fail) + 4 related findings #197/fix(codex): keep Codex working across 0.142 upgrades (fail-open + stale app-server reuse) #237 guard only covers URL mismatch.)kill <bridge_pid>(launcher respawned a fresh bridge within a second — that part works great), the fresh bridge resumed the same thread, sawstatus.type === "active"(codex-bridge.js:785), and deferred the pending wake indefinitely again. The deadlock survives bridge restarts because the stuck state lives in the app-server thread.Also note the fresh bridge attached to the first entry of
thread/loaded/list— a 4-day-old thread — not the thread the live TUI was actually showing, so even a successful delivery would have been invisible to the user (related to the #170 discovery heuristic).Diagnosis (reproducible)
Speak JSON-RPC 2.0 over WebSocket to the app-server (
initializewithcapabilities.experimentalApi: true, notifyinitialized, then):Manual recovery that worked
(
turnIdis required —turn/interrupt {threadId}alone fails withmissing field 'turnId'.)The instant the thread went idle, the bridge picked up the pending wake, delivered the queued message, and the agent replied. End-to-end confirmed working again.
Suggested fixes
turn/startif the API supports one. A headless bridge must never create a prompt only a human can answer.pendingWakehas been deferred for more than N minutes while the thread reportsactive, re-poll the thread status; onwaitingOnApprovalwith no TUI activity, log loudly and optionallyturn/interruptthe stuck turn (opt-in flag).ids[0]when discovering the TUI thread.Happy to provide the full bridge log excerpts or test a patch.
🤖 Generated with Claude Code