fix(wta): log process teardown + agent-pane disconnect; trim log spam#298
Open
vanzue wants to merge 10 commits into
Open
fix(wta): log process teardown + agent-pane disconnect; trim log spam#298vanzue wants to merge 10 commits into
vanzue wants to merge 10 commits into
Conversation
Helper/master deaths were undiagnosable: on a non-graceful teardown the process vanished mid-stream and the non-blocking appender's buffered tail was lost (shutdown_flush never ran), while the C++ pane side stayed silent. The success path was logged exhaustively; the teardown path was not. - logging: install a SetConsoleCtrlHandler that logs which control event (CTRL_CLOSE/LOGOFF/SHUTDOWN/BREAK) tore the process down and drains the appender so final records (incl. the transport-lost WARN) reach disk. Returns FALSE -- termination behavior unchanged. Wired into main() so it covers both helper and master. (A hard taskkill /F sends no control event and stays untraceable by design.) - Tab.cpp: _agentPaneLog when the agent pane's connection closes, so a dead helper is recorded on the C++ side and correlates to the millisecond with the Rust logs -- fires even when the master is gone and cannot emit restart_agent_pane. - trim per-token-chunk / per-VT-sequence logs (master forwarding x2, acp session_notification, autofix WtEvent/classified/cross-tab-drop) from debug to trace; keep one per-event breadcrumb at debug and all connection/lifecycle logs intact. cargo +stable check passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves diagnosability of Windows Terminal Agent (WTA) failures by ensuring teardown paths emit logs and by reducing high-volume success-path debug logging that can drown out lifecycle signals. It also adds a C++-side log record when the agent pane’s backend disconnects to correlate UI-side behavior with Rust-side logs.
Changes:
- Add a Windows console control handler in WTA to log teardown signals and flush buffered logs before termination.
- Log agent-pane backend disconnects from the TerminalApp
Tabto make UI-side disconnects visible. - Reduce log volume by demoting per-chunk/per-sequence debug logs to trace across ACP and autofix paths.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/wta/src/protocol/acp/client.rs | Demotes per-streamed-chunk ACP session_notification logging to trace to reduce debug noise. |
| tools/wta/src/master/mod.rs | Demotes per-chunk session notification routing logs to trace to reduce master log volume. |
| tools/wta/src/main.rs | Installs the new teardown control handler immediately after logging init. |
| tools/wta/src/logging.rs | Adds install_ctrl_handler() to log/flush on console teardown control events. |
| tools/wta/src/app.rs | Demotes high-volume autofix event/classification logs to trace. |
| tools/wta/Cargo.toml | Enables windows-sys console APIs via Win32_System_Console. |
| src/cascadia/TerminalApp/Tab.cpp | Logs when the agent pane’s connection is closed to surface backend death on the C++ side. |
This comment has been minimized.
This comment has been minimized.
…ail, spell - Tab.cpp: edge-trigger the agent-pane disconnect log on the open→closed transition (new `_agentPaneConnectionClosed` guard on Tab). The method runs on every pane connection-state change and on focus changes, so the unconditional log would spam terminal-agent-pane.log after a helper dies. - logging.rs: SetConsoleCtrlHandler install failure now logs at warn, not debug, so release (info) logs explain why teardown signals are missing rather than leaving it silent — the whole point is diagnosability. - spelling: allow the Win32 token PHANDLER (PHANDLER_ROUTINE) in apis.txt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
The noun "diagnosability" isn't in cspell's base dictionary (though "diagnosable" is); reword the one occurrence to "diagnostic" to keep the spell-check gate green without a dictionary entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per PR review: SetConsoleCtrlHandler failure now logs GetLastError() in an `error_code` field, captured immediately before any other call can reset thread-last-error, so the warning distinguishes likely causes (no console attached vs. permission failure) instead of just stating it failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These five info! sites fired per-event, not per-lifecycle, and dominated the info logs (~2284 lines in a sample run). Info should be once-per- lifecycle / rare connection events; per-event detail belongs at debug (still available via WTA_LOG=debug). No information removed. - agent_sessions.rs "alive snapshot upgraded Historical row → Live" (1724x) - master/mod.rs "received helper session hook" (157x) - master/mod.rs "handling intellterm.wta/session_hook locally" (157x) - app.rs (agent_route) "applied" (127x) and "routing" (119x) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es to debug Refines the earlier wholesale demotion of "received helper session hook": terminal/lifecycle transitions (SessionStarted/SessionStopped/PaneClosed/ ConnectionFailed) stay at info so field diagnosis of session-state bugs works from the default release log, while the high-frequency routine events (ToolStarting/ToolCompleted/Notification/resume bookkeeping) go to debug. The generic master-side dispatch breadcrumb stays at debug (it carries no event detail). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per PR review: only the helper is a ConPTY child of Windows Terminal and receives CTRL_CLOSE on pane/tab/window close. The master is spawned CREATE_NO_WINDOW and contained in a Job Object with KILL_ON_JOB_CLOSE (C++ SharedWta), so its routine teardown is a job reap — like TerminateProcess, no control event — and this handler does not make it traceable. Tighten the doc comment and the main.rs call-site comment to say so instead of implying both processes always observe these events. No behavior change (comment-only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fills the two diagnosability gaps left after the console-ctrl-handler:
1. Rust panic hook (logging.rs): panics wrote only to stderr — invisible
for a ConPTY helper / CREATE_NO_WINDOW master — and the appender's
buffered tail was lost on a fatal panic. install_panic_hook() logs the
panic (message + location + thread) AND synchronously appends to
wta-panic.log (independent of the async appender, so it survives a fatal
panic), then chains the previous hook. Does NOT call shutdown_flush (that
would kill logging after a recoverable catch_unwind).
2. C++ master-death observability (SharedWta.cpp): the master is job-killed
(KILL_ON_JOB_CLOSE) and can't log its own hard death. Log from the parent
that observes it:
- _OnProcessExited: the wait callback fires for unexpected master exit
(crash/OOM/taskkill /F) — the external observer that makes silent
master deaths diagnosable. Deliberate teardowns never reach it.
- _CleanupLocked: record the deliberate KILL_ON_JOB_CLOSE teardown.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- logging.rs: treat SetConsoleCtrlHandler's ERROR_INVALID_HANDLE (no console, expected for the CREATE_NO_WINDOW master / detached CLI) as debug; keep warn only for unexpected failures, so the warn can't reintroduce startup spam. - master/mod.rs: match the session-hook level decision on &event (borrow, not consume) — clearer intent; event is reused by the reducer below. - Tab.cpp: only walk the tree for the agent pane when something is closed, avoiding a second traversal on the common all-healthy path (still re-arms the edge-trigger guard correctly when nothing is closed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines
79
to
99
| # Workaround for the archival of check-spelling/check-spelling (archived 2026-06-16). | ||
| # On archival the maintainer flipped the action's runtime security poll (secpoll) | ||
| # to return the advisory 'Sorry' for v0.0.26 -- the final release -- which fatally | ||
| # to return a security advisory for v0.0.26 -- the final release -- which fatally | ||
| # cancels the workflow (exit 200) on every repo pinned to it. There is no fixed | ||
| # version to upgrade to. The action exposes an `ignore_security_advisory` input, | ||
| # but v0.0.26 never maps it into the secpoll step's environment, so passing it via | ||
| # `with:` is silently dropped. secpoll reads the INPUT_IGNORE_SECURITY_ADVISORY env | ||
| # var directly, so we set it here (composite action steps inherit workflow-level | ||
| # env). The value must exactly match the advisory text 'Sorry'. This keeps the | ||
| # (still-executable) archived action running. Caveat: if check-spelling.dev is ever | ||
| # taken down, secpoll's lookup returns empty/NXDOMAIN and this var being set will | ||
| # instead make secpoll fail ("...but there is no security advisory") -- at which | ||
| # env). secpoll requires this value to EXACTLY equal the current advisory text | ||
| # (the DNS TXT record at 26.0.0.security-status.secpoll.check-spelling.dev, minus | ||
| # its leading "3 " status prefix); any mismatch re-triggers the fatal exit 200 with | ||
| # "(ignore_security_advisory did not match)". The maintainer has already changed the | ||
| # text once (from "Sorry" to the value below), so if spelling starts failing fast | ||
| # (~6s) again, re-read that TXT record and update this string to match. This keeps | ||
| # the (still-executable) archived action running. Caveat: if check-spelling.dev is | ||
| # ever taken down, secpoll's lookup returns empty/NXDOMAIN and this var being set | ||
| # will instead make secpoll fail ("...but there is no security advisory") -- at which | ||
| # point this workflow must be reworked (remove the checker or pin a maintained fork). | ||
| env: | ||
| INPUT_IGNORE_SECURITY_ADVISORY: Sorry | ||
| INPUT_IGNORE_SECURITY_ADVISORY: "Sorry. https://github.com/jsoref/2026-06-16-credential-leak/blob/main/README.md" | ||
|
|
f80858e to
92df298
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A reported "helper just stopped responding" incident turned out to be undiagnosable from the logs — not because the success path was under-logged (it was logged exhaustively), but because the teardown path was silent:
shutdown_flushonly runs on clean exits; theWorkerGuardlives in astaticthat neverDrops)._FocusAgentPane called.debug!lines flooded the logs (the master log was 631 KB ofdelivered=true), crowding out signal and pushing useful records out of the buffer faster.What
1. Rust — record why/when a process is torn down, and flush before dying
logging::install_ctrl_handler()— aSetConsoleCtrlHandlerthat logs which control event (CTRL_CLOSE/CTRL_LOGOFF/CTRL_SHUTDOWN/CTRL_BREAK) tore the process down and callsshutdown_flush()so the final records (incl. the existing transport-lostWARN) reach disk. ReturnsFALSE→ default handler still runs, termination behavior unchanged.main()right afterlogging::init, so it covers both helper and master.taskkill /F/ Task Manager "End task" delivers no control event and stays untraceable from inside the process.2. C++ — pane no longer blind to backend death
Tab::_UpdateConnectionClosedState()now_agentPaneLog(...)s when the agent pane's connection closes — correlating the UI-side disconnect to the millisecond with the Rust logs, and firing even when the master is gone and can't emitrestart_agent_pane.3. Success-path cleanup (
debug→trace)session_notification, autofixWtEvent/classified/cross-tab-drop). Kept one per-event breadcrumb atdebugand all connection/lifecycle logs intact.Verification
cargo +stable checkpasses; no warnings from the changed lines.wta-main_helper-*.log:WARN lifecycle: console control event received … ctrl_type=2 event="CTRL_CLOSE"— and it is the last line in the file, proving the pre-deathshutdown_flush()landed the buffered tail (the original blind spot).terminal-agent-pane.log:releasing wta-master pid=… (deliberate teardown via KILL_ON_JOB_CLOSE); no_UpdateConnectionClosedStatespam.🤖 Generated with Claude Code