Skip to content

Interactive sessions orphan browser-tool Chrome (+ Python eval kernel) on exit — no teardown in dispose() and no SIGTERM/SIGINT handler #698

@devswha

Description

@devswha

Description

Interactive gjc sessions do not tear down the subprocesses their tools spawn when the session ends. Specifically, the browser tool's headless Chrome (and, on signal kills, the Python eval kernel) is left running and reparented to init (PID 1). Over a long-lived workstation this accumulates orphaned Chrome/Python processes that survive forever and drag system load.

Three independent gaps combine:

  1. The browser tool closes Chrome only on an explicit tab/browser close action. There is no idle teardown and no session-end teardown.
  2. AgentSession.dispose() has no browser teardown at all. It disposes async jobs, MCP, LSP, the Python kernel, and provider sessions — but never releases browser handles. So even a graceful quit leaks any open headless Chrome.
  3. The interactive entry installs no SIGTERM/SIGINT/exit handler. An external kill, a cleanup sweep, OOM, or terminal close kills the agent instantly and bypasses dispose() entirely, orphaning Chrome and the Python kernel.

Observed in the wild: a 26h session (bun .../gjc, RSS 1.28 GB, wchan=ep_poll) holding a 22h-old headless Chrome, an 18h-old Python runner, 2 unreaped node zombies, plus 30 leaked puppeteer_dev_chrome_profile-* dirs (361 MB) in /tmp. kill -TERM <pid> killed the agent instantly and left Chrome running as a ppid=1 orphan.

Steps to Reproduce

  1. Start an interactive gjc session; note its bun PID G.
  2. Use the browser tool to open a headless tab — do not close it.
  3. ps --ppid G → confirm a headless Chrome child of G.
  4. End the session two ways:
  5. Repeat 4b with an eval Python cell open to see the runner orphaned too.

Zombie sub-symptom (mechanism confirmed under bun 1.3.14): bun reaps children on the event loop, so a wedged/loop-blocked session stops reaping and exited children pile up as Z:

// bun /tmp/zblock.js  (watch from another shell)
const { spawn } = require("node:child_process");
const c = spawn("node", ["-e","setTimeout(()=>process.exit(0),300)"], { stdio:"ignore" });
require("fs").writeFileSync("/tmp/zblock-pid.txt", JSON.stringify({ bun: process.pid, child: c.pid }));
const t = Date.now(); while (Date.now()-t < 4500) Math.sqrt(Math.random()); // block the loop
process.exit(0);

At t≈2s the child shows STAT=Z (parent = bun); after the loop unblocks it is reaped. Negative controls (Bun.spawn/child_process with no-await / unref / detached / dropped-ref+gc) all reap cleanly with no zombie. So the zombies are a symptom of a wedged session, not a spawn-pattern bug.

Expected Behavior

When a session ends — gracefully or via signal — every subprocess the session spawned (headless/spawned Chrome from the browser tool, the Python eval kernel) should be torn down, not reparented to init. Killing or closing a session should not leave Chrome/Python running.

Root cause (code, v0.5.1)

Paths relative to packages/coding-agent/src.

  • Browser closed only on explicit release. tools/browser/registry.ts:165-194 disposeBrowserHandle() → headless does await handle.browser.close() (:169). Callers: registry.ts:53 (only when replacing a different-kind browser), tools/browser/tab-supervisor.ts:243,379 (explicit releaseTab at refCount 0), tab-supervisor.ts:133,144 (worker-init failure). No closeAllBrowsers export and no process-exit hook in registry.ts.
  • AgentSession.dispose() omits browsers. session/agent-session.ts (≈3163-3207): tears down post-prompt tasks, async jobs, MCPManager.disconnectAll() (:3182), shutdownAllLspClients() (:3187), Python kernel (:3188, :3194 disposeKernelSessionsByOwner), power assertion, sessionManager.close(), provider sessions, hindsight — no releaseBrowser/disposeBrowserHandle/registry call anywhere.
  • No interactive signal handler. process.on/once('SIGTERM'|'SIGINT'|'SIGHUP'|'exit'|'beforeExit') exists only in non-interactive subcommands: cli/auth-broker-cli.ts:146-147, cli/auth-gateway-cli.ts:214-215, cli/shell-cli.ts:97 (SIGINT only), cli/stats-cli.ts:158 (SIGINT only), modes/rpc/rpc-mode.ts:672-673. The interactive session that runs browser/eval has none. session.dispose() is only reached on the normal quit path (main.ts:1035, then postmortem.quit(0)).
  • Python kernel is covered on graceful dispose (disposeKernelSessionsByOwner, agent-session.ts:3194) but not on signal (no handler) → orphaned on external kill, matching the observed 18h runner.

Error Output

# stale session 798818: bun gjc, 26h, RSS 1.28GB, wchan=ep_poll
$ ps --ppid 798818 -o pid,stat,etime,args
   PID STAT     ELAPSED COMMAND
1989217 Ssl    22:39:24 /opt/google/chrome/chrome --headless ... --user-data-dir=/tmp/puppeteer_dev_chrome_profile-m7dOHm
3308437 Sl+    18:35:41 python3 /tmp/gjc-python-runner/runner-*.py
  99545 Zs            - [node] <defunct>
3203093 Zs            - [node] <defunct>

$ kill -TERM 798818        # agent dies instantly
$ ps -o pid,ppid,etime,args -p 1989217
   PID    PPID     ELAPSED COMMAND
1989217     1    22:39:47 /opt/google/chrome/chrome --headless ...   # reparented to init, still running

Additional context

  • Suggested fix direction (not prescriptive): add browser-registry teardown to AgentSession.dispose() (release/close all session-owned handles, disposeBrowserHandle(..., { kill: true })); install a top-level, time-boxed SIGTERM/SIGINT/exit handler in the interactive entry that runs the same dispose. Consider launching Chrome in its own process group and killTree on teardown — the helper already exists at autoresearch/helpers.ts:104. Mind browser refCount/holdBrowser ownership when subagent sessions share a browser, and the existing RPC-mode signal handling + postmortem.quit path.
  • Open question (unproven): what wedged session 798818 to RSS 1.28 GB / ep_poll for 26h (slow leak vs stuck subsystem vs legitimate growth)? Needs a heap snapshot / long-session repro; not asserting a leak.
  • Happy to send a focused PR (start with browser teardown in dispose(), then the interactive signal handler) targeting the dev branch once you confirm the teardown ownership model.

Environment

  • Platform: Linux (Ubuntu 22.04, x64)
  • gjc version: 0.5.1 (= current npm latest)
  • Bun version: 1.3.14
  • Provider: not provider-specific

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions