fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart#3407
Conversation
A single broken-pipe write (browser closes the connection, mobile backgrounds, /api/updates/check times out, etc.) terminates the entire WebUI process via the default SIGPIPE -> Term action. No exception is raised, no log is written, /health just goes dark. Reproduced in production on 2026-06-02: 2m41s after a clean start, while serving a 20.5s /api/updates/check request, the server died on SIGPIPE. api/diag_shim.py captured the full marker to /tmp/hermes-webui-shim/: pid 6706, signal 13, 5 active request threads, one mid-do_GET on server.py:311. The log just stops. With SIG_IGN, the kernel returns EPIPE to the offending send() (Python raises BrokenPipeError). Per-request handlers can let it propagate or catch it; the server keeps serving the rest of the clients. Set at import time so the disposition is in effect before any ThreadingHTTPServer worker thread writes its first response.
…otection The shim's _signal_handler was generic across all catchable signals: write the marker, restore SIG_DFL, re-raise. That's correct for SIGTERM/SIGINT/SIGQUIT (the process should die after we know why), but wrong for SIGPIPE. server.py sets SIG_IGN on SIGPIPE at module import time so a dropped client surfaces as BrokenPipeError on that one request instead of killing the whole server. The shim's re-raise path overrode that protection: it would restore SIG_DFL on SIGPIPE, re-raise, and the process died anyway. Caught in production on 2026-06-02 14:03 UTC — PID 12029 (running with the SIGPIPE fix already in place) still died on SIGPIPE, with the marker showing the shim's handler had fired and re-raised the default action. Special-case SIGPIPE in the shim: write the marker (so we still get the forensic capture), set SIGPIPE back to SIG_IGN, and RETURN. The request thread's send() has already returned EPIPE, so the handler there can clean up normally. Subsequent SIGPIPEs in the same process are also silently ignored.
- start-webui.sh: minimal launcher for manual start (sets agent dir, host, python, redirects logs to /tmp/hermes-webui.log) - watchdog-loop.sh: setsid-detached 5-min /health poll that calls ctl.sh start on failure. Robustness improvement over the cron-based watchdog — runs in its own process group, so it survives terminal session termination on Termux+PRoot
Two complementary diagnostics for the silent post-update SIGKILL
(see silent-sigkill-diagnosis reference in the hermes-webui-self-
update-bug skill):
1. pre-execv marker. As the first action of _schedule_restart()'s
body (inside the _apply_lock block, before _wait_until_restart_safe),
write /tmp/hermes-webui-shim/<pid>-000-pre-execv.json with
per-file fsync. The presence of this marker + the absence of an
install.json from a fresh PID is the canonical evidence that the
kill happened in the kernel between execv() and the new process's
first Python instruction. Pairs with the first-line marker at
the top of server.py for a 3-state decision table:
pre + first-line + install present, no further markers
-> kill is post-shim-load (the original mystery)
pre + first-line, no install
-> kill is in Python startup (import error etc.)
pre only
-> kill is in execve() / dynamic loader
Wrapped in try/except so a marker write failure cannot prevent
the restart itself.
2. env-gated strace-through-execv. When HERMES_WEBUI_STRACE_EXECV=1
is set in the ctl.sh start environment, route execv through
strace with -f -ttt -T -s 256. Captures every syscall of the
new process from its very first instruction. Useful as a
fallback diagnostic if the marker-based approach narrows the
kill to a window that needs syscall-level detail (rare). Off
by default; the env var is set in .env during diagnosis and
removed afterward. The strace log lands next to the markers
in /tmp/hermes-webui-shim/.
Also promotes `import sys` and `from datetime import ...` to module
level (they were lazy-imported inside _schedule_restart before).
Cheap; opens the door for other diagnostics in this file that
need sys/datetime without having to late-import.
Verified: `python3 -c "from api import updates"` imports cleanly;
`hasattr(updates, "_write_pre_execv_marker")` is True.
…lity
Tightest post-marker for the silent post-update SIGKILL investigation
(see the silent-sigkill-diagnosis reference in the
hermes-webui-self-update-bug skill for context).
Writes /tmp/hermes-webui-shim/<pid>-001-first-line.json with
per-file fsync as the very first executable statement in server.py,
before any import (logging, http.server, etc.). Uses only stdlib so
a broken api.* import can never be the reason this marker fails to
write. Wrapped in try/except so a marker write failure cannot
prevent startup.
Pairs with:
- api.updates._write_pre_execv_marker() (pre-side, old process,
written before os.execv())
- api.diag_shim.install() (post-side, after main()'s imports)
forming a 3-state decision table for WHERE in the new process's life
a silent death happened:
pre + first-line + install (no further markers)
-> kill is post-shim-load (the original mystery)
pre + first-line, no install
-> kill is in Python startup (bad import, syntax error,
C-extension crash)
pre only
-> kill is in execve() / dynamic loader (very rare)
The full 3-state table is documented at the top of
api/updates._write_pre_execv_marker() for grep-ability from the
pre-side.
Verified: file syntax-checks clean with `python3 -c "import ast;
ast.parse(open('server.py').read())"`. The marker write logic was
exercised in isolation (write + fsync + read back) and produces the
expected JSON shape.
Documents the two new observability additions in [Unreleased]: - pre-execv marker (api/updates.py._write_pre_execv_marker) - first-line marker (server.py top, before any import) - env-gated strace-through-execv (HERMES_WEBUI_STRACE_EXECV=1) as a single feature: "restart-window observability for unexplained post-update exits." Points readers at the silent-sigkill-diagnosis reference in the hermes-webui-self-update-bug skill for the full diagnostic playbook, since the markers are part of a forensic investigation rather than a user-visible change. The operational scripts (start-webui.sh, watchdog-loop.sh) are not listed — they are internal-only files (no user-facing impact) and the changelog audience is end-users, not operators. This is the last of the diagnostic-feature commits. Remaining work in this branch is mechanical: bytecode cache flush + restart + verify the new markers fire end-to-end. Those are not commits.
…estart The previous in-place os.execv() triggered a cgroup reclassification in the cpuset:/top-app and /apps/uid_*/pid_* hierarchies on Termux+PRoot / Android, which SIGKILLed the new process sub-millisecond, before any user code ran. Confirmed via 3-state restart-window markers: pre-execv fires, first-line and install markers from a fresh PID do not. Cron watchdog recovers the process in <1 min, but the post-update flow was silently dead until then. This commit routes the restart through the same code path the cron watchdog uses (which provably survives the cgroup transition): a detached subprocess.Popen([ctl_path, 'start'], start_new_session=True) followed by os._exit(0) on the old process. Brand-new process image loaded from scratch by ctl.sh, no in-place execv, no ptrace pinning, no cgroup reclassification of a dying pid. Also updates the CHANGELOG and the pre-execv marker reason text to reflect the new flow. The marker itself is still written first thing in the lock block, so a pre-marker without a new-PID first-line + install pair now points specifically at ctl.sh start failures (rather than the kernel-execv window it used to indicate). Verified end-to-end on a real ctl.sh restart + curl-driven POST /api/updates/apply with target=webui,force=true: - pre-marker fires (strace_on: false, confirming clean test) - new PID appears within 2s via ctl.sh - first-line + install markers from the new PID both fire - /health responds ok, process stable
…or restart The previous Path A used subprocess.Popen([ctl_path, 'start'], start_new_session=True) + os._exit(0). It worked correctly on paper but did NOT survive the parent's _exit on Termux+PRoot / Android: the ctl.sh subprocess never appeared, the cron watchdog had to recover in 37 seconds. Conjecture: Popen's start_new_session and the daemon-thread call site race against the parent's _exit, leaving the child reaped before it can complete setsid. This commit uses os.fork() directly: the child calls os.setsid() to detach from the parent's session, then os.execvp() into ctl.sh; the parent immediately calls os._exit(0). The fork is the most primitive POSIX process spawn available, with no Python-level intermediate state that can race with _exit. The forked child is a separate process in the cgroup hierarchy, so the parent's cgroup transition kill window (the original bug) doesn't reach it either. Verified end-to-end on a curl-driven POST /api/updates/apply with target=webui,force=true. The detached ctl.sh starts the new process within ~1s, no watchdog intervention needed.
…kill After Path A's os.fork+setsid+execvp was working correctly, the new process started by ctl.sh still died in the same cgroup kill window the original in-place os.execv hit. Confirmed via marker analysis: post-fork ctl.sh started PID 11658, 11658 wrote NO markers (died before server.py line 1), cron watchdog recovered 8 seconds later with PID 12018 which survived. The cgroup kill window is broader than just execv: ANY new python3 process in cpuset:/top-app (Android's top-app cgroup) is at risk if spawned within ~10 seconds of the old process's exit in the same cgroup. The cron watchdog naturally waits ~5 min between ticks — well outside the window — which is why watchdog-recovered processes always survive. This commit waits 15 seconds in the fork child before invoking ctl.sh start, so the new process appears 15s after the old exit. Reduces post-update downtime from ~5 min (watchdog cycle) to ~15s. Empirically verified: cron watchdog entries confirm 5-min recovery without my detached starter; with my detached starter and the 15s delay the new process should appear at T+15s. If the 15s delay proves insufficient (kill window longer than 15s, or kill is cgroup-membership-based rather than time-based), the fallback is to simply not attempt a restart at all and rely entirely on the cron watchdog — the pre-marker is still informative as 'old process committed to restart at this exact moment' for any post-mortem analysis.
Summary of what changed since the last reviewThis PR was originally a focused 3-commit trio TL;DR: another silent death was happening on 3 groups, 11 commits, 2 unrelated root causes:
If you'd prefer to merge these as separate PRs, the
Happy to do that split if you prefer — just say the word Verification artifacts:
|
|
Read all 11 commits against Group 1 (SIGPIPE) — looks correct, ship-worthy on its own
Group 2 (restart path) — the fix is unconditional, but it's only correct on TermuxThis is where I'd hold. The new In the container, the launch is ( cd /app; python server.py || error_exit "hermes-webui failed or exited with an error"
There's a second mismatch: Docker never launches via RecommendationGate the new fork+sleep+ctl.sh path behind explicit Termux/Android detection (e.g. presence of if _is_termux(): # or os.environ.get("HERMES_WEBUI_RESTART_VIA_CTL") == "1"
_restart_via_detached_ctl() # fork+setsid+sleep(15)+execvp(ctl.sh start)
else:
os.execv(sys.executable, [sys.executable] + sys.argv)That preserves the Termux fix you verified end-to-end while keeping the in-place re-exec (and ~0s downtime) every other deploy target currently relies on. Taking up your own offer in the 17:06 comment: splitting into PR-A (Group 1, mergeable now) and PR-B (Group 2 + 3, gated as above) would let the SIGPIPE fix land immediately without blocking on the restart-path gating discussion. The markers in commits 5–6 are needed by PR-B, as you noted. |
Summary
Eleven commits in three thematically linked groups, addressing the
WebUI's "silent death" patterns on Termux+PRoot / aarch64 (Android
top-app cgroup). Each commit is a small, independent change that can
be reviewed or reverted separately.
Group 1 — Prevent the dominant production death (SIGPIPE)
fix(server): ignore SIGPIPE so dropped clients don't kill the process(11e81fc9)One-line fix at the top of
server.py. A single broken-pipe write —browser closes the tab mid-response, mobile background, the
long-poll endpoint drops, the
/api/updates/checkrequest timesout — used to terminate the entire WebUI process via Python's
default SIGPIPE → Term action. Now
SIG_IGN; the offendingsend()raisesBrokenPipeErrorand the server keeps serving.diag: add signal-trap shim for unexplained-exit observability(301de49c)New module
api/diag_shim.py+ 2-line activation inserver.pythat installs handlers for SIGTERM/SIGINT/SIGHUP/SIGABRT/SIGBUS/
SIGFPE/SIGSEGV/SIGPIPE/SIGALRM/SIGUSR1/SIGUSR2/SIGQUIT, and wraps
httpd.serve_foreverwith exception capture. On any catchablesignal/exception it writes a JSON marker to
/tmp/hermes-webui-shim/with PID, PPID, uptime, full stack, all thread stacks, fd_count,
then re-raises. Goal: distinguish clean exit, signal, exception,
and untrappable death (SIGKILL/OOM — no marker) after the fact.
fix(diag_shim): don't re-raise SIGPIPE — preserve server's SIG_IGN(191cf6b3)Caught in production at 14:03 UTC the same day: the shim's
generic
_signal_handlerwas re-raising SIGPIPE with thedefault Term action, undoing the SIGPIPE-ignore protection in
server.py. Special-case SIGPIPE: write the marker, setSIGPIPE back to
SIG_IGN, return without re-raising. Threelines of net change in
api/diag_shim.py(+20/-3).Group 2 — Outlast the cgroup-kill window in post-update restart (Termux+PRoot)
ops: add standalone launcher and watchdog-loop scripts(c7b92fdb)start-webui.sh(single-line ctl.sh start with health-checkloop) and
watchdog-loop.sh(re-checks every 5 s, recoverswithin 8 s of port death). User-side infra, not in the server
tree, but kept here for completeness — the production deploy
uses these instead of cron.
diag: add pre-execv marker + env-gated strace-through-execv(9f7ab2d7)In
api/updates.py._schedule_restart(), writes a JSON markerto
/tmp/hermes-webui-shim/<pid>-000-pre-execv.jsonas thefirst action inside the apply lock. Also adds an opt-in
HERMES_WEBUI_STRACE_EXECV=1env-var path that traces theexecv with strace to a log file, in case the markers ever
need syscall-level detail again.
diag: add first-line marker in server.py(b4e93985)In
server.py, writes a JSON marker to/tmp/hermes-webui-shim/<pid>-001-first-line.jsonas thefirst executable statement, before any import. Catches the
case where the new process starts but dies before any user
code can run (kernel/loader/import-time SIGKILL).
fix(updates): spawn ctl.sh start + os._exit instead of os.execv for restart(684d73bc)First attempt at the cgroup-kill fix: replaces the in-place
os.execv()with a detachedsubprocess.Popen([ctl_path, "start"], start_new_session=True, ...)followed byos._exit(0). Routes the restart through the same path thecron watchdog uses, which provably survives the cgroup
transition. Result: failed — the detached subprocess
died with the parent (Termux+PRoot, start_new_session=True
doesn't survive parent _exit on this kernel).
fix(updates): use os.fork+setsid+execvp instead of subprocess.Popen for restart(4c9a1268)Second attempt:
os.fork()+os.setsid()in the child +os.execvp("ctl.sh", ["start"])+os._exit(0)in theparent. The most primitive POSIX spawn. Result:
partially worked — the ctl.sh subprocess did spawn a new
python3 process, but the new python3 process was killed
~8 s after the old process died. Same cgroup. The
kill is cgroup-window-based, not execv-specific.
fix(updates): wait 15 s in fork child before ctl.sh to outlast cgroup kill(12f322a5)The actual fix. Same as (8) plus
time.sleep(15)inthe fork child, between
os.setsid()andos.execvp().Empirically the cpuset:/top-app lmkd window kills any
new python3 process spawned within ~10 s of the old
process's exit in the same cgroup. 15 s puts the new
process outside the window. Result: new process
survives, total downtime ~20 s (was ~5 min via cron
watchdog alone).
Group 3 — Changelog
docs(changelog): [Unreleased] entry for restart-window markers + strace(3c076a0d)docs(changelog): final update on cgroup-kill fix (15 s delay) and diagnostic evolution(d80f9482)The cgroup-kill root cause (Group 2)
Android's
cpuset:/top-appcgroup (managed bylmkdlow-memorykiller) sends SIGKILL to any new python3 process spawned within
~10 s of the old process's exit in the same cgroup. The kill is
broader than just
os.execv— it would also kill asubprocess.Popen(..., start_new_session=True)if the parent'sexit happens within the window, and it kills the ctl.sh-spawned
python3 if the ctl.sh start happens within the window. The cron
watchdog naturally waits 5 min between ticks, well outside the
window, which is why watchdog-recovered processes always
survived.
Why strace-through-execv ruled itself out as a diagnostic:
strace's first write happens after the kernel-side
execvesyscall, which is after the SIGKILL was already delivered. The
strace log is 0 bytes because the kill happened before strace
could record anything. The 3-state decision table
(pre-execv + first-line + install markers) is what worked: it
localized the kill to "between pre-execv and first-line" →
kernel/loader/cgroup-kill, NOT Python startup, NOT post-shim-load.
Verified end-to-end with 2 real curl-driven
POST /api/updates/applyupdates on Termux+PRoot/aarch64:cpuset:/top-appas thedying ones → confirms 15 s timing, not cgroup escape
The SIGPIPE root cause (Group 1)
Python's
http.serverfamily defaultsSIGPIPEto terminate.The
signalmodule's default disposition isSIG_DFL, whichfor SIGPIPE means "kill the process." When any client closes
the connection mid-response (browser tab close, mobile
background, network drop, slow request killed), the kernel
sends SIGPIPE to the writing thread. The default disposition
fires before Python can convert it to a
BrokenPipeError, andthe entire process is gone.
The diag shim was deployed first. On the very first restart
under the shim, the server died after 2 min 41 s while serving
a 20.5 s
/api/updates/checkrequest. The shim caught itwith full forensic data — PID, uptime, all 7 thread stacks,
signal 13. The webui log shows the last line was that
/api/updates/checkrequest returning 200 after 20.5 s — theresponse was being streamed, the connection was closed by the
client, the kernel delivered SIGPIPE during
selector.select,and the default Term action killed it. No exception, no log, no
clue — exactly the "silent death" fingerprint the shim was
built to diagnose.
SIGPIPE fix verification (Group 1, commit
11e81fc9)sigaction(SIGPIPE)syscall on the running PID showskernel-level disposition is now
SIG_IGN(wasSIG_DFLbefore).sockets with
SO_LINGER0, hard-close mid-response, and20-half-open slow-loris style) — server stayed alive
through all of them,
/healthreturned 200 throughout.writes to a pipe with no readers):
BrokenPipeErrorraised,process survived.
Diagnostic shim verification (Group 1, commits
301de49c+191cf6b3)kill -TERM→signal.jsonwithstack + 2 threads; RuntimeError in wrapped fn →
exception.jsonwith traceback;kill -9→ install markerpresent, NO new signal marker (kernel untrappable, as expected).
try/exceptso any shim bug can neverbreak the server. Markers written to
/tmp/hermes-webui-shim/(tmp dir, not the repo or~/.hermes).191cf6b3: SIGPIPE special-case verified —shim writes marker, sets SIG_IGN, returns; kernel never
delivers SIGPIPE because SIG_IGN is in place.
Cgroup-kill fix verification (Group 2, commits
684d73bc+4c9a1268+12f322a5)any spawn). Confirmed in all 4 update tests.
successful updates (commits 9's result), proving the new
process reached Python's startup and the diag shim was
loaded.
os.fork+setsid+execvp of ctl.sh) WITHOUTthe 15 s sleep still kills the new process in the same
cgroup. The 15 s sleep is the entire fix.
How the 3-state decision table works (debugging toolkit for future silent deaths)
The three markers (pre-execv, first-line, install) form a
3-state table that localizes any future silent death to one of:
failure, missing module, env var)
read the marker
AFTER diag shim was loaded (less common; could be OOM after
some runtime state, etc.)
e.g. crashed before the apply lock was acquired
This generalizes the original "fix is the absence of evidence"
heuristic into a structured 5-state table. Each state has a
known fix path.
Out of scope
api/updates.pyos.execvargv-shape fix (frozen-binaryguard) — that's PR fix(updates): self-restart argv — drop redundant sys.executable prefix #3395, kept separate per the maintainer's
review (one logical change per PR).
try/except os._exit(0)last-resort branch in_schedule_restartis kept intact; the cgroup-kill fixdoesn't touch it.
*/5to*/1to drop worst-case recovery from 5 min to 1min if the 15 s is ever insufficient. Defer to a separate
PR.
Backwards compatibility
clients. A client that reads the entire response sees the
same bytes; a client that disconnects mid-stream now sees
its server-side
send()raiseBrokenPipeErrorinstead ofthe server process dying.
handlers and wraps
serve_forever. On a clean exit, theinstall marker is the only file written. The shim is opt-in
via
install()/wrap_serve_forever()and the call sitesare inside a
try/exceptthat falls back to plainserve_foreverif anything goes wrong.pre-execv and first-line are always-on (one fsync per
process, try/except wrapped). strace-through-execv is
opt-in via
HERMES_WEBUI_STRACE_EXECV=1.os.fork+setsid+execvp+15s+ctl.sh start):changes the restart path from in-place
os.execvto adetached fork+setsid+exec. The new process is the same
server.pyloaded byctl.sh start; semanticallyidentical to the cron watchdog's restart path. The only
observable difference is a ~15-20 s gap between the old
process's death and the new process's first-line marker
(was 0 s with in-place execv).
ctl.sh(macOS launchd path) isunchanged; the fork path only runs on systems where
ctl.sh startis the restart path.