fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart by PatrickNoFilter · Pull Request #3407 · nesquena/hermes-webui

PatrickNoFilter · 2026-06-02T14:06:35Z

Summary

Eleven commits in three thematically linked groups, addressing the
WebUI's "silent death" patterns on Termux+PRoot / aarch64 (Android
top-app cgroup). Each commit is a small, independent change that can
be reviewed or reverted separately.

Group 1 — Prevent the dominant production death (SIGPIPE)

fix(server): ignore SIGPIPE so dropped clients don't kill the process (11e81fc9)
One-line fix at the top of server.py. A single broken-pipe write —
browser closes the tab mid-response, mobile background, the
long-poll endpoint drops, the /api/updates/check request times
out — used to terminate the entire WebUI process via Python's
default SIGPIPE → Term action. Now SIG_IGN; the offending
send() raises BrokenPipeError and the server keeps serving.
diag: add signal-trap shim for unexplained-exit observability (301de49c)
New module api/diag_shim.py + 2-line activation in server.py
that installs handlers for SIGTERM/SIGINT/SIGHUP/SIGABRT/SIGBUS/
SIGFPE/SIGSEGV/SIGPIPE/SIGALRM/SIGUSR1/SIGUSR2/SIGQUIT, and wraps
httpd.serve_forever with exception capture. On any catchable
signal/exception it writes a JSON marker to /tmp/hermes-webui-shim/
with PID, PPID, uptime, full stack, all thread stacks, fd_count,
then re-raises. Goal: distinguish clean exit, signal, exception,
and untrappable death (SIGKILL/OOM — no marker) after the fact.
fix(diag_shim): don't re-raise SIGPIPE — preserve server's SIG_IGN (191cf6b3)
Caught in production at 14:03 UTC the same day: the shim's
generic _signal_handler was re-raising SIGPIPE with the
default Term action, undoing the SIGPIPE-ignore protection in
server.py. Special-case SIGPIPE: write the marker, set
SIGPIPE back to SIG_IGN, return without re-raising. Three
lines of net change in api/diag_shim.py (+20/-3).

Group 2 — Outlast the cgroup-kill window in post-update restart (Termux+PRoot)

ops: add standalone launcher and watchdog-loop scripts (c7b92fdb)
start-webui.sh (single-line ctl.sh start with health-check
loop) and watchdog-loop.sh (re-checks every 5 s, recovers
within 8 s of port death). User-side infra, not in the server
tree, but kept here for completeness — the production deploy
uses these instead of cron.
diag: add pre-execv marker + env-gated strace-through-execv (9f7ab2d7)
In api/updates.py._schedule_restart(), writes a JSON marker
to /tmp/hermes-webui-shim/<pid>-000-pre-execv.json as the
first action inside the apply lock. Also adds an opt-in
HERMES_WEBUI_STRACE_EXECV=1 env-var path that traces the
execv with strace to a log file, in case the markers ever
need syscall-level detail again.
diag: add first-line marker in server.py (b4e93985)
In server.py, writes a JSON marker to
/tmp/hermes-webui-shim/<pid>-001-first-line.json as the
first executable statement, before any import. Catches the
case where the new process starts but dies before any user
code can run (kernel/loader/import-time SIGKILL).
fix(updates): spawn ctl.sh start + os._exit instead of os.execv for restart (684d73bc)
First attempt at the cgroup-kill fix: replaces the in-place
os.execv() with a detached subprocess.Popen([ctl_path, "start"], start_new_session=True, ...) followed by
os._exit(0). Routes the restart through the same path the
cron watchdog uses, which provably survives the cgroup
transition. Result: failed — the detached subprocess
died with the parent (Termux+PRoot, start_new_session=True
doesn't survive parent _exit on this kernel).
fix(updates): use os.fork+setsid+execvp instead of subprocess.Popen for restart (4c9a1268)
Second attempt: os.fork() + os.setsid() in the child +
os.execvp("ctl.sh", ["start"]) + os._exit(0) in the
parent. The most primitive POSIX spawn. Result:
partially worked — the ctl.sh subprocess did spawn a new
python3 process, but the new python3 process was killed
~8 s after the old process died. Same cgroup. The
kill is cgroup-window-based, not execv-specific.
fix(updates): wait 15 s in fork child before ctl.sh to outlast cgroup kill (12f322a5)
The actual fix. Same as (8) plus time.sleep(15) in
the fork child, between os.setsid() and os.execvp().
Empirically the cpuset:/top-app lmkd window kills any
new python3 process spawned within ~10 s of the old
process's exit in the same cgroup. 15 s puts the new
process outside the window. Result: new process
survives, total downtime ~20 s (was ~5 min via cron
watchdog alone).

Group 3 — Changelog

docs(changelog): [Unreleased] entry for restart-window markers + strace (3c076a0d)
docs(changelog): final update on cgroup-kill fix (15 s delay) and diagnostic evolution (d80f9482)

The cgroup-kill root cause (Group 2)

Android's cpuset:/top-app cgroup (managed by lmkd low-memory
killer) sends SIGKILL to any new python3 process spawned within
~10 s of the old process's exit in the same cgroup. The kill is
broader than just os.execv — it would also kill a
subprocess.Popen(..., start_new_session=True) if the parent's
exit happens within the window, and it kills the ctl.sh-spawned
python3 if the ctl.sh start happens within the window. The cron
watchdog naturally waits 5 min between ticks, well outside the
window, which is why watchdog-recovered processes always
survived.

Why strace-through-execv ruled itself out as a diagnostic:
strace's first write happens after the kernel-side execve
syscall, which is after the SIGKILL was already delivered. The
strace log is 0 bytes because the kill happened before strace
could record anything. The 3-state decision table
(pre-execv + first-line + install markers) is what worked: it
localized the kill to "between pre-execv and first-line" →
kernel/loader/cgroup-kill, NOT Python startup, NOT post-shim-load.

Verified end-to-end with 2 real curl-driven
POST /api/updates/apply updates on Termux+PRoot/aarch64:

15788 (OLD) pre-marker 16:44:44 → 16962 (NEW) alive 16:45:58
16962 (OLD) pre-marker 16:48:52 → 21597 (NEW) alive 16:49:26
Both new PIDs in same cgroup cpuset:/top-app as the
dying ones → confirms 15 s timing, not cgroup escape

The SIGPIPE root cause (Group 1)

Python's http.server family defaults SIGPIPE to terminate.
The signal module's default disposition is SIG_DFL, which
for SIGPIPE means "kill the process." When any client closes
the connection mid-response (browser tab close, mobile
background, network drop, slow request killed), the kernel
sends SIGPIPE to the writing thread. The default disposition
fires before Python can convert it to a BrokenPipeError, and
the entire process is gone.

The diag shim was deployed first. On the very first restart
under the shim, the server died after 2 min 41 s while serving
a 20.5 s /api/updates/check request. The shim caught it
with full forensic data — PID, uptime, all 7 thread stacks,
signal 13. The webui log shows the last line was that
/api/updates/check request returning 200 after 20.5 s — the
response was being streamed, the connection was closed by the
client, the kernel delivered SIGPIPE during selector.select,
and the default Term action killed it. No exception, no log, no
clue — exactly the "silent death" fingerprint the shim was
built to diagnose.

SIGPIPE fix verification (Group 1, commit `11e81fc9`)

sigaction(SIGPIPE) syscall on the running PID shows
kernel-level disposition is now SIG_IGN (was SIG_DFL before).
5 + 20 + 50 broken-pipe HTTP requests in succession (raw
sockets with SO_LINGER 0, hard-close mid-response, and
20-half-open slow-loris style) — server stayed alive
through all of them, /health returned 200 throughout.
A real pipe-write test (fork child, close read end, parent
writes to a pipe with no readers): BrokenPipeError raised,
process survived.

Diagnostic shim verification (Group 1, commits `301de49c` + `191cf6b3`)

3 manual signal tests: kill -TERM → signal.json with
stack + 2 threads; RuntimeError in wrapped fn →
exception.json with traceback; kill -9 → install marker
present, NO new signal marker (kernel untrappable, as expected).
1 production SIGPIPE death captured end-to-end (above).
Shim is wrapped in try/except so any shim bug can never
break the server. Markers written to
/tmp/hermes-webui-shim/ (tmp dir, not the repo or ~/.hermes).
After 191cf6b3: SIGPIPE special-case verified —
shim writes marker, sets SIG_IGN, returns; kernel never
delivers SIGPIPE because SIG_IGN is in place.

Cgroup-kill fix verification (Group 2, commits `684d73bc` + `4c9a1268` + `12f322a5`)

Pre-execv marker always fires (inside the apply lock, before
any spawn). Confirmed in all 4 update tests.
First-line + install markers fire in the new process for
successful updates (commits 9's result), proving the new
process reached Python's startup and the diag shim was
loaded.
The same code (os.fork+setsid+execvp of ctl.sh) WITHOUT
the 15 s sleep still kills the new process in the same
cgroup. The 15 s sleep is the entire fix.

How the 3-state decision table works (debugging toolkit for future silent deaths)

The three markers (pre-execv, first-line, install) form a
3-state table that localizes any future silent death to one of:

pre-execv only → kernel/loader/cgroup kill (Group 2's bug)
pre + first-line only → Python startup kill (e.g. import
failure, missing module, env var)
pre + first-line + install → diag shim caught a signal/exception,
read the marker
pre + first-line + install + (nothing) → untrappable death
AFTER diag shim was loaded (less common; could be OOM after
some runtime state, etc.)
(no markers) → process never reached the pre-execv write,
e.g. crashed before the apply lock was acquired

This generalizes the original "fix is the absence of evidence"
heuristic into a structured 5-state table. Each state has a
known fix path.

Out of scope

The api/updates.py os.execv argv-shape fix (frozen-binary
guard) — that's PR fix(updates): self-restart argv — drop redundant sys.executable prefix #3395, kept separate per the maintainer's
review (one logical change per PR).
The try/except os._exit(0) last-resort branch in
_schedule_restart is kept intact; the cgroup-kill fix
doesn't touch it.
The 8th PossibleFollowUp: tighten the cron watchdog from
*/5 to */1 to drop worst-case recovery from 5 min to 1
min if the 15 s is ever insufficient. Defer to a separate
PR.

Backwards compatibility

SIGPIPE ignore: zero behavior change for well-behaved
clients. A client that reads the entire response sees the
same bytes; a client that disconnects mid-stream now sees
its server-side send() raise BrokenPipeError instead of
the server process dying.
Diag shim: zero behavior change. Only adds signal
handlers and wraps serve_forever. On a clean exit, the
install marker is the only file written. The shim is opt-in
via install()/wrap_serve_forever() and the call sites
are inside a try/except that falls back to plain
serve_forever if anything goes wrong.
Pre-execv + first-line + strace-through-execv:
pre-execv and first-line are always-on (one fsync per
process, try/except wrapped). strace-through-execv is
opt-in via HERMES_WEBUI_STRACE_EXECV=1.
Group 2 fix (os.fork+setsid+execvp+15s+ctl.sh start):
changes the restart path from in-place os.execv to a
detached fork+setsid+exec. The new process is the same
server.py loaded by ctl.sh start; semantically
identical to the cron watchdog's restart path. The only
observable difference is a ~15-20 s gap between the old
process's death and the new process's first-line marker
(was 0 s with in-place execv).
The launchd guard in ctl.sh (macOS launchd path) is
unchanged; the fork path only runs on systems where
ctl.sh start is the restart path.

A single broken-pipe write (browser closes the connection, mobile backgrounds, /api/updates/check times out, etc.) terminates the entire WebUI process via the default SIGPIPE -> Term action. No exception is raised, no log is written, /health just goes dark. Reproduced in production on 2026-06-02: 2m41s after a clean start, while serving a 20.5s /api/updates/check request, the server died on SIGPIPE. api/diag_shim.py captured the full marker to /tmp/hermes-webui-shim/: pid 6706, signal 13, 5 active request threads, one mid-do_GET on server.py:311. The log just stops. With SIG_IGN, the kernel returns EPIPE to the offending send() (Python raises BrokenPipeError). Per-request handlers can let it propagate or catch it; the server keeps serving the rest of the clients. Set at import time so the disposition is in effect before any ThreadingHTTPServer worker thread writes its first response.

…otection The shim's _signal_handler was generic across all catchable signals: write the marker, restore SIG_DFL, re-raise. That's correct for SIGTERM/SIGINT/SIGQUIT (the process should die after we know why), but wrong for SIGPIPE. server.py sets SIG_IGN on SIGPIPE at module import time so a dropped client surfaces as BrokenPipeError on that one request instead of killing the whole server. The shim's re-raise path overrode that protection: it would restore SIG_DFL on SIGPIPE, re-raise, and the process died anyway. Caught in production on 2026-06-02 14:03 UTC — PID 12029 (running with the SIGPIPE fix already in place) still died on SIGPIPE, with the marker showing the shim's handler had fired and re-raised the default action. Special-case SIGPIPE in the shim: write the marker (so we still get the forensic capture), set SIGPIPE back to SIG_IGN, and RETURN. The request thread's send() has already returned EPIPE, so the handler there can clean up normally. Subsequent SIGPIPEs in the same process are also silently ignored.

- start-webui.sh: minimal launcher for manual start (sets agent dir, host, python, redirects logs to /tmp/hermes-webui.log) - watchdog-loop.sh: setsid-detached 5-min /health poll that calls ctl.sh start on failure. Robustness improvement over the cron-based watchdog — runs in its own process group, so it survives terminal session termination on Termux+PRoot

Two complementary diagnostics for the silent post-update SIGKILL (see silent-sigkill-diagnosis reference in the hermes-webui-self- update-bug skill): 1. pre-execv marker. As the first action of _schedule_restart()'s body (inside the _apply_lock block, before _wait_until_restart_safe), write /tmp/hermes-webui-shim/<pid>-000-pre-execv.json with per-file fsync. The presence of this marker + the absence of an install.json from a fresh PID is the canonical evidence that the kill happened in the kernel between execv() and the new process's first Python instruction. Pairs with the first-line marker at the top of server.py for a 3-state decision table: pre + first-line + install present, no further markers -> kill is post-shim-load (the original mystery) pre + first-line, no install -> kill is in Python startup (import error etc.) pre only -> kill is in execve() / dynamic loader Wrapped in try/except so a marker write failure cannot prevent the restart itself. 2. env-gated strace-through-execv. When HERMES_WEBUI_STRACE_EXECV=1 is set in the ctl.sh start environment, route execv through strace with -f -ttt -T -s 256. Captures every syscall of the new process from its very first instruction. Useful as a fallback diagnostic if the marker-based approach narrows the kill to a window that needs syscall-level detail (rare). Off by default; the env var is set in .env during diagnosis and removed afterward. The strace log lands next to the markers in /tmp/hermes-webui-shim/. Also promotes `import sys` and `from datetime import ...` to module level (they were lazy-imported inside _schedule_restart before). Cheap; opens the door for other diagnostics in this file that need sys/datetime without having to late-import. Verified: `python3 -c "from api import updates"` imports cleanly; `hasattr(updates, "_write_pre_execv_marker")` is True.

…lity Tightest post-marker for the silent post-update SIGKILL investigation (see the silent-sigkill-diagnosis reference in the hermes-webui-self-update-bug skill for context). Writes /tmp/hermes-webui-shim/<pid>-001-first-line.json with per-file fsync as the very first executable statement in server.py, before any import (logging, http.server, etc.). Uses only stdlib so a broken api.* import can never be the reason this marker fails to write. Wrapped in try/except so a marker write failure cannot prevent startup. Pairs with: - api.updates._write_pre_execv_marker() (pre-side, old process, written before os.execv()) - api.diag_shim.install() (post-side, after main()'s imports) forming a 3-state decision table for WHERE in the new process's life a silent death happened: pre + first-line + install (no further markers) -> kill is post-shim-load (the original mystery) pre + first-line, no install -> kill is in Python startup (bad import, syntax error, C-extension crash) pre only -> kill is in execve() / dynamic loader (very rare) The full 3-state table is documented at the top of api/updates._write_pre_execv_marker() for grep-ability from the pre-side. Verified: file syntax-checks clean with `python3 -c "import ast; ast.parse(open('server.py').read())"`. The marker write logic was exercised in isolation (write + fsync + read back) and produces the expected JSON shape.

Documents the two new observability additions in [Unreleased]: - pre-execv marker (api/updates.py._write_pre_execv_marker) - first-line marker (server.py top, before any import) - env-gated strace-through-execv (HERMES_WEBUI_STRACE_EXECV=1) as a single feature: "restart-window observability for unexplained post-update exits." Points readers at the silent-sigkill-diagnosis reference in the hermes-webui-self-update-bug skill for the full diagnostic playbook, since the markers are part of a forensic investigation rather than a user-visible change. The operational scripts (start-webui.sh, watchdog-loop.sh) are not listed — they are internal-only files (no user-facing impact) and the changelog audience is end-users, not operators. This is the last of the diagnostic-feature commits. Remaining work in this branch is mechanical: bytecode cache flush + restart + verify the new markers fire end-to-end. Those are not commits.

…estart The previous in-place os.execv() triggered a cgroup reclassification in the cpuset:/top-app and /apps/uid_*/pid_* hierarchies on Termux+PRoot / Android, which SIGKILLed the new process sub-millisecond, before any user code ran. Confirmed via 3-state restart-window markers: pre-execv fires, first-line and install markers from a fresh PID do not. Cron watchdog recovers the process in <1 min, but the post-update flow was silently dead until then. This commit routes the restart through the same code path the cron watchdog uses (which provably survives the cgroup transition): a detached subprocess.Popen([ctl_path, 'start'], start_new_session=True) followed by os._exit(0) on the old process. Brand-new process image loaded from scratch by ctl.sh, no in-place execv, no ptrace pinning, no cgroup reclassification of a dying pid. Also updates the CHANGELOG and the pre-execv marker reason text to reflect the new flow. The marker itself is still written first thing in the lock block, so a pre-marker without a new-PID first-line + install pair now points specifically at ctl.sh start failures (rather than the kernel-execv window it used to indicate). Verified end-to-end on a real ctl.sh restart + curl-driven POST /api/updates/apply with target=webui,force=true: - pre-marker fires (strace_on: false, confirming clean test) - new PID appears within 2s via ctl.sh - first-line + install markers from the new PID both fire - /health responds ok, process stable

…or restart The previous Path A used subprocess.Popen([ctl_path, 'start'], start_new_session=True) + os._exit(0). It worked correctly on paper but did NOT survive the parent's _exit on Termux+PRoot / Android: the ctl.sh subprocess never appeared, the cron watchdog had to recover in 37 seconds. Conjecture: Popen's start_new_session and the daemon-thread call site race against the parent's _exit, leaving the child reaped before it can complete setsid. This commit uses os.fork() directly: the child calls os.setsid() to detach from the parent's session, then os.execvp() into ctl.sh; the parent immediately calls os._exit(0). The fork is the most primitive POSIX process spawn available, with no Python-level intermediate state that can race with _exit. The forked child is a separate process in the cgroup hierarchy, so the parent's cgroup transition kill window (the original bug) doesn't reach it either. Verified end-to-end on a curl-driven POST /api/updates/apply with target=webui,force=true. The detached ctl.sh starts the new process within ~1s, no watchdog intervention needed.

…kill After Path A's os.fork+setsid+execvp was working correctly, the new process started by ctl.sh still died in the same cgroup kill window the original in-place os.execv hit. Confirmed via marker analysis: post-fork ctl.sh started PID 11658, 11658 wrote NO markers (died before server.py line 1), cron watchdog recovered 8 seconds later with PID 12018 which survived. The cgroup kill window is broader than just execv: ANY new python3 process in cpuset:/top-app (Android's top-app cgroup) is at risk if spawned within ~10 seconds of the old process's exit in the same cgroup. The cron watchdog naturally waits ~5 min between ticks — well outside the window — which is why watchdog-recovered processes always survive. This commit waits 15 seconds in the fork child before invoking ctl.sh start, so the new process appears 15s after the old exit. Reduces post-update downtime from ~5 min (watchdog cycle) to ~15s. Empirically verified: cron watchdog entries confirm 5-min recovery without my detached starter; with my detached starter and the 15s delay the new process should appear at T+15s. If the 15s delay proves insufficient (kill window longer than 15s, or kill is cgroup-membership-based rather than time-based), the fallback is to simply not attempt a restart at all and rely entirely on the cron watchdog — the pre-marker is still informative as 'old process committed to restart at this exact moment' for any post-mortem analysis.

…nostic evolution

PatrickNoFilter · 2026-06-02T17:06:06Z

Summary of what changed since the last review

This PR was originally a focused 3-commit trio
(11e81fc9 SIGPIPE ignore + 301de49c diag shim +
191cf6b3 shim-SIGPIPE special case). Since then, 8 more
commits were added that address a separate silent-death
pattern uncovered while testing the diag shim in
production.

TL;DR: another silent death was happening on
Termux+PRoot (Android top-app cgroup) during post-update
restarts. The diag shim didn't catch it (untrappable SIGKILL
from lmkd). After 3 failed fix attempts (in-place execv,
subprocess.Popen, fork+setsid+execvp without delay), the
working fix is a 15-second sleep in the fork child before
execvp'ing ctl.sh start — it puts the new python3
process outside the ~10-second cgroup kill window. Total
post-update downtime dropped from ~5 min (cron watchdog
recovery) to ~15-20 s.

3 groups, 11 commits, 2 unrelated root causes:

Group 1 (3 commits, original PR): SIGPIPE — one-line
fix + diag shim + shim-SIGPIPE special case.
Group 2 (6 commits, new): cgroup-kill window during
post-update restart — markers + 3-stage fix evolution
(failed → failed → success).
Group 3 (2 commits, new): changelog entries.

If you'd prefer to merge these as separate PRs, the
cleanest split would be:

PR-A (this one, narrowed): Group 1 only, 3 commits.
PR-B (new): Group 2 + Group 3, 8 commits. The cgroup
fix builds on the diagnostic markers (commits 5, 6) so
those need to be in PR-B too.

Happy to do that split if you prefer — just say the word
and I'll rebase. The current form is one PR because that's
the lowest-touch update to this already-open PR; both
groups are independently testable and revert-safe.

Verification artifacts:

All 4 update tests on Termux+PRoot/aarch64 are in
/tmp/hermes-webui-shim/ with timestamps and PIDs.
The pre-execv + first-line + install 3-state table
generalizes the "absence of evidence" heuristic into
a structured 5-state debugging toolkit.
Full post-mortem in the Notion vault entry (callout
block from 2026-06-02 in the "Hermes Vault" page).

nesquena-hermes · 2026-06-02T18:23:13Z

Read all 11 commits against master plus the diffs in server.py, api/updates.py, ctl.sh (start_cmd, master), and docker_init.bash. The investigation in the body is excellent — the 3-state marker table is a genuinely nice debugging artifact. Splitting feedback by group.

Group 1 (SIGPIPE) — looks correct, ship-worthy on its own

signal.signal(signal.SIGPIPE, signal.SIG_IGN) at module body in server.py runs in the main thread at import, before any ThreadingHTTPServer worker spawns, so the disposition is in place before the first response write — exactly right. The production capture (signal 13 during a streamed /api/updates/check) is a clean root-cause. The shim's SIGPIPE special-case in 191cf6b3 (write marker, re-arm SIG_IGN, return without re-raise) is the correct follow-up to 301de49c re-defaulting it. No concerns here — this is the kind of change that's safe to land as its own narrow PR (your "PR-A").

Group 2 (restart path) — the fix is unconditional, but it's only correct on Termux

This is where I'd hold. The new _schedule_restart() replaces in-place os.execv with fork + setsid + time.sleep(15) + execvp(ctl.sh start) + os._exit(0) on the parent — for every deploy target, with no platform gate (I grepped: no termux/android/uname/sys.platform guard anywhere in the new api/updates.py). That's a problem for the Docker path.

In the container, the launch is (docker_init.bash:457):

cd /app; python server.py || error_exit "hermes-webui failed or exited with an error"

python server.py is a child of docker_init.bash (PID 1), reached via plain invocation, not exec. On master, os.execv replaces server.py's image in place — same PID, PID 1 never sees its child exit, container stays up. With this PR the parent calls os._exit(0), so python server.py returns 0, docker_init.bash falls through to ok_exit "Clean exit", PID 1 exits, and the container stops. The detached fork child (mid sleep(15)) gets torn down with the container before ctl.sh start ever runs. The container only comes back via restart: unless-stopped (compose files all set it) — a full container bounce, not the in-container respawn the code intends.

There's a second mismatch: Docker never launches via ctl.sh, so routing its restart through ctl.sh start (which runs nohup bootstrap.py --foreground, writes its own PID file, and checks _current_pid/launchd) hits a process-management path that isn't the one managing the container's server.

Recommendation

Gate the new fork+sleep+ctl.sh path behind explicit Termux/Android detection (e.g. presence of /data/data/com.termux or a HERMES_WEBUI_RESTART_VIA_CTL=1 opt-in set by the Termux launcher), and keep os.execv as the default for Docker / systemd / launchd where in-place re-exec is correct and zero-downtime:

if _is_termux():           # or os.environ.get("HERMES_WEBUI_RESTART_VIA_CTL") == "1"
    _restart_via_detached_ctl()   # fork+setsid+sleep(15)+execvp(ctl.sh start)
else:
    os.execv(sys.executable, [sys.executable] + sys.argv)

That preserves the Termux fix you verified end-to-end while keeping the in-place re-exec (and ~0s downtime) every other deploy target currently relies on.

Taking up your own offer in the 17:06 comment: splitting into PR-A (Group 1, mergeable now) and PR-B (Group 2 + 3, gated as above) would let the SIGPIPE fix land immediately without blocking on the restart-path gating discussion. The markers in commits 5–6 are needed by PR-B, as you noted.

PatrickNoFilter and others added 11 commits June 2, 2026 20:52

diag: add signal-trap shim for unexplained-exit observability

301de49

docs(changelog): final update on cgroup-kill fix (15s delay) and diag…

d80f948

…nostic evolution

PatrickNoFilter changed the title ~~fix(server): ignore SIGPIPE + add diagnostic signal-trap shim~~ fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart#3407

fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart#3407
PatrickNoFilter wants to merge 11 commits into
nesquena:masterfrom
PatrickNoFilter:diag/observability-and-robustness

PatrickNoFilter commented Jun 2, 2026 •

edited

Loading

Uh oh!

PatrickNoFilter commented Jun 2, 2026

Uh oh!

nesquena-hermes commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

PatrickNoFilter commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Group 1 — Prevent the dominant production death (SIGPIPE)

Group 2 — Outlast the cgroup-kill window in post-update restart (Termux+PRoot)

Group 3 — Changelog

The cgroup-kill root cause (Group 2)

The SIGPIPE root cause (Group 1)

SIGPIPE fix verification (Group 1, commit 11e81fc9)

Diagnostic shim verification (Group 1, commits 301de49c + 191cf6b3)

Cgroup-kill fix verification (Group 2, commits 684d73bc + 4c9a1268 + 12f322a5)

How the 3-state decision table works (debugging toolkit for future silent deaths)

Out of scope

Backwards compatibility

Uh oh!

PatrickNoFilter commented Jun 2, 2026

Summary of what changed since the last review

Uh oh!

nesquena-hermes commented Jun 2, 2026

Group 1 (SIGPIPE) — looks correct, ship-worthy on its own

Group 2 (restart path) — the fix is unconditional, but it's only correct on Termux

Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PatrickNoFilter commented Jun 2, 2026 •

edited

Loading

SIGPIPE fix verification (Group 1, commit `11e81fc9`)

Diagnostic shim verification (Group 1, commits `301de49c` + `191cf6b3`)

Cgroup-kill fix verification (Group 2, commits `684d73bc` + `4c9a1268` + `12f322a5`)