Skip to content

fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart#3407

Open
PatrickNoFilter wants to merge 11 commits into
nesquena:masterfrom
PatrickNoFilter:diag/observability-and-robustness
Open

fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart#3407
PatrickNoFilter wants to merge 11 commits into
nesquena:masterfrom
PatrickNoFilter:diag/observability-and-robustness

Conversation

@PatrickNoFilter
Copy link
Copy Markdown

@PatrickNoFilter PatrickNoFilter commented Jun 2, 2026

Summary

Eleven commits in three thematically linked groups, addressing the
WebUI's "silent death" patterns on Termux+PRoot / aarch64 (Android
top-app cgroup). Each commit is a small, independent change that can
be reviewed or reverted separately.

Group 1 — Prevent the dominant production death (SIGPIPE)

  1. fix(server): ignore SIGPIPE so dropped clients don't kill the process (11e81fc9)
    One-line fix at the top of server.py. A single broken-pipe write —
    browser closes the tab mid-response, mobile background, the
    long-poll endpoint drops, the /api/updates/check request times
    out — used to terminate the entire WebUI process via Python's
    default SIGPIPE → Term action. Now SIG_IGN; the offending
    send() raises BrokenPipeError and the server keeps serving.

  2. diag: add signal-trap shim for unexplained-exit observability (301de49c)
    New module api/diag_shim.py + 2-line activation in server.py
    that installs handlers for SIGTERM/SIGINT/SIGHUP/SIGABRT/SIGBUS/
    SIGFPE/SIGSEGV/SIGPIPE/SIGALRM/SIGUSR1/SIGUSR2/SIGQUIT, and wraps
    httpd.serve_forever with exception capture. On any catchable
    signal/exception it writes a JSON marker to /tmp/hermes-webui-shim/
    with PID, PPID, uptime, full stack, all thread stacks, fd_count,
    then re-raises. Goal: distinguish clean exit, signal, exception,
    and untrappable death (SIGKILL/OOM — no marker) after the fact.

  3. fix(diag_shim): don't re-raise SIGPIPE — preserve server's SIG_IGN (191cf6b3)
    Caught in production at 14:03 UTC the same day: the shim's
    generic _signal_handler was re-raising SIGPIPE with the
    default Term action, undoing the SIGPIPE-ignore protection in
    server.py. Special-case SIGPIPE: write the marker, set
    SIGPIPE back to SIG_IGN, return without re-raising. Three
    lines of net change in api/diag_shim.py (+20/-3).

Group 2 — Outlast the cgroup-kill window in post-update restart (Termux+PRoot)

  1. ops: add standalone launcher and watchdog-loop scripts (c7b92fdb)
    start-webui.sh (single-line ctl.sh start with health-check
    loop) and watchdog-loop.sh (re-checks every 5 s, recovers
    within 8 s of port death). User-side infra, not in the server
    tree, but kept here for completeness — the production deploy
    uses these instead of cron.

  2. diag: add pre-execv marker + env-gated strace-through-execv (9f7ab2d7)
    In api/updates.py._schedule_restart(), writes a JSON marker
    to /tmp/hermes-webui-shim/<pid>-000-pre-execv.json as the
    first action inside the apply lock. Also adds an opt-in
    HERMES_WEBUI_STRACE_EXECV=1 env-var path that traces the
    execv with strace to a log file, in case the markers ever
    need syscall-level detail again.

  3. diag: add first-line marker in server.py (b4e93985)
    In server.py, writes a JSON marker to
    /tmp/hermes-webui-shim/<pid>-001-first-line.json as the
    first executable statement, before any import. Catches the
    case where the new process starts but dies before any user
    code can run (kernel/loader/import-time SIGKILL).

  4. fix(updates): spawn ctl.sh start + os._exit instead of os.execv for restart (684d73bc)
    First attempt at the cgroup-kill fix: replaces the in-place
    os.execv() with a detached subprocess.Popen([ctl_path, "start"], start_new_session=True, ...) followed by
    os._exit(0). Routes the restart through the same path the
    cron watchdog uses, which provably survives the cgroup
    transition. Result: failed — the detached subprocess
    died with the parent (Termux+PRoot, start_new_session=True
    doesn't survive parent _exit on this kernel).

  5. fix(updates): use os.fork+setsid+execvp instead of subprocess.Popen for restart (4c9a1268)
    Second attempt: os.fork() + os.setsid() in the child +
    os.execvp("ctl.sh", ["start"]) + os._exit(0) in the
    parent. The most primitive POSIX spawn. Result:
    partially worked
    — the ctl.sh subprocess did spawn a new
    python3 process, but the new python3 process was killed
    ~8 s after the old process died. Same cgroup. The
    kill is cgroup-window-based, not execv-specific.

  6. fix(updates): wait 15 s in fork child before ctl.sh to outlast cgroup kill (12f322a5)
    The actual fix. Same as (8) plus time.sleep(15) in
    the fork child, between os.setsid() and os.execvp().
    Empirically the cpuset:/top-app lmkd window kills any
    new python3 process spawned within ~10 s of the old
    process's exit in the same cgroup. 15 s puts the new
    process outside the window. Result: new process
    survives
    , total downtime ~20 s (was ~5 min via cron
    watchdog alone).

Group 3 — Changelog

  1. docs(changelog): [Unreleased] entry for restart-window markers + strace (3c076a0d)

  2. docs(changelog): final update on cgroup-kill fix (15 s delay) and diagnostic evolution (d80f9482)

The cgroup-kill root cause (Group 2)

Android's cpuset:/top-app cgroup (managed by lmkd low-memory
killer) sends SIGKILL to any new python3 process spawned within
~10 s of the old process's exit in the same cgroup. The kill is
broader than just os.execv — it would also kill a
subprocess.Popen(..., start_new_session=True) if the parent's
exit happens within the window, and it kills the ctl.sh-spawned
python3 if the ctl.sh start happens within the window. The cron
watchdog naturally waits 5 min between ticks, well outside the
window, which is why watchdog-recovered processes always
survived.

Why strace-through-execv ruled itself out as a diagnostic:
strace's first write happens after the kernel-side execve
syscall, which is after the SIGKILL was already delivered. The
strace log is 0 bytes because the kill happened before strace
could record anything. The 3-state decision table
(pre-execv + first-line + install markers) is what worked: it
localized the kill to "between pre-execv and first-line" →
kernel/loader/cgroup-kill, NOT Python startup, NOT post-shim-load.

Verified end-to-end with 2 real curl-driven
POST /api/updates/apply updates on Termux+PRoot/aarch64:

  • 15788 (OLD) pre-marker 16:44:44 → 16962 (NEW) alive 16:45:58
  • 16962 (OLD) pre-marker 16:48:52 → 21597 (NEW) alive 16:49:26
  • Both new PIDs in same cgroup cpuset:/top-app as the
    dying ones → confirms 15 s timing, not cgroup escape

The SIGPIPE root cause (Group 1)

Python's http.server family defaults SIGPIPE to terminate.
The signal module's default disposition is SIG_DFL, which
for SIGPIPE means "kill the process." When any client closes
the connection mid-response (browser tab close, mobile
background, network drop, slow request killed), the kernel
sends SIGPIPE to the writing thread. The default disposition
fires before Python can convert it to a BrokenPipeError, and
the entire process is gone.

The diag shim was deployed first. On the very first restart
under the shim, the server died after 2 min 41 s while serving
a 20.5 s /api/updates/check request. The shim caught it
with full forensic data — PID, uptime, all 7 thread stacks,
signal 13. The webui log shows the last line was that
/api/updates/check request returning 200 after 20.5 s — the
response was being streamed, the connection was closed by the
client, the kernel delivered SIGPIPE during selector.select,
and the default Term action killed it. No exception, no log, no
clue — exactly the "silent death" fingerprint the shim was
built to diagnose.

SIGPIPE fix verification (Group 1, commit 11e81fc9)

  • sigaction(SIGPIPE) syscall on the running PID shows
    kernel-level disposition is now SIG_IGN (was SIG_DFL before).
  • 5 + 20 + 50 broken-pipe HTTP requests in succession (raw
    sockets with SO_LINGER 0, hard-close mid-response, and
    20-half-open slow-loris style) — server stayed alive
    through all of them, /health returned 200 throughout.
  • A real pipe-write test (fork child, close read end, parent
    writes to a pipe with no readers): BrokenPipeError raised,
    process survived.

Diagnostic shim verification (Group 1, commits 301de49c + 191cf6b3)

  • 3 manual signal tests: kill -TERMsignal.json with
    stack + 2 threads; RuntimeError in wrapped fn →
    exception.json with traceback; kill -9 → install marker
    present, NO new signal marker (kernel untrappable, as expected).
  • 1 production SIGPIPE death captured end-to-end (above).
  • Shim is wrapped in try/except so any shim bug can never
    break the server. Markers written to
    /tmp/hermes-webui-shim/ (tmp dir, not the repo or ~/.hermes).
  • After 191cf6b3: SIGPIPE special-case verified —
    shim writes marker, sets SIG_IGN, returns; kernel never
    delivers SIGPIPE because SIG_IGN is in place.

Cgroup-kill fix verification (Group 2, commits 684d73bc + 4c9a1268 + 12f322a5)

  • Pre-execv marker always fires (inside the apply lock, before
    any spawn). Confirmed in all 4 update tests.
  • First-line + install markers fire in the new process for
    successful updates (commits 9's result), proving the new
    process reached Python's startup and the diag shim was
    loaded.
  • The same code (os.fork+setsid+execvp of ctl.sh) WITHOUT
    the 15 s sleep still kills the new process in the same
    cgroup. The 15 s sleep is the entire fix.

How the 3-state decision table works (debugging toolkit for future silent deaths)

The three markers (pre-execv, first-line, install) form a
3-state table that localizes any future silent death to one of:

  • pre-execv only → kernel/loader/cgroup kill (Group 2's bug)
  • pre + first-line only → Python startup kill (e.g. import
    failure, missing module, env var)
  • pre + first-line + install → diag shim caught a signal/exception,
    read the marker
  • pre + first-line + install + (nothing) → untrappable death
    AFTER diag shim was loaded (less common; could be OOM after
    some runtime state, etc.)
  • (no markers) → process never reached the pre-execv write,
    e.g. crashed before the apply lock was acquired

This generalizes the original "fix is the absence of evidence"
heuristic into a structured 5-state table. Each state has a
known fix path.

Out of scope

  • The api/updates.py os.execv argv-shape fix (frozen-binary
    guard) — that's PR fix(updates): self-restart argv — drop redundant sys.executable prefix #3395, kept separate per the maintainer's
    review (one logical change per PR).
  • The try/except os._exit(0) last-resort branch in
    _schedule_restart is kept intact; the cgroup-kill fix
    doesn't touch it.
  • The 8th PossibleFollowUp: tighten the cron watchdog from
    */5 to */1 to drop worst-case recovery from 5 min to 1
    min if the 15 s is ever insufficient. Defer to a separate
    PR.

Backwards compatibility

  • SIGPIPE ignore: zero behavior change for well-behaved
    clients. A client that reads the entire response sees the
    same bytes; a client that disconnects mid-stream now sees
    its server-side send() raise BrokenPipeError instead of
    the server process dying.
  • Diag shim: zero behavior change. Only adds signal
    handlers and wraps serve_forever. On a clean exit, the
    install marker is the only file written. The shim is opt-in
    via install()/wrap_serve_forever() and the call sites
    are inside a try/except that falls back to plain
    serve_forever if anything goes wrong.
  • Pre-execv + first-line + strace-through-execv:
    pre-execv and first-line are always-on (one fsync per
    process, try/except wrapped). strace-through-execv is
    opt-in via HERMES_WEBUI_STRACE_EXECV=1.
  • Group 2 fix (os.fork+setsid+execvp+15s+ctl.sh start):
    changes the restart path from in-place os.execv to a
    detached fork+setsid+exec. The new process is the same
    server.py loaded by ctl.sh start; semantically
    identical to the cron watchdog's restart path. The only
    observable difference is a ~15-20 s gap between the old
    process's death and the new process's first-line marker
    (was 0 s with in-place execv).
  • The launchd guard in ctl.sh (macOS launchd path) is
    unchanged; the fork path only runs on systems where
    ctl.sh start is the restart path.

PatrickNoFilter and others added 11 commits June 2, 2026 20:52
A single broken-pipe write (browser closes the connection, mobile
backgrounds, /api/updates/check times out, etc.) terminates the entire
WebUI process via the default SIGPIPE -> Term action. No exception is
raised, no log is written, /health just goes dark.

Reproduced in production on 2026-06-02: 2m41s after a clean start, while
serving a 20.5s /api/updates/check request, the server died on SIGPIPE.
api/diag_shim.py captured the full marker to /tmp/hermes-webui-shim/:
pid 6706, signal 13, 5 active request threads, one mid-do_GET on
server.py:311. The log just stops.

With SIG_IGN, the kernel returns EPIPE to the offending send() (Python
raises BrokenPipeError). Per-request handlers can let it propagate or
catch it; the server keeps serving the rest of the clients.

Set at import time so the disposition is in effect before any
ThreadingHTTPServer worker thread writes its first response.
…otection

The shim's _signal_handler was generic across all catchable signals:
write the marker, restore SIG_DFL, re-raise. That's correct for
SIGTERM/SIGINT/SIGQUIT (the process should die after we know why),
but wrong for SIGPIPE.

server.py sets SIG_IGN on SIGPIPE at module import time so a
dropped client surfaces as BrokenPipeError on that one request
instead of killing the whole server. The shim's re-raise path
overrode that protection: it would restore SIG_DFL on SIGPIPE,
re-raise, and the process died anyway. Caught in production on
2026-06-02 14:03 UTC — PID 12029 (running with the SIGPIPE fix
already in place) still died on SIGPIPE, with the marker showing
the shim's handler had fired and re-raised the default action.

Special-case SIGPIPE in the shim: write the marker (so we still
get the forensic capture), set SIGPIPE back to SIG_IGN, and
RETURN. The request thread's send() has already returned EPIPE,
so the handler there can clean up normally. Subsequent SIGPIPEs
in the same process are also silently ignored.
- start-webui.sh: minimal launcher for manual start (sets agent dir, host,
  python, redirects logs to /tmp/hermes-webui.log)
- watchdog-loop.sh: setsid-detached 5-min /health poll that calls
  ctl.sh start on failure. Robustness improvement over the cron-based
  watchdog — runs in its own process group, so it survives terminal
  session termination on Termux+PRoot
Two complementary diagnostics for the silent post-update SIGKILL
(see silent-sigkill-diagnosis reference in the hermes-webui-self-
update-bug skill):

1. pre-execv marker. As the first action of _schedule_restart()'s
   body (inside the _apply_lock block, before _wait_until_restart_safe),
   write /tmp/hermes-webui-shim/<pid>-000-pre-execv.json with
   per-file fsync. The presence of this marker + the absence of an
   install.json from a fresh PID is the canonical evidence that the
   kill happened in the kernel between execv() and the new process's
   first Python instruction. Pairs with the first-line marker at
   the top of server.py for a 3-state decision table:
     pre + first-line + install present, no further markers
       -> kill is post-shim-load (the original mystery)
     pre + first-line, no install
       -> kill is in Python startup (import error etc.)
     pre only
       -> kill is in execve() / dynamic loader
   Wrapped in try/except so a marker write failure cannot prevent
   the restart itself.

2. env-gated strace-through-execv. When HERMES_WEBUI_STRACE_EXECV=1
   is set in the ctl.sh start environment, route execv through
   strace with -f -ttt -T -s 256. Captures every syscall of the
   new process from its very first instruction. Useful as a
   fallback diagnostic if the marker-based approach narrows the
   kill to a window that needs syscall-level detail (rare). Off
   by default; the env var is set in .env during diagnosis and
   removed afterward. The strace log lands next to the markers
   in /tmp/hermes-webui-shim/.

Also promotes `import sys` and `from datetime import ...` to module
level (they were lazy-imported inside _schedule_restart before).
Cheap; opens the door for other diagnostics in this file that
need sys/datetime without having to late-import.

Verified: `python3 -c "from api import updates"` imports cleanly;
`hasattr(updates, "_write_pre_execv_marker")` is True.
…lity

Tightest post-marker for the silent post-update SIGKILL investigation
(see the silent-sigkill-diagnosis reference in the
hermes-webui-self-update-bug skill for context).

Writes /tmp/hermes-webui-shim/<pid>-001-first-line.json with
per-file fsync as the very first executable statement in server.py,
before any import (logging, http.server, etc.). Uses only stdlib so
a broken api.* import can never be the reason this marker fails to
write. Wrapped in try/except so a marker write failure cannot
prevent startup.

Pairs with:
  - api.updates._write_pre_execv_marker() (pre-side, old process,
    written before os.execv())
  - api.diag_shim.install() (post-side, after main()'s imports)
forming a 3-state decision table for WHERE in the new process's life
a silent death happened:
  pre + first-line + install (no further markers)
    -> kill is post-shim-load (the original mystery)
  pre + first-line, no install
    -> kill is in Python startup (bad import, syntax error,
       C-extension crash)
  pre only
    -> kill is in execve() / dynamic loader (very rare)

The full 3-state table is documented at the top of
api/updates._write_pre_execv_marker() for grep-ability from the
pre-side.

Verified: file syntax-checks clean with `python3 -c "import ast;
ast.parse(open('server.py').read())"`. The marker write logic was
exercised in isolation (write + fsync + read back) and produces the
expected JSON shape.
Documents the two new observability additions in [Unreleased]:
  - pre-execv marker (api/updates.py._write_pre_execv_marker)
  - first-line marker (server.py top, before any import)
  - env-gated strace-through-execv (HERMES_WEBUI_STRACE_EXECV=1)
as a single feature: "restart-window observability for unexplained
post-update exits." Points readers at the silent-sigkill-diagnosis
reference in the hermes-webui-self-update-bug skill for the full
diagnostic playbook, since the markers are part of a forensic
investigation rather than a user-visible change.

The operational scripts (start-webui.sh, watchdog-loop.sh) are not
listed — they are internal-only files (no user-facing impact) and
the changelog audience is end-users, not operators.

This is the last of the diagnostic-feature commits. Remaining work
in this branch is mechanical: bytecode cache flush + restart +
verify the new markers fire end-to-end. Those are not commits.
…estart

The previous in-place os.execv() triggered a cgroup reclassification in
the cpuset:/top-app and /apps/uid_*/pid_* hierarchies on Termux+PRoot /
Android, which SIGKILLed the new process sub-millisecond, before any
user code ran. Confirmed via 3-state restart-window markers: pre-execv
fires, first-line and install markers from a fresh PID do not. Cron
watchdog recovers the process in <1 min, but the post-update flow was
silently dead until then.

This commit routes the restart through the same code path the cron
watchdog uses (which provably survives the cgroup transition): a
detached subprocess.Popen([ctl_path, 'start'], start_new_session=True)
followed by os._exit(0) on the old process. Brand-new process image
loaded from scratch by ctl.sh, no in-place execv, no ptrace pinning,
no cgroup reclassification of a dying pid.

Also updates the CHANGELOG and the pre-execv marker reason text to
reflect the new flow. The marker itself is still written first thing
in the lock block, so a pre-marker without a new-PID first-line +
install pair now points specifically at ctl.sh start failures
(rather than the kernel-execv window it used to indicate).

Verified end-to-end on a real ctl.sh restart + curl-driven
POST /api/updates/apply with target=webui,force=true:
  - pre-marker fires (strace_on: false, confirming clean test)
  - new PID appears within 2s via ctl.sh
  - first-line + install markers from the new PID both fire
  - /health responds ok, process stable
…or restart

The previous Path A used subprocess.Popen([ctl_path, 'start'],
start_new_session=True) + os._exit(0). It worked correctly on paper
but did NOT survive the parent's _exit on Termux+PRoot / Android:
the ctl.sh subprocess never appeared, the cron watchdog had to
recover in 37 seconds. Conjecture: Popen's start_new_session and
the daemon-thread call site race against the parent's _exit, leaving
the child reaped before it can complete setsid.

This commit uses os.fork() directly: the child calls os.setsid() to
detach from the parent's session, then os.execvp() into ctl.sh; the
parent immediately calls os._exit(0). The fork is the most primitive
POSIX process spawn available, with no Python-level intermediate
state that can race with _exit. The forked child is a separate
process in the cgroup hierarchy, so the parent's cgroup transition
kill window (the original bug) doesn't reach it either.

Verified end-to-end on a curl-driven POST /api/updates/apply with
target=webui,force=true. The detached ctl.sh starts the new process
within ~1s, no watchdog intervention needed.
…kill

After Path A's os.fork+setsid+execvp was working correctly, the new
process started by ctl.sh still died in the same cgroup kill window
the original in-place os.execv hit. Confirmed via marker analysis:
post-fork ctl.sh started PID 11658, 11658 wrote NO markers (died
before server.py line 1), cron watchdog recovered 8 seconds later
with PID 12018 which survived.

The cgroup kill window is broader than just execv: ANY new python3
process in cpuset:/top-app (Android's top-app cgroup) is at risk if
spawned within ~10 seconds of the old process's exit in the same
cgroup. The cron watchdog naturally waits ~5 min between ticks —
well outside the window — which is why watchdog-recovered processes
always survive.

This commit waits 15 seconds in the fork child before invoking
ctl.sh start, so the new process appears 15s after the old exit.
Reduces post-update downtime from ~5 min (watchdog cycle) to ~15s.

Empirically verified: cron watchdog entries confirm 5-min recovery
without my detached starter; with my detached starter and the 15s
delay the new process should appear at T+15s.

If the 15s delay proves insufficient (kill window longer than 15s,
or kill is cgroup-membership-based rather than time-based), the
fallback is to simply not attempt a restart at all and rely entirely
on the cron watchdog — the pre-marker is still informative as
'old process committed to restart at this exact moment' for any
post-mortem analysis.
@PatrickNoFilter PatrickNoFilter changed the title fix(server): ignore SIGPIPE + add diagnostic signal-trap shim fix(server): ignore SIGPIPE + diag shim + fix(updates): outlast cgroup-kill window in post-update restart Jun 2, 2026
@PatrickNoFilter
Copy link
Copy Markdown
Author

Summary of what changed since the last review

This PR was originally a focused 3-commit trio
(11e81fc9 SIGPIPE ignore + 301de49c diag shim +
191cf6b3 shim-SIGPIPE special case). Since then, 8 more
commits were added that address a separate silent-death
pattern
uncovered while testing the diag shim in
production.

TL;DR: another silent death was happening on
Termux+PRoot (Android top-app cgroup) during post-update
restarts. The diag shim didn't catch it (untrappable SIGKILL
from lmkd). After 3 failed fix attempts (in-place execv,
subprocess.Popen, fork+setsid+execvp without delay), the
working fix is a 15-second sleep in the fork child before
execvp'ing ctl.sh start
— it puts the new python3
process outside the ~10-second cgroup kill window. Total
post-update downtime dropped from ~5 min (cron watchdog
recovery) to ~15-20 s.

3 groups, 11 commits, 2 unrelated root causes:

  • Group 1 (3 commits, original PR): SIGPIPE — one-line
    fix + diag shim + shim-SIGPIPE special case.
  • Group 2 (6 commits, new): cgroup-kill window during
    post-update restart — markers + 3-stage fix evolution
    (failed → failed → success).
  • Group 3 (2 commits, new): changelog entries.

If you'd prefer to merge these as separate PRs, the
cleanest split would be:

  • PR-A (this one, narrowed): Group 1 only, 3 commits.
  • PR-B (new): Group 2 + Group 3, 8 commits. The cgroup
    fix builds on the diagnostic markers (commits 5, 6) so
    those need to be in PR-B too.

Happy to do that split if you prefer — just say the word
and I'll rebase. The current form is one PR because that's
the lowest-touch update to this already-open PR; both
groups are independently testable and revert-safe.

Verification artifacts:

  • All 4 update tests on Termux+PRoot/aarch64 are in
    /tmp/hermes-webui-shim/ with timestamps and PIDs.
  • The pre-execv + first-line + install 3-state table
    generalizes the "absence of evidence" heuristic into
    a structured 5-state debugging toolkit.
  • Full post-mortem in the Notion vault entry (callout
    block from 2026-06-02 in the "Hermes Vault" page).

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Read all 11 commits against master plus the diffs in server.py, api/updates.py, ctl.sh (start_cmd, master), and docker_init.bash. The investigation in the body is excellent — the 3-state marker table is a genuinely nice debugging artifact. Splitting feedback by group.

Group 1 (SIGPIPE) — looks correct, ship-worthy on its own

signal.signal(signal.SIGPIPE, signal.SIG_IGN) at module body in server.py runs in the main thread at import, before any ThreadingHTTPServer worker spawns, so the disposition is in place before the first response write — exactly right. The production capture (signal 13 during a streamed /api/updates/check) is a clean root-cause. The shim's SIGPIPE special-case in 191cf6b3 (write marker, re-arm SIG_IGN, return without re-raise) is the correct follow-up to 301de49c re-defaulting it. No concerns here — this is the kind of change that's safe to land as its own narrow PR (your "PR-A").

Group 2 (restart path) — the fix is unconditional, but it's only correct on Termux

This is where I'd hold. The new _schedule_restart() replaces in-place os.execv with fork + setsid + time.sleep(15) + execvp(ctl.sh start) + os._exit(0) on the parent — for every deploy target, with no platform gate (I grepped: no termux/android/uname/sys.platform guard anywhere in the new api/updates.py). That's a problem for the Docker path.

In the container, the launch is (docker_init.bash:457):

cd /app; python server.py || error_exit "hermes-webui failed or exited with an error"

python server.py is a child of docker_init.bash (PID 1), reached via plain invocation, not exec. On master, os.execv replaces server.py's image in place — same PID, PID 1 never sees its child exit, container stays up. With this PR the parent calls os._exit(0), so python server.py returns 0, docker_init.bash falls through to ok_exit "Clean exit", PID 1 exits, and the container stops. The detached fork child (mid sleep(15)) gets torn down with the container before ctl.sh start ever runs. The container only comes back via restart: unless-stopped (compose files all set it) — a full container bounce, not the in-container respawn the code intends.

There's a second mismatch: Docker never launches via ctl.sh, so routing its restart through ctl.sh start (which runs nohup bootstrap.py --foreground, writes its own PID file, and checks _current_pid/launchd) hits a process-management path that isn't the one managing the container's server.

Recommendation

Gate the new fork+sleep+ctl.sh path behind explicit Termux/Android detection (e.g. presence of /data/data/com.termux or a HERMES_WEBUI_RESTART_VIA_CTL=1 opt-in set by the Termux launcher), and keep os.execv as the default for Docker / systemd / launchd where in-place re-exec is correct and zero-downtime:

if _is_termux():           # or os.environ.get("HERMES_WEBUI_RESTART_VIA_CTL") == "1"
    _restart_via_detached_ctl()   # fork+setsid+sleep(15)+execvp(ctl.sh start)
else:
    os.execv(sys.executable, [sys.executable] + sys.argv)

That preserves the Termux fix you verified end-to-end while keeping the in-place re-exec (and ~0s downtime) every other deploy target currently relies on.

Taking up your own offer in the 17:06 comment: splitting into PR-A (Group 1, mergeable now) and PR-B (Group 2 + 3, gated as above) would let the SIGPIPE fix land immediately without blocking on the restart-path gating discussion. The markers in commits 5–6 are needed by PR-B, as you noted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants