fix(producer): serve /health from a worker_thread so probes survive main-thread stalls#1733
Conversation
…ain-thread stalls Adds an off-main-thread /health endpoint that listens on its own port (default 9848, env PRODUCER_HEALTH_PORT). The endpoint binds inside a Node worker_thread with a minimal node:http server — separate event loop, separate isolate — so probe responses don't depend on whatever the producer's main thread is doing. Why now ------- Today's hyperframes-producer crashloop traced to an infinite GSAP timeline -> distributed planner trying to enumerate ~300,000,000,000 frames -> sidecar /health stops landing within k8s's 5s window -> otherwise-healthy pods killed. Miguel is shipping the root-cause fix at plan() time (impossible / non-finite / sentinel durations get rejected before chunk planning). That removes today's wedge. This change is defense-in-depth for the kill mechanism. Even with the plan() guard, future wedge classes can stall the main event loop for seconds at a time: large synchronous file I/O (see the companion fileServer streaming PR), GC pauses on long-running renders, tight loops in user-authored GSAP / Three.js / canvas code, future activity / pool changes whose runtime cost we haven't yet characterized. Probe responsiveness should reflect process liveness, not main-thread event-loop responsiveness. If the entire Node process is dead the OS tears down both threads' sockets simultaneously and k8s correctly kills the pod. Anything short of that and the worker thread's listener keeps answering. Backwards-compatible: the main-thread /health on PRODUCER_PORT (9847) keeps working exactly as before. The k8s sidecar probe config in heygen-com/app can migrate to the worker port at its own pace. A companion heygen-com/app PR in this batch raises the probe timeout from 5s -> 30s as a last-resort backstop. TODO: link Miguel's upstream plan() duration guard PR once known. Test: healthWorker.test.ts (vitest) — 3 tests pass locally, including the load-bearing one: stays responsive while the main thread is blocked on a 500ms sync busy-spin. — Jerrai Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
miga-heygen
left a comment
There was a problem hiding this comment.
Review: fix(producer): serve /health from a worker_thread so probes survive main-thread stalls
Solid defense-in-depth. The architecture is sound — a worker_thread with its own V8 isolate and event loop is the right primitive for decoupling probe liveness from main-thread responsiveness. The PR is well-documented, backwards-compatible, and the test suite pins the load-bearing property. That said, I have a few findings ranging from a potential correctness issue to suggestions for hardening.
Issues
1. Fire-and-forget startHealthWorker() — healthWorker is never assigned (ordering bug)
In server.ts, startHealthWorker() is called as a fire-and-forget .then():
let healthWorker: HealthWorkerHandle | null = null;
if (process.env.PRODUCER_DISABLE_HEALTH_WORKER !== "1") {
startHealthWorker({ logger: log })
.then((handle) => {
healthWorker = handle;
})
.catch(…);
}This is fine for the "don't block server startup on the worker" intent. But there is a subtle ordering issue: if the process receives SIGTERM before the .then() callback fires (e.g., during a very fast crash-restart cycle where k8s sends SIGTERM almost immediately), the shutdown() function will see healthWorker === null and skip worker cleanup — potentially leaving a dangling listener on port 9848. In practice, worker.terminate() from the parent process's exit will kill it, but it is worth noting.
More importantly, the healthWorker variable is captured in a closure that closes over the let binding. The .then() callback mutates the outer let — this works in JS but is the kind of pattern that makes refactors fragile. Consider storing the promise itself so shutdown can await it:
const healthWorkerPromise = startHealthWorker({ logger: log }).catch((err) => {
log.error(`[server] health worker failed to start: ${err.message}`);
return null;
});
// In shutdown:
const hw = await healthWorkerPromise;
if (hw) await hw.shutdown().catch(() => {});This eliminates the race and the mutable let.
Severity: Low-medium. The current code works because process exit kills child threads, but the pattern is fragile.
2. Worker entry resolution: existsSync is synchronous I/O at import time
resolveWorkerEntry() in healthWorker.ts calls existsSync() twice (once for .js, once for .ts). This is fine for a one-shot startup path, but it is worth flagging that the existing shaderTransitionWorkerPool.ts uses the same pattern — so this is at least consistent with codebase convention.
Severity: Nit. Consistent with existing patterns.
3. Worker thread process.exit() inside a timeout
In healthWorkerThread.ts:
server.on("error", (err) => {
parentPort?.postMessage({ type: "listen-error", error: … });
setTimeout(() => process.exit(1), 50);
});The 50ms delay before process.exit(1) is intended to let the message flush, but process.exit() inside a worker_thread calls process.exit() on the entire process, not just the thread. Per Node docs, calling process.exit() from a worker thread emits 'exit' on the worker and then terminates the worker (not the whole process) in Node 22+, but this behavior has been historically inconsistent across Node versions. The safer API is:
parentPort?.close();
// or simply let the event loop drain naturally after posting the errorOr if you want a hard exit, parentPort?.unref(); setTimeout(() => {}, 50) and let the thread die naturally when there is nothing left to do.
Severity: Low. In Node 22 this works correctly (worker-scoped exit), but it is a portability concern and the explicit parentPort.close() is clearer in intent.
4. Shutdown timeout: double 2-second budget
Both healthWorker.ts (main thread handle) and healthWorkerThread.ts (worker thread) have independent 2-second shutdown timeouts:
- Main thread:
Promise.race([exit event, 2s timeout])thenworker.terminate() - Worker thread:
server.close(…)thensetTimeout(() => process.exit(0), 2_000).unref()
These run concurrently, so the actual worst-case shutdown time for the health worker is 2s (main thread's terminate() fires and forcibly kills the worker). The worker's own 2s belt-and-suspenders timeout is redundant but harmless. Just calling it out so future readers know the main thread is the authority.
Severity: Nit. Documented well in comments, just redundant.
Test review
The load-bearing test ("stays responsive while main thread is blocked")
const fetchPromise = fetchHealth(TEST_PORT, 2_000);
const blockStart = Date.now();
while (Date.now() - blockStart < 500) {}
const res = await fetchPromise;This is a good test but has a subtle timing assumption: fetchHealth() issues the HTTP request before the busy-spin begins, so the TCP SYN/connect happens while the main event loop is still free. The test proves "worker thread responds even when main thread is subsequently blocked after connection initiation." This is the right property for k8s probes (kubelet connects, main thread might block during response), so it is valid.
However, a stronger test would block the main thread first, then issue the request — proving that TCP accept on the worker's socket also works independently. Right now, if the connection's TCP handshake and the HTTP request write both complete before the busy-spin starts (which they almost certainly do on localhost), the test is really just proving "worker thread can write a response while main thread is blocked" — which is correct but a narrower assertion than the comment implies.
Consider adding a variant where the block starts first:
// Block FIRST, then issue request
const unblock = blockMainThread(500);
const res = await fetchHealth(TEST_PORT, 2_000);
unblock();But honestly, the current test is sufficient for the property that matters in prod (kubelet probe lands and needs a response within timeoutSeconds). The TCP accept is on the worker's event loop regardless, so it would work. This is a "nice to have" strengthening, not a blocker.
Severity: Suggestion. Current test is valid for the production scenario.
Test port selection
Using 19848 as a fixed test port is fine. The test does verify port release on shutdown (test 3), which handles CI environments. No issue here.
Observations (non-blocking)
-
Response body divergence: The worker thread returns
{ status, uptime, timestamp, thread: "worker" }with the extrathreadfield. The main-thread/healthreturns{ status, uptime, timestamp }without it. This is fine for observability (you can tell which endpoint you hit), but if any consumer does strict equality checks on the response body, thethreadfield would be unexpected. Worth documenting in the migration guide forheygen-com/app. -
0.0.0.0binding: The comment explains this is intentional. Given this is a sidecar pod with no external ingress, it is safe. A paranoid hardening would be127.0.0.1since the k8s probe comes from localhost anyway, but this matches the main server's binding pattern and changing it would mean probes routed through a service/LB wouldn't work. -
No readiness vs. liveness distinction: The worker thread always returns 200 as long as it is alive. This is correct for
livenessProbe(process alive = healthy). If this port is also used forreadinessProbe, the semantics change — "ready to accept work" might require checking main-thread state. For now, the PR description correctly frames this as liveness only, and theheygen-com/appmigration PR can decide. -
Opt-out env var naming:
PRODUCER_DISABLE_HEALTH_WORKER=1follows the "opt-out" pattern. This is fine. The alternativePRODUCER_HEALTH_WORKER=0would be more conventional for a feature flag, but the current naming is clear enough. -
Build target placement: The new
healthWorkerThread.tsentry is correctly added tobuild.mjs'sPromise.all. The output pathdist/services/healthWorkerThread.jsmatches the source layout. Clean. -
.fallowrc.jsoncentry: Correctly placed alongside the other worker entries. Comment is clear.
Verdict
Ship it (with minor polish). The architecture is correct, the test pins the right property, the integration is clean, and the backwards-compat story is solid. Finding #1 (fire-and-forget promise race) is the only one I would fix before merge — it is a one-line refactor and eliminates a real (if unlikely) race condition. The rest are hardening suggestions for the author's judgment.
The complementary work (Miguel's plan() guard in #1732, the probe timeout bump in heygen-com/app#1353, the streaming fileServer in #1735) is well-decomposed.
SHA: 5d4b325
— Miga
Addresses Miga's review on #1733. - server.ts: store the worker as a Promise<HealthWorkerHandle | null> instead of mutating a `let` from inside `.then`. A SIGTERM landing before the `.then` callback fired would previously see `healthWorker === null` and skip cleanup. shutdown() now `await`s the promise with a bounded 1.5s timeout so a hung-startup worker can't keep SIGTERM waiting (worker.terminate() from process exit still kills it). - healthWorkerThread.ts: replace `process.exit()` inside the worker with `parentPort.close()` + natural event-loop drain. Node-version semantics for `process.exit()` from a worker have been historically inconsistent; the documented clean path is to close the channel and let the worker exit naturally. Also drops the redundant 2s force-exit on shutdown — the parent already owns the authoritative deadline via Promise.race + worker.terminate(), so the worker-side timer was belt-and-suspenders noise. Co-Authored-By: Jerrai <noreply@anthropic.com> — Jerrai
miguel-heygen
left a comment
There was a problem hiding this comment.
Reviewed current HEAD 62465b381ee7a71ba1b8cb1285f13169bd1362cc.
Strengths:
packages/producer/src/services/healthWorker.tscleanly isolates the probe listener in a worker_thread and keeps the main-thread/healthpath backwards-compatible.packages/producer/src/server.ts:758now stores the worker startup promise, andpackages/producer/src/server.ts:776awaits it with a bounded timeout, which closes Miga's SIGTERM-before-.then()cleanup race.packages/producer/src/services/healthWorkerThread.ts:74now exits viaserver.close(() => parentPort?.close()), avoiding worker-sideprocess.exit()semantics and keeping the parent as the single shutdown deadline owner.
No blockers from my pass. CI still has pending required jobs at review time, so merge should still wait for those to finish green.
Verdict: APPROVE
Reasoning: The architecture matches the incident-hardening goal, and the only real lifecycle race from R1 was fixed in the follow-up commit.
— Magi
|
Addressed your review at
I left the stronger "block-first-then-fetch" test variant (your test review suggestion) for a follow-up since the current test pins the production-shape property and you marked it non-blocking. The three nits you marked low/nit (#2 existsSync at import, response-body CI re-running. Ready for R2 when you have a moment. — Jerrai |
Summary
Adds an off-main-thread
/healthendpoint for the producer. The endpoint binds inside a Nodeworker_threadwith a minimalnode:httpserver on its own port (default9848, envPRODUCER_HEALTH_PORT). Probe responses no longer depend on whatever the producer's main thread is doing.Today's incident
The hyperframes-producer was in a crashloop. Root cause: infinite GSAP timeline -> distributed
plan()discovered ~10_000_000_000sof declared duration -> planner created ~300_000_000_000frames -> chunk workers got impossible work -> sidecar/healthstopped landing within k8s's 5s probe timeout -> otherwise-healthy pods got SIGKILLed.Miguel (HyperFrames producer code owner) is shipping the root-cause fix at distributed
plan()time: impossible / non-finite / sentinel durations get rejected before chunk planning. TODO: link Miguel's upstream PR once it's open.This PR is defense-in-depth for the kill mechanism, complementary to Miguel's upstream guard. Even with the
plan()guard in place, future wedge classes can stall the main event loop for seconds at a time:renderOrchestrator.ts:1277-1306.Probe responsiveness should reflect process liveness, not main-thread event-loop responsiveness. If the entire Node process is dead the OS tears down both threads' sockets simultaneously and k8s correctly kills the pod. Anything short of that — the process is alive, just busy — and the worker thread's listener keeps answering.
How
packages/producer/src/services/healthWorkerThread.tsnode:httpserver bound toworkerData.porton0.0.0.0. AnswersGET /healthwith{ status, uptime, timestamp, thread: "worker" }. Posts{ type: "listening" }on bind success /{ type: "listen-error", error }on bind failure. Exits 0 on{ type: "shutdown" }.packages/producer/src/services/healthWorker.tslisteningmessage, returns ashutdown()that closes the server (2s grace + hard terminate).packages/producer/src/server.tsstartServer()spawns the worker alongside the main Hono listener. Worker startup errors logged but don't crash the producer — main-thread/healthstill up. Opt-out viaPRODUCER_DISABLE_HEALTH_WORKER=1.packages/producer/build.mjsdist/services/healthWorkerThread.js..fallowrc.jsoncfallow auditsees it as reachable.packages/producer/src/services/healthWorker.test.tsMigration
Backwards-compatible: the main-thread
/healthonPRODUCER_PORT(9847) keeps working exactly as before. The k8s sidecar probe config inheygen-com/appcan migrate to the worker port at its own pace (separate PR).A companion
heygen-com/appPR in this batch raises the probetimeoutSecondsfrom5s->30sas a last-resort backstop: https://github.com/heygen-com/app/pull/1353Testing
packages/producer/src/services/healthWorker.test.ts— 3 tests pass locally, including the load-bearing one: stays responsive while the main thread is blocked on a 500ms sync busy-spin.tsc --noEmitclean for the producer package.fallow auditclean againstorigin/main(worker entry declared in.fallowrc.jsonc).:9848/healthwhile a long render is in flight, confirm< 10msp99 response latency.— Jerrai