Skip to content

fix(producer): serve /health from a worker_thread so probes survive main-thread stalls#1733

Merged
jrusso1020 merged 2 commits into
mainfrom
fix/producer-health-worker-thread
Jun 26, 2026
Merged

fix(producer): serve /health from a worker_thread so probes survive main-thread stalls#1733
jrusso1020 merged 2 commits into
mainfrom
fix/producer-health-worker-thread

Conversation

@jrusso1020

@jrusso1020 jrusso1020 commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an off-main-thread /health endpoint for the producer. The endpoint binds inside a Node worker_thread with a minimal node:http server on its own port (default 9848, env PRODUCER_HEALTH_PORT). Probe responses no longer depend on whatever the producer's main thread is doing.

Today's incident

The hyperframes-producer was in a crashloop. Root cause: infinite GSAP timeline -> distributed plan() discovered ~10_000_000_000s of declared duration -> planner created ~300_000_000_000 frames -> chunk workers got impossible work -> sidecar /health stopped landing within k8s's 5s probe timeout -> otherwise-healthy pods got SIGKILLed.

Miguel (HyperFrames producer code owner) is shipping the root-cause fix at distributed plan() time: impossible / non-finite / sentinel durations get rejected before chunk planning. TODO: link Miguel's upstream PR once it's open.

This PR is defense-in-depth for the kill mechanism, complementary to Miguel's upstream guard. Even with the plan() guard in place, future wedge classes can stall the main event loop for seconds at a time:

  • Large synchronous file I/O — see the companion streaming-fileServer PR perf(producer): stream binary file responses, async-read HTML #1735 that addresses the documented video-heavy regression at renderOrchestrator.ts:1277-1306.
  • GC pauses on long-running renders that retain frame buffers.
  • Tight loops in user-authored GSAP / Three.js / canvas code that block during readiness probing.
  • Future activity / pool changes whose runtime cost we haven't yet characterized.

Probe responsiveness should reflect process liveness, not main-thread event-loop responsiveness. If the entire Node process is dead the OS tears down both threads' sockets simultaneously and k8s correctly kills the pod. Anything short of that — the process is alive, just busy — and the worker thread's listener keeps answering.

How

File What
packages/producer/src/services/healthWorkerThread.ts Worker entry. Raw node:http server bound to workerData.port on 0.0.0.0. Answers GET /health with { status, uptime, timestamp, thread: "worker" }. Posts { type: "listening" } on bind success / { type: "listen-error", error } on bind failure. Exits 0 on { type: "shutdown" }.
packages/producer/src/services/healthWorker.ts Main-thread handle. Spawns the worker, awaits the listening message, returns a shutdown() that closes the server (2s grace + hard terminate).
packages/producer/src/server.ts startServer() spawns the worker alongside the main Hono listener. Worker startup errors logged but don't crash the producer — main-thread /health still up. Opt-out via PRODUCER_DISABLE_HEALTH_WORKER=1.
packages/producer/build.mjs Adds the worker entry as a build target so prod ships dist/services/healthWorkerThread.js.
.fallowrc.jsonc Declares the worker entry (loaded by path, not by import) so fallow audit sees it as reachable.
packages/producer/src/services/healthWorker.test.ts Vitest pinning the boot path, the load-bearing "responsive while main thread is blocked" property, and clean shutdown / port release.

Migration

Backwards-compatible: the main-thread /health on PRODUCER_PORT (9847) keeps working exactly as before. The k8s sidecar probe config in heygen-com/app can migrate to the worker port at its own pace (separate PR).

A companion heygen-com/app PR in this batch raises the probe timeoutSeconds from 5s -> 30s as a last-resort backstop: https://github.com/heygen-com/app/pull/1353

Testing

  • vitest packages/producer/src/services/healthWorker.test.ts — 3 tests pass locally, including the load-bearing one: stays responsive while the main thread is blocked on a 500ms sync busy-spin.
  • tsc --noEmit clean for the producer package.
  • fallow audit clean against origin/main (worker entry declared in .fallowrc.jsonc).
  • Smoke test in dev sandbox: spawn the producer, hit :9848/health while a long render is in flight, confirm < 10ms p99 response latency.

— Jerrai

…ain-thread stalls

Adds an off-main-thread /health endpoint that listens on its own port
(default 9848, env PRODUCER_HEALTH_PORT). The endpoint binds inside a
Node worker_thread with a minimal node:http server — separate event loop,
separate isolate — so probe responses don't depend on whatever the
producer's main thread is doing.

Why now
-------

Today's hyperframes-producer crashloop traced to an infinite GSAP
timeline -> distributed planner trying to enumerate ~300,000,000,000
frames -> sidecar /health stops landing within k8s's 5s window ->
otherwise-healthy pods killed.

Miguel is shipping the root-cause fix at plan() time (impossible /
non-finite / sentinel durations get rejected before chunk planning).
That removes today's wedge.

This change is defense-in-depth for the kill mechanism. Even with the
plan() guard, future wedge classes can stall the main event loop for
seconds at a time: large synchronous file I/O (see the companion
fileServer streaming PR), GC pauses on long-running renders, tight
loops in user-authored GSAP / Three.js / canvas code, future
activity / pool changes whose runtime cost we haven't yet characterized.

Probe responsiveness should reflect process liveness, not main-thread
event-loop responsiveness. If the entire Node process is dead the OS
tears down both threads' sockets simultaneously and k8s correctly kills
the pod. Anything short of that and the worker thread's listener keeps
answering.

Backwards-compatible: the main-thread /health on PRODUCER_PORT (9847)
keeps working exactly as before. The k8s sidecar probe config in
heygen-com/app can migrate to the worker port at its own pace. A
companion heygen-com/app PR in this batch raises the probe timeout
from 5s -> 30s as a last-resort backstop.

TODO: link Miguel's upstream plan() duration guard PR once known.

Test: healthWorker.test.ts (vitest) — 3 tests pass locally, including
the load-bearing one: stays responsive while the main thread is blocked
on a 500ms sync busy-spin.

— Jerrai

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@miga-heygen miga-heygen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(producer): serve /health from a worker_thread so probes survive main-thread stalls

Solid defense-in-depth. The architecture is sound — a worker_thread with its own V8 isolate and event loop is the right primitive for decoupling probe liveness from main-thread responsiveness. The PR is well-documented, backwards-compatible, and the test suite pins the load-bearing property. That said, I have a few findings ranging from a potential correctness issue to suggestions for hardening.


Issues

1. Fire-and-forget startHealthWorker()healthWorker is never assigned (ordering bug)

In server.ts, startHealthWorker() is called as a fire-and-forget .then():

let healthWorker: HealthWorkerHandle | null = null;
if (process.env.PRODUCER_DISABLE_HEALTH_WORKER !== "1") {
  startHealthWorker({ logger: log })
    .then((handle) => {
      healthWorker = handle;
    })
    .catch();
}

This is fine for the "don't block server startup on the worker" intent. But there is a subtle ordering issue: if the process receives SIGTERM before the .then() callback fires (e.g., during a very fast crash-restart cycle where k8s sends SIGTERM almost immediately), the shutdown() function will see healthWorker === null and skip worker cleanup — potentially leaving a dangling listener on port 9848. In practice, worker.terminate() from the parent process's exit will kill it, but it is worth noting.

More importantly, the healthWorker variable is captured in a closure that closes over the let binding. The .then() callback mutates the outer let — this works in JS but is the kind of pattern that makes refactors fragile. Consider storing the promise itself so shutdown can await it:

const healthWorkerPromise = startHealthWorker({ logger: log }).catch((err) => {
  log.error(`[server] health worker failed to start: ${err.message}`);
  return null;
});

// In shutdown:
const hw = await healthWorkerPromise;
if (hw) await hw.shutdown().catch(() => {});

This eliminates the race and the mutable let.

Severity: Low-medium. The current code works because process exit kills child threads, but the pattern is fragile.

2. Worker entry resolution: existsSync is synchronous I/O at import time

resolveWorkerEntry() in healthWorker.ts calls existsSync() twice (once for .js, once for .ts). This is fine for a one-shot startup path, but it is worth flagging that the existing shaderTransitionWorkerPool.ts uses the same pattern — so this is at least consistent with codebase convention.

Severity: Nit. Consistent with existing patterns.

3. Worker thread process.exit() inside a timeout

In healthWorkerThread.ts:

server.on("error", (err) => {
  parentPort?.postMessage({ type: "listen-error", error:  });
  setTimeout(() => process.exit(1), 50);
});

The 50ms delay before process.exit(1) is intended to let the message flush, but process.exit() inside a worker_thread calls process.exit() on the entire process, not just the thread. Per Node docs, calling process.exit() from a worker thread emits 'exit' on the worker and then terminates the worker (not the whole process) in Node 22+, but this behavior has been historically inconsistent across Node versions. The safer API is:

parentPort?.close();
// or simply let the event loop drain naturally after posting the error

Or if you want a hard exit, parentPort?.unref(); setTimeout(() => {}, 50) and let the thread die naturally when there is nothing left to do.

Severity: Low. In Node 22 this works correctly (worker-scoped exit), but it is a portability concern and the explicit parentPort.close() is clearer in intent.

4. Shutdown timeout: double 2-second budget

Both healthWorker.ts (main thread handle) and healthWorkerThread.ts (worker thread) have independent 2-second shutdown timeouts:

  • Main thread: Promise.race([exit event, 2s timeout]) then worker.terminate()
  • Worker thread: server.close(…) then setTimeout(() => process.exit(0), 2_000).unref()

These run concurrently, so the actual worst-case shutdown time for the health worker is 2s (main thread's terminate() fires and forcibly kills the worker). The worker's own 2s belt-and-suspenders timeout is redundant but harmless. Just calling it out so future readers know the main thread is the authority.

Severity: Nit. Documented well in comments, just redundant.


Test review

The load-bearing test ("stays responsive while main thread is blocked")

const fetchPromise = fetchHealth(TEST_PORT, 2_000);
const blockStart = Date.now();
while (Date.now() - blockStart < 500) {}
const res = await fetchPromise;

This is a good test but has a subtle timing assumption: fetchHealth() issues the HTTP request before the busy-spin begins, so the TCP SYN/connect happens while the main event loop is still free. The test proves "worker thread responds even when main thread is subsequently blocked after connection initiation." This is the right property for k8s probes (kubelet connects, main thread might block during response), so it is valid.

However, a stronger test would block the main thread first, then issue the request — proving that TCP accept on the worker's socket also works independently. Right now, if the connection's TCP handshake and the HTTP request write both complete before the busy-spin starts (which they almost certainly do on localhost), the test is really just proving "worker thread can write a response while main thread is blocked" — which is correct but a narrower assertion than the comment implies.

Consider adding a variant where the block starts first:

// Block FIRST, then issue request
const unblock = blockMainThread(500);
const res = await fetchHealth(TEST_PORT, 2_000);
unblock();

But honestly, the current test is sufficient for the property that matters in prod (kubelet probe lands and needs a response within timeoutSeconds). The TCP accept is on the worker's event loop regardless, so it would work. This is a "nice to have" strengthening, not a blocker.

Severity: Suggestion. Current test is valid for the production scenario.

Test port selection

Using 19848 as a fixed test port is fine. The test does verify port release on shutdown (test 3), which handles CI environments. No issue here.


Observations (non-blocking)

  1. Response body divergence: The worker thread returns { status, uptime, timestamp, thread: "worker" } with the extra thread field. The main-thread /health returns { status, uptime, timestamp } without it. This is fine for observability (you can tell which endpoint you hit), but if any consumer does strict equality checks on the response body, the thread field would be unexpected. Worth documenting in the migration guide for heygen-com/app.

  2. 0.0.0.0 binding: The comment explains this is intentional. Given this is a sidecar pod with no external ingress, it is safe. A paranoid hardening would be 127.0.0.1 since the k8s probe comes from localhost anyway, but this matches the main server's binding pattern and changing it would mean probes routed through a service/LB wouldn't work.

  3. No readiness vs. liveness distinction: The worker thread always returns 200 as long as it is alive. This is correct for livenessProbe (process alive = healthy). If this port is also used for readinessProbe, the semantics change — "ready to accept work" might require checking main-thread state. For now, the PR description correctly frames this as liveness only, and the heygen-com/app migration PR can decide.

  4. Opt-out env var naming: PRODUCER_DISABLE_HEALTH_WORKER=1 follows the "opt-out" pattern. This is fine. The alternative PRODUCER_HEALTH_WORKER=0 would be more conventional for a feature flag, but the current naming is clear enough.

  5. Build target placement: The new healthWorkerThread.ts entry is correctly added to build.mjs's Promise.all. The output path dist/services/healthWorkerThread.js matches the source layout. Clean.

  6. .fallowrc.jsonc entry: Correctly placed alongside the other worker entries. Comment is clear.


Verdict

Ship it (with minor polish). The architecture is correct, the test pins the right property, the integration is clean, and the backwards-compat story is solid. Finding #1 (fire-and-forget promise race) is the only one I would fix before merge — it is a one-line refactor and eliminates a real (if unlikely) race condition. The rest are hardening suggestions for the author's judgment.

The complementary work (Miguel's plan() guard in #1732, the probe timeout bump in heygen-com/app#1353, the streaming fileServer in #1735) is well-decomposed.

SHA: 5d4b325

— Miga

Addresses Miga's review on #1733.

- server.ts: store the worker as a Promise<HealthWorkerHandle | null>
  instead of mutating a `let` from inside `.then`. A SIGTERM landing
  before the `.then` callback fired would previously see `healthWorker
  === null` and skip cleanup. shutdown() now `await`s the promise with
  a bounded 1.5s timeout so a hung-startup worker can't keep SIGTERM
  waiting (worker.terminate() from process exit still kills it).
- healthWorkerThread.ts: replace `process.exit()` inside the worker
  with `parentPort.close()` + natural event-loop drain. Node-version
  semantics for `process.exit()` from a worker have been historically
  inconsistent; the documented clean path is to close the channel and
  let the worker exit naturally. Also drops the redundant 2s force-exit
  on shutdown — the parent already owns the authoritative deadline via
  Promise.race + worker.terminate(), so the worker-side timer was
  belt-and-suspenders noise.

Co-Authored-By: Jerrai <noreply@anthropic.com>

— Jerrai

@miguel-heygen miguel-heygen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed current HEAD 62465b381ee7a71ba1b8cb1285f13169bd1362cc.

Strengths:

  • packages/producer/src/services/healthWorker.ts cleanly isolates the probe listener in a worker_thread and keeps the main-thread /health path backwards-compatible.
  • packages/producer/src/server.ts:758 now stores the worker startup promise, and packages/producer/src/server.ts:776 awaits it with a bounded timeout, which closes Miga's SIGTERM-before-.then() cleanup race.
  • packages/producer/src/services/healthWorkerThread.ts:74 now exits via server.close(() => parentPort?.close()), avoiding worker-side process.exit() semantics and keeping the parent as the single shutdown deadline owner.

No blockers from my pass. CI still has pending required jobs at review time, so merge should still wait for those to finish green.

Verdict: APPROVE
Reasoning: The architecture matches the incident-hardening goal, and the only real lifecycle race from R1 was fixed in the follow-up commit.

— Magi

@jrusso1020

Copy link
Copy Markdown
Collaborator Author

Addressed your review at 62465b381:

  1. Race fix (your Initial repo setup #1) — replaced the let healthWorker = null + .then mutation with a Promise<HealthWorkerHandle | null> captured at startup. shutdown() now awaits the promise with a bounded 1.5s timeout so a SIGTERM landing before the worker is online still has something to await, and a hung-startup worker can't keep SIGTERM waiting (worker.terminate() from process exit still kills it).
  2. Worker process.exit() -> parentPort.close() (your feat(core): add compiler entry point and runtime composition fixes #3) — replaced both worker-side exits with parentPort.close() + natural event-loop drain. The Node-version semantics of process.exit() inside a worker have been historically inconsistent; the documented clean path is to close the channel and let the worker exit naturally.
  3. Redundant 2s force-exit timer in worker (your feat(studio): consolidate into single OSS-ready NLE editor #4) — dropped. The parent already owns the authoritative deadline via Promise.race + worker.terminate(), so the worker-side belt-and-suspenders timer was just noise. Single source of truth for the deadline now.

I left the stronger "block-first-then-fetch" test variant (your test review suggestion) for a follow-up since the current test pins the production-shape property and you marked it non-blocking. The three nits you marked low/nit (#2 existsSync at import, response-body thread field, 0.0.0.0 binding, readiness vs liveness, env-var naming, build-target placement) I also left as-is since they were observations rather than findings.

CI re-running. Ready for R2 when you have a moment.

— Jerrai

@jrusso1020 jrusso1020 merged commit 01a10cd into main Jun 26, 2026
54 of 55 checks passed
@jrusso1020 jrusso1020 deleted the fix/producer-health-worker-thread branch June 26, 2026 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants