Skip to content

Wire canary monitor to real device health checks#4276

Open
stisiTT wants to merge 2 commits into
mainfrom
stisi/fix-media-canary-device-health-check
Open

Wire canary monitor to real device health checks#4276
stisiTT wants to merge 2 commits into
mainfrom
stisi/fix-media-canary-device-health-check

Conversation

@stisiTT

@stisiTT stisiTT commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

The media server had a silent gap: a worker that hangs mid-inference keeps reporting healthy because the canary monitor's device probes were no-ops. This PR closes that gap.

What was broken: `BaseMetalDeviceRunner.health_check()` always returned `True`. The canary monitor called it dutifully, declared the device alive, and `/health` + `/tt-liveness` stayed at 200 — even while the worker was wedged.

What this fixes:

  • Adds a real shallow probe to `BaseMetalDeviceRunner` (pings the TTNN mesh device handle)
  • Adds deep-probe overrides to `TTDiTRunner` (FLUX, Wan, etc.) and `BaseSDXLRunner` — both replay the 2-step compiled warmup forward, reusing the existing trace
  • Flips `canary_gate_readiness=True` and `canary_deep_probe_enabled=True` so a DEAD canary surfaces as `503` on both endpoints instead of silently absorbing requests

Kill-switch: set `CANARY_GATE_READINESS=false` or `CANARY_DEEP_PROBE_ENABLED=false` in the container env to revert without a code change.

This PR also includes the 405→503 fix from #4274 (cherry-picked), so #4274 can be closed.

Closes #4275

Validation runs

Model Device Run
whisper-large-v3 p150 27720892450
speecht5_tts p150 27720895691
Wan2.2-T2V-A14B-Diffusers bh-qb-ge 27720899389
whisper-large-v3 bh-galaxy 27720902581
Wan2.2-T2V-A14B-Diffusers bh-galaxy 27720905720
Wan2.2-I2V-A14B-Diffusers bh-galaxy 27720909813
stable-diffusion-xl-base-1.0 bh-galaxy 27720912861
mochi-1-preview bh-galaxy 27720916194
whisper-large-v3 6u 27720919662
FLUX.1-dev 6u 27720923396
Wan2.2-T2V-A14B-Diffusers 6u 27720926831
stable-diffusion-xl-base-1.0 6u 27720930555
mochi-1-preview 6u 27720933870

stisiTT added 2 commits June 17, 2026 17:06
The scheduler raised HTTP 405 ("Method Not Allowed") while the model was
still warming up, which is semantically wrong and confused health probes:
a shield liveness probe polling /tt-liveness received 405 and treated the
server as broken rather than warming up.

Use 503 ("Service Unavailable") for every not-ready state so probes,
the cloud console, and k8s all interpret it as "retry later." Applies
the same fix to the inference endpoints (chat, llm, audio, video,
fine_tuning) that independently re-raised 405 on the not-ready path.
BaseMetalDeviceRunner.health_check() was a no-op (always returned True),
so the canary probe loop declared every device alive regardless of actual
state. Add a real shallow check (ttnn mesh device ping) at the base level,
and deep-probe overrides in TTDiTRunner (replay 2-step forward via compiled
trace) and BaseSDXLRunner (replay _warmup_inference_block). Flip
canary_gate_readiness and canary_deep_probe_enabled to True so a DEAD
canary surfaces as 503 on /health and /tt-liveness instead of silently
routing requests into a hung worker.

Kill-switch: set CANARY_GATE_READINESS=false or CANARY_DEEP_PROBE_ENABLED=false
in the container env to revert without a code change.

Closes #4275
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tt-media-server: wire canary monitor to real device health checks

1 participant