Wire canary monitor to real device health checks#4276
Open
stisiTT wants to merge 2 commits into
Open
Conversation
The scheduler raised HTTP 405 ("Method Not Allowed") while the model was
still warming up, which is semantically wrong and confused health probes:
a shield liveness probe polling /tt-liveness received 405 and treated the
server as broken rather than warming up.
Use 503 ("Service Unavailable") for every not-ready state so probes,
the cloud console, and k8s all interpret it as "retry later." Applies
the same fix to the inference endpoints (chat, llm, audio, video,
fine_tuning) that independently re-raised 405 on the not-ready path.
BaseMetalDeviceRunner.health_check() was a no-op (always returned True), so the canary probe loop declared every device alive regardless of actual state. Add a real shallow check (ttnn mesh device ping) at the base level, and deep-probe overrides in TTDiTRunner (replay 2-step forward via compiled trace) and BaseSDXLRunner (replay _warmup_inference_block). Flip canary_gate_readiness and canary_deep_probe_enabled to True so a DEAD canary surfaces as 503 on /health and /tt-liveness instead of silently routing requests into a hung worker. Kill-switch: set CANARY_GATE_READINESS=false or CANARY_DEEP_PROBE_ENABLED=false in the container env to revert without a code change. Closes #4275
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The media server had a silent gap: a worker that hangs mid-inference keeps reporting healthy because the canary monitor's device probes were no-ops. This PR closes that gap.
What was broken: `BaseMetalDeviceRunner.health_check()` always returned `True`. The canary monitor called it dutifully, declared the device alive, and `/health` + `/tt-liveness` stayed at 200 — even while the worker was wedged.
What this fixes:
Kill-switch: set `CANARY_GATE_READINESS=false` or `CANARY_DEEP_PROBE_ENABLED=false` in the container env to revert without a code change.
This PR also includes the 405→503 fix from #4274 (cherry-picked), so #4274 can be closed.
Closes #4275
Validation runs