fix: Increase the failure threshold for k8s dsr1 trtllm wideep deploy… #4568
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry-pick: Increase failure threshold for k8s dsr1 trtllm wideep deploy.yaml
Overview
Cherry-pick of PR #4557 to
release/0.7.0.Original PR: #4557
Related Bug: https://nvbugspro.nvidia.com/bug/5685145
Changes
failureThresholdfrom 500 to 600 in startup probes for both prefill and decode containersrecipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yamlReason for Cherry-pick
QA testing found that 500 iterations were insufficient for model loading and stability. The additional 100 iterations provide better reliability for startup health checks in the DSR1 TRT-LLM wide-EP deployment configuration.
Testing
Checklist