Skip to content

Conversation

@dagil-nvidia
Copy link
Contributor

Cherry-pick: Increase failure threshold for k8s dsr1 trtllm wideep deploy.yaml

Overview

Cherry-pick of PR #4557 to release/0.7.0.

Original PR: #4557
Related Bug: https://nvbugspro.nvidia.com/bug/5685145

Changes

  • Increased failureThreshold from 500 to 600 in startup probes for both prefill and decode containers
  • File: recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

Reason for Cherry-pick

QA testing found that 500 iterations were insufficient for model loading and stability. The additional 100 iterations provide better reliability for startup health checks in the DSR1 TRT-LLM wide-EP deployment configuration.

Testing

  • Original PR was tested and merged to main on Nov 25, 2025
  • CI checks passed on main branch

Checklist

  • Cherry-picked from main (commit: d6aa4a0)
  • DCO sign-off applied
  • Update Cherry Pick table in release canvas
  • Notify Release PiC in #swdl-dynamo-release

@dagil-nvidia dagil-nvidia requested review from a team as code owners November 25, 2025 01:06
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nv-tusharma nv-tusharma merged commit c203bc4 into release/0.7.0 Nov 25, 2025
11 checks passed
@nv-tusharma nv-tusharma deleted the dagil/cherrypick-4557-timeout branch November 25, 2025 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants