Skip to content

Conversation

@tanmayv25
Copy link
Contributor

@tanmayv25 tanmayv25 commented Nov 24, 2025

Overview:

QA found that 500 might not be sufficient iterations. They suggest adding 100 more iterations for better stability and model loading.

Check https://nvbugspro.nvidia.com/bug/5685145 for more details.

Summary by CodeRabbit

  • Chores
    • Updated deployment configuration parameters to optimize system startup and reliability handling.

✏️ Tip: You can customize this high-level summary in your review settings.

@tanmayv25 tanmayv25 requested review from a team as code owners November 24, 2025 19:18
@tzulingk
Copy link
Contributor

nit: mentioned in the PR overview about https://nvbugspro.nvidia.com/bug/5685145?

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 24, 2025

Walkthrough

A configuration update that increases the failureThreshold values from 500 to 600 in startup probe blocks for both prefill and decode containers within a Kubernetes deployment YAML file.

Changes

Cohort / File(s) Summary
Kubernetes Deployment Configuration
recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml
Increased failureThreshold in startupProbe from 500 to 600 for prefill container; increased failureThreshold in startupProbe from 500 to 600 for decode container

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

  • Simple numeric configuration adjustment in a single file with no logic or behavioral impact

Poem

🐰 A number here, a threshold there,
From five to six, with gentle care,
The probes now wait a little more,
To check our containers at the door! 🚀✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description lacks required sections including Details, Where should the reviewer start, and Related Issues. Add missing sections: provide specific details about the changes made, specify which files should be reviewed closely, and include the related GitHub issue reference (#5685145 or equivalent bug tracker reference).
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title directly and clearly describes the main change: increasing the failure threshold in a Kubernetes deployment configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tanmayv25 tanmayv25 changed the title Increase the failure threshold for k8s dsr1 trtllm wideep deploy.yaml fix: Increase the failure threshold for k8s dsr1 trtllm wideep deploy.yaml Nov 24, 2025
@github-actions github-actions bot added the fix label Nov 24, 2025
@tanmayv25 tanmayv25 enabled auto-merge (squash) November 24, 2025 19:54
@tanmayv25
Copy link
Contributor Author

The CI seems to be blocked on unrelated CI failure: #4561

@nvda-mesharma nvda-mesharma merged commit d6aa4a0 into main Nov 25, 2025
25 of 27 checks passed
@nvda-mesharma nvda-mesharma deleted the tanmayv-timeout branch November 25, 2025 00:54
dagil-nvidia pushed a commit that referenced this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants