-
Couldn't load subscription status.
- Fork 662
restructure dsr1 recipes and add gb200 #3891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| # 4096 = 256 * 16 | ||
| # moe_max_num_tokens: 4096 | ||
| load_balancer: /mnt/recipes/deepseek-r1/trtllm/wide_ep/eplb.yaml | ||
| load_balancer: /mnt/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this assuming some specific PVC mounting path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is actually coming from the slurm guides, in a separate pr we should remove those and turn all of these into k8s examples (if we want to keep slurm examples we need to fully separate it out)
| values: | ||
| - "true" | ||
| mainContainer: | ||
| image: rohanv672/dynamo:0.5.1-trtllm-ssh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to update this obviously 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, testing with actual dynamo container today
| initialDelaySeconds: 30 | ||
| periodSeconds: 10 | ||
| timeoutSeconds: 5 | ||
| failureThreshold: 3000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably fix this to a more reasonable number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats a good value you think, depending on pvc speed it can end up taking around an hour, I'm thinking 1.5 hours with a comment saying its dependent on your pvc speed
| requiredDuringSchedulingIgnoredDuringExecution: | ||
| nodeSelectorTerms: | ||
| - matchExpressions: | ||
| - key: nvidia.com/gpu.present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary given we also specify it in the limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
| metadata: | ||
| name: trtllm-test-compute-domain | ||
| spec: | ||
| numNodes: 9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a comment here and in prefill/decode that these need to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or ideally that you don't need to specify it ahead of time and the compute domain will grow/shrink accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once grove adds in support for automatic dra we'll remove this, will add in the comment
| name: trtllm-disagg-multinode | ||
| spec: | ||
| pvcs: | ||
| - name: modelcache-pvc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we need the PVC here and in each component? Also where is this PVC defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya will be adding in the same model_downloader setup as biswas
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)