restructure dsr1 recipes and add gb200 #3891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

nvrohanv wants to merge 1 commit into main from nvrohanv/add-dsr1-gb200-widep-recipe

Contributor

nvrohanv commented Oct 25, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx


          restructure dsr1 recipes and add gb200, needs testing with new container

aae303c

pull-request-size bot added the size/L label

itay reviewed

View reviewed changes

recipes/deepseek-r1/trtllm/agg/wide_ep/wide_ep_agg.yaml

    
                #   4096 = 256 * 16

                # moe_max_num_tokens: 4096

                load_balancer: /mnt/recipes/deepseek-r1/trtllm/wide_ep/eplb.yaml

                load_balancer: /mnt/recipes/deepseek-r1/trtllm/agg/wide_ep/eplb.yaml

itay Oct 25, 2025

Is this assuming some specific PVC mounting path?

Contributor Author

nvrohanv Oct 26, 2025

Yes, this is actually coming from the slurm guides, in a separate pr we should remove those and turn all of these into k8s examples (if we want to keep slurm examples we need to fully separate it out)

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

    
                                    values:

                                      - "true"

                      mainContainer:

                        image: rohanv672/dynamo:0.5.1-trtllm-ssh

itay Oct 25, 2025

Will need to update this obviously 🙂

Contributor Author

nvrohanv Oct 26, 2025

Yes, testing with actual dynamo container today

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

    
                          initialDelaySeconds: 30

                          periodSeconds: 10

                          timeoutSeconds: 5

                          failureThreshold: 3000

itay Oct 25, 2025

We should probably fix this to a more reasonable number.

Contributor Author

nvrohanv Oct 26, 2025

Whats a good value you think, depending on pvc speed it can end up taking around an hour, I'm thinking 1.5 hours with a comment saying its dependent on your pvc speed

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

    
                          requiredDuringSchedulingIgnoredDuringExecution:

                            nodeSelectorTerms:

                              - matchExpressions:

                                  - key: nvidia.com/gpu.present

itay Oct 25, 2025

Is this necessary given we also specify it in the limits?

Contributor Author

nvrohanv Oct 26, 2025

good point

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

    
              metadata:

                name: trtllm-test-compute-domain

              spec:

                numNodes: 9

itay Oct 25, 2025

We should add a comment here and in prefill/decode that these need to match.

itay Oct 25, 2025

Or ideally that you don't need to specify it ahead of time and the compute domain will grow/shrink accordingly.

Contributor Author

nvrohanv Oct 26, 2025

Once grove adds in support for automatic dra we'll remove this, will add in the comment

recipes/deepseek-r1/trtllm/disagg/wide_ep/gb200/deploy.yaml

    
                name: trtllm-disagg-multinode

              spec:

                pvcs:

                  - name: modelcache-pvc

itay Oct 25, 2025

Is there a reason we need the PVC here and in each component? Also where is this PVC defined?

Contributor Author

nvrohanv Oct 26, 2025

ya will be adding in the same model_downloader setup as biswas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels