Support Runpod Instant Clusters #3214

r4victor · 2025-10-22T09:35:55Z

#2802

The PR adds support for Runpod Instant clusters. dstack can now run multi-node tasks (2-8 nodes) on Runpod.

Implementation details:

Added a new Compute mixin ComputeWithGroupProvisioningSupport for backends that can provision/delete multiple instances at once.
Implemented ComputeWithGroupProvisioningSupport for RunPod.
Added ComputeGroupModel that's used to terminate the cluster when all instances become terminating.
process_submitted_jobs now can provision multiple jobs for multi-node tasks at once if the backend supports it (when provisioning master job). Introduced JobModel.waiting_master_job to avoid race conditions with one-by-one jobs processing.

Notes:

Tested only 2x H100:8 and 2x A100:8 clusters since others were not available at the time.
NCCL tests work but the bandwidth is suboptimal. This is a Runpod internal issue and they are on it.

TODO:

templated_id is currently required in the Runpod API when creating cluster. Waiting for Runpod to make it optional. Currently pytorch template is hardcoded.

Cluster offers are behind a feature flag DSTACK_FF_RUNPOD_CLUSTER_OFFERS_ENABLED. It'll be dropped when the template issue is resolved.

r4victor added 23 commits June 16, 2025 15:21

Fix runpod type annotations

e7b74fa

Add _generate_create_cluster_mutation

f373d6b

feat: add create_cluster method to RunpodApiClient

b089af5

feat: add delete_cluster method to RunpodApiClient

38376db

Use keyword arguments

991e886

Implement run_jobs and terminate_compute_group

07f1313

Merge branch 'master' into issue_2802_runpod_instant_clusters

8c13be6

Prototype compute.run_jobs calling

852896f

Prototype compute group provisioning for multinode tasks

ed47824

Add JobModel.waiting_master_job

f538ea6

Add ComputeGroupModel

12a1c13

Implement process_compute_groups to terminate compute groups

6b36b96

Remove todo

41fb7b9

Fix comments

1593b05

Set internal_ip

e987301

Merge branch 'master' into issue_2802_runpod_instant_clusters

df0c9ce

Merge branch 'master' into issue_2802_runpod_instant_clusters

79b8dac

Support Runpod Clusters offers

6b1679b

Respect supported pod_counts

888a7ba

Support registry_auth

e4e0ce8

Fix tests

7ee2643

Add feature flag DSTACK_FF_RUNPOD_CLUSTER_OFFERS_ENABLED

2d4589a

Merge branch 'master' into issue_2802_runpod_instant_clusters

cb0cd54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Runpod Instant Clusters #3214

Support Runpod Instant Clusters #3214

Uh oh!

r4victor commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Support Runpod Instant Clusters #3214

Are you sure you want to change the base?

Support Runpod Instant Clusters #3214

Uh oh!

Conversation

r4victor commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant