Skip to content

Conversation

@r4victor
Copy link
Collaborator

#2802

The PR adds support for Runpod Instant clusters. dstack can now run multi-node tasks (2-8 nodes) on Runpod.

Implementation details:

  • Added a new Compute mixin ComputeWithGroupProvisioningSupport for backends that can provision/delete multiple instances at once.
  • Implemented ComputeWithGroupProvisioningSupport for RunPod.
  • Added ComputeGroupModel that's used to terminate the cluster when all instances become terminating.
  • process_submitted_jobs now can provision multiple jobs for multi-node tasks at once if the backend supports it (when provisioning master job). Introduced JobModel.waiting_master_job to avoid race conditions with one-by-one jobs processing.

Notes:

  • Tested only 2x H100:8 and 2x A100:8 clusters since others were not available at the time.
  • NCCL tests work but the bandwidth is suboptimal. This is a Runpod internal issue and they are on it.

TODO:

  • templated_id is currently required in the Runpod API when creating cluster. Waiting for Runpod to make it optional. Currently pytorch template is hardcoded.

Cluster offers are behind a feature flag DSTACK_FF_RUNPOD_CLUSTER_OFFERS_ENABLED. It'll be dropped when the template issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant