From 2cfcde93d915ddb3119f53da0bc09520aebdcda2 Mon Sep 17 00:00:00 2001 From: Mo King Date: Wed, 25 Jun 2025 09:07:42 -0400 Subject: [PATCH 01/13] Init --- docs.json | 5 +-- instant-clusters.mdx | 3 +- instant-clusters/slurm-managed.mdx | 5 +++ instant-clusters/slurm.mdx | 50 +++++++++++++++++------------- 4 files changed, 38 insertions(+), 25 deletions(-) create mode 100644 instant-clusters/slurm-managed.mdx diff --git a/docs.json b/docs.json index c23ccbd8..ce4e3c54 100644 --- a/docs.json +++ b/docs.json @@ -142,9 +142,10 @@ "group": "Instant Clusters", "pages": [ "instant-clusters", + "instant-clusters/slurm-managed", + "instant-clusters/slurm", "instant-clusters/pytorch", - "instant-clusters/axolotl", - "instant-clusters/slurm" + "instant-clusters/axolotl" ] }, { diff --git a/instant-clusters.mdx b/instant-clusters.mdx index 3cf6035e..650b1a36 100644 --- a/instant-clusters.mdx +++ b/instant-clusters.mdx @@ -27,6 +27,7 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework: +* [Deploy an Instant Cluster with Slurm (managed)](/instant-clusters/slurm-managed). * [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch). * [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl). * [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm). @@ -66,7 +67,7 @@ Instant Clusters support up to 8 interfaces per Pod. Each interface (`eth1` - `e ## Environment variables -The following environment variables are available in all Pods: +The following environment variables are present in all Pods on an Instant Cluster: | Environment Variable | Description | | ------------------------------ | ----------------------------------------------------------------------------- | diff --git a/instant-clusters/slurm-managed.mdx b/instant-clusters/slurm-managed.mdx new file mode 100644 index 00000000..57b29b13 --- /dev/null +++ b/instant-clusters/slurm-managed.mdx @@ -0,0 +1,5 @@ +--- +title: "Deploy an Instant Cluster with Slurm (managed)" +sidebarTitle: "Deploy with Slurm (managed)" +--- + diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx index 4eaace34..9ab297d1 100644 --- a/instant-clusters/slurm.mdx +++ b/instant-clusters/slurm.mdx @@ -1,17 +1,23 @@ --- -title: "Deploy an Instant Cluster with SLURM" -sidebarTitle: "Deploy with SLURM" +title: "Deploy an Instant Cluster with Slurm (unmanaged)" +sidebarTitle: "Deploy with Slurm (unmanaged)" --- -This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. + + +This guide is for advanced users who want to manage their own Slurm cluster. If you're looking for a managed solution, see [Deploy an Instant Cluster with Slurm (managed)](/instant-clusters/slurm-managed). + + + +This tutorial demonstrates how to use Instant Clusters with [Slurm](https://slurm.schedmd.com/) to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. -Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently. +Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently. ## Requirements * You've created a [Runpod account](https://www.console.runpod.io/home) and funded it with sufficient credits. * You have basic familiarity with Linux command line. -* You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/). +* You're comfortable working with [Pods](/pods/overview) and understand the basics of [Slurm](https://slurm.schedmd.com/). ## Step 1: Deploy an Instant Cluster @@ -20,7 +26,7 @@ Follow the steps below to deploy a cluster and start running distributed SLURM w 3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (Runpod PyTorch). 4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds. -## Step 2: Clone demo and install SLURM on each Pod +## Step 2: Clone demo and install Slurm on each Pod To connect to a Pod: @@ -31,28 +37,28 @@ To connect to a Pod: 1. Click **Connect**, then click **Web Terminal**. -2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory: +2. In the terminal that opens, run this command to clone the Slurm demo files into the Pod's main directory: ```bash git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example ``` -3. Run this command to install SLURM: +3. Run this command to install Slurm: ```bash apt update && apt install -y slurm-wlm slurm-client munge ``` -## Step 3: Overview of SLURM demo scripts +## Step 3: Overview of Slurm demo scripts -The repository contains several essential scripts for setting up SLURM. Let's examine what each script does: +The repository contains several essential scripts for setting up Slurm. Let's examine what each script does: -* `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node. -* `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup. -* `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment. -* `test_batch.sh`: A sample SLURM job script for testing cluster functionality. +* `create_gres_conf.sh`: Generates the Slurm Generic Resource (GRES) configuration file that defines GPU resources for each node. +* `create_slurm_conf.sh`: Creates the main Slurm configuration file with cluster settings, node definitions, and partition setup. +* `install.sh`: The primary installation script that sets up MUNGE authentication, configures Slurm, and prepares the environment. +* `test_batch.sh`: A sample Slurm job script for testing cluster functionality. -## Step 4: Install SLURM on each Pod +## Step 4: Install Slurm on each Pod Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster. @@ -60,9 +66,9 @@ Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` ./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3 ``` -This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes. +This script automates the complex process of configuring a two-node Slurm cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes. -## Step 5: Start SLURM services +## Step 5: Start Slurm services @@ -70,7 +76,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM -1. **On the primary node** (`node-0`), run both SLURM services: +1. **On the primary node** (`node-0`), run both Slurm services: ```bash slurmctld -D @@ -90,7 +96,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal. -## Step 6: Test your SLURM Cluster +## Step 6: Test your Slurm Cluster 1. Run this command **on the primary node** (`node-0`) to check the status of your nodes: @@ -108,7 +114,7 @@ After running these commands, you should see output indicating that the services This command should list all GPUs across both nodes. -## Step 7: Submit the SLURM job script +## Step 7: Submit the Slurm job script Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly: @@ -130,9 +136,9 @@ You can monitor your cluster usage and spending using the **Billing Explorer** a ## Next steps -Now that you've successfully deployed and tested a SLURM cluster on Runpod, you can: +Now that you've successfully deployed and tested a Slurm cluster on Runpod, you can: -* **Adapt your own distributed workloads** to run using SLURM job scripts. +* **Adapt your own distributed workloads** to run using Slurm job scripts. * **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets. * **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models. * **Optimize performance** by experimenting with different distributed training strategies. From a549883c3a6dfcfdfdf9761117d82b49f1434953 Mon Sep 17 00:00:00 2001 From: Mo King Date: Wed, 25 Jun 2025 13:46:14 -0400 Subject: [PATCH 02/13] First draft --- instant-clusters/slurm-managed.mdx | 125 ++++++++++++++++++++++++++++- 1 file changed, 123 insertions(+), 2 deletions(-) diff --git a/instant-clusters/slurm-managed.mdx b/instant-clusters/slurm-managed.mdx index 57b29b13..c0843ee4 100644 --- a/instant-clusters/slurm-managed.mdx +++ b/instant-clusters/slurm-managed.mdx @@ -1,5 +1,126 @@ --- -title: "Deploy an Instant Cluster with Slurm (managed)" -sidebarTitle: "Deploy with Slurm (managed)" +title: Managed Slurm clusters +sidebarTitle: Managed Slurm clusters +description: Deploy fully managed Slurm clusters on Runpod with zero configuration --- +Runpod managed Slurm clusters provide a fully managed high-performance computing (HPC) scheduling solution that enables you to create, scale, and manage Slurm clusters without the complexity of setup and configuration. + +Managed Slurm eliminates the traditional complexity of HPC cluster orchestration by providing: + +- **Zero configuration setup** - Slurm is pre-installed and fully configured +- **Instant provisioning** - Clusters deploy rapidly with minimal setup +- **Automatic role assignment** - System automatically designates controller and worker nodes +- **Built-in optimizations** - Pre-configured for optimal NCCL performance +- **Full Slurm compatibility** - All standard Slurm commands work immediately + +This solution is ideal for AI/ML teams, research institutions, and enterprise R&D departments that need on-demand GPU/CPU clusters without infrastructure management overhead. + +This page shows how to deploy a managed Slurm cluster on Runpod, and how to use it to run distributed training jobs. + +## Creating a managed Slurm cluster + +1. Navigate to the [Runpod Instant Clusters console](https://console.runpod.io/cluster). +2. Click **Create Cluster**. +3. Select **Slurm Cluster** from the cluster type options. +4. Configure your cluster specifications. +5. Click **Deploy Cluster**. + +## Accessing your cluster + +Once deployment completes, view your cluster dashboard. The interface clearly displays: +- Controller node (primary node) with its connection details +- Worker nodes with their status and specifications +- Overall cluster health and resource availability + +SSH directly into the controller node using the provided credentials: + +```bash +ssh username@controller-node-address +``` + +No additional setup is required - Slurm is ready to use immediately upon connection. + +## Submit and manage jobs + +All standard Slurm commands are available without configuration: + +Check cluster status and available resources: +```bash +sinfo +``` + +Submit a job to the cluster: +```bash +sbatch your-job-script.sh +``` + +Monitor job queue and status: +```bash +squeue +``` + +View detailed job information: +```bash +scontrol show job +``` + +The managed environment includes: +- Pre-installed Slurm with all necessary plugins +- Configured Munge authentication +- Optimized topology.conf for NCCL performance +- Support for common HPC workloads and MPI integration + +## Advanced configuration + +While the managed Slurm cluster works out-of-the-box, advanced users can customize configurations through shell access. + +Access Slurm configuration files in their standard locations: +- `/etc/slurm/slurm.conf` - Main configuration file +- `/etc/slurm/topology.conf` - Network topology configuration +- `/etc/slurm/gres.conf` - Generic resource configuration + +Modify these files as needed for your specific requirements. The managed service ensures baseline functionality while allowing flexibility for customization. + +## Performance optimization + +Managed Slurm clusters come pre-optimized for distributed training workloads: + +- **Topology-aware scheduling** - Properly configured topology.conf ensures optimal NCCL performance +- **GPU resource management** - Automatic GRES configuration for GPU scheduling +- **MPI integration** - Pre-configured for seamless MPI job execution + +These optimizations enable efficient large-scale ML training without manual tuning. + +## Monitoring and scaling + +Monitor your cluster health and resource utilization through the Runpod dashboard. Key metrics include: +- Node availability and status +- Job queue statistics +- Resource utilization trends + +Scale your cluster based on workload demands by adding or removing nodes through the Instant Clusters interface. The managed service handles all reconfiguration automatically. + +## Best practices + +To maximize the effectiveness of your managed Slurm cluster: + +- Start with the minimum nodes required and scale as needed +- Use Slurm's built-in accounting features to track resource usage +- Leverage job arrays for similar tasks to improve scheduling efficiency +- Monitor job performance metrics to optimize resource allocation + +## Troubleshooting + +If you encounter issues with your managed Slurm cluster: + +**Jobs stuck in pending state** +Check resource availability with `sinfo` and ensure requested resources are available. + +**Authentication errors** +Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes. + +**Performance issues** +Review topology configuration and ensure jobs are using appropriate resource requests. + +For additional support, contact Runpod support with your cluster ID and specific error messages. \ No newline at end of file From 7a923fc6978551fb81815188044ba9735133a1f3 Mon Sep 17 00:00:00 2001 From: Mo King Date: Thu, 26 Jun 2025 09:40:57 -0400 Subject: [PATCH 03/13] update --- docs.json | 14 ++++++--- .../{slurm-managed.mdx => slurm-clusters.mdx} | 31 +++++++++++-------- instant-clusters/slurm.mdx | 6 ++-- 3 files changed, 31 insertions(+), 20 deletions(-) rename instant-clusters/{slurm-managed.mdx => slurm-clusters.mdx} (76%) diff --git a/docs.json b/docs.json index ce4e3c54..8539011b 100644 --- a/docs.json +++ b/docs.json @@ -142,10 +142,16 @@ "group": "Instant Clusters", "pages": [ "instant-clusters", - "instant-clusters/slurm-managed", - "instant-clusters/slurm", - "instant-clusters/pytorch", - "instant-clusters/axolotl" + "instant-clusters/slurm-clusters", + { + "group": "Deployment tutorials", + "pages": [ + "instant-clusters/slurm", + "instant-clusters/pytorch", + "instant-clusters/axolotl" + ] + } + ] }, { diff --git a/instant-clusters/slurm-managed.mdx b/instant-clusters/slurm-clusters.mdx similarity index 76% rename from instant-clusters/slurm-managed.mdx rename to instant-clusters/slurm-clusters.mdx index c0843ee4..093534f6 100644 --- a/instant-clusters/slurm-managed.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -1,24 +1,29 @@ --- -title: Managed Slurm clusters -sidebarTitle: Managed Slurm clusters -description: Deploy fully managed Slurm clusters on Runpod with zero configuration +title: Slurm Clusters +sidebarTitle: Slurm Clusters +description: Deploy fully managed Slurm Clusters on Runpod with zero configuration +tag: "NEW" --- -Runpod managed Slurm clusters provide a fully managed high-performance computing (HPC) scheduling solution that enables you to create, scale, and manage Slurm clusters without the complexity of setup and configuration. +Runpod Slurm Clusters provide a fully managed high-performance computing (HPC) scheduling solution that enables you to create, scale, and manage Slurm Clusters without the complexity of setup and configuration. -Managed Slurm eliminates the traditional complexity of HPC cluster orchestration by providing: +Slurm Clusters eliminate the traditional complexity of HPC cluster orchestration by providing: -- **Zero configuration setup** - Slurm is pre-installed and fully configured -- **Instant provisioning** - Clusters deploy rapidly with minimal setup -- **Automatic role assignment** - System automatically designates controller and worker nodes -- **Built-in optimizations** - Pre-configured for optimal NCCL performance -- **Full Slurm compatibility** - All standard Slurm commands work immediately +- **Zero configuration setup:** Slurm is pre-installed and fully configured. +- **Instant provisioning:** Clusters deploy rapidly with minimal setup. +- **Automatic role assignment:** Runpod automatically designates controller and worker nodes. +- **Built-in optimizations:** Pre-configured for optimal NCCL performance. +- **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box. -This solution is ideal for AI/ML teams, research institutions, and enterprise R&D departments that need on-demand GPU/CPU clusters without infrastructure management overhead. +This page shows how to deploy a Slurm Cluster on Runpod, and how to use it to run distributed training jobs. -This page shows how to deploy a managed Slurm cluster on Runpod, and how to use it to run distributed training jobs. + -## Creating a managed Slurm cluster +If you would rather configure and manage an Instant Cluster with Slurm manually, see [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm) for a step-by-step guide. + + + +## Creating a Slurm Cluster 1. Navigate to the [Runpod Instant Clusters console](https://console.runpod.io/cluster). 2. Click **Create Cluster**. diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx index 9ab297d1..f6d986e6 100644 --- a/instant-clusters/slurm.mdx +++ b/instant-clusters/slurm.mdx @@ -1,15 +1,15 @@ --- -title: "Deploy an Instant Cluster with Slurm (unmanaged)" +title: "Deploy an Instant Cluster with Slurm" sidebarTitle: "Deploy with Slurm (unmanaged)" --- -This guide is for advanced users who want to manage their own Slurm cluster. If you're looking for a managed solution, see [Deploy an Instant Cluster with Slurm (managed)](/instant-clusters/slurm-managed). +This guide is for advanced users who want to configure and manage their own Slurm deployment on Instant Clusters. If you're looking for a pre-configured solution, see [Slurm Clusters](/instant-clusters/slurm-clusters). -This tutorial demonstrates how to use Instant Clusters with [Slurm](https://slurm.schedmd.com/) to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. +This tutorial demonstrates how to configure Runpod Instant Clusters with [Slurm](https://slurm.schedmd.com/) to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently. From e5521ee4522a6d661b00f82d667102e0a10192ac Mon Sep 17 00:00:00 2001 From: Mo King Date: Thu, 26 Jun 2025 12:45:38 -0400 Subject: [PATCH 04/13] Update sidebar titles --- docs.json | 6 ++-- instant-clusters.mdx | 4 +-- instant-clusters/axolotl.mdx | 2 +- instant-clusters/pytorch.mdx | 2 +- instant-clusters/slurm-clusters.mdx | 51 +++++++++++++++-------------- instant-clusters/slurm.mdx | 2 +- 6 files changed, 34 insertions(+), 33 deletions(-) diff --git a/docs.json b/docs.json index 8539011b..e6625dac 100644 --- a/docs.json +++ b/docs.json @@ -144,11 +144,11 @@ "instant-clusters", "instant-clusters/slurm-clusters", { - "group": "Deployment tutorials", + "group": "Deployment guides", "pages": [ - "instant-clusters/slurm", "instant-clusters/pytorch", - "instant-clusters/axolotl" + "instant-clusters/axolotl", + "instant-clusters/slurm" ] } diff --git a/instant-clusters.mdx b/instant-clusters.mdx index 650b1a36..6c1b125a 100644 --- a/instant-clusters.mdx +++ b/instant-clusters.mdx @@ -27,10 +27,10 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework: -* [Deploy an Instant Cluster with Slurm (managed)](/instant-clusters/slurm-managed). +* [Deploy a Slurm Cluster](/instant-clusters/slurm-clusters). * [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch). * [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl). -* [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm). +* [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm). ## Use cases for Instant Clusters diff --git a/instant-clusters/axolotl.mdx b/instant-clusters/axolotl.mdx index 7bef1354..a4075e24 100644 --- a/instant-clusters/axolotl.mdx +++ b/instant-clusters/axolotl.mdx @@ -1,6 +1,6 @@ --- title: "Deploy an Instant Cluster with Axolotl" -sidebarTitle: "Deploy with Axolotl" +sidebarTitle: "Axolotl" --- This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. diff --git a/instant-clusters/pytorch.mdx b/instant-clusters/pytorch.mdx index 182f1664..90397ea6 100644 --- a/instant-clusters/pytorch.mdx +++ b/instant-clusters/pytorch.mdx @@ -1,6 +1,6 @@ --- title: "Deploy an Instant Cluster with PyTorch" -sidebarTitle: "Deploy with PyTorch" +sidebarTitle: "PyTorch" --- This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index 093534f6..f75b0a7b 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -5,9 +5,9 @@ description: Deploy fully managed Slurm Clusters on Runpod with zero configurati tag: "NEW" --- -Runpod Slurm Clusters provide a fully managed high-performance computing (HPC) scheduling solution that enables you to create, scale, and manage Slurm Clusters without the complexity of setup and configuration. +Runpod Slurm Clusters provide a fully managed high-performance computing scheduling solution that enables you to create, scale, and manage Slurm Clusters without the complexity of setup and configuration. -Slurm Clusters eliminate the traditional complexity of HPC cluster orchestration by providing: +Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing: - **Zero configuration setup:** Slurm is pre-installed and fully configured. - **Instant provisioning:** Clusters deploy rapidly with minimal setup. @@ -19,32 +19,32 @@ This page shows how to deploy a Slurm Cluster on Runpod, and how to use it to ru -If you would rather configure and manage an Instant Cluster with Slurm manually, see [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm) for a step-by-step guide. +If you prefer to manually configure and manage your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide. -## Creating a Slurm Cluster +## Deploy a Slurm Cluster -1. Navigate to the [Runpod Instant Clusters console](https://console.runpod.io/cluster). +1. Navigate to the [Instant Cluster page](https://console.runpod.io/cluster) on the Runpod console. 2. Click **Create Cluster**. -3. Select **Slurm Cluster** from the cluster type options. -4. Configure your cluster specifications. +3. Select **Slurm Cluster** from the dropdown menu of cluster types. +4. Configure your cluster specifications, including: + - [Network volume](/pods/storage/create-network-volumes), if you need persistent/shared storage. + - Region (must be the same as the network volume if you're using one). + - Pod count (i.e., the number of Pods in the cluster). + - [GPU type](/references/gpu-types). + - Cluster name. + - [Pod template](/pods/templates/overview). + - Click **Edit Template** if you need to adjust start commands, environment variables, ports, or the capacity of the [container/volume disk](/pods/storage/types). 5. Click **Deploy Cluster**. -## Accessing your cluster +## Access your cluster -Once deployment completes, view your cluster dashboard. The interface clearly displays: -- Controller node (primary node) with its connection details -- Worker nodes with their status and specifications -- Overall cluster health and resource availability +Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster). -SSH directly into the controller node using the provided credentials: +Select your cluster to view details. The interface displays the **Slurm Controller** (primary node) and **Slurm Agents** (secondary nodes) with their status and specifications. You can expand each node to view details on the node's resources and status. -```bash -ssh username@controller-node-address -``` - -No additional setup is required - Slurm is ready to use immediately upon connection. +You can connect to each node using the **Connect** button, or with any of the [connection methods supported by Pods](/pods/connect-to-pods). ## Submit and manage jobs @@ -70,15 +70,15 @@ View detailed job information: scontrol show job ``` -The managed environment includes: -- Pre-installed Slurm with all necessary plugins -- Configured Munge authentication -- Optimized topology.conf for NCCL performance -- Support for common HPC workloads and MPI integration +The managed Slurm environment includes: +- Pre-installed Slurm with all necessary plugins. +- Pre-configured Munge authentication. +- Optimized `topology.conf` for NCCL performance. +- Support for common high-performance computing workloads and MPI integration. ## Advanced configuration -While the managed Slurm cluster works out-of-the-box, advanced users can customize configurations through shell access. +While the managed Slurm cluster works out-of-the-box, advanced users can customize configurations through shell access to the Slurm Controller node. Access Slurm configuration files in their standard locations: - `/etc/slurm/slurm.conf` - Main configuration file @@ -97,9 +97,10 @@ Managed Slurm clusters come pre-optimized for distributed training workloads: These optimizations enable efficient large-scale ML training without manual tuning. -## Monitoring and scaling +## Monitor and scale Monitor your cluster health and resource utilization through the Runpod dashboard. Key metrics include: + - Node availability and status - Job queue statistics - Resource utilization trends diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx index f6d986e6..82a5c21b 100644 --- a/instant-clusters/slurm.mdx +++ b/instant-clusters/slurm.mdx @@ -1,6 +1,6 @@ --- title: "Deploy an Instant Cluster with Slurm" -sidebarTitle: "Deploy with Slurm (unmanaged)" +sidebarTitle: "Slurm (unmanaged)" --- From 868d14c152494c92a20100875fa5cd93774d1b09 Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 10:07:12 -0400 Subject: [PATCH 05/13] update --- instant-clusters/slurm-clusters.mdx | 39 +++++++++++++++-------------- instant-clusters/slurm.mdx | 2 +- 2 files changed, 21 insertions(+), 20 deletions(-) diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index f75b0a7b..fa636634 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -2,49 +2,50 @@ title: Slurm Clusters sidebarTitle: Slurm Clusters description: Deploy fully managed Slurm Clusters on Runpod with zero configuration -tag: "NEW" +tag: "BETA" --- -Runpod Slurm Clusters provide a fully managed high-performance computing scheduling solution that enables you to create, scale, and manage Slurm Clusters without the complexity of setup and configuration. + +Slurm Clusters are currently in beta. If you'd like to provide feedback, please [join our Discord](https://discord.gg/runpod). + + +Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to create, scale, and manage Slurm Clusters without the complexity of setup and configuration. Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing: - **Zero configuration setup:** Slurm is pre-installed and fully configured. - **Instant provisioning:** Clusters deploy rapidly with minimal setup. -- **Automatic role assignment:** Runpod automatically designates controller and worker nodes. +- **Automatic role assignment:** Runpod automatically designates controller and agent nodes. - **Built-in optimizations:** Pre-configured for optimal NCCL performance. - **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box. -This page shows how to deploy a Slurm Cluster on Runpod, and how to use it to run distributed training jobs. - -If you prefer to manually configure and manage your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide. +If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide. ## Deploy a Slurm Cluster -1. Navigate to the [Instant Cluster page](https://console.runpod.io/cluster) on the Runpod console. +1. Open the [Instant Clusters page](https://console.runpod.io/cluster) on the Runpod console. 2. Click **Create Cluster**. -3. Select **Slurm Cluster** from the dropdown menu of cluster types. -4. Configure your cluster specifications, including: - - [Network volume](/pods/storage/create-network-volumes), if you need persistent/shared storage. - - Region (must be the same as the network volume if you're using one). - - Pod count (i.e., the number of Pods in the cluster). - - [GPU type](/references/gpu-types). - - Cluster name. - - [Pod template](/pods/templates/overview). - - Click **Edit Template** if you need to adjust start commands, environment variables, ports, or the capacity of the [container/volume disk](/pods/storage/types). +3. Select **Slurm Cluster** from the cluster type dropdown menu. +4. Configure your cluster specifications: + - **Cluster name**: Enter a descriptive name for your cluster. + - **Pod count**: Choose the number of Pods in your cluster. + - **GPU type**: Select your preferred [GPU type](/references/gpu-types). + - **Region**: Choose your deployment region. + - **Network volume** (optional): Add a [network volume](/pods/storage/create-network-volumes) for persistent/shared storage. If using a network volume, ensure the region matches your cluster region. + - **Pod template**: Select a [Pod template](/pods/templates/overview) or click **Edit Template** to customize start commands, environment variables, ports, or [container/volume disk](/pods/storage/types) capacity. 5. Click **Deploy Cluster**. ## Access your cluster Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster). -Select your cluster to view details. The interface displays the **Slurm Controller** (primary node) and **Slurm Agents** (secondary nodes) with their status and specifications. You can expand each node to view details on the node's resources and status. +Select your cluster to view details. The interface displays the **Slurm Controller** (primary node) and **Slurm Agents** (secondary nodes) with their status and specifications. You can expand each node to view details on each node's resources and status. -You can connect to each node using the **Connect** button, or with any of the [connection methods supported by Pods](/pods/connect-to-pods). +You can connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-pods). ## Submit and manage jobs @@ -78,7 +79,7 @@ The managed Slurm environment includes: ## Advanced configuration -While the managed Slurm cluster works out-of-the-box, advanced users can customize configurations through shell access to the Slurm Controller node. +While Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm Controller node using the [web terminal or SSH](/pods/connect-to-pods). Access Slurm configuration files in their standard locations: - `/etc/slurm/slurm.conf` - Main configuration file diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx index 82a5c21b..cae87676 100644 --- a/instant-clusters/slurm.mdx +++ b/instant-clusters/slurm.mdx @@ -1,5 +1,5 @@ --- -title: "Deploy an Instant Cluster with Slurm" +title: "Deploy an Instant Cluster with Slurm (unmanaged)" sidebarTitle: "Slurm (unmanaged)" --- From d6a625afe49b0cad5e372d04e9cc3fb5001c0434 Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 10:31:22 -0400 Subject: [PATCH 06/13] remove unused sections --- instant-clusters/slurm-clusters.mdx | 54 +++++++---------------------- 1 file changed, 13 insertions(+), 41 deletions(-) diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index fa636634..9d3a8c28 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -56,7 +56,7 @@ Check cluster status and available resources: sinfo ``` -Submit a job to the cluster: +Submit a job to the cluster from the Slurm controller node: ```bash sbatch your-job-script.sh ``` @@ -66,7 +66,7 @@ Monitor job queue and status: squeue ``` -View detailed job information: +View detailed job information from the Slurm controller node: ```bash scontrol show job ``` @@ -79,55 +79,27 @@ The managed Slurm environment includes: ## Advanced configuration -While Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm Controller node using the [web terminal or SSH](/pods/connect-to-pods). +While Runpod's Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the [web terminal or SSH](/pods/connect-to-pods). Access Slurm configuration files in their standard locations: -- `/etc/slurm/slurm.conf` - Main configuration file -- `/etc/slurm/topology.conf` - Network topology configuration -- `/etc/slurm/gres.conf` - Generic resource configuration +- `/etc/slurm/slurm.conf` - Main configuration file. +- `/etc/slurm/topology.conf` - Network topology configuration. +- `/etc/slurm/gres.conf` - Generic resource configuration. -Modify these files as needed for your specific requirements. The managed service ensures baseline functionality while allowing flexibility for customization. +Modify these files as needed for your specific requirements. -## Performance optimization +## Monitoring -Managed Slurm clusters come pre-optimized for distributed training workloads: - -- **Topology-aware scheduling** - Properly configured topology.conf ensures optimal NCCL performance -- **GPU resource management** - Automatic GRES configuration for GPU scheduling -- **MPI integration** - Pre-configured for seamless MPI job execution - -These optimizations enable efficient large-scale ML training without manual tuning. - -## Monitor and scale - -Monitor your cluster health and resource utilization through the Runpod dashboard. Key metrics include: - -- Node availability and status -- Job queue statistics -- Resource utilization trends - -Scale your cluster based on workload demands by adding or removing nodes through the Instant Clusters interface. The managed service handles all reconfiguration automatically. - -## Best practices - -To maximize the effectiveness of your managed Slurm cluster: - -- Start with the minimum nodes required and scale as needed -- Use Slurm's built-in accounting features to track resource usage -- Leverage job arrays for similar tasks to improve scheduling efficiency -- Monitor job performance metrics to optimize resource allocation +Monitor your cluster health and resource utilization through the Runpod console. The interface provides visibility into metrics like node availability and status, job queue statistics, and resource utilization trends to help you optimize cluster performance. ## Troubleshooting -If you encounter issues with your managed Slurm cluster: +If you encounter issues with your Slurm Cluster: -**Jobs stuck in pending state** -Check resource availability with `sinfo` and ensure requested resources are available. +**Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. -**Authentication errors** -Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes. +**Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes. -**Performance issues** -Review topology configuration and ensure jobs are using appropriate resource requests. +**Performance issues:** Review topology configuration and ensure jobs are using appropriate resource requests. For additional support, contact Runpod support with your cluster ID and specific error messages. \ No newline at end of file From 33696d1e6402c1c02a7efead42f6ccff4d82114a Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 10:33:25 -0400 Subject: [PATCH 07/13] link to slurm docs --- instant-clusters/slurm-clusters.mdx | 2 ++ 1 file changed, 2 insertions(+) diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index 9d3a8c28..b571e002 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -19,6 +19,8 @@ Slurm Clusters eliminate the traditional complexity of cluster orchestration by - **Built-in optimizations:** Pre-configured for optimal NCCL performance. - **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box. +For more information on working with Slurm, see the [Slurm documentation](https://slurm.schedmd.com/documentation.html). + If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide. From 27b3cea453340b9fcfd2e80b36f14c7f34637608 Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 10:38:07 -0400 Subject: [PATCH 08/13] Update --- instant-clusters/slurm-clusters.mdx | 34 +++++++++-------------------- 1 file changed, 10 insertions(+), 24 deletions(-) diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index b571e002..4774923e 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -13,7 +13,7 @@ Runpod Slurm Clusters provide a fully managed high-performance computing and sch Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing: -- **Zero configuration setup:** Slurm is pre-installed and fully configured. +- **Zero configuration setup:** Slurm and munge are pre-installed and fully configured. - **Instant provisioning:** Clusters deploy rapidly with minimal setup. - **Automatic role assignment:** Runpod automatically designates controller and agent nodes. - **Built-in optimizations:** Pre-configured for optimal NCCL performance. @@ -45,40 +45,38 @@ If you prefer to manually configure your Slurm deployment, see [Deploy an Instan Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster). -Select your cluster to view details. The interface displays the **Slurm Controller** (primary node) and **Slurm Agents** (secondary nodes) with their status and specifications. You can expand each node to view details on each node's resources and status. +Select your cluster to view details. The interface displays the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes) with their status and specifications. You can expand each node to view details on each node's resources and status. -You can connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-pods). +You can connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod). ## Submit and manage jobs -All standard Slurm commands are available without configuration: +All standard Slurm commands are available without configuration. For example, you can: Check cluster status and available resources: + ```bash sinfo ``` Submit a job to the cluster from the Slurm controller node: + ```bash sbatch your-job-script.sh ``` Monitor job queue and status: + ```bash squeue ``` View detailed job information from the Slurm controller node: + ```bash -scontrol show job +scontrol show job JOB_ID ``` -The managed Slurm environment includes: -- Pre-installed Slurm with all necessary plugins. -- Pre-configured Munge authentication. -- Optimized `topology.conf` for NCCL performance. -- Support for common high-performance computing workloads and MPI integration. - ## Advanced configuration While Runpod's Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the [web terminal or SSH](/pods/connect-to-pods). @@ -92,16 +90,4 @@ Modify these files as needed for your specific requirements. ## Monitoring -Monitor your cluster health and resource utilization through the Runpod console. The interface provides visibility into metrics like node availability and status, job queue statistics, and resource utilization trends to help you optimize cluster performance. - -## Troubleshooting - -If you encounter issues with your Slurm Cluster: - -**Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. - -**Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes. - -**Performance issues:** Review topology configuration and ensure jobs are using appropriate resource requests. - -For additional support, contact Runpod support with your cluster ID and specific error messages. \ No newline at end of file +Monitor your cluster health and resource utilization through the Runpod console. The interface provides visibility into metrics like node availability and status, job queue statistics, and resource utilization trends to help you optimize cluster performance. \ No newline at end of file From dfcbf0e6d0c683b5d3d5c2eddcab0ea7e3043e29 Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 10:44:03 -0400 Subject: [PATCH 09/13] update --- instant-clusters/slurm-clusters.mdx | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index 4774923e..e5212a66 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -9,7 +9,11 @@ tag: "BETA" Slurm Clusters are currently in beta. If you'd like to provide feedback, please [join our Discord](https://discord.gg/runpod). -Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to create, scale, and manage Slurm Clusters without the complexity of setup and configuration. +Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup. + +For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html). + +## Key features Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing: @@ -19,8 +23,6 @@ Slurm Clusters eliminate the traditional complexity of cluster orchestration by - **Built-in optimizations:** Pre-configured for optimal NCCL performance. - **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box. -For more information on working with Slurm, see the [Slurm documentation](https://slurm.schedmd.com/documentation.html). - If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide. From aebb1c70db0196929ae955c4bde2b115856f5e79 Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 10:47:49 -0400 Subject: [PATCH 10/13] Add troubleshooting --- instant-clusters/slurm-clusters.mdx | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index e5212a66..90243919 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -92,4 +92,14 @@ Modify these files as needed for your specific requirements. ## Monitoring -Monitor your cluster health and resource utilization through the Runpod console. The interface provides visibility into metrics like node availability and status, job queue statistics, and resource utilization trends to help you optimize cluster performance. \ No newline at end of file +Monitor your cluster health and resource utilization through the Runpod console. The interface provides visibility into metrics like node availability and status, job queue statistics, and resource utilization trends to help you optimize cluster performance. + +## Troubleshooting + +If you encounter issues with your Slurm Cluster, try the following: + +- **Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. +- **Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes. +- **Performance issues:** Review topology configuration and ensure jobs are using appropriate resource requests. + +For additional support, contact [Runpod support](https://contact.runpod.io/) with your cluster ID and specific error messages. \ No newline at end of file From e2545b924ee2483ff9e982563ab22a8e6e9b4045 Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 11:00:18 -0400 Subject: [PATCH 11/13] update ic title --- instant-clusters.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/instant-clusters.mdx b/instant-clusters.mdx index 6c1b125a..77be9d40 100644 --- a/instant-clusters.mdx +++ b/instant-clusters.mdx @@ -1,5 +1,5 @@ --- -title: "Instant Clusters" +title: "Overview" sidebarTitle: "Overview" --- From 791334641297b5631074b0926e3586586e8b28ca Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 13:02:53 -0400 Subject: [PATCH 12/13] Update --- instant-clusters/slurm-clusters.mdx | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index 90243919..827cda85 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -43,13 +43,13 @@ If you prefer to manually configure your Slurm deployment, see [Deploy an Instan - **Pod template**: Select a [Pod template](/pods/templates/overview) or click **Edit Template** to customize start commands, environment variables, ports, or [container/volume disk](/pods/storage/types) capacity. 5. Click **Deploy Cluster**. -## Access your cluster +## Connect to a Slurm Cluster Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster). -Select your cluster to view details. The interface displays the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes) with their status and specifications. You can expand each node to view details on each node's resources and status. +From this page you can select a cluster to view it's component nodes, including a label indicating the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management. -You can connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod). +Connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod). ## Submit and manage jobs @@ -90,10 +90,6 @@ Access Slurm configuration files in their standard locations: Modify these files as needed for your specific requirements. -## Monitoring - -Monitor your cluster health and resource utilization through the Runpod console. The interface provides visibility into metrics like node availability and status, job queue statistics, and resource utilization trends to help you optimize cluster performance. - ## Troubleshooting If you encounter issues with your Slurm Cluster, try the following: From f4a382eab1d149bb7c124464f1855dfd07f27ad8 Mon Sep 17 00:00:00 2001 From: Mo King Date: Fri, 27 Jun 2025 13:05:27 -0400 Subject: [PATCH 13/13] info -> tip --- instant-clusters/slurm.mdx | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx index cae87676..93275ff0 100644 --- a/instant-clusters/slurm.mdx +++ b/instant-clusters/slurm.mdx @@ -128,11 +128,11 @@ Check the output file created by the test (`test_simple_[JOBID].out`) and look f If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.console.runpod.io/cluster) and delete your cluster to avoid incurring extra charges. - + You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.console.runpod.io/user/billing) section under the **Cluster** tab. - + ## Next steps