Add tutorial for Instant Clusters + SLURM (#221)

muhsinking · web-flow · commit 5f171aafb569 · 2025-04-28T12:28:41.000-04:00
diff --git a/docs/instant-clusters/axolotl.md b/docs/instant-clusters/axolotl.md
@@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and use it to fine-tune a la
 
 This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
 
-Follow the steps below to deploy your Cluster and start training your models efficiently.
+Follow the steps below to deploy a cluster and start training your models efficiently.
 
 ## Step 1: Deploy an Instant Cluster
 
@@ -19,35 +19,35 @@ Follow the steps below to deploy your Cluster and start training your models eff
 
 ## Step 2: Set up Axolotl on each Pod
 
-1. Click your Cluster to expand the list of Pods.
+1. Click your cluster to expand the list of Pods.
 2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
 3. Click **Connect**, then click **Web Terminal**.
-4. Clone the Axolotl repository into the Pod's main directory:
+4. In the terminal that opens, run this command to clone the Axolotl repository into the Pod's main directory:
 
-```bash
-git clone https://github.com/axolotl-ai-cloud/axolotl
-```
+    ```bash
+    git clone https://github.com/axolotl-ai-cloud/axolotl
+    ```
 
 5. Navigate to the `axolotl` directory:
 
-```bash
-cd axolotl
-```
+    ```bash
+    cd axolotl
+    ```
 
 6. Install the required packages:
 
-```bash
-pip3 install -U packaging setuptools wheel ninja
-pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
-```
+    ```bash
+    pip3 install -U packaging setuptools wheel ninja
+    pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
+    ```
 
 7. Navigate to the `examples/llama-3` directory:
 
-```bash
-cd examples/llama-3
-```
+    ```bash
+    cd examples/llama-3
+    ```
 
-Repeat these steps for **each Pod** in your Cluster.
+Repeat these steps for **each Pod** in your cluster.
 
 ## Step 3: Start the training process on each Pod
 
@@ -90,11 +90,11 @@ Congrats! You've successfully trained a model using Axolotl on an Instant Cluste
 
 ## Step 4: Clean up
 
-If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges.
+If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
 
 :::note
 
-You can monitor your Cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
+You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
 
 :::
 
@@ -103,7 +103,7 @@ You can monitor your Cluster usage and spending using the **Billing Explorer** a
 Now that you've successfully deployed and tested an Axolotl distributed training job on an Instant Cluster, you can:
 
 - **Fine-tune your own models** by modifying the configuration files in Axolotl to suit your specific requirements.
-- **Scale your training** by adjusting the number of Pods in your Cluster (and the size of their containers and volumes) to handle larger models or datasets.
+- **Scale your training** by adjusting the number of Pods in your cluster (and the size of their containers and volumes) to handle larger models or datasets.
 - **Try different optimization techniques** such as DeepSpeed, FSDP (Fully Sharded Data Parallel), or other distributed training strategies.
 
 For more information on fine-tuning with Axolotl, refer to the [Axolotl documentation](https://github.com/OpenAccess-AI-Collective/axolotl).
diff --git a/docs/instant-clusters/index.md b/docs/instant-clusters/index.md
@@ -24,8 +24,9 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a
 
 Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework:
 
-- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch)
-- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl)
+- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch).
+- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl).
+- [Deploy an Instant Cluster with Slurm](/instant-clusters/axolotl).
 
 ## Use cases for Instant Clusters
 
@@ -69,12 +70,12 @@ The following environment variables are available in all Pods:
 | `PRIMARY_ADDR` / `MASTER_ADDR` | The address of the primary Pod.                                               |
 | `PRIMARY_PORT` / `MASTER_PORT` | The port of the primary Pod (all ports are available).                        |
 | `NODE_ADDR`                    | The static IP of this Pod within the cluster network.                         |
-| `NODE_RANK`                    | The Cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). |
-| `NUM_NODES`                    | The number of Pods in the Cluster.                                            |
+| `NODE_RANK`                    | The cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). |
+| `NUM_NODES`                    | The number of Pods in the cluster.                                            |
 | `NUM_TRAINERS`                 | The number of GPUs per Pod.                                                   |
 | `HOST_NODE_ADDR`               | Defined as `PRIMARY_ADDR:PRIMARY_PORT` for convenience.                       |
-| `WORLD_SIZE`                   | The total number of GPUs in the Cluster (`NUM_NODES` * `NUM_TRAINERS`).       |
+| `WORLD_SIZE`                   | The total number of GPUs in the cluster (`NUM_NODES` * `NUM_TRAINERS`).       |
 
-Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a Cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node.
+Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node.
 
 The variables `MASTER_ADDR`/`PRIMARY_ADDR` and `MASTER_PORT`/`PRIMARY_PORT` are equivalent. The `MASTER_*` variables provide compatibility with tools that expect these legacy names.
diff --git a/docs/instant-clusters/pytorch.md b/docs/instant-clusters/pytorch.md
@@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and run a multi-node process
 
 This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
 
-Follow the steps below to deploy your Cluster and start running distributed PyTorch workloads efficiently.
+Follow the steps below to deploy a cluster and start running distributed PyTorch workloads efficiently.
 
 ## Step 1: Deploy an Instant Cluster
 
@@ -19,22 +19,22 @@ Follow the steps below to deploy your Cluster and start running distributed PyTo
 
 ## Step 2: Clone the PyTorch demo into each Pod
 
-1. Click your Cluster to expand the list of Pods.
+1. Click your cluster to expand the list of Pods.
 2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
 3. Click **Connect**, then click **Web Terminal**.
-4. Run this command to clone a basic `main.py` file into the Pod's main directory:
+4. In the terminal that opens, run this command to clone a basic `main.py` file into the Pod's main directory:
 
-```bash
-git clone https://github.com/murat-runpod/torch-demo.git
-```
+    ```bash
+    git clone https://github.com/murat-runpod/torch-demo.git
+    ```
 
-Repeat these steps for **each Pod** in your Cluster.
+Repeat these steps for **each Pod** in your cluster.
 
 ## Step 3: Examine the main.py file
 
 Let's look at the code in our `main.py` file:
 
-```python
+```python title="main.py"
 import os
 import torch
 import torch.distributed as dist
@@ -80,7 +80,7 @@ This is the minimal code necessary for initializing a distributed environment. T
 
 Run this command in the web terminal of **each Pod** to start the PyTorch process:
 
-```bash
+```bash title="launcher.sh"
 export NCCL_DEBUG=WARN
 torchrun \
   --nproc_per_node=$NUM_TRAINERS \
@@ -106,7 +106,7 @@ Running on rank 14/15 (local rank: 6), device: cuda:6
 Running on rank 10/15 (local rank: 2), device: cuda:2
 ```
 
-The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the Cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example).
+The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example).
 
 The specific number and order of ranks may be different in your terminal, and the global ranks listed will be different for each Pod.
 
@@ -116,7 +116,7 @@ This diagram illustrates how local and global ranks are distributed across multi
 
 ## Step 5: Clean up
 
-If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges.
+If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
 
 :::note
 
@@ -128,8 +128,8 @@ You can monitor your cluster usage and spending using the **Billing Explorer** a
 
 Now that you've successfully deployed and tested a PyTorch distributed application on an Instant Cluster, you can:
 
-- **Adapt your own PyTorch code** to run on the Cluster by modifying the distributed initialization in your scripts.
-- **Scale your training** by adjusting the number of Pods in your Cluster to handle larger models or datasets.
+- **Adapt your own PyTorch code** to run on the cluster by modifying the distributed initialization in your scripts.
+- **Scale your training** by adjusting the number of Pods in your cluster to handle larger models or datasets.
 - **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
 - **Optimize performance** by experimenting with different distributed training strategies like Data Parallel (DP), Distributed Data Parallel (DDP), or Fully Sharded Data Parallel (FSDP).
 
diff --git a/docs/instant-clusters/slurm.md b/docs/instant-clusters/slurm.md
@@ -0,0 +1,140 @@
+---
+title: Deploy with SLURM
+sidebar_position: 4
+description: Learn how to deploy an Instant Cluster and set up SLURM for distributed job scheduling.
+---
+
+# Deploy an Instant Cluster with SLURM
+
+This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on RunPod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
+
+Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently.
+
+## Requirements
+
+- You've created a [RunPod account](https://www.runpod.io/console/home) and funded it with sufficient credits.
+- You have basic familiarity with Linux command line.
+- You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/).
+
+## Step 1: Deploy an Instant Cluster
+
+1. Open the [Instant Clusters page](https://www.runpod.io/console/cluster) on the RunPod web interface.
+2. Click **Create Cluster**.
+3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (RunPod PyTorch).
+4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.
+
+## Step 2: Clone demo and install SLURM on each Pod
+
+To connect to a Pod:
+
+1. On the Instant Clusters page, click on the cluster you created to expand the list of Pods.
+2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
+
+**On each Pod:**
+
+1. Click **Connect**, then click **Web Terminal**.
+2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory:
+
+    ```bash
+    git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example
+    ```
+
+3. Run this command to install SLURM:
+
+    ```bash
+    apt update && apt install -y slurm-wlm slurm-client munge
+    ```
+
+## Step 3: Overview of SLURM demo scripts
+
+The repository contains several essential scripts for setting up SLURM. Let's examine what each script does:
+
+- `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node.
+- `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup.
+- `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment.
+- `test_batch.sh`: A sample SLURM job script for testing cluster functionality.
+
+## Step 4: Install SLURM on each Pod
+
+Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.
+
+```bash
+./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3
+```
+
+This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
+
+## Step 5: Start SLURM services
+
+:::tip
+
+If you're not sure which Pod is the primary node, run the command `echo $HOSTNAME` on the web terminal of each Pod and look for `node-0`.
+
+:::
+
+1. **On the primary node** (`node-0`), run both SLURM services:
+
+    ```bash
+    slurmctld -D
+    ```
+
+2. Use the web interface to open a second terminal **on the primary node** and run:
+
+    ```bash
+    slurmd -D
+    ```
+
+3. **On the secondary node** (`node-1`), run:
+
+    ```bash
+    slurmd -D
+    ```
+
+After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal.
+
+## Step 6: Test your SLURM Cluster
+
+1. Run this command **on the primary node** (`node-0`) to check the status of your nodes:
+
+    ```bash
+    sinfo
+    ```
+
+    You should see output showing both nodes in your cluster, with a state of "idle" if everything is working correctly.
+
+2. Run this command to test GPU availability across both nodes:
+
+    ```bash
+    srun --nodes=2 --gres=gpu:1 nvidia-smi -L
+    ```
+
+    This command should list all GPUs across both nodes.
+
+## Step 7: Submit the SLURM job script
+
+Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly:
+
+```bash
+sbatch test_batch.sh
+```
+
+Check the output file created by the test (`test_simple_[JOBID].out`) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster.
+
+## Step 8: Clean up
+
+If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
+
+:::note
+
+You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
+
+:::
+
+## Next steps
+
+Now that you've successfully deployed and tested a SLURM cluster on RunPod, you can:
+
+- **Adapt your own distributed workloads** to run using SLURM job scripts.
+- **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets.
+- **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
+- **Optimize performance** by experimenting with different distributed training strategies.