|
| 1 | +--- |
| 2 | +title: Deploy with SLURM |
| 3 | +sidebar_position: 4 |
| 4 | +description: Learn how to deploy an Instant Cluster and set up SLURM for distributed job scheduling. |
| 5 | +--- |
| 6 | + |
| 7 | +# Deploy an Instant Cluster with SLURM |
| 8 | + |
| 9 | +This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on RunPod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. |
| 10 | + |
| 11 | +Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently. |
| 12 | + |
| 13 | +## Requirements |
| 14 | + |
| 15 | +- You've created a [RunPod account](https://www.runpod.io/console/home) and funded it with sufficient credits. |
| 16 | +- You have basic familiarity with Linux command line. |
| 17 | +- You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/). |
| 18 | + |
| 19 | +## Step 1: Deploy an Instant Cluster |
| 20 | + |
| 21 | +1. Open the [Instant Clusters page](https://www.runpod.io/console/cluster) on the RunPod web interface. |
| 22 | +2. Click **Create Cluster**. |
| 23 | +3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (RunPod PyTorch). |
| 24 | +4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds. |
| 25 | + |
| 26 | +## Step 2: Clone demo and install SLURM on each Pod |
| 27 | + |
| 28 | +To connect to a Pod: |
| 29 | + |
| 30 | +1. On the Instant Clusters page, click on the cluster you created to expand the list of Pods. |
| 31 | +2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod. |
| 32 | + |
| 33 | +**On each Pod:** |
| 34 | + |
| 35 | +1. Click **Connect**, then click **Web Terminal**. |
| 36 | +2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory: |
| 37 | + |
| 38 | + ```bash |
| 39 | + git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example |
| 40 | + ``` |
| 41 | + |
| 42 | +3. Run this command to install SLURM: |
| 43 | + |
| 44 | + ```bash |
| 45 | + apt update && apt install -y slurm-wlm slurm-client munge |
| 46 | + ``` |
| 47 | + |
| 48 | +## Step 3: Overview of SLURM demo scripts |
| 49 | + |
| 50 | +The repository contains several essential scripts for setting up SLURM. Let's examine what each script does: |
| 51 | +
|
| 52 | +- `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node. |
| 53 | +- `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup. |
| 54 | +- `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment. |
| 55 | +- `test_batch.sh`: A sample SLURM job script for testing cluster functionality. |
| 56 | +
|
| 57 | +## Step 4: Install SLURM on each Pod |
| 58 | +
|
| 59 | +Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster. |
| 60 | +
|
| 61 | +```bash |
| 62 | +./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3 |
| 63 | +``` |
| 64 | +
|
| 65 | +This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes. |
| 66 | +
|
| 67 | +## Step 5: Start SLURM services |
| 68 | +
|
| 69 | +:::tip |
| 70 | +
|
| 71 | +If you're not sure which Pod is the primary node, run the command `echo $HOSTNAME` on the web terminal of each Pod and look for `node-0`. |
| 72 | + |
| 73 | +::: |
| 74 | + |
| 75 | +1. **On the primary node** (`node-0`), run both SLURM services: |
| 76 | + |
| 77 | + ```bash |
| 78 | + slurmctld -D |
| 79 | + ``` |
| 80 | + |
| 81 | +2. Use the web interface to open a second terminal **on the primary node** and run: |
| 82 | + |
| 83 | + ```bash |
| 84 | + slurmd -D |
| 85 | + ``` |
| 86 | + |
| 87 | +3. **On the secondary node** (`node-1`), run: |
| 88 | + |
| 89 | + ```bash |
| 90 | + slurmd -D |
| 91 | + ``` |
| 92 | + |
| 93 | +After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal. |
| 94 | + |
| 95 | +## Step 6: Test your SLURM Cluster |
| 96 | + |
| 97 | +1. Run this command **on the primary node** (`node-0`) to check the status of your nodes: |
| 98 | + |
| 99 | + ```bash |
| 100 | + sinfo |
| 101 | + ``` |
| 102 | + |
| 103 | + You should see output showing both nodes in your cluster, with a state of "idle" if everything is working correctly. |
| 104 | + |
| 105 | +2. Run this command to test GPU availability across both nodes: |
| 106 | + |
| 107 | + ```bash |
| 108 | + srun --nodes=2 --gres=gpu:1 nvidia-smi -L |
| 109 | + ``` |
| 110 | + |
| 111 | + This command should list all GPUs across both nodes. |
| 112 | + |
| 113 | +## Step 7: Submit the SLURM job script |
| 114 | + |
| 115 | +Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly: |
| 116 | + |
| 117 | +```bash |
| 118 | +sbatch test_batch.sh |
| 119 | +``` |
| 120 | + |
| 121 | +Check the output file created by the test (`test_simple_[JOBID].out`) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster. |
| 122 | + |
| 123 | +## Step 8: Clean up |
| 124 | + |
| 125 | +If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges. |
| 126 | + |
| 127 | +:::note |
| 128 | + |
| 129 | +You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab. |
| 130 | + |
| 131 | +::: |
| 132 | + |
| 133 | +## Next steps |
| 134 | + |
| 135 | +Now that you've successfully deployed and tested a SLURM cluster on RunPod, you can: |
| 136 | +
|
| 137 | +- **Adapt your own distributed workloads** to run using SLURM job scripts. |
| 138 | +- **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets. |
| 139 | +- **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models. |
| 140 | +- **Optimize performance** by experimenting with different distributed training strategies. |
0 commit comments