Skip to content

Commit 5f171aa

Browse files
authored
Add tutorial for Instant Clusters + SLURM (#221)
1 parent 8ad9d42 commit 5f171aa

File tree

4 files changed

+180
-39
lines changed

4 files changed

+180
-39
lines changed

docs/instant-clusters/axolotl.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and use it to fine-tune a la
88

99
This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
1010

11-
Follow the steps below to deploy your Cluster and start training your models efficiently.
11+
Follow the steps below to deploy a cluster and start training your models efficiently.
1212

1313
## Step 1: Deploy an Instant Cluster
1414

@@ -19,35 +19,35 @@ Follow the steps below to deploy your Cluster and start training your models eff
1919

2020
## Step 2: Set up Axolotl on each Pod
2121

22-
1. Click your Cluster to expand the list of Pods.
22+
1. Click your cluster to expand the list of Pods.
2323
2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
2424
3. Click **Connect**, then click **Web Terminal**.
25-
4. Clone the Axolotl repository into the Pod's main directory:
25+
4. In the terminal that opens, run this command to clone the Axolotl repository into the Pod's main directory:
2626

27-
```bash
28-
git clone https://github.com/axolotl-ai-cloud/axolotl
29-
```
27+
```bash
28+
git clone https://github.com/axolotl-ai-cloud/axolotl
29+
```
3030

3131
5. Navigate to the `axolotl` directory:
3232

33-
```bash
34-
cd axolotl
35-
```
33+
```bash
34+
cd axolotl
35+
```
3636

3737
6. Install the required packages:
3838

39-
```bash
40-
pip3 install -U packaging setuptools wheel ninja
41-
pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
42-
```
39+
```bash
40+
pip3 install -U packaging setuptools wheel ninja
41+
pip3 install --no-build-isolation -e '.[flash-attn,deepspeed]'
42+
```
4343

4444
7. Navigate to the `examples/llama-3` directory:
4545

46-
```bash
47-
cd examples/llama-3
48-
```
46+
```bash
47+
cd examples/llama-3
48+
```
4949

50-
Repeat these steps for **each Pod** in your Cluster.
50+
Repeat these steps for **each Pod** in your cluster.
5151

5252
## Step 3: Start the training process on each Pod
5353

@@ -90,11 +90,11 @@ Congrats! You've successfully trained a model using Axolotl on an Instant Cluste
9090
9191
## Step 4: Clean up
9292
93-
If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges.
93+
If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
9494
9595
:::note
9696
97-
You can monitor your Cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
97+
You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
9898
9999
:::
100100
@@ -103,7 +103,7 @@ You can monitor your Cluster usage and spending using the **Billing Explorer** a
103103
Now that you've successfully deployed and tested an Axolotl distributed training job on an Instant Cluster, you can:
104104

105105
- **Fine-tune your own models** by modifying the configuration files in Axolotl to suit your specific requirements.
106-
- **Scale your training** by adjusting the number of Pods in your Cluster (and the size of their containers and volumes) to handle larger models or datasets.
106+
- **Scale your training** by adjusting the number of Pods in your cluster (and the size of their containers and volumes) to handle larger models or datasets.
107107
- **Try different optimization techniques** such as DeepSpeed, FSDP (Fully Sharded Data Parallel), or other distributed training strategies.
108108

109109
For more information on fine-tuning with Axolotl, refer to the [Axolotl documentation](https://github.com/OpenAccess-AI-Collective/axolotl).

docs/instant-clusters/index.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,9 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a
2424

2525
Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework:
2626

27-
- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch)
28-
- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl)
27+
- [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch).
28+
- [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl).
29+
- [Deploy an Instant Cluster with Slurm](/instant-clusters/axolotl).
2930

3031
## Use cases for Instant Clusters
3132

@@ -69,12 +70,12 @@ The following environment variables are available in all Pods:
6970
| `PRIMARY_ADDR` / `MASTER_ADDR` | The address of the primary Pod. |
7071
| `PRIMARY_PORT` / `MASTER_PORT` | The port of the primary Pod (all ports are available). |
7172
| `NODE_ADDR` | The static IP of this Pod within the cluster network. |
72-
| `NODE_RANK` | The Cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). |
73-
| `NUM_NODES` | The number of Pods in the Cluster. |
73+
| `NODE_RANK` | The cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). |
74+
| `NUM_NODES` | The number of Pods in the cluster. |
7475
| `NUM_TRAINERS` | The number of GPUs per Pod. |
7576
| `HOST_NODE_ADDR` | Defined as `PRIMARY_ADDR:PRIMARY_PORT` for convenience. |
76-
| `WORLD_SIZE` | The total number of GPUs in the Cluster (`NUM_NODES` * `NUM_TRAINERS`). |
77+
| `WORLD_SIZE` | The total number of GPUs in the cluster (`NUM_NODES` * `NUM_TRAINERS`). |
7778

78-
Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a Cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node.
79+
Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node.
7980

8081
The variables `MASTER_ADDR`/`PRIMARY_ADDR` and `MASTER_PORT`/`PRIMARY_PORT` are equivalent. The `MASTER_*` variables provide compatibility with tools that expect these legacy names.

docs/instant-clusters/pytorch.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ description: Learn how to deploy an Instant Cluster and run a multi-node process
88

99
This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and RunPod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
1010

11-
Follow the steps below to deploy your Cluster and start running distributed PyTorch workloads efficiently.
11+
Follow the steps below to deploy a cluster and start running distributed PyTorch workloads efficiently.
1212

1313
## Step 1: Deploy an Instant Cluster
1414

@@ -19,22 +19,22 @@ Follow the steps below to deploy your Cluster and start running distributed PyTo
1919

2020
## Step 2: Clone the PyTorch demo into each Pod
2121

22-
1. Click your Cluster to expand the list of Pods.
22+
1. Click your cluster to expand the list of Pods.
2323
2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
2424
3. Click **Connect**, then click **Web Terminal**.
25-
4. Run this command to clone a basic `main.py` file into the Pod's main directory:
25+
4. In the terminal that opens, run this command to clone a basic `main.py` file into the Pod's main directory:
2626

27-
```bash
28-
git clone https://github.com/murat-runpod/torch-demo.git
29-
```
27+
```bash
28+
git clone https://github.com/murat-runpod/torch-demo.git
29+
```
3030

31-
Repeat these steps for **each Pod** in your Cluster.
31+
Repeat these steps for **each Pod** in your cluster.
3232

3333
## Step 3: Examine the main.py file
3434

3535
Let's look at the code in our `main.py` file:
3636
37-
```python
37+
```python title="main.py"
3838
import os
3939
import torch
4040
import torch.distributed as dist
@@ -80,7 +80,7 @@ This is the minimal code necessary for initializing a distributed environment. T
8080
8181
Run this command in the web terminal of **each Pod** to start the PyTorch process:
8282
83-
```bash
83+
```bash title="launcher.sh"
8484
export NCCL_DEBUG=WARN
8585
torchrun \
8686
--nproc_per_node=$NUM_TRAINERS \
@@ -106,7 +106,7 @@ Running on rank 14/15 (local rank: 6), device: cuda:6
106106
Running on rank 10/15 (local rank: 2), device: cuda:2
107107
```
108108
109-
The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the Cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example).
109+
The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example).
110110
111111
The specific number and order of ranks may be different in your terminal, and the global ranks listed will be different for each Pod.
112112
@@ -116,7 +116,7 @@ This diagram illustrates how local and global ranks are distributed across multi
116116
117117
## Step 5: Clean up
118118
119-
If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges.
119+
If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
120120
121121
:::note
122122
@@ -128,8 +128,8 @@ You can monitor your cluster usage and spending using the **Billing Explorer** a
128128
129129
Now that you've successfully deployed and tested a PyTorch distributed application on an Instant Cluster, you can:
130130

131-
- **Adapt your own PyTorch code** to run on the Cluster by modifying the distributed initialization in your scripts.
132-
- **Scale your training** by adjusting the number of Pods in your Cluster to handle larger models or datasets.
131+
- **Adapt your own PyTorch code** to run on the cluster by modifying the distributed initialization in your scripts.
132+
- **Scale your training** by adjusting the number of Pods in your cluster to handle larger models or datasets.
133133
- **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
134134
- **Optimize performance** by experimenting with different distributed training strategies like Data Parallel (DP), Distributed Data Parallel (DDP), or Fully Sharded Data Parallel (FSDP).
135135

docs/instant-clusters/slurm.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
---
2+
title: Deploy with SLURM
3+
sidebar_position: 4
4+
description: Learn how to deploy an Instant Cluster and set up SLURM for distributed job scheduling.
5+
---
6+
7+
# Deploy an Instant Cluster with SLURM
8+
9+
This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on RunPod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
10+
11+
Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently.
12+
13+
## Requirements
14+
15+
- You've created a [RunPod account](https://www.runpod.io/console/home) and funded it with sufficient credits.
16+
- You have basic familiarity with Linux command line.
17+
- You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/).
18+
19+
## Step 1: Deploy an Instant Cluster
20+
21+
1. Open the [Instant Clusters page](https://www.runpod.io/console/cluster) on the RunPod web interface.
22+
2. Click **Create Cluster**.
23+
3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (RunPod PyTorch).
24+
4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.
25+
26+
## Step 2: Clone demo and install SLURM on each Pod
27+
28+
To connect to a Pod:
29+
30+
1. On the Instant Clusters page, click on the cluster you created to expand the list of Pods.
31+
2. Click on a Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
32+
33+
**On each Pod:**
34+
35+
1. Click **Connect**, then click **Web Terminal**.
36+
2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory:
37+
38+
```bash
39+
git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example
40+
```
41+
42+
3. Run this command to install SLURM:
43+
44+
```bash
45+
apt update && apt install -y slurm-wlm slurm-client munge
46+
```
47+
48+
## Step 3: Overview of SLURM demo scripts
49+
50+
The repository contains several essential scripts for setting up SLURM. Let's examine what each script does:
51+
52+
- `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node.
53+
- `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup.
54+
- `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment.
55+
- `test_batch.sh`: A sample SLURM job script for testing cluster functionality.
56+
57+
## Step 4: Install SLURM on each Pod
58+
59+
Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.
60+
61+
```bash
62+
./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3
63+
```
64+
65+
This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
66+
67+
## Step 5: Start SLURM services
68+
69+
:::tip
70+
71+
If you're not sure which Pod is the primary node, run the command `echo $HOSTNAME` on the web terminal of each Pod and look for `node-0`.
72+
73+
:::
74+
75+
1. **On the primary node** (`node-0`), run both SLURM services:
76+
77+
```bash
78+
slurmctld -D
79+
```
80+
81+
2. Use the web interface to open a second terminal **on the primary node** and run:
82+
83+
```bash
84+
slurmd -D
85+
```
86+
87+
3. **On the secondary node** (`node-1`), run:
88+
89+
```bash
90+
slurmd -D
91+
```
92+
93+
After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal.
94+
95+
## Step 6: Test your SLURM Cluster
96+
97+
1. Run this command **on the primary node** (`node-0`) to check the status of your nodes:
98+
99+
```bash
100+
sinfo
101+
```
102+
103+
You should see output showing both nodes in your cluster, with a state of "idle" if everything is working correctly.
104+
105+
2. Run this command to test GPU availability across both nodes:
106+
107+
```bash
108+
srun --nodes=2 --gres=gpu:1 nvidia-smi -L
109+
```
110+
111+
This command should list all GPUs across both nodes.
112+
113+
## Step 7: Submit the SLURM job script
114+
115+
Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly:
116+
117+
```bash
118+
sbatch test_batch.sh
119+
```
120+
121+
Check the output file created by the test (`test_simple_[JOBID].out`) and look for the hostnames of both nodes. This confirms that the job ran successfully across the cluster.
122+
123+
## Step 8: Clean up
124+
125+
If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your cluster to avoid incurring extra charges.
126+
127+
:::note
128+
129+
You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.
130+
131+
:::
132+
133+
## Next steps
134+
135+
Now that you've successfully deployed and tested a SLURM cluster on RunPod, you can:
136+
137+
- **Adapt your own distributed workloads** to run using SLURM job scripts.
138+
- **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets.
139+
- **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
140+
- **Optimize performance** by experimenting with different distributed training strategies.

0 commit comments

Comments
 (0)