Skip to content

Add Slurm Cluster documentation #306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
13 changes: 10 additions & 3 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -142,9 +142,16 @@
"group": "Instant Clusters",
"pages": [
"instant-clusters",
"instant-clusters/pytorch",
"instant-clusters/axolotl",
"instant-clusters/slurm"
"instant-clusters/slurm-clusters",
{
"group": "Deployment guides",
"pages": [
"instant-clusters/pytorch",
"instant-clusters/axolotl",
"instant-clusters/slurm"
]
}

]
},
{
Expand Down
7 changes: 4 additions & 3 deletions instant-clusters.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Instant Clusters"
title: "Overview"
sidebarTitle: "Overview"
---

Expand Down Expand Up @@ -27,9 +27,10 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a

Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework:

* [Deploy a Slurm Cluster](/instant-clusters/slurm-clusters).
* [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch).
* [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl).
* [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm).
* [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm).

## Use cases for Instant Clusters

Expand Down Expand Up @@ -66,7 +67,7 @@ Instant Clusters support up to 8 interfaces per Pod. Each interface (`eth1` - `e

## Environment variables

The following environment variables are available in all Pods:
The following environment variables are present in all Pods on an Instant Cluster:

| Environment Variable | Description |
| ------------------------------ | ----------------------------------------------------------------------------- |
Expand Down
2 changes: 1 addition & 1 deletion instant-clusters/axolotl.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Deploy an Instant Cluster with Axolotl"
sidebarTitle: "Deploy with Axolotl"
sidebarTitle: "Axolotl"
---

This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
Expand Down
2 changes: 1 addition & 1 deletion instant-clusters/pytorch.mdx
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Deploy an Instant Cluster with PyTorch"
sidebarTitle: "Deploy with PyTorch"
sidebarTitle: "PyTorch"
---

This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
Expand Down
101 changes: 101 additions & 0 deletions instant-clusters/slurm-clusters.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
title: Slurm Clusters
sidebarTitle: Slurm Clusters
description: Deploy fully managed Slurm Clusters on Runpod with zero configuration
tag: "BETA"
---

<Note>
Slurm Clusters are currently in beta. If you'd like to provide feedback, please [join our Discord](https://discord.gg/runpod).
</Note>

Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup.

For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html).

## Key features

Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:

- **Zero configuration setup:** Slurm and munge are pre-installed and fully configured.
- **Instant provisioning:** Clusters deploy rapidly with minimal setup.
- **Automatic role assignment:** Runpod automatically designates controller and agent nodes.
- **Built-in optimizations:** Pre-configured for optimal NCCL performance.
- **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box.

<Tip>

If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide.

</Tip>

## Deploy a Slurm Cluster

1. Open the [Instant Clusters page](https://console.runpod.io/cluster) on the Runpod console.
2. Click **Create Cluster**.
3. Select **Slurm Cluster** from the cluster type dropdown menu.
4. Configure your cluster specifications:
- **Cluster name**: Enter a descriptive name for your cluster.
- **Pod count**: Choose the number of Pods in your cluster.
- **GPU type**: Select your preferred [GPU type](/references/gpu-types).
- **Region**: Choose your deployment region.
- **Network volume** (optional): Add a [network volume](/pods/storage/create-network-volumes) for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
- **Pod template**: Select a [Pod template](/pods/templates/overview) or click **Edit Template** to customize start commands, environment variables, ports, or [container/volume disk](/pods/storage/types) capacity.
5. Click **Deploy Cluster**.

## Connect to a Slurm Cluster

Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster).

From this page you can select a cluster to view it's component nodes, including a label indicating the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management.

Connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod).

## Submit and manage jobs

All standard Slurm commands are available without configuration. For example, you can:

Check cluster status and available resources:

```bash
sinfo
```

Submit a job to the cluster from the Slurm controller node:

```bash
sbatch your-job-script.sh
```

Monitor job queue and status:

```bash
squeue
```

View detailed job information from the Slurm controller node:

```bash
scontrol show job JOB_ID
```

## Advanced configuration

While Runpod's Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the [web terminal or SSH](/pods/connect-to-pods).

Access Slurm configuration files in their standard locations:
- `/etc/slurm/slurm.conf` - Main configuration file.
- `/etc/slurm/topology.conf` - Network topology configuration.
- `/etc/slurm/gres.conf` - Generic resource configuration.

Modify these files as needed for your specific requirements.

## Troubleshooting

If you encounter issues with your Slurm Cluster, try the following:

- **Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster.
- **Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.
- **Performance issues:** Review topology configuration and ensure jobs are using appropriate resource requests.

For additional support, contact [Runpod support](https://contact.runpod.io/) with your cluster ID and specific error messages.
54 changes: 30 additions & 24 deletions instant-clusters/slurm.mdx
Original file line number Diff line number Diff line change
@@ -1,17 +1,23 @@
---
title: "Deploy an Instant Cluster with SLURM"
sidebarTitle: "Deploy with SLURM"
title: "Deploy an Instant Cluster with Slurm (unmanaged)"
sidebarTitle: "Slurm (unmanaged)"
---

This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
<Tip>

This guide is for advanced users who want to configure and manage their own Slurm deployment on Instant Clusters. If you're looking for a pre-configured solution, see [Slurm Clusters](/instant-clusters/slurm-clusters).

</Tip>

Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently.
This tutorial demonstrates how to configure Runpod Instant Clusters with [Slurm](https://slurm.schedmd.com/) to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.

Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently.

## Requirements

* You've created a [Runpod account](https://www.console.runpod.io/home) and funded it with sufficient credits.
* You have basic familiarity with Linux command line.
* You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/).
* You're comfortable working with [Pods](/pods/overview) and understand the basics of [Slurm](https://slurm.schedmd.com/).

## Step 1: Deploy an Instant Cluster

Expand All @@ -20,7 +26,7 @@ Follow the steps below to deploy a cluster and start running distributed SLURM w
3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (Runpod PyTorch).
4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.

## Step 2: Clone demo and install SLURM on each Pod
## Step 2: Clone demo and install Slurm on each Pod

To connect to a Pod:

Expand All @@ -31,46 +37,46 @@ To connect to a Pod:

1. Click **Connect**, then click **Web Terminal**.

2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory:
2. In the terminal that opens, run this command to clone the Slurm demo files into the Pod's main directory:

```bash
git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example
```

3. Run this command to install SLURM:
3. Run this command to install Slurm:

```bash
apt update && apt install -y slurm-wlm slurm-client munge
```

## Step 3: Overview of SLURM demo scripts
## Step 3: Overview of Slurm demo scripts

The repository contains several essential scripts for setting up SLURM. Let's examine what each script does:
The repository contains several essential scripts for setting up Slurm. Let's examine what each script does:

* `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node.
* `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup.
* `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment.
* `test_batch.sh`: A sample SLURM job script for testing cluster functionality.
* `create_gres_conf.sh`: Generates the Slurm Generic Resource (GRES) configuration file that defines GPU resources for each node.
* `create_slurm_conf.sh`: Creates the main Slurm configuration file with cluster settings, node definitions, and partition setup.
* `install.sh`: The primary installation script that sets up MUNGE authentication, configures Slurm, and prepares the environment.
* `test_batch.sh`: A sample Slurm job script for testing cluster functionality.

## Step 4: Install SLURM on each Pod
## Step 4: Install Slurm on each Pod

Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.

```bash
./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3
```

This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
This script automates the complex process of configuring a two-node Slurm cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.

## Step 5: Start SLURM services
## Step 5: Start Slurm services

<Tip>

If you're not sure which Pod is the primary node, run the command `echo $HOSTNAME` on the web terminal of each Pod and look for `node-0`.

</Tip>

1. **On the primary node** (`node-0`), run both SLURM services:
1. **On the primary node** (`node-0`), run both Slurm services:

```bash
slurmctld -D
Expand All @@ -90,7 +96,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM

After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal.

## Step 6: Test your SLURM Cluster
## Step 6: Test your Slurm Cluster

1. Run this command **on the primary node** (`node-0`) to check the status of your nodes:

Expand All @@ -108,7 +114,7 @@ After running these commands, you should see output indicating that the services

This command should list all GPUs across both nodes.

## Step 7: Submit the SLURM job script
## Step 7: Submit the Slurm job script

Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly:

Expand All @@ -122,17 +128,17 @@ Check the output file created by the test (`test_simple_[JOBID].out`) and look f

If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.console.runpod.io/cluster) and delete your cluster to avoid incurring extra charges.

<Info>
<Tip>

You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.console.runpod.io/user/billing) section under the **Cluster** tab.

</Info>
</Tip>

## Next steps

Now that you've successfully deployed and tested a SLURM cluster on Runpod, you can:
Now that you've successfully deployed and tested a Slurm cluster on Runpod, you can:

* **Adapt your own distributed workloads** to run using SLURM job scripts.
* **Adapt your own distributed workloads** to run using Slurm job scripts.
* **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets.
* **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
* **Optimize performance** by experimenting with different distributed training strategies.