diff --git a/docs.json b/docs.json
index c23ccbd8..e6625dac 100644
--- a/docs.json
+++ b/docs.json
@@ -142,9 +142,16 @@
"group": "Instant Clusters",
"pages": [
"instant-clusters",
- "instant-clusters/pytorch",
- "instant-clusters/axolotl",
- "instant-clusters/slurm"
+ "instant-clusters/slurm-clusters",
+ {
+ "group": "Deployment guides",
+ "pages": [
+ "instant-clusters/pytorch",
+ "instant-clusters/axolotl",
+ "instant-clusters/slurm"
+ ]
+ }
+
]
},
{
diff --git a/instant-clusters.mdx b/instant-clusters.mdx
index 3cf6035e..77be9d40 100644
--- a/instant-clusters.mdx
+++ b/instant-clusters.mdx
@@ -1,5 +1,5 @@
---
-title: "Instant Clusters"
+title: "Overview"
sidebarTitle: "Overview"
---
@@ -27,9 +27,10 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a
Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework:
+* [Deploy a Slurm Cluster](/instant-clusters/slurm-clusters).
* [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch).
* [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl).
-* [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm).
+* [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm).
## Use cases for Instant Clusters
@@ -66,7 +67,7 @@ Instant Clusters support up to 8 interfaces per Pod. Each interface (`eth1` - `e
## Environment variables
-The following environment variables are available in all Pods:
+The following environment variables are present in all Pods on an Instant Cluster:
| Environment Variable | Description |
| ------------------------------ | ----------------------------------------------------------------------------- |
diff --git a/instant-clusters/axolotl.mdx b/instant-clusters/axolotl.mdx
index 7bef1354..a4075e24 100644
--- a/instant-clusters/axolotl.mdx
+++ b/instant-clusters/axolotl.mdx
@@ -1,6 +1,6 @@
---
title: "Deploy an Instant Cluster with Axolotl"
-sidebarTitle: "Deploy with Axolotl"
+sidebarTitle: "Axolotl"
---
This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
diff --git a/instant-clusters/pytorch.mdx b/instant-clusters/pytorch.mdx
index 182f1664..90397ea6 100644
--- a/instant-clusters/pytorch.mdx
+++ b/instant-clusters/pytorch.mdx
@@ -1,6 +1,6 @@
---
title: "Deploy an Instant Cluster with PyTorch"
-sidebarTitle: "Deploy with PyTorch"
+sidebarTitle: "PyTorch"
---
This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.
diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx
new file mode 100644
index 00000000..827cda85
--- /dev/null
+++ b/instant-clusters/slurm-clusters.mdx
@@ -0,0 +1,101 @@
+---
+title: Slurm Clusters
+sidebarTitle: Slurm Clusters
+description: Deploy fully managed Slurm Clusters on Runpod with zero configuration
+tag: "BETA"
+---
+
+
+Slurm Clusters are currently in beta. If you'd like to provide feedback, please [join our Discord](https://discord.gg/runpod).
+
+
+Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup.
+
+For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html).
+
+## Key features
+
+Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:
+
+- **Zero configuration setup:** Slurm and munge are pre-installed and fully configured.
+- **Instant provisioning:** Clusters deploy rapidly with minimal setup.
+- **Automatic role assignment:** Runpod automatically designates controller and agent nodes.
+- **Built-in optimizations:** Pre-configured for optimal NCCL performance.
+- **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box.
+
+
+
+If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide.
+
+
+
+## Deploy a Slurm Cluster
+
+1. Open the [Instant Clusters page](https://console.runpod.io/cluster) on the Runpod console.
+2. Click **Create Cluster**.
+3. Select **Slurm Cluster** from the cluster type dropdown menu.
+4. Configure your cluster specifications:
+ - **Cluster name**: Enter a descriptive name for your cluster.
+ - **Pod count**: Choose the number of Pods in your cluster.
+ - **GPU type**: Select your preferred [GPU type](/references/gpu-types).
+ - **Region**: Choose your deployment region.
+ - **Network volume** (optional): Add a [network volume](/pods/storage/create-network-volumes) for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
+ - **Pod template**: Select a [Pod template](/pods/templates/overview) or click **Edit Template** to customize start commands, environment variables, ports, or [container/volume disk](/pods/storage/types) capacity.
+5. Click **Deploy Cluster**.
+
+## Connect to a Slurm Cluster
+
+Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster).
+
+From this page you can select a cluster to view it's component nodes, including a label indicating the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management.
+
+Connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod).
+
+## Submit and manage jobs
+
+All standard Slurm commands are available without configuration. For example, you can:
+
+Check cluster status and available resources:
+
+```bash
+sinfo
+```
+
+Submit a job to the cluster from the Slurm controller node:
+
+```bash
+sbatch your-job-script.sh
+```
+
+Monitor job queue and status:
+
+```bash
+squeue
+```
+
+View detailed job information from the Slurm controller node:
+
+```bash
+scontrol show job JOB_ID
+```
+
+## Advanced configuration
+
+While Runpod's Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the [web terminal or SSH](/pods/connect-to-pods).
+
+Access Slurm configuration files in their standard locations:
+- `/etc/slurm/slurm.conf` - Main configuration file.
+- `/etc/slurm/topology.conf` - Network topology configuration.
+- `/etc/slurm/gres.conf` - Generic resource configuration.
+
+Modify these files as needed for your specific requirements.
+
+## Troubleshooting
+
+If you encounter issues with your Slurm Cluster, try the following:
+
+- **Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster.
+- **Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.
+- **Performance issues:** Review topology configuration and ensure jobs are using appropriate resource requests.
+
+For additional support, contact [Runpod support](https://contact.runpod.io/) with your cluster ID and specific error messages.
\ No newline at end of file
diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx
index 4eaace34..93275ff0 100644
--- a/instant-clusters/slurm.mdx
+++ b/instant-clusters/slurm.mdx
@@ -1,17 +1,23 @@
---
-title: "Deploy an Instant Cluster with SLURM"
-sidebarTitle: "Deploy with SLURM"
+title: "Deploy an Instant Cluster with Slurm (unmanaged)"
+sidebarTitle: "Slurm (unmanaged)"
---
-This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
+
+
+This guide is for advanced users who want to configure and manage their own Slurm deployment on Instant Clusters. If you're looking for a pre-configured solution, see [Slurm Clusters](/instant-clusters/slurm-clusters).
+
+
-Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently.
+This tutorial demonstrates how to configure Runpod Instant Clusters with [Slurm](https://slurm.schedmd.com/) to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
+
+Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently.
## Requirements
* You've created a [Runpod account](https://www.console.runpod.io/home) and funded it with sufficient credits.
* You have basic familiarity with Linux command line.
-* You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/).
+* You're comfortable working with [Pods](/pods/overview) and understand the basics of [Slurm](https://slurm.schedmd.com/).
## Step 1: Deploy an Instant Cluster
@@ -20,7 +26,7 @@ Follow the steps below to deploy a cluster and start running distributed SLURM w
3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (Runpod PyTorch).
4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.
-## Step 2: Clone demo and install SLURM on each Pod
+## Step 2: Clone demo and install Slurm on each Pod
To connect to a Pod:
@@ -31,28 +37,28 @@ To connect to a Pod:
1. Click **Connect**, then click **Web Terminal**.
-2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory:
+2. In the terminal that opens, run this command to clone the Slurm demo files into the Pod's main directory:
```bash
git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example
```
-3. Run this command to install SLURM:
+3. Run this command to install Slurm:
```bash
apt update && apt install -y slurm-wlm slurm-client munge
```
-## Step 3: Overview of SLURM demo scripts
+## Step 3: Overview of Slurm demo scripts
-The repository contains several essential scripts for setting up SLURM. Let's examine what each script does:
+The repository contains several essential scripts for setting up Slurm. Let's examine what each script does:
-* `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node.
-* `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup.
-* `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment.
-* `test_batch.sh`: A sample SLURM job script for testing cluster functionality.
+* `create_gres_conf.sh`: Generates the Slurm Generic Resource (GRES) configuration file that defines GPU resources for each node.
+* `create_slurm_conf.sh`: Creates the main Slurm configuration file with cluster settings, node definitions, and partition setup.
+* `install.sh`: The primary installation script that sets up MUNGE authentication, configures Slurm, and prepares the environment.
+* `test_batch.sh`: A sample Slurm job script for testing cluster functionality.
-## Step 4: Install SLURM on each Pod
+## Step 4: Install Slurm on each Pod
Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.
@@ -60,9 +66,9 @@ Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]`
./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3
```
-This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
+This script automates the complex process of configuring a two-node Slurm cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
-## Step 5: Start SLURM services
+## Step 5: Start Slurm services
@@ -70,7 +76,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM
-1. **On the primary node** (`node-0`), run both SLURM services:
+1. **On the primary node** (`node-0`), run both Slurm services:
```bash
slurmctld -D
@@ -90,7 +96,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM
After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal.
-## Step 6: Test your SLURM Cluster
+## Step 6: Test your Slurm Cluster
1. Run this command **on the primary node** (`node-0`) to check the status of your nodes:
@@ -108,7 +114,7 @@ After running these commands, you should see output indicating that the services
This command should list all GPUs across both nodes.
-## Step 7: Submit the SLURM job script
+## Step 7: Submit the Slurm job script
Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly:
@@ -122,17 +128,17 @@ Check the output file created by the test (`test_simple_[JOBID].out`) and look f
If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.console.runpod.io/cluster) and delete your cluster to avoid incurring extra charges.
-
+
You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.console.runpod.io/user/billing) section under the **Cluster** tab.
-
+
## Next steps
-Now that you've successfully deployed and tested a SLURM cluster on Runpod, you can:
+Now that you've successfully deployed and tested a Slurm cluster on Runpod, you can:
-* **Adapt your own distributed workloads** to run using SLURM job scripts.
+* **Adapt your own distributed workloads** to run using Slurm job scripts.
* **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets.
* **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
* **Optimize performance** by experimenting with different distributed training strategies.