runpod · muhsinking · Jun 25, 2025 · Jun 25, 2025 · Jun 26, 2025 · Jun 26, 2025
diff --git a/docs.json b/docs.json
@@ -142,9 +142,16 @@
             "group": "Instant Clusters",
             "pages": [
               "instant-clusters",
-              "instant-clusters/pytorch",
-              "instant-clusters/axolotl",
-              "instant-clusters/slurm"
+              "instant-clusters/slurm-clusters",
+              {
+                "group": "Deployment guides",
+                "pages": [
+                  "instant-clusters/pytorch",
+                  "instant-clusters/axolotl",
+                  "instant-clusters/slurm"
+                ]
+              }
+
             ]
           },
           {

diff --git a/instant-clusters.mdx b/instant-clusters.mdx
@@ -1,5 +1,5 @@
 ---
-title: "Instant Clusters"
+title: "Overview"
 sidebarTitle: "Overview"
 ---
 
@@ -27,9 +27,10 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a
 
 Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework:
 
+* [Deploy a Slurm Cluster](/instant-clusters/slurm-clusters).
 * [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch).
 * [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl).
-* [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm).
+* [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm).
 
 ## Use cases for Instant Clusters
 
@@ -66,7 +67,7 @@ Instant Clusters support up to 8 interfaces per Pod. Each interface (`eth1` - `e
 
 ## Environment variables
 
-The following environment variables are available in all Pods:
+The following environment variables are present in all Pods on an Instant Cluster:
 
 | Environment Variable           | Description                                                                   |
 | ------------------------------ | ----------------------------------------------------------------------------- |

diff --git a/instant-clusters/axolotl.mdx b/instant-clusters/axolotl.mdx
@@ -1,6 +1,6 @@
 ---
 title: "Deploy an Instant Cluster with Axolotl"
-sidebarTitle: "Deploy with Axolotl"
+sidebarTitle: "Axolotl"
 ---
 
 This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.

diff --git a/instant-clusters/pytorch.mdx b/instant-clusters/pytorch.mdx
@@ -1,6 +1,6 @@
 ---
 title: "Deploy an Instant Cluster with PyTorch"
-sidebarTitle: "Deploy with PyTorch"
+sidebarTitle: "PyTorch"
 ---
 
 This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups.

diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx
@@ -0,0 +1,101 @@
+---
+title: Slurm Clusters
+sidebarTitle: Slurm Clusters
+description: Deploy fully managed Slurm Clusters on Runpod with zero configuration
+tag: "BETA"
+---
+
+<Note>
+Slurm Clusters are currently in beta. If you'd like to provide feedback, please [join our Discord](https://discord.gg/runpod).
+</Note>
+
+Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup.
+
+For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html).
+
+## Key features
+
+Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing:
+
+- **Zero configuration setup:** Slurm and munge are pre-installed and fully configured.
+- **Instant provisioning:** Clusters deploy rapidly with minimal setup.
+- **Automatic role assignment:** Runpod automatically designates controller and agent nodes.
+- **Built-in optimizations:** Pre-configured for optimal NCCL performance.
+- **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box.
+
+<Tip>
+
+If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide.
+
+</Tip>
+
+## Deploy a Slurm Cluster
+
+1. Open the [Instant Clusters page](https://console.runpod.io/cluster) on the Runpod console.
+2. Click **Create Cluster**.
+3. Select **Slurm Cluster** from the cluster type dropdown menu.
+4. Configure your cluster specifications:
+   - **Cluster name**: Enter a descriptive name for your cluster.
+   - **Pod count**: Choose the number of Pods in your cluster.
+   - **GPU type**: Select your preferred [GPU type](/references/gpu-types).
+   - **Region**: Choose your deployment region.
+   - **Network volume** (optional): Add a [network volume](/pods/storage/create-network-volumes) for persistent/shared storage. If using a network volume, ensure the region matches your cluster region.
+   - **Pod template**: Select a [Pod template](/pods/templates/overview) or click **Edit Template** to customize start commands, environment variables, ports, or [container/volume disk](/pods/storage/types) capacity.
+5. Click **Deploy Cluster**.
+
+## Connect to a Slurm Cluster
+
+Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster).
+
+From this page you can select a cluster to view it's component nodes, including a label indicating the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management.
+
+Connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod).
+
+## Submit and manage jobs
+
+All standard Slurm commands are available without configuration. For example, you can:
+
+Check cluster status and available resources:
+
+```bash
+sinfo
+```
+
+Submit a job to the cluster from the Slurm controller node:
+
+```bash
+sbatch your-job-script.sh
+```
+
+Monitor job queue and status:
+
+```bash
+squeue
+```
+
+View detailed job information from the Slurm controller node:
+
+```bash
+scontrol show job JOB_ID
+```
+
+## Advanced configuration
+
+While Runpod's Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the [web terminal or SSH](/pods/connect-to-pods).
+
+Access Slurm configuration files in their standard locations:
+- `/etc/slurm/slurm.conf` - Main configuration file.
+- `/etc/slurm/topology.conf` - Network topology configuration.
+- `/etc/slurm/gres.conf` - Generic resource configuration.
+
+Modify these files as needed for your specific requirements.
+
+## Troubleshooting
+
+If you encounter issues with your Slurm Cluster, try the following:
+
+- **Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster.
+- **Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes.
+- **Performance issues:** Review topology configuration and ensure jobs are using appropriate resource requests.
+
+For additional support, contact [Runpod support](https://contact.runpod.io/) with your cluster ID and specific error messages.
diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx
@@ -1,17 +1,23 @@
 ---
-title: "Deploy an Instant Cluster with SLURM"
-sidebarTitle: "Deploy with SLURM"
+title: "Deploy an Instant Cluster with Slurm (unmanaged)"
+sidebarTitle: "Slurm (unmanaged)"
 ---
 
-This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
+<Tip>
+
+This guide is for advanced users who want to configure and manage their own Slurm deployment on Instant Clusters. If you're looking for a pre-configured solution, see [Slurm Clusters](/instant-clusters/slurm-clusters).
+
+</Tip>
 
-Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently.
+This tutorial demonstrates how to configure Runpod Instant Clusters with [Slurm](https://slurm.schedmd.com/) to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs.
+
+Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently.
 
 ## Requirements
 
 * You've created a [Runpod account](https://www.console.runpod.io/home) and funded it with sufficient credits.
 * You have basic familiarity with Linux command line.
-* You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/).
+* You're comfortable working with [Pods](/pods/overview) and understand the basics of [Slurm](https://slurm.schedmd.com/).
 
 ## Step 1: Deploy an Instant Cluster
 
@@ -20,7 +26,7 @@ Follow the steps below to deploy a cluster and start running distributed SLURM w
 3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (Runpod PyTorch).
 4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.
 
-## Step 2: Clone demo and install SLURM on each Pod
+## Step 2: Clone demo and install Slurm on each Pod
 
 To connect to a Pod:
 
@@ -31,46 +37,46 @@ To connect to a Pod:
 
 1. Click **Connect**, then click **Web Terminal**.
 
-2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory:
+2. In the terminal that opens, run this command to clone the Slurm demo files into the Pod's main directory:
 
    ```bash
    git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example
    ```
 
-3. Run this command to install SLURM:
+3. Run this command to install Slurm:
 
    ```bash
    apt update && apt install -y slurm-wlm slurm-client munge
    ```
 
-## Step 3: Overview of SLURM demo scripts
+## Step 3: Overview of Slurm demo scripts
 
-The repository contains several essential scripts for setting up SLURM. Let's examine what each script does:
+The repository contains several essential scripts for setting up Slurm. Let's examine what each script does:
 
-* `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node.
-* `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup.
-* `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment.
-* `test_batch.sh`: A sample SLURM job script for testing cluster functionality.
+* `create_gres_conf.sh`: Generates the Slurm Generic Resource (GRES) configuration file that defines GPU resources for each node.
+* `create_slurm_conf.sh`: Creates the main Slurm configuration file with cluster settings, node definitions, and partition setup.
+* `install.sh`: The primary installation script that sets up MUNGE authentication, configures Slurm, and prepares the environment.
+* `test_batch.sh`: A sample Slurm job script for testing cluster functionality.
 
-## Step 4: Install SLURM on each Pod
+## Step 4: Install Slurm on each Pod
 
 Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster.
 
 ```bash
 ./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3
 ```
 
-This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
+This script automates the complex process of configuring a two-node Slurm cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes.
 
-## Step 5: Start SLURM services
+## Step 5: Start Slurm services
 
 <Tip>
 
 If you're not sure which Pod is the primary node, run the command `echo $HOSTNAME` on the web terminal of each Pod and look for `node-0`.
 
 </Tip>
 
-1. **On the primary node** (`node-0`), run both SLURM services:
+1. **On the primary node** (`node-0`), run both Slurm services:
 
    ```bash
    slurmctld -D
@@ -90,7 +96,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM
 
 After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal.
 
-## Step 6: Test your SLURM Cluster
+## Step 6: Test your Slurm Cluster
 
 1. Run this command **on the primary node** (`node-0`) to check the status of your nodes:
 
@@ -108,7 +114,7 @@ After running these commands, you should see output indicating that the services
 
    This command should list all GPUs across both nodes.
 
-## Step 7: Submit the SLURM job script
+## Step 7: Submit the Slurm job script
 
 Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly:
 
@@ -122,17 +128,17 @@ Check the output file created by the test (`test_simple_[JOBID].out`) and look f
 
 If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.console.runpod.io/cluster) and delete your cluster to avoid incurring extra charges.
 
-<Info>
+<Tip>
 
 You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.console.runpod.io/user/billing) section under the **Cluster** tab.
 
-</Info>
+</Tip>
 
 ## Next steps
 
-Now that you've successfully deployed and tested a SLURM cluster on Runpod, you can:
+Now that you've successfully deployed and tested a Slurm cluster on Runpod, you can:
 
-* **Adapt your own distributed workloads** to run using SLURM job scripts.
+* **Adapt your own distributed workloads** to run using Slurm job scripts.
 * **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets.
 * **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models.
 * **Optimize performance** by experimenting with different distributed training strategies.