diff --git a/docs.json b/docs.json index c23ccbd8..e6625dac 100644 --- a/docs.json +++ b/docs.json @@ -142,9 +142,16 @@ "group": "Instant Clusters", "pages": [ "instant-clusters", - "instant-clusters/pytorch", - "instant-clusters/axolotl", - "instant-clusters/slurm" + "instant-clusters/slurm-clusters", + { + "group": "Deployment guides", + "pages": [ + "instant-clusters/pytorch", + "instant-clusters/axolotl", + "instant-clusters/slurm" + ] + } + ] }, { diff --git a/instant-clusters.mdx b/instant-clusters.mdx index 3cf6035e..77be9d40 100644 --- a/instant-clusters.mdx +++ b/instant-clusters.mdx @@ -1,5 +1,5 @@ --- -title: "Instant Clusters" +title: "Overview" sidebarTitle: "Overview" --- @@ -27,9 +27,10 @@ All accounts have a default spending limit. To deploy a larger cluster, submit a Get started with Instant Clusters by following a step-by-step tutorial for your preferred framework: +* [Deploy a Slurm Cluster](/instant-clusters/slurm-clusters). * [Deploy an Instant Cluster with PyTorch](/instant-clusters/pytorch). * [Deploy an Instant Cluster with Axolotl](/instant-clusters/axolotl). -* [Deploy an Instant Cluster with Slurm](/instant-clusters/slurm). +* [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm). ## Use cases for Instant Clusters @@ -66,7 +67,7 @@ Instant Clusters support up to 8 interfaces per Pod. Each interface (`eth1` - `e ## Environment variables -The following environment variables are available in all Pods: +The following environment variables are present in all Pods on an Instant Cluster: | Environment Variable | Description | | ------------------------------ | ----------------------------------------------------------------------------- | diff --git a/instant-clusters/axolotl.mdx b/instant-clusters/axolotl.mdx index 7bef1354..a4075e24 100644 --- a/instant-clusters/axolotl.mdx +++ b/instant-clusters/axolotl.mdx @@ -1,6 +1,6 @@ --- title: "Deploy an Instant Cluster with Axolotl" -sidebarTitle: "Deploy with Axolotl" +sidebarTitle: "Axolotl" --- This tutorial demonstrates how to use Instant Clusters with [Axolotl](https://axolotl.ai/) to fine-tune large language models (LLMs) across multiple GPUs. By leveraging PyTorch's distributed training capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. diff --git a/instant-clusters/pytorch.mdx b/instant-clusters/pytorch.mdx index 182f1664..90397ea6 100644 --- a/instant-clusters/pytorch.mdx +++ b/instant-clusters/pytorch.mdx @@ -1,6 +1,6 @@ --- title: "Deploy an Instant Cluster with PyTorch" -sidebarTitle: "Deploy with PyTorch" +sidebarTitle: "PyTorch" --- This tutorial demonstrates how to use Instant Clusters with [PyTorch](http://pytorch.org) to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx new file mode 100644 index 00000000..827cda85 --- /dev/null +++ b/instant-clusters/slurm-clusters.mdx @@ -0,0 +1,101 @@ +--- +title: Slurm Clusters +sidebarTitle: Slurm Clusters +description: Deploy fully managed Slurm Clusters on Runpod with zero configuration +tag: "BETA" +--- + + +Slurm Clusters are currently in beta. If you'd like to provide feedback, please [join our Discord](https://discord.gg/runpod). + + +Runpod Slurm Clusters provide a fully managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup. + +For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html). + +## Key features + +Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing: + +- **Zero configuration setup:** Slurm and munge are pre-installed and fully configured. +- **Instant provisioning:** Clusters deploy rapidly with minimal setup. +- **Automatic role assignment:** Runpod automatically designates controller and agent nodes. +- **Built-in optimizations:** Pre-configured for optimal NCCL performance. +- **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box. + + + +If you prefer to manually configure your Slurm deployment, see [Deploy an Instant Cluster with Slurm (unmanaged)](/instant-clusters/slurm) for a step-by-step guide. + + + +## Deploy a Slurm Cluster + +1. Open the [Instant Clusters page](https://console.runpod.io/cluster) on the Runpod console. +2. Click **Create Cluster**. +3. Select **Slurm Cluster** from the cluster type dropdown menu. +4. Configure your cluster specifications: + - **Cluster name**: Enter a descriptive name for your cluster. + - **Pod count**: Choose the number of Pods in your cluster. + - **GPU type**: Select your preferred [GPU type](/references/gpu-types). + - **Region**: Choose your deployment region. + - **Network volume** (optional): Add a [network volume](/pods/storage/create-network-volumes) for persistent/shared storage. If using a network volume, ensure the region matches your cluster region. + - **Pod template**: Select a [Pod template](/pods/templates/overview) or click **Edit Template** to customize start commands, environment variables, ports, or [container/volume disk](/pods/storage/types) capacity. +5. Click **Deploy Cluster**. + +## Connect to a Slurm Cluster + +Once deployment completes, you can access your cluster from the [Instant Clusters page](https://console.runpod.io/cluster). + +From this page you can select a cluster to view it's component nodes, including a label indicating the **Slurm controller** (primary node) and **Slurm agents** (secondary nodes). Expand a node to view details like availability, GPU/storage utilization, and options for connection and management. + +Connect to a node using the **Connect** button, or using any of the [connection methods supported by Pods](/pods/connect-to-a-pod). + +## Submit and manage jobs + +All standard Slurm commands are available without configuration. For example, you can: + +Check cluster status and available resources: + +```bash +sinfo +``` + +Submit a job to the cluster from the Slurm controller node: + +```bash +sbatch your-job-script.sh +``` + +Monitor job queue and status: + +```bash +squeue +``` + +View detailed job information from the Slurm controller node: + +```bash +scontrol show job JOB_ID +``` + +## Advanced configuration + +While Runpod's Slurm Clusters work out-of-the-box, you can customize your configuration by connecting to the Slurm controller node using the [web terminal or SSH](/pods/connect-to-pods). + +Access Slurm configuration files in their standard locations: +- `/etc/slurm/slurm.conf` - Main configuration file. +- `/etc/slurm/topology.conf` - Network topology configuration. +- `/etc/slurm/gres.conf` - Generic resource configuration. + +Modify these files as needed for your specific requirements. + +## Troubleshooting + +If you encounter issues with your Slurm Cluster, try the following: + +- **Jobs stuck in pending state:** Check resource availability with `sinfo` and ensure requested resources are available. If you need more resources, you can add more nodes to your cluster. +- **Authentication errors:** Munge is pre-configured, but if issues arise, verify the munge service is running on all nodes. +- **Performance issues:** Review topology configuration and ensure jobs are using appropriate resource requests. + +For additional support, contact [Runpod support](https://contact.runpod.io/) with your cluster ID and specific error messages. \ No newline at end of file diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx index 4eaace34..93275ff0 100644 --- a/instant-clusters/slurm.mdx +++ b/instant-clusters/slurm.mdx @@ -1,17 +1,23 @@ --- -title: "Deploy an Instant Cluster with SLURM" -sidebarTitle: "Deploy with SLURM" +title: "Deploy an Instant Cluster with Slurm (unmanaged)" +sidebarTitle: "Slurm (unmanaged)" --- -This tutorial demonstrates how to use Instant Clusters with [SLURM](https://slurm.schedmd.com/) (Simple Linux Utility for Resource Management) to manage and schedule distributed workloads across multiple nodes. SLURM is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging SLURM on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. + + +This guide is for advanced users who want to configure and manage their own Slurm deployment on Instant Clusters. If you're looking for a pre-configured solution, see [Slurm Clusters](/instant-clusters/slurm-clusters). + + -Follow the steps below to deploy a cluster and start running distributed SLURM workloads efficiently. +This tutorial demonstrates how to configure Runpod Instant Clusters with [Slurm](https://slurm.schedmd.com/) to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. + +Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently. ## Requirements * You've created a [Runpod account](https://www.console.runpod.io/home) and funded it with sufficient credits. * You have basic familiarity with Linux command line. -* You're comfortable working with [Pods](/pods/overview) and understand the basics of [SLURM](https://slurm.schedmd.com/). +* You're comfortable working with [Pods](/pods/overview) and understand the basics of [Slurm](https://slurm.schedmd.com/). ## Step 1: Deploy an Instant Cluster @@ -20,7 +26,7 @@ Follow the steps below to deploy a cluster and start running distributed SLURM w 3. Use the UI to name and configure your cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (Runpod PyTorch). 4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds. -## Step 2: Clone demo and install SLURM on each Pod +## Step 2: Clone demo and install Slurm on each Pod To connect to a Pod: @@ -31,28 +37,28 @@ To connect to a Pod: 1. Click **Connect**, then click **Web Terminal**. -2. In the terminal that opens, run this command to clone the SLURM demo files into the Pod's main directory: +2. In the terminal that opens, run this command to clone the Slurm demo files into the Pod's main directory: ```bash git clone https://github.com/pandyamarut/slurm_example.git && cd slurm_example ``` -3. Run this command to install SLURM: +3. Run this command to install Slurm: ```bash apt update && apt install -y slurm-wlm slurm-client munge ``` -## Step 3: Overview of SLURM demo scripts +## Step 3: Overview of Slurm demo scripts -The repository contains several essential scripts for setting up SLURM. Let's examine what each script does: +The repository contains several essential scripts for setting up Slurm. Let's examine what each script does: -* `create_gres_conf.sh`: Generates the SLURM Generic Resource (GRES) configuration file that defines GPU resources for each node. -* `create_slurm_conf.sh`: Creates the main SLURM configuration file with cluster settings, node definitions, and partition setup. -* `install.sh`: The primary installation script that sets up MUNGE authentication, configures SLURM, and prepares the environment. -* `test_batch.sh`: A sample SLURM job script for testing cluster functionality. +* `create_gres_conf.sh`: Generates the Slurm Generic Resource (GRES) configuration file that defines GPU resources for each node. +* `create_slurm_conf.sh`: Creates the main Slurm configuration file with cluster settings, node definitions, and partition setup. +* `install.sh`: The primary installation script that sets up MUNGE authentication, configures Slurm, and prepares the environment. +* `test_batch.sh`: A sample Slurm job script for testing cluster functionality. -## Step 4: Install SLURM on each Pod +## Step 4: Install Slurm on each Pod Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` with any secure random string (like a password). The secret key is used for authentication between nodes, and must be identical across all Pods in your cluster. @@ -60,9 +66,9 @@ Now run the installation script **on each Pod**, replacing `[MUNGE_SECRET_KEY]` ./install.sh "[MUNGE_SECRET_KEY]" node-0 node-1 10.65.0.2 10.65.0.3 ``` -This script automates the complex process of configuring a two-node SLURM cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes. +This script automates the complex process of configuring a two-node Slurm cluster with GPU support, handling everything from system dependencies to authentication and resource configuration. It implements the necessary setup for both the primary (i.e. master/control) and secondary (i.e compute/worker) nodes. -## Step 5: Start SLURM services +## Step 5: Start Slurm services @@ -70,7 +76,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM -1. **On the primary node** (`node-0`), run both SLURM services: +1. **On the primary node** (`node-0`), run both Slurm services: ```bash slurmctld -D @@ -90,7 +96,7 @@ If you're not sure which Pod is the primary node, run the command `echo $HOSTNAM After running these commands, you should see output indicating that the services have started successfully. The `-D` flag keeps the services running in the foreground, so each command needs its own terminal. -## Step 6: Test your SLURM Cluster +## Step 6: Test your Slurm Cluster 1. Run this command **on the primary node** (`node-0`) to check the status of your nodes: @@ -108,7 +114,7 @@ After running these commands, you should see output indicating that the services This command should list all GPUs across both nodes. -## Step 7: Submit the SLURM job script +## Step 7: Submit the Slurm job script Run the following command **on the primary node** (`node-0`) to submit the test job script and confirm that your cluster is working properly: @@ -122,17 +128,17 @@ Check the output file created by the test (`test_simple_[JOBID].out`) and look f If you no longer need your cluster, make sure you return to the [Instant Clusters page](https://www.console.runpod.io/cluster) and delete your cluster to avoid incurring extra charges. - + You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.console.runpod.io/user/billing) section under the **Cluster** tab. - + ## Next steps -Now that you've successfully deployed and tested a SLURM cluster on Runpod, you can: +Now that you've successfully deployed and tested a Slurm cluster on Runpod, you can: -* **Adapt your own distributed workloads** to run using SLURM job scripts. +* **Adapt your own distributed workloads** to run using Slurm job scripts. * **Scale your cluster** by adjusting the number of Pods to handle larger models or datasets. * **Try different frameworks** like [Axolotl](/instant-clusters/axolotl) for fine-tuning large language models. * **Optimize performance** by experimenting with different distributed training strategies.