ROCm · coketaste · Sep 23, 2025 · Sep 23, 2025 · Sep 24, 2025 · Oct 1, 2025
diff --git a/docs/how-to-run-multi-node.md b/docs/how-to-run-multi-node.md
@@ -75,17 +75,144 @@ You can see at the end of these commands, we are pointing DLM/MAD to the shared-
 
 **NOTE: The above commands assumes the shared-file system is mounted at `/nfs` in the commands above. If this is not the case and a user simply copies/pastes the above commands on two nodes, DLM/MAD will create a folder called `nfs` on each node and copy the data there, which is not desired behavior.**
 
+## SLURM Cluster Integration
+
+madengine now supports running workloads on SLURM clusters, allowing you to leverage job scheduling and resource management for multi-node training and inference.
+
+### Overview
+
+When `slurm_args` is provided in the `additional-context`, madengine will:
+1. Parse the SLURM configuration parameters
+2. Submit the job directly to the SLURM cluster using `sbatch`
+3. Skip the standard Docker container build and run workflow
+4. Execute the model-specific script (e.g., `scripts/sglang_disagg/run.sh`) which handles SLURM job submission
+
+### SLURM Arguments
+
+The following arguments can be specified in the `slurm_args` dictionary:
+
+| Argument | Description | Required | Default | Example |
+|----------|-------------|----------|---------|---------|
+| `FRAMEWORK` | Framework to use for the job | Yes | - | `'sglang_disagg'` |
+| `PREFILL_NODES` | Number of nodes for prefill phase | Yes | - | `'2'` |
+| `DECODE_NODES` | Number of nodes for decode phase | Yes | - | `'2'` |
+| `PARTITION` | SLURM partition/queue name | Yes | - | `'amd-rccl'` |
+| `TIME` | Maximum job runtime (HH:MM:SS) | Yes | - | `'12:00:00'` |
+| `DOCKER_IMAGE` | Docker image to use | No | `''` | `'myregistry/image:tag'` |
+| `EXCLUSIVE_MODE` | Request exclusive node access | No | `True` | `True` or `False` |
+
+### Usage Examples
+
+#### Basic SLURM Job Submission
+
+To run a model on SLURM with default settings:
+
+```bash
+madengine run --tags sglang_disagg_pd_qwen3-32B \
+  --additional-context "{'slurm_args': {
+    'FRAMEWORK': 'sglang_disagg',
+    'PREFILL_NODES': '2',
+    'DECODE_NODES': '2',
+    'PARTITION': 'amd-rccl',
+    'TIME': '12:00:00',
+    'DOCKER_IMAGE': ''
+  }}"
+```
+
+#### Custom Docker Image
+
+To specify a custom Docker image for the SLURM job:
+
+```bash
+madengine run --tags sglang_disagg_pd_qwen3-32B \
+  --additional-context "{'slurm_args': {
+    'FRAMEWORK': 'sglang_disagg',
+    'PREFILL_NODES': '4',
+    'DECODE_NODES': '4',
+    'PARTITION': 'gpu-high-priority',
+    'TIME': '24:00:00',
+    'DOCKER_IMAGE': 'myregistry/custom-image:latest'
+  }}"
+```
+
+#### Running Different Model Configurations
+
+For DeepSeek-V2 model:
+
+```bash
+madengine run --tags sglang_disagg_pd_deepseek_v2 \
+  --additional-context "{'slurm_args': {
+    'FRAMEWORK': 'sglang_disagg',
+    'PREFILL_NODES': '8',
+    'DECODE_NODES': '8',
+    'PARTITION': 'amd-mi300x',
+    'TIME': '48:00:00',
+    'DOCKER_IMAGE': ''
+  }}"
+```
+
+#### Using Exclusive Mode
+
+By default, `EXCLUSIVE_MODE` is `True`, which requests exclusive access to nodes (recommended for distributed inference). To share nodes with other jobs:
+
+```bash
+madengine run --tags sglang_disagg_pd_qwen3-32B \
+  --additional-context "{'slurm_args': {
+    'FRAMEWORK': 'sglang_disagg',
+    'PREFILL_NODES': '2',
+    'DECODE_NODES': '2',
+    'PARTITION': 'amd-rccl',
+    'TIME': '12:00:00',
+    'DOCKER_IMAGE': '',
+    'EXCLUSIVE_MODE': False
+  }}"
+```
+
+**Note:** Exclusive mode (`--exclusive` in SLURM) is typically recommended for distributed multi-node workloads to ensure consistent performance and avoid interference from other jobs running on the same nodes.
+
+### Model Configuration
+
+Models configured for SLURM should include the model name in the `args` attribute of `models.json`. For example:
+
+```json
+{
+  "name": "sglang_disagg_pd_qwen3-32B",
+  "args": "--model Qwen3-32B",
+  "tags": ["sglang_disagg"]
+}
+```
+
+The model name (e.g., `Qwen/Qwen3-32B`) will be extracted and set as the `MODEL_NAME` environment variable for the SLURM job.
+
+### Requirements
+
+To use SLURM integration, ensure the following are available:
+
+1. **SLURM Cluster Access**: Access to a SLURM cluster with proper credentials
+2. **Model Scripts**: Framework-specific scripts (e.g., `scripts/sglang_disagg/run.sh`) that handle SLURM job submission
+
+### How It Works
+
+1. **Context Parsing**: madengine detects `slurm_args` in the additional context
+2. **Model Selection**: Extracts model information from `models.json` based on the provided tags
+3. **Environment Setup**: Prepares environment variables including `MODEL_NAME`, node counts, partition, etc.
+4. **Job Submission**: Executes the framework-specific run script which submits the SLURM job using `sbatch`
+5. **Job Monitoring**: The SLURM cluster manages job execution, resource allocation, and scheduling
+
 ## TODO
 
 ### RUNNER
 
+- [x] torchrun
 - [ ] mpirun (requires ansible integration)
 
-### Job Schedulare
+### Job Scheduler
 
-- [ ] SLURM
+- [x] SLURM (via slurm_args integration)
 - [ ] Kubernetes
 
 ### Design Consideration
 
-- [ ] Having the python model script launched by individual bash scripts can be limiting for multi-node. Perhaps we can explore a full python workflow for multi-node and only the job scheduler uses a bash script like SLURM using sbatch script.
+- [x] SLURM integration using sbatch scripts for job submission
+- [ ] Full Python workflow for multi-node (without bash script intermediaries)
+- [ ] Kubernetes-native job scheduling integration
diff --git a/src/madengine/core/context.py b/src/madengine/core/context.py
@@ -92,22 +92,43 @@ def __init__(
         else:
             print("Warning: unknown numa balancing setup ...")
 
-        # Keeping gpu_vendor for filterning purposes, if we filter using file names we can get rid of this attribute.
-        self.ctx["gpu_vendor"] = self.get_gpu_vendor()
-
-        # Initialize the docker context
-        self.ctx["docker_env_vars"] = {}
-        self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]
-        self.ctx["docker_env_vars"]["MAD_SYSTEM_NGPUS"] = self.get_system_ngpus()
-        self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = self.get_system_gpu_architecture()
-        self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_PRODUCT_NAME"] = self.get_system_gpu_product_name()
-        self.ctx['docker_env_vars']['MAD_SYSTEM_HIP_VERSION'] = self.get_system_hip_version()
-        self.ctx["docker_build_arg"] = {
-            "MAD_SYSTEM_GPU_ARCHITECTURE": self.get_system_gpu_architecture(),
-            "MAD_SYSTEM_GPU_PRODUCT_NAME": self.get_system_gpu_product_name()
-        }
-        self.ctx["docker_gpus"] = self.get_docker_gpus()
-        self.ctx["gpu_renderDs"] = self.get_gpu_renderD_nodes()
+        # Check if SLURM mode is requested before GPU detection
+        is_slurm_mode = self._is_slurm_mode(additional_context, additional_context_file)
+
+        if is_slurm_mode:
+            # For SLURM mode, set minimal GPU context to avoid detection on control node
+            print("SLURM mode detected - skipping GPU detection on control node")
+            self.ctx["gpu_vendor"] = "AMD"  # Default to AMD for SLURM environments
+            self.ctx["docker_env_vars"] = {}
+            self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]
+            self.ctx["docker_env_vars"]["MAD_SYSTEM_NGPUS"] = "8"  # Default value for SLURM
+            self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = "gfx90a"  # Default for SLURM
+            self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_PRODUCT_NAME"] = "AMD_GPU"  # Default value
+            self.ctx['docker_env_vars']['MAD_SYSTEM_HIP_VERSION'] = "5.0.0"  # Default value
+            self.ctx["docker_build_arg"] = {
+                "MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a",
+                "MAD_SYSTEM_GPU_PRODUCT_NAME": "AMD_GPU"
+            }
+            self.ctx["docker_gpus"] = "0,1,2,3,4,5,6,7"  # Default GPU list
+            self.ctx["gpu_renderDs"] = [128, 129, 130, 131, 132, 133, 134, 135]  # Default renderD nodes
+        else:
+            # Normal mode - detect GPUs
+            # Keeping gpu_vendor for filterning purposes, if we filter using file names we can get rid of this attribute.
+            self.ctx["gpu_vendor"] = self.get_gpu_vendor()
+
+            # Initialize the docker context
+            self.ctx["docker_env_vars"] = {}
+            self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]
+            self.ctx["docker_env_vars"]["MAD_SYSTEM_NGPUS"] = self.get_system_ngpus()
+            self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = self.get_system_gpu_architecture()
+            self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_PRODUCT_NAME"] = self.get_system_gpu_product_name()
+            self.ctx['docker_env_vars']['MAD_SYSTEM_HIP_VERSION'] = self.get_system_hip_version()
+            self.ctx["docker_build_arg"] = {
+                "MAD_SYSTEM_GPU_ARCHITECTURE": self.get_system_gpu_architecture(),
+                "MAD_SYSTEM_GPU_PRODUCT_NAME": self.get_system_gpu_product_name()
+            }
+            self.ctx["docker_gpus"] = self.get_docker_gpus()
+            self.ctx["gpu_renderDs"] = self.get_gpu_renderD_nodes()
 
         # Default multi-node configuration
         self.ctx['multi_node_args'] = {
@@ -148,6 +169,40 @@ def __init__(
         # Set multi-node runner after context update
         self.ctx['docker_env_vars']['MAD_MULTI_NODE_RUNNER'] = self.set_multi_node_runner()
 
+    def _is_slurm_mode(self, additional_context: str = None, additional_context_file: str = None) -> bool:
+        """Check if SLURM mode is requested.
+
+        Args:
+            additional_context: The additional context string.
+            additional_context_file: The additional context file.
+
+        Returns:
+            bool: True if SLURM mode is detected, False otherwise.
+        """
+        import ast
+        import json
+
+        # Check additional_context_file first
+        if additional_context_file:
+            try:
+                with open(additional_context_file) as f:
+                    context_data = json.load(f)
+                    if 'slurm_args' in context_data:
+                        return True
+            except Exception:
+                pass
+
+        # Check additional_context string
+        if additional_context:
+            try:
+                dict_additional_context = ast.literal_eval(additional_context)
+                if 'slurm_args' in dict_additional_context:
+                    return True
+            except Exception:
+                pass
+
+        return False
+
     def get_ctx_test(self) -> str:
         """Get context test.