Skip to content
133 changes: 130 additions & 3 deletions docs/how-to-run-multi-node.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,17 +75,144 @@ You can see at the end of these commands, we are pointing DLM/MAD to the shared-

**NOTE: The above commands assumes the shared-file system is mounted at `/nfs` in the commands above. If this is not the case and a user simply copies/pastes the above commands on two nodes, DLM/MAD will create a folder called `nfs` on each node and copy the data there, which is not desired behavior.**

## SLURM Cluster Integration

madengine now supports running workloads on SLURM clusters, allowing you to leverage job scheduling and resource management for multi-node training and inference.

### Overview

When `slurm_args` is provided in the `additional-context`, madengine will:
1. Parse the SLURM configuration parameters
2. Submit the job directly to the SLURM cluster using `sbatch`
3. Skip the standard Docker container build and run workflow
4. Execute the model-specific script (e.g., `scripts/sglang_disagg/run.sh`) which handles SLURM job submission

### SLURM Arguments

The following arguments can be specified in the `slurm_args` dictionary:

| Argument | Description | Required | Default | Example |
|----------|-------------|----------|---------|---------|
| `FRAMEWORK` | Framework to use for the job | Yes | - | `'sglang_disagg'` |
| `PREFILL_NODES` | Number of nodes for prefill phase | Yes | - | `'2'` |
| `DECODE_NODES` | Number of nodes for decode phase | Yes | - | `'2'` |
| `PARTITION` | SLURM partition/queue name | Yes | - | `'amd-rccl'` |
| `TIME` | Maximum job runtime (HH:MM:SS) | Yes | - | `'12:00:00'` |
| `DOCKER_IMAGE` | Docker image to use | No | `''` | `'myregistry/image:tag'` |
| `EXCLUSIVE_MODE` | Request exclusive node access | No | `True` | `True` or `False` |

### Usage Examples

#### Basic SLURM Job Submission

To run a model on SLURM with default settings:

```bash
madengine run --tags sglang_disagg_pd_qwen3-32B \
--additional-context "{'slurm_args': {
'FRAMEWORK': 'sglang_disagg',
'PREFILL_NODES': '2',
'DECODE_NODES': '2',
'PARTITION': 'amd-rccl',
'TIME': '12:00:00',
'DOCKER_IMAGE': ''
}}"
```

#### Custom Docker Image

To specify a custom Docker image for the SLURM job:

```bash
madengine run --tags sglang_disagg_pd_qwen3-32B \
--additional-context "{'slurm_args': {
'FRAMEWORK': 'sglang_disagg',
'PREFILL_NODES': '4',
'DECODE_NODES': '4',
'PARTITION': 'gpu-high-priority',
'TIME': '24:00:00',
'DOCKER_IMAGE': 'myregistry/custom-image:latest'
}}"
```

#### Running Different Model Configurations

For DeepSeek-V2 model:

```bash
madengine run --tags sglang_disagg_pd_deepseek_v2 \
--additional-context "{'slurm_args': {
'FRAMEWORK': 'sglang_disagg',
'PREFILL_NODES': '8',
'DECODE_NODES': '8',
'PARTITION': 'amd-mi300x',
'TIME': '48:00:00',
'DOCKER_IMAGE': ''
}}"
```

#### Using Exclusive Mode

By default, `EXCLUSIVE_MODE` is `True`, which requests exclusive access to nodes (recommended for distributed inference). To share nodes with other jobs:

```bash
madengine run --tags sglang_disagg_pd_qwen3-32B \
--additional-context "{'slurm_args': {
'FRAMEWORK': 'sglang_disagg',
'PREFILL_NODES': '2',
'DECODE_NODES': '2',
'PARTITION': 'amd-rccl',
'TIME': '12:00:00',
'DOCKER_IMAGE': '',
'EXCLUSIVE_MODE': False
}}"
```

**Note:** Exclusive mode (`--exclusive` in SLURM) is typically recommended for distributed multi-node workloads to ensure consistent performance and avoid interference from other jobs running on the same nodes.

### Model Configuration

Models configured for SLURM should include the model name in the `args` attribute of `models.json`. For example:

```json
{
"name": "sglang_disagg_pd_qwen3-32B",
"args": "--model Qwen3-32B",
"tags": ["sglang_disagg"]
}
```

The model name (e.g., `Qwen/Qwen3-32B`) will be extracted and set as the `MODEL_NAME` environment variable for the SLURM job.

### Requirements

To use SLURM integration, ensure the following are available:

1. **SLURM Cluster Access**: Access to a SLURM cluster with proper credentials
2. **Model Scripts**: Framework-specific scripts (e.g., `scripts/sglang_disagg/run.sh`) that handle SLURM job submission

### How It Works

1. **Context Parsing**: madengine detects `slurm_args` in the additional context
2. **Model Selection**: Extracts model information from `models.json` based on the provided tags
3. **Environment Setup**: Prepares environment variables including `MODEL_NAME`, node counts, partition, etc.
4. **Job Submission**: Executes the framework-specific run script which submits the SLURM job using `sbatch`
5. **Job Monitoring**: The SLURM cluster manages job execution, resource allocation, and scheduling

## TODO

### RUNNER

- [x] torchrun
- [ ] mpirun (requires ansible integration)

### Job Schedulare
### Job Scheduler

- [ ] SLURM
- [x] SLURM (via slurm_args integration)
- [ ] Kubernetes

### Design Consideration

- [ ] Having the python model script launched by individual bash scripts can be limiting for multi-node. Perhaps we can explore a full python workflow for multi-node and only the job scheduler uses a bash script like SLURM using sbatch script.
- [x] SLURM integration using sbatch scripts for job submission
- [ ] Full Python workflow for multi-node (without bash script intermediaries)
- [ ] Kubernetes-native job scheduling integration
87 changes: 71 additions & 16 deletions src/madengine/core/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,22 +92,43 @@ def __init__(
else:
print("Warning: unknown numa balancing setup ...")

# Keeping gpu_vendor for filterning purposes, if we filter using file names we can get rid of this attribute.
self.ctx["gpu_vendor"] = self.get_gpu_vendor()

# Initialize the docker context
self.ctx["docker_env_vars"] = {}
self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]
self.ctx["docker_env_vars"]["MAD_SYSTEM_NGPUS"] = self.get_system_ngpus()
self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = self.get_system_gpu_architecture()
self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_PRODUCT_NAME"] = self.get_system_gpu_product_name()
self.ctx['docker_env_vars']['MAD_SYSTEM_HIP_VERSION'] = self.get_system_hip_version()
self.ctx["docker_build_arg"] = {
"MAD_SYSTEM_GPU_ARCHITECTURE": self.get_system_gpu_architecture(),
"MAD_SYSTEM_GPU_PRODUCT_NAME": self.get_system_gpu_product_name()
}
self.ctx["docker_gpus"] = self.get_docker_gpus()
self.ctx["gpu_renderDs"] = self.get_gpu_renderD_nodes()
# Check if SLURM mode is requested before GPU detection
is_slurm_mode = self._is_slurm_mode(additional_context, additional_context_file)

if is_slurm_mode:
# For SLURM mode, set minimal GPU context to avoid detection on control node
print("SLURM mode detected - skipping GPU detection on control node")
self.ctx["gpu_vendor"] = "AMD" # Default to AMD for SLURM environments
self.ctx["docker_env_vars"] = {}
self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]
self.ctx["docker_env_vars"]["MAD_SYSTEM_NGPUS"] = "8" # Default value for SLURM
self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = "gfx90a" # Default for SLURM
self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_PRODUCT_NAME"] = "AMD_GPU" # Default value
self.ctx['docker_env_vars']['MAD_SYSTEM_HIP_VERSION'] = "5.0.0" # Default value
self.ctx["docker_build_arg"] = {
"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a",
"MAD_SYSTEM_GPU_PRODUCT_NAME": "AMD_GPU"
}
self.ctx["docker_gpus"] = "0,1,2,3,4,5,6,7" # Default GPU list
self.ctx["gpu_renderDs"] = [128, 129, 130, 131, 132, 133, 134, 135] # Default renderD nodes
else:
# Normal mode - detect GPUs
# Keeping gpu_vendor for filterning purposes, if we filter using file names we can get rid of this attribute.
self.ctx["gpu_vendor"] = self.get_gpu_vendor()

# Initialize the docker context
self.ctx["docker_env_vars"] = {}
self.ctx["docker_env_vars"]["MAD_GPU_VENDOR"] = self.ctx["gpu_vendor"]
self.ctx["docker_env_vars"]["MAD_SYSTEM_NGPUS"] = self.get_system_ngpus()
self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_ARCHITECTURE"] = self.get_system_gpu_architecture()
self.ctx["docker_env_vars"]["MAD_SYSTEM_GPU_PRODUCT_NAME"] = self.get_system_gpu_product_name()
self.ctx['docker_env_vars']['MAD_SYSTEM_HIP_VERSION'] = self.get_system_hip_version()
self.ctx["docker_build_arg"] = {
"MAD_SYSTEM_GPU_ARCHITECTURE": self.get_system_gpu_architecture(),
"MAD_SYSTEM_GPU_PRODUCT_NAME": self.get_system_gpu_product_name()
}
self.ctx["docker_gpus"] = self.get_docker_gpus()
self.ctx["gpu_renderDs"] = self.get_gpu_renderD_nodes()

# Default multi-node configuration
self.ctx['multi_node_args'] = {
Expand Down Expand Up @@ -148,6 +169,40 @@ def __init__(
# Set multi-node runner after context update
self.ctx['docker_env_vars']['MAD_MULTI_NODE_RUNNER'] = self.set_multi_node_runner()

def _is_slurm_mode(self, additional_context: str = None, additional_context_file: str = None) -> bool:
"""Check if SLURM mode is requested.

Args:
additional_context: The additional context string.
additional_context_file: The additional context file.

Returns:
bool: True if SLURM mode is detected, False otherwise.
"""
import ast
import json

# Check additional_context_file first
if additional_context_file:
try:
with open(additional_context_file) as f:
context_data = json.load(f)
if 'slurm_args' in context_data:
return True
except Exception:
pass

# Check additional_context string
if additional_context:
try:
dict_additional_context = ast.literal_eval(additional_context)
if 'slurm_args' in dict_additional_context:
return True
except Exception:
pass

return False

def get_ctx_test(self) -> str:
"""Get context test.

Expand Down
Loading