Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 61 additions & 2 deletions docs/deployment/launcher-orchestrated/slurm.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,11 +77,12 @@ execution:

# Resource allocation
partition: batch # Slurm partition/queue
num_nodes: 1 # Number of nodes
num_nodes: 1 # Total SLURM nodes
num_instances: 1 # Independent deployment instances (HAProxy auto-enabled when > 1)
ntasks_per_node: 1 # Tasks per node
gres: gpu:8 # GPU resources
walltime: "01:00:00" # Wall time limit (HH:MM:SS)

# Environment variables and mounts
env_vars:
deployment: {} # Environment variables for deployment container
Expand All @@ -96,6 +97,64 @@ execution:
The `gpus_per_node` parameter can be used as an alternative to `gres` for specifying GPU resources. However, `gres` is the default in the base configuration.
:::

## Multi-Node Deployment

Multi-node deployment can be achieved with or without Ray.

### Without Ray (Custom Command)

For multi-node setups using vLLM's native data parallelism or other custom coordination, override `deployment.command` with your own multi-node logic. The launcher exports `MASTER_IP` and `SLURM_PROCID` to help coordinate nodes:

```yaml
defaults:
- execution: slurm/default
- deployment: vllm
- _self_

execution:
num_nodes: 2

deployment:
command: >-
bash -c 'if [ "$SLURM_PROCID" -eq 0 ]; then
vllm serve ${deployment.hf_model_handle} --data-parallel-size 16 --data-parallel-address $MASTER_IP ...;
else
vllm serve ${deployment.hf_model_handle} --headless --data-parallel-address $MASTER_IP ...;
fi'
```

See `examples/slurm_vllm_multinode_dp.yaml` for a complete native data parallelism example.

### With Ray (vllm_ray)

For models that require tensor/pipeline parallelism across nodes, use the `vllm_ray` deployment config which includes a built-in Ray cluster setup script:

```yaml
defaults:
- execution: slurm/default
- deployment: vllm_ray # Ray-managed multi-node vLLM deployment
- _self_

execution:
num_nodes: 2 # Single instance spanning 2 nodes

deployment:
tensor_parallel_size: 8
pipeline_parallel_size: 2
```

### Multi-Instance with HAProxy

To run multiple independent deployment instances with HAProxy load-balancing:

```yaml
execution:
num_nodes: 4 # Total SLURM nodes
num_instances: 2 # 2 instances of 2 nodes each → HAProxy auto-enabled
```

When `num_instances > 1`, HAProxy is automatically configured to distribute requests across instance head nodes. See the `examples/` directory for complete configurations.

## Configuration Examples

### Benchmark Suite Evaluation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,12 +81,39 @@ evaluation:
HF_TOKEN: $host:HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
```

## Multi-Node Deployment with Ray (`vllm_ray`)

For models requiring multiple nodes (e.g., pipeline parallelism across nodes), use the `vllm_ray` deployment config:

```yaml
defaults:
- execution: slurm/default
- deployment: vllm_ray
- _self_
execution:
num_nodes: 2 # Single instance spanning 2 nodes
deployment:
tensor_parallel_size: 8
pipeline_parallel_size: 2
```

The `vllm_ray` config inherits all fields from `vllm` and adds:

- **`distributed_executor_backend`**: Ray backend type (default: `ray`)
- **`ray_compiled_dag_channel_type`**: Ray channel type — `auto`, `shm`, or `nccl` (default: `shm`)
- **`command`**: Built-in Ray cluster setup script that starts a Ray head on rank 0, waits for workers, then launches vLLM with `--distributed-executor-backend`

The `base_command` field in the base `vllm` config contains the `vllm serve ...` invocation. The `vllm_ray` config references it via `${deployment.base_command}` to append Ray-specific flags.

## Reference

The following example configuration files are available in the `examples/` directory:

- `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment on Lepton platform
- `slurm_llama_3_1_8b_instruct.yaml` - vLLM deployment on SLURM cluster
- `slurm_llama_3_1_8b_instruct_hf.yaml` - vLLM deployment using HuggingFace model ID
- `slurm_vllm_basic.yaml` - Basic single-node vLLM deployment
- `slurm_vllm_multinode_ray_tp_pp.yaml` - Multi-node Ray deployment with TP+PP
- `slurm_vllm_multinode_multiinstance_ray_tp_pp.yaml` - Multi-node multi-instance Ray with HAProxy
- `slurm_vllm_multinode_dp_haproxy.yaml` - Multi-node independent instances with HAProxy

Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running.
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,30 @@ env_vars:

**Security:** Secret values are never written into the generated `run.sub` script. They are stored in a separate `.secrets.env` file and sourced at runtime, preventing accidental exposure in logs or artifacts.

### Multi-Node and Multi-Instance

Configure multi-node deployments using `num_nodes` and `num_instances`:

```yaml
execution:
num_nodes: 4 # Total SLURM nodes
num_instances: 2 # Independent deployment instances (default: 1)
```

- **`num_nodes`**: Total number of SLURM nodes to allocate
- **`num_instances`**: Number of independent deployment instances. When `> 1`, HAProxy is automatically configured to load-balance across instances. `num_nodes` must be divisible by `num_instances`.

For multi-node deployments requiring Ray (e.g., pipeline parallelism across nodes), use the `vllm_ray` deployment config instead of `vllm`:

```yaml
defaults:
- deployment: vllm_ray # Built-in Ray cluster setup
```

:::{note}
The deprecated `deployment.multiple_instances` field is still accepted but will be removed in a future release. Use `execution.num_instances` instead.
:::

### Mounting and Storage

The Slurm executor provides sophisticated mounting capabilities:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -142,27 +142,59 @@ Show tasks in the current config. Loop until the user confirms the task list is
```
For the None (External) deployment the `api_key_name` should be already defined. The `DUMMY_API_KEY` export is handled in Step 8.

**Step 6: Advanced - Multi-node (Data Parallel)**
**Step 6: Advanced - Multi-node**

Only if model >120B parameters, suggest multi-node. Explain: "This is DP multi-node - the weights are copied (not distributed) across nodes. One deployment instance per node will be run with HAProxy load-balancing requests."
There are two multi-node patterns. Ask the user which applies:

Ask if user wants multi-node. If yes, ask for node count and configure:
**Pattern A: Multi-instance (independent instances with HAProxy)**

Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."

```yaml
execution:
num_nodes: 4 # 4 nodes = 4 independent deployment instances = 4x throughput
deployment:
n_tasks: ${execution.num_nodes} # Must match num_nodes for multi-instance deployment
num_nodes: 4 # Total nodes
num_instances: 4 # 4 independent instances → HAProxy auto-enabled
```

**Pattern B: Multi-node single instance (Ray TP/PP across nodes)**

When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:

```yaml
defaults:
- deployment: vllm_ray # Built-in Ray cluster setup (replaces manual pre_cmd)

execution:
num_nodes: 2 # Single instance spanning 2 nodes

deployment:
tensor_parallel_size: 8
pipeline_parallel_size: 2
```

**Pattern A+B combined: Multi-instance with multi-node instances**

For very large models needing both cross-node parallelism AND multiple instances:

```yaml
defaults:
- deployment: vllm_ray

execution:
num_nodes: 4 # Total nodes
num_instances: 2 # 2 instances of 2 nodes each → HAProxy auto-enabled

deployment:
multiple_instances: true
tensor_parallel_size: 8
pipeline_parallel_size: 2
```

**Common Confusions**

- **This is different from `data_parallel_size`**, which controls DP replicas *within* a single node/deployment instance.
- Global data parallelism is `num_nodes x data_parallel_size` (e.g., 2 nodes x 4 DP each = 8 replicas for max throughput).
- With multi-node, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
- `num_nodes` must be divisible by `num_instances`.

**Step 7: Advanced - Interceptors**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,6 @@ execution:
walltime: 01:00:00

num_nodes: 2 # Number of SLURM nodes for multi-node deployment
deployment:
n_tasks: ${execution.num_nodes} # For multi-node vLLM deployment, must match num_nodes

mounts:
mount_home: false # Whether to mount home directory (default: true)
Expand All @@ -97,7 +95,7 @@ evaluation:
parallelism: 512 # Number of parallel requests (higher for data parallel deployment)
temperature: 0.6 # Sampling temperature
top_p: 0.95 # Nucleus sampling parameter
max_tokens: 32768 # Maximum number of tokens to generate (32k)
max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
request_timeout: 3600 # Timeout for API requests in seconds
target:
api_endpoint:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,13 @@ execution:
walltime: 01:00:00

num_nodes: 2 # Number of SLURM nodes for multi-node deployment
deployment:
n_tasks: ${execution.num_nodes} # One vLLM instance per node
num_instances: 2 # 2 independent single-node instances → HAProxy auto-enabled

mounts:
mount_home: false # Whether to mount home directory (default: true)

# Override default deployment arguments
deployment:
multiple_instances: true # Enable HAProxy load balancing across nodes
checkpoint_path: null
hf_model_handle: nvidia/NVIDIA-Nemotron-Nano-9B-v2
served_model_name: nvidia/NVIDIA-Nemotron-Nano-9B-v2
Expand All @@ -86,7 +84,7 @@ evaluation:
parallelism: 512 # Number of parallel requests (higher for multi-node deployment)
temperature: 0.6 # Sampling temperature
top_p: 0.95 # Nucleus sampling parameter
max_tokens: 32768 # Maximum number of tokens to generate (32k)
max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
request_timeout: 3600 # Timeout for API requests in seconds
target:
api_endpoint:
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
#
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ==============================================================================
# Multi-Node Multi-Instance SLURM Deployment: DeepSeek-R1 with HAProxy
# ==============================================================================
# This configuration demonstrates how to run evaluations with DeepSeek-R1
# deployed as multiple instances across SLURM nodes, each instance spanning
# multiple nodes using Ray tensor and pipeline parallelism, with HAProxy
# load-balancing across instances.
#
# Architecture:
# 4 nodes total, 2 instances of 2 nodes each:
# Instance 0 (nodes 0,1): Ray head + worker, vLLM on :8000
# Instance 1 (nodes 2,3): Ray head + worker, vLLM on :8000
# HAProxy: distributes requests across Instance 0 and Instance 1
#
# How to use:
#
# 1. copy this file locally or clone the repository
# 2. (optional) set the required values in the config file. Alternatively, you can pass them later with -o cli arguments, e.g.
# -o execution.hostname=my-cluster.com -o execution.output_dir=/absolute/path/on/cluster -o execution.account=my-account etc.
# 3. (optional) run with 10 samples for quick testing - add the following flag to the command below
# -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
# 4. run full evaluation:
# nemo-evaluator-launcher run --config path/to/slurm_vllm_multinode_multiinstance_ray_tp_pp.yaml
#
# ⚠️ WARNING:
# Always run full evaluations (without limit_samples) for actual benchmark results.
# Using a subset of samples is solely for testing configuration and setup.
# Results from such test runs should NEVER be used to compare models or
# report benchmark performance.

# Model Details:
# - Model: deepseek-ai/DeepSeek-R1
# - Hardware: 4 nodes with 8xH100 GPUs each (32 H100 GPUs total)
# - 2 instances, each spanning 2 nodes (16 GPUs per instance)
# - tensor_parallel_size: 8 (within node parallelism)
# - pipeline_parallel_size: 2 (across node parallelism within each instance)
#
# Multi-Node Multi-Instance Configuration:
# - execution.num_nodes: 4 (total SLURM nodes)
# - execution.num_instances: 2 (2 instances → HAProxy auto-enabled)
# - num_nodes_per_instance = num_nodes / num_instances = 2
# - deployment.tensor_parallel_size: GPU parallelism within a single node
# - deployment.pipeline_parallel_size: Model parallelism across nodes within an instance
#
# The vllm_ray deployment config contains a built-in Ray setup script.
# The script expects scheduler-agnostic variables (PROC_ID, NUM_TASKS,
# MASTER_IP, ALL_NODE_IPS) exported by the executor, and computes
# per-instance variables (INSTANCE_ID, INSTANCE_RANK, etc.) internally.
# ==============================================================================

defaults:
- execution: slurm/default
- deployment: vllm_ray
- _self_

execution:
hostname: ??? # SLURM headnode (login) hostname (required)
username: ${oc.env:USER}
account: ??? # SLURM account allocation (required)
output_dir: ??? # ABSOLUTE path accessible to SLURM compute nodes (required)
num_nodes: 4 # 4 total SLURM nodes (2 per instance × 2 instances)
num_instances: 2 # 2 instances → HAProxy auto-enabled
mounts:
deployment:
/path/to/hf_home: /root/.cache/huggingface
mount_home: false
env_vars:
deployment:
HF_TOKEN: ${oc.env:HF_TOKEN}

# Ray cluster setup is handled by the vllm_ray deployment config (no pre_cmd needed)
deployment:
image: vllm/vllm-openai:v0.15.1
checkpoint_path: null
hf_model_handle: deepseek-ai/DeepSeek-R1
served_model_name: deepseek-ai/DeepSeek-R1
tensor_parallel_size: 8
pipeline_parallel_size: 2
data_parallel_size: 1
gpu_memory_utilization: 0.90
port: 8000
extra_args: "--disable-custom-all-reduce --enforce-eager"

evaluation:
nemo_evaluator_config:
config:
params:
parallelism: 128
request_timeout: 3600
temperature: 0.6
top_p: 0.95
max_new_tokens: 32768
target:
api_endpoint:
adapter_config:
process_reasoning_traces: true # Strip <think>...</think> tokens from DeepSeek-R1 responses
use_response_logging: true
max_logged_responses: 10
use_request_logging: true
max_logged_requests: 10
tasks:
- name: gsm8k_cot_instruct
Loading