NVIDIA-NeMo · marta-sd · Mar 6, 2026 · Feb 23, 2026 · Feb 24, 2026 · Feb 24, 2026
@@ -77,11 +77,12 @@ execution:
 
   # Resource allocation
   partition: batch                      # Slurm partition/queue
-  num_nodes: 1                         # Number of nodes
+  num_nodes: 1                         # Total SLURM nodes
+  num_instances: 1                     # Independent deployment instances (HAProxy auto-enabled when > 1)
   ntasks_per_node: 1                   # Tasks per node
   gres: gpu:8                          # GPU resources
   walltime: "01:00:00"                 # Wall time limit (HH:MM:SS)
-  
+
   # Environment variables and mounts
   env_vars:
     deployment: {}                     # Environment variables for deployment container
@@ -96,6 +97,64 @@ execution:
 The `gpus_per_node` parameter can be used as an alternative to `gres` for specifying GPU resources. However, `gres` is the default in the base configuration.
 :::
 
+## Multi-Node Deployment
+
+Multi-node deployment can be achieved with or without Ray.
+
+### Without Ray (Custom Command)
+
+For multi-node setups using vLLM's native data parallelism or other custom coordination, override `deployment.command` with your own multi-node logic. The launcher exports `MASTER_IP` and `SLURM_PROCID` to help coordinate nodes:
+
+```yaml
+defaults:
+  - execution: slurm/default
+  - deployment: vllm
+  - _self_
+
+execution:
+  num_nodes: 2
+
+deployment:
+  command: >-
+    bash -c 'if [ "$SLURM_PROCID" -eq 0 ]; then
+      vllm serve ${deployment.hf_model_handle} --data-parallel-size 16 --data-parallel-address $MASTER_IP ...;
+    else
+      vllm serve ${deployment.hf_model_handle} --headless --data-parallel-address $MASTER_IP ...;
+    fi'
+```
+
+See `examples/slurm_vllm_multinode_dp.yaml` for a complete native data parallelism example.
+
+### With Ray (vllm_ray)
+
+For models that require tensor/pipeline parallelism across nodes, use the `vllm_ray` deployment config which includes a built-in Ray cluster setup script:
+
+```yaml
+defaults:
+  - execution: slurm/default
+  - deployment: vllm_ray    # Ray-managed multi-node vLLM deployment
+  - _self_
+
+execution:
+  num_nodes: 2              # Single instance spanning 2 nodes
+
+deployment:
+  tensor_parallel_size: 8
+  pipeline_parallel_size: 2
+```
+
+### Multi-Instance with HAProxy
+
+To run multiple independent deployment instances with HAProxy load-balancing:
+
+```yaml
+execution:
+  num_nodes: 4              # Total SLURM nodes
+  num_instances: 2           # 2 instances of 2 nodes each → HAProxy auto-enabled
+```
+
+When `num_instances > 1`, HAProxy is automatically configured to distribute requests across instance head nodes. See the `examples/` directory for complete configurations.
+
 ## Configuration Examples
 
 ### Benchmark Suite Evaluation

@@ -81,12 +81,39 @@ evaluation:
         HF_TOKEN: $host:HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
 ```
 
+## Multi-Node Deployment with Ray (`vllm_ray`)
+
+For models requiring multiple nodes (e.g., pipeline parallelism across nodes), use the `vllm_ray` deployment config:
+
+```yaml
+defaults:
+  - execution: slurm/default
+  - deployment: vllm_ray
+  - _self_
+
+execution:
+  num_nodes: 2              # Single instance spanning 2 nodes
+
+deployment:
+  tensor_parallel_size: 8
+  pipeline_parallel_size: 2
+```
+
+The `vllm_ray` config inherits all fields from `vllm` and adds:
+
+- **`distributed_executor_backend`**: Ray backend type (default: `ray`)
+- **`ray_compiled_dag_channel_type`**: Ray channel type — `auto`, `shm`, or `nccl` (default: `shm`)
+- **`command`**: Built-in Ray cluster setup script that starts a Ray head on rank 0, waits for workers, then launches vLLM with `--distributed-executor-backend`
+
+The `base_command` field in the base `vllm` config contains the `vllm serve ...` invocation. The `vllm_ray` config references it via `${deployment.base_command}` to append Ray-specific flags.
+
 ## Reference
 
 The following example configuration files are available in the `examples/` directory:
 
-- `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment on Lepton platform
-- `slurm_llama_3_1_8b_instruct.yaml` - vLLM deployment on SLURM cluster
-- `slurm_llama_3_1_8b_instruct_hf.yaml` - vLLM deployment using HuggingFace model ID
+- `slurm_vllm_basic.yaml` - Basic single-node vLLM deployment
+- `slurm_vllm_multinode_ray_tp_pp.yaml` - Multi-node Ray deployment with TP+PP
+- `slurm_vllm_multinode_multiinstance_ray_tp_pp.yaml` - Multi-node multi-instance Ray with HAProxy
+- `slurm_vllm_multinode_dp_haproxy.yaml` - Multi-node independent instances with HAProxy
 
 Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running.
@@ -85,6 +85,30 @@ env_vars:
 
 **Security:** Secret values are never written into the generated `run.sub` script. They are stored in a separate `.secrets.env` file and sourced at runtime, preventing accidental exposure in logs or artifacts.
 
+### Multi-Node and Multi-Instance
+
+Configure multi-node deployments using `num_nodes` and `num_instances`:
+
+```yaml
+execution:
+  num_nodes: 4              # Total SLURM nodes
+  num_instances: 2          # Independent deployment instances (default: 1)
+```
+
+- **`num_nodes`**: Total number of SLURM nodes to allocate
+- **`num_instances`**: Number of independent deployment instances. When `> 1`, HAProxy is automatically configured to load-balance across instances. `num_nodes` must be divisible by `num_instances`.
+
+For multi-node deployments requiring Ray (e.g., pipeline parallelism across nodes), use the `vllm_ray` deployment config instead of `vllm`:
+
+```yaml
+defaults:
+  - deployment: vllm_ray     # Built-in Ray cluster setup
+```
+
+:::{note}
+The deprecated `deployment.multiple_instances` field is still accepted but will be removed in a future release. Use `execution.num_instances` instead.
+:::
+
 ### Mounting and Storage
 
 The Slurm executor provides sophisticated mounting capabilities:

@@ -142,27 +142,59 @@ Show tasks in the current config. Loop until the user confirms the task list is
   ```
   For the None (External) deployment the `api_key_name` should be already defined. The `DUMMY_API_KEY` export is handled in Step 8.
 
-**Step 6: Advanced - Multi-node (Data Parallel)**
+**Step 6: Advanced - Multi-node**
 
-Only if model >120B parameters, suggest multi-node. Explain: "This is DP multi-node - the weights are copied (not distributed) across nodes. One deployment instance per node will be run with HAProxy load-balancing requests."
+There are two multi-node patterns. Ask the user which applies:
 
-Ask if user wants multi-node. If yes, ask for node count and configure:
+**Pattern A: Multi-instance (independent instances with HAProxy)**
+
+Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
 
 ```yaml
 execution:
-    num_nodes: 4  # 4 nodes = 4 independent deployment instances = 4x throughput
-    deployment:
-        n_tasks: ${execution.num_nodes}  # Must match num_nodes for multi-instance deployment
+    num_nodes: 4       # Total nodes
+    num_instances: 4   # 4 independent instances → HAProxy auto-enabled
+```
+
+**Pattern B: Multi-node single instance (Ray TP/PP across nodes)**
+
+When a single model is too large for one node and needs pipeline parallelism across nodes. Use `vllm_ray` deployment config:
+
+```yaml
+defaults:
+  - deployment: vllm_ray   # Built-in Ray cluster setup (replaces manual pre_cmd)
+
+execution:
+    num_nodes: 2           # Single instance spanning 2 nodes
+
+deployment:
+    tensor_parallel_size: 8
+    pipeline_parallel_size: 2
+```
+
+**Pattern A+B combined: Multi-instance with multi-node instances**
+
+For very large models needing both cross-node parallelism AND multiple instances:
+
+```yaml
+defaults:
+  - deployment: vllm_ray
+
+execution:
+    num_nodes: 4       # Total nodes
+    num_instances: 2   # 2 instances of 2 nodes each → HAProxy auto-enabled
 
 deployment:
-    multiple_instances: true
+    tensor_parallel_size: 8
+    pipeline_parallel_size: 2
 ```
 
 **Common Confusions**
 
-- **This is different from `data_parallel_size`**, which controls DP replicas *within* a single node/deployment instance.
-- Global data parallelism is `num_nodes x data_parallel_size` (e.g., 2 nodes x 4 DP each = 8 replicas for max throughput).
-- With multi-node, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
+- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
+- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
+- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
+- `num_nodes` must be divisible by `num_instances`.
 
 **Step 7: Advanced - Interceptors**
 

@@ -69,8 +69,6 @@ execution:
   walltime: 01:00:00
 
   num_nodes: 2 # Number of SLURM nodes for multi-node deployment
-  deployment:
-    n_tasks: ${execution.num_nodes}  # For multi-node vLLM deployment, must match num_nodes
 
   mounts:
     mount_home: false # Whether to mount home directory (default: true)
@@ -97,7 +95,7 @@ evaluation:
         parallelism: 512 # Number of parallel requests (higher for data parallel deployment)
         temperature: 0.6 # Sampling temperature
         top_p: 0.95 # Nucleus sampling parameter
-        max_tokens: 32768 # Maximum number of tokens to generate (32k)
+        max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
         request_timeout: 3600 # Timeout for API requests in seconds
     target:
       api_endpoint:

@@ -61,15 +61,13 @@ execution:
   walltime: 01:00:00
 
   num_nodes: 2 # Number of SLURM nodes for multi-node deployment
-  deployment:
-    n_tasks: ${execution.num_nodes}  # One vLLM instance per node
+  num_instances: 2   # 2 independent single-node instances → HAProxy auto-enabled
 
   mounts:
     mount_home: false # Whether to mount home directory (default: true)
 
 # Override default deployment arguments
 deployment:
-  multiple_instances: true  # Enable HAProxy load balancing across nodes
   checkpoint_path: null
   hf_model_handle: nvidia/NVIDIA-Nemotron-Nano-9B-v2
   served_model_name: nvidia/NVIDIA-Nemotron-Nano-9B-v2
@@ -86,7 +84,7 @@ evaluation:
         parallelism: 512 # Number of parallel requests (higher for multi-node deployment)
         temperature: 0.6 # Sampling temperature
         top_p: 0.95 # Nucleus sampling parameter
-        max_tokens: 32768 # Maximum number of tokens to generate (32k)
+        max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
         request_timeout: 3600 # Timeout for API requests in seconds
     target:
       api_endpoint:

@@ -0,0 +1,117 @@
+#
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ==============================================================================
+# Multi-Node Multi-Instance SLURM Deployment: DeepSeek-R1 with HAProxy
+# ==============================================================================
+# This configuration demonstrates how to run evaluations with DeepSeek-R1
+# deployed as multiple instances across SLURM nodes, each instance spanning
+# multiple nodes using Ray tensor and pipeline parallelism, with HAProxy
+# load-balancing across instances.
+#
+# Architecture:
+#   4 nodes total, 2 instances of 2 nodes each:
+#     Instance 0 (nodes 0,1): Ray head + worker, vLLM on :8000
+#     Instance 1 (nodes 2,3): Ray head + worker, vLLM on :8000
+#     HAProxy: distributes requests across Instance 0 and Instance 1
+#
+# How to use:
+#
+# 1. copy this file locally or clone the repository
+# 2. (optional) set the required values in the config file. Alternatively, you can pass them later with -o cli arguments, e.g.
+#    -o execution.hostname=my-cluster.com -o execution.output_dir=/absolute/path/on/cluster -o execution.account=my-account etc.
+# 3. (optional) run with 10 samples for quick testing - add the following flag to the command below
+#    -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10
+# 4. run full evaluation:
+#    nemo-evaluator-launcher run --config path/to/slurm_vllm_multinode_multiinstance_ray_tp_pp.yaml
+#
+# ⚠️  WARNING:
+#     Always run full evaluations (without limit_samples) for actual benchmark results.
+#     Using a subset of samples is solely for testing configuration and setup.
+#     Results from such test runs should NEVER be used to compare models or
+#     report benchmark performance.
+
+# Model Details:
+# - Model: deepseek-ai/DeepSeek-R1
+# - Hardware: 4 nodes with 8xH100 GPUs each (32 H100 GPUs total)
+# - 2 instances, each spanning 2 nodes (16 GPUs per instance)
+# - tensor_parallel_size: 8 (within node parallelism)
+# - pipeline_parallel_size: 2 (across node parallelism within each instance)
+#
+# Multi-Node Multi-Instance Configuration:
+# - execution.num_nodes: 4 (total SLURM nodes)
+# - execution.num_instances: 2 (2 instances → HAProxy auto-enabled)
+# - num_nodes_per_instance = num_nodes / num_instances = 2
+# - deployment.tensor_parallel_size: GPU parallelism within a single node
+# - deployment.pipeline_parallel_size: Model parallelism across nodes within an instance
+#
+# The vllm_ray deployment config contains a built-in Ray setup script.
+# The script expects scheduler-agnostic variables (PROC_ID, NUM_TASKS,
+# MASTER_IP, ALL_NODE_IPS) exported by the executor, and computes
+# per-instance variables (INSTANCE_ID, INSTANCE_RANK, etc.) internally.
+# ==============================================================================
+
+defaults:
+  - execution: slurm/default
+  - deployment: vllm_ray
+  - _self_
+
+execution:
+  hostname: ???  # SLURM headnode (login) hostname (required)
+  username: ${oc.env:USER}
+  account: ???  # SLURM account allocation (required)
+  output_dir: ???  # ABSOLUTE path accessible to SLURM compute nodes (required)
+  num_nodes: 4     # 4 total SLURM nodes (2 per instance × 2 instances)
+  num_instances: 2 # 2 instances → HAProxy auto-enabled
+  mounts:
+    deployment:
+      /path/to/hf_home: /root/.cache/huggingface
+    mount_home: false
+  env_vars:
+    deployment:
+      HF_TOKEN: ${oc.env:HF_TOKEN}
+
+# Ray cluster setup is handled by the vllm_ray deployment config (no pre_cmd needed)
+deployment:
+  image: vllm/vllm-openai:v0.15.1
+  checkpoint_path: null
+  hf_model_handle: deepseek-ai/DeepSeek-R1
+  served_model_name: deepseek-ai/DeepSeek-R1
+  tensor_parallel_size: 8
+  pipeline_parallel_size: 2
+  data_parallel_size: 1
+  gpu_memory_utilization: 0.90
+  port: 8000
+  extra_args: "--disable-custom-all-reduce --enforce-eager"
+
+evaluation:
+  nemo_evaluator_config:
+    config:
+      params:
+        parallelism: 128
+        request_timeout: 3600
+        temperature: 0.6
+        top_p: 0.95
+        max_new_tokens: 32768
+    target:
+      api_endpoint:
+        adapter_config:
+          process_reasoning_traces: true  # Strip <think>...</think> tokens from DeepSeek-R1 responses
+          use_response_logging: true
+          max_logged_responses: 10
+          use_request_logging: true
+          max_logged_requests: 10
+  tasks:
+    - name: gsm8k_cot_instruct