NVIDIA-NeMo · AdamRajfer · Mar 5, 2026
@@ -77,7 +77,8 @@ execution:
 
   # Resource allocation
   partition: batch                      # Slurm partition/queue
-  num_nodes: 1                         # Number of nodes
+  num_nodes: 1                         # Total SLURM nodes
+  num_instances: 1                     # Independent deployment instances (HAProxy auto-enabled when > 1)
   ntasks_per_node: 1                   # Tasks per node
   gres: gpu:8                          # GPU resources
   walltime: "01:00:00"                 # Wall time limit (HH:MM:SS)
@@ -96,6 +97,18 @@ execution:
 The `gpus_per_node` parameter can be used as an alternative to `gres` for specifying GPU resources. However, `gres` is the default in the base configuration.
 :::
 
+## Multi-Instance with HAProxy
+
+To run multiple independent deployment instances with HAProxy load-balancing:
+
+```yaml
+execution:
+  num_nodes: 4              # Total SLURM nodes
+  num_instances: 2           # 2 instances of 2 nodes each → HAProxy auto-enabled
+```
+
+When `num_instances > 1`, HAProxy is automatically configured to distribute requests across instance head nodes. See the `examples/` directory for complete configurations.
+
 ## Configuration Examples
 
 ### Benchmark Suite Evaluation

@@ -85,8 +85,9 @@ evaluation:
 
 The following example configuration files are available in the `examples/` directory:
 
-- `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment on Lepton platform
-- `slurm_llama_3_1_8b_instruct.yaml` - vLLM deployment on SLURM cluster
-- `slurm_llama_3_1_8b_instruct_hf.yaml` - vLLM deployment using HuggingFace model ID
+- `slurm_vllm_basic.yaml` - Basic single-node vLLM deployment
+- `slurm_vllm_multinode_ray_tp_pp.yaml` - Multi-node deployment with TP+PP
+- `slurm_vllm_multinode_dp.yaml` - Multi-node data parallelism
+- `slurm_vllm_multinode_dp_haproxy.yaml` - Multi-node independent instances with HAProxy
 
 Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running.
@@ -85,6 +85,23 @@ env_vars:
 
 **Security:** Secret values are never written into the generated `run.sub` script. They are stored in a separate `.secrets.env` file and sourced at runtime, preventing accidental exposure in logs or artifacts.
 
+### Multi-Node and Multi-Instance
+
+Configure multi-node deployments using `num_nodes` and `num_instances`:
+
+```yaml
+execution:
+  num_nodes: 4              # Total SLURM nodes
+  num_instances: 2          # Independent deployment instances (default: 1)
+```
+
+- **`num_nodes`**: Total number of SLURM nodes to allocate
+- **`num_instances`**: Number of independent deployment instances. When `> 1`, HAProxy is automatically configured to load-balance across instances. `num_nodes` must be divisible by `num_instances`.
+
+:::{note}
+The deprecated `deployment.multiple_instances` field is still accepted but will be removed in a future release. Use `execution.num_instances` instead.
+:::
+
 ### Mounting and Storage
 
 The Slurm executor provides sophisticated mounting capabilities:

@@ -142,27 +142,26 @@ Show tasks in the current config. Loop until the user confirms the task list is
   ```
   For the None (External) deployment the `api_key_name` should be already defined. The `DUMMY_API_KEY` export is handled in Step 8.
 
-**Step 6: Advanced - Multi-node (Data Parallel)**
+**Step 6: Advanced - Multi-node**
 
-Only if model >120B parameters, suggest multi-node. Explain: "This is DP multi-node - the weights are copied (not distributed) across nodes. One deployment instance per node will be run with HAProxy load-balancing requests."
+There are two multi-node patterns. Ask the user which applies:
 
-Ask if user wants multi-node. If yes, ask for node count and configure:
+**Pattern A: Multi-instance (independent instances with HAProxy)**
+
+Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."
 
 ```yaml
 execution:
-    num_nodes: 4  # 4 nodes = 4 independent deployment instances = 4x throughput
-    deployment:
-        n_tasks: ${execution.num_nodes}  # Must match num_nodes for multi-instance deployment
-
-deployment:
-    multiple_instances: true
+    num_nodes: 4       # Total nodes
+    num_instances: 4   # 4 independent instances → HAProxy auto-enabled
 ```
 
 **Common Confusions**
 
-- **This is different from `data_parallel_size`**, which controls DP replicas *within* a single node/deployment instance.
-- Global data parallelism is `num_nodes x data_parallel_size` (e.g., 2 nodes x 4 DP each = 8 replicas for max throughput).
-- With multi-node, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
+- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
+- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
+- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
+- `num_nodes` must be divisible by `num_instances`.
 
 **Step 7: Advanced - Interceptors**
 

@@ -69,8 +69,6 @@ execution:
   walltime: 01:00:00
 
   num_nodes: 2 # Number of SLURM nodes for multi-node deployment
-  deployment:
-    n_tasks: ${execution.num_nodes}  # For multi-node vLLM deployment, must match num_nodes
 
   mounts:
     mount_home: false # Whether to mount home directory (default: true)
@@ -97,7 +95,7 @@ evaluation:
         parallelism: 512 # Number of parallel requests (higher for data parallel deployment)
         temperature: 0.6 # Sampling temperature
         top_p: 0.95 # Nucleus sampling parameter
-        max_tokens: 32768 # Maximum number of tokens to generate (32k)
+        max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
         request_timeout: 3600 # Timeout for API requests in seconds
     target:
       api_endpoint:

@@ -61,15 +61,13 @@ execution:
   walltime: 01:00:00
 
   num_nodes: 2 # Number of SLURM nodes for multi-node deployment
-  deployment:
-    n_tasks: ${execution.num_nodes}  # One vLLM instance per node
+  num_instances: 2   # 2 independent single-node instances → HAProxy auto-enabled
 
   mounts:
     mount_home: false # Whether to mount home directory (default: true)
 
 # Override default deployment arguments
 deployment:
-  multiple_instances: true  # Enable HAProxy load balancing across nodes
   checkpoint_path: null
   hf_model_handle: nvidia/NVIDIA-Nemotron-Nano-9B-v2
   served_model_name: nvidia/NVIDIA-Nemotron-Nano-9B-v2
@@ -86,7 +84,7 @@ evaluation:
         parallelism: 512 # Number of parallel requests (higher for multi-node deployment)
         temperature: 0.6 # Sampling temperature
         top_p: 0.95 # Nucleus sampling parameter
-        max_tokens: 32768 # Maximum number of tokens to generate (32k)
+        max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
         request_timeout: 3600 # Timeout for API requests in seconds
     target:
       api_endpoint:

@@ -42,9 +42,8 @@
 # - tensor_parallel_size: 8 (within node parallelism)
 # - pipeline_parallel_size: 2 (across node parallelism)
 #
-# Multi-Instance Configuration:
-# - execution.num_nodes: Number of SLURM nodes to allocate (2 in this example)
-# - execution.deployment.n_tasks: Must match num_nodes for multi-instance deployment
+# Multi-Node Configuration:
+# - execution.num_nodes: 2 (single instance spanning 2 nodes)
 # - deployment.tensor_parallel_size: GPU parallelism within a single node
 # - deployment.pipeline_parallel_size: Model parallelism across multiple nodes
 #
@@ -67,8 +66,6 @@ execution:
   walltime: 02:00:00
 
   num_nodes: 2 # Number of SLURM nodes for multi-node deployment
-  deployment:
-    n_tasks: ${execution.num_nodes} # For multi-node ray vLLM deployment, must match num_nodes
 
   mounts:
     mount_home: false # Whether to mount home directory (default: true)
@@ -135,7 +132,7 @@ evaluation:
         request_timeout: 3600 # Timeout for API requests in seconds
         temperature: 0.6 # Sampling temperature
         top_p: 0.95 # Nucleus sampling parameter
-        max_tokens: 32768 # Maximum number of tokens to generate (32k)
+        max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
 
     target:
       api_endpoint:

@@ -339,8 +339,8 @@ def apply_url_override(url: str) -> str:
         # Local executor - use localhost
         endpoint_uri = cfg.deployment.endpoints[endpoint_type]
 
-        # Use HAProxy port if multiple_instances is enabled
-        if cfg.deployment.get("multiple_instances", False):
+        # Use HAProxy port if num_instances > 1
+        if OmegaConf.select(cfg, "execution.num_instances", default=1) > 1:
             proxy_config = cfg.execution.get("proxy", {}).get("config", {})
             port = proxy_config.get("haproxy_port", 5009)
         else:

@@ -20,7 +20,8 @@ username: ${oc.env:USER} # Defaults to $USER env var
 account: ??? # SLURM account allocation (required)
 output_dir: ??? # Absolute path accessible on compute nodes (required)
 partition: batch
-num_nodes: 1
+num_nodes: 1             # Total SLURM nodes (num_nodes_per_instance = num_nodes / num_instances)
+num_instances: 1         # Number of independent deployment instances
 ntasks_per_node: 1
 gres: gpu:8
 walltime: 01:00:00
@@ -30,7 +31,7 @@ sbatch_comment: null # Optional comment for SLURM job (translates to #SBATCH --c
 
 # Deployment-specific SLURM configuration
 deployment:
-  n_tasks: 1 # Number of tasks for deployment srun (default: 1, for multi-instance set to num_nodes)
+  n_tasks: ${execution.num_nodes}  # Number of tasks for deployment srun (default: num_nodes)
 
 mounts:
   deployment: {}

@@ -29,7 +29,7 @@
 
 import yaml
 from jinja2 import Environment, FileSystemLoader
-from omegaconf import DictConfig, OmegaConf
+from omegaconf import DictConfig, OmegaConf, open_dict
 
 from nemo_evaluator_launcher.common.env_vars import (
     SecretsEnvResult,
@@ -145,7 +145,7 @@ def execute_eval(cfg: DictConfig, dry_run: bool = False) -> str:
                 )
 
                 # Create proxy config file with placeholder IPs for multi-instance deployments
-                if cfg.deployment.get("multiple_instances", False):
+                if cfg.execution.num_instances > 1:
                     proxy_type = cfg.execution.get("proxy", {}).get("type", "haproxy")
                     if proxy_type == "haproxy":
                         proxy_config = _generate_haproxy_config_with_placeholders(cfg)
@@ -642,6 +642,22 @@ def _create_slurm_sbatch_script(
     Returns:
         str: The contents of the sbatch script.
     """
+    # Remove deprecated deployment.multiple_instances if present
+    if cfg.deployment.get("multiple_instances") is not None:
+        logger.warning(
+            "deployment.multiple_instances is deprecated and will be "
+            "removed from config — use execution.num_instances instead."
+        )
+        with open_dict(cfg):
+            del cfg.deployment.multiple_instances
+
+    # Validate topology: num_nodes must be divisible by num_instances
+    if cfg.execution.num_nodes % cfg.execution.num_instances != 0:
+        raise ValueError(
+            f"execution.num_nodes ({cfg.execution.num_nodes}) must be divisible by "
+            f"execution.num_instances ({cfg.execution.num_instances})"
+        )
+
     # get task from mapping, overrides, urls
     tasks_mapping = load_tasks_mapping()
     task_definition = get_task_definition_for_job(
@@ -780,9 +796,9 @@ def _create_slurm_sbatch_script(
 
         # wait for the server to initialize
         health_path = cfg.deployment.endpoints.get("health", "/health")
-        # For multi-instance check all node IPs, for single instance check localhost
-        if cfg.deployment.get("multiple_instances", False):
-            ip_list = '"${NODES_IPS_ARRAY[@]}"'
+        # HEAD_NODE_IPS is always set: subset of heads when NPI > 1, all nodes otherwise
+        if cfg.execution.num_instances > 1:
+            ip_list = '"${HEAD_NODE_IPS[@]}"'
         else:
             ip_list = '"127.0.0.1"'
         s += _get_wait_for_server_handler(
@@ -795,7 +811,7 @@ def _create_slurm_sbatch_script(
         s += "\n\n"
 
         # add proxy load balancer for multi-instance deployments
-        if cfg.deployment.get("multiple_instances", False):
+        if cfg.execution.num_instances > 1:
             s += _get_proxy_server_srun_command(cfg, remote_task_subdir)
 
     # prepare evaluation mounts
@@ -858,7 +874,7 @@ def _create_slurm_sbatch_script(
     # terminate the server after all evaluation clients finish
     if cfg.deployment.type != "none":
         s += "kill $SERVER_PID  # terminate the server to finish gracefully\n"
-        if cfg.deployment.get("multiple_instances", False):
+        if cfg.execution.num_instances > 1:
             s += "kill $PROXY_PID  # terminate proxy to finish gracefully\n"
         s += "\n"
 
@@ -1579,11 +1595,11 @@ def _generate_haproxy_config_with_placeholders(cfg):
     env = Environment(loader=FileSystemLoader(template_dir))
     template = env.get_template("proxy.cfg.template")
 
-    # Prepare template data with placeholder IPs - use actual number of nodes
-    num_nodes = cfg.execution.num_nodes
+    # Prepare template data with placeholder IPs - one backend per instance head node
     nodes = []
-    for i in range(num_nodes):
-        nodes.append({"ip": f"{{IP_{i}}}", "port": cfg.deployment.port})
+    for i in range(cfg.execution.num_instances):
+        head_idx = i * cfg.execution.num_nodes // cfg.execution.num_instances
+        nodes.append({"ip": f"{{IP_{head_idx}}}", "port": cfg.deployment.port})
 
     # Get health check parameters - prefer proxy config, fallback to deployment.endpoints.health
     proxy_config = cfg.execution.get("proxy", {}).get("config", {})
@@ -1680,6 +1696,12 @@ def _generate_deployment_srun_command(
     s += "# Export MASTER_IP as the first node IP\n"
     s += "export MASTER_IP=${NODES_IPS_ARRAY[0]}\n"
     s += 'echo "MASTER_IP: $MASTER_IP"\n'
+    s += 'export ALL_NODE_IPS=$(IFS=,; echo "${NODES_IPS_ARRAY[*]}")\n'
+    s += "HEAD_NODE_IPS=()\n"
+    s += f"for ((g=0; g<{cfg.execution.num_instances}; g++)); do\n"
+    s += f'    HEAD_NODE_IPS+=("${{NODES_IPS_ARRAY[$((g * {cfg.execution.num_nodes // cfg.execution.num_instances}))]}}")\n'
+    s += "done\n"
+    s += 'echo "HEAD_NODE_IPS: ${HEAD_NODE_IPS[@]}"\n'
 
     # Add debug comment for deployment pre_cmd before srun command
     if debug_comment:
@@ -1703,9 +1725,26 @@ def _generate_deployment_srun_command(
     if "MASTER_IP" not in deployment_env_var_names:
         deployment_env_var_names.append("MASTER_IP")
 
+    # Always add ALL_NODE_IPS to the environment variables
+    if "ALL_NODE_IPS" not in deployment_env_var_names:
+        deployment_env_var_names.append("ALL_NODE_IPS")
+
     if deployment_env_var_names:
         s += f"--container-env {','.join(sorted(deployment_env_var_names))} "
 
+    # Build the command that runs inside the container:
+    # 1. Export scheduler-agnostic env vars (PROC_ID, NUM_TASKS)
+    # 2. Optionally write + source deployment_pre_cmd.sh
+    # 3. Write deployment_cmd.sh and execute it
+    create_script_cmd = _str_to_echo_command(
+        cfg.deployment.command, filename="deployment_cmd.sh"
+    )
+    debug_comment += create_script_cmd.debug + "\n\n"
+
+    # Map SLURM task variables to scheduler-agnostic names inside the container
+    env_setup = "export PROC_ID=${SLURM_PROCID:-0} NUM_TASKS=${SLURM_NTASKS:-1}"
+    script = f"{env_setup} && {create_script_cmd.cmd} && bash deployment_cmd.sh"
+
     # Wrap deployment command to execute pre_cmd inside container if needed
     if pre_cmd:
         # Create a wrapper command that runs inside the container:
@@ -1715,16 +1754,13 @@ def _generate_deployment_srun_command(
         create_pre_script_cmd = _str_to_echo_command(
             pre_cmd, filename="deployment_pre_cmd.sh"
         )
-        # Escape single quotes in the deployment command for bash -c
-        escaped_deployment_cmd = cfg.deployment.command.replace("'", "'\"'\"'")
-        wrapped_command = (
-            f"bash -c '{create_pre_script_cmd.cmd} && "
+        script = (
+            f"{env_setup} && "
+            f"{create_pre_script_cmd.cmd} && "
             f"source deployment_pre_cmd.sh && "
-            f"{escaped_deployment_cmd}'"
+            f"{create_script_cmd.cmd} && bash deployment_cmd.sh"
         )
-        s += "{} &\n\n".format(wrapped_command)
-    else:
-        s += "{} &\n\n".format(cfg.deployment.command)  # run asynchronously
+    s += "bash -c '{}' &\n\n".format(script)  # run asynchronously
 
     s += "SERVER_PID=$!  # capture the PID of the server background srun process\n\n"
 

@@ -105,6 +105,7 @@ def test_get_endpoint_url_local_builds_localhost():
     cfg = _cfg(
         {
             "deployment": {"type": "vllm", "port": 8081, "endpoints": {"chat": "/v1"}},
+            "execution": {"num_instances": 1},
             "evaluation": {},
         }
     )