Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion docs/deployment/launcher-orchestrated/slurm.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,8 @@ execution:

# Resource allocation
partition: batch # Slurm partition/queue
num_nodes: 1 # Number of nodes
num_nodes: 1 # Total SLURM nodes
num_instances: 1 # Independent deployment instances (HAProxy auto-enabled when > 1)
ntasks_per_node: 1 # Tasks per node
gres: gpu:8 # GPU resources
walltime: "01:00:00" # Wall time limit (HH:MM:SS)
Expand All @@ -96,6 +97,18 @@ execution:
The `gpus_per_node` parameter can be used as an alternative to `gres` for specifying GPU resources. However, `gres` is the default in the base configuration.
:::

## Multi-Instance with HAProxy

To run multiple independent deployment instances with HAProxy load-balancing:

```yaml
execution:
num_nodes: 4 # Total SLURM nodes
num_instances: 2 # 2 instances of 2 nodes each → HAProxy auto-enabled
```

When `num_instances > 1`, HAProxy is automatically configured to distribute requests across instance head nodes. See the `examples/` directory for complete configurations.

## Configuration Examples

### Benchmark Suite Evaluation
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,9 @@ evaluation:

The following example configuration files are available in the `examples/` directory:

- `lepton_vllm_llama_3_1_8b_instruct.yaml` - vLLM deployment on Lepton platform
- `slurm_llama_3_1_8b_instruct.yaml` - vLLM deployment on SLURM cluster
- `slurm_llama_3_1_8b_instruct_hf.yaml` - vLLM deployment using HuggingFace model ID
- `slurm_vllm_basic.yaml` - Basic single-node vLLM deployment
- `slurm_vllm_multinode_ray_tp_pp.yaml` - Multi-node deployment with TP+PP
- `slurm_vllm_multinode_dp.yaml` - Multi-node data parallelism
- `slurm_vllm_multinode_dp_haproxy.yaml` - Multi-node independent instances with HAProxy

Use `nemo-evaluator-launcher run --dry-run` to check your configuration before running.
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,23 @@ env_vars:

**Security:** Secret values are never written into the generated `run.sub` script. They are stored in a separate `.secrets.env` file and sourced at runtime, preventing accidental exposure in logs or artifacts.

### Multi-Node and Multi-Instance

Configure multi-node deployments using `num_nodes` and `num_instances`:

```yaml
execution:
num_nodes: 4 # Total SLURM nodes
num_instances: 2 # Independent deployment instances (default: 1)
```

- **`num_nodes`**: Total number of SLURM nodes to allocate
- **`num_instances`**: Number of independent deployment instances. When `> 1`, HAProxy is automatically configured to load-balance across instances. `num_nodes` must be divisible by `num_instances`.

:::{note}
The deprecated `deployment.multiple_instances` field is still accepted but will be removed in a future release. Use `execution.num_instances` instead.
:::

### Mounting and Storage

The Slurm executor provides sophisticated mounting capabilities:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -142,27 +142,26 @@ Show tasks in the current config. Loop until the user confirms the task list is
```
For the None (External) deployment the `api_key_name` should be already defined. The `DUMMY_API_KEY` export is handled in Step 8.

**Step 6: Advanced - Multi-node (Data Parallel)**
**Step 6: Advanced - Multi-node**

Only if model >120B parameters, suggest multi-node. Explain: "This is DP multi-node - the weights are copied (not distributed) across nodes. One deployment instance per node will be run with HAProxy load-balancing requests."
There are two multi-node patterns. Ask the user which applies:

Ask if user wants multi-node. If yes, ask for node count and configure:
**Pattern A: Multi-instance (independent instances with HAProxy)**

Only if model >120B parameters or user wants more throughput. Explain: "Each node runs an independent deployment instance. HAProxy load-balances requests across all instances."

```yaml
execution:
num_nodes: 4 # 4 nodes = 4 independent deployment instances = 4x throughput
deployment:
n_tasks: ${execution.num_nodes} # Must match num_nodes for multi-instance deployment

deployment:
multiple_instances: true
num_nodes: 4 # Total nodes
num_instances: 4 # 4 independent instances → HAProxy auto-enabled
```

**Common Confusions**

- **This is different from `data_parallel_size`**, which controls DP replicas *within* a single node/deployment instance.
- Global data parallelism is `num_nodes x data_parallel_size` (e.g., 2 nodes x 4 DP each = 8 replicas for max throughput).
- With multi-node, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
- **`num_instances`** controls independent deployment instances with HAProxy. **`data_parallel_size`** controls DP replicas *within* a single instance.
- Global data parallelism is `num_instances x data_parallel_size` (e.g., 2 instances x 8 DP each = 16 replicas).
- With multi-instance, `parallelism` in task config is the total concurrent requests across all instances, not per-instance.
- `num_nodes` must be divisible by `num_instances`.

**Step 7: Advanced - Interceptors**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,8 +69,6 @@ execution:
walltime: 01:00:00

num_nodes: 2 # Number of SLURM nodes for multi-node deployment
deployment:
n_tasks: ${execution.num_nodes} # For multi-node vLLM deployment, must match num_nodes

mounts:
mount_home: false # Whether to mount home directory (default: true)
Expand All @@ -97,7 +95,7 @@ evaluation:
parallelism: 512 # Number of parallel requests (higher for data parallel deployment)
temperature: 0.6 # Sampling temperature
top_p: 0.95 # Nucleus sampling parameter
max_tokens: 32768 # Maximum number of tokens to generate (32k)
max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
request_timeout: 3600 # Timeout for API requests in seconds
target:
api_endpoint:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,15 +61,13 @@ execution:
walltime: 01:00:00

num_nodes: 2 # Number of SLURM nodes for multi-node deployment
deployment:
n_tasks: ${execution.num_nodes} # One vLLM instance per node
num_instances: 2 # 2 independent single-node instances → HAProxy auto-enabled

mounts:
mount_home: false # Whether to mount home directory (default: true)

# Override default deployment arguments
deployment:
multiple_instances: true # Enable HAProxy load balancing across nodes
checkpoint_path: null
hf_model_handle: nvidia/NVIDIA-Nemotron-Nano-9B-v2
served_model_name: nvidia/NVIDIA-Nemotron-Nano-9B-v2
Expand All @@ -86,7 +84,7 @@ evaluation:
parallelism: 512 # Number of parallel requests (higher for multi-node deployment)
temperature: 0.6 # Sampling temperature
top_p: 0.95 # Nucleus sampling parameter
max_tokens: 32768 # Maximum number of tokens to generate (32k)
max_new_tokens: 32768 # Maximum number of tokens to generate (32k)
request_timeout: 3600 # Timeout for API requests in seconds
target:
api_endpoint:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,8 @@
# - tensor_parallel_size: 8 (within node parallelism)
# - pipeline_parallel_size: 2 (across node parallelism)
#
# Multi-Instance Configuration:
# - execution.num_nodes: Number of SLURM nodes to allocate (2 in this example)
# - execution.deployment.n_tasks: Must match num_nodes for multi-instance deployment
# Multi-Node Configuration:
# - execution.num_nodes: 2 (single instance spanning 2 nodes)
# - deployment.tensor_parallel_size: GPU parallelism within a single node
# - deployment.pipeline_parallel_size: Model parallelism across multiple nodes
#
Expand All @@ -67,8 +66,6 @@ execution:
walltime: 02:00:00

num_nodes: 2 # Number of SLURM nodes for multi-node deployment
deployment:
n_tasks: ${execution.num_nodes} # For multi-node ray vLLM deployment, must match num_nodes

mounts:
mount_home: false # Whether to mount home directory (default: true)
Expand Down Expand Up @@ -135,7 +132,7 @@ evaluation:
request_timeout: 3600 # Timeout for API requests in seconds
temperature: 0.6 # Sampling temperature
top_p: 0.95 # Nucleus sampling parameter
max_tokens: 32768 # Maximum number of tokens to generate (32k)
max_new_tokens: 32768 # Maximum number of tokens to generate (32k)

target:
api_endpoint:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -339,8 +339,8 @@ def apply_url_override(url: str) -> str:
# Local executor - use localhost
endpoint_uri = cfg.deployment.endpoints[endpoint_type]

# Use HAProxy port if multiple_instances is enabled
if cfg.deployment.get("multiple_instances", False):
# Use HAProxy port if num_instances > 1
if OmegaConf.select(cfg, "execution.num_instances", default=1) > 1:
proxy_config = cfg.execution.get("proxy", {}).get("config", {})
port = proxy_config.get("haproxy_port", 5009)
else:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ username: ${oc.env:USER} # Defaults to $USER env var
account: ??? # SLURM account allocation (required)
output_dir: ??? # Absolute path accessible on compute nodes (required)
partition: batch
num_nodes: 1
num_nodes: 1 # Total SLURM nodes (num_nodes_per_instance = num_nodes / num_instances)
num_instances: 1 # Number of independent deployment instances
ntasks_per_node: 1
gres: gpu:8
walltime: 01:00:00
Expand All @@ -30,7 +31,7 @@ sbatch_comment: null # Optional comment for SLURM job (translates to #SBATCH --c

# Deployment-specific SLURM configuration
deployment:
n_tasks: 1 # Number of tasks for deployment srun (default: 1, for multi-instance set to num_nodes)
n_tasks: ${execution.num_nodes} # Number of tasks for deployment srun (default: num_nodes)

mounts:
deployment: {}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@

import yaml
from jinja2 import Environment, FileSystemLoader
from omegaconf import DictConfig, OmegaConf
from omegaconf import DictConfig, OmegaConf, open_dict

from nemo_evaluator_launcher.common.env_vars import (
SecretsEnvResult,
Expand Down Expand Up @@ -145,7 +145,7 @@ def execute_eval(cfg: DictConfig, dry_run: bool = False) -> str:
)

# Create proxy config file with placeholder IPs for multi-instance deployments
if cfg.deployment.get("multiple_instances", False):
if cfg.execution.num_instances > 1:
proxy_type = cfg.execution.get("proxy", {}).get("type", "haproxy")
if proxy_type == "haproxy":
proxy_config = _generate_haproxy_config_with_placeholders(cfg)
Expand Down Expand Up @@ -642,6 +642,22 @@ def _create_slurm_sbatch_script(
Returns:
str: The contents of the sbatch script.
"""
# Remove deprecated deployment.multiple_instances if present
if cfg.deployment.get("multiple_instances") is not None:
logger.warning(
"deployment.multiple_instances is deprecated and will be "
"removed from config — use execution.num_instances instead."
)
with open_dict(cfg):
del cfg.deployment.multiple_instances

# Validate topology: num_nodes must be divisible by num_instances
if cfg.execution.num_nodes % cfg.execution.num_instances != 0:
raise ValueError(
f"execution.num_nodes ({cfg.execution.num_nodes}) must be divisible by "
f"execution.num_instances ({cfg.execution.num_instances})"
)

# get task from mapping, overrides, urls
tasks_mapping = load_tasks_mapping()
task_definition = get_task_definition_for_job(
Expand Down Expand Up @@ -780,9 +796,9 @@ def _create_slurm_sbatch_script(

# wait for the server to initialize
health_path = cfg.deployment.endpoints.get("health", "/health")
# For multi-instance check all node IPs, for single instance check localhost
if cfg.deployment.get("multiple_instances", False):
ip_list = '"${NODES_IPS_ARRAY[@]}"'
# HEAD_NODE_IPS is always set: subset of heads when NPI > 1, all nodes otherwise
if cfg.execution.num_instances > 1:
ip_list = '"${HEAD_NODE_IPS[@]}"'
else:
ip_list = '"127.0.0.1"'
s += _get_wait_for_server_handler(
Expand All @@ -795,7 +811,7 @@ def _create_slurm_sbatch_script(
s += "\n\n"

# add proxy load balancer for multi-instance deployments
if cfg.deployment.get("multiple_instances", False):
if cfg.execution.num_instances > 1:
s += _get_proxy_server_srun_command(cfg, remote_task_subdir)

# prepare evaluation mounts
Expand Down Expand Up @@ -858,7 +874,7 @@ def _create_slurm_sbatch_script(
# terminate the server after all evaluation clients finish
if cfg.deployment.type != "none":
s += "kill $SERVER_PID # terminate the server to finish gracefully\n"
if cfg.deployment.get("multiple_instances", False):
if cfg.execution.num_instances > 1:
s += "kill $PROXY_PID # terminate proxy to finish gracefully\n"
s += "\n"

Expand Down Expand Up @@ -1579,11 +1595,11 @@ def _generate_haproxy_config_with_placeholders(cfg):
env = Environment(loader=FileSystemLoader(template_dir))
template = env.get_template("proxy.cfg.template")

# Prepare template data with placeholder IPs - use actual number of nodes
num_nodes = cfg.execution.num_nodes
# Prepare template data with placeholder IPs - one backend per instance head node
nodes = []
for i in range(num_nodes):
nodes.append({"ip": f"{{IP_{i}}}", "port": cfg.deployment.port})
for i in range(cfg.execution.num_instances):
head_idx = i * cfg.execution.num_nodes // cfg.execution.num_instances
nodes.append({"ip": f"{{IP_{head_idx}}}", "port": cfg.deployment.port})

# Get health check parameters - prefer proxy config, fallback to deployment.endpoints.health
proxy_config = cfg.execution.get("proxy", {}).get("config", {})
Expand Down Expand Up @@ -1680,6 +1696,12 @@ def _generate_deployment_srun_command(
s += "# Export MASTER_IP as the first node IP\n"
s += "export MASTER_IP=${NODES_IPS_ARRAY[0]}\n"
s += 'echo "MASTER_IP: $MASTER_IP"\n'
s += 'export ALL_NODE_IPS=$(IFS=,; echo "${NODES_IPS_ARRAY[*]}")\n'
s += "HEAD_NODE_IPS=()\n"
s += f"for ((g=0; g<{cfg.execution.num_instances}; g++)); do\n"
s += f' HEAD_NODE_IPS+=("${{NODES_IPS_ARRAY[$((g * {cfg.execution.num_nodes // cfg.execution.num_instances}))]}}")\n'
s += "done\n"
s += 'echo "HEAD_NODE_IPS: ${HEAD_NODE_IPS[@]}"\n'

# Add debug comment for deployment pre_cmd before srun command
if debug_comment:
Expand All @@ -1703,9 +1725,26 @@ def _generate_deployment_srun_command(
if "MASTER_IP" not in deployment_env_var_names:
deployment_env_var_names.append("MASTER_IP")

# Always add ALL_NODE_IPS to the environment variables
if "ALL_NODE_IPS" not in deployment_env_var_names:
deployment_env_var_names.append("ALL_NODE_IPS")

if deployment_env_var_names:
s += f"--container-env {','.join(sorted(deployment_env_var_names))} "

# Build the command that runs inside the container:
# 1. Export scheduler-agnostic env vars (PROC_ID, NUM_TASKS)
# 2. Optionally write + source deployment_pre_cmd.sh
# 3. Write deployment_cmd.sh and execute it
create_script_cmd = _str_to_echo_command(
cfg.deployment.command, filename="deployment_cmd.sh"
)
debug_comment += create_script_cmd.debug + "\n\n"

# Map SLURM task variables to scheduler-agnostic names inside the container
env_setup = "export PROC_ID=${SLURM_PROCID:-0} NUM_TASKS=${SLURM_NTASKS:-1}"
script = f"{env_setup} && {create_script_cmd.cmd} && bash deployment_cmd.sh"

# Wrap deployment command to execute pre_cmd inside container if needed
if pre_cmd:
# Create a wrapper command that runs inside the container:
Expand All @@ -1715,16 +1754,13 @@ def _generate_deployment_srun_command(
create_pre_script_cmd = _str_to_echo_command(
pre_cmd, filename="deployment_pre_cmd.sh"
)
# Escape single quotes in the deployment command for bash -c
escaped_deployment_cmd = cfg.deployment.command.replace("'", "'\"'\"'")
wrapped_command = (
f"bash -c '{create_pre_script_cmd.cmd} && "
script = (
f"{env_setup} && "
f"{create_pre_script_cmd.cmd} && "
f"source deployment_pre_cmd.sh && "
f"{escaped_deployment_cmd}'"
f"{create_script_cmd.cmd} && bash deployment_cmd.sh"
)
s += "{} &\n\n".format(wrapped_command)
else:
s += "{} &\n\n".format(cfg.deployment.command) # run asynchronously
s += "bash -c '{}' &\n\n".format(script) # run asynchronously

s += "SERVER_PID=$! # capture the PID of the server background srun process\n\n"

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ def test_get_endpoint_url_local_builds_localhost():
cfg = _cfg(
{
"deployment": {"type": "vllm", "port": 8081, "endpoints": {"chat": "/v1"}},
"execution": {"num_instances": 1},
"evaluation": {},
}
)
Expand Down
Loading