feat: Declarative multi-node Ray config and native vLLM data-parallel support#757
feat: Declarative multi-node Ray config and native vLLM data-parallel support#757AdamRajfer wants to merge 11 commits intomainfrom
Conversation
Add support for deploying models that span multiple nodes per instance with multiple replicas behind HAProxy load balancing. This enables running large models (e.g., DeepSeek-R1) that require Ray tensor and pipeline parallelism across nodes, with multiple instances for throughput. New config field `deployment.nodes_per_instance` (default: 1) controls how many nodes each vLLM instance spans. When > 1, the launcher: - Groups SLURM tasks into instances based on SLURM_PROCID - Injects per-task variables: INSTANCE_ID, INSTANCE_RANK, INSTANCE_MASTER_IP, MASTER_IP, NODES_PER_INSTANCE, NUM_INSTANCES, ALL_NODE_IPS - Configures HAProxy to route only to instance head nodes - Runs health checks only against head nodes The user provides a `pre_cmd` script that uses the injected variables to coordinate nodes within each instance (e.g., Ray cluster formation). Includes example config, documentation, and unit tests. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
|
/ok to test c0106bf |
| data_parallel_size: 1 | ||
| port: 8000 | ||
| extra_args: "--disable-custom-all-reduce --distributed-executor-backend ray --enforce-eager" | ||
| pre_cmd: | |
There was a problem hiding this comment.
Nice!
suggestion: Can we make it less verbose and more automatic regarding the script injected pre_cmd to avoid copy pasting? That's tad too much of verbosity even for our already verbose config IMO. Scripts like slurm_ray.sh from Nemo-run can be inspiration.
suggestion: more logging output to be added into that script to show the parameters and configs of the launched scripts.
| ) | ||
| if not cfg.deployment.get("multiple_instances", False): | ||
| raise ValueError( | ||
| "nodes_per_instance > 1 requires deployment.multiple_instances=True" |
| n_tasks = cfg.execution.get("deployment", {}).get("n_tasks", 1) | ||
| if n_tasks != num_nodes: | ||
| raise ValueError( | ||
| f"nodes_per_instance > 1 requires execution.deployment.n_tasks " |
There was a problem hiding this comment.
Must user set it (it should be automatic)?
There was a problem hiding this comment.
different multinode deployments requite different value for it, e.g. it is different for the trt-llm and vllm, sometimes it is number of gpus
There was a problem hiding this comment.
Thanks for the comment. Please check if it looks good now. I am validating this field only for vLLM now.
| pipeline_parallel_size: 2 | ||
| data_parallel_size: 1 | ||
| port: 8000 | ||
| extra_args: "--disable-custom-all-reduce --distributed-executor-backend ray --enforce-eager" |
There was a problem hiding this comment.
Users were complaining in the past about --enforce-eager flag as it hurts the performance. Is it required? If so, please add a comment indicating it's needed (ideally with short explanation why). If no -- let's remove it
| top_p: 0.95 | ||
| max_new_tokens: 32768 | ||
| tasks: | ||
| - name: gsm8k_cot_instruct |
There was a problem hiding this comment.
Don't we need to set up adapter server with reasoning interceptor?
There was a problem hiding this comment.
Yes, it is already configured — see evaluation.nemo_evaluator_config.target.api_endpoint.adapter_config.process_reasoning_traces: true (lines 107-113).
| s += "{} &\n\n".format(cfg.deployment.command) # run asynchronously | ||
| if instance_prefix: | ||
| # Wrap in bash -c to inject instance vars | ||
| escaped_deployment_cmd = cfg.deployment.command.replace("'", "'\"'\"'") |
There was a problem hiding this comment.
I don't understand this replacement, are you sure it's correct?
There was a problem hiding this comment.
Yes, this is correct — it's a standard bash idiom for escaping single quotes inside a single-quoted string. Since the whole command is wrapped in bash -c '...', any single quotes inside the deployment command would break the quoting. .replace("'", "'\"'\"'") works by:
- End the current single-quoted string:
' - Append a double-quoted literal single quote:
"'" - Resume the single-quoted string:
'
So for example, if the deployment command contained it's, it becomes it'"'"'s, which bash concatenates back into it's. This is equivalent to the commonly seen '\'' pattern but uses double-quoting for the escaped quote instead.
There was a problem hiding this comment.
Anyways, please ignore this change as I introduced a broader redesign and it doesn't apply anymore.
…multi-node-multi-instance Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Cover all code paths in _resolve_multi_node_config, Ray auto-injection, HAProxy auto-enablement, and n_tasks auto-set for vLLM with 39 new tests. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
|
/ok to test 3bd8150 |
packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_dp.yaml
Outdated
Show resolved
Hide resolved
...ges/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/execution/slurm/default.yaml
Outdated
Show resolved
Hide resolved
| if cfg.execution.num_nodes_per_instance > 1: | ||
| if cfg.deployment.get("type") != "vllm": | ||
| raise ValueError( | ||
| f"Multi-node (num_nodes_per_instance > 1) is only supported for " | ||
| f"vLLM deployments, got deployment.type={cfg.deployment.get('type')}" | ||
| ) |
The echo/base64 pipe in the ray_setup injection was parsed by the batch shell instead of running inside the srun container, causing the server to dump raw base64 to logs and never start Ray or vLLM. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
|
/ok to test b863910 |
0.95 causes CUDA OOM during vLLM sampler warmup on 2x8 H100 nodes. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…_ray config Replace ray_setup.sh.j2 template with a self-contained vllm_ray.yaml deployment config. Split vllm.yaml command into base_command + command to allow vllm_ray to override only the command while reusing the base vllm serve invocation. Add num_instances as the user-facing field for multi-instance deployments, replacing the old deployment.multiple_instances boolean. HAProxy is auto-enabled when num_instances > 1. Default n_tasks to num_nodes in base slurm config. Parameterize ray_compiled_dag_channel_type in vllm_ray.yaml (default: shm). Always base64-encode deployment commands for safe srun transport. Update examples, docs, and skills to reflect the new configuration patterns. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
|
/ok to test bd9ba75 |
…yaml OmegaConf rejects $() and $(()) syntax. Replace with backticks for command substitution and let for arithmetic to avoid parse errors. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
|
/ok to test d502d27 |
@AdamRajfer, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
The evaluator validates params and rejects 'max_tokens'. The correct parameter name is 'max_new_tokens'. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
|
/ok to test 7557d47 |
| The `vllm_ray` config automatically handles head/worker node coordination — no `pre_cmd` is needed. | ||
|
|
There was a problem hiding this comment.
do we need this comment about pre_cmd?
There was a problem hiding this comment.
Removed, good catch.
| RANK="$SLURM_PROCID" | ||
| EXPECTED_NODES="$SLURM_NTASKS" |
There was a problem hiding this comment.
cane we make these envs names slurm agnostic?, q: why do we need all of these variables? I think we need just MASTER_IP and $SLURM_PROCID (which can be name to PROCID or sth)
There was a problem hiding this comment.
Done! The script now maps SLURM vars to generic names at the top (PROC_ID=$SLURM_PROCID, NUM_TASKS=$SLURM_NTASKS) and uses the generic names throughout. Re ALL_NODE_IPS — we still need it in multi-instance mode to compute per-instance head IPs, so it cannot be dropped entirely. Note: cannot use ${VAR:-default} bash syntax due to OmegaConf restrictions, so the aliases are simple assignments (SLURM always provides these vars in srun context).
- Remove --enforce-eager from ray example (marta-sd) - Remove unnecessary pre_cmd mention from slurm.md (AWarno) - Make vllm_ray.yaml env vars scheduler-agnostic by aliasing SLURM_PROCID/SLURM_NTASKS to PROC_ID/NUM_TASKS (AWarno) Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…lti-instance Signed-off-by: Adam Rajfer <arajfer@nvidia.com> # Conflicts: # packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_ray_tp_pp.yaml # packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/execution/slurm/default.yaml
| # Map SLURM variables to scheduler-agnostic names | ||
| PROC_ID=$SLURM_PROCID | ||
| NUM_TASKS=$SLURM_NTASKS | ||
|
|
There was a problem hiding this comment.
What I mean is that we should export these values to agnostic variables on the Slurm side before reaching this script, since they do not change much. This script should not be tightly coupled to Slurm.
There was a problem hiding this comment.
this is the original script: https://github.com/vllm-project/vllm/blob/main/examples/online_serving/multi-node-serving.sh
There was a problem hiding this comment.
Good point — moved the mapping to the SLURM executor (executor.py). It now exports PROC_ID and NUM_TASKS inside the bash -c container wrapper before running the deployment script. The vllm_ray script no longer references any SLURM variables.
There was a problem hiding this comment.
Thanks for the reference. Done — the mapping now lives in executor.py and the vllm_ray script is SLURM-free.
| NUM_NODES=${execution.num_nodes} | ||
| NUM_INSTANCES=${execution.num_instances} | ||
| let "NODES_PER_INSTANCE = NUM_NODES / NUM_INSTANCES" | ||
|
|
There was a problem hiding this comment.
How is multi-node, multi instance deployment handled without Ray? Right now, this script is tied to Slurm, but it doesn’t have to be. This Ray setup mechanism could be made common for any multi-node environment.
There was a problem hiding this comment.
The vllm_ray script is now fully scheduler-agnostic — it only uses PROC_ID, NUM_TASKS, MASTER_IP, and ALL_NODE_IPS, all exported by the executor. For multi-node multi-instance without Ray, users override deployment.command with custom logic (see slurm_vllm_multinode_dp.yaml). If we add non-SLURM executors with multi-node support in the future, they would just need to export the same generic variables and the vllm_ray script would work as-is.
…ecutor PROC_ID and NUM_TASKS are now exported by the SLURM executor inside the container wrapper (bash -c), so the vllm_ray deployment script is fully scheduler-agnostic and no longer references SLURM_PROCID/SLURM_NTASKS. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Summary
vllm_ray.yaml: Move the Ray cluster bootstrap script from a Jinja template (ray_setup.sh.j2) + programmatic injection into a standalone Hydra config (deployment/vllm_ray.yaml). The full bash script lives in thecommandfield using OmegaConf-safe syntax (backticks for command substitution,letfor arithmetic).base_commandinvllm.yaml: Extract thevllm serve ...invocation into a reusablebase_commandkey sovllm_ray.yamlcan reference it via${deployment.base_command}.slurm_vllm_multinode_dp.yamlshowing multi-node deployment using vLLM's built-in data parallelism (no Ray required).ray_setup.sh.j2and the programmatic Ray injection block inexecutor.py. Multi-line commands are now handled generically via base64 encoding.slurm.mdto cover both non-Ray (data parallelism) and Ray (tensor/pipeline parallelism) approaches.OmegaConf compatibility
OmegaConf rejects
$(...)and$((...))in YAML strings. The Ray script uses:`cmd`instead of$(cmd)for command substitutionlet "x = a / b"instead of$((a / b))for arithmetic|| trueonletstatements that can evaluate to 0 (to avoidset -efailures)Test plan