feat: Declarative multi-node Ray config and native vLLM data-parallel support by AdamRajfer · Pull Request #757 · NVIDIA-NeMo/Evaluator

AdamRajfer · 2026-02-23T20:43:45Z

Summary

Declarative vllm_ray.yaml: Move the Ray cluster bootstrap script from a Jinja template (ray_setup.sh.j2) + programmatic injection into a standalone Hydra config (deployment/vllm_ray.yaml). The full bash script lives in the command field using OmegaConf-safe syntax (backticks for command substitution, let for arithmetic).
base_command in vllm.yaml: Extract the vllm serve ... invocation into a reusable base_command key so vllm_ray.yaml can reference it via ${deployment.base_command}.
Native vLLM data-parallel example: Add slurm_vllm_multinode_dp.yaml showing multi-node deployment using vLLM's built-in data parallelism (no Ray required).
Remove Jinja template: Delete ray_setup.sh.j2 and the programmatic Ray injection block in executor.py. Multi-line commands are now handled generically via base64 encoding.
Updated docs: Rewrite the Multi-Node Deployment section in slurm.md to cover both non-Ray (data parallelism) and Ray (tensor/pipeline parallelism) approaches.

OmegaConf compatibility

OmegaConf rejects $(...) and $((...)) in YAML strings. The Ray script uses:

Backticks `cmd` instead of $(cmd) for command substitution
let "x = a / b" instead of $((a / b)) for arithmetic
|| true on let statements that can evaluate to 0 (to avoid set -e failures)

Test plan

All 121 unit tests pass
Dry-run: single-node, multi-node single-instance, multi-instance, multi-node multi-instance
Cluster run: multi-node multi-instance Ray (TP=8, PP=2, 2 instances x 2 nodes)
Cluster run: multi-node native data-parallel (no Ray)

Add support for deploying models that span multiple nodes per instance with multiple replicas behind HAProxy load balancing. This enables running large models (e.g., DeepSeek-R1) that require Ray tensor and pipeline parallelism across nodes, with multiple instances for throughput. New config field `deployment.nodes_per_instance` (default: 1) controls how many nodes each vLLM instance spans. When > 1, the launcher: - Groups SLURM tasks into instances based on SLURM_PROCID - Injects per-task variables: INSTANCE_ID, INSTANCE_RANK, INSTANCE_MASTER_IP, MASTER_IP, NODES_PER_INSTANCE, NUM_INSTANCES, ALL_NODE_IPS - Configures HAProxy to route only to instance head nodes - Runs health checks only against head nodes The user provides a `pre_cmd` script that uses the injected variables to coordinate nodes within each instance (e.g., Ray cluster formation). Includes example config, documentation, and unit tests. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

copy-pr-bot · 2026-02-23T20:43:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

AdamRajfer · 2026-02-23T20:55:21Z

/ok to test c0106bf

agronskiy · 2026-02-23T21:32:18Z

docs/deployment/launcher-orchestrated/slurm.md

+  data_parallel_size: 1
+  port: 8000
+  extra_args: "--disable-custom-all-reduce --distributed-executor-backend ray --enforce-eager"
+  pre_cmd: |


Nice!

suggestion: Can we make it less verbose and more automatic regarding the script injected pre_cmd to avoid copy pasting? That's tad too much of verbosity even for our already verbose config IMO. Scripts like slurm_ray.sh from Nemo-run can be inspiration.

suggestion: more logging output to be added into that script to show the parameters and configs of the launched scripts.

prokotg · 2026-02-23T21:36:55Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

+            )
+        if not cfg.deployment.get("multiple_instances", False):
+            raise ValueError(
+                "nodes_per_instance > 1 requires deployment.multiple_instances=True"


prokotg · 2026-02-23T21:38:08Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

+        n_tasks = cfg.execution.get("deployment", {}).get("n_tasks", 1)
+        if n_tasks != num_nodes:
+            raise ValueError(
+                f"nodes_per_instance > 1 requires execution.deployment.n_tasks "


Must user set it (it should be automatic)?

different multinode deployments requite different value for it, e.g. it is different for the trt-llm and vllm, sometimes it is number of gpus

Thanks for the comment. Please check if it looks good now. I am validating this field only for vLLM now.

marta-sd · 2026-02-24T08:15:33Z

packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_multiinstance_ray_tp_pp.yaml

+  pipeline_parallel_size: 2
+  data_parallel_size: 1
+  port: 8000
+  extra_args: "--disable-custom-all-reduce --distributed-executor-backend ray --enforce-eager"


Users were complaining in the past about --enforce-eager flag as it hurts the performance. Is it required? If so, please add a comment indicating it's needed (ideally with short explanation why). If no -- let's remove it

marta-sd · 2026-02-24T08:17:04Z

packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_multiinstance_ray_tp_pp.yaml

+        top_p: 0.95
+        max_new_tokens: 32768
+  tasks:
+    - name: gsm8k_cot_instruct


Don't we need to set up adapter server with reasoning interceptor?

Yes, it is already configured — see evaluation.nemo_evaluator_config.target.api_endpoint.adapter_config.process_reasoning_traces: true (lines 107-113).

marta-sd · 2026-02-24T08:26:25Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

-        s += "{} &\n\n".format(cfg.deployment.command)  # run asynchronously
+        if instance_prefix:
+            # Wrap in bash -c to inject instance vars
+            escaped_deployment_cmd = cfg.deployment.command.replace("'", "'\"'\"'")


I don't understand this replacement, are you sure it's correct?

Yes, this is correct — it's a standard bash idiom for escaping single quotes inside a single-quoted string. Since the whole command is wrapped in bash -c '...', any single quotes inside the deployment command would break the quoting. .replace("'", "'\"'\"'") works by:

End the current single-quoted string: '

Append a double-quoted literal single quote: "'"

Resume the single-quoted string: '

So for example, if the deployment command contained it's, it becomes it'"'"'s, which bash concatenates back into it's. This is equivalent to the commonly seen '\'' pattern but uses double-quoting for the escaped quote instead.

Anyways, please ignore this change as I introduced a broader redesign and it doesn't apply anymore.

…multi-node-multi-instance Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Cover all code paths in _resolve_multi_node_config, Ray auto-injection, HAProxy auto-enablement, and n_tasks auto-set for vLLM with 39 new tests. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-02-24T22:28:13Z

/ok to test 3bd8150

packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_dp.yaml

...ges/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/execution/slurm/default.yaml

AWarno · 2026-02-25T10:50:00Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/executors/slurm/executor.py

+    if cfg.execution.num_nodes_per_instance > 1:
+        if cfg.deployment.get("type") != "vllm":
+            raise ValueError(
+                f"Multi-node (num_nodes_per_instance > 1) is only supported for "
+                f"vLLM deployments, got deployment.type={cfg.deployment.get('type')}"
+            )


The echo/base64 pipe in the ray_setup injection was parsed by the batch shell instead of running inside the srun container, causing the server to dump raw base64 to logs and never start Ray or vLLM. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-02-25T13:11:13Z

/ok to test b863910

0.95 causes CUDA OOM during vLLM sampler warmup on 2x8 H100 nodes. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

…_ray config Replace ray_setup.sh.j2 template with a self-contained vllm_ray.yaml deployment config. Split vllm.yaml command into base_command + command to allow vllm_ray to override only the command while reusing the base vllm serve invocation. Add num_instances as the user-facing field for multi-instance deployments, replacing the old deployment.multiple_instances boolean. HAProxy is auto-enabled when num_instances > 1. Default n_tasks to num_nodes in base slurm config. Parameterize ray_compiled_dag_channel_type in vllm_ray.yaml (default: shm). Always base64-encode deployment commands for safe srun transport. Update examples, docs, and skills to reflect the new configuration patterns. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-02-25T17:18:40Z

/ok to test bd9ba75

…yaml OmegaConf rejects $() and $(()) syntax. Replace with backticks for command substitution and let for arithmetic to avoid parse errors. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-02-25T17:41:56Z

/ok to test d502d27

copy-pr-bot · 2026-02-25T17:41:59Z

/ok to test

@AdamRajfer, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

The evaluator validates params and rejects 'max_tokens'. The correct parameter name is 'max_new_tokens'. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer · 2026-02-25T18:21:28Z

/ok to test 7557d47

AWarno · 2026-02-26T12:20:44Z

docs/deployment/launcher-orchestrated/slurm.md

+The `vllm_ray` config automatically handles head/worker node coordination — no `pre_cmd` is needed.
+


do we need this comment about pre_cmd?

Removed, good catch.

AWarno · 2026-02-26T12:27:49Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm_ray.yaml

+      RANK="$SLURM_PROCID"
+      EXPECTED_NODES="$SLURM_NTASKS"


cane we make these envs names slurm agnostic?, q: why do we need all of these variables? I think we need just MASTER_IP and $SLURM_PROCID (which can be name to PROCID or sth)

Done! The script now maps SLURM vars to generic names at the top (PROC_ID=$SLURM_PROCID, NUM_TASKS=$SLURM_NTASKS) and uses the generic names throughout. Re ALL_NODE_IPS — we still need it in multi-instance mode to compute per-instance head IPs, so it cannot be dropped entirely. Note: cannot use ${VAR:-default} bash syntax due to OmegaConf restrictions, so the aliases are simple assignments (SLURM always provides these vars in srun context).

- Remove --enforce-eager from ray example (marta-sd) - Remove unnecessary pre_cmd mention from slurm.md (AWarno) - Make vllm_ray.yaml env vars scheduler-agnostic by aliasing SLURM_PROCID/SLURM_NTASKS to PROC_ID/NUM_TASKS (AWarno) Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

…lti-instance Signed-off-by: Adam Rajfer <arajfer@nvidia.com> # Conflicts: # packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_ray_tp_pp.yaml # packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/execution/slurm/default.yaml

AWarno · 2026-03-03T15:43:16Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm_ray.yaml

+  # Map SLURM variables to scheduler-agnostic names
+  PROC_ID=$SLURM_PROCID
+  NUM_TASKS=$SLURM_NTASKS
+


What I mean is that we should export these values to agnostic variables on the Slurm side before reaching this script, since they do not change much. This script should not be tightly coupled to Slurm.

this is the original script: https://github.com/vllm-project/vllm/blob/main/examples/online_serving/multi-node-serving.sh

Good point — moved the mapping to the SLURM executor (executor.py). It now exports PROC_ID and NUM_TASKS inside the bash -c container wrapper before running the deployment script. The vllm_ray script no longer references any SLURM variables.

Thanks for the reference. Done — the mapping now lives in executor.py and the vllm_ray script is SLURM-free.

AWarno · 2026-03-03T15:57:28Z

packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/vllm_ray.yaml

+  NUM_NODES=${execution.num_nodes}
+  NUM_INSTANCES=${execution.num_instances}
+  let "NODES_PER_INSTANCE = NUM_NODES / NUM_INSTANCES"
+


How is multi-node, multi instance deployment handled without Ray? Right now, this script is tied to Slurm, but it doesn’t have to be. This Ray setup mechanism could be made common for any multi-node environment.

The vllm_ray script is now fully scheduler-agnostic — it only uses PROC_ID, NUM_TASKS, MASTER_IP, and ALL_NODE_IPS, all exported by the executor. For multi-node multi-instance without Ray, users override deployment.command with custom logic (see slurm_vllm_multinode_dp.yaml). If we add non-SLURM executors with multi-node support in the future, they would just need to export the same generic variables and the vllm_ray script would work as-is.

…ecutor PROC_ID and NUM_TASKS are now exported by the SLURM executor inside the container wrapper (bash -c), so the vllm_ray deployment script is fully scheduler-agnostic and no longer references SLURM_PROCID/SLURM_NTASKS. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer requested review from a team as code owners February 23, 2026 20:43

github-actions bot added documentation Improvements or additions to documentation nemo-evaluator-launcher tests labels Feb 23, 2026

AdamRajfer removed documentation Improvements or additions to documentation tests labels Feb 23, 2026

copy-pr-bot bot temporarily deployed to test February 23, 2026 20:56 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 23, 2026 20:57 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 23, 2026 20:59 Inactive

agronskiy reviewed Feb 23, 2026

View reviewed changes

prokotg reviewed Feb 23, 2026

View reviewed changes

marta-sd reviewed Feb 24, 2026

View reviewed changes

AdamRajfer added 2 commits February 24, 2026 23:17

Merge commit '8b8fd1d05fe3956bd9baddd0723fb99f8e464efd' into arajfer/…

60d279f

…multi-node-multi-instance Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

Add comprehensive tests for multi-node config redesign

3bd8150

Cover all code paths in _resolve_multi_node_config, Ray auto-injection, HAProxy auto-enablement, and n_tasks auto-set for vLLM with 39 new tests. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

github-actions bot added documentation Improvements or additions to documentation tests labels Feb 24, 2026

AdamRajfer changed the title ~~feat: Add multi-node multi-instance deployment support for vLLM~~ feat: Redesign multi-node config with num_nodes_per_instance and num_instances Feb 24, 2026

copy-pr-bot bot temporarily deployed to test February 24, 2026 22:29 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 24, 2026 22:29 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 24, 2026 22:29 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 24, 2026 22:32 Inactive

AWarno reviewed Feb 25, 2026

View reviewed changes

packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_dp.yaml Outdated Show resolved Hide resolved

AWarno reviewed Feb 25, 2026

View reviewed changes

...ges/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/execution/slurm/default.yaml Outdated Show resolved Hide resolved

AWarno reviewed Feb 25, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test February 25, 2026 13:12 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 25, 2026 13:12 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 13:12 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 25, 2026 13:14 Inactive

AdamRajfer added 2 commits February 25, 2026 15:50

fix: Lower gpu_memory_utilization to 0.90 for DeepSeek-R1 example

74ca4e3

0.95 causes CUDA OOM during vLLM sampler warmup on 2x8 H100 nodes. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

fix: Use backticks and let for OmegaConf-compatible bash in vllm_ray.…

d502d27

…yaml OmegaConf rejects $() and $(()) syntax. Replace with backticks for command substitution and let for arithmetic to avoid parse errors. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AdamRajfer changed the title ~~feat: Redesign multi-node config with num_nodes_per_instance and num_instances~~ feat: Declarative multi-node Ray config and native vLLM data-parallel support Feb 25, 2026

fix: Use max_new_tokens instead of max_tokens in example configs

7557d47

The evaluator validates params and rejects 'max_tokens'. The correct parameter name is 'max_new_tokens'. Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

AWarno reviewed Feb 26, 2026

View reviewed changes

AdamRajfer added 2 commits March 3, 2026 15:25

AWarno reviewed Mar 3, 2026

View reviewed changes

		The `vllm_ray` config automatically handles head/worker node coordination — no `pre_cmd` is needed.

Conversation

AdamRajfer commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

OmegaConf compatibility

Test plan

Uh oh!

copy-pr-bot bot commented Feb 23, 2026

Uh oh!

AdamRajfer commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prokotg Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AWarno Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdamRajfer commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdamRajfer commented Feb 25, 2026

Uh oh!

AdamRajfer commented Feb 25, 2026

Uh oh!

AdamRajfer commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 25, 2026

Uh oh!

AdamRajfer commented Feb 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AWarno Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdamRajfer commented Feb 23, 2026 •

edited

Loading

prokotg Feb 23, 2026 •

edited

Loading

AWarno Feb 24, 2026 •

edited

Loading

AdamRajfer commented Feb 25, 2026 •

edited

Loading

AWarno Feb 26, 2026 •

edited

Loading