Skip to content

feat: Declarative multi-node Ray config and native vLLM data-parallel support#757

Open
AdamRajfer wants to merge 11 commits intomainfrom
arajfer/multi-node-multi-instance
Open

feat: Declarative multi-node Ray config and native vLLM data-parallel support#757
AdamRajfer wants to merge 11 commits intomainfrom
arajfer/multi-node-multi-instance

Conversation

@AdamRajfer
Copy link
Contributor

@AdamRajfer AdamRajfer commented Feb 23, 2026

Summary

  • Declarative vllm_ray.yaml: Move the Ray cluster bootstrap script from a Jinja template (ray_setup.sh.j2) + programmatic injection into a standalone Hydra config (deployment/vllm_ray.yaml). The full bash script lives in the command field using OmegaConf-safe syntax (backticks for command substitution, let for arithmetic).
  • base_command in vllm.yaml: Extract the vllm serve ... invocation into a reusable base_command key so vllm_ray.yaml can reference it via ${deployment.base_command}.
  • Native vLLM data-parallel example: Add slurm_vllm_multinode_dp.yaml showing multi-node deployment using vLLM's built-in data parallelism (no Ray required).
  • Remove Jinja template: Delete ray_setup.sh.j2 and the programmatic Ray injection block in executor.py. Multi-line commands are now handled generically via base64 encoding.
  • Updated docs: Rewrite the Multi-Node Deployment section in slurm.md to cover both non-Ray (data parallelism) and Ray (tensor/pipeline parallelism) approaches.

OmegaConf compatibility

OmegaConf rejects $(...) and $((...)) in YAML strings. The Ray script uses:

  • Backticks `cmd` instead of $(cmd) for command substitution
  • let "x = a / b" instead of $((a / b)) for arithmetic
  • || true on let statements that can evaluate to 0 (to avoid set -e failures)

Test plan

  • All 121 unit tests pass
  • Dry-run: single-node, multi-node single-instance, multi-instance, multi-node multi-instance
  • Cluster run: multi-node multi-instance Ray (TP=8, PP=2, 2 instances x 2 nodes)
  • Cluster run: multi-node native data-parallel (no Ray)

Add support for deploying models that span multiple nodes per instance
with multiple replicas behind HAProxy load balancing. This enables
running large models (e.g., DeepSeek-R1) that require Ray tensor and
pipeline parallelism across nodes, with multiple instances for throughput.

New config field `deployment.nodes_per_instance` (default: 1) controls
how many nodes each vLLM instance spans. When > 1, the launcher:
- Groups SLURM tasks into instances based on SLURM_PROCID
- Injects per-task variables: INSTANCE_ID, INSTANCE_RANK,
  INSTANCE_MASTER_IP, MASTER_IP, NODES_PER_INSTANCE, NUM_INSTANCES,
  ALL_NODE_IPS
- Configures HAProxy to route only to instance head nodes
- Runs health checks only against head nodes

The user provides a `pre_cmd` script that uses the injected variables
to coordinate nodes within each instance (e.g., Ray cluster formation).

Includes example config, documentation, and unit tests.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer AdamRajfer requested review from a team as code owners February 23, 2026 20:43
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added documentation Improvements or additions to documentation nemo-evaluator-launcher tests labels Feb 23, 2026
@AdamRajfer AdamRajfer removed documentation Improvements or additions to documentation tests labels Feb 23, 2026
@AdamRajfer
Copy link
Contributor Author

/ok to test c0106bf

data_parallel_size: 1
port: 8000
extra_args: "--disable-custom-all-reduce --distributed-executor-backend ray --enforce-eager"
pre_cmd: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

suggestion: Can we make it less verbose and more automatic regarding the script injected pre_cmd to avoid copy pasting? That's tad too much of verbosity even for our already verbose config IMO. Scripts like slurm_ray.sh from Nemo-run can be inspiration.

suggestion: more logging output to be added into that script to show the parameters and configs of the launched scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

)
if not cfg.deployment.get("multiple_instances", False):
raise ValueError(
"nodes_per_instance > 1 requires deployment.multiple_instances=True"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

n_tasks = cfg.execution.get("deployment", {}).get("n_tasks", 1)
if n_tasks != num_nodes:
raise ValueError(
f"nodes_per_instance > 1 requires execution.deployment.n_tasks "
Copy link
Collaborator

@prokotg prokotg Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must user set it (it should be automatic)?

Copy link
Contributor

@AWarno AWarno Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different multinode deployments requite different value for it, e.g. it is different for the trt-llm and vllm, sometimes it is number of gpus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Please check if it looks good now. I am validating this field only for vLLM now.

pipeline_parallel_size: 2
data_parallel_size: 1
port: 8000
extra_args: "--disable-custom-all-reduce --distributed-executor-backend ray --enforce-eager"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users were complaining in the past about --enforce-eager flag as it hurts the performance. Is it required? If so, please add a comment indicating it's needed (ideally with short explanation why). If no -- let's remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

top_p: 0.95
max_new_tokens: 32768
tasks:
- name: gsm8k_cot_instruct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to set up adapter server with reasoning interceptor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is already configured — see evaluation.nemo_evaluator_config.target.api_endpoint.adapter_config.process_reasoning_traces: true (lines 107-113).

s += "{} &\n\n".format(cfg.deployment.command) # run asynchronously
if instance_prefix:
# Wrap in bash -c to inject instance vars
escaped_deployment_cmd = cfg.deployment.command.replace("'", "'\"'\"'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this replacement, are you sure it's correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is correct — it's a standard bash idiom for escaping single quotes inside a single-quoted string. Since the whole command is wrapped in bash -c '...', any single quotes inside the deployment command would break the quoting. .replace("'", "'\"'\"'") works by:

  1. End the current single-quoted string: '
  2. Append a double-quoted literal single quote: "'"
  3. Resume the single-quoted string: '

So for example, if the deployment command contained it's, it becomes it'"'"'s, which bash concatenates back into it's. This is equivalent to the commonly seen '\'' pattern but uses double-quoting for the escaped quote instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyways, please ignore this change as I introduced a broader redesign and it doesn't apply anymore.

…multi-node-multi-instance

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Cover all code paths in _resolve_multi_node_config, Ray auto-injection,
HAProxy auto-enablement, and n_tasks auto-set for vLLM with 39 new tests.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation tests labels Feb 24, 2026
@AdamRajfer AdamRajfer changed the title feat: Add multi-node multi-instance deployment support for vLLM feat: Redesign multi-node config with num_nodes_per_instance and num_instances Feb 24, 2026
@AdamRajfer
Copy link
Contributor Author

/ok to test 3bd8150

Comment on lines +1731 to +1736
if cfg.execution.num_nodes_per_instance > 1:
if cfg.deployment.get("type") != "vllm":
raise ValueError(
f"Multi-node (num_nodes_per_instance > 1) is only supported for "
f"vLLM deployments, got deployment.type={cfg.deployment.get('type')}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revoked!

The echo/base64 pipe in the ray_setup injection was parsed by the batch
shell instead of running inside the srun container, causing the server
to dump raw base64 to logs and never start Ray or vLLM.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Contributor Author

/ok to test b863910

0.95 causes CUDA OOM during vLLM sampler warmup on 2x8 H100 nodes.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…_ray config

Replace ray_setup.sh.j2 template with a self-contained vllm_ray.yaml deployment
config. Split vllm.yaml command into base_command + command to allow vllm_ray to
override only the command while reusing the base vllm serve invocation.

Add num_instances as the user-facing field for multi-instance deployments,
replacing the old deployment.multiple_instances boolean. HAProxy is auto-enabled
when num_instances > 1. Default n_tasks to num_nodes in base slurm config.

Parameterize ray_compiled_dag_channel_type in vllm_ray.yaml (default: shm).
Always base64-encode deployment commands for safe srun transport.

Update examples, docs, and skills to reflect the new configuration patterns.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Contributor Author

/ok to test bd9ba75

…yaml

OmegaConf rejects $() and $(()) syntax. Replace with backticks for
command substitution and let for arithmetic to avoid parse errors.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Contributor Author

AdamRajfer commented Feb 25, 2026

/ok to test d502d27

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 25, 2026

/ok to test

@AdamRajfer, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@AdamRajfer AdamRajfer changed the title feat: Redesign multi-node config with num_nodes_per_instance and num_instances feat: Declarative multi-node Ray config and native vLLM data-parallel support Feb 25, 2026
The evaluator validates params and rejects 'max_tokens'. The correct
parameter name is 'max_new_tokens'.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
@AdamRajfer
Copy link
Contributor Author

/ok to test 7557d47

Comment on lines +146 to +147
The `vllm_ray` config automatically handles head/worker node coordination — no `pre_cmd` is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this comment about pre_cmd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, good catch.

Comment on lines +73 to +74
RANK="$SLURM_PROCID"
EXPECTED_NODES="$SLURM_NTASKS"
Copy link
Contributor

@AWarno AWarno Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cane we make these envs names slurm agnostic?, q: why do we need all of these variables? I think we need just MASTER_IP and $SLURM_PROCID (which can be name to PROCID or sth)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! The script now maps SLURM vars to generic names at the top (PROC_ID=$SLURM_PROCID, NUM_TASKS=$SLURM_NTASKS) and uses the generic names throughout. Re ALL_NODE_IPS — we still need it in multi-instance mode to compute per-instance head IPs, so it cannot be dropped entirely. Note: cannot use ${VAR:-default} bash syntax due to OmegaConf restrictions, so the aliases are simple assignments (SLURM always provides these vars in srun context).

- Remove --enforce-eager from ray example (marta-sd)
- Remove unnecessary pre_cmd mention from slurm.md (AWarno)
- Make vllm_ray.yaml env vars scheduler-agnostic by aliasing
  SLURM_PROCID/SLURM_NTASKS to PROC_ID/NUM_TASKS (AWarno)

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
…lti-instance

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>

# Conflicts:
#	packages/nemo-evaluator-launcher/examples/slurm_vllm_multinode_ray_tp_pp.yaml
#	packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/execution/slurm/default.yaml
# Map SLURM variables to scheduler-agnostic names
PROC_ID=$SLURM_PROCID
NUM_TASKS=$SLURM_NTASKS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that we should export these values to agnostic variables on the Slurm side before reaching this script, since they do not change much. This script should not be tightly coupled to Slurm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — moved the mapping to the SLURM executor (executor.py). It now exports PROC_ID and NUM_TASKS inside the bash -c container wrapper before running the deployment script. The vllm_ray script no longer references any SLURM variables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference. Done — the mapping now lives in executor.py and the vllm_ray script is SLURM-free.

Comment on lines +56 to +59
NUM_NODES=${execution.num_nodes}
NUM_INSTANCES=${execution.num_instances}
let "NODES_PER_INSTANCE = NUM_NODES / NUM_INSTANCES"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is multi-node, multi instance deployment handled without Ray? Right now, this script is tied to Slurm, but it doesn’t have to be. This Ray setup mechanism could be made common for any multi-node environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vllm_ray script is now fully scheduler-agnostic — it only uses PROC_ID, NUM_TASKS, MASTER_IP, and ALL_NODE_IPS, all exported by the executor. For multi-node multi-instance without Ray, users override deployment.command with custom logic (see slurm_vllm_multinode_dp.yaml). If we add non-SLURM executors with multi-node support in the future, they would just need to export the same generic variables and the vllm_ray script would work as-is.

…ecutor

PROC_ID and NUM_TASKS are now exported by the SLURM executor inside the
container wrapper (bash -c), so the vllm_ray deployment script is fully
scheduler-agnostic and no longer references SLURM_PROCID/SLURM_NTASKS.

Signed-off-by: Adam Rajfer <arajfer@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nemo-evaluator-launcher tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants