Skip to content

Comments

feat: add NeMo Gym rollouts pipeline and multi-node vLLM support#1227

Open
gwarmstrong wants to merge 5 commits intomainfrom
georgea/nemo-gym-rollouts-pr
Open

feat: add NeMo Gym rollouts pipeline and multi-node vLLM support#1227
gwarmstrong wants to merge 5 commits intomainfrom
georgea/nemo-gym-rollouts-pr

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Feb 10, 2026

NeMo-Gym Rollouts Pipeline

Add ns nemo_gym_rollouts CLI command for orchestrating rollout collection with NeMo Gym, including self-hosted or pre-hosted vLLM servers, optional sandbox, and parallel seed-based scaling.

New script classes:

  • NemoGymRolloutsScript: orchestrates ng_run + ng_collect_rollouts
  • MultiVLLMServerScript: data-parallel vLLM with multiple replicas per node (not integrated into pipeline yet)
  • GymClientScript: multi-node vLLM client (not integrated into pipeline yet)

Infrastructure enhancements:

  • Command: mounts, environment, workdir, avoid/force nemo_run_code controls
  • Per-script num_tasks_override for mixed task configurations
  • Allocation ordering fix for multi-component overlap jobs
  • NEMO_SKILLS_FORCE_PATTERN_PACKAGER env var for packaging control
  • --kill-on-bad-exit=1 srun flag for early failure detection

Gym Config Verification: math_with_judge (AIME24) & code_gen (LCB)

Verified that nemo_gym_rollouts pipeline works end-to-end with math_with_judge and code_gen Gym configs on their full validation datasets.

Results

Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 (8x GPU, vLLM, --reasoning-parser deepseek_r1)

Metric AIME24 (30 problems) LCB (322 problems)
pass@1 0.9000 0.7259
maj@4 0.9000 0.6739
pass@4 0.9333 0.9720

4 random seeds per benchmark, temperature=1.0, top_p=1.0, max_output_tokens=65536.

Repro: AIME24 (math_with_judge)

Requires AIME24 validation data pre-downloaded to the cluster (see data download step below).

# run_rollouts_aime24.py
from nemo_skills.pipeline.cli import nemo_gym_rollouts, wrap_arguments

run_number = 2
cluster = "<your-cluster>"

nemo_gym_rollouts(
    ctx=wrap_arguments(
        "+agent_name=math_with_judge_simple_agent "
        "+num_samples_in_parallel=64 "
        "+responses_create_params.max_output_tokens=65536 "
        "+responses_create_params.temperature=1.0 "
        "+responses_create_params.top_p=1.0 "
    ),
    num_random_seeds=4,
    starting_seed=0,
    cluster=cluster,
    config_paths=(
        "resources_servers/math_with_judge/configs/dapo17k.yaml,"
        "responses_api_models/vllm_model/configs/vllm_model.yaml"
    ),
    input_file="resources_servers/math_with_judge/data/aime24_validation.jsonl",
    output_dir=f"/workspace/results/math_verification_aime24/run{run_number}",
    gym_path='/opt/nemo-rl/3rdparty/Gym-workspace/Gym',
    with_sandbox=False,
    model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16',
    server_type='vllm',
    server_gpus=8,
    server_args='--trust-remote-code --dtype auto --mamba_ssm_cache_dtype float32 --reasoning-parser deepseek_r1 ',
    partition=None,
    exclusive=True,
)

Repro: LCB (code_gen)

Requires LiveCodeBench v5 validation data pre-downloaded to the cluster (see data download step below).

# run_rollouts_lcb.py
from nemo_skills.pipeline.cli import nemo_gym_rollouts, wrap_arguments

run_number = 2
cluster = "<your-cluster>"

nemo_gym_rollouts(
    ctx=wrap_arguments(
        "+agent_name=code_gen_simple_agent "
        "+num_samples_in_parallel=64 "
        "+responses_create_params.max_output_tokens=65536 "
        "+responses_create_params.temperature=1.0 "
        "+responses_create_params.top_p=1.0 "
    ),
    num_random_seeds=4,
    starting_seed=0,
    cluster=cluster,
    config_paths=(
        "resources_servers/code_gen/configs/code_gen.yaml,"
        "responses_api_models/vllm_model/configs/vllm_model.yaml"
    ),
    input_file="resources_servers/code_gen/data/livecodebench_v5_2024-07-01_2025-02-01_validation.jsonl",
    output_dir=f"/workspace/results/math_verification_lcb/run{run_number}",
    gym_path='/opt/nemo-rl/3rdparty/Gym-workspace/Gym',
    with_sandbox=False,
    model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16',
    server_type='vllm',
    server_gpus=8,
    server_args='--trust-remote-code --dtype auto --mamba_ssm_cache_dtype float32 --reasoning-parser deepseek_r1 ',
    partition=None,
    exclusive=True,
)

Data Download

Both datasets must be downloaded to the Gym data directories on the cluster before running rollouts. This uses ng_download_dataset_from_hf inside the nemo-rl container:

from nemo_skills.pipeline.cli import run_cmd, wrap_arguments

gym_path = '/opt/nemo-rl/3rdparty/Gym-workspace/Gym'

run_cmd(
    ctx=wrap_arguments(""),
    cluster="<your-cluster>",
    command=(
        f"cd {gym_path} && "
        "uv venv --python 3.12 --allow-existing .venv && "
        "source .venv/bin/activate && "
        "uv sync --active --extra dev && "
        "ng_download_dataset_from_hf "
        "+repo_id=nvidia/Nemotron-RL-math-OpenMathReasoning "
        "+artifact_fpath=validation.jsonl "
        "+output_fpath=resources_servers/math_with_judge/data/aime24_validation.jsonl && "
        "ng_download_dataset_from_hf "
        "+repo_id=nvidia/nemotron-RL-coding-competitive_coding "
        "+artifact_fpath=validation.jsonl "
        "+output_fpath=resources_servers/code_gen/data/livecodebench_v5_2024-07-01_2025-02-01_validation.jsonl"
    ),
    container="nemo-rl",
    expname="download_gym_data",
    partition="cpu",
    num_gpus=0,
)

Summary by CodeRabbit

  • New Features

    • NeMo Gym rollout command: server/sandbox options, policy handling, seed management, dry-run, and parallel execution.
    • Per-script task overrides and per-command mounts, environment and working-directory support for finer job control.
  • Bug Fixes

    • Slurm jobs now fail fast if any task exits abnormally.
  • Chores

    • Deterministic packaging flags added for more reproducible builds.

Add `ns nemo_gym_rollouts` CLI command for orchestrating rollout collection
with NeMo Gym, including self-hosted or pre-hosted vLLM servers, optional
sandbox, and parallel seed-based scaling.

New script classes:
- NemoGymRolloutsScript: orchestrates ng_run + ng_collect_rollouts
- MultiVLLMServerScript: data-parallel vLLM with multiple replicas per node
- GymClientScript: multi-node vLLM server discovery via SLURM env vars

Infrastructure enhancements:
- Command: mounts, environment, workdir, avoid/force nemo_run_code controls
- Per-script num_tasks_override for mixed task configurations
- Allocation ordering fix for multi-component overlap jobs
- NEMO_SKILLS_FORCE_PATTERN_PACKAGER env var for packaging control
- --kill-on-bad-exit=1 srun flag for early failure detection

Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong force-pushed the georgea/nemo-gym-rollouts-pr branch 2 times, most recently from d61b5f4 to d86e6c9 Compare February 10, 2026 22:02
@gwarmstrong gwarmstrong marked this pull request as ready for review February 13, 2026 18:10
@gwarmstrong gwarmstrong requested a review from Kipok February 13, 2026 18:11
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

VLLM_SERVER_URL="{vllm_server_url}"
if [ -n "$VLLM_SERVER_URL" ]; then
echo "=== Waiting for vLLM server at $VLLM_SERVER_URL ==="
while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curl command returns the HTTP status code directly in the condition, but this can fail when curl itself fails (network issues, DNS failures, etc). The command substitution $(curl ...) will return an empty string on curl failure, which bash compares numerically to 200, causing an error.

Suggested change
while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do
while ! curl -sf "$VLLM_SERVER_URL/models" >/dev/null 2>&1; do

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

Adds a Nemo Gym rollout CLI command and supporting pipeline pieces: new script classes for multi-vLLM servers, gym client, and rollout orchestration; per-command mounts/env/workdir and PYTHONPATH handling; deterministic packaging flags; and a fail-fast Slurm srun option. Also re-exports new scripts and adds a CLI import.

Changes

Cohort / File(s) Summary
CLI & entry
nemo_skills/pipeline/cli.py, nemo_skills/pipeline/nemo_gym_rollouts.py
Imports nemo_gym_rollouts into CLI; adds new nemo_gym_rollouts CLI command implementing rollout orchestration and related options (server, sandbox, seeds, configs, dry-run).
Script classes & re-exports
nemo_skills/pipeline/utils/scripts.py, nemo_skills/pipeline/utils/__init__.py
Adds NemoGymRolloutsScript, MultiVLLMServerScript, GymClientScript; adds num_tasks_override to BaseJobScript; re-exports script classes from utils package.
Declarative / Execution behavior
nemo_skills/pipeline/utils/declarative.py
Extends Command with mounts, environment, workdir, avoid_nemo_run_code, force_nemo_run_code; prepare_for_execution merges envs/mounts, adds PYTHONPATH/workdir prelude; executor uses merged mounts and per-script task overrides; job ordering heuristic added.
Packaging & submission
nemo_skills/pipeline/utils/packager.py, nemo_skills/pipeline/utils/exp.py
Adds NEMO_SKILLS_FORCE_INSTALLED_PACKAGE and NEMO_SKILLS_FORCE_PATTERN_PACKAGER to influence packager selection; adds --kill-on-bad-exit=1 to Slurm srun args for fail-fast behavior.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant CLI as "User CLI\n(nemo_gym_rollouts)"
participant Orch as "Pipeline Orchestrator\n(Command / Pipeline)"
participant Server as "Multi-vLLM Servers\n(MultiVLLMServerScript)"
participant Sandbox as "Sandbox Container\n(optional)"
participant Client as "Gym Client\n(GymClientScript)"
participant Storage as "Output Storage\n(output_dir / logs)"

CLI->>Orch: invoke nemo_gym_rollouts(args)
Orch->>Server: create server jobs (replicas, ports, env)
Orch->>Sandbox: optionally create sandbox per job
Orch->>Client: create gym client job(s) with server addresses
Server-->>Client: register server URLs / health OK
Client->>Server: gym connects to vLLM servers
Client->>Storage: collect and write rollouts
Orch->>Storage: aggregate job statuses and logs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

run GPU tests

Suggested reviewers

  • activatedgeek
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a NeMo Gym rollouts pipeline command with multi-node vLLM server support, which aligns with the primary additions in nemo_gym_rollouts.py and supporting infrastructure.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/nemo-gym-rollouts-pr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@nemo_skills/pipeline/nemo_gym_rollouts.py`:
- Line 238: The code uses cluster_config["containers"].get(server_type_str,
server_type_str) which can silently accept typos as literal image names; change
this to validate the server_type_str against cluster_config["containers"] (or
use direct access) so missing keys fail fast: check if server_type_str is a key
in cluster_config["containers"] and if so set server_container =
cluster_config["containers"][server_type_str], otherwise raise a clear error (or
fallback to a validated explicit list) so functions referencing server_container
(and the server_type_str lookup) don’t silently accept invalid names.

In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 930-936: The error block that runs when BEST_COUNT is less than
NUM_NODES prints several echo lines with inconsistent indentation; normalize the
leading spaces so all echo lines align with the surrounding echo and exit
statements. Locate the conditional using BEST_COUNT and NUM_NODES and the echo
statements referencing BEST_VAR_NAME, BEST_NODES_STR, SLURM_JOB_NODELIST,
SLURM_NODELIST, and SLURM_STEP_NODELIST, and remove the extra leading whitespace
on the lines echoing SLURM_* and BEST_NODES_STR so formatting is consistent.
- Around line 966-972: The modulo expression ATTEMPT % (60 /
HEALTH_CHECK_INTERVAL) can divide by zero when HEALTH_CHECK_INTERVAL >= 60; fix
by computing a safe divisor first (e.g. LOG_EVERY = 60 / HEALTH_CHECK_INTERVAL
and if LOG_EVERY < 1 set LOG_EVERY=1) and then use ATTEMPT % LOG_EVERY for the
logging condition; update the block using the variables ATTEMPT,
HEALTH_CHECK_INTERVAL, LOG_EVERY (or similar) and keep the existing ELAPSED and
echo logic unchanged.
- Around line 565-572: The health-check loop uses an unquoted command
substitution which can break if curl returns empty or contains whitespace; fix
the loop in the VLLM server wait logic by capturing the curl result into a
quoted variable and comparing it safely, e.g. status="$(curl -s -o /dev/null -w
"%{http_code}" "$VLLM_SERVER_URL/models" 2>/dev/null)" and then use a quoted
comparison like while [ "$status" != "200" ]; do sleep 10; status="$(...)" done;
ensure the variable and substitutions (VLLM_SERVER_URL and the curl command) are
wrapped in double quotes to be defensive.
🧹 Nitpick comments (5)
nemo_skills/pipeline/utils/packager.py (1)

139-157: Mutually reinforcing flags — consider documenting interaction.

When force_pattern_packager=True, repo_path is None so we skip the if repo_path: branch entirely (going to the else at line 164). When force_installed_nemo_skills=True but force_pattern_packager=False, we enter the git repo branch but include the full installed package. These two flags interact subtly — both are documented in the comments above, which is sufficient, but worth noting that force_installed_nemo_skills has no effect when force_pattern_packager is also set (since the code path that checks it is skipped).

nemo_skills/pipeline/utils/declarative.py (1)

263-294: PYTHONPATH filtering only removes exact /nemo_run/code — intentional?

Line 280 uses grep -v '^/nemo_run/code$' which only removes the exact path /nemo_run/code from PYTHONPATH. If PYTHONPATH contains /nemo_run/code/ (trailing slash) or /nemo_run/code/subdir, those entries would be preserved. This is likely intentional (precise removal), but worth confirming it matches the expected behavior. Subdirectory entries like /nemo_run/code/nemo_skills would also remain.

nemo_skills/pipeline/nemo_gym_rollouts.py (2)

184-206: Validation logic is thorough and covers key mutual-exclusion cases.

The self-hosted vs. pre-hosted validation gates are well-structured. One edge case: if a user provides --server_type with --server_address (pre-hosted), the server_type is silently unused. Consider warning or erroring when --server_type is provided alongside --server_address.


247-255: Broad exception catch is acceptable here but could be narrower.

The except Exception catches all failures when checking remote file existence. Since this involves SSH/cluster operations that can fail unpredictably, the defensive approach with a warning log and safe fallback (include the seed) is pragmatic. If you want to tighten it, catching (OSError, RuntimeError) would cover network/SSH failures while still letting truly unexpected errors surface.

nemo_skills/pipeline/utils/scripts.py (1)

690-691: Remove unused cluster_config field or document its intended purpose.

The cluster_config: Dict parameter is declared but never used anywhere in MultiVLLMServerScript—not in __post_init__ or elsewhere in the class. Unlike ServerScript and SandboxScript, which both pass this to their respective command builders, MultiVLLMServerScript builds its bash script using only other fields (model_path, num_nodes, gpus_per_server, etc.). Per coding guidelines, avoid silently ignoring user-passed parameters; either remove the field or add clear documentation if it's reserved for future use.

server_container = None
if self_hosted:
server_type_str = server_type.value if hasattr(server_type, "value") else server_type
server_container = cluster_config["containers"].get(server_type_str, server_type_str)
Copy link
Contributor

@coderabbitai coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# View the specific file and line in context
head -250 nemo_skills/pipeline/nemo_gym_rollouts.py | tail -20

Repository: NVIDIA-NeMo/Skills

Length of output: 990


🏁 Script executed:

# Get full context around line 238
sed -n '225,250p' nemo_skills/pipeline/nemo_gym_rollouts.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1258


🏁 Script executed:

# Search for similar container resolution patterns
rg 'containers.*get\(' --type=py -B2 -A2

Repository: NVIDIA-NeMo/Skills

Length of output: 913


🏁 Script executed:

# Search for direct container dictionary access patterns
rg 'containers\[' --type=py -B2 -A2

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Find where cluster_config is created/initialized
rg 'cluster_config.*containers' --type=py -B2 -A2

Repository: NVIDIA-NeMo/Skills

Length of output: 17498


Consider using direct dictionary access or explicit validation for container resolution.

cluster_config["containers"].get(server_type_str, server_type_str) allows passing either a registered container type or a direct image path, but this creates a UX risk: a typo like "vlm" instead of "vllm" silently becomes a literal image name, failing only at runtime instead of failing fast. The rest of the codebase uses direct dictionary access (e.g., cluster_config["containers"]["nemo-skills"]), which fails immediately on missing keys. Either validate the server type against a set of known options or use direct access if the key is always expected.

🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/nemo_gym_rollouts.py` at line 238, The code uses
cluster_config["containers"].get(server_type_str, server_type_str) which can
silently accept typos as literal image names; change this to validate the
server_type_str against cluster_config["containers"] (or use direct access) so
missing keys fail fast: check if server_type_str is a key in
cluster_config["containers"] and if so set server_container =
cluster_config["containers"][server_type_str], otherwise raise a clear error (or
fallback to a validated explicit list) so functions referencing server_container
(and the server_type_str lookup) don’t silently accept invalid names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwarmstrong should we do that? Not sure why we fall back to server_type_str here. We should probalby have a separate server_container argument that users can use to override, but if it's not specified we should fail if we can't find container

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment on lines +565 to +572
# Note: --kill-on-bad-exit in srun ensures job fails if vLLM crashes
VLLM_SERVER_URL="{vllm_server_url}"
if [ -n "$VLLM_SERVER_URL" ]; then
echo "=== Waiting for vLLM server at $VLLM_SERVER_URL ==="
while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do
sleep 10
done
echo "vLLM server is ready!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unquoted command substitution in health-check loop.

Line 569: The $(curl ...) is not double-quoted. While curl -w "%{http_code}" normally returns "000" on connection failure, quoting defensively is safer.

🛡️ Suggested fix
-    while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do
+    while [ "$(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null)" -ne 200 ]; do
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/scripts.py` around lines 565 - 572, The
health-check loop uses an unquoted command substitution which can break if curl
returns empty or contains whitespace; fix the loop in the VLLM server wait logic
by capturing the curl result into a quoted variable and comparing it safely,
e.g. status="$(curl -s -o /dev/null -w "%{http_code}" "$VLLM_SERVER_URL/models"
2>/dev/null)" and then use a quoted comparison like while [ "$status" != "200"
]; do sleep 10; status="$(...)" done; ensure the variable and substitutions
(VLLM_SERVER_URL and the curl command) are wrapped in double quotes to be
defensive.

Comment on lines +966 to +972
fi
# Log progress every 60 seconds
if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then
ELAPSED=$(($(date +%s) - START_TIME))
echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)"
fi
sleep $HEALTH_CHECK_INTERVAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Division-by-zero risk if health_check_interval ≥ 60.

Line 968: ATTEMPT % (60 / HEALTH_CHECK_INTERVAL) — bash integer division truncates, so if HEALTH_CHECK_INTERVAL > 60, 60 / HEALTH_CHECK_INTERVAL evaluates to 0, causing % 0 which is a bash arithmetic error. The default of 10 is safe, but this is a latent issue if someone configures a large interval.

🛡️ Suggested fix
-            if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then
+            LOG_EVERY=$((60 / HEALTH_CHECK_INTERVAL))
+            if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi
+            if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fi
# Log progress every 60 seconds
if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then
ELAPSED=$(($(date +%s) - START_TIME))
echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)"
fi
sleep $HEALTH_CHECK_INTERVAL
fi
# Log progress every 60 seconds
LOG_EVERY=$((60 / HEALTH_CHECK_INTERVAL))
if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi
if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then
ELAPSED=$(($(date +%s) - START_TIME))
echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)"
fi
sleep $HEALTH_CHECK_INTERVAL
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/scripts.py` around lines 966 - 972, The modulo
expression ATTEMPT % (60 / HEALTH_CHECK_INTERVAL) can divide by zero when
HEALTH_CHECK_INTERVAL >= 60; fix by computing a safe divisor first (e.g.
LOG_EVERY = 60 / HEALTH_CHECK_INTERVAL and if LOG_EVERY < 1 set LOG_EVERY=1) and
then use ATTEMPT % LOG_EVERY for the logging condition; update the block using
the variables ATTEMPT, HEALTH_CHECK_INTERVAL, LOG_EVERY (or similar) and keep
the existing ELAPSED and echo logic unchanged.

- Fix PYTHONPATH stripping to match /nemo_run/code prefixed paths
- Re-enable pipefail before ng_collect_rollouts
- Fix indentation in GymClientScript error block
- Remove emoji from server ready message

Signed-off-by: George Armstrong <georgea@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/utils/declarative.py (1)

778-802: ⚠️ Potential issue | 🟠 Major

Bug: external_deps is not passed to the first executor when reordering occurs.

After the sort at line 754, prepared_commands is reordered so spanning components appear first (improving SLURM allocation). However, the loop at line 762 checks het_idx == 0 and comp_idx == 0 using the original component indices stored in each entry. If the spanning component that now appears first was originally at comp_idx=1, the condition fails and no executor receives external_deps.

The code already acknowledges this reordering issue at lines 757–759 for shared_packager, but the same problem applies to external_deps. Since nemo-run uses the first executor to determine the SLURM allocation (line 738), the fix is to pass dependencies to the first executor in iteration order:

-        for entry in prepared_commands:
+        for entry_idx, entry in enumerate(prepared_commands):
             het_idx = entry["het_idx"]
             comp_idx = entry["comp_idx"]
             ...
-            exec_dependencies = external_deps if (het_idx == 0 and comp_idx == 0) else None
+            exec_dependencies = external_deps if entry_idx == 0 else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 778 - 802,
prepared_commands is reordered so relying on stored het_idx/comp_idx to decide
who gets external_deps is incorrect; instead, determine the "first executor" by
iteration order and pass external_deps to that executor. Modify the loop that
builds executors (the block that calls _create_executor) to compute
exec_dependencies based on the iteration position (e.g., using an iteration
counter or a boolean flag) rather than the stored het_idx/comp_idx, ensuring the
very first created executor receives external_deps; update any existing
shared_packager handling consistently so only the first yielded executor gets
external_deps.
🧹 Nitpick comments (4)
nemo_skills/pipeline/utils/declarative.py (1)

638-644: Use direct dictionary access for exec_config["mounts"].

exec_config["mounts"] is always set at line 302, so .get() is unnecessary. Per coding guidelines, use direct access to fail fast if the key is ever unexpectedly missing.

Suggested fix
-        extra_mounts = exec_config.get("mounts") or None
+        extra_mounts = exec_config["mounts"] or None

As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/declarative.py` around lines 638 - 644, Replace
the use of exec_config.get("mounts") with direct dictionary access
exec_config["mounts"] in the mounts merge block so the code fails fast if the
key is missing; specifically, in the block that computes mounts (using variables
mounts, extra_mounts and calling get_mounts_from_config(cluster_config)), set
extra_mounts = exec_config["mounts"] (keeping the subsequent truthy check and
merge logic unchanged) to follow the guideline of direct access when the key is
expected to exist.
nemo_skills/pipeline/utils/scripts.py (3)

518-535: extra_arguments is forwarded verbatim to both ng_run and ng_collect_rollouts.

If a user passes Hydra overrides that are only valid for one of the two commands, the other will fail with an unknown-override error. Consider splitting into separate parameters (e.g., extra_ng_run_args / extra_ng_collect_args) or at least document the coupling more prominently so callers know every override must be understood by both commands.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts.py` around lines 518 - 535,
extra_arguments is forwarded verbatim to both ng_run and ng_collect_rollouts
causing unknown-override failures; split or separate the args so each command
only receives relevant overrides. Change the implementation to accept two
distinct parameters (e.g., extra_ng_run_args and extra_ng_collect_args) and use
extra_ng_run_args when appending to ng_run_parts and building ng_run_cmd, and
use extra_ng_collect_args when appending to ng_collect_parts and building
ng_collect_cmd; update constructor/attribute names and any call sites that set
extra_arguments to pass the appropriate new parameter(s), or alternatively
update documentation to clearly state that extra_arguments must be valid for
both commands if you choose not to split.

639-805: MultiVLLMServerScript — solid design for data-parallel vLLM.

Clean use of SLURM env vars for per-task GPU slicing and port assignment. A few observations:

  1. --trust-remote-code is unconditionally hardcoded (line 800). If a user loads a model from a trusted local checkpoint and wants to avoid this flag, there's no way to opt out. Consider making it configurable or letting users control it via server_args.
  2. The port range (base_port to base_port + total_replicas - 1) is allocated once at init time. If any of those ports are already in use on a target node, the server will fail at runtime. This is inherent to the approach but worth noting in the docstring.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts.py` around lines 639 - 805, The script
unconditionally injects --trust-remote-code in MultiVLLMServerScript which
prevents users from opting out; add a boolean field (e.g., trust_remote_code:
bool = True) to the dataclass and conditionally include the
"--trust-remote-code" token in the generated cmd only when that flag is true (or
let users pass it via server_args by default). Update the __post_init__ build
logic that constructs cmd to reference this new field when composing the python3
-m vllm.entrypoints.openai.api_server command, and update the class docstring to
note the port allocation caveat (base_port .. base_port + _total_replicas - 1)
so users know ports may be in-use on target nodes.

625-633: hostname_ref() in env vars won't resolve in heterogeneous jobs.

Line 630 sets NEMO_SKILLS_SANDBOX_HOST to self.sandbox.hostname_ref(). For non-het jobs this returns "127.0.0.1" (fine), but for het jobs it returns a bash expression like ${SLURM_MASTER_NODE_HET_GROUP_0:-localhost}. Since env vars set via nemo-run are literal values (not shell-evaluated), OmegaConf/Gym would receive the unexpanded string.

This isn't an issue for the current single-group use case, but will silently break if someone uses this script in a heterogeneous job. Consider adding a guard or a comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts.py` around lines 625 - 633, The code sets
NEMO_SKILLS_SANDBOX_HOST to self.sandbox.hostname_ref() which can be a bash
expression for heterogeneous (het) SLURM jobs and will be passed literally;
change the logic around env_vars population so you detect bash-expression
patterns (e.g. contains "${" or starts with "$") from
self.sandbox.hostname_ref() and avoid injecting an unexpanded literal into
env_vars["NEMO_SKILLS_SANDBOX_HOST"] — instead either (a) set a safe fallback
like "localhost"/resolved IP, (b) omit the env var so the target process can
resolve itself, and (c) add a warning log entry; keep
env_vars["NEMO_SKILLS_SANDBOX_PORT"] as-is. Update the code around
self.sandbox.hostname_ref(), self.sandbox.port and the return that builds
{"environment": env_vars} and add a brief comment noting het-only hostname
expressions must not be passed literally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 590-605: The health check currently relies on a brittle string
match of STATUS_OUTPUT for "healthy, 0 unhealthy" (in the ng_status loop) which
can break if ng_status output changes; update the loop to use a more robust
signal: prefer ng_status's exit code or a structured output (JSON) if available,
or parse numeric healthy/unhealthy counts with a regex that extracts both
numbers from STATUS_OUTPUT and checks unhealthy == 0; also add a timeout/counter
to the loop to prevent an infinite hang, referencing the ng_status call,
STATUS_OUTPUT, CURRENT_STATUS and LAST_STATUS variables for where to implement
the changes.

---

Outside diff comments:
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 778-802: prepared_commands is reordered so relying on stored
het_idx/comp_idx to decide who gets external_deps is incorrect; instead,
determine the "first executor" by iteration order and pass external_deps to that
executor. Modify the loop that builds executors (the block that calls
_create_executor) to compute exec_dependencies based on the iteration position
(e.g., using an iteration counter or a boolean flag) rather than the stored
het_idx/comp_idx, ensuring the very first created executor receives
external_deps; update any existing shared_packager handling consistently so only
the first yielded executor gets external_deps.

---

Duplicate comments:
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 565-572: The health-check loop uses unquoted command substitution
for the curl exit code; update the while condition in the VLLM_SERVER_URL block
so the $(curl ...) expansion is wrapped in double quotes to prevent
word-splitting/empty-string issues (i.e., quote the command substitution used in
the while test that queries "$VLLM_SERVER_URL/models" so the comparison against
200 is reliable).
- Around line 933-939: The error block around the BEST_COUNT check has
consistent 4-space indentation for the echo lines but the review contains a
duplicate comment marker; remove the duplicate marker and verify the conditional
block (if [ "$BEST_COUNT" -lt "$NUM_NODES" ]; then ... exit 1) keeps the echo
lines for BEST_VAR_NAME, BEST_NODES_STR, SLURM_JOB_NODELIST, SLURM_NODELIST, and
SLURM_STEP_NODELIST indented consistently and that no stray duplicate comments
remain in scripts.py.
- Around line 963-976: The modulo expression ATTEMPT % (60 /
HEALTH_CHECK_INTERVAL) can divide by zero when HEALTH_CHECK_INTERVAL >= 60; fix
by computing a safe log-frequency before the loop (e.g. LOG_EVERY=$((60 /
HEALTH_CHECK_INTERVAL)); if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi) and
then replace the existing check with if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then
...; use the same variables ATTEMPT, HEALTH_CHECK_INTERVAL, LOG_EVERY, ELAPSED
and GLOBAL_IDX to locate and update the code inside the while true loop.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 638-644: Replace the use of exec_config.get("mounts") with direct
dictionary access exec_config["mounts"] in the mounts merge block so the code
fails fast if the key is missing; specifically, in the block that computes
mounts (using variables mounts, extra_mounts and calling
get_mounts_from_config(cluster_config)), set extra_mounts =
exec_config["mounts"] (keeping the subsequent truthy check and merge logic
unchanged) to follow the guideline of direct access when the key is expected to
exist.

In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 518-535: extra_arguments is forwarded verbatim to both ng_run and
ng_collect_rollouts causing unknown-override failures; split or separate the
args so each command only receives relevant overrides. Change the implementation
to accept two distinct parameters (e.g., extra_ng_run_args and
extra_ng_collect_args) and use extra_ng_run_args when appending to ng_run_parts
and building ng_run_cmd, and use extra_ng_collect_args when appending to
ng_collect_parts and building ng_collect_cmd; update constructor/attribute names
and any call sites that set extra_arguments to pass the appropriate new
parameter(s), or alternatively update documentation to clearly state that
extra_arguments must be valid for both commands if you choose not to split.
- Around line 639-805: The script unconditionally injects --trust-remote-code in
MultiVLLMServerScript which prevents users from opting out; add a boolean field
(e.g., trust_remote_code: bool = True) to the dataclass and conditionally
include the "--trust-remote-code" token in the generated cmd only when that flag
is true (or let users pass it via server_args by default). Update the
__post_init__ build logic that constructs cmd to reference this new field when
composing the python3 -m vllm.entrypoints.openai.api_server command, and update
the class docstring to note the port allocation caveat (base_port .. base_port +
_total_replicas - 1) so users know ports may be in-use on target nodes.
- Around line 625-633: The code sets NEMO_SKILLS_SANDBOX_HOST to
self.sandbox.hostname_ref() which can be a bash expression for heterogeneous
(het) SLURM jobs and will be passed literally; change the logic around env_vars
population so you detect bash-expression patterns (e.g. contains "${" or starts
with "$") from self.sandbox.hostname_ref() and avoid injecting an unexpanded
literal into env_vars["NEMO_SKILLS_SANDBOX_HOST"] — instead either (a) set a
safe fallback like "localhost"/resolved IP, (b) omit the env var so the target
process can resolve itself, and (c) add a warning log entry; keep
env_vars["NEMO_SKILLS_SANDBOX_PORT"] as-is. Update the code around
self.sandbox.hostname_ref(), self.sandbox.port and the return that builds
{"environment": env_vars} and add a brief comment noting het-only hostname
expressions must not be passed literally.

Comment on lines +590 to +605
STATUS_OUTPUT=$(ng_status 2>&1)

if echo "$STATUS_OUTPUT" | grep -q "healthy, 0 unhealthy"; then
echo "All servers ready!"
break
fi

# Only print status when it changes (reduce verbosity)
CURRENT_STATUS=$(echo "$STATUS_OUTPUT" | grep -oE '[0-9]+ healthy' | head -1 || echo "starting")
if [ "$CURRENT_STATUS" != "$LAST_STATUS" ]; then
echo "Server status: $CURRENT_STATUS"
LAST_STATUS="$CURRENT_STATUS"
fi

sleep 10
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fragile health-check string match: "healthy, 0 unhealthy".

If ng_status changes its output format (e.g., wording, ordering, locale), the grep -q "healthy, 0 unhealthy" at line 592 will never match and the loop will spin forever (silent hang with no timeout). Consider using a more structured health-check mechanism (e.g., exit code of ng_status, or a JSON output field) if one is available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/scripts.py` around lines 590 - 605, The health
check currently relies on a brittle string match of STATUS_OUTPUT for "healthy,
0 unhealthy" (in the ng_status loop) which can break if ng_status output
changes; update the loop to use a more robust signal: prefer ng_status's exit
code or a structured output (JSON) if available, or parse numeric
healthy/unhealthy counts with a regex that extracts both numbers from
STATUS_OUTPUT and checks unhealthy == 0; also add a timeout/counter to the loop
to prevent an infinite hang, referencing the ng_status call, STATUS_OUTPUT,
CURRENT_STATUS and LAST_STATUS variables for where to implement the changes.

@cmunley1
Copy link

curious if this will support resuming rollouts like with .done file?

@gwarmstrong
Copy link
Collaborator Author

curious if this will support resuming rollouts like with .done file?

@cmunley1 no concrete plans at the moment, it would be better if Gym supported resuming natively.

@cmunley1
Copy link

im also curious if this would/could enable an end to end reward profiling across many parallel slurm jobs

ie say I want to run 16 rollouts per prompt, for 20,000 prompts, across a few hundred slurm jobs, then aggregate results, so i have avg/stddev reward per task for filtering or curriculum

we have a small feature in nemo gym to aggregate results, but chunking across slurm jobs is not supported in gym natively, it feels more suited here since ns can launch slurm jobs

NVIDIA-NeMo/Gym#621

This is a common workflow that currently requires custom scripts in nemo gym to chunk, submit parallel jobs, and merge result

@gwarmstrong
Copy link
Collaborator Author

@cmunley1 we do something similar to ns generate already, we can probably extend it to gym rollouts, but I might do it as a follow-up PR

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments. Here is another one from codex, not fully sure if it's meaningful or not, but worth checking

Sorting prepared_commands changes executor order, but external dependencies are still attached based on the original comp_idx == 0 check later in the loop. When a non-spanning command was originally first and a spanning command is reordered ahead of it, the dependency can be attached to a non-leading executor, which can let a run_after job start before its intended upstream dependency.

#
# This can be surprising when users rely on a venv editable install / uncommitted changes.
# These flags let you force using the installed package tree (PatternPackager) regardless of CWD.
force_installed_nemo_skills = bool(int(os.getenv("NEMO_SKILLS_FORCE_INSTALLED_PACKAGE", "0")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to document this in packaging docs



@dataclass(kw_only=True)
class NemoGymRolloutsScript(BaseJobScript):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a good time to make scripts a subfolder and split this into different files



@dataclass(kw_only=True)
class MultiVLLMServerScript(BaseJobScript):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused why we need this and below class. vllm should naturally support data parallel with a --data-parallel-size argument. Why can't we just use that with our normal vllm server? And then there is a single ip gym talks to, but it still handled data split, just on the server side?

@@ -0,0 +1,374 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a simple gpu test?

cmd = f"""set -e
set -o pipefail

echo "=== Installing NeMo Gym ==="
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to install gym? It should be preinstalled in the container, right?

# Wait for vLLM server to be ready before starting ng_run
# Note: --kill-on-bad-exit in srun ensures job fails if vLLM crashes
VLLM_SERVER_URL="{vllm_server_url}"
if [ -n "$VLLM_SERVER_URL" ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse get_server_wait_cmd here?

"dummy",
help="API key for policy server. Use 'dummy' for local vLLM servers.",
),
policy_model_name: str = typer.Option(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just reuse the "model" parameter here? That would be consistent with how we do that in main generate pipeline

if random_seeds is not None:
# Explicit seeds provided
if isinstance(random_seeds, str):
seed_indices = [int(s.strip()) for s in random_seeds.split(",")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse str_ids_to_list from main generate pipeline for consistency?

server_container = None
if self_hosted:
server_type_str = server_type.value if hasattr(server_type, "value") else server_type
server_container = cluster_config["containers"].get(server_type_str, server_type_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwarmstrong should we do that? Not sure why we fall back to server_type_str here. We should probalby have a separate server_container argument that users can use to override, but if it's not specified we should fail if we can't find container

skipped_seeds.append(seed_idx)
else:
filtered_seeds.append(seed_idx)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect this to fail? Why blank exception?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants