feat: add NeMo Gym rollouts pipeline and multi-node vLLM support#1227
feat: add NeMo Gym rollouts pipeline and multi-node vLLM support#1227gwarmstrong wants to merge 5 commits intomainfrom
Conversation
Add `ns nemo_gym_rollouts` CLI command for orchestrating rollout collection with NeMo Gym, including self-hosted or pre-hosted vLLM servers, optional sandbox, and parallel seed-based scaling. New script classes: - NemoGymRolloutsScript: orchestrates ng_run + ng_collect_rollouts - MultiVLLMServerScript: data-parallel vLLM with multiple replicas per node - GymClientScript: multi-node vLLM server discovery via SLURM env vars Infrastructure enhancements: - Command: mounts, environment, workdir, avoid/force nemo_run_code controls - Per-script num_tasks_override for mixed task configurations - Allocation ordering fix for multi-component overlap jobs - NEMO_SKILLS_FORCE_PATTERN_PACKAGER env var for packaging control - --kill-on-bad-exit=1 srun flag for early failure detection Signed-off-by: George Armstrong <georgea@nvidia.com>
d61b5f4 to
d86e6c9
Compare
| VLLM_SERVER_URL="{vllm_server_url}" | ||
| if [ -n "$VLLM_SERVER_URL" ]; then | ||
| echo "=== Waiting for vLLM server at $VLLM_SERVER_URL ===" | ||
| while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do |
There was a problem hiding this comment.
The curl command returns the HTTP status code directly in the condition, but this can fail when curl itself fails (network issues, DNS failures, etc). The command substitution $(curl ...) will return an empty string on curl failure, which bash compares numerically to 200, causing an error.
| while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do | |
| while ! curl -sf "$VLLM_SERVER_URL/models" >/dev/null 2>&1; do |
|
No actionable comments were generated in the recent review. 🎉 📝 WalkthroughWalkthroughAdds a Nemo Gym rollout CLI command and supporting pipeline pieces: new script classes for multi-vLLM servers, gym client, and rollout orchestration; per-command mounts/env/workdir and PYTHONPATH handling; deterministic packaging flags; and a fail-fast Slurm srun option. Also re-exports new scripts and adds a CLI import. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@nemo_skills/pipeline/nemo_gym_rollouts.py`:
- Line 238: The code uses cluster_config["containers"].get(server_type_str,
server_type_str) which can silently accept typos as literal image names; change
this to validate the server_type_str against cluster_config["containers"] (or
use direct access) so missing keys fail fast: check if server_type_str is a key
in cluster_config["containers"] and if so set server_container =
cluster_config["containers"][server_type_str], otherwise raise a clear error (or
fallback to a validated explicit list) so functions referencing server_container
(and the server_type_str lookup) don’t silently accept invalid names.
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 930-936: The error block that runs when BEST_COUNT is less than
NUM_NODES prints several echo lines with inconsistent indentation; normalize the
leading spaces so all echo lines align with the surrounding echo and exit
statements. Locate the conditional using BEST_COUNT and NUM_NODES and the echo
statements referencing BEST_VAR_NAME, BEST_NODES_STR, SLURM_JOB_NODELIST,
SLURM_NODELIST, and SLURM_STEP_NODELIST, and remove the extra leading whitespace
on the lines echoing SLURM_* and BEST_NODES_STR so formatting is consistent.
- Around line 966-972: The modulo expression ATTEMPT % (60 /
HEALTH_CHECK_INTERVAL) can divide by zero when HEALTH_CHECK_INTERVAL >= 60; fix
by computing a safe divisor first (e.g. LOG_EVERY = 60 / HEALTH_CHECK_INTERVAL
and if LOG_EVERY < 1 set LOG_EVERY=1) and then use ATTEMPT % LOG_EVERY for the
logging condition; update the block using the variables ATTEMPT,
HEALTH_CHECK_INTERVAL, LOG_EVERY (or similar) and keep the existing ELAPSED and
echo logic unchanged.
- Around line 565-572: The health-check loop uses an unquoted command
substitution which can break if curl returns empty or contains whitespace; fix
the loop in the VLLM server wait logic by capturing the curl result into a
quoted variable and comparing it safely, e.g. status="$(curl -s -o /dev/null -w
"%{http_code}" "$VLLM_SERVER_URL/models" 2>/dev/null)" and then use a quoted
comparison like while [ "$status" != "200" ]; do sleep 10; status="$(...)" done;
ensure the variable and substitutions (VLLM_SERVER_URL and the curl command) are
wrapped in double quotes to be defensive.
🧹 Nitpick comments (5)
nemo_skills/pipeline/utils/packager.py (1)
139-157: Mutually reinforcing flags — consider documenting interaction.When
force_pattern_packager=True,repo_pathisNoneso we skip theif repo_path:branch entirely (going to theelseat line 164). Whenforce_installed_nemo_skills=Truebutforce_pattern_packager=False, we enter the git repo branch but include the full installed package. These two flags interact subtly — both are documented in the comments above, which is sufficient, but worth noting thatforce_installed_nemo_skillshas no effect whenforce_pattern_packageris also set (since the code path that checks it is skipped).nemo_skills/pipeline/utils/declarative.py (1)
263-294: PYTHONPATH filtering only removes exact/nemo_run/code— intentional?Line 280 uses
grep -v '^/nemo_run/code$'which only removes the exact path/nemo_run/codefrom PYTHONPATH. If PYTHONPATH contains/nemo_run/code/(trailing slash) or/nemo_run/code/subdir, those entries would be preserved. This is likely intentional (precise removal), but worth confirming it matches the expected behavior. Subdirectory entries like/nemo_run/code/nemo_skillswould also remain.nemo_skills/pipeline/nemo_gym_rollouts.py (2)
184-206: Validation logic is thorough and covers key mutual-exclusion cases.The self-hosted vs. pre-hosted validation gates are well-structured. One edge case: if a user provides
--server_typewith--server_address(pre-hosted), theserver_typeis silently unused. Consider warning or erroring when--server_typeis provided alongside--server_address.
247-255: Broad exception catch is acceptable here but could be narrower.The
except Exceptioncatches all failures when checking remote file existence. Since this involves SSH/cluster operations that can fail unpredictably, the defensive approach with a warning log and safe fallback (include the seed) is pragmatic. If you want to tighten it, catching(OSError, RuntimeError)would cover network/SSH failures while still letting truly unexpected errors surface.nemo_skills/pipeline/utils/scripts.py (1)
690-691: Remove unusedcluster_configfield or document its intended purpose.The
cluster_config: Dictparameter is declared but never used anywhere inMultiVLLMServerScript—not in__post_init__or elsewhere in the class. UnlikeServerScriptandSandboxScript, which both pass this to their respective command builders,MultiVLLMServerScriptbuilds its bash script using only other fields (model_path, num_nodes, gpus_per_server, etc.). Per coding guidelines, avoid silently ignoring user-passed parameters; either remove the field or add clear documentation if it's reserved for future use.
| server_container = None | ||
| if self_hosted: | ||
| server_type_str = server_type.value if hasattr(server_type, "value") else server_type | ||
| server_container = cluster_config["containers"].get(server_type_str, server_type_str) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# View the specific file and line in context
head -250 nemo_skills/pipeline/nemo_gym_rollouts.py | tail -20Repository: NVIDIA-NeMo/Skills
Length of output: 990
🏁 Script executed:
# Get full context around line 238
sed -n '225,250p' nemo_skills/pipeline/nemo_gym_rollouts.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1258
🏁 Script executed:
# Search for similar container resolution patterns
rg 'containers.*get\(' --type=py -B2 -A2Repository: NVIDIA-NeMo/Skills
Length of output: 913
🏁 Script executed:
# Search for direct container dictionary access patterns
rg 'containers\[' --type=py -B2 -A2Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Find where cluster_config is created/initialized
rg 'cluster_config.*containers' --type=py -B2 -A2Repository: NVIDIA-NeMo/Skills
Length of output: 17498
Consider using direct dictionary access or explicit validation for container resolution.
cluster_config["containers"].get(server_type_str, server_type_str) allows passing either a registered container type or a direct image path, but this creates a UX risk: a typo like "vlm" instead of "vllm" silently becomes a literal image name, failing only at runtime instead of failing fast. The rest of the codebase uses direct dictionary access (e.g., cluster_config["containers"]["nemo-skills"]), which fails immediately on missing keys. Either validate the server type against a set of known options or use direct access if the key is always expected.
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/nemo_gym_rollouts.py` at line 238, The code uses
cluster_config["containers"].get(server_type_str, server_type_str) which can
silently accept typos as literal image names; change this to validate the
server_type_str against cluster_config["containers"] (or use direct access) so
missing keys fail fast: check if server_type_str is a key in
cluster_config["containers"] and if so set server_container =
cluster_config["containers"][server_type_str], otherwise raise a clear error (or
fallback to a validated explicit list) so functions referencing server_container
(and the server_type_str lookup) don’t silently accept invalid names.
There was a problem hiding this comment.
@gwarmstrong should we do that? Not sure why we fall back to server_type_str here. We should probalby have a separate server_container argument that users can use to override, but if it's not specified we should fail if we can't find container
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| # Note: --kill-on-bad-exit in srun ensures job fails if vLLM crashes | ||
| VLLM_SERVER_URL="{vllm_server_url}" | ||
| if [ -n "$VLLM_SERVER_URL" ]; then | ||
| echo "=== Waiting for vLLM server at $VLLM_SERVER_URL ===" | ||
| while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do | ||
| sleep 10 | ||
| done | ||
| echo "vLLM server is ready!" |
There was a problem hiding this comment.
Unquoted command substitution in health-check loop.
Line 569: The $(curl ...) is not double-quoted. While curl -w "%{http_code}" normally returns "000" on connection failure, quoting defensively is safer.
🛡️ Suggested fix
- while [ $(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null) -ne 200 ]; do
+ while [ "$(curl -s -o /dev/null -w "%{{http_code}}" "$VLLM_SERVER_URL/models" 2>/dev/null)" -ne 200 ]; do🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/scripts.py` around lines 565 - 572, The
health-check loop uses an unquoted command substitution which can break if curl
returns empty or contains whitespace; fix the loop in the VLLM server wait logic
by capturing the curl result into a quoted variable and comparing it safely,
e.g. status="$(curl -s -o /dev/null -w "%{http_code}" "$VLLM_SERVER_URL/models"
2>/dev/null)" and then use a quoted comparison like while [ "$status" != "200"
]; do sleep 10; status="$(...)" done; ensure the variable and substitutions
(VLLM_SERVER_URL and the curl command) are wrapped in double quotes to be
defensive.
| fi | ||
| # Log progress every 60 seconds | ||
| if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then | ||
| ELAPSED=$(($(date +%s) - START_TIME)) | ||
| echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)" | ||
| fi | ||
| sleep $HEALTH_CHECK_INTERVAL |
There was a problem hiding this comment.
Division-by-zero risk if health_check_interval ≥ 60.
Line 968: ATTEMPT % (60 / HEALTH_CHECK_INTERVAL) — bash integer division truncates, so if HEALTH_CHECK_INTERVAL > 60, 60 / HEALTH_CHECK_INTERVAL evaluates to 0, causing % 0 which is a bash arithmetic error. The default of 10 is safe, but this is a latent issue if someone configures a large interval.
🛡️ Suggested fix
- if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then
+ LOG_EVERY=$((60 / HEALTH_CHECK_INTERVAL))
+ if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi
+ if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fi | |
| # Log progress every 60 seconds | |
| if [ $((ATTEMPT % (60 / HEALTH_CHECK_INTERVAL))) -eq 0 ]; then | |
| ELAPSED=$(($(date +%s) - START_TIME)) | |
| echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)" | |
| fi | |
| sleep $HEALTH_CHECK_INTERVAL | |
| fi | |
| # Log progress every 60 seconds | |
| LOG_EVERY=$((60 / HEALTH_CHECK_INTERVAL)) | |
| if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi | |
| if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then | |
| ELAPSED=$(($(date +%s) - START_TIME)) | |
| echo " ... still waiting for server $GLOBAL_IDX (${{ELAPSED}}s elapsed)" | |
| fi | |
| sleep $HEALTH_CHECK_INTERVAL |
🤖 Prompt for AI Agents
In `@nemo_skills/pipeline/utils/scripts.py` around lines 966 - 972, The modulo
expression ATTEMPT % (60 / HEALTH_CHECK_INTERVAL) can divide by zero when
HEALTH_CHECK_INTERVAL >= 60; fix by computing a safe divisor first (e.g.
LOG_EVERY = 60 / HEALTH_CHECK_INTERVAL and if LOG_EVERY < 1 set LOG_EVERY=1) and
then use ATTEMPT % LOG_EVERY for the logging condition; update the block using
the variables ATTEMPT, HEALTH_CHECK_INTERVAL, LOG_EVERY (or similar) and keep
the existing ELAPSED and echo logic unchanged.
- Fix PYTHONPATH stripping to match /nemo_run/code prefixed paths - Re-enable pipefail before ng_collect_rollouts - Fix indentation in GymClientScript error block - Remove emoji from server ready message Signed-off-by: George Armstrong <georgea@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/utils/declarative.py (1)
778-802:⚠️ Potential issue | 🟠 MajorBug:
external_depsis not passed to the first executor when reordering occurs.After the sort at line 754,
prepared_commandsis reordered so spanning components appear first (improving SLURM allocation). However, the loop at line 762 checkshet_idx == 0 and comp_idx == 0using the original component indices stored in each entry. If the spanning component that now appears first was originally atcomp_idx=1, the condition fails and no executor receivesexternal_deps.The code already acknowledges this reordering issue at lines 757–759 for
shared_packager, but the same problem applies toexternal_deps. Since nemo-run uses the first executor to determine the SLURM allocation (line 738), the fix is to pass dependencies to the first executor in iteration order:- for entry in prepared_commands: + for entry_idx, entry in enumerate(prepared_commands): het_idx = entry["het_idx"] comp_idx = entry["comp_idx"] ... - exec_dependencies = external_deps if (het_idx == 0 and comp_idx == 0) else None + exec_dependencies = external_deps if entry_idx == 0 else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` around lines 778 - 802, prepared_commands is reordered so relying on stored het_idx/comp_idx to decide who gets external_deps is incorrect; instead, determine the "first executor" by iteration order and pass external_deps to that executor. Modify the loop that builds executors (the block that calls _create_executor) to compute exec_dependencies based on the iteration position (e.g., using an iteration counter or a boolean flag) rather than the stored het_idx/comp_idx, ensuring the very first created executor receives external_deps; update any existing shared_packager handling consistently so only the first yielded executor gets external_deps.
🧹 Nitpick comments (4)
nemo_skills/pipeline/utils/declarative.py (1)
638-644: Use direct dictionary access forexec_config["mounts"].
exec_config["mounts"]is always set at line 302, so.get()is unnecessary. Per coding guidelines, use direct access to fail fast if the key is ever unexpectedly missing.Suggested fix
- extra_mounts = exec_config.get("mounts") or None + extra_mounts = exec_config["mounts"] or NoneAs per coding guidelines, "Do not use
.get()for accessing dictionary keys if the code expects them to be present; use direct dictionary accessdict[key]instead to allow proper error handling and fail fast with clear errors".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/declarative.py` around lines 638 - 644, Replace the use of exec_config.get("mounts") with direct dictionary access exec_config["mounts"] in the mounts merge block so the code fails fast if the key is missing; specifically, in the block that computes mounts (using variables mounts, extra_mounts and calling get_mounts_from_config(cluster_config)), set extra_mounts = exec_config["mounts"] (keeping the subsequent truthy check and merge logic unchanged) to follow the guideline of direct access when the key is expected to exist.nemo_skills/pipeline/utils/scripts.py (3)
518-535:extra_argumentsis forwarded verbatim to bothng_runandng_collect_rollouts.If a user passes Hydra overrides that are only valid for one of the two commands, the other will fail with an unknown-override error. Consider splitting into separate parameters (e.g.,
extra_ng_run_args/extra_ng_collect_args) or at least document the coupling more prominently so callers know every override must be understood by both commands.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/scripts.py` around lines 518 - 535, extra_arguments is forwarded verbatim to both ng_run and ng_collect_rollouts causing unknown-override failures; split or separate the args so each command only receives relevant overrides. Change the implementation to accept two distinct parameters (e.g., extra_ng_run_args and extra_ng_collect_args) and use extra_ng_run_args when appending to ng_run_parts and building ng_run_cmd, and use extra_ng_collect_args when appending to ng_collect_parts and building ng_collect_cmd; update constructor/attribute names and any call sites that set extra_arguments to pass the appropriate new parameter(s), or alternatively update documentation to clearly state that extra_arguments must be valid for both commands if you choose not to split.
639-805:MultiVLLMServerScript— solid design for data-parallel vLLM.Clean use of SLURM env vars for per-task GPU slicing and port assignment. A few observations:
--trust-remote-codeis unconditionally hardcoded (line 800). If a user loads a model from a trusted local checkpoint and wants to avoid this flag, there's no way to opt out. Consider making it configurable or letting users control it viaserver_args.- The port range (
base_porttobase_port + total_replicas - 1) is allocated once at init time. If any of those ports are already in use on a target node, the server will fail at runtime. This is inherent to the approach but worth noting in the docstring.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/scripts.py` around lines 639 - 805, The script unconditionally injects --trust-remote-code in MultiVLLMServerScript which prevents users from opting out; add a boolean field (e.g., trust_remote_code: bool = True) to the dataclass and conditionally include the "--trust-remote-code" token in the generated cmd only when that flag is true (or let users pass it via server_args by default). Update the __post_init__ build logic that constructs cmd to reference this new field when composing the python3 -m vllm.entrypoints.openai.api_server command, and update the class docstring to note the port allocation caveat (base_port .. base_port + _total_replicas - 1) so users know ports may be in-use on target nodes.
625-633:hostname_ref()in env vars won't resolve in heterogeneous jobs.Line 630 sets
NEMO_SKILLS_SANDBOX_HOSTtoself.sandbox.hostname_ref(). For non-het jobs this returns"127.0.0.1"(fine), but for het jobs it returns a bash expression like${SLURM_MASTER_NODE_HET_GROUP_0:-localhost}. Since env vars set via nemo-run are literal values (not shell-evaluated), OmegaConf/Gym would receive the unexpanded string.This isn't an issue for the current single-group use case, but will silently break if someone uses this script in a heterogeneous job. Consider adding a guard or a comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/scripts.py` around lines 625 - 633, The code sets NEMO_SKILLS_SANDBOX_HOST to self.sandbox.hostname_ref() which can be a bash expression for heterogeneous (het) SLURM jobs and will be passed literally; change the logic around env_vars population so you detect bash-expression patterns (e.g. contains "${" or starts with "$") from self.sandbox.hostname_ref() and avoid injecting an unexpanded literal into env_vars["NEMO_SKILLS_SANDBOX_HOST"] — instead either (a) set a safe fallback like "localhost"/resolved IP, (b) omit the env var so the target process can resolve itself, and (c) add a warning log entry; keep env_vars["NEMO_SKILLS_SANDBOX_PORT"] as-is. Update the code around self.sandbox.hostname_ref(), self.sandbox.port and the return that builds {"environment": env_vars} and add a brief comment noting het-only hostname expressions must not be passed literally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 590-605: The health check currently relies on a brittle string
match of STATUS_OUTPUT for "healthy, 0 unhealthy" (in the ng_status loop) which
can break if ng_status output changes; update the loop to use a more robust
signal: prefer ng_status's exit code or a structured output (JSON) if available,
or parse numeric healthy/unhealthy counts with a regex that extracts both
numbers from STATUS_OUTPUT and checks unhealthy == 0; also add a timeout/counter
to the loop to prevent an infinite hang, referencing the ng_status call,
STATUS_OUTPUT, CURRENT_STATUS and LAST_STATUS variables for where to implement
the changes.
---
Outside diff comments:
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 778-802: prepared_commands is reordered so relying on stored
het_idx/comp_idx to decide who gets external_deps is incorrect; instead,
determine the "first executor" by iteration order and pass external_deps to that
executor. Modify the loop that builds executors (the block that calls
_create_executor) to compute exec_dependencies based on the iteration position
(e.g., using an iteration counter or a boolean flag) rather than the stored
het_idx/comp_idx, ensuring the very first created executor receives
external_deps; update any existing shared_packager handling consistently so only
the first yielded executor gets external_deps.
---
Duplicate comments:
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 565-572: The health-check loop uses unquoted command substitution
for the curl exit code; update the while condition in the VLLM_SERVER_URL block
so the $(curl ...) expansion is wrapped in double quotes to prevent
word-splitting/empty-string issues (i.e., quote the command substitution used in
the while test that queries "$VLLM_SERVER_URL/models" so the comparison against
200 is reliable).
- Around line 933-939: The error block around the BEST_COUNT check has
consistent 4-space indentation for the echo lines but the review contains a
duplicate comment marker; remove the duplicate marker and verify the conditional
block (if [ "$BEST_COUNT" -lt "$NUM_NODES" ]; then ... exit 1) keeps the echo
lines for BEST_VAR_NAME, BEST_NODES_STR, SLURM_JOB_NODELIST, SLURM_NODELIST, and
SLURM_STEP_NODELIST indented consistently and that no stray duplicate comments
remain in scripts.py.
- Around line 963-976: The modulo expression ATTEMPT % (60 /
HEALTH_CHECK_INTERVAL) can divide by zero when HEALTH_CHECK_INTERVAL >= 60; fix
by computing a safe log-frequency before the loop (e.g. LOG_EVERY=$((60 /
HEALTH_CHECK_INTERVAL)); if [ "$LOG_EVERY" -le 0 ]; then LOG_EVERY=1; fi) and
then replace the existing check with if [ $((ATTEMPT % LOG_EVERY)) -eq 0 ]; then
...; use the same variables ATTEMPT, HEALTH_CHECK_INTERVAL, LOG_EVERY, ELAPSED
and GLOBAL_IDX to locate and update the code inside the while true loop.
---
Nitpick comments:
In `@nemo_skills/pipeline/utils/declarative.py`:
- Around line 638-644: Replace the use of exec_config.get("mounts") with direct
dictionary access exec_config["mounts"] in the mounts merge block so the code
fails fast if the key is missing; specifically, in the block that computes
mounts (using variables mounts, extra_mounts and calling
get_mounts_from_config(cluster_config)), set extra_mounts =
exec_config["mounts"] (keeping the subsequent truthy check and merge logic
unchanged) to follow the guideline of direct access when the key is expected to
exist.
In `@nemo_skills/pipeline/utils/scripts.py`:
- Around line 518-535: extra_arguments is forwarded verbatim to both ng_run and
ng_collect_rollouts causing unknown-override failures; split or separate the
args so each command only receives relevant overrides. Change the implementation
to accept two distinct parameters (e.g., extra_ng_run_args and
extra_ng_collect_args) and use extra_ng_run_args when appending to ng_run_parts
and building ng_run_cmd, and use extra_ng_collect_args when appending to
ng_collect_parts and building ng_collect_cmd; update constructor/attribute names
and any call sites that set extra_arguments to pass the appropriate new
parameter(s), or alternatively update documentation to clearly state that
extra_arguments must be valid for both commands if you choose not to split.
- Around line 639-805: The script unconditionally injects --trust-remote-code in
MultiVLLMServerScript which prevents users from opting out; add a boolean field
(e.g., trust_remote_code: bool = True) to the dataclass and conditionally
include the "--trust-remote-code" token in the generated cmd only when that flag
is true (or let users pass it via server_args by default). Update the
__post_init__ build logic that constructs cmd to reference this new field when
composing the python3 -m vllm.entrypoints.openai.api_server command, and update
the class docstring to note the port allocation caveat (base_port .. base_port +
_total_replicas - 1) so users know ports may be in-use on target nodes.
- Around line 625-633: The code sets NEMO_SKILLS_SANDBOX_HOST to
self.sandbox.hostname_ref() which can be a bash expression for heterogeneous
(het) SLURM jobs and will be passed literally; change the logic around env_vars
population so you detect bash-expression patterns (e.g. contains "${" or starts
with "$") from self.sandbox.hostname_ref() and avoid injecting an unexpanded
literal into env_vars["NEMO_SKILLS_SANDBOX_HOST"] — instead either (a) set a
safe fallback like "localhost"/resolved IP, (b) omit the env var so the target
process can resolve itself, and (c) add a warning log entry; keep
env_vars["NEMO_SKILLS_SANDBOX_PORT"] as-is. Update the code around
self.sandbox.hostname_ref(), self.sandbox.port and the return that builds
{"environment": env_vars} and add a brief comment noting het-only hostname
expressions must not be passed literally.
| STATUS_OUTPUT=$(ng_status 2>&1) | ||
|
|
||
| if echo "$STATUS_OUTPUT" | grep -q "healthy, 0 unhealthy"; then | ||
| echo "All servers ready!" | ||
| break | ||
| fi | ||
|
|
||
| # Only print status when it changes (reduce verbosity) | ||
| CURRENT_STATUS=$(echo "$STATUS_OUTPUT" | grep -oE '[0-9]+ healthy' | head -1 || echo "starting") | ||
| if [ "$CURRENT_STATUS" != "$LAST_STATUS" ]; then | ||
| echo "Server status: $CURRENT_STATUS" | ||
| LAST_STATUS="$CURRENT_STATUS" | ||
| fi | ||
|
|
||
| sleep 10 | ||
| done |
There was a problem hiding this comment.
Fragile health-check string match: "healthy, 0 unhealthy".
If ng_status changes its output format (e.g., wording, ordering, locale), the grep -q "healthy, 0 unhealthy" at line 592 will never match and the loop will spin forever (silent hang with no timeout). Consider using a more structured health-check mechanism (e.g., exit code of ng_status, or a JSON output field) if one is available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/utils/scripts.py` around lines 590 - 605, The health
check currently relies on a brittle string match of STATUS_OUTPUT for "healthy,
0 unhealthy" (in the ng_status loop) which can break if ng_status output
changes; update the loop to use a more robust signal: prefer ng_status's exit
code or a structured output (JSON) if available, or parse numeric
healthy/unhealthy counts with a regex that extracts both numbers from
STATUS_OUTPUT and checks unhealthy == 0; also add a timeout/counter to the loop
to prevent an infinite hang, referencing the ng_status call, STATUS_OUTPUT,
CURRENT_STATUS and LAST_STATUS variables for where to implement the changes.
|
curious if this will support resuming rollouts like with .done file? |
@cmunley1 no concrete plans at the moment, it would be better if Gym supported resuming natively. |
|
im also curious if this would/could enable an end to end reward profiling across many parallel slurm jobs ie say I want to run 16 rollouts per prompt, for 20,000 prompts, across a few hundred slurm jobs, then aggregate results, so i have avg/stddev reward per task for filtering or curriculum we have a small feature in nemo gym to aggregate results, but chunking across slurm jobs is not supported in gym natively, it feels more suited here since ns can launch slurm jobs This is a common workflow that currently requires custom scripts in nemo gym to chunk, submit parallel jobs, and merge result |
|
@cmunley1 we do something similar to |
Kipok
left a comment
There was a problem hiding this comment.
added a few comments. Here is another one from codex, not fully sure if it's meaningful or not, but worth checking
Sorting prepared_commands changes executor order, but external dependencies are still attached based on the original comp_idx == 0 check later in the loop. When a non-spanning command was originally first and a spanning command is reordered ahead of it, the dependency can be attached to a non-leading executor, which can let a run_after job start before its intended upstream dependency.
| # | ||
| # This can be surprising when users rely on a venv editable install / uncommitted changes. | ||
| # These flags let you force using the installed package tree (PatternPackager) regardless of CWD. | ||
| force_installed_nemo_skills = bool(int(os.getenv("NEMO_SKILLS_FORCE_INSTALLED_PACKAGE", "0"))) |
There was a problem hiding this comment.
would be good to document this in packaging docs
|
|
||
|
|
||
| @dataclass(kw_only=True) | ||
| class NemoGymRolloutsScript(BaseJobScript): |
There was a problem hiding this comment.
might be a good time to make scripts a subfolder and split this into different files
|
|
||
|
|
||
| @dataclass(kw_only=True) | ||
| class MultiVLLMServerScript(BaseJobScript): |
There was a problem hiding this comment.
I'm a bit confused why we need this and below class. vllm should naturally support data parallel with a --data-parallel-size argument. Why can't we just use that with our normal vllm server? And then there is a single ip gym talks to, but it still handled data split, just on the server side?
| @@ -0,0 +1,374 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
can we add a simple gpu test?
| cmd = f"""set -e | ||
| set -o pipefail | ||
|
|
||
| echo "=== Installing NeMo Gym ===" |
There was a problem hiding this comment.
why do we need to install gym? It should be preinstalled in the container, right?
| # Wait for vLLM server to be ready before starting ng_run | ||
| # Note: --kill-on-bad-exit in srun ensures job fails if vLLM crashes | ||
| VLLM_SERVER_URL="{vllm_server_url}" | ||
| if [ -n "$VLLM_SERVER_URL" ]; then |
There was a problem hiding this comment.
can we reuse get_server_wait_cmd here?
| "dummy", | ||
| help="API key for policy server. Use 'dummy' for local vLLM servers.", | ||
| ), | ||
| policy_model_name: str = typer.Option( |
There was a problem hiding this comment.
can we just reuse the "model" parameter here? That would be consistent with how we do that in main generate pipeline
| if random_seeds is not None: | ||
| # Explicit seeds provided | ||
| if isinstance(random_seeds, str): | ||
| seed_indices = [int(s.strip()) for s in random_seeds.split(",")] |
There was a problem hiding this comment.
can we reuse str_ids_to_list from main generate pipeline for consistency?
| server_container = None | ||
| if self_hosted: | ||
| server_type_str = server_type.value if hasattr(server_type, "value") else server_type | ||
| server_container = cluster_config["containers"].get(server_type_str, server_type_str) |
There was a problem hiding this comment.
@gwarmstrong should we do that? Not sure why we fall back to server_type_str here. We should probalby have a separate server_container argument that users can use to override, but if it's not specified we should fail if we can't find container
| skipped_seeds.append(seed_idx) | ||
| else: | ||
| filtered_seeds.append(seed_idx) | ||
| except Exception as e: |
There was a problem hiding this comment.
do we expect this to fail? Why blank exception?
NeMo-Gym Rollouts Pipeline
Add
ns nemo_gym_rolloutsCLI command for orchestrating rollout collection with NeMo Gym, including self-hosted or pre-hosted vLLM servers, optional sandbox, and parallel seed-based scaling.New script classes:
Infrastructure enhancements:
Gym Config Verification: math_with_judge (AIME24) & code_gen (LCB)
Verified that
nemo_gym_rolloutspipeline works end-to-end withmath_with_judgeandcode_genGym configs on their full validation datasets.Results
Model:
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16(8x GPU, vLLM,--reasoning-parser deepseek_r1)4 random seeds per benchmark,
temperature=1.0,top_p=1.0,max_output_tokens=65536.Repro: AIME24 (math_with_judge)
Requires AIME24 validation data pre-downloaded to the cluster (see data download step below).
Repro: LCB (code_gen)
Requires LiveCodeBench v5 validation data pre-downloaded to the cluster (see data download step below).
Data Download
Both datasets must be downloaded to the Gym data directories on the cluster before running rollouts. This uses
ng_download_dataset_from_hfinside the nemo-rl container:Summary by CodeRabbit
New Features
Bug Fixes
Chores