Merge Gym GRPO into main pipeline with --use-gym flag by gwarmstrong · Pull Request #1238 · NVIDIA-NeMo/Skills

gwarmstrong · 2026-02-12T01:07:24Z

Add --use-gym and --training-config flags to ns nemo_rl grpo so Gym-based GRPO training uses the same unified command instead of a separate pipeline.

Summary by CodeRabbit

New Features
- GRPO Gym training support with CLI flags to enable gym-based training and pass training configs
- New comprehensive GRPO Gym training configuration template
- Runtime detection of NemoRL installation paths for multiple install layouts
- New entrypoint to run GRPO training directly with NemoGym integration
Tests
- Added GPU tests exercising GRPO Gym training and evaluation
- Added a small sample dataset for Gym test/eval runs

Add --use-gym and --training-config flags to `ns nemo_rl grpo` so Gym-based GRPO training uses the same unified command instead of a separate pipeline. Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Wei Du <wedu@nvidia.com>

Newer container images use /opt/nemo-rl while older images and user mounts may still use /opt/NeMo-RL. Add a shared shell snippet that checks for the uppercase path first and falls back to lowercase, keeping both conventions working. Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai · 2026-02-17T23:54:43Z

📝 Walkthrough

Walkthrough

Adds NemoGym support for GRPO: runtime NeMo‑RL directory detection, gym-aware GRPO training entrypoint and YAML config, pipeline/CLI plumbing to invoke gym mode and pass a training config, command-builder updates to use detected paths, plus tests and sample data.

Changes

Cohort / File(s)	Summary
Environment Detection `nemo_skills/pipeline/nemo_rl/__init__.py`	Adds `DETECT_NEMO_RL_DIR` constant: shell snippet resolving `NEMO_RL_DIR` to `/opt/NeMo-RL` or `/opt/nemo-rl`.
GRPO Gym Pipeline `nemo_skills/pipeline/nemo_rl/grpo.py`	Adds `use_gym: bool` and `training_config: str` to `NemoRLTask`; threads `use_gym`/`training_config` through CLI, `get_training_cmd`, and task creation; selects `start_grpo_gym.py` when `use_gym` is true; validates train/val data in gym mode; updates checkpoint command builders to prefer `DETECT_NEMO_RL_DIR`/`$NEMO_RL_DIR`.
SFT Command Updates `nemo_skills/pipeline/nemo_rl/sft.py`	Imports `DETECT_NEMO_RL_DIR` and replaces hard-coded `/opt/NeMo-RL` with `DETECT_NEMO_RL_DIR` / `$NEMO_RL_DIR` in command and env construction.
Training Config `nemo_skills/training/nemo_rl/configs/grpo_gym.yaml`	Adds a comprehensive GRPO gym YAML config (grpo, loss_fn, checkpointing, policy, data, env, logger, cluster) with placeholders and distributed training settings.
Gym Training Script `nemo_skills/training/nemo_rl/start_grpo_gym.py`	New gym-aware GRPO entrypoint: CLI/Hydra overrides, JSONL dataset loaders and DatumSpec conversion, Ray init, tokenizer/NemoGym setup and registration, RL components initialization, and invocation of `grpo_train`.
Tests & Data `tests/data/small-grpo-gym-data.test`, `tests/gpu-tests/test_train.py`	Adds a small JSONL test dataset (10 items) and new GPU tests that run gym-mode GRPO training, evaluation, and metric assertions.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as CLI Parser
    participant Config as Config Loader
    participant Ray as Ray Runtime
    participant Tokenizer as Tokenizer
    participant Env as NemoGym Env
    participant RL as RL Components
    participant Trainer as GRPO Trainer

    User->>CLI: run start_grpo_gym.py (config + overrides)
    CLI->>Config: parse args, load YAML, apply overrides
    Config->>Ray: initialize Ray cluster / runtime env
    Ray->>Tokenizer: load tokenizer & generate config
    Config->>Env: register/create NemoGym environments
    Config->>RL: load/prepare JSONL datasets -> DatumSpec
    RL->>RL: init policy, dataloaders, loss, logger, checkpointer
    Env->>RL: map tasks to environments (health check)
    RL->>Trainer: invoke grpo_train with components
    Trainer->>Trainer: execute training loop, save checkpoints
    Trainer-->>User: return trained model/checkpoints

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main change: integrating Gym GRPO into the main pipeline with a new --use-gym flag, which aligns with all major file changes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/move-gym-grpo-to-main-pipeline

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/pipeline/nemo_rl/grpo.py (1)
412-414: ⚠️ Potential issue | 🔴 Critical

Bug: dependent_jobs > 0 should be dependent_jobs >= 0—the error message contradicts the condition and training runs when dependent_jobs == 0.

At line 412, the condition is if dependent_jobs > 0:, but the error message at line 414 says "training_data is required when dependent_jobs >= 0". The loop at line 449 runs range(dependent_jobs + 1) iterations, so training executes when dependent_jobs == 0. With the current > 0 condition, validation is skipped when dependent_jobs == 0, allowing training_data to remain None and later produce +data.train_data_path=None in the command.

Both sft.py (line 369) and ppo.py (line 340) use >= 0, establishing the correct pattern.

Also, line 73 should use training_config: str | None = None to properly type-hint that the field accepts None.
🐛 Proposed fixes
-    if dependent_jobs > 0:
+    if dependent_jobs >= 0:
-    training_config: str = None
+    training_config: str | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_rl/grpo.py` around lines 412 - 414, Fix the
conditional that validates training_data: change the check on dependent_jobs
from ">" to ">=" so that when dependent_jobs == 0 the code enforces
training_data presence (make the condition dependent_jobs >= 0 guarding the
raise ValueError("training_data is required when dependent_jobs >= 0")). Also
update the type hint for the field training_config to allow None (use
training_config: str | None = None) so the config correctly accepts null values;
use the same pattern as sft.py and ppo.py and update the symbols dependent_jobs,
training_data, and training_config accordingly.

🧹 Nitpick comments (5)

nemo_skills/training/nemo_rl/start_grpo_gym.py (3)
194-205: Prefix unused cluster variable with underscore.

The cluster variable from the setup() return tuple is never used (as flagged by static analysis). Prefix it with _ to signal intent.
Proposed fix
     (
         policy,
         policy_generation,
-        cluster,
+        _cluster,
         dataloader,
         val_dataloader,
         loss_fn,
         logger,
         checkpointer,
         grpo_state,
         master_config,
     ) = setup(config, tokenizer, train_dataset, val_dataset)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 194 - 205, The
variable returned from setup(...) named cluster is unused; update the tuple
unpacking where setup(config, tokenizer, train_dataset, val_dataset) is called
so that cluster is prefixed with an underscore (e.g., _cluster) to indicate it's
intentionally unused. Keep the same order of returned values and only rename
cluster to _cluster in the assignment (refer to the setup(...) call and the
tuple containing policy, policy_generation, cluster, ...).
76-86: Dead code: load_jsonl_dataset is defined but never called.

setup_single_nemo_gym_dataset (line 89) has its own JSONL loading logic. This function is unused.
Proposed fix — remove unused function
-def load_jsonl_dataset(filepath: str) -> Dataset:
-    """Load JSONL file as HuggingFace Dataset."""
-    records = []
-    with open(filepath, "r", encoding="utf-8") as f:
-        for line in f:
-            line = line.strip()
-            if not line:
-                continue
-            obj = json.loads(line)
-            records.append(obj)
-    return Dataset.from_list(records)
-
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 76 - 86, The
function load_jsonl_dataset is dead code (defined but never used); remove the
entire load_jsonl_dataset definition and also clean up any now-unused imports it
relied on (e.g., json and Dataset if they are only used by that function) so
that setup_single_nemo_gym_dataset remains the single JSONL loading path; ensure
no other references to load_jsonl_dataset exist before deleting it.
64-64: Global OmegaConf.register_new_resolver call at module level.

This has the side effect of registering the "mul" resolver on import. If another module has already registered a "mul" resolver with different behavior, this would conflict. Since the grpo_gym.yaml config depends on this resolver (for ${mul:...} interpolations), consider guarding with a try/except or checking if it's already registered. This is low-risk since this script is an entrypoint, but worth noting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` at line 64, The module-level
call OmegaConf.register_new_resolver("mul", lambda a, b: a * b) should be
guarded to avoid clobbering an existing resolver; update the top-level code in
start_grpo_gym.py to first check OmegaConf.has_resolver("mul") and only register
if absent, and also wrap the register call in a try/except to safely handle
unexpected errors (optionally logging or ignoring the failure) so importing this
module cannot overwrite or crash on existing resolver registrations.
nemo_skills/pipeline/nemo_rl/grpo.py (1)
72-73: Type hint for training_config doesn't reflect that it accepts None.

training_config: str = None is misleading — it defaults to None but the annotation says str. Use str | None = None (or Optional[str]) for correctness. Same applies to use_gym annotation which is fine as bool = False.
Proposed fix
     use_gym: bool = False
-    training_config: str = None
+    training_config: str | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_rl/grpo.py` around lines 72 - 73, The type
annotation for the variable training_config is incorrect because it allows None
but is annotated as str; update the annotation for training_config to indicate
it may be None (e.g., use str | None or Optional[str]) while keeping the default
= None; leave use_gym as bool = False unchanged and ensure the change is applied
where training_config is declared.
nemo_skills/training/nemo_rl/configs/grpo_gym.yaml (1)
41-45: Inconsistent boolean casing in YAML config.

The file mixes YAML-standard true/false with Python-style True/False. While both are valid in YAML 1.1, the inconsistency can confuse contributors. For example, use_dynamic_sampling: False (Line 41) vs normalize_rewards: true (Line 8), or cpu_offload: False (Line 88) vs enabled: false (Line 87).

Consider normalizing to lowercase true/false throughout for consistency with YAML conventions.

Also applies to: 88-89, 99-99, 118-119, 170-171, 207-208
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/configs/grpo_gym.yaml` around lines 41 - 45, The
YAML config mixes Python-style booleans (True/False) with YAML-style lowercase
(true/false) which is inconsistent; update all boolean values to lowercase YAML
form (e.g., change use_dynamic_sampling, reward_shaping.enabled,
reward_scaling.enabled, cpu_offload, normalize_rewards and any other boolean
keys referenced like the ones around lines noted) so every boolean is `true` or
`false` consistently across the file; locate and replace occurrences in keys
such as use_dynamic_sampling, reward_shaping.enabled, reward_scaling.enabled,
cpu_offload, normalize_rewards (and the other flagged boolean entries) to use
lowercase `true`/`false`.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 137-140: The code checks
config["policy"]["make_sequence_length_divisible_by"] but incorrectly reads
tensor_model_parallel_size and context_parallel_size from config["policy"]
instead of the nested config["policy"]["megatron_cfg"], causing KeyError for
custom configs; update the access in the conditional to obtain tp and cp from
config["policy"]["megatron_cfg"] and then pass them into
setup_make_sequence_length_divisible_by (i.e., replace references to
config["policy"]["tensor_model_parallel_size"] and
config["policy"]["context_parallel_size"] with
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and
config["policy"]["megatron_cfg"]["context_parallel_size"] respectively) so
setup_make_sequence_length_divisible_by is called with the correct values.

---

Outside diff comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 412-414: Fix the conditional that validates training_data: change
the check on dependent_jobs from ">" to ">=" so that when dependent_jobs == 0
the code enforces training_data presence (make the condition dependent_jobs >= 0
guarding the raise ValueError("training_data is required when dependent_jobs >=
0")). Also update the type hint for the field training_config to allow None (use
training_config: str | None = None) so the config correctly accepts null values;
use the same pattern as sft.py and ppo.py and update the symbols dependent_jobs,
training_data, and training_config accordingly.

---

Nitpick comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 72-73: The type annotation for the variable training_config is
incorrect because it allows None but is annotated as str; update the annotation
for training_config to indicate it may be None (e.g., use str | None or
Optional[str]) while keeping the default = None; leave use_gym as bool = False
unchanged and ensure the change is applied where training_config is declared.

In `@nemo_skills/training/nemo_rl/configs/grpo_gym.yaml`:
- Around line 41-45: The YAML config mixes Python-style booleans (True/False)
with YAML-style lowercase (true/false) which is inconsistent; update all boolean
values to lowercase YAML form (e.g., change use_dynamic_sampling,
reward_shaping.enabled, reward_scaling.enabled, cpu_offload, normalize_rewards
and any other boolean keys referenced like the ones around lines noted) so every
boolean is `true` or `false` consistently across the file; locate and replace
occurrences in keys such as use_dynamic_sampling, reward_shaping.enabled,
reward_scaling.enabled, cpu_offload, normalize_rewards (and the other flagged
boolean entries) to use lowercase `true`/`false`.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 194-205: The variable returned from setup(...) named cluster is
unused; update the tuple unpacking where setup(config, tokenizer, train_dataset,
val_dataset) is called so that cluster is prefixed with an underscore (e.g.,
_cluster) to indicate it's intentionally unused. Keep the same order of returned
values and only rename cluster to _cluster in the assignment (refer to the
setup(...) call and the tuple containing policy, policy_generation, cluster,
...).
- Around line 76-86: The function load_jsonl_dataset is dead code (defined but
never used); remove the entire load_jsonl_dataset definition and also clean up
any now-unused imports it relied on (e.g., json and Dataset if they are only
used by that function) so that setup_single_nemo_gym_dataset remains the single
JSONL loading path; ensure no other references to load_jsonl_dataset exist
before deleting it.
- Line 64: The module-level call OmegaConf.register_new_resolver("mul", lambda
a, b: a * b) should be guarded to avoid clobbering an existing resolver; update
the top-level code in start_grpo_gym.py to first check
OmegaConf.has_resolver("mul") and only register if absent, and also wrap the
register call in a try/except to safely handle unexpected errors (optionally
logging or ignoring the failure) so importing this module cannot overwrite or
crash on existing resolver registrations.

coderabbitai · 2026-02-17T23:54:46Z

nemo_skills/training/nemo_rl/start_grpo_gym.py

+    if config["policy"]["make_sequence_length_divisible_by"] is None:
+        tp = config["policy"]["tensor_model_parallel_size"]
+        cp = config["policy"]["context_parallel_size"]
+        config["policy"]["make_sequence_length_divisible_by"] = setup_make_sequence_length_divisible_by(tp, cp)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd 'grpo_gym.yaml' --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 112

🏁 Script executed:

fd 'start_grpo_gym.py' --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 108

🏁 Script executed:

cat -n nemo_skills/training/nemo_rl/start_grpo_gym.py | head -150 | tail -40

Repository: NVIDIA-NeMo/Skills

Length of output: 1759

🏁 Script executed:

# Check the grpo_gym.yaml structure around the relevant keys cat nemo_skills/training/nemo_rl/grpo_gym.yaml | head -200

Repository: NVIDIA-NeMo/Skills

Length of output: 136

🏁 Script executed:

cat nemo_skills/training/nemo_rl/configs/grpo_gym.yaml | head -250

Repository: NVIDIA-NeMo/Skills

Length of output: 7463

🏁 Script executed:

# Check for megatron_cfg structure in the YAML grep -n 'megatron_cfg\|tensor_model_parallel_size\|context_parallel_size\|make_sequence_length_divisible_by' nemo_skills/training/nemo_rl/configs/grpo_gym.yaml

Repository: NVIDIA-NeMo/Skills

Length of output: 424

🏁 Script executed:

# Check the load_config and parse_hydra_overrides functions to see if they do any flattening grep -n 'def load_config\|def parse_hydra_overrides' nemo_skills/training/nemo_rl/*.py

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Search for load_config function definition rg 'def load_config' nemo_skills/training/nemo_rl/ -A 15

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Search for parse_hydra_overrides function definition rg 'def parse_hydra_overrides' nemo_skills/training/nemo_rl/ -A 15

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Check if there's any config flattening or transformation in the codebase rg 'flatten|tensor_model_parallel_size.*policy' nemo_skills/training/nemo_rl/

Repository: NVIDIA-NeMo/Skills

Length of output: 384

🏁 Script executed:

# Search for load_config in the entire repo, checking imports and all files rg 'load_config|parse_hydra_overrides' nemo_skills/ --type py | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 2249

🏁 Script executed:

# Check the imports in start_grpo_gym.py head -30 nemo_skills/training/nemo_rl/start_grpo_gym.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1009

🏁 Script executed:

# Now find the config utilities module find nemo_skills/training/nemo_rl -name "*config*" -type f

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Read the config utilities to understand what load_config and parse_hydra_overrides do cat nemo_skills/training/nemo_rl/utils/config.py

Repository: NVIDIA-NeMo/Skills

Length of output: 138

🏁 Script executed:

# Find config.py in nemo_rl directory find nemo_skills/training/nemo_rl -name "config.py" -o -name "*config*.py" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Search for the actual location of load_config function imported from nemo_rl.utils.config fd '\.py' nemo_skills/training/nemo_rl | xargs rg -l 'def load_config|def parse_hydra_overrides'

Repository: NVIDIA-NeMo/Skills

Length of output: 46

🏁 Script executed:

# List the directory structure of nemo_rl ls -la nemo_skills/training/nemo_rl/

Repository: NVIDIA-NeMo/Skills

Length of output: 836

🏁 Script executed:

# Check if utils directory exists ls -la nemo_skills/training/nemo_rl/utils/ 2>/dev/null || echo "utils directory not found"

Repository: NVIDIA-NeMo/Skills

Length of output: 87

🏁 Script executed:

# Search more broadly for where load_config is defined rg 'def load_config' nemo_skills/ -t py

Repository: NVIDIA-NeMo/Skills

Length of output: 411

🏁 Script executed:

# Check the full imports section of start_grpo_gym.py sed -n '1,50p' nemo_skills/training/nemo_rl/start_grpo_gym.py | cat -n

Repository: NVIDIA-NeMo/Skills

Length of output: 2106

🏁 Script executed:

# Verify if nemo_rl is a package and check for __init__.py ls -la nemo_skills/training/nemo_rl/__init__.py

Repository: NVIDIA-NeMo/Skills

Length of output: 148

🏁 Script executed:

# Try to find parse_hydra_overrides anywhere in the codebase rg 'def parse_hydra_overrides' nemo_skills/ -t py

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Double-check the structure in other config files to confirm the pattern grep -n 'tensor_model_parallel_size\|context_parallel_size' nemo_skills/training/nemo_rl/configs/*.yaml | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1673

Incorrect dictionary access paths — tensor_model_parallel_size and context_parallel_size must be accessed via megatron_cfg, not directly from policy.

In grpo_gym.yaml, these parameters live at policy.megatron_cfg, but the code at lines 138–139 tries to access them directly under policy, which will raise KeyError:

tp = config["policy"]["tensor_model_parallel_size"] # KeyError: 'tensor_model_parallel_size' cp = config["policy"]["context_parallel_size"] # KeyError: 'context_parallel_size'

The default config avoids this because make_sequence_length_divisible_by is set via interpolation (${policy.megatron_cfg.tensor_model_parallel_size}), so the conditional never triggers. However, any custom config with make_sequence_length_divisible_by: null will hit this bug.

Fix: use the correct nested paths

if config["policy"]["make_sequence_length_divisible_by"] is None: - tp = config["policy"]["tensor_model_parallel_size"] - cp = config["policy"]["context_parallel_size"] + tp = config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] + cp = config["policy"]["megatron_cfg"]["context_parallel_size"] config["policy"]["make_sequence_length_divisible_by"] = setup_make_sequence_length_divisible_by(tp, cp)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 137 - 140, The code checks config["policy"]["make_sequence_length_divisible_by"] but incorrectly reads tensor_model_parallel_size and context_parallel_size from config["policy"] instead of the nested config["policy"]["megatron_cfg"], causing KeyError for custom configs; update the access in the conditional to obtain tp and cp from config["policy"]["megatron_cfg"] and then pass them into setup_make_sequence_length_divisible_by (i.e., replace references to config["policy"]["tensor_model_parallel_size"] and config["policy"]["context_parallel_size"] with config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and config["policy"]["megatron_cfg"]["context_parallel_size"] respectively) so setup_make_sequence_length_divisible_by is called with the correct values.

Kipok · 2026-02-18T00:08:17Z

nemo_skills/pipeline/nemo_rl/grpo.py

            f"echo 'Starting training' && "
-            f"uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/start_grpo.py "
+            f"uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/{start_script} "
+            f"  {config_arg}"


Suggested change

f" {config_arg}"

f" {config_arg} "

Kipok · 2026-02-18T00:10:56Z

nemo_skills/pipeline/nemo_rl/grpo.py

    ),
+    use_gym: bool = typer.Option(
+        False,
+        help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. "


Suggested change

help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. "

help="If True, uses NeMo Gym for environment interaction instead of native NeMo RL logic. "

gym vs native nemo rl is main difference here, not really prompt templating, although we do have that part as well

Kipok · 2026-02-18T00:12:03Z

nemo_skills/pipeline/nemo_rl/grpo.py

    LOG.info("Extra arguments that will be passed to the underlying script: %s", extra_arguments)

+    if use_gym:
+        if training_data is None or validation_data is None:


disabling validation or limiting to 1 batch isn't supported in gym?

Kipok · 2026-02-18T00:17:18Z

nemo_skills/training/nemo_rl/configs/grpo_gym.yaml

+  checkpoint_dir: "results/grpo"
+  metric_name: "val_reward"
+  higher_is_better: true
+  keep_top_k: 4


Suggested change

keep_top_k: 4

keep_top_k: 50

best to try to such high-level settings consistent across different configs

Kipok · 2026-02-18T00:18:16Z

nemo_skills/training/nemo_rl/configs/grpo_gym.yaml

+  higher_is_better: true
+  keep_top_k: 4
+  save_period: 5
+  checkpoint_must_save_by: "00:03:35:00"


this feels like a bad default, we shouldn't enforce this parameter here, but instead set it through timeout argument inside our code. This is only good for clusters with 4 hour timeout

Kipok · 2026-02-18T00:23:12Z

nemo_skills/training/nemo_rl/configs/grpo_gym.yaml

+    calculate_per_token_loss: true
+    scale_loss_by_dp_cp_size: false
+
+    optimizer:


in other grpo config we have some piping to sync optimizer parameters between dtensor and megatron, would be good to reuse that and also use same defaults. Generally I'd recommend to check all parameters in here and try to make grpo vs grpo_gym as similar as possible in non-gym related options

Kipok · 2026-02-18T00:23:51Z

nemo_skills/training/nemo_rl/configs/grpo_gym.yaml

+  nemo_gym:
+    config_paths:
+    - responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
+    - resources_servers/math_with_judge/configs/math_with_judge.yaml


should we not have any default here? Otherwise might be harder to override from cmdline? Although I guess gym is pretty hard to set from cmdline anyway?

Kipok · 2026-02-18T00:24:42Z

nemo_skills/training/nemo_rl/start_grpo_gym.py

+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+


please add a link to reference file

Kipok · 2026-02-18T00:32:10Z

nemo_skills/pipeline/nemo_rl/grpo.py


+    if use_gym:
+        if training_data is None or validation_data is None:
+            raise typer.BadParameter("--use-gym requires both --training-data and --validation-data to be specified.")


Suggested change

raise typer.BadParameter("--use-gym requires both --training-data and --validation-data to be specified.")

raise typer.BadParameter("--use_gym requires both --training_data and --validation_data to be specified.")

Kipok · 2026-02-18T00:32:59Z

nemo_skills/pipeline/nemo_rl/grpo.py

+        help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. "
+        "Requires both --training-data and --validation-data.",
+    ),
+    training_config: str = typer.Option(


should we handle this default setting logic explicitly in this script? Might be a bit more explicit than delegating to the underlying gym script

In long run, do we still need to support nemo-skills environment?

- Add test_grpo_gym_nemo_rl GPU test mirroring the non-gym GRPO test - Add small-grpo-gym-data.test with 10 simple math problems in Gym format - Add FSDP optimizer/scheduler defaults to grpo_gym.yaml so both backends work (was previously null, only worked with megatron) Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: Wei Du <wedu@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

tests/gpu-tests/test_train.py (1)
193-194: @pytest.mark.parametrize with a single-element list is unnecessary overhead.

Since only "fsdp" is tested (gym mode apparently doesn't yet support megatron), this adds pytest overhead with no actual parametrization. Either drop the decorator and hardcode the backend, or leave a comment explaining megatron support is planned.
♻️ Suggested simplification (if megatron is not planned)
-@pytest.mark.parametrize("backend", ["fsdp"])
-def test_grpo_gym_nemo_rl(backend):
+def test_grpo_gym_nemo_rl():
+    backend = "fsdp"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu-tests/test_train.py` around lines 193 - 194, The
`@pytest.mark.parametrize`("backend", ["fsdp"]) decorator is unnecessary since
only "fsdp" is used; remove the parametrize decorator and hardcode the backend
value in the test (or, if megatron support is planned later, keep the decorator
but add a short inline comment referencing planned megatron support). Locate the
test decorated with `@pytest.mark.gpu` and `@pytest.mark.parametrize` in
tests/gpu-tests/test_train.py (the decorator lines shown) and either delete the
parametrize decorator and replace any parameter usage with the literal "fsdp",
or add a comment above the parametrize explaining why only "fsdp" is included
and that "megatron" support may be added later.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/gpu-tests/test_train.py`:
- Around line 193-194: The `@pytest.mark.parametrize`("backend", ["fsdp"])
decorator is unnecessary since only "fsdp" is used; remove the parametrize
decorator and hardcode the backend value in the test (or, if megatron support is
planned later, keep the decorator but add a short inline comment referencing
planned megatron support). Locate the test decorated with `@pytest.mark.gpu` and
`@pytest.mark.parametrize` in tests/gpu-tests/test_train.py (the decorator lines
shown) and either delete the parametrize decorator and replace any parameter
usage with the literal "fsdp", or add a comment above the parametrize explaining
why only "fsdp" is included and that "megatron" support may be added later.

Signed-off-by: Wei Du <wedu@nvidia.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

nemo_skills/training/nemo_rl/start_grpo_gym.py (2)

194-205: Prefix unused cluster with an underscore.

Static analysis flags cluster (line 197) as never used. Convention is to prefix it _cluster (or _) to signal intent.

Proposed fix

     (
         policy,
         policy_generation,
-        cluster,
+        _cluster,
         dataloader,
         val_dataloader,
         loss_fn,
         logger,
         checkpointer,
         grpo_state,
         master_config,
     ) = setup(config, tokenizer, train_dataset, val_dataset)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 194 - 205, The
variable returned from setup(...) named cluster is never used; change its name
in the tuple unpacking to _cluster (or _) to indicate it's intentionally unused
— update the assignment that unpacks (policy, policy_generation, cluster,
dataloader, ...) = setup(...) to use _cluster instead so static analysis and
linters stop flagging it.

76-86: load_jsonl_dataset is unused dead code.

This function is never called anywhere in the codebase. The main() function delegates data loading to setup_single_nemo_gym_dataset, which handles JSONL loading independently. Remove this function or refactor setup_single_nemo_gym_dataset to use it as a helper if the HuggingFace Dataset wrapper is needed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 76 - 86, The
load_jsonl_dataset function is dead code and should be removed or reused; either
delete the unused function load_jsonl_dataset from
nemo_skills/training/nemo_rl/start_grpo_gym.py, or refactor
setup_single_nemo_gym_dataset to call load_jsonl_dataset for
JSONL-to-HuggingFace Dataset conversion (replace inline JSONL parsing inside
setup_single_nemo_gym_dataset with a call to load_jsonl_dataset and return its
Dataset) so there is no redundant parsing logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 89-92: The function setup_single_nemo_gym_dataset currently calls
json.loads on every raw line which will raise JSONDecodeError on trailing blank
lines; change the file read to iterate lines, strip each line and skip empty
ones before calling json.loads so only non-blank lines are parsed (match
behavior of load_jsonl_dataset); update references inside
setup_single_nemo_gym_dataset to build nemo_gym_examples from filtered
json.loads results.

---

Duplicate comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 137-140: The code incorrectly reads tensor_model_parallel_size and
context_parallel_size directly from config["policy"]; update the access to read
them from config["policy"]["megatron_cfg"] instead (e.g., retrieve tp =
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and cp =
config["policy"]["megatron_cfg"]["context_parallel_size"]) before calling
setup_make_sequence_length_divisible_by; ensure you handle missing megatron_cfg
or missing keys with a sensible fallback or explicit error so
setup_make_sequence_length_divisible_by receives valid tp and cp.

---

Nitpick comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 194-205: The variable returned from setup(...) named cluster is
never used; change its name in the tuple unpacking to _cluster (or _) to
indicate it's intentionally unused — update the assignment that unpacks (policy,
policy_generation, cluster, dataloader, ...) = setup(...) to use _cluster
instead so static analysis and linters stop flagging it.
- Around line 76-86: The load_jsonl_dataset function is dead code and should be
removed or reused; either delete the unused function load_jsonl_dataset from
nemo_skills/training/nemo_rl/start_grpo_gym.py, or refactor
setup_single_nemo_gym_dataset to call load_jsonl_dataset for
JSONL-to-HuggingFace Dataset conversion (replace inline JSONL parsing inside
setup_single_nemo_gym_dataset with a call to load_jsonl_dataset and return its
Dataset) so there is no redundant parsing logic.

coderabbitai · 2026-02-18T03:12:06Z

nemo_skills/training/nemo_rl/start_grpo_gym.py

+def setup_single_nemo_gym_dataset(jsonl_fpath: str, tokenizer, num_repeats: Optional[int] = None):
+    """Setup NemoGym dataset from JSONL file."""
+    with open(jsonl_fpath) as f:
+        nemo_gym_examples = list(map(json.loads, f))


⚠️ Potential issue | 🟡 Minor

Trailing empty lines in JSONL will cause JSONDecodeError.

Unlike load_jsonl_dataset (which strips and skips blanks), this function feeds every line—including a common trailing empty line—straight into json.loads. A file ending with \n (the norm on Unix) will produce an empty string that blows up.

Proposed fix

with open(jsonl_fpath) as f: - nemo_gym_examples = list(map(json.loads, f)) + nemo_gym_examples = [json.loads(line) for line in f if line.strip()]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 89 - 92, The function setup_single_nemo_gym_dataset currently calls json.loads on every raw line which will raise JSONDecodeError on trailing blank lines; change the file read to iterate lines, strip each line and skip empty ones before calling json.loads so only non-blank lines are parsed (match behavior of load_jsonl_dataset); update references inside setup_single_nemo_gym_dataset to build nemo_gym_examples from filtered json.loads results.

Merge Gym GRPO into main pipeline with --use-gym flag

241e195

Add --use-gym and --training-config flags to `ns nemo_rl grpo` so Gym-based GRPO training uses the same unified command instead of a separate pipeline. Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong force-pushed the georgea/move-gym-grpo-to-main-pipeline branch from 6fa2749 to 241e195 Compare February 12, 2026 01:08

gwarmstrong and others added 4 commits February 11, 2026 17:08

Merge branch 'main' into georgea/move-gym-grpo-to-main-pipeline

59f219a

fix pythonpath

f9f6281

Signed-off-by: Wei Du <wedu@nvidia.com>

Merge branch 'main' into georgea/move-gym-grpo-to-main-pipeline

c88cb26

gwarmstrong marked this pull request as ready for review February 17, 2026 23:49

gwarmstrong requested review from Kipok and wedu-nvidia February 17, 2026 23:49

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

Kipok reviewed Feb 18, 2026

View reviewed changes

gwarmstrong added the run GPU tests label Feb 18, 2026

add ref for yaml

10d6c72

Signed-off-by: Wei Du <wedu@nvidia.com>

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

add ref link for grpo_gym

11a4fbc

Signed-off-by: Wei Du <wedu@nvidia.com>

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

gwarmstrong added run GPU tests and removed run GPU tests labels Feb 18, 2026

	help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. "
	help="If True, uses NeMo Gym for environment interaction instead of native NeMo RL logic. "

	raise typer.BadParameter("--use-gym requires both --training-data and --validation-data to be specified.")
	raise typer.BadParameter("--use_gym requires both --training_data and --validation_data to be specified.")

Comments

Conversation

gwarmstrong commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gwarmstrong commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 17, 2026 •

edited

Loading