Skip to content

Comments

Merge Gym GRPO into main pipeline with --use-gym flag#1238

Open
gwarmstrong wants to merge 8 commits intomainfrom
georgea/move-gym-grpo-to-main-pipeline
Open

Merge Gym GRPO into main pipeline with --use-gym flag#1238
gwarmstrong wants to merge 8 commits intomainfrom
georgea/move-gym-grpo-to-main-pipeline

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Feb 12, 2026

Add --use-gym and --training-config flags to ns nemo_rl grpo so Gym-based GRPO training uses the same unified command instead of a separate pipeline.

Summary by CodeRabbit

  • New Features
    • GRPO Gym training support with CLI flags to enable gym-based training and pass training configs
    • New comprehensive GRPO Gym training configuration template
    • Runtime detection of NemoRL installation paths for multiple install layouts
    • New entrypoint to run GRPO training directly with NemoGym integration
  • Tests
    • Added GPU tests exercising GRPO Gym training and evaluation
    • Added a small sample dataset for Gym test/eval runs

Add --use-gym and --training-config flags to `ns nemo_rl grpo` so Gym-based
GRPO training uses the same unified command instead of a separate pipeline.

Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong force-pushed the georgea/move-gym-grpo-to-main-pipeline branch from 6fa2749 to 241e195 Compare February 12, 2026 01:08
gwarmstrong and others added 4 commits February 11, 2026 17:08
Signed-off-by: Wei Du <wedu@nvidia.com>
Newer container images use /opt/nemo-rl while older images and user
mounts may still use /opt/NeMo-RL. Add a shared shell snippet that
checks for the uppercase path first and falls back to lowercase,
keeping both conventions working.

Signed-off-by: George Armstrong <georgea@nvidia.com>
@gwarmstrong gwarmstrong marked this pull request as ready for review February 17, 2026 23:49
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

Adds NemoGym support for GRPO: runtime NeMo‑RL directory detection, gym-aware GRPO training entrypoint and YAML config, pipeline/CLI plumbing to invoke gym mode and pass a training config, command-builder updates to use detected paths, plus tests and sample data.

Changes

Cohort / File(s) Summary
Environment Detection
nemo_skills/pipeline/nemo_rl/__init__.py
Adds DETECT_NEMO_RL_DIR constant: shell snippet resolving NEMO_RL_DIR to /opt/NeMo-RL or /opt/nemo-rl.
GRPO Gym Pipeline
nemo_skills/pipeline/nemo_rl/grpo.py
Adds use_gym: bool and training_config: str to NemoRLTask; threads use_gym/training_config through CLI, get_training_cmd, and task creation; selects start_grpo_gym.py when use_gym is true; validates train/val data in gym mode; updates checkpoint command builders to prefer DETECT_NEMO_RL_DIR/$NEMO_RL_DIR.
SFT Command Updates
nemo_skills/pipeline/nemo_rl/sft.py
Imports DETECT_NEMO_RL_DIR and replaces hard-coded /opt/NeMo-RL with DETECT_NEMO_RL_DIR / $NEMO_RL_DIR in command and env construction.
Training Config
nemo_skills/training/nemo_rl/configs/grpo_gym.yaml
Adds a comprehensive GRPO gym YAML config (grpo, loss_fn, checkpointing, policy, data, env, logger, cluster) with placeholders and distributed training settings.
Gym Training Script
nemo_skills/training/nemo_rl/start_grpo_gym.py
New gym-aware GRPO entrypoint: CLI/Hydra overrides, JSONL dataset loaders and DatumSpec conversion, Ray init, tokenizer/NemoGym setup and registration, RL components initialization, and invocation of grpo_train.
Tests & Data
tests/data/small-grpo-gym-data.test, tests/gpu-tests/test_train.py
Adds a small JSONL test dataset (10 items) and new GPU tests that run gym-mode GRPO training, evaluation, and metric assertions.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as CLI Parser
    participant Config as Config Loader
    participant Ray as Ray Runtime
    participant Tokenizer as Tokenizer
    participant Env as NemoGym Env
    participant RL as RL Components
    participant Trainer as GRPO Trainer

    User->>CLI: run start_grpo_gym.py (config + overrides)
    CLI->>Config: parse args, load YAML, apply overrides
    Config->>Ray: initialize Ray cluster / runtime env
    Ray->>Tokenizer: load tokenizer & generate config
    Config->>Env: register/create NemoGym environments
    Config->>RL: load/prepare JSONL datasets -> DatumSpec
    RL->>RL: init policy, dataloaders, loss, logger, checkpointer
    Env->>RL: map tasks to environments (health check)
    RL->>Trainer: invoke grpo_train with components
    Trainer->>Trainer: execute training loop, save checkpoints
    Trainer-->>User: return trained model/checkpoints
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.32% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main change: integrating Gym GRPO into the main pipeline with a new --use-gym flag, which aligns with all major file changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/move-gym-grpo-to-main-pipeline

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/nemo_rl/grpo.py (1)

412-414: ⚠️ Potential issue | 🔴 Critical

Bug: dependent_jobs > 0 should be dependent_jobs >= 0—the error message contradicts the condition and training runs when dependent_jobs == 0.

At line 412, the condition is if dependent_jobs > 0:, but the error message at line 414 says "training_data is required when dependent_jobs >= 0". The loop at line 449 runs range(dependent_jobs + 1) iterations, so training executes when dependent_jobs == 0. With the current > 0 condition, validation is skipped when dependent_jobs == 0, allowing training_data to remain None and later produce +data.train_data_path=None in the command.

Both sft.py (line 369) and ppo.py (line 340) use >= 0, establishing the correct pattern.

Also, line 73 should use training_config: str | None = None to properly type-hint that the field accepts None.

🐛 Proposed fixes
-    if dependent_jobs > 0:
+    if dependent_jobs >= 0:
-    training_config: str = None
+    training_config: str | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_rl/grpo.py` around lines 412 - 414, Fix the
conditional that validates training_data: change the check on dependent_jobs
from ">" to ">=" so that when dependent_jobs == 0 the code enforces
training_data presence (make the condition dependent_jobs >= 0 guarding the
raise ValueError("training_data is required when dependent_jobs >= 0")). Also
update the type hint for the field training_config to allow None (use
training_config: str | None = None) so the config correctly accepts null values;
use the same pattern as sft.py and ppo.py and update the symbols dependent_jobs,
training_data, and training_config accordingly.
🧹 Nitpick comments (5)
nemo_skills/training/nemo_rl/start_grpo_gym.py (3)

194-205: Prefix unused cluster variable with underscore.

The cluster variable from the setup() return tuple is never used (as flagged by static analysis). Prefix it with _ to signal intent.

Proposed fix
     (
         policy,
         policy_generation,
-        cluster,
+        _cluster,
         dataloader,
         val_dataloader,
         loss_fn,
         logger,
         checkpointer,
         grpo_state,
         master_config,
     ) = setup(config, tokenizer, train_dataset, val_dataset)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 194 - 205, The
variable returned from setup(...) named cluster is unused; update the tuple
unpacking where setup(config, tokenizer, train_dataset, val_dataset) is called
so that cluster is prefixed with an underscore (e.g., _cluster) to indicate it's
intentionally unused. Keep the same order of returned values and only rename
cluster to _cluster in the assignment (refer to the setup(...) call and the
tuple containing policy, policy_generation, cluster, ...).

76-86: Dead code: load_jsonl_dataset is defined but never called.

setup_single_nemo_gym_dataset (line 89) has its own JSONL loading logic. This function is unused.

Proposed fix — remove unused function
-def load_jsonl_dataset(filepath: str) -> Dataset:
-    """Load JSONL file as HuggingFace Dataset."""
-    records = []
-    with open(filepath, "r", encoding="utf-8") as f:
-        for line in f:
-            line = line.strip()
-            if not line:
-                continue
-            obj = json.loads(line)
-            records.append(obj)
-    return Dataset.from_list(records)
-
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 76 - 86, The
function load_jsonl_dataset is dead code (defined but never used); remove the
entire load_jsonl_dataset definition and also clean up any now-unused imports it
relied on (e.g., json and Dataset if they are only used by that function) so
that setup_single_nemo_gym_dataset remains the single JSONL loading path; ensure
no other references to load_jsonl_dataset exist before deleting it.

64-64: Global OmegaConf.register_new_resolver call at module level.

This has the side effect of registering the "mul" resolver on import. If another module has already registered a "mul" resolver with different behavior, this would conflict. Since the grpo_gym.yaml config depends on this resolver (for ${mul:...} interpolations), consider guarding with a try/except or checking if it's already registered. This is low-risk since this script is an entrypoint, but worth noting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` at line 64, The module-level
call OmegaConf.register_new_resolver("mul", lambda a, b: a * b) should be
guarded to avoid clobbering an existing resolver; update the top-level code in
start_grpo_gym.py to first check OmegaConf.has_resolver("mul") and only register
if absent, and also wrap the register call in a try/except to safely handle
unexpected errors (optionally logging or ignoring the failure) so importing this
module cannot overwrite or crash on existing resolver registrations.
nemo_skills/pipeline/nemo_rl/grpo.py (1)

72-73: Type hint for training_config doesn't reflect that it accepts None.

training_config: str = None is misleading — it defaults to None but the annotation says str. Use str | None = None (or Optional[str]) for correctness. Same applies to use_gym annotation which is fine as bool = False.

Proposed fix
     use_gym: bool = False
-    training_config: str = None
+    training_config: str | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/nemo_rl/grpo.py` around lines 72 - 73, The type
annotation for the variable training_config is incorrect because it allows None
but is annotated as str; update the annotation for training_config to indicate
it may be None (e.g., use str | None or Optional[str]) while keeping the default
= None; leave use_gym as bool = False unchanged and ensure the change is applied
where training_config is declared.
nemo_skills/training/nemo_rl/configs/grpo_gym.yaml (1)

41-45: Inconsistent boolean casing in YAML config.

The file mixes YAML-standard true/false with Python-style True/False. While both are valid in YAML 1.1, the inconsistency can confuse contributors. For example, use_dynamic_sampling: False (Line 41) vs normalize_rewards: true (Line 8), or cpu_offload: False (Line 88) vs enabled: false (Line 87).

Consider normalizing to lowercase true/false throughout for consistency with YAML conventions.

Also applies to: 88-89, 99-99, 118-119, 170-171, 207-208

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/configs/grpo_gym.yaml` around lines 41 - 45, The
YAML config mixes Python-style booleans (True/False) with YAML-style lowercase
(true/false) which is inconsistent; update all boolean values to lowercase YAML
form (e.g., change use_dynamic_sampling, reward_shaping.enabled,
reward_scaling.enabled, cpu_offload, normalize_rewards and any other boolean
keys referenced like the ones around lines noted) so every boolean is `true` or
`false` consistently across the file; locate and replace occurrences in keys
such as use_dynamic_sampling, reward_shaping.enabled, reward_scaling.enabled,
cpu_offload, normalize_rewards (and the other flagged boolean entries) to use
lowercase `true`/`false`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 137-140: The code checks
config["policy"]["make_sequence_length_divisible_by"] but incorrectly reads
tensor_model_parallel_size and context_parallel_size from config["policy"]
instead of the nested config["policy"]["megatron_cfg"], causing KeyError for
custom configs; update the access in the conditional to obtain tp and cp from
config["policy"]["megatron_cfg"] and then pass them into
setup_make_sequence_length_divisible_by (i.e., replace references to
config["policy"]["tensor_model_parallel_size"] and
config["policy"]["context_parallel_size"] with
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and
config["policy"]["megatron_cfg"]["context_parallel_size"] respectively) so
setup_make_sequence_length_divisible_by is called with the correct values.

---

Outside diff comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 412-414: Fix the conditional that validates training_data: change
the check on dependent_jobs from ">" to ">=" so that when dependent_jobs == 0
the code enforces training_data presence (make the condition dependent_jobs >= 0
guarding the raise ValueError("training_data is required when dependent_jobs >=
0")). Also update the type hint for the field training_config to allow None (use
training_config: str | None = None) so the config correctly accepts null values;
use the same pattern as sft.py and ppo.py and update the symbols dependent_jobs,
training_data, and training_config accordingly.

---

Nitpick comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 72-73: The type annotation for the variable training_config is
incorrect because it allows None but is annotated as str; update the annotation
for training_config to indicate it may be None (e.g., use str | None or
Optional[str]) while keeping the default = None; leave use_gym as bool = False
unchanged and ensure the change is applied where training_config is declared.

In `@nemo_skills/training/nemo_rl/configs/grpo_gym.yaml`:
- Around line 41-45: The YAML config mixes Python-style booleans (True/False)
with YAML-style lowercase (true/false) which is inconsistent; update all boolean
values to lowercase YAML form (e.g., change use_dynamic_sampling,
reward_shaping.enabled, reward_scaling.enabled, cpu_offload, normalize_rewards
and any other boolean keys referenced like the ones around lines noted) so every
boolean is `true` or `false` consistently across the file; locate and replace
occurrences in keys such as use_dynamic_sampling, reward_shaping.enabled,
reward_scaling.enabled, cpu_offload, normalize_rewards (and the other flagged
boolean entries) to use lowercase `true`/`false`.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 194-205: The variable returned from setup(...) named cluster is
unused; update the tuple unpacking where setup(config, tokenizer, train_dataset,
val_dataset) is called so that cluster is prefixed with an underscore (e.g.,
_cluster) to indicate it's intentionally unused. Keep the same order of returned
values and only rename cluster to _cluster in the assignment (refer to the
setup(...) call and the tuple containing policy, policy_generation, cluster,
...).
- Around line 76-86: The function load_jsonl_dataset is dead code (defined but
never used); remove the entire load_jsonl_dataset definition and also clean up
any now-unused imports it relied on (e.g., json and Dataset if they are only
used by that function) so that setup_single_nemo_gym_dataset remains the single
JSONL loading path; ensure no other references to load_jsonl_dataset exist
before deleting it.
- Line 64: The module-level call OmegaConf.register_new_resolver("mul", lambda
a, b: a * b) should be guarded to avoid clobbering an existing resolver; update
the top-level code in start_grpo_gym.py to first check
OmegaConf.has_resolver("mul") and only register if absent, and also wrap the
register call in a try/except to safely handle unexpected errors (optionally
logging or ignoring the failure) so importing this module cannot overwrite or
crash on existing resolver registrations.

Comment on lines +137 to +140
if config["policy"]["make_sequence_length_divisible_by"] is None:
tp = config["policy"]["tensor_model_parallel_size"]
cp = config["policy"]["context_parallel_size"]
config["policy"]["make_sequence_length_divisible_by"] = setup_make_sequence_length_divisible_by(tp, cp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

fd 'grpo_gym.yaml' --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 112


🏁 Script executed:

fd 'start_grpo_gym.py' --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 108


🏁 Script executed:

cat -n nemo_skills/training/nemo_rl/start_grpo_gym.py | head -150 | tail -40

Repository: NVIDIA-NeMo/Skills

Length of output: 1759


🏁 Script executed:

# Check the grpo_gym.yaml structure around the relevant keys
cat nemo_skills/training/nemo_rl/grpo_gym.yaml | head -200

Repository: NVIDIA-NeMo/Skills

Length of output: 136


🏁 Script executed:

cat nemo_skills/training/nemo_rl/configs/grpo_gym.yaml | head -250

Repository: NVIDIA-NeMo/Skills

Length of output: 7463


🏁 Script executed:

# Check for megatron_cfg structure in the YAML
grep -n 'megatron_cfg\|tensor_model_parallel_size\|context_parallel_size\|make_sequence_length_divisible_by' nemo_skills/training/nemo_rl/configs/grpo_gym.yaml

Repository: NVIDIA-NeMo/Skills

Length of output: 424


🏁 Script executed:

# Check the load_config and parse_hydra_overrides functions to see if they do any flattening
grep -n 'def load_config\|def parse_hydra_overrides' nemo_skills/training/nemo_rl/*.py

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Search for load_config function definition
rg 'def load_config' nemo_skills/training/nemo_rl/ -A 15

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Search for parse_hydra_overrides function definition
rg 'def parse_hydra_overrides' nemo_skills/training/nemo_rl/ -A 15

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Check if there's any config flattening or transformation in the codebase
rg 'flatten|tensor_model_parallel_size.*policy' nemo_skills/training/nemo_rl/

Repository: NVIDIA-NeMo/Skills

Length of output: 384


🏁 Script executed:

# Search for load_config in the entire repo, checking imports and all files
rg 'load_config|parse_hydra_overrides' nemo_skills/ --type py | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 2249


🏁 Script executed:

# Check the imports in start_grpo_gym.py
head -30 nemo_skills/training/nemo_rl/start_grpo_gym.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1009


🏁 Script executed:

# Now find the config utilities module
find nemo_skills/training/nemo_rl -name "*config*" -type f

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Read the config utilities to understand what load_config and parse_hydra_overrides do
cat nemo_skills/training/nemo_rl/utils/config.py

Repository: NVIDIA-NeMo/Skills

Length of output: 138


🏁 Script executed:

# Find config.py in nemo_rl directory
find nemo_skills/training/nemo_rl -name "config.py" -o -name "*config*.py" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Search for the actual location of load_config function imported from nemo_rl.utils.config
fd '\.py' nemo_skills/training/nemo_rl | xargs rg -l 'def load_config|def parse_hydra_overrides'

Repository: NVIDIA-NeMo/Skills

Length of output: 46


🏁 Script executed:

# List the directory structure of nemo_rl
ls -la nemo_skills/training/nemo_rl/

Repository: NVIDIA-NeMo/Skills

Length of output: 836


🏁 Script executed:

# Check if utils directory exists
ls -la nemo_skills/training/nemo_rl/utils/ 2>/dev/null || echo "utils directory not found"

Repository: NVIDIA-NeMo/Skills

Length of output: 87


🏁 Script executed:

# Search more broadly for where load_config is defined
rg 'def load_config' nemo_skills/ -t py

Repository: NVIDIA-NeMo/Skills

Length of output: 411


🏁 Script executed:

# Check the full imports section of start_grpo_gym.py
sed -n '1,50p' nemo_skills/training/nemo_rl/start_grpo_gym.py | cat -n

Repository: NVIDIA-NeMo/Skills

Length of output: 2106


🏁 Script executed:

# Verify if nemo_rl is a package and check for __init__.py
ls -la nemo_skills/training/nemo_rl/__init__.py

Repository: NVIDIA-NeMo/Skills

Length of output: 148


🏁 Script executed:

# Try to find parse_hydra_overrides anywhere in the codebase
rg 'def parse_hydra_overrides' nemo_skills/ -t py

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Double-check the structure in other config files to confirm the pattern
grep -n 'tensor_model_parallel_size\|context_parallel_size' nemo_skills/training/nemo_rl/configs/*.yaml | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1673


Incorrect dictionary access paths — tensor_model_parallel_size and context_parallel_size must be accessed via megatron_cfg, not directly from policy.

In grpo_gym.yaml, these parameters live at policy.megatron_cfg, but the code at lines 138–139 tries to access them directly under policy, which will raise KeyError:

tp = config["policy"]["tensor_model_parallel_size"]  # KeyError: 'tensor_model_parallel_size'
cp = config["policy"]["context_parallel_size"]        # KeyError: 'context_parallel_size'

The default config avoids this because make_sequence_length_divisible_by is set via interpolation (${policy.megatron_cfg.tensor_model_parallel_size}), so the conditional never triggers. However, any custom config with make_sequence_length_divisible_by: null will hit this bug.

Fix: use the correct nested paths
     if config["policy"]["make_sequence_length_divisible_by"] is None:
-        tp = config["policy"]["tensor_model_parallel_size"]
-        cp = config["policy"]["context_parallel_size"]
+        tp = config["policy"]["megatron_cfg"]["tensor_model_parallel_size"]
+        cp = config["policy"]["megatron_cfg"]["context_parallel_size"]
         config["policy"]["make_sequence_length_divisible_by"] = setup_make_sequence_length_divisible_by(tp, cp)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 137 - 140, The
code checks config["policy"]["make_sequence_length_divisible_by"] but
incorrectly reads tensor_model_parallel_size and context_parallel_size from
config["policy"] instead of the nested config["policy"]["megatron_cfg"], causing
KeyError for custom configs; update the access in the conditional to obtain tp
and cp from config["policy"]["megatron_cfg"] and then pass them into
setup_make_sequence_length_divisible_by (i.e., replace references to
config["policy"]["tensor_model_parallel_size"] and
config["policy"]["context_parallel_size"] with
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and
config["policy"]["megatron_cfg"]["context_parallel_size"] respectively) so
setup_make_sequence_length_divisible_by is called with the correct values.

f"echo 'Starting training' && "
f"uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/start_grpo.py "
f"uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/{start_script} "
f" {config_arg}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f" {config_arg}"
f" {config_arg} "

),
use_gym: bool = typer.Option(
False,
help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. "
help="If True, uses NeMo Gym for environment interaction instead of native NeMo RL logic. "

gym vs native nemo rl is main difference here, not really prompt templating, although we do have that part as well

LOG.info("Extra arguments that will be passed to the underlying script: %s", extra_arguments)

if use_gym:
if training_data is None or validation_data is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabling validation or limiting to 1 batch isn't supported in gym?

checkpoint_dir: "results/grpo"
metric_name: "val_reward"
higher_is_better: true
keep_top_k: 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
keep_top_k: 4
keep_top_k: 50

best to try to such high-level settings consistent across different configs

higher_is_better: true
keep_top_k: 4
save_period: 5
checkpoint_must_save_by: "00:03:35:00"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a bad default, we shouldn't enforce this parameter here, but instead set it through timeout argument inside our code. This is only good for clusters with 4 hour timeout

calculate_per_token_loss: true
scale_loss_by_dp_cp_size: false

optimizer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other grpo config we have some piping to sync optimizer parameters between dtensor and megatron, would be good to reuse that and also use same defaults. Generally I'd recommend to check all parameters in here and try to make grpo vs grpo_gym as similar as possible in non-gym related options

nemo_gym:
config_paths:
- responses_api_models/vllm_model/configs/vllm_model_for_training.yaml
- resources_servers/math_with_judge/configs/math_with_judge.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not have any default here? Otherwise might be harder to override from cmdline? Although I guess gym is pretty hard to set from cmdline anyway?

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a link to reference file


if use_gym:
if training_data is None or validation_data is None:
raise typer.BadParameter("--use-gym requires both --training-data and --validation-data to be specified.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise typer.BadParameter("--use-gym requires both --training-data and --validation-data to be specified.")
raise typer.BadParameter("--use_gym requires both --training_data and --validation_data to be specified.")

help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. "
"Requires both --training-data and --validation-data.",
),
training_config: str = typer.Option(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we handle this default setting logic explicitly in this script? Might be a bit more explicit than delegating to the underlying gym script

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In long run, do we still need to support nemo-skills environment?

- Add test_grpo_gym_nemo_rl GPU test mirroring the non-gym GRPO test
- Add small-grpo-gym-data.test with 10 simple math problems in Gym format
- Add FSDP optimizer/scheduler defaults to grpo_gym.yaml so both backends work
  (was previously null, only worked with megatron)

Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/gpu-tests/test_train.py (1)

193-194: @pytest.mark.parametrize with a single-element list is unnecessary overhead.

Since only "fsdp" is tested (gym mode apparently doesn't yet support megatron), this adds pytest overhead with no actual parametrization. Either drop the decorator and hardcode the backend, or leave a comment explaining megatron support is planned.

♻️ Suggested simplification (if megatron is not planned)
-@pytest.mark.parametrize("backend", ["fsdp"])
-def test_grpo_gym_nemo_rl(backend):
+def test_grpo_gym_nemo_rl():
+    backend = "fsdp"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu-tests/test_train.py` around lines 193 - 194, The
`@pytest.mark.parametrize`("backend", ["fsdp"]) decorator is unnecessary since
only "fsdp" is used; remove the parametrize decorator and hardcode the backend
value in the test (or, if megatron support is planned later, keep the decorator
but add a short inline comment referencing planned megatron support). Locate the
test decorated with `@pytest.mark.gpu` and `@pytest.mark.parametrize` in
tests/gpu-tests/test_train.py (the decorator lines shown) and either delete the
parametrize decorator and replace any parameter usage with the literal "fsdp",
or add a comment above the parametrize explaining why only "fsdp" is included
and that "megatron" support may be added later.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/gpu-tests/test_train.py`:
- Around line 193-194: The `@pytest.mark.parametrize`("backend", ["fsdp"])
decorator is unnecessary since only "fsdp" is used; remove the parametrize
decorator and hardcode the backend value in the test (or, if megatron support is
planned later, keep the decorator but add a short inline comment referencing
planned megatron support). Locate the test decorated with `@pytest.mark.gpu` and
`@pytest.mark.parametrize` in tests/gpu-tests/test_train.py (the decorator lines
shown) and either delete the parametrize decorator and replace any parameter
usage with the literal "fsdp", or add a comment above the parametrize explaining
why only "fsdp" is included and that "megatron" support may be added later.

Signed-off-by: Wei Du <wedu@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
nemo_skills/training/nemo_rl/start_grpo_gym.py (2)

194-205: Prefix unused cluster with an underscore.

Static analysis flags cluster (line 197) as never used. Convention is to prefix it _cluster (or _) to signal intent.

Proposed fix
     (
         policy,
         policy_generation,
-        cluster,
+        _cluster,
         dataloader,
         val_dataloader,
         loss_fn,
         logger,
         checkpointer,
         grpo_state,
         master_config,
     ) = setup(config, tokenizer, train_dataset, val_dataset)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 194 - 205, The
variable returned from setup(...) named cluster is never used; change its name
in the tuple unpacking to _cluster (or _) to indicate it's intentionally unused
— update the assignment that unpacks (policy, policy_generation, cluster,
dataloader, ...) = setup(...) to use _cluster instead so static analysis and
linters stop flagging it.

76-86: load_jsonl_dataset is unused dead code.

This function is never called anywhere in the codebase. The main() function delegates data loading to setup_single_nemo_gym_dataset, which handles JSONL loading independently. Remove this function or refactor setup_single_nemo_gym_dataset to use it as a helper if the HuggingFace Dataset wrapper is needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 76 - 86, The
load_jsonl_dataset function is dead code and should be removed or reused; either
delete the unused function load_jsonl_dataset from
nemo_skills/training/nemo_rl/start_grpo_gym.py, or refactor
setup_single_nemo_gym_dataset to call load_jsonl_dataset for
JSONL-to-HuggingFace Dataset conversion (replace inline JSONL parsing inside
setup_single_nemo_gym_dataset with a call to load_jsonl_dataset and return its
Dataset) so there is no redundant parsing logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 89-92: The function setup_single_nemo_gym_dataset currently calls
json.loads on every raw line which will raise JSONDecodeError on trailing blank
lines; change the file read to iterate lines, strip each line and skip empty
ones before calling json.loads so only non-blank lines are parsed (match
behavior of load_jsonl_dataset); update references inside
setup_single_nemo_gym_dataset to build nemo_gym_examples from filtered
json.loads results.

---

Duplicate comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 137-140: The code incorrectly reads tensor_model_parallel_size and
context_parallel_size directly from config["policy"]; update the access to read
them from config["policy"]["megatron_cfg"] instead (e.g., retrieve tp =
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and cp =
config["policy"]["megatron_cfg"]["context_parallel_size"]) before calling
setup_make_sequence_length_divisible_by; ensure you handle missing megatron_cfg
or missing keys with a sensible fallback or explicit error so
setup_make_sequence_length_divisible_by receives valid tp and cp.

---

Nitpick comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 194-205: The variable returned from setup(...) named cluster is
never used; change its name in the tuple unpacking to _cluster (or _) to
indicate it's intentionally unused — update the assignment that unpacks (policy,
policy_generation, cluster, dataloader, ...) = setup(...) to use _cluster
instead so static analysis and linters stop flagging it.
- Around line 76-86: The load_jsonl_dataset function is dead code and should be
removed or reused; either delete the unused function load_jsonl_dataset from
nemo_skills/training/nemo_rl/start_grpo_gym.py, or refactor
setup_single_nemo_gym_dataset to call load_jsonl_dataset for
JSONL-to-HuggingFace Dataset conversion (replace inline JSONL parsing inside
setup_single_nemo_gym_dataset with a call to load_jsonl_dataset and return its
Dataset) so there is no redundant parsing logic.

Comment on lines +89 to +92
def setup_single_nemo_gym_dataset(jsonl_fpath: str, tokenizer, num_repeats: Optional[int] = None):
"""Setup NemoGym dataset from JSONL file."""
with open(jsonl_fpath) as f:
nemo_gym_examples = list(map(json.loads, f))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Trailing empty lines in JSONL will cause JSONDecodeError.

Unlike load_jsonl_dataset (which strips and skips blanks), this function feeds every line—including a common trailing empty line—straight into json.loads. A file ending with \n (the norm on Unix) will produce an empty string that blows up.

Proposed fix
     with open(jsonl_fpath) as f:
-        nemo_gym_examples = list(map(json.loads, f))
+        nemo_gym_examples = [json.loads(line) for line in f if line.strip()]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 89 - 92, The
function setup_single_nemo_gym_dataset currently calls json.loads on every raw
line which will raise JSONDecodeError on trailing blank lines; change the file
read to iterate lines, strip each line and skip empty ones before calling
json.loads so only non-blank lines are parsed (match behavior of
load_jsonl_dataset); update references inside setup_single_nemo_gym_dataset to
build nemo_gym_examples from filtered json.loads results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants