Merge Gym GRPO into main pipeline with --use-gym flag#1238
Merge Gym GRPO into main pipeline with --use-gym flag#1238gwarmstrong wants to merge 8 commits intomainfrom
Conversation
Add --use-gym and --training-config flags to `ns nemo_rl grpo` so Gym-based GRPO training uses the same unified command instead of a separate pipeline. Signed-off-by: George Armstrong <georgea@nvidia.com>
6fa2749 to
241e195
Compare
Signed-off-by: Wei Du <wedu@nvidia.com>
Newer container images use /opt/nemo-rl while older images and user mounts may still use /opt/NeMo-RL. Add a shared shell snippet that checks for the uppercase path first and falls back to lowercase, keeping both conventions working. Signed-off-by: George Armstrong <georgea@nvidia.com>
📝 WalkthroughWalkthroughAdds NemoGym support for GRPO: runtime NeMo‑RL directory detection, gym-aware GRPO training entrypoint and YAML config, pipeline/CLI plumbing to invoke gym mode and pass a training config, command-builder updates to use detected paths, plus tests and sample data. Changes
Sequence DiagramsequenceDiagram
participant User
participant CLI as CLI Parser
participant Config as Config Loader
participant Ray as Ray Runtime
participant Tokenizer as Tokenizer
participant Env as NemoGym Env
participant RL as RL Components
participant Trainer as GRPO Trainer
User->>CLI: run start_grpo_gym.py (config + overrides)
CLI->>Config: parse args, load YAML, apply overrides
Config->>Ray: initialize Ray cluster / runtime env
Ray->>Tokenizer: load tokenizer & generate config
Config->>Env: register/create NemoGym environments
Config->>RL: load/prepare JSONL datasets -> DatumSpec
RL->>RL: init policy, dataloaders, loss, logger, checkpointer
Env->>RL: map tasks to environments (health check)
RL->>Trainer: invoke grpo_train with components
Trainer->>Trainer: execute training loop, save checkpoints
Trainer-->>User: return trained model/checkpoints
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/nemo_rl/grpo.py (1)
412-414:⚠️ Potential issue | 🔴 CriticalBug:
dependent_jobs > 0should bedependent_jobs >= 0—the error message contradicts the condition and training runs whendependent_jobs == 0.At line 412, the condition is
if dependent_jobs > 0:, but the error message at line 414 says"training_data is required when dependent_jobs >= 0". The loop at line 449 runsrange(dependent_jobs + 1)iterations, so training executes whendependent_jobs == 0. With the current> 0condition, validation is skipped whendependent_jobs == 0, allowingtraining_datato remainNoneand later produce+data.train_data_path=Nonein the command.Both
sft.py(line 369) andppo.py(line 340) use>= 0, establishing the correct pattern.Also, line 73 should use
training_config: str | None = Noneto properly type-hint that the field acceptsNone.🐛 Proposed fixes
- if dependent_jobs > 0: + if dependent_jobs >= 0:- training_config: str = None + training_config: str | None = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/nemo_rl/grpo.py` around lines 412 - 414, Fix the conditional that validates training_data: change the check on dependent_jobs from ">" to ">=" so that when dependent_jobs == 0 the code enforces training_data presence (make the condition dependent_jobs >= 0 guarding the raise ValueError("training_data is required when dependent_jobs >= 0")). Also update the type hint for the field training_config to allow None (use training_config: str | None = None) so the config correctly accepts null values; use the same pattern as sft.py and ppo.py and update the symbols dependent_jobs, training_data, and training_config accordingly.
🧹 Nitpick comments (5)
nemo_skills/training/nemo_rl/start_grpo_gym.py (3)
194-205: Prefix unusedclustervariable with underscore.The
clustervariable from thesetup()return tuple is never used (as flagged by static analysis). Prefix it with_to signal intent.Proposed fix
( policy, policy_generation, - cluster, + _cluster, dataloader, val_dataloader, loss_fn, logger, checkpointer, grpo_state, master_config, ) = setup(config, tokenizer, train_dataset, val_dataset)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 194 - 205, The variable returned from setup(...) named cluster is unused; update the tuple unpacking where setup(config, tokenizer, train_dataset, val_dataset) is called so that cluster is prefixed with an underscore (e.g., _cluster) to indicate it's intentionally unused. Keep the same order of returned values and only rename cluster to _cluster in the assignment (refer to the setup(...) call and the tuple containing policy, policy_generation, cluster, ...).
76-86: Dead code:load_jsonl_datasetis defined but never called.
setup_single_nemo_gym_dataset(line 89) has its own JSONL loading logic. This function is unused.Proposed fix — remove unused function
-def load_jsonl_dataset(filepath: str) -> Dataset: - """Load JSONL file as HuggingFace Dataset.""" - records = [] - with open(filepath, "r", encoding="utf-8") as f: - for line in f: - line = line.strip() - if not line: - continue - obj = json.loads(line) - records.append(obj) - return Dataset.from_list(records) - -🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 76 - 86, The function load_jsonl_dataset is dead code (defined but never used); remove the entire load_jsonl_dataset definition and also clean up any now-unused imports it relied on (e.g., json and Dataset if they are only used by that function) so that setup_single_nemo_gym_dataset remains the single JSONL loading path; ensure no other references to load_jsonl_dataset exist before deleting it.
64-64: GlobalOmegaConf.register_new_resolvercall at module level.This has the side effect of registering the
"mul"resolver on import. If another module has already registered a"mul"resolver with different behavior, this would conflict. Since the grpo_gym.yaml config depends on this resolver (for${mul:...}interpolations), consider guarding with a try/except or checking if it's already registered. This is low-risk since this script is an entrypoint, but worth noting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` at line 64, The module-level call OmegaConf.register_new_resolver("mul", lambda a, b: a * b) should be guarded to avoid clobbering an existing resolver; update the top-level code in start_grpo_gym.py to first check OmegaConf.has_resolver("mul") and only register if absent, and also wrap the register call in a try/except to safely handle unexpected errors (optionally logging or ignoring the failure) so importing this module cannot overwrite or crash on existing resolver registrations.nemo_skills/pipeline/nemo_rl/grpo.py (1)
72-73: Type hint fortraining_configdoesn't reflect that it acceptsNone.
training_config: str = Noneis misleading — it defaults toNonebut the annotation saysstr. Usestr | None = None(orOptional[str]) for correctness. Same applies touse_gymannotation which is fine asbool = False.Proposed fix
use_gym: bool = False - training_config: str = None + training_config: str | None = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/nemo_rl/grpo.py` around lines 72 - 73, The type annotation for the variable training_config is incorrect because it allows None but is annotated as str; update the annotation for training_config to indicate it may be None (e.g., use str | None or Optional[str]) while keeping the default = None; leave use_gym as bool = False unchanged and ensure the change is applied where training_config is declared.nemo_skills/training/nemo_rl/configs/grpo_gym.yaml (1)
41-45: Inconsistent boolean casing in YAML config.The file mixes YAML-standard
true/falsewith Python-styleTrue/False. While both are valid in YAML 1.1, the inconsistency can confuse contributors. For example,use_dynamic_sampling: False(Line 41) vsnormalize_rewards: true(Line 8), orcpu_offload: False(Line 88) vsenabled: false(Line 87).Consider normalizing to lowercase
true/falsethroughout for consistency with YAML conventions.Also applies to: 88-89, 99-99, 118-119, 170-171, 207-208
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/configs/grpo_gym.yaml` around lines 41 - 45, The YAML config mixes Python-style booleans (True/False) with YAML-style lowercase (true/false) which is inconsistent; update all boolean values to lowercase YAML form (e.g., change use_dynamic_sampling, reward_shaping.enabled, reward_scaling.enabled, cpu_offload, normalize_rewards and any other boolean keys referenced like the ones around lines noted) so every boolean is `true` or `false` consistently across the file; locate and replace occurrences in keys such as use_dynamic_sampling, reward_shaping.enabled, reward_scaling.enabled, cpu_offload, normalize_rewards (and the other flagged boolean entries) to use lowercase `true`/`false`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 137-140: The code checks
config["policy"]["make_sequence_length_divisible_by"] but incorrectly reads
tensor_model_parallel_size and context_parallel_size from config["policy"]
instead of the nested config["policy"]["megatron_cfg"], causing KeyError for
custom configs; update the access in the conditional to obtain tp and cp from
config["policy"]["megatron_cfg"] and then pass them into
setup_make_sequence_length_divisible_by (i.e., replace references to
config["policy"]["tensor_model_parallel_size"] and
config["policy"]["context_parallel_size"] with
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and
config["policy"]["megatron_cfg"]["context_parallel_size"] respectively) so
setup_make_sequence_length_divisible_by is called with the correct values.
---
Outside diff comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 412-414: Fix the conditional that validates training_data: change
the check on dependent_jobs from ">" to ">=" so that when dependent_jobs == 0
the code enforces training_data presence (make the condition dependent_jobs >= 0
guarding the raise ValueError("training_data is required when dependent_jobs >=
0")). Also update the type hint for the field training_config to allow None (use
training_config: str | None = None) so the config correctly accepts null values;
use the same pattern as sft.py and ppo.py and update the symbols dependent_jobs,
training_data, and training_config accordingly.
---
Nitpick comments:
In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 72-73: The type annotation for the variable training_config is
incorrect because it allows None but is annotated as str; update the annotation
for training_config to indicate it may be None (e.g., use str | None or
Optional[str]) while keeping the default = None; leave use_gym as bool = False
unchanged and ensure the change is applied where training_config is declared.
In `@nemo_skills/training/nemo_rl/configs/grpo_gym.yaml`:
- Around line 41-45: The YAML config mixes Python-style booleans (True/False)
with YAML-style lowercase (true/false) which is inconsistent; update all boolean
values to lowercase YAML form (e.g., change use_dynamic_sampling,
reward_shaping.enabled, reward_scaling.enabled, cpu_offload, normalize_rewards
and any other boolean keys referenced like the ones around lines noted) so every
boolean is `true` or `false` consistently across the file; locate and replace
occurrences in keys such as use_dynamic_sampling, reward_shaping.enabled,
reward_scaling.enabled, cpu_offload, normalize_rewards (and the other flagged
boolean entries) to use lowercase `true`/`false`.
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 194-205: The variable returned from setup(...) named cluster is
unused; update the tuple unpacking where setup(config, tokenizer, train_dataset,
val_dataset) is called so that cluster is prefixed with an underscore (e.g.,
_cluster) to indicate it's intentionally unused. Keep the same order of returned
values and only rename cluster to _cluster in the assignment (refer to the
setup(...) call and the tuple containing policy, policy_generation, cluster,
...).
- Around line 76-86: The function load_jsonl_dataset is dead code (defined but
never used); remove the entire load_jsonl_dataset definition and also clean up
any now-unused imports it relied on (e.g., json and Dataset if they are only
used by that function) so that setup_single_nemo_gym_dataset remains the single
JSONL loading path; ensure no other references to load_jsonl_dataset exist
before deleting it.
- Line 64: The module-level call OmegaConf.register_new_resolver("mul", lambda
a, b: a * b) should be guarded to avoid clobbering an existing resolver; update
the top-level code in start_grpo_gym.py to first check
OmegaConf.has_resolver("mul") and only register if absent, and also wrap the
register call in a try/except to safely handle unexpected errors (optionally
logging or ignoring the failure) so importing this module cannot overwrite or
crash on existing resolver registrations.
| if config["policy"]["make_sequence_length_divisible_by"] is None: | ||
| tp = config["policy"]["tensor_model_parallel_size"] | ||
| cp = config["policy"]["context_parallel_size"] | ||
| config["policy"]["make_sequence_length_divisible_by"] = setup_make_sequence_length_divisible_by(tp, cp) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd 'grpo_gym.yaml' --type fRepository: NVIDIA-NeMo/Skills
Length of output: 112
🏁 Script executed:
fd 'start_grpo_gym.py' --type fRepository: NVIDIA-NeMo/Skills
Length of output: 108
🏁 Script executed:
cat -n nemo_skills/training/nemo_rl/start_grpo_gym.py | head -150 | tail -40Repository: NVIDIA-NeMo/Skills
Length of output: 1759
🏁 Script executed:
# Check the grpo_gym.yaml structure around the relevant keys
cat nemo_skills/training/nemo_rl/grpo_gym.yaml | head -200Repository: NVIDIA-NeMo/Skills
Length of output: 136
🏁 Script executed:
cat nemo_skills/training/nemo_rl/configs/grpo_gym.yaml | head -250Repository: NVIDIA-NeMo/Skills
Length of output: 7463
🏁 Script executed:
# Check for megatron_cfg structure in the YAML
grep -n 'megatron_cfg\|tensor_model_parallel_size\|context_parallel_size\|make_sequence_length_divisible_by' nemo_skills/training/nemo_rl/configs/grpo_gym.yamlRepository: NVIDIA-NeMo/Skills
Length of output: 424
🏁 Script executed:
# Check the load_config and parse_hydra_overrides functions to see if they do any flattening
grep -n 'def load_config\|def parse_hydra_overrides' nemo_skills/training/nemo_rl/*.pyRepository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Search for load_config function definition
rg 'def load_config' nemo_skills/training/nemo_rl/ -A 15Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Search for parse_hydra_overrides function definition
rg 'def parse_hydra_overrides' nemo_skills/training/nemo_rl/ -A 15Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Check if there's any config flattening or transformation in the codebase
rg 'flatten|tensor_model_parallel_size.*policy' nemo_skills/training/nemo_rl/Repository: NVIDIA-NeMo/Skills
Length of output: 384
🏁 Script executed:
# Search for load_config in the entire repo, checking imports and all files
rg 'load_config|parse_hydra_overrides' nemo_skills/ --type py | head -30Repository: NVIDIA-NeMo/Skills
Length of output: 2249
🏁 Script executed:
# Check the imports in start_grpo_gym.py
head -30 nemo_skills/training/nemo_rl/start_grpo_gym.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1009
🏁 Script executed:
# Now find the config utilities module
find nemo_skills/training/nemo_rl -name "*config*" -type fRepository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Read the config utilities to understand what load_config and parse_hydra_overrides do
cat nemo_skills/training/nemo_rl/utils/config.pyRepository: NVIDIA-NeMo/Skills
Length of output: 138
🏁 Script executed:
# Find config.py in nemo_rl directory
find nemo_skills/training/nemo_rl -name "config.py" -o -name "*config*.py" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Search for the actual location of load_config function imported from nemo_rl.utils.config
fd '\.py' nemo_skills/training/nemo_rl | xargs rg -l 'def load_config|def parse_hydra_overrides'Repository: NVIDIA-NeMo/Skills
Length of output: 46
🏁 Script executed:
# List the directory structure of nemo_rl
ls -la nemo_skills/training/nemo_rl/Repository: NVIDIA-NeMo/Skills
Length of output: 836
🏁 Script executed:
# Check if utils directory exists
ls -la nemo_skills/training/nemo_rl/utils/ 2>/dev/null || echo "utils directory not found"Repository: NVIDIA-NeMo/Skills
Length of output: 87
🏁 Script executed:
# Search more broadly for where load_config is defined
rg 'def load_config' nemo_skills/ -t pyRepository: NVIDIA-NeMo/Skills
Length of output: 411
🏁 Script executed:
# Check the full imports section of start_grpo_gym.py
sed -n '1,50p' nemo_skills/training/nemo_rl/start_grpo_gym.py | cat -nRepository: NVIDIA-NeMo/Skills
Length of output: 2106
🏁 Script executed:
# Verify if nemo_rl is a package and check for __init__.py
ls -la nemo_skills/training/nemo_rl/__init__.pyRepository: NVIDIA-NeMo/Skills
Length of output: 148
🏁 Script executed:
# Try to find parse_hydra_overrides anywhere in the codebase
rg 'def parse_hydra_overrides' nemo_skills/ -t pyRepository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Double-check the structure in other config files to confirm the pattern
grep -n 'tensor_model_parallel_size\|context_parallel_size' nemo_skills/training/nemo_rl/configs/*.yaml | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 1673
Incorrect dictionary access paths — tensor_model_parallel_size and context_parallel_size must be accessed via megatron_cfg, not directly from policy.
In grpo_gym.yaml, these parameters live at policy.megatron_cfg, but the code at lines 138–139 tries to access them directly under policy, which will raise KeyError:
tp = config["policy"]["tensor_model_parallel_size"] # KeyError: 'tensor_model_parallel_size'
cp = config["policy"]["context_parallel_size"] # KeyError: 'context_parallel_size'The default config avoids this because make_sequence_length_divisible_by is set via interpolation (${policy.megatron_cfg.tensor_model_parallel_size}), so the conditional never triggers. However, any custom config with make_sequence_length_divisible_by: null will hit this bug.
Fix: use the correct nested paths
if config["policy"]["make_sequence_length_divisible_by"] is None:
- tp = config["policy"]["tensor_model_parallel_size"]
- cp = config["policy"]["context_parallel_size"]
+ tp = config["policy"]["megatron_cfg"]["tensor_model_parallel_size"]
+ cp = config["policy"]["megatron_cfg"]["context_parallel_size"]
config["policy"]["make_sequence_length_divisible_by"] = setup_make_sequence_length_divisible_by(tp, cp)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 137 - 140, The
code checks config["policy"]["make_sequence_length_divisible_by"] but
incorrectly reads tensor_model_parallel_size and context_parallel_size from
config["policy"] instead of the nested config["policy"]["megatron_cfg"], causing
KeyError for custom configs; update the access in the conditional to obtain tp
and cp from config["policy"]["megatron_cfg"] and then pass them into
setup_make_sequence_length_divisible_by (i.e., replace references to
config["policy"]["tensor_model_parallel_size"] and
config["policy"]["context_parallel_size"] with
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and
config["policy"]["megatron_cfg"]["context_parallel_size"] respectively) so
setup_make_sequence_length_divisible_by is called with the correct values.
| f"echo 'Starting training' && " | ||
| f"uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/start_grpo.py " | ||
| f"uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/{start_script} " | ||
| f" {config_arg}" |
There was a problem hiding this comment.
| f" {config_arg}" | |
| f" {config_arg} " |
| ), | ||
| use_gym: bool = typer.Option( | ||
| False, | ||
| help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. " |
There was a problem hiding this comment.
| help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. " | |
| help="If True, uses NeMo Gym for environment interaction instead of native NeMo RL logic. " |
gym vs native nemo rl is main difference here, not really prompt templating, although we do have that part as well
| LOG.info("Extra arguments that will be passed to the underlying script: %s", extra_arguments) | ||
|
|
||
| if use_gym: | ||
| if training_data is None or validation_data is None: |
There was a problem hiding this comment.
disabling validation or limiting to 1 batch isn't supported in gym?
| checkpoint_dir: "results/grpo" | ||
| metric_name: "val_reward" | ||
| higher_is_better: true | ||
| keep_top_k: 4 |
There was a problem hiding this comment.
| keep_top_k: 4 | |
| keep_top_k: 50 |
best to try to such high-level settings consistent across different configs
| higher_is_better: true | ||
| keep_top_k: 4 | ||
| save_period: 5 | ||
| checkpoint_must_save_by: "00:03:35:00" |
There was a problem hiding this comment.
this feels like a bad default, we shouldn't enforce this parameter here, but instead set it through timeout argument inside our code. This is only good for clusters with 4 hour timeout
| calculate_per_token_loss: true | ||
| scale_loss_by_dp_cp_size: false | ||
|
|
||
| optimizer: |
There was a problem hiding this comment.
in other grpo config we have some piping to sync optimizer parameters between dtensor and megatron, would be good to reuse that and also use same defaults. Generally I'd recommend to check all parameters in here and try to make grpo vs grpo_gym as similar as possible in non-gym related options
| nemo_gym: | ||
| config_paths: | ||
| - responses_api_models/vllm_model/configs/vllm_model_for_training.yaml | ||
| - resources_servers/math_with_judge/configs/math_with_judge.yaml |
There was a problem hiding this comment.
should we not have any default here? Otherwise might be harder to override from cmdline? Although I guess gym is pretty hard to set from cmdline anyway?
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
There was a problem hiding this comment.
please add a link to reference file
|
|
||
| if use_gym: | ||
| if training_data is None or validation_data is None: | ||
| raise typer.BadParameter("--use-gym requires both --training-data and --validation-data to be specified.") |
There was a problem hiding this comment.
| raise typer.BadParameter("--use-gym requires both --training-data and --validation-data to be specified.") | |
| raise typer.BadParameter("--use_gym requires both --training_data and --validation_data to be specified.") |
| help="If True, uses NeMo Gym for environment interaction instead of NeMo Skills prompt templating. " | ||
| "Requires both --training-data and --validation-data.", | ||
| ), | ||
| training_config: str = typer.Option( |
There was a problem hiding this comment.
should we handle this default setting logic explicitly in this script? Might be a bit more explicit than delegating to the underlying gym script
There was a problem hiding this comment.
In long run, do we still need to support nemo-skills environment?
- Add test_grpo_gym_nemo_rl GPU test mirroring the non-gym GRPO test - Add small-grpo-gym-data.test with 10 simple math problems in Gym format - Add FSDP optimizer/scheduler defaults to grpo_gym.yaml so both backends work (was previously null, only worked with megatron) Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/gpu-tests/test_train.py (1)
193-194:@pytest.mark.parametrizewith a single-element list is unnecessary overhead.Since only
"fsdp"is tested (gym mode apparently doesn't yet supportmegatron), this adds pytest overhead with no actual parametrization. Either drop the decorator and hardcode the backend, or leave a comment explaining megatron support is planned.♻️ Suggested simplification (if megatron is not planned)
-@pytest.mark.parametrize("backend", ["fsdp"]) -def test_grpo_gym_nemo_rl(backend): +def test_grpo_gym_nemo_rl(): + backend = "fsdp"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu-tests/test_train.py` around lines 193 - 194, The `@pytest.mark.parametrize`("backend", ["fsdp"]) decorator is unnecessary since only "fsdp" is used; remove the parametrize decorator and hardcode the backend value in the test (or, if megatron support is planned later, keep the decorator but add a short inline comment referencing planned megatron support). Locate the test decorated with `@pytest.mark.gpu` and `@pytest.mark.parametrize` in tests/gpu-tests/test_train.py (the decorator lines shown) and either delete the parametrize decorator and replace any parameter usage with the literal "fsdp", or add a comment above the parametrize explaining why only "fsdp" is included and that "megatron" support may be added later.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/gpu-tests/test_train.py`:
- Around line 193-194: The `@pytest.mark.parametrize`("backend", ["fsdp"])
decorator is unnecessary since only "fsdp" is used; remove the parametrize
decorator and hardcode the backend value in the test (or, if megatron support is
planned later, keep the decorator but add a short inline comment referencing
planned megatron support). Locate the test decorated with `@pytest.mark.gpu` and
`@pytest.mark.parametrize` in tests/gpu-tests/test_train.py (the decorator lines
shown) and either delete the parametrize decorator and replace any parameter
usage with the literal "fsdp", or add a comment above the parametrize explaining
why only "fsdp" is included and that "megatron" support may be added later.
Signed-off-by: Wei Du <wedu@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
nemo_skills/training/nemo_rl/start_grpo_gym.py (2)
194-205: Prefix unusedclusterwith an underscore.Static analysis flags
cluster(line 197) as never used. Convention is to prefix it_cluster(or_) to signal intent.Proposed fix
( policy, policy_generation, - cluster, + _cluster, dataloader, val_dataloader, loss_fn, logger, checkpointer, grpo_state, master_config, ) = setup(config, tokenizer, train_dataset, val_dataset)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 194 - 205, The variable returned from setup(...) named cluster is never used; change its name in the tuple unpacking to _cluster (or _) to indicate it's intentionally unused — update the assignment that unpacks (policy, policy_generation, cluster, dataloader, ...) = setup(...) to use _cluster instead so static analysis and linters stop flagging it.
76-86:load_jsonl_datasetis unused dead code.This function is never called anywhere in the codebase. The
main()function delegates data loading tosetup_single_nemo_gym_dataset, which handles JSONL loading independently. Remove this function or refactorsetup_single_nemo_gym_datasetto use it as a helper if the HuggingFaceDatasetwrapper is needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 76 - 86, The load_jsonl_dataset function is dead code and should be removed or reused; either delete the unused function load_jsonl_dataset from nemo_skills/training/nemo_rl/start_grpo_gym.py, or refactor setup_single_nemo_gym_dataset to call load_jsonl_dataset for JSONL-to-HuggingFace Dataset conversion (replace inline JSONL parsing inside setup_single_nemo_gym_dataset with a call to load_jsonl_dataset and return its Dataset) so there is no redundant parsing logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 89-92: The function setup_single_nemo_gym_dataset currently calls
json.loads on every raw line which will raise JSONDecodeError on trailing blank
lines; change the file read to iterate lines, strip each line and skip empty
ones before calling json.loads so only non-blank lines are parsed (match
behavior of load_jsonl_dataset); update references inside
setup_single_nemo_gym_dataset to build nemo_gym_examples from filtered
json.loads results.
---
Duplicate comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 137-140: The code incorrectly reads tensor_model_parallel_size and
context_parallel_size directly from config["policy"]; update the access to read
them from config["policy"]["megatron_cfg"] instead (e.g., retrieve tp =
config["policy"]["megatron_cfg"]["tensor_model_parallel_size"] and cp =
config["policy"]["megatron_cfg"]["context_parallel_size"]) before calling
setup_make_sequence_length_divisible_by; ensure you handle missing megatron_cfg
or missing keys with a sensible fallback or explicit error so
setup_make_sequence_length_divisible_by receives valid tp and cp.
---
Nitpick comments:
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py`:
- Around line 194-205: The variable returned from setup(...) named cluster is
never used; change its name in the tuple unpacking to _cluster (or _) to
indicate it's intentionally unused — update the assignment that unpacks (policy,
policy_generation, cluster, dataloader, ...) = setup(...) to use _cluster
instead so static analysis and linters stop flagging it.
- Around line 76-86: The load_jsonl_dataset function is dead code and should be
removed or reused; either delete the unused function load_jsonl_dataset from
nemo_skills/training/nemo_rl/start_grpo_gym.py, or refactor
setup_single_nemo_gym_dataset to call load_jsonl_dataset for
JSONL-to-HuggingFace Dataset conversion (replace inline JSONL parsing inside
setup_single_nemo_gym_dataset with a call to load_jsonl_dataset and return its
Dataset) so there is no redundant parsing logic.
| def setup_single_nemo_gym_dataset(jsonl_fpath: str, tokenizer, num_repeats: Optional[int] = None): | ||
| """Setup NemoGym dataset from JSONL file.""" | ||
| with open(jsonl_fpath) as f: | ||
| nemo_gym_examples = list(map(json.loads, f)) |
There was a problem hiding this comment.
Trailing empty lines in JSONL will cause JSONDecodeError.
Unlike load_jsonl_dataset (which strips and skips blanks), this function feeds every line—including a common trailing empty line—straight into json.loads. A file ending with \n (the norm on Unix) will produce an empty string that blows up.
Proposed fix
with open(jsonl_fpath) as f:
- nemo_gym_examples = list(map(json.loads, f))
+ nemo_gym_examples = [json.loads(line) for line in f if line.strip()]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/training/nemo_rl/start_grpo_gym.py` around lines 89 - 92, The
function setup_single_nemo_gym_dataset currently calls json.loads on every raw
line which will raise JSONDecodeError on trailing blank lines; change the file
read to iterate lines, strip each line and skip empty ones before calling
json.loads so only non-blank lines are parsed (match behavior of
load_jsonl_dataset); update references inside setup_single_nemo_gym_dataset to
build nemo_gym_examples from filtered json.loads results.
Add --use-gym and --training-config flags to
ns nemo_rl grposo Gym-based GRPO training uses the same unified command instead of a separate pipeline.Summary by CodeRabbit