Skip to content

Conversation

@RayenTian
Copy link
Contributor

@RayenTian RayenTian commented Feb 10, 2026

PR: Enhance GRPO setup to support reward model environment configuration

Summary

This PR fixes how GRPO determines whether the reward model environment is used, so that cluster resource setup (GPU/node allocation for policy, inference, and reward model) is correct when train or validation data uses the reward_model env.

Previously, reward-model usage was inferred inside grpo.setup() from data_config["env_name"] == "reward_model", which only reflects the default env and can miss cases where the reward model env is used only in validation or in a non-default task. This change moves the detection to the entrypoint using the actual env names required by the data config, and passes a single flag into setup().

Motivation

  • Correctness: When using data.default.env_name: "reward_model" or any train/validation task that uses the reward model env, the cluster must reserve GPUs/nodes for the reward model. Relying only on data_config["env_name"] can be wrong when:
    • Validation uses the reward model env but the default env is something else, or
    • Multiple datasets/tasks are used and the default does not reflect all required envs.
  • Single source of truth: Env names are already computed in extract_necessary_env_names(config["data"]). Using that for “is reward model env needed?” avoids duplicating logic and keeps behavior aligned with what the data pipeline actually uses.

Changes

1. examples/run_grpo.py

  • Import extract_necessary_env_names from nemo_rl.data.datasets.
  • After setup_data_with_envs(), compute:
    • env_name_list = extract_necessary_env_names(config["data"])
    • rm_env_enabled = "reward_model" in env_name_list
  • Pass rm_env_enabled=rm_env_enabled into setup().

2. nemo_rl/algorithms/grpo.py

  • Add a new parameter to setup(): rm_env_enabled: bool = False.
  • Remove the previous logic that set reward_model_enabled from data_config["env_name"].
  • Use rm_env_enabled everywhere the previous reward_model_enabled was used for:
    • Reward model resource (nodes, GPUs per node) and cluster sizing.
    • Non-colocated inference path: subtracting reward model GPUs when total_nodes == 1 and when asserting train_gpus_per_node > 0.

Behavior of cluster setup (colocated vs non-colocated, single vs multi-node, reward model reservation) is unchanged except that the “reward model env in use” flag is now derived from the data config’s required env names instead of the default env name only.

3. tests/functional/L1_Functional_Tests_GPU.sh

  • Add a run of the reward-model env functional test:
    time uv run --no-sync bash ./tests/functional/grpo_rm.sh
    so that the reward model env path is exercised in L1 GPU tests.

Testing

  • Run the existing reward-model env functional test:
    bash tests/functional/grpo_rm_env.sh
    (or grpo_rm.sh as wired in L1).
  • Manually run GRPO with a config that uses the reward model env for train and/or validation and confirm cluster setup and run complete (e.g. no placement group or resource assertion errors).

Summary by CodeRabbit

  • Improvements

    • Enhanced flexibility in reward-model environment configuration detection and resource allocation
  • Tests

    • Added functional test coverage for reward-model environment scenarios

@RayenTian RayenTian added the CI:L1 Run doctests, unit tests, and functional tests label Feb 10, 2026
Signed-off-by: ruit <ruit@nvidia.com>
@RayenTian RayenTian marked this pull request as ready for review February 11, 2026 02:27
@RayenTian RayenTian requested review from a team as code owners February 11, 2026 02:27
@RayenTian RayenTian added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 11, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Replaces hard-coded reward-model environment checks with dynamic extraction-based detection in the GRPO algorithm, enabling flexible resource allocation based on configured environment names instead of direct string matching. Also adds a new test invocation for reward-model environment scenarios.

Changes

Cohort / File(s) Summary
GRPO Algorithm Logic
nemo_rl/algorithms/grpo.py
Refactors reward-model detection from hard-coded env_name == "reward_model" check to dynamic extraction via extract_necessary_env_names(data_config). Introduces rm_env_enabled flag to gate resource allocation (GPUs/nodes) and adds default zero-resource path when reward-model environment is not active.
GPU Functional Tests
tests/functional/L1_Functional_Tests_GPU.sh
Adds new test invocation for grpo_rm_env.sh in the GPU test sequence, positioned after grpo_non_colocated.sh.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • yuki-97
  • terrykong
  • joyang-nv
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: fix and re-enable rm env functional test' accurately describes the main changes: fixing the reward-model environment detection logic and re-enabling the functional test for it.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed PR documentation includes testing instructions (grpo_rm_env.sh) and manual validation steps for the bug fix changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ruit/rm_env_func

No actionable comments were generated in the recent review. 🎉

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant