fix: fix and re-enable rm env functional test #1905
+8
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR: Enhance GRPO setup to support reward model environment configuration
Summary
This PR fixes how GRPO determines whether the reward model environment is used, so that cluster resource setup (GPU/node allocation for policy, inference, and reward model) is correct when train or validation data uses the
reward_modelenv.Previously, reward-model usage was inferred inside
grpo.setup()fromdata_config["env_name"] == "reward_model", which only reflects the default env and can miss cases where the reward model env is used only in validation or in a non-default task. This change moves the detection to the entrypoint using the actual env names required by the data config, and passes a single flag intosetup().Motivation
data.default.env_name: "reward_model"or any train/validation task that uses the reward model env, the cluster must reserve GPUs/nodes for the reward model. Relying only ondata_config["env_name"]can be wrong when:extract_necessary_env_names(config["data"]). Using that for “is reward model env needed?” avoids duplicating logic and keeps behavior aligned with what the data pipeline actually uses.Changes
1.
examples/run_grpo.pyextract_necessary_env_namesfromnemo_rl.data.datasets.setup_data_with_envs(), compute:env_name_list = extract_necessary_env_names(config["data"])rm_env_enabled = "reward_model" in env_name_listrm_env_enabled=rm_env_enabledintosetup().2.
nemo_rl/algorithms/grpo.pysetup():rm_env_enabled: bool = False.reward_model_enabledfromdata_config["env_name"].rm_env_enabledeverywhere the previousreward_model_enabledwas used for:total_nodes == 1and when assertingtrain_gpus_per_node > 0.Behavior of cluster setup (colocated vs non-colocated, single vs multi-node, reward model reservation) is unchanged except that the “reward model env in use” flag is now derived from the data config’s required env names instead of the default env name only.
3.
tests/functional/L1_Functional_Tests_GPU.shtime uv run --no-sync bash ./tests/functional/grpo_rm.shso that the reward model env path is exercised in L1 GPU tests.
Testing
bash tests/functional/grpo_rm_env.sh(or
grpo_rm.shas wired in L1).Summary by CodeRabbit
Improvements
Tests