feat: improve dataset by yuki-97 · Pull Request #1893 · NVIDIA-NeMo/RL

yuki-97 · 2026-02-06T16:12:41Z

support subset so that dataset like gsm8k can directly use ResponseDataset to load.
update doc guide for how to add a new dataset class.

Summary by CodeRabbit

Release Notes

New Features
- Added support for specifying dataset subsets in HuggingFace dataset configurations, enabling selection of specific dataset variants (e.g., GSM8K's "main" subset).
Documentation
- Updated all dataset guides (DPO, GRPO, RM, SFT) to clarify the required task_name field structure and dataset formatting conventions.
- Enhanced configuration examples to demonstrate subset usage for HuggingFace datasets.
Tests
- Added test coverage for dataset loading with subset selection.

Signed-off-by: Yuki Huang <yukih@nvidia.com>

coderabbitai · 2026-02-06T16:17:31Z

📝 Walkthrough

Walkthrough

This PR introduces support for a subset parameter across the dataset loading infrastructure to enable HuggingFace dataset subset selection. Changes include type definitions, dataset class constructors, loading utilities, configuration examples, and documentation updates describing the new task_name field requirement and subset parameter.

Changes

Cohort / File(s)	Summary
Documentation updates `docs/guides/dpo.md`, `docs/guides/grpo.md`, `docs/guides/rm.md`, `docs/guides/sft.md`	Updated dataset documentation to describe `task_name` as a required field in formatted examples and added guidance on the new `subset` parameter for HuggingFace datasets.
Configuration examples `examples/configs/dpo.yaml`, `examples/configs/rm.yaml`	Added `subset: null` placeholder entries for HuggingFace datasets in train and validation data sections.
Type definitions `nemo_rl/data/__init__.py`	Added `subset: NotRequired[str]` field to `ResponseDatasetConfig` and `PreferenceDatasetConfig` TypedDicts.
Dataset class implementations `nemo_rl/data/datasets/response_datasets/response_dataset.py`, `nemo_rl/data/datasets/preference_datasets/preference_dataset.py`, `nemo_rl/data/datasets/preference_datasets/binary_preference_dataset.py`	Added optional `subset: Optional[str] = None` parameter to constructors and threaded it through to `load_dataset_from_path` calls.
Dataset loading utilities `nemo_rl/data/datasets/utils.py`	Extended `load_dataset_from_path` signature to accept `data_subset` parameter with conditional logic: asserts subset is None for local file types, and passes subset to `load_dataset()` for HuggingFace datasets.
Unit tests `tests/unit/data/datasets/test_response_dataset.py`	Added new test `test_response_dataset_gsm8k_with_subset` validating ResponseDataset loading with HuggingFace subset selection and message formatting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

refactor: refactor dataset module #977: Extends dataset refactor by modifying dataset-loading utilities and dataset classes with subset parameter threading.
refactor: refactor env and data processor & add nemotron super 49b recipes #1506: Related through task_name dataflow changes being propagated into dataset formats and processors.
refactor: split train and val dataset in response dataset #1649: Touches same dataset-loading surface with modifications to load_dataset_from_path and dataset constructor signatures.

Suggested labels

CI:L1

Suggested reviewers

terrykong
parthchadha
odelalleau

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes	⚠️ Warning	PR adds optional subset parameter to dataset classes with backward-compatible defaults, includes new test case for GSM8K subset functionality, but PR description lacks explicit test result documentation.	Add explicit test result documentation to PR description confirming tests passed and backward compatibility verified with default parameter values.
Title check	❓ Inconclusive	The title 'feat: improve dataset' is vague and generic, using non-descriptive language that does not convey the specific changes made in the changeset.	Consider a more specific title like 'feat: add dataset subset support' or 'feat: add HuggingFace subset parameter to dataset loaders' to clearly communicate the primary change.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yukih/improve-dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_rl/data/datasets/utils.py (1)
86-104: ⚠️ Potential issue | 🟡 Minor

data_subset is silently ignored in the load_from_disk fallback.

If a user provides a data_subset with a path that triggers the load_from_disk fallback (line 102), the subset is silently discarded. Consider adding an assertion (matching the pattern on line 88) to warn the user.
Proposed fix
         except ValueError as e:
             # load from local file (save_to_disk format)
             if "load_from_disk" in str(e):
+                assert data_subset is None, (
+                    "data_subset is only supported for huggingface datasets"
+                )
                 raw_dataset = load_from_disk(data_path)
             else:
                 raise e

🤖 Fix all issues with AI agents

In `@docs/guides/dpo.md`:
- Line 39: The link text "response_datasets/tulu3.py" in docs/guides/dpo.md is
misleading because it points to
../../nemo_rl/data/datasets/preference_datasets/tulu3.py; update the doc so link
text matches the actual target (e.g., change the link text to
"preference_datasets/tulu3.py") or adjust the URL to point to the intended
response_datasets path, ensuring the visible text and link destination are
consistent.

In `@docs/guides/rm.md`:
- Line 28: The link display text is inconsistent: the markdown shows
"response_datasets/tulu3.py" while the URL points to
"../../nemo_rl/data/datasets/preference_datasets/tulu3.py"; update the markdown
in rm.md so the link text matches the actual target (e.g., change the display
text to "preference_datasets/tulu3.py") or alternatively change the URL if you
intended to link to response_datasets; edit the specific link near the sentence
"An example implementation can be found in [response_datasets/tulu3.py]" to keep
display text and path consistent.

In `@nemo_rl/data/__init__.py`:
- Line 23: Change the type of the optional subset field to allow explicit nulls:
update the NotRequired annotation for subset in DatasetConfig (and the subset
field in PreferenceDatasetConfig) from NotRequired[str] to NotRequired[str |
None] (or NotRequired[Optional[str]]), ensuring any necessary typing imports are
present so YAML null maps to None without violating the type contract.

🧹 Nitpick comments (1)

nemo_rl/data/datasets/utils.py (1)
94-98: Use keyword argument for load_dataset subset parameter.

load_dataset(data_path, data_subset) passes the subset as a positional argument, relying on name being the second parameter. Use the explicit keyword argument for clarity and resilience to API changes.
Proposed change
             # load from huggingface
             if data_subset:
-                raw_dataset = load_dataset(data_path, data_subset)
+                raw_dataset = load_dataset(data_path, name=data_subset)
             else:
                 raw_dataset = load_dataset(data_path)

docs/guides/dpo.md

docs/guides/rm.md

nemo_rl/data/__init__.py

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 added 2 commits February 6, 2026 06:42

support data subset

4366ca0

Signed-off-by: Yuki Huang <yukih@nvidia.com>

update doc

8904c7e

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 requested review from a team as code owners February 6, 2026 16:12

github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2026

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed documentation Improvements or additions to documentation labels Feb 6, 2026

yuki-97 had a problem deploying to nemo-ci February 6, 2026 16:13 — with GitHub Actions Error

yuki-97 requested a review from terrykong February 6, 2026 16:13

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

docs/guides/dpo.md Outdated Show resolved Hide resolved

docs/guides/rm.md Outdated Show resolved Hide resolved

nemo_rl/data/__init__.py Outdated Show resolved Hide resolved

coderabbit

aeeb9eb

Signed-off-by: Yuki Huang <yukih@nvidia.com>

github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2026

yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 6, 2026

yuki-97 temporarily deployed to nemo-ci February 6, 2026 16:26 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 6, 2026 18:27 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci February 7, 2026 00:11 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve dataset#1893

feat: improve dataset#1893
yuki-97 wants to merge 3 commits intomainfrom
yukih/improve-dataset

yuki-97 commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yuki-97 commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuki-97 commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading