Skip to content

feat: improve dataset#1893

Open
yuki-97 wants to merge 3 commits intomainfrom
yukih/improve-dataset
Open

feat: improve dataset#1893
yuki-97 wants to merge 3 commits intomainfrom
yukih/improve-dataset

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Feb 6, 2026

  1. support subset so that dataset like gsm8k can directly use ResponseDataset to load.
  2. update doc guide for how to add a new dataset class.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for specifying dataset subsets in HuggingFace dataset configurations, enabling selection of specific dataset variants (e.g., GSM8K's "main" subset).
  • Documentation

    • Updated all dataset guides (DPO, GRPO, RM, SFT) to clarify the required task_name field structure and dataset formatting conventions.
    • Enhanced configuration examples to demonstrate subset usage for HuggingFace datasets.
  • Tests

    • Added test coverage for dataset loading with subset selection.

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 requested review from a team as code owners February 6, 2026 16:12
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2026
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed documentation Improvements or additions to documentation labels Feb 6, 2026
@yuki-97 yuki-97 requested a review from terrykong February 6, 2026 16:13
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

This PR introduces support for a subset parameter across the dataset loading infrastructure to enable HuggingFace dataset subset selection. Changes include type definitions, dataset class constructors, loading utilities, configuration examples, and documentation updates describing the new task_name field requirement and subset parameter.

Changes

Cohort / File(s) Summary
Documentation updates
docs/guides/dpo.md, docs/guides/grpo.md, docs/guides/rm.md, docs/guides/sft.md
Updated dataset documentation to describe task_name as a required field in formatted examples and added guidance on the new subset parameter for HuggingFace datasets.
Configuration examples
examples/configs/dpo.yaml, examples/configs/rm.yaml
Added subset: null placeholder entries for HuggingFace datasets in train and validation data sections.
Type definitions
nemo_rl/data/__init__.py
Added subset: NotRequired[str] field to ResponseDatasetConfig and PreferenceDatasetConfig TypedDicts.
Dataset class implementations
nemo_rl/data/datasets/response_datasets/response_dataset.py, nemo_rl/data/datasets/preference_datasets/preference_dataset.py, nemo_rl/data/datasets/preference_datasets/binary_preference_dataset.py
Added optional subset: Optional[str] = None parameter to constructors and threaded it through to load_dataset_from_path calls.
Dataset loading utilities
nemo_rl/data/datasets/utils.py
Extended load_dataset_from_path signature to accept data_subset parameter with conditional logic: asserts subset is None for local file types, and passes subset to load_dataset() for HuggingFace datasets.
Unit tests
tests/unit/data/datasets/test_response_dataset.py
Added new test test_response_dataset_gsm8k_with_subset validating ResponseDataset loading with HuggingFace subset selection and message formatting.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

CI:L1

Suggested reviewers

  • terrykong
  • parthchadha
  • odelalleau
🚥 Pre-merge checks | ✅ 1 | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR adds optional subset parameter to dataset classes with backward-compatible defaults, includes new test case for GSM8K subset functionality, but PR description lacks explicit test result documentation. Add explicit test result documentation to PR description confirming tests passed and backward compatibility verified with default parameter values.
Title check ❓ Inconclusive The title 'feat: improve dataset' is vague and generic, using non-descriptive language that does not convey the specific changes made in the changeset. Consider a more specific title like 'feat: add dataset subset support' or 'feat: add HuggingFace subset parameter to dataset loaders' to clearly communicate the primary change.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yukih/improve-dataset

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_rl/data/datasets/utils.py (1)

86-104: ⚠️ Potential issue | 🟡 Minor

data_subset is silently ignored in the load_from_disk fallback.

If a user provides a data_subset with a path that triggers the load_from_disk fallback (line 102), the subset is silently discarded. Consider adding an assertion (matching the pattern on line 88) to warn the user.

Proposed fix
         except ValueError as e:
             # load from local file (save_to_disk format)
             if "load_from_disk" in str(e):
+                assert data_subset is None, (
+                    "data_subset is only supported for huggingface datasets"
+                )
                 raw_dataset = load_from_disk(data_path)
             else:
                 raise e
🤖 Fix all issues with AI agents
In `@docs/guides/dpo.md`:
- Line 39: The link text "response_datasets/tulu3.py" in docs/guides/dpo.md is
misleading because it points to
../../nemo_rl/data/datasets/preference_datasets/tulu3.py; update the doc so link
text matches the actual target (e.g., change the link text to
"preference_datasets/tulu3.py") or adjust the URL to point to the intended
response_datasets path, ensuring the visible text and link destination are
consistent.

In `@docs/guides/rm.md`:
- Line 28: The link display text is inconsistent: the markdown shows
"response_datasets/tulu3.py" while the URL points to
"../../nemo_rl/data/datasets/preference_datasets/tulu3.py"; update the markdown
in rm.md so the link text matches the actual target (e.g., change the display
text to "preference_datasets/tulu3.py") or alternatively change the URL if you
intended to link to response_datasets; edit the specific link near the sentence
"An example implementation can be found in [response_datasets/tulu3.py]" to keep
display text and path consistent.

In `@nemo_rl/data/__init__.py`:
- Line 23: Change the type of the optional subset field to allow explicit nulls:
update the NotRequired annotation for subset in DatasetConfig (and the subset
field in PreferenceDatasetConfig) from NotRequired[str] to NotRequired[str |
None] (or NotRequired[Optional[str]]), ensuring any necessary typing imports are
present so YAML null maps to None without violating the type contract.
🧹 Nitpick comments (1)
nemo_rl/data/datasets/utils.py (1)

94-98: Use keyword argument for load_dataset subset parameter.

load_dataset(data_path, data_subset) passes the subset as a positional argument, relying on name being the second parameter. Use the explicit keyword argument for clarity and resilience to API changes.

Proposed change
             # load from huggingface
             if data_subset:
-                raw_dataset = load_dataset(data_path, data_subset)
+                raw_dataset = load_dataset(data_path, name=data_subset)
             else:
                 raw_dataset = load_dataset(data_path)

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2026
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant