Conversation
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces support for a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_rl/data/datasets/utils.py (1)
86-104:⚠️ Potential issue | 🟡 Minor
data_subsetis silently ignored in theload_from_diskfallback.If a user provides a
data_subsetwith a path that triggers theload_from_diskfallback (line 102), the subset is silently discarded. Consider adding an assertion (matching the pattern on line 88) to warn the user.Proposed fix
except ValueError as e: # load from local file (save_to_disk format) if "load_from_disk" in str(e): + assert data_subset is None, ( + "data_subset is only supported for huggingface datasets" + ) raw_dataset = load_from_disk(data_path) else: raise e
🤖 Fix all issues with AI agents
In `@docs/guides/dpo.md`:
- Line 39: The link text "response_datasets/tulu3.py" in docs/guides/dpo.md is
misleading because it points to
../../nemo_rl/data/datasets/preference_datasets/tulu3.py; update the doc so link
text matches the actual target (e.g., change the link text to
"preference_datasets/tulu3.py") or adjust the URL to point to the intended
response_datasets path, ensuring the visible text and link destination are
consistent.
In `@docs/guides/rm.md`:
- Line 28: The link display text is inconsistent: the markdown shows
"response_datasets/tulu3.py" while the URL points to
"../../nemo_rl/data/datasets/preference_datasets/tulu3.py"; update the markdown
in rm.md so the link text matches the actual target (e.g., change the display
text to "preference_datasets/tulu3.py") or alternatively change the URL if you
intended to link to response_datasets; edit the specific link near the sentence
"An example implementation can be found in [response_datasets/tulu3.py]" to keep
display text and path consistent.
In `@nemo_rl/data/__init__.py`:
- Line 23: Change the type of the optional subset field to allow explicit nulls:
update the NotRequired annotation for subset in DatasetConfig (and the subset
field in PreferenceDatasetConfig) from NotRequired[str] to NotRequired[str |
None] (or NotRequired[Optional[str]]), ensuring any necessary typing imports are
present so YAML null maps to None without violating the type contract.
🧹 Nitpick comments (1)
nemo_rl/data/datasets/utils.py (1)
94-98: Use keyword argument forload_datasetsubset parameter.
load_dataset(data_path, data_subset)passes the subset as a positional argument, relying onnamebeing the second parameter. Use the explicit keyword argument for clarity and resilience to API changes.Proposed change
# load from huggingface if data_subset: - raw_dataset = load_dataset(data_path, data_subset) + raw_dataset = load_dataset(data_path, name=data_subset) else: raw_dataset = load_dataset(data_path)
Signed-off-by: Yuki Huang <yukih@nvidia.com>
gsm8kcan directly useResponseDatasetto load.Summary by CodeRabbit
Release Notes
New Features
Documentation
task_namefield structure and dataset formatting conventions.Tests