Conversation
There was a problem hiding this comment.
Pull request overview
This PR makes missing data.group_training_examples_by columns fail fast with a consistent, user-facing error, instead of warning during holdout and later failing with less-informative downstream errors.
Changes:
- Introduces a shared
validate_groupby_columnhelper (and centralized error messages) and reuses it across SDK processing, holdout, and training. - Updates holdout behavior to raise immediately on missing/invalid group-by rather than silently falling back to an ungrouped split.
- Adds/updates tests to cover early failure and the new validation helper.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/nemo_safe_synthesizer/data_processing/validation.py |
Adds shared group-by validation + standardized error messages. |
src/nemo_safe_synthesizer/sdk/library_builder.py |
Validates group-by early in process_data() before constructing/running holdout. |
src/nemo_safe_synthesizer/holdout/holdout.py |
Replaces warning+fallback behavior with centralized validation + early error. |
src/nemo_safe_synthesizer/training/huggingface_backend.py |
Reuses centralized validator and validates before autoconfig resolution. |
tests/data_processing/test_validation.py |
New unit tests for validate_groupby_column. |
tests/holdout/test_holdout.py |
Updates holdout test to expect the new missing-group-by behavior/message. |
tests/sdk/test_process_data.py |
Adds regression test ensuring process_data() fails before Holdout is invoked. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
seayang-nv
left a comment
There was a problem hiding this comment.
Overall looks good to me. Thanks!
b86051c to
2a37c9b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6230446 to
9185def
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kendrickb-nvidia
left a comment
There was a problem hiding this comment.
Other than removing/consolidating the testing to test_validation.py, this looks good to me.
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> centralize group by column validation Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> PR feedback: add early fail to order by check, remove unnecessary internal method, clean up docstrings, etc. Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> minor ty/docstring fixes Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
22af3b7 to
673867d
Compare
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/nemo_safe_synthesizer/holdout/holdout.py:196
Holdout.train_test_split()validatesself.group_byviavalidate_groupby_column(...), and then (when grouping is enabled) callsgrouped_train_test_split(...), which validates the same column again. This double-scans the group column for nulls and adds avoidable overhead on large datasets. Consider removing one of the validations (e.g., rely ongrouped_train_test_splitfor the grouped path and change the branch toif self.group_by is not None:so empty-string misconfigs still raise) to keep validation single-pass.
validate_groupby_column(input_df, self.group_by)
if self.group_by:
training_df, test_df = grouped_train_test_split(
input_df=input_df,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This reverts commit 040f078. Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
a1aecd1 to
de92dd3
Compare
Summary
Previously if the
group_training_examples_bycolumn is missing in the data set, we file a warning at the holdout step and revert to the naive train test split. This is pointless because later on we'd error out at either the rope scaling auto resolution step or the training step, with less informative error messages. In this PR, weorder_training_examples_byin the library builder to fail early as wellPre-Review Checklist
Ensure that the following pass:
make format && make checkor via prek validation.make testpasses locallymake test-e2epasses locallymake test-ci-containerpasses locally (recommended)/syncon this PR to trigger a run (auto-triggers on ready-for-review)Pre-Merge Checklist
Other Notes
Current Behavior
First, a warning
Then, raises key error at…
config/autoconfig.py::get_max_token_countSame warning, then raises error at the training step,
Raises an error in the training step:
New Behavior
Raises the same error always, and early: