fix: raise group by column not found error by nina-xu · Pull Request #374 · NVIDIA-NeMo/Safe-Synthesizer

nina-xu · 2026-04-07T15:44:17Z

Summary

Previously if the group_training_examples_by column is missing in the data set, we file a warning at the holdout step and revert to the naive train test split. This is pointless because later on we'd error out at either the rope scaling auto resolution step or the training step, with less informative error messages. In this PR, we

Instead raise an error early at the holdout step
Added an early validation for order_training_examples_by in the library builder to fail early as well
Centralize the group by and order by validation checks so we do the same check regardless of the entry point and consistently provide an informative error message.

Pre-Review Checklist

Ensure that the following pass:

make format && make check or via prek validation.
make test passes locally
make test-e2e passes locally
make test-ci-container passes locally (recommended)
GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

New or updated tests for any fix or new behavior
Updated documentation for new features and behaviors, including docstrings for API docs.

Other Notes

Closes fix: group by config error not handled properly #176

Current Behavior

entry point 1

safe-synthesizer run --config /root/configs/quick-tinyllama-unsloth.yaml --data-source /root/datasets/clinc_oos.csv --data__group_training_examples_by non_existent

First, a warning

2026-04-07T14:56:01.168 | Nemo Safe Synthesizer |  runtime |  warning |  holdout.py: Holdout.train_test_split: 177 
Group By column non_existent not found in input Dataset columns! Doing a normal split.

Then, raises key error at… config/autoconfig.py::get_max_token_count

entry point 2

safe-synthesizer run --config /root/configs/quick-tinyllama-unsloth.yaml --data-source /root/datasets/clinc_oos.csv --data__group_training_examples_by non_existent --training__rope_scaling_factor 1

Same warning, then raises error at the training step,

File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/training/huggingface_backend.py", line 548, in _validate_groupby_column
    raise ParameterError(msg)
nemo_safe_synthesizer.errors.ParameterError: Group by column 'non_existent' not found in the input data.

order by missing

safe-synthesizer run --config /root/configs/quick-tinyllama-unsloth.yaml --data-source /root/datasets/clinc_oos.csv --data__group_training_examples_by label --data__order_training_examples_by non_existent

Raises an error in the training step:

 File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/training/huggingface_backend.py", line 551, in _validate_orderby_column
    raise ParameterError(msg)
nemo_safe_synthesizer.errors.ParameterError: Order by column 'non_existent' not found in the input data.

New Behavior

group by error
Raises the same error always, and early:

  File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/data_processing/validation.py", line 34, in validate_groupby_column
    raise ParameterError(MISSING_GROUP_BY_COLUMN_ERROR.format(group_by=group_by))
nemo_safe_synthesizer.errors.ParameterError: Group by column 'non_existent' not found in input dataset columns. Please set `data.group_training_examples_by` to an existing column or disable grouping.

order by error

  File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/sdk/library_builder.py", line 278, in process_data
    validate_orderby_column(self._data_source, self._nss_config.data.order_training_examples_by)
  File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/data_processing/validation.py", line 55, in validate_orderby_column
    raise ParameterError(MISSING_ORDER_BY_COLUMN_ERROR.format(order_by=order_by))
nemo_safe_synthesizer.errors.ParameterError: Order by column 'non_existent' not found in the input data.

Copilot

Pull request overview

This PR makes missing data.group_training_examples_by columns fail fast with a consistent, user-facing error, instead of warning during holdout and later failing with less-informative downstream errors.

Changes:

Introduces a shared validate_groupby_column helper (and centralized error messages) and reuses it across SDK processing, holdout, and training.
Updates holdout behavior to raise immediately on missing/invalid group-by rather than silently falling back to an ungrouped split.
Adds/updates tests to cover early failure and the new validation helper.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/nemo_safe_synthesizer/data_processing/validation.py`	Adds shared group-by validation + standardized error messages.
`src/nemo_safe_synthesizer/sdk/library_builder.py`	Validates group-by early in `process_data()` before constructing/running holdout.
`src/nemo_safe_synthesizer/holdout/holdout.py`	Replaces warning+fallback behavior with centralized validation + early error.
`src/nemo_safe_synthesizer/training/huggingface_backend.py`	Reuses centralized validator and validates before autoconfig resolution.
`tests/data_processing/test_validation.py`	New unit tests for `validate_groupby_column`.
`tests/holdout/test_holdout.py`	Updates holdout test to expect the new missing-group-by behavior/message.
`tests/sdk/test_process_data.py`	Adds regression test ensuring `process_data()` fails before `Holdout` is invoked.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

seayang-nv

Overall looks good to me. Thanks!

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kendrickb-nvidia

This looks good. Will need to do some merge resolution after whichever of #374 and #344 merges first.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kendrickb-nvidia

Other than removing/consolidating the testing to test_validation.py, this looks good to me.

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> centralize group by column validation Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> PR feedback: add early fail to order by check, remove unnecessary internal method, clean up docstrings, etc. Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com> minor ty/docstring fixes Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/holdout/holdout.py:196

Holdout.train_test_split() validates self.group_by via validate_groupby_column(...), and then (when grouping is enabled) calls grouped_train_test_split(...), which validates the same column again. This double-scans the group column for nulls and adds avoidable overhead on large datasets. Consider removing one of the validations (e.g., rely on grouped_train_test_split for the grouped path and change the branch to if self.group_by is not None: so empty-string misconfigs still raise) to keep validation single-pass.

        validate_groupby_column(input_df, self.group_by)

        if self.group_by:
            training_df, test_df = grouped_train_test_split(
                input_df=input_df,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This reverts commit 040f078. Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

nina-xu changed the title ~~Fix: raise group by column not found error~~ fix: raise group by column not found error Apr 7, 2026

nina-xu marked this pull request as ready for review April 7, 2026 15:57

nina-xu requested a review from a team as a code owner April 7, 2026 15:57

Copilot AI review requested due to automatic review settings April 7, 2026 15:57

Copilot started reviewing on behalf of nina-xu April 7, 2026 15:58 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread tests/holdout/test_holdout.py Outdated

Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py

seayang-nv previously approved these changes Apr 7, 2026

View reviewed changes

Comment thread src/nemo_safe_synthesizer/holdout/holdout.py Outdated

kendrickb-nvidia requested changes Apr 7, 2026

View reviewed changes

kendrickb-nvidia mentioned this pull request Apr 7, 2026

fix: standardized groupby types #344

Merged

7 tasks

nina-xu dismissed seayang-nv’s stale review via b86051c April 8, 2026 15:11

Copilot AI review requested due to automatic review settings April 8, 2026 15:15

nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from b86051c to 2a37c9b Compare April 8, 2026 15:15

Copilot started reviewing on behalf of nina-xu April 8, 2026 15:16 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Comment thread src/nemo_safe_synthesizer/holdout/holdout.py Outdated

Comment thread src/nemo_safe_synthesizer/holdout/holdout.py Outdated

Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py Outdated

nina-xu requested review from kendrickb-nvidia and seayang-nv April 8, 2026 15:22

kendrickb-nvidia previously approved these changes Apr 8, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 9, 2026 14:50

nina-xu dismissed kendrickb-nvidia’s stale review via 9185def April 9, 2026 14:50

nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from 6230446 to 9185def Compare April 9, 2026 14:50

Copilot started reviewing on behalf of nina-xu April 9, 2026 14:50 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py

Comment thread src/nemo_safe_synthesizer/data_processing/validation.py

nina-xu requested a review from kendrickb-nvidia April 9, 2026 14:59

kendrickb-nvidia previously approved these changes Apr 14, 2026

View reviewed changes

Comment thread tests/training/test_huggingface_backend.py Outdated

nina-xu added 4 commits April 15, 2026 17:27

refactor validation now that we also check for comma in the name

cb85128

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

make format

879dc0f

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

dedupe tests

673867d

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

nina-xu dismissed kendrickb-nvidia’s stale review via 673867d April 15, 2026 17:43

nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from 22af3b7 to 673867d Compare April 15, 2026 17:43

revert nit change

040f078

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

nina-xu requested review from Copilot and kendrickb-nvidia April 15, 2026 17:53

Copilot started reviewing on behalf of nina-xu April 15, 2026 17:53 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Revert "revert nit change"

de92dd3

This reverts commit 040f078. Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from a1aecd1 to de92dd3 Compare April 15, 2026 17:59

kendrickb-nvidia approved these changes Apr 15, 2026

View reviewed changes

Conversation

nina-xu commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pre-Review Checklist

Pre-Merge Checklist

Other Notes

Current Behavior

New Behavior

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

seayang-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kendrickb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

kendrickb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nina-xu commented Apr 7, 2026 •

edited

Loading