Skip to content

fix: raise group by column not found error#374

Open
nina-xu wants to merge 6 commits intomainfrom
nina-xu/176-error-out-group-by-column-not-found
Open

fix: raise group by column not found error#374
nina-xu wants to merge 6 commits intomainfrom
nina-xu/176-error-out-group-by-column-not-found

Conversation

@nina-xu
Copy link
Copy Markdown
Contributor

@nina-xu nina-xu commented Apr 7, 2026

Summary

Previously if the group_training_examples_by column is missing in the data set, we file a warning at the holdout step and revert to the naive train test split. This is pointless because later on we'd error out at either the rope scaling auto resolution step or the training step, with less informative error messages. In this PR, we

  • Instead raise an error early at the holdout step
  • Added an early validation for order_training_examples_by in the library builder to fail early as well
  • Centralize the group by and order by validation checks so we do the same check regardless of the entry point and consistently provide an informative error message.

Pre-Review Checklist

Ensure that the following pass:

  • make format && make check or via prek validation.
  • make test passes locally
  • make test-e2e passes locally
  • make test-ci-container passes locally (recommended)
  • GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

  • New or updated tests for any fix or new behavior
  • Updated documentation for new features and behaviors, including docstrings for API docs.

Other Notes

Current Behavior

  • entry point 1
safe-synthesizer run --config /root/configs/quick-tinyllama-unsloth.yaml --data-source /root/datasets/clinc_oos.csv --data__group_training_examples_by non_existent

First, a warning

2026-04-07T14:56:01.168 | Nemo Safe Synthesizer |  runtime |  warning |  holdout.py: Holdout.train_test_split: 177 
Group By column non_existent not found in input Dataset columns! Doing a normal split.

Then, raises key error at… config/autoconfig.py::get_max_token_count

  • entry point 2
safe-synthesizer run --config /root/configs/quick-tinyllama-unsloth.yaml --data-source /root/datasets/clinc_oos.csv --data__group_training_examples_by non_existent --training__rope_scaling_factor 1

Same warning, then raises error at the training step,

File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/training/huggingface_backend.py", line 548, in _validate_groupby_column
    raise ParameterError(msg)
nemo_safe_synthesizer.errors.ParameterError: Group by column 'non_existent' not found in the input data.
  • order by missing
safe-synthesizer run --config /root/configs/quick-tinyllama-unsloth.yaml --data-source /root/datasets/clinc_oos.csv --data__group_training_examples_by label --data__order_training_examples_by non_existent

Raises an error in the training step:

 File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/training/huggingface_backend.py", line 551, in _validate_orderby_column
    raise ParameterError(msg)
nemo_safe_synthesizer.errors.ParameterError: Order by column 'non_existent' not found in the input data.

New Behavior

  • group by error
    Raises the same error always, and early:
  File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/data_processing/validation.py", line 34, in validate_groupby_column
    raise ParameterError(MISSING_GROUP_BY_COLUMN_ERROR.format(group_by=group_by))
nemo_safe_synthesizer.errors.ParameterError: Group by column 'non_existent' not found in input dataset columns. Please set `data.group_training_examples_by` to an existing column or disable grouping.
  • order by error
  File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/sdk/library_builder.py", line 278, in process_data
    validate_orderby_column(self._data_source, self._nss_config.data.order_training_examples_by)
  File "/root/Safe-Synthesizer/src/nemo_safe_synthesizer/data_processing/validation.py", line 55, in validate_orderby_column
    raise ParameterError(MISSING_ORDER_BY_COLUMN_ERROR.format(order_by=order_by))
nemo_safe_synthesizer.errors.ParameterError: Order by column 'non_existent' not found in the input data.

@nina-xu nina-xu changed the title Fix: raise group by column not found error fix: raise group by column not found error Apr 7, 2026
@nina-xu nina-xu marked this pull request as ready for review April 7, 2026 15:57
@nina-xu nina-xu requested a review from a team as a code owner April 7, 2026 15:57
Copilot AI review requested due to automatic review settings April 7, 2026 15:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes missing data.group_training_examples_by columns fail fast with a consistent, user-facing error, instead of warning during holdout and later failing with less-informative downstream errors.

Changes:

  • Introduces a shared validate_groupby_column helper (and centralized error messages) and reuses it across SDK processing, holdout, and training.
  • Updates holdout behavior to raise immediately on missing/invalid group-by rather than silently falling back to an ungrouped split.
  • Adds/updates tests to cover early failure and the new validation helper.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/nemo_safe_synthesizer/data_processing/validation.py Adds shared group-by validation + standardized error messages.
src/nemo_safe_synthesizer/sdk/library_builder.py Validates group-by early in process_data() before constructing/running holdout.
src/nemo_safe_synthesizer/holdout/holdout.py Replaces warning+fallback behavior with centralized validation + early error.
src/nemo_safe_synthesizer/training/huggingface_backend.py Reuses centralized validator and validates before autoconfig resolution.
tests/data_processing/test_validation.py New unit tests for validate_groupby_column.
tests/holdout/test_holdout.py Updates holdout test to expect the new missing-group-by behavior/message.
tests/sdk/test_process_data.py Adds regression test ensuring process_data() fails before Holdout is invoked.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/holdout/test_holdout.py Outdated
Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py
seayang-nv
seayang-nv previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Contributor

@seayang-nv seayang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Thanks!

Comment thread src/nemo_safe_synthesizer/holdout/holdout.py Outdated
Comment thread src/nemo_safe_synthesizer/data_processing/validation.py Outdated
Comment thread src/nemo_safe_synthesizer/holdout/holdout.py
Comment thread src/nemo_safe_synthesizer/holdout/holdout.py
Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py Outdated
Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py Outdated
Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py
Comment thread tests/data_processing/test_validation.py Outdated
Comment thread src/nemo_safe_synthesizer/data_processing/validation.py
Copilot AI review requested due to automatic review settings April 8, 2026 15:15
@nina-xu nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from b86051c to 2a37c9b Compare April 8, 2026 15:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/nemo_safe_synthesizer/holdout/holdout.py Outdated
Comment thread src/nemo_safe_synthesizer/holdout/holdout.py Outdated
Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py Outdated
Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Will need to do some merge resolution after whichever of #374 and #344 merges first.

Copilot AI review requested due to automatic review settings April 9, 2026 14:50
@nina-xu nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from 6230446 to 9185def Compare April 9, 2026 14:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/nemo_safe_synthesizer/training/huggingface_backend.py
Comment thread src/nemo_safe_synthesizer/data_processing/validation.py
@nina-xu nina-xu requested a review from kendrickb-nvidia April 9, 2026 14:59
Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than removing/consolidating the testing to test_validation.py, this looks good to me.

Comment thread tests/training/test_huggingface_backend.py Outdated
nina-xu added 4 commits April 15, 2026 17:27
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

centralize group by column validation

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

PR feedback: add early fail to order by check, remove unnecessary internal method, clean up docstrings, etc.

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

minor ty/docstring fixes

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
@nina-xu nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from 22af3b7 to 673867d Compare April 15, 2026 17:43
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/holdout/holdout.py:196

  • Holdout.train_test_split() validates self.group_by via validate_groupby_column(...), and then (when grouping is enabled) calls grouped_train_test_split(...), which validates the same column again. This double-scans the group column for nulls and adds avoidable overhead on large datasets. Consider removing one of the validations (e.g., rely on grouped_train_test_split for the grouped path and change the branch to if self.group_by is not None: so empty-string misconfigs still raise) to keep validation single-pass.
        validate_groupby_column(input_df, self.group_by)

        if self.group_by:
            training_df, test_df = grouped_train_test_split(
                input_df=input_df,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This reverts commit 040f078.

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
@nina-xu nina-xu force-pushed the nina-xu/176-error-out-group-by-column-not-found branch from a1aecd1 to de92dd3 Compare April 15, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: group by config error not handled properly

4 participants