fix: standardized groupby types by seayang-nv · Pull Request #344 · NVIDIA-NeMo/Safe-Synthesizer

seayang-nv · 2026-04-02T18:10:54Z

Summary

The config source of truth in config/data.py defines group_training_examples_by as str | None, but several downstream consumers accepted str | list[str] or list[str] | str | None. This PR makes all public/interface-level group_by type annotations consistently str (or str | None).

Changes:

utils.py -- grouped_train_test_split: narrowed group_by from str | list[str] to str
config/autoconfig.py -- get_max_token_count: narrowed group_by from list[str] | str | None to str | None
generation/processors.py -- GroupedDataProcessor.__init__: narrowed group_by from str | list[str] to str, removed the isinstance coercion guard. Internal storage remains self.group_by: list[str] = [group_by] to leave room for future multi-column expansion.
holdout/holdout.py -- grouped_train_test_split: added type annotations (df: pd.DataFrame, test_size: int | float, group_by: str, random_state: int | None)
tests/generation/test_processors.py -- commented out two multi-column group_by tests (test_grouped_data_processor_multiple_group_by and test_grouped_data_processor_multiple_group_by_error) that passed list[str], which is no longer valid at the public API level. Single-column grouping remains well-covered by the existing test suite.

Design note: Classes that internally iterate over group columns (GroupedDataProcessor, GroupedDataExampleAssembler) still store self.group_by as list[str] internally. This means expanding back to multi-column grouping in the future only requires widening the public signatures -- no internal refactoring needed.

e2e test on patient_events.csv with the following config:

data:
  holdout: 0    
  max_holdout: 0
  group_training_examples_by: patient_id
  order_training_examples_by: event_id

generation:
  num_records: 1000   

training:
  pretrained_model: "HuggingFaceTB/SmolLM3-3B"
  num_input_records_to_sample: 25000

Results:

                                                    Quality Metrics                                                     
+----------------------------------------------------------------------------------------------------------------------+
| Metric                                | Value                                                                        |
|---------------------------------------+------------------------------------------------------------------------------|
| Synthetic Data Quality Score          | 8.80                                                                         |
| Column Correlation Stability Score    | 8.90                                                                         |
| Deep Structure Stability Score        | 8.10                                                                         |
| Column Distribution Stability Score   | 8.80                                                                         |
| Text Semantic Similarity Score        | None                                                                         |
| Text Structure Similarity Score       | 9.90                                                                         |
| Data Privacy Score                    | 9.60                                                                         |
| Membership Inference Protection Score | None                                                                         |
| Attribute Inference Protection Score  | 9.60                                                                         |
| Num Valid Records                     | 1010                                                                         |
| Num Invalid Records                   | 254                                                                          |
| Num Prompts                           | 280                                                                          |
| Valid Record Fraction                 | 79.91%                                                                       |
| Timing                                | {'total_time_sec': 1342.7879953924567, 'pii_replacer_time_sec': None,        |
|                                       | 'training_time_sec': 514.4236079379916, 'generation_time_sec':               |
|                                       | 576.6627847012132, 'evaluation_time_sec': 35.51542983856052}                 |
+----------------------------------------------------------------------------------------------------------------------+
 
2026-04-02T18:11:26.464 | Nemo Safe Synthesizer |  runtime |  info  |  external_results.py: SafeSynthesizerTiming.log_timing: 34 
Safe Synthesizer timing

          Pipeline Timing          
+---------------------------------+
| Metric                | Value   |
|-----------------------+---------|
| Total Time Sec        | 1342.79 |
| Pii Replacer Time Sec | None    |
| Training Time Sec     | 514.42  |
| Generation Time Sec   | 576.66  |
| Evaluation Time Sec   | 35.52   |
+---------------------------------+

Pre-Review Checklist

Ensure that the following pass:

make format && make check or via prek validation.
make test passes locally
make test-e2e passes locally
make test-ci-container passes locally (recommended)
GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

New or updated tests for any fix or new behavior
Updated documentation for new features and behaviors, including docstrings for API docs.

Other Notes

Closes bug: groupby type unclear #180

nina-xu

question: if the user puts in group_training_examples_by: col1,col2, does that get parsed into a list and fail the validation? or gets parsed into one string "col1,col2" and pass validation?
in general it might be nice to add a test to demonstrate one column passes validation and a list fails

seayang-nv · 2026-04-03T15:46:07Z

question: if the user puts in group_training_examples_by: col1,col2, does that get parsed into a list and fail the validation? or gets parsed into one string "col1,col2" and pass validation? in general it might be nice to add a test to demonstrate one column passes validation and a list fails

Good suggestions. Added the following:

if the users try to feed col1,col2, we will send a warning:

group_training_examples_by contains a comma: col1,col2. Only a single column name is supported. If you intended to specify multiple columns, note that multi-column grouping is not currently supported.

I don't want to reject it in case there are users have commas in their column names.

for col1,col2 case, the user would get KeyError. Added this scenario in the troubleshotting guide.
added tests.

kendrickb-nvidia

The more I think about it, I don't think a warning in pydantic is a good idea. That warning will be printed out in random places whenever we end up parsing the pydantic model and will be really annoying if you do have a comma in the name.

This is a super clear cut error that we explicitly check for and raise an error if the column name can't be found in the data frame. If we want a warning about the comma, let's put it there right before we raise the error (but only if column was not found) when we know it's a problem. And not in pydantic validation.

src/nemo_safe_synthesizer/config/data.py

tests/generation/test_processors.py

Signed-off-by: Sean Yang <seayang@nvidia.com>

# Summary Improve the heartbeat message by adding an explanation that it is normal to have long stretches with no new records. ## Pre-Review Checklist   Ensure that the following pass: - [x] `make format && make check` or via prek validation. - [x] `make test` passes locally - [x] `make test-e2e` passes locally - [ ] `make test-ci-container` passes locally (recommended) - [ ] GPU CI status check passes -- comment `/sync` on this PR to trigger a run (auto-triggers on ready-for-review) ## Pre-Merge Checklist    - [ ] New or updated tests for any fix or new behavior - [ ] Updated documentation for new features and behaviors, including docstrings for API docs. ## Other Notes  - Closes #<issue> --------- Signed-off-by: Alexa Haushalter <ahaushalter@nvidia.com> Signed-off-by: Sean Yang <seayang@nvidia.com>

Signed-off-by: Sean Yang <seayang@nvidia.com>

seayang-nv · 2026-04-03T20:03:06Z

Moved the comma-in-column-name hint out of Pydantic validation and into the actual error sites. Previously, a warning fired every time the config was parsed . Now the hint only appears when the column is not found in the data, appended to the existing error messages in _validate_groupby_column, holdout.train_test_split, and assembler._validate_columns.

seayang-nv requested a review from a team as a code owner April 2, 2026 18:10

seayang-nv changed the title ~~standardiszed groupby types~~ fix: standardized groupby types Apr 2, 2026

seayang-nv force-pushed the seayang/clarify-groupby-type branch from 9b3c801 to e6d56d6 Compare April 2, 2026 18:19

nina-xu previously approved these changes Apr 3, 2026

View reviewed changes

seayang-nv dismissed nina-xu’s stale review via c508066 April 3, 2026 15:49

seayang-nv requested a review from a team as a code owner April 3, 2026 15:49

nina-xu previously approved these changes Apr 3, 2026

View reviewed changes

kendrickb-nvidia requested changes Apr 3, 2026

View reviewed changes

src/nemo_safe_synthesizer/config/data.py Outdated Show resolved Hide resolved

tests/generation/test_processors.py Outdated Show resolved Hide resolved

seayang-nv and others added 7 commits April 3, 2026 13:59

standardiszed groupby types

99d26bd

Signed-off-by: Sean Yang <seayang@nvidia.com>

commented out multi column group by

54c96cd

Signed-off-by: Sean Yang <seayang@nvidia.com>

fixed type checking errors

84b5fc0

Signed-off-by: Sean Yang <seayang@nvidia.com>

added validation and troubleshooting guide for the col1,col2 case

576f4a3

Signed-off-by: Sean Yang <seayang@nvidia.com>

fixed format

3effef3

Signed-off-by: Sean Yang <seayang@nvidia.com>

addressed feedback

7084ab6

Signed-off-by: Sean Yang <seayang@nvidia.com>

seayang-nv dismissed nina-xu’s stale review via 7084ab6 April 3, 2026 20:00

seayang-nv force-pushed the seayang/clarify-groupby-type branch from 8648e8f to 7084ab6 Compare April 3, 2026 20:00

Merge branch 'main' into seayang/clarify-groupby-type

3c22a09

seayang-nv requested review from kendrickb-nvidia and nina-xu April 3, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: standardized groupby types#344

fix: standardized groupby types#344
seayang-nv wants to merge 8 commits intomainfrom
seayang/clarify-groupby-type

seayang-nv commented Apr 2, 2026 •

edited

Loading

Uh oh!

nina-xu left a comment

Uh oh!

seayang-nv commented Apr 3, 2026

Uh oh!

kendrickb-nvidia left a comment

Uh oh!

Uh oh!

Uh oh!

seayang-nv commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

seayang-nv commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pre-Review Checklist

Pre-Merge Checklist

Other Notes

Uh oh!

nina-xu left a comment

Choose a reason for hiding this comment

Uh oh!

seayang-nv commented Apr 3, 2026

Uh oh!

kendrickb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

seayang-nv commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seayang-nv commented Apr 2, 2026 •

edited

Loading