Skip to content

fix: standardized groupby types#344

Open
seayang-nv wants to merge 8 commits intomainfrom
seayang/clarify-groupby-type
Open

fix: standardized groupby types#344
seayang-nv wants to merge 8 commits intomainfrom
seayang/clarify-groupby-type

Conversation

@seayang-nv
Copy link
Copy Markdown
Contributor

@seayang-nv seayang-nv commented Apr 2, 2026

Summary

The config source of truth in config/data.py defines group_training_examples_by as str | None, but several downstream consumers accepted str | list[str] or list[str] | str | None. This PR makes all public/interface-level group_by type annotations consistently str (or str | None).

Changes:

  • utils.py -- grouped_train_test_split: narrowed group_by from str | list[str] to str
  • config/autoconfig.py -- get_max_token_count: narrowed group_by from list[str] | str | None to str | None
  • generation/processors.py -- GroupedDataProcessor.__init__: narrowed group_by from str | list[str] to str, removed the isinstance coercion guard. Internal storage remains self.group_by: list[str] = [group_by] to leave room for future multi-column expansion.
  • holdout/holdout.py -- grouped_train_test_split: added type annotations (df: pd.DataFrame, test_size: int | float, group_by: str, random_state: int | None)
  • tests/generation/test_processors.py -- commented out two multi-column group_by tests (test_grouped_data_processor_multiple_group_by and test_grouped_data_processor_multiple_group_by_error) that passed list[str], which is no longer valid at the public API level. Single-column grouping remains well-covered by the existing test suite.

Design note: Classes that internally iterate over group columns (GroupedDataProcessor, GroupedDataExampleAssembler) still store self.group_by as list[str] internally. This means expanding back to multi-column grouping in the future only requires widening the public signatures -- no internal refactoring needed.

e2e test on patient_events.csv with the following config:

data:
  holdout: 0    
  max_holdout: 0
  group_training_examples_by: patient_id
  order_training_examples_by: event_id

generation:
  num_records: 1000   

training:
  pretrained_model: "HuggingFaceTB/SmolLM3-3B"
  num_input_records_to_sample: 25000  

Results:

                                                    Quality Metrics                                                     
+----------------------------------------------------------------------------------------------------------------------+
| Metric                                | Value                                                                        |
|---------------------------------------+------------------------------------------------------------------------------|
| Synthetic Data Quality Score          | 8.80                                                                         |
| Column Correlation Stability Score    | 8.90                                                                         |
| Deep Structure Stability Score        | 8.10                                                                         |
| Column Distribution Stability Score   | 8.80                                                                         |
| Text Semantic Similarity Score        | None                                                                         |
| Text Structure Similarity Score       | 9.90                                                                         |
| Data Privacy Score                    | 9.60                                                                         |
| Membership Inference Protection Score | None                                                                         |
| Attribute Inference Protection Score  | 9.60                                                                         |
| Num Valid Records                     | 1010                                                                         |
| Num Invalid Records                   | 254                                                                          |
| Num Prompts                           | 280                                                                          |
| Valid Record Fraction                 | 79.91%                                                                       |
| Timing                                | {'total_time_sec': 1342.7879953924567, 'pii_replacer_time_sec': None,        |
|                                       | 'training_time_sec': 514.4236079379916, 'generation_time_sec':               |
|                                       | 576.6627847012132, 'evaluation_time_sec': 35.51542983856052}                 |
+----------------------------------------------------------------------------------------------------------------------+
 
2026-04-02T18:11:26.464 | Nemo Safe Synthesizer |  runtime |  info  |  external_results.py: SafeSynthesizerTiming.log_timing: 34 
Safe Synthesizer timing

          Pipeline Timing          
+---------------------------------+
| Metric                | Value   |
|-----------------------+---------|
| Total Time Sec        | 1342.79 |
| Pii Replacer Time Sec | None    |
| Training Time Sec     | 514.42  |
| Generation Time Sec   | 576.66  |
| Evaluation Time Sec   | 35.52   |
+---------------------------------+

Pre-Review Checklist

Ensure that the following pass:

  • make format && make check or via prek validation.
  • make test passes locally
  • make test-e2e passes locally
  • make test-ci-container passes locally (recommended)
  • GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

  • New or updated tests for any fix or new behavior
  • Updated documentation for new features and behaviors, including docstrings for API docs.

Other Notes

@seayang-nv seayang-nv requested a review from a team as a code owner April 2, 2026 18:10
@seayang-nv seayang-nv changed the title standardiszed groupby types fix: standardized groupby types Apr 2, 2026
@seayang-nv seayang-nv force-pushed the seayang/clarify-groupby-type branch from 9b3c801 to e6d56d6 Compare April 2, 2026 18:19
nina-xu
nina-xu previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Contributor

@nina-xu nina-xu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: if the user puts in group_training_examples_by: col1,col2, does that get parsed into a list and fail the validation? or gets parsed into one string "col1,col2" and pass validation?
in general it might be nice to add a test to demonstrate one column passes validation and a list fails

@seayang-nv
Copy link
Copy Markdown
Contributor Author

question: if the user puts in group_training_examples_by: col1,col2, does that get parsed into a list and fail the validation? or gets parsed into one string "col1,col2" and pass validation? in general it might be nice to add a test to demonstrate one column passes validation and a list fails

Good suggestions. Added the following:

  • if the users try to feed col1,col2, we will send a warning:
group_training_examples_by contains a comma: col1,col2. Only a single column name is supported. If you intended to specify multiple columns, note that multi-column grouping is not currently supported.

I don't want to reject it in case there are users have commas in their column names.

  • for col1,col2 case, the user would get KeyError. Added this scenario in the troubleshotting guide.
  • added tests.

@seayang-nv seayang-nv requested a review from a team as a code owner April 3, 2026 15:49
nina-xu
nina-xu previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, I don't think a warning in pydantic is a good idea. That warning will be printed out in random places whenever we end up parsing the pydantic model and will be really annoying if you do have a comma in the name.

This is a super clear cut error that we explicitly check for and raise an error if the column name can't be found in the data frame. If we want a warning about the comma, let's put it there right before we raise the error (but only if column was not found) when we know it's a problem. And not in pydantic validation.

seayang-nv and others added 7 commits April 3, 2026 13:59
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION
& AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

<!-- Thank you for contributing to Safe Synthesizer! -->

# Summary
Improve the heartbeat message by adding an explanation that it is normal
to have long stretches with no new records.

## Pre-Review Checklist

<!-- These checks should be completed before a PR is reviewed, -->
<!-- but you can submit a draft early to indicate that the issue is
being worked on. -->

Ensure that the following pass:

- [x] `make format && make check` or via prek validation.
- [x] `make test` passes locally
- [x] `make test-e2e` passes locally
- [ ] `make test-ci-container` passes locally (recommended)
- [ ] GPU CI status check passes -- comment `/sync` on this PR to
trigger a run (auto-triggers on ready-for-review)

## Pre-Merge Checklist

<!-- These checks need to be completed before a PR is merged, -->
<!-- but as PRs often change significantly during review, -->
<!-- it's OK for them to be incomplete when review is first requested.
-->

- [ ] New or updated tests for any fix or new behavior
- [ ] Updated documentation for new features and behaviors, including
docstrings for API docs.

## Other Notes

<!-- Please add the issue number that should be closed when this PR is
merged. -->
- Closes #<issue>

---------

Signed-off-by: Alexa Haushalter <ahaushalter@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
@seayang-nv
Copy link
Copy Markdown
Contributor Author

Moved the comma-in-column-name hint out of Pydantic validation and into the actual error sites. Previously, a warning fired every time the config was parsed . Now the hint only appears when the column is not found in the data, appended to the existing error messages in _validate_groupby_column, holdout.train_test_split, and assembler._validate_columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: groupby type unclear

4 participants