Conversation
9b3c801 to
e6d56d6
Compare
nina-xu
left a comment
There was a problem hiding this comment.
question: if the user puts in group_training_examples_by: col1,col2, does that get parsed into a list and fail the validation? or gets parsed into one string "col1,col2" and pass validation?
in general it might be nice to add a test to demonstrate one column passes validation and a list fails
Good suggestions. Added the following:
I don't want to reject it in case there are users have commas in their column names.
|
kendrickb-nvidia
left a comment
There was a problem hiding this comment.
The more I think about it, I don't think a warning in pydantic is a good idea. That warning will be printed out in random places whenever we end up parsing the pydantic model and will be really annoying if you do have a comma in the name.
This is a super clear cut error that we explicitly check for and raise an error if the column name can't be found in the data frame. If we want a warning about the comma, let's put it there right before we raise the error (but only if column was not found) when we know it's a problem. And not in pydantic validation.
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. --> <!-- SPDX-License-Identifier: Apache-2.0 --> <!-- Thank you for contributing to Safe Synthesizer! --> # Summary Improve the heartbeat message by adding an explanation that it is normal to have long stretches with no new records. ## Pre-Review Checklist <!-- These checks should be completed before a PR is reviewed, --> <!-- but you can submit a draft early to indicate that the issue is being worked on. --> Ensure that the following pass: - [x] `make format && make check` or via prek validation. - [x] `make test` passes locally - [x] `make test-e2e` passes locally - [ ] `make test-ci-container` passes locally (recommended) - [ ] GPU CI status check passes -- comment `/sync` on this PR to trigger a run (auto-triggers on ready-for-review) ## Pre-Merge Checklist <!-- These checks need to be completed before a PR is merged, --> <!-- but as PRs often change significantly during review, --> <!-- it's OK for them to be incomplete when review is first requested. --> - [ ] New or updated tests for any fix or new behavior - [ ] Updated documentation for new features and behaviors, including docstrings for API docs. ## Other Notes <!-- Please add the issue number that should be closed when this PR is merged. --> - Closes #<issue> --------- Signed-off-by: Alexa Haushalter <ahaushalter@nvidia.com> Signed-off-by: Sean Yang <seayang@nvidia.com>
Signed-off-by: Sean Yang <seayang@nvidia.com>
8648e8f to
7084ab6
Compare
|
Moved the comma-in-column-name hint out of Pydantic validation and into the actual error sites. Previously, a warning fired every time the config was parsed . Now the hint only appears when the column is not found in the data, appended to the existing error messages in |
Summary
The config source of truth in
config/data.pydefinesgroup_training_examples_byasstr | None, but several downstream consumers acceptedstr | list[str]orlist[str] | str | None. This PR makes all public/interface-levelgroup_bytype annotations consistentlystr(orstr | None).Changes:
utils.py--grouped_train_test_split: narrowedgroup_byfromstr | list[str]tostrconfig/autoconfig.py--get_max_token_count: narrowedgroup_byfromlist[str] | str | Nonetostr | Nonegeneration/processors.py--GroupedDataProcessor.__init__: narrowedgroup_byfromstr | list[str]tostr, removed theisinstancecoercion guard. Internal storage remainsself.group_by: list[str] = [group_by]to leave room for future multi-column expansion.holdout/holdout.py--grouped_train_test_split: added type annotations(df: pd.DataFrame, test_size: int | float, group_by: str, random_state: int | None)tests/generation/test_processors.py-- commented out two multi-column group_by tests (test_grouped_data_processor_multiple_group_byandtest_grouped_data_processor_multiple_group_by_error) that passedlist[str], which is no longer valid at the public API level. Single-column grouping remains well-covered by the existing test suite.Design note: Classes that internally iterate over group columns (
GroupedDataProcessor,GroupedDataExampleAssembler) still storeself.group_byaslist[str]internally. This means expanding back to multi-column grouping in the future only requires widening the public signatures -- no internal refactoring needed.e2e test on
patient_events.csvwith the following config:Results:
Pre-Review Checklist
Ensure that the following pass:
make format && make checkor via prek validation.make testpasses locallymake test-e2epasses locallymake test-ci-containerpasses locally (recommended)/syncon this PR to trigger a run (auto-triggers on ready-for-review)Pre-Merge Checklist
Other Notes