Skip to content

fix: addressed holdout=0 for generation#343

Merged
seayang-nv merged 3 commits intomainfrom
seayang/run-generate-fails-with-no-holdout
Apr 3, 2026
Merged

fix: addressed holdout=0 for generation#343
seayang-nv merged 3 commits intomainfrom
seayang/run-generate-fails-with-no-holdout

Conversation

@seayang-nv
Copy link
Copy Markdown
Contributor

@seayang-nv seayang-nv commented Apr 2, 2026

Summary

Root Cause: When holdout=0, process_data() correctly produces no test set (_test_df = None), but it still creates an empty 0-byte test.csv via touch(). On resume, load_from_save_path() sees the file exists and unconditionally calls pd.read_csv() on it, which raises EmptyDataError because the file has no content.

Changes:

  • library_builder.py -- process_data(): Removed the else: touch() branch so no test.csv is written when there is no holdout set. This prevents the empty file from being created in the first place.
  • library_builder.py -- load_from_save_path(): Changed the loading condition to only require training.csv. If est.csv is missing or empty (0 bytes), _test_df is set to None instead of attempting pd.read_csv(). This fixes the crash on resume and provides backward compatibility with saved runs that already have an empty test.csv on disk.
  • cli/utils.py: Relaxed the CLI resume validation to only require training.csv to exist, since test.csv is legitimately absent when holdout=0.
  • tests/sdk/test_process_data.py: Added three tests covering the fix -- no test.csv written when holdout is zero, successful resume without test.csv, and backward-compat handling of empty test.csv from older runs.

Pre-Review Checklist

Ensure that the following pass:

  • make format && make check or via prek validation.
  • make test passes locally
  • make test-e2e passes locally
  • make test-ci-container passes locally (recommended)
  • GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Ran e2e on shoppers.csv on the following config:

data:
  holdout: 0    
  max_holdout: 0
  group_training_examples_by: null
  order_training_examples_by: null

generation:
  num_records: 1000   

training:
  pretrained_model: "HuggingFaceTB/SmolLM3-3B"
  num_input_records_to_sample: 25000   

safe-synthesizer run --data-source data/shoppers.csv --config safe-synthesizer-config.yaml

                                                    Quality Metrics                                                     
+----------------------------------------------------------------------------------------------------------------------+
| Metric                                | Value                                                                        |
|---------------------------------------+------------------------------------------------------------------------------|
| Synthetic Data Quality Score          | 8.80                                                                         |
| Column Correlation Stability Score    | 9.00                                                                         |
| Deep Structure Stability Score        | 8.40                                                                         |
| Column Distribution Stability Score   | 9.00                                                                         |
| Text Semantic Similarity Score        | None                                                                         |
| Text Structure Similarity Score       | None                                                                         |
| Data Privacy Score                    | 9.60                                                                         |
| Membership Inference Protection Score | None                                                                         |
| Attribute Inference Protection Score  | 9.60                                                                         |
| Num Valid Records                     | 1675                                                                         |
| Num Invalid Records                   | 6                                                                            |
| Num Prompts                           | 100                                                                          |
| Valid Record Fraction                 | 99.64%                                                                       |
| Timing                                | {'total_time_sec': 2078.319779300131, 'pii_replacer_time_sec': None,         |
|                                       | 'training_time_sec': 688.9708022195846, 'generation_time_sec':               |
|                                       | 1285.9716185210273, 'evaluation_time_sec': 6.210952864959836}                |
+----------------------------------------------------------------------------------------------------------------------+

then ran only generation with the trained adapter above

safe-synthesizer run generate --data-source data/shoppers.csv --config safe-synthesizer-config.yaml --run-path safe-synthesizer-artifacts/safe-synthesizer-config---shoppers/2026-04-02T16\:27\:08/

                                                    Quality Metrics                                                     
+----------------------------------------------------------------------------------------------------------------------+
| Metric                                | Value                                                                        |
|---------------------------------------+------------------------------------------------------------------------------|
| Synthetic Data Quality Score          | 8.90                                                                         |
| Column Correlation Stability Score    | 9.00                                                                         |
| Deep Structure Stability Score        | 8.60                                                                         |
| Column Distribution Stability Score   | 9.00                                                                         |
| Text Semantic Similarity Score        | None                                                                         |
| Text Structure Similarity Score       | None                                                                         |
| Data Privacy Score                    | 9.60                                                                         |
| Membership Inference Protection Score | None                                                                         |
| Attribute Inference Protection Score  | 9.60                                                                         |
| Num Valid Records                     | 1544                                                                         |
| Num Invalid Records                   | 6                                                                            |
| Num Prompts                           | 100                                                                          |
| Valid Record Fraction                 | 99.61%                                                                       |
| Timing                                | {'total_time_sec': 1248.903744426556, 'pii_replacer_time_sec': None,         |
|                                       | 'training_time_sec': None, 'generation_time_sec': 1224.3934827037156,        |
|                                       | 'evaluation_time_sec': 5.736630130559206}                                    |
+----------------------------------------------------------------------------------------------------------------------+

Pre-Merge Checklist

  • New or updated tests for any fix or new behavior
  • Updated documentation for new features and behaviors, including docstrings for API docs.

Other Notes

Signed-off-by: Sean Yang <seayang@nvidia.com>
@seayang-nv seayang-nv requested a review from a team as a code owner April 2, 2026 17:35
@seayang-nv seayang-nv changed the title addressed hold=0 for generation fix: addressed hold=0 for generation Apr 2, 2026
@seayang-nv seayang-nv changed the title fix: addressed hold=0 for generation fix: addressed holdout=0 for generation Apr 2, 2026
nina-xu
nina-xu previously approved these changes Apr 2, 2026
Signed-off-by: Sean Yang <seayang@nvidia.com>
@seayang-nv seayang-nv merged commit a823d08 into main Apr 3, 2026
10 checks passed
@seayang-nv seayang-nv deleted the seayang/run-generate-fails-with-no-holdout branch April 3, 2026 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: run generate errors when the original training job has holdout = 0

4 participants