fix: addressed holdout=0 for generation by seayang-nv · Pull Request #343 · NVIDIA-NeMo/Safe-Synthesizer

seayang-nv · 2026-04-02T17:35:15Z

Summary

Root Cause: When holdout=0, process_data() correctly produces no test set (_test_df = None), but it still creates an empty 0-byte test.csv via touch(). On resume, load_from_save_path() sees the file exists and unconditionally calls pd.read_csv() on it, which raises EmptyDataError because the file has no content.

Changes:

library_builder.py -- process_data(): Removed the else: touch() branch so no test.csv is written when there is no holdout set. This prevents the empty file from being created in the first place.
library_builder.py -- load_from_save_path(): Changed the loading condition to only require training.csv. If est.csv is missing or empty (0 bytes), _test_df is set to None instead of attempting pd.read_csv(). This fixes the crash on resume and provides backward compatibility with saved runs that already have an empty test.csv on disk.
cli/utils.py: Relaxed the CLI resume validation to only require training.csv to exist, since test.csv is legitimately absent when holdout=0.
tests/sdk/test_process_data.py: Added three tests covering the fix -- no test.csv written when holdout is zero, successful resume without test.csv, and backward-compat handling of empty test.csv from older runs.

Pre-Review Checklist

Ensure that the following pass:

make format && make check or via prek validation.
make test passes locally
make test-e2e passes locally
make test-ci-container passes locally (recommended)
GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Ran e2e on shoppers.csv on the following config:

data:
  holdout: 0    
  max_holdout: 0
  group_training_examples_by: null
  order_training_examples_by: null

generation:
  num_records: 1000   

training:
  pretrained_model: "HuggingFaceTB/SmolLM3-3B"
  num_input_records_to_sample: 25000

safe-synthesizer run --data-source data/shoppers.csv --config safe-synthesizer-config.yaml

                                                    Quality Metrics                                                     
+----------------------------------------------------------------------------------------------------------------------+
| Metric                                | Value                                                                        |
|---------------------------------------+------------------------------------------------------------------------------|
| Synthetic Data Quality Score          | 8.80                                                                         |
| Column Correlation Stability Score    | 9.00                                                                         |
| Deep Structure Stability Score        | 8.40                                                                         |
| Column Distribution Stability Score   | 9.00                                                                         |
| Text Semantic Similarity Score        | None                                                                         |
| Text Structure Similarity Score       | None                                                                         |
| Data Privacy Score                    | 9.60                                                                         |
| Membership Inference Protection Score | None                                                                         |
| Attribute Inference Protection Score  | 9.60                                                                         |
| Num Valid Records                     | 1675                                                                         |
| Num Invalid Records                   | 6                                                                            |
| Num Prompts                           | 100                                                                          |
| Valid Record Fraction                 | 99.64%                                                                       |
| Timing                                | {'total_time_sec': 2078.319779300131, 'pii_replacer_time_sec': None,         |
|                                       | 'training_time_sec': 688.9708022195846, 'generation_time_sec':               |
|                                       | 1285.9716185210273, 'evaluation_time_sec': 6.210952864959836}                |
+----------------------------------------------------------------------------------------------------------------------+

then ran only generation with the trained adapter above

safe-synthesizer run generate --data-source data/shoppers.csv --config safe-synthesizer-config.yaml --run-path safe-synthesizer-artifacts/safe-synthesizer-config---shoppers/2026-04-02T16\:27\:08/

                                                    Quality Metrics                                                     
+----------------------------------------------------------------------------------------------------------------------+
| Metric                                | Value                                                                        |
|---------------------------------------+------------------------------------------------------------------------------|
| Synthetic Data Quality Score          | 8.90                                                                         |
| Column Correlation Stability Score    | 9.00                                                                         |
| Deep Structure Stability Score        | 8.60                                                                         |
| Column Distribution Stability Score   | 9.00                                                                         |
| Text Semantic Similarity Score        | None                                                                         |
| Text Structure Similarity Score       | None                                                                         |
| Data Privacy Score                    | 9.60                                                                         |
| Membership Inference Protection Score | None                                                                         |
| Attribute Inference Protection Score  | 9.60                                                                         |
| Num Valid Records                     | 1544                                                                         |
| Num Invalid Records                   | 6                                                                            |
| Num Prompts                           | 100                                                                          |
| Valid Record Fraction                 | 99.61%                                                                       |
| Timing                                | {'total_time_sec': 1248.903744426556, 'pii_replacer_time_sec': None,         |
|                                       | 'training_time_sec': None, 'generation_time_sec': 1224.3934827037156,        |
|                                       | 'evaluation_time_sec': 5.736630130559206}                                    |
+----------------------------------------------------------------------------------------------------------------------+

Pre-Merge Checklist

New or updated tests for any fix or new behavior
Updated documentation for new features and behaviors, including docstrings for API docs.

Other Notes

Closes bug: run generate errors when the original training job has holdout = 0 #276

Signed-off-by: Sean Yang <seayang@nvidia.com>

src/nemo_safe_synthesizer/sdk/library_builder.py

Signed-off-by: Sean Yang <seayang@nvidia.com>

addressed hold=0 for generation

8414afb

Signed-off-by: Sean Yang <seayang@nvidia.com>

seayang-nv requested review from kendrickb-nvidia and mckornfield April 2, 2026 17:35

seayang-nv requested a review from a team as a code owner April 2, 2026 17:35

seayang-nv changed the title ~~addressed hold=0 for generation~~ fix: addressed hold=0 for generation Apr 2, 2026

Merge branch 'main' into seayang/run-generate-fails-with-no-holdout

0018025

seayang-nv changed the title ~~fix: addressed hold=0 for generation~~ fix: addressed holdout=0 for generation Apr 2, 2026

nina-xu previously approved these changes Apr 2, 2026

View reviewed changes

src/nemo_safe_synthesizer/sdk/library_builder.py Show resolved Hide resolved

added warning when no test split is loaded

b2890d2

Signed-off-by: Sean Yang <seayang@nvidia.com>

seayang-nv dismissed nina-xu’s stale review via b2890d2 April 2, 2026 18:02

kendrickb-nvidia approved these changes Apr 3, 2026

View reviewed changes

mckornfield approved these changes Apr 3, 2026

View reviewed changes

seayang-nv merged commit a823d08 into main Apr 3, 2026
10 checks passed

seayang-nv deleted the seayang/run-generate-fails-with-no-holdout branch April 3, 2026 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: addressed holdout=0 for generation#343

fix: addressed holdout=0 for generation#343
seayang-nv merged 3 commits intomainfrom
seayang/run-generate-fails-with-no-holdout

seayang-nv commented Apr 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

seayang-nv commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pre-Review Checklist

Pre-Merge Checklist

Other Notes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seayang-nv commented Apr 2, 2026 •

edited

Loading