chore: naming consistency for training/synthetic/test datasets by nina-xu · Pull Request #337 · NVIDIA-NeMo/Safe-Synthesizer

nina-xu · 2026-04-01T19:46:13Z

Summary

Standardize dataset naming conventions throughout the codebase, documentation, and quality report for consistency and clarity.

Canonical naming convention

Dataset	Old names	New name
Training data	`reference`, `real`, `training`, `train`, `df1`, `original`, `df_all`	`training`
Synthetic data	`output`, `synthetic`, `synth`, `df2`	`synthetic`
Test holdout	`test`	`test` (unchanged)
Full data before split	`df`	`input`

Variable naming conventions

Pydantic model fields: short form (training, synthetic)
pandas DataFrame variables/params: _df suffix (training_df, synthetic_df, test_df)
Embedding DataFrames: _embd_df suffix (training_embd_df, synthetic_embd_df)
Parameter ordering: always training, synthetic, test when multiple datasets appear together

Other fixes for clarity

Renamed class EvaluationDataset → EvaluationDatasets; file renamed to evaluation_datasets.py
Renamed mode parameter → based_on in column-type query methods (get_tabular_columns, etc.)
Fixed some inaccuracies in the evaluation html report while I was at it

Pre-Review Checklist

Ensure that the following pass:

make format && make check or via prek validation.
make test passes locally
make test-e2e passes locally
make test-ci-container passes locally (recommended)
GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

New or updated tests for any fix or new behavior
Updated documentation for new features and behaviors, including docstrings for API docs.

Testing Plan

make test-e2e
run a few end-to-end jobs via sdk/cli. inspect the new evaluation report
evaluation_report.html
run slurm jobs and compare with baseline to ensure that no implementation change is accidentally introduced

Closes chore: naming consistency for train/output/test datasets #143

nina-xu · 2026-04-02T13:49:52Z

src/nemo_safe_synthesizer/evaluation/data_model/evaluation_datasets.py

            raise ValueError(f"{df_name} is empty!")

-    def get_columns_of_type(self, types: set[FieldType], mode="reference") -> list[str]:
+    def get_columns_of_type(self, types: set[FieldType], based_on="training") -> list[str]:


changed the argument name here because mode="training" feels misleading

Copilot

Pull request overview

This PR standardizes dataset naming across the project (code, tests, docs, and the evaluation HTML report) to consistently use training, synthetic, test, and input (pre-split) terminology.

Changes:

Renames evaluation data model EvaluationDataset → EvaluationDatasets and propagates the new training/synthetic naming through evaluation components, reports, and templates.
Updates holdout splitting APIs/params and training pipeline variables to use training_df / synthetic_df naming consistently.
Refreshes tests and documentation/report copy to reflect the new canonical naming and labels.

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/stub_datasets/licenses.md	Adds SPDX headers for test stub dataset licensing doc.
tests/smoke/test_evaluation_cpu.py	Updates report construction args to `training`/`synthetic`.
tests/sdk/test_process_data.py	Renames builder internals to `_training_df` / `_original_training_df` in tests.
tests/holdout/test_holdout.py	Updates holdout split call to `input_df=...`.
tests/evaluation/test_render.py	Updates fixtures/kwargs to `training`/`synthetic`.
tests/evaluation/reports/test_multimodal_report.py	Updates report args to `training`/`synthetic`.
tests/evaluation/datamodel/test_evaluation_datasets.py	New tests for `EvaluationDatasets` model behavior.
tests/evaluation/datamodel/test_evaluation_dataset.py	Removes obsolete tests for old `EvaluationDataset` model.
tests/evaluation/conftest.py	Renames fixtures and switches to `EvaluationDatasets`.
tests/evaluation/components/test_pii_replay.py	Updates PII replay construction and field names.
tests/evaluation/components/test_multi_modal_figures.py	Updates figure trace naming (“Training”, “Synthetic”).
tests/evaluation/components/test_membership_inference_protection.py	Updates MIA tests to new model/constructor names.
tests/evaluation/components/test_deep_structure.py	Updates PCA outputs to training/synthetic naming.
tests/evaluation/components/test_correlation.py	Updates correlation outputs to training/synthetic naming.
tests/evaluation/components/test_composite_score.py	Updates component construction helper fixture usage.
tests/evaluation/components/test_column_distribution.py	Updates column distribution construction and inputs.
tests/evaluation/components/test_attribute_inference_protection.py	Updates AIA construction to new model/constructor names.
tests/data_processing/test_assembler.py	Updates assembler attribute naming (`training_dataset`).
src/nemo_safe_synthesizer/utils.py	Updates wrapper to pass `input_df` to holdout module.
src/nemo_safe_synthesizer/training/huggingface_backend.py	Renames internal training/test df fields; updates trainer result payload.
src/nemo_safe_synthesizer/training/backend.py	Renames `df_train` → `training_df` and `df_test` → `test_df` in base types.
src/nemo_safe_synthesizer/sdk/library_builder.py	Renames internal cached split fields and evaluator arg to `training_df`.
src/nemo_safe_synthesizer/holdout/holdout.py	Renames split fns/params to `input_df` and returns `training_df`.
src/nemo_safe_synthesizer/evaluation/statistics/stats.py	Renames parameters for memorization/PCA helpers to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/reports/multimodal/multimodal_report.py	Switches report assembly to `EvaluationDatasets` and new component constructors.
src/nemo_safe_synthesizer/evaluation/evaluator.py	Renames evaluator input from `train_df` to `training_df`.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_report.py	Renames report field to `evaluation_datasets`.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_field.py	Renames field feature/distribution attributes to training/synthetic.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_datasets.py	Renames and updates evaluation data model; renames `mode` → `based_on`.
src/nemo_safe_synthesizer/evaluation/components/text_structure_similarity.py	Updates component to consume `EvaluationDatasets` and new naming.
src/nemo_safe_synthesizer/evaluation/components/text_semantic_similarity.py	Updates component naming and PCA labeling (“training” vs “train”).
src/nemo_safe_synthesizer/evaluation/components/sqs_score.py	Updates field-type lookup to training-based features.
src/nemo_safe_synthesizer/evaluation/components/pii_replay.py	Renames PII replay fields to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py	Updates figure labels and function parameter names to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/membership_inference_protection.py	Renames params and normalization flow to training/synthetic/test naming.
src/nemo_safe_synthesizer/evaluation/components/deep_structure.py	Renames PCA outputs and data-prep params to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py	Renames dataset stats fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/correlation.py	Renames correlation matrices and wiring to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/component.py	Renames base factory method to `from_evaluation_datasets`.
src/nemo_safe_synthesizer/evaluation/components/column_distribution.py	Updates plotting/data access to training/synthetic fields.
src/nemo_safe_synthesizer/evaluation/components/attribute_inference_protection.py	Renames params and internal variables to training/synthetic.
src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py	Updates report tooltip text to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/assets/jinja/reports/multi_modal_report.j2	Updates report navigation and includes to “Training Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2	Renames “Reference Columns” section to “Training Data Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/pii_replay.j2	Updates headings/fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/dataset_statistics.j2	Updates table labels and memorized-lines row wording.
src/nemo_safe_synthesizer/data_processing/assembler.py	Renames internal dataset attributes and split variables to training/validation.
docs/user-guide/running.md	Updates generation-stage description to “training dataset schema”.
docs/product-overview/evaluation.md	Updates SQS/DPS copy to training vs synthetic terminology.

Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2:35

The "Type" column in the "Training Data Columns" table is currently sourced from synthetic_field_features.type. If this table is intended to describe the training data (as the header + tooltip indicate), this should use training_field_features.type instead; otherwise the label/tooltip should be updated to clarify which dataset’s type is shown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/nemo_safe_synthesizer/evaluation/components/pii_replay.py

src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py

kendrickb-nvidia

Thanks @nina-xu this is really good to be consistent on naming.

As a larger refactoring, I'd like to get our ty type checking updated and working more fully in #141 first. That will give us more confidence that the refactoring didn't miss anything or introduce any typos.

My goal is to get #141 ready today. Hopefully I can run a bit of slurm testing over the weekend and we can merge #141 on Monday. Then do one merge conflict resolution with this PR and proceed. If things don't go well on the type checking PR today, then we can swap the order and I'll figure out merge conflicts in the already massive #141. Will check in on Monday on how to proceed.

Align dataset split naming to canonical training/synthetic/test/input terms across code, docs, and evaluation artifacts for clearer, consistent semantics. Made-with: Cursor Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/holdout/test_holdout.py

tests/evaluation/reports/test_multimodal_report.py

src/nemo_safe_synthesizer/evaluation/components/pii_replay.py

src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2:36

The "Training Data Columns" table tooltip/description says values are for the training data, but the rendered "Type" column is pulled from column.synthetic_field_features.type. This will display the synthetic type rather than the training type (and can be misleading when types differ). Use column.training_field_features.type here, or render both explicitly if the intent is comparison.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T21:32:48Z

src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py

    "sqs_info": """
-        The SQS (Synthetic Quality Score) is a measure of how well the output data matches the reference data. A higher SQS indicates a better match.
+        The SQS (Synthetic Quality Score) is a measure of how well the synthetic data matches the input data. A higher SQS indicates a better match.
        SQS is comprised of five metrics. Column Correlation Stability, Deep Structure Stability, and Column Distribution Stability are calculated and


In the SQS tooltip text, "synthetic data matches the input data" is inconsistent with the rest of the report (and the evaluation code), which compares synthetic data against the training split. Consider changing "input data" to "training data" here (and similarly in the Text Semantic Similarity tooltip at line 54) to avoid confusing users about what baseline SQS is computed against.

Copilot · 2026-04-08T21:32:48Z

tests/sdk/test_process_data.py

 # ---------------------------------------------------------------------------
 # Tests: config validation at process_data entry
 # ---------------------------------------------------------------------------


The holdout=0 resume-path regression tests (missing/empty test.csv handling in load_from_save_path) were removed here. The production code still has special-case logic for test.csv being absent or 0-bytes, so dropping these tests reduces coverage for a historically fragile edge case. Please reintroduce equivalent coverage (either here or in a more appropriate test module).

nina-xu changed the title ~~naming consistency for training/synthetic/test datasets~~ chore: naming consistency for training/synthetic/test datasets Apr 1, 2026

nina-xu commented Apr 2, 2026

View reviewed changes

nina-xu marked this pull request as ready for review April 2, 2026 15:18

nina-xu requested review from a team as code owners April 2, 2026 15:18

Copilot AI review requested due to automatic review settings April 2, 2026 15:18

Copilot started reviewing on behalf of nina-xu April 2, 2026 15:18 View session

nina-xu requested a review from alexahaushalter April 2, 2026 15:21

Copilot AI reviewed Apr 2, 2026

View reviewed changes

src/nemo_safe_synthesizer/evaluation/components/pii_replay.py Show resolved Hide resolved

src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py Show resolved Hide resolved

src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py Show resolved Hide resolved

kendrickb-nvidia reviewed Apr 3, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 8, 2026 17:47

nina-xu force-pushed the ninaxu/143-dataset-naming-consistency branch from edb34d2 to 63b5c59 Compare April 8, 2026 17:47

Copilot started reviewing on behalf of nina-xu April 8, 2026 17:48 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

nina-xu added 2 commits April 8, 2026 17:53

re-delete stale file

99d8796

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

reapply ty changes

d0a298a

Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 8, 2026 21:26

Copilot started reviewing on behalf of nina-xu April 8, 2026 21:26 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: naming consistency for training/synthetic/test datasets#337

chore: naming consistency for training/synthetic/test datasets#337
nina-xu wants to merge 3 commits intomainfrom
ninaxu/143-dataset-naming-consistency

nina-xu commented Apr 1, 2026 •

edited

Loading

Uh oh!

nina-xu Apr 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kendrickb-nvidia left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nina-xu commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Canonical naming convention

Variable naming conventions

Other fixes for clarity

Pre-Review Checklist

Pre-Merge Checklist

Testing Plan

Uh oh!

nina-xu Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kendrickb-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nina-xu commented Apr 1, 2026 •

edited

Loading