Skip to content

chore: naming consistency for training/synthetic/test datasets#337

Open
nina-xu wants to merge 3 commits intomainfrom
ninaxu/143-dataset-naming-consistency
Open

chore: naming consistency for training/synthetic/test datasets#337
nina-xu wants to merge 3 commits intomainfrom
ninaxu/143-dataset-naming-consistency

Conversation

@nina-xu
Copy link
Copy Markdown
Contributor

@nina-xu nina-xu commented Apr 1, 2026

Summary

Standardize dataset naming conventions throughout the codebase, documentation, and quality report for consistency and clarity.

Canonical naming convention

Dataset Old names New name
Training data reference, real, training, train, df1, original, df_all training
Synthetic data output, synthetic, synth, df2 synthetic
Test holdout test test (unchanged)
Full data before split df input

Variable naming conventions

  • Pydantic model fields: short form (training, synthetic)
  • pandas DataFrame variables/params: _df suffix (training_df, synthetic_df, test_df)
  • Embedding DataFrames: _embd_df suffix (training_embd_df, synthetic_embd_df)
  • Parameter ordering: always training, synthetic, test when multiple datasets appear together

Other fixes for clarity

  • Renamed class EvaluationDatasetEvaluationDatasets; file renamed to evaluation_datasets.py
  • Renamed mode parameter → based_on in column-type query methods (get_tabular_columns, etc.)
  • Fixed some inaccuracies in the evaluation html report while I was at it

Pre-Review Checklist

Ensure that the following pass:

  • make format && make check or via prek validation.
  • make test passes locally
  • make test-e2e passes locally
  • make test-ci-container passes locally (recommended)
  • GPU CI status check passes -- comment /sync on this PR to trigger a run (auto-triggers on ready-for-review)

Pre-Merge Checklist

  • New or updated tests for any fix or new behavior
  • Updated documentation for new features and behaviors, including docstrings for API docs.

Testing Plan

  • make test-e2e
  • run a few end-to-end jobs via sdk/cli. inspect the new evaluation report
    evaluation_report.html
  • run slurm jobs and compare with baseline to ensure that no implementation change is accidentally introduced

@nina-xu nina-xu changed the title naming consistency for training/synthetic/test datasets chore: naming consistency for training/synthetic/test datasets Apr 1, 2026
raise ValueError(f"{df_name} is empty!")

def get_columns_of_type(self, types: set[FieldType], mode="reference") -> list[str]:
def get_columns_of_type(self, types: set[FieldType], based_on="training") -> list[str]:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the argument name here because mode="training" feels misleading

@nina-xu nina-xu marked this pull request as ready for review April 2, 2026 15:18
@nina-xu nina-xu requested review from a team as code owners April 2, 2026 15:18
Copilot AI review requested due to automatic review settings April 2, 2026 15:18
@nina-xu nina-xu requested a review from alexahaushalter April 2, 2026 15:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR standardizes dataset naming across the project (code, tests, docs, and the evaluation HTML report) to consistently use training, synthetic, test, and input (pre-split) terminology.

Changes:

  • Renames evaluation data model EvaluationDatasetEvaluationDatasets and propagates the new training/synthetic naming through evaluation components, reports, and templates.
  • Updates holdout splitting APIs/params and training pipeline variables to use training_df / synthetic_df naming consistently.
  • Refreshes tests and documentation/report copy to reflect the new canonical naming and labels.

Reviewed changes

Copilot reviewed 49 out of 49 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/stub_datasets/licenses.md Adds SPDX headers for test stub dataset licensing doc.
tests/smoke/test_evaluation_cpu.py Updates report construction args to training/synthetic.
tests/sdk/test_process_data.py Renames builder internals to _training_df / _original_training_df in tests.
tests/holdout/test_holdout.py Updates holdout split call to input_df=....
tests/evaluation/test_render.py Updates fixtures/kwargs to training/synthetic.
tests/evaluation/reports/test_multimodal_report.py Updates report args to training/synthetic.
tests/evaluation/datamodel/test_evaluation_datasets.py New tests for EvaluationDatasets model behavior.
tests/evaluation/datamodel/test_evaluation_dataset.py Removes obsolete tests for old EvaluationDataset model.
tests/evaluation/conftest.py Renames fixtures and switches to EvaluationDatasets.
tests/evaluation/components/test_pii_replay.py Updates PII replay construction and field names.
tests/evaluation/components/test_multi_modal_figures.py Updates figure trace naming (“Training”, “Synthetic”).
tests/evaluation/components/test_membership_inference_protection.py Updates MIA tests to new model/constructor names.
tests/evaluation/components/test_deep_structure.py Updates PCA outputs to training/synthetic naming.
tests/evaluation/components/test_correlation.py Updates correlation outputs to training/synthetic naming.
tests/evaluation/components/test_composite_score.py Updates component construction helper fixture usage.
tests/evaluation/components/test_column_distribution.py Updates column distribution construction and inputs.
tests/evaluation/components/test_attribute_inference_protection.py Updates AIA construction to new model/constructor names.
tests/data_processing/test_assembler.py Updates assembler attribute naming (training_dataset).
src/nemo_safe_synthesizer/utils.py Updates wrapper to pass input_df to holdout module.
src/nemo_safe_synthesizer/training/huggingface_backend.py Renames internal training/test df fields; updates trainer result payload.
src/nemo_safe_synthesizer/training/backend.py Renames df_traintraining_df and df_testtest_df in base types.
src/nemo_safe_synthesizer/sdk/library_builder.py Renames internal cached split fields and evaluator arg to training_df.
src/nemo_safe_synthesizer/holdout/holdout.py Renames split fns/params to input_df and returns training_df.
src/nemo_safe_synthesizer/evaluation/statistics/stats.py Renames parameters for memorization/PCA helpers to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/reports/multimodal/multimodal_report.py Switches report assembly to EvaluationDatasets and new component constructors.
src/nemo_safe_synthesizer/evaluation/evaluator.py Renames evaluator input from train_df to training_df.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_report.py Renames report field to evaluation_datasets.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_field.py Renames field feature/distribution attributes to training/synthetic.
src/nemo_safe_synthesizer/evaluation/data_model/evaluation_datasets.py Renames and updates evaluation data model; renames modebased_on.
src/nemo_safe_synthesizer/evaluation/components/text_structure_similarity.py Updates component to consume EvaluationDatasets and new naming.
src/nemo_safe_synthesizer/evaluation/components/text_semantic_similarity.py Updates component naming and PCA labeling (“training” vs “train”).
src/nemo_safe_synthesizer/evaluation/components/sqs_score.py Updates field-type lookup to training-based features.
src/nemo_safe_synthesizer/evaluation/components/pii_replay.py Renames PII replay fields to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/components/multi_modal_figures.py Updates figure labels and function parameter names to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/membership_inference_protection.py Renames params and normalization flow to training/synthetic/test naming.
src/nemo_safe_synthesizer/evaluation/components/deep_structure.py Renames PCA outputs and data-prep params to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/dataset_statistics.py Renames dataset stats fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/components/correlation.py Renames correlation matrices and wiring to training/synthetic.
src/nemo_safe_synthesizer/evaluation/components/component.py Renames base factory method to from_evaluation_datasets.
src/nemo_safe_synthesizer/evaluation/components/column_distribution.py Updates plotting/data access to training/synthetic fields.
src/nemo_safe_synthesizer/evaluation/components/attribute_inference_protection.py Renames params and internal variables to training/synthetic.
src/nemo_safe_synthesizer/evaluation/assets/text/multi_modal_tooltips.py Updates report tooltip text to training/synthetic terminology.
src/nemo_safe_synthesizer/evaluation/assets/jinja/reports/multi_modal_report.j2 Updates report navigation and includes to “Training Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2 Renames “Reference Columns” section to “Training Data Columns”.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/pii_replay.j2 Updates headings/fields to training/synthetic naming.
src/nemo_safe_synthesizer/evaluation/assets/jinja/components/dataset_statistics.j2 Updates table labels and memorized-lines row wording.
src/nemo_safe_synthesizer/data_processing/assembler.py Renames internal dataset attributes and split variables to training/validation.
docs/user-guide/running.md Updates generation-stage description to “training dataset schema”.
docs/product-overview/evaluation.md Updates SQS/DPS copy to training vs synthetic terminology.
Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2:35

  • The "Type" column in the "Training Data Columns" table is currently sourced from synthetic_field_features.type. If this table is intended to describe the training data (as the header + tooltip indicate), this should use training_field_features.type instead; otherwise the label/tooltip should be updated to clarify which dataset’s type is shown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@kendrickb-nvidia kendrickb-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nina-xu this is really good to be consistent on naming.

As a larger refactoring, I'd like to get our ty type checking updated and working more fully in #141 first. That will give us more confidence that the refactoring didn't miss anything or introduce any typos.

My goal is to get #141 ready today. Hopefully I can run a bit of slurm testing over the weekend and we can merge #141 on Monday. Then do one merge conflict resolution with this PR and proceed. If things don't go well on the type checking PR today, then we can swap the order and I'll figure out merge conflicts in the already massive #141. Will check in on Monday on how to proceed.

Align dataset split naming to canonical training/synthetic/test/input terms across code, docs, and evaluation artifacts for clearer, consistent semantics.

Made-with: Cursor
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 8, 2026 17:47
@nina-xu nina-xu force-pushed the ninaxu/143-dataset-naming-consistency branch from edb34d2 to 63b5c59 Compare April 8, 2026 17:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 43 out of 43 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nina-xu added 2 commits April 8, 2026 17:53
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Signed-off-by: nina-xu <19981858+nina-xu@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 8, 2026 21:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 48 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/nemo_safe_synthesizer/evaluation/assets/jinja/components/training_columns.j2:36

  • The "Training Data Columns" table tooltip/description says values are for the training data, but the rendered "Type" column is pulled from column.synthetic_field_features.type. This will display the synthetic type rather than the training type (and can be misleading when types differ). Use column.training_field_features.type here, or render both explicitly if the intent is comparison.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 15 to 17
"sqs_info": """
The SQS (Synthetic Quality Score) is a measure of how well the output data matches the reference data. A higher SQS indicates a better match.
The SQS (Synthetic Quality Score) is a measure of how well the synthetic data matches the input data. A higher SQS indicates a better match.
SQS is comprised of five metrics. Column Correlation Stability, Deep Structure Stability, and Column Distribution Stability are calculated and
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SQS tooltip text, "synthetic data matches the input data" is inconsistent with the rest of the report (and the evaluation code), which compares synthetic data against the training split. Consider changing "input data" to "training data" here (and similarly in the Text Semantic Similarity tooltip at line 54) to avoid confusing users about what baseline SQS is computed against.

Copilot uses AI. Check for mistakes.
Comment on lines 448 to 450
# ---------------------------------------------------------------------------
# Tests: config validation at process_data entry
# ---------------------------------------------------------------------------
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The holdout=0 resume-path regression tests (missing/empty test.csv handling in load_from_save_path) were removed here. The production code still has special-case logic for test.csv being absent or 0-bytes, so dropping these tests reduces coverage for a historically fragile edge case. Please reintroduce equivalent coverage (either here or in a more appropriate test module).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore: naming consistency for train/output/test datasets

3 participants