Skip to content

NVIDIA-NeMo/Safe-Synthesizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

212 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

🛡️ NeMo Safe Synthesizer

NVIDIA NeMo Safe Synthesizer creates private, safe versions of sensitive tabular datasets -- entirely synthetic data with no one-to-one mapping to your original records. Purpose-built for privacy compliance and sensitive information protection while preserving data utility for downstream AI tasks.

Quick Start

Read detailed usage below, or jump to the documentation with Getting Started or the Safe Synthesizer 101 notebook.

Prerequisites

  • Python 3.11–3.13 (we pin a specific 3.11.x in .python-version for local/dev bootstrap; any 3.11, 3.12, or 3.13 interpreter works. Python 3.14+ is NOT supported because ray, a transitive dependency of vLLM, does not yet publish cp314 wheels)
  • uv - Python package manager (>=0.9.14, <0.11.0)
  • NVIDIA GPU (A100 or larger) for training and generation
  • Linux only -- macOS, Windows, and Apple Silicon are not supported for training or generation. A CPU-only install is available for development and configuration validation.

Installation

uv pip install "nemo-safe-synthesizer[cu128,engine]" \
  --index https://flashinfer.ai/whl/cu128 \
  --index https://download.pytorch.org/whl/cu128 \
  --index-strategy unsafe-best-match

Or install from source:

git clone https://github.com/NVIDIA-NeMo/Safe-Synthesizer.git
cd Safe-Synthesizer
make bootstrap-tools
make bootstrap-nss cuda

Running

Activate Python virtual environment and run the CLI using safe-synthesizer:

> safe-synthesizer --help
Usage: safe-synthesizer [OPTIONS] COMMAND [ARGS]...

  NeMo Safe Synthesizer command-line interface. This application is used to
  run the Safe Synthesizer pipeline. It can be used to train a model, generate
  synthetic data, and evaluate the synthetic data. It can also be used to
  modify a config file.

Options:
  --help  Show this message and exit.

Commands:
  artifacts  Artifacts management commands.
  config     Manage Safe Synthesizer configurations.
  run        Run the Safe Synthesizer end-to-end pipeline.

Running the Pipeline

The run command executes the Safe Synthesizer pipeline. Without a subcommand, it runs the full end-to-end pipeline:

> uv run safe-synthesizer run --help
Usage: safe-synthesizer run [OPTIONS] COMMAND [ARGS]...

  Run the Safe Synthesizer end-to-end pipeline.

  Without a subcommand, runs the full end-to-end pipeline. Use 'run train' or
  'run generate' for individual stages.

Options:
  --config TEXT                   path to a yaml config file
  --data-source TEXT                      Dataset name, URL, or path to CSV dataset.
                                  For 'run generate', this is optional if a
                                  cached dataset exists in the workdir.
  --artifact-path DIRECTORY       Base directory for all runs. Runs are
                                  created as <artifact-
                                  path>/<config>---<dataset>/<timestamp>/. Can
                                  also be set via NSS_ARTIFACTS_PATH env var.
                                  [default: ./safe-synthesizer-artifacts]
  --run-path DIRECTORY            Explicit path for this run's output
                                  directory. When specified, outputs go
                                  directly to this path. Overrides --artifact-
                                  path.
  --output-file PATH              Path to output CSV file. Overrides the
                                  default workdir output location.
  --log-format [json|plain]       Log format for console output. File logging
                                  will always be JSON. Can also be set via
                                  NSS_LOG_FORMAT env var. [default: plain]
  --log-color / --no-log-color    Whether to colorize the log output on the
                                  console. [default: --log-color]
  --log-file PATH                 Path to log file. Defaults to a file nested
                                  under the run directory. Can also be set via
                                  NSS_LOG_FILE env var.
  --wandb-mode [online|offline|disabled]
                                  Wandb mode. 'online' will upload logs to
                                  wandb, 'offline' will save logs to a local
                                  file, 'disabled' will not upload logs to
                                  wandb. Can also be set via WANDB_MODE env
                                  var. [default: disabled]
  --wandb-project TEXT            Wandb project. Can also be set via
                                  WANDB_PROJECT env var.
  -v                              Verbose logging. 'v' shows debug info from
                                  main program, 'vv' shows debug from
                                  dependencies too
  --dataset-registry TEXT         URL or path of a dataset registry YAML file.
                                  If provided, datasets in the registry may be
                                  referenced by name in --data-source. Can also be set
                                  via NSS_DATASET_REGISTRY env var. If both
                                  env var and CLI option are provided, the CLI
                                  option takes precedence.
  --help                          Show this message and exit.

Commands:
  generate  Run the generation stage only.
  train     Run the training stage only.

Subcommands

  • safe-synthesizer run train - Run only the training stage, saving the adapter to the run directory.
  • safe-synthesizer run generate - Run only the generation stage using a saved adapter.
> uv run safe-synthesizer run generate --help
Usage: safe-synthesizer run generate [OPTIONS]

  Run the generation stage only.

  This command loads a trained adapter and generates synthetic data. Requires
  'run train' to have been executed first.

  Use --run-path to specify the exact run directory containing the trained
  model, or use --auto-discover-adapter with --artifact-path to automatically
  find the latest trained run.

Options:
  --config TEXT                   path to a yaml config file
  --data-source TEXT                      Dataset name, URL, or path to CSV dataset.
                                  [required]
  --artifact-path DIRECTORY       Base directory for all runs. Runs are
                                  created as <artifact-path>/<config>-
                                  <dataset>/<timestamp>/. [default: ./safe-
                                  synthesizer-artifacts]
  --run-path DIRECTORY            Explicit path for this run's output
                                  directory. When specified, outputs go
                                  directly to this path. Overrides --artifact-
                                  path.
  --output-file PATH              Path to output CSV file. Overrides the
                                  default workdir output location.
  --log-format [json|plain]       Log format for console output. File logging
                                  will always be JSON.
  --log-color / --no-log-color    Whether to colorize the log output on the
                                  console
  --log-file PATH                 Path to log file. Defaults to a file nested
                                  under the run directory.
  -v                              Verbose logging. 'v' shows debug info from
                                  main program, 'vv' shows debug from
                                  dependencies too
  --wandb-mode [online|offline|disabled]
                                  Wandb mode. 'online' will upload logs to
                                  wandb, 'offline' will save logs to a local
                                  file, 'disabled' will not upload logs to
                                  wandb.
  --wandb-project TEXT            Wandb project. If not specified, the project
                                  will be taken from the environment variable
                                  WANDB_PROJECT.
  --auto-discover-adapter         Automatically find the latest trained
                                  adapter in --artifact-path. Without this
                                  flag, --run-path must point to a specific
                                  trained run.
  --help                          Show this message and exit.

Managing Configurations

The config command provides tools to validate and modify configuration files:

> uv run safe-synthesizer config --help
Usage: safe-synthesizer config [OPTIONS] COMMAND [ARGS]...

  Manage Safe Synthesizer configurations.

Options:
  --help  Show this message and exit.

Commands:
  modify    Modify a Safe Synthesizer configuration.
  validate  Validate a Safe Synthesizer configuration.

Attention Configuration

Safe Synthesizer exposes attention implementation settings for both training and generation.

Training (attn_implementation)

Controls the HuggingFace attention backend used during model loading for training. Set via config YAML, CLI, or SDK:

# config.yaml
training:
  attn_implementation: "kernels-community/vllm-flash-attn3"
# CLI override
safe-synthesizer run --training__attn_implementation sdpa --data-source my_data.csv
Value Description Requires
kernels-community/vllm-flash-attn3 Flash Attention 3 via HuggingFace Kernels Hub (default) kernels pip package
kernels-community/flash-attn2 Flash Attention 2 via HuggingFace Kernels Hub kernels pip package
flash_attention_2 Flash Attention 2 (traditional) flash-attn pip package
sdpa PyTorch scaled dot product attention None (built-in)
eager Standard PyTorch attention None (built-in)

If the default kernels-community/vllm-flash-attn3 is configured but the kernels package is not installed, the backend automatically falls back to sdpa.

Generation (attention_backend)

Controls the vLLM attention backend used during synthetic data generation. Defaults to "auto", which lets vLLM auto-select the best available backend.

# config.yaml
generation:
  attention_backend: "FLASH_ATTN"

Common values: FLASHINFER, FLASH_ATTN, TORCH_SDPA, TRITON_ATTN, FLEX_ATTENTION.

NIM Integration

Column classification uses a NIM/OpenAI-compatible endpoint to detect entity types in your data. NSS_INFERENCE_ENDPOINT defaults to https://integrate.api.nvidia.com/v1; override it to use a different endpoint.

When using the CLI or Python SDK, set NSS_INFERENCE_KEY (and NSS_INFERENCE_ENDPOINT only if not using the default) so column classification can run.

Local Endpoint

To point to a locally hosted LLM:

export NSS_INFERENCE_ENDPOINT="https://your-local-nim-endpoint"
export NSS_INFERENCE_KEY="your-api-key"  # pragma: allowlist secret

Disable Classification

To disable classification entirely:

replace_pii:
  globals:
    classify:
      enable_classify: false

When classification is disabled, NSS falls back to default entity types.

Artifacts and Workdirs

Safe Synthesizer uses a structured directory format to manage artifacts (trained models, synthetic data, logs).

Directory Layout

By default, runs are nested under --artifact-path using the project name (<config>---<dataset>) and a unique run name.

<artifact-path>/<config>---<dataset>/<run_name>/
├── train/
│   ├── safe-synthesizer-config.json
│   └── adapter/                     # trained PEFT adapter
│       ├── adapter_config.json
│       ├── adapter_model.safetensors
│       ├── metadata_v2.json
│       └── dataset_schema.json
├── generate/
│   ├── logs.jsonl                   # generate-only workflow
│   ├── info.json                    # generate-only workflow
│   ├── synthetic_data.csv
│   ├── evaluation_report.html
│   └── evaluation_metrics.json      # machine-readable metrics
├── dataset/
│   ├── training.csv
│   ├── test.csv
│   ├── validation.csv               # when training.validation_ratio > 0
│   └── transformed_training.csv     # when PII replacement transforms the data
└── logs/
    └── <phase>.jsonl                # e.g. end_to_end.jsonl or train.jsonl

Run Names

If not provided with --run-path, run names are automatically generated using the current <timestamp>.

Overriding Paths

  • Use --run-path to specify an explicit directory for the run, bypassing the <project>/<timestamp> nesting.
  • Use --output-file to specify an explicit path for the final synthetic CSV, overriding the default location in the generate/ directory.

WandB Logging

Safe Synthesizer supports Weights & Biases (WandB) for experiment tracking.

Configuration

You can enable WandB logging using CLI options or environment variables:

  • --wandb-mode [online|offline|disabled]: Set the WandB mode. Default is disabled.
  • --wandb-project <name>: Specify the WandB project name.
  • WANDB_API_KEY: Ensure your API key is set in your environment.

Logged Data

The following information is logged to WandB:

  • Configuration parameters
  • Training metrics (if supported by the backend)
  • Generation statistics
  • Evaluation results
  • Timing information

Dataset Registry

Safe Synthesizer supports a dataset registry to simplify working with a standard set of datasets. Datasets in the registry may be referenced by name, rather than repeatedly specifying long URLS or file paths on the command line. Additionally, the registry supports custom config overrides or args that are specific to individual datasets.

Providing a Dataset Registry

You can supply a dataset registry (YAML file) via either the CLI or an environment variable:

  • CLI Option: --dataset-registry <path_or_url>
  • Environment Variable: Set NSS_DATASET_REGISTRY to point to your YAML file (path or URL).

If both are provided, the CLI option takes precedence.

Referencing Datasets

When a dataset registry is provided, you can use dataset names defined in the registry with the --data-source argument. For example:

nemo-safe-synthesizer run --dataset-registry my_registry.yaml --data-source my_dataset

This will load the dataset from the url plus apply any overrides for my_dataset from the registry YAML.

Dataset Registry YAML Format

The registry file should conform to the pydantic model defined by DatasetRegistry in cli/datasets.py. For example,

# registry.yaml
base_url: /root/data/location
datasets:
- name: dataset1
  url: dataset1.csv
- name: dataset2
  url: dataset2.jsonl
  overrides:
    data:
      group_training_examples_by: id
- name: dataset3
  url: /absolute/path/to/dataset.csv
- name: dataset4
  url: https://myhost.com/path/to/dataset.json
  load_args:
    keyword: custom_arg_for_data_reader
  • Minimal requirements for each entry in the datasets: list are a name and a url. url may be a URL or a file path, anything that data readers like pd.read_csv will accept.
  • base_url - Any relative urls or paths will be prepended with the base_url before attempting to load the dataset. This only applies to the named datasets in the registry which have a relative url. Passing a relative --data-source on the CLI will attempt to load the file relative to your current working directory, regardless of whether a registry is provided or whether base_url is set. base_url is optional, if not provided, it is recommended to use absolute urls or file paths for all entries.
  • overrides - Dataset specific config overrides, such as a dataset that should always be run with group_training_examples_by. Config values passed as CLI arguments always take precendence, then any overrides from the registry, and finally values from the --config yaml file.
  • load_args - Extra arguments needed by the data reader for a specific dataset. For example, changing the separator used by pd.read_csv for a .csv file with a different delimiter.

License

NeMo Safe Synthesizer is licensed under the Apache License 2.0.

Contact

About

🛡️ NeMo Safe Synthesizer: Create private, safe versions of sensitive tabular datasets.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors