This file provides guidance to agents when working with code in this repository.
DataDesigner is an NVIDIA NeMo project for creating synthetic datasets from scratch. It's a comprehensive framework that generates structured data using multiple generation strategies:
- Sampled data: Built-in generators (UUID, DateTime, etc.) and Faker integration
- LLM-generated content: Text, code, and structured data via LiteLLM
- Expression-based columns: Derived columns using Jinja2 templates
- Validation & scoring: Python, SQL, and remote validators; LLM-based judge scoring
- Seed dataset-based generation: Generate from existing datasets
The project follows a layered architecture:
-
Config Layer (packages/data-designer-config/src/data_designer/config/): User-facing configuration API
config_builder.py: Main builder API for constructing configurationscolumn_configs.py: Column configuration types (Sampler, LLMText, LLMCode, LLMStructured, LLMJudge, Expression, Validation, SeedDataset)models.py: Model configurations and inference parameterssampler_params.py: Parametrized samplers (Uniform, Category, Person, DateTime, etc.)
-
Engine Layer (packages/data-designer-engine/src/data_designer/engine/): Internal generation and processing
column_generators/: Generates individual columns from configsdataset_builders/: Orchestrates full dataset generation with DAG-based dependency managementmodels/: LLM integration via LiteLLM with response parsingvalidators/: Column validation (Python, SQL, Code, Remote)sampling_gen/: Sophisticated person/entity sampling
-
Interface Layer (packages/data-designer/src/data_designer/interface/): Public API
data_designer.py: MainDataDesignerclass (primary entry point)results.py: Result containerserrors.py: Public error types
import data_designer.config as dd
from data_designer.interface import DataDesigner
# Usage:
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
config_builder.add_column(
dd.SamplerColumnConfig(
name="category",
sampler_type=dd.SamplerType.CATEGORY,
params=dd.CategorySamplerParams(values=["A", "B"]),
)
)- Builder pattern: Configuration construction via
DataDesignerConfigBuilder - Registry pattern: Plugin system for column generators, validators, and profilers
- Strategy pattern: Multiple generation approaches (sampled, LLM, expression, seed)
- DAG-based execution: Column dependencies managed as directed acyclic graph
This project uses uv for dependency management and make for common tasks:
# Install dependencies
uv sync
# Install with dev dependencies
uv sync --all-extras
# Run the main module (if applicable)
uv run python -m data_designer# Using Make (recommended)
make lint # Run ruff linter
make lint-fix # Fix linting issues automatically
make format # Format code with ruff
make format-check # Check code formatting without changes
make check-all # Run all checks (format-check + lint)
make check-all-fix # Run all checks with autofix (format + lint-fix)
# Direct commands
uv run ruff check # Lint all files
uv run ruff check --fix # Lint with autofix
uv run ruff format # Format all files
uv run ruff format --check # Check formatting# Run all tests
uv run pytest
# Run tests with verbose output
uv run pytest -v
# Run a specific test file
uv run pytest tests/config/test_sampler_constraints.py
# Run tests with coverage
uv run pytest --cov=data_designer --cov-report=term-missing --cov-report=html
# Using Make
make test # Run all tests
make coverage # Run tests with coverage report- packages/data-designer/src/data_designer/interface/data_designer.py - Main entry point (
DataDesignerclass) - packages/data-designer-config/src/data_designer/config/config_builder.py - Configuration API (
DataDesignerConfigBuilder) - packages/data-designer-config/src/data_designer/config/init.py - User-facing config API exports
- packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py - Generation orchestrator
- pyproject.toml - Project dependencies and tool configurations
- Makefile - Common development commands
- Comments: Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments.
- License headers: All Python files must include the NVIDIA SPDX license header:
Use
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0
make update-license-headersto add headers to all files automatically. from __future__ import annotations: Include at the top of all Python source files for deferred type evaluation.- Imports: Avoid importing Python modules inside method definitions. Prefer module-level imports for better performance and clarity. See Code Style for detailed import and type annotation guidelines.
This project uses ruff (>=0.14.10) for linting and formatting. Follow these guidelines to avoid linter errors:
- Line length: Maximum 120 characters per line
- Quote style: Always use double quotes (
") for strings - Indentation: Use 4 spaces (never tabs)
- Target version: Python 3.10+
Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules UP006, UP007, and UP045.
-
ALWAYS add type annotations to all functions, methods, and class attributes (including tests)
-
Use primitive types when possible:
listnotList,dictnotDict,setnotSet,tuplenotTuple(enforced byUP006) -
Use modern union syntax with
|for optional and union types:str | NonenotOptional[str](enforced byUP045)int | strnotUnion[int, str](enforced byUP007)
-
Only import from
typingwhen absolutely necessary for complex generic types -
For Pydantic models, use field-level type annotations
# Good def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]: return {item: len(item) for item in items} # Avoid - missing type annotations def process_items(items, max_count=None): return {item: len(item) for item in items} # Avoid - old-style typing from typing import List, Dict, Optional def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]: return {item: len(item) for item in items}
-
ALWAYS include
from __future__ import annotationsat the top of every Python source file (after the license header) for deferred type evaluation -
ALWAYS use absolute imports, never relative imports (enforced by
TID) -
Place imports at module level, not inside functions (exception: it is unavoidable for performance reasons)
-
Import sorting is handled by
ruff'sisort- imports should be grouped and sorted:- Standard library imports
- Third-party imports (use
lazy_heavy_importsfor heavy libraries) - First-party imports (
data_designer)
-
Use standard import conventions (enforced by
ICN) -
See Lazy Loading and TYPE_CHECKING section for optimization guidelines
# Good from data_designer.config.config_builder import DataDesignerConfigBuilder # Bad - relative import (will cause linter errors) from .config_builder import DataDesignerConfigBuilder # Good - imports at module level from pathlib import Path def process_file(filename: str) -> None: path = Path(filename) # Bad - import inside function def process_file(filename: str) -> None: from pathlib import Path path = Path(filename)
This project uses lazy loading for heavy third-party dependencies to optimize import performance.
Heavy third-party libraries (>100ms import cost) should be lazy-loaded via lazy_heavy_imports.py:
# ❌ Don't import directly
import pandas as pd
import numpy as np
# ✅ Use lazy loading with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd, np
if TYPE_CHECKING:
import pandas as pd # For IDE autocomplete and type hints
import numpy as npThis pattern provides:
- Runtime lazy loading (fast startup)
- Full IDE support (autocomplete, type hints)
- Type checker validation
See lazy_heavy_imports.py for the current list of lazy-loaded libraries.
If you add a new dependency with significant import cost (>100ms):
-
Add to
lazy_heavy_imports.py:_LAZY_IMPORTS = { # ... existing entries ... "your_lib": "your_library_name", }
-
Update imports across codebase:
from typing import TYPE_CHECKING from data_designer.lazy_heavy_imports import your_lib if TYPE_CHECKING: import your_library_name as your_lib # For IDE support
-
Verify with performance test:
make perf-import CLEAN=1
TYPE_CHECKING blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time.
For internal data_designer imports:
from __future__ import annotations
from typing import TYPE_CHECKING
# Runtime imports
from pathlib import Path
from data_designer.config.base import ConfigBase
if TYPE_CHECKING:
# Type-only imports - only visible to type checkers
from data_designer.engine.models.facade import ModelFacade
def get_model(model: ModelFacade) -> str:
return model.nameFor lazy-loaded libraries (see pattern in "When to Use Lazy Loading" above):
- Import from
lazy_heavy_importsfor runtime - Add full import in
TYPE_CHECKINGblock for IDE support
Rules for TYPE_CHECKING:
✅ DO put in TYPE_CHECKING:
- Internal
data_designerimports used only in type hints - Imports that would cause circular dependencies
- Full imports of lazy-loaded libraries for IDE support (e.g.,
import pandas as pdin addition to runtimefrom data_designer.lazy_heavy_imports import pd)
❌ DON'T put in TYPE_CHECKING:
- Standard library imports (
Path,Any,Callable,Literal,TypeAlias, etc.) - Pydantic model types used in field definitions (needed at runtime for validation)
- Types used in discriminated unions (Pydantic needs them at runtime)
- Any import used at runtime (instantiation, method calls, base classes, etc.)
Examples:
# ✅ CORRECT - Lazy-loaded library with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd
if TYPE_CHECKING:
import pandas as pd # IDE gets full type hints
def load_data(path: str) -> pd.DataFrame: # IDE understands pd.DataFrame
return pd.read_csv(path)
# ✅ CORRECT - Standard library NOT in TYPE_CHECKING
from pathlib import Path
from typing import Any
def process_file(path: Path) -> Any:
return path.read_text()
# ✅ CORRECT - Internal type-only import
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from data_designer.engine.models.facade import ModelFacade
def get_model(model: ModelFacade) -> str: # Only used in type hint
return model.name
# ❌ INCORRECT - Pydantic field type in TYPE_CHECKING
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from data_designer.config.models import ModelConfig # Wrong!
class MyConfig(BaseModel):
model: ModelConfig # Pydantic needs this at runtime!
# ✅ CORRECT - Pydantic field type at runtime
from data_designer.config.models import ModelConfig
class MyConfig(BaseModel):
model: ModelConfigFollow PEP 8 naming conventions:
-
Functions and variables:
snake_case -
Classes:
PascalCase -
Constants:
UPPER_SNAKE_CASE -
Private attributes: prefix with single underscore
_private_var -
Function and method names must start with an action verb: e.g.
get_value_fromnotvalue_from,coerce_to_intnotto_int,extract_usagenotusage# Good class DatasetGenerator: MAX_RETRIES = 3 def __init__(self) -> None: self._cache: dict[str, str] = {} def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame: pass # Bad class dataset_generator: # Should be PascalCase maxRetries = 3 # Should be UPPER_SNAKE_CASE def GenerateDataset(self, Config): # Should be snake_case pass
- Public before private: Public functions/methods appear before private ones in modules and classes
- Class method order:
__init__and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all@staticmethods in one block, all@classmethods in one block). - Prefer public over private for testability: Use public functions (no
_prefix) for helpers that benefit from direct testing - Section comments in larger modules: Use
# ---separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors)
DRY
- Extract shared logic into pure helper functions rather than duplicating across similar call sites
- Rule of thumb: tolerate duplication until the third occurrence, then extract
KISS
- Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper
- When in doubt between DRY and KISS, favor readability over deduplication
YAGNI
- Don't add parameters, config, or abstraction layers for hypothetical future use cases
- Don't generalize until the third caller appears
SOLID
- Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals
- Use
Protocolfor contracts between layers - One function, one job — separate logic from I/O
-
Mutable default arguments:
# Bad - mutable default argument def add_item(item: str, items: list[str] = []) -> list[str]: items.append(item) return items # Good def add_item(item: str, items: list[str] | None = None) -> list[str]: if items is None: items = [] items.append(item) return items
-
Unused imports and variables:
# Bad - unused import from pathlib import Path from typing import Any # Not used def process() -> None: pass # Good - only import what you use from pathlib import Path def process() -> None: pass
-
Simplify code where possible (enforced by
SIM):# Bad if condition: return True else: return False # Good return condition # Bad if key in my_dict: value = my_dict[key] else: value = default # Good value = my_dict.get(key, default)
-
Use comprehensions properly:
# Bad list([x for x in items]) # Unnecessary list() call # Good [x for x in items] # Bad dict([(k, v) for k, v in items]) # Good {k: v for k, v in items}
-
Proper return statements:
# Bad - unnecessary else after return def get_value(condition: bool) -> str: if condition: return "yes" else: return "no" # Good def get_value(condition: bool) -> str: if condition: return "yes" return "no"
The following ruff linter rules are currently enabled (see pyproject.toml):
W: pycodestyle warningsF: pyflakes (unused imports, undefined names)I: isort (import sorting)ICN: flake8-import-conventions (standard import names)PIE: flake8-pie (miscellaneous lints)TID: flake8-tidy-imports (bans relative imports)UP006:List[A]->list[A]UP007:Union[A, B]->A | BUP045:Optional[A]->A | None
Note: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing.
The project uses pytest with the following patterns:
- Fixtures: Shared test data and configurations in tests/conftest.py
- Stub configs: YAML-based configuration stubs for testing (see
stub_data_designer_config_strfixture) - Mocking: Use
unittest.mock.patchfor external services and dependencies - Async support: pytest-asyncio for async tests (
asyncio_default_fixture_loop_scope = "session") - HTTP mocking: pytest-httpx for mocking HTTP requests
- Coverage: Track test coverage with pytest-cov
- Parametrize over duplicate: Use
@pytest.mark.parametrizeinstead of writing multiple test functions for variations of the same behavior - Minimal fixtures: Fixtures should be simple — one fixture, one responsibility, just setup with no behavior logic
- Shared fixtures in
conftest.py: Place fixtures shared across a test directory inconftest.py - Mock at boundaries: Mock external dependencies (APIs, databases, third-party services), not internal functions
- Test behavior, not implementation: Assert on outputs and side effects, not internal call counts (unless verifying routing)
- Keep mocking shallow: If a test requires deeply nested mocking, the code under test may need refactoring
Example test structure:
from typing import Any
from data_designer.config.config_builder import DataDesignerConfigBuilder
def test_something(stub_model_configs: dict[str, Any]) -> None:
"""Test description."""
builder = DataDesignerConfigBuilder(model_configs=stub_model_configs)
# ... test implementation
assert expected == actualWhen working with column configurations, understand these key types:
SamplerColumnConfig: Built-in samplers (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.)LLMTextColumnConfig: LLM text generation with Jinja2 templatingLLMCodeColumnConfig: Code generation with language specificationLLMStructuredColumnConfig: Structured JSON generation with schemaLLMJudgeColumnConfig: Judge/scoring columns for quality assessmentExpressionColumnConfig: Expression-based derived columns (Python eval or Jinja2)ValidationColumnConfig: Validation results (Python, SQL, Code, Remote validators)SeedDatasetColumnConfig: Data from seed datasetsEmbeddingColumnConfig: Embedding generation for text columns using a specified modelCustomColumnConfig: Custom user-defined column generators via@custom_column_generatordecorator
See packages/data-designer-config/src/data_designer/config/column_configs.py for detailed schemas.
Models are configured via ModelConfig with:
alias: User-defined alias for the modelmodel: Model ID (e.g., from build.nvidia.com)inference_parameters: Temperature, top_p, max_tokens (can be distribution-based)system_prompt: Optional system promptimage_modality: Support for image inputs
See packages/data-designer-config/src/data_designer/config/models.py for details.
The project uses a registry pattern for extensibility. Key registries:
- Column generators: packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py
- Validators: packages/data-designer-engine/src/data_designer/engine/validators/
- Column profilers: packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py
- Models: packages/data-designer-engine/src/data_designer/engine/models/registry.py
When adding new generators or validators, register them appropriately.
The project uses pre-commit hooks to enforce code quality. Install them with:
uv run pre-commit installHooks include:
- Trailing whitespace removal
- End-of-file fixer
- YAML/JSON/TOML validation
- Merge conflict detection
- Debug statement detection
- Ruff linting and formatting
# Clean up generated files
make clean
# Update license headers
make update-license-headers
# Run all checks before committing
make check-all-fix
make test
# Generate coverage report
make coverage
# View htmlcov/index.html in browser
# Profile import performance (use after adding heavy dependencies)
make perf-import # Profile import time
make perf-import CLEAN=1 # Clean cache first, then profile- README.md: Installation and basic usage examples
- packages/data-designer-config/src/data_designer/config/: Configuration API documentation
- tests/: Comprehensive test suite with usage examples