AGENTS.md

This file provides guidance to agents when working with code in this repository.

Project Overview

DataDesigner is an NVIDIA NeMo project for creating synthetic datasets from scratch. It's a comprehensive framework that generates structured data using multiple generation strategies:

Sampled data: Built-in generators (UUID, DateTime, etc.) and Faker integration
LLM-generated content: Text, code, and structured data via LiteLLM
Expression-based columns: Derived columns using Jinja2 templates
Validation & scoring: Python, SQL, and remote validators; LLM-based judge scoring
Seed dataset-based generation: Generate from existing datasets

Architecture

The project follows a layered architecture:

Config Layer (packages/data-designer-config/src/data_designer/config/): User-facing configuration API
- config_builder.py: Main builder API for constructing configurations
- column_configs.py: Column configuration types (Sampler, LLMText, LLMCode, LLMStructured, LLMJudge, Expression, Validation, SeedDataset)
- models.py: Model configurations and inference parameters
- sampler_params.py: Parametrized samplers (Uniform, Category, Person, DateTime, etc.)
Engine Layer (packages/data-designer-engine/src/data_designer/engine/): Internal generation and processing
- column_generators/: Generates individual columns from configs
- dataset_builders/: Orchestrates full dataset generation with DAG-based dependency management
- models/: LLM integration via LiteLLM with response parsing
- validators/: Column validation (Python, SQL, Code, Remote)
- sampling_gen/: Sophisticated person/entity sampling
Interface Layer (packages/data-designer/src/data_designer/interface/): Public API
- data_designer.py: Main DataDesigner class (primary entry point)
- results.py: Result containers
- errors.py: Public error types

Recommended Import Pattern

import data_designer.config as dd
from data_designer.interface import DataDesigner

# Usage:
data_designer = DataDesigner()
config_builder = dd.DataDesignerConfigBuilder()
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(values=["A", "B"]),
    )
)

Key Design Patterns

Builder pattern: Configuration construction via DataDesignerConfigBuilder
Registry pattern: Plugin system for column generators, validators, and profilers
Strategy pattern: Multiple generation approaches (sampled, LLM, expression, seed)
DAG-based execution: Column dependencies managed as directed acyclic graph

Development Workflow

This project uses uv for dependency management and make for common tasks:

# Install dependencies
uv sync

# Install with dev dependencies
uv sync --all-extras

# Run the main module (if applicable)
uv run python -m data_designer

Code Quality

# Using Make (recommended)
make lint           # Run ruff linter
make lint-fix       # Fix linting issues automatically
make format         # Format code with ruff
make format-check   # Check code formatting without changes
make check-all      # Run all checks (format-check + lint)
make check-all-fix  # Run all checks with autofix (format + lint-fix)

# Direct commands
uv run ruff check                # Lint all files
uv run ruff check --fix          # Lint with autofix
uv run ruff format               # Format all files
uv run ruff format --check       # Check formatting

Running Tests

# Run all tests
uv run pytest

# Run tests with verbose output
uv run pytest -v

# Run a specific test file
uv run pytest tests/config/test_sampler_constraints.py

# Run tests with coverage
uv run pytest --cov=data_designer --cov-report=term-missing --cov-report=html

# Using Make
make test           # Run all tests
make coverage       # Run tests with coverage report

Key Files

packages/data-designer/src/data_designer/interface/data_designer.py - Main entry point (DataDesigner class)
packages/data-designer-config/src/data_designer/config/config_builder.py - Configuration API (DataDesignerConfigBuilder)
packages/data-designer-config/src/data_designer/config/init.py - User-facing config API exports
packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py - Generation orchestrator
pyproject.toml - Project dependencies and tool configurations
Makefile - Common development commands

Working Guidelines

Comments: Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments.
License headers: All Python files must include the NVIDIA SPDX license header:
```
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
```
Use make update-license-headers to add headers to all files automatically.
from __future__ import annotations: Include at the top of all Python source files for deferred type evaluation.
Imports: Avoid importing Python modules inside method definitions. Prefer module-level imports for better performance and clarity. See Code Style for detailed import and type annotation guidelines.

Code Style

This project uses ruff (>=0.14.10) for linting and formatting. Follow these guidelines to avoid linter errors:

General Formatting

Line length: Maximum 120 characters per line
Quote style: Always use double quotes (") for strings
Indentation: Use 4 spaces (never tabs)
Target version: Python 3.10+

Type Annotations

Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules UP006, UP007, and UP045.

ALWAYS add type annotations to all functions, methods, and class attributes (including tests)
Use primitive types when possible: list not List, dict not Dict, set not Set, tuple not Tuple (enforced by UP006)
Use modern union syntax with | for optional and union types:
- str | None not Optional[str] (enforced by UP045)
- int | str not Union[int, str] (enforced by UP007)
Only import from typing when absolutely necessary for complex generic types

For Pydantic models, use field-level type annotations

# Good
def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]:
    return {item: len(item) for item in items}

# Avoid - missing type annotations
def process_items(items, max_count=None):
    return {item: len(item) for item in items}

# Avoid - old-style typing
from typing import List, Dict, Optional
def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]:
    return {item: len(item) for item in items}

Import Style

ALWAYS include from __future__ import annotations at the top of every Python source file (after the license header) for deferred type evaluation
ALWAYS use absolute imports, never relative imports (enforced by TID)
Place imports at module level, not inside functions (exception: it is unavoidable for performance reasons)
Import sorting is handled by ruff's isort - imports should be grouped and sorted:
1. Standard library imports
2. Third-party imports (use lazy_heavy_imports for heavy libraries)
3. First-party imports (data_designer)
Use standard import conventions (enforced by ICN)

See Lazy Loading and TYPE_CHECKING section for optimization guidelines

# Good
from data_designer.config.config_builder import DataDesignerConfigBuilder

# Bad - relative import (will cause linter errors)
from .config_builder import DataDesignerConfigBuilder

# Good - imports at module level
from pathlib import Path

def process_file(filename: str) -> None:
    path = Path(filename)

# Bad - import inside function
def process_file(filename: str) -> None:
    from pathlib import Path
    path = Path(filename)

Lazy Loading and TYPE_CHECKING

This project uses lazy loading for heavy third-party dependencies to optimize import performance.

When to Use Lazy Loading

Heavy third-party libraries (>100ms import cost) should be lazy-loaded via lazy_heavy_imports.py:

# ❌ Don't import directly
import pandas as pd
import numpy as np

# ✅ Use lazy loading with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd, np

if TYPE_CHECKING:
    import pandas as pd  # For IDE autocomplete and type hints
    import numpy as np

This pattern provides:

Runtime lazy loading (fast startup)
Full IDE support (autocomplete, type hints)
Type checker validation

See lazy_heavy_imports.py for the current list of lazy-loaded libraries.

Adding New Heavy Dependencies

If you add a new dependency with significant import cost (>100ms):

Add to lazy_heavy_imports.py:

_LAZY_IMPORTS = {
    # ... existing entries ...
    "your_lib": "your_library_name",
}

Update imports across codebase:

from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import your_lib

if TYPE_CHECKING:
    import your_library_name as your_lib  # For IDE support

Verify with performance test:
```
make perf-import CLEAN=1
```

Using TYPE_CHECKING Blocks

TYPE_CHECKING blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time.

For internal data_designer imports:

from __future__ import annotations

from typing import TYPE_CHECKING

# Runtime imports
from pathlib import Path
from data_designer.config.base import ConfigBase

if TYPE_CHECKING:
    # Type-only imports - only visible to type checkers
    from data_designer.engine.models.facade import ModelFacade

def get_model(model: ModelFacade) -> str:
    return model.name

For lazy-loaded libraries (see pattern in "When to Use Lazy Loading" above):

Import from lazy_heavy_imports for runtime
Add full import in TYPE_CHECKING block for IDE support

Rules for TYPE_CHECKING:

✅ DO put in TYPE_CHECKING:

Internal data_designer imports used only in type hints
Imports that would cause circular dependencies
Full imports of lazy-loaded libraries for IDE support (e.g., import pandas as pd in addition to runtime from data_designer.lazy_heavy_imports import pd)

❌ DON'T put in TYPE_CHECKING:

Standard library imports (Path, Any, Callable, Literal, TypeAlias, etc.)
Pydantic model types used in field definitions (needed at runtime for validation)
Types used in discriminated unions (Pydantic needs them at runtime)
Any import used at runtime (instantiation, method calls, base classes, etc.)

Examples:

# ✅ CORRECT - Lazy-loaded library with IDE support
from typing import TYPE_CHECKING
from data_designer.lazy_heavy_imports import pd

if TYPE_CHECKING:
    import pandas as pd  # IDE gets full type hints

def load_data(path: str) -> pd.DataFrame:  # IDE understands pd.DataFrame
    return pd.read_csv(path)

# ✅ CORRECT - Standard library NOT in TYPE_CHECKING
from pathlib import Path
from typing import Any

def process_file(path: Path) -> Any:
    return path.read_text()

# ✅ CORRECT - Internal type-only import
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from data_designer.engine.models.facade import ModelFacade

def get_model(model: ModelFacade) -> str:  # Only used in type hint
    return model.name

# ❌ INCORRECT - Pydantic field type in TYPE_CHECKING
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from data_designer.config.models import ModelConfig  # Wrong!

class MyConfig(BaseModel):
    model: ModelConfig  # Pydantic needs this at runtime!

# ✅ CORRECT - Pydantic field type at runtime
from data_designer.config.models import ModelConfig

class MyConfig(BaseModel):
    model: ModelConfig

Naming Conventions (PEP 8)

Follow PEP 8 naming conventions:

Functions and variables: snake_case
Classes: PascalCase
Constants: UPPER_SNAKE_CASE
Private attributes: prefix with single underscore _private_var

Function and method names must start with an action verb: e.g. get_value_from not value_from, coerce_to_int not to_int, extract_usage not usage

# Good
class DatasetGenerator:
    MAX_RETRIES = 3

    def __init__(self) -> None:
        self._cache: dict[str, str] = {}

    def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame:
        pass

# Bad
class dataset_generator:  # Should be PascalCase
    maxRetries = 3        # Should be UPPER_SNAKE_CASE

    def GenerateDataset(self, Config):  # Should be snake_case
        pass

Code Organization

Public before private: Public functions/methods appear before private ones in modules and classes
Class method order: __init__ and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all @staticmethods in one block, all @classmethods in one block).
Prefer public over private for testability: Use public functions (no _ prefix) for helpers that benefit from direct testing
Section comments in larger modules: Use # --- separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors)

Design Principles

DRY

Extract shared logic into pure helper functions rather than duplicating across similar call sites
Rule of thumb: tolerate duplication until the third occurrence, then extract

KISS

Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper
When in doubt between DRY and KISS, favor readability over deduplication

YAGNI

Don't add parameters, config, or abstraction layers for hypothetical future use cases
Don't generalize until the third caller appears

SOLID

Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals
Use Protocol for contracts between layers
One function, one job — separate logic from I/O

Common Pitfalls to Avoid

Mutable default arguments:

# Bad - mutable default argument
def add_item(item: str, items: list[str] = []) -> list[str]:
    items.append(item)
    return items

# Good
def add_item(item: str, items: list[str] | None = None) -> list[str]:
    if items is None:
        items = []
    items.append(item)
    return items

Unused imports and variables:

# Bad - unused import
from pathlib import Path
from typing import Any  # Not used

def process() -> None:
    pass

# Good - only import what you use
from pathlib import Path

def process() -> None:
    pass

Simplify code where possible (enforced by SIM):

# Bad
if condition:
    return True
else:
    return False

# Good
return condition

# Bad
if key in my_dict:
    value = my_dict[key]
else:
    value = default

# Good
value = my_dict.get(key, default)

Use comprehensions properly:

# Bad
list([x for x in items])  # Unnecessary list() call

# Good
[x for x in items]

# Bad
dict([(k, v) for k, v in items])

# Good
{k: v for k, v in items}

Proper return statements:

# Bad - unnecessary else after return
def get_value(condition: bool) -> str:
    if condition:
        return "yes"
    else:
        return "no"

# Good
def get_value(condition: bool) -> str:
    if condition:
        return "yes"
    return "no"

Active Linter Rules

The following ruff linter rules are currently enabled (see pyproject.toml):

W: pycodestyle warnings
F: pyflakes (unused imports, undefined names)
I: isort (import sorting)
ICN: flake8-import-conventions (standard import names)
PIE: flake8-pie (miscellaneous lints)
TID: flake8-tidy-imports (bans relative imports)
UP006: List[A] -> list[A]
UP007: Union[A, B] -> A | B
UP045: Optional[A] -> A | None

Note: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing.

Testing Patterns

The project uses pytest with the following patterns:

Fixtures: Shared test data and configurations in tests/conftest.py
Stub configs: YAML-based configuration stubs for testing (see stub_data_designer_config_str fixture)
Mocking: Use unittest.mock.patch for external services and dependencies
Async support: pytest-asyncio for async tests (asyncio_default_fixture_loop_scope = "session")
HTTP mocking: pytest-httpx for mocking HTTP requests
Coverage: Track test coverage with pytest-cov

Test Guidelines

Parametrize over duplicate: Use @pytest.mark.parametrize instead of writing multiple test functions for variations of the same behavior
Minimal fixtures: Fixtures should be simple — one fixture, one responsibility, just setup with no behavior logic
Shared fixtures in conftest.py: Place fixtures shared across a test directory in conftest.py
Mock at boundaries: Mock external dependencies (APIs, databases, third-party services), not internal functions
Test behavior, not implementation: Assert on outputs and side effects, not internal call counts (unless verifying routing)
Keep mocking shallow: If a test requires deeply nested mocking, the code under test may need refactoring

Example test structure:

from typing import Any

from data_designer.config.config_builder import DataDesignerConfigBuilder

def test_something(stub_model_configs: dict[str, Any]) -> None:
    """Test description."""
    builder = DataDesignerConfigBuilder(model_configs=stub_model_configs)
    # ... test implementation
    assert expected == actual

Column Configuration Types

When working with column configurations, understand these key types:

SamplerColumnConfig: Built-in samplers (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.)
LLMTextColumnConfig: LLM text generation with Jinja2 templating
LLMCodeColumnConfig: Code generation with language specification
LLMStructuredColumnConfig: Structured JSON generation with schema
LLMJudgeColumnConfig: Judge/scoring columns for quality assessment
ExpressionColumnConfig: Expression-based derived columns (Python eval or Jinja2)
ValidationColumnConfig: Validation results (Python, SQL, Code, Remote validators)
SeedDatasetColumnConfig: Data from seed datasets
EmbeddingColumnConfig: Embedding generation for text columns using a specified model
CustomColumnConfig: Custom user-defined column generators via @custom_column_generator decorator

See packages/data-designer-config/src/data_designer/config/column_configs.py for detailed schemas.

Model Configuration

Models are configured via ModelConfig with:

alias: User-defined alias for the model
model: Model ID (e.g., from build.nvidia.com)
inference_parameters: Temperature, top_p, max_tokens (can be distribution-based)
system_prompt: Optional system prompt
image_modality: Support for image inputs

See packages/data-designer-config/src/data_designer/config/models.py for details.

Registry System

The project uses a registry pattern for extensibility. Key registries:

Column generators: packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py
Validators: packages/data-designer-engine/src/data_designer/engine/validators/
Column profilers: packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py
Models: packages/data-designer-engine/src/data_designer/engine/models/registry.py

When adding new generators or validators, register them appropriately.

Pre-commit Hooks

The project uses pre-commit hooks to enforce code quality. Install them with:

uv run pre-commit install

Hooks include:

Trailing whitespace removal
End-of-file fixer
YAML/JSON/TOML validation
Merge conflict detection
Debug statement detection
Ruff linting and formatting

Common Development Tasks

# Clean up generated files
make clean

# Update license headers
make update-license-headers

# Run all checks before committing
make check-all-fix
make test

# Generate coverage report
make coverage
# View htmlcov/index.html in browser

# Profile import performance (use after adding heavy dependencies)
make perf-import            # Profile import time
make perf-import CLEAN=1    # Clean cache first, then profile

Additional Resources

README.md: Installation and basic usage examples
packages/data-designer-config/src/data_designer/config/: Configuration API documentation
tests/: Comprehensive test suite with usage examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Project Overview

Architecture

Recommended Import Pattern

Key Design Patterns

Development Workflow

Code Quality

Running Tests

Key Files

Working Guidelines

Code Style

General Formatting

Type Annotations

Import Style

Lazy Loading and TYPE_CHECKING

When to Use Lazy Loading

Adding New Heavy Dependencies

Using TYPE_CHECKING Blocks

Naming Conventions (PEP 8)

Code Organization

Design Principles

Common Pitfalls to Avoid

Active Linter Rules

Testing Patterns

Test Guidelines

Column Configuration Types

Model Configuration

Registry System

Pre-commit Hooks

Common Development Tasks

Additional Resources

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Project Overview

Architecture

Recommended Import Pattern

Key Design Patterns

Development Workflow

Code Quality

Running Tests

Key Files

Working Guidelines

Code Style

General Formatting

Type Annotations

Import Style

Lazy Loading and TYPE_CHECKING

When to Use Lazy Loading

Adding New Heavy Dependencies

Using TYPE_CHECKING Blocks

Naming Conventions (PEP 8)

Code Organization

Design Principles

Common Pitfalls to Avoid

Active Linter Rules

Testing Patterns

Test Guidelines

Column Configuration Types

Model Configuration

Registry System

Pre-commit Hooks

Common Development Tasks

Additional Resources