From 3aac4f1e7485b152080106ef795842db2f66f1f8 Mon Sep 17 00:00:00 2001
From: Johnny Greco <jogreco@nvidia.com>
Date: Tue, 17 Mar 2026 15:08:05 -0700
Subject: [PATCH 1/3] chore: strip down AGENTS.md to match CLAUDE.md (#430)

Remove verbose development instructions, code style guidelines, and
testing patterns that duplicate CLAUDE.md. Keep only the concise
sections: identity, layering, core concepts, design principles,
structural invariants, and development commands.
---
 AGENTS.md | 644 +++---------------------------------------------------
 1 file changed, 33 insertions(+), 611 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index 7d46ef76..800883cb 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,626 +1,48 @@
-# AGENTS.md
+## Identity
 
-This file provides guidance to agents when working with code in this repository.
+You are developing **Data Designer** — a framework for creating synthetic datasets from scratch. Users declare *what* data they want (columns, types, relationships, quality criteria) and the system figures out *how* to generate it.
 
-## Project Overview
+The core contract is **declare, don't orchestrate**. General users should never have to think about execution order, batching, or model plumbing. Every change you make should preserve or strengthen that contract. If a change forces users to reason about runtime mechanics, it's moving in the wrong direction.
 
-**DataDesigner** is an NVIDIA NeMo project for creating synthetic datasets from scratch. It's a comprehensive framework that generates structured data using multiple generation strategies:
+## The Layering Is Structural
 
-- **Sampled data**: Built-in generators (UUID, DateTime, etc.) and Faker integration
-- **LLM-generated content**: Text, code, and structured data via LiteLLM
-- **Expression-based columns**: Derived columns using Jinja2 templates
-- **Validation & scoring**: Python, SQL, and remote validators; LLM-based judge scoring
-- **Seed dataset-based generation**: Generate from existing datasets
+Three packages share the `data_designer` namespace via PEP 420 implicit namespace packages (no `__init__.py` at the `data_designer/` level — this is intentional): **config** (schema) → **engine** (execution) → **interface** (entry-point).
 
-### Architecture
+- **Config** (`data-designer-config`): Schema layer. Pydantic models for columns, pipelines, and model routing. Builder API for config construction. Pure data — no I/O.
+- **Engine** (`data-designer-engine`): Execution layer. Compiler resolves configs into a DAG; runtime executes it via LLM calls, batching, and task scheduling. Registries map config types to implementations.
+- **Interface** (`data-designer`): Entry-point layer. `DataDesigner` class, CLI, and result types. Orchestrates engine on behalf of users.
 
-The project follows a layered architecture:
+**Import direction is one-way and absolute.** Config must never import engine. Engine must never import interface. Violations create circular dependencies and break the namespace package structure.
 
-1. **Config Layer** ([packages/data-designer-config/src/data_designer/config/](packages/data-designer-config/src/data_designer/config/)): User-facing configuration API
-   - `config_builder.py`: Main builder API for constructing configurations
-   - `column_configs.py`: Column configuration types (Sampler, LLMText, LLMCode, LLMStructured, LLMJudge, Expression, Validation, SeedDataset)
-   - `models.py`: Model configurations and inference parameters
-   - `sampler_params.py`: Parametrized samplers (Uniform, Category, Person, DateTime, etc.)
+## Core Concepts
 
-2. **Engine Layer** ([packages/data-designer-engine/src/data_designer/engine/](packages/data-designer-engine/src/data_designer/engine/)): Internal generation and processing
-   - `column_generators/`: Generates individual columns from configs
-   - `dataset_builders/`: Orchestrates full dataset generation with DAG-based dependency management
-   - `models/`: LLM integration via LiteLLM with response parsing
-   - `validators/`: Column validation (Python, SQL, Code, Remote)
-   - `sampling_gen/`: Sophisticated person/entity sampling
+- **Columns** — the primary abstraction. Each column declares its dependencies and extra outputs. The DAG of these declarations determines execution order automatically — users never specify ordering.
+- **Samplers** — statistical distributions, category sets, and persona generators used by sampler columns to produce values without model calls.
+- **Seed datasets** — existing data that bootstraps generation, providing real-world context for downstream columns.
+- **Processors** — batch-level transformations (schema reshaping, column dropping) that run outside the column DAG.
+- **Models** — model configurations are routed per-column via `model_alias`. The framework is model-agnostic and provider-agnostic.
+- **Plugins** — extend any of the above via `setuptools` entry points. Define a config, implement the behavior, register both.
 
-3. **Interface Layer** ([packages/data-designer/src/data_designer/interface/](packages/data-designer/src/data_designer/interface/)): Public API
-   - `data_designer.py`: Main `DataDesigner` class (primary entry point)
-   - `results.py`: Result containers
-   - `errors.py`: Public error types
+## Core Design Principles
 
-### Recommended Import Pattern
+- **Config is declarative, engine is imperative.** Users build configs that describe intent; the engine is where I/O, LLM calls, and state mutation happen. The **compiler** bridges the two: it validates the config, resolves the full DAG, and produces the runtime plan the engine executes.
+- **Registries connect types to behavior.** `TaskRegistry` maps config type → implementation class. Plugins extend this at runtime.
+- **Errors normalize at boundaries.** Each layer defines canonical error types. Callers depend on these public types, never on vendor-specific or internal exceptions.
 
-```python
-import data_designer.config as dd
-from data_designer.interface import DataDesigner
+## Structural Invariants
 
-# Usage:
-data_designer = DataDesigner()
-config_builder = dd.DataDesignerConfigBuilder()
-config_builder.add_column(
-    dd.SamplerColumnConfig(
-        name="category",
-        sampler_type=dd.SamplerType.CATEGORY,
-        params=dd.CategorySamplerParams(values=["A", "B"]),
-    )
-)
-```
+- **Imports flow downstream only** — config must not import from engine or interface; engine must not import from interface
+- **Fast imports** — Avoid new import-time side effects. Prefer lazy initialization and defer heavy work until use-time
+- **No relative imports** — absolute only
+- **No untyped code** — all code carries type annotations
+- **Follow established patterns** — before introducing new conventions, study surrounding code and match existing style, structure, and idioms
+- **No untested code paths** — new and changed logic must have associated unit tests
 
-### Key Design Patterns
+## Development
 
-- **Builder pattern**: Configuration construction via `DataDesignerConfigBuilder`
-- **Registry pattern**: Plugin system for column generators, validators, and profilers
-- **Strategy pattern**: Multiple generation approaches (sampled, LLM, expression, seed)
-- **DAG-based execution**: Column dependencies managed as directed acyclic graph
+Use `make` targets — not raw tool commands — for all standard workflows:
 
-## Development Workflow
-
-This project uses `uv` for dependency management and `make` for common tasks:
-
-```bash
-# Install dependencies
-uv sync
-
-# Install with dev dependencies
-uv sync --all-extras
-
-# Run the main module (if applicable)
-uv run python -m data_designer
-```
-
-### Code Quality
-
-```bash
-# Using Make (recommended)
-make lint           # Run ruff linter
-make lint-fix       # Fix linting issues automatically
-make format         # Format code with ruff
-make format-check   # Check code formatting without changes
-make check-all      # Run all checks (format-check + lint)
-make check-all-fix  # Run all checks with autofix (format + lint-fix)
-
-# Direct commands
-uv run ruff check                # Lint all files
-uv run ruff check --fix          # Lint with autofix
-uv run ruff format               # Format all files
-uv run ruff format --check       # Check formatting
-```
-
-### Running Tests
-
-```bash
-# Run all tests
-uv run pytest
-
-# Run tests with verbose output
-uv run pytest -v
-
-# Run a specific test file
-uv run pytest tests/config/test_sampler_constraints.py
-
-# Run tests with coverage
-uv run pytest --cov=data_designer --cov-report=term-missing --cov-report=html
-
-# Using Make
-make test           # Run all tests
-make coverage       # Run tests with coverage report
-```
-
-## Key Files
-
-- [packages/data-designer/src/data_designer/interface/data_designer.py](packages/data-designer/src/data_designer/interface/data_designer.py) - Main entry point (`DataDesigner` class)
-- [packages/data-designer-config/src/data_designer/config/config_builder.py](packages/data-designer-config/src/data_designer/config/config_builder.py) - Configuration API (`DataDesignerConfigBuilder`)
-- [packages/data-designer-config/src/data_designer/config/__init__.py](packages/data-designer-config/src/data_designer/config/__init__.py) - User-facing config API exports
-- [packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py](packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py) - Generation orchestrator
-- [pyproject.toml](pyproject.toml) - Project dependencies and tool configurations
-- [Makefile](Makefile) - Common development commands
-
-## Working Guidelines
-
-- **Comments**: Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments.
-- **License headers**: All Python files must include the NVIDIA SPDX license header:
-  ```python
-  # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-  # SPDX-License-Identifier: Apache-2.0
-  ```
-  Use `make update-license-headers` to add headers to all files automatically.
-- **`from __future__ import annotations`**: Include at the top of all Python source files for deferred type evaluation.
-- **Imports**: Avoid importing Python modules inside method definitions. Prefer module-level imports for better performance and clarity. See [Code Style](#code-style) for detailed import and type annotation guidelines.
-
-## Code Style
-
-This project uses `ruff` (>=0.14.10) for linting and formatting. Follow these guidelines to avoid linter errors:
-
-### General Formatting
-
-- **Line length**: Maximum 120 characters per line
-- **Quote style**: Always use double quotes (`"`) for strings
-- **Indentation**: Use 4 spaces (never tabs)
-- **Target version**: Python 3.10+
-
-### Type Annotations
-
-Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules `UP006`, `UP007`, and `UP045`.
-
-- **ALWAYS** add type annotations to all functions, methods, and class attributes (including tests)
-- Use primitive types when possible: `list` not `List`, `dict` not `Dict`, `set` not `Set`, `tuple` not `Tuple` (enforced by `UP006`)
-- Use modern union syntax with `|` for optional and union types:
-  - `str | None` not `Optional[str]` (enforced by `UP045`)
-  - `int | str` not `Union[int, str]` (enforced by `UP007`)
-- Only import from `typing` when absolutely necessary for complex generic types
-- For Pydantic models, use field-level type annotations
-
-  ```python
-  # Good
-  def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]:
-      return {item: len(item) for item in items}
-
-  # Avoid - missing type annotations
-  def process_items(items, max_count=None):
-      return {item: len(item) for item in items}
-
-  # Avoid - old-style typing
-  from typing import List, Dict, Optional
-  def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]:
-      return {item: len(item) for item in items}
-  ```
-
-### Import Style
-
-- **ALWAYS** include `from __future__ import annotations` at the top of every Python source file (after the license header) for deferred type evaluation
-- **ALWAYS** use absolute imports, never relative imports (enforced by `TID`)
-- Place imports at module level, not inside functions (exception: it is unavoidable for performance reasons)
-- Import sorting is handled by `ruff`'s `isort` - imports should be grouped and sorted:
-  1. Standard library imports
-  2. Third-party imports (use `lazy_heavy_imports` for heavy libraries)
-  3. First-party imports (`data_designer`)
-- Use standard import conventions (enforced by `ICN`)
-- See [Lazy Loading and TYPE_CHECKING](#lazy-loading-and-type_checking) section for optimization guidelines
-
-  ```python
-  # Good
-  from data_designer.config.config_builder import DataDesignerConfigBuilder
-
-  # Bad - relative import (will cause linter errors)
-  from .config_builder import DataDesignerConfigBuilder
-
-  # Good - imports at module level
-  from pathlib import Path
-
-  def process_file(filename: str) -> None:
-      path = Path(filename)
-
-  # Bad - import inside function
-  def process_file(filename: str) -> None:
-      from pathlib import Path
-      path = Path(filename)
-  ```
-
-### Lazy Loading and TYPE_CHECKING
-
-This project uses lazy loading for heavy third-party dependencies to optimize import performance.
-
-#### When to Use Lazy Loading
-
-**Heavy third-party libraries** (>100ms import cost) should be lazy-loaded via `lazy_heavy_imports.py`:
-
-```python
-# ❌ Don't import directly
-import pandas as pd
-import numpy as np
-
-# ✅ Use lazy loading with IDE support
-from typing import TYPE_CHECKING
-from data_designer.lazy_heavy_imports import pd, np
-
-if TYPE_CHECKING:
-    import pandas as pd  # For IDE autocomplete and type hints
-    import numpy as np
-```
-
-This pattern provides:
-- Runtime lazy loading (fast startup)
-- Full IDE support (autocomplete, type hints)
-- Type checker validation
-
-**See [lazy_heavy_imports.py](packages/data-designer-config/src/data_designer/lazy_heavy_imports.py) for the current list of lazy-loaded libraries.**
-
-#### Adding New Heavy Dependencies
-
-If you add a new dependency with significant import cost (>100ms):
-
-1. **Add to `lazy_heavy_imports.py`:**
-   ```python
-   _LAZY_IMPORTS = {
-       # ... existing entries ...
-       "your_lib": "your_library_name",
-   }
-   ```
-
-2. **Update imports across codebase:**
-   ```python
-   from typing import TYPE_CHECKING
-   from data_designer.lazy_heavy_imports import your_lib
-
-   if TYPE_CHECKING:
-       import your_library_name as your_lib  # For IDE support
-   ```
-
-3. **Verify with performance test:**
-   ```bash
-   make perf-import CLEAN=1
-   ```
-
-#### Using TYPE_CHECKING Blocks
-
-`TYPE_CHECKING` blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time.
-
-**For internal data_designer imports:**
-
-```python
-from __future__ import annotations
-
-from typing import TYPE_CHECKING
-
-# Runtime imports
-from pathlib import Path
-from data_designer.config.base import ConfigBase
-
-if TYPE_CHECKING:
-    # Type-only imports - only visible to type checkers
-    from data_designer.engine.models.facade import ModelFacade
-
-def get_model(model: ModelFacade) -> str:
-    return model.name
-```
-
-**For lazy-loaded libraries (see pattern in "When to Use Lazy Loading" above):**
-- Import from `lazy_heavy_imports` for runtime
-- Add full import in `TYPE_CHECKING` block for IDE support
-
-**Rules for TYPE_CHECKING:**
-
-✅ **DO put in TYPE_CHECKING:**
-- Internal `data_designer` imports used **only** in type hints
-- Imports that would cause circular dependencies
-- **Full imports of lazy-loaded libraries for IDE support** (e.g., `import pandas as pd` in addition to runtime `from data_designer.lazy_heavy_imports import pd`)
-
-❌ **DON'T put in TYPE_CHECKING:**
-- **Standard library imports** (`Path`, `Any`, `Callable`, `Literal`, `TypeAlias`, etc.)
-- **Pydantic model types** used in field definitions (needed at runtime for validation)
-- **Types used in discriminated unions** (Pydantic needs them at runtime)
-- **Any import used at runtime** (instantiation, method calls, base classes, etc.)
-
-**Examples:**
-
-```python
-# ✅ CORRECT - Lazy-loaded library with IDE support
-from typing import TYPE_CHECKING
-from data_designer.lazy_heavy_imports import pd
-
-if TYPE_CHECKING:
-    import pandas as pd  # IDE gets full type hints
-
-def load_data(path: str) -> pd.DataFrame:  # IDE understands pd.DataFrame
-    return pd.read_csv(path)
-
-# ✅ CORRECT - Standard library NOT in TYPE_CHECKING
-from pathlib import Path
-from typing import Any
-
-def process_file(path: Path) -> Any:
-    return path.read_text()
-
-# ✅ CORRECT - Internal type-only import
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from data_designer.engine.models.facade import ModelFacade
-
-def get_model(model: ModelFacade) -> str:  # Only used in type hint
-    return model.name
-
-# ❌ INCORRECT - Pydantic field type in TYPE_CHECKING
-from typing import TYPE_CHECKING
-
-if TYPE_CHECKING:
-    from data_designer.config.models import ModelConfig  # Wrong!
-
-class MyConfig(BaseModel):
-    model: ModelConfig  # Pydantic needs this at runtime!
-
-# ✅ CORRECT - Pydantic field type at runtime
-from data_designer.config.models import ModelConfig
-
-class MyConfig(BaseModel):
-    model: ModelConfig
-```
-
-### Naming Conventions (PEP 8)
-
-Follow PEP 8 naming conventions:
-
-- **Functions and variables**: `snake_case`
-- **Classes**: `PascalCase`
-- **Constants**: `UPPER_SNAKE_CASE`
-- **Private attributes**: prefix with single underscore `_private_var`
-- **Function and method names must start with an action verb**: e.g. `get_value_from` not `value_from`, `coerce_to_int` not `to_int`, `extract_usage` not `usage`
-
-  ```python
-  # Good
-  class DatasetGenerator:
-      MAX_RETRIES = 3
-
-      def __init__(self) -> None:
-          self._cache: dict[str, str] = {}
-
-      def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame:
-          pass
-
-  # Bad
-  class dataset_generator:  # Should be PascalCase
-      maxRetries = 3        # Should be UPPER_SNAKE_CASE
-
-      def GenerateDataset(self, Config):  # Should be snake_case
-          pass
-  ```
-
-### Code Organization
-
-- **Public before private**: Public functions/methods appear before private ones in modules and classes
-- **Class method order**: `__init__` and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all `@staticmethod`s in one block, all `@classmethod`s in one block).
-- **Prefer public over private for testability**: Use public functions (no `_` prefix) for helpers that benefit from direct testing
-- **Section comments in larger modules**: Use `# ---` separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors)
-
-### Design Principles
-
-**DRY**
-- Extract shared logic into pure helper functions rather than duplicating across similar call sites
-- Rule of thumb: tolerate duplication until the third occurrence, then extract
-
-**KISS**
-- Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper
-- When in doubt between DRY and KISS, favor readability over deduplication
-
-**YAGNI**
-- Don't add parameters, config, or abstraction layers for hypothetical future use cases
-- Don't generalize until the third caller appears
-
-**SOLID**
-- Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals
-- Use `Protocol` for contracts between layers
-- One function, one job — separate logic from I/O
-
-### Common Pitfalls to Avoid
-
-1. **Mutable default arguments**:
-
-   ```python
-   # Bad - mutable default argument
-   def add_item(item: str, items: list[str] = []) -> list[str]:
-       items.append(item)
-       return items
-
-   # Good
-   def add_item(item: str, items: list[str] | None = None) -> list[str]:
-       if items is None:
-           items = []
-       items.append(item)
-       return items
-   ```
-
-2. **Unused imports and variables**:
-
-   ```python
-   # Bad - unused import
-   from pathlib import Path
-   from typing import Any  # Not used
-
-   def process() -> None:
-       pass
-
-   # Good - only import what you use
-   from pathlib import Path
-
-   def process() -> None:
-       pass
-   ```
-
-3. **Simplify code where possible** (enforced by `SIM`):
-
-   ```python
-   # Bad
-   if condition:
-       return True
-   else:
-       return False
-
-   # Good
-   return condition
-
-   # Bad
-   if key in my_dict:
-       value = my_dict[key]
-   else:
-       value = default
-
-   # Good
-   value = my_dict.get(key, default)
-   ```
-
-4. **Use comprehensions properly**:
-
-   ```python
-   # Bad
-   list([x for x in items])  # Unnecessary list() call
-
-   # Good
-   [x for x in items]
-
-   # Bad
-   dict([(k, v) for k, v in items])
-
-   # Good
-   {k: v for k, v in items}
-   ```
-
-5. **Proper return statements**:
-
-   ```python
-   # Bad - unnecessary else after return
-   def get_value(condition: bool) -> str:
-       if condition:
-           return "yes"
-       else:
-           return "no"
-
-   # Good
-   def get_value(condition: bool) -> str:
-       if condition:
-           return "yes"
-       return "no"
-   ```
-
-### Active Linter Rules
-
-The following ruff linter rules are currently enabled (see [pyproject.toml](pyproject.toml)):
-
-- `W`: pycodestyle warnings
-- `F`: pyflakes (unused imports, undefined names)
-- `I`: isort (import sorting)
-- `ICN`: flake8-import-conventions (standard import names)
-- `PIE`: flake8-pie (miscellaneous lints)
-- `TID`: flake8-tidy-imports (bans relative imports)
-- `UP006`: `List[A]` -> `list[A]`
-- `UP007`: `Union[A, B]` -> `A | B`
-- `UP045`: `Optional[A]` -> `A | None`
-
-**Note**: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing.
-
-## Testing Patterns
-
-The project uses `pytest` with the following patterns:
-
-- **Fixtures**: Shared fixtures are provided via `pytest_plugins` from `data_designer.config.testing.fixtures` and `data_designer.engine.testing.fixtures`, plus local `conftest.py` files in each test directory
-- **Stub configs**: YAML-based configuration stubs for testing (see `stub_data_designer_config_str` fixture)
-- **Mocking**: Use `unittest.mock.patch` for external services and dependencies
-- **Async support**: pytest-asyncio for async tests (`asyncio_default_fixture_loop_scope = "session"`)
-- **HTTP mocking**: pytest-httpx for mocking HTTP requests
-- **Coverage**: Track test coverage with pytest-cov
-
-### Test Guidelines
-
-- **Test public APIs only**: Tests should exercise public interfaces, not `_`-prefixed functions or classes. If something is hard to test without reaching into private internals, consider refactoring the code to expose a public entry point
-- **Type annotations required**: Test functions and fixtures must include type annotations — `-> None` for tests, typed parameters, and typed return values for fixtures
-- **Imports at module level**: Follow the same import rules as production code — keep imports at the top of the file, not inside test functions
-- **Parametrize over duplicate**: Use `@pytest.mark.parametrize` (with `ids=` for readable names) instead of writing multiple test functions for variations of the same behavior
-- **Minimal fixtures**: Fixtures should be simple — one fixture, one responsibility, just setup with no behavior logic
-- **Shared fixtures in `conftest.py`**: Place fixtures shared across a test directory in `conftest.py`
-- **Mock at boundaries**: Mock external dependencies (APIs, databases, third-party services), not internal functions
-- **Test behavior, not implementation**: Assert on outputs and side effects, not internal call counts (unless verifying routing)
-- **Keep mocking shallow**: If a test requires deeply nested mocking, the code under test may need refactoring
-
-Example test structure:
-
-```python
-from typing import Any
-
-from data_designer.config.config_builder import DataDesignerConfigBuilder
-
-
-def test_something(stub_model_configs: dict[str, Any]) -> None:
-    """Test description."""
-    builder = DataDesignerConfigBuilder(model_configs=stub_model_configs)
-    # ... test implementation
-    assert expected == actual
-```
-
-## Column Configuration Types
-
-When working with column configurations, understand these key types:
-
-- **`SamplerColumnConfig`**: Built-in samplers (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.)
-- **`LLMTextColumnConfig`**: LLM text generation with Jinja2 templating
-- **`LLMCodeColumnConfig`**: Code generation with language specification
-- **`LLMStructuredColumnConfig`**: Structured JSON generation with schema
-- **`LLMJudgeColumnConfig`**: Judge/scoring columns for quality assessment
-- **`ExpressionColumnConfig`**: Expression-based derived columns (Python eval or Jinja2)
-- **`ValidationColumnConfig`**: Validation results (Python, SQL, Code, Remote validators)
-- **`SeedDatasetColumnConfig`**: Data from seed datasets
-- **`EmbeddingColumnConfig`**: Embedding generation for text columns using a specified model
-- **`CustomColumnConfig`**: Custom user-defined column generators via `@custom_column_generator` decorator
-
-See [packages/data-designer-config/src/data_designer/config/column_configs.py](packages/data-designer-config/src/data_designer/config/column_configs.py) for detailed schemas.
-
-## Model Configuration
-
-Models are configured via `ModelConfig` with:
-
-- `alias`: User-defined alias for the model
-- `model`: Model ID (e.g., from build.nvidia.com)
-- `inference_parameters`: Temperature, top_p, max_tokens (can be distribution-based)
-- `system_prompt`: Optional system prompt
-- `image_modality`: Support for image inputs
-
-See [packages/data-designer-config/src/data_designer/config/models.py](packages/data-designer-config/src/data_designer/config/models.py) for details.
-
-## Registry System
-
-The project uses a registry pattern for extensibility. Key registries:
-
-- **Column generators**: [packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py](packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py)
-- **Validators**: [packages/data-designer-engine/src/data_designer/engine/validators/](packages/data-designer-engine/src/data_designer/engine/validators/)
-- **Column profilers**: [packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py](packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py)
-- **Models**: [packages/data-designer-engine/src/data_designer/engine/models/registry.py](packages/data-designer-engine/src/data_designer/engine/models/registry.py)
-
-When adding new generators or validators, register them appropriately.
-
-## Pre-commit Hooks
-
-The project uses pre-commit hooks to enforce code quality. Install them with:
-
-```bash
-uv run pre-commit install
-```
-
-Hooks include:
-- Trailing whitespace removal
-- End-of-file fixer
-- YAML/JSON/TOML validation
-- Merge conflict detection
-- Debug statement detection
-- Ruff linting and formatting
-
-## Common Development Tasks
-
-```bash
-# Clean up generated files
-make clean
-
-# Update license headers
-make update-license-headers
-
-# Run all checks before committing
-make check-all-fix
-make test
-
-# Generate coverage report
-make coverage
-# View htmlcov/index.html in browser
-
-# Profile import performance (use after adding heavy dependencies)
-make perf-import            # Profile import time
-make perf-import CLEAN=1    # Clean cache first, then profile
-```
-
-## Additional Resources
-
-- **README.md**: Installation and basic usage examples
-- **packages/data-designer-config/src/data_designer/config/**: Configuration API documentation
-- **tests/**: Comprehensive test suite with usage examples
+- `make check-all-fix` — format + lint (ruff)
+- `make test` — run all tests
+- `make update-license-headers` — add SPDX license headers (never add them manually)
+- `make perf-import CLEAN=1` — profile import time after dependency changes

From 08ea087f2ca99275ee43e834005220145d0f1ff2 Mon Sep 17 00:00:00 2001
From: Johnny Greco <jogreco@nvidia.com>
Date: Tue, 17 Mar 2026 15:10:25 -0700
Subject: [PATCH 2/3] chore: make CLAUDE.md a symlink to AGENTS.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Single source of truth — both files now share the same content
via symlink rather than maintaining two copies.
---
 CLAUDE.md | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)
 mode change 100644 => 120000 CLAUDE.md

diff --git a/CLAUDE.md b/CLAUDE.md
deleted file mode 100644
index 68490f0c..00000000
--- a/CLAUDE.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# In ./CLAUDE.md
-
-@AGENTS.md
diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 120000
index 00000000..47dc3e3d
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1 @@
+AGENTS.md
\ No newline at end of file

From 391c757ecec7fe39c0a7ea1156892a1688708c03 Mon Sep 17 00:00:00 2001
From: Johnny Greco <jogreco@nvidia.com>
Date: Tue, 17 Mar 2026 15:12:38 -0700
Subject: [PATCH 3/3] chore: remove duplicate import direction statement

---
 AGENTS.md | 2 --
 1 file changed, 2 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index 800883cb..f5e1471b 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -12,8 +12,6 @@ Three packages share the `data_designer` namespace via PEP 420 implicit namespac
 - **Engine** (`data-designer-engine`): Execution layer. Compiler resolves configs into a DAG; runtime executes it via LLM calls, batching, and task scheduling. Registries map config types to implementations.
 - **Interface** (`data-designer`): Entry-point layer. `DataDesigner` class, CLI, and result types. Orchestrates engine on behalf of users.
 
-**Import direction is one-way and absolute.** Config must never import engine. Engine must never import interface. Violations create circular dependencies and break the namespace package structure.
-
 ## Core Concepts
 
 - **Columns** — the primary abstraction. Each column declares its dependencies and extra outputs. The DAG of these declarations determines execution order automatically — users never specify ordering.