From 3aac4f1e7485b152080106ef795842db2f66f1f8 Mon Sep 17 00:00:00 2001 From: Johnny Greco Date: Tue, 17 Mar 2026 15:08:05 -0700 Subject: [PATCH 1/3] chore: strip down AGENTS.md to match CLAUDE.md (#430) Remove verbose development instructions, code style guidelines, and testing patterns that duplicate CLAUDE.md. Keep only the concise sections: identity, layering, core concepts, design principles, structural invariants, and development commands. --- AGENTS.md | 644 +++--------------------------------------------------- 1 file changed, 33 insertions(+), 611 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 7d46ef76..800883cb 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,626 +1,48 @@ -# AGENTS.md +## Identity -This file provides guidance to agents when working with code in this repository. +You are developing **Data Designer** — a framework for creating synthetic datasets from scratch. Users declare *what* data they want (columns, types, relationships, quality criteria) and the system figures out *how* to generate it. -## Project Overview +The core contract is **declare, don't orchestrate**. General users should never have to think about execution order, batching, or model plumbing. Every change you make should preserve or strengthen that contract. If a change forces users to reason about runtime mechanics, it's moving in the wrong direction. -**DataDesigner** is an NVIDIA NeMo project for creating synthetic datasets from scratch. It's a comprehensive framework that generates structured data using multiple generation strategies: +## The Layering Is Structural -- **Sampled data**: Built-in generators (UUID, DateTime, etc.) and Faker integration -- **LLM-generated content**: Text, code, and structured data via LiteLLM -- **Expression-based columns**: Derived columns using Jinja2 templates -- **Validation & scoring**: Python, SQL, and remote validators; LLM-based judge scoring -- **Seed dataset-based generation**: Generate from existing datasets +Three packages share the `data_designer` namespace via PEP 420 implicit namespace packages (no `__init__.py` at the `data_designer/` level — this is intentional): **config** (schema) → **engine** (execution) → **interface** (entry-point). -### Architecture +- **Config** (`data-designer-config`): Schema layer. Pydantic models for columns, pipelines, and model routing. Builder API for config construction. Pure data — no I/O. +- **Engine** (`data-designer-engine`): Execution layer. Compiler resolves configs into a DAG; runtime executes it via LLM calls, batching, and task scheduling. Registries map config types to implementations. +- **Interface** (`data-designer`): Entry-point layer. `DataDesigner` class, CLI, and result types. Orchestrates engine on behalf of users. -The project follows a layered architecture: +**Import direction is one-way and absolute.** Config must never import engine. Engine must never import interface. Violations create circular dependencies and break the namespace package structure. -1. **Config Layer** ([packages/data-designer-config/src/data_designer/config/](packages/data-designer-config/src/data_designer/config/)): User-facing configuration API - - `config_builder.py`: Main builder API for constructing configurations - - `column_configs.py`: Column configuration types (Sampler, LLMText, LLMCode, LLMStructured, LLMJudge, Expression, Validation, SeedDataset) - - `models.py`: Model configurations and inference parameters - - `sampler_params.py`: Parametrized samplers (Uniform, Category, Person, DateTime, etc.) +## Core Concepts -2. **Engine Layer** ([packages/data-designer-engine/src/data_designer/engine/](packages/data-designer-engine/src/data_designer/engine/)): Internal generation and processing - - `column_generators/`: Generates individual columns from configs - - `dataset_builders/`: Orchestrates full dataset generation with DAG-based dependency management - - `models/`: LLM integration via LiteLLM with response parsing - - `validators/`: Column validation (Python, SQL, Code, Remote) - - `sampling_gen/`: Sophisticated person/entity sampling +- **Columns** — the primary abstraction. Each column declares its dependencies and extra outputs. The DAG of these declarations determines execution order automatically — users never specify ordering. +- **Samplers** — statistical distributions, category sets, and persona generators used by sampler columns to produce values without model calls. +- **Seed datasets** — existing data that bootstraps generation, providing real-world context for downstream columns. +- **Processors** — batch-level transformations (schema reshaping, column dropping) that run outside the column DAG. +- **Models** — model configurations are routed per-column via `model_alias`. The framework is model-agnostic and provider-agnostic. +- **Plugins** — extend any of the above via `setuptools` entry points. Define a config, implement the behavior, register both. -3. **Interface Layer** ([packages/data-designer/src/data_designer/interface/](packages/data-designer/src/data_designer/interface/)): Public API - - `data_designer.py`: Main `DataDesigner` class (primary entry point) - - `results.py`: Result containers - - `errors.py`: Public error types +## Core Design Principles -### Recommended Import Pattern +- **Config is declarative, engine is imperative.** Users build configs that describe intent; the engine is where I/O, LLM calls, and state mutation happen. The **compiler** bridges the two: it validates the config, resolves the full DAG, and produces the runtime plan the engine executes. +- **Registries connect types to behavior.** `TaskRegistry` maps config type → implementation class. Plugins extend this at runtime. +- **Errors normalize at boundaries.** Each layer defines canonical error types. Callers depend on these public types, never on vendor-specific or internal exceptions. -```python -import data_designer.config as dd -from data_designer.interface import DataDesigner +## Structural Invariants -# Usage: -data_designer = DataDesigner() -config_builder = dd.DataDesignerConfigBuilder() -config_builder.add_column( - dd.SamplerColumnConfig( - name="category", - sampler_type=dd.SamplerType.CATEGORY, - params=dd.CategorySamplerParams(values=["A", "B"]), - ) -) -``` +- **Imports flow downstream only** — config must not import from engine or interface; engine must not import from interface +- **Fast imports** — Avoid new import-time side effects. Prefer lazy initialization and defer heavy work until use-time +- **No relative imports** — absolute only +- **No untyped code** — all code carries type annotations +- **Follow established patterns** — before introducing new conventions, study surrounding code and match existing style, structure, and idioms +- **No untested code paths** — new and changed logic must have associated unit tests -### Key Design Patterns +## Development -- **Builder pattern**: Configuration construction via `DataDesignerConfigBuilder` -- **Registry pattern**: Plugin system for column generators, validators, and profilers -- **Strategy pattern**: Multiple generation approaches (sampled, LLM, expression, seed) -- **DAG-based execution**: Column dependencies managed as directed acyclic graph +Use `make` targets — not raw tool commands — for all standard workflows: -## Development Workflow - -This project uses `uv` for dependency management and `make` for common tasks: - -```bash -# Install dependencies -uv sync - -# Install with dev dependencies -uv sync --all-extras - -# Run the main module (if applicable) -uv run python -m data_designer -``` - -### Code Quality - -```bash -# Using Make (recommended) -make lint # Run ruff linter -make lint-fix # Fix linting issues automatically -make format # Format code with ruff -make format-check # Check code formatting without changes -make check-all # Run all checks (format-check + lint) -make check-all-fix # Run all checks with autofix (format + lint-fix) - -# Direct commands -uv run ruff check # Lint all files -uv run ruff check --fix # Lint with autofix -uv run ruff format # Format all files -uv run ruff format --check # Check formatting -``` - -### Running Tests - -```bash -# Run all tests -uv run pytest - -# Run tests with verbose output -uv run pytest -v - -# Run a specific test file -uv run pytest tests/config/test_sampler_constraints.py - -# Run tests with coverage -uv run pytest --cov=data_designer --cov-report=term-missing --cov-report=html - -# Using Make -make test # Run all tests -make coverage # Run tests with coverage report -``` - -## Key Files - -- [packages/data-designer/src/data_designer/interface/data_designer.py](packages/data-designer/src/data_designer/interface/data_designer.py) - Main entry point (`DataDesigner` class) -- [packages/data-designer-config/src/data_designer/config/config_builder.py](packages/data-designer-config/src/data_designer/config/config_builder.py) - Configuration API (`DataDesignerConfigBuilder`) -- [packages/data-designer-config/src/data_designer/config/__init__.py](packages/data-designer-config/src/data_designer/config/__init__.py) - User-facing config API exports -- [packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py](packages/data-designer-engine/src/data_designer/engine/dataset_builders/column_wise_builder.py) - Generation orchestrator -- [pyproject.toml](pyproject.toml) - Project dependencies and tool configurations -- [Makefile](Makefile) - Common development commands - -## Working Guidelines - -- **Comments**: Only insert comments when code is especially important to understand. For basic code blocks, comments aren't necessary. We want readable code without vacuous comments. -- **License headers**: All Python files must include the NVIDIA SPDX license header: - ```python - # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. - # SPDX-License-Identifier: Apache-2.0 - ``` - Use `make update-license-headers` to add headers to all files automatically. -- **`from __future__ import annotations`**: Include at the top of all Python source files for deferred type evaluation. -- **Imports**: Avoid importing Python modules inside method definitions. Prefer module-level imports for better performance and clarity. See [Code Style](#code-style) for detailed import and type annotation guidelines. - -## Code Style - -This project uses `ruff` (>=0.14.10) for linting and formatting. Follow these guidelines to avoid linter errors: - -### General Formatting - -- **Line length**: Maximum 120 characters per line -- **Quote style**: Always use double quotes (`"`) for strings -- **Indentation**: Use 4 spaces (never tabs) -- **Target version**: Python 3.10+ - -### Type Annotations - -Type annotations are REQUIRED for all code in this project. This is strictly enforced for code quality and maintainability. Modern type syntax is enforced by ruff rules `UP006`, `UP007`, and `UP045`. - -- **ALWAYS** add type annotations to all functions, methods, and class attributes (including tests) -- Use primitive types when possible: `list` not `List`, `dict` not `Dict`, `set` not `Set`, `tuple` not `Tuple` (enforced by `UP006`) -- Use modern union syntax with `|` for optional and union types: - - `str | None` not `Optional[str]` (enforced by `UP045`) - - `int | str` not `Union[int, str]` (enforced by `UP007`) -- Only import from `typing` when absolutely necessary for complex generic types -- For Pydantic models, use field-level type annotations - - ```python - # Good - def process_items(items: list[str], max_count: int | None = None) -> dict[str, int]: - return {item: len(item) for item in items} - - # Avoid - missing type annotations - def process_items(items, max_count=None): - return {item: len(item) for item in items} - - # Avoid - old-style typing - from typing import List, Dict, Optional - def process_items(items: List[str], max_count: Optional[int] = None) -> Dict[str, int]: - return {item: len(item) for item in items} - ``` - -### Import Style - -- **ALWAYS** include `from __future__ import annotations` at the top of every Python source file (after the license header) for deferred type evaluation -- **ALWAYS** use absolute imports, never relative imports (enforced by `TID`) -- Place imports at module level, not inside functions (exception: it is unavoidable for performance reasons) -- Import sorting is handled by `ruff`'s `isort` - imports should be grouped and sorted: - 1. Standard library imports - 2. Third-party imports (use `lazy_heavy_imports` for heavy libraries) - 3. First-party imports (`data_designer`) -- Use standard import conventions (enforced by `ICN`) -- See [Lazy Loading and TYPE_CHECKING](#lazy-loading-and-type_checking) section for optimization guidelines - - ```python - # Good - from data_designer.config.config_builder import DataDesignerConfigBuilder - - # Bad - relative import (will cause linter errors) - from .config_builder import DataDesignerConfigBuilder - - # Good - imports at module level - from pathlib import Path - - def process_file(filename: str) -> None: - path = Path(filename) - - # Bad - import inside function - def process_file(filename: str) -> None: - from pathlib import Path - path = Path(filename) - ``` - -### Lazy Loading and TYPE_CHECKING - -This project uses lazy loading for heavy third-party dependencies to optimize import performance. - -#### When to Use Lazy Loading - -**Heavy third-party libraries** (>100ms import cost) should be lazy-loaded via `lazy_heavy_imports.py`: - -```python -# ❌ Don't import directly -import pandas as pd -import numpy as np - -# ✅ Use lazy loading with IDE support -from typing import TYPE_CHECKING -from data_designer.lazy_heavy_imports import pd, np - -if TYPE_CHECKING: - import pandas as pd # For IDE autocomplete and type hints - import numpy as np -``` - -This pattern provides: -- Runtime lazy loading (fast startup) -- Full IDE support (autocomplete, type hints) -- Type checker validation - -**See [lazy_heavy_imports.py](packages/data-designer-config/src/data_designer/lazy_heavy_imports.py) for the current list of lazy-loaded libraries.** - -#### Adding New Heavy Dependencies - -If you add a new dependency with significant import cost (>100ms): - -1. **Add to `lazy_heavy_imports.py`:** - ```python - _LAZY_IMPORTS = { - # ... existing entries ... - "your_lib": "your_library_name", - } - ``` - -2. **Update imports across codebase:** - ```python - from typing import TYPE_CHECKING - from data_designer.lazy_heavy_imports import your_lib - - if TYPE_CHECKING: - import your_library_name as your_lib # For IDE support - ``` - -3. **Verify with performance test:** - ```bash - make perf-import CLEAN=1 - ``` - -#### Using TYPE_CHECKING Blocks - -`TYPE_CHECKING` blocks defer imports that are only needed for type hints, preventing circular dependencies and reducing import time. - -**For internal data_designer imports:** - -```python -from __future__ import annotations - -from typing import TYPE_CHECKING - -# Runtime imports -from pathlib import Path -from data_designer.config.base import ConfigBase - -if TYPE_CHECKING: - # Type-only imports - only visible to type checkers - from data_designer.engine.models.facade import ModelFacade - -def get_model(model: ModelFacade) -> str: - return model.name -``` - -**For lazy-loaded libraries (see pattern in "When to Use Lazy Loading" above):** -- Import from `lazy_heavy_imports` for runtime -- Add full import in `TYPE_CHECKING` block for IDE support - -**Rules for TYPE_CHECKING:** - -✅ **DO put in TYPE_CHECKING:** -- Internal `data_designer` imports used **only** in type hints -- Imports that would cause circular dependencies -- **Full imports of lazy-loaded libraries for IDE support** (e.g., `import pandas as pd` in addition to runtime `from data_designer.lazy_heavy_imports import pd`) - -❌ **DON'T put in TYPE_CHECKING:** -- **Standard library imports** (`Path`, `Any`, `Callable`, `Literal`, `TypeAlias`, etc.) -- **Pydantic model types** used in field definitions (needed at runtime for validation) -- **Types used in discriminated unions** (Pydantic needs them at runtime) -- **Any import used at runtime** (instantiation, method calls, base classes, etc.) - -**Examples:** - -```python -# ✅ CORRECT - Lazy-loaded library with IDE support -from typing import TYPE_CHECKING -from data_designer.lazy_heavy_imports import pd - -if TYPE_CHECKING: - import pandas as pd # IDE gets full type hints - -def load_data(path: str) -> pd.DataFrame: # IDE understands pd.DataFrame - return pd.read_csv(path) - -# ✅ CORRECT - Standard library NOT in TYPE_CHECKING -from pathlib import Path -from typing import Any - -def process_file(path: Path) -> Any: - return path.read_text() - -# ✅ CORRECT - Internal type-only import -from typing import TYPE_CHECKING - -if TYPE_CHECKING: - from data_designer.engine.models.facade import ModelFacade - -def get_model(model: ModelFacade) -> str: # Only used in type hint - return model.name - -# ❌ INCORRECT - Pydantic field type in TYPE_CHECKING -from typing import TYPE_CHECKING - -if TYPE_CHECKING: - from data_designer.config.models import ModelConfig # Wrong! - -class MyConfig(BaseModel): - model: ModelConfig # Pydantic needs this at runtime! - -# ✅ CORRECT - Pydantic field type at runtime -from data_designer.config.models import ModelConfig - -class MyConfig(BaseModel): - model: ModelConfig -``` - -### Naming Conventions (PEP 8) - -Follow PEP 8 naming conventions: - -- **Functions and variables**: `snake_case` -- **Classes**: `PascalCase` -- **Constants**: `UPPER_SNAKE_CASE` -- **Private attributes**: prefix with single underscore `_private_var` -- **Function and method names must start with an action verb**: e.g. `get_value_from` not `value_from`, `coerce_to_int` not `to_int`, `extract_usage` not `usage` - - ```python - # Good - class DatasetGenerator: - MAX_RETRIES = 3 - - def __init__(self) -> None: - self._cache: dict[str, str] = {} - - def generate_dataset(self, config: dict[str, str]) -> pd.DataFrame: - pass - - # Bad - class dataset_generator: # Should be PascalCase - maxRetries = 3 # Should be UPPER_SNAKE_CASE - - def GenerateDataset(self, Config): # Should be snake_case - pass - ``` - -### Code Organization - -- **Public before private**: Public functions/methods appear before private ones in modules and classes -- **Class method order**: `__init__` and other dunder methods first, then properties, then public methods, then private helpers. Group related method types together (e.g., all `@staticmethod`s in one block, all `@classmethod`s in one block). -- **Prefer public over private for testability**: Use public functions (no `_` prefix) for helpers that benefit from direct testing -- **Section comments in larger modules**: Use `# ---` separators to delineate logical groups (e.g. image parsing, usage extraction, generic accessors) - -### Design Principles - -**DRY** -- Extract shared logic into pure helper functions rather than duplicating across similar call sites -- Rule of thumb: tolerate duplication until the third occurrence, then extract - -**KISS** -- Prefer flat, obvious code over clever abstractions — two similar lines is better than a premature helper -- When in doubt between DRY and KISS, favor readability over deduplication - -**YAGNI** -- Don't add parameters, config, or abstraction layers for hypothetical future use cases -- Don't generalize until the third caller appears - -**SOLID** -- Wrap third-party exceptions at module boundaries — callers depend on canonical error types, not leaked internals -- Use `Protocol` for contracts between layers -- One function, one job — separate logic from I/O - -### Common Pitfalls to Avoid - -1. **Mutable default arguments**: - - ```python - # Bad - mutable default argument - def add_item(item: str, items: list[str] = []) -> list[str]: - items.append(item) - return items - - # Good - def add_item(item: str, items: list[str] | None = None) -> list[str]: - if items is None: - items = [] - items.append(item) - return items - ``` - -2. **Unused imports and variables**: - - ```python - # Bad - unused import - from pathlib import Path - from typing import Any # Not used - - def process() -> None: - pass - - # Good - only import what you use - from pathlib import Path - - def process() -> None: - pass - ``` - -3. **Simplify code where possible** (enforced by `SIM`): - - ```python - # Bad - if condition: - return True - else: - return False - - # Good - return condition - - # Bad - if key in my_dict: - value = my_dict[key] - else: - value = default - - # Good - value = my_dict.get(key, default) - ``` - -4. **Use comprehensions properly**: - - ```python - # Bad - list([x for x in items]) # Unnecessary list() call - - # Good - [x for x in items] - - # Bad - dict([(k, v) for k, v in items]) - - # Good - {k: v for k, v in items} - ``` - -5. **Proper return statements**: - - ```python - # Bad - unnecessary else after return - def get_value(condition: bool) -> str: - if condition: - return "yes" - else: - return "no" - - # Good - def get_value(condition: bool) -> str: - if condition: - return "yes" - return "no" - ``` - -### Active Linter Rules - -The following ruff linter rules are currently enabled (see [pyproject.toml](pyproject.toml)): - -- `W`: pycodestyle warnings -- `F`: pyflakes (unused imports, undefined names) -- `I`: isort (import sorting) -- `ICN`: flake8-import-conventions (standard import names) -- `PIE`: flake8-pie (miscellaneous lints) -- `TID`: flake8-tidy-imports (bans relative imports) -- `UP006`: `List[A]` -> `list[A]` -- `UP007`: `Union[A, B]` -> `A | B` -- `UP045`: `Optional[A]` -> `A | None` - -**Note**: Additional rules (E, N, ANN, B, C4, DTZ, RET, SIM, PTH) are commented out but may be enabled in the future. Write code that would pass these checks for future-proofing. - -## Testing Patterns - -The project uses `pytest` with the following patterns: - -- **Fixtures**: Shared fixtures are provided via `pytest_plugins` from `data_designer.config.testing.fixtures` and `data_designer.engine.testing.fixtures`, plus local `conftest.py` files in each test directory -- **Stub configs**: YAML-based configuration stubs for testing (see `stub_data_designer_config_str` fixture) -- **Mocking**: Use `unittest.mock.patch` for external services and dependencies -- **Async support**: pytest-asyncio for async tests (`asyncio_default_fixture_loop_scope = "session"`) -- **HTTP mocking**: pytest-httpx for mocking HTTP requests -- **Coverage**: Track test coverage with pytest-cov - -### Test Guidelines - -- **Test public APIs only**: Tests should exercise public interfaces, not `_`-prefixed functions or classes. If something is hard to test without reaching into private internals, consider refactoring the code to expose a public entry point -- **Type annotations required**: Test functions and fixtures must include type annotations — `-> None` for tests, typed parameters, and typed return values for fixtures -- **Imports at module level**: Follow the same import rules as production code — keep imports at the top of the file, not inside test functions -- **Parametrize over duplicate**: Use `@pytest.mark.parametrize` (with `ids=` for readable names) instead of writing multiple test functions for variations of the same behavior -- **Minimal fixtures**: Fixtures should be simple — one fixture, one responsibility, just setup with no behavior logic -- **Shared fixtures in `conftest.py`**: Place fixtures shared across a test directory in `conftest.py` -- **Mock at boundaries**: Mock external dependencies (APIs, databases, third-party services), not internal functions -- **Test behavior, not implementation**: Assert on outputs and side effects, not internal call counts (unless verifying routing) -- **Keep mocking shallow**: If a test requires deeply nested mocking, the code under test may need refactoring - -Example test structure: - -```python -from typing import Any - -from data_designer.config.config_builder import DataDesignerConfigBuilder - - -def test_something(stub_model_configs: dict[str, Any]) -> None: - """Test description.""" - builder = DataDesignerConfigBuilder(model_configs=stub_model_configs) - # ... test implementation - assert expected == actual -``` - -## Column Configuration Types - -When working with column configurations, understand these key types: - -- **`SamplerColumnConfig`**: Built-in samplers (UUID, Category, Uniform, Gaussian, Person, DateTime, etc.) -- **`LLMTextColumnConfig`**: LLM text generation with Jinja2 templating -- **`LLMCodeColumnConfig`**: Code generation with language specification -- **`LLMStructuredColumnConfig`**: Structured JSON generation with schema -- **`LLMJudgeColumnConfig`**: Judge/scoring columns for quality assessment -- **`ExpressionColumnConfig`**: Expression-based derived columns (Python eval or Jinja2) -- **`ValidationColumnConfig`**: Validation results (Python, SQL, Code, Remote validators) -- **`SeedDatasetColumnConfig`**: Data from seed datasets -- **`EmbeddingColumnConfig`**: Embedding generation for text columns using a specified model -- **`CustomColumnConfig`**: Custom user-defined column generators via `@custom_column_generator` decorator - -See [packages/data-designer-config/src/data_designer/config/column_configs.py](packages/data-designer-config/src/data_designer/config/column_configs.py) for detailed schemas. - -## Model Configuration - -Models are configured via `ModelConfig` with: - -- `alias`: User-defined alias for the model -- `model`: Model ID (e.g., from build.nvidia.com) -- `inference_parameters`: Temperature, top_p, max_tokens (can be distribution-based) -- `system_prompt`: Optional system prompt -- `image_modality`: Support for image inputs - -See [packages/data-designer-config/src/data_designer/config/models.py](packages/data-designer-config/src/data_designer/config/models.py) for details. - -## Registry System - -The project uses a registry pattern for extensibility. Key registries: - -- **Column generators**: [packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py](packages/data-designer-engine/src/data_designer/engine/column_generators/registry.py) -- **Validators**: [packages/data-designer-engine/src/data_designer/engine/validators/](packages/data-designer-engine/src/data_designer/engine/validators/) -- **Column profilers**: [packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py](packages/data-designer-engine/src/data_designer/engine/analysis/column_profilers/registry.py) -- **Models**: [packages/data-designer-engine/src/data_designer/engine/models/registry.py](packages/data-designer-engine/src/data_designer/engine/models/registry.py) - -When adding new generators or validators, register them appropriately. - -## Pre-commit Hooks - -The project uses pre-commit hooks to enforce code quality. Install them with: - -```bash -uv run pre-commit install -``` - -Hooks include: -- Trailing whitespace removal -- End-of-file fixer -- YAML/JSON/TOML validation -- Merge conflict detection -- Debug statement detection -- Ruff linting and formatting - -## Common Development Tasks - -```bash -# Clean up generated files -make clean - -# Update license headers -make update-license-headers - -# Run all checks before committing -make check-all-fix -make test - -# Generate coverage report -make coverage -# View htmlcov/index.html in browser - -# Profile import performance (use after adding heavy dependencies) -make perf-import # Profile import time -make perf-import CLEAN=1 # Clean cache first, then profile -``` - -## Additional Resources - -- **README.md**: Installation and basic usage examples -- **packages/data-designer-config/src/data_designer/config/**: Configuration API documentation -- **tests/**: Comprehensive test suite with usage examples +- `make check-all-fix` — format + lint (ruff) +- `make test` — run all tests +- `make update-license-headers` — add SPDX license headers (never add them manually) +- `make perf-import CLEAN=1` — profile import time after dependency changes From 08ea087f2ca99275ee43e834005220145d0f1ff2 Mon Sep 17 00:00:00 2001 From: Johnny Greco Date: Tue, 17 Mar 2026 15:10:25 -0700 Subject: [PATCH 2/3] chore: make CLAUDE.md a symlink to AGENTS.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Single source of truth — both files now share the same content via symlink rather than maintaining two copies. --- CLAUDE.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) mode change 100644 => 120000 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index 68490f0c..00000000 --- a/CLAUDE.md +++ /dev/null @@ -1,3 +0,0 @@ -# In ./CLAUDE.md - -@AGENTS.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 00000000..47dc3e3d --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file From 391c757ecec7fe39c0a7ea1156892a1688708c03 Mon Sep 17 00:00:00 2001 From: Johnny Greco Date: Tue, 17 Mar 2026 15:12:38 -0700 Subject: [PATCH 3/3] chore: remove duplicate import direction statement --- AGENTS.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 800883cb..f5e1471b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -12,8 +12,6 @@ Three packages share the `data_designer` namespace via PEP 420 implicit namespac - **Engine** (`data-designer-engine`): Execution layer. Compiler resolves configs into a DAG; runtime executes it via LLM calls, batching, and task scheduling. Registries map config types to implementations. - **Interface** (`data-designer`): Entry-point layer. `DataDesigner` class, CLI, and result types. Orchestrates engine on behalf of users. -**Import direction is one-way and absolute.** Config must never import engine. Engine must never import interface. Violations create circular dependencies and break the namespace package structure. - ## Core Concepts - **Columns** — the primary abstraction. Each column declares its dependencies and extra outputs. The DAG of these declarations determines execution order automatically — users never specify ordering.