Community Revival: Modernize data-diff as v1.0.0#1
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Delete data_diff/cloud/ directory (DatafoldAPI, data_source) - Delete data_diff/tracking.py (RudderStack telemetry, tokens, profile logic) - Delete tests/cloud/ directory - Remove --cloud and --no-tracking CLI flags from __main__.py - Remove DATAFOLD_TRIGGERED_BY env var handling - Remove _cloud_diff(), _initialize_api(), _initialize_events(), _email_signup(), _extension_notification() from dbt.py - Remove is_cloud parameter threading through dbt_diff() - Remove tracking event calls from diff_tables.py - Remove cloud-specific error classes from errors.py - Remove cloud-related test cases from test_dbt.py - Update remaining Datafold URLs to community GitHub org - Clean up unused imports (time, run_as_daemon, truncate_error) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove deprecated version key - Update service images (postgres 16, mysql 8.4, clickhouse 24.3, trino 439, vertica 24.1) - Add health checks to all database services - Remove deprecated mysql_native_password auth plugin flag Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite pyproject.toml from [tool.poetry] to [project] (PEP 621) - Switch build system from poetry-core to hatchling - Set version to 1.0.0, requires-python >= 3.10 - Make dbt-core optional (moved to [project.optional-dependencies]) - Uncomment BigQuery and Databricks extras - Pin all wildcard dependencies to minimum version ranges - Remove keyring, urllib3<2 pin, toml dependencies - Replace toml with tomllib/tomli for TOML parsing - Add tomli as conditional dep for Python 3.10 - Upgrade pydantic to >=2.0, fix Field(regex=) -> Field(pattern=) - Add [dependency-groups] dev with pytest, ruff, and DB drivers - Add [tool.pytest.ini_options] configuration - Lazy-load dbt import in __main__.py - Export __version__ from data_diff/__init__.py - Delete poetry.lock, generate uv.lock Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Security hardening: assert statements are stripped when Python runs with -O flag, silently disabling safety checks. Replace all production asserts with explicit if/raise providing descriptive error messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MySQL and PostgreSQL test URIs now read from DATADIFF_MYSQL_URI and DATADIFF_POSTGRESQL_URI env vars, falling back to local docker defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rrides - Replace assert statements in production code with explicit ValueError/TypeError raises - Add env var overrides for MySQL and PostgreSQL test connection strings - Create dev/.env.example with placeholder credentials Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite main Dockerfile with multi-stage build (builder + runtime), python:3.12-slim base, uv instead of Poetry, non-root appuser, minimal apt packages (libpq5 only), and selective COPY - Update dev/Dockerfile.prestosql.340 base image from EOL openjdk:11-jdk-slim-buster to eclipse-temurin:11-jre-jammy - Expand .dockerignore to exclude .git/, tests/, docs/, dev/, *.md, .github/, .worktrees/, __pycache__/, *.pyc, .ruff_cache/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite README.md: remove Datafold branding, add community-maintained badge, installation/quick-start examples, supported databases list - Update CONTRIBUTING.md: replace Poetry with uv, pytest instead of unittest, ruff for linting, remove Datafold-specific references - Create CHANGELOG.md starting at v1.0.0 with community revival summary - Create GOVERNANCE.md describing community maintenance model Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pin ALL actions to full SHA commit hashes to prevent supply chain attacks - Replace Poetry with astral-sh/setup-uv + uv sync --frozen - Replace unittest-parallel with uv run pytest tests/ - Update Python matrix to 3.10, 3.11, 3.12, 3.13 - Add explicit permissions: contents: read at top level of CI workflows - Remove BigQuery install hacks and cloud-specific secrets/references - Replace andymckay/labeler@master (supply chain risk) with SHA-pinned actions/github-script - Upgrade actions/stale from v5 to v9, actions/github-script from v6 to v7 - Add new release.yml workflow: uv build, PyPI trusted publishing, GitHub Release - Add dependabot.yml for weekly GitHub Actions ecosystem updates Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to pytest Ruff linter: - Expand ruff.toml with F/E/W/I/UP/B/SIM/RUF/C4/PIE/T20 lint rules - Auto-apply UP (pyupgrade) fixes: Union -> X|Y, Optional -> X|None, typing.List -> list, typing.Dict -> dict, typing.Tuple -> tuple - Auto-apply I (isort) import ordering across all files - Fix B904 (raise from), B007 (unused loop vars), unused imports - Add per-file ignores for test files, CLI output prints, SQL generation - ruff check passes with zero errors Pre-commit: - Update ruff-pre-commit from v0.1.2 to v0.9.9 - Add ruff lint hook (not just format) - Add check-yaml, check-toml, end-of-file-fixer, trailing-whitespace - Add no-commit-to-branch (master) guard - Add gitleaks for secret detection Pytest migration: - Create tests/conftest.py with shared DuckDB fixtures (in-memory and file-backed) using lazy imports to avoid import-time errors - Convert test_utils.py, test_config.py, test_parse_time.py, and test_format.py from unittest.TestCase to standalone pytest functions - Replace self.assertEqual with assert ==, self.assertRaises with pytest.raises, remove class wrappers - Fix pre-existing bugs: modernize type annotations in queries/api.py, remove duplicate attrs fields in thread_utils.py, fix imports in test_format.py (SegmentInfo/InfoTree from info_tree module) - All 21 converted tests pass; complex parameterized DB tests left as unittest.TestCase (pytest runs them natively) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add [tool.ty] config targeting Python 3.10 in pyproject.toml. Add ty check step to CI workflow with continue-on-error: true so it reports diagnostics without blocking PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unnecessary int() cast in utils.py number_to_human - Prefix unused unpacked variable with underscore in test_database_types.py - Convert isinstance tuple to union syntax for UP038 compat - Remove deprecated UP038 from ruff ignore list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a7682e573
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Dockerfile
Outdated
| RUN poetry install | ||
| ENTRYPOINT ["poetry", "run", "python3", "-m", "data_diff"] | ||
|
|
||
| COPY pyproject.toml uv.lock ./ |
There was a problem hiding this comment.
Copy README into builder before running uv sync
uv sync --frozen --no-dev installs the local project from pyproject.toml, and this project declares readme = "README.md" (pyproject line 5), but the builder only copies pyproject.toml, uv.lock, and data_diff/ before syncing. In Docker builds this leaves no README.md available, so package metadata build for data-diff can fail during image creation; .dockerignore now also excludes *.md, which guarantees the README cannot be copied from build context.
Useful? React with 👍 / 👎.
- Raise click.BadParameter in _set_age instead of silently swallowing ParseError, which caused unfiltered diffs on invalid age expressions - Change bare except: to except Exception: in try_set_dbt_flags to avoid catching KeyboardInterrupt and SystemExit - Remove dead is_cloud and deps_impacts params from dbt_diff_string_template now that cloud code is fully removed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Error handling improvements: - Propagate diff task failures in dbt_diff instead of swallowing them - Add symmetric error handling for table2.get_schema() in _local_diff - Log schema fetch failures at warning level with exception details - Raise click.UsageError for --limit/--stats conflict and missing DBs instead of returning silently with exit code 0 Bug fixes: - Fix wrong schema referenced in error message (schema1 → schema2) - Fix duplicate tuple comparison in base.py after typing.Tuple removal - Add type guards in _remove_passwords_in_dict for non-string values - Simplify useless re-raise in get_pk_from_model - Fix isinstance/issubclass to use X | Y syntax per ruff UP038 Dead code removal: - Remove unused mashumaro[msgpack] dependency - Remove run_as_daemon and truncate_error (only used by removed cloud code) - Remove unused threading import URL corrections: - Replace nonexistent data-diff-community GitHub URLs with datafold/data-diff Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documentation fixes: - Fix database count from "14+" to "13+" in README - Remove "non-cloud dbs" terminology from docstrings and CLI help - Use modern "docker compose" command in CONTRIBUTING.md Test improvements: - Add tests for Databricks/Snowflake URI validation error paths - Use monkeypatch.setenv in test_embed_env to prevent env pollution - Add conn.close() to duckdb_file_connection fixture teardown - Remove unused os import from test_config.py CI/CD: - Add tag-version validation step to release workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pyproject.toml declares readme = "README.md", but the builder stage didn't copy it and .dockerignore excluded all *.md files, causing package metadata builds to fail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard dbt.config.renderer import with try/except for clear error when dbt-core is not installed (optional dependency) - Fix NoneType crash in dbt_diff_string_template when extra_info_dict is None - Replace bare except: with except AttributeError: in 4 database dialects (snowflake, bigquery, databricks, postgresql) to avoid catching SystemExit, KeyboardInterrupt, etc. - Split overly broad except Exception: pass in try_set_dbt_flags into ImportError (expected) vs Exception (logged at debug level) - Log connection creation failures in ThreadedDatabase.set_conn() instead of silently storing them Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Re-raise exceptions in _local_diff schema fetching so dbt_diff error aggregation actually captures failures instead of silently skipping - Add debug logging to Redshift query_table_schema fallback chain and narrow innermost except to RuntimeError - Remove dead datasource_id field from TDatadiffConfig (cloud remnant) - Remove unused conftest.py fixtures (duckdb_connection, duckdb_file_connection) - Fix broken anchor link in dbt_diff doc_url - Fix stale "non-cloud dbs" docstring in JoinDiffer - Replace "cloud databases" with "thread-pooled databases" in _connect.py docs - Fix columns_flag type hint: tuple[str] -> tuple[str, ...] - Update Union[str, DbPath] to str | DbPath in docstring - Fix isinstance tuple syntax to use X | Y (ruff UP038) - Add regression tests for diff_schemas bug fix and None extra_info_dict Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract duplicate tomllib shim into data_diff/_compat.py - Remove duplicate --limit + --stats validation (dead code at line 593) - Remove unused TypeVar T from databases/base.py - Migrate remaining Union[] type hints to X | Y syntax in utils.py - Gate CI type check to run only on Python 3.10 matrix entry - Replace commented-out logging with active debug logging in base.py - Add from None to exception conversions in _connect.py for cleaner tracebacks - Fix is_dbt parameter type: bool | None -> bool - Add clarifying comment for "datafold" dbt config key - Fix CHANGELOG: "replacing black/flake8" -> "replacing black-based configuration" - Add dbt integration section to README Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Complete Phase 1 modernization of data-diff for community-maintained open-source revival after Datafold sunset the project in May 2024.
cloud/,tracking.py,tests/cloud/; strip all cloud/tracking references from CLI, dbt integration, and diff enginepyproject.tomlwith hatchling build system, make dbt-core optional, fix BigQuery/Databricks extras, replacetomlwithtomllib/tomli, bump to v1.0.0assertstatements in production code with explicitif/raise, make test DB URIs configurable via env varsrelease.ymlworkflow for tag-triggered PyPI publishing with trusted publishing; adddependabot.ymlfor Actions updatesconftest.pywith shared fixtures, convert simple test files from unittest to pytest style (21 tests passing)[tool.ty]config, non-blocking CI stepStats
Test plan
uv syncinstalls successfullyuv run ruff check .passes with zero errorsuv run ruff format --check .passesuv run pytest tests/test_utils.py tests/test_config.py tests/test_parse_time.py tests/test_format.py— 21 tests passgrep -ri "datafold\|rudderstack" data_diff/returns zero matchespython -c "from data_diff import diff_tables, connect_to_table"workspython -c "from data_diff.dbt import dbt_diff"fails withModuleNotFoundError: No module named 'dbt'(correct — dbt is optional)python -c "import data_diff; print(data_diff.__version__)"prints1.0.0pre-commit run --all-filesdocker build .produces a working image🤖 Generated with Claude Code