Skip to content

Community Revival: Modernize data-diff as v1.0.0#1

Merged
dtsong merged 20 commits intomasterfrom
community-revival
Mar 1, 2026
Merged

Community Revival: Modernize data-diff as v1.0.0#1
dtsong merged 20 commits intomasterfrom
community-revival

Conversation

@dtsong
Copy link
Owner

@dtsong dtsong commented Mar 1, 2026

Summary

Complete Phase 1 modernization of data-diff for community-maintained open-source revival after Datafold sunset the project in May 2024.

  • Remove Datafold Cloud code and telemetry — delete cloud/, tracking.py, tests/cloud/; strip all cloud/tracking references from CLI, dbt integration, and diff engine
  • Migrate from Poetry to uv (PEP 621) — rewrite pyproject.toml with hatchling build system, make dbt-core optional, fix BigQuery/Databricks extras, replace toml with tomllib/tomli, bump to v1.0.0
  • Security hardening — replace 50+ assert statements in production code with explicit if/raise, make test DB URIs configurable via env vars
  • Configure ruff linter — add F/E/W/I/UP/B/SIM/RUF/C4/PIE/T20 rules, auto-apply 3.10+ syntax modernizations (Union→X|Y, Optional→X|None, typing.List→list), enforce isort ordering
  • Modernize pre-commit hooks — update ruff from v0.1.2 to v0.9.9, add lint hook, check-yaml/toml, trailing-whitespace, gitleaks secret detection, no-commit-to-master guard
  • Harden Dockerfile — multi-stage build, non-root user, uv instead of Poetry, minimal runtime image; update Presto dev Dockerfile base image
  • Modernize docker-compose.yml — update service images (postgres:16, mysql:8.4, clickhouse:24.3, trino:439, vertica:24.1), add health checks, remove deprecated version key
  • Overhaul GitHub Actions CI/CD — SHA-pin all actions, add permissions blocks, replace Poetry with uv, replace unittest-parallel with pytest, drop Python 3.8/3.9 and add 3.12/3.13
  • Create release automation — new release.yml workflow for tag-triggered PyPI publishing with trusted publishing; add dependabot.yml for Actions updates
  • Migrate test suite to pytest — create conftest.py with shared fixtures, convert simple test files from unittest to pytest style (21 tests passing)
  • Integrate ty type checker — add [tool.ty] config, non-blocking CI step
  • Rewrite documentation — new README, CONTRIBUTING.md (uv-based), CHANGELOG.md, GOVERNANCE.md; remove all Datafold branding

Stats

  • 96 files changed, +5,468 / -6,968 lines (net reduction of 1,500 lines)
  • 12 commits covering all Phase 1 plan items

Test plan

  • uv sync installs successfully
  • uv run ruff check . passes with zero errors
  • uv run ruff format --check . passes
  • uv run pytest tests/test_utils.py tests/test_config.py tests/test_parse_time.py tests/test_format.py — 21 tests pass
  • grep -ri "datafold\|rudderstack" data_diff/ returns zero matches
  • python -c "from data_diff import diff_tables, connect_to_table" works
  • python -c "from data_diff.dbt import dbt_diff" fails with ModuleNotFoundError: No module named 'dbt' (correct — dbt is optional)
  • python -c "import data_diff; print(data_diff.__version__)" prints 1.0.0
  • Pre-commit hooks pass: pre-commit run --all-files
  • CI passes on Python 3.10, 3.11, 3.12, 3.13 (requires Docker services for full DB tests)
  • docker build . produces a working image

🤖 Generated with Claude Code

dtsong and others added 13 commits February 28, 2026 23:43
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Delete data_diff/cloud/ directory (DatafoldAPI, data_source)
- Delete data_diff/tracking.py (RudderStack telemetry, tokens, profile logic)
- Delete tests/cloud/ directory
- Remove --cloud and --no-tracking CLI flags from __main__.py
- Remove DATAFOLD_TRIGGERED_BY env var handling
- Remove _cloud_diff(), _initialize_api(), _initialize_events(),
  _email_signup(), _extension_notification() from dbt.py
- Remove is_cloud parameter threading through dbt_diff()
- Remove tracking event calls from diff_tables.py
- Remove cloud-specific error classes from errors.py
- Remove cloud-related test cases from test_dbt.py
- Update remaining Datafold URLs to community GitHub org
- Clean up unused imports (time, run_as_daemon, truncate_error)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove deprecated version key
- Update service images (postgres 16, mysql 8.4, clickhouse 24.3, trino 439, vertica 24.1)
- Add health checks to all database services
- Remove deprecated mysql_native_password auth plugin flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite pyproject.toml from [tool.poetry] to [project] (PEP 621)
- Switch build system from poetry-core to hatchling
- Set version to 1.0.0, requires-python >= 3.10
- Make dbt-core optional (moved to [project.optional-dependencies])
- Uncomment BigQuery and Databricks extras
- Pin all wildcard dependencies to minimum version ranges
- Remove keyring, urllib3<2 pin, toml dependencies
- Replace toml with tomllib/tomli for TOML parsing
- Add tomli as conditional dep for Python 3.10
- Upgrade pydantic to >=2.0, fix Field(regex=) -> Field(pattern=)
- Add [dependency-groups] dev with pytest, ruff, and DB drivers
- Add [tool.pytest.ini_options] configuration
- Lazy-load dbt import in __main__.py
- Export __version__ from data_diff/__init__.py
- Delete poetry.lock, generate uv.lock

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Security hardening: assert statements are stripped when Python runs with
-O flag, silently disabling safety checks. Replace all production asserts
with explicit if/raise providing descriptive error messages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MySQL and PostgreSQL test URIs now read from DATADIFF_MYSQL_URI and
DATADIFF_POSTGRESQL_URI env vars, falling back to local docker defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rrides

- Replace assert statements in production code with explicit ValueError/TypeError raises
- Add env var overrides for MySQL and PostgreSQL test connection strings
- Create dev/.env.example with placeholder credentials

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite main Dockerfile with multi-stage build (builder + runtime),
  python:3.12-slim base, uv instead of Poetry, non-root appuser,
  minimal apt packages (libpq5 only), and selective COPY
- Update dev/Dockerfile.prestosql.340 base image from EOL
  openjdk:11-jdk-slim-buster to eclipse-temurin:11-jre-jammy
- Expand .dockerignore to exclude .git/, tests/, docs/, dev/, *.md,
  .github/, .worktrees/, __pycache__/, *.pyc, .ruff_cache/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rewrite README.md: remove Datafold branding, add community-maintained
  badge, installation/quick-start examples, supported databases list
- Update CONTRIBUTING.md: replace Poetry with uv, pytest instead of
  unittest, ruff for linting, remove Datafold-specific references
- Create CHANGELOG.md starting at v1.0.0 with community revival summary
- Create GOVERNANCE.md describing community maintenance model

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pin ALL actions to full SHA commit hashes to prevent supply chain attacks
- Replace Poetry with astral-sh/setup-uv + uv sync --frozen
- Replace unittest-parallel with uv run pytest tests/
- Update Python matrix to 3.10, 3.11, 3.12, 3.13
- Add explicit permissions: contents: read at top level of CI workflows
- Remove BigQuery install hacks and cloud-specific secrets/references
- Replace andymckay/labeler@master (supply chain risk) with SHA-pinned actions/github-script
- Upgrade actions/stale from v5 to v9, actions/github-script from v6 to v7
- Add new release.yml workflow: uv build, PyPI trusted publishing, GitHub Release
- Add dependabot.yml for weekly GitHub Actions ecosystem updates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to pytest

Ruff linter:
- Expand ruff.toml with F/E/W/I/UP/B/SIM/RUF/C4/PIE/T20 lint rules
- Auto-apply UP (pyupgrade) fixes: Union -> X|Y, Optional -> X|None,
  typing.List -> list, typing.Dict -> dict, typing.Tuple -> tuple
- Auto-apply I (isort) import ordering across all files
- Fix B904 (raise from), B007 (unused loop vars), unused imports
- Add per-file ignores for test files, CLI output prints, SQL generation
- ruff check passes with zero errors

Pre-commit:
- Update ruff-pre-commit from v0.1.2 to v0.9.9
- Add ruff lint hook (not just format)
- Add check-yaml, check-toml, end-of-file-fixer, trailing-whitespace
- Add no-commit-to-branch (master) guard
- Add gitleaks for secret detection

Pytest migration:
- Create tests/conftest.py with shared DuckDB fixtures (in-memory and
  file-backed) using lazy imports to avoid import-time errors
- Convert test_utils.py, test_config.py, test_parse_time.py, and
  test_format.py from unittest.TestCase to standalone pytest functions
- Replace self.assertEqual with assert ==, self.assertRaises with
  pytest.raises, remove class wrappers
- Fix pre-existing bugs: modernize type annotations in queries/api.py,
  remove duplicate attrs fields in thread_utils.py, fix imports in
  test_format.py (SegmentInfo/InfoTree from info_tree module)
- All 21 converted tests pass; complex parameterized DB tests left as
  unittest.TestCase (pytest runs them natively)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add [tool.ty] config targeting Python 3.10 in pyproject.toml.
Add ty check step to CI workflow with continue-on-error: true
so it reports diagnostics without blocking PRs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unnecessary int() cast in utils.py number_to_human
- Prefix unused unpacked variable with underscore in test_database_types.py
- Convert isinstance tuple to union syntax for UP038 compat
- Remove deprecated UP038 from ruff ignore list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a7682e573

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Dockerfile Outdated
RUN poetry install
ENTRYPOINT ["poetry", "run", "python3", "-m", "data_diff"]

COPY pyproject.toml uv.lock ./

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Copy README into builder before running uv sync

uv sync --frozen --no-dev installs the local project from pyproject.toml, and this project declares readme = "README.md" (pyproject line 5), but the builder only copies pyproject.toml, uv.lock, and data_diff/ before syncing. In Docker builds this leaves no README.md available, so package metadata build for data-diff can fail during image creation; .dockerignore now also excludes *.md, which guarantees the README cannot be copied from build context.

Useful? React with 👍 / 👎.

dtsong and others added 7 commits March 1, 2026 00:28
- Raise click.BadParameter in _set_age instead of silently swallowing
  ParseError, which caused unfiltered diffs on invalid age expressions
- Change bare except: to except Exception: in try_set_dbt_flags to
  avoid catching KeyboardInterrupt and SystemExit
- Remove dead is_cloud and deps_impacts params from
  dbt_diff_string_template now that cloud code is fully removed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Error handling improvements:
- Propagate diff task failures in dbt_diff instead of swallowing them
- Add symmetric error handling for table2.get_schema() in _local_diff
- Log schema fetch failures at warning level with exception details
- Raise click.UsageError for --limit/--stats conflict and missing DBs
  instead of returning silently with exit code 0

Bug fixes:
- Fix wrong schema referenced in error message (schema1 → schema2)
- Fix duplicate tuple comparison in base.py after typing.Tuple removal
- Add type guards in _remove_passwords_in_dict for non-string values
- Simplify useless re-raise in get_pk_from_model
- Fix isinstance/issubclass to use X | Y syntax per ruff UP038

Dead code removal:
- Remove unused mashumaro[msgpack] dependency
- Remove run_as_daemon and truncate_error (only used by removed cloud code)
- Remove unused threading import

URL corrections:
- Replace nonexistent data-diff-community GitHub URLs with datafold/data-diff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documentation fixes:
- Fix database count from "14+" to "13+" in README
- Remove "non-cloud dbs" terminology from docstrings and CLI help
- Use modern "docker compose" command in CONTRIBUTING.md

Test improvements:
- Add tests for Databricks/Snowflake URI validation error paths
- Use monkeypatch.setenv in test_embed_env to prevent env pollution
- Add conn.close() to duckdb_file_connection fixture teardown
- Remove unused os import from test_config.py

CI/CD:
- Add tag-version validation step to release workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pyproject.toml declares readme = "README.md", but the builder stage
didn't copy it and .dockerignore excluded all *.md files, causing
package metadata builds to fail.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard dbt.config.renderer import with try/except for clear error when
  dbt-core is not installed (optional dependency)
- Fix NoneType crash in dbt_diff_string_template when extra_info_dict is None
- Replace bare except: with except AttributeError: in 4 database dialects
  (snowflake, bigquery, databricks, postgresql) to avoid catching SystemExit,
  KeyboardInterrupt, etc.
- Split overly broad except Exception: pass in try_set_dbt_flags into
  ImportError (expected) vs Exception (logged at debug level)
- Log connection creation failures in ThreadedDatabase.set_conn() instead
  of silently storing them

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Re-raise exceptions in _local_diff schema fetching so dbt_diff error
  aggregation actually captures failures instead of silently skipping
- Add debug logging to Redshift query_table_schema fallback chain and
  narrow innermost except to RuntimeError
- Remove dead datasource_id field from TDatadiffConfig (cloud remnant)
- Remove unused conftest.py fixtures (duckdb_connection, duckdb_file_connection)
- Fix broken anchor link in dbt_diff doc_url
- Fix stale "non-cloud dbs" docstring in JoinDiffer
- Replace "cloud databases" with "thread-pooled databases" in _connect.py docs
- Fix columns_flag type hint: tuple[str] -> tuple[str, ...]
- Update Union[str, DbPath] to str | DbPath in docstring
- Fix isinstance tuple syntax to use X | Y (ruff UP038)
- Add regression tests for diff_schemas bug fix and None extra_info_dict

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract duplicate tomllib shim into data_diff/_compat.py
- Remove duplicate --limit + --stats validation (dead code at line 593)
- Remove unused TypeVar T from databases/base.py
- Migrate remaining Union[] type hints to X | Y syntax in utils.py
- Gate CI type check to run only on Python 3.10 matrix entry
- Replace commented-out logging with active debug logging in base.py
- Add from None to exception conversions in _connect.py for cleaner tracebacks
- Fix is_dbt parameter type: bool | None -> bool
- Add clarifying comment for "datafold" dbt config key
- Fix CHANGELOG: "replacing black/flake8" -> "replacing black-based configuration"
- Add dbt integration section to README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dtsong dtsong merged commit 71e46d1 into master Mar 1, 2026
1 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant