Add agent-discoverability contract test (closes #461)

igerber · claude · igerber · commit 95ae91449630 · 2026-05-18T21:22:43.000-04:00
New `tests/test_agent_discoverability.py` pins the agent-facing surface introduced by PR #464 against future regression. Snapshot/static assertions only — no live API calls, no subprocess, runs in the default pytest suite. What's locked: 1. `__all__` membership of agent_workflow / profile_panel / get_llm_guide / practitioner_next_steps / BusinessReport (catches export pruning). 2. `dir(diff_diff)` head-first ordering matches `_AGENT_FACING_ORDER` (catches drift in `_OrderedName.__lt__` or `__dir__()` regression). 3. `dir()` tail stays alphabetic when keyed by `str` (recovery key for downstream tooling that re-sorts). 4. `dir()` returns the FULL module namespace, not just `__all__` (preserves `__doc__` / `__name__` / `__file__` for `inspect.getmembers` consumers). 5. `_OrderedName` invariants: `isinstance(_, str)` holds, str methods work (upper, eq, hash, `in`, f-string). 6. Top-level `__doc__` first non-blank paragraph names `agent_workflow`; full doc text names the 4 downstream primitives. 7. `agent_workflow()` output script references each canonical helper by name; every `fit_candidates` entry resolves on the diff_diff namespace. 8. Canonical estimator class names (CallawaySantAnna, ChaisemartinDHaultfoeuille, ContinuousDiD, DifferenceInDifferences, HeterogeneousAdoptionDiD, HonestDiD, ImputationDiD, PreTrendsPower, SunAbraham, TwoWayFixedEffects, WooldridgeDiD) remain importable. 9. Each agent-facing entrypoint stays callable. 17 tests (12 standalone + 5 parametrize cells over the agent-facing entrypoint names). Closes #461 (snapshot variant). The live-agent regression test remains a follow-up that depends on causal-llm-eval packaging its harness module. Also closes the `__dir__()` contract-test row from PR #464's TODO.md (deferred there, landed here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **Agent-discoverability contract test (`tests/test_agent_discoverability.py`).** New static-snapshot test pinning the agent-facing surface introduced by PR #464: `__all__` membership of `agent_workflow` / `profile_panel` / `get_llm_guide` / `practitioner_next_steps` / `BusinessReport`; `dir(diff_diff)` head-first ordering against `_AGENT_FACING_ORDER` (catches drift in the `_OrderedName` `__lt__` ordering trick); `_OrderedName` `isinstance(_, str)` + str-method compatibility; `dir()` full-namespace + `inspect.getmembers` parity; top-level `__doc__` first-paragraph mention of `agent_workflow` + named references to the 5-step workflow primitives; `agent_workflow()` script content references each downstream helper by name; canonical estimator class names (CallawaySantAnna, ContinuousDiD, HeterogeneousAdoptionDiD, etc.) remain importable. No live API calls; runs in the default pytest suite. Closes [issue #461](https://github.com/igerber/diff-diff/issues/461) (snapshot variant — live-agent regression test deferred to a separate follow-up that depends on causal-llm-eval packaging its harness). Also closes the `__dir__()` contract-test row from `TODO.md` that PR #464 deferred here.
 - **`diff_diff.agent_workflow(df, unit=..., time=..., treatment=..., outcome=...)` — stateless orchestrator for LLM-agent discoverability** (`diff_diff/agent_workflow.py`). Prints (and returns as dict) a copy-pasteable 5-step workflow with the caller's column names templated in: `profile_panel` → `get_llm_guide("autonomous")` → `<Estimator>(...).fit(df, ...)` → `practitioner_next_steps(result)` → `BusinessReport(result).full_report()`. The function calls nothing internally and does not inspect `df`; it is a guided tour, not a router. Surfaces the canonical workflow primitives (`profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`) that cold-start agent dry-passes at [igerber/causal-llm-eval](https://github.com/igerber/causal-llm-eval) showed agents practically never reach for on their own. Output structure: `{"profile_call", "guide_call", "fit_candidates", "validation_calls", "reporting_call", "script"}`; `fit_candidates` is a flat list of estimator/diagnostic class names referenced in the workflow patterns (each must remain importable on `diff_diff`, locked by `tests/test_agent_workflow.py::test_fit_candidates_all_importable`). Closes [issue #460](https://github.com/igerber/diff-diff/issues/460).
 - **Top-level `__doc__` rewritten to lead with the agent workflow** (`diff_diff/__init__.py`). `help(diff_diff)` now opens with the `agent_workflow(df, ...)` recommendation as the first non-blank paragraph; `get_llm_guide("full")` and `get_llm_guide("practitioner")` pointers preserved for the existing `tests/test_guides.py::test_module_docstring_mentions_helper` guard.
 - **`dir(diff_diff)` now surfaces agent-facing entrypoints first** via a module-level `__dir__()` override paired with a small `_OrderedName(str)` subclass that subverts CPython's unconditional alphabetic sort (PyList_Sort respects `__lt__` on the elements). Agent-facing names (`agent_workflow`, `profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`, `DiagnosticReport`) appear at the head of the list; the remainder stays alphabetic via the `str.__lt__` fallback. The underlying `__all__` membership is **unchanged** and `from diff_diff import *` semantics are unaffected (driven by `__all__`, not `dir()`). Elements are `isinstance(x, str)` and compatible with `inspect.getmembers`, dict-key lookup, f-strings, and standard `str` methods; tooling that re-sorts via `sorted(dir(diff_diff))` will see priority order (use `sorted(dir(diff_diff), key=str)` to recover plain alphabetic if needed). Internal: `_AGENT_FACING_ORDER` tuple is read by the new `tests/test_agent_discoverability.py` contract test (PR B). Addresses [issue #460](https://github.com/igerber/diff-diff/issues/460) item 3.
diff --git a/TODO.md b/TODO.md
@@ -162,7 +162,6 @@ Deferred items from PR reviews that were not addressed before merge.
 | Add CI validation for `docs/doc-deps.yaml` integrity (stale paths, unmapped source files) | `docs/doc-deps.yaml` | #269 | Low |
 | SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
 | AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
-| `__dir__()` discoverability contract test (head order, membership, `_OrderedName` invariants, `inspect.getmembers` parity) — deferred from PR #464 to the planned PR B addressing #461. The full snapshot/contract surface lands together in `tests/test_agent_discoverability.py`. | `diff_diff/__init__.py::__dir__`, `tests/test_agent_discoverability.py` (new in PR B) | #464 | Low |
 | `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
 | Render `docs/methodology/REPORTING.md` and `docs/methodology/REGISTRY.md` as in-site Sphinx pages so cross-references can use `:doc:` instead of off-site GitHub `blob/main` URLs. Current state (#410 fix-audit-r2) restores navigable links via `blob/main`, but stable-docs readers can land on a different revision than the package version they are reading. Two viable paths: (a) add `myst-parser` to `docs/conf.py` extensions + docs extras and link with `:doc:`, or (b) convert both files to `.rst`. | `docs/conf.py`, `docs/api/business_report.rst`, `docs/api/diagnostic_report.rst`, `docs/tutorials/18_geo_experiments.ipynb`, `docs/tutorials/19_dcdh_marketing_pulse.ipynb` | follow-up | Low |
 
diff --git a/tests/test_agent_discoverability.py b/tests/test_agent_discoverability.py
@@ -0,0 +1,288 @@
+"""Contract test for the agent-discoverability surface (issue #461).
+
+This is a static snapshot test of the four contract surfaces that PR
+#464 introduced for LLM-agent discovery:
+
+1. ``__all__`` membership of agent-facing primitives
+2. ``dir(diff_diff)`` head-first ordering (via the ``_OrderedName`` trick)
+3. Top-level ``__doc__`` content (first paragraph names the recommended
+   call; the 5-step workflow primitives all appear)
+4. ``agent_workflow()`` output references the canonical downstream
+   primitives by name
+
+It also locks the ``__dir__()`` invariants (head matches
+``_AGENT_FACING_ORDER``, tail is alphabetic by ``str``, module dunders
+are preserved, ``inspect.getmembers`` parity).
+
+Closes the ``__dir__`` contract-test deferral row from PR #464's
+``TODO.md``.
+
+No live API calls, no subprocess, no live agents — purely string/identity
+assertions runnable in the default ``pytest`` suite.
+"""
+
+from __future__ import annotations
+
+import inspect
+
+import pandas as pd
+import pytest
+
+import diff_diff
+from diff_diff import _AGENT_FACING_ORDER
+
+# ---------------------------------------------------------------------------
+# __all__ membership
+# ---------------------------------------------------------------------------
+
+
+def test_agent_facing_names_in_all():
+    """The named primitives must remain in the public API surface.
+
+    Catches an export pruning that would silently remove an agent-facing
+    name from ``from diff_diff import *``.
+    """
+    required = {
+        "agent_workflow",
+        "profile_panel",
+        "get_llm_guide",
+        "practitioner_next_steps",
+        "BusinessReport",
+    }
+    assert required <= set(
+        diff_diff.__all__
+    ), f"missing from __all__: {required - set(diff_diff.__all__)}"
+
+
+def test_estimator_class_names_importable():
+    """Class-name renames silently break agent recognition.
+
+    The canonical staggered estimators + the simple-2x2 case must remain
+    importable under their documented names; the orchestrator's Step 3
+    examples and ``llms-autonomous.txt`` routing matrix reference them
+    by these literal identifiers.
+    """
+    from diff_diff import (  # noqa: F401
+        CallawaySantAnna,
+        ChaisemartinDHaultfoeuille,
+        ContinuousDiD,
+        DifferenceInDifferences,
+        HeterogeneousAdoptionDiD,
+        HonestDiD,
+        ImputationDiD,
+        PreTrendsPower,
+        SunAbraham,
+        TwoWayFixedEffects,
+        WooldridgeDiD,
+    )
+
+
+# ---------------------------------------------------------------------------
+# __dir__() head-first ordering + _OrderedName invariants
+# ---------------------------------------------------------------------------
+
+
+def test_dir_head_matches_agent_facing_order():
+    """``dir(diff_diff)`` must surface ``_AGENT_FACING_ORDER`` at the
+    head, IN THE DECLARED ORDER.
+
+    Anchors to the contract (the override's curated tuple) rather than
+    a fixed slice length: if a future change adds or trims the head
+    tuple, this test follows it. Catches the failure mode where
+    ``__dir__()`` is dropped, mis-ordered, or where the
+    ``_OrderedName`` ``__lt__`` is broken.
+    """
+    names = dir(diff_diff)
+    head_size = len(_AGENT_FACING_ORDER)
+    assert names[:head_size] == list(_AGENT_FACING_ORDER), (
+        f"dir() head does not match _AGENT_FACING_ORDER. "
+        f"Got: {names[:head_size]!r}. "
+        f"Expected: {list(_AGENT_FACING_ORDER)!r}."
+    )
+
+
+def test_dir_tail_alphabetic_by_str():
+    """The non-head portion of ``dir()`` should stay alphabetic when
+    keyed by ``str``.
+
+    The ``_OrderedName`` head members compare with custom ``__lt__``
+    (priority then alphabetic); tail elements are plain strings sorted
+    by CPython's ``PyList_Sort``. ``sorted(tail, key=str)`` is the
+    canonical recovery key in case any downstream tooling re-sorts.
+    """
+    names = dir(diff_diff)
+    tail = names[len(_AGENT_FACING_ORDER) :]
+    assert tail == sorted(tail, key=str)
+
+
+def test_dir_returns_full_module_namespace():
+    """``dir(diff_diff)`` must enumerate the full module namespace.
+
+    Restricting to ``__all__`` would drop module dunders (``__doc__``,
+    ``__name__``, ``__file__``) and break ``inspect.getmembers``
+    consumers. The override returns ``[_OrderedName(n) for n in
+    globals()]`` to preserve that compatibility.
+    """
+    names = dir(diff_diff)
+    for dunder in ("__doc__", "__name__", "__file__", "__all__"):
+        assert dunder in names, f"{dunder!r} missing from dir() output"
+
+
+def test_getmembers_parity_with_default_module_dir():
+    """``inspect.getmembers(diff_diff)`` should return the same set of
+    names as ``dir(diff_diff)``, with ``__doc__`` accessible.
+
+    Catches regressions where ``__dir__`` is reduced to ``__all__`` only.
+    """
+    dir_names = set(dir(diff_diff))
+    gm_names = {name for name, _ in inspect.getmembers(diff_diff)}
+    assert dir_names == gm_names, (
+        f"dir() and inspect.getmembers() disagree by " f"{sorted(dir_names ^ gm_names)[:5]}"
+    )
+    # And the steering surface must be accessible.
+    assert diff_diff.__doc__ is not None
+    assert "agent_workflow" in diff_diff.__doc__.lower()
+
+
+# ---------------------------------------------------------------------------
+# _OrderedName subclass invariants
+# ---------------------------------------------------------------------------
+
+
+def test_ordered_name_isinstance_str():
+    """Every ``dir()`` element must still be ``isinstance(..., str)`` so
+    consumers that type-check don't break.
+    """
+    for name in dir(diff_diff):
+        assert isinstance(
+            name, str
+        ), f"dir() element {name!r} is type {type(name).__name__}, not a str subclass"
+
+
+def test_ordered_name_str_methods_work():
+    """The head ``_OrderedName`` instances must support all the str
+    operations downstream tooling relies on (upper, eq, hash for dict
+    keys, ``in`` membership, f-string interpolation).
+    """
+    head = dir(diff_diff)[: len(_AGENT_FACING_ORDER)]
+    for n in head:
+        assert n.upper() == str(n).upper()
+        assert n == str(n)
+        assert {n: 1}.get(n) == 1
+        assert n in [str(n)]
+        assert f"{n}" == str(n)
+
+
+# ---------------------------------------------------------------------------
+# __doc__ first-paragraph contract
+# ---------------------------------------------------------------------------
+
+
+def test_doc_first_paragraph_names_agent_workflow():
+    """``help(diff_diff)`` opens with ``__doc__``; the first non-blank
+    paragraph must name ``agent_workflow``.
+
+    Catches a docstring rewrite that drops the recommended-call hint
+    from the top-of-help surface.
+    """
+    doc = diff_diff.__doc__
+    assert doc is not None
+    first_block = doc.strip().split("\n\n")[0]
+    assert "agent_workflow" in first_block.lower()
+
+
+def test_doc_names_canonical_workflow_helpers():
+    """The full 5-step workflow's primitive names must remain reachable
+    from ``help(diff_diff)``.
+
+    Catches a docstring trim that removes references to the downstream
+    helpers an agent following the doc would call next.
+    """
+    assert diff_diff.__doc__ is not None
+    doc_lower = diff_diff.__doc__.lower()
+    for name in (
+        "profile_panel",
+        "get_llm_guide",
+        "practitioner_next_steps",
+        "businessreport",
+    ):
+        assert name in doc_lower, f"{name!r} missing from __doc__"
+
+
+# ---------------------------------------------------------------------------
+# agent_workflow() output references the canonical primitives
+# ---------------------------------------------------------------------------
+
+
+def test_agent_workflow_output_names_canonical_helpers():
+    """Calling ``agent_workflow()`` must still produce a script that
+    names the four downstream primitives. Catches the orchestrator
+    content drifting away from the helpers it advertises.
+    """
+    df = pd.DataFrame({"u": [1], "t": [0], "tr": [0], "y": [0.0]})
+    out = diff_diff.agent_workflow(
+        df,
+        unit="u",
+        time="t",
+        treatment="tr",
+        outcome="y",
+        verbose=False,
+    )
+    for name in (
+        "profile_panel",
+        "get_llm_guide",
+        "practitioner_next_steps",
+        "BusinessReport",
+    ):
+        assert name in out["script"], f"{name!r} missing from agent_workflow script"
+
+
+def test_agent_workflow_fit_candidates_resolve_on_diff_diff():
+    """Every estimator advertised in ``agent_workflow().fit_candidates``
+    must be a real attribute on the ``diff_diff`` namespace.
+
+    Mirrors the per-PR test in ``test_agent_workflow.py``; here we
+    re-assert as part of the discoverability contract so a rename
+    that escapes the per-PR suite is still caught at the surface
+    level.
+    """
+    df = pd.DataFrame({"u": [1], "t": [0], "tr": [0], "y": [0.0]})
+    out = diff_diff.agent_workflow(
+        df,
+        unit="u",
+        time="t",
+        treatment="tr",
+        outcome="y",
+        verbose=False,
+    )
+    missing = [n for n in out["fit_candidates"] if not hasattr(diff_diff, n)]
+    assert not missing, f"fit_candidates not on diff_diff namespace: {missing}"
+
+
+# ---------------------------------------------------------------------------
+# Cross-surface sanity (all four agent-facing entrypoints callable)
+# ---------------------------------------------------------------------------
+
+
+@pytest.mark.parametrize(
+    "name",
+    sorted(
+        {
+            "agent_workflow",
+            "profile_panel",
+            "get_llm_guide",
+            "practitioner_next_steps",
+            "BusinessReport",
+        }
+    ),
+)
+def test_agent_facing_entrypoint_callable(name):
+    """Each agent-facing primitive must remain a callable attribute on
+    the top-level package.
+
+    Catches an accidental replacement of one of these names with a
+    module or constant (which would silently break the agent's
+    ``help(name)`` follow-up).
+    """
+    obj = getattr(diff_diff, name)
+    assert callable(obj), f"{name!r} is not callable on the diff_diff namespace"