Address codex R5 P2+P3 on HAD: stute_test scope + verdict-language accuracy

igerber · claude · igerber · commit 0fd456447690 · 2026-05-20T08:10:00.000-04:00
- P2 (Methodology): tightened stute_test / yatchew_hr_test / class docstring
  to correctly attribute Assumption 7 (mean-independence pre-trends) to
  joint_pretrends_test (intercept-only residual form via
  null_form="mean_independence") rather than to the raw stute_test helper.
  The raw stute_test always fits dy ~ 1 + d and tests Assumption 8 linearity.
  Updated all 5 surfaces: stute_test Notes, yatchew_hr_test Notes (now also
  documents null="linearity" vs null="mean_independence" kwarg correctly,
  no longer references nonexistent "residual_form"), HeterogeneousAdoptionDiD
  class docstring (split into 4 distinct ADJACENT condition bullets), REGISTRY
  HAD checklist L2694 closure, paper-review L192 closure.

- P3 (Documentation/Tests): the new workflow / REGISTRY / paper-review prose
  said the composite verdict surfaces the Assumption 5/6 caveat. Actually
  the verdict string only flags the Assumption 7 step-2 gap on the
  aggregate="overall" path. Reworded in 4 surfaces (workflow Notes, HAD class
  docstring, REGISTRY L2694, paper-review L192) to clarify that the
  Assumption 5/6 caveat is surfaced by (a) the Design 1 fit-time UserWarning
  and (b) T21 tutorial prose — NOT by the workflow verdict string.

- P3 (Documentation/Tests): yatchew_hr_test Notes referenced a nonexistent
  "residual_form" selector. Replaced with the correct kwarg name "null"
  ({"linearity", "mean_independence"}) and described both branches.

All 35 methodology tests pass; full HAD + drift sweep 665 passed; lint clean.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/had.py b/diff_diff/had.py
@@ -2615,18 +2615,29 @@ class HeterogeneousAdoptionDiD:
     is on the Design 1 family (``continuous_near_d_lower`` or
     ``mass_point``) so users are not silently led to interpret point
     estimates as full point identification. The available pre-tests
-    (:func:`diff_diff.qug_test`, :func:`diff_diff.stute_test`,
-    :func:`diff_diff.yatchew_hr_test`) verify ADJACENT identifying
-    conditions: QUG tests the Theorem 4 / Design 1' support-infimum
-    null ``d_lower = 0`` — adjacent evidence on the ``d_lower = 0``
-    clause of Assumption 4 only, NOT a test of the full Assumption 4
-    statement (which also covers boundary-density positivity,
-    conditional-mean smoothness, conditional-variance regularity, and
-    bandwidth conditions); Assumption 7 mean-independence pre-trends
-    via Stute; Assumption 8 linearity / homogeneity via Yatchew. None
-    of these test Assumptions 5 or 6 directly. T21 (HAD pretest
-    workflow tutorial) shows the verdict-language convention that
-    surfaces this caveat to end users.
+    verify ADJACENT identifying conditions:
+
+    - :func:`diff_diff.qug_test`: Theorem 4 / Design 1' support-infimum
+      null ``d_lower = 0`` (adjacent evidence on the ``d_lower = 0``
+      clause of Assumption 4 only, NOT a test of the full Assumption 4
+      statement which also covers boundary-density positivity,
+      conditional-mean smoothness, conditional-variance regularity, and
+      bandwidth conditions).
+    - :func:`diff_diff.stute_test` / :func:`diff_diff.yatchew_hr_test`:
+      Assumption 8 linearity of ``E[ΔY | D_2]`` in ``D_2`` (residuals
+      from ``dy ~ 1 + d``).
+    - :func:`diff_diff.joint_pretrends_test`: Assumption 7
+      mean-independence pre-trends across multi-period placebos
+      (intercept-only residual form via ``null_form="mean_independence"``;
+      the raw ``stute_test`` / ``yatchew_hr_test`` helpers do NOT cover
+      Assumption 7 on their own).
+
+    None of these test Assumptions 5 or 6 directly. The Assumption 5/6
+    non-testability caveat is surfaced by the Design 1 fit-time
+    ``UserWarning`` and by T21 (HAD pretest workflow tutorial) prose,
+    NOT by the composite workflow verdict string (which only flags the
+    Assumption 7 step-2 gap on the two-period ``aggregate="overall"``
+    path).
 
     **Diagnostics coverage.** ``HeterogeneousAdoptionDiDResults.bandwidth_diagnostics``
     and ``.bias_corrected_fit`` are populated only on the continuous
diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
@@ -1653,16 +1653,21 @@ def stute_test(
     Notes
     -----
     **Scope (what this test does NOT cover).** ``stute_test`` targets
-    paper Assumption 8 (mean-independence of treatment effects /
-    pre-trends linearity, depending on the residual definition). It does
+    paper Assumption 8 (linearity of ``E[ΔY | D_2]`` in ``D_2``) — the
+    raw helper always fits ``dy ~ 1 + d`` and tests the linearity null;
+    it does NOT target Assumption 7 mean-independence pre-trends on its
+    own. For Assumption 7 mean-independence (residuals from intercept-
+    only ``dy ~ 1``), use :func:`joint_pretrends_test` (which routes
+    ``null_form="mean_independence"`` into the joint CvM core). It does
     NOT and CANNOT test Assumptions 5 and 6 from de Chaisemartin et al.
     (2026) Section 3.1.2, which are required for sign / point
     identification of ``WAS_{d_lower}`` on the Design 1 family
     (``d_lower > 0``). Assumptions 5/6 are non-testable via pre-trends
     (boundary-conditional expectations and counterfactual-mean alignment
-    statements). See :class:`HeterogeneousAdoptionDiD` class docstring
-    Notes for the full statement and T21 for the verdict-language
-    convention that surfaces this gap to end users.
+    statements); they are surfaced by the Design 1 fit-time
+    ``UserWarning`` and by T21 tutorial prose, NOT by the workflow
+    verdict string. See :class:`HeterogeneousAdoptionDiD` class
+    docstring Notes for the full statement.
 
     Sample-size gate: below ``G = 10`` the CvM statistic is not
     well-calibrated. In that case the function emits ``UserWarning`` and
@@ -2141,15 +2146,18 @@ def yatchew_hr_test(
     Notes
     -----
     **Scope (what this test does NOT cover).** ``yatchew_hr_test`` targets
-    paper Assumption 8 (linearity of ``E[ΔY | D_2]`` in ``D_2``, or
-    mean-independence depending on ``residual_form``). It does NOT and
-    CANNOT test Assumptions 5 and 6 from de Chaisemartin et al. (2026)
-    Section 3.1.2, which are required for sign / point identification of
-    ``WAS_{d_lower}`` on the Design 1 family (``d_lower > 0``).
-    Assumptions 5/6 are non-testable via pre-trends. See
-    :class:`HeterogeneousAdoptionDiD` class docstring Notes for the full
-    statement and T21 for the verdict-language convention that surfaces
-    this gap to end users.
+    paper Assumption 8 (linearity of ``E[ΔY | D_2]`` in ``D_2``) under
+    ``null="linearity"`` (default); ``null="mean_independence"`` swaps
+    the residual definition to intercept-only ``dy ~ 1`` for R parity
+    with ``YatchewTest::yatchew_test(order=0)`` on pre-trend placebos.
+    It does NOT and CANNOT test Assumptions 5 and 6 from de
+    Chaisemartin et al. (2026) Section 3.1.2, which are required for
+    sign / point identification of ``WAS_{d_lower}`` on the Design 1
+    family (``d_lower > 0``). Assumptions 5/6 are non-testable via
+    pre-trends; they are surfaced by the Design 1 fit-time
+    ``UserWarning`` and by T21 tutorial prose, NOT by the workflow
+    verdict string. See :class:`HeterogeneousAdoptionDiD` class
+    docstring Notes for the full statement.
 
     Sample-size gate: below ``G = 3`` the difference-variance estimator
     is undefined; the function emits ``UserWarning`` and returns NaN
@@ -4599,12 +4607,14 @@ def did_had_pretest_workflow(
     from de Chaisemartin et al. (2026) Section 3.1.2, which are required
     for sign / point identification of ``WAS_{d_lower}`` on the Design 1
     family (``d_lower > 0``). Assumptions 5/6 are non-testable via
-    pre-trends. The composite verdict surfaces this gap explicitly via
-    its ``"Assumption 7 gap"`` (when QUG defers) and via the
+    pre-trends. The composite verdict string does NOT mention
+    Assumptions 5 or 6 — it only flags the Assumption 7 step-2 gap on
+    the two-period ``aggregate="overall"`` path. The Assumption 5/6
+    caveat is surfaced separately by (a) the
     ``HeterogeneousAdoptionDiD.fit()`` fit-time ``UserWarning`` (which
-    fires whenever the resolved design is Design 1 family). T21 (HAD
-    pretest workflow tutorial) shows the recommended user-facing
-    verdict-language convention.
+    fires whenever the resolved design is Design 1 family —
+    ``continuous_near_d_lower`` or ``mass_point``) and (b) T21 (HAD
+    pretest workflow tutorial) tutorial prose.
 
     Survey/weighted data (Phase 4.5 C): under ``survey=`` or ``weights=``,
     the workflow:
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2691,7 +2691,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in
 - [x] Phase 5 (partial): README catalog one-liner, bundled `llms.txt` `## Estimators` entry, `docs/api/had.rst` (autoclass for the three classes), and `docs/references.rst` citation landed in PR #372 docs refresh.
 - [x] Phase 5 (wave 2 first slice, PR #409): T21 HAD pretest workflow tutorial (`docs/tutorials/21_had_pretest_workflow.ipynb`) — composite pre-test walkthrough for `did_had_pretest_workflow`. Uses a `Uniform[$0.01K, $50K]` dose-distribution variant of T20's brand-campaign panel (true support strictly positive but near-zero, chosen so QUG fails-to-reject `H0: d_lower = 0` in finite sample). Walks through `aggregate="overall"` (Steps 1 + 3 only, verdict explicitly flags Step 2 deferral) and upgrades to `aggregate="event_study"` (joint pre-trends Stute + joint homogeneity Stute close the gap). Side panel exercises both `yatchew_hr_test` null modes (`linearity` vs `mean_independence`). Companion drift-test file `tests/test_t21_had_pretest_workflow_drift.py` (17 tests pinning panel composition, both verdict pivots, structural anchors, deterministic stats, bootstrap p-value tolerance bands per backend, and `HAD(design="auto")` resolution to `continuous_at_zero` on this panel).
 - [x] Phase 5 (wave 2 second slice): T22 weighted/survey HAD tutorial (`docs/tutorials/22_had_survey_design.ipynb`) - shipped as the follow-up to PR #432. End-to-end walkthrough of `HeterogeneousAdoptionDiD` + `did_had_pretest_workflow` under `SurveyDesign(weights, strata, psu, fpc)` on a BRFSS-shape state-rollout panel (5 strata x 6 PSUs/stratum x 2 states/PSU = 60 states; post-stratification raking weights with CV ~ 0.30; FPC = 30 PSUs/stratum). Companion drift-test file `tests/test_t22_had_survey_design_drift.py` (32 tests pinning panel composition, naive-vs-survey SE inflation direction, design auto-detection, event-study cband-vs-pointwise width ordering, `_QUG_DEFERRED_SUFFIX` substring on `report.verdict` for both overall and event-study paths, the distinct `report.summary()` QUG-skip note on the event-study path, deterministic Yatchew sigma2_*, bootstrap p-value anchored windows of total width 0.30 (± 0.15 around seeded centers) per `feedback_strata_bootstrap_path_divergence`, workflow-surface separation between overall and event-study paths, and the weighted point-estimation contract via the `_fit_continuous` algebraic identity).
-- [x] Documentation of non-testability of Assumptions 5 and 6. **Closed 2026-05-20:** `HeterogeneousAdoptionDiD` class docstring carries a "Non-testable assumptions (paper Section 3.1.2)" Notes block; `qug_test` / `stute_test` / `yatchew_hr_test` / `did_had_pretest_workflow` Notes sections carry "Scope (what this test does NOT cover)" clauses explicitly stating they verify ADJACENT assumptions (Assumption 4 / 7 / 8) and CANNOT test Assumptions 5 or 6. Belt-and-suspenders: `HAD.fit()` emits a `UserWarning` in `diff_diff/had.py` (search for "---- Assumption 5/6 warning on Design 1 paths ----") whenever the resolved design is Design 1 family (`continuous_near_d_lower` or `mass_point`). T21 surfaces the caveat to end users via the verdict language.
+- [x] Documentation of non-testability of Assumptions 5 and 6. **Closed 2026-05-20:** `HeterogeneousAdoptionDiD` class docstring carries a "Non-testable assumptions (paper Section 3.1.2)" Notes block; `qug_test` / `stute_test` / `yatchew_hr_test` / `did_had_pretest_workflow` Notes sections carry "Scope (what this test does NOT cover)" clauses explicitly stating they verify ADJACENT identifying conditions (QUG: support-infimum null `d_lower = 0`; Stute / Yatchew: Assumption 8 linearity; `joint_pretrends_test`: Assumption 7 mean-independence) and CANNOT test Assumptions 5 or 6. The composite workflow verdict string does NOT mention Assumptions 5 or 6 — it only flags the Assumption 7 step-2 gap on the two-period `aggregate="overall"` path. The Assumption 5/6 non-testability caveat is surfaced separately by (a) `HAD.fit()`'s fit-time `UserWarning` in `diff_diff/had.py` (search for "---- Assumption 5/6 warning on Design 1 paths ----") which fires whenever the resolved design is Design 1 family (`continuous_near_d_lower` or `mass_point`), and (b) T21 (HAD pretest workflow tutorial) tutorial prose.
 - [x] Warnings for staggered treatment timing (redirect to `ChaisemartinDHaultfoeuille`). **Closed 2026-05-20:** fail-closed `ValueError` at `diff_diff/had.py:1511` (see Deviations § "Library extension: Staggered-timing fail-closed" for the rationale on raising vs warning).
 - [ ] `NotImplementedError` phase pointer when `covariates=` is passed (Theorem 6 future work). **Status 2026-05-20:** current behavior is a Python `TypeError` (the `covariates=` kwarg is not in the `HAD.fit()` signature). Adding an explicit `**kwargs`-trap with `NotImplementedError` and a Theorem 6 pointer is a follow-up PR; tracked in `TODO.md` as Low priority — the existing TypeError is fail-closed.
 
diff --git a/docs/methodology/papers/dechaisemartin-2026-review.md b/docs/methodology/papers/dechaisemartin-2026-review.md
@@ -189,7 +189,7 @@ Alternative to Stute when `G` is large or heteroskedasticity is suspected.
 - [x] Composite workflow `did_had_pretest_workflow()` (paper Section 4.2-4.3). **Phase 3 implementation (2026-04):** `aggregate="overall"` (default, two-period) runs QUG + Stute + Yatchew on a two-period panel; step 2 is NOT run on this path because a two-period panel has no pre-period placebo horizon. **Phase 3 follow-up (2026-04):** `aggregate="event_study"` (multi-period) runs QUG at F + joint pre-trends Stute + joint homogeneity-linearity Stute; closes the paper step-2 gap.
 - [x] Warnings for staggered treatment timing (direct users to existing `ChaisemartinDHaultfoeuille` in diff-diff). **Phase 4 closure (2026-05-20):** fail-closed `ValueError` at `diff_diff/had.py:1511` when multiple first-treat cohorts are detected without `first_treat_col`; the error message directs the user to either supply `first_treat_col` (which activates the last-cohort + never-treated auto-filter per Appendix B.2) or to use `ChaisemartinDHaultfoeuille` (`did_multiplegt_dyn`) for full staggered support. The fail-closed choice (over `UserWarning`) is documented in REGISTRY Deviations § "Staggered-timing fail-closed" as a library extension toward stricter safety than the paper's "Warn" prescription.
 - [ ] Warnings for extensive-margin effects / positive mass of untreated (not fatal; suggests running existing DiD). **Status 2026-05-20 (partial):** `qug_test()` filters zero-dose observations upfront with a `UserWarning` naming the exclusion count — surfaces the *presence* of extensive-margin / positive-mass-of-untreated units to users running pre-tests. The paper-language "suggests running existing DiD" recommendation is NOT a separate fit-time warning on the main `HeterogeneousAdoptionDiD.fit()` path; this item remains open as a Low-priority follow-up tracked in `TODO.md`.
-- [x] Documentation of non-testability of Assumptions 5 and 6. **Phase 4 closure (2026-05-20):** `HeterogeneousAdoptionDiD.fit()` emits a `UserWarning` at fit time when `resolved_design ∈ {continuous_near_d_lower, mass_point}` (Design 1 family) explicitly flagging that point identification of `WAS_{d_lower}` requires Assumption 6, sign identification requires Assumption 5, and NEITHER is testable via pre-trends (`diff_diff/had.py`, search for "---- Assumption 5/6 warning on Design 1 paths ----"). The `HeterogeneousAdoptionDiD` class docstring + `qug_test` / `stute_test` / `yatchew_hr_test` / `did_had_pretest_workflow` Notes sections cross-reference this and explicitly state that the available pre-tests verify ADJACENT identifying conditions (QUG tests the Theorem 4 / Design 1' support-infimum null `d_lower = 0` — adjacent evidence on the `d_lower = 0` clause of Assumption 4 only, NOT a test of full Assumption 4's boundary-density / conditional-mean smoothness / variance regularity statement; Assumption 7 mean-independence pre-trends via Stute; Assumption 8 linearity / homogeneity via Yatchew) and do NOT and CANNOT test Assumptions 5 or 6 directly. T21 verdict logic surfaces the caveat to end users.
+- [x] Documentation of non-testability of Assumptions 5 and 6. **Phase 4 closure (2026-05-20):** `HeterogeneousAdoptionDiD.fit()` emits a `UserWarning` at fit time when `resolved_design ∈ {continuous_near_d_lower, mass_point}` (Design 1 family) explicitly flagging that point identification of `WAS_{d_lower}` requires Assumption 6, sign identification requires Assumption 5, and NEITHER is testable via pre-trends (`diff_diff/had.py`, search for "---- Assumption 5/6 warning on Design 1 paths ----"). The `HeterogeneousAdoptionDiD` class docstring + `qug_test` / `stute_test` / `yatchew_hr_test` / `did_had_pretest_workflow` Notes sections cross-reference this and explicitly state that the available pre-tests verify ADJACENT identifying conditions: QUG tests the Theorem 4 / Design 1' support-infimum null `d_lower = 0` — adjacent evidence on the `d_lower = 0` clause of Assumption 4 only, NOT a test of full Assumption 4's boundary-density / conditional-mean smoothness / variance regularity statement; the raw `stute_test` / `yatchew_hr_test` helpers test Assumption 8 linearity (residuals from `dy ~ 1 + d`); `joint_pretrends_test` tests Assumption 7 mean-independence (intercept-only residuals via `null_form="mean_independence"`). None of these test Assumptions 5 or 6 directly. The composite workflow verdict string does NOT mention Assumptions 5 or 6 — it only flags the Assumption 7 step-2 gap on the two-period `aggregate="overall"` path. The Assumption 5/6 caveat is surfaced separately by the Design 1 fit-time `UserWarning` and by T21 tutorial prose.
 - [x] Multi-period event-study extension (Appendix B.2). **Phase 2b implementation (2026-04):** `aggregate="event_study"` returns per-event-time WAS estimates using uniform `F-1` anchor. Staggered-timing contract (see L190 closure for full statement): when `first_treat_col` is supplied, the panel auto-filters to last-cohort + never-treated units with a `UserWarning` per Appendix B.2 prescription; when omitted on a multi-cohort panel, the estimator raises `ValueError` (fail-closed, see REGISTRY § "Library extension: Staggered-timing fail-closed"). Pointwise CIs per horizon (no joint cross-horizon covariance; matches paper's Pierce-Schott Figure 2). Pre-period placebos at `e <= -2`; the anchor `e = -1` is skipped since `ΔY = 0` there by construction.
 - [x] Joint Stute tests (paper Section 4.2 step 2 + Section 4.3 joint extension, pages 23-25 + 32). **Phase 3 follow-up (2026-04):** `stute_joint_pretest()` (residuals-in core) + `joint_pretrends_test()` (mean-independence null) + `joint_homogeneity_test()` (linearity null) in `diff_diff/had_pretests.py`. Sum-of-CvMs aggregation, shared-η Mammen wild bootstrap across horizons (Delgado-Manteiga 2001), per-horizon exact-linear short-circuit. Paper Eq (18) linear-trend detrending variant (Section 5.2 Pierce-Schott p=0.51) deferred to Phase 4 replication harness where the published value serves as parity anchor.