From 9945d306704704feaec61e5489e65f603f37017b Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 14 May 2026 20:05:51 -0400
Subject: [PATCH 1/2] Fix #409 holistic audit residuals: T20+T21
 notebook-narrative drift coverage + TODO status
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Holistic re-audit of merged #409 (Tutorial 21 HAD pre-test workflow notebook + drift tests) + #425 (cleanup: pin deterministic p-values). The per-PR cleanup PR review on #425 couldn't see the combined post-PR holistic state. 4 local agentic codex rounds (R0-R3) surfaced a repeated rendered-surface drift gap on both the T21 PR added and the sibling T20 drift test, plus a small TODO status reconciliation.

**Rendered-surface drift gap** — both `test_t20_*_drift.py` and `test_t21_*_drift.py` had file-level docstrings claiming to check pinned numbers "against the values quoted in the tutorial markdown", but every prior assert just re-derived values from the locked DGP and compared to a hardcoded constant — the .ipynb files themselves were never read. Because `nbsphinx_execute = "never"` in `docs/conf.py`, CI cannot detect drift between drift-test constants and the committed tutorial via notebook re-execution; the two surfaces can diverge silently.

Closed via a new shared helper `tests/_tutorial_drift.py` (`notebook_markdown` / `notebook_output_text` / `notebook_rendered_text` + `assert_quotes_in_rendered`) used by:

- `tests/test_t21_had_pretest_workflow_drift.py::test_notebook_quotes_match_pinned_constants` — pins 16 load-bearing rendered substrings: verdict text on both `aggregate="overall"` and `aggregate="event_study"` paths; structural-field anchors; QUG / Stute / Yatchew p-values (`0.2059`, `0.6860`, `0.0720`, `0.7630`, `0.4917`, `0.2899`) the file already pins analytically; design / target anchors (`continuous_at_zero`, `WAS`); rendered Yatchew sigma2_lin values (`6250.2569`, `6.5340`, `7.0076`).
- `tests/test_t20_had_brand_campaign_drift.py::test_notebook_quotes_match_pinned_constants` — pins the headline WAS estimate prose (`100 weekly visits`, `98.6 to 101.4`), the design auto-detect outcome (`continuous_near_d_lower`), the target label (`WAS_d_lower`), and the placebo magnitude / sample-summary prose (`±0.06`, `median`, `$25K`) the file already pins analytically.

**TODO status reconciliation** — TODO.md still said T21 was "PR-pending" while CHANGELOG and REGISTRY already credited PR #409. Updated.

4 files, +244/-1. No methodology changes. No estimator/inference behavior change. Tests-only + one TODO line.

Two pilot findings NOT included in this fix-PR:

1. The P3 stale verdict text (`"paper step 2 deferred to Phase 3 follow-up"`) — the codex correctly observes the Phase 3 follow-up has shipped as `aggregate="event_study"`, so the wording is historically frozen. Changing it is a user-facing API string change touching 6+ surfaces (`had_pretests.py` source + 4 test assertions + notebook + CHANGELOG); deferred for explicit guidance before changing the verdict library-wide.

2. The codex's R3 P1 about syncing `docs/_review/t21_notebook_extract.md` against the notebook — phantom finding. That file was deleted from main in a follow-up PR (it's a temporary review aid per its own header); pilot-409 has it only because cherry-picking #409 brought it in. The fix-PR off main has nothing to sync.

Per `feedback_holistic_fix_on_repeated_p1s`: shipping after the same-class finding repeated 4 rounds rather than continuing to enumerate anchors.
---
 TODO.md                                      |   2 +-
 tests/_tutorial_drift.py                     | 113 +++++++++++++++++++
 tests/test_t20_had_brand_campaign_drift.py   |  49 ++++++++
 tests/test_t21_had_pretest_workflow_drift.py |  79 +++++++++++++
 4 files changed, 242 insertions(+), 1 deletion(-)
 create mode 100644 tests/_tutorial_drift.py

diff --git a/TODO.md b/TODO.md
index aaae5dfb..5128c648 100644
--- a/TODO.md
+++ b/TODO.md
@@ -112,7 +112,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low |
 | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low |
 | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low |
-| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed (PR-pending); `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low |
+| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed in PR #409; `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low |
 | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
 | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium |
 | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low |
diff --git a/tests/_tutorial_drift.py b/tests/_tutorial_drift.py
new file mode 100644
index 00000000..815e2546
--- /dev/null
+++ b/tests/_tutorial_drift.py
@@ -0,0 +1,113 @@
+"""Shared helpers for tutorial-drift tests (T20, T21, ...).
+
+The HAD tutorial drift tests pin numbers / verdict strings against the
+locked DGP + seed. Without these helpers each drift test re-derived
+numbers but never verified that the rendered notebook surface (markdown
+prose + executed output cells) actually quotes those values. Because
+``nbsphinx_execute = "never"`` in ``docs/conf.py``, CI cannot detect
+drift between the pinned constants and the committed tutorial via
+notebook re-execution; the constants and the notebook can diverge
+silently. These helpers parse the .ipynb JSON directly so each
+tutorial-drift test file can cross-check its pins against the
+rendered surface it claims to protect.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+from typing import Iterable
+
+
+def _read_notebook(nb_relpath: str) -> dict:
+    """Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``)."""
+    nb_path = Path(__file__).resolve().parents[1] / nb_relpath
+    return json.loads(nb_path.read_text())
+
+
+def notebook_markdown(nb_relpath: str) -> str:
+    """Return all markdown cells concatenated into one string."""
+    nb = _read_notebook(nb_relpath)
+    parts = []
+    for cell in nb["cells"]:
+        if cell["cell_type"] != "markdown":
+            continue
+        src = cell["source"]
+        if isinstance(src, list):
+            src = "".join(src)
+        parts.append(src)
+    return "\n".join(parts)
+
+
+def notebook_output_text(nb_relpath: str) -> str:
+    """Return all executed-output text (``stream`` and ``execute_result``
+    text/plain) from every code cell, concatenated.
+
+    Covers the rendered numeric surface that markdown alone misses —
+    e.g. printed verdict strings, formatted summary tables, p-values.
+    """
+    nb = _read_notebook(nb_relpath)
+    parts = []
+    for cell in nb["cells"]:
+        if cell["cell_type"] != "code":
+            continue
+        for out in cell.get("outputs", []):
+            # stream-style outputs (print / stdout / stderr)
+            text = out.get("text")
+            if text is not None:
+                parts.append("".join(text) if isinstance(text, list) else text)
+            # execute_result / display_data with text/plain
+            data = out.get("data") or {}
+            plain = data.get("text/plain")
+            if plain is not None:
+                parts.append("".join(plain) if isinstance(plain, list) else plain)
+    return "\n".join(parts)
+
+
+def notebook_rendered_text(nb_relpath: str) -> str:
+    """Return markdown + executed-output text together — the full
+    rendered surface a reader sees on RTD."""
+    return notebook_markdown(nb_relpath) + "\n" + notebook_output_text(nb_relpath)
+
+
+def assert_quotes_in_rendered(
+    nb_relpath: str,
+    expected_quotes: Iterable[str],
+    *,
+    surface: str = "rendered",
+) -> None:
+    """Assert each expected substring appears in the chosen rendered surface.
+
+    Parameters
+    ----------
+    nb_relpath
+        Notebook path relative to repo root (e.g.
+        ``"docs/tutorials/21_had_pretest_workflow.ipynb"``).
+    expected_quotes
+        Iterable of substrings that MUST appear in the chosen rendered
+        surface. Each is checked independently; the assertion message
+        lists every missing quote so a single failure surfaces all of
+        them.
+    surface
+        Which slice of the notebook to check: ``"markdown"`` (prose
+        only), ``"output"`` (executed output cells only), or
+        ``"rendered"`` (both — default; matches what a reader sees
+        on RTD).
+    """
+    if surface == "markdown":
+        text = notebook_markdown(nb_relpath)
+    elif surface == "output":
+        text = notebook_output_text(nb_relpath)
+    elif surface == "rendered":
+        text = notebook_rendered_text(nb_relpath)
+    else:
+        raise ValueError(f"surface must be 'markdown' / 'output' / 'rendered'; got {surface!r}")
+    missing = [q for q in expected_quotes if q not in text]
+    assert not missing, (
+        f"Tutorial {nb_relpath!r} ({surface=}) is missing load-bearing "
+        f"quoted values that the pinned drift constants assume are "
+        f"rendered verbatim. Either the notebook drifted from the "
+        f"locked DGP output (rerun the tutorial against the pinned "
+        f"seed) or the drift-test constants were updated without "
+        f"updating the tutorial. Missing: {missing}"
+    )
diff --git a/tests/test_t20_had_brand_campaign_drift.py b/tests/test_t20_had_brand_campaign_drift.py
index 10c4d1c9..f1aa6238 100644
--- a/tests/test_t20_had_brand_campaign_drift.py
+++ b/tests/test_t20_had_brand_campaign_drift.py
@@ -270,3 +270,52 @@ def test_event_study_pre_placebos_cover_zero(event_study_result):
         i = event_times.index(e)
         assert abs(atts[i]) < 0.1, (e, atts[i])
         assert ci_lows[i] <= 0.0 <= ci_highs[i], (e, ci_lows[i], ci_highs[i])
+
+
+# =============================================================================
+# Notebook-narrative cross-check
+# =============================================================================
+#
+# Mirror of the T21 notebook cross-check (PR #409 holistic re-audit).
+# The asserts above re-derive numbers from the locked DGP+seed but
+# never verify that the rendered T20 tutorial actually quotes those
+# same numbers. Because ``nbsphinx_execute = "never"`` in
+# ``docs/conf.py``, CI cannot detect drift between the pinned drift
+# constants and the committed tutorial via notebook re-execution.
+# Use the shared helper from ``tests/_tutorial_drift.py`` to assert
+# each load-bearing quote is present in the rendered notebook
+# surface (markdown prose + executed output cells).
+
+
+T20_NOTEBOOK = "docs/tutorials/20_had_brand_campaign.ipynb"
+
+
+def test_notebook_quotes_match_pinned_constants():
+    """Every load-bearing T20 verdict / number must appear verbatim
+    in the rendered notebook (markdown prose + executed output cells).
+
+    Closes the same gap fixed for T21 in the PR #409 holistic
+    re-audit: the file-level docstring claimed "check against the
+    values quoted in the tutorial markdown" but every prior assert
+    re-derived numbers from the DGP and compared them to a hardcoded
+    constant, leaving the notebook completely uncross-checked.
+    """
+    from tests._tutorial_drift import assert_quotes_in_rendered
+
+    expected_quotes = [
+        # Headline WAS estimate quoted in cell 10 narrative.
+        "100 weekly visits",
+        # CI quoted alongside the headline.
+        "98.6 to 101.4",
+        # Design auto-detect outcome.
+        "continuous_near_d_lower",
+        # Target parameter label used across cells 8 / 12 / 13.
+        "WAS_d_lower",
+        # Placebo-magnitude prose claim (locked analytically above by
+        # test_event_study_pre_atts_near_zero with the ±0.1 envelope).
+        "±0.06",
+        # Sample summary in the design-fit narrative.
+        "median",
+        "$25K",
+    ]
+    assert_quotes_in_rendered(T20_NOTEBOOK, expected_quotes, surface="rendered")
diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py
index c1e155a6..cbc1433e 100644
--- a/tests/test_t21_had_pretest_workflow_drift.py
+++ b/tests/test_t21_had_pretest_workflow_drift.py
@@ -366,3 +366,82 @@ def test_yatchew_side_panel_mean_independence_passes(yatchew_side_panel_inputs):
     assert res_mi.sigma2_lin > res_lin.sigma2_lin
     # And the differencing variance (sigma2_diff) is shared across modes.
     assert round(res_mi.sigma2_diff, 4) == round(res_lin.sigma2_diff, 4)
+
+
+# =============================================================================
+# Notebook-narrative cross-check
+# =============================================================================
+#
+# The asserts above re-derive numbers from the locked DGP+seed but do NOT
+# verify that the rendered tutorial actually quotes those same numbers.
+# Without this layer, the notebook prose can drift independently of the
+# library numerics (or vice versa) and CI stays green because
+# `nbsphinx_execute = "never"` in `docs/conf.py` (CI doesn't re-execute
+# notebooks during build). Use the shared tutorial-drift helper that
+# parses the notebook JSON and checks both markdown prose AND executed
+# output cells (since the load-bearing verdict strings appear in
+# print()-rendered output blocks, not just markdown prose).
+
+
+T21_NOTEBOOK = "docs/tutorials/21_had_pretest_workflow.ipynb"
+
+
+def test_notebook_quotes_match_pinned_constants():
+    """Every load-bearing verdict/value this file pins must appear
+    verbatim in the rendered T21 notebook surface (markdown prose +
+    executed output cells).
+
+    Closes the gap the file-level docstring claims to cover ("check
+    against the values quoted in the tutorial markdown") but the rest
+    of the file did not actually exercise — every prior assert
+    re-derives numbers from the DGP and compares them to a hardcoded
+    constant, leaving the notebook completely uncross-checked.
+    Without this test, the notebook can drift independently of the
+    library numerics (or vice versa) and CI stays green because
+    ``nbsphinx_execute = "never"`` in ``docs/conf.py``.
+    """
+    from tests._tutorial_drift import assert_quotes_in_rendered
+
+    expected_quotes = [
+        # ---- Verdict-string anchors ----
+        # Overall verdict substring (also pinned in test_overall_workflow_*).
+        # Appears in markdown prose AND in the verdict-print output cell.
+        "paper step 2 deferred to Phase 3 follow-up",
+        # Event-study verdict substring (rendered output of the
+        # aggregate='event_study' workflow + markdown reading-cell).
+        "TWFE admissible under Section 4 assumptions",
+        # Event-study output cell anchor — full verdict header.
+        "QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject",
+        # ---- Structural-field anchors ----
+        "aggregate = 'event_study'",
+        "pretrends_joint populated? True",
+        "homogeneity_joint populated? True",
+        "aggregate = 'overall'",
+        "pretrends_joint populated? False",
+        # ---- Verdict-reading markdown anchors (cell 6) ----
+        "T = D_(1) / (D_(2) - D_(1)) ~ 3.86",
+        "1/alpha - 1 = 19",
+        # ---- Numeric anchors pinned analytically above ----
+        # Every value pinned via round(..., 4) == 0.NNNN in this file
+        # must also appear in the rendered notebook (otherwise the
+        # tutorial prose / output is showing a different number than
+        # the test claims to lock).
+        "0.2059",  # QUG p-value (test_overall_workflow_*)
+        "0.6860",  # Stute p-value tolerance band anchor
+        "0.0720",  # joint-pretrends Stute p-value (event-study)
+        "0.7630",  # joint-homogeneity Stute p-value (event-study)
+        "0.4917",  # Yatchew side-panel null=linearity p-value
+        "0.2899",  # Yatchew side-panel null=mean_independence p-value
+        # Design auto-detect outcome (also pinned by overall-path tests).
+        "continuous_at_zero",
+        "WAS",
+        # Overall Yatchew p-value (analytical short-circuit on this DGP).
+        "1.0000",
+        # Overall Yatchew sigma2_lin in the rendered output.
+        "6250.2569",
+        # Side-panel Yatchew sigma2_lin under null='linearity'.
+        "6.5340",
+        # Side-panel Yatchew sigma2_lin under null='mean_independence'.
+        "7.0076",
+    ]
+    assert_quotes_in_rendered(T21_NOTEBOOK, expected_quotes, surface="rendered")

From a01600d8f42e5a2221670c3f6e88f6b9149571cd Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 14 May 2026 20:22:00 -0400
Subject: [PATCH 2/2] CI R1: add missing-notebook skip guard + tighten generic
 anchors

CI AI review on #439 R1:

P1: tests/_tutorial_drift.py read docs/tutorials/*.ipynb
unconditionally, but the Rust-test CI matrix copies only tests/ to
/tmp/tests and runs pytest there (no docs/), so the new tests would
FileNotFoundError instead of skipping cleanly. Added pytest.skip()
guard in _read_notebook() matching the repo convention used in
test_notebook_md_extract.py and test_nprobust_port.py.

P3: tightened two generic anchors that could false-pass:
- T21 'WAS' (matches many incidental occurrences) -> 'target = `WAS`'
  (the exact paper-step-1 phrasing, single occurrence).
- T20 'median' + '$25K' (split, can both pass independently even if
  the sentence drifts) -> 'median ~$25K' (the exact tilde-prefixed
  phrase from the design-fit narrative).
---
 tests/_tutorial_drift.py                     | 18 +++++++++++++++++-
 tests/test_t20_had_brand_campaign_drift.py   |  7 ++++---
 tests/test_t21_had_pretest_workflow_drift.py |  5 ++++-
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/tests/_tutorial_drift.py b/tests/_tutorial_drift.py
index 815e2546..1e31ccfa 100644
--- a/tests/_tutorial_drift.py
+++ b/tests/_tutorial_drift.py
@@ -20,8 +20,24 @@
 
 
 def _read_notebook(nb_relpath: str) -> dict:
-    """Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``)."""
+    """Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``).
+
+    Skips the calling test via ``pytest.skip(...)`` when the notebook file
+    is not present. The Rust-test CI job (and the isolated-install job)
+    copies only ``tests/`` to ``/tmp/tests`` and runs from there, without
+    ``docs/`` available. The repo convention is to skip cleanly when
+    artifacts are absent rather than fail (see e.g.
+    ``tests/test_notebook_md_extract.py`` and ``tests/test_nprobust_port.py``).
+    """
+    import pytest
+
     nb_path = Path(__file__).resolve().parents[1] / nb_relpath
+    if not nb_path.exists():
+        pytest.skip(
+            f"Notebook {nb_relpath!r} not available in this CI environment "
+            "(isolated-install job copies only tests/, not docs/); "
+            "rendered-surface cross-check requires a full repo checkout."
+        )
     return json.loads(nb_path.read_text())
 
 
diff --git a/tests/test_t20_had_brand_campaign_drift.py b/tests/test_t20_had_brand_campaign_drift.py
index f1aa6238..83fb9485 100644
--- a/tests/test_t20_had_brand_campaign_drift.py
+++ b/tests/test_t20_had_brand_campaign_drift.py
@@ -314,8 +314,9 @@ def test_notebook_quotes_match_pinned_constants():
         # Placebo-magnitude prose claim (locked analytically above by
         # test_event_study_pre_atts_near_zero with the ±0.1 envelope).
         "±0.06",
-        # Sample summary in the design-fit narrative.
-        "median",
-        "$25K",
+        # Sample-summary phrase in the design-fit narrative. Use the
+        # exact tilde-prefixed form so a future drift in the sentence
+        # (e.g. "median around $25K") would surface here.
+        "median ~$25K",
     ]
     assert_quotes_in_rendered(T20_NOTEBOOK, expected_quotes, surface="rendered")
diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py
index cbc1433e..c2b588b5 100644
--- a/tests/test_t21_had_pretest_workflow_drift.py
+++ b/tests/test_t21_had_pretest_workflow_drift.py
@@ -434,7 +434,10 @@ def test_notebook_quotes_match_pinned_constants():
         "0.2899",  # Yatchew side-panel null=mean_independence p-value
         # Design auto-detect outcome (also pinned by overall-path tests).
         "continuous_at_zero",
-        "WAS",
+        # Use the exact paper-step-1 phrasing with target=`WAS` so we
+        # don't false-pass on the many incidental occurrences of "WAS"
+        # elsewhere in the prose.
+        "target = `WAS`",
         # Overall Yatchew p-value (analytical short-circuit on this DGP).
         "1.0000",
         # Overall Yatchew sigma2_lin in the rendered output.