From 9945d306704704feaec61e5489e65f603f37017b Mon Sep 17 00:00:00 2001 From: igerber Date: Thu, 14 May 2026 20:05:51 -0400 Subject: [PATCH 1/2] Fix #409 holistic audit residuals: T20+T21 notebook-narrative drift coverage + TODO status MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Holistic re-audit of merged #409 (Tutorial 21 HAD pre-test workflow notebook + drift tests) + #425 (cleanup: pin deterministic p-values). The per-PR cleanup PR review on #425 couldn't see the combined post-PR holistic state. 4 local agentic codex rounds (R0-R3) surfaced a repeated rendered-surface drift gap on both the T21 PR added and the sibling T20 drift test, plus a small TODO status reconciliation. **Rendered-surface drift gap** — both `test_t20_*_drift.py` and `test_t21_*_drift.py` had file-level docstrings claiming to check pinned numbers "against the values quoted in the tutorial markdown", but every prior assert just re-derived values from the locked DGP and compared to a hardcoded constant — the .ipynb files themselves were never read. Because `nbsphinx_execute = "never"` in `docs/conf.py`, CI cannot detect drift between drift-test constants and the committed tutorial via notebook re-execution; the two surfaces can diverge silently. Closed via a new shared helper `tests/_tutorial_drift.py` (`notebook_markdown` / `notebook_output_text` / `notebook_rendered_text` + `assert_quotes_in_rendered`) used by: - `tests/test_t21_had_pretest_workflow_drift.py::test_notebook_quotes_match_pinned_constants` — pins 16 load-bearing rendered substrings: verdict text on both `aggregate="overall"` and `aggregate="event_study"` paths; structural-field anchors; QUG / Stute / Yatchew p-values (`0.2059`, `0.6860`, `0.0720`, `0.7630`, `0.4917`, `0.2899`) the file already pins analytically; design / target anchors (`continuous_at_zero`, `WAS`); rendered Yatchew sigma2_lin values (`6250.2569`, `6.5340`, `7.0076`). - `tests/test_t20_had_brand_campaign_drift.py::test_notebook_quotes_match_pinned_constants` — pins the headline WAS estimate prose (`100 weekly visits`, `98.6 to 101.4`), the design auto-detect outcome (`continuous_near_d_lower`), the target label (`WAS_d_lower`), and the placebo magnitude / sample-summary prose (`±0.06`, `median`, `$25K`) the file already pins analytically. **TODO status reconciliation** — TODO.md still said T21 was "PR-pending" while CHANGELOG and REGISTRY already credited PR #409. Updated. 4 files, +244/-1. No methodology changes. No estimator/inference behavior change. Tests-only + one TODO line. Two pilot findings NOT included in this fix-PR: 1. The P3 stale verdict text (`"paper step 2 deferred to Phase 3 follow-up"`) — the codex correctly observes the Phase 3 follow-up has shipped as `aggregate="event_study"`, so the wording is historically frozen. Changing it is a user-facing API string change touching 6+ surfaces (`had_pretests.py` source + 4 test assertions + notebook + CHANGELOG); deferred for explicit guidance before changing the verdict library-wide. 2. The codex's R3 P1 about syncing `docs/_review/t21_notebook_extract.md` against the notebook — phantom finding. That file was deleted from main in a follow-up PR (it's a temporary review aid per its own header); pilot-409 has it only because cherry-picking #409 brought it in. The fix-PR off main has nothing to sync. Per `feedback_holistic_fix_on_repeated_p1s`: shipping after the same-class finding repeated 4 rounds rather than continuing to enumerate anchors. --- TODO.md | 2 +- tests/_tutorial_drift.py | 113 +++++++++++++++++++ tests/test_t20_had_brand_campaign_drift.py | 49 ++++++++ tests/test_t21_had_pretest_workflow_drift.py | 79 +++++++++++++ 4 files changed, 242 insertions(+), 1 deletion(-) create mode 100644 tests/_tutorial_drift.py diff --git a/TODO.md b/TODO.md index aaae5dfb..5128c648 100644 --- a/TODO.md +++ b/TODO.md @@ -112,7 +112,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low | | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low | -| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed (PR-pending); `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low | +| `HeterogeneousAdoptionDiD` Phase 5 follow-up tutorial (T22 weighted/survey HAD tutorial). T21 HAD pretest workflow notebook landed in PR #409; `practitioner_next_steps()` HAD handlers + `llms-full.txt` HeterogeneousAdoptionDiD section + Choosing-an-Estimator row landed in Phase 5 wave 1 (PR #402). | `tutorials/`, `tests/test_t22_*_drift.py` | Phase 2a | Low | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | diff --git a/tests/_tutorial_drift.py b/tests/_tutorial_drift.py new file mode 100644 index 00000000..815e2546 --- /dev/null +++ b/tests/_tutorial_drift.py @@ -0,0 +1,113 @@ +"""Shared helpers for tutorial-drift tests (T20, T21, ...). + +The HAD tutorial drift tests pin numbers / verdict strings against the +locked DGP + seed. Without these helpers each drift test re-derived +numbers but never verified that the rendered notebook surface (markdown +prose + executed output cells) actually quotes those values. Because +``nbsphinx_execute = "never"`` in ``docs/conf.py``, CI cannot detect +drift between the pinned constants and the committed tutorial via +notebook re-execution; the constants and the notebook can diverge +silently. These helpers parse the .ipynb JSON directly so each +tutorial-drift test file can cross-check its pins against the +rendered surface it claims to protect. +""" + +from __future__ import annotations + +import json +from pathlib import Path +from typing import Iterable + + +def _read_notebook(nb_relpath: str) -> dict: + """Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``).""" + nb_path = Path(__file__).resolve().parents[1] / nb_relpath + return json.loads(nb_path.read_text()) + + +def notebook_markdown(nb_relpath: str) -> str: + """Return all markdown cells concatenated into one string.""" + nb = _read_notebook(nb_relpath) + parts = [] + for cell in nb["cells"]: + if cell["cell_type"] != "markdown": + continue + src = cell["source"] + if isinstance(src, list): + src = "".join(src) + parts.append(src) + return "\n".join(parts) + + +def notebook_output_text(nb_relpath: str) -> str: + """Return all executed-output text (``stream`` and ``execute_result`` + text/plain) from every code cell, concatenated. + + Covers the rendered numeric surface that markdown alone misses — + e.g. printed verdict strings, formatted summary tables, p-values. + """ + nb = _read_notebook(nb_relpath) + parts = [] + for cell in nb["cells"]: + if cell["cell_type"] != "code": + continue + for out in cell.get("outputs", []): + # stream-style outputs (print / stdout / stderr) + text = out.get("text") + if text is not None: + parts.append("".join(text) if isinstance(text, list) else text) + # execute_result / display_data with text/plain + data = out.get("data") or {} + plain = data.get("text/plain") + if plain is not None: + parts.append("".join(plain) if isinstance(plain, list) else plain) + return "\n".join(parts) + + +def notebook_rendered_text(nb_relpath: str) -> str: + """Return markdown + executed-output text together — the full + rendered surface a reader sees on RTD.""" + return notebook_markdown(nb_relpath) + "\n" + notebook_output_text(nb_relpath) + + +def assert_quotes_in_rendered( + nb_relpath: str, + expected_quotes: Iterable[str], + *, + surface: str = "rendered", +) -> None: + """Assert each expected substring appears in the chosen rendered surface. + + Parameters + ---------- + nb_relpath + Notebook path relative to repo root (e.g. + ``"docs/tutorials/21_had_pretest_workflow.ipynb"``). + expected_quotes + Iterable of substrings that MUST appear in the chosen rendered + surface. Each is checked independently; the assertion message + lists every missing quote so a single failure surfaces all of + them. + surface + Which slice of the notebook to check: ``"markdown"`` (prose + only), ``"output"`` (executed output cells only), or + ``"rendered"`` (both — default; matches what a reader sees + on RTD). + """ + if surface == "markdown": + text = notebook_markdown(nb_relpath) + elif surface == "output": + text = notebook_output_text(nb_relpath) + elif surface == "rendered": + text = notebook_rendered_text(nb_relpath) + else: + raise ValueError(f"surface must be 'markdown' / 'output' / 'rendered'; got {surface!r}") + missing = [q for q in expected_quotes if q not in text] + assert not missing, ( + f"Tutorial {nb_relpath!r} ({surface=}) is missing load-bearing " + f"quoted values that the pinned drift constants assume are " + f"rendered verbatim. Either the notebook drifted from the " + f"locked DGP output (rerun the tutorial against the pinned " + f"seed) or the drift-test constants were updated without " + f"updating the tutorial. Missing: {missing}" + ) diff --git a/tests/test_t20_had_brand_campaign_drift.py b/tests/test_t20_had_brand_campaign_drift.py index 10c4d1c9..f1aa6238 100644 --- a/tests/test_t20_had_brand_campaign_drift.py +++ b/tests/test_t20_had_brand_campaign_drift.py @@ -270,3 +270,52 @@ def test_event_study_pre_placebos_cover_zero(event_study_result): i = event_times.index(e) assert abs(atts[i]) < 0.1, (e, atts[i]) assert ci_lows[i] <= 0.0 <= ci_highs[i], (e, ci_lows[i], ci_highs[i]) + + +# ============================================================================= +# Notebook-narrative cross-check +# ============================================================================= +# +# Mirror of the T21 notebook cross-check (PR #409 holistic re-audit). +# The asserts above re-derive numbers from the locked DGP+seed but +# never verify that the rendered T20 tutorial actually quotes those +# same numbers. Because ``nbsphinx_execute = "never"`` in +# ``docs/conf.py``, CI cannot detect drift between the pinned drift +# constants and the committed tutorial via notebook re-execution. +# Use the shared helper from ``tests/_tutorial_drift.py`` to assert +# each load-bearing quote is present in the rendered notebook +# surface (markdown prose + executed output cells). + + +T20_NOTEBOOK = "docs/tutorials/20_had_brand_campaign.ipynb" + + +def test_notebook_quotes_match_pinned_constants(): + """Every load-bearing T20 verdict / number must appear verbatim + in the rendered notebook (markdown prose + executed output cells). + + Closes the same gap fixed for T21 in the PR #409 holistic + re-audit: the file-level docstring claimed "check against the + values quoted in the tutorial markdown" but every prior assert + re-derived numbers from the DGP and compared them to a hardcoded + constant, leaving the notebook completely uncross-checked. + """ + from tests._tutorial_drift import assert_quotes_in_rendered + + expected_quotes = [ + # Headline WAS estimate quoted in cell 10 narrative. + "100 weekly visits", + # CI quoted alongside the headline. + "98.6 to 101.4", + # Design auto-detect outcome. + "continuous_near_d_lower", + # Target parameter label used across cells 8 / 12 / 13. + "WAS_d_lower", + # Placebo-magnitude prose claim (locked analytically above by + # test_event_study_pre_atts_near_zero with the ±0.1 envelope). + "±0.06", + # Sample summary in the design-fit narrative. + "median", + "$25K", + ] + assert_quotes_in_rendered(T20_NOTEBOOK, expected_quotes, surface="rendered") diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py index c1e155a6..cbc1433e 100644 --- a/tests/test_t21_had_pretest_workflow_drift.py +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -366,3 +366,82 @@ def test_yatchew_side_panel_mean_independence_passes(yatchew_side_panel_inputs): assert res_mi.sigma2_lin > res_lin.sigma2_lin # And the differencing variance (sigma2_diff) is shared across modes. assert round(res_mi.sigma2_diff, 4) == round(res_lin.sigma2_diff, 4) + + +# ============================================================================= +# Notebook-narrative cross-check +# ============================================================================= +# +# The asserts above re-derive numbers from the locked DGP+seed but do NOT +# verify that the rendered tutorial actually quotes those same numbers. +# Without this layer, the notebook prose can drift independently of the +# library numerics (or vice versa) and CI stays green because +# `nbsphinx_execute = "never"` in `docs/conf.py` (CI doesn't re-execute +# notebooks during build). Use the shared tutorial-drift helper that +# parses the notebook JSON and checks both markdown prose AND executed +# output cells (since the load-bearing verdict strings appear in +# print()-rendered output blocks, not just markdown prose). + + +T21_NOTEBOOK = "docs/tutorials/21_had_pretest_workflow.ipynb" + + +def test_notebook_quotes_match_pinned_constants(): + """Every load-bearing verdict/value this file pins must appear + verbatim in the rendered T21 notebook surface (markdown prose + + executed output cells). + + Closes the gap the file-level docstring claims to cover ("check + against the values quoted in the tutorial markdown") but the rest + of the file did not actually exercise — every prior assert + re-derives numbers from the DGP and compares them to a hardcoded + constant, leaving the notebook completely uncross-checked. + Without this test, the notebook can drift independently of the + library numerics (or vice versa) and CI stays green because + ``nbsphinx_execute = "never"`` in ``docs/conf.py``. + """ + from tests._tutorial_drift import assert_quotes_in_rendered + + expected_quotes = [ + # ---- Verdict-string anchors ---- + # Overall verdict substring (also pinned in test_overall_workflow_*). + # Appears in markdown prose AND in the verdict-print output cell. + "paper step 2 deferred to Phase 3 follow-up", + # Event-study verdict substring (rendered output of the + # aggregate='event_study' workflow + markdown reading-cell). + "TWFE admissible under Section 4 assumptions", + # Event-study output cell anchor — full verdict header. + "QUG, joint pre-trends, and joint linearity diagnostics fail-to-reject", + # ---- Structural-field anchors ---- + "aggregate = 'event_study'", + "pretrends_joint populated? True", + "homogeneity_joint populated? True", + "aggregate = 'overall'", + "pretrends_joint populated? False", + # ---- Verdict-reading markdown anchors (cell 6) ---- + "T = D_(1) / (D_(2) - D_(1)) ~ 3.86", + "1/alpha - 1 = 19", + # ---- Numeric anchors pinned analytically above ---- + # Every value pinned via round(..., 4) == 0.NNNN in this file + # must also appear in the rendered notebook (otherwise the + # tutorial prose / output is showing a different number than + # the test claims to lock). + "0.2059", # QUG p-value (test_overall_workflow_*) + "0.6860", # Stute p-value tolerance band anchor + "0.0720", # joint-pretrends Stute p-value (event-study) + "0.7630", # joint-homogeneity Stute p-value (event-study) + "0.4917", # Yatchew side-panel null=linearity p-value + "0.2899", # Yatchew side-panel null=mean_independence p-value + # Design auto-detect outcome (also pinned by overall-path tests). + "continuous_at_zero", + "WAS", + # Overall Yatchew p-value (analytical short-circuit on this DGP). + "1.0000", + # Overall Yatchew sigma2_lin in the rendered output. + "6250.2569", + # Side-panel Yatchew sigma2_lin under null='linearity'. + "6.5340", + # Side-panel Yatchew sigma2_lin under null='mean_independence'. + "7.0076", + ] + assert_quotes_in_rendered(T21_NOTEBOOK, expected_quotes, surface="rendered") From a01600d8f42e5a2221670c3f6e88f6b9149571cd Mon Sep 17 00:00:00 2001 From: igerber Date: Thu, 14 May 2026 20:22:00 -0400 Subject: [PATCH 2/2] CI R1: add missing-notebook skip guard + tighten generic anchors CI AI review on #439 R1: P1: tests/_tutorial_drift.py read docs/tutorials/*.ipynb unconditionally, but the Rust-test CI matrix copies only tests/ to /tmp/tests and runs pytest there (no docs/), so the new tests would FileNotFoundError instead of skipping cleanly. Added pytest.skip() guard in _read_notebook() matching the repo convention used in test_notebook_md_extract.py and test_nprobust_port.py. P3: tightened two generic anchors that could false-pass: - T21 'WAS' (matches many incidental occurrences) -> 'target = `WAS`' (the exact paper-step-1 phrasing, single occurrence). - T20 'median' + '$25K' (split, can both pass independently even if the sentence drifts) -> 'median ~$25K' (the exact tilde-prefixed phrase from the design-fit narrative). --- tests/_tutorial_drift.py | 18 +++++++++++++++++- tests/test_t20_had_brand_campaign_drift.py | 7 ++++--- tests/test_t21_had_pretest_workflow_drift.py | 5 ++++- 3 files changed, 25 insertions(+), 5 deletions(-) diff --git a/tests/_tutorial_drift.py b/tests/_tutorial_drift.py index 815e2546..1e31ccfa 100644 --- a/tests/_tutorial_drift.py +++ b/tests/_tutorial_drift.py @@ -20,8 +20,24 @@ def _read_notebook(nb_relpath: str) -> dict: - """Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``).""" + """Load a notebook by repo-relative path (e.g. ``docs/tutorials/X.ipynb``). + + Skips the calling test via ``pytest.skip(...)`` when the notebook file + is not present. The Rust-test CI job (and the isolated-install job) + copies only ``tests/`` to ``/tmp/tests`` and runs from there, without + ``docs/`` available. The repo convention is to skip cleanly when + artifacts are absent rather than fail (see e.g. + ``tests/test_notebook_md_extract.py`` and ``tests/test_nprobust_port.py``). + """ + import pytest + nb_path = Path(__file__).resolve().parents[1] / nb_relpath + if not nb_path.exists(): + pytest.skip( + f"Notebook {nb_relpath!r} not available in this CI environment " + "(isolated-install job copies only tests/, not docs/); " + "rendered-surface cross-check requires a full repo checkout." + ) return json.loads(nb_path.read_text()) diff --git a/tests/test_t20_had_brand_campaign_drift.py b/tests/test_t20_had_brand_campaign_drift.py index f1aa6238..83fb9485 100644 --- a/tests/test_t20_had_brand_campaign_drift.py +++ b/tests/test_t20_had_brand_campaign_drift.py @@ -314,8 +314,9 @@ def test_notebook_quotes_match_pinned_constants(): # Placebo-magnitude prose claim (locked analytically above by # test_event_study_pre_atts_near_zero with the ±0.1 envelope). "±0.06", - # Sample summary in the design-fit narrative. - "median", - "$25K", + # Sample-summary phrase in the design-fit narrative. Use the + # exact tilde-prefixed form so a future drift in the sentence + # (e.g. "median around $25K") would surface here. + "median ~$25K", ] assert_quotes_in_rendered(T20_NOTEBOOK, expected_quotes, surface="rendered") diff --git a/tests/test_t21_had_pretest_workflow_drift.py b/tests/test_t21_had_pretest_workflow_drift.py index cbc1433e..c2b588b5 100644 --- a/tests/test_t21_had_pretest_workflow_drift.py +++ b/tests/test_t21_had_pretest_workflow_drift.py @@ -434,7 +434,10 @@ def test_notebook_quotes_match_pinned_constants(): "0.2899", # Yatchew side-panel null=mean_independence p-value # Design auto-detect outcome (also pinned by overall-path tests). "continuous_at_zero", - "WAS", + # Use the exact paper-step-1 phrasing with target=`WAS` so we + # don't false-pass on the many incidental occurrences of "WAS" + # elsewhere in the prose. + "target = `WAS`", # Overall Yatchew p-value (analytical short-circuit on this DGP). "1.0000", # Overall Yatchew sigma2_lin in the rendered output.