Pin T21 deterministic p-values quoted in tutorial (re-audit of PR #409)#425
Conversation
The restored CI reviewer caught one P2 on PR #409: the T21 drift test locked QUG `t_stat` / `critical_value` and Yatchew `t_stat_hr` / `sigma2_lin` but left the corresponding p-values unpinned, even though all three are deterministic closed-form values that the notebook / review extract quote verbatim: - `overall_report.qug.p_value` = 0.2059 (notebook + extract Section 3, reproduced in event-study path) - `res_lin.p_value` = 0.4917 (Yatchew linearity side panel) - `res_mi.p_value` = 0.2899 (Yatchew mean-independence side panel) A drift in any of those numbers could ship silently while the test still passes on the already-locked t-stat. Add direct rounded assertions, matching the rounding style of the existing locks. All four touched tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall assessment: ✅ Looks good Executive summary
Methodology
Code Quality Performance Maintainability Tech Debt Security Documentation/Tests |
…drift coverage + TODO status Holistic re-audit of merged igerber#409 (Tutorial 21 HAD pre-test workflow notebook + drift tests) + igerber#425 (cleanup: pin deterministic p-values). The per-PR cleanup PR review on igerber#425 couldn't see the combined post-PR holistic state. 4 local agentic codex rounds (R0-R3) surfaced a repeated rendered-surface drift gap on both the T21 PR added and the sibling T20 drift test, plus a small TODO status reconciliation. **Rendered-surface drift gap** — both `test_t20_*_drift.py` and `test_t21_*_drift.py` had file-level docstrings claiming to check pinned numbers "against the values quoted in the tutorial markdown", but every prior assert just re-derived values from the locked DGP and compared to a hardcoded constant — the .ipynb files themselves were never read. Because `nbsphinx_execute = "never"` in `docs/conf.py`, CI cannot detect drift between drift-test constants and the committed tutorial via notebook re-execution; the two surfaces can diverge silently. Closed via a new shared helper `tests/_tutorial_drift.py` (`notebook_markdown` / `notebook_output_text` / `notebook_rendered_text` + `assert_quotes_in_rendered`) used by: - `tests/test_t21_had_pretest_workflow_drift.py::test_notebook_quotes_match_pinned_constants` — pins 16 load-bearing rendered substrings: verdict text on both `aggregate="overall"` and `aggregate="event_study"` paths; structural-field anchors; QUG / Stute / Yatchew p-values (`0.2059`, `0.6860`, `0.0720`, `0.7630`, `0.4917`, `0.2899`) the file already pins analytically; design / target anchors (`continuous_at_zero`, `WAS`); rendered Yatchew sigma2_lin values (`6250.2569`, `6.5340`, `7.0076`). - `tests/test_t20_had_brand_campaign_drift.py::test_notebook_quotes_match_pinned_constants` — pins the headline WAS estimate prose (`100 weekly visits`, `98.6 to 101.4`), the design auto-detect outcome (`continuous_near_d_lower`), the target label (`WAS_d_lower`), and the placebo magnitude / sample-summary prose (`±0.06`, `median`, `$25K`) the file already pins analytically. **TODO status reconciliation** — TODO.md still said T21 was "PR-pending" while CHANGELOG and REGISTRY already credited PR igerber#409. Updated. 4 files, +244/-1. No methodology changes. No estimator/inference behavior change. Tests-only + one TODO line. Two pilot findings NOT included in this fix-PR: 1. The P3 stale verdict text (`"paper step 2 deferred to Phase 3 follow-up"`) — the codex correctly observes the Phase 3 follow-up has shipped as `aggregate="event_study"`, so the wording is historically frozen. Changing it is a user-facing API string change touching 6+ surfaces (`had_pretests.py` source + 4 test assertions + notebook + CHANGELOG); deferred for explicit guidance before changing the verdict library-wide. 2. The codex's R3 P1 about syncing `docs/_review/t21_notebook_extract.md` against the notebook — phantom finding. That file was deleted from main in a follow-up PR (it's a temporary review aid per its own header); pilot-409 has it only because cherry-picking igerber#409 brought it in. The fix-PR off main has nothing to sync. Per `feedback_holistic_fix_on_repeated_p1s`: shipping after the same-class finding repeated 4 rounds rather than continuing to enumerate anchors.
Summary
Audit follow-up to PR #409. The restored CI reviewer's one P2 finding: the T21 drift test pins QUG `t_stat` / `critical_value` and Yatchew `t_stat_hr` / `sigma2_lin` but leaves the corresponding p-values unpinned, even though all three are deterministic closed-form values that the notebook + review extract quote verbatim.
Pin the three deterministic p-values:
A drift in any of these would otherwise ship silently while the existing test still passed on the already-locked `t_stat` (the relationship between QUG `t_stat` and `p_value` is `1 / (1 + T)`, so a drift in either suggests a real divergence). All four touched tests still pass locally.
Test plan
🤖 Generated with Claude Code