Pin T21 deterministic p-values quoted in tutorial (re-audit of PR #409) by igerber · Pull Request #425 · igerber/diff-diff

igerber · 2026-05-13T23:14:54Z

Summary

Audit follow-up to PR #409. The restored CI reviewer's one P2 finding: the T21 drift test pins QUG `t_stat` / `critical_value` and Yatchew `t_stat_hr` / `sigma2_lin` but leaves the corresponding p-values unpinned, even though all three are deterministic closed-form values that the notebook + review extract quote verbatim.

Pin the three deterministic p-values:

`overall_report.qug.p_value == 0.2059` (Section 3, reproduced on the event-study path so a divergence between the two surfaces also surfaces)
`res_lin.p_value == 0.4917` (Yatchew linearity side panel, Section 5)
`res_mi.p_value == 0.2899` (Yatchew mean-independence side panel, Section 5)

A drift in any of these would otherwise ship silently while the existing test still passed on the already-locked `t_stat` (the relationship between QUG `t_stat` and `p_value` is `1 / (1 + T)`, so a drift in either suggests a real divergence). All four touched tests still pass locally.

Test plan

CI - `pytest tests/test_t21_had_pretest_workflow_drift.py::test_overall_qug_fails_to_reject tests/test_t21_had_pretest_workflow_drift.py::test_event_study_qug_matches_overall tests/test_t21_had_pretest_workflow_drift.py::test_yatchew_side_panel_linearity_passes tests/test_t21_had_pretest_workflow_drift.py::test_yatchew_side_panel_mean_independence_passes` all pass on the new assertions.

🤖 Generated with Claude Code

The restored CI reviewer caught one P2 on PR #409: the T21 drift test locked QUG `t_stat` / `critical_value` and Yatchew `t_stat_hr` / `sigma2_lin` but left the corresponding p-values unpinned, even though all three are deterministic closed-form values that the notebook / review extract quote verbatim: - `overall_report.qug.p_value` = 0.2059 (notebook + extract Section 3, reproduced in event-study path) - `res_lin.p_value` = 0.4917 (Yatchew linearity side panel) - `res_mi.p_value` = 0.2899 (Yatchew mean-independence side panel) A drift in any of those numbers could ship silently while the test still passes on the already-locked t-stat. Add direct rounded assertions, matching the rounding style of the existing locks. All four touched tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-13T23:20:09Z

Overall assessment: ✅ Looks good

Executive summary

Re-review focus: the prior gap appears addressed. This is a test-only diff in tests/test_t21_had_pretest_workflow_drift.py; it does not change estimator logic, weighting, variance/SE code, identification assumptions, or defaults.
The new QUG assertions at tests/test_t21_had_pretest_workflow_drift.py:L204-L207 and :L257-L260 match the library’s documented contract p_value = 1 / (1 + T) in diff_diff/had_pretests.py:L150-L156 and :L1293-L1297, and they match the saved tutorial outputs at docs/tutorials/21_had_pretest_workflow.ipynb:L223-L229 and :L376-L380.
The new Yatchew assertions at tests/test_t21_had_pretest_workflow_drift.py:L346-L348 and :L362-L364 match the documented p_value = 1 - Phi(T_hr) contract in diff_diff/had_pretests.py:L328-L352 and :L2033-L2051, and the tutorial’s saved outputs at docs/tutorials/21_had_pretest_workflow.ipynb:L484-L501.
null="mean_independence" remains a documented R-parity extension rather than the paper’s default linearity-null surface, so pinning that deterministic value is informationally fine and not a methodology defect. citeturn5view0turn8view0turn8view1
I found no new NaN/inference anti-patterns, partial parameter-propagation issues, or control-group logic risks in scope.
I could not execute the targeted tests in this sandbox because the available Python lacks the repo’s declared test dependencies (numpy, pytest).

Methodology

Severity P3 (informational). Impact: tests/test_t21_had_pretest_workflow_drift.py:L351-L364 exercises yatchew_hr_test(null="mean_independence"), which is a documented repo deviation/R-parity extension rather than a paper-prescribed null; this is explicitly allowed by the registry and is consistent with the exposed order=0 constant-mean variant. Concrete fix: none. docs/methodology/REGISTRY.md:L2474-L2482. citeturn5view0turn8view0turn8view1
No P0/P1 findings. The QUG and Yatchew p-value pins are consistent with the documented formulas and saved tutorial outputs: docs/methodology/REGISTRY.md:L2549-L2552, docs/tutorials/21_had_pretest_workflow.ipynb:L223-L229, :L376-L380, :L484-L501.

Code Quality
No findings. The added assertions follow the existing drift-test pattern: exact pins for deterministic closed-form outputs, tolerance bands only for bootstrap-driven outputs.

Performance
No findings. This is a test-only change with negligible runtime impact.

Maintainability
No findings. The new comments explain why these p-values are pinned and make the intent of the stricter assertions clear.

Tech Debt
No findings. This diff does not introduce new deferred work and does not require a TODO.md entry.

Security
No findings. Test-only change; no new secret-handling, deserialization, or input-surface risk.

Documentation/Tests
No findings. The added assertions directly cover previously unpinned deterministic tutorial values, which is the right regression surface for a notebook drift test.
Residual review limitation: execution was not available in this sandbox because /usr/bin/python here does not have the repo’s test dependencies installed.

…drift coverage + TODO status Holistic re-audit of merged igerber#409 (Tutorial 21 HAD pre-test workflow notebook + drift tests) + igerber#425 (cleanup: pin deterministic p-values). The per-PR cleanup PR review on igerber#425 couldn't see the combined post-PR holistic state. 4 local agentic codex rounds (R0-R3) surfaced a repeated rendered-surface drift gap on both the T21 PR added and the sibling T20 drift test, plus a small TODO status reconciliation. **Rendered-surface drift gap** — both `test_t20_*_drift.py` and `test_t21_*_drift.py` had file-level docstrings claiming to check pinned numbers "against the values quoted in the tutorial markdown", but every prior assert just re-derived values from the locked DGP and compared to a hardcoded constant — the .ipynb files themselves were never read. Because `nbsphinx_execute = "never"` in `docs/conf.py`, CI cannot detect drift between drift-test constants and the committed tutorial via notebook re-execution; the two surfaces can diverge silently. Closed via a new shared helper `tests/_tutorial_drift.py` (`notebook_markdown` / `notebook_output_text` / `notebook_rendered_text` + `assert_quotes_in_rendered`) used by: - `tests/test_t21_had_pretest_workflow_drift.py::test_notebook_quotes_match_pinned_constants` — pins 16 load-bearing rendered substrings: verdict text on both `aggregate="overall"` and `aggregate="event_study"` paths; structural-field anchors; QUG / Stute / Yatchew p-values (`0.2059`, `0.6860`, `0.0720`, `0.7630`, `0.4917`, `0.2899`) the file already pins analytically; design / target anchors (`continuous_at_zero`, `WAS`); rendered Yatchew sigma2_lin values (`6250.2569`, `6.5340`, `7.0076`). - `tests/test_t20_had_brand_campaign_drift.py::test_notebook_quotes_match_pinned_constants` — pins the headline WAS estimate prose (`100 weekly visits`, `98.6 to 101.4`), the design auto-detect outcome (`continuous_near_d_lower`), the target label (`WAS_d_lower`), and the placebo magnitude / sample-summary prose (`±0.06`, `median`, `$25K`) the file already pins analytically. **TODO status reconciliation** — TODO.md still said T21 was "PR-pending" while CHANGELOG and REGISTRY already credited PR igerber#409. Updated. 4 files, +244/-1. No methodology changes. No estimator/inference behavior change. Tests-only + one TODO line. Two pilot findings NOT included in this fix-PR: 1. The P3 stale verdict text (`"paper step 2 deferred to Phase 3 follow-up"`) — the codex correctly observes the Phase 3 follow-up has shipped as `aggregate="event_study"`, so the wording is historically frozen. Changing it is a user-facing API string change touching 6+ surfaces (`had_pretests.py` source + 4 test assertions + notebook + CHANGELOG); deferred for explicit guidance before changing the verdict library-wide. 2. The codex's R3 P1 about syncing `docs/_review/t21_notebook_extract.md` against the notebook — phantom finding. That file was deleted from main in a follow-up PR (it's a temporary review aid per its own header); pilot-409 has it only because cherry-picking igerber#409 brought it in. The fix-PR off main has nothing to sync. Per `feedback_holistic_fix_on_repeated_p1s`: shipping after the same-class finding repeated 4 rounds rather than continuing to enumerate anchors.

igerber added the ready-for-ci Triggers CI test workflows label May 13, 2026

igerber merged commit 6425094 into main May 14, 2026
25 of 26 checks passed

igerber deleted the fix-audit-409 branch May 14, 2026 00:47

igerber mentioned this pull request May 15, 2026

Fix #409 holistic audit residuals: T20+T21 notebook cross-check + TODO status #439

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pin T21 deterministic p-values quoted in tutorial (re-audit of PR #409)#425

Pin T21 deterministic p-values quoted in tutorial (re-audit of PR #409)#425
igerber merged 1 commit into
mainfrom
fix-audit-409

igerber commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 13, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant