Skip to content

Commit 3ab7a86

Browse files
igerberclaude
andcommitted
Address PR #409 R7 review (P2 D1) — bounded p-value drift bands
Two bootstrap p-value drift tests had lower-bound-only assertions: - `test_overall_stute_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.686 → would silently pass if p drifted to 0.99 - `test_event_study_homogeneity_fails_to_reject`: was `p > 0.50`, tutorial quotes ~0.763 → same silent-stale risk The third bootstrap test (`test_event_study_pretrends_fails_to_reject`) already used a bounded band `0.0 <= p <= 0.25`. Mirror that pattern on the other two with bounded bands per `feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15 width): - Stute: 0.53 <= p <= 0.84 (band ~0.31 around 0.686) - Homogeneity: 0.61 <= p <= 0.92 (band ~0.31 around 0.763) Both bands wide enough for Rust ↔ pure-Python RNG path differences; both narrow enough that drift in either direction (toward rejection or toward an even cleaner pass) flags the prose as stale. All 16 drift tests pass on both backends within the new bands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f9f951f commit 3ab7a86

1 file changed

Lines changed: 14 additions & 9 deletions

File tree

tests/test_t21_had_pretest_workflow_drift.py

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -204,14 +204,14 @@ def test_overall_qug_fails_to_reject(overall_report):
204204

205205

206206
def test_overall_stute_fails_to_reject(overall_report):
207-
"""Section 3 narrative claims Stute fails-to-reject linearity.
208-
Stute uses Mammen wild bootstrap so the p-value is RNG-dependent;
209-
use binary fail-to-reject + abs tolerance band per
210-
`feedback_bootstrap_drift_tests_need_backend_tolerance`."""
207+
"""Section 3 narrative quotes Stute p_value ~0.686. Stute uses
208+
Mammen wild bootstrap so the p-value is RNG-dependent; use a
209+
bounded abs tolerance band per
210+
`feedback_bootstrap_drift_tests_need_backend_tolerance` (>= 0.15
211+
width). Both bounds tight enough to catch methodology drift in
212+
either direction, loose enough for backend RNG path differences."""
211213
assert overall_report.stute.reject is False
212-
# Tight enough to catch methodology drift, loose enough for backend
213-
# RNG path differences.
214-
assert overall_report.stute.p_value > 0.50, overall_report.stute.p_value
214+
assert 0.53 <= overall_report.stute.p_value <= 0.84, overall_report.stute.p_value
215215

216216

217217
def test_overall_yatchew_fails_to_reject(overall_report):
@@ -292,11 +292,16 @@ def test_event_study_pretrends_fails_to_reject(event_study_report):
292292

293293
def test_event_study_homogeneity_fails_to_reject(event_study_report):
294294
"""Section 4 narrative claims joint homogeneity strongly fails to
295-
reject (~0.76 from numbers.json)."""
295+
reject and quotes p ~0.763 from numbers.json. Use a bounded abs
296+
tolerance band per
297+
`feedback_bootstrap_drift_tests_need_backend_tolerance` so that
298+
drift in either direction (toward rejection or toward an even
299+
cleaner pass) flags the prose as stale rather than silently
300+
passing."""
296301
hj = event_study_report.homogeneity_joint
297302
assert hj is not None
298303
assert hj.reject is False
299-
assert hj.p_value > 0.50, hj.p_value
304+
assert 0.61 <= hj.p_value <= 0.92, hj.p_value
300305

301306

302307
def test_had_design_auto_lands_on_continuous_at_zero(two_period):

0 commit comments

Comments
 (0)