Skip to content

Commit aef0702

Browse files
igerberclaude
andcommitted
Fix CI failure: relax bootstrap-p-value bit-equality claim in trivial-stratum test
CI matrix surfaced a single failing assertion across all 6 failed Python Tests jobs (5176 pass, 1 fail per matrix): test_trivial_stratum_reduces_to_strata_none assert_allclose(r_explicit.p_value, r_implicit.p_value, atol=1e-12) ACTUAL: 0.475, DESIRED: 0.445 (Δ = 0.030) Root cause: test design flaw, not methodology. ``generate_survey_multiplier_weights_batch`` takes structurally different code paths based on whether ``strata`` is None: - strata-not-None (bootstrap_utils.py:579+): iterates np.unique (strata), per-stratum batch via generate_bootstrap_weights_batch_numpy. - strata-None (bootstrap_utils.py:556+): single batch via generate_bootstrap_weights_batch, which routes through the Rust backend when available. Both paths produce different RNG state evolutions even at the same seed (single batch call vs per-stratum loop advances numpy default_rng differently), AND the Rust-vs-numpy dispatch divergence on the strata-None branch adds a second source of multiplier difference on machines with the Rust backend installed. The test passed locally (no Rust installed) but fails on CI (Rust present). Fix: keep the deterministic CvM statistic bit-equal claim (atol=1e-14, the actual algebraic-identity invariant), relax the bootstrap p-value claim to a 0.15 absolute closeness band (within bootstrap noise at B=199; ~4σ). Update the class docstring and the test docstring to explain why bit-equality on p-values was never achievable. The methodology is unaffected — only the test's overclaim on bootstrap-p bit-equality was wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ce2b91c commit aef0702

1 file changed

Lines changed: 41 additions & 10 deletions

File tree

tests/test_had_pretests.py

Lines changed: 41 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -5469,7 +5469,12 @@ class TestStuteStratifiedSurveyBootstrap:
54695469
smoke; QUG silently skipped, joint Stute + Yatchew run
54705470
survey-aware; verdict carries the C0 deferral substring.
54715471
- Trivial-stratum reduction: ``SurveyDesign(strata="all_ones")``
5472-
≡ ``SurveyDesign(strata=None)`` at atol=1e-12.
5472+
vs ``SurveyDesign(strata=None)`` — the deterministic CvM
5473+
statistic is bit-equal (atol=1e-14, the algebraic-identity
5474+
claim); the bootstrap p-value matches only within
5475+
bootstrap-noise (≤0.15 abs) because
5476+
``generate_survey_multiplier_weights_batch`` takes different
5477+
RNG-consumption paths for strata-Some vs strata-None.
54735478
- Non-strata calibration-shift end-to-end smoke: finite + range
54745479
check that the non-strata Stute path runs after the
54755480
single-implicit-stratum centering lands. A direction-pin via
@@ -5610,10 +5615,24 @@ def test_workflow_stratified_event_study_end_to_end_smoke(self):
56105615

56115616
def test_trivial_stratum_reduces_to_strata_none(self):
56125617
"""``SurveyDesign(weights, psu, strata="all_ones")`` (single
5613-
explicit stratum) is algebraically identical to
5618+
explicit stratum) is *algebraically* equivalent to
56145619
``SurveyDesign(weights, psu, strata=None)`` (single implicit
5615-
stratum) after the PR. Both apply the same Bessel correction.
5616-
Validates the trivial-stratum reduction at atol=1e-12."""
5620+
stratum) after the PR — both apply the same within-implicit-
5621+
stratum demean + sqrt(n_psu/(n_psu-1)) Bessel rescale.
5622+
5623+
The DETERMINISTIC CvM statistic matches bit-exactly between the
5624+
two fits — that is the algebraic-identity claim.
5625+
5626+
The BOOTSTRAP p-value does NOT match bit-exactly because
5627+
``generate_survey_multiplier_weights_batch`` takes structurally
5628+
different code paths based on ``strata`` (see
5629+
``bootstrap_utils.py:556+`` strata-None branch vs ``:579+``
5630+
stratified branch — different RNG-consumption patterns and the
5631+
strata-None branch additionally routes through the Rust backend
5632+
when available, while the stratified per-stratum loop uses
5633+
``..._numpy``). Bootstrap p-values are expected to differ at
5634+
the order of bootstrap noise — we assert a loose closeness band
5635+
as a SMOKE, not bit-equality."""
56175636
from diff_diff import SurveyDesign
56185637

56195638
df = self._stratified_panel(n_strata=1, n_psu_per_stratum=20)
@@ -5635,19 +5654,31 @@ def test_trivial_stratum_reduces_to_strata_none(self):
56355654
resolved_explicit = sd_explicit.resolve(unit_df)
56365655
resolved_implicit = sd_implicit.resolve(unit_df)
56375656

5638-
# Use the same seed for both to lock the RNG path.
5657+
# Use the same seed for both. The CvM is deterministic; the
5658+
# bootstrap p-value depends on the multiplier RNG path, which
5659+
# differs between the two helper branches (see docstring).
56395660
r_explicit = stute_test(
56405661
d=d_arr, dy=dy_arr, survey_design=resolved_explicit, n_bootstrap=199, seed=42
56415662
)
56425663
r_implicit = stute_test(
56435664
d=d_arr, dy=dy_arr, survey_design=resolved_implicit, n_bootstrap=199, seed=42
56445665
)
5645-
# CvM statistic is deterministic (no bootstrap) — bit-exact match.
5666+
# CvM statistic is deterministic — bit-exact match (the
5667+
# algebraic-identity claim).
56465668
np.testing.assert_allclose(r_explicit.cvm_stat, r_implicit.cvm_stat, atol=1e-14)
5647-
# Bootstrap p-value: at the same seed + same multiplier draw,
5648-
# both apply the SAME stratum centering (one explicit, one
5649-
# implicit). p-values match at atol=1e-12.
5650-
np.testing.assert_allclose(r_explicit.p_value, r_implicit.p_value, atol=1e-12)
5669+
# Bootstrap p-values: loose closeness band (within bootstrap
5670+
# noise), NOT bit-equality. At B=199 the bootstrap SD on the
5671+
# uniform p-value is ~sqrt(0.25/199) ≈ 0.035; a ±0.15 band is
5672+
# ~4σ and decisively distinguishes "two fits of the same
5673+
# statistic" from "the algebraic-equivalence claim is broken"
5674+
# (which would push p-values to opposite tails).
5675+
assert abs(float(r_explicit.p_value) - float(r_implicit.p_value)) < 0.15, (
5676+
f"Trivial-stratum reduction smoke: explicit p={r_explicit.p_value!r} "
5677+
f"vs implicit p={r_implicit.p_value!r} differ by more than the "
5678+
f"bootstrap-noise band (0.15). The algebraic equivalence between "
5679+
f"SurveyDesign(strata='all_ones') and SurveyDesign(strata=None) "
5680+
f"may be broken; check apply_stratum_centering's strata-None branch."
5681+
)
56515682

56525683
def test_stute_call_sites_invoke_apply_stratum_centering(self, monkeypatch):
56535684
"""Regression: ensure the Stute survey-bootstrap call sites

0 commit comments

Comments
 (0)