Skip to content

Commit 08056e4

Browse files
igerberclaude
andcommitted
Address PR #355 R13 P1: stratified_survey DGP off-by-one on post_periods
``generate_survey_did_data`` is 1-indexed (prep_dgp.py L1211-L1212), so ``n_periods=12`` with ``cohort_periods=[7]`` emits periods 1..12 with post = [7, 8, 9, 10, 11, 12]. The coverage harness' ``_stratified_survey_dgp`` returned ``list(range(7, 12))`` = [7, 8, 9, 10, 11], silently dropping period 12 into the pre window. SDID therefore fit the panel as 7-pre/5-post instead of the documented 6-pre/6-post, and every rejection / mean SE cell in the survey-bootstrap calibration row (plus the REGISTRY narrative transcribed from it) was derived from the mis-specified window. Fix: derive post_periods from ``df["period"].max()`` so any change to ``n_periods`` propagates. Regression test ``test_stratified_survey_dgp_post_periods_cover_full_post_tail`` fails fast if a future refactor reintroduces the off-by-one (checks unique / sorted / contiguous / max == df.period.max() plus the explicit [7, 8, 9, 10, 11, 12] shape). Regenerated only the stratified_survey block and spliced it into the main artifact (other DGPs unaffected — their seeds and DGP code are unchanged). New rejection rates at α = {0.01, 0.05, 0.10}: {0.024, 0.058, 0.094}; mean SE / true SD drops from 1.25 to 1.13. Rejection at α=0.05 remains well inside the calibration gate [0.02, 0.10]. REGISTRY table row and narrative updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fb2dd90 commit 08056e4

4 files changed

Lines changed: 63 additions & 12 deletions

File tree

benchmarks/data/sdid_coverage.json

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"n_bootstrap": 200,
55
"library_version": "3.2.0",
66
"backend": "rust",
7-
"generated_at": "2026-04-24T00:58:22.180577+00:00",
7+
"generated_at": "2026-04-24T13:01:54.876774+00:00",
88
"total_elapsed_sec": 2420.61,
99
"methods": [
1010
"placebo",
@@ -147,13 +147,13 @@
147147
"bootstrap": {
148148
"n_successful_fits": 500,
149149
"rejection_rate": {
150-
"0.01": 0.014,
151-
"0.05": 0.042,
152-
"0.10": 0.088
150+
"0.01": 0.024,
151+
"0.05": 0.058,
152+
"0.10": 0.094
153153
},
154-
"mean_se": 0.5689806467245018,
155-
"true_sd_tau_hat": 0.45569672831386343,
156-
"se_over_truesd": 1.2485949785722699
154+
"mean_se": 0.5097482138251239,
155+
"true_sd_tau_hat": 0.4512243070193919,
156+
"se_over_truesd": 1.1297002530566618
157157
},
158158
"jackknife": {
159159
"n_successful_fits": 0,
@@ -166,7 +166,7 @@
166166
"true_sd_tau_hat": null,
167167
"se_over_truesd": null
168168
},
169-
"_elapsed_sec": 33.17
169+
"_elapsed_sec": 16.48
170170
}
171171
}
172172
}

benchmarks/python/coverage_sdid.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -211,10 +211,17 @@ def _stratified_survey_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]:
211211
# generate_survey_did_data emits per-observation 'treated' (post-only
212212
# for treated units); SDID requires a unit-level ever-treated indicator
213213
# (constant across time). Derive from 'first_treat' (cohort, 0 for
214-
# never-treated). Block-treatment cohort is 7 → post = 7..11.
214+
# never-treated). Periods are 1-indexed (prep_dgp.py L1211-L1212), so
215+
# cohort 7 with n_periods=12 → post = [7, 8, 9, 10, 11, 12] (6 post
216+
# periods). Derive from df["period"].max() so any change to n_periods
217+
# propagates (PR #355 R13 P1 — the hard-coded range(7, 12) dropped
218+
# period 12 into the pre window, contaminating calibration).
215219
df = df.copy()
216220
df["treated"] = (df["first_treat"] > 0).astype(int)
217-
return df, list(range(7, 12))
221+
cohort_onset = 7
222+
period_max = int(df["period"].max())
223+
post_periods = list(range(cohort_onset, period_max + 1))
224+
return df, post_periods
218225

219226

220227
def _stratified_survey_design(df: pd.DataFrame) -> Tuple[Any, Tuple[str, ...]]:

docs/methodology/REGISTRY.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1586,11 +1586,11 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
15861586
| AER §6.3 (N=100, N_tr=20, T=120, T_pre=115, rank=2, σ=2) | placebo | 0.018 | 0.058 | 0.086 | 0.99 |
15871587
| AER §6.3 | bootstrap | 0.010 | 0.040 | 0.078 | 1.05 |
15881588
| AER §6.3 | jackknife | 0.030 | 0.080 | 0.150 | 0.90 |
1589-
| stratified_survey (N=40, strata=2, PSU=2/stratum, ICC≈0.84) | bootstrap | 0.014 | 0.042 | 0.088 | 1.25 |
1589+
| stratified_survey (N=40, strata=2, PSU=2/stratum, ICC≈0.84) | bootstrap | 0.024 | 0.058 | 0.094 | 1.13 |
15901590

15911591
Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three non-survey DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
15921592

1593-
**`stratified_survey × bootstrap` (PR #352)**: validates the weighted-FW + Rao-Wu composition added in this PR. Rejection at α=0.05 is 0.042 (well inside the calibration gate [0.02, 0.10] widened from a 2σ band to accommodate the high ICC ≈ 0.84 induced by `psu_re_sd=1.5` with only 4 PSUs total). `mean SE / true SD = 1.25` indicates the bootstrap is slightly conservative (overestimates the empirical sampling SD by ~25%) — the safer direction; expected under Rao-Wu rescaling with few PSUs because the per-draw weights inflate variance from the resampling structure on top of the fit-time uncertainty. Placebo and jackknife rows are NaN here because both methods reject strata/PSU/FPC at fit-time (tracked as a separate methodology gap in TODO.md). Bootstrap is the only available variance method for full-design SDID fits in this release.
1593+
**`stratified_survey × bootstrap` (PR #352)**: validates the weighted-FW + Rao-Wu composition added in this PR. Rejection at α=0.05 is 0.058 (inside the calibration gate [0.02, 0.10] widened from a 2σ band to accommodate the high ICC ≈ 0.84 induced by `psu_re_sd=1.5` with only 4 PSUs total). `mean SE / true SD = 1.13` indicates the bootstrap is slightly conservative (overestimates the empirical sampling SD by ~13%) — the safer direction; expected under Rao-Wu rescaling with few PSUs because the per-draw weights inflate variance from the resampling structure on top of the fit-time uncertainty. Placebo and jackknife rows are `null` here because both methods reject strata/PSU/FPC at fit-time (tracked as a separate methodology gap in TODO.md). Bootstrap is the only available variance method for full-design SDID fits in this release.
15941594

15951595
The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one).
15961596

tests/test_methodology_sdid.py

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3518,3 +3518,47 @@ def test_coverage_artifacts_present(self):
35183518
"stratified_survey jackknife should have 0 successful fits "
35193519
"(strata/PSU/FPC raises NotImplementedError at fit-time)"
35203520
)
3521+
3522+
def test_stratified_survey_dgp_post_periods_cover_full_post_tail(self):
3523+
"""The ``stratified_survey`` coverage DGP must not drop any post period.
3524+
3525+
Regression against PR #355 R13 P1: the harness previously hard-
3526+
coded ``range(7, 12)`` as ``post_periods`` even though
3527+
``generate_survey_did_data`` is 1-indexed and emits periods
3528+
1..n_periods, so period 12 was silently included in the pre
3529+
window. That contaminated the ``stratified_survey × bootstrap``
3530+
calibration row and every downstream REGISTRY claim that
3531+
transcribes from the artifact. The fix derives
3532+
``post_periods`` from ``df["period"].max()``; this test fails
3533+
fast if a future refactor reintroduces the off-by-one.
3534+
"""
3535+
import importlib
3536+
coverage_sdid = importlib.import_module("benchmarks.python.coverage_sdid")
3537+
df, post_periods = coverage_sdid._stratified_survey_dgp(seed=0)
3538+
3539+
# Contract: post_periods covers the full tail from cohort onset
3540+
# through df["period"].max(), with no gaps (unique + sorted +
3541+
# contiguous + maxed at df["period"].max()).
3542+
assert len(post_periods) == len(set(post_periods)), (
3543+
f"post_periods has duplicates: {post_periods}"
3544+
)
3545+
assert post_periods == sorted(post_periods), (
3546+
f"post_periods not sorted: {post_periods}"
3547+
)
3548+
gaps = [
3549+
(a, b) for a, b in zip(post_periods, post_periods[1:]) if b - a != 1
3550+
]
3551+
assert not gaps, (
3552+
f"post_periods has gaps: {gaps} in {post_periods}"
3553+
)
3554+
assert post_periods[-1] == int(df["period"].max()), (
3555+
f"post_periods max {post_periods[-1]} != df[period].max() "
3556+
f"{int(df['period'].max())} — DGP drops the last post period "
3557+
"(off-by-one on 1-indexed generate_survey_did_data)."
3558+
)
3559+
# Strong form: cohort onset is 7, n_periods=12 → [7,8,9,10,11,12].
3560+
assert post_periods == [7, 8, 9, 10, 11, 12], (
3561+
f"post_periods={post_periods} must equal [7,8,9,10,11,12] for "
3562+
"the documented 6-pre/6-post survey DGP; any other slice "
3563+
"changes the calibration interpretation in REGISTRY."
3564+
)

0 commit comments

Comments
 (0)