You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Address PR #355 R13 P1: stratified_survey DGP off-by-one on post_periods
``generate_survey_did_data`` is 1-indexed (prep_dgp.py L1211-L1212),
so ``n_periods=12`` with ``cohort_periods=[7]`` emits periods 1..12
with post = [7, 8, 9, 10, 11, 12]. The coverage harness'
``_stratified_survey_dgp`` returned ``list(range(7, 12))`` =
[7, 8, 9, 10, 11], silently dropping period 12 into the pre window.
SDID therefore fit the panel as 7-pre/5-post instead of the
documented 6-pre/6-post, and every rejection / mean SE cell in the
survey-bootstrap calibration row (plus the REGISTRY narrative
transcribed from it) was derived from the mis-specified window.
Fix: derive post_periods from ``df["period"].max()`` so any change
to ``n_periods`` propagates. Regression test
``test_stratified_survey_dgp_post_periods_cover_full_post_tail``
fails fast if a future refactor reintroduces the off-by-one (checks
unique / sorted / contiguous / max == df.period.max() plus the
explicit [7, 8, 9, 10, 11, 12] shape).
Regenerated only the stratified_survey block and spliced it into
the main artifact (other DGPs unaffected — their seeds and DGP code
are unchanged). New rejection rates at α = {0.01, 0.05, 0.10}:
{0.024, 0.058, 0.094}; mean SE / true SD drops from 1.25 to 1.13.
Rejection at α=0.05 remains well inside the calibration gate
[0.02, 0.10]. REGISTRY table row and narrative updated to match.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three non-survey DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
1592
1592
1593
-
**`stratified_survey × bootstrap` (PR #352)**: validates the weighted-FW + Rao-Wu composition added in this PR. Rejection at α=0.05 is 0.042 (well inside the calibration gate [0.02, 0.10] widened from a 2σ band to accommodate the high ICC ≈ 0.84 induced by `psu_re_sd=1.5` with only 4 PSUs total). `mean SE / true SD = 1.25` indicates the bootstrap is slightly conservative (overestimates the empirical sampling SD by ~25%) — the safer direction; expected under Rao-Wu rescaling with few PSUs because the per-draw weights inflate variance from the resampling structure on top of the fit-time uncertainty. Placebo and jackknife rows are NaN here because both methods reject strata/PSU/FPC at fit-time (tracked as a separate methodology gap in TODO.md). Bootstrap is the only available variance method for full-design SDID fits in this release.
1593
+
**`stratified_survey × bootstrap` (PR #352)**: validates the weighted-FW + Rao-Wu composition added in this PR. Rejection at α=0.05 is 0.058 (inside the calibration gate [0.02, 0.10] widened from a 2σ band to accommodate the high ICC ≈ 0.84 induced by `psu_re_sd=1.5` with only 4 PSUs total). `mean SE / true SD = 1.13` indicates the bootstrap is slightly conservative (overestimates the empirical sampling SD by ~13%) — the safer direction; expected under Rao-Wu rescaling with few PSUs because the per-draw weights inflate variance from the resampling structure on top of the fit-time uncertainty. Placebo and jackknife rows are `null` here because both methods reject strata/PSU/FPC at fit-time (tracked as a separate methodology gap in TODO.md). Bootstrap is the only available variance method for full-design SDID fits in this release.
1594
1594
1595
1595
The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one).
0 commit comments