Address PR #355 R13 P1: stratified_survey DGP off-by-one on post_periods

igerber · claude · igerber · commit 08056e482fa9 · 2026-04-24T09:04:18.000-04:00
``generate_survey_did_data`` is 1-indexed (prep_dgp.py L1211-L1212),
so ``n_periods=12`` with ``cohort_periods=[7]`` emits periods 1..12
with post = [7, 8, 9, 10, 11, 12]. The coverage harness'
``_stratified_survey_dgp`` returned ``list(range(7, 12))`` =
[7, 8, 9, 10, 11], silently dropping period 12 into the pre window.
SDID therefore fit the panel as 7-pre/5-post instead of the
documented 6-pre/6-post, and every rejection / mean SE cell in the
survey-bootstrap calibration row (plus the REGISTRY narrative
transcribed from it) was derived from the mis-specified window.

Fix: derive post_periods from ``df["period"].max()`` so any change
to ``n_periods`` propagates. Regression test
``test_stratified_survey_dgp_post_periods_cover_full_post_tail``
fails fast if a future refactor reintroduces the off-by-one (checks
unique / sorted / contiguous / max == df.period.max() plus the
explicit [7, 8, 9, 10, 11, 12] shape).

Regenerated only the stratified_survey block and spliced it into
the main artifact (other DGPs unaffected — their seeds and DGP code
are unchanged). New rejection rates at α = {0.01, 0.05, 0.10}:
{0.024, 0.058, 0.094}; mean SE / true SD drops from 1.25 to 1.13.
Rejection at α=0.05 remains well inside the calibration gate
[0.02, 0.10]. REGISTRY table row and narrative updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/data/sdid_coverage.json b/benchmarks/data/sdid_coverage.json
@@ -4,7 +4,7 @@
     "n_bootstrap": 200,
     "library_version": "3.2.0",
     "backend": "rust",
-    "generated_at": "2026-04-24T00:58:22.180577+00:00",
+    "generated_at": "2026-04-24T13:01:54.876774+00:00",
     "total_elapsed_sec": 2420.61,
     "methods": [
       "placebo",
@@ -147,13 +147,13 @@
       "bootstrap": {
         "n_successful_fits": 500,
         "rejection_rate": {
-          "0.01": 0.014,
-          "0.05": 0.042,
-          "0.10": 0.088
+          "0.01": 0.024,
+          "0.05": 0.058,
+          "0.10": 0.094
         },
-        "mean_se": 0.5689806467245018,
-        "true_sd_tau_hat": 0.45569672831386343,
-        "se_over_truesd": 1.2485949785722699
+        "mean_se": 0.5097482138251239,
+        "true_sd_tau_hat": 0.4512243070193919,
+        "se_over_truesd": 1.1297002530566618
       },
       "jackknife": {
         "n_successful_fits": 0,
@@ -166,7 +166,7 @@
         "true_sd_tau_hat": null,
         "se_over_truesd": null
       },
-      "_elapsed_sec": 33.17
+      "_elapsed_sec": 16.48
     }
   }
 }
diff --git a/benchmarks/python/coverage_sdid.py b/benchmarks/python/coverage_sdid.py
@@ -211,10 +211,17 @@ def _stratified_survey_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]:
     # generate_survey_did_data emits per-observation 'treated' (post-only
     # for treated units); SDID requires a unit-level ever-treated indicator
     # (constant across time). Derive from 'first_treat' (cohort, 0 for
-    # never-treated). Block-treatment cohort is 7 → post = 7..11.
+    # never-treated). Periods are 1-indexed (prep_dgp.py L1211-L1212), so
+    # cohort 7 with n_periods=12 → post = [7, 8, 9, 10, 11, 12] (6 post
+    # periods). Derive from df["period"].max() so any change to n_periods
+    # propagates (PR #355 R13 P1 — the hard-coded range(7, 12) dropped
+    # period 12 into the pre window, contaminating calibration).
     df = df.copy()
     df["treated"] = (df["first_treat"] > 0).astype(int)
-    return df, list(range(7, 12))
+    cohort_onset = 7
+    period_max = int(df["period"].max())
+    post_periods = list(range(cohort_onset, period_max + 1))
+    return df, post_periods
 
 
 def _stratified_survey_design(df: pd.DataFrame) -> Tuple[Any, Tuple[str, ...]]:
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -1586,11 +1586,11 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
     | AER §6.3 (N=100, N_tr=20, T=120, T_pre=115, rank=2, σ=2)  | placebo    | 0.018  | 0.058  | 0.086  | 0.99 |
     | AER §6.3                                                  | bootstrap  | 0.010  | 0.040  | 0.078  | 1.05 |
     | AER §6.3                                                  | jackknife  | 0.030  | 0.080  | 0.150  | 0.90 |
-    | stratified_survey (N=40, strata=2, PSU=2/stratum, ICC≈0.84) | bootstrap  | 0.014  | 0.042  | 0.088  | 1.25 |
+    | stratified_survey (N=40, strata=2, PSU=2/stratum, ICC≈0.84) | bootstrap  | 0.024  | 0.058  | 0.094  | 1.13 |
 
     Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three non-survey DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
 
-    **`stratified_survey × bootstrap` (PR #352)**: validates the weighted-FW + Rao-Wu composition added in this PR. Rejection at α=0.05 is 0.042 (well inside the calibration gate [0.02, 0.10] widened from a 2σ band to accommodate the high ICC ≈ 0.84 induced by `psu_re_sd=1.5` with only 4 PSUs total). `mean SE / true SD = 1.25` indicates the bootstrap is slightly conservative (overestimates the empirical sampling SD by ~25%) — the safer direction; expected under Rao-Wu rescaling with few PSUs because the per-draw weights inflate variance from the resampling structure on top of the fit-time uncertainty. Placebo and jackknife rows are NaN here because both methods reject strata/PSU/FPC at fit-time (tracked as a separate methodology gap in TODO.md). Bootstrap is the only available variance method for full-design SDID fits in this release.
+    **`stratified_survey × bootstrap` (PR #352)**: validates the weighted-FW + Rao-Wu composition added in this PR. Rejection at α=0.05 is 0.058 (inside the calibration gate [0.02, 0.10] widened from a 2σ band to accommodate the high ICC ≈ 0.84 induced by `psu_re_sd=1.5` with only 4 PSUs total). `mean SE / true SD = 1.13` indicates the bootstrap is slightly conservative (overestimates the empirical sampling SD by ~13%) — the safer direction; expected under Rao-Wu rescaling with few PSUs because the per-draw weights inflate variance from the resampling structure on top of the fit-time uncertainty. Placebo and jackknife rows are `null` here because both methods reject strata/PSU/FPC at fit-time (tracked as a separate methodology gap in TODO.md). Bootstrap is the only available variance method for full-design SDID fits in this release.
 
     The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one).
 
diff --git a/tests/test_methodology_sdid.py b/tests/test_methodology_sdid.py
@@ -3518,3 +3518,47 @@ def test_coverage_artifacts_present(self):
             "stratified_survey jackknife should have 0 successful fits "
             "(strata/PSU/FPC raises NotImplementedError at fit-time)"
         )
+
+    def test_stratified_survey_dgp_post_periods_cover_full_post_tail(self):
+        """The ``stratified_survey`` coverage DGP must not drop any post period.
+
+        Regression against PR #355 R13 P1: the harness previously hard-
+        coded ``range(7, 12)`` as ``post_periods`` even though
+        ``generate_survey_did_data`` is 1-indexed and emits periods
+        1..n_periods, so period 12 was silently included in the pre
+        window. That contaminated the ``stratified_survey × bootstrap``
+        calibration row and every downstream REGISTRY claim that
+        transcribes from the artifact. The fix derives
+        ``post_periods`` from ``df["period"].max()``; this test fails
+        fast if a future refactor reintroduces the off-by-one.
+        """
+        import importlib
+        coverage_sdid = importlib.import_module("benchmarks.python.coverage_sdid")
+        df, post_periods = coverage_sdid._stratified_survey_dgp(seed=0)
+
+        # Contract: post_periods covers the full tail from cohort onset
+        # through df["period"].max(), with no gaps (unique + sorted +
+        # contiguous + maxed at df["period"].max()).
+        assert len(post_periods) == len(set(post_periods)), (
+            f"post_periods has duplicates: {post_periods}"
+        )
+        assert post_periods == sorted(post_periods), (
+            f"post_periods not sorted: {post_periods}"
+        )
+        gaps = [
+            (a, b) for a, b in zip(post_periods, post_periods[1:]) if b - a != 1
+        ]
+        assert not gaps, (
+            f"post_periods has gaps: {gaps} in {post_periods}"
+        )
+        assert post_periods[-1] == int(df["period"].max()), (
+            f"post_periods max {post_periods[-1]} != df[period].max() "
+            f"{int(df['period'].max())} — DGP drops the last post period "
+            "(off-by-one on 1-indexed generate_survey_did_data)."
+        )
+        # Strong form: cohort onset is 7, n_periods=12 → [7,8,9,10,11,12].
+        assert post_periods == [7, 8, 9, 10, 11, 12], (
+            f"post_periods={post_periods} must equal [7,8,9,10,11,12] for "
+            "the documented 6-pre/6-post survey DGP; any other slice "
+            "changes the calibration interpretation in REGISTRY."
+        )