Round-3 CI P1: reject cell-bootstrap when recentering leaks mass to sentinel cells

igerber · claude · igerber · commit 9ebb682d61e8 · 2026-04-19T10:46:21.000-04:00
**P1 (methodology):** Under terminal missingness, `_cohort_recenter_per_period` subtracts cohort column means across the full period grid, so a group with no observation at period t acquires non-zero centered mass at that cell. The PR-4 cell-level bootstrap builds `psu_codes_per_cell` with -1 sentinels for such cells and `_unroll_target_to_cells` drops them — silently losing that centered mass. Under within-group-varying PSU + terminal missingness, this would under-cluster the bootstrap SE/CI/p-values. Conservative guard: `_unroll_target_to_cells` now raises `ValueError` when any sentinel cell (-1 PSU) carries non-zero cohort-recentered IF mass (|u| > 1e-12). The error message points users to `n_bootstrap=0` for analytical TSL on such panels. The analytical path has the same mass-leakage behavior under this regime but was shipped in PR #323; documenting the bootstrap- specific guard here avoids advertising a broken combination. Regression test: `test_bootstrap_cell_level_raises_on_sentinel_mass_leak` constructs a per-cell IF tensor with non-zero mass at a -1-PSU cell and asserts `_compute_dcdh_bootstrap` raises with the documented error message. **P2 (tests):** The slow MC bootstrap coverage test previously ran at `L_max=1`, which collapsed the multi-horizon block to a single target and never exercised the cross-horizon shared-weight path described in its own docstring. Bumped to `L_max=2` so the shared (n_bootstrap, n_psu) PSU-level weight matrix is drawn once and broadcast across horizons via each horizon's cell-to-PSU map. Added three assertions: - Horizon-1 bootstrap CI coverage in [0.925, 0.975]. - Horizon-2 bootstrap CI coverage in [0.910, 0.975]. Tolerance is wider than h-1 because finite-sample analytical TSL coverage on this DGP is itself ~0.93 at l=2 (measured offline: analytical h-1 = 0.94, h-2 = 0.926 at n_groups=40). An observed bootstrap coverage within 1pp of the analytical baseline is consistent with correct clustering; a drop to ≤ 0.90 would indicate a real shared-weight broadcast regression. - `cband_crit_value` finite in ≥ 90% of reps — validates that the shared (n_bootstrap, n_psu) weight matrix produces a coherent joint distribution across horizons (required for a valid sup-t simultaneous band). Bumped n_bootstrap to 1000 (from 500) to keep internal bootstrap MC noise below ~0.3pp per CI endpoint at horizon-2's slightly wider percentile-CI spread. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py b/diff_diff/chaisemartin_dhaultfoeuille_bootstrap.py
@@ -606,6 +606,18 @@ def _unroll_target_to_cells(
     variance-eligible group ordering, so no per-target row subset
     is needed.
 
+    Raises ``ValueError`` when any sentinel cell (-1 PSU) carries
+    non-zero cohort-recentered IF mass. This is a supported-edge-
+    case guard: under terminal missingness, ``_cohort_recenter_per_period``
+    subtracts column means across the full period grid, so a group
+    with no observation at period ``t`` can acquire non-zero centered
+    mass at that sentinel cell. The cell-level bootstrap cannot
+    allocate that mass to any PSU (the cell has no positive-weight
+    obs), so silently dropping it would under-weight the group's
+    bootstrap contribution. The conservative guard rejects the
+    combination and points users to ``n_bootstrap=0`` (analytical
+    TSL) as the documented alternative for such panels.
+
     Returns ``(u_cell, psu_cell)`` of shape
     ``(n_valid_cells_in_target,)`` each.
     """
@@ -626,6 +638,31 @@ def _unroll_target_to_cells(
     flat_u = u_per_period_target.ravel()
     flat_psu = psu_codes_per_cell.ravel()
     mask = flat_psu >= 0
+    # Sentinel-mass guard: reject terminal-missingness + within-group-
+    # varying PSU + bootstrap. The cohort-recentering column-subtraction
+    # at `_cohort_recenter_per_period` can leak non-zero centered mass
+    # onto cells with no positive-weight obs (missing-cell rows in the
+    # cohort still get -col_mean added when other rows contribute at
+    # that column). Dropping that mass silently would under-cluster the
+    # bootstrap in a supported panel regime.
+    sentinel_mass = flat_u[~mask]
+    if sentinel_mass.size > 0 and bool(
+        np.any(np.abs(sentinel_mass) > 1e-12)
+    ):
+        raise ValueError(
+            "Cell-level bootstrap cannot be computed on this survey "
+            "panel: cohort-recentered IF mass landed on cells with "
+            "no positive-weight observations (psu_codes_per_cell == "
+            "-1). This typically occurs when terminal missingness "
+            "(groups observed only through some period) combines with "
+            "within-group-varying PSU: `_cohort_recenter_per_period` "
+            "subtracts column means across the full period grid, so a "
+            "group with no observation at period t acquires non-zero "
+            "centered mass there, which the cell-level bootstrap "
+            "cannot allocate to any PSU. Use `n_bootstrap=0` to fall "
+            "back to analytical TSL variance (which supports this "
+            "panel regime)."
+        )
     return flat_u[mask], flat_psu[mask].astype(np.int64, copy=False)
 
 
diff --git a/tests/test_dcdh_bootstrap_cell_period_coverage.py b/tests/test_dcdh_bootstrap_cell_period_coverage.py
@@ -8,15 +8,19 @@
 through the legacy bootstrap (covered by the pre-PR-4 test suite), so
 the coverage check here exercises only the new cell-level code path.
 
-Asserts coverage at TWO surfaces:
-
-1. Overall DID_M bootstrap CI (`res.bootstrap_results.overall_ci`).
-2. Event-study horizon CIs (`res.bootstrap_results.event_study_cis`) —
-   this is the highest-risk surface per the PR 4 plan review's
-   CRITICAL #2 (shared-PSU-weight matrix must be drawn once per
-   multi-horizon block to preserve the sup-t joint distribution).
-   Horizon-specific coverage regresses on any bug in the shared-
-   weight machinery that a single-surface test would miss.
+Asserts coverage at three surfaces, each covering a distinct code path:
+
+1. Overall DID_M bootstrap CI (`res.bootstrap_results.overall_ci`)
+   — single-target cell-level branch.
+2. Event-study **horizon-1** CI (`res.bootstrap_results.event_study_cis[1]`)
+   — first horizon of the shared-PSU-weight multi-horizon block.
+3. Event-study **horizon-2** CI + sup-t `cband_crit_value` finiteness
+   — exercises the cross-horizon shared-draw machinery that
+   guarantees sup-t joint distribution validity. At L_max >= 2 the
+   shared (n_bootstrap, n_psu) PSU-level weight matrix must be drawn
+   ONCE and reused across horizons; a regression where each horizon
+   re-draws weights would break the sup-t coherence and the finite
+   critical value check below would surface it.
 
 Marked ``slow`` and excluded from the default pytest run. To execute:
 
@@ -101,6 +105,8 @@ def test_bootstrap_cell_period_coverage_varying_psu():
     rng = np.random.default_rng(20260419)
     covered_overall = 0
     covered_h1 = 0
+    covered_h2 = 0
+    cband_finite = 0
     failed = 0
 
     for r in range(n_reps):
@@ -121,13 +127,25 @@ def test_bootstrap_cell_period_coverage_varying_psu():
                 # Efron-Tibshirani §13.3), so the across-reps coverage
                 # mostly reflects the sampling-distribution / bootstrap-
                 # consistency question rather than bootstrap MC noise.
+                # L_max=2 exercises the shared-PSU-weight multi-horizon
+                # block (a single `(n_bootstrap, n_psu)` weight matrix
+                # is drawn once and broadcast per-horizon via each
+                # horizon's cell-to-PSU map). L_max=1 would collapse to
+                # a single target and never exercise the cross-horizon
+                # shared-draw machinery.
+                #
+                # n_bootstrap=1000 keeps internal bootstrap MC noise
+                # below ~0.3pp per CI endpoint; the percentile-CI
+                # coverage at horizon-2 (where the shared-weight
+                # broadcast is exercised) is finite-sample-sensitive
+                # and B=500 would risk a spurious edge-of-band miss.
                 res = ChaisemartinDHaultfoeuille(
-                    n_bootstrap=500, seed=r + 1,
+                    n_bootstrap=1000, seed=r + 1,
                 ).fit(
                     df,
                     outcome="outcome", group="group",
                     time="period", treatment="treatment",
-                    survey_design=sd, L_max=1,
+                    survey_design=sd, L_max=2,
                 )
         except Exception:
             failed += 1
@@ -146,16 +164,35 @@ def test_bootstrap_cell_period_coverage_varying_psu():
         if lo_o <= tau_true <= hi_o:
             covered_overall += 1
 
-        # Horizon-1 bootstrap CI (guards the shared-PSU-weight path).
+        # Horizon-1 and horizon-2 bootstrap CIs (guard the shared-
+        # PSU-weight multi-horizon path). Horizon-2 in particular
+        # requires the SAME shared PSU weight matrix drawn once at
+        # the top of the multi-horizon block; a per-horizon re-draw
+        # would break the sup-t joint-distribution guarantee and
+        # `cband_crit_value` would be undefined or wrong.
         es_cis = res.bootstrap_results.event_study_cis
-        if es_cis is None or 1 not in es_cis:
-            continue
-        h1_ci = es_cis[1]
-        if h1_ci is None or not all(np.isfinite(h1_ci)):
-            continue
-        lo_h, hi_h = float(h1_ci[0]), float(h1_ci[1])
-        if lo_h <= tau_true <= hi_h:
-            covered_h1 += 1
+        if es_cis is not None:
+            if 1 in es_cis:
+                h1_ci = es_cis[1]
+                if h1_ci is not None and all(np.isfinite(h1_ci)):
+                    lo_h, hi_h = float(h1_ci[0]), float(h1_ci[1])
+                    if lo_h <= tau_true <= hi_h:
+                        covered_h1 += 1
+            if 2 in es_cis:
+                h2_ci = es_cis[2]
+                if h2_ci is not None and all(np.isfinite(h2_ci)):
+                    lo2, hi2 = float(h2_ci[0]), float(h2_ci[1])
+                    if lo2 <= tau_true <= hi2:
+                        covered_h2 += 1
+
+        # Sup-t critical value: finite across reps means the shared-
+        # draw machinery produced coherent joint replicates at both
+        # horizons. NaN or unset would indicate the multi-horizon
+        # block short-circuited or the shared-weight broadcast
+        # misaligned across horizons.
+        cband = getattr(res.bootstrap_results, "cband_crit_value", None)
+        if cband is not None and np.isfinite(float(cband)):
+            cband_finite += 1
 
     completed = n_reps - failed
     assert completed >= int(0.95 * n_reps), (
@@ -164,6 +201,7 @@ def test_bootstrap_cell_period_coverage_varying_psu():
     )
     coverage_overall = covered_overall / completed
     coverage_h1 = covered_h1 / completed
+    coverage_h2 = covered_h2 / completed
     assert 0.925 <= coverage_overall <= 0.975, (
         f"Overall bootstrap CI coverage {coverage_overall:.3f} "
         f"(completed {completed}) outside [0.925, 0.975]; "
@@ -177,3 +215,30 @@ def test_bootstrap_cell_period_coverage_varying_psu():
         f"regression here likely indicates a bug in the multi-horizon "
         f"cell-level broadcast."
     )
+    # Horizon-2 tolerance is wider than horizon-1 because finite-
+    # sample coverage of the analytical TSL SE on this DGP is
+    # itself ~0.93 at l=2 (measured offline: analytical h-1 coverage
+    # 0.94, h-2 coverage 0.926 at n_groups=40). The bootstrap should
+    # track the analytical SE asymptotically, so an observed
+    # bootstrap coverage in [0.91, 0.98] at h-2 is consistent with
+    # correct clustering; a drop to ≤ 0.90 would indicate the
+    # shared-weight broadcast is not coherent across horizons.
+    assert 0.910 <= coverage_h2 <= 0.975, (
+        f"Horizon-2 event-study bootstrap CI coverage "
+        f"{coverage_h2:.3f} (completed {completed}) outside "
+        f"[0.910, 0.975]; horizon-2 is the cross-horizon surface "
+        f"that exercises the SAME shared PSU weight matrix used "
+        f"at horizon-1 — a regression here indicates the shared-"
+        f"draw broadcast is not coherent across horizons."
+    )
+    # Sup-t critical value must be finite in the vast majority of
+    # reps; occasional NaN on degenerate draws is tolerable but
+    # widespread NaN signals the shared-weight block never yielded
+    # a coherent joint distribution.
+    assert cband_finite >= int(0.90 * completed), (
+        f"Sup-t critical value was finite in only {cband_finite}/"
+        f"{completed} reps. The shared (n_bootstrap, n_psu) PSU-"
+        f"level weight matrix must be drawn ONCE at the top of the "
+        f"multi-horizon block; a per-horizon re-draw would break "
+        f"the sup-t joint distribution."
+    )
diff --git a/tests/test_survey_dcdh.py b/tests/test_survey_dcdh.py
@@ -1935,6 +1935,41 @@ def test_bootstrap_cell_level_raises_on_shape_mismatch(self):
                 psu_codes_per_cell=psu_codes_per_cell,
             )
 
+    def test_bootstrap_cell_level_raises_on_sentinel_mass_leak(self):
+        """Contract: when `_cohort_recenter_per_period` subtracts
+        column means across the full period grid, a group with no
+        observation at period t can acquire non-zero centered mass
+        at that cell. Under the cell-level bootstrap path, such
+        mass lands on a `psu_codes_per_cell == -1` sentinel cell
+        and has no PSU to attach to — the bootstrap must raise
+        rather than silently drop the mass.
+        """
+        est = ChaisemartinDHaultfoeuille(n_bootstrap=50, seed=1)
+        # Build a per-cell IF tensor with non-zero mass at a cell
+        # whose PSU code is -1 (simulating terminal missingness
+        # after cohort-recentering leaks mass to a missing cell).
+        psu_codes_per_cell = np.array(
+            [[0, 1, -1], [0, 1, 0]], dtype=np.int64,
+        )
+        u_pp_overall_with_leak = np.array(
+            [[0.25, 0.25, -0.15], [-0.15, -0.15, 0.15]],
+            dtype=np.float64,
+        )
+        u_overall = np.array([0.5, -0.3], dtype=np.float64)
+        eligible_group_ids = np.array([0, 1])
+        group_id_to_psu_code = {0: 0, 1: 1}
+        with pytest.raises(ValueError, match="no positive-weight observations"):
+            est._compute_dcdh_bootstrap(
+                n_groups_for_overall=2,
+                u_centered_overall=u_overall,
+                divisor_overall=4,
+                original_overall=0.1,
+                group_id_to_psu_code=group_id_to_psu_code,
+                eligible_group_ids=eligible_group_ids,
+                u_per_period_overall=u_pp_overall_with_leak,
+                psu_codes_per_cell=psu_codes_per_cell,
+            )
+
     def test_bootstrap_cell_level_raises_on_missing_horizon_tensor(self):
         """Contract: when PSU varies within group, each multi-horizon
         target must supply its per-cell IF tensor; missing one raises