Round 10: propagate bootstrap p-value and CI to top-level results

igerber · claude · igerber · commit 819ba5789e3c · 2026-04-11T21:47:12.000-04:00
Fixes a P1 where the dCDH bootstrap branch silently replaced the
multiplier-bootstrap percentile p-values and CIs with normal-theory
recomputations from the bootstrap SE. The bootstrap helper computes
the per-target SE / CI / p-value triples correctly on the
DCDHBootstrapResults object, but fit() was only copying the SE and
then calling safe_inference() which returns normal-theory p-values
and CIs. The user requested multiplier bootstrap inference and got
a hybrid (bootstrap SE + normal-theory p/CI) — the public surface
fields, event_study_effects[1], summary(), to_dataframe(),
is_significant, and significance_stars all reflected the
normal-theory recomputations instead of the bootstrap inference.

The fix: in the fit() bootstrap branch, propagate br.overall_p_value
and br.overall_ci directly to the top-level overall_p / overall_ci
(and joiners/leavers analogues). Keep the t-stat from
safe_inference()[0] since percentile bootstrap doesn't define an
alternative t-stat semantic, and that satisfies the project
anti-pattern rule (never compute t = effect/se inline).

Library precedent: imputation.py:790-805, two_stage.py:778-787, and
efficient_did.py:1009-1013 all propagate bootstrap p/CI to the
public surface while keeping a SE-derived t-stat. dCDH was the only
modern bootstrap-enabled estimator that didn't follow this pattern.

Documentation updates:
- New `**Note (bootstrap inference surface):**` block in REGISTRY.md
  documenting the propagation contract, the rationale for the
  SE-based t-stat, and the placebo carve-out (placebo bootstrap
  remains deferred to Phase 2).
- Inference-method switch paragraph added to the
  ChaisemartinDHaultfoeuilleResults class docstring `Notes` section.
- README.md row updated to clarify "cohort-recentered analytical SE
  by default; multiplier-bootstrap percentile inference when
  n_bootstrap &gt; 0".
- API rst Multiplier bootstrap example now shows that
  results.overall_p_value and results.overall_conf_int reflect the
  bootstrap inference (not just bootstrap_results.overall_*).

Regression test added: test_bootstrap_p_value_and_ci_propagated_to_top_level
asserts results.overall_p_value == bootstrap_results.overall_p_value,
overall_conf_int == bootstrap_results.overall_ci, the joiners/leavers
analogues, event_study_effects[1] reflects bootstrap, the t-stat is
the SE-derived value, summary() doesn't crash, and to_dataframe()
returns the bootstrap-derived numbers. This pins the contract so
the silent inference-method swap can't regress.

Test counts: 110 -&gt; 111 (one new bootstrap propagation regression
test). The existing TestBootstrap tests at lines 855-955 only
asserted that bootstrap_results was populated and SE was finite,
which is why the silent swap passed CI for nine commits.

Files modified:
- diff_diff/chaisemartin_dhaultfoeuille.py
  (fit() bootstrap branch propagates p/CI from br instead of
  recomputing via safe_inference)
- diff_diff/chaisemartin_dhaultfoeuille_results.py
  (class docstring `Notes` section gains the inference-method
  switch paragraph)
- docs/methodology/REGISTRY.md
  (new bootstrap inference surface Note)
- README.md (field-description row clarification)
- docs/api/chaisemartin_dhaultfoeuille.rst
  (Multiplier bootstrap example reflects the new propagation)
- tests/test_chaisemartin_dhaultfoeuille.py
  (test_bootstrap_p_value_and_ci_propagated_to_top_level)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.md b/README.md
@@ -1205,7 +1205,7 @@ ChaisemartinDHaultfoeuille(
 
 | Field | Description |
 |-------|-------------|
-| `overall_att`, `overall_se`, `overall_conf_int` | `DID_M` and inference (cohort-recentered analytical SE) |
+| `overall_att`, `overall_se`, `overall_conf_int` | `DID_M` and inference (cohort-recentered analytical SE by default; multiplier-bootstrap percentile inference when `n_bootstrap > 0`) |
 | `joiners_att`, `leavers_att` | Decomposition into the joiners (`DID_+`) and leavers (`DID_-`) views |
 | `placebo_effect` | Single-lag placebo (`DID_M^pl`) point estimate |
 | `per_period_effects` | Per-period decomposition with explicit A11-violation flags |
diff --git a/diff_diff/chaisemartin_dhaultfoeuille.py b/diff_diff/chaisemartin_dhaultfoeuille.py
@@ -1098,28 +1098,42 @@ def fit(
             )
             bootstrap_results = br
 
-            # Replace analytical SE with bootstrap SE for the targets that
-            # have valid bootstrap output. The original analytical values
-            # remain available via re-running with n_bootstrap=0. After
-            # the SE replacement we recompute t-stat / p-value / CI through
-            # ``safe_inference()`` so all inference fields stay consistent
-            # with the library-wide convention (project anti-pattern rule:
-            # never compute t_stat = effect / se inline; use safe_inference).
+            # Replace the analytical SE with the bootstrap SE for the
+            # targets that have valid bootstrap output, AND propagate
+            # the bootstrap percentile p-value and CI directly to the
+            # top-level fields. The t-stat is computed from the SE via
+            # safe_inference()[0] so the project anti-pattern rule
+            # (never compute t_stat = effect / se inline) stays
+            # satisfied — bootstrap does not define an alternative
+            # t-stat semantic for percentile bootstrap, so the
+            # SE-based t-stat is the natural choice.
+            #
+            # Library precedent: imputation.py:790-805,
+            # two_stage.py:778-787, and efficient_did.py:1009-1013 all
+            # propagate bootstrap p/CI to the public surface while
+            # keeping a SE-derived t-stat. Round 10 brings dCDH in line
+            # with that pattern (the prior code silently recomputed
+            # normal-theory p/CI from the bootstrap SE, which made the
+            # public inference surface a hybrid).
+            #
+            # See REGISTRY.md ChaisemartinDHaultfoeuille `Note
+            # (bootstrap inference surface)` and the regression test
+            # ``test_bootstrap_p_value_and_ci_propagated_to_top_level``.
             if np.isfinite(br.overall_se):
                 overall_se = br.overall_se
-                overall_t, overall_p, overall_ci = safe_inference(
-                    overall_att, overall_se, alpha=self.alpha, df=None
-                )
+                overall_p = br.overall_p_value if br.overall_p_value is not None else np.nan
+                overall_ci = br.overall_ci if br.overall_ci is not None else (np.nan, np.nan)
+                overall_t = safe_inference(overall_att, overall_se, alpha=self.alpha, df=None)[0]
             if joiners_available and br.joiners_se is not None and np.isfinite(br.joiners_se):
                 joiners_se = br.joiners_se
-                joiners_t, joiners_p, joiners_ci = safe_inference(
-                    joiners_att, joiners_se, alpha=self.alpha, df=None
-                )
+                joiners_p = br.joiners_p_value if br.joiners_p_value is not None else np.nan
+                joiners_ci = br.joiners_ci if br.joiners_ci is not None else (np.nan, np.nan)
+                joiners_t = safe_inference(joiners_att, joiners_se, alpha=self.alpha, df=None)[0]
             if leavers_available and br.leavers_se is not None and np.isfinite(br.leavers_se):
                 leavers_se = br.leavers_se
-                leavers_t, leavers_p, leavers_ci = safe_inference(
-                    leavers_att, leavers_se, alpha=self.alpha, df=None
-                )
+                leavers_p = br.leavers_p_value if br.leavers_p_value is not None else np.nan
+                leavers_ci = br.leavers_ci if br.leavers_ci is not None else (np.nan, np.nan)
+                leavers_t = safe_inference(leavers_att, leavers_se, alpha=self.alpha, df=None)[0]
 
         # ------------------------------------------------------------------
         # Step 20: Build the results dataclass
diff --git a/diff_diff/chaisemartin_dhaultfoeuille_results.py b/diff_diff/chaisemartin_dhaultfoeuille_results.py
@@ -138,6 +138,25 @@ class ChaisemartinDHaultfoeuilleResults:
     ``DIDmultiplegtDYN`` reference. The number of dropped groups is
     exposed via ``n_groups_dropped_crossers``.
 
+    **Inference-method switch when bootstrap is enabled.** The
+    ``overall_p_value`` / ``overall_conf_int`` (and joiners/leavers
+    analogues) fields are populated by *normal-theory* inference from
+    the cohort-recentered analytical SE when ``n_bootstrap=0`` (the
+    default). When ``n_bootstrap > 0``, the same fields are populated
+    by *percentile-based bootstrap inference* from the multiplier
+    bootstrap distribution computed by ``_compute_dcdh_bootstrap()``.
+    The t-stat (``overall_t_stat``, etc.) is computed from the SE in
+    both cases, since percentile bootstrap does not define an
+    alternative t-stat semantic. ``event_study_effects[1]``,
+    ``summary()``, ``to_dataframe()``, ``is_significant``, and
+    ``significance_stars`` all read from these top-level fields and
+    therefore reflect the bootstrap inference automatically. The
+    placebo path is unchanged: placebo bootstrap is deferred to Phase
+    2, so ``placebo_p_value`` and ``placebo_conf_int`` stay NaN even
+    when ``n_bootstrap > 0``. See the methodology registry
+    ``Note (bootstrap inference surface)`` for the full contract and
+    library precedent.
+
     Attributes
     ----------
     overall_att : float
diff --git a/docs/api/chaisemartin_dhaultfoeuille.rst b/docs/api/chaisemartin_dhaultfoeuille.rst
@@ -202,8 +202,15 @@ Multiplier bootstrap inference::
         data, outcome="outcome", group="group",
         time="period", treatment="treatment",
     )
-    print(f"Bootstrap SE: {results.bootstrap_results.overall_se:.3f}")
-    print(f"Bootstrap CI: {results.bootstrap_results.overall_ci}")
+    # When n_bootstrap > 0, the top-level overall_*/joiners_*/leavers_*
+    # p-value and conf_int fields hold percentile-based bootstrap
+    # inference (not normal-theory recomputations from the bootstrap SE).
+    # The t-stat is computed from the SE in both cases. See REGISTRY.md
+    # `Note (bootstrap inference surface)` for the full contract.
+    print(f"Top-level p-value (bootstrap): {results.overall_p_value:.4f}")
+    print(f"Top-level CI (bootstrap):     {results.overall_conf_int}")
+    print(f"bootstrap_results.overall_se: {results.bootstrap_results.overall_se:.3f}")
+    print(f"bootstrap_results.overall_ci: {results.bootstrap_results.overall_ci}")
 
 Standalone TWFE diagnostic (without fitting the full estimator)::
 
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -558,6 +558,8 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
 
 - **Note (Phase 1 cluster contract):** `ChaisemartinDHaultfoeuille` always clusters at the group level. The cohort-recentered analytical SE plug-in operates on per-group influence-function values (one `U^G_g` per group); the multiplier bootstrap generates one weight per group; both inference paths cluster at the user's `group` column with no other option. The constructor accepts `cluster=None` (the default and only supported value); passing any non-`None` value raises `NotImplementedError` with a Phase 1 pointer at construction time (and the same gate fires from `set_params`). Custom clustering at a coarser or finer level than the group is reserved for a future phase. The matching test is `test_cluster_parameter_raises_not_implemented` in `tests/test_chaisemartin_dhaultfoeuille.py::TestForwardCompatGates`.
 
+- **Note (bootstrap inference surface):** When `n_bootstrap > 0`, the top-level `results.overall_p_value` / `results.overall_conf_int` (and joiners/leavers analogues) hold **percentile-based bootstrap inference** computed by the multiplier bootstrap, NOT normal-theory recomputations from the bootstrap SE. The t-stat (`overall_t_stat`, etc.) is computed from the SE via `safe_inference()[0]` to satisfy the project's anti-pattern rule (never compute `t = effect / se` inline) — bootstrap does not define an alternative t-stat semantic for percentile bootstrap, so the SE-based t-stat is the natural choice. `event_study_effects[1]`, `summary()`, `to_dataframe()`, `is_significant`, and `significance_stars` all read from these top-level fields and therefore reflect the bootstrap inference automatically. The library precedent for this propagation is `imputation.py:790-805`, `two_stage.py:778-787`, and `efficient_did.py:1009-1013`. The placebo path is unchanged: placebo bootstrap is deferred to Phase 2 (see the placebo SE Note above), so `placebo_p_value` and `placebo_conf_int` stay NaN even when `n_bootstrap > 0`. The matching test is `test_bootstrap_p_value_and_ci_propagated_to_top_level` in `tests/test_chaisemartin_dhaultfoeuille.py::TestBootstrap`.
+
 - **Note:** Placebo Assumption 11 violations (placebo joiners exist but no 3-period stable_0 controls, or symmetric for leavers/stable_1) trigger zero-retention in the placebo numerator AND emit a consolidated `Placebo (DID_M^pl) Assumption 11 violations` warning from `fit()`, mirroring the main DID path's contract documented above. The zeroed placebo periods retain their switcher counts in the placebo `N_S^pl` denominator, biasing `DID_M^pl` toward zero in the offending direction (matching the Theorem 4 paper convention).
 
 - **Note (TWFE diagnostic sample contract):** The fitted `results.twfe_weights` / `results.twfe_fraction_negative` / `results.twfe_sigma_fe` / `results.twfe_beta_fe` are computed on the **FULL pre-filter cell sample** — the data the user passed in, after `_validate_and_aggregate_to_cells()` runs but **before** the ragged-panel validation (Step 5b) and the multi-switch filter (`drop_larger_lower`, Step 6). They do NOT describe the post-filter estimation sample used by `overall_att`, `results.groups`, and the inference fields. `fit()` has three sample-shaping filters in total: (1) interior-gap drops in Step 5b, (2) multi-switch drops in Step 6, and (3) the singleton-baseline filter in Step 7. Filters (1) and (2) actually shrink the point-estimate sample, so when either fires, the fitted TWFE diagnostic and `overall_att` describe **different samples** and the estimator emits a `UserWarning` explaining the divergence with explicit counts. Filter (3) is **variance-only** — singleton-baseline groups remain in the point-estimate sample as period-based stable controls (see the singleton-baseline Note above) — so it does NOT create a fitted-vs-`overall_att` mismatch and does NOT trigger the divergence warning. Rationale for the pre-filter design: the TWFE diagnostic answers "what would the plain TWFE estimator say on the data you passed in?" — not "what would TWFE say on the data dCDH actually used after filtering?" — so users comparing TWFE vs dCDH on a fixed input can do so without an interaction effect from the dCDH-specific filters. The standalone `twowayfeweights()` function uses the same pre-filter sample, so the fitted and standalone APIs always produce identical numbers on the same input. To reproduce the dCDH estimation sample for an external TWFE comparison, pre-process your data to drop the multi-switch and interior-gap groups before fitting (the warning lists offending IDs). The matching tests are `test_twfe_pre_filter_contract_with_interior_gap_drop` and `test_twfe_pre_filter_contract_with_multi_switch_drop` in `tests/test_chaisemartin_dhaultfoeuille.py`.
diff --git a/tests/test_chaisemartin_dhaultfoeuille.py b/tests/test_chaisemartin_dhaultfoeuille.py
@@ -971,6 +971,83 @@ def test_placebo_bootstrap_unavailable_in_phase_1(self, data, ci_params):
         if results.placebo_available:
             assert np.isfinite(results.placebo_effect)
 
+    def test_bootstrap_p_value_and_ci_propagated_to_top_level(self, data, ci_params):
+        """
+        Per the bootstrap inference surface contract: when
+        ``n_bootstrap > 0``, the top-level ``results.overall_*`` /
+        ``joiners_*`` / ``leavers_*`` p-value and CI fields hold the
+        percentile-based bootstrap inference computed by the
+        multiplier bootstrap, NOT normal-theory recomputations from
+        the bootstrap SE. The t-stat is still computed from the SE
+        (project anti-pattern rule: never compute t = effect/se
+        inline).
+
+        Pre-Round-10, the dCDH ``fit()`` body silently called
+        ``safe_inference(overall_att, br.overall_se)`` and stored its
+        normal-theory p/CI on the top-level fields, which made the
+        public inference surface a hybrid (bootstrap SE + normal-
+        theory p/CI). Library precedent for the propagation:
+        ``imputation.py:790-805``, ``two_stage.py:778-787``,
+        ``efficient_did.py:1009-1013``. This test pins the new
+        contract.
+
+        See REGISTRY.md ``ChaisemartinDHaultfoeuille`` ``Note
+        (bootstrap inference surface)``.
+        """
+        n_boot = ci_params.bootstrap(199)
+        est = ChaisemartinDHaultfoeuille(
+            n_bootstrap=n_boot,
+            bootstrap_weights="rademacher",
+            seed=42,
+        )
+        results = est.fit(
+            data,
+            outcome="outcome",
+            group="group",
+            time="period",
+            treatment="treatment",
+        )
+        br = results.bootstrap_results
+        assert br is not None
+
+        # Overall DID_M: top-level p-value and CI come from bootstrap
+        assert results.overall_p_value == pytest.approx(br.overall_p_value)
+        assert results.overall_conf_int == pytest.approx(br.overall_ci)
+        # The t-stat is computed from the SE (effect / se), not from
+        # a percentile distribution
+        assert np.isfinite(results.overall_t_stat)
+        expected_t = results.overall_att / results.overall_se
+        assert results.overall_t_stat == pytest.approx(expected_t)
+
+        # Joiners
+        if results.joiners_available and br.joiners_p_value is not None:
+            assert results.joiners_p_value == pytest.approx(br.joiners_p_value)
+            assert results.joiners_conf_int == pytest.approx(br.joiners_ci)
+
+        # Leavers
+        if results.leavers_available and br.leavers_p_value is not None:
+            assert results.leavers_p_value == pytest.approx(br.leavers_p_value)
+            assert results.leavers_conf_int == pytest.approx(br.leavers_ci)
+
+        # event_study_effects[1] mirrors the top-level overall fields,
+        # so it should also reflect the bootstrap inference
+        assert results.event_study_effects is not None
+        assert 1 in results.event_study_effects
+        es = results.event_study_effects[1]
+        assert es["p_value"] == pytest.approx(br.overall_p_value)
+        assert es["conf_int"] == pytest.approx(br.overall_ci)
+
+        # summary() and to_dataframe() chain off the top-level fields,
+        # so they automatically reflect the bootstrap inference. Smoke
+        # test that they don't crash and that the rendered values match
+        # the bootstrap output.
+        summary_text = results.summary()
+        assert "DID_M" in summary_text
+        df_overall = results.to_dataframe(level="overall")
+        assert df_overall.iloc[0]["p_value"] == pytest.approx(br.overall_p_value)
+        assert df_overall.iloc[0]["conf_int_lower"] == pytest.approx(br.overall_ci[0])
+        assert df_overall.iloc[0]["conf_int_upper"] == pytest.approx(br.overall_ci[1])
+
     def test_bootstrap_seed_reproducibility(self, data, ci_params):
         n_boot = ci_params.bootstrap(99)
         r1 = ChaisemartinDHaultfoeuille(n_bootstrap=n_boot, seed=42).fit(