Skip to content

Commit 819ba57

Browse files
igerberclaude
andcommitted
Round 10: propagate bootstrap p-value and CI to top-level results
Fixes a P1 where the dCDH bootstrap branch silently replaced the multiplier-bootstrap percentile p-values and CIs with normal-theory recomputations from the bootstrap SE. The bootstrap helper computes the per-target SE / CI / p-value triples correctly on the DCDHBootstrapResults object, but fit() was only copying the SE and then calling safe_inference() which returns normal-theory p-values and CIs. The user requested multiplier bootstrap inference and got a hybrid (bootstrap SE + normal-theory p/CI) — the public surface fields, event_study_effects[1], summary(), to_dataframe(), is_significant, and significance_stars all reflected the normal-theory recomputations instead of the bootstrap inference. The fix: in the fit() bootstrap branch, propagate br.overall_p_value and br.overall_ci directly to the top-level overall_p / overall_ci (and joiners/leavers analogues). Keep the t-stat from safe_inference()[0] since percentile bootstrap doesn't define an alternative t-stat semantic, and that satisfies the project anti-pattern rule (never compute t = effect/se inline). Library precedent: imputation.py:790-805, two_stage.py:778-787, and efficient_did.py:1009-1013 all propagate bootstrap p/CI to the public surface while keeping a SE-derived t-stat. dCDH was the only modern bootstrap-enabled estimator that didn't follow this pattern. Documentation updates: - New `**Note (bootstrap inference surface):**` block in REGISTRY.md documenting the propagation contract, the rationale for the SE-based t-stat, and the placebo carve-out (placebo bootstrap remains deferred to Phase 2). - Inference-method switch paragraph added to the ChaisemartinDHaultfoeuilleResults class docstring `Notes` section. - README.md row updated to clarify "cohort-recentered analytical SE by default; multiplier-bootstrap percentile inference when n_bootstrap > 0". - API rst Multiplier bootstrap example now shows that results.overall_p_value and results.overall_conf_int reflect the bootstrap inference (not just bootstrap_results.overall_*). Regression test added: test_bootstrap_p_value_and_ci_propagated_to_top_level asserts results.overall_p_value == bootstrap_results.overall_p_value, overall_conf_int == bootstrap_results.overall_ci, the joiners/leavers analogues, event_study_effects[1] reflects bootstrap, the t-stat is the SE-derived value, summary() doesn't crash, and to_dataframe() returns the bootstrap-derived numbers. This pins the contract so the silent inference-method swap can't regress. Test counts: 110 -> 111 (one new bootstrap propagation regression test). The existing TestBootstrap tests at lines 855-955 only asserted that bootstrap_results was populated and SE was finite, which is why the silent swap passed CI for nine commits. Files modified: - diff_diff/chaisemartin_dhaultfoeuille.py (fit() bootstrap branch propagates p/CI from br instead of recomputing via safe_inference) - diff_diff/chaisemartin_dhaultfoeuille_results.py (class docstring `Notes` section gains the inference-method switch paragraph) - docs/methodology/REGISTRY.md (new bootstrap inference surface Note) - README.md (field-description row clarification) - docs/api/chaisemartin_dhaultfoeuille.rst (Multiplier bootstrap example reflects the new propagation) - tests/test_chaisemartin_dhaultfoeuille.py (test_bootstrap_p_value_and_ci_propagated_to_top_level) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 63edb3d commit 819ba57

6 files changed

Lines changed: 138 additions & 19 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1205,7 +1205,7 @@ ChaisemartinDHaultfoeuille(
12051205

12061206
| Field | Description |
12071207
|-------|-------------|
1208-
| `overall_att`, `overall_se`, `overall_conf_int` | `DID_M` and inference (cohort-recentered analytical SE) |
1208+
| `overall_att`, `overall_se`, `overall_conf_int` | `DID_M` and inference (cohort-recentered analytical SE by default; multiplier-bootstrap percentile inference when `n_bootstrap > 0`) |
12091209
| `joiners_att`, `leavers_att` | Decomposition into the joiners (`DID_+`) and leavers (`DID_-`) views |
12101210
| `placebo_effect` | Single-lag placebo (`DID_M^pl`) point estimate |
12111211
| `per_period_effects` | Per-period decomposition with explicit A11-violation flags |

diff_diff/chaisemartin_dhaultfoeuille.py

Lines changed: 30 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1098,28 +1098,42 @@ def fit(
10981098
)
10991099
bootstrap_results = br
11001100

1101-
# Replace analytical SE with bootstrap SE for the targets that
1102-
# have valid bootstrap output. The original analytical values
1103-
# remain available via re-running with n_bootstrap=0. After
1104-
# the SE replacement we recompute t-stat / p-value / CI through
1105-
# ``safe_inference()`` so all inference fields stay consistent
1106-
# with the library-wide convention (project anti-pattern rule:
1107-
# never compute t_stat = effect / se inline; use safe_inference).
1101+
# Replace the analytical SE with the bootstrap SE for the
1102+
# targets that have valid bootstrap output, AND propagate
1103+
# the bootstrap percentile p-value and CI directly to the
1104+
# top-level fields. The t-stat is computed from the SE via
1105+
# safe_inference()[0] so the project anti-pattern rule
1106+
# (never compute t_stat = effect / se inline) stays
1107+
# satisfied — bootstrap does not define an alternative
1108+
# t-stat semantic for percentile bootstrap, so the
1109+
# SE-based t-stat is the natural choice.
1110+
#
1111+
# Library precedent: imputation.py:790-805,
1112+
# two_stage.py:778-787, and efficient_did.py:1009-1013 all
1113+
# propagate bootstrap p/CI to the public surface while
1114+
# keeping a SE-derived t-stat. Round 10 brings dCDH in line
1115+
# with that pattern (the prior code silently recomputed
1116+
# normal-theory p/CI from the bootstrap SE, which made the
1117+
# public inference surface a hybrid).
1118+
#
1119+
# See REGISTRY.md ChaisemartinDHaultfoeuille `Note
1120+
# (bootstrap inference surface)` and the regression test
1121+
# ``test_bootstrap_p_value_and_ci_propagated_to_top_level``.
11081122
if np.isfinite(br.overall_se):
11091123
overall_se = br.overall_se
1110-
overall_t, overall_p, overall_ci = safe_inference(
1111-
overall_att, overall_se, alpha=self.alpha, df=None
1112-
)
1124+
overall_p = br.overall_p_value if br.overall_p_value is not None else np.nan
1125+
overall_ci = br.overall_ci if br.overall_ci is not None else (np.nan, np.nan)
1126+
overall_t = safe_inference(overall_att, overall_se, alpha=self.alpha, df=None)[0]
11131127
if joiners_available and br.joiners_se is not None and np.isfinite(br.joiners_se):
11141128
joiners_se = br.joiners_se
1115-
joiners_t, joiners_p, joiners_ci = safe_inference(
1116-
joiners_att, joiners_se, alpha=self.alpha, df=None
1117-
)
1129+
joiners_p = br.joiners_p_value if br.joiners_p_value is not None else np.nan
1130+
joiners_ci = br.joiners_ci if br.joiners_ci is not None else (np.nan, np.nan)
1131+
joiners_t = safe_inference(joiners_att, joiners_se, alpha=self.alpha, df=None)[0]
11181132
if leavers_available and br.leavers_se is not None and np.isfinite(br.leavers_se):
11191133
leavers_se = br.leavers_se
1120-
leavers_t, leavers_p, leavers_ci = safe_inference(
1121-
leavers_att, leavers_se, alpha=self.alpha, df=None
1122-
)
1134+
leavers_p = br.leavers_p_value if br.leavers_p_value is not None else np.nan
1135+
leavers_ci = br.leavers_ci if br.leavers_ci is not None else (np.nan, np.nan)
1136+
leavers_t = safe_inference(leavers_att, leavers_se, alpha=self.alpha, df=None)[0]
11231137

11241138
# ------------------------------------------------------------------
11251139
# Step 20: Build the results dataclass

diff_diff/chaisemartin_dhaultfoeuille_results.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,25 @@ class ChaisemartinDHaultfoeuilleResults:
138138
``DIDmultiplegtDYN`` reference. The number of dropped groups is
139139
exposed via ``n_groups_dropped_crossers``.
140140
141+
**Inference-method switch when bootstrap is enabled.** The
142+
``overall_p_value`` / ``overall_conf_int`` (and joiners/leavers
143+
analogues) fields are populated by *normal-theory* inference from
144+
the cohort-recentered analytical SE when ``n_bootstrap=0`` (the
145+
default). When ``n_bootstrap > 0``, the same fields are populated
146+
by *percentile-based bootstrap inference* from the multiplier
147+
bootstrap distribution computed by ``_compute_dcdh_bootstrap()``.
148+
The t-stat (``overall_t_stat``, etc.) is computed from the SE in
149+
both cases, since percentile bootstrap does not define an
150+
alternative t-stat semantic. ``event_study_effects[1]``,
151+
``summary()``, ``to_dataframe()``, ``is_significant``, and
152+
``significance_stars`` all read from these top-level fields and
153+
therefore reflect the bootstrap inference automatically. The
154+
placebo path is unchanged: placebo bootstrap is deferred to Phase
155+
2, so ``placebo_p_value`` and ``placebo_conf_int`` stay NaN even
156+
when ``n_bootstrap > 0``. See the methodology registry
157+
``Note (bootstrap inference surface)`` for the full contract and
158+
library precedent.
159+
141160
Attributes
142161
----------
143162
overall_att : float

docs/api/chaisemartin_dhaultfoeuille.rst

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -202,8 +202,15 @@ Multiplier bootstrap inference::
202202
data, outcome="outcome", group="group",
203203
time="period", treatment="treatment",
204204
)
205-
print(f"Bootstrap SE: {results.bootstrap_results.overall_se:.3f}")
206-
print(f"Bootstrap CI: {results.bootstrap_results.overall_ci}")
205+
# When n_bootstrap > 0, the top-level overall_*/joiners_*/leavers_*
206+
# p-value and conf_int fields hold percentile-based bootstrap
207+
# inference (not normal-theory recomputations from the bootstrap SE).
208+
# The t-stat is computed from the SE in both cases. See REGISTRY.md
209+
# `Note (bootstrap inference surface)` for the full contract.
210+
print(f"Top-level p-value (bootstrap): {results.overall_p_value:.4f}")
211+
print(f"Top-level CI (bootstrap): {results.overall_conf_int}")
212+
print(f"bootstrap_results.overall_se: {results.bootstrap_results.overall_se:.3f}")
213+
print(f"bootstrap_results.overall_ci: {results.bootstrap_results.overall_ci}")
207214

208215
Standalone TWFE diagnostic (without fitting the full estimator)::
209216

docs/methodology/REGISTRY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -558,6 +558,8 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
558558

559559
- **Note (Phase 1 cluster contract):** `ChaisemartinDHaultfoeuille` always clusters at the group level. The cohort-recentered analytical SE plug-in operates on per-group influence-function values (one `U^G_g` per group); the multiplier bootstrap generates one weight per group; both inference paths cluster at the user's `group` column with no other option. The constructor accepts `cluster=None` (the default and only supported value); passing any non-`None` value raises `NotImplementedError` with a Phase 1 pointer at construction time (and the same gate fires from `set_params`). Custom clustering at a coarser or finer level than the group is reserved for a future phase. The matching test is `test_cluster_parameter_raises_not_implemented` in `tests/test_chaisemartin_dhaultfoeuille.py::TestForwardCompatGates`.
560560

561+
- **Note (bootstrap inference surface):** When `n_bootstrap > 0`, the top-level `results.overall_p_value` / `results.overall_conf_int` (and joiners/leavers analogues) hold **percentile-based bootstrap inference** computed by the multiplier bootstrap, NOT normal-theory recomputations from the bootstrap SE. The t-stat (`overall_t_stat`, etc.) is computed from the SE via `safe_inference()[0]` to satisfy the project's anti-pattern rule (never compute `t = effect / se` inline) — bootstrap does not define an alternative t-stat semantic for percentile bootstrap, so the SE-based t-stat is the natural choice. `event_study_effects[1]`, `summary()`, `to_dataframe()`, `is_significant`, and `significance_stars` all read from these top-level fields and therefore reflect the bootstrap inference automatically. The library precedent for this propagation is `imputation.py:790-805`, `two_stage.py:778-787`, and `efficient_did.py:1009-1013`. The placebo path is unchanged: placebo bootstrap is deferred to Phase 2 (see the placebo SE Note above), so `placebo_p_value` and `placebo_conf_int` stay NaN even when `n_bootstrap > 0`. The matching test is `test_bootstrap_p_value_and_ci_propagated_to_top_level` in `tests/test_chaisemartin_dhaultfoeuille.py::TestBootstrap`.
562+
561563
- **Note:** Placebo Assumption 11 violations (placebo joiners exist but no 3-period stable_0 controls, or symmetric for leavers/stable_1) trigger zero-retention in the placebo numerator AND emit a consolidated `Placebo (DID_M^pl) Assumption 11 violations` warning from `fit()`, mirroring the main DID path's contract documented above. The zeroed placebo periods retain their switcher counts in the placebo `N_S^pl` denominator, biasing `DID_M^pl` toward zero in the offending direction (matching the Theorem 4 paper convention).
562564

563565
- **Note (TWFE diagnostic sample contract):** The fitted `results.twfe_weights` / `results.twfe_fraction_negative` / `results.twfe_sigma_fe` / `results.twfe_beta_fe` are computed on the **FULL pre-filter cell sample** — the data the user passed in, after `_validate_and_aggregate_to_cells()` runs but **before** the ragged-panel validation (Step 5b) and the multi-switch filter (`drop_larger_lower`, Step 6). They do NOT describe the post-filter estimation sample used by `overall_att`, `results.groups`, and the inference fields. `fit()` has three sample-shaping filters in total: (1) interior-gap drops in Step 5b, (2) multi-switch drops in Step 6, and (3) the singleton-baseline filter in Step 7. Filters (1) and (2) actually shrink the point-estimate sample, so when either fires, the fitted TWFE diagnostic and `overall_att` describe **different samples** and the estimator emits a `UserWarning` explaining the divergence with explicit counts. Filter (3) is **variance-only** — singleton-baseline groups remain in the point-estimate sample as period-based stable controls (see the singleton-baseline Note above) — so it does NOT create a fitted-vs-`overall_att` mismatch and does NOT trigger the divergence warning. Rationale for the pre-filter design: the TWFE diagnostic answers "what would the plain TWFE estimator say on the data you passed in?" — not "what would TWFE say on the data dCDH actually used after filtering?" — so users comparing TWFE vs dCDH on a fixed input can do so without an interaction effect from the dCDH-specific filters. The standalone `twowayfeweights()` function uses the same pre-filter sample, so the fitted and standalone APIs always produce identical numbers on the same input. To reproduce the dCDH estimation sample for an external TWFE comparison, pre-process your data to drop the multi-switch and interior-gap groups before fitting (the warning lists offending IDs). The matching tests are `test_twfe_pre_filter_contract_with_interior_gap_drop` and `test_twfe_pre_filter_contract_with_multi_switch_drop` in `tests/test_chaisemartin_dhaultfoeuille.py`.

tests/test_chaisemartin_dhaultfoeuille.py

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -971,6 +971,83 @@ def test_placebo_bootstrap_unavailable_in_phase_1(self, data, ci_params):
971971
if results.placebo_available:
972972
assert np.isfinite(results.placebo_effect)
973973

974+
def test_bootstrap_p_value_and_ci_propagated_to_top_level(self, data, ci_params):
975+
"""
976+
Per the bootstrap inference surface contract: when
977+
``n_bootstrap > 0``, the top-level ``results.overall_*`` /
978+
``joiners_*`` / ``leavers_*`` p-value and CI fields hold the
979+
percentile-based bootstrap inference computed by the
980+
multiplier bootstrap, NOT normal-theory recomputations from
981+
the bootstrap SE. The t-stat is still computed from the SE
982+
(project anti-pattern rule: never compute t = effect/se
983+
inline).
984+
985+
Pre-Round-10, the dCDH ``fit()`` body silently called
986+
``safe_inference(overall_att, br.overall_se)`` and stored its
987+
normal-theory p/CI on the top-level fields, which made the
988+
public inference surface a hybrid (bootstrap SE + normal-
989+
theory p/CI). Library precedent for the propagation:
990+
``imputation.py:790-805``, ``two_stage.py:778-787``,
991+
``efficient_did.py:1009-1013``. This test pins the new
992+
contract.
993+
994+
See REGISTRY.md ``ChaisemartinDHaultfoeuille`` ``Note
995+
(bootstrap inference surface)``.
996+
"""
997+
n_boot = ci_params.bootstrap(199)
998+
est = ChaisemartinDHaultfoeuille(
999+
n_bootstrap=n_boot,
1000+
bootstrap_weights="rademacher",
1001+
seed=42,
1002+
)
1003+
results = est.fit(
1004+
data,
1005+
outcome="outcome",
1006+
group="group",
1007+
time="period",
1008+
treatment="treatment",
1009+
)
1010+
br = results.bootstrap_results
1011+
assert br is not None
1012+
1013+
# Overall DID_M: top-level p-value and CI come from bootstrap
1014+
assert results.overall_p_value == pytest.approx(br.overall_p_value)
1015+
assert results.overall_conf_int == pytest.approx(br.overall_ci)
1016+
# The t-stat is computed from the SE (effect / se), not from
1017+
# a percentile distribution
1018+
assert np.isfinite(results.overall_t_stat)
1019+
expected_t = results.overall_att / results.overall_se
1020+
assert results.overall_t_stat == pytest.approx(expected_t)
1021+
1022+
# Joiners
1023+
if results.joiners_available and br.joiners_p_value is not None:
1024+
assert results.joiners_p_value == pytest.approx(br.joiners_p_value)
1025+
assert results.joiners_conf_int == pytest.approx(br.joiners_ci)
1026+
1027+
# Leavers
1028+
if results.leavers_available and br.leavers_p_value is not None:
1029+
assert results.leavers_p_value == pytest.approx(br.leavers_p_value)
1030+
assert results.leavers_conf_int == pytest.approx(br.leavers_ci)
1031+
1032+
# event_study_effects[1] mirrors the top-level overall fields,
1033+
# so it should also reflect the bootstrap inference
1034+
assert results.event_study_effects is not None
1035+
assert 1 in results.event_study_effects
1036+
es = results.event_study_effects[1]
1037+
assert es["p_value"] == pytest.approx(br.overall_p_value)
1038+
assert es["conf_int"] == pytest.approx(br.overall_ci)
1039+
1040+
# summary() and to_dataframe() chain off the top-level fields,
1041+
# so they automatically reflect the bootstrap inference. Smoke
1042+
# test that they don't crash and that the rendered values match
1043+
# the bootstrap output.
1044+
summary_text = results.summary()
1045+
assert "DID_M" in summary_text
1046+
df_overall = results.to_dataframe(level="overall")
1047+
assert df_overall.iloc[0]["p_value"] == pytest.approx(br.overall_p_value)
1048+
assert df_overall.iloc[0]["conf_int_lower"] == pytest.approx(br.overall_ci[0])
1049+
assert df_overall.iloc[0]["conf_int_upper"] == pytest.approx(br.overall_ci[1])
1050+
9741051
def test_bootstrap_seed_reproducibility(self, data, ci_params):
9751052
n_boot = ci_params.bootstrap(99)
9761053
r1 = ChaisemartinDHaultfoeuille(n_bootstrap=n_boot, seed=42).fit(

0 commit comments

Comments
 (0)