You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Round 10: propagate bootstrap p-value and CI to top-level results
Fixes a P1 where the dCDH bootstrap branch silently replaced the
multiplier-bootstrap percentile p-values and CIs with normal-theory
recomputations from the bootstrap SE. The bootstrap helper computes
the per-target SE / CI / p-value triples correctly on the
DCDHBootstrapResults object, but fit() was only copying the SE and
then calling safe_inference() which returns normal-theory p-values
and CIs. The user requested multiplier bootstrap inference and got
a hybrid (bootstrap SE + normal-theory p/CI) — the public surface
fields, event_study_effects[1], summary(), to_dataframe(),
is_significant, and significance_stars all reflected the
normal-theory recomputations instead of the bootstrap inference.
The fix: in the fit() bootstrap branch, propagate br.overall_p_value
and br.overall_ci directly to the top-level overall_p / overall_ci
(and joiners/leavers analogues). Keep the t-stat from
safe_inference()[0] since percentile bootstrap doesn't define an
alternative t-stat semantic, and that satisfies the project
anti-pattern rule (never compute t = effect/se inline).
Library precedent: imputation.py:790-805, two_stage.py:778-787, and
efficient_did.py:1009-1013 all propagate bootstrap p/CI to the
public surface while keeping a SE-derived t-stat. dCDH was the only
modern bootstrap-enabled estimator that didn't follow this pattern.
Documentation updates:
- New `**Note (bootstrap inference surface):**` block in REGISTRY.md
documenting the propagation contract, the rationale for the
SE-based t-stat, and the placebo carve-out (placebo bootstrap
remains deferred to Phase 2).
- Inference-method switch paragraph added to the
ChaisemartinDHaultfoeuilleResults class docstring `Notes` section.
- README.md row updated to clarify "cohort-recentered analytical SE
by default; multiplier-bootstrap percentile inference when
n_bootstrap > 0".
- API rst Multiplier bootstrap example now shows that
results.overall_p_value and results.overall_conf_int reflect the
bootstrap inference (not just bootstrap_results.overall_*).
Regression test added: test_bootstrap_p_value_and_ci_propagated_to_top_level
asserts results.overall_p_value == bootstrap_results.overall_p_value,
overall_conf_int == bootstrap_results.overall_ci, the joiners/leavers
analogues, event_study_effects[1] reflects bootstrap, the t-stat is
the SE-derived value, summary() doesn't crash, and to_dataframe()
returns the bootstrap-derived numbers. This pins the contract so
the silent inference-method swap can't regress.
Test counts: 110 -> 111 (one new bootstrap propagation regression
test). The existing TestBootstrap tests at lines 855-955 only
asserted that bootstrap_results was populated and SE was finite,
which is why the silent swap passed CI for nine commits.
Files modified:
- diff_diff/chaisemartin_dhaultfoeuille.py
(fit() bootstrap branch propagates p/CI from br instead of
recomputing via safe_inference)
- diff_diff/chaisemartin_dhaultfoeuille_results.py
(class docstring `Notes` section gains the inference-method
switch paragraph)
- docs/methodology/REGISTRY.md
(new bootstrap inference surface Note)
- README.md (field-description row clarification)
- docs/api/chaisemartin_dhaultfoeuille.rst
(Multiplier bootstrap example reflects the new propagation)
- tests/test_chaisemartin_dhaultfoeuille.py
(test_bootstrap_p_value_and_ci_propagated_to_top_level)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1205,7 +1205,7 @@ ChaisemartinDHaultfoeuille(
1205
1205
1206
1206
| Field | Description |
1207
1207
|-------|-------------|
1208
-
|`overall_att`, `overall_se`, `overall_conf_int`|`DID_M` and inference (cohort-recentered analytical SE) |
1208
+
|`overall_att`, `overall_se`, `overall_conf_int`|`DID_M` and inference (cohort-recentered analytical SE by default; multiplier-bootstrap percentile inference when `n_bootstrap > 0`) |
1209
1209
|`joiners_att`, `leavers_att`| Decomposition into the joiners (`DID_+`) and leavers (`DID_-`) views |
1210
1210
|`placebo_effect`| Single-lag placebo (`DID_M^pl`) point estimate |
1211
1211
|`per_period_effects`| Per-period decomposition with explicit A11-violation flags |
Copy file name to clipboardExpand all lines: docs/methodology/REGISTRY.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -558,6 +558,8 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
558
558
559
559
-**Note (Phase 1 cluster contract):**`ChaisemartinDHaultfoeuille` always clusters at the group level. The cohort-recentered analytical SE plug-in operates on per-group influence-function values (one `U^G_g` per group); the multiplier bootstrap generates one weight per group; both inference paths cluster at the user's `group` column with no other option. The constructor accepts `cluster=None` (the default and only supported value); passing any non-`None` value raises `NotImplementedError` with a Phase 1 pointer at construction time (and the same gate fires from `set_params`). Custom clustering at a coarser or finer level than the group is reserved for a future phase. The matching test is `test_cluster_parameter_raises_not_implemented` in `tests/test_chaisemartin_dhaultfoeuille.py::TestForwardCompatGates`.
560
560
561
+
- **Note (bootstrap inference surface):** When `n_bootstrap > 0`, the top-level `results.overall_p_value` / `results.overall_conf_int` (and joiners/leavers analogues) hold **percentile-based bootstrap inference** computed by the multiplier bootstrap, NOT normal-theory recomputations from the bootstrap SE. The t-stat (`overall_t_stat`, etc.) is computed from the SE via `safe_inference()[0]` to satisfy the project's anti-pattern rule (never compute `t = effect / se` inline) — bootstrap does not define an alternative t-stat semantic for percentile bootstrap, so the SE-based t-stat is the natural choice. `event_study_effects[1]`, `summary()`, `to_dataframe()`, `is_significant`, and `significance_stars` all read from these top-level fields and therefore reflect the bootstrap inference automatically. The library precedent for this propagation is `imputation.py:790-805`, `two_stage.py:778-787`, and `efficient_did.py:1009-1013`. The placebo path is unchanged: placebo bootstrap is deferred to Phase 2 (see the placebo SE Note above), so `placebo_p_value` and `placebo_conf_int` stay NaN even when `n_bootstrap > 0`. The matching test is `test_bootstrap_p_value_and_ci_propagated_to_top_level` in `tests/test_chaisemartin_dhaultfoeuille.py::TestBootstrap`.
562
+
561
563
-**Note:** Placebo Assumption 11 violations (placebo joiners exist but no 3-period stable_0 controls, or symmetric for leavers/stable_1) trigger zero-retention in the placebo numerator AND emit a consolidated `Placebo (DID_M^pl) Assumption 11 violations` warning from `fit()`, mirroring the main DID path's contract documented above. The zeroed placebo periods retain their switcher counts in the placebo `N_S^pl` denominator, biasing `DID_M^pl` toward zero in the offending direction (matching the Theorem 4 paper convention).
562
564
563
565
- **Note (TWFE diagnostic sample contract):** The fitted `results.twfe_weights` / `results.twfe_fraction_negative` / `results.twfe_sigma_fe` / `results.twfe_beta_fe` are computed on the **FULL pre-filter cell sample** — the data the user passed in, after `_validate_and_aggregate_to_cells()` runs but **before** the ragged-panel validation (Step 5b) and the multi-switch filter (`drop_larger_lower`, Step 6). They do NOT describe the post-filter estimation sample used by `overall_att`, `results.groups`, and the inference fields. `fit()` has three sample-shaping filters in total: (1) interior-gap drops in Step 5b, (2) multi-switch drops in Step 6, and (3) the singleton-baseline filter in Step 7. Filters (1) and (2) actually shrink the point-estimate sample, so when either fires, the fitted TWFE diagnostic and `overall_att` describe **different samples** and the estimator emits a `UserWarning` explaining the divergence with explicit counts. Filter (3) is **variance-only** — singleton-baseline groups remain in the point-estimate sample as period-based stable controls (see the singleton-baseline Note above) — so it does NOT create a fitted-vs-`overall_att` mismatch and does NOT trigger the divergence warning. Rationale for the pre-filter design: the TWFE diagnostic answers "what would the plain TWFE estimator say on the data you passed in?" — not "what would TWFE say on the data dCDH actually used after filtering?" — so users comparing TWFE vs dCDH on a fixed input can do so without an interaction effect from the dCDH-specific filters. The standalone `twowayfeweights()` function uses the same pre-filter sample, so the fitted and standalone APIs always produce identical numbers on the same input. To reproduce the dCDH estimation sample for an external TWFE comparison, pre-process your data to drop the multi-switch and interior-gap groups before fitting (the warning lists offending IDs). The matching tests are `test_twfe_pre_filter_contract_with_interior_gap_drop` and `test_twfe_pre_filter_contract_with_multi_switch_drop` in `tests/test_chaisemartin_dhaultfoeuille.py`.
0 commit comments