Skip to content

Commit 63edb3d

Browse files
igerberclaude
andcommitted
Round 9: TWFE diagnostic sample-contract clarification + warning + tests
Documents the dCDH TWFE diagnostic sample contract that Round 7's swap left implicit. The fitted results.twfe_* values are computed on the FULL pre-filter cell sample (matching the standalone twowayfeweights() function), NOT on the post-filter estimation sample used by overall_att / results.groups / inference fields. The existing user-facing wording said "TWFE on the same data" / "diagnostic from the same fit" — phrases that naturally read as "same data as overall_att" — which contradicted the post-Round-7 behavior. This commit: 1. Adds a new `**Note (TWFE diagnostic sample contract):**` block in REGISTRY.md enumerating all three sample-shaping filters (interior-gap, multi-switch, singleton-baseline) and explicitly carving singleton-baseline as variance-only (no fitted-vs-overall_att mismatch, so no warning). 2. Rewrites the `twfe_diagnostic` parameter docstring in chaisemartin_dhaultfoeuille.py to describe the pre-filter contract and the divergence warning. 3. Rewrites the twfe_weights / twfe_fraction_negative / twfe_sigma_fe / twfe_beta_fe field docstrings in the results dataclass to clarify they describe the FULL pre-filter cell sample, with a pointer to the REGISTRY contract Note. 4. Adds a `UserWarning` from `fit()` whenever the user requested the TWFE diagnostic AND any of the interior-gap or multi-switch filters dropped groups. The warning explains the divergence with explicit counts and points at REGISTRY for the rationale. The warning fires regardless of whether the diagnostic itself succeeded or hit the rank-deficient fallback (the plan-review correctly flagged that the `twfe_diagnostic_payload is not None` guard would swallow the rare rank-deficient + filtered-panel intersection — dropped that guard). 5. Updates docs/api/chaisemartin_dhaultfoeuille.rst and docs/choosing_estimator.rst to replace "from the same fit" with "computed on the data you pass in (pre-filter)". 6. Adds three regression tests in TestTwowayFeweightsHelper: - test_twfe_pre_filter_contract_with_interior_gap_drop: panel with a dropped interior-gap group, asserts fitted twfe_* matches standalone, estimation sample is smaller, and the divergence warning fires with the expected counts. - test_twfe_pre_filter_contract_with_multi_switch_drop: panel with an injected multi-switch crosser, similar assertions. - test_twfe_no_divergence_warning_on_clean_panel: negative test asserting NO divergence warning fires on a clean panel (hard-codes pattern="single_switch" to close a future footgun). 7. Fixes the stale "Step 5a guarantees..." comment at line 712 to "Step 5b guarantees..." (post-Round-7 the ragged-panel validation is Step 5b, not Step 5a). Independent cleanup; bundled because it's in the same file and the same topic. This resolution preserves Round 7's standalone-vs-fitted parity (both APIs use the pre-filter cell sample) and addresses Round 9's P1 about the documentation contract. Both reviewers' concerns are now satisfied: the standalone and fitted produce identical numbers on the same input, AND users see an explicit warning when filters make the fitted sample diverge from the dCDH estimation sample. Test counts: 107 -> 110 (three new sample-contract regression tests). Black, ruff clean. Files modified: - docs/methodology/REGISTRY.md (new TWFE sample contract Note enumerating all three filters) - diff_diff/chaisemartin_dhaultfoeuille.py (twfe_diagnostic param docstring, n_groups_dropped_interior_gap tracking, divergence warning at Step 6b, stale comment fix) - diff_diff/chaisemartin_dhaultfoeuille_results.py (twfe_weights / twfe_fraction_negative / twfe_sigma_fe / twfe_beta_fe field docstrings) - docs/api/chaisemartin_dhaultfoeuille.rst (wording fix) - docs/choosing_estimator.rst (wording fix) - tests/test_chaisemartin_dhaultfoeuille.py (3 new tests + 1 parity test comment update) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 8cafae1 commit 63edb3d

6 files changed

Lines changed: 260 additions & 19 deletions

File tree

diff_diff/chaisemartin_dhaultfoeuille.py

Lines changed: 56 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -275,9 +275,17 @@ class ChaisemartinDHaultfoeuille(ChaisemartinDHaultfoeuilleBootstrapMixin):
275275
from Theorem 1 of AER 2020: per-``(g, t)`` weights, fraction of
276276
treated cells with negative weights, and ``sigma_fe`` (the
277277
smallest cell-effect standard deviation that could flip the sign
278-
of the plain TWFE coefficient). Useful for diagnosing whether
279-
TWFE on the same data would have a different (potentially
280-
wrong-signed) answer than ``DID_M``.
278+
of the plain TWFE coefficient). The diagnostic answers "what
279+
would the plain TWFE estimator say on the data you passed in?",
280+
so it runs on the **FULL pre-filter cell sample** (the same
281+
input as the standalone :func:`twowayfeweights` function), NOT
282+
on the post-filter estimation sample used by ``DID_M``. When
283+
the ragged-panel filter or ``drop_larger_lower`` drops groups,
284+
the fitted ``results.twfe_*`` values describe a LARGER sample
285+
(pre-filter) than ``results.overall_att`` and a ``UserWarning``
286+
is emitted to make the divergence explicit. See REGISTRY.md
287+
``ChaisemartinDHaultfoeuille`` ``Note (TWFE diagnostic sample
288+
contract)`` for the full rationale.
281289
drop_larger_lower : bool, default=True
282290
If ``True`` (default, matches R ``DIDmultiplegtDYN``), drops
283291
groups whose treatment switches more than once (multi-switch
@@ -636,6 +644,7 @@ def fit(
636644
expected_count = g_max_idx - g_min_idx + 1
637645
if len(g_periods) != expected_count:
638646
groups_with_interior_gaps.append(g_id)
647+
n_groups_dropped_interior_gap = len(groups_with_interior_gaps)
639648
if groups_with_interior_gaps:
640649
warnings.warn(
641650
f"Dropping {len(groups_with_interior_gaps)} group(s) with interior "
@@ -685,6 +694,49 @@ def fit(
685694
stacklevel=2,
686695
)
687696

697+
# ------------------------------------------------------------------
698+
# Step 6b: TWFE diagnostic sample-contract notice
699+
#
700+
# The fitted twfe_* values (if the diagnostic succeeded in
701+
# Step 5a) were computed on the FULL pre-filter cell sample,
702+
# matching the standalone twowayfeweights() output. Steps 5b
703+
# and 6 may have dropped groups since then. When they did, the
704+
# fitted diagnostic and the dCDH point estimate describe
705+
# DIFFERENT samples, so we surface that divergence as a
706+
# UserWarning per the REGISTRY contract Note. Users see the
707+
# warning at fit time and can decide whether to pre-process
708+
# their data before re-fitting (or accept the documented
709+
# divergence).
710+
#
711+
# The warning fires whenever the user requested the diagnostic
712+
# AND filters dropped groups, even if _compute_twfe_diagnostic
713+
# itself failed (rank-deficient fallback) and
714+
# twfe_diagnostic_payload is None. The warning text uses "(if
715+
# the diagnostic succeeded)" to remain accurate in both cases.
716+
# ------------------------------------------------------------------
717+
if self.twfe_diagnostic and (n_groups_dropped_interior_gap + n_groups_dropped_crossers) > 0:
718+
warnings.warn(
719+
f"TWFE diagnostic sample-contract notice: the dCDH point "
720+
f"estimate, results.groups, and inference fields use a "
721+
f"POST-FILTER sample after Step 5b dropped "
722+
f"{n_groups_dropped_interior_gap} interior-gap group(s) "
723+
f"and Step 6 dropped {n_groups_dropped_crossers} multi-"
724+
f"switch group(s). The fitted results.twfe_* values (if "
725+
f"the diagnostic succeeded) were computed on the FULL "
726+
f"pre-filter cell sample, so they describe a LARGER "
727+
f"sample (pre-filter) than overall_att. The standalone "
728+
f"twowayfeweights() function also uses the pre-filter "
729+
f"sample. This is the documented Phase 1 contract — see "
730+
f"REGISTRY.md ChaisemartinDHaultfoeuille `Note (TWFE "
731+
f"diagnostic sample contract)` for the rationale. To "
732+
f"reproduce the dCDH estimation sample for an external "
733+
f"TWFE comparison, pre-process your data to drop the "
734+
f"{n_groups_dropped_interior_gap + n_groups_dropped_crossers} "
735+
f"flagged groups before re-fitting.",
736+
UserWarning,
737+
stacklevel=2,
738+
)
739+
688740
# ------------------------------------------------------------------
689741
# Step 7: Singleton-baseline identification (footnote 15 of dynamic paper)
690742
# ------------------------------------------------------------------
@@ -700,7 +752,7 @@ def fit(
700752
# variance stage only — the cell DataFrame retains these groups
701753
# so they can serve as stable controls.
702754
# Use the validated first global period as the canonical baseline.
703-
# Step 5a guarantees every group has an observation at this period,
755+
# Step 5b guarantees every group has an observation at this period,
704756
# so we can read it directly without a groupby.first() that could
705757
# otherwise return a later observed period for late-entry groups.
706758
baselines_per_group = cell.loc[cell[time] == first_global_period, [group, "d_gt"]].rename(

diff_diff/chaisemartin_dhaultfoeuille_results.py

Lines changed: 23 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -208,18 +208,34 @@ class ChaisemartinDHaultfoeuilleResults:
208208
twfe_weights : pd.DataFrame, optional
209209
Per-cell TWFE decomposition weights from Theorem 1 of de
210210
Chaisemartin & D'Haultfoeuille (2020). Columns: ``group``,
211-
``time``, ``weight``. Only populated when ``twfe_diagnostic=True``.
211+
``time``, ``weight``. Computed on the **FULL pre-filter cell
212+
sample** passed by the user (the same input the standalone
213+
:func:`twowayfeweights` function uses) — NOT the post-filter
214+
estimation sample described by ``overall_att`` and
215+
``groups``. When ``fit()`` drops groups via the ragged-panel
216+
or ``drop_larger_lower`` filters, ``results.twfe_*`` and
217+
``results.overall_att`` describe different samples and a
218+
``UserWarning`` is emitted; see REGISTRY.md
219+
``ChaisemartinDHaultfoeuille`` ``Note (TWFE diagnostic
220+
sample contract)`` for the rationale. Only populated when
221+
``twfe_diagnostic=True``.
212222
twfe_fraction_negative : float, optional
213-
Fraction of treated-cell weights that are negative. ``> 0`` is the
214-
diagnostic for the heterogeneous-treatment-effect bias of the
215-
plain TWFE estimator on the same data.
223+
Fraction of treated-cell weights that are negative. ``> 0`` is
224+
the diagnostic for the heterogeneous-treatment-effect bias of
225+
the plain TWFE estimator on the **FULL pre-filter cell sample**
226+
(NOT the post-filter estimation sample). See the
227+
``twfe_weights`` docstring above for the sample contract.
216228
twfe_sigma_fe : float, optional
217229
Smallest standard deviation of per-cell treatment effects that
218230
could flip the sign of the plain TWFE estimator (Corollary 1 of
219-
the AER 2020 paper).
231+
the AER 2020 paper). Computed on the **FULL pre-filter cell
232+
sample**.
220233
twfe_beta_fe : float, optional
221-
The plain TWFE coefficient computed on the same data, for
222-
comparison with ``overall_att``.
234+
The plain TWFE coefficient computed on the **FULL pre-filter
235+
cell sample**, for comparison with ``overall_att``. Note that
236+
the two are computed on different samples when ``fit()``
237+
filters drop groups — see the ``twfe_weights`` docstring above
238+
for the sample contract.
223239
groups : list
224240
Group identifiers in the post-filter sample.
225241
time_periods : list

docs/api/chaisemartin_dhaultfoeuille.rst

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,14 @@ The estimator:
3737
seasonal promotions, on/off policy cycles)
3838
- You need separate joiners (``DID_+``) and leavers (``DID_-``) views, plus
3939
the aggregate ``DID_M``
40-
- You want a built-in placebo and a TWFE decomposition diagnostic from the
41-
same fit
40+
- You want a built-in placebo and a TWFE decomposition diagnostic computed
41+
on the data you pass in (pre-filter) for direct comparison against
42+
``DID_M``. The fitted TWFE diagnostic uses the FULL pre-filter cell
43+
sample (matching :func:`twowayfeweights`); when ``fit()`` drops groups
44+
via the ragged-panel or ``drop_larger_lower`` filters, a ``UserWarning``
45+
is emitted to make the divergence from the post-filter ``DID_M`` sample
46+
explicit. See REGISTRY.md ``ChaisemartinDHaultfoeuille`` ``Note (TWFE
47+
diagnostic sample contract)`` for the rationale.
4248
- You want a Python implementation that matches R ``DIDmultiplegtDYN`` at
4349
``l = 1``
4450

docs/choosing_estimator.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -235,8 +235,9 @@ Use :class:`~diff_diff.ChaisemartinDHaultfoeuille` (alias :class:`~diff_diff.DCD
235235
seasonal promotions, on/off policy cycles, binary fuzzy designs)
236236
- You need separate joiners (``DID_+``) and leavers (``DID_-``) views, plus
237237
the aggregate ``DID_M``
238-
- You want a built-in placebo and a TWFE decomposition diagnostic from the
239-
same fit
238+
- You want a built-in placebo and a TWFE decomposition diagnostic computed
239+
on the data you pass in (pre-filter) for direct comparison against
240+
``DID_M``
240241

241242
This is **the only library estimator that handles non-absorbing treatments**.
242243
All other staggered estimators

docs/methodology/REGISTRY.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -560,6 +560,8 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
560560

561561
- **Note:** Placebo Assumption 11 violations (placebo joiners exist but no 3-period stable_0 controls, or symmetric for leavers/stable_1) trigger zero-retention in the placebo numerator AND emit a consolidated `Placebo (DID_M^pl) Assumption 11 violations` warning from `fit()`, mirroring the main DID path's contract documented above. The zeroed placebo periods retain their switcher counts in the placebo `N_S^pl` denominator, biasing `DID_M^pl` toward zero in the offending direction (matching the Theorem 4 paper convention).
562562

563+
- **Note (TWFE diagnostic sample contract):** The fitted `results.twfe_weights` / `results.twfe_fraction_negative` / `results.twfe_sigma_fe` / `results.twfe_beta_fe` are computed on the **FULL pre-filter cell sample** — the data the user passed in, after `_validate_and_aggregate_to_cells()` runs but **before** the ragged-panel validation (Step 5b) and the multi-switch filter (`drop_larger_lower`, Step 6). They do NOT describe the post-filter estimation sample used by `overall_att`, `results.groups`, and the inference fields. `fit()` has three sample-shaping filters in total: (1) interior-gap drops in Step 5b, (2) multi-switch drops in Step 6, and (3) the singleton-baseline filter in Step 7. Filters (1) and (2) actually shrink the point-estimate sample, so when either fires, the fitted TWFE diagnostic and `overall_att` describe **different samples** and the estimator emits a `UserWarning` explaining the divergence with explicit counts. Filter (3) is **variance-only** — singleton-baseline groups remain in the point-estimate sample as period-based stable controls (see the singleton-baseline Note above) — so it does NOT create a fitted-vs-`overall_att` mismatch and does NOT trigger the divergence warning. Rationale for the pre-filter design: the TWFE diagnostic answers "what would the plain TWFE estimator say on the data you passed in?" — not "what would TWFE say on the data dCDH actually used after filtering?" — so users comparing TWFE vs dCDH on a fixed input can do so without an interaction effect from the dCDH-specific filters. The standalone `twowayfeweights()` function uses the same pre-filter sample, so the fitted and standalone APIs always produce identical numbers on the same input. To reproduce the dCDH estimation sample for an external TWFE comparison, pre-process your data to drop the multi-switch and interior-gap groups before fitting (the warning lists offending IDs). The matching tests are `test_twfe_pre_filter_contract_with_interior_gap_drop` and `test_twfe_pre_filter_contract_with_multi_switch_drop` in `tests/test_chaisemartin_dhaultfoeuille.py`.
564+
563565
- **Note:** By default (`drop_larger_lower=True`), the estimator drops groups whose treatment switches more than once before estimation. This matches R `DIDmultiplegtDYN`'s default and is required for the analytical variance formula (Web Appendix Section 3.7.3 of the dynamic paper, which assumes Assumption 5 / no-crossing) to be consistent with the AER 2020 Theorem 3 point estimate. Both formulas operate on the same post-drop dataset. Setting `drop_larger_lower=False` is supported for diagnostic comparison but produces an inconsistent estimator-variance pairing for any multi-switch groups present, and emits an explicit warning.
564566

565567
- **Note:** When Assumption 11 (existence of stable controls) is violated for some period `t` — i.e., joiners exist but no stable-untreated controls, or leavers exist but no stable-treated controls — `DID_{+,t}` (or `DID_{-,t}`) is set to zero by paper convention, and the period's switcher count is **retained** in the `N_S` denominator. This means the affected period contributes a zero to the numerator with a non-zero weight in the denominator, biasing `DID_M` toward zero in the offending direction. Users can detect this by inspecting `results.per_period_effects[t]['did_plus_t_a11_zeroed']` (or `did_minus_t_a11_zeroed`) or the consolidated `fit()` warning. This matches the AER 2020 Theorem 3 paper convention and the worked example arithmetic.

0 commit comments

Comments
 (0)