Round 9: TWFE diagnostic sample-contract clarification + warning + tests

igerber · claude · igerber · commit 63edb3d9041e · 2026-04-11T21:27:33.000-04:00
Documents the dCDH TWFE diagnostic sample contract that Round 7's
swap left implicit. The fitted results.twfe_* values are computed on
the FULL pre-filter cell sample (matching the standalone
twowayfeweights() function), NOT on the post-filter estimation
sample used by overall_att / results.groups / inference fields. The
existing user-facing wording said "TWFE on the same data" /
"diagnostic from the same fit" — phrases that naturally read as
"same data as overall_att" — which contradicted the post-Round-7
behavior. This commit:

1. Adds a new `**Note (TWFE diagnostic sample contract):**` block in
   REGISTRY.md enumerating all three sample-shaping filters
   (interior-gap, multi-switch, singleton-baseline) and explicitly
   carving singleton-baseline as variance-only (no fitted-vs-overall_att
   mismatch, so no warning).

2. Rewrites the `twfe_diagnostic` parameter docstring in
   chaisemartin_dhaultfoeuille.py to describe the pre-filter contract
   and the divergence warning.

3. Rewrites the twfe_weights / twfe_fraction_negative / twfe_sigma_fe
   / twfe_beta_fe field docstrings in the results dataclass to clarify
   they describe the FULL pre-filter cell sample, with a pointer to
   the REGISTRY contract Note.

4. Adds a `UserWarning` from `fit()` whenever the user requested the
   TWFE diagnostic AND any of the interior-gap or multi-switch filters
   dropped groups. The warning explains the divergence with explicit
   counts and points at REGISTRY for the rationale. The warning fires
   regardless of whether the diagnostic itself succeeded or hit the
   rank-deficient fallback (the plan-review correctly flagged that the
   `twfe_diagnostic_payload is not None` guard would swallow the rare
   rank-deficient + filtered-panel intersection — dropped that guard).

5. Updates docs/api/chaisemartin_dhaultfoeuille.rst and
   docs/choosing_estimator.rst to replace "from the same fit" with
   "computed on the data you pass in (pre-filter)".

6. Adds three regression tests in TestTwowayFeweightsHelper:
   - test_twfe_pre_filter_contract_with_interior_gap_drop: panel with
     a dropped interior-gap group, asserts fitted twfe_* matches
     standalone, estimation sample is smaller, and the divergence
     warning fires with the expected counts.
   - test_twfe_pre_filter_contract_with_multi_switch_drop: panel with
     an injected multi-switch crosser, similar assertions.
   - test_twfe_no_divergence_warning_on_clean_panel: negative test
     asserting NO divergence warning fires on a clean panel
     (hard-codes pattern="single_switch" to close a future footgun).

7. Fixes the stale "Step 5a guarantees..." comment at line 712 to
   "Step 5b guarantees..." (post-Round-7 the ragged-panel validation
   is Step 5b, not Step 5a). Independent cleanup; bundled because
   it's in the same file and the same topic.

This resolution preserves Round 7's standalone-vs-fitted parity
(both APIs use the pre-filter cell sample) and addresses Round 9's
P1 about the documentation contract. Both reviewers' concerns are
now satisfied: the standalone and fitted produce identical numbers
on the same input, AND users see an explicit warning when filters
make the fitted sample diverge from the dCDH estimation sample.

Test counts: 107 -&gt; 110 (three new sample-contract regression
tests). Black, ruff clean.

Files modified:
- docs/methodology/REGISTRY.md
  (new TWFE sample contract Note enumerating all three filters)
- diff_diff/chaisemartin_dhaultfoeuille.py
  (twfe_diagnostic param docstring, n_groups_dropped_interior_gap
  tracking, divergence warning at Step 6b, stale comment fix)
- diff_diff/chaisemartin_dhaultfoeuille_results.py
  (twfe_weights / twfe_fraction_negative / twfe_sigma_fe /
  twfe_beta_fe field docstrings)
- docs/api/chaisemartin_dhaultfoeuille.rst (wording fix)
- docs/choosing_estimator.rst (wording fix)
- tests/test_chaisemartin_dhaultfoeuille.py (3 new tests + 1
  parity test comment update)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/chaisemartin_dhaultfoeuille.py b/diff_diff/chaisemartin_dhaultfoeuille.py
@@ -275,9 +275,17 @@ class ChaisemartinDHaultfoeuille(ChaisemartinDHaultfoeuilleBootstrapMixin):
         from Theorem 1 of AER 2020: per-``(g, t)`` weights, fraction of
         treated cells with negative weights, and ``sigma_fe`` (the
         smallest cell-effect standard deviation that could flip the sign
-        of the plain TWFE coefficient). Useful for diagnosing whether
-        TWFE on the same data would have a different (potentially
-        wrong-signed) answer than ``DID_M``.
+        of the plain TWFE coefficient). The diagnostic answers "what
+        would the plain TWFE estimator say on the data you passed in?",
+        so it runs on the **FULL pre-filter cell sample** (the same
+        input as the standalone :func:`twowayfeweights` function), NOT
+        on the post-filter estimation sample used by ``DID_M``. When
+        the ragged-panel filter or ``drop_larger_lower`` drops groups,
+        the fitted ``results.twfe_*`` values describe a LARGER sample
+        (pre-filter) than ``results.overall_att`` and a ``UserWarning``
+        is emitted to make the divergence explicit. See REGISTRY.md
+        ``ChaisemartinDHaultfoeuille`` ``Note (TWFE diagnostic sample
+        contract)`` for the full rationale.
     drop_larger_lower : bool, default=True
         If ``True`` (default, matches R ``DIDmultiplegtDYN``), drops
         groups whose treatment switches more than once (multi-switch
@@ -636,6 +644,7 @@ def fit(
             expected_count = g_max_idx - g_min_idx + 1
             if len(g_periods) != expected_count:
                 groups_with_interior_gaps.append(g_id)
+        n_groups_dropped_interior_gap = len(groups_with_interior_gaps)
         if groups_with_interior_gaps:
             warnings.warn(
                 f"Dropping {len(groups_with_interior_gaps)} group(s) with interior "
@@ -685,6 +694,49 @@ def fit(
                 stacklevel=2,
             )
 
+        # ------------------------------------------------------------------
+        # Step 6b: TWFE diagnostic sample-contract notice
+        #
+        # The fitted twfe_* values (if the diagnostic succeeded in
+        # Step 5a) were computed on the FULL pre-filter cell sample,
+        # matching the standalone twowayfeweights() output. Steps 5b
+        # and 6 may have dropped groups since then. When they did, the
+        # fitted diagnostic and the dCDH point estimate describe
+        # DIFFERENT samples, so we surface that divergence as a
+        # UserWarning per the REGISTRY contract Note. Users see the
+        # warning at fit time and can decide whether to pre-process
+        # their data before re-fitting (or accept the documented
+        # divergence).
+        #
+        # The warning fires whenever the user requested the diagnostic
+        # AND filters dropped groups, even if _compute_twfe_diagnostic
+        # itself failed (rank-deficient fallback) and
+        # twfe_diagnostic_payload is None. The warning text uses "(if
+        # the diagnostic succeeded)" to remain accurate in both cases.
+        # ------------------------------------------------------------------
+        if self.twfe_diagnostic and (n_groups_dropped_interior_gap + n_groups_dropped_crossers) > 0:
+            warnings.warn(
+                f"TWFE diagnostic sample-contract notice: the dCDH point "
+                f"estimate, results.groups, and inference fields use a "
+                f"POST-FILTER sample after Step 5b dropped "
+                f"{n_groups_dropped_interior_gap} interior-gap group(s) "
+                f"and Step 6 dropped {n_groups_dropped_crossers} multi-"
+                f"switch group(s). The fitted results.twfe_* values (if "
+                f"the diagnostic succeeded) were computed on the FULL "
+                f"pre-filter cell sample, so they describe a LARGER "
+                f"sample (pre-filter) than overall_att. The standalone "
+                f"twowayfeweights() function also uses the pre-filter "
+                f"sample. This is the documented Phase 1 contract — see "
+                f"REGISTRY.md ChaisemartinDHaultfoeuille `Note (TWFE "
+                f"diagnostic sample contract)` for the rationale. To "
+                f"reproduce the dCDH estimation sample for an external "
+                f"TWFE comparison, pre-process your data to drop the "
+                f"{n_groups_dropped_interior_gap + n_groups_dropped_crossers} "
+                f"flagged groups before re-fitting.",
+                UserWarning,
+                stacklevel=2,
+            )
+
         # ------------------------------------------------------------------
         # Step 7: Singleton-baseline identification (footnote 15 of dynamic paper)
         # ------------------------------------------------------------------
@@ -700,7 +752,7 @@ def fit(
         # variance stage only — the cell DataFrame retains these groups
         # so they can serve as stable controls.
         # Use the validated first global period as the canonical baseline.
-        # Step 5a guarantees every group has an observation at this period,
+        # Step 5b guarantees every group has an observation at this period,
         # so we can read it directly without a groupby.first() that could
         # otherwise return a later observed period for late-entry groups.
         baselines_per_group = cell.loc[cell[time] == first_global_period, [group, "d_gt"]].rename(
diff --git a/diff_diff/chaisemartin_dhaultfoeuille_results.py b/diff_diff/chaisemartin_dhaultfoeuille_results.py
@@ -208,18 +208,34 @@ class ChaisemartinDHaultfoeuilleResults:
     twfe_weights : pd.DataFrame, optional
         Per-cell TWFE decomposition weights from Theorem 1 of de
         Chaisemartin & D'Haultfoeuille (2020). Columns: ``group``,
-        ``time``, ``weight``. Only populated when ``twfe_diagnostic=True``.
+        ``time``, ``weight``. Computed on the **FULL pre-filter cell
+        sample** passed by the user (the same input the standalone
+        :func:`twowayfeweights` function uses) — NOT the post-filter
+        estimation sample described by ``overall_att`` and
+        ``groups``. When ``fit()`` drops groups via the ragged-panel
+        or ``drop_larger_lower`` filters, ``results.twfe_*`` and
+        ``results.overall_att`` describe different samples and a
+        ``UserWarning`` is emitted; see REGISTRY.md
+        ``ChaisemartinDHaultfoeuille`` ``Note (TWFE diagnostic
+        sample contract)`` for the rationale. Only populated when
+        ``twfe_diagnostic=True``.
     twfe_fraction_negative : float, optional
-        Fraction of treated-cell weights that are negative. ``> 0`` is the
-        diagnostic for the heterogeneous-treatment-effect bias of the
-        plain TWFE estimator on the same data.
+        Fraction of treated-cell weights that are negative. ``> 0`` is
+        the diagnostic for the heterogeneous-treatment-effect bias of
+        the plain TWFE estimator on the **FULL pre-filter cell sample**
+        (NOT the post-filter estimation sample). See the
+        ``twfe_weights`` docstring above for the sample contract.
     twfe_sigma_fe : float, optional
         Smallest standard deviation of per-cell treatment effects that
         could flip the sign of the plain TWFE estimator (Corollary 1 of
-        the AER 2020 paper).
+        the AER 2020 paper). Computed on the **FULL pre-filter cell
+        sample**.
     twfe_beta_fe : float, optional
-        The plain TWFE coefficient computed on the same data, for
-        comparison with ``overall_att``.
+        The plain TWFE coefficient computed on the **FULL pre-filter
+        cell sample**, for comparison with ``overall_att``. Note that
+        the two are computed on different samples when ``fit()``
+        filters drop groups — see the ``twfe_weights`` docstring above
+        for the sample contract.
     groups : list
         Group identifiers in the post-filter sample.
     time_periods : list
diff --git a/docs/api/chaisemartin_dhaultfoeuille.rst b/docs/api/chaisemartin_dhaultfoeuille.rst
@@ -37,8 +37,14 @@ The estimator:
   seasonal promotions, on/off policy cycles)
 - You need separate joiners (``DID_+``) and leavers (``DID_-``) views, plus
   the aggregate ``DID_M``
-- You want a built-in placebo and a TWFE decomposition diagnostic from the
-  same fit
+- You want a built-in placebo and a TWFE decomposition diagnostic computed
+  on the data you pass in (pre-filter) for direct comparison against
+  ``DID_M``. The fitted TWFE diagnostic uses the FULL pre-filter cell
+  sample (matching :func:`twowayfeweights`); when ``fit()`` drops groups
+  via the ragged-panel or ``drop_larger_lower`` filters, a ``UserWarning``
+  is emitted to make the divergence from the post-filter ``DID_M`` sample
+  explicit. See REGISTRY.md ``ChaisemartinDHaultfoeuille`` ``Note (TWFE
+  diagnostic sample contract)`` for the rationale.
 - You want a Python implementation that matches R ``DIDmultiplegtDYN`` at
   ``l = 1``
 
diff --git a/docs/choosing_estimator.rst b/docs/choosing_estimator.rst
@@ -235,8 +235,9 @@ Use :class:`~diff_diff.ChaisemartinDHaultfoeuille` (alias :class:`~diff_diff.DCD
   seasonal promotions, on/off policy cycles, binary fuzzy designs)
 - You need separate joiners (``DID_+``) and leavers (``DID_-``) views, plus
   the aggregate ``DID_M``
-- You want a built-in placebo and a TWFE decomposition diagnostic from the
-  same fit
+- You want a built-in placebo and a TWFE decomposition diagnostic computed
+  on the data you pass in (pre-filter) for direct comparison against
+  ``DID_M``
 
 This is **the only library estimator that handles non-absorbing treatments**.
 All other staggered estimators
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -560,6 +560,8 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
 
 - **Note:** Placebo Assumption 11 violations (placebo joiners exist but no 3-period stable_0 controls, or symmetric for leavers/stable_1) trigger zero-retention in the placebo numerator AND emit a consolidated `Placebo (DID_M^pl) Assumption 11 violations` warning from `fit()`, mirroring the main DID path's contract documented above. The zeroed placebo periods retain their switcher counts in the placebo `N_S^pl` denominator, biasing `DID_M^pl` toward zero in the offending direction (matching the Theorem 4 paper convention).
 
+- **Note (TWFE diagnostic sample contract):** The fitted `results.twfe_weights` / `results.twfe_fraction_negative` / `results.twfe_sigma_fe` / `results.twfe_beta_fe` are computed on the **FULL pre-filter cell sample** — the data the user passed in, after `_validate_and_aggregate_to_cells()` runs but **before** the ragged-panel validation (Step 5b) and the multi-switch filter (`drop_larger_lower`, Step 6). They do NOT describe the post-filter estimation sample used by `overall_att`, `results.groups`, and the inference fields. `fit()` has three sample-shaping filters in total: (1) interior-gap drops in Step 5b, (2) multi-switch drops in Step 6, and (3) the singleton-baseline filter in Step 7. Filters (1) and (2) actually shrink the point-estimate sample, so when either fires, the fitted TWFE diagnostic and `overall_att` describe **different samples** and the estimator emits a `UserWarning` explaining the divergence with explicit counts. Filter (3) is **variance-only** — singleton-baseline groups remain in the point-estimate sample as period-based stable controls (see the singleton-baseline Note above) — so it does NOT create a fitted-vs-`overall_att` mismatch and does NOT trigger the divergence warning. Rationale for the pre-filter design: the TWFE diagnostic answers "what would the plain TWFE estimator say on the data you passed in?" — not "what would TWFE say on the data dCDH actually used after filtering?" — so users comparing TWFE vs dCDH on a fixed input can do so without an interaction effect from the dCDH-specific filters. The standalone `twowayfeweights()` function uses the same pre-filter sample, so the fitted and standalone APIs always produce identical numbers on the same input. To reproduce the dCDH estimation sample for an external TWFE comparison, pre-process your data to drop the multi-switch and interior-gap groups before fitting (the warning lists offending IDs). The matching tests are `test_twfe_pre_filter_contract_with_interior_gap_drop` and `test_twfe_pre_filter_contract_with_multi_switch_drop` in `tests/test_chaisemartin_dhaultfoeuille.py`.
+
 - **Note:** By default (`drop_larger_lower=True`), the estimator drops groups whose treatment switches more than once before estimation. This matches R `DIDmultiplegtDYN`'s default and is required for the analytical variance formula (Web Appendix Section 3.7.3 of the dynamic paper, which assumes Assumption 5 / no-crossing) to be consistent with the AER 2020 Theorem 3 point estimate. Both formulas operate on the same post-drop dataset. Setting `drop_larger_lower=False` is supported for diagnostic comparison but produces an inconsistent estimator-variance pairing for any multi-switch groups present, and emits an explicit warning.
 
 - **Note:** When Assumption 11 (existence of stable controls) is violated for some period `t` — i.e., joiners exist but no stable-untreated controls, or leavers exist but no stable-treated controls — `DID_{+,t}` (or `DID_{-,t}`) is set to zero by paper convention, and the period's switcher count is **retained** in the `N_S` denominator. This means the affected period contributes a zero to the numerator with a non-zero weight in the denominator, biasing `DID_M` toward zero in the offending direction. Users can detect this by inspecting `results.per_period_effects[t]['did_plus_t_a11_zeroed']` (or `did_minus_t_a11_zeroed`) or the consolidated `fit()` warning. This matches the AER 2020 Theorem 3 paper convention and the worked example arithmetic.
diff --git a/tests/test_chaisemartin_dhaultfoeuille.py b/tests/test_chaisemartin_dhaultfoeuille.py