Round 7: cluster gate + TWFE diagnostic order + singleton-baseline language

igerber · claude · igerber · commit 83cc093c82fa · 2026-04-11T20:26:43.000-04:00
Resolves three findings from the prior CI review:

P1 — cluster parameter was a public no-op. The constructor exposed
`cluster: Optional[str] = None` and stored it on `self.cluster`, but
neither `fit()` nor `_compute_dcdh_bootstrap()` ever read it, so
`cluster="state"` silently produced the same group-level inference as
`cluster=None`. The Phase 1 contract is now: dCDH always clusters at
the group level (via the cohort-recentered influence function and the
multiplier bootstrap), and any non-None cluster value raises
`NotImplementedError` at construction time. The same gate fires from
`set_params`. Added `test_cluster_parameter_raises_not_implemented`
covering all four entry points (`__init__`, `set_params`, the
`cluster=None` happy path, and the convenience function).

P3 — TWFE diagnostic was running on the post-Step-5a sample but the
inline comment claimed "FULL pre-filter cell dataset" and the
standalone `twowayfeweights()` function used the truly pre-filter
cell. On ragged panels with interior gaps, the two paths diverged.
Fixed by reordering: the TWFE diagnostic now runs as Step 5a (was
5b) BEFORE the ragged-panel validation, which is now Step 5b (was
5a). The blocks are independent — the diagnostic just reads `cell`
while the ragged-panel block mutates it. Both API surfaces now
operate on the same `_validate_and_aggregate_to_cells()` output.

P3 — API rst step 3 said "Filters singleton-baseline groups" which
read as a point-estimate sample drop. After Round 3, singleton-
baseline groups are excluded from the variance computation only
(they remain in the point-estimate sample as period-based stable
controls). Fixed the language to match.

Documentation:
- REGISTRY.md gets a new `**Note (Phase 1 cluster contract):**`
  block in the dCDH section explaining the always-group-level
  semantics and the construction-time NotImplementedError gate.
- API rst step 3 reflects the variance-only singleton-baseline scope.
- The dCDH `cluster` parameter docstring now describes the Phase 1
  contract instead of "Currently unused — analytical SEs are always
  at the group level via the cohort-recentered plug-in."

Test counts: 106 -&gt; 107 (one new cluster gate regression test).

Files modified:
- diff_diff/chaisemartin_dhaultfoeuille.py (cluster gate in __init__
  and set_params, docstring update, TWFE diagnostic block reorder
  with renumbered Step 5a/5b labels)
- docs/methodology/REGISTRY.md (cluster contract Note + rename
  Step 5a -&gt; Step 5b in the ragged-panel deviation Note)
- docs/api/chaisemartin_dhaultfoeuille.rst (singleton-baseline
  language fix in step 3 of the overview)
- tests/test_chaisemartin_dhaultfoeuille.py
  (test_cluster_parameter_raises_not_implemented in
  TestForwardCompatGates; rename Step 5a -&gt; Step 5b in three
  ragged-panel test docstrings)

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/chaisemartin_dhaultfoeuille.py b/diff_diff/chaisemartin_dhaultfoeuille.py
@@ -246,10 +246,15 @@ class ChaisemartinDHaultfoeuille(ChaisemartinDHaultfoeuilleBootstrapMixin):
     ----------
     alpha : float, default=0.05
         Significance level for confidence intervals.
-    cluster : str, optional
-        Reserved for future cluster-robust SE customization. Currently
-        unused — analytical SEs are always at the group level via the
-        cohort-recentered plug-in.
+    cluster : str, optional, default=None
+        **Phase 1 contract:** ``cluster`` must be ``None`` (the default).
+        dCDH always clusters at the group level via the cohort-recentered
+        influence-function plug-in (analytical SEs) and the multiplier
+        bootstrap (also grouped at the ``group`` column). Passing any
+        non-``None`` value raises ``NotImplementedError`` with a Phase 1
+        pointer. Custom clustering at a coarser or finer level than the
+        group is reserved for a future phase. See REGISTRY.md
+        ``ChaisemartinDHaultfoeuille`` section for the full contract.
     n_bootstrap : int, default=0
         Number of multiplier-bootstrap iterations. ``0`` (default) uses
         only the analytical SE. Set to ``999`` or higher for stable
@@ -344,6 +349,18 @@ def __init__(
             raise ValueError(f"alpha must be in (0, 1), got {alpha}")
         if n_bootstrap < 0:
             raise ValueError(f"n_bootstrap must be non-negative, got {n_bootstrap}")
+        if cluster is not None:
+            raise NotImplementedError(
+                f"cluster={cluster!r}: custom clustering is not supported in "
+                f"Phase 1 of ChaisemartinDHaultfoeuille. dCDH always clusters "
+                f"at the group level via the cohort-recentered influence-"
+                f"function plug-in (analytical SEs) and the multiplier "
+                f"bootstrap (also grouped at the group column). To use the "
+                f"supported group-level clustering, pass cluster=None (the "
+                f"default). Custom clustering is reserved for a future "
+                f"phase. See REGISTRY.md ChaisemartinDHaultfoeuille section "
+                f"for the full contract."
+            )
 
         self.alpha = alpha
         self.cluster = cluster
@@ -403,6 +420,15 @@ def set_params(self, **params: Any) -> "ChaisemartinDHaultfoeuille":
             raise ValueError(f"alpha must be in (0, 1), got {self.alpha}")
         if self.n_bootstrap < 0:
             raise ValueError(f"n_bootstrap must be non-negative, got {self.n_bootstrap}")
+        if self.cluster is not None:
+            raise NotImplementedError(
+                f"cluster={self.cluster!r}: custom clustering is not supported "
+                f"in Phase 1 of ChaisemartinDHaultfoeuille. dCDH always clusters "
+                f"at the group level. To use the supported group-level "
+                f"clustering, pass cluster=None (the default). Custom clustering "
+                f"is reserved for a future phase. See REGISTRY.md "
+                f"ChaisemartinDHaultfoeuille section for the full contract."
+            )
         return self
 
     # ------------------------------------------------------------------
@@ -531,7 +557,34 @@ def fit(
         )
 
         # ------------------------------------------------------------------
-        # Step 5a: Ragged panel validation
+        # Step 5a: Compute the TWFE diagnostic on the FULL pre-filter cell
+        #          dataset, so the diagnostic reflects the data the user
+        #          actually passed in. This MUST run BEFORE Step 5b (the
+        #          ragged-panel filter) so that the fitted diagnostic and
+        #          the standalone twowayfeweights() function produce
+        #          identical results on ragged panels — both operate on
+        #          the same _validate_and_aggregate_to_cells() output.
+        # ------------------------------------------------------------------
+        twfe_diagnostic_payload = None
+        if self.twfe_diagnostic:
+            try:
+                twfe_diagnostic_payload = _compute_twfe_diagnostic(
+                    cell=cell,
+                    group_col=group,
+                    time_col=time,
+                    rank_deficient_action=self.rank_deficient_action,
+                )
+            except Exception as exc:  # noqa: BLE001
+                warnings.warn(
+                    f"TWFE decomposition diagnostic failed: {exc}. "
+                    "Skipping diagnostic; main estimation continues.",
+                    UserWarning,
+                    stacklevel=2,
+                )
+                twfe_diagnostic_payload = None
+
+        # ------------------------------------------------------------------
+        # Step 5b: Ragged panel validation
         #
         # The cohort/variance path treats D_{g,1} as the canonical
         # baseline and walks adjacent observed periods to detect first
@@ -613,29 +666,6 @@ def fit(
                 f"got {len(all_periods_pre_drop)}"
             )
 
-        # ------------------------------------------------------------------
-        # Step 5b: Compute the TWFE diagnostic on the FULL pre-filter cell
-        #          dataset, so the diagnostic reflects the data the user
-        #          actually passed in (per the plan).
-        # ------------------------------------------------------------------
-        twfe_diagnostic_payload = None
-        if self.twfe_diagnostic:
-            try:
-                twfe_diagnostic_payload = _compute_twfe_diagnostic(
-                    cell=cell,
-                    group_col=group,
-                    time_col=time,
-                    rank_deficient_action=self.rank_deficient_action,
-                )
-            except Exception as exc:  # noqa: BLE001
-                warnings.warn(
-                    f"TWFE decomposition diagnostic failed: {exc}. "
-                    "Skipping diagnostic; main estimation continues.",
-                    UserWarning,
-                    stacklevel=2,
-                )
-                twfe_diagnostic_payload = None
-
         # ------------------------------------------------------------------
         # Step 6: Drop A5-violating (multi-switch) cells per drop_larger_lower
         # ------------------------------------------------------------------
diff --git a/docs/api/chaisemartin_dhaultfoeuille.rst b/docs/api/chaisemartin_dhaultfoeuille.rst
@@ -19,7 +19,7 @@ The estimator:
 
 1. Aggregates individual-level panel data to ``(group, time)`` cells
 2. Drops multi-switch groups by default (matches R ``DIDmultiplegtDYN``)
-3. Filters singleton-baseline groups (footnote 15 of the dynamic paper)
+3. Excludes singleton-baseline groups from the variance computation only (footnote 15 of the dynamic paper)
 4. Computes per-period joiner (``DID_{+,t}``) and leaver (``DID_{-,t}``)
    contributions via Theorem 3 of the AER 2020 paper
 5. Aggregates them into ``DID_M``, the joiners-only ``DID_+``, and the
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -556,6 +556,8 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
 
 - **Note:** When every variance-eligible group forms its own `(D_{g,1}, F_g, S_g)` cohort (a degenerate small-panel case where the cohort framework has zero degrees of freedom), the cohort-recentered plug-in formula is unidentified: cohort recentering subtracts the cohort mean from each group's `U^G_g`, and for singleton cohorts the centered value is exactly zero, so the centered influence function vector collapses to all zeros. The estimator returns `overall_se = NaN` with a `UserWarning` rather than silently collapsing to `0.0` (which would falsely imply infinite precision). The `DID_M` point estimate remains well-defined. The bootstrap path inherits the same degeneracy on these panels — the multiplier weights act on an all-zero vector, so the bootstrap distribution is also degenerate. **Deviation from R `DIDmultiplegtDYN`:** R returns a non-zero SE on the canonical 4-group worked example via small-sample sandwich machinery that Python does not implement. Both responses are valid for a degenerate case; Python's `NaN`+warning is the safer default. To get a non-degenerate SE, include more groups so cohorts have peers (real-world panels typically have `G >> K`).
 
+- **Note (Phase 1 cluster contract):** `ChaisemartinDHaultfoeuille` always clusters at the group level. The cohort-recentered analytical SE plug-in operates on per-group influence-function values (one `U^G_g` per group); the multiplier bootstrap generates one weight per group; both inference paths cluster at the user's `group` column with no other option. The constructor accepts `cluster=None` (the default and only supported value); passing any non-`None` value raises `NotImplementedError` with a Phase 1 pointer at construction time (and the same gate fires from `set_params`). Custom clustering at a coarser or finer level than the group is reserved for a future phase. The matching test is `test_cluster_parameter_raises_not_implemented` in `tests/test_chaisemartin_dhaultfoeuille.py::TestForwardCompatGates`.
+
 - **Note:** Placebo Assumption 11 violations (placebo joiners exist but no 3-period stable_0 controls, or symmetric for leavers/stable_1) trigger zero-retention in the placebo numerator AND emit a consolidated `Placebo (DID_M^pl) Assumption 11 violations` warning from `fit()`, mirroring the main DID path's contract documented above. The zeroed placebo periods retain their switcher counts in the placebo `N_S^pl` denominator, biasing `DID_M^pl` toward zero in the offending direction (matching the Theorem 4 paper convention).
 
 - **Note:** By default (`drop_larger_lower=True`), the estimator drops groups whose treatment switches more than once before estimation. This matches R `DIDmultiplegtDYN`'s default and is required for the analytical variance formula (Web Appendix Section 3.7.3 of the dynamic paper, which assumes Assumption 5 / no-crossing) to be consistent with the AER 2020 Theorem 3 point estimate. Both formulas operate on the same post-drop dataset. Setting `drop_larger_lower=False` is supported for diagnostic comparison but produces an inconsistent estimator-variance pairing for any multi-switch groups present, and emits an explicit warning.
@@ -566,7 +568,7 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
 
 - **Note (deviation from R DIDmultiplegtDYN):** Python uses **period-based** stable-control sets — `stable_0(t)` is any cell with `D_{g,t-1} = D_{g,t} = 0` regardless of baseline `D_{g,1}`, and similarly for `stable_1(t)`. R `DIDmultiplegtDYN` uses **cohort-based** stable-control sets that additionally require `D_{g,1}` to match the side. Python's definition matches the AER 2020 Theorem 3 cell-count notation `N_{0,0,t}` and `N_{1,1,t}` literally; R's definition matches the dynamic companion paper's cohort `(D_{g,1}, F_g, S_g)` framework. The two definitions agree exactly on (a) panels containing only joiners, (b) panels containing only leavers, (c) the hand-calculable 4-group worked example, or (d) any panel where no joiner's post-switch state overlaps a period when leavers are switching. They disagree by O(1%) on the **point estimate** when both joiners and leavers exist AND some joiners' post-switch cells could serve as leavers' controls (or vice versa). After the Round 2 fix that implemented the full `Lambda^G_{g,l=1}` influence function, the **standard error** parity gap on pure-direction scenarios narrowed from ~18% to ~3%. The R parity tests in `tests/test_chaisemartin_dhaultfoeuille_parity.py` use a tight `1e-4` tolerance for pure-direction point estimates, a 5% rtol for pure-direction SEs, and a 2.5% tolerance for mixed-direction point estimates (with the SE check skipped on mixed scenarios because the period-vs-cohort point-estimate deviation cascades into the variance).
 
-- **Note (deviation from R DIDmultiplegtDYN):** Phase 1 requires panels with a **balanced baseline** (every group observed at the first global period) and **no interior period gaps**. The Step 5a validation in `fit()` enforces this contract: groups missing the baseline raise `ValueError`; groups with interior gaps are dropped with a `UserWarning`; groups with **terminal missingness** (early exit / right-censoring — observed at the baseline but missing one or more later periods) are retained and contribute from their observed periods only. R `DIDmultiplegtDYN` accepts unbalanced panels with documented missing-treatment-before-first-switch handling. Python's restriction is a Phase 1 limitation: the cohort enumeration uses `D_{g,1}` as the canonical baseline (so the baseline observation must exist) and the first-switch detection walks adjacent observed periods (so interior gaps create ambiguous transition counts). Terminal missingness is supported because the per-period `present = (N_mat[:, t] > 0) & (N_mat[:, t-1] > 0)` guard appears at three sites in the variance computation (`_compute_per_period_dids`, `_compute_full_per_group_contributions`, `_compute_cohort_recentered_inputs`) and cleanly masks out missing transitions without propagating NaN into the arithmetic. **Workaround for unbalanced panels:** pre-process your data to back-fill the baseline (or drop late-entry groups before fitting), or use R `DIDmultiplegtDYN` until a future phase lifts the restriction. The Step 5a `ValueError` and `UserWarning` messages name the offending group IDs so you can locate them quickly.
+- **Note (deviation from R DIDmultiplegtDYN):** Phase 1 requires panels with a **balanced baseline** (every group observed at the first global period) and **no interior period gaps**. The Step 5b validation in `fit()` enforces this contract: groups missing the baseline raise `ValueError`; groups with interior gaps are dropped with a `UserWarning`; groups with **terminal missingness** (early exit / right-censoring — observed at the baseline but missing one or more later periods) are retained and contribute from their observed periods only. R `DIDmultiplegtDYN` accepts unbalanced panels with documented missing-treatment-before-first-switch handling. Python's restriction is a Phase 1 limitation: the cohort enumeration uses `D_{g,1}` as the canonical baseline (so the baseline observation must exist) and the first-switch detection walks adjacent observed periods (so interior gaps create ambiguous transition counts). Terminal missingness is supported because the per-period `present = (N_mat[:, t] > 0) & (N_mat[:, t-1] > 0)` guard appears at three sites in the variance computation (`_compute_per_period_dids`, `_compute_full_per_group_contributions`, `_compute_cohort_recentered_inputs`) and cleanly masks out missing transitions without propagating NaN into the arithmetic. **Workaround for unbalanced panels:** pre-process your data to back-fill the baseline (or drop late-entry groups before fitting), or use R `DIDmultiplegtDYN` until a future phase lifts the restriction. The Step 5b `ValueError` and `UserWarning` messages name the offending group IDs so you can locate them quickly.
 
 **Reference implementation(s):**
 - R: [`DIDmultiplegtDYN`](https://cran.r-project.org/package=DIDmultiplegtDYN) (CRAN, maintained by the paper authors). The Python implementation matches `did_multiplegt_dyn(..., effects=1)` at horizon `l = 1`. Parity tests live in `tests/test_chaisemartin_dhaultfoeuille_parity.py`.
diff --git a/tests/test_chaisemartin_dhaultfoeuille.py b/tests/test_chaisemartin_dhaultfoeuille.py
@@ -323,6 +323,53 @@ def test_survey_design_raises_not_implemented(self, data):
                 survey_design=object(),
             )
 
+    def test_cluster_parameter_raises_not_implemented(self, data):
+        """
+        Per Phase 1 cluster contract: dCDH always clusters at the
+        group level via the cohort-recentered influence function
+        (analytical SEs) and the multiplier bootstrap (also grouped at
+        the group column). Custom clustering is not supported in
+        Phase 1.
+
+        The reviewer flagged that ``cluster`` was previously accepted
+        on ``__init__`` and stored on ``self.cluster`` but never
+        actually read by ``fit()`` or ``_compute_dcdh_bootstrap()``,
+        making it a silent no-op. This test pins the new contract: any
+        non-None cluster value raises ``NotImplementedError`` at
+        construction time with a message naming the offending value
+        and pointing at the Phase 1 reservation. The same gate fires
+        from ``set_params``.
+
+        See REGISTRY.md ``Note (Phase 1 cluster contract)``.
+        """
+        # __init__ rejects any non-None cluster
+        with pytest.raises(NotImplementedError, match=r"cluster.*Phase 1"):
+            ChaisemartinDHaultfoeuille(cluster="state")
+        with pytest.raises(NotImplementedError, match=r"cluster.*Phase 1"):
+            ChaisemartinDHaultfoeuille(cluster="unit")
+
+        # set_params after construction also rejects
+        est = ChaisemartinDHaultfoeuille()
+        with pytest.raises(NotImplementedError, match=r"cluster.*Phase 1"):
+            est.set_params(cluster="state")
+
+        # cluster=None still works (the only supported value)
+        est_default = ChaisemartinDHaultfoeuille(cluster=None)
+        assert est_default.cluster is None
+        assert est_default.get_params()["cluster"] is None
+
+        # The convenience function also rejects (forward-compat gate
+        # propagates through the wrapper at __init__ time)
+        with pytest.raises(NotImplementedError, match=r"cluster.*Phase 1"):
+            chaisemartin_dhaultfoeuille(
+                data,
+                outcome="outcome",
+                group="group",
+                time="period",
+                treatment="treatment",
+                cluster="state",
+            )
+
 
 # =============================================================================
 # drop_larger_lower (Critical #1)
@@ -461,7 +508,7 @@ def test_singleton_baseline_filter_variance_only(self):
 
     def test_missing_baseline_period_raises_value_error(self):
         """
-        Per fit() Step 5a: groups missing the first global period have
+        Per fit() Step 5b: groups missing the first global period have
         an undefined baseline D_{g,1} and must be rejected with a clear
         error rather than crashing the cohort enumeration with NaN.
         """
@@ -480,7 +527,7 @@ def test_missing_baseline_period_raises_value_error(self):
 
     def test_interior_gap_drops_group_with_warning(self):
         """
-        Per fit() Step 5a: groups with missing intermediate periods
+        Per fit() Step 5b: groups with missing intermediate periods
         (interior gaps between their first and last observed period)
         are dropped with an explicit warning. The cohort/variance path
         requires consecutive observed periods to detect first switches
@@ -508,7 +555,7 @@ def test_interior_gap_drops_group_with_warning(self):
 
     def test_terminal_missingness_retained(self):
         """
-        Per fit() Step 5a contract: groups observed at the baseline but
+        Per fit() Step 5b contract: groups observed at the baseline but
         missing one or more LATER periods (terminal missingness / early
         exit / right-censoring) are RETAINED. The group contributes from
         its observed periods only, masked out of missing transitions by