Fix CI review Round 5: partial theta_hat, coarser-than-group, het docs

igerber · claude · igerber · commit bc1dab76d317 · 2026-04-13T14:28:44.000-04:00
P1: DID^X rank-deficiency now residualizes with finite subset of
    theta_hat (zeroing NaN coefficients) instead of skipping entirely.

P1: trends_nonparam now rejects set definitions that are not coarser
    than group (singleton sets have no within-set controls).

P1: heterogeneity restrictions on trends_linear and trends_nonparam
    now documented in REGISTRY.md and fit() docstring.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/chaisemartin_dhaultfoeuille.py b/diff_diff/chaisemartin_dhaultfoeuille.py
@@ -544,7 +544,8 @@ def fit(
             heterogeneous effects (Web Appendix Section 1.5, Lemma 7).
             Partial implementation: post-treatment regressions only
             (no placebo regressions or joint null test). Cannot be
-            combined with ``controls``. Requires ``L_max >= 1``.
+            combined with ``controls``, ``trends_linear``, or
+            ``trends_nonparam``. Requires ``L_max >= 1``.
         design2 : bool, default=False
             If ``True``, identify and report switch-in/switch-out
             (Design-2) groups. Convenience wrapper (descriptive summary,
@@ -1075,6 +1076,20 @@ def fit(
                     f"{len(time_varying)} group(s) have varying values. "
                     f"Examples: {time_varying.index.tolist()[:5]}"
                 )
+            # Set partition must be coarser than group (multiple groups
+            # per set). A group-level partition creates singleton sets
+            # with no within-set controls available.
+            set_map_check = data.groupby(group)[set_col].first()
+            n_sets = set_map_check.nunique()
+            n_groups_total = len(set_map_check)
+            if n_sets >= n_groups_total:
+                raise ValueError(
+                    f"trends_nonparam column {set_col!r} defines "
+                    f"{n_sets} distinct sets for {n_groups_total} "
+                    f"groups. The set partition must be coarser than "
+                    f"group (multiple groups per set) to provide "
+                    f"within-set controls."
+                )
             # Extract set membership per group aligned with all_groups
             set_map = data.groupby(group)[set_col].first()
             set_ids_arr = np.array(
@@ -2848,18 +2863,22 @@ def _compute_covariate_residualization(
             "r_squared": r_squared,
         }
 
-        # Guard: if any control coefficient is NaN (rank-deficient OLS
-        # dropped a collinear control), skip residualization for this
-        # baseline to prevent NaN propagation through Y_resid.
-        if not np.all(np.isfinite(theta_hat)):
+        # Guard: if some control coefficients are NaN (rank-deficient
+        # OLS dropped collinear controls), residualize with only the
+        # finite subset. Replace NaN coefficients with 0 so einsum
+        # only uses the identified controls.
+        nan_mask = ~np.isfinite(theta_hat)
+        if nan_mask.any():
+            n_dropped = int(nan_mask.sum())
             warnings.warn(
                 f"DID^X: rank-deficient first-stage OLS for baseline "
-                f"d={d_val} produced NaN coefficients. Outcomes for "
-                f"groups with this baseline are not residualized.",
+                f"d={d_val} dropped {n_dropped} collinear control(s). "
+                f"Residualization uses the {n_covariates - n_dropped} "
+                f"identified control(s).",
                 UserWarning,
                 stacklevel=3,
             )
-            continue
+            theta_hat = np.where(np.isfinite(theta_hat), theta_hat, 0.0)
 
         # Residualize Y at levels for all groups with this baseline.
         # Vectorized level residualization: Y_tilde[g, t] = Y[g, t] - X[g, t] @ theta_hat
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -615,7 +615,7 @@ Alternative: Multiplier bootstrap clustered at group via the `n_bootstrap` param
 
 - **Note (Phase 3 state-set trends):** Implements state-set-specific trends from Web Appendix Section 1.4 (Assumptions 13-14). Restricts the control pool for each switcher to groups in the same set (e.g., same state in county-level data). The restriction applies in BOTH `_compute_multi_horizon_dids()` (point estimates) and `_compute_per_group_if_multi_horizon()` (influence functions) to ensure IF consistency. Cohort structure stays as `(D_{g,1}, F_g, S_g)` triples (does not incorporate set membership). Set membership must be time-invariant per group. Activated via `trends_nonparam="state_column"` in `fit()`.
 
-- **Note (Phase 3 heterogeneity testing - partial implementation):** Partial implementation of the heterogeneity test from Web Appendix Section 1.5 (Assumption 15, Lemma 7). Computes post-treatment saturated OLS regressions of `S_g * (Y_{g, F_g-1+l} - Y_{g, F_g-1})` on a time-invariant covariate `X_g` plus cohort indicator dummies. Standard OLS inference is valid (paper shows no DID error correction needed). **Deviation from R `predict_het`:** R's full `predict_het` option additionally computes placebo regressions and a joint null test, and disallows combination with `controls`. This implementation provides only post-treatment regressions. Combination with `controls` is rejected (matching R). Results stored in `results.heterogeneity_effects`. Activated via `heterogeneity="covariate_column"` in `fit()`.
+- **Note (Phase 3 heterogeneity testing - partial implementation):** Partial implementation of the heterogeneity test from Web Appendix Section 1.5 (Assumption 15, Lemma 7). Computes post-treatment saturated OLS regressions of `S_g * (Y_{g, F_g-1+l} - Y_{g, F_g-1})` on a time-invariant covariate `X_g` plus cohort indicator dummies. Standard OLS inference is valid (paper shows no DID error correction needed). **Deviation from R `predict_het`:** R's full `predict_het` option additionally computes placebo regressions and a joint null test, and disallows combination with `controls`. This implementation provides only post-treatment regressions. **Rejected combinations:** `controls` (matching R), `trends_linear` (heterogeneity test uses raw level changes, incompatible with second-differenced outcomes), and `trends_nonparam` (heterogeneity test does not thread state-set control-pool restrictions). Results stored in `results.heterogeneity_effects`. Activated via `heterogeneity="covariate_column"` in `fit()`.
 
 - **Note (Phase 3 Design-2 switch-in/switch-out):** Convenience wrapper for Web Appendix Section 1.6 (Assumption 16). Identifies groups with exactly 2 treatment changes (join then leave), reports switch-in and switch-out mean effects. This is a descriptive summary, not a full re-estimation with specialized control pools as described in the paper. The paper notes Design-2 can be implemented by "running the command on a restricted subsample and using `trends_nonparam` for the entry-timing grouping." Activated via `design2=True` in `fit()`, requires `drop_larger_lower=False` to retain 2-switch groups.
 
diff --git a/tests/test_chaisemartin_dhaultfoeuille.py b/tests/test_chaisemartin_dhaultfoeuille.py
@@ -2672,6 +2672,16 @@ def test_missing_set_column_raises(self):
                 L_max=1, trends_nonparam="nonexistent",
             )
 
+    def test_group_level_set_rejected(self):
+        """Set partition at group level (not coarser) raises ValueError."""
+        df = self._make_panel_with_sets()
+        # Use group column itself as set (each group is its own set)
+        with pytest.raises(ValueError, match="coarser than group"):
+            ChaisemartinDHaultfoeuille(seed=1).fit(
+                df, "outcome", "group", "period", "treatment",
+                L_max=1, trends_nonparam="group",
+            )
+
     def test_nonparam_with_covariates(self):
         """Combined state-set trends + covariates."""
         df = self._make_panel_with_sets()