Address residual P1 + P2s from re-audit of PR #412

igerber · claude · igerber · commit 71ee0b9ee022 · 2026-05-12T20:14:00.000-04:00
The restored CI reviewer surfaced findings the degraded reviewer missed across all 5 prior rounds on PR #412: P1 (REGISTRY + code comment): the claim that "R does not ship per-path predict_het on placebos either, so parity is preserved by deferral" contradicts what R's `did_multiplegt_dyn(..., by_path, predict_het)` dispatcher actually does - it forwards `predict_het` into each per-path `did_multiplegt_main` call along with `placebo`, so R may emit per-path placebo heterogeneity rows we do not yet mirror. Rewrite both surfaces (chaisemartin_dhaultfoeuille_results.py code comment and REGISTRY.md DataFrame-integration paragraph) as an explicit Python- side deferral rather than a verified R-parity. Add a TODO row to track validating R's actual placebo predict_het output and either implementing parity or documenting the deviation explicitly. P2 (REGISTRY rtol claim): the per-path heterogeneity R-parity paragraph claimed "rtol ~1e-6 on point estimates AND SE", but the parity tests use BETA_RTOL=1e-6 and SE_RTOL=1e-5 (one decade looser on SE). Split the claim into the two separate tolerances and note the WLS-denominator/cohort-recentering numerical drift that motivates the looser SE bound. P2 (replicate-weight df_survey refresh): the existing test only checked finite SE; it would have passed if the new dedicated heterogeneity refresh loop failed to recompute t_stat / p_value / conf_int at the final df_survey. Strengthen the test to call `safe_inference(beta, se, df=df_survey)` on the first finite entry and assert the stored inference fields match - this anti-regression covers the dedicated post-call refresh added for path_heterogeneity_ effects. P2 (paths_of_interest survey gap): the documented composability of `paths_of_interest + heterogeneity + survey_design` was not regression- locked - all existing survey-specific tests used `by_path=k`. Add test_paths_of_interest_heterogeneity_survey_design_analytical (verify analytical Binder TSL fits, selector ordering preserved, finite SE per populated (path, l)) and test_paths_of_interest_heterogeneity_ survey_n_bootstrap_gate (verify the multiplier-bootstrap gate applies under paths_of_interest too). No estimator behavior, weighting, variance/SE, identification, or default statistical surface changed in source - documentation accuracy plus expanded regression coverage only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/TODO.md b/TODO.md
@@ -60,6 +60,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | PR 2 | Medium |
 | dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low |
 | dCDH by_path: negative-baseline path regression (e.g. `(-1, 0, 0, 0)`) is not yet exercised. The existing negative-D test (`test_negative_integer_D_supported`) only covers paths with negative values in non-baseline positions like `(0, -1, -1, -1)`, which does not trigger the R `substr(path, 1, 1)` bug regime (the bug needs a multi-character baseline). Add a switcher fixture with `D_{g,1} = -1` and assert the resulting path tuple key. | `tests/test_chaisemartin_dhaultfoeuille.py` | #419 | Low |
+| dCDH by_path: per-path placebo heterogeneity (`predict_het` rows for negative horizons) is currently NaN-filled in `to_dataframe(level="by_path")` `het_*` columns and unpopulated in `path_heterogeneity_effects`. R `did_multiplegt_dyn(..., by_path, predict_het)` forwards `predict_het` into each per-path `did_multiplegt_main` call alongside `placebo`, so R likely emits placebo het rows we do not yet mirror. Validate R's actual placebo predict_het output, then either implement parity or document the deviation explicitly. | `diff_diff/chaisemartin_dhaultfoeuille.py`, `diff_diff/chaisemartin_dhaultfoeuille_results.py`, `tests/test_chaisemartin_dhaultfoeuille_parity.py` | #421 | Medium |
 | CallawaySantAnna: consider materializing NaN entries for non-estimable (g,t) cells in group_time_effects dict (currently omitted with consolidated warning); would require updating downstream consumers (event study, balance_e, aggregation) | `staggered.py` | #256 | Low |
 | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) |
 | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium |
diff --git a/diff_diff/chaisemartin_dhaultfoeuille_results.py b/diff_diff/chaisemartin_dhaultfoeuille_results.py
@@ -1887,9 +1887,13 @@ def to_dataframe(self, level: str = "overall") -> pd.DataFrame:
                             "cband_upper": ph_cband[1] if ph_cband else np.nan,
                             "cumulated_effect": np.nan,
                             "cumulated_se": np.nan,
-                            # Heterogeneity is forward-only (R doesn't ship
-                            # per-path predict_het on placebos); placebo
-                            # rows always emit NaN here.
+                            # Heterogeneity is forward-only in this release.
+                            # Per-path placebo heterogeneity is not exposed
+                            # yet; R may emit placebo het rows under
+                            # did_multiplegt_dyn(..., by_path, predict_het)
+                            # but R-parity for that surface has not been
+                            # validated, so we emit NaN on placebo rows
+                            # rather than claim parity. See REGISTRY note.
                             "het_beta": np.nan,
                             "het_se": np.nan,
                             "het_t_stat": np.nan,
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
diff --git a/tests/test_chaisemartin_dhaultfoeuille.py b/tests/test_chaisemartin_dhaultfoeuille.py
@@ -10574,6 +10574,134 @@ def test_per_path_heterogeneity_replicate_weights_propagates_n_valid(self):
                         f"path={path} l={l_h}: replicate SE non-finite"
                     )
 
+        # Verify the final df_survey is actually USED to refresh the
+        # inference fields on path_heterogeneity_effects (not the
+        # compute-time snapshot). Pick the first finite entry, recompute
+        # safe_inference at the final df, and require the stored fields
+        # to match. Anti-regression for the dedicated refresh loop at
+        # chaisemartin_dhaultfoeuille.py R2 P1b: a regression in that
+        # loop would leave stale t_stat / p_value / conf_int derived
+        # from an earlier (likely larger) df.
+        from diff_diff.utils import safe_inference
+
+        df_final = res.survey_metadata.df_survey
+        checked = False
+        for path, horizons in res.path_heterogeneity_effects.items():
+            for l_h, vals in horizons.items():
+                if vals["n_obs"] >= 3 and np.isfinite(vals["se"]):
+                    expected_t, expected_p, expected_ci = safe_inference(
+                        vals["beta"], vals["se"], df=df_final
+                    )
+                    assert vals["t_stat"] == pytest.approx(
+                        expected_t, rel=1e-12, nan_ok=True
+                    ), (
+                        f"path={path} l={l_h}: t_stat not refreshed at "
+                        f"df={df_final} (have {vals['t_stat']}, expected "
+                        f"{expected_t})"
+                    )
+                    assert vals["p_value"] == pytest.approx(
+                        expected_p, rel=1e-12, nan_ok=True
+                    ), (
+                        f"path={path} l={l_h}: p_value not refreshed at "
+                        f"df={df_final} (have {vals['p_value']}, expected "
+                        f"{expected_p})"
+                    )
+                    assert vals["conf_int"][0] == pytest.approx(
+                        expected_ci[0], rel=1e-12, nan_ok=True
+                    )
+                    assert vals["conf_int"][1] == pytest.approx(
+                        expected_ci[1], rel=1e-12, nan_ok=True
+                    )
+                    checked = True
+                    break
+            if checked:
+                break
+        assert checked, (
+            "Expected at least one finite (path, l) entry to refresh-"
+            "check; fixture is degenerate."
+        )
+
+    @pytest.mark.slow
+    def test_paths_of_interest_heterogeneity_survey_design_analytical(self):
+        """Mirror of the by_path+heterogeneity+survey_design analytical
+        path using paths_of_interest. Anti-regression: the docs claim
+        both selectors compose with heterogeneity under survey_design,
+        but the existing TestByPathHeterogeneity survey tests only
+        exercise by_path=. This test pins the reciprocal selector under
+        analytical Binder TSL.
+        """
+        from diff_diff.survey import SurveyDesign
+
+        df = self._by_path_het_data_with_survey()
+        sd = SurveyDesign(weights="survey_weights", strata="strata", psu="psu")
+        # Three observed paths in the fixture; pick two in non-frequency
+        # order so we can verify selector ordering is preserved.
+        est = ChaisemartinDHaultfoeuille(
+            drop_larger_lower=False,
+            paths_of_interest=[(0, 1, 1, 0), (0, 1, 1, 1)],
+        )
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", UserWarning)
+            res = est.fit(
+                df,
+                outcome="outcome",
+                group="group",
+                time="period",
+                treatment="treatment",
+                L_max=3,
+                heterogeneity="het_x",
+                survey_design=sd,
+            )
+        assert res.path_heterogeneity_effects, (
+            "paths_of_interest + heterogeneity + survey_design must "
+            "populate path_heterogeneity_effects"
+        )
+        # Selector keys are preserved in the user-specified order
+        # (not frequency-ranked like by_path).
+        keys = list(res.path_heterogeneity_effects.keys())
+        assert keys == [(0, 1, 1, 0), (0, 1, 1, 1)], (
+            f"paths_of_interest order not preserved: got {keys}"
+        )
+        # Every populated (path, l) entry yields finite analytical SE.
+        for path, horizons in res.path_heterogeneity_effects.items():
+            for l_h, vals in horizons.items():
+                if vals["n_obs"] >= 3:
+                    assert np.isfinite(vals["se"]), (
+                        f"path={path} l={l_h}: analytical survey SE "
+                        f"non-finite under paths_of_interest"
+                    )
+
+    @pytest.mark.slow
+    def test_paths_of_interest_heterogeneity_survey_n_bootstrap_gate(self):
+        """The by_path + survey_design + n_bootstrap > 0 gate (PR #408)
+        also fires under paths_of_interest + heterogeneity. Anti-
+        regression: the multiplier-bootstrap-survey gate must apply to
+        both selectors.
+        """
+        from diff_diff.survey import SurveyDesign
+
+        df = self._by_path_het_data_with_survey()
+        sd = SurveyDesign(weights="survey_weights", strata="strata", psu="psu")
+        est = ChaisemartinDHaultfoeuille(
+            drop_larger_lower=False,
+            paths_of_interest=[(0, 1, 1, 1)],
+            n_bootstrap=10,
+            seed=1,
+        )
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", UserWarning)
+            with pytest.raises(NotImplementedError, match="multiplier"):
+                est.fit(
+                    df,
+                    outcome="outcome",
+                    group="group",
+                    time="period",
+                    treatment="treatment",
+                    L_max=3,
+                    heterogeneity="het_x",
+                    survey_design=sd,
+                )
+
     @pytest.mark.slow
     def test_survey_design_plus_n_bootstrap_with_heterogeneity_still_raises(
         self,