Address PR #347 R1: fix StackedDiD wording, drop dead TWFE branch, fix dCDH headline_attribute

igerber · claude · igerber · commit 7e8e26479f4e · 2026-04-20T19:04:30.000-04:00
R1 surfaced three P1s, all legitimate: 1. StackedDiD wording mismatch. Claimed ``overall_att`` is a treated-share-weighted aggregate across sub-experiments; actual implementation (``stacked_did.py`` ~line 541) computes ``overall_att`` as the simple average of post-treatment event- study coefficients ``delta_h`` with delta-method SE. Per-horizon ``delta_h`` is the paper's ``theta_kappa^e`` cross-event aggregate, but the headline is an equally-weighted average over those per-horizon coefficients, not a separate cross-event weighting at the ATT level. Definition rewritten to describe the actual estimand. 2. Dead ``TwoWayFixedEffectsResults`` branch. ``TwoWayFixedEffects`` is a subclass of ``DifferenceInDifferences`` and its ``fit()`` returns ``DiDResults`` — there is no separate TWFE result class, so the ``type(results).__name__ == "TwoWayFixedEffectsResults"`` dispatch branch was unreachable on any real fit. Removed the dead branch and rewrote the ``DiDResults`` branch to cover both 2x2 DiD and TWFE interpretations explicitly (both estimators route here). Follow-up for future PR: persist estimator provenance on ``DiDResults`` (or return a dedicated TWFE result class) so the branch can split again; documented inline. 3. dCDH ``headline_attribute="att"``. Both dCDH branches (``DID_M`` for ``L_max=None``, ``DID_l``/derivatives for ``L_max >= 1``) named ``"att"`` as the headline attribute, but ``ChaisemartinDHaultfoeuilleResults`` stores the headline in ``overall_att`` (``chaisemartin_dhaultfoeuille_results.py:357``). Fixed both branches to ``"overall_att"``; downstream consumers using the machine-readable contract now point at the correct attribute. Tests: new ``TestTargetParameterRealFitIntegration`` covers the gap R1 P2 flagged — prior coverage was stub-based and would not have caught any of the three P1s. Four new real-fit tests: - ``TwoWayFixedEffects().fit(...)`` returns ``DiDResults``; target- parameter block uses the shared DiD/TWFE branch. - ``StackedDiD(...).fit(...)`` on a staggered panel; the ``headline_attribute`` matches the actual real attribute and the definition names the event-study-coefficient estimand. - ``ChaisemartinDHaultfoeuille().fit(...)`` on a reversible- treatment panel (both ``DID_M`` and ``DID_l`` regimes); ``headline_attribute == "overall_att"`` and the named attribute actually exists on the real fit object. Existing stub-based dispatch tests updated: the ``test_twfe_results`` test is now ``test_did_results_mentions_twfe`` (asserts the DiD branch describes both estimators). The dCDH stub tests now also assert ``headline_attribute == "overall_att"``. All 323 BR/DR tests pass (319 prior + 4 new real-fit integration). Out of scope (plan-review MEDIUM #2 — centralizing report metadata in a single registry shared by estimator outputs and reporting helpers): queued as a separate PR. Current approach (string dispatch on ``type(results).__name__`` + REGISTRY.md references) is working but brittle; a centralized registry is the principled fix for the TWFE-dispatch-dead-code class of bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/diff_diff/_reporting_helpers.py b/diff_diff/_reporting_helpers.py
@@ -70,17 +70,34 @@ def describe_target_parameter(results: Any) -> Dict[str, Any]:
     name = type(results).__name__
 
     if name == "DiDResults":
+        # Covers both ``DifferenceInDifferences`` (2x2 DiD) and
+        # ``TwoWayFixedEffects`` (TWFE with unit + time FE). Both
+        # estimators return ``DiDResults``; there is no separate
+        # ``TwoWayFixedEffectsResults`` class as of this PR
+        # (confirmed in PR #347 R1 review). The description covers
+        # both interpretations because the result carries no
+        # estimator-provenance marker BR/DR can dispatch on. Adding a
+        # dedicated TWFE result class (or persisting provenance on
+        # DiDResults) is queued as follow-up so this branch can split
+        # in a future PR.
         return {
-            "name": "ATT (2x2)",
+            "name": "ATT (2x2 or TWFE within-transformed coefficient)",
             "definition": (
-                "The average treatment effect on the treated, estimated as a "
-                "single 2x2 Difference-in-Differences contrast between the "
-                "treated-unit change and the control-unit change across the "
-                "pre / post period."
+                "The average treatment effect on the treated. For "
+                "``DifferenceInDifferences``, this is the 2x2 DiD "
+                "contrast between treated-unit change and control-unit "
+                "change across pre / post. For ``TwoWayFixedEffects``, "
+                "this is the coefficient on the treatment-by-post "
+                "interaction in a regression with unit and time fixed "
+                "effects; under homogeneous treatment effects it is "
+                "the ATT, and under heterogeneous effects with staggered "
+                "adoption it is a weighted average of 2x2 comparisons "
+                "that may include forbidden later-vs-earlier comparisons "
+                "(see Goodman-Bacon)."
             ),
             "aggregation": "2x2",
             "headline_attribute": "att",
-            "reference": "REGISTRY.md Sec. DifferenceInDifferences",
+            "reference": ("REGISTRY.md Sec. DifferenceInDifferences / TwoWayFixedEffects"),
         }
 
     if name == "MultiPeriodDiDResults":
@@ -97,22 +114,6 @@ def describe_target_parameter(results: Any) -> Dict[str, Any]:
             "reference": "REGISTRY.md Sec. MultiPeriodDiD",
         }
 
-    if name == "TwoWayFixedEffectsResults":
-        return {
-            "name": "TWFE ATT (within-transformed DiD coefficient)",
-            "definition": (
-                "The coefficient on the treatment-by-post interaction in a "
-                "two-way-fixed-effects regression (unit + time FE). Under "
-                "homogeneous treatment effects this is the ATT; under "
-                "heterogeneous effects with staggered adoption it is a weighted "
-                "average of 2x2 comparisons, possibly including forbidden "
-                "comparisons (see Goodman-Bacon)."
-            ),
-            "aggregation": "twfe",
-            "headline_attribute": "att",
-            "reference": "REGISTRY.md Sec. TwoWayFixedEffects",
-        }
-
     if name == "CallawaySantAnnaResults":
         return {
             "name": "overall ATT (cohort-size-weighted average of ATT(g,t))",
@@ -197,13 +198,19 @@ def describe_target_parameter(results: Any) -> Dict[str, Any]:
                 "(``A_s > a + kappa_post``)."
             )
         return {
-            "name": "overall ATT (sub-experiment-weighted aggregate across stacked events)",
+            "name": "overall ATT (average of post-treatment event-study coefficients)",
             "definition": (
-                "A weighted aggregate of per-sub-experiment ATTs across stacked "
-                "adoption events. Each sub-experiment aligns a treated cohort "
-                "with its clean-control set over the event window "
-                "``[-kappa_pre, +kappa_post]``; the overall ATT averages these "
-                "sub-experiment ATTs using treated-unit share weights. " + control_clause
+                "The average of post-treatment event-study coefficients "
+                "``delta_h`` (h >= -anticipation), estimated from the stacked "
+                "sub-experiment panel with delta-method SE "
+                "(``stacked_did.py`` around line 541). Each sub-experiment "
+                "aligns a treated cohort with its clean-control set over the "
+                "event window ``[-kappa_pre, +kappa_post]``; each per-horizon "
+                "``delta_h`` is the paper's ``theta_kappa^e`` "
+                "treated-share-weighted cross-event aggregate. The "
+                "``overall_att`` headline is the equally-weighted average of "
+                "these per-horizon coefficients, not a separate cross-event "
+                "weighted aggregate at the ATT level. " + control_clause
             ),
             "aggregation": "stacked",
             "headline_attribute": "overall_att",
@@ -322,7 +329,7 @@ def describe_target_parameter(results: Any) -> Dict[str, Any]:
                     "contrasts."
                 ),
                 "aggregation": "M",
-                "headline_attribute": "att",
+                "headline_attribute": "overall_att",
                 "reference": (
                     "de Chaisemartin & D'Haultfoeuille (2020); "
                     "REGISTRY.md Sec. ChaisemartinDHaultfoeuille"
@@ -362,7 +369,7 @@ def describe_target_parameter(results: Any) -> Dict[str, Any]:
                 "share weights." + extra
             ),
             "aggregation": agg_tag,
-            "headline_attribute": "att",
+            "headline_attribute": "overall_att",
             "reference": (
                 "de Chaisemartin & D'Haultfoeuille (2020, 2024); "
                 "REGISTRY.md Sec. ChaisemartinDHaultfoeuille"
diff --git a/tests/test_target_parameter.py b/tests/test_target_parameter.py
@@ -54,10 +54,15 @@ def test_multi_period_did_results(self):
         assert tp["headline_attribute"] == "avg_att"
         assert "event-study" in tp["name"].lower()
 
-    def test_twfe_results(self):
-        tp = describe_target_parameter(_minimal_result("TwoWayFixedEffectsResults"))
-        assert tp["aggregation"] == "twfe"
-        assert "TWFE" in tp["name"]
+    def test_did_results_mentions_twfe(self):
+        """``TwoWayFixedEffects.fit()`` returns ``DiDResults`` (verified in
+        PR #347 R1 review), so the DiDResults branch must cover both
+        the 2x2 DiD and TWFE interpretations. There is no separate
+        ``TwoWayFixedEffectsResults`` class.
+        """
+        tp = describe_target_parameter(_minimal_result("DiDResults"))
+        assert tp["aggregation"] == "2x2"
+        assert "TWFE" in tp["name"] or "TWFE" in tp["definition"]
 
     def test_callaway_santanna(self):
         tp = describe_target_parameter(_minimal_result("CallawaySantAnnaResults"))
@@ -133,6 +138,10 @@ def test_dcdh_m(self):
         )
         assert tp["aggregation"] == "M"
         assert "DID_M" in tp["name"]
+        # R1 PR #347 review: headline lives in ``overall_att``, not
+        # ``att``. Machine-readable field must match the raw attribute
+        # downstream consumers should read.
+        assert tp["headline_attribute"] == "overall_att"
 
     def test_dcdh_l(self):
         tp = describe_target_parameter(
@@ -145,6 +154,7 @@ def test_dcdh_l(self):
         )
         assert tp["aggregation"] == "l"
         assert "DID_l" in tp["name"]
+        assert tp["headline_attribute"] == "overall_att"
 
     def test_dcdh_l_with_controls(self):
         tp = describe_target_parameter(
@@ -344,3 +354,136 @@ def test_dr_full_report_emits_target_parameter_section(self):
         assert "## Target Parameter" in md
         assert "Aggregation tag:" in md
         assert "Headline attribute:" in md
+
+
+class TestTargetParameterRealFitIntegration:
+    """PR #347 R1 review P2: stub-based dispatch tests missed (a) the
+    TWFE-returns-DiDResults mismatch, (b) the dCDH headline_attribute
+    bug. Exercise real fits so these contracts are enforced end-to-end.
+    """
+
+    def test_twfe_fit_returns_did_results_branch(self):
+        """``TwoWayFixedEffects.fit()`` returns ``DiDResults``, so the
+        target-parameter block must be the DiD/TWFE-covering branch.
+        Guards against reintroducing a dead-code
+        ``TwoWayFixedEffectsResults`` branch.
+        """
+        import warnings
+
+        from diff_diff import TwoWayFixedEffects, generate_did_data
+
+        warnings.filterwarnings("ignore")
+        df = generate_did_data(n_units=40, n_periods=4, seed=7)
+        fit = TwoWayFixedEffects().fit(
+            df, outcome="outcome", treatment="treated", time="post", unit="unit"
+        )
+        # Real TWFE fit returns DiDResults (no separate TWFE result class).
+        assert type(fit).__name__ == "DiDResults"
+        tp = describe_target_parameter(fit)
+        assert tp["aggregation"] == "2x2"
+        # The DiDResults branch must name TWFE explicitly so the
+        # description is source-faithful for both DiD and TWFE fits.
+        assert "TWFE" in tp["name"] or "TWFE" in tp["definition"]
+
+    def test_stacked_did_fit_headline_attribute_matches_real_estimand(self):
+        """``StackedDiDResults.overall_att`` is the average of
+        post-treatment event-study coefficients ``delta_h`` with
+        delta-method SE (``stacked_did.py`` around line 541). Real-fit
+        regression against the reviewer's R1 P1 wording catch.
+        """
+        import warnings
+
+        from diff_diff import StackedDiD, generate_staggered_data
+
+        warnings.filterwarnings("ignore")
+        df = generate_staggered_data(n_units=60, n_periods=6, seed=13)
+        fit = StackedDiD(kappa_pre=1, kappa_post=1).fit(
+            df,
+            outcome="outcome",
+            unit="unit",
+            time="period",
+            first_treat="first_treat",
+        )
+        assert type(fit).__name__ == "StackedDiDResults"
+        tp = describe_target_parameter(fit)
+        assert tp["headline_attribute"] == "overall_att"
+        # R1 P1 fix: the definition must describe the actual estimand —
+        # the average of post-treatment delta_h event-study coefficients.
+        assert (
+            "event-study" in tp["definition"].lower()
+            or "delta_h" in tp["definition"]
+            or "post-treatment" in tp["definition"].lower()
+        )
+
+    def _dcdh_reversible_panel(self, seed):
+        """Build a minimal reversible-treatment panel that dCDH
+        accepts (group / time / treatment columns, at least one
+        switcher)."""
+        import numpy as np
+        import pandas as pd
+
+        rng = np.random.default_rng(seed)
+        units = list(range(20))
+        periods = list(range(6))
+        rows = []
+        for u in units:
+            # Half of units switch from 0 -> 1 at period 3.
+            for t in periods:
+                d = 1 if (u < 10 and t >= 3) else 0
+                y = d * 2.0 + rng.normal(0.0, 0.5)
+                rows.append({"unit": u, "period": t, "treated": d, "outcome": y})
+        return pd.DataFrame(rows)
+
+    def test_dcdh_did_m_fit_headline_attribute_is_overall_att(self):
+        """``ChaisemartinDHaultfoeuilleResults.overall_att`` holds the
+        DID_M headline scalar (``chaisemartin_dhaultfoeuille_results.py``
+        line ~357). R1 P1: previously ``headline_attribute="att"``
+        pointed at a non-existent attribute.
+        """
+        import warnings
+
+        from diff_diff import ChaisemartinDHaultfoeuille
+
+        warnings.filterwarnings("ignore")
+        df = self._dcdh_reversible_panel(seed=11)
+        # DID_M regime: L_max not supplied.
+        fit = ChaisemartinDHaultfoeuille().fit(
+            df,
+            outcome="outcome",
+            group="unit",
+            time="period",
+            treatment="treated",
+        )
+        assert type(fit).__name__ == "ChaisemartinDHaultfoeuilleResults"
+        tp = describe_target_parameter(fit)
+        assert tp["aggregation"] == "M"
+        assert tp["headline_attribute"] == "overall_att"
+        # Sanity: the attribute BR/DR points at actually exists on the
+        # real fit object.
+        assert hasattr(fit, tp["headline_attribute"]), (
+            f"headline_attribute={tp['headline_attribute']!r} must name "
+            "an attribute that actually exists on the real result object."
+        )
+
+    def test_dcdh_did_l_fit_headline_attribute_is_overall_att(self):
+        """Same guard for the DID_l dynamic-horizon regime
+        (``L_max >= 1``). Real-fit regression.
+        """
+        import warnings
+
+        from diff_diff import ChaisemartinDHaultfoeuille
+
+        warnings.filterwarnings("ignore")
+        df = self._dcdh_reversible_panel(seed=12)
+        fit = ChaisemartinDHaultfoeuille().fit(
+            df,
+            outcome="outcome",
+            group="unit",
+            time="period",
+            treatment="treated",
+            L_max=2,
+        )
+        tp = describe_target_parameter(fit)
+        assert tp["aggregation"] == "l"
+        assert tp["headline_attribute"] == "overall_att"
+        assert hasattr(fit, tp["headline_attribute"])