PR #457 R4 polish: narrow always-treated carve-out to U-bucket only

igerber · igerber · commit 8225ba06567f · 2026-05-16T14:58:03.000-04:00
R4 verdict was Looks good with 1 P3 informational item: the per-component
parity test skipped the ENTIRE always_treated_remapped fixture, leaving
the 6 timing-vs-timing rows (Earlier/Later vs Earlier/Later Treated
between cohorts 3/4/5) without direct per-component parity assertions.
Per memory feedback_test_coverage_gap_treat_as_actionable, this is the
"test exists but doesn't directly exercise the surface" pattern and
should be actionable.

Narrowed the carve-out: instead of skipping the whole fixture, drop only
the treated_vs_never keys from both Python and R sides (the actual
U-bucket convention divergence), and keep direct atol=1e-6 parity
assertions on the 6 timing-vs-timing keys. Also refined _classify_r_type
to canonicalize R's "Later vs Always Treated" type string to
treated_vs_never (Python folds those rows into the U bucket per paper
footnote 11, so they belong to the U comparison set semantically even
though R numbers them by the always-treated cohort), keeping the
narrow carve-out simple.

Tests: 34/34 pass in test_methodology_bacon.py (+6 directly asserted
timing-vs-timing comparisons in the remap fixture vs prior coverage).
diff --git a/tests/test_methodology_bacon.py b/tests/test_methodology_bacon.py
@@ -372,13 +372,21 @@ def _canonical_control(ctype: str, group):
 
         def _classify_r_type(c: dict, fixture_name: str) -> str:
             # R bacondecomp's `type` strings vary across versions
-            # ("Treated vs Untreated", "Earlier vs Later Treated", ...).
-            # Fall back to inferring from the control_group: U sentinel
-            # (0, np.inf, or "never"-containing string) -> treated_vs_never;
-            # otherwise treated_group < control_group is earlier-vs-later.
+            # ("Treated vs Untreated", "Earlier vs Later Treated",
+            # "Later vs Always Treated", ...). Fall back to inferring from
+            # the control_group: U sentinel (0, np.inf, or "never"-containing
+            # string) -> treated_vs_never; otherwise treated_group <
+            # control_group is earlier-vs-later. Note: ``Later vs Always
+            # Treated`` is canonicalized to ``treated_vs_never`` here because
+            # Python's paper-footnote-11 convention folds always-treated
+            # units into the U bucket — semantically these R rows belong
+            # to the U comparison set even though R numbers them by the
+            # always-treated cohort (typically first_treat=1).
             t = c.get("type") or ""
             if "never" in t.lower() or "untreated" in t.lower():
                 return "treated_vs_never"
+            if "always" in t.lower():
+                return "treated_vs_never"
             ctrl = c["control_group"]
             if isinstance(ctrl, str) and "never" in ctrl.lower():
                 return "treated_vs_never"
@@ -398,30 +406,17 @@ def _classify_r_type(c: dict, fixture_name: str) -> str:
         for fixture_name, fix in golden.items():
             if fixture_name == "meta":
                 continue
-            # ``always_treated_remapped``: R keeps ``first_treat=1`` as a
-            # separate cohort and emits ``Later vs Always Treated`` (and
-            # ``Treated vs Untreated``) comparisons against it. Python's
-            # paper-footnote-11 convention remaps those units to U,
-            # folding R's two columns of components into single
-            # ``treated_vs_never`` cells per treated cohort. The aggregate
-            # (TWFE coefficient + weights-sum) is invariant to this
-            # re-bucketing and is locked by ``test_twfe_coef_matches_r``
-            # and ``test_weights_sum_matches_r`` above, but the
-            # per-component set differs **structurally** under the two
-            # conventions. Skip this fixture's per-component assertion
-            # while keeping the aggregate parity. See REGISTRY note on
-            # always-treated remap for the convention rationale.
-            if fixture_name == "always_treated_remapped":
-                continue
             panel = pd.DataFrame(fix["panel"])
-            results = bacon_decompose(
-                panel,
-                outcome="y",
-                unit="unit",
-                time="time",
-                first_treat="first_treat",
-                weights="exact",
-            )
+            with warnings.catch_warnings():
+                warnings.simplefilter("ignore", category=UserWarning)
+                results = bacon_decompose(
+                    panel,
+                    outcome="y",
+                    unit="unit",
+                    time="time",
+                    first_treat="first_treat",
+                    weights="exact",
+                )
             py_estimates = {}
             py_weights = {}
             for c in results.comparisons:
@@ -443,6 +438,24 @@ def _classify_r_type(c: dict, fixture_name: str) -> str:
                 )
                 r_estimates[key] = c["estimate"]
                 r_weights[key] = c["weight"]
+            # ``always_treated_remapped`` carves out only the U-bucket rows,
+            # which R and Python decompose under different conventions
+            # (R: separate ``Later vs Always Treated`` + ``Treated vs
+            # Untreated``; Python: single ``treated_vs_never`` per cohort
+            # via paper-footnote-11 remap). The aggregated fold-back is
+            # asserted in ``test_always_treated_remapped_fold_back_matches_r``.
+            # The 6 timing-vs-timing rows in that fixture are NOT affected
+            # by the convention split and must satisfy direct per-component
+            # parity at atol=1e-6 — narrow the carve-out to U-bucket keys
+            # only so regressions in timing-vs-timing decomposition are
+            # caught directly, not just through aggregate parity.
+            if fixture_name == "always_treated_remapped":
+                # Drop only treated_vs_never keys from both sides; keep
+                # earlier_vs_later + later_vs_earlier for direct parity.
+                py_estimates = {k: v for k, v in py_estimates.items() if k[0] != "treated_vs_never"}
+                py_weights = {k: v for k, v in py_weights.items() if k[0] != "treated_vs_never"}
+                r_estimates = {k: v for k, v in r_estimates.items() if k[0] != "treated_vs_never"}
+                r_weights = {k: v for k, v in r_weights.items() if k[0] != "treated_vs_never"}
             # Full-set equality: no Python component missing from R, no R
             # component missing from Python. A dropped β̂_{kU} term or an
             # extra spurious comparison would fail here.