Skip to content

Commit 81b9430

Browse files
authored
Merge pull request igerber#378 from igerber/dcdh-by-path-controls
Lift by_path + controls gate (DID^X residualization)
2 parents 2a7f3c0 + 6785d6a commit 81b9430

7 files changed

Lines changed: 1080 additions & 15 deletions

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Large diffs are not rendered by default.

benchmarks/R/generate_dcdh_dynr_test_values.R

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -699,6 +699,60 @@ scenarios$multi_path_reversible_by_path_placebo <- list(
699699
results = extract_dcdh_by_path(res15, n_effects = 3, n_placebos = 2)
700700
)
701701

702+
# Scenario 16: multi_path_reversible + by_path=3 + controls="X1" (Phase 3
703+
# Wave 3 #5: by_path + DID^X residualization). Same deterministic DGP
704+
# and n_periods=10 as scenarios 14/15, with a confounding covariate X1
705+
# added via the same `add_covariate` helper used by scenario 10's
706+
# `joiners_only_controls`. **R re-runs `did_multiplegt_main()` per path**
707+
# with a path-restricted subsample (path's switchers + same-baseline
708+
# not-yet-treated controls), so its per-baseline OLS residualization
709+
# coefficients can vary per path (verified against
710+
# `chaisemartinPackages/did_multiplegt_dyn` source —
711+
# `R/R/did_multiplegt_dyn.R` lines 393-411 dispatch the per-path loop;
712+
# `did_multiplegt_by_path` is a path-classifier preprocessor only).
713+
# Python residualizes once on the full panel before path enumeration,
714+
# then disaggregates per path. **The two strategies coincide on
715+
# single-baseline switcher panels** (every switcher shares D_{g,1}=0)
716+
# because R's per-path control pool then equals the global control pool
717+
# # — `multi_path_reversible` is built precisely for this property, so
718+
# per-path event-study point estimates and switcher counts must match R
719+
# bit-exactly on the one-observation-per-(g,t) DGP this generator
720+
# produces. (On panels with multiple observations per `(g, t)` cell, the
721+
# library's equal-cell-weighting first stage diverges from R's `N_gt`-
722+
# weighted first stage per the existing DID^X cell-weighting deviation
723+
# in `docs/methodology/REGISTRY.md` "Note (Phase 3 DID^X covariate
724+
# adjustment)" — that deviation is independent of the by_path lift.)
725+
# Per-path SE inherits the documented cross-path cohort-sharing
726+
# deviation from R for `path_effects`. On multi-baseline switcher panels
727+
# the residualization coefficients can diverge per path between Python
728+
# and R; the production fit emits a `UserWarning` in that configuration.
729+
# Single covariate keeps the scenario tight; multi-covariate is
730+
# exercised via internal regression tests.
731+
cat(" Scenario 16: multi_path_reversible_by_path_controls\n")
732+
d16 <- gen_reversible(n_groups = N_GOLDEN, n_periods = 10,
733+
pattern = "multi_path_reversible", seed = 116,
734+
L_max = 3)
735+
d16 <- add_covariate(d16, seed = 216, x_effect = 1.5)
736+
res16 <- did_multiplegt_dyn(
737+
df = d16, outcome = "outcome", group = "group", time = "period",
738+
treatment = "treatment", effects = 3, by_path = 3, controls = "X1",
739+
ci_level = 95
740+
)
741+
scenarios$multi_path_reversible_by_path_controls <- list(
742+
data = list(
743+
group = as.numeric(d16$group),
744+
period = as.numeric(d16$period),
745+
treatment = as.numeric(d16$treatment),
746+
outcome = as.numeric(d16$outcome),
747+
X1 = as.numeric(d16$X1)
748+
),
749+
params = list(pattern = "multi_path_reversible",
750+
n_switcher_groups = N_GOLDEN, n_realized_groups = N_GOLDEN + 40L,
751+
n_periods = 10, seed = 116, effects = 3, by_path = 3,
752+
controls = "X1", ci_level = 95),
753+
results = extract_dcdh_by_path(res16, n_effects = 3)
754+
)
755+
702756
# ---------------------------------------------------------------------------
703757
# Write output
704758
# ---------------------------------------------------------------------------

benchmarks/data/dcdh_dynr_golden_values.json

Lines changed: 114 additions & 0 deletions
Large diffs are not rendered by default.

diff_diff/chaisemartin_dhaultfoeuille.py

Lines changed: 95 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -408,10 +408,36 @@ class ChaisemartinDHaultfoeuille(ChaisemartinDHaultfoeuilleBootstrapMixin):
408408
the object of interest) and ``L_max >= 1`` (the path window
409409
depends on ``L_max``). Binary treatment only — non-binary
410410
treatment + ``by_path`` is deferred. Also incompatible with
411-
``controls``, ``trends_linear``, ``trends_nonparam``,
412-
``heterogeneity``, ``design2``, ``honest_did``, and
413-
``survey_design`` (each combination raises
414-
``NotImplementedError`` in the current release).
411+
``trends_linear``, ``trends_nonparam``, ``heterogeneity``,
412+
``design2``, ``honest_did``, and ``survey_design`` (each
413+
combination raises ``NotImplementedError`` in the current
414+
release).
415+
416+
Compatible with ``controls`` (DID^X residualization) -- the
417+
per-baseline OLS residualization runs once on first-differenced
418+
``Y`` BEFORE path enumeration, so per-path point estimates,
419+
bootstrap SE, per-path placebos, and per-path sup-t bands all
420+
consume the residualized ``Y_mat`` automatically (Frisch-
421+
Waugh-Lovell). Per-period effects remain unadjusted, consistent
422+
with the existing ``controls`` + per-period DID contract.
423+
424+
**Deviation from R on multi-baseline switcher panels:** R
425+
``did_multiplegt_dyn(..., by_path, controls)`` re-runs the
426+
per-baseline residualization on each path's restricted
427+
subsample (path's switchers + same-baseline not-yet-treated
428+
controls), so its residualization coefficients vary per path
429+
when switchers have different baseline values. Our global-
430+
residualization architecture coincides with R on single-
431+
baseline panels (every switcher shares the same ``D_{g,1}``)
432+
and per-path point estimates match exactly on the one-
433+
observation-per-``(g, t)`` regime; on multi-observation-per-
434+
cell panels the existing DID^X cell-weighting deviation from
435+
R applies (see ``docs/methodology/REGISTRY.md`` "Note (Phase
436+
3 DID^X covariate adjustment)"; independent of the by_path
437+
lift). On multi-baseline switcher panels, point estimates can
438+
diverge — a ``UserWarning`` is emitted at fit-time when this
439+
configuration is detected. SE inherits the cross-path cohort-
440+
sharing deviation from R documented for ``path_effects``.
415441
416442
Compatible with ``n_bootstrap > 0`` -- the top-k paths are
417443
enumerated once on the observed data (paths held fixed across
@@ -985,11 +1011,6 @@ def fit(
9851011
"[F_g - 1, F_g - 1 + L_max] and therefore depends on "
9861012
"the event-study horizon. Set L_max when calling fit()."
9871013
)
988-
if controls is not None:
989-
raise NotImplementedError(
990-
"by_path combined with controls (DID^X residualization) "
991-
"is deferred to a future release."
992-
)
9931014
if trends_linear:
9941015
raise NotImplementedError(
9951016
"by_path combined with trends_linear (DID^{fd}) is "
@@ -1450,9 +1471,14 @@ def fit(
14501471
#
14511472
# When controls are specified, residualize Y_mat by partialling
14521473
# out covariate effects per baseline treatment group. This
1453-
# transforms Y_mat in-place so ALL downstream DID computations
1454-
# (per-period and per-group multi-horizon) automatically produce
1455-
# covariate-adjusted estimates. See Web Appendix Section 1.2.
1474+
# transforms Y_mat so the per-group multi-horizon DID path
1475+
# (event_study_effects, overall_att, joiners/leavers, by_path
1476+
# surfaces, placebos, sup-t bands) automatically produces
1477+
# covariate-adjusted estimates. The per-period DID path
1478+
# (per_period_effects) intentionally remains on raw outcomes —
1479+
# it uses binary joiner/leaver categorization and is not part
1480+
# of the DID^X contract per REGISTRY.md "Note (Phase 3 DID^X
1481+
# covariate adjustment)". See Web Appendix Section 1.2.
14561482
# ------------------------------------------------------------------
14571483
covariate_diagnostics: Optional[Dict[str, Any]] = None
14581484
_switch_metadata_computed = False
@@ -1473,6 +1499,63 @@ def fit(
14731499
)
14741500
_switch_metadata_computed = True
14751501

1502+
# by_path + controls residualization-sample deviation from R.
1503+
# R's `did_multiplegt_dyn(..., by_path, controls)` calls
1504+
# `did_multiplegt_main()` once per path with `df_main` filtered
1505+
# to: rows of the path's switchers OR rows where
1506+
# `yet_to_switch=1 AND baseline matches the path's baseline`
1507+
# (R/R/did_multiplegt_dyn.R lines 401-405). Inside the per-path
1508+
# `did_multiplegt_main()` call, the per-baseline first-stage
1509+
# residualization regression uses `(g, t)` cells where g's
1510+
# treatment hasn't changed yet at t. Critically, R's path-
1511+
# restricted subset INCLUDES the pre-switch rows of OTHER-path
1512+
# switchers via the `yet_to_switch=1 AND baseline matches`
1513+
# clause, so the first-stage SAMPLE that R uses for path B
1514+
# equals: pre-switch rows of all switchers with matching
1515+
# baseline + all rows of never-switchers with matching
1516+
# baseline. This is BIT-IDENTICAL to the first-stage sample
1517+
# we use under our global residualization — first-stage
1518+
# coefficients (and therefore residualized outcomes) coincide,
1519+
# and per-path point estimates match R exactly **under single-
1520+
# baseline switcher panels** (every switcher has the same
1521+
# `D_{g,1}`, regardless of how `F_g` varies across paths or
1522+
# within a path). Empirical confirmation: the
1523+
# `multi_path_reversible_by_path_controls` R-parity scenario
1524+
# has 4 paths with switcher `F_g` values spanning [0..6] under
1525+
# `D_{g,1}=0` for every switcher, and Python matches R to
1526+
# rtol ~1e-11 across all `(path, horizon)` cells.
1527+
#
1528+
# On MULTI-baseline switcher panels the per-baseline regression
1529+
# coefficients diverge per path under R (R's per-path subset
1530+
# for path B drops switchers whose baseline differs from B's
1531+
# baseline), so point estimates can diverge between Python and
1532+
# R — warn the user explicitly. The check filters to switcher
1533+
# groups only (never-switchers do not contribute to "switcher
1534+
# baseline" multiplicity even if they appear at multiple
1535+
# `D_{g,1}` values across the never-treated / always-treated
1536+
# control mix). SE inheritance (cross-path cohort-sharing) is
1537+
# documented separately in REGISTRY.md.
1538+
if self.by_path is not None:
1539+
_switcher_mask = first_switch_idx_arr >= 0
1540+
if _switcher_mask.any():
1541+
_switcher_baselines = baselines[_switcher_mask]
1542+
if np.unique(_switcher_baselines).size > 1:
1543+
warnings.warn(
1544+
"by_path + controls: switcher baselines D_{g,1} "
1545+
"take multiple values in this panel. Python "
1546+
"residualizes once on the full panel before path "
1547+
"enumeration; R `did_multiplegt_dyn(..., by_path, "
1548+
"controls)` re-runs residualization per path on "
1549+
"the path-restricted subsample, so per-path point "
1550+
"estimates can diverge between Python and R on "
1551+
"this panel. See `docs/methodology/REGISTRY.md` "
1552+
"(`Note (Phase 3 by_path ...)` -> Per-path "
1553+
"covariate residualization) for the full "
1554+
"deviation contract.",
1555+
UserWarning,
1556+
stacklevel=2,
1557+
)
1558+
14761559
Y_mat_residualized, covariate_diagnostics, _failed_baselines = (
14771560
_compute_covariate_residualization(
14781561
Y_mat=Y_mat,

0 commit comments

Comments
 (0)