igerber
diff --git a/‎BRIEFING.md‎
Lines changed: 59 additions & 83 deletions b/‎BRIEFING.md‎
Lines changed: 59 additions & 83 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ROADMAP.md‎
Lines changed: 1 addition & 1 deletion b/‎ROADMAP.md‎
Lines changed: 1 addition & 1 deletion
@@ -1,83 +1,59 @@
-# SDID Practitioner Validation Tooling - Briefing
-
-## Problem
-
-A data scientist runs `SyntheticDiD`, gets an ATT and a p-value, and then
-faces the question: *should I trust this estimate?* The library gives them the
-point estimate and inference, but the validation workflow - the steps between
-"I got a number" and "I'm confident enough to present this" - is largely
-left to the practitioner to assemble from scratch.
-
-The standard validation workflow for synthetic control methods is well
-understood in the econometrics literature (Arkhangelsky et al. 2021,
-Abadie et al. 2010, Abadie 2021). The pieces include pre-treatment fit
-assessment, weight diagnostics, placebo/falsification tests, sensitivity
-analysis, and cross-estimator comparison. Our library provides some of the
-raw ingredients (pre-treatment RMSE, weight dicts, placebo effects array)
-but doesn't connect them into an accessible diagnostic workflow.
-
-The gap is most visible in `practitioner.py`, where `_handle_synthetic`
-recommends in-time placebos and leave-one-out analysis but provides only
-comment-only pseudo-code. A practitioner following that guidance hits a wall.
-
-## Current state
-
-What we have today:
-
-- `results.pre_treatment_fit` (RMSE) with a warning when it exceeds the
-  treated pre-period SD
-- `results.get_unit_weights_df()` and `results.get_time_weights_df()`
-- Three variance methods: placebo (default), bootstrap, and jackknife (just
-  landed in v3.1.1)
-- `results.placebo_effects` - stores per-iteration estimates for all three
-  variance methods, but for jackknife these are positional LOO estimates
-  with no unit labels
-- `results.summary()` shows top-5 unit weights and count of non-trivial weights
-- `practitioner.py` guidance that names the right steps but can't point to
-  runnable code for most of them
-
-What the practitioner must currently build themselves:
-
-- Mapping jackknife LOO estimates back to unit identities to answer "which
-  unit, when dropped, changes my estimate the most?"
-- In-time placebo tests (re-estimate with a fake treatment date)
-- Any weight concentration metric beyond eyeballing the sorted list
-- Any sense of whether their RMSE is "bad enough to worry about" beyond
-  the binary warning
-- Regularization sensitivity (does the ATT change if I perturb zeta?)
-- Pre-treatment trajectory data for plotting (the Y matrices are internal
-  to `fit()` and not returned)
-
-## Context from prior discussion
-
-The jackknife work created an interesting opportunity. The delete-one-re-estimate
-loop already runs for SE computation. The per-unit ATT estimates are stored in
-`results.placebo_effects`. The missing piece is a presentation layer that maps
-those estimates to unit identities and surfaces the diagnostic interpretation
-(which units are influential, how stable is the estimate to unit composition).
-
-More broadly, the validation gaps fall into two categories:
-
-1. **Low-marginal-cost additions** - things where the computation already
-   exists and we just need to expose or label it (LOO diagnostic from
-   jackknife, weight concentration metrics, trajectory data extraction)
-
-2. **New functionality** - things that require new estimation loops or
-   helpers (in-time placebo, regularization sensitivity sweep)
-
-The practitioner guidance in `practitioner.py` should evolve alongside any
-new tooling so that the recommended steps point to real, runnable code paths.
-
-## What "done" looks like
-
-A practitioner using SyntheticDiD should be able to follow a credible
-validation workflow using library-provided tools and guidance, without
-needing to reverse-engineer internals or write substantial boilerplate.
-The validation steps recognized in the literature should either be directly
-supported or have clear, concrete guidance for how to perform them with
-the library's API.
-
-This is not about adding visualization or plotting (that's a separate
-concern). It's about making the computational and diagnostic building
-blocks accessible and well-documented through the results API and
-practitioner guidance.
+# dcdh-by-path — Briefing
+
+## The ask
+
+Clément de Chaisemartin (dCDH author) suggested implementing the `by_path`
+option from R's `did_multiplegt_dyn`. It disaggregates the dynamic event-study
+by observed treatment trajectory so practitioners can compare paths like:
+
+- `(0,1,0,0)` — one pulse
+- `(0,1,1,0)` — two periods on, then off
+- `(0,1,1,1)` — three periods on, then off
+- `(0,1,0,1)` vs `(0,1,1,0)` — sequencing
+
+Use case: "is a single pulse enough, or do you need sustained exposure?"
+
+## Where we stand today
+
+`diff_diff/chaisemartin_dhaultfoeuille.py` implements `ChaisemartinDHaultfoeuille`.
+
+- Supports reversible on/off treatments (the only estimator in the library
+  that does)
+- **Currently drops multi-switch groups by default** (`drop_larger_lower=True`) —
+  exactly the groups `by_path` wants to keep and compare
+- Stratifies by direction cohort (`DID_+`, `DID_-`, `S_g = sign(Δ)`) but not
+  by trajectory
+- No `by_path`, `treatment_path`, or path-enumeration code exists anywhere
+- Not on ROADMAP.md; not in TODO.md
+
+## Shape of the work
+
+1. Parameter: likely `by_path: bool = False` (implies `drop_larger_lower=False`)
+2. Enumerate unique treatment histories `(D_{g,1}, …, D_{g,T})` per group;
+   optionally accept a user-specified subset of paths of interest
+3. Per-path `DID_{g,l}` aggregation with influence-function SEs per path
+4. Result container extension: `path_effects` dict keyed by trajectory tuple,
+   each holding ATT + SE + CI vectors
+5. Decide interaction with `drop_larger_lower`: probably forbid both being
+   non-default simultaneously, or have `by_path` override
+6. REGISTRY.md section on path-heterogeneity methodology + deviation notes
+7. Methodology reference: `did_multiplegt_dyn` manual §on `by_path`; dCDH
+   dynamic paper for the `DID_{g,l}` building block (already cited in REGISTRY)
+
+## Open methodology questions (for plan mode)
+
+- Which paths are enumerable? All observed, or user-specified subset only?
+  R's default behavior on cardinality control is worth checking.
+- How does path stratification interact with the current cohort pooling
+  `(D_{g,1}, F_g, S_g)` used for variance recentering — does it still apply
+  per path?
+- Placebo and TWFE diagnostics: compute per-path or overall only?
+- Bootstrap interaction: per-path bootstrap blocks vs single bootstrap with
+  per-path aggregation
+
+## Before starting
+
+- Pull the R manual section on `by_path` for `did_multiplegt_dyn` — the option
+  spec there is load-bearing; don't infer from usage examples alone
+- Methodology changes: consult `docs/methodology/REGISTRY.md` first
+- New estimator surface → budget ~12-20 CI review rounds
@@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 - **`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test` + `StuteJointResult`** (HeterogeneousAdoptionDiD Phase 3 follow-up). Joint Cramér-von Mises pretests across K horizons with shared-η Mammen wild bootstrap (preserves vector-valued empirical-process unit-level dependence per Delgado-Manteiga 2001 / Hlávka-Hušková 2020). The core `stute_joint_pretest` is residuals-in; two thin data-in wrappers construct per-horizon residuals for the two nulls the paper spells out: mean-independence (step 2 pre-trends, `OLS(Y_t − Y_base ~ 1)` per pre-period) and linearity (step 3 joint, `OLS(Y_t − Y_base ~ 1 + D)` per post-period). Sum-of-CvMs aggregation (`S_joint = Σ_k S_k`); per-horizon scale-invariant exact-linear short-circuit. Closes the paper Section 4.2 step-2 gap that Phase 3 `did_had_pretest_workflow` previously flagged with an "Assumption 7 pre-trends test NOT run" caveat. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and scope exclusion of Eq 18 linear-trend detrending (deferred to Phase 4 Pierce-Schott replication).
 - **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`.
+- **`ChaisemartinDHaultfoeuille.by_path`** — per-path event-study disaggregation, mirroring R `did_multiplegt_dyn(..., by_path=k)`. Passing `by_path=k` (positive int) to the estimator reports separate `DID_{path,l}` + SE + inference for the top-k most common observed treatment paths in the window `[F_g-1, F_g-1+L_max]`, answering the practitioner question "is a single pulse enough, or do you need sustained exposure?" across paths like `(0,1,0,0)` vs `(0,1,1,0)` vs `(0,1,1,1)`. The per-path SE follows the joiners-only / leavers-only IF precedent (switcher-side contribution zeroed for non-path groups; control pool and cohort structure unchanged; plug-in SE with path-specific divisor). Requires `drop_larger_lower=False` (multi-switch groups are the object of interest) and `L_max >= 1`. Binary treatment only in this release; combinations with `controls`, `trends_linear`, `trends_nonparam`, `heterogeneity`, `design2`, `honest_did`, `survey_design`, and `n_bootstrap > 0` raise `NotImplementedError` and are deferred to follow-up PRs. Results expose `results.path_effects: Dict[Tuple[int, ...], Dict[str, Any]]` and `results.to_dataframe(level="by_path")`; the summary grows a "Treatment-Path Disaggregation" block. Ties in path frequency are broken lexicographically on the path tuple for deterministic ranking. Overflow (`by_path > n_observed_paths`) returns all observed paths with a `UserWarning`. See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path per-path event-study disaggregation)` for the full contract.
 - **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section.
 - SyntheticDiD coverage Monte Carlo calibration table added to `docs/methodology/REGISTRY.md` §SyntheticDiD — rejection rates at α ∈ {0.01, 0.05, 0.10} across `placebo` / `bootstrap` / `jackknife` on 3 representative DGPs (balanced / exchangeable, unbalanced, and Arkhangelsky et al. (2021) AER §6.3 non-exchangeable). Artifact at `benchmarks/data/sdid_coverage.json` (500 seeds × B=200), regenerable via `benchmarks/python/coverage_sdid.py`.
 
 
@@ -58,7 +58,7 @@ See [Survey Design Support](docs/choosing_estimator.rst#survey-design-support) f
 Major landings since the prior roadmap revision. See [CHANGELOG.md](CHANGELOG.md) for the full history.
 
 - **`BusinessReport` and `DiagnosticReport`** - practitioner-ready output layer. Plain-English stakeholder summaries + unified diagnostic runner with a stable AI-legible `to_dict()` schema. `BusinessReport` auto-constructs `DiagnosticReport` by default so summaries mention pre-trends, robustness, and design-effect findings in one call. Estimator-native validation surfaces are routed through: SyntheticDiD uses `pre_treatment_fit` / `in_time_placebo` / `sensitivity_to_zeta_omega`; EfficientDiD uses its native `hausman_pretest`; TROP exposes factor-model fit metrics. See `docs/methodology/REPORTING.md` for methodology deviations including no-traffic-light gates, pre-trends verdict thresholds, and power-aware phrasing.
-- **ChaisemartinDHaultfoeuille (dCDH)** - full feature set: `DID_M` contemporaneous-switch, multi-horizon `DID_l` event study, analytical SE, multiplier bootstrap, TWFE decomposition diagnostic, dynamic placebos, normalized estimator, cost-benefit aggregate, sup-t bands, covariate adjustment (`DID^X`), group-specific linear trends (`DID^{fd}`), state-set-specific trends, heterogeneity testing, non-binary treatment, HonestDiD integration, and survey support (TSL + pweight).
+- **ChaisemartinDHaultfoeuille (dCDH)** - full feature set: `DID_M` contemporaneous-switch, multi-horizon `DID_l` event study, analytical SE, multiplier bootstrap, TWFE decomposition diagnostic, dynamic placebos, normalized estimator, cost-benefit aggregate, sup-t bands, covariate adjustment (`DID^X`), group-specific linear trends (`DID^{fd}`), state-set-specific trends, heterogeneity testing, non-binary treatment, HonestDiD integration, survey support (TSL + pweight), and per-path event-study disaggregation via `by_path=k` (mirrors R `did_multiplegt_dyn(..., by_path=k)`).
 - **SyntheticDiD jackknife variance** (`variance_method='jackknife'`) with survey-weighted jackknife.
 - **SyntheticDiD validation diagnostics**.
 - **Survey support completion** - all 16 estimators accept `survey_design`; `aggregate_survey()` microdata-to-panel bridge with `second_stage_weights` parameter; `conditional_pt` DGP parameter for conditional-PT scenarios.