Sync scenario-spec doc to script phase lists (P2 cleanup)

igerber · claude · igerber · commit bf9b9d7a0f97 · 2026-04-19T13:37:14.000-04:00
CI review P2: performance-scenarios.md had four drift points where the
documented operation chain did not match what the scripts actually time.
Fixed each to be a faithful spec the reviewer can cross-check against:

- BRFSS small scale: "single year" -&gt; "narrow analytic slice on a
  state-year grid" (all scales use n_years=10).
- Scenario 4 (SDiD): removed the seventh plot_synth_weights step the
  script never times; chain is now 6 steps, matching the script.
- Scenario 5 (dCDH): replaced "results.print_summary()" with the
  actual attribute snapshot the script performs (placebo_effect,
  overall_att, joiners_att, leavers_att); chain is now 4 steps.
- Scenario 6 (dose-response): event-study step is no longer described
  as to_dataframe(level="event_study") on a dose-only fit (that API
  path raises because aggregate="dose" does not populate event_study);
  it is now described as a second CDiD fit with aggregate="eventstudy",
  matching the separate phase the script times.

The within-estimator API-spelling inconsistency that surfaced during
this cleanup (ContinuousDiD uses "eventstudy" on fit(aggregate=...) but
"event_study" on to_dataframe(level=...)) is captured in the
correctness-adjacent observations in performance-plan.md.

No changes under diff_diff/, rust/, scripts, or baselines. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
@@ -220,10 +220,13 @@ These are developer-ergonomics / API-consistency smells surfaced during
 scenario development. None are silent-failures and none belong in this PR
 or in the silent-failures audit; logging here for awareness.
 
-1. **`aggregate` parameter naming.** CS accepts `aggregate="event_study"`;
-   ContinuousDiD requires `aggregate="eventstudy"` (no underscore). Both
-   estimators expose the same conceptual aggregation but different
-   spellings. Route: API-consistency cleanup, minor.
+1. **`aggregate` / `level` parameter naming is inconsistent.** CS accepts
+   `aggregate="event_study"`; ContinuousDiD requires
+   `aggregate="eventstudy"` on `fit()` **but** `level="event_study"` on
+   `to_dataframe()`. Two different spellings within one estimator plus a
+   third cross-estimator spelling. Surfaced when the P1 exit-propagation
+   fix stopped silently swallowing the resulting `ValueError` in the
+   dose-response benchmark. Route: API-consistency cleanup, minor.
 2. **`generate_survey_did_data(panel=True)` `treated` column.** Row-level
    active-treatment indicator that is zero in pre-periods, which makes it
    quietly incompatible with `check_parallel_trends` (expects unit-level
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
@@ -173,10 +173,10 @@ serves a different purpose: R-parity accuracy). They complement it.
   to a state-year panel, then a modern staggered estimator.
 - **Data shape (scale sweep).** 50 states × 10 years × N respondents per
   state-year cell, 5 adoption cohorts staggered over the window. Three scales:
-    - **small** - 50,000 rows (100/cell, 10 strata × 200 PSUs). Substate
-      analytic slice of a single year.
-    - **medium** - 250,000 rows (500/cell, 15 strata × 600 PSUs). Pooled
-      substate analytic slice across multiple years.
+    - **small** - 50,000 rows (100/cell, 10 strata × 200 PSUs). Narrow
+      analytic slice on a state-year grid.
+    - **medium** - 250,000 rows (500/cell, 15 strata × 600 PSUs).
+      Mid-range analytic slice on the same state-year grid.
     - **large** - 1,000,000 rows (2,000/cell, 20 strata × 1,000 PSUs).
       A realistic pooled 10-year multi-state analysis - comparable to the
       kind of panel built from BRFSS 2024's ~458K-record universe filtered
@@ -229,13 +229,12 @@ serves a different purpose: R-parity accuracy). They complement it.
   # then also variance_method="bootstrap", n_bootstrap=200 for comparison
   ```
 - **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` -
-  exercises the leave-one-out refit loop (80 full refits); (2) SDiD fit
-  with `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
+  exercises the leave-one-out refit loop; (2) SDiD fit with
+  `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
   (3) `results.in_time_placebo()`; (4) `results.get_loo_effects_df()`;
   (5) `results.sensitivity_to_zeta_omega()`; (6)
-  `results.get_weight_concentration()`; (7) `plot_synth_weights()` equivalent
-  (data extraction via `results.get_unit_weights_df()`). The jackknife loop
-  is the primary time sink; `sensitivity_to_zeta_omega` also refits.
+  `results.get_weight_concentration()`. The jackknife loop is the primary
+  time sink; `sensitivity_to_zeta_omega` also refits.
 - **Source anchor.** `docs/tutorials/18_geo_experiments.ipynb`,
   Arkhangelsky et al. (2021), Mercado Libre geo-experiment writeup
   (medium.com/mercadolibre-tech), Meta GeoLift methodology docs
@@ -261,12 +260,12 @@ serves a different purpose: R-parity accuracy). They complement it.
   )
   ```
 - **Operation chain.** (1) dCDH fit with `L_max=3` (computes `DID_l` for
-  l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) inspect
-  `placebo_effect` and dynamic placebos for pre-trend evidence;
-  (3) `results.print_summary()`; (4) `compute_honest_did()` on the placebo
-  event study; (5) heterogeneity refit with `heterogeneity="group"`.
-  The TSL path for `L_max >= 1` is newer code (v3.1) and has not been
-  profiled.
+  l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) snapshot
+  `placebo_effect`, `overall_att`, `joiners_att`, `leavers_att` from the
+  result object for pre-trend evidence and joiner/leaver inspection;
+  (3) `compute_honest_did()` M-grid on the placebo event study;
+  (4) heterogeneity refit with `heterogeneity="group"`. The TSL path for
+  `L_max >= 1` is newer code (v3.1) and has not been profiled.
 - **Source anchor.** `docs/practitioner_decision_tree.rst`
   ("Reversible Treatment (On/Off Cycles)"), de Chaisemartin & D'Haultfoeuille
   (2020), NBER WP 29873 (dynamic companion), R package
@@ -290,12 +289,17 @@ serves a different purpose: R-parity accuracy). They complement it.
   )
   ```
 - **Operation chain.** (1) CDiD fit with `aggregate="dose"` - produces
-  overall ATT, overall ACRT, and the dose-response curves; (2)
-  `results.to_dataframe(level="dose_response")`; (3)
-  `results.to_dataframe(level="event_study")` for pre-trend diagnostics;
+  overall ATT, overall ACRT, and the dose-response curves; (2) extract
+  `results.to_dataframe(level="dose_response")` and
+  `level="group_time"` (event-study is not populated by a dose-only
+  fit, so it is extracted in a separate step); (3) a second CDiD fit
+  with `aggregate="eventstudy"` for pre-trend diagnostics (note the
+  spelling: `fit(aggregate="eventstudy")` with no underscore, but
+  `to_dataframe(level="event_study")` with underscore - see the
+  correctness-adjacent observations in `performance-plan.md`);
   (4) compare to a binarized DiD fit on the same data to quantify
-  the information loss from binarizing; (5) alternate `degree=1`
-  (linear) and `num_knots=2` refits for spline-sensitivity. The dose-curve
+  information loss from binarizing; (5) alternate `degree=1` (linear)
+  and (6) `num_knots=2` refits for spline-sensitivity. The dose-curve
   bootstrap loop (199 reps x spline refit) is the primary time sink.
 - **Source anchor.** `docs/tutorials/14_continuous_did.ipynb`,
   Callaway, Goodman-Bacon & Sant'Anna (2024), `docs/methodology/REGISTRY.md`