Skip to content

Commit bf9b9d7

Browse files
igerberclaude
andcommitted
Sync scenario-spec doc to script phase lists (P2 cleanup)
CI review P2: performance-scenarios.md had four drift points where the documented operation chain did not match what the scripts actually time. Fixed each to be a faithful spec the reviewer can cross-check against: - BRFSS small scale: "single year" -> "narrow analytic slice on a state-year grid" (all scales use n_years=10). - Scenario 4 (SDiD): removed the seventh plot_synth_weights step the script never times; chain is now 6 steps, matching the script. - Scenario 5 (dCDH): replaced "results.print_summary()" with the actual attribute snapshot the script performs (placebo_effect, overall_att, joiners_att, leavers_att); chain is now 4 steps. - Scenario 6 (dose-response): event-study step is no longer described as to_dataframe(level="event_study") on a dose-only fit (that API path raises because aggregate="dose" does not populate event_study); it is now described as a second CDiD fit with aggregate="eventstudy", matching the separate phase the script times. The within-estimator API-spelling inconsistency that surfaced during this cleanup (ContinuousDiD uses "eventstudy" on fit(aggregate=...) but "event_study" on to_dataframe(level=...)) is captured in the correctness-adjacent observations in performance-plan.md. No changes under diff_diff/, rust/, scripts, or baselines. Docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 98a1f3a commit bf9b9d7

2 files changed

Lines changed: 31 additions & 24 deletions

File tree

docs/performance-plan.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -220,10 +220,13 @@ These are developer-ergonomics / API-consistency smells surfaced during
220220
scenario development. None are silent-failures and none belong in this PR
221221
or in the silent-failures audit; logging here for awareness.
222222

223-
1. **`aggregate` parameter naming.** CS accepts `aggregate="event_study"`;
224-
ContinuousDiD requires `aggregate="eventstudy"` (no underscore). Both
225-
estimators expose the same conceptual aggregation but different
226-
spellings. Route: API-consistency cleanup, minor.
223+
1. **`aggregate` / `level` parameter naming is inconsistent.** CS accepts
224+
`aggregate="event_study"`; ContinuousDiD requires
225+
`aggregate="eventstudy"` on `fit()` **but** `level="event_study"` on
226+
`to_dataframe()`. Two different spellings within one estimator plus a
227+
third cross-estimator spelling. Surfaced when the P1 exit-propagation
228+
fix stopped silently swallowing the resulting `ValueError` in the
229+
dose-response benchmark. Route: API-consistency cleanup, minor.
227230
2. **`generate_survey_did_data(panel=True)` `treated` column.** Row-level
228231
active-treatment indicator that is zero in pre-periods, which makes it
229232
quietly incompatible with `check_parallel_trends` (expects unit-level

docs/performance-scenarios.md

Lines changed: 24 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -173,10 +173,10 @@ serves a different purpose: R-parity accuracy). They complement it.
173173
to a state-year panel, then a modern staggered estimator.
174174
- **Data shape (scale sweep).** 50 states × 10 years × N respondents per
175175
state-year cell, 5 adoption cohorts staggered over the window. Three scales:
176-
- **small** - 50,000 rows (100/cell, 10 strata × 200 PSUs). Substate
177-
analytic slice of a single year.
178-
- **medium** - 250,000 rows (500/cell, 15 strata × 600 PSUs). Pooled
179-
substate analytic slice across multiple years.
176+
- **small** - 50,000 rows (100/cell, 10 strata × 200 PSUs). Narrow
177+
analytic slice on a state-year grid.
178+
- **medium** - 250,000 rows (500/cell, 15 strata × 600 PSUs).
179+
Mid-range analytic slice on the same state-year grid.
180180
- **large** - 1,000,000 rows (2,000/cell, 20 strata × 1,000 PSUs).
181181
A realistic pooled 10-year multi-state analysis - comparable to the
182182
kind of panel built from BRFSS 2024's ~458K-record universe filtered
@@ -229,13 +229,12 @@ serves a different purpose: R-parity accuracy). They complement it.
229229
# then also variance_method="bootstrap", n_bootstrap=200 for comparison
230230
```
231231
- **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` -
232-
exercises the leave-one-out refit loop (80 full refits); (2) SDiD fit
233-
with `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
232+
exercises the leave-one-out refit loop; (2) SDiD fit with
233+
`variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
234234
(3) `results.in_time_placebo()`; (4) `results.get_loo_effects_df()`;
235235
(5) `results.sensitivity_to_zeta_omega()`; (6)
236-
`results.get_weight_concentration()`; (7) `plot_synth_weights()` equivalent
237-
(data extraction via `results.get_unit_weights_df()`). The jackknife loop
238-
is the primary time sink; `sensitivity_to_zeta_omega` also refits.
236+
`results.get_weight_concentration()`. The jackknife loop is the primary
237+
time sink; `sensitivity_to_zeta_omega` also refits.
239238
- **Source anchor.** `docs/tutorials/18_geo_experiments.ipynb`,
240239
Arkhangelsky et al. (2021), Mercado Libre geo-experiment writeup
241240
(medium.com/mercadolibre-tech), Meta GeoLift methodology docs
@@ -261,12 +260,12 @@ serves a different purpose: R-parity accuracy). They complement it.
261260
)
262261
```
263262
- **Operation chain.** (1) dCDH fit with `L_max=3` (computes `DID_l` for
264-
l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) inspect
265-
`placebo_effect` and dynamic placebos for pre-trend evidence;
266-
(3) `results.print_summary()`; (4) `compute_honest_did()` on the placebo
267-
event study; (5) heterogeneity refit with `heterogeneity="group"`.
268-
The TSL path for `L_max >= 1` is newer code (v3.1) and has not been
269-
profiled.
263+
l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) snapshot
264+
`placebo_effect`, `overall_att`, `joiners_att`, `leavers_att` from the
265+
result object for pre-trend evidence and joiner/leaver inspection;
266+
(3) `compute_honest_did()` M-grid on the placebo event study;
267+
(4) heterogeneity refit with `heterogeneity="group"`. The TSL path for
268+
`L_max >= 1` is newer code (v3.1) and has not been profiled.
270269
- **Source anchor.** `docs/practitioner_decision_tree.rst`
271270
("Reversible Treatment (On/Off Cycles)"), de Chaisemartin & D'Haultfoeuille
272271
(2020), NBER WP 29873 (dynamic companion), R package
@@ -290,12 +289,17 @@ serves a different purpose: R-parity accuracy). They complement it.
290289
)
291290
```
292291
- **Operation chain.** (1) CDiD fit with `aggregate="dose"` - produces
293-
overall ATT, overall ACRT, and the dose-response curves; (2)
294-
`results.to_dataframe(level="dose_response")`; (3)
295-
`results.to_dataframe(level="event_study")` for pre-trend diagnostics;
292+
overall ATT, overall ACRT, and the dose-response curves; (2) extract
293+
`results.to_dataframe(level="dose_response")` and
294+
`level="group_time"` (event-study is not populated by a dose-only
295+
fit, so it is extracted in a separate step); (3) a second CDiD fit
296+
with `aggregate="eventstudy"` for pre-trend diagnostics (note the
297+
spelling: `fit(aggregate="eventstudy")` with no underscore, but
298+
`to_dataframe(level="event_study")` with underscore - see the
299+
correctness-adjacent observations in `performance-plan.md`);
296300
(4) compare to a binarized DiD fit on the same data to quantify
297-
the information loss from binarizing; (5) alternate `degree=1`
298-
(linear) and `num_knots=2` refits for spline-sensitivity. The dose-curve
301+
information loss from binarizing; (5) alternate `degree=1` (linear)
302+
and (6) `num_knots=2` refits for spline-sensitivity. The dose-curve
299303
bootstrap loop (199 reps x spline refit) is the primary time sink.
300304
- **Source anchor.** `docs/tutorials/14_continuous_did.ipynb`,
301305
Callaway, Goodman-Bacon & Sant'Anna (2024), `docs/methodology/REGISTRY.md`

0 commit comments

Comments
 (0)