Skip to content

Commit de6ce63

Browse files
igerberclaude
andcommitted
Add practitioner-workflow performance baseline
Six end-to-end scenarios covering CS + 8-step chain, survey DiD, BRFSS microdata -> CS panel, SDiD few-markets, reversible dCDH, and continuous dose-response -- anchored to applied-econ paper and industry conventions rather than the 200 x 8 cookie cutter. Each chain is timed per-phase and profiled with pyinstrument under both backends; findings and recommended actions are in docs/performance-plan.md. Measurement only -- no changes under diff_diff/ or rust/. The decision doc identifies aggregate_survey per-cell scaffolding, ImputationDiD fit loop, and dCDH heterogeneity refit as candidates for follow-up PRs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 328dc33 commit de6ce63

24 files changed

Lines changed: 2172 additions & 0 deletions

benchmarks/speed_review/README.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Speed Review — Practitioner Workflow Benchmarks
2+
3+
Scenario-driven performance measurement for end-to-end practitioner chains,
4+
as distinct from `benchmarks/run_benchmarks.py` which measures R-parity on
5+
isolated `fit()` calls.
6+
7+
## Why these exist
8+
9+
See [`docs/performance-scenarios.md`](../../docs/performance-scenarios.md) for
10+
the full methodology. Short version: the existing benchmarks measure
11+
`fit()` in isolation on 200 x 8 synthetic panels, which does not reflect what
12+
a practitioner running the 8-step Baker et al. (2025) workflow on a real
13+
BRFSS or geo-experiment panel actually sees. These scripts measure the full
14+
chain (Bacon -> fit -> HonestDiD -> cross-estimator robustness -> reporting)
15+
at data shapes anchored to applied-econ conventions.
16+
17+
## Layout
18+
19+
```
20+
benchmarks/speed_review/
21+
├── README.md # this file
22+
├── bench_shared.py # timing + pyinstrument harness
23+
├── run_all.py # orchestrator (both backends)
24+
├── bench_campaign_staggered.py # Scenario 1: CS + 8-step chain
25+
├── bench_brand_awareness_survey.py # Scenario 2: DiD + SurveyDesign
26+
├── bench_brfss_panel.py # Scenario 3: aggregate_survey -> CS
27+
├── bench_geo_few_markets.py # Scenario 4: SDiD + jackknife
28+
├── bench_reversible_dcdh.py # Scenario 5: dCDH L_max + TSL
29+
├── bench_dose_response.py # Scenario 6: ContinuousDiD splines
30+
├── bench_callaway.py # pre-existing CS scaling sweep
31+
├── baseline_results.json # pre-existing CS baseline
32+
└── baselines/ # this effort's output
33+
├── <scenario>_<backend>.json # phase-level wall-clock (committed)
34+
└── profiles/ # flame HTMLs (gitignored)
35+
└── <scenario>_<backend>.html # pyinstrument flame output
36+
```
37+
38+
**Note on profile HTMLs.** pyinstrument flames are ~500KB-1.2MB each and are
39+
regenerated on every run; they live under `baselines/profiles/` which is
40+
gitignored. The key hotspots identified from them are already captured in
41+
the findings doc (top-5 hot phases per scenario); run a scenario locally
42+
to regenerate the full flame when needed.
43+
44+
## Running
45+
46+
```bash
47+
# One-time install
48+
pip install pyinstrument
49+
50+
# All scenarios, both backends
51+
python benchmarks/speed_review/run_all.py
52+
53+
# One scenario, one backend
54+
DIFF_DIFF_BACKEND=rust python benchmarks/speed_review/bench_campaign_staggered.py
55+
56+
# Subset
57+
python benchmarks/speed_review/run_all.py --scenarios brfss_panel geo_few_markets
58+
```
59+
60+
## Where to look for findings
61+
62+
[`docs/performance-plan.md`](../../docs/performance-plan.md) — "Practitioner
63+
Workflow Baseline (v3.1.3)" section holds per-scenario hot-phase rankings
64+
and action recommendations. The scenarios here are the measurement surface;
65+
the findings doc is the decision output.
66+
67+
## Adding a scenario
68+
69+
1. Add the scenario definition to `docs/performance-scenarios.md`
70+
(persona, data shape, operation chain, source anchor).
71+
2. Add `bench_<name>.py` following the existing scripts: build data, define
72+
`phases` as a list of `(label, callable)` tuples, call `run_scenario`.
73+
3. Register it in `run_all.py`'s `SCRIPTS` dict.
74+
4. Run under both backends, commit the refreshed `baselines/*.json` and the
75+
corresponding `baselines/profiles/*.html`.
76+
5. Add a per-scenario finding paragraph to `docs/performance-plan.md`.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
profiles/
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
{
2+
"scenario": "brand_awareness_survey",
3+
"backend": "python",
4+
"has_rust_backend": false,
5+
"total_seconds": 0.18850491600000008,
6+
"phases": {
7+
"1_naive_fit_no_survey_design": {
8+
"seconds": 0.0016701670000000002,
9+
"ok": true,
10+
"error": null
11+
},
12+
"2_tsl_strata_psu_fpc": {
13+
"seconds": 0.006741541999999989,
14+
"ok": true,
15+
"error": null
16+
},
17+
"3_replicate_weights_brr": {
18+
"seconds": 0.014424250000000027,
19+
"ok": true,
20+
"error": null
21+
},
22+
"4_multi_outcome_loop_3_metrics": {
23+
"seconds": 0.043619666,
24+
"ok": true,
25+
"error": null
26+
},
27+
"5_check_parallel_trends": {
28+
"seconds": 0.00915220799999994,
29+
"ok": true,
30+
"error": null
31+
},
32+
"6_placebo_refit_pre_period": {
33+
"seconds": 0.029268290999999946,
34+
"ok": true,
35+
"error": null
36+
},
37+
"7_event_study_plus_honest_did": {
38+
"seconds": 0.08362433400000002,
39+
"ok": true,
40+
"error": null
41+
}
42+
},
43+
"metadata": {
44+
"n_units": 200,
45+
"n_periods": 12,
46+
"n_obs": 2400,
47+
"n_strata": 10,
48+
"n_psu_per_stratum": 4,
49+
"n_replicate_weights": 40,
50+
"outcomes": [
51+
"outcome",
52+
"consideration",
53+
"purchase_intent"
54+
]
55+
},
56+
"diff_diff_version": "3.1.3",
57+
"numpy_version": "2.0.2"
58+
}
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
{
2+
"scenario": "brand_awareness_survey",
3+
"backend": "rust",
4+
"has_rust_backend": true,
5+
"total_seconds": 0.16800324999999994,
6+
"phases": {
7+
"1_naive_fit_no_survey_design": {
8+
"seconds": 0.0018907079999999077,
9+
"ok": true,
10+
"error": null
11+
},
12+
"2_tsl_strata_psu_fpc": {
13+
"seconds": 0.006109541999999912,
14+
"ok": true,
15+
"error": null
16+
},
17+
"3_replicate_weights_brr": {
18+
"seconds": 0.01849195799999992,
19+
"ok": true,
20+
"error": null
21+
},
22+
"4_multi_outcome_loop_3_metrics": {
23+
"seconds": 0.02723191700000005,
24+
"ok": true,
25+
"error": null
26+
},
27+
"5_check_parallel_trends": {
28+
"seconds": 0.009134625000000063,
29+
"ok": true,
30+
"error": null
31+
},
32+
"6_placebo_refit_pre_period": {
33+
"seconds": 0.024182666999999936,
34+
"ok": true,
35+
"error": null
36+
},
37+
"7_event_study_plus_honest_did": {
38+
"seconds": 0.08095333299999996,
39+
"ok": true,
40+
"error": null
41+
}
42+
},
43+
"metadata": {
44+
"n_units": 200,
45+
"n_periods": 12,
46+
"n_obs": 2400,
47+
"n_strata": 10,
48+
"n_psu_per_stratum": 4,
49+
"n_replicate_weights": 40,
50+
"outcomes": [
51+
"outcome",
52+
"consideration",
53+
"purchase_intent"
54+
]
55+
},
56+
"diff_diff_version": "3.1.3",
57+
"numpy_version": "2.0.2"
58+
}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
{
2+
"scenario": "brfss_panel",
3+
"backend": "python",
4+
"has_rust_backend": false,
5+
"total_seconds": 1.599043583,
6+
"phases": {
7+
"1_aggregate_survey_microdata_to_panel": {
8+
"seconds": 1.530210625,
9+
"ok": true,
10+
"error": null
11+
},
12+
"2_cs_fit_with_stage2_survey_design": {
13+
"seconds": 0.014581666999999854,
14+
"ok": true,
15+
"error": null
16+
},
17+
"3_inspect_pretrends": {
18+
"seconds": 1.8749999997069722e-06,
19+
"ok": true,
20+
"error": null
21+
},
22+
"4_honest_did_grid": {
23+
"seconds": 0.003660958000000214,
24+
"ok": true,
25+
"error": null
26+
},
27+
"5_sun_abraham_robustness": {
28+
"seconds": 0.05053487499999987,
29+
"ok": true,
30+
"error": null
31+
},
32+
"6_practitioner_next_steps": {
33+
"seconds": 4.9042000000110164e-05,
34+
"ok": true,
35+
"error": null
36+
}
37+
},
38+
"metadata": {
39+
"n_microdata_rows": 50000,
40+
"n_states": 50,
41+
"n_years": 10,
42+
"n_strata": 10,
43+
"n_psu": 200,
44+
"n_bootstrap": 199
45+
},
46+
"diff_diff_version": "3.1.3",
47+
"numpy_version": "2.0.2"
48+
}
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
{
2+
"scenario": "brfss_panel",
3+
"backend": "rust",
4+
"has_rust_backend": true,
5+
"total_seconds": 1.5960411249999997,
6+
"phases": {
7+
"1_aggregate_survey_microdata_to_panel": {
8+
"seconds": 1.5271849580000003,
9+
"ok": true,
10+
"error": null
11+
},
12+
"2_cs_fit_with_stage2_survey_design": {
13+
"seconds": 0.014870542000000153,
14+
"ok": true,
15+
"error": null
16+
},
17+
"3_inspect_pretrends": {
18+
"seconds": 2.208000000170074e-06,
19+
"ok": true,
20+
"error": null
21+
},
22+
"4_honest_did_grid": {
23+
"seconds": 0.003847707999999894,
24+
"ok": true,
25+
"error": null
26+
},
27+
"5_sun_abraham_robustness": {
28+
"seconds": 0.05008866700000025,
29+
"ok": true,
30+
"error": null
31+
},
32+
"6_practitioner_next_steps": {
33+
"seconds": 4.3584000000151946e-05,
34+
"ok": true,
35+
"error": null
36+
}
37+
},
38+
"metadata": {
39+
"n_microdata_rows": 50000,
40+
"n_states": 50,
41+
"n_years": 10,
42+
"n_strata": 10,
43+
"n_psu": 200,
44+
"n_bootstrap": 199
45+
},
46+
"diff_diff_version": "3.1.3",
47+
"numpy_version": "2.0.2"
48+
}
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
{
2+
"scenario": "campaign_staggered",
3+
"backend": "python",
4+
"has_rust_backend": false,
5+
"total_seconds": 0.493763792,
6+
"phases": {
7+
"1_bacon_decomposition": {
8+
"seconds": 0.00662462499999994,
9+
"ok": true,
10+
"error": null
11+
},
12+
"2_cs_fit_with_covariates_bootstrap999": {
13+
"seconds": 0.06328537499999998,
14+
"ok": true,
15+
"error": null
16+
},
17+
"3_inspect_pretrends": {
18+
"seconds": 3.3750000000276614e-06,
19+
"ok": true,
20+
"error": null
21+
},
22+
"4_honest_did_M_grid": {
23+
"seconds": 0.0047993339999999884,
24+
"ok": true,
25+
"error": null
26+
},
27+
"5_sun_abraham_robustness": {
28+
"seconds": 0.09586058399999997,
29+
"ok": true,
30+
"error": null
31+
},
32+
"6_imputation_did_robustness": {
33+
"seconds": 0.29060341599999995,
34+
"ok": true,
35+
"error": null
36+
},
37+
"7_cs_without_covariates": {
38+
"seconds": 0.03254304100000005,
39+
"ok": true,
40+
"error": null
41+
},
42+
"8_practitioner_next_steps": {
43+
"seconds": 3.7708000000025166e-05,
44+
"ok": true,
45+
"error": null
46+
}
47+
},
48+
"metadata": {
49+
"n_units": 150,
50+
"n_periods": 26,
51+
"n_cohorts": 2,
52+
"covariates": [
53+
"log_pop",
54+
"baseline_spend"
55+
],
56+
"n_bootstrap": 999,
57+
"aggregate": "all",
58+
"estimation_method": "dr"
59+
},
60+
"diff_diff_version": "3.1.3",
61+
"numpy_version": "2.0.2"
62+
}

0 commit comments

Comments
 (0)