Skip to content

Commit a0aafc5

Browse files
igerberclaude
andcommitted
Narrow three P3 narrative overgeneralizations
CI re-review P3 items, all documentation-only: - Scenario 3 operation chain: said "analytical TSL via strata + PSU", but aggregate_survey()'s returned second-stage design is pweight with geographic PSU clustering and no stage-2 strata. Reworded to match the actual second-stage design surface being benchmarked. - ImputationDiD "consistently dominant" claim in scaling finding #2 and hotspot table row #2: at Rust medium SunAbraham clearly leads (0.353s vs 0.214s). Both claims narrowed to "Python all scales + Rust small/large" with the Rust-medium SunAbraham exception called out explicitly; the "together ~70-80% of the chain" framing preserves the optimization recommendation. - SDiD narrative said sensitivity_to_zeta_omega and in_time_placebo are the two largest at every scale/backend, but at Rust small bootstrap_variance slightly edges both (at sub-50ms totals, per- phase fixed overhead dominates ranking). Qualified to Python all scales + Rust medium/large. Docs-only. No script or baseline changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4bf991c commit a0aafc5

2 files changed

Lines changed: 18 additions & 12 deletions

File tree

docs/performance-plan.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,9 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
6969
(`aggregate_survey` is entirely Python).
7070
2. **Staggered CS chain stays cheap across scales.** A 10x unit increase
7171
(150 -> 1,500) is a small-single-digit multiplier on total time.
72-
ImputationDiD is consistently the dominant phase but scales well.
72+
ImputationDiD is the dominant phase at most (scale, backend)
73+
combinations; SunAbraham takes the top spot at Rust medium but the
74+
two phases together consistently account for ~70-80% of the chain.
7375
3. **SDiD Rust gap is stable across scales, not emergent.** Python SDiD
7476
has a fixed per-jackknife-refit overhead that dominates even at small
7577
n. Rust stays sub-second through 500 units.
@@ -132,11 +134,14 @@ any rerun):
132134
effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
133135
SunAbraham, HonestDiD) are a fraction of a second combined.
134136
- **SDiD few markets.** `sensitivity_to_zeta_omega` and
135-
`in_time_placebo` are the two largest phases under both backends at
136-
every scale - they together account for roughly ~70% of the chain.
137-
The difference is absolute: under Python they drive a multi-second
138-
chain, under Rust they stay the top phases but of a sub-second total
139-
runtime. That is the Python-vs-Rust story for this scenario.
137+
`in_time_placebo` are the two largest phases under Python at every
138+
scale and under Rust at medium/large (together ~70% of the chain).
139+
At Rust small the absolute cost collapses so far that per-phase
140+
fixed overhead dominates and `2_sdid_bootstrap_variance_200` slightly
141+
edges the other two. The difference across backends is absolute:
142+
under Python these phases drive a multi-second chain, under Rust
143+
they stay in the top ranks but of a sub-second total runtime. That
144+
is the Python-vs-Rust story for this scenario.
140145
- **Reversible dCDH.** Main fit and heterogeneity refit are the two
141146
largest phases by design - together effectively the whole chain. The
142147
split is not stable across backends: under Python the main fit is
@@ -152,7 +157,7 @@ any rerun):
152157
| # | Location | Scenario + scale | Signal | Recommended action |
153158
|---|---|---|---|---|
154159
| 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | dominates BRFSS chain at all scales, ~100% at 1M rows | **Algorithmic fix, highest priority.** Function called once per (state, year) cell (500 calls); per-call work rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. |
155-
| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase of the CS chain under both backends at all scales; SunAbraham narrows the gap under Rust at large but ImputationDiD still leads | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
160+
| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase under Python at every scale and under Rust at small/large; at Rust medium SunAbraham takes the top spot. Together ImputationDiD + SunAbraham are ~70-80% of the chain at every scale | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
156161
| 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | dominates Python SDiD at all scales | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; non-production for n > 100. Python skipped at n=500 (jackknife cost would exceed 4 minutes per run). |
157162
| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | main fit and survey-aware heterogeneity refit each rebuild TSL scaffolding; heterogeneity phase is as expensive as the main fit | **Cache/precompute** - heterogeneity refit duplicates the main fit's TSL setup under the same `SurveyDesign`. Not P0; newer code path (v3.1) never optimization-reviewed. |
158163
| 5 | `diff_diff/continuous_did.py` CDiD spline bootstrap | Dose-response (single scale) | four spline fits ~equal, linear in variant count | **Leave alone** - well under perceptible threshold. |

docs/performance-scenarios.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -201,11 +201,12 @@ serves a different purpose: R-parity accuracy). They complement it.
201201
compute_honest_did(results, method="relative_magnitude", M=[0.5, 1.0, 1.5])
202202
```
203203
- **Operation chain.** (1) `aggregate_survey()` - the microdata-to-panel
204-
collapse; (2) CS fit with staged second-stage SurveyDesign
205-
(`weight_type="pweight"`, analytical TSL via strata + PSU) and bootstrap
206-
at PSU level; (3) event-study pre-trend inspection; (4) HonestDiD
207-
sensitivity grid; (5) SunAbraham robustness refit using the same
208-
second-stage pweight SurveyDesign; (6) `practitioner_next_steps()`.
204+
collapse; (2) CS fit with the second-stage SurveyDesign returned by
205+
`aggregate_survey` (pweight + geographic PSU clustering; `aggregate_survey`
206+
does not stratify the collapsed cell panel) and bootstrap at PSU level;
207+
(3) event-study pre-trend inspection; (4) HonestDiD sensitivity grid;
208+
(5) SunAbraham robustness refit using the same second-stage pweight
209+
SurveyDesign; (6) `practitioner_next_steps()`.
209210
- **Source anchor.** `docs/practitioner_getting_started.rst` ("What If
210211
You Have Survey Data?" section), CDC BRFSS 2024 overview
211212
(cdc.gov/brfss/annual_data/2024), `diff_diff.prep.aggregate_survey`

0 commit comments

Comments
 (0)