Skip to content

Commit 307868c

Browse files
igerberclaude
andcommitted
Harden brand-awareness narrative; rename JK1 label to "replicate-weight variance"
CI re-review P2 + P3, both docs/label only: - docs/performance-plan.md had two remaining specific-magnitude claims about brand-awareness medium ("~1.6x under Python", "Python and Rust separate the most at medium", "~1.6x at worst on brand-awareness medium"). Those were true on one earlier rerun but the current committed baselines show medium at 0.56 / 0.55 (essentially tied) and the widest non-SDiD gap is now ~1.1x at brand-large. Reworded per-scenario paragraph and scaling finding #5 to describe the stable aggregate pattern and defer exact ratios to the scale-sweep table. Same treatment as the earlier staggered/dCDH pass: narrative stops claiming magnitudes that can shift on rerun; the generator-owned table carries the specifics. - bench_brand_awareness_survey.py module docstring labeled JK1 as "replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate- weight variance (jackknife-style), not bootstrap inference - they are distinct methodology surfaces. Renamed to "replicate-weight variance (JK1 delete-one-PSU)" with an inline note pointing to the registry. Docs + docstring only. No script behaviour change; no baseline regeneration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 030d5f5 commit 307868c

2 files changed

Lines changed: 17 additions & 17 deletions

File tree

benchmarks/speed_review/bench_brand_awareness_survey.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,10 @@
33
44
DifferenceInDifferences + SurveyDesign under two variance paths:
55
(a) analytical Taylor-series linearization (strata + PSU + FPC)
6-
(b) replicate-weight bootstrap (JK1 delete-one-PSU weights; count equals
7-
the number of PSUs, so 40/90/160 at small/medium/large)
6+
(b) replicate-weight variance (JK1 delete-one-PSU; count equals
7+
the number of PSUs, so 40/90/160 at small/medium/large).
8+
This is replicate-weight variance, not bootstrap resampling -
9+
see REGISTRY.md for the distinction.
810
911
Chains: naive fit (for SE-inflation comparison) -> TSL -> replicate -> multi-
1012
outcome refit loop -> check_parallel_trends -> placebo -> HonestDiD grid.

docs/performance-plan.md

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -84,12 +84,12 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
8484
n_units x n_replicates - faster growth than the chain total, so it
8585
increasingly dominates at large n.
8686
5. Rust backend gives large uplift only for SDiD (order-of-magnitude
87-
and up). Elsewhere the gap is modest - under ~1.6x at worst on
88-
brand-awareness medium, and within noise on the other scenarios
89-
and scales. The primary bottlenecks live in Python code the Rust
90-
backend does not touch (`aggregate_survey`, JK1 replicate fit), and
91-
paths that Rust does touch (CS bootstrap, ImputationDiD, Survey
92-
TSL) are already well-vectorized in Python.
87+
and up). Elsewhere the gap is modest across all measured (scenario,
88+
scale) cells - see the scale-sweep table for exact ratios. The
89+
primary bottlenecks live in Python code the Rust backend does not
90+
touch (`aggregate_survey`, JK1 replicate fit), and paths that Rust
91+
does touch (CS bootstrap, ImputationDiD, Survey TSL) are already
92+
well-vectorized in Python.
9393

9494
### Top phases by scenario at largest measured scale
9595

@@ -122,15 +122,13 @@ any rerun):
122122
without covariates) is well-vectorized and sits well below both in
123123
the ranking. Either phase is a legitimate optimization target; the
124124
aggregate share is what drives the "next hotspot" priority.
125-
- **Brand awareness survey.** At small scale HonestDiD dominates. At
126-
medium the backends diverge: on Python JK1 leads clearly (about
127-
2.2x the multi-outcome loop), while on Rust the multi-outcome loop
128-
and JK1 come in essentially tied. Medium is also the scale where
129-
Python and Rust separate the most on total time (~1.6x under
130-
Python at the time of writing); the analytical TSL path with FPC
131-
appears to vectorize better under Rust at that shape. At large,
132-
JK1 becomes the clearly dominant phase under both backends and
133-
totals re-converge.
125+
- **Brand awareness survey.** At small scale HonestDiD dominates. From
126+
medium onwards JK1 is the single largest phase under both backends;
127+
see the table for the exact share per cell. Python and Rust totals
128+
stay close across the sweep (within ~1.1x at any measured scale,
129+
see scale-sweep table); the JK1 replicate-fit loop is not
130+
Rust-accelerated, so the backends neither help nor hurt each other
131+
meaningfully on this chain.
134132
- **BRFSS.** `aggregate_survey` share of total grows with scale and is
135133
effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
136134
SunAbraham, HonestDiD) are a fraction of a second combined.

0 commit comments

Comments
 (0)