Harden brand-awareness narrative; rename JK1 label to "replicate-weight variance"

igerber · claude · igerber · commit 307868c93671 · 2026-04-19T15:42:16.000-04:00
CI re-review P2 + P3, both docs/label only: - docs/performance-plan.md had two remaining specific-magnitude claims about brand-awareness medium ("~1.6x under Python", "Python and Rust separate the most at medium", "~1.6x at worst on brand-awareness medium"). Those were true on one earlier rerun but the current committed baselines show medium at 0.56 / 0.55 (essentially tied) and the widest non-SDiD gap is now ~1.1x at brand-large. Reworded per-scenario paragraph and scaling finding #5 to describe the stable aggregate pattern and defer exact ratios to the scale-sweep table. Same treatment as the earlier staggered/dCDH pass: narrative stops claiming magnitudes that can shift on rerun; the generator-owned table carries the specifics. - bench_brand_awareness_survey.py module docstring labeled JK1 as "replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate- weight variance (jackknife-style), not bootstrap inference - they are distinct methodology surfaces. Renamed to "replicate-weight variance (JK1 delete-one-PSU)" with an inline note pointing to the registry. Docs + docstring only. No script behaviour change; no baseline regeneration needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -3,8 +3,10 @@
 
 DifferenceInDifferences + SurveyDesign under two variance paths:
   (a) analytical Taylor-series linearization (strata + PSU + FPC)
-  (b) replicate-weight bootstrap (JK1 delete-one-PSU weights; count equals
-      the number of PSUs, so 40/90/160 at small/medium/large)
+  (b) replicate-weight variance (JK1 delete-one-PSU; count equals
+      the number of PSUs, so 40/90/160 at small/medium/large).
+      This is replicate-weight variance, not bootstrap resampling -
+      see REGISTRY.md for the distinction.
 
 Chains: naive fit (for SE-inflation comparison) -> TSL -> replicate -> multi-
 outcome refit loop -> check_parallel_trends -> placebo -> HonestDiD grid.
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
@@ -84,12 +84,12 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
    n_units x n_replicates - faster growth than the chain total, so it
    increasingly dominates at large n.
 5. Rust backend gives large uplift only for SDiD (order-of-magnitude
-   and up). Elsewhere the gap is modest - under ~1.6x at worst on
-   brand-awareness medium, and within noise on the other scenarios
-   and scales. The primary bottlenecks live in Python code the Rust
-   backend does not touch (`aggregate_survey`, JK1 replicate fit), and
-   paths that Rust does touch (CS bootstrap, ImputationDiD, Survey
-   TSL) are already well-vectorized in Python.
+   and up). Elsewhere the gap is modest across all measured (scenario,
+   scale) cells - see the scale-sweep table for exact ratios. The
+   primary bottlenecks live in Python code the Rust backend does not
+   touch (`aggregate_survey`, JK1 replicate fit), and paths that Rust
+   does touch (CS bootstrap, ImputationDiD, Survey TSL) are already
+   well-vectorized in Python.
 
 ### Top phases by scenario at largest measured scale
 
@@ -122,15 +122,13 @@ any rerun):
   without covariates) is well-vectorized and sits well below both in
   the ranking. Either phase is a legitimate optimization target; the
   aggregate share is what drives the "next hotspot" priority.
-- **Brand awareness survey.** At small scale HonestDiD dominates. At
-  medium the backends diverge: on Python JK1 leads clearly (about
-  2.2x the multi-outcome loop), while on Rust the multi-outcome loop
-  and JK1 come in essentially tied. Medium is also the scale where
-  Python and Rust separate the most on total time (~1.6x under
-  Python at the time of writing); the analytical TSL path with FPC
-  appears to vectorize better under Rust at that shape. At large,
-  JK1 becomes the clearly dominant phase under both backends and
-  totals re-converge.
+- **Brand awareness survey.** At small scale HonestDiD dominates. From
+  medium onwards JK1 is the single largest phase under both backends;
+  see the table for the exact share per cell. Python and Rust totals
+  stay close across the sweep (within ~1.1x at any measured scale,
+  see scale-sweep table); the JK1 replicate-fit loop is not
+  Rust-accelerated, so the backends neither help nor hurt each other
+  meaningfully on this chain.
 - **BRFSS.** `aggregate_survey` share of total grows with scale and is
   effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
   SunAbraham, HonestDiD) are a fraction of a second combined.