Address PR #351 R9 P3: unify SDID bootstrap slowdown wording

igerber · claude · igerber · commit dc2045fb781e · 2026-04-22T19:45:53.000-04:00
Single actionable P3 from R9 CI review: user-facing runtime wording for refit bootstrap had diverged across surfaces, giving conflicting expectations about the cost of the new bootstrap path: - CHANGELOG.md and diff_diff/synthetic_did.py said ~5-30x slower. - diff_diff/power.py said ~10-100x slower (two sites). - docs/choosing_estimator.rst said ~10-100x slower. - docs/performance-scenarios.md said ~10-100x slower. - docs/methodology/REGISTRY.md coverage-MC block said ~10-100x slower. - docs/tutorials/03_synthetic_did.ipynb and docs/tutorials/18_geo_experiments.ipynb said ~10-100x slower. - benchmarks/python/coverage_sdid.py said the 500-seed MC run takes ~2-4 hours, while REGISTRY.md said ~15-40 min (the actually-observed wall-clock; aer63 is ~37 min, balanced + unbalanced ~2 min combined). Unify on "~5-30x slower than placebo (panel-size dependent)" for the per-fit slowdown (the warm-start plumbing closed the gap vs the pre- warm-start cold-start estimate of 10-100x) and on "~15-40 min" for the coverage MC wall-clock. The CHANGELOG entry already notes the 10-100x figure as a historical "prior estimate" — left as-is so the release notes continue to explain the revision. Also fix two tutorial surfaces that still called placebo "R's default" (tutorial 03, sections 7 and 10). R's default is bootstrap; placebo is the library default per the REGISTRY Note added in 710f966. Reword to describe placebo as the library default with the rationale pointer. Verified: 353 tests pass across test_methodology_sdid, test_power, test_guides (UTF-8 fingerprint preserved). Tutorial-18 nbmake drift guards unaffected because the change is markdown-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/benchmarks/python/coverage_sdid.py b/benchmarks/python/coverage_sdid.py
@@ -11,7 +11,9 @@
 ``docs/methodology/REGISTRY.md`` §SyntheticDiD.
 
 Usage:
-    # Full run (~2–4 hours, AER §6.3 refit is the long tail)
+    # Full run (~15–40 min on M-series Mac with Rust backend; AER §6.3 refit
+    # is the long tail at ~37 min. Matches the wall-clock wording in
+    # REGISTRY.md §SyntheticDiD coverage MC note.)
     python benchmarks/python/coverage_sdid.py \\
         --n-seeds 500 --n-bootstrap 200 \\
         --output benchmarks/data/sdid_coverage.json
diff --git a/diff_diff/power.py b/diff_diff/power.py
@@ -718,7 +718,7 @@ def _check_sdid_placebo_data(
             f"n_treated={n_treated}. Either adjust your data_generator so that "
             f"n_control > n_treated, or use "
             f"SyntheticDiD(variance_method='bootstrap') (paper-faithful refit; "
-            f"~10-100x slower than placebo) or SyntheticDiD(variance_method='jackknife')."
+            f"~5-30x slower than placebo) or SyntheticDiD(variance_method='jackknife')."
         )
 
 
@@ -2047,7 +2047,7 @@ def simulate_power(
                 f"n_treated={effective_n_treated}). Either lower "
                 f"treatment_fraction so that n_control > n_treated, or use "
                 f"SyntheticDiD(variance_method='bootstrap') (paper-faithful refit; "
-                f"~10-100x slower than placebo) or "
+                f"~5-30x slower than placebo) or "
                 f"SyntheticDiD(variance_method='jackknife')."
             )
 
diff --git a/docs/choosing_estimator.rst b/docs/choosing_estimator.rst
@@ -611,7 +611,7 @@ differences helps interpret results and choose appropriate inference.
      - Uses influence-function SEs with WIF adjustment by default. Set ``n_bootstrap=999`` for multiplier bootstrap inference (weight types: ``rademacher``, ``mammen``, ``webb``).
    * - ``SyntheticDiD``
      - Placebo, paper-faithful refit bootstrap, or jackknife
-     - Default uses placebo-based variance (``variance_method="placebo"``). Set ``variance_method="bootstrap"`` for paper-faithful Algorithm 2 bootstrap (re-estimates ω and λ via Frank-Wolfe per draw; ~10–100× slower than placebo). Both methods use ``n_bootstrap`` replications (default 200). ``variance_method="jackknife"`` is also available.
+     - Default uses placebo-based variance (``variance_method="placebo"``). Set ``variance_method="bootstrap"`` for paper-faithful Algorithm 2 bootstrap (re-estimates ω and λ via Frank-Wolfe per draw; ~5–30× slower than placebo, panel-size dependent). Both methods use ``n_bootstrap`` replications (default 200). ``variance_method="jackknife"`` is also available.
    * - ``ContinuousDiD``
      - Analytical (influence function)
      - Uses influence-function-based SEs by default. Use ``n_bootstrap=199`` (or higher) for multiplier bootstrap inference with proper CIs.
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -1505,7 +1505,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
 
   R-parity rationale: `synthdid_estimate()` (synthdid.R) stores `update.omega = TRUE` in `attr(estimate, "opts")`, and `vcov.R::bootstrap_sample` rebinds those `opts` inside its `do.call` back into `synthdid_estimate`, so the renormalized ω passed via `weights$omega` is used as Frank-Wolfe initialization (the `sum_normalize` helper in R's source explicitly says so). The Python path threads the same warm-start via `compute_sdid_unit_weights(..., init_weights=...)` and `compute_time_weights(..., init_weights=...)`. The FW objective is strictly convex on the simplex (quadratic loss + ζ² ridge on simplex), so warm- and cold-start converge to the same global minimum given enough iterations; warm-start matters in practice because the 100-iter first pass then sparsification is path-dependent on draws where the pre-sparsify budget is tight. Cross-language SE parity at bit tolerance is not claimed — different BLAS / RNG paths — but the procedure matches R's default bootstrap shape at the algorithm level, and Python-only bit-identity on non-survey data is asserted via `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`.
 
-  Expected wall-clock ~10–100× slower per fit than a fixed-weight shortcut would be (panel-size dependent; Frank-Wolfe second-pass can hit its 10K-iter cap on larger panels). Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside the loop and aggregated into a single summary warning emitted after the loop when the share of valid bootstrap draws with any non-convergence event (counted once per draw — each draw runs Frank-Wolfe once for ω and once for λ, and any of those calls firing a non-convergence warning trips the draw) exceeds 5% of `n_successful`. Composed with any survey design (including pweight-only) this path raises `NotImplementedError` in the current release — see the survey-regression Note below for scope and the deferred-composition sketch.
+  Expected wall-clock ~5–30× slower per fit than placebo (panel-size dependent; Frank-Wolfe second-pass can hit its 10K-iter cap on larger panels; warm-start plumbing closes the gap vs cold-start, which would be closer to 10–30× on these DGPs). Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside the loop and aggregated into a single summary warning emitted after the loop when the share of valid bootstrap draws with any non-convergence event (counted once per draw — each draw runs Frank-Wolfe once for ω and once for λ, and any of those calls firing a non-convergence warning trips the draw) exceeds 5% of `n_successful`. Composed with any survey design (including pweight-only) this path raises `NotImplementedError` in the current release — see the survey-regression Note below for scope and the deferred-composition sketch.
 
 - Alternative: Jackknife variance (matching R's `synthdid::vcov(method="jackknife")`)
   Implements Algorithm 3 from Arkhangelsky et al. (2021):
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
@@ -235,8 +235,9 @@ serves a different purpose: R-parity accuracy). They complement it.
   SyntheticDiD(variance_method="jackknife", n_bootstrap=0).fit(...)
   # then also variance_method="bootstrap", n_bootstrap=200 for comparison
   # NOTE: bootstrap is now paper-faithful refit (re-estimates ω and λ via
-  # Frank-Wolfe per draw); ~10–100× slower than placebo or the previous
-  # release's fixed-weight bootstrap. Plan accordingly when timing.
+  # Frank-Wolfe per draw); ~5–30× slower than placebo (panel-size dependent;
+  # warm-start plumbing reduces the slowdown relative to cold-start). Plan
+  # accordingly when timing.
   ```
 - **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` -
   exercises the leave-one-out refit loop; (2) SDiD fit with
diff --git a/docs/tutorials/03_synthetic_did.ipynb b/docs/tutorials/03_synthetic_did.ipynb
@@ -398,7 +398,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## 7. Inference Methods\n\nSDID supports three inference methods:\n\n1. **Placebo** (`variance_method=\"placebo\"`, default): Placebo-based variance using Algorithm 4 from Arkhangelsky et al. (2021). This matches R's default.\n2. **Bootstrap** (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap (Algorithm 2 step 2) — re-estimates ω and λ via Frank-Wolfe on each draw. Also matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior. Expect ~10–100× slower per fit than placebo.\n3. **Jackknife** (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out. Deterministic; no bootstrap replications."
+   "source": "## 7. Inference Methods\n\nSDID supports three inference methods:\n\n1. **Placebo** (`variance_method=\"placebo\"`, default): Placebo-based variance using Algorithm 4 from Arkhangelsky et al. (2021). Library default (R's default is bootstrap — we deviate because placebo is unconditionally available on pweight-only survey designs and sidesteps the refit bootstrap slowdown).\n2. **Bootstrap** (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap (Algorithm 2 step 2) — re-estimates ω and λ via Frank-Wolfe on each draw. Matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior. Expect ~5–30× slower per fit than placebo (panel-size dependent).\n3. **Jackknife** (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out. Deterministic; no bootstrap replications."
   },
   {
    "cell_type": "code",
@@ -599,7 +599,7 @@
   {
    "cell_type": "markdown",
    "metadata": {},
-   "source": "## Summary\n\nKey takeaways for Synthetic DiD:\n\n1. **Best use cases**: Few treated units, many controls, long pre-period\n2. **Unit weights**: Identify which controls are most similar to treated (Frank-Wolfe with sparsification)\n3. **Time weights**: Determine which pre-periods are most informative (Frank-Wolfe on collapsed form)\n4. **Pre-treatment fit**: Lower RMSE indicates better synthetic match\n5. **Inference options**:\n   - Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance from controls (matches R)\n   - Bootstrap (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap re-estimating ω and λ via Frank-Wolfe per draw (Algorithm 2 step 2; matches R's default `vcov`). ~10–100× slower than placebo.\n   - Jackknife (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out.\n6. **Regularization**: Auto-computed from data noise level by default. Override with `zeta_omega`/`zeta_lambda`.\n\nReference:\n- Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118."
+   "source": "## Summary\n\nKey takeaways for Synthetic DiD:\n\n1. **Best use cases**: Few treated units, many controls, long pre-period\n2. **Unit weights**: Identify which controls are most similar to treated (Frank-Wolfe with sparsification)\n3. **Time weights**: Determine which pre-periods are most informative (Frank-Wolfe on collapsed form)\n4. **Pre-treatment fit**: Lower RMSE indicates better synthetic match\n5. **Inference options**:\n   - Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance from controls. Library default (R's default is bootstrap; we deviate for survey availability + perf).\n   - Bootstrap (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap re-estimating ω and λ via Frank-Wolfe per draw (Algorithm 2 step 2; matches R's default `vcov`). ~5–30× slower than placebo.\n   - Jackknife (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out.\n6. **Regularization**: Auto-computed from data noise level by default. Override with `zeta_omega`/`zeta_lambda`.\n\nReference:\n- Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118."
   }
  ],
  "metadata": {
diff --git a/docs/tutorials/18_geo_experiments.ipynb b/docs/tutorials/18_geo_experiments.ipynb
@@ -854,17 +854,7 @@
    "cell_type": "markdown",
    "id": "t18-cell-028",
    "metadata": {},
-   "source": [
-    "diff-diff's `SyntheticDiD` supports three standard error methods, and the difference between the two paper-based ones is *what gets resampled* per replication:\n",
-    "\n",
-    "- **Placebo SE** (default): permutes which control units are pretended to be \"treated\", then **re-estimates both the unit weights and the time weights** (Frank-Wolfe) on each permutation and recomputes SDiD. The standard deviation of those placebo effects is the SE. This is Algorithm 4 in Arkhangelsky et al. (2021) and matches R's `synthdid::vcov(method=\"placebo\")`.\n",
-    "- **Bootstrap SE**: pairs-bootstrap resampling of all units with replacement, then **re-estimates both the unit weights and the time weights** via Frank-Wolfe on each resampled panel and recomputes SDiD. This is Algorithm 2 step 2 in Arkhangelsky et al. (2021) and matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior (which rebinds `attr(estimate, \"opts\")` so the renormalized ω is only Frank-Wolfe initialization). Expect ~10–100× slower per fit than placebo.\n",
-    "- **Jackknife SE**: deterministic Algorithm 3 — fixed-weight leave-one-out across all units. Faster than bootstrap; mildly anti-conservative on smaller panels.\n",
-    "\n",
-    "Both bootstrap and placebo re-estimate the weights per replication, so each reflects the full uncertainty in the weighting procedure. They differ in *how* they resample: placebo permutes the control-vs-treated assignment, bootstrap draws with replacement. On exchangeable DGPs the two SEs typically track each other; on small panels with non-exchangeable factor structure (like the marketing geo-experiment here), they can differ in magnitude while still agreeing on significance and CI direction.\n",
-    "\n",
-    "All three methods are configured on the `SyntheticDiD` *constructor*, not on `.fit()`. Use placebo by default (it's the published method); switch to bootstrap if you want a cross-check from a different resampling protocol; switch to jackknife if you need a deterministic, fast alternative."
-   ]
+   "source": "diff-diff's `SyntheticDiD` supports three standard error methods, and the difference between the two paper-based ones is *what gets resampled* per replication:\n\n- **Placebo SE** (default): permutes which control units are pretended to be \"treated\", then **re-estimates both the unit weights and the time weights** (Frank-Wolfe) on each permutation and recomputes SDiD. The standard deviation of those placebo effects is the SE. This is Algorithm 4 in Arkhangelsky et al. (2021) and matches R's `synthdid::vcov(method=\"placebo\")`.\n- **Bootstrap SE**: pairs-bootstrap resampling of all units with replacement, then **re-estimates both the unit weights and the time weights** via Frank-Wolfe on each resampled panel and recomputes SDiD. This is Algorithm 2 step 2 in Arkhangelsky et al. (2021) and matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior (which rebinds `attr(estimate, \"opts\")` so the renormalized ω is only Frank-Wolfe initialization). Expect ~5–30× slower per fit than placebo (panel-size dependent).\n- **Jackknife SE**: deterministic Algorithm 3 — fixed-weight leave-one-out across all units. Faster than bootstrap; mildly anti-conservative on smaller panels.\n\nBoth bootstrap and placebo re-estimate the weights per replication, so each reflects the full uncertainty in the weighting procedure. They differ in *how* they resample: placebo permutes the control-vs-treated assignment, bootstrap draws with replacement. On exchangeable DGPs the two SEs typically track each other; on small panels with non-exchangeable factor structure (like the marketing geo-experiment here), they can differ in magnitude while still agreeing on significance and CI direction.\n\nAll three methods are configured on the `SyntheticDiD` *constructor*, not on `.fit()`. Use placebo by default (it's the library default; R's default is bootstrap); switch to bootstrap if you want a cross-check from a different resampling protocol; switch to jackknife if you need a deterministic, fast alternative."
   },
   {
    "cell_type": "code",
@@ -1095,4 +1085,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}

Original file line number	Diff line number	Diff line change
`@@ -718,7 +718,7 @@ def _check_sdid_placebo_data(`
`718`	`718`	`f"n_treated={n_treated}. Either adjust your data_generator so that "`
`719`	`719`	`f"n_control > n_treated, or use "`
`720`	`720`	`f"SyntheticDiD(variance_method='bootstrap') (paper-faithful refit; "`
`721`		`- f"~10-100x slower than placebo) or SyntheticDiD(variance_method='jackknife')."`
	`721`	`+ f"~5-30x slower than placebo) or SyntheticDiD(variance_method='jackknife')."`
`722`	`722`	`)`
`723`	`723`
`724`	`724`
`@@ -2047,7 +2047,7 @@ def simulate_power(`
`2047`	`2047`	`f"n_treated={effective_n_treated}). Either lower "`
`2048`	`2048`	`f"treatment_fraction so that n_control > n_treated, or use "`
`2049`	`2049`	`f"SyntheticDiD(variance_method='bootstrap') (paper-faithful refit; "`
`2050`		`- f"~10-100x slower than placebo) or "`
	`2050`	`+ f"~5-30x slower than placebo) or "`
`2051`	`2051`	`f"SyntheticDiD(variance_method='jackknife')."`
`2052`	`2052`	`)`
`2053`	`2053`
Original file line number	Diff line number	Diff line change
`@@ -398,7 +398,7 @@`
`398`	`398`	`{`
`399`	`399`	`"cell_type": "markdown",`
`400`	`400`	`"metadata": {},`
`401`		- "source": "## 7. Inference Methods\n\nSDID supports three inference methods:\n\n1. Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance using Algorithm 4 from Arkhangelsky et al. (2021). This matches R's default.\n2. Bootstrap (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap (Algorithm 2 step 2) — re-estimates ω and λ via Frank-Wolfe on each draw. Also matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior. Expect ~10–100× slower per fit than placebo.\n3. Jackknife (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out. Deterministic; no bootstrap replications."
	`401`	+ "source": "## 7. Inference Methods\n\nSDID supports three inference methods:\n\n1. Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance using Algorithm 4 from Arkhangelsky et al. (2021). Library default (R's default is bootstrap — we deviate because placebo is unconditionally available on pweight-only survey designs and sidesteps the refit bootstrap slowdown).\n2. Bootstrap (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap (Algorithm 2 step 2) — re-estimates ω and λ via Frank-Wolfe on each draw. Matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior. Expect ~5–30× slower per fit than placebo (panel-size dependent).\n3. Jackknife (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out. Deterministic; no bootstrap replications."
`402`	`402`	`},`
`403`	`403`	`{`
`404`	`404`	`"cell_type": "code",`
`@@ -599,7 +599,7 @@`
`599`	`599`	`{`
`600`	`600`	`"cell_type": "markdown",`
`601`	`601`	`"metadata": {},`
`602`		- "source": "## Summary\n\nKey takeaways for Synthetic DiD:\n\n1. Best use cases: Few treated units, many controls, long pre-period\n2. Unit weights: Identify which controls are most similar to treated (Frank-Wolfe with sparsification)\n3. Time weights: Determine which pre-periods are most informative (Frank-Wolfe on collapsed form)\n4. Pre-treatment fit: Lower RMSE indicates better synthetic match\n5. Inference options:\n - Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance from controls (matches R)\n - Bootstrap (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap re-estimating ω and λ via Frank-Wolfe per draw (Algorithm 2 step 2; matches R's default `vcov`). ~10–100× slower than placebo.\n - Jackknife (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out.\n6. Regularization: Auto-computed from data noise level by default. Override with `zeta_omega`/`zeta_lambda`.\n\nReference:\n- Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118."
	`602`	+ "source": "## Summary\n\nKey takeaways for Synthetic DiD:\n\n1. Best use cases: Few treated units, many controls, long pre-period\n2. Unit weights: Identify which controls are most similar to treated (Frank-Wolfe with sparsification)\n3. Time weights: Determine which pre-periods are most informative (Frank-Wolfe on collapsed form)\n4. Pre-treatment fit: Lower RMSE indicates better synthetic match\n5. Inference options:\n - Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance from controls. Library default (R's default is bootstrap; we deviate for survey availability + perf).\n - Bootstrap (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap re-estimating ω and λ via Frank-Wolfe per draw (Algorithm 2 step 2; matches R's default `vcov`). ~5–30× slower than placebo.\n - Jackknife (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out.\n6. Regularization: Auto-computed from data noise level by default. Override with `zeta_omega`/`zeta_lambda`.\n\nReference:\n- Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118."
`603`	`603`	`}`
`604`	`604`	`],`
`605`	`605`	`"metadata": {`