Address PR #351 R8 P3 cleanup

igerber · claude · igerber · commit 710f96654d44 · 2026-04-22T19:16:35.000-04:00
Four P3-only items from R8 CI review:

1. Correctly attribute R's default `vcov()` method:
   - diff_diff/synthetic_did.py:53 docstring previously claimed placebo was
     "R's default". R's `synthdid::vcov()` actually defaults to
     `method="bootstrap"`. Reword to describe placebo as the library default
     with a rationale paragraph (survey availability, perf) and cross-
     reference to the REGISTRY Note below.
   - METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite
     to frame the default as a deliberate library deviation with the same
     two-reason rationale.

2. Add a REGISTRY.md Note (default variance_method deviation from R)
   that documents the rationale explicitly (survey availability + perf)
   so the AI reviewer recognizes the deviation as documented rather than
   as an accidental contradiction between the docstring and R's actual
   default.

3. Soften the placebo-failure fallback guidance in
   `_placebo_variance_se`: the previous strings recommended
   `variance_method="bootstrap"`, which now raises `NotImplementedError`
   on every survey design. Branch on `w_control is not None` (survey
   fit) to recommend jackknife + adding controls for survey users and
   keep the bootstrap + jackknife + more-controls recommendation for
   non-survey users.

4. Fix the REGISTRY pointer to the slow dispersion guard: reference
   `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the
   current name) instead of the pre-rename
   `test_bootstrap_p_value_null_calibration`, and describe the new
   contract (calibration-agnostic dispersion + loose rejection-rate
   band).

5. Rephrase the REGISTRY coverage-MC narrative about jackknife
   anti-conservatism: the paper's AER §6.3 shows mixed jackknife
   evidence (98% iid — slightly conservative; 93% AR(1) — slightly
   anti-conservative). Our observed anti-conservatism aligns with the
   AR(1) branch of the paper's evidence, not a uniform "in line"
   pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
@@ -514,9 +514,13 @@ variables appear to the left of the `|` separator.
    in PR #351 along with its R-parity fixture, which had also been mis-anchored.
    The same PR added the warm-start plumbing to `compute_sdid_unit_weights` /
    `compute_time_weights` via new `init_weights=` kwargs.)*
-5. **Default `variance_method` changed to `"placebo"`** matching R's default. The R
-   package uses placebo variance by default (`synthdid_estimate` returns an object whose
-   `vcov()` uses the placebo method); our default now matches.
+5. **Default `variance_method` changed to `"placebo"`** — intentional deviation from
+   R's default (R's `synthdid::vcov()` defaults to `"bootstrap"`). The library default
+   is placebo for two reasons: (a) placebo is unconditionally available on pweight-only
+   survey designs, whereas refit bootstrap rejects every survey design in this release;
+   (b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in
+   refit bootstrap. See REGISTRY.md §SyntheticDiD `Note (default variance_method
+   deviation from R)` for details.
 6. **Deprecated `lambda_reg` and `zeta` params; new params are `zeta_omega` and
    `zeta_lambda`**. The old parameters had unclear semantics and did not correspond to
    the paper's notation. The new parameters directly match the paper and R package
diff --git a/diff_diff/synthetic_did.py b/diff_diff/synthetic_did.py
@@ -50,7 +50,11 @@ class SyntheticDiD(DifferenceInDifferences):
     variance_method : str, default="placebo"
         Method for variance estimation:
         - "placebo": Placebo-based variance matching R's synthdid::vcov(method="placebo").
-          Implements Algorithm 4 from Arkhangelsky et al. (2021). This is R's default.
+          Implements Algorithm 4 from Arkhangelsky et al. (2021). Library default
+          (R's default is ``"bootstrap"``; we default to placebo because it is
+          unconditionally available on pweight-only survey designs and avoids the
+          ~5–30× slowdown of the refit bootstrap). See REGISTRY.md §SyntheticDiD
+          ``Note (default variance_method deviation from R)`` for rationale.
         - "bootstrap": Paper-faithful pairs bootstrap — Arkhangelsky et al. (2021)
           Algorithm 2 step 2, also the behavior of R's default
           synthdid::vcov(method="bootstrap") (which rebinds ``attr(estimate, "opts")``
@@ -1125,10 +1129,22 @@ def _placebo_variance_se(
         # Ensure we have enough controls for the split
         n_pseudo_control = n_control - n_treated
         if n_pseudo_control < 1:
+            # Bootstrap rejects every survey design in this release, so
+            # steer survey users to jackknife (pweight-only only) or
+            # adding controls. Non-survey users can still fall back to
+            # bootstrap or jackknife.
+            fallback = (
+                "variance_method='jackknife' or adding more control units "
+                "(strata/PSU/FPC are not yet supported by any SDID variance "
+                "method)"
+                if w_control is not None
+                else "variance_method='bootstrap', variance_method='jackknife', "
+                "or adding more control units"
+            )
             warnings.warn(
                 f"Not enough control units ({n_control}) for placebo variance "
                 f"estimation with {n_treated} treated units. "
-                f"Consider using variance_method='bootstrap'.",
+                f"Consider using {fallback}.",
                 UserWarning,
                 stacklevel=3,
             )
@@ -1217,11 +1233,21 @@ def _placebo_variance_se(
         n_successful = len(placebo_estimates)
 
         if n_successful < 2:
+            # Same survey-awareness branch as the pre-replication guard
+            # above — bootstrap rejects every survey design in this
+            # release, so suggest jackknife for pweight-only fits.
+            fallback = (
+                "variance_method='jackknife' or increasing the number of "
+                "control units (strata/PSU/FPC are not yet supported by any "
+                "SDID variance method)"
+                if w_control is not None
+                else "variance_method='bootstrap' or variance_method='jackknife' "
+                "or increasing the number of control units"
+            )
             warnings.warn(
                 f"Only {n_successful} placebo replications completed successfully. "
                 f"Standard error cannot be estimated reliably. "
-                f"Consider using variance_method='bootstrap' or increasing "
-                f"the number of control units.",
+                f"Consider using {fallback}.",
                 UserWarning,
                 stacklevel=3,
             )
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -1547,6 +1547,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
 - **Jackknife with non-finite LOO estimate**: Returns NaN SE. Unlike bootstrap/placebo, jackknife is deterministic and cannot skip failed iterations; NaN propagates through `var()` (matches R behavior).
 - **Jackknife with survey weights**: Guards on effective positive support (omega * w_control > 0 and w_treated > 0) after composition, not raw FW counts. Returns NaN SE if fewer than 2 effective controls or 2 positive-weight treated units. Per-iteration zero-sum guards return NaN for individual LOO iterations when remaining composed weights sum to zero.
 - **Note (survey support):** weights are supported; strata/PSU/FPC are not. The pweight-only path is accepted for `variance_method="placebo"` and `variance_method="jackknife"` — treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Frank-Wolfe optimization itself is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. **`variance_method="bootstrap"` rejects all survey designs (including pweight-only)** in this release; see the deferred-composition Note below.
+- **Note (default variance_method deviation from R):** R's `synthdid::vcov()` defaults to `method="bootstrap"`; our `SyntheticDiD.__init__` defaults to `variance_method="placebo"`. This is an intentional library deviation for two reasons: (a) placebo is unconditionally available on pweight-only survey designs whereas refit bootstrap rejects every survey design in this release (see the deferred-composition Note below); (b) placebo avoids the ~5–30× per-draw Frank-Wolfe refit slowdown relative to the deleted fixed-weight path, which becomes prohibitive on large panels at B=200 replications. Users can opt into the R-matching default with `variance_method="bootstrap"`. Placebo (Algorithm 4) and bootstrap (Algorithm 2) both track nominal calibration in the committed coverage MC; see the calibration table below.
 - **Note (deferred survey + bootstrap composition):** Rao-Wu rescaled bootstrap composed with paper-faithful refit is not yet implemented. Reusable scaffolding for the follow-up implementer: `generate_rao_wu_weights` (in `bootstrap_utils`, retained for use by other estimators), the split into `rw_control` / `rw_treated`, the degenerate-retry check, and the treated-mean weighting pattern are portable from the fixed-weight Rao-Wu branch this release removed (recoverable via `git show 91082e5:diff_diff/synthetic_did.py` near the pre-rewrite `_bootstrap_se` body (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit")). The genuinely new work is a **weighted Frank-Wolfe** variant of `_sc_weight_fw` that accepts per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Compose-after-unweighted-FW does NOT work — it silently reproduces the fixed-weight + Rao-Wu behavior we removed (which was never paper-faithful and never matched R's default vcov). Validation: re-use the coverage MC harness with a stratified DGP and confirm near-nominal rejection rates against placebo-SE tracking. Survey + SDID variance is therefore deferred capability — pre-existing strata/PSU/FPC users have no SDID variance method on this release.
 - **Note:** P-value computation is variance-method dependent. Placebo (Algorithm 4) uses the empirical null formula `max(mean(|placebo_effects| ≥ |att|), 1/(r+1))` because permuting control indices generates draws from the null distribution (centered on 0). Bootstrap (Algorithm 2) and jackknife (Algorithm 3) use the analytical p-value from `safe_inference(att, se)` (normal-theory): bootstrap draws are centered on `τ̂` (sampling distribution of the estimator) and jackknife pseudo-values are not null draws, so the empirical null formula is invalid for them. This matches R's `synthdid::vcov()` convention, where variance is returned and inference is normal-theory from the SE.
 - **Note (coverage Monte Carlo calibration):** `benchmarks/data/sdid_coverage.json` carries empirical rejection rates across all three variance methods on 3 representative null-panel DGPs (500 seeds × B=200, regenerable via `benchmarks/python/coverage_sdid.py`). Under H0 the nominal rejection rate at each α equals α; rates substantially above α indicate anti-conservatism, rates below indicate over-coverage.
@@ -1563,11 +1564,11 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
     | AER §6.3                                                  | bootstrap  | 0.010  | 0.040  | 0.078  | 1.05 |
     | AER §6.3                                                  | jackknife  | 0.030  | 0.080  | 0.150  | 0.90 |
 
-    Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3), in line with Arkhangelsky et al. (2021) §6.3's reported 98% coverage (iid) and 93% coverage (AR(1) ρ=0.7). The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
+    Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
 
     The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one).
 
-    **Artifact cadence (documentation-grade, not pinned-regression-grade):** this file is a documentation substrate — the table above transcribes from it, and the schema test just checks structure. It is **not** numerically pinned by any regression test, because MC noise at 500 seeds × B=200 (2σ ≈ 0.02–0.05 per cell) makes tight bounds fragile and loose bounds uninformative. The runtime characterization test that guards the one regression class worth catching (dispatch breakage → rejection rate ≈0 or ≈0.5) is `TestPValueSemantics::test_bootstrap_p_value_null_calibration` (slow). Regenerate the JSON when a methodology change materially shifts per-draw numerics — SE formula, new variance method, FW solver swap, paper-procedure correction. **Do not** regenerate on refactors that preserve the FW global optimum (warm-start vs cold-start, backend migration, pure renames, docstring fixes) — those stay within MC noise of the committed numbers. Per-seed bit-identity on the captured fixture is the cheaper, stricter check: see `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`.
+    **Artifact cadence (documentation-grade, not pinned-regression-grade):** this file is a documentation substrate — the table above transcribes from it, and the schema test just checks structure. It is **not** numerically pinned by any regression test, because MC noise at 500 seeds × B=200 (2σ ≈ 0.02–0.05 per cell) makes tight bounds fragile and loose bounds uninformative. The runtime characterization test that guards the one regression class worth catching (dispatch breakage → rejection rate ≈0 or ≈0.5, or SE-collapse) is `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (slow; calibration-agnostic dispersion + loose rejection-rate band). Regenerate the JSON when a methodology change materially shifts per-draw numerics — SE formula, new variance method, FW solver swap, paper-procedure correction. **Do not** regenerate on refactors that preserve the FW global optimum (warm-start vs cold-start, backend migration, pure renames, docstring fixes) — those stay within MC noise of the committed numbers. Per-seed bit-identity on the captured fixture is the cheaper, stricter check: see `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`.
 - **Note:** Internal Y normalization. Before weight optimization, the estimator, and variance procedures, `fit()` centers Y by `mean(Y_pre_control)` and scales by `std(Y_pre_control)`; `Y_scale` falls back to `1.0` when std is non-finite or below `1e-12 * max(|mean|, 1)`. Auto-regularization and `noise_level` are computed on normalized Y; user-supplied `zeta_omega` / `zeta_lambda` are divided by `Y_scale` internally for Frank-Wolfe. τ, SE, CI, the placebo/bootstrap/jackknife effect vectors, `results_.noise_level`, and `results_.zeta_omega` / `results_.zeta_lambda` are all reported on the user's original outcome scale (user-supplied zetas are echoed back exactly to avoid float roundoff). Mathematically a no-op — τ is location-invariant and scale-equivariant, and FW weights are invariant under `(Y, ζ) → (Y/s, ζ/s)` — but prevents catastrophic cancellation in the SDID double-difference when outcomes span millions-to-billions (see synth-inference/synthdid#71 for the R-package version of this issue). Normalization constants are derived from controls' pre-period only so the reference is unaffected by treatment. `in_time_placebo()` and `sensitivity_to_zeta_omega()` reuse the exact same `Y_shift` / `Y_scale` captured on the fit snapshot: they normalize the re-sliced arrays before re-running Frank-Wolfe, pass `zeta / Y_scale` to the weight solvers, and rescale the returned `att` and `pre_fit_rmse` by `Y_scale` before reporting; unit-weight diagnostics (`max_unit_weight`, `effective_n`) are scale-invariant and reported directly.
 
 *Validation diagnostics (post-fit methods on `SyntheticDiDResults`):*