Skip to content

Commit 710f966

Browse files
igerberclaude
andcommitted
Address PR #351 R8 P3 cleanup
Four P3-only items from R8 CI review: 1. Correctly attribute R's default `vcov()` method: - diff_diff/synthetic_did.py:53 docstring previously claimed placebo was "R's default". R's `synthdid::vcov()` actually defaults to `method="bootstrap"`. Reword to describe placebo as the library default with a rationale paragraph (survey availability, perf) and cross- reference to the REGISTRY Note below. - METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite to frame the default as a deliberate library deviation with the same two-reason rationale. 2. Add a REGISTRY.md Note (default variance_method deviation from R) that documents the rationale explicitly (survey availability + perf) so the AI reviewer recognizes the deviation as documented rather than as an accidental contradiction between the docstring and R's actual default. 3. Soften the placebo-failure fallback guidance in `_placebo_variance_se`: the previous strings recommended `variance_method="bootstrap"`, which now raises `NotImplementedError` on every survey design. Branch on `w_control is not None` (survey fit) to recommend jackknife + adding controls for survey users and keep the bootstrap + jackknife + more-controls recommendation for non-survey users. 4. Fix the REGISTRY pointer to the slow dispersion guard: reference `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the current name) instead of the pre-rename `test_bootstrap_p_value_null_calibration`, and describe the new contract (calibration-agnostic dispersion + loose rejection-rate band). 5. Rephrase the REGISTRY coverage-MC narrative about jackknife anti-conservatism: the paper's AER §6.3 shows mixed jackknife evidence (98% iid — slightly conservative; 93% AR(1) — slightly anti-conservative). Our observed anti-conservatism aligns with the AR(1) branch of the paper's evidence, not a uniform "in line" pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9280cdd commit 710f966

3 files changed

Lines changed: 40 additions & 9 deletions

File tree

METHODOLOGY_REVIEW.md

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -514,9 +514,13 @@ variables appear to the left of the `|` separator.
514514
in PR #351 along with its R-parity fixture, which had also been mis-anchored.
515515
The same PR added the warm-start plumbing to `compute_sdid_unit_weights` /
516516
`compute_time_weights` via new `init_weights=` kwargs.)*
517-
5. **Default `variance_method` changed to `"placebo"`** matching R's default. The R
518-
package uses placebo variance by default (`synthdid_estimate` returns an object whose
519-
`vcov()` uses the placebo method); our default now matches.
517+
5. **Default `variance_method` changed to `"placebo"`** — intentional deviation from
518+
R's default (R's `synthdid::vcov()` defaults to `"bootstrap"`). The library default
519+
is placebo for two reasons: (a) placebo is unconditionally available on pweight-only
520+
survey designs, whereas refit bootstrap rejects every survey design in this release;
521+
(b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in
522+
refit bootstrap. See REGISTRY.md §SyntheticDiD `Note (default variance_method
523+
deviation from R)` for details.
520524
6. **Deprecated `lambda_reg` and `zeta` params; new params are `zeta_omega` and
521525
`zeta_lambda`**. The old parameters had unclear semantics and did not correspond to
522526
the paper's notation. The new parameters directly match the paper and R package

diff_diff/synthetic_did.py

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,11 @@ class SyntheticDiD(DifferenceInDifferences):
5050
variance_method : str, default="placebo"
5151
Method for variance estimation:
5252
- "placebo": Placebo-based variance matching R's synthdid::vcov(method="placebo").
53-
Implements Algorithm 4 from Arkhangelsky et al. (2021). This is R's default.
53+
Implements Algorithm 4 from Arkhangelsky et al. (2021). Library default
54+
(R's default is ``"bootstrap"``; we default to placebo because it is
55+
unconditionally available on pweight-only survey designs and avoids the
56+
~5–30× slowdown of the refit bootstrap). See REGISTRY.md §SyntheticDiD
57+
``Note (default variance_method deviation from R)`` for rationale.
5458
- "bootstrap": Paper-faithful pairs bootstrap — Arkhangelsky et al. (2021)
5559
Algorithm 2 step 2, also the behavior of R's default
5660
synthdid::vcov(method="bootstrap") (which rebinds ``attr(estimate, "opts")``
@@ -1125,10 +1129,22 @@ def _placebo_variance_se(
11251129
# Ensure we have enough controls for the split
11261130
n_pseudo_control = n_control - n_treated
11271131
if n_pseudo_control < 1:
1132+
# Bootstrap rejects every survey design in this release, so
1133+
# steer survey users to jackknife (pweight-only only) or
1134+
# adding controls. Non-survey users can still fall back to
1135+
# bootstrap or jackknife.
1136+
fallback = (
1137+
"variance_method='jackknife' or adding more control units "
1138+
"(strata/PSU/FPC are not yet supported by any SDID variance "
1139+
"method)"
1140+
if w_control is not None
1141+
else "variance_method='bootstrap', variance_method='jackknife', "
1142+
"or adding more control units"
1143+
)
11281144
warnings.warn(
11291145
f"Not enough control units ({n_control}) for placebo variance "
11301146
f"estimation with {n_treated} treated units. "
1131-
f"Consider using variance_method='bootstrap'.",
1147+
f"Consider using {fallback}.",
11321148
UserWarning,
11331149
stacklevel=3,
11341150
)
@@ -1217,11 +1233,21 @@ def _placebo_variance_se(
12171233
n_successful = len(placebo_estimates)
12181234

12191235
if n_successful < 2:
1236+
# Same survey-awareness branch as the pre-replication guard
1237+
# above — bootstrap rejects every survey design in this
1238+
# release, so suggest jackknife for pweight-only fits.
1239+
fallback = (
1240+
"variance_method='jackknife' or increasing the number of "
1241+
"control units (strata/PSU/FPC are not yet supported by any "
1242+
"SDID variance method)"
1243+
if w_control is not None
1244+
else "variance_method='bootstrap' or variance_method='jackknife' "
1245+
"or increasing the number of control units"
1246+
)
12201247
warnings.warn(
12211248
f"Only {n_successful} placebo replications completed successfully. "
12221249
f"Standard error cannot be estimated reliably. "
1223-
f"Consider using variance_method='bootstrap' or increasing "
1224-
f"the number of control units.",
1250+
f"Consider using {fallback}.",
12251251
UserWarning,
12261252
stacklevel=3,
12271253
)

docs/methodology/REGISTRY.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1547,6 +1547,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
15471547
- **Jackknife with non-finite LOO estimate**: Returns NaN SE. Unlike bootstrap/placebo, jackknife is deterministic and cannot skip failed iterations; NaN propagates through `var()` (matches R behavior).
15481548
- **Jackknife with survey weights**: Guards on effective positive support (omega * w_control > 0 and w_treated > 0) after composition, not raw FW counts. Returns NaN SE if fewer than 2 effective controls or 2 positive-weight treated units. Per-iteration zero-sum guards return NaN for individual LOO iterations when remaining composed weights sum to zero.
15491549
- **Note (survey support):** weights are supported; strata/PSU/FPC are not. The pweight-only path is accepted for `variance_method="placebo"` and `variance_method="jackknife"` — treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Frank-Wolfe optimization itself is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. **`variance_method="bootstrap"` rejects all survey designs (including pweight-only)** in this release; see the deferred-composition Note below.
1550+
- **Note (default variance_method deviation from R):** R's `synthdid::vcov()` defaults to `method="bootstrap"`; our `SyntheticDiD.__init__` defaults to `variance_method="placebo"`. This is an intentional library deviation for two reasons: (a) placebo is unconditionally available on pweight-only survey designs whereas refit bootstrap rejects every survey design in this release (see the deferred-composition Note below); (b) placebo avoids the ~5–30× per-draw Frank-Wolfe refit slowdown relative to the deleted fixed-weight path, which becomes prohibitive on large panels at B=200 replications. Users can opt into the R-matching default with `variance_method="bootstrap"`. Placebo (Algorithm 4) and bootstrap (Algorithm 2) both track nominal calibration in the committed coverage MC; see the calibration table below.
15501551
- **Note (deferred survey + bootstrap composition):** Rao-Wu rescaled bootstrap composed with paper-faithful refit is not yet implemented. Reusable scaffolding for the follow-up implementer: `generate_rao_wu_weights` (in `bootstrap_utils`, retained for use by other estimators), the split into `rw_control` / `rw_treated`, the degenerate-retry check, and the treated-mean weighting pattern are portable from the fixed-weight Rao-Wu branch this release removed (recoverable via `git show 91082e5:diff_diff/synthetic_did.py` near the pre-rewrite `_bootstrap_se` body (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit")). The genuinely new work is a **weighted Frank-Wolfe** variant of `_sc_weight_fw` that accepts per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Compose-after-unweighted-FW does NOT work — it silently reproduces the fixed-weight + Rao-Wu behavior we removed (which was never paper-faithful and never matched R's default vcov). Validation: re-use the coverage MC harness with a stratified DGP and confirm near-nominal rejection rates against placebo-SE tracking. Survey + SDID variance is therefore deferred capability — pre-existing strata/PSU/FPC users have no SDID variance method on this release.
15511552
- **Note:** P-value computation is variance-method dependent. Placebo (Algorithm 4) uses the empirical null formula `max(mean(|placebo_effects| ≥ |att|), 1/(r+1))` because permuting control indices generates draws from the null distribution (centered on 0). Bootstrap (Algorithm 2) and jackknife (Algorithm 3) use the analytical p-value from `safe_inference(att, se)` (normal-theory): bootstrap draws are centered on `τ̂` (sampling distribution of the estimator) and jackknife pseudo-values are not null draws, so the empirical null formula is invalid for them. This matches R's `synthdid::vcov()` convention, where variance is returned and inference is normal-theory from the SE.
15521553
- **Note (coverage Monte Carlo calibration):** `benchmarks/data/sdid_coverage.json` carries empirical rejection rates across all three variance methods on 3 representative null-panel DGPs (500 seeds × B=200, regenerable via `benchmarks/python/coverage_sdid.py`). Under H0 the nominal rejection rate at each α equals α; rates substantially above α indicate anti-conservatism, rates below indicate over-coverage.
@@ -1563,11 +1564,11 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
15631564
| AER §6.3 | bootstrap | 0.010 | 0.040 | 0.078 | 1.05 |
15641565
| AER §6.3 | jackknife | 0.030 | 0.080 | 0.150 | 0.90 |
15651566

1566-
Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3), in line with Arkhangelsky et al. (2021) §6.3's reported 98% coverage (iid) and 93% coverage (AR(1) ρ=0.7). The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
1567+
Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
15671568

15681569
The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one).
15691570

1570-
**Artifact cadence (documentation-grade, not pinned-regression-grade):** this file is a documentation substrate — the table above transcribes from it, and the schema test just checks structure. It is **not** numerically pinned by any regression test, because MC noise at 500 seeds × B=200 (2σ ≈ 0.02–0.05 per cell) makes tight bounds fragile and loose bounds uninformative. The runtime characterization test that guards the one regression class worth catching (dispatch breakage → rejection rate ≈0 or ≈0.5) is `TestPValueSemantics::test_bootstrap_p_value_null_calibration` (slow). Regenerate the JSON when a methodology change materially shifts per-draw numerics — SE formula, new variance method, FW solver swap, paper-procedure correction. **Do not** regenerate on refactors that preserve the FW global optimum (warm-start vs cold-start, backend migration, pure renames, docstring fixes) — those stay within MC noise of the committed numbers. Per-seed bit-identity on the captured fixture is the cheaper, stricter check: see `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`.
1571+
**Artifact cadence (documentation-grade, not pinned-regression-grade):** this file is a documentation substrate — the table above transcribes from it, and the schema test just checks structure. It is **not** numerically pinned by any regression test, because MC noise at 500 seeds × B=200 (2σ ≈ 0.02–0.05 per cell) makes tight bounds fragile and loose bounds uninformative. The runtime characterization test that guards the one regression class worth catching (dispatch breakage → rejection rate ≈0 or ≈0.5, or SE-collapse) is `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (slow; calibration-agnostic dispersion + loose rejection-rate band). Regenerate the JSON when a methodology change materially shifts per-draw numerics — SE formula, new variance method, FW solver swap, paper-procedure correction. **Do not** regenerate on refactors that preserve the FW global optimum (warm-start vs cold-start, backend migration, pure renames, docstring fixes) — those stay within MC noise of the committed numbers. Per-seed bit-identity on the captured fixture is the cheaper, stricter check: see `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`.
15711572
- **Note:** Internal Y normalization. Before weight optimization, the estimator, and variance procedures, `fit()` centers Y by `mean(Y_pre_control)` and scales by `std(Y_pre_control)`; `Y_scale` falls back to `1.0` when std is non-finite or below `1e-12 * max(|mean|, 1)`. Auto-regularization and `noise_level` are computed on normalized Y; user-supplied `zeta_omega` / `zeta_lambda` are divided by `Y_scale` internally for Frank-Wolfe. τ, SE, CI, the placebo/bootstrap/jackknife effect vectors, `results_.noise_level`, and `results_.zeta_omega` / `results_.zeta_lambda` are all reported on the user's original outcome scale (user-supplied zetas are echoed back exactly to avoid float roundoff). Mathematically a no-op — τ is location-invariant and scale-equivariant, and FW weights are invariant under `(Y, ζ) → (Y/s, ζ/s)` — but prevents catastrophic cancellation in the SDID double-difference when outcomes span millions-to-billions (see synth-inference/synthdid#71 for the R-package version of this issue). Normalization constants are derived from controls' pre-period only so the reference is unaffected by treatment. `in_time_placebo()` and `sensitivity_to_zeta_omega()` reuse the exact same `Y_shift` / `Y_scale` captured on the fit snapshot: they normalize the re-sliced arrays before re-running Frank-Wolfe, pass `zeta / Y_scale` to the weight solvers, and rescale the returned `att` and `pre_fit_rmse` by `Y_scale` before reporting; unit-weight diagnostics (`max_unit_weight`, `effective_n`) are scale-invariant and reported directly.
15721573

15731574
*Validation diagnostics (post-fit methods on `SyntheticDiDResults`):*

0 commit comments

Comments
 (0)