You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Four P3-only items from R8 CI review:
1. Correctly attribute R's default `vcov()` method:
- diff_diff/synthetic_did.py:53 docstring previously claimed placebo was
"R's default". R's `synthdid::vcov()` actually defaults to
`method="bootstrap"`. Reword to describe placebo as the library default
with a rationale paragraph (survey availability, perf) and cross-
reference to the REGISTRY Note below.
- METHODOLOGY_REVIEW.md item 5 said the same incorrect thing. Rewrite
to frame the default as a deliberate library deviation with the same
two-reason rationale.
2. Add a REGISTRY.md Note (default variance_method deviation from R)
that documents the rationale explicitly (survey availability + perf)
so the AI reviewer recognizes the deviation as documented rather than
as an accidental contradiction between the docstring and R's actual
default.
3. Soften the placebo-failure fallback guidance in
`_placebo_variance_se`: the previous strings recommended
`variance_method="bootstrap"`, which now raises `NotImplementedError`
on every survey design. Branch on `w_control is not None` (survey
fit) to recommend jackknife + adding controls for survey users and
keep the bootstrap + jackknife + more-controls recommendation for
non-survey users.
4. Fix the REGISTRY pointer to the slow dispersion guard: reference
`TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (the
current name) instead of the pre-rename
`test_bootstrap_p_value_null_calibration`, and describe the new
contract (calibration-agnostic dispersion + loose rejection-rate
band).
5. Rephrase the REGISTRY coverage-MC narrative about jackknife
anti-conservatism: the paper's AER §6.3 shows mixed jackknife
evidence (98% iid — slightly conservative; 93% AR(1) — slightly
anti-conservative). Our observed anti-conservatism aligns with the
AR(1) branch of the paper's evidence, not a uniform "in line"
pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/methodology/REGISTRY.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1547,6 +1547,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
1547
1547
-**Jackknife with non-finite LOO estimate**: Returns NaN SE. Unlike bootstrap/placebo, jackknife is deterministic and cannot skip failed iterations; NaN propagates through `var()` (matches R behavior).
1548
1548
-**Jackknife with survey weights**: Guards on effective positive support (omega * w_control > 0 and w_treated > 0) after composition, not raw FW counts. Returns NaN SE if fewer than 2 effective controls or 2 positive-weight treated units. Per-iteration zero-sum guards return NaN for individual LOO iterations when remaining composed weights sum to zero.
1549
1549
-**Note (survey support):** weights are supported; strata/PSU/FPC are not. The pweight-only path is accepted for `variance_method="placebo"` and `variance_method="jackknife"` — treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Frank-Wolfe optimization itself is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. **`variance_method="bootstrap"` rejects all survey designs (including pweight-only)** in this release; see the deferred-composition Note below.
1550
+
-**Note (default variance_method deviation from R):** R's `synthdid::vcov()` defaults to `method="bootstrap"`; our `SyntheticDiD.__init__` defaults to `variance_method="placebo"`. This is an intentional library deviation for two reasons: (a) placebo is unconditionally available on pweight-only survey designs whereas refit bootstrap rejects every survey design in this release (see the deferred-composition Note below); (b) placebo avoids the ~5–30× per-draw Frank-Wolfe refit slowdown relative to the deleted fixed-weight path, which becomes prohibitive on large panels at B=200 replications. Users can opt into the R-matching default with `variance_method="bootstrap"`. Placebo (Algorithm 4) and bootstrap (Algorithm 2) both track nominal calibration in the committed coverage MC; see the calibration table below.
1550
1551
- **Note (deferred survey + bootstrap composition):** Rao-Wu rescaled bootstrap composed with paper-faithful refit is not yet implemented. Reusable scaffolding for the follow-up implementer: `generate_rao_wu_weights` (in `bootstrap_utils`, retained for use by other estimators), the split into `rw_control` / `rw_treated`, the degenerate-retry check, and the treated-mean weighting pattern are portable from the fixed-weight Rao-Wu branch this release removed (recoverable via `git show 91082e5:diff_diff/synthetic_did.py` near the pre-rewrite `_bootstrap_se` body (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit")). The genuinely new work is a **weighted Frank-Wolfe** variant of `_sc_weight_fw` that accepts per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Compose-after-unweighted-FW does NOT work — it silently reproduces the fixed-weight + Rao-Wu behavior we removed (which was never paper-faithful and never matched R's default vcov). Validation: re-use the coverage MC harness with a stratified DGP and confirm near-nominal rejection rates against placebo-SE tracking. Survey + SDID variance is therefore deferred capability — pre-existing strata/PSU/FPC users have no SDID variance method on this release.
1551
1552
-**Note:** P-value computation is variance-method dependent. Placebo (Algorithm 4) uses the empirical null formula `max(mean(|placebo_effects| ≥ |att|), 1/(r+1))` because permuting control indices generates draws from the null distribution (centered on 0). Bootstrap (Algorithm 2) and jackknife (Algorithm 3) use the analytical p-value from `safe_inference(att, se)` (normal-theory): bootstrap draws are centered on `τ̂` (sampling distribution of the estimator) and jackknife pseudo-values are not null draws, so the empirical null formula is invalid for them. This matches R's `synthdid::vcov()` convention, where variance is returned and inference is normal-theory from the SE.
1552
1553
-**Note (coverage Monte Carlo calibration):**`benchmarks/data/sdid_coverage.json` carries empirical rejection rates across all three variance methods on 3 representative null-panel DGPs (500 seeds × B=200, regenerable via `benchmarks/python/coverage_sdid.py`). Under H0 the nominal rejection rate at each α equals α; rates substantially above α indicate anti-conservatism, rates below indicate over-coverage.
@@ -1563,11 +1564,11 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi
Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3), in line with Arkhangelsky et al. (2021) §6.3's reported 98% coverage (iid) and 93% coverage (AR(1) ρ=0.7). The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
1567
+
Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds.
1567
1568
1568
1569
The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one).
1569
1570
1570
-
**Artifact cadence (documentation-grade, not pinned-regression-grade):** this file is a documentation substrate — the table above transcribes from it, and the schema test just checks structure. It is **not** numerically pinned by any regression test, because MC noise at 500 seeds × B=200 (2σ ≈ 0.02–0.05 per cell) makes tight bounds fragile and loose bounds uninformative. The runtime characterization test that guards the one regression class worth catching (dispatch breakage → rejection rate ≈0 or ≈0.5) is `TestPValueSemantics::test_bootstrap_p_value_null_calibration` (slow). Regenerate the JSON when a methodology change materially shifts per-draw numerics — SE formula, new variance method, FW solver swap, paper-procedure correction. **Do not** regenerate on refactors that preserve the FW global optimum (warm-start vs cold-start, backend migration, pure renames, docstring fixes) — those stay within MC noise of the committed numbers. Per-seed bit-identity on the captured fixture is the cheaper, stricter check: see `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`.
1571
+
**Artifact cadence (documentation-grade, not pinned-regression-grade):** this file is a documentation substrate — the table above transcribes from it, and the schema test just checks structure. It is **not** numerically pinned by any regression test, because MC noise at 500 seeds × B=200 (2σ ≈ 0.02–0.05 per cell) makes tight bounds fragile and loose bounds uninformative. The runtime characterization test that guards the one regression class worth catching (dispatch breakage → rejection rate ≈0 or ≈0.5, or SE-collapse) is `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (slow; calibration-agnostic dispersion + loose rejection-rate band). Regenerate the JSON when a methodology change materially shifts per-draw numerics — SE formula, new variance method, FW solver swap, paper-procedure correction. **Do not** regenerate on refactors that preserve the FW global optimum (warm-start vs cold-start, backend migration, pure renames, docstring fixes) — those stay within MC noise of the committed numbers. Per-seed bit-identity on the captured fixture is the cheaper, stricter check: see `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`.
1571
1572
- **Note:** Internal Y normalization. Before weight optimization, the estimator, and variance procedures, `fit()` centers Y by `mean(Y_pre_control)` and scales by `std(Y_pre_control)`; `Y_scale` falls back to `1.0` when std is non-finite or below `1e-12 * max(|mean|, 1)`. Auto-regularization and `noise_level` are computed on normalized Y; user-supplied `zeta_omega` / `zeta_lambda` are divided by `Y_scale` internally for Frank-Wolfe. τ, SE, CI, the placebo/bootstrap/jackknife effect vectors, `results_.noise_level`, and `results_.zeta_omega` / `results_.zeta_lambda` are all reported on the user's original outcome scale (user-supplied zetas are echoed back exactly to avoid float roundoff). Mathematically a no-op — τ is location-invariant and scale-equivariant, and FW weights are invariant under `(Y, ζ) → (Y/s, ζ/s)` — but prevents catastrophic cancellation in the SDID double-difference when outcomes span millions-to-billions (see synth-inference/synthdid#71 for the R-package version of this issue). Normalization constants are derived from controls' pre-period only so the reference is unaffected by treatment. `in_time_placebo()` and `sensitivity_to_zeta_omega()` reuse the exact same `Y_shift` / `Y_scale` captured on the fit snapshot: they normalize the re-sliced arrays before re-running Frank-Wolfe, pass `zeta / Y_scale` to the weight solvers, and rescale the returned `att` and `pre_fit_rmse` by `Y_scale` before reporting; unit-weight diagnostics (`max_unit_weight`, `effective_n`) are scale-invariant and reported directly.
1572
1573
1573
1574
*Validation diagnostics (post-fit methods on `SyntheticDiDResults`):*
0 commit comments