Skip to content

Commit 08bcbfe

Browse files
authored
Merge pull request #354 from igerber/fix/trop-bootstrap-rng-parity
Fix TROP bootstrap SE backend divergence under fixed seed
2 parents 172d1d8 + 5ce5b8f commit 08bcbfe

8 files changed

Lines changed: 531 additions & 126 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
1616
- SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected.
1717
- SyntheticDiD bootstrap SE formula applies the `sqrt((r-1)/r)` correction matching R's synthdid and the placebo SE formula.
1818
- SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite `τ_b`) until exactly `n_bootstrap` valid replicates are accumulated, matching R's `synthdid::bootstrap_sample` and Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs.
19+
- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes silent-failures audit finding #23 (bootstrap half; grid-search half closed in PR #348). Local-method TROP also adopts the Python-canonical index contract for the RNG layer, but separately-discovered backend divergences (Rust normalizes weight-matrix outer product, Python `_compute_observation_weights` reads stale `_precomputed` cache) prevent local-method bit-identity SE; tracked as a follow-up in `TODO.md`.
1920

2021
### Changed
2122
- **SyntheticDiD bootstrap no longer supports survey designs** (capability regression). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the new paper-faithful refit bootstrap rejects all survey designs (including pweight-only) with `NotImplementedError`. Pweight-only users can switch to `variance_method="placebo"` or `"jackknife"`. Strata/PSU/FPC users have no SDID variance option on this release. Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation requires a separate derivation (weighted FW solver); sketch and reusable scaffolding pointers are in `docs/methodology/REGISTRY.md` §SyntheticDiD and `TODO.md`.

TODO.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,8 @@ Deferred items from PR reviews that were not addressed before merge.
8383
| Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
8484
| Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium |
8585
| `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
86-
| TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
86+
| TROP local-method Rust vs Python bootstrap SE divergence beyond RNG: with stratified indices now Python-canonical (closes RNG axis-H), local-method SE still diverges by 10-20% on tiny panels. Two downstream causes: (a) Rust `compute_weight_matrix` (`rust/src/trop.rs:573-606`) normalizes time_weights and unit_weights to sum to 1 before the outer product; Python `_compute_observation_weights` (`diff_diff/trop_local.py:489`) does not normalize. (b) Python `_compute_observation_weights` reads `self._precomputed["Y"]`, `["D"]`, `["time_dist_matrix"]` (lines 423-458) — original-panel cache — rather than the bootstrap-sample data passed via the function arguments, so unit-distance computation in the Python fallback uses stale data. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local` baselines the gap across `seeds=[0, 42, 12345]`. Affects local-method TROP backend parity broadly, not just bootstrap. | `trop_local.py`, `rust/src/trop.rs` | follow-up | Medium |
87+
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #NNN): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
8788
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
8889
| `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
8990
| `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |

diff_diff/bootstrap_utils.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,50 @@
2424
"compute_effect_bootstrap_stats",
2525
"compute_effect_bootstrap_stats_batch",
2626
"warn_bootstrap_failure_rate",
27+
"stratified_bootstrap_indices",
2728
]
2829

2930

31+
def stratified_bootstrap_indices(
32+
rng: np.random.Generator,
33+
n_control: int,
34+
n_treated: int,
35+
n_bootstrap: int,
36+
) -> Tuple[np.ndarray, np.ndarray]:
37+
"""Generate stratified bootstrap sample indices.
38+
39+
Draws controls first, then treated, per replicate. A single ``rng``
40+
advances through all replicates so Python and Rust consumers share an
41+
identical bytestream under the same seed.
42+
43+
Parameters
44+
----------
45+
rng : np.random.Generator
46+
Seeded numpy generator. Consumed positionally; caller owns seeding.
47+
n_control : int
48+
Size of the control pool. Indices drawn in ``[0, n_control)``.
49+
n_treated : int
50+
Size of the treated pool. Indices drawn in ``[0, n_treated)``.
51+
n_bootstrap : int
52+
Number of bootstrap replicates.
53+
54+
Returns
55+
-------
56+
(control_indices, treated_indices) : tuple of np.ndarray
57+
Shapes ``(n_bootstrap, n_control)`` and ``(n_bootstrap, n_treated)``,
58+
dtype ``int64``. Each row is one replicate's sample-with-replacement
59+
indices. Callers map these back to unit identities via their own pool.
60+
"""
61+
control_idx = np.empty((n_bootstrap, n_control), dtype=np.int64)
62+
treated_idx = np.empty((n_bootstrap, n_treated), dtype=np.int64)
63+
for b in range(n_bootstrap):
64+
if n_control > 0:
65+
control_idx[b] = rng.choice(n_control, size=n_control, replace=True)
66+
if n_treated > 0:
67+
treated_idx[b] = rng.choice(n_treated, size=n_treated, replace=True)
68+
return control_idx, treated_idx
69+
70+
3071
def warn_bootstrap_failure_rate(
3172
n_success: int,
3273
n_attempted: int,

diff_diff/trop_global.py

Lines changed: 29 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@
2424
_rust_bootstrap_trop_variance_global,
2525
_rust_loocv_grid_search_global,
2626
)
27-
from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
27+
from diff_diff.bootstrap_utils import (
28+
stratified_bootstrap_indices,
29+
warn_bootstrap_failure_rate,
30+
)
2831
from diff_diff.trop_local import _soft_threshold_svd, _validate_and_pivot_treatment
2932
from diff_diff.trop_results import TROPResults
3033
from diff_diff.utils import safe_inference, warn_if_not_converged
@@ -961,6 +964,21 @@ def _bootstrap_variance_global(
961964
survey_design,
962965
)
963966

967+
# Stratified bootstrap pools (shared by Rust and Python paths)
968+
unit_ever_treated = data.groupby(unit)[treatment].max()
969+
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
970+
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())
971+
n_treated_units = len(treated_units)
972+
n_control_units = len(control_units)
973+
974+
# Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
975+
# Both backends consume these indices so SE is identical under the same seed
976+
# (silent-failures finding #23, bootstrap half).
977+
rng = np.random.default_rng(self.seed)
978+
control_idx, treated_idx = stratified_bootstrap_indices(
979+
rng, n_control_units, n_treated_units, self.n_bootstrap
980+
)
981+
964982
# Try Rust backend for parallel bootstrap (5-15x speedup)
965983
# Only used for pweight-only designs (no strata/PSU/FPC)
966984
if HAS_RUST_BACKEND and _rust_bootstrap_trop_variance_global is not None:
@@ -991,7 +1009,8 @@ def _bootstrap_variance_global(
9911009
self.n_bootstrap,
9921010
self.max_iter,
9931011
self.tol,
994-
self.seed if self.seed is not None else 0,
1012+
control_idx,
1013+
treated_idx,
9951014
unit_weight_arr,
9961015
)
9971016

@@ -1015,32 +1034,17 @@ def _bootstrap_variance_global(
10151034
stacklevel=2,
10161035
)
10171036

1018-
# Python fallback implementation
1019-
rng = np.random.default_rng(self.seed)
1020-
1021-
# Stratified bootstrap sampling
1022-
unit_ever_treated = data.groupby(unit)[treatment].max()
1023-
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
1024-
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())
1025-
1026-
n_treated_units = len(treated_units)
1027-
n_control_units = len(control_units)
1028-
1037+
# Python fallback: consume the same indices the Rust branch would have used.
10291038
bootstrap_estimates_list: List[float] = []
10301039
nonconverg_tracker: List[int] = []
10311040

1032-
for _ in range(self.n_bootstrap):
1033-
# Stratified sampling
1034-
if n_control_units > 0:
1035-
sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
1036-
else:
1037-
sampled_control = np.array([], dtype=object)
1038-
1039-
if n_treated_units > 0:
1040-
sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
1041-
else:
1042-
sampled_treated = np.array([], dtype=object)
1043-
1041+
for b in range(self.n_bootstrap):
1042+
sampled_control = (
1043+
control_units[control_idx[b]] if n_control_units > 0 else np.array([], dtype=object)
1044+
)
1045+
sampled_treated = (
1046+
treated_units[treated_idx[b]] if n_treated_units > 0 else np.array([], dtype=object)
1047+
)
10441048
sampled_units = np.concatenate([sampled_control, sampled_treated])
10451049

10461050
# Create bootstrap sample

diff_diff/trop_local.py

Lines changed: 40 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@
2424
_rust_bootstrap_trop_variance,
2525
_rust_unit_distance_matrix,
2626
)
27-
from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
27+
from diff_diff.bootstrap_utils import (
28+
stratified_bootstrap_indices,
29+
warn_bootstrap_failure_rate,
30+
)
2831
from diff_diff.trop_results import _PrecomputedStructures
2932
from diff_diff.utils import warn_if_not_converged
3033

@@ -957,6 +960,28 @@ def _bootstrap_variance(
957960
survey_design,
958961
)
959962

963+
# Stratified bootstrap pools (shared by Rust and Python paths).
964+
# Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
965+
# and N_1 treated rows separately to preserve treatment ratio.
966+
unit_ever_treated = data.groupby(unit)[treatment].max()
967+
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
968+
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)
969+
n_treated_units = len(treated_units)
970+
n_control_units = len(control_units)
971+
972+
# Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
973+
# This aligns the RNG layer between backends. For the global method the RNG
974+
# was the only divergence, so global SE is now bit-identical under the same
975+
# seed. For the local method two downstream divergences remain (Rust weight-
976+
# matrix normalization + Python `_compute_observation_weights` reading the
977+
# stale `_precomputed` cache) — tracked in TODO.md; until those land, local
978+
# bootstrap SE still differs across backends. Silent-failures finding #23
979+
# (bootstrap half) closed for global; local follow-up queued.
980+
rng = np.random.default_rng(self.seed)
981+
control_idx, treated_idx = stratified_bootstrap_indices(
982+
rng, n_control_units, n_treated_units, self.n_bootstrap
983+
)
984+
960985
# Try Rust backend for parallel bootstrap (5-15x speedup)
961986
# Only used for pweight-only designs (no strata/PSU/FPC)
962987
if (
@@ -981,7 +1006,8 @@ def _bootstrap_variance(
9811006
self.n_bootstrap,
9821007
self.max_iter,
9831008
self.tol,
984-
self.seed if self.seed is not None else 0,
1009+
control_idx,
1010+
treated_idx,
9851011
unit_weight_arr,
9861012
)
9871013

@@ -1003,36 +1029,21 @@ def _bootstrap_variance(
10031029
stacklevel=2,
10041030
)
10051031

1006-
# Python implementation (fallback)
1007-
rng = np.random.default_rng(self.seed)
1008-
1009-
# Issue D fix: Stratified bootstrap sampling
1010-
# Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
1011-
# and N_1 treated rows separately to preserve treatment ratio
1012-
unit_ever_treated = data.groupby(unit)[treatment].max()
1013-
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
1014-
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)
1015-
1016-
n_treated_units = len(treated_units)
1017-
n_control_units = len(control_units)
1018-
1032+
# Python fallback: consume the same indices the Rust branch would have used.
10191033
bootstrap_estimates_list = []
10201034
nonconverg_tracker: List[int] = []
10211035

1022-
for _ in range(self.n_bootstrap):
1023-
# Stratified sampling: sample control and treated units separately
1024-
# This preserves the treatment ratio in each bootstrap sample
1025-
if n_control_units > 0:
1026-
sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
1027-
else:
1028-
sampled_control = np.array([], dtype=control_units.dtype)
1029-
1030-
if n_treated_units > 0:
1031-
sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
1032-
else:
1033-
sampled_treated = np.array([], dtype=treated_units.dtype)
1034-
1035-
# Combine stratified samples
1036+
for b in range(self.n_bootstrap):
1037+
sampled_control = (
1038+
control_units[control_idx[b]]
1039+
if n_control_units > 0
1040+
else np.array([], dtype=control_units.dtype)
1041+
)
1042+
sampled_treated = (
1043+
treated_units[treated_idx[b]]
1044+
if n_treated_units > 0
1045+
else np.array([], dtype=treated_units.dtype)
1046+
)
10361047
sampled_units = np.concatenate([sampled_control, sampled_treated])
10371048

10381049
# Create bootstrap sample with unique unit IDs

0 commit comments

Comments
 (0)