Skip to content

Commit b10e46c

Browse files
igerberclaude
andcommitted
Fix TROP bootstrap SE backend divergence under fixed seed
Rust and Python TROP backends produced different bootstrap standard errors for the same `seed` value. On a tiny correlated panel under `seed=42` the gap was ~28% of SE: Rust seeded `rand_xoshiro:: Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), so identical seeds mapped to different bytestreams. Canonicalize on numpy. New `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` pre-generates per-replicate (control, treated) positional index arrays from a numpy `Generator` and hands them to both backends through the PyO3 surface — both Rust bootstrap functions (`bootstrap_trop_variance_global`, `bootstrap_trop_variance`) now accept `control_indices` and `treated_indices` as `i64` arrays in place of `seed: u64`. Parallelism is preserved. Sampling law (stratified: controls then treated, with replacement) is unchanged. Global-method SE is now backend-invariant under the same seed to machine precision: the prior `xfail(strict=True)` in `test_bootstrap_seed_reproducibility` is flipped to a passing `assert_allclose(atol=rtol=1e-14)` and parametrized over `[0, 42, 12345]`. A companion `test_bootstrap_seed_reproducibility_local` is added for the local-method bootstrap. It is currently `xfail(strict=True)` because aligning the RNG exposed two separate local-method backend divergences beyond this PR's scope: Rust's `compute_weight_matrix` normalizes time and unit weights to sum to 1, while Python's `_compute_observation_weights` does not; and the Python fallback's `_compute_observation_weights(_precomputed branch)` reads the original-panel cached `Y`/`D` instead of the bootstrap-sample arguments. Both are tracked as follow-up rows in `TODO.md` with file:line pointers and will land in a separate methodology PR. Closes the bootstrap half of silent-failures audit finding #23 (the grid-search half closed in PR #348). Reference: Athey, Imbens, Qu & Viviano (2025), "Triply Robust Panel Estimators", Algorithm 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent bbbcc4d commit b10e46c

8 files changed

Lines changed: 361 additions & 120 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Fixed
11+
- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes silent-failures audit finding #23 (bootstrap half; grid-search half closed in PR #348). Local-method TROP also adopts the Python-canonical index contract for the RNG layer, but separately-discovered backend divergences (Rust normalizes weight-matrix outer product, Python `_compute_observation_weights` reads stale `_precomputed` cache) prevent local-method bit-identity SE; tracked as a follow-up in `TODO.md`.
12+
1013
## [3.2.0] - 2026-04-19
1114

1215
### Added

TODO.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,8 @@ Deferred items from PR reviews that were not addressed before merge.
8383
| Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
8484
| Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium |
8585
| `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
86-
| TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
86+
| TROP local-method Rust vs Python bootstrap SE divergence beyond RNG: with stratified indices now Python-canonical (closes RNG axis-H), local-method SE still diverges by 10-20% on tiny panels. Two downstream causes: (a) Rust `compute_weight_matrix` (`rust/src/trop.rs:573-606`) normalizes time_weights and unit_weights to sum to 1 before the outer product; Python `_compute_observation_weights` (`diff_diff/trop_local.py:489`) does not normalize. (b) Python `_compute_observation_weights` reads `self._precomputed["Y"]`, `["D"]`, `["time_dist_matrix"]` (lines 423-458) — original-panel cache — rather than the bootstrap-sample data passed via the function arguments, so unit-distance computation in the Python fallback uses stale data. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local` baselines the gap across `seeds=[0, 42, 12345]`. Affects local-method TROP backend parity broadly, not just bootstrap. | `trop_local.py`, `rust/src/trop.rs` | follow-up | Medium |
87+
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #NNN): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
8788
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
8889
| `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
8990
| `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |

diff_diff/bootstrap_utils.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,50 @@
2424
"compute_effect_bootstrap_stats",
2525
"compute_effect_bootstrap_stats_batch",
2626
"warn_bootstrap_failure_rate",
27+
"stratified_bootstrap_indices",
2728
]
2829

2930

31+
def stratified_bootstrap_indices(
32+
rng: np.random.Generator,
33+
n_control: int,
34+
n_treated: int,
35+
n_bootstrap: int,
36+
) -> Tuple[np.ndarray, np.ndarray]:
37+
"""Generate stratified bootstrap sample indices.
38+
39+
Draws controls first, then treated, per replicate. A single ``rng``
40+
advances through all replicates so Python and Rust consumers share an
41+
identical bytestream under the same seed.
42+
43+
Parameters
44+
----------
45+
rng : np.random.Generator
46+
Seeded numpy generator. Consumed positionally; caller owns seeding.
47+
n_control : int
48+
Size of the control pool. Indices drawn in ``[0, n_control)``.
49+
n_treated : int
50+
Size of the treated pool. Indices drawn in ``[0, n_treated)``.
51+
n_bootstrap : int
52+
Number of bootstrap replicates.
53+
54+
Returns
55+
-------
56+
(control_indices, treated_indices) : tuple of np.ndarray
57+
Shapes ``(n_bootstrap, n_control)`` and ``(n_bootstrap, n_treated)``,
58+
dtype ``int64``. Each row is one replicate's sample-with-replacement
59+
indices. Callers map these back to unit identities via their own pool.
60+
"""
61+
control_idx = np.empty((n_bootstrap, n_control), dtype=np.int64)
62+
treated_idx = np.empty((n_bootstrap, n_treated), dtype=np.int64)
63+
for b in range(n_bootstrap):
64+
if n_control > 0:
65+
control_idx[b] = rng.choice(n_control, size=n_control, replace=True)
66+
if n_treated > 0:
67+
treated_idx[b] = rng.choice(n_treated, size=n_treated, replace=True)
68+
return control_idx, treated_idx
69+
70+
3071
def warn_bootstrap_failure_rate(
3172
n_success: int,
3273
n_attempted: int,

diff_diff/trop_global.py

Lines changed: 29 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@
2424
_rust_bootstrap_trop_variance_global,
2525
_rust_loocv_grid_search_global,
2626
)
27-
from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
27+
from diff_diff.bootstrap_utils import (
28+
stratified_bootstrap_indices,
29+
warn_bootstrap_failure_rate,
30+
)
2831
from diff_diff.trop_local import _soft_threshold_svd, _validate_and_pivot_treatment
2932
from diff_diff.trop_results import TROPResults
3033
from diff_diff.utils import safe_inference, warn_if_not_converged
@@ -961,6 +964,21 @@ def _bootstrap_variance_global(
961964
survey_design,
962965
)
963966

967+
# Stratified bootstrap pools (shared by Rust and Python paths)
968+
unit_ever_treated = data.groupby(unit)[treatment].max()
969+
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
970+
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())
971+
n_treated_units = len(treated_units)
972+
n_control_units = len(control_units)
973+
974+
# Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
975+
# Both backends consume these indices so SE is identical under the same seed
976+
# (silent-failures finding #23, bootstrap half).
977+
rng = np.random.default_rng(self.seed)
978+
control_idx, treated_idx = stratified_bootstrap_indices(
979+
rng, n_control_units, n_treated_units, self.n_bootstrap
980+
)
981+
964982
# Try Rust backend for parallel bootstrap (5-15x speedup)
965983
# Only used for pweight-only designs (no strata/PSU/FPC)
966984
if HAS_RUST_BACKEND and _rust_bootstrap_trop_variance_global is not None:
@@ -991,7 +1009,8 @@ def _bootstrap_variance_global(
9911009
self.n_bootstrap,
9921010
self.max_iter,
9931011
self.tol,
994-
self.seed if self.seed is not None else 0,
1012+
control_idx,
1013+
treated_idx,
9951014
unit_weight_arr,
9961015
)
9971016

@@ -1015,32 +1034,17 @@ def _bootstrap_variance_global(
10151034
stacklevel=2,
10161035
)
10171036

1018-
# Python fallback implementation
1019-
rng = np.random.default_rng(self.seed)
1020-
1021-
# Stratified bootstrap sampling
1022-
unit_ever_treated = data.groupby(unit)[treatment].max()
1023-
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
1024-
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())
1025-
1026-
n_treated_units = len(treated_units)
1027-
n_control_units = len(control_units)
1028-
1037+
# Python fallback: consume the same indices the Rust branch would have used.
10291038
bootstrap_estimates_list: List[float] = []
10301039
nonconverg_tracker: List[int] = []
10311040

1032-
for _ in range(self.n_bootstrap):
1033-
# Stratified sampling
1034-
if n_control_units > 0:
1035-
sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
1036-
else:
1037-
sampled_control = np.array([], dtype=object)
1038-
1039-
if n_treated_units > 0:
1040-
sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
1041-
else:
1042-
sampled_treated = np.array([], dtype=object)
1043-
1041+
for b in range(self.n_bootstrap):
1042+
sampled_control = (
1043+
control_units[control_idx[b]] if n_control_units > 0 else np.array([], dtype=object)
1044+
)
1045+
sampled_treated = (
1046+
treated_units[treated_idx[b]] if n_treated_units > 0 else np.array([], dtype=object)
1047+
)
10441048
sampled_units = np.concatenate([sampled_control, sampled_treated])
10451049

10461050
# Create bootstrap sample

diff_diff/trop_local.py

Lines changed: 35 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,10 @@
2424
_rust_bootstrap_trop_variance,
2525
_rust_unit_distance_matrix,
2626
)
27-
from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
27+
from diff_diff.bootstrap_utils import (
28+
stratified_bootstrap_indices,
29+
warn_bootstrap_failure_rate,
30+
)
2831
from diff_diff.trop_results import _PrecomputedStructures
2932
from diff_diff.utils import warn_if_not_converged
3033

@@ -957,6 +960,23 @@ def _bootstrap_variance(
957960
survey_design,
958961
)
959962

963+
# Stratified bootstrap pools (shared by Rust and Python paths).
964+
# Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
965+
# and N_1 treated rows separately to preserve treatment ratio.
966+
unit_ever_treated = data.groupby(unit)[treatment].max()
967+
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
968+
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)
969+
n_treated_units = len(treated_units)
970+
n_control_units = len(control_units)
971+
972+
# Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
973+
# Both backends consume these indices so SE is identical under the same seed
974+
# (silent-failures finding #23, bootstrap half).
975+
rng = np.random.default_rng(self.seed)
976+
control_idx, treated_idx = stratified_bootstrap_indices(
977+
rng, n_control_units, n_treated_units, self.n_bootstrap
978+
)
979+
960980
# Try Rust backend for parallel bootstrap (5-15x speedup)
961981
# Only used for pweight-only designs (no strata/PSU/FPC)
962982
if (
@@ -981,7 +1001,8 @@ def _bootstrap_variance(
9811001
self.n_bootstrap,
9821002
self.max_iter,
9831003
self.tol,
984-
self.seed if self.seed is not None else 0,
1004+
control_idx,
1005+
treated_idx,
9851006
unit_weight_arr,
9861007
)
9871008

@@ -1003,36 +1024,21 @@ def _bootstrap_variance(
10031024
stacklevel=2,
10041025
)
10051026

1006-
# Python implementation (fallback)
1007-
rng = np.random.default_rng(self.seed)
1008-
1009-
# Issue D fix: Stratified bootstrap sampling
1010-
# Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
1011-
# and N_1 treated rows separately to preserve treatment ratio
1012-
unit_ever_treated = data.groupby(unit)[treatment].max()
1013-
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
1014-
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)
1015-
1016-
n_treated_units = len(treated_units)
1017-
n_control_units = len(control_units)
1018-
1027+
# Python fallback: consume the same indices the Rust branch would have used.
10191028
bootstrap_estimates_list = []
10201029
nonconverg_tracker: List[int] = []
10211030

1022-
for _ in range(self.n_bootstrap):
1023-
# Stratified sampling: sample control and treated units separately
1024-
# This preserves the treatment ratio in each bootstrap sample
1025-
if n_control_units > 0:
1026-
sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
1027-
else:
1028-
sampled_control = np.array([], dtype=control_units.dtype)
1029-
1030-
if n_treated_units > 0:
1031-
sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
1032-
else:
1033-
sampled_treated = np.array([], dtype=treated_units.dtype)
1034-
1035-
# Combine stratified samples
1031+
for b in range(self.n_bootstrap):
1032+
sampled_control = (
1033+
control_units[control_idx[b]]
1034+
if n_control_units > 0
1035+
else np.array([], dtype=control_units.dtype)
1036+
)
1037+
sampled_treated = (
1038+
treated_units[treated_idx[b]]
1039+
if n_treated_units > 0
1040+
else np.array([], dtype=treated_units.dtype)
1041+
)
10361042
sampled_units = np.concatenate([sampled_control, sampled_treated])
10371043

10381044
# Create bootstrap sample with unique unit IDs

0 commit comments

Comments
 (0)