You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix TROP bootstrap SE backend divergence under fixed seed
Rust and Python TROP backends produced different bootstrap standard
errors for the same `seed` value. On a tiny correlated panel under
`seed=42` the gap was ~28% of SE: Rust seeded `rand_xoshiro::
Xoshiro256PlusPlus` per replicate while Python's fallback consumed
`numpy.random.default_rng` (PCG64), so identical seeds mapped to
different bytestreams.
Canonicalize on numpy. New `stratified_bootstrap_indices` helper in
`diff_diff/bootstrap_utils.py` pre-generates per-replicate
(control, treated) positional index arrays from a numpy `Generator`
and hands them to both backends through the PyO3 surface — both
Rust bootstrap functions (`bootstrap_trop_variance_global`,
`bootstrap_trop_variance`) now accept `control_indices` and
`treated_indices` as `i64` arrays in place of `seed: u64`. Parallelism
is preserved. Sampling law (stratified: controls then treated, with
replacement) is unchanged.
Global-method SE is now backend-invariant under the same seed to
machine precision: the prior `xfail(strict=True)` in
`test_bootstrap_seed_reproducibility` is flipped to a passing
`assert_allclose(atol=rtol=1e-14)` and parametrized over
`[0, 42, 12345]`.
A companion `test_bootstrap_seed_reproducibility_local` is added for
the local-method bootstrap. It is currently `xfail(strict=True)`
because aligning the RNG exposed two separate local-method backend
divergences beyond this PR's scope: Rust's `compute_weight_matrix`
normalizes time and unit weights to sum to 1, while Python's
`_compute_observation_weights` does not; and the Python fallback's
`_compute_observation_weights(_precomputed branch)` reads the
original-panel cached `Y`/`D` instead of the bootstrap-sample
arguments. Both are tracked as follow-up rows in `TODO.md` with
file:line pointers and will land in a separate methodology PR.
Closes the bootstrap half of silent-failures audit finding #23 (the
grid-search half closed in PR #348). Reference: Athey, Imbens, Qu &
Viviano (2025), "Triply Robust Panel Estimators", Algorithm 3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
8
8
## [Unreleased]
9
9
10
+
### Fixed
11
+
- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes silent-failures audit finding #23 (bootstrap half; grid-search half closed in PR #348). Local-method TROP also adopts the Python-canonical index contract for the RNG layer, but separately-discovered backend divergences (Rust normalizes weight-matrix outer product, Python `_compute_observation_weights` reads stale `_precomputed` cache) prevent local-method bit-identity SE; tracked as a follow-up in `TODO.md`.
Copy file name to clipboardExpand all lines: TODO.md
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,7 +83,8 @@ Deferred items from PR reviews that were not addressed before merge.
83
83
| Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. |`linalg.py::_compute_cr2_bm`| Phase 1a | Medium |
84
84
| Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. |`benchmarks/R/generate_clubsandwich_golden.R`| Phase 1a | Medium |
85
85
|`honest_did.py:1907``np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. |`honest_did.py`|#334| Low |
86
-
| TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. |`trop_global.py`, `rust/`| follow-up | Medium |
86
+
| TROP local-method Rust vs Python bootstrap SE divergence beyond RNG: with stratified indices now Python-canonical (closes RNG axis-H), local-method SE still diverges by 10-20% on tiny panels. Two downstream causes: (a) Rust `compute_weight_matrix` (`rust/src/trop.rs:573-606`) normalizes time_weights and unit_weights to sum to 1 before the outer product; Python `_compute_observation_weights` (`diff_diff/trop_local.py:489`) does not normalize. (b) Python `_compute_observation_weights` reads `self._precomputed["Y"]`, `["D"]`, `["time_dist_matrix"]` (lines 423-458) — original-panel cache — rather than the bootstrap-sample data passed via the function arguments, so unit-distance computation in the Python fallback uses stale data. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local` baselines the gap across `seeds=[0, 42, 12345]`. Affects local-method TROP backend parity broadly, not just bootstrap. | `trop_local.py`, `rust/src/trop.rs` | follow-up | Medium |
87
+
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #NNN): pre-generate weights in Python via numpy and pass them to Rust through PyO3. |`rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py`| follow-up | Medium |
87
88
|`bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). |`benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py`| Phase 1c | Low |
88
89
|`bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). |`diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py`| Phase 1c | Medium |
89
90
|`bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. |`diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust`| Phase 1c | Medium |
0 commit comments