Skip to content

Commit 2c48e60

Browse files
igerberclaude
andcommitted
Restrict Phase 1c wrapper vce to 'nn'; align tolerance docs
Address CI review P1 on HC2/HC3 leverage. nprobust's lprobust reuses the p-fit hat-matrix leverage for the q-fit residual path (lprobust.R:229-241). The Python port matches R exactly, but that R behavior deserves a derivation and a dedicated golden anchor before the public wrapper can advertise CCT-2014 conformance on hc-mode paths. The Phase 1c plan always said hc0-3 were "exposed but not golden-tested"; this commit restricts the public surface to what actually has a parity anchor. Changes: - `bias_corrected_local_linear`: `vce != "nn"` now raises `NotImplementedError` with a pointer to the port-level `diff_diff._nprobust_port.lprobust` for callers who still need the broader surface. Docstring updated. - Tests: swap positive-path hc-mode tests for negative-path raises; drop `test_auto_vce_hc1_returns_finite` (now superseded by `test_auto_vce_hc1_rejected_in_phase_1c`); drop the cluster+hc2/hc3 tests (now covered by the plain hc2/hc3 rejection tests). - REGISTRY.md: swap "Deviation from R" wording for a clearer "public-API surface restriction" note explaining the R-side hii reuse and the deferred derivation. - TODO.md: upgrade the hc-mode expansion TODO from Low to Medium and document the q-fit leverage derivation as the gating work. Also fix the P3 tolerance-doc inconsistency: registry/docstring now say atol=1e-12 on all six scalars across unclustered DGPs (tests actually assert 1e-12, with DGP 1 / DGP 3 landing closer to 1e-13 in practice). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7eeb42a commit 2c48e60

4 files changed

Lines changed: 68 additions & 52 deletions

File tree

TODO.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Deferred items from PR reviews that were not addressed before merge.
8787
| TROP Rust vs Python grid-search divergence on rank-deficient Y: on two near-parallel control units, LOOCV grid-search ATT diverges ~6% between Rust (`trop_global.py:688`) and Python fallback (`trop_global.py:753`). Either grid-winner ties are broken differently or the per-λ solver reaches different stationary points under rank deficiency. Audit finding #23 flagged this surface. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_grid_search_rank_deficient_Y` baselines the gap. | `trop_global.py`, `rust/` | follow-up | Medium |
8888
| TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
8989
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
90-
| `bias_corrected_local_linear`: extend golden parity to `vce in {"hc0", "hc1", "hc2", "hc3"}`. The port supports all four via `lprobust_res`'s hc branches but Phase 1c golden-tested only `vce="nn"`. | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
90+
| `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
9191
| `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |
9292
| `bias_corrected_local_linear`: support multi-eval grid (`neval > 1`) with cross-covariance (`covgrid=TRUE` branch of `lprobust.R:253-378`). Not needed for HAD but useful for multi-dose diagnostics. | `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Low |
9393
| Clustered-DGP parity: Phase 1c's DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster bug in `lpbwselect.mse.dpi`'s pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. | `benchmarks/R/generate_nprobust_lprobust_golden.R` | Phase 1c | Low |

diff_diff/local_linear.py

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -990,9 +990,11 @@ def bias_corrected_local_linear(
990990
provided without ``h`` raises ``ValueError``.
991991
alpha : float, default=0.05
992992
CI level; ``0.05`` gives a 95% CI. Must be in ``(0, 1)``.
993-
vce : {"nn", "hc0", "hc1", "hc2", "hc3"}, default="nn"
994-
Variance-estimation method. ``"nn"`` is parity-tested in
995-
Phase 1c; other modes are exposed but not golden-parity-tested.
993+
vce : {"nn"}, default="nn"
994+
Variance-estimation method. Only ``"nn"`` is supported in
995+
Phase 1c; hc0/hc1/hc2/hc3 are queued for Phase 2+ pending
996+
dedicated R parity goldens. Passing anything else raises
997+
``NotImplementedError``.
996998
cluster : np.ndarray or None
997999
Per-observation cluster IDs for cluster-robust variance. Missing
9981000
(NaN) cluster IDs raise ``ValueError`` rather than silently
@@ -1019,18 +1021,19 @@ def bias_corrected_local_linear(
10191021
10201022
Notes
10211023
-----
1022-
Parity against ``nprobust::lprobust(..., bwselect="mse-dpi")`` is tiered
1023-
(see ``docs/methodology/REGISTRY.md``): ``atol=1e-12`` on ``tau_cl``,
1024-
``tau_bc``, ``se_cl``, and ``se_rb`` across the three unclustered
1025-
golden DGPs; ``atol=1e-13`` on CI bounds. The Python wrapper computes
1026-
its own ``z_{1-alpha/2}`` via ``scipy.stats.norm.ppf`` inside
1027-
``safe_inference()``; R's ``qnorm`` value is stored in the golden JSON
1028-
for audit, and the parity harness compares Python's CI bounds to R's
1029-
pre-computed CI bounds, so any residual drift is purely the
1030-
floating-point arithmetic in ``tau.bc +/- z * se.rb``, not a
1031-
critical-value disagreement. Clustered DGP 4 achieves bit-parity
1032-
(``atol=1e-14``) when cluster IDs happen to be in first-appearance
1033-
order; otherwise BLAS reduction ordering can drift to ``atol=1e-10``.
1024+
Parity against ``nprobust::lprobust(..., bwselect="mse-dpi")`` is
1025+
asserted at ``atol=1e-12`` on ``tau_cl``, ``tau_bc``, ``se_cl``,
1026+
``se_rb``, ``ci_low``, and ``ci_high`` across the three unclustered
1027+
golden DGPs; DGP 1 and DGP 3 typically land closer to ``1e-13``.
1028+
The Python wrapper computes its own ``z_{1-alpha/2}`` via
1029+
``scipy.stats.norm.ppf`` inside ``safe_inference()``; R's ``qnorm``
1030+
value is stored in the golden JSON for audit, and the parity harness
1031+
compares Python's CI bounds to R's pre-computed CI bounds, so any
1032+
residual drift is purely the floating-point arithmetic in
1033+
``tau.bc +/- z * se.rb``, not a critical-value disagreement.
1034+
Clustered DGP 4 achieves bit-parity (``atol=1e-14``) when cluster
1035+
IDs are in first-appearance order; otherwise BLAS reduction
1036+
ordering can drift to ``atol=1e-10``.
10341037
"""
10351038
if weights is not None:
10361039
raise NotImplementedError(
@@ -1050,15 +1053,26 @@ def bias_corrected_local_linear(
10501053
if not (0.0 < alpha < 1.0):
10511054
raise ValueError(f"alpha must be in (0, 1); got {alpha!r}.")
10521055

1053-
# Reject manifestly-incompatible vce+cluster combinations upfront.
1054-
# nprobust's lprobust silently accepts hc2/hc3 + cluster and produces a
1055-
# weighted-score cluster meat that is NOT the standard CR2 or clustered
1056-
# HC estimator. Rather than propagate that ambiguity, reject.
1057-
if cluster is not None and vce in ("hc2", "hc3"):
1058-
raise ValueError(
1059-
f"vce={vce!r} combined with cluster= is not a well-defined "
1060-
"estimator in nprobust (hc2/hc3 assume independent observations)."
1061-
" Use vce='nn' or drop the cluster argument."
1056+
# Phase 1c public-wrapper vce restriction.
1057+
# Only ``vce="nn"`` is golden-tested against R in Phase 1c. Exposing
1058+
# hc0/hc1/hc2/hc3 on the public surface would ship non-parity-verified
1059+
# inference -- and nprobust's internal hc2/hc3 residual path reuses
1060+
# the p-fit hat-matrix leverage for the q-fit residuals (lprobust.R
1061+
# :229-241), a detail that would need its own R parity anchor before
1062+
# the Python port can advertise it as CCT-2014-compliant. Defer the
1063+
# hc-mode public surface to Phase 2+ with dedicated goldens. The
1064+
# port-level ``diff_diff._nprobust_port.lprobust`` still accepts
1065+
# hc0..hc3 for callers who need the broader surface and accept that
1066+
# the hc-mode behavior has not been separately parity-tested.
1067+
if vce != "nn":
1068+
raise NotImplementedError(
1069+
f"vce={vce!r} is not supported on bias_corrected_local_linear "
1070+
"in Phase 1c. Only vce='nn' is golden-tested against R "
1071+
"nprobust::lprobust in this phase; hc0/hc1/hc2/hc3 are queued "
1072+
"for Phase 2+ pending dedicated R parity goldens. If you need "
1073+
"the hc-mode port path for exploratory use, call "
1074+
"diff_diff._nprobust_port.lprobust directly and accept that "
1075+
"those paths are not separately parity-tested."
10621076
)
10631077

10641078
# HAD-scope input validation (shared with Phase 1b via _validate_had_inputs).

docs/methodology/REGISTRY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2314,7 +2314,7 @@ Shipped as `did_had_pretest_workflow()` and surfaced via `practitioner_next_step
23142314
- [x] Phase 1a: `clubSandwich::vcovCR(..., type="CR2")` parity harness committed: R script at `benchmarks/R/generate_clubsandwich_golden.R` plus a regression-anchor JSON at `benchmarks/data/clubsandwich_cr2_golden.json`. **Note:** the committed JSON currently has `"source": "python_self_reference"` and pins numerical stability only; authoritative R-produced values are generated by running the R script, which the TODO.md row under Methodology/Correctness tracks. The parity test at `tests/test_linalg_hc2_bm.py::TestCR2BMCluster::test_cr2_parity_with_golden` runs at 1e-6 tolerance (Phase 1a plan commits 6-digit parity once R regen completes).
23152315
- [x] Phase 1b: Calonico-Cattaneo-Farrell (2018) MSE-optimal bandwidth selector. In-house port of `nprobust::lpbwselect(bwselect="mse-dpi")` (nprobust 0.5.0, SHA `36e4e53`) as `diff_diff.mse_optimal_bandwidth` and `BandwidthResult`, backed by the private `diff_diff._nprobust_port` module (`kernel_W`, `lprobust_bw`, `lpbwselect_mse_dpi`). Three-stage DPI with four `lprobust.bw` calls at orders `q+1`, `q+2`, `q`, `p`. Python matches R to `0.0000%` relative error (i.e., bit-parity within float64 precision, ~8-13 digits agreement) on all five stage bandwidths (`c_bw`, `bw_mp2`, `bw_mp3`, `b_mse`, `h_mse`) across three deterministic DGPs (uniform, Beta(2,2), half-normal) via `benchmarks/R/generate_nprobust_golden.R` → `benchmarks/data/nprobust_mse_dpi_golden.json`. **Note:** `weights=` is currently unsupported (raises `NotImplementedError`); nprobust's `lpbwselect` has no weight argument so there is no parity anchor. Weighted-data support deferred to Phase 2 (survey-design adaptation). **Note (public API scope restriction):** the exported wrapper `mse_optimal_bandwidth` hard-codes the HAD Phase 1b configuration (`p=1`, `deriv=0`, `interior=False`, `vce="nn"`, `nnmatch=3`). The underlying port supports a broader surface (`hc0`/`hc1`/`hc2`/`hc3` variance, interior evaluation, higher `p`), but those paths are not parity-tested against `nprobust` and are deferred. Callers needing the broader surface should use `diff_diff._nprobust_port.lpbwselect_mse_dpi` directly and accept that parity has not been verified on non-HAD configurations. **Note (input contract):** the wrapper enforces HAD's support restriction `D_{g,2} >= 0` (front-door `ValueError` on negative doses and empty inputs). `boundary` must equal `0` (Design 1') or `float(d.min())` (Design 1 continuous-near-d_lower) within float tolerance; off-support values raise `ValueError`. When `boundary ~ 0`, the wrapper additionally requires `d.min() <= 0.05 * median(|d|)` as a Design 1' support plausibility heuristic, chosen to pass the paper's thin-boundary-density DGPs (Beta(2,2), d.min/median ~ 3%) while rejecting substantially off-support samples (U(0.5, 1.0), d.min/median ~ 1.0). Detected mass-point designs (`d.min() > 0` with modal fraction at `d.min() > 2%`) raise `NotImplementedError` pointing to the Phase 2 2SLS path per paper Section 3.2.4.
23162316
- [x] Phase 1c: First-order bias estimator `M̂_{ĥ*_G}` and robust variance `V̂_{ĥ*_G}`. Implemented via Calonico-Cattaneo-Titiunik (2014) bias-combined design matrix `Q.q` in the in-house port `diff_diff._nprobust_port.lprobust` (single-eval-point path of `nprobust::lprobust`, npfunctions.R:177-246).
2317-
- [x] Phase 1c: Bias-corrected CI (Equation 8) with `nprobust` parity. Public wrapper `diff_diff.bias_corrected_local_linear` returns `BiasCorrectedFit` with μ̂-scale point estimate, robust SE, and bias-corrected 95% CI `[tau.bc ± z_{1-α/2} * se.rb]`. The β-scale rescaling from Equation 8, `(1/G) Σ D_{g,2}`, is applied by Phase 2's `HeterogeneousAdoptionDiD.fit()`. Parity against `nprobust::lprobust(..., bwselect="mse-dpi")` is tiered: `atol=1e-12` on `tau_cl`/`tau_bc`/`se_cl`/`se_rb` (the `Q.q` outer-product + broadcast-multiply path drifts slightly from the Phase 1b bit-parity primitives), `atol=1e-13` on CI bounds (the Python wrapper computes its own `z_{1-α/2}` via `scipy.stats.norm.ppf` inside `safe_inference()`; R's `qnorm` value is stored in the golden JSON for audit, and the parity harness compares Python's CI bounds to R's pre-computed CI bounds so any residual drift is purely the floating-point arithmetic in `tau.bc ± z * se.rb`, not a critical-value disagreement). The clustered DGP achieves bit-parity (`atol=1e-14`) when cluster IDs happen to be sorted; otherwise BLAS reduction ordering can drift to `atol=1e-10`. Generator: `benchmarks/R/generate_nprobust_lprobust_golden.R`. **Note:** The wrapper matches nprobust's `rho=1` default (`b = h` in auto mode), so Phase 1b's separately-computed `b_mse` is surfaced via `bandwidth_diagnostics.b_mse` but not applied. **Deviation from R:** nprobust silently accepts `vce="hc2"`/`vce="hc3"` combined with `cluster=`; this wrapper raises `ValueError` because the combination is not a well-defined estimator (hc2/hc3 assume independent observations). **Note (Phase 1c internal bug workaround):** The clustered golden DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster shape bug in `lprobust.vce` fired by the mse-dpi pilot fits; the Python port has no equivalent bug.
2317+
- [x] Phase 1c: Bias-corrected CI (Equation 8) with `nprobust` parity. Public wrapper `diff_diff.bias_corrected_local_linear` returns `BiasCorrectedFit` with μ̂-scale point estimate, robust SE, and bias-corrected 95% CI `[tau.bc ± z_{1-α/2} * se.rb]`. The β-scale rescaling from Equation 8, `(1/G) Σ D_{g,2}`, is applied by Phase 2's `HeterogeneousAdoptionDiD.fit()`. Parity against `nprobust::lprobust(..., bwselect="mse-dpi")` is asserted at `atol=1e-12` on `tau_cl`/`tau_bc`/`se_cl`/`se_rb`/`ci_low`/`ci_high` across the three unclustered golden DGPs (DGP 1 and DGP 3 typically land closer to `1e-13`). The Python wrapper computes its own `z_{1-α/2}` via `scipy.stats.norm.ppf` inside `safe_inference()`; R's `qnorm` value is stored in the golden JSON for audit, and the parity harness compares Python's CI bounds to R's pre-computed CI bounds so any residual drift is purely the floating-point arithmetic in `tau.bc ± z * se.rb`, not a critical-value disagreement. The clustered DGP achieves bit-parity (`atol=1e-14`) when cluster IDs are in first-appearance order; otherwise BLAS reduction ordering can drift to `atol=1e-10`. Generator: `benchmarks/R/generate_nprobust_lprobust_golden.R`. **Note:** The wrapper matches nprobust's `rho=1` default (`b = h` in auto mode), so Phase 1b's separately-computed `b_mse` is surfaced via `bandwidth_diagnostics.b_mse` but not applied. **Note (public-API surface restriction):** Phase 1c restricts the public wrapper's `vce` parameter to `"nn"`; hc0/hc1/hc2/hc3 raise `NotImplementedError` and are queued for Phase 2+ pending dedicated R parity goldens. The port-level `diff_diff._nprobust_port.lprobust` still accepts all five vce modes (matching R's `nprobust::lprobust` signature) for callers who need the broader surface and accept that the hc-mode variance path — which reuses p-fit hat-matrix leverage for the q-fit residual in R (lprobust.R:229-241) — has not been separately parity-tested. **Note (Phase 1c internal bug workaround):** The clustered golden DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster shape bug in `lprobust.vce` fired by the mse-dpi pilot fits; the Python port has no equivalent bug.
23182318
- [ ] Phase 2: `HeterogeneousAdoptionDiD` class with separate code paths for Design 1', Design 1 mass-point, and Design 1 continuous-near-``.
23192319
- [ ] Phase 2: `design="auto"` detection rule (`min_g D_{g,2} < 0.01 · median_g D_{g,2}` → continuous_at_zero; modal-min fraction > 2% → mass_point; else continuous_near_lower).
23202320
- [ ] Phase 2: Panel validator verifies `D_{g,1} = 0` for all units; error on staggered timing without last-cohort subgroup.

tests/test_bias_corrected_lprobust.py

Lines changed: 28 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -380,23 +380,30 @@ def test_cluster_with_manual_bandwidth(self):
380380
assert np.isfinite(fit.estimate_bias_corrected)
381381
assert fit.se_robust > 0
382382

383-
def test_vce_hc2_with_cluster_raises(self):
384-
d = np.linspace(0.05, 1.0, 100)
383+
def test_vce_hc0_raises_not_implemented(self):
384+
"""Only vce='nn' is golden-tested in Phase 1c; hc-modes raise."""
385+
d = np.linspace(0.0, 1.0, 100)
385386
y = d.copy()
386-
cluster = np.repeat(np.arange(10), 10)
387-
with pytest.raises(ValueError, match="hc2.*not a well-defined"):
388-
bias_corrected_local_linear(
389-
d, y, h=0.3, vce="hc2", cluster=cluster,
390-
)
387+
with pytest.raises(NotImplementedError, match="vce='hc0'"):
388+
bias_corrected_local_linear(d, y, h=0.3, vce="hc0")
391389

392-
def test_vce_hc3_with_cluster_raises(self):
393-
d = np.linspace(0.05, 1.0, 100)
390+
def test_vce_hc1_raises_not_implemented(self):
391+
d = np.linspace(0.0, 1.0, 100)
392+
y = d.copy()
393+
with pytest.raises(NotImplementedError, match="vce='hc1'"):
394+
bias_corrected_local_linear(d, y, h=0.3, vce="hc1")
395+
396+
def test_vce_hc2_raises_not_implemented(self):
397+
d = np.linspace(0.0, 1.0, 100)
398+
y = d.copy()
399+
with pytest.raises(NotImplementedError, match="vce='hc2'"):
400+
bias_corrected_local_linear(d, y, h=0.3, vce="hc2")
401+
402+
def test_vce_hc3_raises_not_implemented(self):
403+
d = np.linspace(0.0, 1.0, 100)
394404
y = d.copy()
395-
cluster = np.repeat(np.arange(10), 10)
396-
with pytest.raises(ValueError, match="hc3.*not a well-defined"):
397-
bias_corrected_local_linear(
398-
d, y, h=0.3, vce="hc3", cluster=cluster,
399-
)
405+
with pytest.raises(NotImplementedError, match="vce='hc3'"):
406+
bias_corrected_local_linear(d, y, h=0.3, vce="hc3")
400407

401408

402409
# =============================================================================
@@ -503,19 +510,14 @@ def test_auto_cluster_returns_finite(self):
503510
# be identical to bit-parity.
504511
assert fit_cluster.h != fit_uncluster.h
505512

506-
def test_auto_vce_hc1_returns_finite(self):
507-
"""Auto-bandwidth with non-default vce must use the requested vce
508-
during bandwidth selection, not silently fall back to nn."""
513+
def test_auto_vce_hc1_rejected_in_phase_1c(self):
514+
"""Phase 1c public wrapper restricts vce to 'nn'; hc-modes raise.
515+
Port-level lprobust still forwards vce through, so
516+
lpbwselect_mse_dpi gets the correct setting when unlocked in
517+
Phase 2+."""
509518
d, y = self._smoke_data()
510-
fit_hc1 = bias_corrected_local_linear(d, y, vce="hc1")
511-
fit_nn = bias_corrected_local_linear(d, y, vce="nn")
512-
assert fit_hc1.bandwidth_source == "auto"
513-
assert np.isfinite(fit_hc1.estimate_bias_corrected)
514-
assert np.isfinite(fit_hc1.se_robust)
515-
# Different residual definitions yield different stage-2/3 AMSE
516-
# and therefore different bandwidths. Bit-identity would indicate
517-
# the selector silently ignored vce.
518-
assert fit_hc1.h != fit_nn.h
519+
with pytest.raises(NotImplementedError, match="vce='hc1'"):
520+
bias_corrected_local_linear(d, y, vce="hc1")
519521

520522
def test_auto_nnmatch_non_default_returns_finite(self):
521523
"""Auto-bandwidth with non-default nnmatch must forward it to the

0 commit comments

Comments
 (0)