Restrict Phase 1c wrapper vce to 'nn'; align tolerance docs

igerber · claude · igerber · commit 2c48e60e7efd · 2026-04-20T07:34:21.000-04:00
Address CI review P1 on HC2/HC3 leverage. nprobust's lprobust reuses
the p-fit hat-matrix leverage for the q-fit residual path
(lprobust.R:229-241). The Python port matches R exactly, but that R
behavior deserves a derivation and a dedicated golden anchor before
the public wrapper can advertise CCT-2014 conformance on hc-mode
paths. The Phase 1c plan always said hc0-3 were "exposed but not
golden-tested"; this commit restricts the public surface to what
actually has a parity anchor.

Changes:
- `bias_corrected_local_linear`: `vce != "nn"` now raises
  `NotImplementedError` with a pointer to the port-level
  `diff_diff._nprobust_port.lprobust` for callers who still need the
  broader surface. Docstring updated.
- Tests: swap positive-path hc-mode tests for negative-path raises;
  drop `test_auto_vce_hc1_returns_finite` (now superseded by
  `test_auto_vce_hc1_rejected_in_phase_1c`); drop the cluster+hc2/hc3
  tests (now covered by the plain hc2/hc3 rejection tests).
- REGISTRY.md: swap "Deviation from R" wording for a clearer
  "public-API surface restriction" note explaining the R-side hii
  reuse and the deferred derivation.
- TODO.md: upgrade the hc-mode expansion TODO from Low to Medium and
  document the q-fit leverage derivation as the gating work.

Also fix the P3 tolerance-doc inconsistency: registry/docstring now
say atol=1e-12 on all six scalars across unclustered DGPs (tests
actually assert 1e-12, with DGP 1 / DGP 3 landing closer to 1e-13
in practice).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/TODO.md b/TODO.md
@@ -87,7 +87,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | TROP Rust vs Python grid-search divergence on rank-deficient Y: on two near-parallel control units, LOOCV grid-search ATT diverges ~6% between Rust (`trop_global.py:688`) and Python fallback (`trop_global.py:753`). Either grid-winner ties are broken differently or the per-λ solver reaches different stationary points under rank deficiency. Audit finding #23 flagged this surface. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_grid_search_rank_deficient_Y` baselines the gap. | `trop_global.py`, `rust/` | follow-up | Medium |
 | TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
 | `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
-| `bias_corrected_local_linear`: extend golden parity to `vce in {"hc0", "hc1", "hc2", "hc3"}`. The port supports all four via `lprobust_res`'s hc branches but Phase 1c golden-tested only `vce="nn"`. | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
+| `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
 | `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |
 | `bias_corrected_local_linear`: support multi-eval grid (`neval > 1`) with cross-covariance (`covgrid=TRUE` branch of `lprobust.R:253-378`). Not needed for HAD but useful for multi-dose diagnostics. | `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Low |
 | Clustered-DGP parity: Phase 1c's DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster bug in `lpbwselect.mse.dpi`'s pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. | `benchmarks/R/generate_nprobust_lprobust_golden.R` | Phase 1c | Low |
diff --git a/diff_diff/local_linear.py b/diff_diff/local_linear.py
@@ -990,9 +990,11 @@ def bias_corrected_local_linear(
         provided without ``h`` raises ``ValueError``.
     alpha : float, default=0.05
         CI level; ``0.05`` gives a 95% CI. Must be in ``(0, 1)``.
-    vce : {"nn", "hc0", "hc1", "hc2", "hc3"}, default="nn"
-        Variance-estimation method. ``"nn"`` is parity-tested in
-        Phase 1c; other modes are exposed but not golden-parity-tested.
+    vce : {"nn"}, default="nn"
+        Variance-estimation method. Only ``"nn"`` is supported in
+        Phase 1c; hc0/hc1/hc2/hc3 are queued for Phase 2+ pending
+        dedicated R parity goldens. Passing anything else raises
+        ``NotImplementedError``.
     cluster : np.ndarray or None
         Per-observation cluster IDs for cluster-robust variance. Missing
         (NaN) cluster IDs raise ``ValueError`` rather than silently
@@ -1019,18 +1021,19 @@ def bias_corrected_local_linear(
 
     Notes
     -----
-    Parity against ``nprobust::lprobust(..., bwselect="mse-dpi")`` is tiered
-    (see ``docs/methodology/REGISTRY.md``): ``atol=1e-12`` on ``tau_cl``,
-    ``tau_bc``, ``se_cl``, and ``se_rb`` across the three unclustered
-    golden DGPs; ``atol=1e-13`` on CI bounds. The Python wrapper computes
-    its own ``z_{1-alpha/2}`` via ``scipy.stats.norm.ppf`` inside
-    ``safe_inference()``; R's ``qnorm`` value is stored in the golden JSON
-    for audit, and the parity harness compares Python's CI bounds to R's
-    pre-computed CI bounds, so any residual drift is purely the
-    floating-point arithmetic in ``tau.bc +/- z * se.rb``, not a
-    critical-value disagreement. Clustered DGP 4 achieves bit-parity
-    (``atol=1e-14``) when cluster IDs happen to be in first-appearance
-    order; otherwise BLAS reduction ordering can drift to ``atol=1e-10``.
+    Parity against ``nprobust::lprobust(..., bwselect="mse-dpi")`` is
+    asserted at ``atol=1e-12`` on ``tau_cl``, ``tau_bc``, ``se_cl``,
+    ``se_rb``, ``ci_low``, and ``ci_high`` across the three unclustered
+    golden DGPs; DGP 1 and DGP 3 typically land closer to ``1e-13``.
+    The Python wrapper computes its own ``z_{1-alpha/2}`` via
+    ``scipy.stats.norm.ppf`` inside ``safe_inference()``; R's ``qnorm``
+    value is stored in the golden JSON for audit, and the parity harness
+    compares Python's CI bounds to R's pre-computed CI bounds, so any
+    residual drift is purely the floating-point arithmetic in
+    ``tau.bc +/- z * se.rb``, not a critical-value disagreement.
+    Clustered DGP 4 achieves bit-parity (``atol=1e-14``) when cluster
+    IDs are in first-appearance order; otherwise BLAS reduction
+    ordering can drift to ``atol=1e-10``.
     """
     if weights is not None:
         raise NotImplementedError(
@@ -1050,15 +1053,26 @@ def bias_corrected_local_linear(
     if not (0.0 < alpha < 1.0):
         raise ValueError(f"alpha must be in (0, 1); got {alpha!r}.")
 
-    # Reject manifestly-incompatible vce+cluster combinations upfront.
-    # nprobust's lprobust silently accepts hc2/hc3 + cluster and produces a
-    # weighted-score cluster meat that is NOT the standard CR2 or clustered
-    # HC estimator. Rather than propagate that ambiguity, reject.
-    if cluster is not None and vce in ("hc2", "hc3"):
-        raise ValueError(
-            f"vce={vce!r} combined with cluster= is not a well-defined "
-            "estimator in nprobust (hc2/hc3 assume independent observations)."
-            " Use vce='nn' or drop the cluster argument."
+    # Phase 1c public-wrapper vce restriction.
+    # Only ``vce="nn"`` is golden-tested against R in Phase 1c. Exposing
+    # hc0/hc1/hc2/hc3 on the public surface would ship non-parity-verified
+    # inference -- and nprobust's internal hc2/hc3 residual path reuses
+    # the p-fit hat-matrix leverage for the q-fit residuals (lprobust.R
+    # :229-241), a detail that would need its own R parity anchor before
+    # the Python port can advertise it as CCT-2014-compliant. Defer the
+    # hc-mode public surface to Phase 2+ with dedicated goldens. The
+    # port-level ``diff_diff._nprobust_port.lprobust`` still accepts
+    # hc0..hc3 for callers who need the broader surface and accept that
+    # the hc-mode behavior has not been separately parity-tested.
+    if vce != "nn":
+        raise NotImplementedError(
+            f"vce={vce!r} is not supported on bias_corrected_local_linear "
+            "in Phase 1c. Only vce='nn' is golden-tested against R "
+            "nprobust::lprobust in this phase; hc0/hc1/hc2/hc3 are queued "
+            "for Phase 2+ pending dedicated R parity goldens. If you need "
+            "the hc-mode port path for exploratory use, call "
+            "diff_diff._nprobust_port.lprobust directly and accept that "
+            "those paths are not separately parity-tested."
         )
 
     # HAD-scope input validation (shared with Phase 1b via _validate_had_inputs).
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2314,7 +2314,7 @@ Shipped as `did_had_pretest_workflow()` and surfaced via `practitioner_next_step
 - [x] Phase 1a: `clubSandwich::vcovCR(..., type="CR2")` parity harness committed: R script at `benchmarks/R/generate_clubsandwich_golden.R` plus a regression-anchor JSON at `benchmarks/data/clubsandwich_cr2_golden.json`. **Note:** the committed JSON currently has `"source": "python_self_reference"` and pins numerical stability only; authoritative R-produced values are generated by running the R script, which the TODO.md row under Methodology/Correctness tracks. The parity test at `tests/test_linalg_hc2_bm.py::TestCR2BMCluster::test_cr2_parity_with_golden` runs at 1e-6 tolerance (Phase 1a plan commits 6-digit parity once R regen completes).
 - [x] Phase 1b: Calonico-Cattaneo-Farrell (2018) MSE-optimal bandwidth selector. In-house port of `nprobust::lpbwselect(bwselect="mse-dpi")` (nprobust 0.5.0, SHA `36e4e53`) as `diff_diff.mse_optimal_bandwidth` and `BandwidthResult`, backed by the private `diff_diff._nprobust_port` module (`kernel_W`, `lprobust_bw`, `lpbwselect_mse_dpi`). Three-stage DPI with four `lprobust.bw` calls at orders `q+1`, `q+2`, `q`, `p`. Python matches R to `0.0000%` relative error (i.e., bit-parity within float64 precision, ~8-13 digits agreement) on all five stage bandwidths (`c_bw`, `bw_mp2`, `bw_mp3`, `b_mse`, `h_mse`) across three deterministic DGPs (uniform, Beta(2,2), half-normal) via `benchmarks/R/generate_nprobust_golden.R` → `benchmarks/data/nprobust_mse_dpi_golden.json`. **Note:** `weights=` is currently unsupported (raises `NotImplementedError`); nprobust's `lpbwselect` has no weight argument so there is no parity anchor. Weighted-data support deferred to Phase 2 (survey-design adaptation). **Note (public API scope restriction):** the exported wrapper `mse_optimal_bandwidth` hard-codes the HAD Phase 1b configuration (`p=1`, `deriv=0`, `interior=False`, `vce="nn"`, `nnmatch=3`). The underlying port supports a broader surface (`hc0`/`hc1`/`hc2`/`hc3` variance, interior evaluation, higher `p`), but those paths are not parity-tested against `nprobust` and are deferred. Callers needing the broader surface should use `diff_diff._nprobust_port.lpbwselect_mse_dpi` directly and accept that parity has not been verified on non-HAD configurations. **Note (input contract):** the wrapper enforces HAD's support restriction `D_{g,2} >= 0` (front-door `ValueError` on negative doses and empty inputs). `boundary` must equal `0` (Design 1') or `float(d.min())` (Design 1 continuous-near-d_lower) within float tolerance; off-support values raise `ValueError`. When `boundary ~ 0`, the wrapper additionally requires `d.min() <= 0.05 * median(|d|)` as a Design 1' support plausibility heuristic, chosen to pass the paper's thin-boundary-density DGPs (Beta(2,2), d.min/median ~ 3%) while rejecting substantially off-support samples (U(0.5, 1.0), d.min/median ~ 1.0). Detected mass-point designs (`d.min() > 0` with modal fraction at `d.min() > 2%`) raise `NotImplementedError` pointing to the Phase 2 2SLS path per paper Section 3.2.4.
 - [x] Phase 1c: First-order bias estimator `M̂_{ĥ*_G}` and robust variance `V̂_{ĥ*_G}`. Implemented via Calonico-Cattaneo-Titiunik (2014) bias-combined design matrix `Q.q` in the in-house port `diff_diff._nprobust_port.lprobust` (single-eval-point path of `nprobust::lprobust`, npfunctions.R:177-246).
-- [x] Phase 1c: Bias-corrected CI (Equation 8) with `nprobust` parity. Public wrapper `diff_diff.bias_corrected_local_linear` returns `BiasCorrectedFit` with μ̂-scale point estimate, robust SE, and bias-corrected 95% CI `[tau.bc ± z_{1-α/2} * se.rb]`. The β-scale rescaling from Equation 8, `(1/G) Σ D_{g,2}`, is applied by Phase 2's `HeterogeneousAdoptionDiD.fit()`. Parity against `nprobust::lprobust(..., bwselect="mse-dpi")` is tiered: `atol=1e-12` on `tau_cl`/`tau_bc`/`se_cl`/`se_rb` (the `Q.q` outer-product + broadcast-multiply path drifts slightly from the Phase 1b bit-parity primitives), `atol=1e-13` on CI bounds (the Python wrapper computes its own `z_{1-α/2}` via `scipy.stats.norm.ppf` inside `safe_inference()`; R's `qnorm` value is stored in the golden JSON for audit, and the parity harness compares Python's CI bounds to R's pre-computed CI bounds so any residual drift is purely the floating-point arithmetic in `tau.bc ± z * se.rb`, not a critical-value disagreement). The clustered DGP achieves bit-parity (`atol=1e-14`) when cluster IDs happen to be sorted; otherwise BLAS reduction ordering can drift to `atol=1e-10`. Generator: `benchmarks/R/generate_nprobust_lprobust_golden.R`. **Note:** The wrapper matches nprobust's `rho=1` default (`b = h` in auto mode), so Phase 1b's separately-computed `b_mse` is surfaced via `bandwidth_diagnostics.b_mse` but not applied. **Deviation from R:** nprobust silently accepts `vce="hc2"`/`vce="hc3"` combined with `cluster=`; this wrapper raises `ValueError` because the combination is not a well-defined estimator (hc2/hc3 assume independent observations). **Note (Phase 1c internal bug workaround):** The clustered golden DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster shape bug in `lprobust.vce` fired by the mse-dpi pilot fits; the Python port has no equivalent bug.
+- [x] Phase 1c: Bias-corrected CI (Equation 8) with `nprobust` parity. Public wrapper `diff_diff.bias_corrected_local_linear` returns `BiasCorrectedFit` with μ̂-scale point estimate, robust SE, and bias-corrected 95% CI `[tau.bc ± z_{1-α/2} * se.rb]`. The β-scale rescaling from Equation 8, `(1/G) Σ D_{g,2}`, is applied by Phase 2's `HeterogeneousAdoptionDiD.fit()`. Parity against `nprobust::lprobust(..., bwselect="mse-dpi")` is asserted at `atol=1e-12` on `tau_cl`/`tau_bc`/`se_cl`/`se_rb`/`ci_low`/`ci_high` across the three unclustered golden DGPs (DGP 1 and DGP 3 typically land closer to `1e-13`). The Python wrapper computes its own `z_{1-α/2}` via `scipy.stats.norm.ppf` inside `safe_inference()`; R's `qnorm` value is stored in the golden JSON for audit, and the parity harness compares Python's CI bounds to R's pre-computed CI bounds so any residual drift is purely the floating-point arithmetic in `tau.bc ± z * se.rb`, not a critical-value disagreement. The clustered DGP achieves bit-parity (`atol=1e-14`) when cluster IDs are in first-appearance order; otherwise BLAS reduction ordering can drift to `atol=1e-10`. Generator: `benchmarks/R/generate_nprobust_lprobust_golden.R`. **Note:** The wrapper matches nprobust's `rho=1` default (`b = h` in auto mode), so Phase 1b's separately-computed `b_mse` is surfaced via `bandwidth_diagnostics.b_mse` but not applied. **Note (public-API surface restriction):** Phase 1c restricts the public wrapper's `vce` parameter to `"nn"`; hc0/hc1/hc2/hc3 raise `NotImplementedError` and are queued for Phase 2+ pending dedicated R parity goldens. The port-level `diff_diff._nprobust_port.lprobust` still accepts all five vce modes (matching R's `nprobust::lprobust` signature) for callers who need the broader surface and accept that the hc-mode variance path — which reuses p-fit hat-matrix leverage for the q-fit residual in R (lprobust.R:229-241) — has not been separately parity-tested. **Note (Phase 1c internal bug workaround):** The clustered golden DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster shape bug in `lprobust.vce` fired by the mse-dpi pilot fits; the Python port has no equivalent bug.
 - [ ] Phase 2: `HeterogeneousAdoptionDiD` class with separate code paths for Design 1', Design 1 mass-point, and Design 1 continuous-near-`d̲`.
 - [ ] Phase 2: `design="auto"` detection rule (`min_g D_{g,2} < 0.01 · median_g D_{g,2}` → continuous_at_zero; modal-min fraction > 2% → mass_point; else continuous_near_lower).
 - [ ] Phase 2: Panel validator verifies `D_{g,1} = 0` for all units; error on staggered timing without last-cohort subgroup.
diff --git a/tests/test_bias_corrected_lprobust.py b/tests/test_bias_corrected_lprobust.py
@@ -380,23 +380,30 @@ def test_cluster_with_manual_bandwidth(self):
         assert np.isfinite(fit.estimate_bias_corrected)
         assert fit.se_robust > 0
 
-    def test_vce_hc2_with_cluster_raises(self):
-        d = np.linspace(0.05, 1.0, 100)
+    def test_vce_hc0_raises_not_implemented(self):
+        """Only vce='nn' is golden-tested in Phase 1c; hc-modes raise."""
+        d = np.linspace(0.0, 1.0, 100)
         y = d.copy()
-        cluster = np.repeat(np.arange(10), 10)
-        with pytest.raises(ValueError, match="hc2.*not a well-defined"):
-            bias_corrected_local_linear(
-                d, y, h=0.3, vce="hc2", cluster=cluster,
-            )
+        with pytest.raises(NotImplementedError, match="vce='hc0'"):
+            bias_corrected_local_linear(d, y, h=0.3, vce="hc0")
 
-    def test_vce_hc3_with_cluster_raises(self):
-        d = np.linspace(0.05, 1.0, 100)
+    def test_vce_hc1_raises_not_implemented(self):
+        d = np.linspace(0.0, 1.0, 100)
+        y = d.copy()
+        with pytest.raises(NotImplementedError, match="vce='hc1'"):
+            bias_corrected_local_linear(d, y, h=0.3, vce="hc1")
+
+    def test_vce_hc2_raises_not_implemented(self):
+        d = np.linspace(0.0, 1.0, 100)
+        y = d.copy()
+        with pytest.raises(NotImplementedError, match="vce='hc2'"):
+            bias_corrected_local_linear(d, y, h=0.3, vce="hc2")
+
+    def test_vce_hc3_raises_not_implemented(self):
+        d = np.linspace(0.0, 1.0, 100)
         y = d.copy()
-        cluster = np.repeat(np.arange(10), 10)
-        with pytest.raises(ValueError, match="hc3.*not a well-defined"):
-            bias_corrected_local_linear(
-                d, y, h=0.3, vce="hc3", cluster=cluster,
-            )
+        with pytest.raises(NotImplementedError, match="vce='hc3'"):
+            bias_corrected_local_linear(d, y, h=0.3, vce="hc3")
 
 
 # =============================================================================
@@ -503,19 +510,14 @@ def test_auto_cluster_returns_finite(self):
         # be identical to bit-parity.
         assert fit_cluster.h != fit_uncluster.h
 
-    def test_auto_vce_hc1_returns_finite(self):
-        """Auto-bandwidth with non-default vce must use the requested vce
-        during bandwidth selection, not silently fall back to nn."""
+    def test_auto_vce_hc1_rejected_in_phase_1c(self):
+        """Phase 1c public wrapper restricts vce to 'nn'; hc-modes raise.
+        Port-level lprobust still forwards vce through, so
+        lpbwselect_mse_dpi gets the correct setting when unlocked in
+        Phase 2+."""
         d, y = self._smoke_data()
-        fit_hc1 = bias_corrected_local_linear(d, y, vce="hc1")
-        fit_nn = bias_corrected_local_linear(d, y, vce="nn")
-        assert fit_hc1.bandwidth_source == "auto"
-        assert np.isfinite(fit_hc1.estimate_bias_corrected)
-        assert np.isfinite(fit_hc1.se_robust)
-        # Different residual definitions yield different stage-2/3 AMSE
-        # and therefore different bandwidths. Bit-identity would indicate
-        # the selector silently ignored vce.
-        assert fit_hc1.h != fit_nn.h
+        with pytest.raises(NotImplementedError, match="vce='hc1'"):
+            bias_corrected_local_linear(d, y, vce="hc1")
 
     def test_auto_nnmatch_non_default_returns_finite(self):
         """Auto-bandwidth with non-default nnmatch must forward it to the