Clarify z-critical-value doc claims (audit export, not direct consumption)

igerber · claude · igerber · commit 12a8efea0715 · 2026-04-20T06:39:40.000-04:00
P3 doc-accuracy note from AI review: the registry, R generator, and
local_linear.py comments said the Python side "consumes R's z
directly" but the wrapper actually computes its own z via
scipy.stats.norm.ppf inside safe_inference(). The golden JSON's z
field is an audit/reference export so a reviewer can confirm R's
qnorm and Python's scipy.stats.norm.ppf agree to machine precision.
The parity harness compares Python-computed CI bounds to R-computed
CI bounds (not R's z fed into Python's arithmetic).

Behavior is unchanged; only doc phrasing.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/R/generate_nprobust_lprobust_golden.R b/benchmarks/R/generate_nprobust_lprobust_golden.R
@@ -123,8 +123,11 @@ golden <- list(
     generator = "benchmarks/R/generate_nprobust_lprobust_golden.R",
     algorithm = paste("nprobust::lprobust(..., bwselect='mse-dpi') at a single",
                        "eval point, p=1, deriv=0, kernel='epa', vce='nn'",
-                       "unless noted. z = qnorm(1 - alpha/2) exported so the",
-                       "Python side consumes R's critical value directly.")
+                       "unless noted. The Python wrapper computes its own",
+                       "z_{1-alpha/2} via scipy.stats.norm.ppf inside",
+                       "safe_inference(); R's z is exported here for audit",
+                       "so a reviewer can verify the two critical values",
+                       "agree to machine precision.")
   ),
   dgp1 = c(list(n = G, d = d1, y = y1, kernel = "epa", vce = "nn",
                 description = "Uniform(0,1), polynomial m(d) = d + d^2"),
diff --git a/diff_diff/local_linear.py b/diff_diff/local_linear.py
@@ -1191,11 +1191,13 @@ def bias_corrected_local_linear(
     # cases where the residual vector collapses), ALL inference fields —
     # including the CI — must return NaN. This enforces the repo-wide
     # inference contract (CLAUDE.md Key Design Pattern #6; CI review
-    # PR #340 P0) rather than returning a misleading zero-width or infinite
-    # CI. safe_inference also handles the R z = qnorm(1 - alpha/2) critical
-    # value via scipy.stats.norm.ppf (the golden JSON stores R's z so
-    # parity tests consume R's value directly and drift is pure
-    # tau.bc + z * se.rb arithmetic).
+    # PR #340 P0) rather than returning a misleading zero-width or
+    # infinite CI. safe_inference computes the critical value z_{1-α/2}
+    # via scipy.stats.norm.ppf; the parity tests compare Python's
+    # scipy-computed ci_low/ci_high to R's qnorm-computed ci_low/ci_high
+    # stored in the golden JSON. The golden JSON also exports R's raw
+    # `z` value for audit/reference so a reviewer can verify the two
+    # critical values agree to machine precision.
     from diff_diff.utils import safe_inference
 
     _, _, (ci_low, ci_high) = safe_inference(
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2314,7 +2314,7 @@ Shipped as `did_had_pretest_workflow()` and surfaced via `practitioner_next_step
 - [x] Phase 1a: `clubSandwich::vcovCR(..., type="CR2")` parity harness committed: R script at `benchmarks/R/generate_clubsandwich_golden.R` plus a regression-anchor JSON at `benchmarks/data/clubsandwich_cr2_golden.json`. **Note:** the committed JSON currently has `"source": "python_self_reference"` and pins numerical stability only; authoritative R-produced values are generated by running the R script, which the TODO.md row under Methodology/Correctness tracks. The parity test at `tests/test_linalg_hc2_bm.py::TestCR2BMCluster::test_cr2_parity_with_golden` runs at 1e-6 tolerance (Phase 1a plan commits 6-digit parity once R regen completes).
 - [x] Phase 1b: Calonico-Cattaneo-Farrell (2018) MSE-optimal bandwidth selector. In-house port of `nprobust::lpbwselect(bwselect="mse-dpi")` (nprobust 0.5.0, SHA `36e4e53`) as `diff_diff.mse_optimal_bandwidth` and `BandwidthResult`, backed by the private `diff_diff._nprobust_port` module (`kernel_W`, `lprobust_bw`, `lpbwselect_mse_dpi`). Three-stage DPI with four `lprobust.bw` calls at orders `q+1`, `q+2`, `q`, `p`. Python matches R to `0.0000%` relative error (i.e., bit-parity within float64 precision, ~8-13 digits agreement) on all five stage bandwidths (`c_bw`, `bw_mp2`, `bw_mp3`, `b_mse`, `h_mse`) across three deterministic DGPs (uniform, Beta(2,2), half-normal) via `benchmarks/R/generate_nprobust_golden.R` → `benchmarks/data/nprobust_mse_dpi_golden.json`. **Note:** `weights=` is currently unsupported (raises `NotImplementedError`); nprobust's `lpbwselect` has no weight argument so there is no parity anchor. Weighted-data support deferred to Phase 2 (survey-design adaptation). **Note (public API scope restriction):** the exported wrapper `mse_optimal_bandwidth` hard-codes the HAD Phase 1b configuration (`p=1`, `deriv=0`, `interior=False`, `vce="nn"`, `nnmatch=3`). The underlying port supports a broader surface (`hc0`/`hc1`/`hc2`/`hc3` variance, interior evaluation, higher `p`), but those paths are not parity-tested against `nprobust` and are deferred. Callers needing the broader surface should use `diff_diff._nprobust_port.lpbwselect_mse_dpi` directly and accept that parity has not been verified on non-HAD configurations. **Note (input contract):** the wrapper enforces HAD's support restriction `D_{g,2} >= 0` (front-door `ValueError` on negative doses and empty inputs). `boundary` must equal `0` (Design 1') or `float(d.min())` (Design 1 continuous-near-d_lower) within float tolerance; off-support values raise `ValueError`. When `boundary ~ 0`, the wrapper additionally requires `d.min() <= 0.05 * median(|d|)` as a Design 1' support plausibility heuristic, chosen to pass the paper's thin-boundary-density DGPs (Beta(2,2), d.min/median ~ 3%) while rejecting substantially off-support samples (U(0.5, 1.0), d.min/median ~ 1.0). Detected mass-point designs (`d.min() > 0` with modal fraction at `d.min() > 2%`) raise `NotImplementedError` pointing to the Phase 2 2SLS path per paper Section 3.2.4.
 - [x] Phase 1c: First-order bias estimator `M̂_{ĥ*_G}` and robust variance `V̂_{ĥ*_G}`. Implemented via Calonico-Cattaneo-Titiunik (2014) bias-combined design matrix `Q.q` in the in-house port `diff_diff._nprobust_port.lprobust` (single-eval-point path of `nprobust::lprobust`, npfunctions.R:177-246).
-- [x] Phase 1c: Bias-corrected CI (Equation 8) with `nprobust` parity. Public wrapper `diff_diff.bias_corrected_local_linear` returns `BiasCorrectedFit` with μ̂-scale point estimate, robust SE, and bias-corrected 95% CI `[tau.bc ± z_{1-α/2} * se.rb]`. The β-scale rescaling from Equation 8, `(1/G) Σ D_{g,2}`, is applied by Phase 2's `HeterogeneousAdoptionDiD.fit()`. Parity against `nprobust::lprobust(..., bwselect="mse-dpi")` is tiered: `atol=1e-12` on `tau_cl`/`tau_bc`/`se_cl`/`se_rb` (the `Q.q` outer-product + broadcast-multiply path drifts slightly from the Phase 1b bit-parity primitives), `atol=1e-13` on CI bounds (R's `z_{1-α/2}` is stored in the golden JSON so Python consumes it directly rather than calling `scipy.stats.norm.ppf`). The clustered DGP achieves bit-parity (`atol=1e-14`) when cluster IDs happen to be sorted; otherwise BLAS reduction ordering can drift to `atol=1e-10`. Generator: `benchmarks/R/generate_nprobust_lprobust_golden.R`. **Note:** The wrapper matches nprobust's `rho=1` default (`b = h` in auto mode), so Phase 1b's separately-computed `b_mse` is surfaced via `bandwidth_diagnostics.b_mse` but not applied. **Deviation from R:** nprobust silently accepts `vce="hc2"`/`vce="hc3"` combined with `cluster=`; this wrapper raises `ValueError` because the combination is not a well-defined estimator (hc2/hc3 assume independent observations). **Note (Phase 1c internal bug workaround):** The clustered golden DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster shape bug in `lprobust.vce` fired by the mse-dpi pilot fits; the Python port has no equivalent bug.
+- [x] Phase 1c: Bias-corrected CI (Equation 8) with `nprobust` parity. Public wrapper `diff_diff.bias_corrected_local_linear` returns `BiasCorrectedFit` with μ̂-scale point estimate, robust SE, and bias-corrected 95% CI `[tau.bc ± z_{1-α/2} * se.rb]`. The β-scale rescaling from Equation 8, `(1/G) Σ D_{g,2}`, is applied by Phase 2's `HeterogeneousAdoptionDiD.fit()`. Parity against `nprobust::lprobust(..., bwselect="mse-dpi")` is tiered: `atol=1e-12` on `tau_cl`/`tau_bc`/`se_cl`/`se_rb` (the `Q.q` outer-product + broadcast-multiply path drifts slightly from the Phase 1b bit-parity primitives), `atol=1e-13` on CI bounds (the Python wrapper computes its own `z_{1-α/2}` via `scipy.stats.norm.ppf` inside `safe_inference()`; R's `qnorm` value is stored in the golden JSON for audit, and the parity harness compares Python's CI bounds to R's pre-computed CI bounds so any residual drift is purely the floating-point arithmetic in `tau.bc ± z * se.rb`, not a critical-value disagreement). The clustered DGP achieves bit-parity (`atol=1e-14`) when cluster IDs happen to be sorted; otherwise BLAS reduction ordering can drift to `atol=1e-10`. Generator: `benchmarks/R/generate_nprobust_lprobust_golden.R`. **Note:** The wrapper matches nprobust's `rho=1` default (`b = h` in auto mode), so Phase 1b's separately-computed `b_mse` is surfaced via `bandwidth_diagnostics.b_mse` but not applied. **Deviation from R:** nprobust silently accepts `vce="hc2"`/`vce="hc3"` combined with `cluster=`; this wrapper raises `ValueError` because the combination is not a well-defined estimator (hc2/hc3 assume independent observations). **Note (Phase 1c internal bug workaround):** The clustered golden DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster shape bug in `lprobust.vce` fired by the mse-dpi pilot fits; the Python port has no equivalent bug.
 - [ ] Phase 2: `HeterogeneousAdoptionDiD` class with separate code paths for Design 1', Design 1 mass-point, and Design 1 continuous-near-`d̲`.
 - [ ] Phase 2: `design="auto"` detection rule (`min_g D_{g,2} < 0.01 · median_g D_{g,2}` → continuous_at_zero; modal-min fraction > 2% → mass_point; else continuous_near_lower).
 - [ ] Phase 2: Panel validator verifies `D_{g,1} = 0` for all units; error on staggered timing without last-cohort subgroup.