Address PR #392 R8 review (1 P3, doc-scope precision)

igerber · claude · igerber · commit 476cc3233065 · 2026-04-26T10:52:35.000-04:00
R8 was ✅ Looks good — only 1 P3 doc nit.

The CHANGELOG and REGISTRY parity claim said "HAD R-package end-
to-end parity" without qualifying that the harness explicitly
forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")`. R
`did_had` always evaluates the local-linear at d=0 regardless of
dose distribution; our default `design="auto"` may legitimately
resolve to `continuous_near_d_lower` or `mass_point` on dose
distributions with boundary density bounded away from zero (e.g.
Beta(2,2) at G=200), in which case the WAS estimand evaluates at
a different point and diverges from R numerically. That
divergence is methodologically defensible — our auto-detect uses
more information when boundary mass is sparse — but it means the
parity test does NOT validate the default `design="auto"`
surface, only the Design 1' surface that R also uses.

Updated both wording surfaces to qualify "on the
`design='continuous_at_zero'` (Design 1') surface" and explain
the auto-detect divergence as out-of-scope-for-this-test (not a
defect).

Stats: 540 tests pass, 0 regressions. Doc-only.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,7 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 - **HAD `trends_lin=True` linear-trend detrending mode** on `HeterogeneousAdoptionDiD.fit(aggregate="event_study")`, `joint_pretrends_test`, and `joint_homogeneity_test`. Mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)` (paper Eq. 17 / Eq. 18 / page 32 joint-Stute homogeneity-with-trends). Per-group linear-trend slope estimated as `Y[g, F-1] - Y[g, F-2]` and applied as `(t - base) × slope` adjustment to per-event-time outcome evolutions. Requires F ≥ 3 (panel must contain F-2). The "consumed" placebo at our event-time `e=-2` is auto-dropped (R reduces max placebo lag by 1 with the same effect). Mutually exclusive with survey weighting (`survey_design` / `survey` / `weights`): raises `NotImplementedError` per `feedback_per_method_survey_element_contract` (weighted slope estimator not derived from paper; tracked in TODO.md as a follow-up). Bit-exact backcompat for `trends_lin=False` (default). Patch-level (additive keyword-only kwarg).
-- **HAD R-package end-to-end parity test** vs `DIDHAD` v2.0.0 (`Credible-Answers/did_had`). New parity fixture `benchmarks/data/did_had_golden.json` generated by `benchmarks/R/generate_did_had_golden.R` covers 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin). Python parity test `tests/test_did_had_parity.py` asserts point estimate / SE / CI bounds at `atol=1e-8` and Yatchew T-stat at `atol=1e-10` after a documented `× G/(G-1)` finite-sample convention shift. Two intentional convention deviations from R, documented in `docs/methodology/REGISTRY.md`: (a) we report the bias-corrected point estimate (modern CCF 2018 convention; R's `Estimate` column reports the conventional estimate with the bias-corrected CI separately — our `att` matches R's CI midpoint); (b) Yatchew uses paper Appendix E's literal (1/G) variance-denominator convention while R uses base-R `var()`'s (1/(N-1)) sample-variance convention (parity is bit-exact after the `× G/(G-1)` shift). Yatchew on placebos with R's mean-independence null (`order=0`) is not yet exposed in our `yatchew_hr_test` (we currently only support the linearity null) and is skipped in the parity test; tracked as TODO follow-up.
+- **HAD R-package end-to-end parity test** vs `DIDHAD` v2.0.0 (`Credible-Answers/did_had`) on the **`design="continuous_at_zero"` (Design 1') surface**. New parity fixture `benchmarks/data/did_had_golden.json` generated by `benchmarks/R/generate_did_had_golden.R` covers 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin). The harness explicitly forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")` because R `did_had` always evaluates the local-linear at `d=0` regardless of dose distribution; our default `design="auto"` may legitimately choose `continuous_near_d_lower` or `mass_point` on dose distributions with boundary density bounded away from zero (e.g., Beta(2,2)) and thereby diverge from R numerically — that divergence is methodologically defensible but out of scope for this parity test. Python parity test `tests/test_did_had_parity.py` asserts point estimate / SE / CI bounds at `atol=1e-8` and Yatchew T-stat at `atol=1e-10` after a documented `× G/(G-1)` finite-sample convention shift. Two intentional convention deviations from R, documented in `docs/methodology/REGISTRY.md`: (a) we report the bias-corrected point estimate (modern CCF 2018 convention; R's `Estimate` column reports the conventional estimate with the bias-corrected CI separately — our `att` matches R's CI midpoint); (b) Yatchew uses paper Appendix E's literal (1/G) variance-denominator convention while R uses base-R `var()`'s (1/(N-1)) sample-variance convention (parity is bit-exact after the `× G/(G-1)` shift). Yatchew on placebos with R's mean-independence null (`order=0`) is not yet exposed in our `yatchew_hr_test` (we currently only support the linearity null) and is skipped in the parity test; tracked as TODO follow-up.
 
 ## [3.3.1] - 2026-04-25
 
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2499,7 +2499,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in
 - **Note:** Sum-of-CvMs aggregation is a standard joint specification-test construction (Delgado 1993; Escanciano 2006); the paper does not prescribe an aggregation rule. Sum-of-CvMs balances power across diffuse vs concentrated alternatives and bootstraps cleanly with shared-η.
 - **Note:** Event-study dispatch adjudicates step 3 via joint Stute only; there is no joint Yatchew variant because the paper does not derive one. The overall two-period path still uses the Phase 3 "Stute OR Yatchew" adjudication. Users who need Yatchew-style adjacent-difference variance-ratio robustness under multi-period data can run `yatchew_hr_test` on each (base, post) pair manually.
 - **Note (Phase 4 — Eq 17 / Eq 18 linear-trend detrending shipped):** `trends_lin: bool = False` (keyword-only) on `HeterogeneousAdoptionDiD.fit(aggregate="event_study")`, `joint_pretrends_test`, and `joint_homogeneity_test` (PR #389, 2026-04). Mirrors R `DIDHAD::did_had(..., trends_lin=TRUE)` (Credible-Answers/did_had v2.0.0, SHA `edc09197`). Per-group linear-trend slope estimated as `Y[g, F-1] - Y[g, F-2]` and applied as `(t - base) × slope` adjustment to per-event-time outcome evolutions (HAD.fit) or to `Y[g, t] - Y[g, base]` directly (joint pretests). The "consumed" placebo at our event-time `e=-2` is auto-dropped (R reduces max placebo lag by 1 with the same effect). Requires F ≥ 3 / `base_period - 1` in panel — front-door `ValueError` if not. Mutually exclusive with survey weighting (raises `NotImplementedError` per `feedback_per_method_survey_element_contract`; weighted slope estimator not derived from paper). Pierce-Schott published-number replication (paper p=0.51 / p=0.40 anchors) deferred indefinitely — primary analysis panel is LBD-restricted (Census FSRDC); the public-deposit proxy panel has filtering ambiguity that prevents exact published-number parity. Replaced by end-to-end R-package parity below, which is a strictly stronger correctness signal.
-- **Note (R-package end-to-end parity, PR #389):** Validated against `DIDHAD` v2.0.0 (Credible-Answers/did_had, SHA `edc09197`) on 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin). Generator: `benchmarks/R/generate_did_had_golden.R`; fixture: `benchmarks/data/did_had_golden.json`; test: `tests/test_did_had_parity.py`. Tolerances: point estimate / SE / CI bounds at `atol=1e-8`; closed-form Yatchew T-stat at `atol=1e-10` after a documented `× G/(G-1)` finite-sample convention shift. Two intentional convention deviations from R: **(a)** we report the **bias-corrected** point estimate `att = (mean(ΔY) - tau.bc) / mean(D)` (modern CCF 2018 convention); R's `Estimate` column reports the **conventional** estimate `(mean(ΔY) - tau.us) / mean(D)` with the bias-corrected CI separately — our `att` matches R's CI midpoint, our `se` / `conf_int_low` / `conf_int_high` match R's `se` / `ci_lo` / `ci_hi` directly. **(b)** Our `yatchew_hr_test` follows paper Appendix E's literal `(1/G)` and `(1/(2G))` variance-denominator convention; R's `YatchewTest::yatchew_test` uses base-R `var()`'s `(1/(N-1))` sample-variance convention. Ratio is exactly `N/(N-1)`; both converge to the same asymptotic null distribution. Yatchew on placebos with R's mean-independence null (`order=0`, fits `Y ~ 1`) is not yet exposed in our `yatchew_hr_test` (we always fit `Y ~ D`, the linearity null) and is skipped in the parity test; tracked as TODO follow-up to add a `null="mean_independence"` mode.
+- **Note (R-package end-to-end parity, PR #389):** Validated against `DIDHAD` v2.0.0 (Credible-Answers/did_had, SHA `edc09197`) on the **`design="continuous_at_zero"` (Design 1') surface**, on 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin). Generator: `benchmarks/R/generate_did_had_golden.R`; fixture: `benchmarks/data/did_had_golden.json`; test: `tests/test_did_had_parity.py`. **Scope qualifier (PR #392 R8 P3):** the harness explicitly forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")` because R `did_had` always evaluates the local-linear at `d=0` regardless of dose distribution. Our default `design="auto"` may legitimately resolve to `continuous_near_d_lower` (`d_lower=d.min()`, Design 1) or `mass_point` (Design 2) on dose distributions with boundary density bounded away from zero (e.g., Beta(2,2) at G=200), in which case the WAS estimand evaluates at a different point and diverges from R's `did_had` numerically. That divergence is methodologically defensible — our auto-detect uses more information when boundary mass is sparse — but is out of scope for this parity contract. Tolerances: point estimate / SE / CI bounds at `atol=1e-8`; closed-form Yatchew T-stat at `atol=1e-10` after a documented `× G/(G-1)` finite-sample convention shift. Two intentional convention deviations from R: **(a)** we report the **bias-corrected** point estimate `att = (mean(ΔY) - tau.bc) / mean(D)` (modern CCF 2018 convention); R's `Estimate` column reports the **conventional** estimate `(mean(ΔY) - tau.us) / mean(D)` with the bias-corrected CI separately — our `att` matches R's CI midpoint, our `se` / `conf_int_low` / `conf_int_high` match R's `se` / `ci_lo` / `ci_hi` directly. **(b)** Our `yatchew_hr_test` follows paper Appendix E's literal `(1/G)` and `(1/(2G))` variance-denominator convention; R's `YatchewTest::yatchew_test` uses base-R `var()`'s `(1/(N-1))` sample-variance convention. Ratio is exactly `N/(N-1)`; both converge to the same asymptotic null distribution. Yatchew on placebos with R's mean-independence null (`order=0`, fits `Y ~ 1`) is not yet exposed in our `yatchew_hr_test` (we always fit `Y ~ D`, the linearity null) and is skipped in the parity test; tracked as TODO follow-up to add a `null="mean_independence"` mode.
 - **Note:** Horizon labels in `StuteJointResult.horizon_labels` are `str(t)` verbatim and carry STRING IDENTITY ONLY — NOT a chronological ordering key. Callers who need chronological order must preserve the original period values alongside (e.g. from the `pre_periods` / `post_periods` argument).
 - **Note:** NaN propagation is explicit: when any horizon has NaN in residuals, `cvm_stat_joint=NaN`, `p_value=NaN`, `reject=False`, AND `per_horizon_stats={label: np.nan for every horizon}` (full dict preserved with NaN values — not empty, not partial).