Address PR #370 R10 review (1 P1 + 1 P3)

igerber · claude · igerber · commit d8a7353dd8ea · 2026-04-25T13:00:53.000-04:00
R10 P1 (Methodology) -- Stute survey bootstrap was silently miscalibrated
on stratified designs. The HAD sup-t bootstrap (had.py:2120+) applies a
within-stratum demean + sqrt(n_h/(n_h-1)) small-sample correction AFTER
generate_survey_multiplier_weights_batch returns, to make the bootstrap
variance match the Binder-TSL stratified target. The same correction
has NOT been derived for the Stute CvM functional, so applying the
helper's raw multipliers directly to residual perturbations on stratified
designs left the bootstrap p-value silently miscalibrated.

Per the reviewer's offered "narrow support" path: Phase 4.5 C now
explicitly rejects stratified designs on the Stute family with
NotImplementedError. Pweight-only, PSU-only, and FPC-only designs
remain supported (the helper's output is appropriately scaled for those
without further correction). Stratified is a follow-up after the
matching Stute-CvM stratified-correction derivation lands.

Mirrors the lonely_psu='adjust' rejection pattern (R5 P1) — both are
methodology-gap-driven explicit NotImplementedErrors with documented
follow-up. The strata guard supersedes the lonely_psu='adjust'
singleton-strata guard for any stratified design (the latter is now
defense-in-depth for the unstratified residual case).

R10 P3 -- Added "stratified-design rejection" entry to REGISTRY's
Note (Phase 4.5 C). Also updated CHANGELOG to narrow the documented
survey contract.

Tests updated:
- test_stute_test_lonely_psu_adjust_singletons_raises -&gt;
  test_stute_test_stratified_design_raises (the strata guard fires
  first; the test is still meaningful but on a strata key match).
- Same renaming for stute_joint_pretest variant.
- test_stute_test_lonely_psu_remove_singletons_returns_nan REMOVED
  (singleton strata under lonely_psu='remove' now hits the strata
  guard instead of the df_survey&lt;=0 guard).
- test_joint_homogeneity_test_psu_strata_survey_smoke -&gt;
  test_joint_homogeneity_test_psu_only_survey_smoke (positive coverage
  on PSU-only design) + new test_joint_homogeneity_test_stratified_raises.
- test_workflow_event_study_psu_strata_survey_smoke -&gt;
  test_workflow_event_study_psu_only_survey_smoke.
- test_workflow_event_study_survey_pass_does_not_say_inconclusive
  switched from strata to PSU-only.

191 pretest tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
-- **HAD linearity-family pretests under survey (Phase 4.5 C).** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` now accept `weights=` / `survey=` keyword-only kwargs. Stute family uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap): each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix, broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`, weighted OLS refit, weighted CvM via new `_cvm_statistic_weighted` helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence AND PSU clustering. Yatchew uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap): `sigma2_lin = sum(w·eps²)/sum(w)`, `sigma2_diff = sum(w_avg·diff²)/(2·sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g+w_{g-1})/2`, `sigma4_W = sum(w_avg·prod)/sum(w_avg)`, `T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W`. All three Yatchew components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` by direct helper test). The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (new `survey._make_trivial_resolved` helper) so the same kernel handles both entry paths. `did_had_pretest_workflow(..., survey=, weights=)` removes the Phase 4.5 C0 `NotImplementedError`, dispatches to the survey-aware sub-tests, **skips the QUG step with `UserWarning`** (per C0 deferral), sets `qug=None` on the report, and appends a `"linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0"` suffix to the verdict. `HADPretestReport.qug` retyped from `QUGTestResults` to `Optional[QUGTestResults]`; `summary()` / `to_dict()` / `to_dataframe()` updated to None-tolerant rendering. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise `NotImplementedError` at every entry point (defense in depth, reciprocal-guard discipline) — parallel follow-up after this PR. Strictly positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Per-row `weights=` / `survey=col` aggregated to per-unit via existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (constant-within-unit invariant enforced). Unweighted code paths preserved bit-exactly. Patch-level addition (additive on stable surfaces). See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C) for the full methodology.
+- **HAD linearity-family pretests under survey (Phase 4.5 C).** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` now accept `weights=` / `survey=` keyword-only kwargs. Stute family uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap): each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix, broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`, weighted OLS refit, weighted CvM via new `_cvm_statistic_weighted` helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence AND PSU clustering. Yatchew uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap): `sigma2_lin = sum(w·eps²)/sum(w)`, `sigma2_diff = sum(w_avg·diff²)/(2·sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g+w_{g-1})/2`, `sigma4_W = sum(w_avg·prod)/sum(w_avg)`, `T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W`. All three Yatchew components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` by direct helper test). The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (new `survey._make_trivial_resolved` helper) so the same kernel handles both entry paths. `did_had_pretest_workflow(..., survey=, weights=)` removes the Phase 4.5 C0 `NotImplementedError`, dispatches to the survey-aware sub-tests, **skips the QUG step with `UserWarning`** (per C0 deferral), sets `qug=None` on the report, and appends a `"linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0"` suffix to the verdict. `HADPretestReport.qug` retyped from `QUGTestResults` to `Optional[QUGTestResults]`; `summary()` / `to_dict()` / `to_dataframe()` updated to None-tolerant rendering. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise `NotImplementedError` at every entry point (defense in depth, reciprocal-guard discipline) — parallel follow-up after this PR. **Stratified designs (`SurveyDesign(strata=...)`) also raise `NotImplementedError` on the Stute family** — the within-stratum demean + `sqrt(n_h/(n_h-1))` correction that the HAD sup-t bootstrap applies to match the Binder-TSL stratified target has not been derived for the Stute CvM functional, so applying raw multipliers from `generate_survey_multiplier_weights_batch` directly to residual perturbations would leave the bootstrap p-value silently miscalibrated. Phase 4.5 C narrows survey support to **pweight-only**, **PSU-only** (`SurveyDesign(weights=, psu=)`), and **FPC-only** (`SurveyDesign(weights=, fpc=)`) designs; stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Strictly positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Per-row `weights=` / `survey=col` aggregated to per-unit via existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (constant-within-unit invariant enforced). Unweighted code paths preserved bit-exactly. Patch-level addition (additive on stable surfaces). See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C) for the full methodology.
 - **`ChaisemartinDHaultfoeuille.by_path` + `placebo=True`** — per-path backward-horizon placebos `DID^{pl}_{path, l}` for `l = 1..L_max`. The same per-path SE convention used for the event-study (joiners/leavers IF precedent: switcher-side contributions zeroed for non-path groups; cohort structure and control pool unchanged; plug-in SE with path-specific divisor `N^{pl}_{l, path}`) is applied to backward horizons via the new `switcher_subset_mask` parameter on `_compute_per_group_if_placebo_horizon`. Surfaced on `results.path_placebo_event_study[path][-l]` (negative-int inner keys mirroring `placebo_event_study`); `summary()` renders the rows alongside per-path event-study horizons; `to_dataframe(level="by_path")` emits negative-horizon rows alongside the existing positive-horizon rows. **Bootstrap** (when `n_bootstrap > 0`) propagates per-`(path, lag)` percentile CI / p-value through the same `_bootstrap_one_target` dispatch as the per-path event-study, with the canonical NaN-on-invalid contract enforced on the new surface (PR #364 library-wide invariant). **SE inherits the cross-path cohort-sharing deviation from R** documented for `path_effects` (full-panel cohort-centered plug-in vs R's per-path re-run): tracks R within tolerance on single-path-cohort panels, diverges materially on cohort-mixed panels — the bootstrap SE is a Monte Carlo analog of the analytical SE and inherits the same deviation. R-parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathPlacebo` on the new `multi_path_reversible_by_path_placebo` scenario (point estimates exact match; SE within Phase-2 envelope rtol ≤ 5%); positive analytical + bootstrap invariants at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPlacebo` (and the gated `::TestBootstrap` subclass). See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path ...)` → "Per-path placebos" for the full contract.
 
 ## [3.3.0] - 2026-04-25
diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
@@ -1722,14 +1722,36 @@ def stute_test(
         # CvM recompute. Routes via synthetic trivial ResolvedSurveyDesign
         # for the weights= shortcut to share the same kernel.
         resolved_for_boot = survey if survey is not None else _make_trivial_resolved(w_arr)
+        # R10 P1: reject stratified designs explicitly until a derived
+        # Stute-specific correction lands. The HAD sup-t bootstrap
+        # (had.py:2120+) applies a within-stratum demean +
+        # sqrt(n_h/(n_h-1)) small-sample correction AFTER
+        # generate_survey_multiplier_weights_batch returns, to make the
+        # bootstrap variance match the Binder-TSL stratified target.
+        # That same correction has NOT been derived for the Stute CvM
+        # functional, so applying the helper's raw multipliers directly
+        # to residual perturbations on stratified designs leaves the
+        # bootstrap p-value silently miscalibrated. Pweight-only,
+        # PSU-only, and FPC-only designs are still supported (the
+        # helper's output is appropriately scaled for those).
+        if resolved_for_boot.strata is not None:
+            raise NotImplementedError(
+                "stute_test: SurveyDesign(strata=...) with stratified "
+                "sampling is not yet supported. The Stute CvM bootstrap "
+                "calibration on stratified designs requires a within-"
+                "stratum demean + sqrt(n_h/(n_h-1)) small-sample "
+                "correction analogous to the HAD sup-t bootstrap, but "
+                "the matching derivation for the Stute functional has "
+                "not been completed. Pweight-only or PSU-only "
+                "(SurveyDesign(weights=..., psu=...)) designs are "
+                "supported; pre-process stratified designs to remove "
+                "the strata column or wait for the derivation in a "
+                "follow-up PR."
+            )
         # R5 P1: reject lonely_psu='adjust' singleton-strata designs
-        # explicitly (mirrors HAD sup-t bootstrap at had.py:2081-2118).
-        # The bootstrap helper pools singletons into a pseudo-stratum with
-        # NONZERO multipliers, but the matching variance target requires
-        # a pseudo-stratum centering transform that is not derived for
-        # the Stute CvM. Other lonely_psu modes ("remove" / "certainty")
-        # produce zero multipliers for singletons and are caught by the
-        # df_survey guard below.
+        # explicitly (now redundant with the strata guard above; kept
+        # for defense in depth and for residual non-stratified
+        # singleton-strata edge cases).
         if _has_lonely_psu_adjust_singletons(resolved_for_boot):
             raise NotImplementedError(
                 "stute_test: SurveyDesign(lonely_psu='adjust') with "
@@ -2914,9 +2936,25 @@ def stute_joint_pretest(
         # vector-valued empirical-process unit-level dependence (paper
         # convention) AND PSU clustering (Krieger-Pfeffermann 1997).
         resolved_for_boot = survey if survey is not None else _make_trivial_resolved(w_arr)
+        # R10 P1: reject stratified designs explicitly until a derived
+        # Stute-specific correction lands (mirrors stute_test
+        # single-horizon).
+        if resolved_for_boot.strata is not None:
+            raise NotImplementedError(
+                "stute_joint_pretest: SurveyDesign(strata=...) with "
+                "stratified sampling is not yet supported. The Stute "
+                "CvM bootstrap calibration on stratified designs "
+                "requires a within-stratum demean + sqrt(n_h/(n_h-1)) "
+                "small-sample correction analogous to the HAD sup-t "
+                "bootstrap, but the matching derivation for the joint "
+                "Stute functional has not been completed. Pweight-only "
+                "or PSU-only designs are supported; pre-process "
+                "stratified designs to remove the strata column or wait "
+                "for the derivation in a follow-up PR."
+            )
         # R5 P1: reject lonely_psu='adjust' singleton-strata designs
-        # explicitly (mirrors stute_test single-horizon and HAD sup-t
-        # bootstrap).
+        # explicitly (now redundant with the strata guard above; kept
+        # for defense in depth).
         if _has_lonely_psu_adjust_singletons(resolved_for_boot):
             raise NotImplementedError(
                 "stute_joint_pretest: SurveyDesign(lonely_psu='adjust') "
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2444,6 +2444,7 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `
     - `aggregate="event_study"`: `True` iff `pretrends_joint` is non-None and conclusive, `homogeneity_joint` is conclusive, AND neither rejects. Both joint variants must be conclusive on the event-study path (same step-2 + step-3 closure as the unweighted aggregate, just without the QUG step).
   - **Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) deferred** to a parallel follow-up. Each helper raises `NotImplementedError` on `survey.replicate_weights is not None` (defense in depth: workflow + every direct-helper entry rejects, mirroring the reciprocal-guard discipline from PR #346). The per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition above.
   - **`lonely_psu='adjust'` with singleton strata is rejected** with `NotImplementedError` on the Stute family (mirrors HAD sup-t bootstrap at `had.py:2081-2118`). The bootstrap multiplier helper pools singleton strata into a pseudo-stratum with nonzero multipliers, but the analytical variance target requires a pseudo-stratum centering transform that has not been derived for the Stute CvM. Use `lonely_psu='remove'` (drops singleton contributions) or `'certainty'` (zero-variance singletons); both produce all-zero singleton multipliers that match a well-defined analytical target. Variance-unidentified designs (`df_survey <= 0` after the adjust+singleton case is handled) return `NaN` with a `UserWarning` (single-PSU unstratified or one-PSU-per-stratum under remove/certainty).
+  - **Stratified designs (`SurveyDesign(strata=...)`) are rejected** with `NotImplementedError` on `stute_test` and `stute_joint_pretest` (and propagate to `did_had_pretest_workflow`). The HAD sup-t bootstrap (had.py:2120+) applies a within-stratum demean + `sqrt(n_h/(n_h-1))` small-sample correction AFTER `generate_survey_multiplier_weights_batch` to make the bootstrap variance match the Binder-TSL stratified target. That same correction has NOT been derived for the Stute CvM functional, so applying the helper's raw multipliers directly to residual perturbations on stratified designs would leave the bootstrap p-value silently miscalibrated. Phase 4.5 C narrows Stute-family survey support to **pweight-only** (no strata, no PSU), **PSU-only** (`SurveyDesign(weights=, psu=)`), and **FPC-only** (`SurveyDesign(weights=, fpc=)`) designs. Stratified designs are a follow-up after the matching Stute-CvM stratified-correction derivation lands.
   - **Constant-within-unit invariant**: per-row `weights=` / `survey=col` are aggregated to per-unit `(G,)` arrays via the existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (had.py:1604, :1671); these enforce constant-within-unit invariant on weights and on every survey design column (strata, psu, fpc) and raise on violation. Direct callers passing already-resolved `ResolvedSurveyDesign` (or per-unit `weights` array) bypass this aggregation; the invariant is the caller's responsibility on that path.
   - **Distributional parity, NOT bit-exact**: at `weights=ones(G)` the survey path produces a different bootstrap p-value than the unweighted path because RNG consumption differs (batched `generate_survey_multiplier_weights_batch` vs per-iteration `_generate_mammen_weights`). The two paths agree DISTRIBUTIONALLY at large B (`|p_avg_diff| < 0.03` over 100 reps at `B=5000`); they DO NOT agree numerically at `atol=1e-10`. The unweighted code path is preserved bit-exactly (stability invariant; the new `weights=`/`survey=` arms are separate `if` branches).
 
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py