Address PR #370 R1 review (1 P0 + 2 P1 + 1 P3)

igerber · claude · igerber · commit fb032672790c · 2026-04-25T09:01:36.000-04:00
R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
@@ -1483,6 +1483,15 @@ def stute_test(
             "by the multiplier-bootstrap composition. Replicate-weight pretests "
             "are a parallel follow-up after Phase 4.5 C."
         )
+    # R1 P1: pweight-only guard on the direct-helper survey entry (mirrors
+    # _resolve_pretest_unit_weights for the workflow path).
+    if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight":
+        raise ValueError(
+            f"stute_test: HAD pretests require weight_type='pweight'. Got "
+            f"weight_type={survey.weight_type!r}. aweight / fweight have "
+            "different sandwich-variance semantics that are not derived "
+            "for the Stute CvM bootstrap calibration."
+        )
 
     d_arr = _validate_1d_numeric(d, "d")
     dy_arr = _validate_1d_numeric(dy, "dy")
@@ -1539,6 +1548,20 @@ def stute_test(
                 f"stute_test: survey.weights length {w_arr.shape[0]} does not "
                 f"match d/dy length {G}."
             )
+        # R1 P0: strictly-positive weights at the per-unit level (mirrors
+        # workflow guard in _resolve_pretest_unit_weights). Zero-weight
+        # units would leak into the dose-variation check + CvM cusum +
+        # bootstrap refit, producing silent wrong pretest decisions on
+        # subpopulation-restricted designs (e.g. only zero-weight units
+        # carry dose variation -> spurious finite test statistic).
+        if (w_arr <= 0).any():
+            raise ValueError(
+                "stute_test: survey weights must be strictly positive. "
+                "Zero / negative weights would leave units in the "
+                "variance / CvM computation while contributing zero "
+                "population mass; pre-filter the panel to the positive-"
+                "weight subpopulation before calling stute_test."
+            )
     elif weights is not None:
         w_arr = np.asarray(weights, dtype=np.float64)
         if w_arr.shape[0] != G:
@@ -1807,6 +1830,13 @@ def yatchew_hr_test(
             "SDR) are not yet supported on HAD pretests. Replicate-weight "
             "pretests are a parallel follow-up after Phase 4.5 C."
         )
+    # R1 P1: pweight-only guard (aweight/fweight have different sandwich-
+    # variance semantics not derived for the variance-ratio statistic).
+    if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight":
+        raise ValueError(
+            f"yatchew_hr_test: HAD pretests require weight_type='pweight'. "
+            f"Got weight_type={survey.weight_type!r}."
+        )
 
     d_arr = _validate_1d_numeric(d, "d")
     dy_arr = _validate_1d_numeric(dy, "dy")
@@ -2421,6 +2451,12 @@ def stute_joint_pretest(
             "JKn/SDR) are not yet supported on HAD pretests. Replicate-weight "
             "pretests are a parallel follow-up after Phase 4.5 C."
         )
+    # R1 P1: pweight-only guard.
+    if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight":
+        raise ValueError(
+            f"stute_joint_pretest: HAD pretests require weight_type='pweight'. "
+            f"Got weight_type={survey.weight_type!r}."
+        )
 
     if not isinstance(residuals_by_horizon, dict) or not isinstance(fitted_by_horizon, dict):
         raise ValueError(
@@ -2604,6 +2640,14 @@ def stute_joint_pretest(
                 f"stute_joint_pretest: survey.weights length {w_arr.shape[0]} "
                 f"does not match doses length {G}."
             )
+        # R1 P0: strictly-positive guard (mirrors stute_test single-horizon).
+        if (w_arr <= 0).any():
+            raise ValueError(
+                "stute_joint_pretest: survey weights must be strictly "
+                "positive. Zero / negative weights would leave units in "
+                "the variance / CvM computation while contributing zero "
+                "population mass."
+            )
     elif weights is not None:
         w_arr = np.asarray(weights, dtype=np.float64)
         if w_arr.shape[0] != G:
@@ -2797,6 +2841,17 @@ def _resolve_pretest_unit_weights(
     if weights is not None:
         weights_arr = np.asarray(weights, dtype=np.float64)
         weights_unit = _aggregate_unit_weights(data, weights_arr, unit_col)
+        # R1 P0: strictly-positive weights required on the pweight shortcut
+        # (matches stute_test/yatchew_hr_test direct entry behavior; the CvM
+        # cusum + adjacent-difference variance assume all rows contribute).
+        if (weights_unit <= 0).any():
+            raise ValueError(
+                f"{caller_name}: weights must be strictly positive at the "
+                "per-unit level. Zero / negative weights would leave units "
+                "in the variance/CvM computation while contributing zero "
+                "mass; use survey= with explicit lonely-PSU handling for "
+                "principled subpopulation analysis."
+            )
         return weights_unit, None
     # survey is not None
     if not hasattr(survey, "resolve"):
@@ -2811,7 +2866,30 @@ def _resolve_pretest_unit_weights(
             "SDR) are not yet supported on HAD pretests. Replicate-weight "
             "pretests are a parallel follow-up after Phase 4.5 C."
         )
+    # R1 P1: pweight-only guard. aweight/fweight slip through pweight-only
+    # formulas silently otherwise (mirrors HeterogeneousAdoptionDiD.fit() at
+    # had.py:2976+ and survey._resolve_pweight_only at survey.py:914).
+    if getattr(resolved_full, "weight_type", "pweight") != "pweight":
+        raise ValueError(
+            f"{caller_name}: HAD pretests require weight_type='pweight'. "
+            f"Got weight_type={resolved_full.weight_type!r}. aweight / "
+            "fweight have different sandwich-variance semantics that are "
+            "not derived for the pretest variance components."
+        )
     resolved_unit = _aggregate_unit_resolved_survey(data, resolved_full, unit_col)
+    # R1 P0: strictly-positive weights at the per-unit level (mirrors the
+    # weights= shortcut). Zero per-unit weights leave units in the dose-
+    # variation check / CvM sum while contributing zero population mass,
+    # which can produce silently-wrong pretest decisions.
+    if (np.asarray(resolved_unit.weights) <= 0).any():
+        raise ValueError(
+            f"{caller_name}: survey weights must be strictly positive at "
+            "the per-unit level. Zero / negative weights would leave units "
+            "in the variance/CvM computation while contributing zero "
+            "mass; this would produce silent wrong pretest decisions on "
+            "subpopulation-restricted designs. Pre-filter the panel to "
+            "the positive-weight subpopulation before calling the workflow."
+        )
     return None, resolved_unit
 
 
@@ -3413,8 +3491,28 @@ def did_had_pretest_workflow(
 
         # Phase 4.5 C: forward weights/survey to the joint helpers. The
         # data-in wrappers handle their own per-row → per-unit aggregation
-        # via _resolve_pretest_unit_weights internally.
-        joint_weights = weights if use_survey_path and weights is not None else None
+        # via _resolve_pretest_unit_weights internally on `data_filtered`.
+        # R1 P1 fix: subset row-level `weights` to data_filtered's rows
+        # BEFORE passing through. Otherwise on staggered panels (where
+        # _validate_multi_period_panel auto-filters to last cohort),
+        # the wrappers would call _aggregate_unit_weights(data_filtered,
+        # weights[full_panel_length], ...) and crash on length mismatch.
+        # Mirrors HeterogeneousAdoptionDiD.fit()'s positional-index
+        # subsetting via `data.index.get_indexer(data_filtered.index)`.
+        # `survey=` carries column references resolved internally on
+        # data_filtered, so no subsetting needed there.
+        if use_survey_path and weights is not None:
+            pos_idx = data.index.get_indexer(data_filtered.index)
+            if (pos_idx < 0).any():
+                raise ValueError(
+                    "did_had_pretest_workflow: cannot align row-level "
+                    "weights to the staggered-filtered panel "
+                    "(some data_filtered rows do not appear in original "
+                    "data.index). This is a bug; please report."
+                )
+            joint_weights = np.asarray(weights, dtype=np.float64)[pos_idx]
+        else:
+            joint_weights = None
         joint_survey = survey if use_survey_path and survey is not None else None
 
         # Step 2: joint pre-trends on earlier pre-periods (those
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2425,11 +2425,11 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `
 4. Theorem 4 establishes: asymptotic size `α`; uniform consistency against fixed alternatives; local power at rate `G` on the class `F^{d̲,d̄}_{m,K}` of differentiable cdfs with positive density and Lipschitz derivative.
 5. Li et al. (2024, Theorem 2.4) implies the QUG test is asymptotically independent of the WAS / TWFE estimator, so conditional inference on WAS given non-rejection does not distort inference (asymptotically; the paper's Footnote 8 notes the extension to triangular arrays is conjectured but not proven).
 - **Note:** Implementation is `O(G)` via `np.partition`; no sort required.
-- **Note (Phase 4.5 C0):** `qug_test(..., survey=...)` and `qug_test(..., weights=...)` raise `NotImplementedError` permanently (Phase 4.5 C0 decision gate, 2026-04). The same gate applies to `did_had_pretest_workflow(..., survey=...)` / `weights=`. Three reasons survey extension is genuinely hard, not "we just haven't done the lit review":
+- **Note (Phase 4.5 C0):** `qug_test(..., survey=...)` and `qug_test(..., weights=...)` raise `NotImplementedError` **permanently** (Phase 4.5 C0 decision gate, 2026-04 -- direct-helper gate is permanent). The Phase 4.5 C0 release also gated `did_had_pretest_workflow(..., survey=...)` / `weights=` with `NotImplementedError`, but that workflow gate was **temporary**: Phase 4.5 C (PR #370, 2026-04) replaces it with functional dispatch that skips the QUG step with `UserWarning` and runs the linearity family with the survey-aware mechanism (see Note (Phase 4.5 C) below for the full algorithm). Direct callers of `qug_test` still get the permanent rejection. Three reasons QUG-under-survey is genuinely hard, not "we just haven't done the lit review":
   1. **Extreme order statistics are not smooth functionals of the empirical CDF.** Standard survey machinery (Binder-TSL linearization via `compute_survey_if_variance`, Rao-Wu rescaled bootstrap via `bootstrap_utils.generate_rao_wu_weights`, Krieger-Pfeffermann (1997) EDF tests for complex surveys) all rely on Hadamard differentiability of the test statistic in the empirical CDF. The first two order statistics are NOT differentiable functionals — small perturbations to F near zero produce O(1) shifts in `D_{(1)}`. None of the standard survey-bootstrap or linearization tools give a calibrated test for QUG.
   2. **The `Exp(1)/Exp(1)` limit law assumes iid sampling with smooth density at zero.** Under cluster sampling, `D_{(1)}` and `D_{(2)}` may both come from the same PSU, breaking the independence required for the Poisson-process limit of rescaled spacings near the boundary. Under stratification, the smallest dose may come from a small stratum that's systematically over- or under-sampled, biasing the test.
   3. **The literature on EVT under unequal-probability sampling is sparse.** Quintos et al. (2001) and Beirlant et al. cover tail-INDEX estimation under unequal sample sizes. There is no off-the-shelf method for "test the support endpoint under complex sampling" in the standard survey-statistics toolkit. Adapting Hill / Pickands / DEdH estimators to the boundary problem would be novel research, not engineering. The de Chaisemartin et al. (2026) paper itself does not discuss survey extensions of QUG.
-  The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + Rao-Wu rescaled bootstrap give a calibrated survey-aware test. Phase 4.5 C ships survey support for the linearity family with mechanism varying by test: Rao-Wu rescaled bootstrap for `stute_test` and the joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`); weighted OLS residuals + weighted variance estimator for `yatchew_hr_test` (Yatchew 1997 is a closed-form variance-ratio test, not bootstrap-based).
+  The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + a survey-aware multiplier bootstrap give a calibrated test. Phase 4.5 C (PR #370) ships survey support for the linearity family — the **PSU-level Mammen multiplier bootstrap** for `stute_test` and the joint variants (NOT Rao-Wu rescaling — multiplier bootstrap is a different mechanism), and **closed-form weighted OLS + pweight-sandwich variance components** for `yatchew_hr_test`. See the dedicated Note (Phase 4.5 C) below for the full algorithm.
   **Research direction (out of scope for diff-diff):** the bridge IS sketchable by combining (a) endpoint-estimation EVT under iid (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006); (b) survey-aware functional CLT for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017); and (c) tail-empirical-process theory (Drees 2003) to define a "design-effective boundary intensity" `λ_eff = Σ_h W_h · f_h(0+)`. Under a "no boundary clumping" assumption (`P(D_{(1)}, D_{(2)}` in same PSU `| both ≤ δ) → 0`), the `Exp(1)/Exp(1)` limit law's pivotality is preserved and only the calibration needs a survey-aware bootstrap (subsampling within strata per Politis-Romano-Wolf, or Bertail et al.'s design-aware bootstrap). This is publishable methodology research — one paper, ~6-12 months for a methods PhD student. If the bridge gets built and published externally, this gate can be revisited.
 - **Note (Phase 4.5 C):** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` accept `weights=` and `survey=ResolvedSurveyDesign` kwargs (or `survey=SurveyDesign` for the data-in entries). Mechanism varies by test:
   - **Stute family** (`stute_test`, `stute_joint_pretest`, joint wrappers) uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap). Each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix; multipliers broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`. The bootstrap residual perturbation is `dy_b = fitted + eps * w * eta_obs`, followed by weighted OLS refit and weighted CvM recompute via `_cvm_statistic_weighted`. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). PSU-shared multipliers are conservative under no-within-PSU outcome correlation (over-clustering gives conservative size in finite samples), asymptotically correct under the standard survey assumption that PSU is the ultimate sampling unit AND outcomes correlate within PSU. The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (constructed via `survey._make_trivial_resolved`) so the kernel is shared across both entry paths. NOT "Rao-Wu rescaled bootstrap" — different mechanism (the Rao-Wu kernel rescales per-unit weights via stratified PSU resampling, while this kernel applies multipliers without resampling).
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py