Address PR #370 R2 review (2 P1 + 1 P3)

igerber · claude · igerber · commit 085d8ebfa236 · 2026-04-25T09:29:04.000-04:00
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
@@ -607,15 +607,19 @@ class HADPretestReport:
         Populated when ``aggregate == "event_study"``; ``None`` on the
         overall path.
     all_pass : bool
-        On the overall path: same Phase 3 semantics - True iff QUG is
-        conclusive AND at least one of Stute/Yatchew is conclusive AND
-        no conclusive test rejects. On the event-study path: True iff
-        ``np.isfinite(qug.p_value)``,
+        On the **unweighted overall path**: same Phase 3 semantics - True
+        iff QUG is conclusive AND at least one of Stute/Yatchew is
+        conclusive AND no conclusive test rejects. On the **unweighted
+        event-study path**: True iff ``np.isfinite(qug.p_value)``,
         ``pretrends_joint is not None and
         np.isfinite(pretrends_joint.p_value)``,
         ``np.isfinite(homogeneity_joint.p_value)``, AND none of the
-        three rejects. Mirrors Phase 3's ``bool(np.isfinite(p_value))``
-        convention - no ``.conclusive()`` helper on any result dataclass.
+        three rejects. On the **survey/weights path** (Phase 4.5 C): the
+        QUG-conclusiveness gate is dropped (qug=None per C0 deferral);
+        ``True`` iff at least one linearity test is conclusive AND no
+        conclusive test rejects (linearity-conditional admissibility).
+        Mirrors Phase 3's ``bool(np.isfinite(p_value))`` convention - no
+        ``.conclusive()`` helper on any result dataclass.
     verdict : str
         Human-readable classification. Paper rule applies symmetrically:
         TWFE is admissible only if NONE of the implemented tests
@@ -1202,18 +1206,18 @@ def qug_test(
     or pweight inputs (Phase 4.5 C0 decision gate, 2026-04). The test
     statistic uses extreme order statistics ``(D_{(1)}, D_{(2)})``, which
     are NOT smooth functionals of the empirical CDF -- standard survey
-    machinery (Binder TSL linearization, Rao-Wu rescaled bootstrap) does
-    not yield a calibrated test, and under cluster sampling the
-    ``Exp(1)/Exp(1)`` limit law's independence assumption breaks. The
-    extreme-value-theory-under-unequal-probability-sampling literature
-    (Quintos et al. 2001, Beirlant et al.) addresses tail-index
-    estimation, not boundary tests; no off-the-shelf survey-aware QUG
-    exists. Use joint Stute via :func:`did_had_pretest_workflow`
-    (``aggregate="event_study"``) for survey-aware HAD pretesting once
-    Phase 4.5 C ships -- Stute tests a smooth empirical-CDF functional
-    and admits a Rao-Wu rescaled bootstrap. See
-    ``docs/methodology/REGISTRY.md`` § "QUG Null Test" for the full
-    methodology note.
+    machinery (Binder TSL linearization, multiplier bootstrap, Rao-Wu
+    rescaled bootstrap) does not yield a calibrated test, and under
+    cluster sampling the ``Exp(1)/Exp(1)`` limit law's independence
+    assumption breaks. The extreme-value-theory-under-unequal-probability-
+    sampling literature (Quintos et al. 2001, Beirlant et al.) addresses
+    tail-index estimation, not boundary tests; no off-the-shelf
+    survey-aware QUG exists. Phase 4.5 C ships survey-aware Stute via
+    :func:`did_had_pretest_workflow` (which skips the QUG step under
+    survey/weights and runs the linearity family with a PSU-level Mammen
+    multiplier bootstrap for Stute and weighted OLS + pweight-sandwich
+    variance components for Yatchew). See ``docs/methodology/REGISTRY.md``
+    § "QUG Null Test" for the full methodology note.
 
     References
     ----------
@@ -3064,8 +3068,24 @@ def joint_pretrends_test(
     # Phase 4.5 C: aggregate per-row weights/survey to per-unit (G,)
     # using the existing HAD helpers (constant-within-unit invariant
     # enforced; replicate-weight rejected on the survey path).
+    # R2 P1 fix: subset row-level `weights` to data_filtered's rows BEFORE
+    # resolution, mirroring did_had_pretest_workflow. When
+    # _validate_had_panel_event_study auto-filters to the last cohort
+    # under staggered timing, the original weights array no longer aligns
+    # with data_filtered's row count. Survey= path is unaffected
+    # (column references resolved internally on data_filtered).
+    weights_for_resolve = weights
+    if weights is not None and len(data_filtered) != len(data):
+        pos_idx = data.index.get_indexer(data_filtered.index)
+        if (pos_idx < 0).any():
+            raise ValueError(
+                "joint_pretrends_test: cannot align row-level weights to "
+                "the staggered-filtered panel; some data_filtered rows do "
+                "not appear in original data.index."
+            )
+        weights_for_resolve = np.asarray(weights, dtype=np.float64)[pos_idx]
     weights_unit, resolved_unit = _resolve_pretest_unit_weights(
-        data_filtered, unit_col, weights, survey, "joint_pretrends_test"
+        data_filtered, unit_col, weights_for_resolve, survey, "joint_pretrends_test"
     )
     # Reorder per-unit weights to match d_arr/dy_by_horizon ordering.
     # _aggregate_for_joint_test sorts the wide pivot by index (unit_col),
@@ -3265,8 +3285,21 @@ def joint_homogeneity_test(
             )
 
     # Phase 4.5 C: aggregate weights/survey to per-unit; thread through.
+    # R2 P1 fix: subset row-level `weights` to data_filtered's rows BEFORE
+    # resolution, mirroring did_had_pretest_workflow / joint_pretrends_test
+    # for staggered last-cohort filtering.
+    weights_for_resolve = weights
+    if weights is not None and len(data_filtered) != len(data):
+        pos_idx = data.index.get_indexer(data_filtered.index)
+        if (pos_idx < 0).any():
+            raise ValueError(
+                "joint_homogeneity_test: cannot align row-level weights to "
+                "the staggered-filtered panel; some data_filtered rows do "
+                "not appear in original data.index."
+            )
+        weights_for_resolve = np.asarray(weights, dtype=np.float64)[pos_idx]
     weights_unit, resolved_unit = _resolve_pretest_unit_weights(
-        data_filtered, unit_col, weights, survey, "joint_homogeneity_test"
+        data_filtered, unit_col, weights_for_resolve, survey, "joint_homogeneity_test"
     )
     w_eff = resolved_unit.weights if resolved_unit is not None else weights_unit
 
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2432,7 +2432,7 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `
   The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + a survey-aware multiplier bootstrap give a calibrated test. Phase 4.5 C (PR #370) ships survey support for the linearity family — the **PSU-level Mammen multiplier bootstrap** for `stute_test` and the joint variants (NOT Rao-Wu rescaling — multiplier bootstrap is a different mechanism), and **closed-form weighted OLS + pweight-sandwich variance components** for `yatchew_hr_test`. See the dedicated Note (Phase 4.5 C) below for the full algorithm.
   **Research direction (out of scope for diff-diff):** the bridge IS sketchable by combining (a) endpoint-estimation EVT under iid (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006); (b) survey-aware functional CLT for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017); and (c) tail-empirical-process theory (Drees 2003) to define a "design-effective boundary intensity" `λ_eff = Σ_h W_h · f_h(0+)`. Under a "no boundary clumping" assumption (`P(D_{(1)}, D_{(2)}` in same PSU `| both ≤ δ) → 0`), the `Exp(1)/Exp(1)` limit law's pivotality is preserved and only the calibration needs a survey-aware bootstrap (subsampling within strata per Politis-Romano-Wolf, or Bertail et al.'s design-aware bootstrap). This is publishable methodology research — one paper, ~6-12 months for a methods PhD student. If the bridge gets built and published externally, this gate can be revisited.
 - **Note (Phase 4.5 C):** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` accept `weights=` and `survey=ResolvedSurveyDesign` kwargs (or `survey=SurveyDesign` for the data-in entries). Mechanism varies by test:
-  - **Stute family** (`stute_test`, `stute_joint_pretest`, joint wrappers) uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap). Each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix; multipliers broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`. The bootstrap residual perturbation is `dy_b = fitted + eps * w * eta_obs`, followed by weighted OLS refit and weighted CvM recompute via `_cvm_statistic_weighted`. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). PSU-shared multipliers are conservative under no-within-PSU outcome correlation (over-clustering gives conservative size in finite samples), asymptotically correct under the standard survey assumption that PSU is the ultimate sampling unit AND outcomes correlate within PSU. The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (constructed via `survey._make_trivial_resolved`) so the kernel is shared across both entry paths. NOT "Rao-Wu rescaled bootstrap" — different mechanism (the Rao-Wu kernel rescales per-unit weights via stratified PSU resampling, while this kernel applies multipliers without resampling).
+  - **Stute family** (`stute_test`, `stute_joint_pretest`, joint wrappers) uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap). Each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix; multipliers broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`. The bootstrap residual perturbation is `dy_b = fitted + eps * eta_obs` (paper Appendix D wild-bootstrap form — multipliers attach to UNWEIGHTED residuals; the weighting flows through the OLS refit + the weighted CvM, NOT through the perturbation step). Followed by weighted OLS refit (`_fit_weighted_ols_intercept_slope`) and weighted CvM recompute via `_cvm_statistic_weighted`. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). PSU-shared multipliers are conservative under no-within-PSU outcome correlation (over-clustering gives conservative size in finite samples), asymptotically correct under the standard survey assumption that PSU is the ultimate sampling unit AND outcomes correlate within PSU. The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (constructed via `survey._make_trivial_resolved`) so the kernel is shared across both entry paths. NOT "Rao-Wu rescaled bootstrap" — different mechanism (the Rao-Wu kernel rescales per-unit weights via stratified PSU resampling, while this kernel applies multipliers without resampling).
   - **Yatchew** (`yatchew_hr_test`) uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap). All three components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` in `TestYatchewHRTestSurvey::test_weighted_reduces_to_unweighted_at_uniform_weights`):
     - `sigma2_lin = sum(w * eps^2) / sum(w)` (weighted OLS residual variance).
     - `sigma2_diff = sum(w_avg * (dy_g - dy_{g-1})^2) / (2 * sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g + w_{g-1})/2`. Divisor uses `sum(w)` (=G at `w=1`), NOT `sum(w_avg)`, to match the existing `(1/(2G))` unweighted formula at `had_pretests.py:1635`.
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
@@ -3547,3 +3547,77 @@ def test_workflow_staggered_event_study_weights_subset_correctly(self):
         assert report.aggregate == "event_study"
         assert report.qug is None
         assert report.homogeneity_joint is not None
+
+    # --- R2 P1: direct-wrapper staggered weights= subsetting ---------------
+
+    def test_joint_pretrends_test_staggered_weights_subset(self):
+        """R2 P1: joint_pretrends_test direct call must subset row-level
+        weights= when its own _validate_had_panel_event_study filtering
+        triggers on staggered panels. Pre-fix this crashed with a length-
+        mismatch ValueError because the wrapper passed the full-panel
+        weights array into _resolve_pretest_unit_weights(data_filtered, ...)."""
+        df = self._make_staggered_panel(G_per_cohort=10)
+        n_rows = 2 * 10 * 4
+        weights_per_row = np.ones(n_rows) * 1.5
+        with pytest.warns(UserWarning):
+            r = joint_pretrends_test(
+                df,
+                "y",
+                "d",
+                "time",
+                "unit",
+                pre_periods=[0, 1],
+                base_period=2,
+                first_treat_col="F",
+                n_bootstrap=199,
+                seed=0,
+                weights=weights_per_row,
+            )
+        assert np.isfinite(r.cvm_stat_joint)
+
+    def test_joint_homogeneity_test_staggered_weights_subset(self):
+        df = self._make_staggered_panel(G_per_cohort=10)
+        n_rows = 2 * 10 * 4
+        weights_per_row = np.ones(n_rows) * 1.5
+        with pytest.warns(UserWarning):
+            r = joint_homogeneity_test(
+                df,
+                "y",
+                "d",
+                "time",
+                "unit",
+                post_periods=[3],
+                base_period=2,
+                first_treat_col="F",
+                n_bootstrap=199,
+                seed=0,
+                weights=weights_per_row,
+            )
+        assert np.isfinite(r.cvm_stat_joint)
+
+    # --- R2 P1: bootstrap perturbation form lock ---------------------------
+
+    def test_stute_survey_perturbation_does_not_double_weight(self):
+        """R2 P1: bootstrap perturbation is `dy_b = fitted + eps * eta_obs`
+        (paper Appendix D form), NOT `eps * w * eta_obs`. Adding `* w` to
+        the perturbation would over-weight by w² (weighting flows through
+        weighted OLS refit + weighted CvM, NOT through the multiplier).
+
+        Lock test: cvm_stat at uniform weights matches between paths
+        bit-exactly (W=G under uniform weights so 1/W² = 1/G²); the
+        bootstrap p-value distributions agree within Monte-Carlo noise
+        (RNG draw ordering differs between batched survey-aware path and
+        per-iteration unweighted path; numerical equivalence is unreachable).
+        """
+        d, dy = _linear_dgp(G=50, beta=2.0, sigma=0.3)
+        r_unweighted = stute_test(d, dy, n_bootstrap=999, seed=0)
+        r_weighted = stute_test(d, dy, weights=np.ones(50), n_bootstrap=999, seed=0)
+        # cvm_stat: bit-exact reduction at w=1 (W=G, weighted CvM ≡ unweighted).
+        np.testing.assert_allclose(
+            r_unweighted.cvm_stat, r_weighted.cvm_stat, atol=1e-14, rtol=1e-14
+        )
+        # p_value: distributional agreement at large B; Monte-Carlo noise.
+        # If the survey path were over-weighting (w² instead of w), the
+        # bootstrap distribution would be inflated and the survey p-value
+        # would systematically deviate. With the correct form, |diff| < 0.10.
+        assert abs(r_unweighted.p_value - r_weighted.p_value) < 0.10