Address PR #370 R7 review (1 P0 + 1 P1)

igerber · claude · igerber · commit 43230619ebed · 2026-04-25T10:46:21.000-04:00
R7 P0 (Methodology) -- weighted CvM was missing the outer measure
factor. Survey-weighted plug-in of Stute's CvM functional integrates
the squared cusum process against the survey-weighted EDF
F_hat_w = (1/W) sum_i w_i delta_{D_i}, which weights BOTH the inner
cusum AND the outer integration measure:
    C_g = sum_{h &lt;= g} w_h * eps_h
    S_w = (1/W^2) * sum_g w_g * (C_g)^2

Earlier revisions used (1/W^2) * sum_g C_g^2 (no outer w_g) which is
a count-weighted-cusum / uniform-outer-measure functional and silently
misreports survey-weighted Stute statistics for non-uniform weights.
At w=ones(G) both forms reduce to (1/G^2) sum_g C_g^2; only non-uniform
weights distinguish them, which is why the prior reduction tests
didn't catch this.

Fix: add the outer w_sorted factor to _cvm_statistic_weighted. New
oracle test pins the formula on a hand-computed non-uniform-weight
example (w=[1,2,3], eps=[1,-2,3] -&gt; 127/36 outer-weighted, NOT 46/36
count-weighted-cusum). Reduction at w=1 still bit-exact.

R7 P1 (Code Quality) -- survey verdict could leave pass cases starting
with "inconclusive". The previous approach composed the unweighted
verdict against a synthetic NaN QUG and string-replaced "QUG NaN" out;
when no rejections fired and linearity was conclusive, the underlying
"inconclusive - QUG NaN" template would collapse to "inconclusive"
even for all_pass=True paths.

Fix: explicit survey-aware verdict composers
_compose_verdict_overall_survey and _compose_verdict_event_study_survey
that drop QUG from consideration entirely and emit linearity-only
priority text. Both append the
"(linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)"
suffix in every branch (rejections, inconclusive, fail-to-reject).

4 new regression tests:
- test_cvm_statistic_weighted_outer_measure_oracle (R7 P0 hand-computed)
- test_cvm_statistic_weighted_reduces_at_uniform_weights (R7 P0 reduction)
- test_workflow_overall_survey_pass_does_not_say_inconclusive (R7 P1)
- test_workflow_event_study_survey_pass_does_not_say_inconclusive (R7 P1)

187 pretest tests pass (was 183 after R6).

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
@@ -1049,21 +1049,31 @@ def _has_lonely_psu_adjust_singletons(resolved: Any) -> bool:
 def _cvm_statistic_weighted(
     eps_sorted: np.ndarray, d_sorted: np.ndarray, w_sorted: np.ndarray
 ) -> float:
-    """Weighted analog of :func:`_cvm_statistic`.
-
-    Aggregates weighted residuals into the cusum:
-
-        C_g = sum_{h : D_h <= D_g} w_h * eps_h
-        S   = (1 / W^2) * sum_g (C_g)^2,  W = sum(w)
-
-    At ``w = ones(G)``, ``W = G`` and the formula reduces bit-exactly to
-    the unweighted ``_cvm_statistic`` (locked at ``atol=1e-14`` by the
-    survey-path direct helper tests; Phase 4.5 C stability invariant #1).
+    """Weighted analog of :func:`_cvm_statistic` (survey-weighted plug-in).
+
+    The unweighted Stute CvM `S = (1/G) * sum_g c_G(D_g)^2` integrates
+    the squared cusum process against the empirical CDF
+    ``F_hat = (1/G) sum_i delta_{D_i}``. The weighted plug-in replaces
+    ``F_hat`` by the survey-weighted EDF
+    ``F_hat_w = (1/W) sum_i w_i delta_{D_i}``, which weights BOTH the
+    inner cusum AND the outer integration measure:
+
+        C_g = sum_{h : D_h <= D_g} w_h * eps_h     (inner cusum, weighted)
+        S_w = (1 / W^2) * sum_g w_g * (C_g)^2      (outer measure, weighted)
+        W   = sum(w)
+
+    The outer ``w_g`` factor on each squared cusum (R7 P0 fix) is what
+    distinguishes this from a count-weighted-cusum form
+    ``(1/W^2) * sum_g C_g^2`` (no outer ``w_g``), which silently
+    misreports survey-weighted Stute statistics for non-uniform weights.
+    At ``w = ones(G)`` both forms reduce to ``(1/G^2) sum_g C_g^2``
+    (unweighted) -- only non-uniform weights distinguish them.
 
     Tie-block collapse uses the same ``np.unique(d_sorted)`` count
-    machinery as the unweighted form — positions are determined by
-    ``d_sorted`` ties (independent of weights), so the collapse pattern is
-    weight-invariant.
+    machinery as the unweighted form -- positions are determined by
+    ``d_sorted`` ties (independent of weights), so the collapse pattern
+    is weight-invariant. The outer ``w_sorted`` factor applies to the
+    tie-collapsed cusum at each observation.
 
     Parameters
     ----------
@@ -1082,7 +1092,9 @@ def _cvm_statistic_weighted(
     tie_end_idx = np.cumsum(counts) - 1
     cumsum_tie_safe = np.repeat(cumsum[tie_end_idx], counts)
     W = float(np.sum(w_sorted))
-    return float(np.sum(cumsum_tie_safe * cumsum_tie_safe) / (W * W))
+    # R7 P0: integrate outer measure against F_hat_w via the w_sorted
+    # factor on each squared cusum (NOT against uniform 1/G measure).
+    return float(np.sum(w_sorted * cumsum_tie_safe * cumsum_tie_safe) / (W * W))
 
 
 def _compose_verdict(
@@ -3534,6 +3546,114 @@ def joint_homogeneity_test(
 
 _VALID_AGGREGATES = ("overall", "event_study")
 
+_QUG_DEFERRED_SUFFIX = (
+    " (linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0)"
+)
+
+
+def _compose_verdict_overall_survey(
+    stute: Optional[StuteTestResults],
+    yatchew: Optional[YatchewTestResults],
+) -> str:
+    """Build the overall-path :class:`HADPretestReport` verdict on the
+    survey/weights branch (Phase 4.5 C).
+
+    Drops the QUG step from consideration (skipped per Phase 4.5 C0)
+    and composes the verdict from Stute + Yatchew alone, with the
+    linearity-conditional suffix appended in every branch. R7 P1 fix:
+    explicit survey-aware composer replaces the prior approach of
+    composing the unweighted verdict with a NaN QUG and string-replacing
+    the resulting "QUG NaN" suffix, which could leave pass cases starting
+    with "inconclusive".
+
+    Priority (mirrors :func:`_compose_verdict` minus QUG):
+      1. Conclusive rejections of Stute or Yatchew lead.
+      2. No conclusive rejection but linearity inconclusive (both NaN)
+         -> "inconclusive - both linearity tests NaN".
+      3. Linearity conclusive (at least one of Stute/Yatchew finite) AND
+         no rejection -> fail-to-reject string.
+    All branches end with `_QUG_DEFERRED_SUFFIX`.
+    """
+    stute_ok = stute is not None and bool(np.isfinite(stute.p_value))
+    yatchew_ok = yatchew is not None and bool(np.isfinite(yatchew.p_value))
+    stute_rej = stute_ok and bool(stute.reject)
+    yatchew_rej = yatchew_ok and bool(yatchew.reject)
+
+    reasons = []
+    if stute_rej:
+        reasons.append("linearity rejected - heterogeneity bias (Stute)")
+    if yatchew_rej:
+        reasons.append("linearity rejected - heterogeneity bias (Yatchew)")
+
+    unresolved = []
+    if not stute_ok:
+        unresolved.append("Stute NaN")
+    if not yatchew_ok:
+        unresolved.append("Yatchew NaN")
+
+    if reasons:
+        verdict = "; ".join(reasons)
+        if unresolved:
+            verdict += "; additional steps unresolved: " + "; ".join(unresolved)
+        return verdict + _QUG_DEFERRED_SUFFIX
+
+    # No rejections.
+    if not (stute_ok or yatchew_ok):
+        return "inconclusive - both Stute and Yatchew linearity tests NaN" + _QUG_DEFERRED_SUFFIX
+
+    # At least one linearity test conclusive AND no rejection.
+    skipped_note = ""
+    if not stute_ok:
+        skipped_note = " (Stute NaN - skipped)"
+    elif not yatchew_ok:
+        skipped_note = " (Yatchew NaN - skipped)"
+    return (
+        "Stute and Yatchew linearity diagnostics fail-to-reject"
+        + skipped_note
+        + _QUG_DEFERRED_SUFFIX
+    )
+
+
+def _compose_verdict_event_study_survey(
+    pretrends_joint: Optional[StuteJointResult],
+    homogeneity_joint: Optional[StuteJointResult],
+) -> str:
+    """Event-study survey-path verdict (R7 P1 fix; mirrors
+    :func:`_compose_verdict_event_study` minus QUG)."""
+    pretrends_ok = pretrends_joint is not None and bool(np.isfinite(pretrends_joint.p_value))
+    homogeneity_ok = homogeneity_joint is not None and bool(np.isfinite(homogeneity_joint.p_value))
+    pretrends_rej = pretrends_joint is not None and pretrends_ok and bool(pretrends_joint.reject)
+    homogeneity_rej = (
+        homogeneity_joint is not None and homogeneity_ok and bool(homogeneity_joint.reject)
+    )
+
+    reasons = []
+    if pretrends_rej:
+        reasons.append("joint pre-trends rejected - assumption 7 violated (joint Stute)")
+    if homogeneity_rej:
+        reasons.append("joint linearity rejected - heterogeneity bias (joint Stute)")
+
+    unresolved = []
+    if pretrends_joint is None:
+        unresolved.append("joint pre-trends skipped (no earlier pre-period)")
+    elif not pretrends_ok:
+        unresolved.append("joint pre-trends NaN")
+    if homogeneity_joint is None:
+        unresolved.append("joint linearity skipped")
+    elif not homogeneity_ok:
+        unresolved.append("joint linearity NaN")
+
+    if reasons:
+        verdict = "; ".join(reasons)
+        if unresolved:
+            verdict += "; additional steps unresolved: " + "; ".join(unresolved)
+        return verdict + _QUG_DEFERRED_SUFFIX
+
+    if unresolved:
+        return "inconclusive - " + "; ".join(unresolved) + _QUG_DEFERRED_SUFFIX
+
+    return "joint pre-trends and joint linearity diagnostics fail-to-reject" + _QUG_DEFERRED_SUFFIX
+
 
 def did_had_pretest_workflow(
     data: pd.DataFrame,
@@ -3820,33 +3940,11 @@ def did_had_pretest_workflow(
                 and homogeneity_ok
                 and not homogeneity_joint.reject
             )
-            # Reuse the unweighted verdict composer with a synthetic NaN
-            # qug (all_finite=False, reject=False) so the existing logic
-            # produces a "linearity-conditional" verdict. Then append the
-            # explicit Phase 4.5 C0 suffix per Reviewer LOW #2.
-            qug_skip = QUGTestResults(
-                t_stat=float("nan"),
-                p_value=float("nan"),
-                reject=False,
-                alpha=alpha,
-                critical_value=float("nan"),
-                n_obs=int(doses_at_F.shape[0]),
-                n_excluded_zero=0,
-                d_order_1=float("nan"),
-                d_order_2=float("nan"),
-            )
-            base_verdict = _compose_verdict_event_study(
-                qug_skip, pretrends_joint, homogeneity_joint
-            )
-            # Strip the "QUG NaN" mention from the unresolved-steps suffix
-            # since users get a more informative QUG-skip warning + suffix.
-            base_verdict = base_verdict.replace(
-                "; additional steps unresolved: QUG NaN", ""
-            ).replace("inconclusive - QUG NaN", "inconclusive")
-            verdict = (
-                base_verdict + " (linearity-conditional verdict; QUG-under-survey "
-                "deferred per Phase 4.5 C0)"
-            )
+            # R7 P1 fix: explicit survey-aware verdict composer instead
+            # of post-processing the unweighted-verdict output (the
+            # previous string-replace approach could leave pass cases
+            # starting with "inconclusive" even when all_pass=True).
+            verdict = _compose_verdict_event_study_survey(pretrends_joint, homogeneity_joint)
         else:
             qug_ok = bool(np.isfinite(qug_res.p_value))
             all_pass = bool(
@@ -3934,28 +4032,8 @@ def did_had_pretest_workflow(
     if use_survey_path:
         any_reject = stute_res.reject or yatchew_res.reject
         all_pass = bool(linearity_conclusive and not any_reject)
-        # Compose verdict from existing _compose_verdict but omit QUG;
-        # synthesize a NaN QUG so existing logic produces "QUG NaN" suffix,
-        # then strip that and append Phase 4.5 C0 suffix.
-        qug_skip = QUGTestResults(
-            t_stat=float("nan"),
-            p_value=float("nan"),
-            reject=False,
-            alpha=alpha,
-            critical_value=float("nan"),
-            n_obs=int(d_arr.shape[0]),
-            n_excluded_zero=0,
-            d_order_1=float("nan"),
-            d_order_2=float("nan"),
-        )
-        base_verdict = _compose_verdict(qug_skip, stute_res, yatchew_res)
-        base_verdict = base_verdict.replace("; additional steps unresolved: QUG NaN", "").replace(
-            "inconclusive - QUG NaN", "inconclusive"
-        )
-        verdict = (
-            base_verdict + " (linearity-conditional verdict; QUG-under-survey "
-            "deferred per Phase 4.5 C0)"
-        )
+        # R7 P1 fix: explicit survey-aware verdict composer.
+        verdict = _compose_verdict_overall_survey(stute_res, yatchew_res)
     else:
         qug_conclusive = bool(np.isfinite(qug_res.p_value))
         any_reject = qug_res.reject or stute_res.reject or yatchew_res.reject
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
@@ -4014,3 +4014,88 @@ def test_workflow_event_study_zero_weights_on_dropped_cohort(self):
         assert report.aggregate == "event_study"
         assert report.qug is None
         assert report.homogeneity_joint is not None
+
+    # --- R7 P0: weighted-CvM outer-measure oracle -------------------------
+
+    def test_cvm_statistic_weighted_outer_measure_oracle(self):
+        """R7 P0: weighted CvM must integrate outer measure against F_hat_w
+        too. Hand-computed oracle distinguishes outer-weighted form
+        ((1/W^2) sum_g w_g C_g^2) from count-weighted-cusum form
+        ((1/W^2) sum_g C_g^2). Uniform weights cannot tell the two apart."""
+        from diff_diff.had_pretests import _cvm_statistic_weighted
+
+        eps = np.array([1.0, -2.0, 3.0])
+        d = np.array([0.1, 0.2, 0.3])
+        w = np.array([1.0, 2.0, 3.0])
+        # C_1=1, C_2=-3, C_3=6, W=6.
+        # Outer-weighted: (1*1 + 2*9 + 3*36) / 36 = 127/36.
+        # Count-weighted (WRONG): (1+9+36) / 36 = 46/36.
+        result = _cvm_statistic_weighted(eps, d, w)
+        outer_weighted = (1 * 1.0**2 + 2 * (-3.0) ** 2 + 3 * 6.0**2) / (6.0**2)
+        count_weighted = (1.0**2 + (-3.0) ** 2 + 6.0**2) / (6.0**2)
+        np.testing.assert_allclose(result, outer_weighted, atol=1e-14, rtol=1e-14)
+        assert abs(outer_weighted - count_weighted) > 1.0
+        assert abs(result - count_weighted) > 1.0
+
+    def test_cvm_statistic_weighted_reduces_at_uniform_weights(self):
+        """At w=ones(G), outer-weighted form reduces bit-exactly to the
+        unweighted statistic."""
+        from diff_diff.had_pretests import _cvm_statistic, _cvm_statistic_weighted
+
+        eps = np.array([1.0, -2.0, 3.0, 0.5, -0.7])
+        d = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
+        w_uniform = np.ones(5)
+        np.testing.assert_allclose(
+            _cvm_statistic_weighted(eps, d, w_uniform),
+            _cvm_statistic(eps, d),
+            atol=1e-14,
+            rtol=1e-14,
+        )
+
+    # --- R7 P1: survey verdict consistency --------------------------------
+
+    def test_workflow_overall_survey_pass_does_not_say_inconclusive(self):
+        """R7 P1: when all_pass=True on the overall survey path, the
+        verdict must NOT start with 'inconclusive'. Locks the explicit
+        survey-aware verdict composer."""
+        df = self._make_overall_panel()
+        weights_per_row = np.full(40, 1.5)
+        with pytest.warns(UserWarning):
+            report = did_had_pretest_workflow(
+                df,
+                "y",
+                "d",
+                "time",
+                "unit",
+                weights=weights_per_row,
+                n_bootstrap=199,
+                seed=0,
+            )
+        if report.all_pass:
+            assert not report.verdict.startswith("inconclusive"), (
+                f"all_pass=True but verdict starts with 'inconclusive': " f"{report.verdict!r}"
+            )
+
+    def test_workflow_event_study_survey_pass_does_not_say_inconclusive(self):
+        """R7 P1: same invariant on the event-study survey path."""
+        from diff_diff import SurveyDesign
+
+        df = self._make_event_study_panel_with_psu_strata(
+            n_strata=2, n_psu_per_stratum=3, n_units_per_psu=2
+        )
+        with pytest.warns(UserWarning, match="QUG step skipped"):
+            report = did_had_pretest_workflow(
+                df,
+                "y",
+                "d",
+                "time",
+                "unit",
+                aggregate="event_study",
+                survey=SurveyDesign(weights="w", strata="stratum", psu="psu"),
+                n_bootstrap=199,
+                seed=0,
+            )
+        if report.all_pass:
+            assert not report.verdict.startswith("inconclusive"), (
+                f"all_pass=True but verdict starts with 'inconclusive': " f"{report.verdict!r}"
+            )