Skip to content

Commit fb03267

Browse files
igerberclaude
andcommitted
Address PR #370 R1 review (1 P0 + 2 P1 + 1 P3)
R1 P0 — Stute survey path silently accepted zero-weight units, which leak into the dose-variation check + CvM cusum + bootstrap refit while contributing zero population mass. Extreme case: only zero-weight units carry dose variation -> spurious finite test statistic with no warning. Fix: strictly-positive guards on every survey-aware Stute / Yatchew / workflow entry point (the weights= shortcut already had this; survey= branch was the gap). R1 P1 #1 — aweight/fweight survey designs slipped through pweight-only formulas silently (the variance components are derived assuming pweight sandwich semantics). Fix: weight_type='pweight' guards added in _resolve_pretest_unit_weights and on every direct-helper survey= branch (stute_test, yatchew_hr_test, stute_joint_pretest). Mirrors HAD.fit guard at had.py:2976 + survey._resolve_pweight_only at survey.py:914. R1 P1 #2 — workflow's row-level weights= crashed on staggered event- study panels because _validate_multi_period_panel filters to last cohort but the joint wrappers re-aggregate with the original full- panel weights array. Fix: subset joint_weights to data_filtered's rows via data.index.get_indexer(data_filtered.index) BEFORE passing to the wrappers. Mirrors HeterogeneousAdoptionDiD.fit positional- index pattern. Survey= path is unaffected (column references resolve internally on data_filtered). R1 P3 — REGISTRY C0 note still said "the same gate applies to did_had_pretest_workflow" and "Phase 4.5 C uses Rao-Wu rescaling"; both are stale post-C. Updated to clarify (a) workflow gate was temporary and is now closed by C, (b) qug_test direct-helper gate remains permanent, (c) C uses PSU-level Mammen multiplier bootstrap (NOT Rao-Wu rescaling). 7 new tests in TestPhase45CR1Regressions covering: zero-weight survey on stute_test / stute_joint_pretest / workflow; aweight rejection on stute_test / workflow; fweight rejection on yatchew_hr_test; staggered event-study workflow with weights= (catches the length-mismatch crash). 165 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 412b189 commit fb03267

3 files changed

Lines changed: 295 additions & 4 deletions

File tree

diff_diff/had_pretests.py

Lines changed: 100 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1483,6 +1483,15 @@ def stute_test(
14831483
"by the multiplier-bootstrap composition. Replicate-weight pretests "
14841484
"are a parallel follow-up after Phase 4.5 C."
14851485
)
1486+
# R1 P1: pweight-only guard on the direct-helper survey entry (mirrors
1487+
# _resolve_pretest_unit_weights for the workflow path).
1488+
if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight":
1489+
raise ValueError(
1490+
f"stute_test: HAD pretests require weight_type='pweight'. Got "
1491+
f"weight_type={survey.weight_type!r}. aweight / fweight have "
1492+
"different sandwich-variance semantics that are not derived "
1493+
"for the Stute CvM bootstrap calibration."
1494+
)
14861495

14871496
d_arr = _validate_1d_numeric(d, "d")
14881497
dy_arr = _validate_1d_numeric(dy, "dy")
@@ -1539,6 +1548,20 @@ def stute_test(
15391548
f"stute_test: survey.weights length {w_arr.shape[0]} does not "
15401549
f"match d/dy length {G}."
15411550
)
1551+
# R1 P0: strictly-positive weights at the per-unit level (mirrors
1552+
# workflow guard in _resolve_pretest_unit_weights). Zero-weight
1553+
# units would leak into the dose-variation check + CvM cusum +
1554+
# bootstrap refit, producing silent wrong pretest decisions on
1555+
# subpopulation-restricted designs (e.g. only zero-weight units
1556+
# carry dose variation -> spurious finite test statistic).
1557+
if (w_arr <= 0).any():
1558+
raise ValueError(
1559+
"stute_test: survey weights must be strictly positive. "
1560+
"Zero / negative weights would leave units in the "
1561+
"variance / CvM computation while contributing zero "
1562+
"population mass; pre-filter the panel to the positive-"
1563+
"weight subpopulation before calling stute_test."
1564+
)
15421565
elif weights is not None:
15431566
w_arr = np.asarray(weights, dtype=np.float64)
15441567
if w_arr.shape[0] != G:
@@ -1807,6 +1830,13 @@ def yatchew_hr_test(
18071830
"SDR) are not yet supported on HAD pretests. Replicate-weight "
18081831
"pretests are a parallel follow-up after Phase 4.5 C."
18091832
)
1833+
# R1 P1: pweight-only guard (aweight/fweight have different sandwich-
1834+
# variance semantics not derived for the variance-ratio statistic).
1835+
if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight":
1836+
raise ValueError(
1837+
f"yatchew_hr_test: HAD pretests require weight_type='pweight'. "
1838+
f"Got weight_type={survey.weight_type!r}."
1839+
)
18101840

18111841
d_arr = _validate_1d_numeric(d, "d")
18121842
dy_arr = _validate_1d_numeric(dy, "dy")
@@ -2421,6 +2451,12 @@ def stute_joint_pretest(
24212451
"JKn/SDR) are not yet supported on HAD pretests. Replicate-weight "
24222452
"pretests are a parallel follow-up after Phase 4.5 C."
24232453
)
2454+
# R1 P1: pweight-only guard.
2455+
if survey is not None and getattr(survey, "weight_type", "pweight") != "pweight":
2456+
raise ValueError(
2457+
f"stute_joint_pretest: HAD pretests require weight_type='pweight'. "
2458+
f"Got weight_type={survey.weight_type!r}."
2459+
)
24242460

24252461
if not isinstance(residuals_by_horizon, dict) or not isinstance(fitted_by_horizon, dict):
24262462
raise ValueError(
@@ -2604,6 +2640,14 @@ def stute_joint_pretest(
26042640
f"stute_joint_pretest: survey.weights length {w_arr.shape[0]} "
26052641
f"does not match doses length {G}."
26062642
)
2643+
# R1 P0: strictly-positive guard (mirrors stute_test single-horizon).
2644+
if (w_arr <= 0).any():
2645+
raise ValueError(
2646+
"stute_joint_pretest: survey weights must be strictly "
2647+
"positive. Zero / negative weights would leave units in "
2648+
"the variance / CvM computation while contributing zero "
2649+
"population mass."
2650+
)
26072651
elif weights is not None:
26082652
w_arr = np.asarray(weights, dtype=np.float64)
26092653
if w_arr.shape[0] != G:
@@ -2797,6 +2841,17 @@ def _resolve_pretest_unit_weights(
27972841
if weights is not None:
27982842
weights_arr = np.asarray(weights, dtype=np.float64)
27992843
weights_unit = _aggregate_unit_weights(data, weights_arr, unit_col)
2844+
# R1 P0: strictly-positive weights required on the pweight shortcut
2845+
# (matches stute_test/yatchew_hr_test direct entry behavior; the CvM
2846+
# cusum + adjacent-difference variance assume all rows contribute).
2847+
if (weights_unit <= 0).any():
2848+
raise ValueError(
2849+
f"{caller_name}: weights must be strictly positive at the "
2850+
"per-unit level. Zero / negative weights would leave units "
2851+
"in the variance/CvM computation while contributing zero "
2852+
"mass; use survey= with explicit lonely-PSU handling for "
2853+
"principled subpopulation analysis."
2854+
)
28002855
return weights_unit, None
28012856
# survey is not None
28022857
if not hasattr(survey, "resolve"):
@@ -2811,7 +2866,30 @@ def _resolve_pretest_unit_weights(
28112866
"SDR) are not yet supported on HAD pretests. Replicate-weight "
28122867
"pretests are a parallel follow-up after Phase 4.5 C."
28132868
)
2869+
# R1 P1: pweight-only guard. aweight/fweight slip through pweight-only
2870+
# formulas silently otherwise (mirrors HeterogeneousAdoptionDiD.fit() at
2871+
# had.py:2976+ and survey._resolve_pweight_only at survey.py:914).
2872+
if getattr(resolved_full, "weight_type", "pweight") != "pweight":
2873+
raise ValueError(
2874+
f"{caller_name}: HAD pretests require weight_type='pweight'. "
2875+
f"Got weight_type={resolved_full.weight_type!r}. aweight / "
2876+
"fweight have different sandwich-variance semantics that are "
2877+
"not derived for the pretest variance components."
2878+
)
28142879
resolved_unit = _aggregate_unit_resolved_survey(data, resolved_full, unit_col)
2880+
# R1 P0: strictly-positive weights at the per-unit level (mirrors the
2881+
# weights= shortcut). Zero per-unit weights leave units in the dose-
2882+
# variation check / CvM sum while contributing zero population mass,
2883+
# which can produce silently-wrong pretest decisions.
2884+
if (np.asarray(resolved_unit.weights) <= 0).any():
2885+
raise ValueError(
2886+
f"{caller_name}: survey weights must be strictly positive at "
2887+
"the per-unit level. Zero / negative weights would leave units "
2888+
"in the variance/CvM computation while contributing zero "
2889+
"mass; this would produce silent wrong pretest decisions on "
2890+
"subpopulation-restricted designs. Pre-filter the panel to "
2891+
"the positive-weight subpopulation before calling the workflow."
2892+
)
28152893
return None, resolved_unit
28162894

28172895

@@ -3413,8 +3491,28 @@ def did_had_pretest_workflow(
34133491

34143492
# Phase 4.5 C: forward weights/survey to the joint helpers. The
34153493
# data-in wrappers handle their own per-row → per-unit aggregation
3416-
# via _resolve_pretest_unit_weights internally.
3417-
joint_weights = weights if use_survey_path and weights is not None else None
3494+
# via _resolve_pretest_unit_weights internally on `data_filtered`.
3495+
# R1 P1 fix: subset row-level `weights` to data_filtered's rows
3496+
# BEFORE passing through. Otherwise on staggered panels (where
3497+
# _validate_multi_period_panel auto-filters to last cohort),
3498+
# the wrappers would call _aggregate_unit_weights(data_filtered,
3499+
# weights[full_panel_length], ...) and crash on length mismatch.
3500+
# Mirrors HeterogeneousAdoptionDiD.fit()'s positional-index
3501+
# subsetting via `data.index.get_indexer(data_filtered.index)`.
3502+
# `survey=` carries column references resolved internally on
3503+
# data_filtered, so no subsetting needed there.
3504+
if use_survey_path and weights is not None:
3505+
pos_idx = data.index.get_indexer(data_filtered.index)
3506+
if (pos_idx < 0).any():
3507+
raise ValueError(
3508+
"did_had_pretest_workflow: cannot align row-level "
3509+
"weights to the staggered-filtered panel "
3510+
"(some data_filtered rows do not appear in original "
3511+
"data.index). This is a bug; please report."
3512+
)
3513+
joint_weights = np.asarray(weights, dtype=np.float64)[pos_idx]
3514+
else:
3515+
joint_weights = None
34183516
joint_survey = survey if use_survey_path and survey is not None else None
34193517

34203518
# Step 2: joint pre-trends on earlier pre-periods (those

docs/methodology/REGISTRY.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2425,11 +2425,11 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `
24252425
4. Theorem 4 establishes: asymptotic size `α`; uniform consistency against fixed alternatives; local power at rate `G` on the class `F^{d̲,d̄}_{m,K}` of differentiable cdfs with positive density and Lipschitz derivative.
24262426
5. Li et al. (2024, Theorem 2.4) implies the QUG test is asymptotically independent of the WAS / TWFE estimator, so conditional inference on WAS given non-rejection does not distort inference (asymptotically; the paper's Footnote 8 notes the extension to triangular arrays is conjectured but not proven).
24272427
- **Note:** Implementation is `O(G)` via `np.partition`; no sort required.
2428-
- **Note (Phase 4.5 C0):** `qug_test(..., survey=...)` and `qug_test(..., weights=...)` raise `NotImplementedError` permanently (Phase 4.5 C0 decision gate, 2026-04). The same gate applies to `did_had_pretest_workflow(..., survey=...)` / `weights=`. Three reasons survey extension is genuinely hard, not "we just haven't done the lit review":
2428+
- **Note (Phase 4.5 C0):** `qug_test(..., survey=...)` and `qug_test(..., weights=...)` raise `NotImplementedError` **permanently** (Phase 4.5 C0 decision gate, 2026-04 -- direct-helper gate is permanent). The Phase 4.5 C0 release also gated `did_had_pretest_workflow(..., survey=...)` / `weights=` with `NotImplementedError`, but that workflow gate was **temporary**: Phase 4.5 C (PR #370, 2026-04) replaces it with functional dispatch that skips the QUG step with `UserWarning` and runs the linearity family with the survey-aware mechanism (see Note (Phase 4.5 C) below for the full algorithm). Direct callers of `qug_test` still get the permanent rejection. Three reasons QUG-under-survey is genuinely hard, not "we just haven't done the lit review":
24292429
1. **Extreme order statistics are not smooth functionals of the empirical CDF.** Standard survey machinery (Binder-TSL linearization via `compute_survey_if_variance`, Rao-Wu rescaled bootstrap via `bootstrap_utils.generate_rao_wu_weights`, Krieger-Pfeffermann (1997) EDF tests for complex surveys) all rely on Hadamard differentiability of the test statistic in the empirical CDF. The first two order statistics are NOT differentiable functionals — small perturbations to F near zero produce O(1) shifts in `D_{(1)}`. None of the standard survey-bootstrap or linearization tools give a calibrated test for QUG.
24302430
2. **The `Exp(1)/Exp(1)` limit law assumes iid sampling with smooth density at zero.** Under cluster sampling, `D_{(1)}` and `D_{(2)}` may both come from the same PSU, breaking the independence required for the Poisson-process limit of rescaled spacings near the boundary. Under stratification, the smallest dose may come from a small stratum that's systematically over- or under-sampled, biasing the test.
24312431
3. **The literature on EVT under unequal-probability sampling is sparse.** Quintos et al. (2001) and Beirlant et al. cover tail-INDEX estimation under unequal sample sizes. There is no off-the-shelf method for "test the support endpoint under complex sampling" in the standard survey-statistics toolkit. Adapting Hill / Pickands / DEdH estimators to the boundary problem would be novel research, not engineering. The de Chaisemartin et al. (2026) paper itself does not discuss survey extensions of QUG.
2432-
The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + Rao-Wu rescaled bootstrap give a calibrated survey-aware test. Phase 4.5 C ships survey support for the linearity family with mechanism varying by test: Rao-Wu rescaled bootstrap for `stute_test` and the joint variants (`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`); weighted OLS residuals + weighted variance estimator for `yatchew_hr_test` (Yatchew 1997 is a closed-form variance-ratio test, not bootstrap-based).
2432+
The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + a survey-aware multiplier bootstrap give a calibrated test. Phase 4.5 C (PR #370) ships survey support for the linearity family — the **PSU-level Mammen multiplier bootstrap** for `stute_test` and the joint variants (NOT Rao-Wu rescaling — multiplier bootstrap is a different mechanism), and **closed-form weighted OLS + pweight-sandwich variance components** for `yatchew_hr_test`. See the dedicated Note (Phase 4.5 C) below for the full algorithm.
24332433
**Research direction (out of scope for diff-diff):** the bridge IS sketchable by combining (a) endpoint-estimation EVT under iid (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006); (b) survey-aware functional CLT for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017); and (c) tail-empirical-process theory (Drees 2003) to define a "design-effective boundary intensity" `λ_eff = Σ_h W_h · f_h(0+)`. Under a "no boundary clumping" assumption (`P(D_{(1)}, D_{(2)}` in same PSU `| both ≤ δ) → 0`), the `Exp(1)/Exp(1)` limit law's pivotality is preserved and only the calibration needs a survey-aware bootstrap (subsampling within strata per Politis-Romano-Wolf, or Bertail et al.'s design-aware bootstrap). This is publishable methodology research — one paper, ~6-12 months for a methods PhD student. If the bridge gets built and published externally, this gate can be revisited.
24342434
- **Note (Phase 4.5 C):** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` accept `weights=` and `survey=ResolvedSurveyDesign` kwargs (or `survey=SurveyDesign` for the data-in entries). Mechanism varies by test:
24352435
- **Stute family** (`stute_test`, `stute_joint_pretest`, joint wrappers) uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap). Each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix; multipliers broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`. The bootstrap residual perturbation is `dy_b = fitted + eps * w * eta_obs`, followed by weighted OLS refit and weighted CvM recompute via `_cvm_statistic_weighted`. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). PSU-shared multipliers are conservative under no-within-PSU outcome correlation (over-clustering gives conservative size in finite samples), asymptotically correct under the standard survey assumption that PSU is the ultimate sampling unit AND outcomes correlate within PSU. The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (constructed via `survey._make_trivial_resolved`) so the kernel is shared across both entry paths. NOT "Rao-Wu rescaled bootstrap" — different mechanism (the Rao-Wu kernel rescales per-unit weights via stratified PSU resampling, while this kernel applies multipliers without resampling).

0 commit comments

Comments
 (0)