Skip to content

Commit 085d8eb

Browse files
igerberclaude
andcommitted
Address PR #370 R2 review (2 P1 + 1 P3)
R2 P1 #1 (Code Quality) -- joint_pretrends_test and joint_homogeneity_test direct calls still crashed on staggered panels because the staggered- weights subset fix from R1 was only applied at the workflow level. The wrappers run their own _validate_had_panel_event_study() and may filter to data_filtered, then passed the original full-panel weights array to _resolve_pretest_unit_weights(data_filtered, ...) which expects the filtered row count. Fix: subset row-level weights to data_filtered.index positions (via data.index.get_indexer) BEFORE _resolve_pretest_unit_weights, mirroring the workflow fix. R2 P1 #2 (Methodology) -- REGISTRY note documented the bootstrap perturbation as `dy_b = fitted + eps * w * eta_obs`, but the code does `dy_b = fitted + eps * eta_obs` (no `* w`). Code is correct: paper Appendix D wild-bootstrap perturbs UNWEIGHTED residuals; weighting flows through the OLS refit and the weighted CvM, not through the perturbation. Adding `* w` would over-weight by w². Fix: update REGISTRY note to remove the spurious `* w` and clarify the canonical form. Add a regression that pins (a) bit-exact cvm_stat reduction at uniform weights, (b) bootstrap p-value distributional agreement within Monte-Carlo noise. R2 P3 -- in-code docstrings still referenced the pre-Phase-4.5-C contract: - qug_test docstring said survey-aware Stute "admits a Rao-Wu rescaled bootstrap" (PSU-level Mammen multiplier bootstrap is what shipped). Updated to reflect the correct mechanism. - HADPretestReport.all_pass docstring described the unweighted contract only; survey/weights path drops the QUG-conclusiveness gate (linearity-conditional admissibility per C0 deferral). Updated. 3 new regression tests in TestPhase45CR1Regressions: - test_joint_pretrends_test_staggered_weights_subset - test_joint_homogeneity_test_staggered_weights_subset - test_stute_survey_perturbation_does_not_double_weight (locks the perturbation form via cvm_stat bit-exact reduction + p-value MC bound) 168 pretest tests pass (was 165 after R1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fb03267 commit 085d8eb

3 files changed

Lines changed: 128 additions & 21 deletions

File tree

diff_diff/had_pretests.py

Lines changed: 53 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -607,15 +607,19 @@ class HADPretestReport:
607607
Populated when ``aggregate == "event_study"``; ``None`` on the
608608
overall path.
609609
all_pass : bool
610-
On the overall path: same Phase 3 semantics - True iff QUG is
611-
conclusive AND at least one of Stute/Yatchew is conclusive AND
612-
no conclusive test rejects. On the event-study path: True iff
613-
``np.isfinite(qug.p_value)``,
610+
On the **unweighted overall path**: same Phase 3 semantics - True
611+
iff QUG is conclusive AND at least one of Stute/Yatchew is
612+
conclusive AND no conclusive test rejects. On the **unweighted
613+
event-study path**: True iff ``np.isfinite(qug.p_value)``,
614614
``pretrends_joint is not None and
615615
np.isfinite(pretrends_joint.p_value)``,
616616
``np.isfinite(homogeneity_joint.p_value)``, AND none of the
617-
three rejects. Mirrors Phase 3's ``bool(np.isfinite(p_value))``
618-
convention - no ``.conclusive()`` helper on any result dataclass.
617+
three rejects. On the **survey/weights path** (Phase 4.5 C): the
618+
QUG-conclusiveness gate is dropped (qug=None per C0 deferral);
619+
``True`` iff at least one linearity test is conclusive AND no
620+
conclusive test rejects (linearity-conditional admissibility).
621+
Mirrors Phase 3's ``bool(np.isfinite(p_value))`` convention - no
622+
``.conclusive()`` helper on any result dataclass.
619623
verdict : str
620624
Human-readable classification. Paper rule applies symmetrically:
621625
TWFE is admissible only if NONE of the implemented tests
@@ -1202,18 +1206,18 @@ def qug_test(
12021206
or pweight inputs (Phase 4.5 C0 decision gate, 2026-04). The test
12031207
statistic uses extreme order statistics ``(D_{(1)}, D_{(2)})``, which
12041208
are NOT smooth functionals of the empirical CDF -- standard survey
1205-
machinery (Binder TSL linearization, Rao-Wu rescaled bootstrap) does
1206-
not yield a calibrated test, and under cluster sampling the
1207-
``Exp(1)/Exp(1)`` limit law's independence assumption breaks. The
1208-
extreme-value-theory-under-unequal-probability-sampling literature
1209-
(Quintos et al. 2001, Beirlant et al.) addresses tail-index
1210-
estimation, not boundary tests; no off-the-shelf survey-aware QUG
1211-
exists. Use joint Stute via :func:`did_had_pretest_workflow`
1212-
(``aggregate="event_study"``) for survey-aware HAD pretesting once
1213-
Phase 4.5 C ships -- Stute tests a smooth empirical-CDF functional
1214-
and admits a Rao-Wu rescaled bootstrap. See
1215-
``docs/methodology/REGISTRY.md`` § "QUG Null Test" for the full
1216-
methodology note.
1209+
machinery (Binder TSL linearization, multiplier bootstrap, Rao-Wu
1210+
rescaled bootstrap) does not yield a calibrated test, and under
1211+
cluster sampling the ``Exp(1)/Exp(1)`` limit law's independence
1212+
assumption breaks. The extreme-value-theory-under-unequal-probability-
1213+
sampling literature (Quintos et al. 2001, Beirlant et al.) addresses
1214+
tail-index estimation, not boundary tests; no off-the-shelf
1215+
survey-aware QUG exists. Phase 4.5 C ships survey-aware Stute via
1216+
:func:`did_had_pretest_workflow` (which skips the QUG step under
1217+
survey/weights and runs the linearity family with a PSU-level Mammen
1218+
multiplier bootstrap for Stute and weighted OLS + pweight-sandwich
1219+
variance components for Yatchew). See ``docs/methodology/REGISTRY.md``
1220+
§ "QUG Null Test" for the full methodology note.
12171221
12181222
References
12191223
----------
@@ -3064,8 +3068,24 @@ def joint_pretrends_test(
30643068
# Phase 4.5 C: aggregate per-row weights/survey to per-unit (G,)
30653069
# using the existing HAD helpers (constant-within-unit invariant
30663070
# enforced; replicate-weight rejected on the survey path).
3071+
# R2 P1 fix: subset row-level `weights` to data_filtered's rows BEFORE
3072+
# resolution, mirroring did_had_pretest_workflow. When
3073+
# _validate_had_panel_event_study auto-filters to the last cohort
3074+
# under staggered timing, the original weights array no longer aligns
3075+
# with data_filtered's row count. Survey= path is unaffected
3076+
# (column references resolved internally on data_filtered).
3077+
weights_for_resolve = weights
3078+
if weights is not None and len(data_filtered) != len(data):
3079+
pos_idx = data.index.get_indexer(data_filtered.index)
3080+
if (pos_idx < 0).any():
3081+
raise ValueError(
3082+
"joint_pretrends_test: cannot align row-level weights to "
3083+
"the staggered-filtered panel; some data_filtered rows do "
3084+
"not appear in original data.index."
3085+
)
3086+
weights_for_resolve = np.asarray(weights, dtype=np.float64)[pos_idx]
30673087
weights_unit, resolved_unit = _resolve_pretest_unit_weights(
3068-
data_filtered, unit_col, weights, survey, "joint_pretrends_test"
3088+
data_filtered, unit_col, weights_for_resolve, survey, "joint_pretrends_test"
30693089
)
30703090
# Reorder per-unit weights to match d_arr/dy_by_horizon ordering.
30713091
# _aggregate_for_joint_test sorts the wide pivot by index (unit_col),
@@ -3265,8 +3285,21 @@ def joint_homogeneity_test(
32653285
)
32663286

32673287
# Phase 4.5 C: aggregate weights/survey to per-unit; thread through.
3288+
# R2 P1 fix: subset row-level `weights` to data_filtered's rows BEFORE
3289+
# resolution, mirroring did_had_pretest_workflow / joint_pretrends_test
3290+
# for staggered last-cohort filtering.
3291+
weights_for_resolve = weights
3292+
if weights is not None and len(data_filtered) != len(data):
3293+
pos_idx = data.index.get_indexer(data_filtered.index)
3294+
if (pos_idx < 0).any():
3295+
raise ValueError(
3296+
"joint_homogeneity_test: cannot align row-level weights to "
3297+
"the staggered-filtered panel; some data_filtered rows do "
3298+
"not appear in original data.index."
3299+
)
3300+
weights_for_resolve = np.asarray(weights, dtype=np.float64)[pos_idx]
32683301
weights_unit, resolved_unit = _resolve_pretest_unit_weights(
3269-
data_filtered, unit_col, weights, survey, "joint_homogeneity_test"
3302+
data_filtered, unit_col, weights_for_resolve, survey, "joint_homogeneity_test"
32703303
)
32713304
w_eff = resolved_unit.weights if resolved_unit is not None else weights_unit
32723305

docs/methodology/REGISTRY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2432,7 +2432,7 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `
24322432
The survey-compatible alternative for HAD pretesting is **joint Stute** (a CvM cusum of regression residuals) — a smooth functional of the empirical CDF for which Krieger-Pfeffermann (1997) + a survey-aware multiplier bootstrap give a calibrated test. Phase 4.5 C (PR #370) ships survey support for the linearity family — the **PSU-level Mammen multiplier bootstrap** for `stute_test` and the joint variants (NOT Rao-Wu rescaling — multiplier bootstrap is a different mechanism), and **closed-form weighted OLS + pweight-sandwich variance components** for `yatchew_hr_test`. See the dedicated Note (Phase 4.5 C) below for the full algorithm.
24332433
**Research direction (out of scope for diff-diff):** the bridge IS sketchable by combining (a) endpoint-estimation EVT under iid (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999, Beirlant-de Wet-Goegebeur 2006); (b) survey-aware functional CLT for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017); and (c) tail-empirical-process theory (Drees 2003) to define a "design-effective boundary intensity" `λ_eff = Σ_h W_h · f_h(0+)`. Under a "no boundary clumping" assumption (`P(D_{(1)}, D_{(2)}` in same PSU `| both ≤ δ) → 0`), the `Exp(1)/Exp(1)` limit law's pivotality is preserved and only the calibration needs a survey-aware bootstrap (subsampling within strata per Politis-Romano-Wolf, or Bertail et al.'s design-aware bootstrap). This is publishable methodology research — one paper, ~6-12 months for a methods PhD student. If the bridge gets built and published externally, this gate can be revisited.
24342434
- **Note (Phase 4.5 C):** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` accept `weights=` and `survey=ResolvedSurveyDesign` kwargs (or `survey=SurveyDesign` for the data-in entries). Mechanism varies by test:
2435-
- **Stute family** (`stute_test`, `stute_joint_pretest`, joint wrappers) uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap). Each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix; multipliers broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`. The bootstrap residual perturbation is `dy_b = fitted + eps * w * eta_obs`, followed by weighted OLS refit and weighted CvM recompute via `_cvm_statistic_weighted`. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). PSU-shared multipliers are conservative under no-within-PSU outcome correlation (over-clustering gives conservative size in finite samples), asymptotically correct under the standard survey assumption that PSU is the ultimate sampling unit AND outcomes correlate within PSU. The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (constructed via `survey._make_trivial_resolved`) so the kernel is shared across both entry paths. NOT "Rao-Wu rescaled bootstrap" — different mechanism (the Rao-Wu kernel rescales per-unit weights via stratified PSU resampling, while this kernel applies multipliers without resampling).
2435+
- **Stute family** (`stute_test`, `stute_joint_pretest`, joint wrappers) uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap). Each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix; multipliers broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`. The bootstrap residual perturbation is `dy_b = fitted + eps * eta_obs` (paper Appendix D wild-bootstrap form — multipliers attach to UNWEIGHTED residuals; the weighting flows through the OLS refit + the weighted CvM, NOT through the perturbation step). Followed by weighted OLS refit (`_fit_weighted_ols_intercept_slope`) and weighted CvM recompute via `_cvm_statistic_weighted`. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence (Delgado 1993; Escanciano 2006) AND PSU clustering (Krieger-Pfeffermann 1997). PSU-shared multipliers are conservative under no-within-PSU outcome correlation (over-clustering gives conservative size in finite samples), asymptotically correct under the standard survey assumption that PSU is the ultimate sampling unit AND outcomes correlate within PSU. The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (constructed via `survey._make_trivial_resolved`) so the kernel is shared across both entry paths. NOT "Rao-Wu rescaled bootstrap" — different mechanism (the Rao-Wu kernel rescales per-unit weights via stratified PSU resampling, while this kernel applies multipliers without resampling).
24362436
- **Yatchew** (`yatchew_hr_test`) uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap). All three components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` in `TestYatchewHRTestSurvey::test_weighted_reduces_to_unweighted_at_uniform_weights`):
24372437
- `sigma2_lin = sum(w * eps^2) / sum(w)` (weighted OLS residual variance).
24382438
- `sigma2_diff = sum(w_avg * (dy_g - dy_{g-1})^2) / (2 * sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g + w_{g-1})/2`. Divisor uses `sum(w)` (=G at `w=1`), NOT `sum(w_avg)`, to match the existing `(1/(2G))` unweighted formula at `had_pretests.py:1635`.

tests/test_had_pretests.py

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3547,3 +3547,77 @@ def test_workflow_staggered_event_study_weights_subset_correctly(self):
35473547
assert report.aggregate == "event_study"
35483548
assert report.qug is None
35493549
assert report.homogeneity_joint is not None
3550+
3551+
# --- R2 P1: direct-wrapper staggered weights= subsetting ---------------
3552+
3553+
def test_joint_pretrends_test_staggered_weights_subset(self):
3554+
"""R2 P1: joint_pretrends_test direct call must subset row-level
3555+
weights= when its own _validate_had_panel_event_study filtering
3556+
triggers on staggered panels. Pre-fix this crashed with a length-
3557+
mismatch ValueError because the wrapper passed the full-panel
3558+
weights array into _resolve_pretest_unit_weights(data_filtered, ...)."""
3559+
df = self._make_staggered_panel(G_per_cohort=10)
3560+
n_rows = 2 * 10 * 4
3561+
weights_per_row = np.ones(n_rows) * 1.5
3562+
with pytest.warns(UserWarning):
3563+
r = joint_pretrends_test(
3564+
df,
3565+
"y",
3566+
"d",
3567+
"time",
3568+
"unit",
3569+
pre_periods=[0, 1],
3570+
base_period=2,
3571+
first_treat_col="F",
3572+
n_bootstrap=199,
3573+
seed=0,
3574+
weights=weights_per_row,
3575+
)
3576+
assert np.isfinite(r.cvm_stat_joint)
3577+
3578+
def test_joint_homogeneity_test_staggered_weights_subset(self):
3579+
df = self._make_staggered_panel(G_per_cohort=10)
3580+
n_rows = 2 * 10 * 4
3581+
weights_per_row = np.ones(n_rows) * 1.5
3582+
with pytest.warns(UserWarning):
3583+
r = joint_homogeneity_test(
3584+
df,
3585+
"y",
3586+
"d",
3587+
"time",
3588+
"unit",
3589+
post_periods=[3],
3590+
base_period=2,
3591+
first_treat_col="F",
3592+
n_bootstrap=199,
3593+
seed=0,
3594+
weights=weights_per_row,
3595+
)
3596+
assert np.isfinite(r.cvm_stat_joint)
3597+
3598+
# --- R2 P1: bootstrap perturbation form lock ---------------------------
3599+
3600+
def test_stute_survey_perturbation_does_not_double_weight(self):
3601+
"""R2 P1: bootstrap perturbation is `dy_b = fitted + eps * eta_obs`
3602+
(paper Appendix D form), NOT `eps * w * eta_obs`. Adding `* w` to
3603+
the perturbation would over-weight by w² (weighting flows through
3604+
weighted OLS refit + weighted CvM, NOT through the multiplier).
3605+
3606+
Lock test: cvm_stat at uniform weights matches between paths
3607+
bit-exactly (W=G under uniform weights so 1/W² = 1/G²); the
3608+
bootstrap p-value distributions agree within Monte-Carlo noise
3609+
(RNG draw ordering differs between batched survey-aware path and
3610+
per-iteration unweighted path; numerical equivalence is unreachable).
3611+
"""
3612+
d, dy = _linear_dgp(G=50, beta=2.0, sigma=0.3)
3613+
r_unweighted = stute_test(d, dy, n_bootstrap=999, seed=0)
3614+
r_weighted = stute_test(d, dy, weights=np.ones(50), n_bootstrap=999, seed=0)
3615+
# cvm_stat: bit-exact reduction at w=1 (W=G, weighted CvM ≡ unweighted).
3616+
np.testing.assert_allclose(
3617+
r_unweighted.cvm_stat, r_weighted.cvm_stat, atol=1e-14, rtol=1e-14
3618+
)
3619+
# p_value: distributional agreement at large B; Monte-Carlo noise.
3620+
# If the survey path were over-weighting (w² instead of w), the
3621+
# bootstrap distribution would be inflated and the survey p-value
3622+
# would systematically deviate. With the correct form, |diff| < 0.10.
3623+
assert abs(r_unweighted.p_value - r_weighted.p_value) < 0.10

0 commit comments

Comments
 (0)