Skip to content

Commit d8a7353

Browse files
igerberclaude
andcommitted
Address PR #370 R10 review (1 P1 + 1 P3)
R10 P1 (Methodology) -- Stute survey bootstrap was silently miscalibrated on stratified designs. The HAD sup-t bootstrap (had.py:2120+) applies a within-stratum demean + sqrt(n_h/(n_h-1)) small-sample correction AFTER generate_survey_multiplier_weights_batch returns, to make the bootstrap variance match the Binder-TSL stratified target. The same correction has NOT been derived for the Stute CvM functional, so applying the helper's raw multipliers directly to residual perturbations on stratified designs left the bootstrap p-value silently miscalibrated. Per the reviewer's offered "narrow support" path: Phase 4.5 C now explicitly rejects stratified designs on the Stute family with NotImplementedError. Pweight-only, PSU-only, and FPC-only designs remain supported (the helper's output is appropriately scaled for those without further correction). Stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Mirrors the lonely_psu='adjust' rejection pattern (R5 P1) — both are methodology-gap-driven explicit NotImplementedErrors with documented follow-up. The strata guard supersedes the lonely_psu='adjust' singleton-strata guard for any stratified design (the latter is now defense-in-depth for the unstratified residual case). R10 P3 -- Added "stratified-design rejection" entry to REGISTRY's Note (Phase 4.5 C). Also updated CHANGELOG to narrow the documented survey contract. Tests updated: - test_stute_test_lonely_psu_adjust_singletons_raises -> test_stute_test_stratified_design_raises (the strata guard fires first; the test is still meaningful but on a strata key match). - Same renaming for stute_joint_pretest variant. - test_stute_test_lonely_psu_remove_singletons_returns_nan REMOVED (singleton strata under lonely_psu='remove' now hits the strata guard instead of the df_survey<=0 guard). - test_joint_homogeneity_test_psu_strata_survey_smoke -> test_joint_homogeneity_test_psu_only_survey_smoke (positive coverage on PSU-only design) + new test_joint_homogeneity_test_stratified_raises. - test_workflow_event_study_psu_strata_survey_smoke -> test_workflow_event_study_psu_only_survey_smoke. - test_workflow_event_study_survey_pass_does_not_say_inconclusive switched from strata to PSU-only. 191 pretest tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 856e342 commit d8a7353

4 files changed

Lines changed: 95 additions & 46 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
## [Unreleased]
99

1010
### Added
11-
- **HAD linearity-family pretests under survey (Phase 4.5 C).** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` now accept `weights=` / `survey=` keyword-only kwargs. Stute family uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap): each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix, broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`, weighted OLS refit, weighted CvM via new `_cvm_statistic_weighted` helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence AND PSU clustering. Yatchew uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap): `sigma2_lin = sum(w·eps²)/sum(w)`, `sigma2_diff = sum(w_avg·diff²)/(2·sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g+w_{g-1})/2`, `sigma4_W = sum(w_avg·prod)/sum(w_avg)`, `T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W`. All three Yatchew components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` by direct helper test). The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (new `survey._make_trivial_resolved` helper) so the same kernel handles both entry paths. `did_had_pretest_workflow(..., survey=, weights=)` removes the Phase 4.5 C0 `NotImplementedError`, dispatches to the survey-aware sub-tests, **skips the QUG step with `UserWarning`** (per C0 deferral), sets `qug=None` on the report, and appends a `"linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0"` suffix to the verdict. `HADPretestReport.qug` retyped from `QUGTestResults` to `Optional[QUGTestResults]`; `summary()` / `to_dict()` / `to_dataframe()` updated to None-tolerant rendering. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise `NotImplementedError` at every entry point (defense in depth, reciprocal-guard discipline) — parallel follow-up after this PR. Strictly positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Per-row `weights=` / `survey=col` aggregated to per-unit via existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (constant-within-unit invariant enforced). Unweighted code paths preserved bit-exactly. Patch-level addition (additive on stable surfaces). See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C) for the full methodology.
11+
- **HAD linearity-family pretests under survey (Phase 4.5 C).** `stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, and `did_had_pretest_workflow` now accept `weights=` / `survey=` keyword-only kwargs. Stute family uses **PSU-level Mammen multiplier bootstrap** via `bootstrap_utils.generate_survey_multiplier_weights_batch` (the same kernel as PR #363's HAD event-study sup-t bootstrap): each replicate draws an `(n_bootstrap, n_psu)` Mammen multiplier matrix, broadcast to per-obs perturbation `eta_obs[g] = eta_psu[psu(g)]`, weighted OLS refit, weighted CvM via new `_cvm_statistic_weighted` helper. Joint Stute SHARES the multiplier matrix across horizons within each replicate, preserving both the vector-valued empirical-process unit-level dependence AND PSU clustering. Yatchew uses **closed-form weighted OLS + pweight-sandwich variance components** (no bootstrap): `sigma2_lin = sum(w·eps²)/sum(w)`, `sigma2_diff = sum(w_avg·diff²)/(2·sum(w))` with arithmetic-mean pair weights `w_avg_g = (w_g+w_{g-1})/2`, `sigma4_W = sum(w_avg·prod)/sum(w_avg)`, `T_hr = sqrt(sum(w))·(sigma2_lin-sigma2_diff)/sigma2_W`. All three Yatchew components reduce bit-exactly to the unweighted formulas at `w=ones(G)` (locked at `atol=1e-14` by direct helper test). The pweight `weights=` shortcut routes through a synthetic trivial `ResolvedSurveyDesign` (new `survey._make_trivial_resolved` helper) so the same kernel handles both entry paths. `did_had_pretest_workflow(..., survey=, weights=)` removes the Phase 4.5 C0 `NotImplementedError`, dispatches to the survey-aware sub-tests, **skips the QUG step with `UserWarning`** (per C0 deferral), sets `qug=None` on the report, and appends a `"linearity-conditional verdict; QUG-under-survey deferred per Phase 4.5 C0"` suffix to the verdict. `HADPretestReport.qug` retyped from `QUGTestResults` to `Optional[QUGTestResults]`; `summary()` / `to_dict()` / `to_dataframe()` updated to None-tolerant rendering. Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) raise `NotImplementedError` at every entry point (defense in depth, reciprocal-guard discipline) — parallel follow-up after this PR. **Stratified designs (`SurveyDesign(strata=...)`) also raise `NotImplementedError` on the Stute family** — the within-stratum demean + `sqrt(n_h/(n_h-1))` correction that the HAD sup-t bootstrap applies to match the Binder-TSL stratified target has not been derived for the Stute CvM functional, so applying raw multipliers from `generate_survey_multiplier_weights_batch` directly to residual perturbations would leave the bootstrap p-value silently miscalibrated. Phase 4.5 C narrows survey support to **pweight-only**, **PSU-only** (`SurveyDesign(weights=, psu=)`), and **FPC-only** (`SurveyDesign(weights=, fpc=)`) designs; stratified is a follow-up after the matching Stute-CvM stratified-correction derivation lands. Strictly positive weights required on Yatchew (the adjacent-difference variance is undefined under contiguous-zero blocks). Per-row `weights=` / `survey=col` aggregated to per-unit via existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (constant-within-unit invariant enforced). Unweighted code paths preserved bit-exactly. Patch-level addition (additive on stable surfaces). See `docs/methodology/REGISTRY.md` § "QUG Null Test" — Note (Phase 4.5 C) for the full methodology.
1212
- **`ChaisemartinDHaultfoeuille.by_path` + `placebo=True`** — per-path backward-horizon placebos `DID^{pl}_{path, l}` for `l = 1..L_max`. The same per-path SE convention used for the event-study (joiners/leavers IF precedent: switcher-side contributions zeroed for non-path groups; cohort structure and control pool unchanged; plug-in SE with path-specific divisor `N^{pl}_{l, path}`) is applied to backward horizons via the new `switcher_subset_mask` parameter on `_compute_per_group_if_placebo_horizon`. Surfaced on `results.path_placebo_event_study[path][-l]` (negative-int inner keys mirroring `placebo_event_study`); `summary()` renders the rows alongside per-path event-study horizons; `to_dataframe(level="by_path")` emits negative-horizon rows alongside the existing positive-horizon rows. **Bootstrap** (when `n_bootstrap > 0`) propagates per-`(path, lag)` percentile CI / p-value through the same `_bootstrap_one_target` dispatch as the per-path event-study, with the canonical NaN-on-invalid contract enforced on the new surface (PR #364 library-wide invariant). **SE inherits the cross-path cohort-sharing deviation from R** documented for `path_effects` (full-panel cohort-centered plug-in vs R's per-path re-run): tracks R within tolerance on single-path-cohort panels, diverges materially on cohort-mixed panels — the bootstrap SE is a Monte Carlo analog of the analytical SE and inherits the same deviation. R-parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathPlacebo` on the new `multi_path_reversible_by_path_placebo` scenario (point estimates exact match; SE within Phase-2 envelope rtol ≤ 5%); positive analytical + bootstrap invariants at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPlacebo` (and the gated `::TestBootstrap` subclass). See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path ...)` → "Per-path placebos" for the full contract.
1313

1414
## [3.3.0] - 2026-04-25

diff_diff/had_pretests.py

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1722,14 +1722,36 @@ def stute_test(
17221722
# CvM recompute. Routes via synthetic trivial ResolvedSurveyDesign
17231723
# for the weights= shortcut to share the same kernel.
17241724
resolved_for_boot = survey if survey is not None else _make_trivial_resolved(w_arr)
1725+
# R10 P1: reject stratified designs explicitly until a derived
1726+
# Stute-specific correction lands. The HAD sup-t bootstrap
1727+
# (had.py:2120+) applies a within-stratum demean +
1728+
# sqrt(n_h/(n_h-1)) small-sample correction AFTER
1729+
# generate_survey_multiplier_weights_batch returns, to make the
1730+
# bootstrap variance match the Binder-TSL stratified target.
1731+
# That same correction has NOT been derived for the Stute CvM
1732+
# functional, so applying the helper's raw multipliers directly
1733+
# to residual perturbations on stratified designs leaves the
1734+
# bootstrap p-value silently miscalibrated. Pweight-only,
1735+
# PSU-only, and FPC-only designs are still supported (the
1736+
# helper's output is appropriately scaled for those).
1737+
if resolved_for_boot.strata is not None:
1738+
raise NotImplementedError(
1739+
"stute_test: SurveyDesign(strata=...) with stratified "
1740+
"sampling is not yet supported. The Stute CvM bootstrap "
1741+
"calibration on stratified designs requires a within-"
1742+
"stratum demean + sqrt(n_h/(n_h-1)) small-sample "
1743+
"correction analogous to the HAD sup-t bootstrap, but "
1744+
"the matching derivation for the Stute functional has "
1745+
"not been completed. Pweight-only or PSU-only "
1746+
"(SurveyDesign(weights=..., psu=...)) designs are "
1747+
"supported; pre-process stratified designs to remove "
1748+
"the strata column or wait for the derivation in a "
1749+
"follow-up PR."
1750+
)
17251751
# R5 P1: reject lonely_psu='adjust' singleton-strata designs
1726-
# explicitly (mirrors HAD sup-t bootstrap at had.py:2081-2118).
1727-
# The bootstrap helper pools singletons into a pseudo-stratum with
1728-
# NONZERO multipliers, but the matching variance target requires
1729-
# a pseudo-stratum centering transform that is not derived for
1730-
# the Stute CvM. Other lonely_psu modes ("remove" / "certainty")
1731-
# produce zero multipliers for singletons and are caught by the
1732-
# df_survey guard below.
1752+
# explicitly (now redundant with the strata guard above; kept
1753+
# for defense in depth and for residual non-stratified
1754+
# singleton-strata edge cases).
17331755
if _has_lonely_psu_adjust_singletons(resolved_for_boot):
17341756
raise NotImplementedError(
17351757
"stute_test: SurveyDesign(lonely_psu='adjust') with "
@@ -2914,9 +2936,25 @@ def stute_joint_pretest(
29142936
# vector-valued empirical-process unit-level dependence (paper
29152937
# convention) AND PSU clustering (Krieger-Pfeffermann 1997).
29162938
resolved_for_boot = survey if survey is not None else _make_trivial_resolved(w_arr)
2939+
# R10 P1: reject stratified designs explicitly until a derived
2940+
# Stute-specific correction lands (mirrors stute_test
2941+
# single-horizon).
2942+
if resolved_for_boot.strata is not None:
2943+
raise NotImplementedError(
2944+
"stute_joint_pretest: SurveyDesign(strata=...) with "
2945+
"stratified sampling is not yet supported. The Stute "
2946+
"CvM bootstrap calibration on stratified designs "
2947+
"requires a within-stratum demean + sqrt(n_h/(n_h-1)) "
2948+
"small-sample correction analogous to the HAD sup-t "
2949+
"bootstrap, but the matching derivation for the joint "
2950+
"Stute functional has not been completed. Pweight-only "
2951+
"or PSU-only designs are supported; pre-process "
2952+
"stratified designs to remove the strata column or wait "
2953+
"for the derivation in a follow-up PR."
2954+
)
29172955
# R5 P1: reject lonely_psu='adjust' singleton-strata designs
2918-
# explicitly (mirrors stute_test single-horizon and HAD sup-t
2919-
# bootstrap).
2956+
# explicitly (now redundant with the strata guard above; kept
2957+
# for defense in depth).
29202958
if _has_lonely_psu_adjust_singletons(resolved_for_boot):
29212959
raise NotImplementedError(
29222960
"stute_joint_pretest: SurveyDesign(lonely_psu='adjust') "

docs/methodology/REGISTRY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2444,6 +2444,7 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `
24442444
- `aggregate="event_study"`: `True` iff `pretrends_joint` is non-None and conclusive, `homogeneity_joint` is conclusive, AND neither rejects. Both joint variants must be conclusive on the event-study path (same step-2 + step-3 closure as the unweighted aggregate, just without the QUG step).
24452445
- **Replicate-weight survey designs (BRR/Fay/JK1/JKn/SDR) deferred** to a parallel follow-up. Each helper raises `NotImplementedError` on `survey.replicate_weights is not None` (defense in depth: workflow + every direct-helper entry rejects, mirroring the reciprocal-guard discipline from PR #346). The per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition above.
24462446
- **`lonely_psu='adjust'` with singleton strata is rejected** with `NotImplementedError` on the Stute family (mirrors HAD sup-t bootstrap at `had.py:2081-2118`). The bootstrap multiplier helper pools singleton strata into a pseudo-stratum with nonzero multipliers, but the analytical variance target requires a pseudo-stratum centering transform that has not been derived for the Stute CvM. Use `lonely_psu='remove'` (drops singleton contributions) or `'certainty'` (zero-variance singletons); both produce all-zero singleton multipliers that match a well-defined analytical target. Variance-unidentified designs (`df_survey <= 0` after the adjust+singleton case is handled) return `NaN` with a `UserWarning` (single-PSU unstratified or one-PSU-per-stratum under remove/certainty).
2447+
- **Stratified designs (`SurveyDesign(strata=...)`) are rejected** with `NotImplementedError` on `stute_test` and `stute_joint_pretest` (and propagate to `did_had_pretest_workflow`). The HAD sup-t bootstrap (had.py:2120+) applies a within-stratum demean + `sqrt(n_h/(n_h-1))` small-sample correction AFTER `generate_survey_multiplier_weights_batch` to make the bootstrap variance match the Binder-TSL stratified target. That same correction has NOT been derived for the Stute CvM functional, so applying the helper's raw multipliers directly to residual perturbations on stratified designs would leave the bootstrap p-value silently miscalibrated. Phase 4.5 C narrows Stute-family survey support to **pweight-only** (no strata, no PSU), **PSU-only** (`SurveyDesign(weights=, psu=)`), and **FPC-only** (`SurveyDesign(weights=, fpc=)`) designs. Stratified designs are a follow-up after the matching Stute-CvM stratified-correction derivation lands.
24472448
- **Constant-within-unit invariant**: per-row `weights=` / `survey=col` are aggregated to per-unit `(G,)` arrays via the existing HAD helpers `_aggregate_unit_weights` / `_aggregate_unit_resolved_survey` (had.py:1604, :1671); these enforce constant-within-unit invariant on weights and on every survey design column (strata, psu, fpc) and raise on violation. Direct callers passing already-resolved `ResolvedSurveyDesign` (or per-unit `weights` array) bypass this aggregation; the invariant is the caller's responsibility on that path.
24482449
- **Distributional parity, NOT bit-exact**: at `weights=ones(G)` the survey path produces a different bootstrap p-value than the unweighted path because RNG consumption differs (batched `generate_survey_multiplier_weights_batch` vs per-iteration `_generate_mammen_weights`). The two paths agree DISTRIBUTIONALLY at large B (`|p_avg_diff| < 0.03` over 100 reps at `B=5000`); they DO NOT agree numerically at `atol=1e-10`. The unweighted code path is preserved bit-exactly (stability invariant; the new `weights=`/`survey=` arms are separate `if` branches).
24492450

0 commit comments

Comments
 (0)