Skip to content

Commit 6bd07e7

Browse files
igerberclaude
andcommitted
Address PR #359 CI review round 5 (1 P0 + 1 P3)
P0 — zero-weight units must not drive HAD design resolution. Previously ``fit()`` aggregated all units first, then ran ``_detect_design``, ``d_lower`` resolution, the 2% mass-point threshold, and treated/control counts on the full ``d_arr`` — so a zero-weight unit at ``d.min() = 0`` (the standard SurveyDesign.subpopulation encoding) could flip a weighted sample from ``continuous_near_d_lower`` to ``continuous_at_zero``, shift ``d_lower``, misstate cohort counts, or trigger the wrong ``NotImplementedError``. The weighted kernel already dropped those units at fit time via lprobust's ``w > 0`` selector, so the fit NUMBERS were correct; the DESIGN DECISION was contaminated. Fixed by filtering ``d_arr`` / ``dy_arr`` / ``weights_unit`` / ``raw_weights_unit`` / ``resolved_survey_unit`` to ``weights_unit > 0`` immediately after survey/weights resolution and before any design-resolution logic. New ``_filter_resolved_survey`` helper rebuilds the ResolvedSurveyDesign on the positive-weight subset (recomputing ``n_strata`` / ``n_psu`` for compute_survey_if_variance). A UserWarning fires when units are dropped so the behavior is introspectable. Three new regression tests lock the fix: - ``test_zero_weight_unit_at_d_min_does_not_flip_design``: full panel with zero-weight unit at ``d=0`` vs physically-dropped panel — design, d_lower, att, and se all match to 1e-10. - ``test_zero_weight_filter_warns_user``: UserWarning emitted. - ``test_zero_weight_counts_reflect_positive_subset``: ``n_obs`` on the result is the positive-weight unit count, not the full panel size. P3 — CHANGELOG + ``to_dict()`` docstring accuracy. CHANGELOG claimed ``to_dict`` / ``summary`` / ``__repr__`` render the survey metadata consistently including ``weight_sum`` / ``n_strata`` / ``n_psu``; in practice each surface renders a different subset. Rewrote the CHANGELOG entry to enumerate exactly what each surface prints, and rewrote ``HeterogeneousAdoptionDiDResults.to_dict()`` docstring to name the weighted-path keys and clarify that ``survey_metadata`` is a SurveyMetadata object (not a dict). All 363 tests pass. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 188021b commit 6bd07e7

3 files changed

Lines changed: 207 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
88
## [Unreleased]
99

1010
### Added
11-
- **`HeterogeneousAdoptionDiD.fit(survey=..., weights=...)` on continuous-dose paths (Phase 4.5 survey support).** The `continuous_at_zero` (paper Design 1') and `continuous_near_d_lower` (Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs: `weights=<array>` (pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) and `survey=SurveyDesign(weights, strata, psu, fpc)` (design-based inference via Binder-TSL variance using the existing `compute_survey_if_variance` helper at `diff_diff/survey.py:1802`). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated). `HeterogeneousAdoptionDiDResults.survey_metadata` surfaces method / variance-formula / weight-sum / effective-sample-size / n_strata / n_psu; `to_dict` / `summary` / `__repr__` render it consistently. The HAD `mass_point` design and `aggregate="event_study"` path raise `NotImplementedError` under survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity at `atol=1e-14` on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) at `atol=1e-12`, and (3) Monte Carlo oracle consistency on known-τ DGPs. `_nprobust_port.lprobust` gains `weights=` and `return_influence=` (used internally by the Binder-TSL path); `bias_corrected_local_linear` removes the Phase 1c `NotImplementedError` on `weights=` and forwards. Auto-bandwidth selection remains unweighted in this release — pass `h`/`b` explicitly for weight-aware bandwidths. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".
11+
- **`HeterogeneousAdoptionDiD.fit(survey=..., weights=...)` on continuous-dose paths (Phase 4.5 survey support).** The `continuous_at_zero` (paper Design 1') and `continuous_near_d_lower` (Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs: `weights=<array>` (pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) and `survey=SurveyDesign(weights, strata, psu, fpc)` (design-based inference via Binder-TSL variance using the existing `compute_survey_if_variance` helper at `diff_diff/survey.py:1802`). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated). `HeterogeneousAdoptionDiDResults.survey_metadata` is a repo-standard `SurveyMetadata` dataclass (weight_type / effective_n / design_effect / sum_weights / weight_range / n_strata / n_psu / df_survey); HAD-specific extras (`variance_formula` label, `effective_dose_mean`) are separate top-level result fields. `to_dict()` surfaces the full `SurveyMetadata` object plus `variance_formula` + `effective_dose_mean`; `summary()` renders `variance_formula`, `effective_n`, `effective_dose_mean`, and (when the survey= path is used) `df_survey`; `__repr__` surfaces `variance_formula` + `effective_dose_mean` when present. The HAD `mass_point` design and `aggregate="event_study"` path raise `NotImplementedError` under survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity at `atol=1e-14` on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) at `atol=1e-12`, and (3) Monte Carlo oracle consistency on known-τ DGPs. `_nprobust_port.lprobust` gains `weights=` and `return_influence=` (used internally by the Binder-TSL path); `bias_corrected_local_linear` removes the Phase 1c `NotImplementedError` on `weights=` and forwards. Auto-bandwidth selection remains unweighted in this release — pass `h`/`b` explicitly for weight-aware bandwidths. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".
1212
- **`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test` + `StuteJointResult`** (HeterogeneousAdoptionDiD Phase 3 follow-up). Joint Cramér-von Mises pretests across K horizons with shared-η Mammen wild bootstrap (preserves vector-valued empirical-process unit-level dependence per Delgado-Manteiga 2001 / Hlávka-Hušková 2020). The core `stute_joint_pretest` is residuals-in; two thin data-in wrappers construct per-horizon residuals for the two nulls the paper spells out: mean-independence (step 2 pre-trends, `OLS(Y_t − Y_base ~ 1)` per pre-period) and linearity (step 3 joint, `OLS(Y_t − Y_base ~ 1 + D)` per post-period). Sum-of-CvMs aggregation (`S_joint = Σ_k S_k`); per-horizon scale-invariant exact-linear short-circuit. Closes the paper Section 4.2 step-2 gap that Phase 3 `did_had_pretest_workflow` previously flagged with an "Assumption 7 pre-trends test NOT run" caveat. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and scope exclusion of Eq 18 linear-trend detrending (deferred to Phase 4 Pierce-Schott replication).
1313
- **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`.
1414
- **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section.

diff_diff/had.py

Lines changed: 112 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -455,12 +455,26 @@ def print_summary(self) -> None:
455455
print(self.summary())
456456

457457
def to_dict(self) -> Dict[str, Any]:
458-
"""Return results as a dict of scalars + ``survey_metadata`` (dict
459-
or ``None``). When ``survey=`` / ``weights=`` is supplied to
460-
``fit()``, ``survey_metadata`` carries the weighted-sample
461-
diagnostic (method, weight sum, effective sample size) so
462-
downstream consumers can inspect how the fit was weighted without
463-
digging into the estimator object."""
458+
"""Return results as a dict of scalars + weighted-path surfaces.
459+
460+
Always-present keys mirror the dataclass fields: ``att``, ``se``,
461+
``t_stat``, ``p_value``, ``conf_int_lower`` / ``conf_int_upper``,
462+
``alpha``, ``design``, ``target_parameter``, ``d_lower``,
463+
``dose_mean``, ``n_obs`` / ``n_treated`` / ``n_control`` /
464+
``n_mass_point`` / ``n_above_d_lower``, ``inference_method``,
465+
``vcov_type``, ``cluster_name``.
466+
467+
Weighted-path keys (``None`` on unweighted fits):
468+
469+
- ``survey_metadata``: repo-standard
470+
:class:`diff_diff.survey.SurveyMetadata` dataclass (object, not
471+
dict) carrying ``weight_type`` / ``effective_n`` /
472+
``design_effect`` / ``sum_weights`` / ``weight_range`` +
473+
``n_strata`` / ``n_psu`` / ``df_survey`` (latter three
474+
``None`` on the ``weights=`` shortcut).
475+
- ``variance_formula``: ``"pweight"`` or ``"survey_binder_tsl"``.
476+
- ``effective_dose_mean``: weighted denominator used by the
477+
beta-scale rescaling."""
464478
return {
465479
"att": self.att,
466480
"se": self.se,
@@ -1659,6 +1673,63 @@ def _collapse(arr: Optional[np.ndarray], name: str) -> Optional[np.ndarray]:
16591673
)
16601674

16611675

1676+
def _filter_resolved_survey(resolved: Any, keep_mask: np.ndarray) -> Any:
1677+
"""Filter a ResolvedSurveyDesign to a boolean unit-level subset.
1678+
1679+
Used by HAD's continuous path to drop zero-weight units from the
1680+
design-resolution sub-population while preserving the attribute shape
1681+
that ``compute_survey_if_variance`` expects. PSU/strata counts are
1682+
recomputed on the positive-weight subset so degenerate singleton
1683+
strata (after filtering) are counted correctly.
1684+
1685+
Parameters
1686+
----------
1687+
resolved : ResolvedSurveyDesign
1688+
Unit-level resolved design (typically from
1689+
``_aggregate_unit_resolved_survey``).
1690+
keep_mask : np.ndarray, shape (G,), bool
1691+
True for units to keep (e.g., ``weights > 0``).
1692+
1693+
Returns
1694+
-------
1695+
ResolvedSurveyDesign with all (G,) arrays filtered by ``keep_mask``.
1696+
"""
1697+
from diff_diff.survey import ResolvedSurveyDesign
1698+
1699+
def _f(arr: Optional[np.ndarray]) -> Optional[np.ndarray]:
1700+
return arr[keep_mask] if arr is not None else None
1701+
1702+
strata_f = _f(resolved.strata)
1703+
psu_f = _f(resolved.psu)
1704+
n_strata_f = (
1705+
int(np.unique(strata_f).shape[0]) if strata_f is not None else 1
1706+
)
1707+
n_psu_f = (
1708+
int(np.unique(psu_f).shape[0])
1709+
if psu_f is not None
1710+
else int(keep_mask.sum())
1711+
)
1712+
return ResolvedSurveyDesign(
1713+
weights=resolved.weights[keep_mask],
1714+
weight_type=resolved.weight_type,
1715+
strata=strata_f,
1716+
psu=psu_f,
1717+
fpc=_f(resolved.fpc),
1718+
n_strata=n_strata_f,
1719+
n_psu=n_psu_f,
1720+
lonely_psu=resolved.lonely_psu,
1721+
replicate_weights=None,
1722+
replicate_method=None,
1723+
fay_rho=0.0,
1724+
n_replicates=0,
1725+
replicate_strata=None,
1726+
combined_weights=resolved.combined_weights,
1727+
replicate_scale=None,
1728+
replicate_rscales=None,
1729+
mse=resolved.mse,
1730+
)
1731+
1732+
16621733
def _aggregate_multi_period_first_differences(
16631734
data: pd.DataFrame,
16641735
outcome_col: str,
@@ -2492,6 +2563,41 @@ def fit(
24922563
resolved_survey_unit.weights, dtype=np.float64
24932564
)
24942565

2566+
# Zero-weight units (e.g., from SurveyDesign.subpopulation(), or
2567+
# a user-supplied pweight column with excluded observations) must
2568+
# not drive design resolution. Filter d_arr, dy_arr, weights_unit,
2569+
# raw_weights_unit, and resolved_survey_unit to the positive-
2570+
# weight subset BEFORE _detect_design / d_lower / mass-point
2571+
# threshold / treated+control counts / bandwidth selection run.
2572+
# The weighted kernel already drops zero-weight observations via
2573+
# the ``w > 0`` selector in lprobust, so the FIT is unchanged;
2574+
# only the design-decision logic was previously contaminated
2575+
# (CI review PR #359 round 5, P0).
2576+
if weights_unit is not None:
2577+
positive_mask = weights_unit > 0.0
2578+
if not bool(positive_mask.all()):
2579+
n_dropped = int((~positive_mask).sum())
2580+
warnings.warn(
2581+
f"HAD continuous path: {n_dropped} unit(s) have "
2582+
f"weight == 0 and are excluded from design resolution "
2583+
f"(auto-detect design, d_lower, mass-point threshold, "
2584+
f"cohort counts) + the weighted fit. Standard survey "
2585+
f"subpopulation designs (SurveyDesign.subpopulation) "
2586+
f"zero-out excluded units by design; the estimator "
2587+
f"treats them as absent from the analysis sample.",
2588+
UserWarning,
2589+
stacklevel=2,
2590+
)
2591+
d_arr = d_arr[positive_mask]
2592+
dy_arr = dy_arr[positive_mask]
2593+
weights_unit = weights_unit[positive_mask]
2594+
if raw_weights_unit is not None:
2595+
raw_weights_unit = raw_weights_unit[positive_mask]
2596+
if resolved_survey_unit is not None:
2597+
resolved_survey_unit = _filter_resolved_survey(
2598+
resolved_survey_unit, positive_mask
2599+
)
2600+
24952601
n_obs = int(d_arr.shape[0])
24962602
if n_obs < 3:
24972603
raise ValueError(

tests/test_had.py

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3803,6 +3803,100 @@ def test_effective_dose_mean_none_when_unweighted(self):
38033803
r = est.fit(panel, "outcome", "dose", "period", "unit")
38043804
assert r.effective_dose_mean is None
38053805

3806+
# ---------- Round 5 P0: zero-weight units don't drive design ----------
3807+
3808+
def test_zero_weight_unit_at_d_min_does_not_flip_design(self):
3809+
"""Round 5 P0: a zero-weight unit sitting at ``d.min() = 0``
3810+
must not flip the auto-detect design from
3811+
``continuous_near_d_lower`` (correct on the positive-weight
3812+
subpop) to ``continuous_at_zero`` (wrong, boundary=0 chosen from
3813+
an excluded unit). Previously design detection ran on the full
3814+
unit set, so a subpopulation-style zero-weight unit at d=0
3815+
silently mistargeted."""
3816+
rng = np.random.default_rng(42)
3817+
G_pop = 200
3818+
# Full population: one zero-weight unit at d=0; rest positive
3819+
# weights with d in [0.1, 1.0] (so positive-weight support min = 0.1).
3820+
d = np.concatenate([[0.0], rng.uniform(0.1, 1.0, G_pop - 1)])
3821+
dy = 2.0 * (d - 0.1) + rng.normal(0, 0.2, G_pop)
3822+
w_unit = np.concatenate([[0.0], rng.uniform(0.5, 1.5, G_pop - 1)])
3823+
panel = _make_panel(d, dy)
3824+
row_w = np.zeros(panel.shape[0])
3825+
for g in range(G_pop):
3826+
row_w[panel["unit"].to_numpy() == g] = w_unit[g]
3827+
with warnings.catch_warnings():
3828+
warnings.simplefilter("ignore", UserWarning)
3829+
# Full panel with zero-weight unit at d=0: auto-detect.
3830+
est = HeterogeneousAdoptionDiD(design="auto")
3831+
r_full = est.fit(
3832+
panel, "outcome", "dose", "period", "unit", weights=row_w
3833+
)
3834+
# Physically drop the zero-weight unit and refit.
3835+
panel_dropped = panel[panel["unit"] != 0].reset_index(drop=True)
3836+
w_dropped = row_w[panel["unit"].to_numpy() != 0]
3837+
r_dropped = est.fit(
3838+
panel_dropped,
3839+
"outcome",
3840+
"dose",
3841+
"period",
3842+
"unit",
3843+
weights=w_dropped,
3844+
)
3845+
# Both paths resolve to the SAME design (the positive-weight
3846+
# support, not the contaminated d=0 boundary).
3847+
assert r_full.design == r_dropped.design
3848+
# Both paths produce the same ATT (lprobust already ignored the
3849+
# zero-weight unit's kernel contribution; filtering earlier
3850+
# doesn't change the fit numerically).
3851+
np.testing.assert_allclose(r_full.att, r_dropped.att, atol=1e-10, rtol=1e-10)
3852+
np.testing.assert_allclose(r_full.se, r_dropped.se, atol=1e-10, rtol=1e-10)
3853+
# d_lower set by the positive-weight subpopulation (d.min() of
3854+
# the kept units), NOT the contaminated full d.min()=0.
3855+
assert r_full.d_lower > 0.0
3856+
np.testing.assert_allclose(
3857+
r_full.d_lower, r_dropped.d_lower, atol=1e-12, rtol=1e-12
3858+
)
3859+
3860+
def test_zero_weight_filter_warns_user(self):
3861+
"""Dropping zero-weight units from design resolution should
3862+
emit a UserWarning so the behavior is visible."""
3863+
rng = np.random.default_rng(5)
3864+
G = 150
3865+
d = rng.uniform(0.0, 1.0, G)
3866+
dy = 2.0 * d + rng.normal(0, 0.25, G)
3867+
w_unit = rng.uniform(0.5, 1.5, G)
3868+
# Zero out 5 units.
3869+
w_unit[:5] = 0.0
3870+
panel = _make_panel(d, dy)
3871+
row_w = np.zeros(panel.shape[0])
3872+
for g in range(G):
3873+
row_w[panel["unit"].to_numpy() == g] = w_unit[g]
3874+
est = HeterogeneousAdoptionDiD(design="continuous_at_zero")
3875+
with pytest.warns(UserWarning, match="weight == 0"):
3876+
est.fit(
3877+
panel, "outcome", "dose", "period", "unit", weights=row_w
3878+
)
3879+
3880+
def test_zero_weight_counts_reflect_positive_subset(self):
3881+
"""``n_obs`` / ``n_treated`` / ``n_control`` on the result must
3882+
reflect the positive-weight sub-population, not the full panel."""
3883+
rng = np.random.default_rng(7)
3884+
G = 120
3885+
d = rng.uniform(0.0, 1.0, G)
3886+
dy = 2.0 * d + rng.normal(0, 0.25, G)
3887+
w_unit = np.ones(G)
3888+
w_unit[:20] = 0.0 # 20 zero-weight units
3889+
panel = _make_panel(d, dy)
3890+
row_w = np.zeros(panel.shape[0])
3891+
for g in range(G):
3892+
row_w[panel["unit"].to_numpy() == g] = w_unit[g]
3893+
est = HeterogeneousAdoptionDiD(design="continuous_at_zero")
3894+
with warnings.catch_warnings():
3895+
warnings.simplefilter("ignore", UserWarning)
3896+
r = est.fit(panel, "outcome", "dose", "period", "unit", weights=row_w)
3897+
# 100 positive-weight units, not 120.
3898+
assert r.n_obs == 100
3899+
38063900
def test_survey_metadata_raw_weights_match_shortcut(self):
38073901
"""Round 4 P2: on the ``survey=SurveyDesign(weights="col")``
38083902
path, ``SurveyMetadata.sum_weights`` and ``weight_range`` must

0 commit comments

Comments
 (0)