Address PR #359 CI review round 5 (1 P0 + 1 P3)

igerber · claude · igerber · commit 6bd07e7d11d0 · 2026-04-24T11:15:20.000-04:00
P0 — zero-weight units must not drive HAD design resolution. Previously
``fit()`` aggregated all units first, then ran ``_detect_design``,
``d_lower`` resolution, the 2% mass-point threshold, and treated/control
counts on the full ``d_arr`` — so a zero-weight unit at ``d.min() = 0``
(the standard SurveyDesign.subpopulation encoding) could flip a
weighted sample from ``continuous_near_d_lower`` to
``continuous_at_zero``, shift ``d_lower``, misstate cohort counts, or
trigger the wrong ``NotImplementedError``. The weighted kernel already
dropped those units at fit time via lprobust's ``w &gt; 0`` selector, so
the fit NUMBERS were correct; the DESIGN DECISION was contaminated.

Fixed by filtering ``d_arr`` / ``dy_arr`` / ``weights_unit`` /
``raw_weights_unit`` / ``resolved_survey_unit`` to
``weights_unit &gt; 0`` immediately after survey/weights resolution and
before any design-resolution logic. New ``_filter_resolved_survey``
helper rebuilds the ResolvedSurveyDesign on the positive-weight subset
(recomputing ``n_strata`` / ``n_psu`` for compute_survey_if_variance).
A UserWarning fires when units are dropped so the behavior is
introspectable.

Three new regression tests lock the fix:
- ``test_zero_weight_unit_at_d_min_does_not_flip_design``: full panel
  with zero-weight unit at ``d=0`` vs physically-dropped panel —
  design, d_lower, att, and se all match to 1e-10.
- ``test_zero_weight_filter_warns_user``: UserWarning emitted.
- ``test_zero_weight_counts_reflect_positive_subset``: ``n_obs`` on the
  result is the positive-weight unit count, not the full panel size.

P3 — CHANGELOG + ``to_dict()`` docstring accuracy. CHANGELOG claimed
``to_dict`` / ``summary`` / ``__repr__`` render the survey metadata
consistently including ``weight_sum`` / ``n_strata`` / ``n_psu``; in
practice each surface renders a different subset. Rewrote the CHANGELOG
entry to enumerate exactly what each surface prints, and rewrote
``HeterogeneousAdoptionDiDResults.to_dict()`` docstring to name the
weighted-path keys and clarify that ``survey_metadata`` is a
SurveyMetadata object (not a dict).

All 363 tests pass. Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
-- **`HeterogeneousAdoptionDiD.fit(survey=..., weights=...)` on continuous-dose paths (Phase 4.5 survey support).** The `continuous_at_zero` (paper Design 1') and `continuous_near_d_lower` (Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs: `weights=<array>` (pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) and `survey=SurveyDesign(weights, strata, psu, fpc)` (design-based inference via Binder-TSL variance using the existing `compute_survey_if_variance` helper at `diff_diff/survey.py:1802`). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated). `HeterogeneousAdoptionDiDResults.survey_metadata` surfaces method / variance-formula / weight-sum / effective-sample-size / n_strata / n_psu; `to_dict` / `summary` / `__repr__` render it consistently. The HAD `mass_point` design and `aggregate="event_study"` path raise `NotImplementedError` under survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity at `atol=1e-14` on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) at `atol=1e-12`, and (3) Monte Carlo oracle consistency on known-τ DGPs. `_nprobust_port.lprobust` gains `weights=` and `return_influence=` (used internally by the Binder-TSL path); `bias_corrected_local_linear` removes the Phase 1c `NotImplementedError` on `weights=` and forwards. Auto-bandwidth selection remains unweighted in this release — pass `h`/`b` explicitly for weight-aware bandwidths. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".
+- **`HeterogeneousAdoptionDiD.fit(survey=..., weights=...)` on continuous-dose paths (Phase 4.5 survey support).** The `continuous_at_zero` (paper Design 1') and `continuous_near_d_lower` (Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs: `weights=<array>` (pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) and `survey=SurveyDesign(weights, strata, psu, fpc)` (design-based inference via Binder-TSL variance using the existing `compute_survey_if_variance` helper at `diff_diff/survey.py:1802`). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated). `HeterogeneousAdoptionDiDResults.survey_metadata` is a repo-standard `SurveyMetadata` dataclass (weight_type / effective_n / design_effect / sum_weights / weight_range / n_strata / n_psu / df_survey); HAD-specific extras (`variance_formula` label, `effective_dose_mean`) are separate top-level result fields. `to_dict()` surfaces the full `SurveyMetadata` object plus `variance_formula` + `effective_dose_mean`; `summary()` renders `variance_formula`, `effective_n`, `effective_dose_mean`, and (when the survey= path is used) `df_survey`; `__repr__` surfaces `variance_formula` + `effective_dose_mean` when present. The HAD `mass_point` design and `aggregate="event_study"` path raise `NotImplementedError` under survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity at `atol=1e-14` on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) at `atol=1e-12`, and (3) Monte Carlo oracle consistency on known-τ DGPs. `_nprobust_port.lprobust` gains `weights=` and `return_influence=` (used internally by the Binder-TSL path); `bias_corrected_local_linear` removes the Phase 1c `NotImplementedError` on `weights=` and forwards. Auto-bandwidth selection remains unweighted in this release — pass `h`/`b` explicitly for weight-aware bandwidths. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".
 - **`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test` + `StuteJointResult`** (HeterogeneousAdoptionDiD Phase 3 follow-up). Joint Cramér-von Mises pretests across K horizons with shared-η Mammen wild bootstrap (preserves vector-valued empirical-process unit-level dependence per Delgado-Manteiga 2001 / Hlávka-Hušková 2020). The core `stute_joint_pretest` is residuals-in; two thin data-in wrappers construct per-horizon residuals for the two nulls the paper spells out: mean-independence (step 2 pre-trends, `OLS(Y_t − Y_base ~ 1)` per pre-period) and linearity (step 3 joint, `OLS(Y_t − Y_base ~ 1 + D)` per post-period). Sum-of-CvMs aggregation (`S_joint = Σ_k S_k`); per-horizon scale-invariant exact-linear short-circuit. Closes the paper Section 4.2 step-2 gap that Phase 3 `did_had_pretest_workflow` previously flagged with an "Assumption 7 pre-trends test NOT run" caveat. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and scope exclusion of Eq 18 linear-trend detrending (deferred to Phase 4 Pierce-Schott replication).
 - **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`.
 - **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section.
diff --git a/diff_diff/had.py b/diff_diff/had.py
@@ -455,12 +455,26 @@ def print_summary(self) -> None:
         print(self.summary())
 
     def to_dict(self) -> Dict[str, Any]:
-        """Return results as a dict of scalars + ``survey_metadata`` (dict
-        or ``None``). When ``survey=`` / ``weights=`` is supplied to
-        ``fit()``, ``survey_metadata`` carries the weighted-sample
-        diagnostic (method, weight sum, effective sample size) so
-        downstream consumers can inspect how the fit was weighted without
-        digging into the estimator object."""
+        """Return results as a dict of scalars + weighted-path surfaces.
+
+        Always-present keys mirror the dataclass fields: ``att``, ``se``,
+        ``t_stat``, ``p_value``, ``conf_int_lower`` / ``conf_int_upper``,
+        ``alpha``, ``design``, ``target_parameter``, ``d_lower``,
+        ``dose_mean``, ``n_obs`` / ``n_treated`` / ``n_control`` /
+        ``n_mass_point`` / ``n_above_d_lower``, ``inference_method``,
+        ``vcov_type``, ``cluster_name``.
+
+        Weighted-path keys (``None`` on unweighted fits):
+
+        - ``survey_metadata``: repo-standard
+          :class:`diff_diff.survey.SurveyMetadata` dataclass (object, not
+          dict) carrying ``weight_type`` / ``effective_n`` /
+          ``design_effect`` / ``sum_weights`` / ``weight_range`` +
+          ``n_strata`` / ``n_psu`` / ``df_survey`` (latter three
+          ``None`` on the ``weights=`` shortcut).
+        - ``variance_formula``: ``"pweight"`` or ``"survey_binder_tsl"``.
+        - ``effective_dose_mean``: weighted denominator used by the
+          beta-scale rescaling."""
         return {
             "att": self.att,
             "se": self.se,
@@ -1659,6 +1673,63 @@ def _collapse(arr: Optional[np.ndarray], name: str) -> Optional[np.ndarray]:
     )
 
 
+def _filter_resolved_survey(resolved: Any, keep_mask: np.ndarray) -> Any:
+    """Filter a ResolvedSurveyDesign to a boolean unit-level subset.
+
+    Used by HAD's continuous path to drop zero-weight units from the
+    design-resolution sub-population while preserving the attribute shape
+    that ``compute_survey_if_variance`` expects. PSU/strata counts are
+    recomputed on the positive-weight subset so degenerate singleton
+    strata (after filtering) are counted correctly.
+
+    Parameters
+    ----------
+    resolved : ResolvedSurveyDesign
+        Unit-level resolved design (typically from
+        ``_aggregate_unit_resolved_survey``).
+    keep_mask : np.ndarray, shape (G,), bool
+        True for units to keep (e.g., ``weights > 0``).
+
+    Returns
+    -------
+    ResolvedSurveyDesign with all (G,) arrays filtered by ``keep_mask``.
+    """
+    from diff_diff.survey import ResolvedSurveyDesign
+
+    def _f(arr: Optional[np.ndarray]) -> Optional[np.ndarray]:
+        return arr[keep_mask] if arr is not None else None
+
+    strata_f = _f(resolved.strata)
+    psu_f = _f(resolved.psu)
+    n_strata_f = (
+        int(np.unique(strata_f).shape[0]) if strata_f is not None else 1
+    )
+    n_psu_f = (
+        int(np.unique(psu_f).shape[0])
+        if psu_f is not None
+        else int(keep_mask.sum())
+    )
+    return ResolvedSurveyDesign(
+        weights=resolved.weights[keep_mask],
+        weight_type=resolved.weight_type,
+        strata=strata_f,
+        psu=psu_f,
+        fpc=_f(resolved.fpc),
+        n_strata=n_strata_f,
+        n_psu=n_psu_f,
+        lonely_psu=resolved.lonely_psu,
+        replicate_weights=None,
+        replicate_method=None,
+        fay_rho=0.0,
+        n_replicates=0,
+        replicate_strata=None,
+        combined_weights=resolved.combined_weights,
+        replicate_scale=None,
+        replicate_rscales=None,
+        mse=resolved.mse,
+    )
+
+
 def _aggregate_multi_period_first_differences(
     data: pd.DataFrame,
     outcome_col: str,
@@ -2492,6 +2563,41 @@ def fit(
                 resolved_survey_unit.weights, dtype=np.float64
             )
 
+        # Zero-weight units (e.g., from SurveyDesign.subpopulation(), or
+        # a user-supplied pweight column with excluded observations) must
+        # not drive design resolution. Filter d_arr, dy_arr, weights_unit,
+        # raw_weights_unit, and resolved_survey_unit to the positive-
+        # weight subset BEFORE _detect_design / d_lower / mass-point
+        # threshold / treated+control counts / bandwidth selection run.
+        # The weighted kernel already drops zero-weight observations via
+        # the ``w > 0`` selector in lprobust, so the FIT is unchanged;
+        # only the design-decision logic was previously contaminated
+        # (CI review PR #359 round 5, P0).
+        if weights_unit is not None:
+            positive_mask = weights_unit > 0.0
+            if not bool(positive_mask.all()):
+                n_dropped = int((~positive_mask).sum())
+                warnings.warn(
+                    f"HAD continuous path: {n_dropped} unit(s) have "
+                    f"weight == 0 and are excluded from design resolution "
+                    f"(auto-detect design, d_lower, mass-point threshold, "
+                    f"cohort counts) + the weighted fit. Standard survey "
+                    f"subpopulation designs (SurveyDesign.subpopulation) "
+                    f"zero-out excluded units by design; the estimator "
+                    f"treats them as absent from the analysis sample.",
+                    UserWarning,
+                    stacklevel=2,
+                )
+                d_arr = d_arr[positive_mask]
+                dy_arr = dy_arr[positive_mask]
+                weights_unit = weights_unit[positive_mask]
+                if raw_weights_unit is not None:
+                    raw_weights_unit = raw_weights_unit[positive_mask]
+                if resolved_survey_unit is not None:
+                    resolved_survey_unit = _filter_resolved_survey(
+                        resolved_survey_unit, positive_mask
+                    )
+
         n_obs = int(d_arr.shape[0])
         if n_obs < 3:
             raise ValueError(
diff --git a/tests/test_had.py b/tests/test_had.py
@@ -3803,6 +3803,100 @@ def test_effective_dose_mean_none_when_unweighted(self):
         r = est.fit(panel, "outcome", "dose", "period", "unit")
         assert r.effective_dose_mean is None
 
+    # ---------- Round 5 P0: zero-weight units don't drive design ----------
+
+    def test_zero_weight_unit_at_d_min_does_not_flip_design(self):
+        """Round 5 P0: a zero-weight unit sitting at ``d.min() = 0``
+        must not flip the auto-detect design from
+        ``continuous_near_d_lower`` (correct on the positive-weight
+        subpop) to ``continuous_at_zero`` (wrong, boundary=0 chosen from
+        an excluded unit). Previously design detection ran on the full
+        unit set, so a subpopulation-style zero-weight unit at d=0
+        silently mistargeted."""
+        rng = np.random.default_rng(42)
+        G_pop = 200
+        # Full population: one zero-weight unit at d=0; rest positive
+        # weights with d in [0.1, 1.0] (so positive-weight support min = 0.1).
+        d = np.concatenate([[0.0], rng.uniform(0.1, 1.0, G_pop - 1)])
+        dy = 2.0 * (d - 0.1) + rng.normal(0, 0.2, G_pop)
+        w_unit = np.concatenate([[0.0], rng.uniform(0.5, 1.5, G_pop - 1)])
+        panel = _make_panel(d, dy)
+        row_w = np.zeros(panel.shape[0])
+        for g in range(G_pop):
+            row_w[panel["unit"].to_numpy() == g] = w_unit[g]
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", UserWarning)
+            # Full panel with zero-weight unit at d=0: auto-detect.
+            est = HeterogeneousAdoptionDiD(design="auto")
+            r_full = est.fit(
+                panel, "outcome", "dose", "period", "unit", weights=row_w
+            )
+            # Physically drop the zero-weight unit and refit.
+            panel_dropped = panel[panel["unit"] != 0].reset_index(drop=True)
+            w_dropped = row_w[panel["unit"].to_numpy() != 0]
+            r_dropped = est.fit(
+                panel_dropped,
+                "outcome",
+                "dose",
+                "period",
+                "unit",
+                weights=w_dropped,
+            )
+        # Both paths resolve to the SAME design (the positive-weight
+        # support, not the contaminated d=0 boundary).
+        assert r_full.design == r_dropped.design
+        # Both paths produce the same ATT (lprobust already ignored the
+        # zero-weight unit's kernel contribution; filtering earlier
+        # doesn't change the fit numerically).
+        np.testing.assert_allclose(r_full.att, r_dropped.att, atol=1e-10, rtol=1e-10)
+        np.testing.assert_allclose(r_full.se, r_dropped.se, atol=1e-10, rtol=1e-10)
+        # d_lower set by the positive-weight subpopulation (d.min() of
+        # the kept units), NOT the contaminated full d.min()=0.
+        assert r_full.d_lower > 0.0
+        np.testing.assert_allclose(
+            r_full.d_lower, r_dropped.d_lower, atol=1e-12, rtol=1e-12
+        )
+
+    def test_zero_weight_filter_warns_user(self):
+        """Dropping zero-weight units from design resolution should
+        emit a UserWarning so the behavior is visible."""
+        rng = np.random.default_rng(5)
+        G = 150
+        d = rng.uniform(0.0, 1.0, G)
+        dy = 2.0 * d + rng.normal(0, 0.25, G)
+        w_unit = rng.uniform(0.5, 1.5, G)
+        # Zero out 5 units.
+        w_unit[:5] = 0.0
+        panel = _make_panel(d, dy)
+        row_w = np.zeros(panel.shape[0])
+        for g in range(G):
+            row_w[panel["unit"].to_numpy() == g] = w_unit[g]
+        est = HeterogeneousAdoptionDiD(design="continuous_at_zero")
+        with pytest.warns(UserWarning, match="weight == 0"):
+            est.fit(
+                panel, "outcome", "dose", "period", "unit", weights=row_w
+            )
+
+    def test_zero_weight_counts_reflect_positive_subset(self):
+        """``n_obs`` / ``n_treated`` / ``n_control`` on the result must
+        reflect the positive-weight sub-population, not the full panel."""
+        rng = np.random.default_rng(7)
+        G = 120
+        d = rng.uniform(0.0, 1.0, G)
+        dy = 2.0 * d + rng.normal(0, 0.25, G)
+        w_unit = np.ones(G)
+        w_unit[:20] = 0.0  # 20 zero-weight units
+        panel = _make_panel(d, dy)
+        row_w = np.zeros(panel.shape[0])
+        for g in range(G):
+            row_w[panel["unit"].to_numpy() == g] = w_unit[g]
+        est = HeterogeneousAdoptionDiD(design="continuous_at_zero")
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", UserWarning)
+            r = est.fit(panel, "outcome", "dose", "period", "unit", weights=row_w)
+        # 100 positive-weight units, not 120.
+        assert r.n_obs == 100
+
     def test_survey_metadata_raw_weights_match_shortcut(self):
         """Round 4 P2: on the ``survey=SurveyDesign(weights="col")``
         path, ``SurveyMetadata.sum_weights`` and ``weight_range`` must