Address PR #359 CI review round 1 (1 P0 + 2 P1 + 2 P2/P3)

igerber · claude · igerber · commit 6125fda109df · 2026-04-24T09:51:27.000-04:00
P0 — bias-corrected survey IF: the survey variance path was building
the influence function from classical residuals (``res_h`` + ``R_p·W_h``,
aligned with ``V_Y_cl``) while the returned ATT uses the bias-corrected
``tau_bc``. Under compute_survey_if_variance this silently
under-estimated the survey SE by ignoring the CCT-2014 bias-correction
variance inflation. Fixed in ``_nprobust_port.lprobust``: IF now uses
``Q_q`` + ``res_b`` so ``sum(IF^2) == V_Y_bc[0,0]`` (verified in new
white-box test), matching the estimator scale of the ATT. Uniform-
weights bit-parity (weights=np.ones ≡ unweighted at 1e-14) preserved
across the new IF formula. The ``TestWeightedLprobust`` HC1 bit-parity
test still passes because under weights=ones the classical vs bias-
corrected IF only differ by the Q.q bias-correction term, which is
deterministic and cancels in the diff.

P1a — df_survey threading: survey fits previously used Normal-theory
inference regardless of PSU count. ``resolved_survey_unit.df_survey``
(n_psu − n_strata, or replicate QR rank − 1) now routes through
``safe_inference(..., df=...)`` on the survey path, producing t-based
p-values and CIs. Also surfaced in ``survey_metadata["df_survey"]`` for
introspection. Under small-PSU designs the t-critical exceeds the
Normal z-critical, widening the CI vs the prior (wrong) output. The
``weights=`` shortcut continues to use Normal inference since there's
no PSU structure to produce a finite df.

P1b — reject non-pweight SurveyDesigns: HAD's weighted kernel composition
``W_combined = k((D-d̲)/h) · w`` implements inverse-probability weighting
semantics. ``SurveyDesign(weight_type="aweight"|"fweight")`` is now
rejected with ``NotImplementedError`` at fit-time — aweight (analytic)
implies a different inferential target (weighted regression, not
design-based inference), and fweight (frequency) implies observation
replication. Neither has been derived for HAD's continuous-dose path;
deferred as a follow-up.

P2 — test coverage: six new ``TestHADSurvey`` tests locking in the
three fixes above:
  - ``test_survey_if_uses_bias_corrected_scale``: white-box
    ``sum(IF^2) == V_Y_bc[0,0]`` under nonlinear DGP where
    V_Y_bc ≠ V_Y_cl (teeth).
  - ``test_survey_df_widens_ci_vs_normal`` +
    ``test_survey_df_threaded_into_inference_via_t_distribution``:
    assert df_survey surfaces in metadata and produces t-CI half-width
    matching ``t_crit(df) · se`` exactly.
  - ``test_survey_aweight_raises_not_implemented`` +
    ``test_survey_fweight_raises_not_implemented``: front-door rejection.
  - ``test_survey_no_psu_no_strata_se_matches_weights_hc1``: SRS
    equivalence — survey SE within 10-15% of weights-shortcut SE (the
    (n/(n-1))-style HC1 correction), ruling out a silent
    V_Y_cl-vs-V_Y_bc mismatch.

P3 — docstring refresh: the stale fit() docstring calling survey/weights
"Reserved for a follow-up PR" is replaced with the actual Phase 4.5 A
contract (pweight-only, constant-within-unit, replicate-weight deferred,
mass-point/event-study/pretests deferred). ``survey_metadata`` docstring
on ``HeterogeneousAdoptionDiDResults`` now enumerates the actual dict
keys and their semantics.

All 348 tests (across test_had, test_nprobust_port,
test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the
slow MC suite) pass after the cascade. Ruff clean on modified files.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/_nprobust_port.py b/diff_diff/_nprobust_port.py
@@ -1005,15 +1005,19 @@ class LprobustResult:
     V_Y_cl: np.ndarray
     V_Y_bc: np.ndarray
     influence_function: Optional[np.ndarray] = None
-    """Per-observation influence function of the CLASSICAL intercept
-    ``mu_hat`` (deriv=0 case), aligned with the ORIGINAL x ordering. Shape
-    ``(N,)``. Set only when ``return_influence=True``; ``None`` otherwise.
-    Observations outside the active kernel window have IF=0.
+    """Per-observation influence function of the BIAS-CORRECTED point
+    estimate at ``deriv`` (``tau_bc`` for the deriv=0 case), aligned
+    with the ORIGINAL x ordering. Shape ``(N,)``. Set only when
+    ``return_influence=True``; ``None`` otherwise. Observations outside
+    the active kernel window have IF=0.
 
-    Surface used by estimator-level Binder (1983) TSL composition for
-    survey-design variance (Phase 4.5 HAD continuous path). The variance
-    check ``sum(IF^2) == V_Y_cl[0, 0]`` (up to BLAS ordering) holds when
-    weights are uniform and cluster=None."""
+    Derived from ``Q_q`` + ``res_b``, so the variance self-check is
+    ``sum(IF^2) == V_Y_bc[deriv, deriv]`` (up to BLAS ordering) under
+    unclustered HC0. Used by estimator-level Binder (1983) TSL
+    composition for survey-design variance on HAD's continuous path — the
+    bias-corrected scale matches the ATT which itself uses ``tau_bc``.
+    Using the classical IF here would silently under-estimate survey SE
+    by ignoring the bias-correction variance inflation."""
 
 
 def lprobust(
@@ -1352,23 +1356,30 @@ def lprobust(
     se_cl = float(np.sqrt((deriv_fact**2) * V_Y_cl[deriv, deriv]))
     se_rb = float(np.sqrt((deriv_fact**2) * V_Y_bc[deriv, deriv]))
 
-    # --- Per-observation influence function for the classical point
+    # --- Per-observation influence function for the BIAS-CORRECTED point
     # estimate at ``deriv`` (Phase 4.5 survey composition).
-    # Weighted OLS IF decomposition:
-    #   beta_hat - beta  =  invG_p @ sum_g [(R_p[g] * W_h[g]) * res[g]]
-    # so psi_g = invG_p[deriv, :] @ (R_p[g] * W_h[g]) * res[g] scaled by
-    # deriv_fact. Observations outside the active kernel window have
-    # W_h[g]=0 and contribute IF=0. Length-N, aligned with the ORIGINAL
-    # x ordering (inverse-permuted when vce="nn" sorts). ---
+    # Aligned with ``V_Y_bc`` (NOT ``V_Y_cl``) so survey-composed variance
+    # through ``compute_survey_if_variance`` targets the same estimator
+    # scale that HAD's beta-scale ATT uses (``tau_bc``-based). Using the
+    # classical IF here would under-estimate the variance by ignoring the
+    # bias-correction inflation, producing a silently wrong survey SE.
+    #
+    # Bias-corrected WLS IF decomposition (mirrors the ``V_Y_bc`` sandwich
+    # inner at lprobust.R:244): beta_bc - beta  =  invG_p @ sum_g [ Q_q[g] · res_b[g] ],
+    # so psi_g = deriv_fact · invG_p[deriv, :] · Q_q[g, :] · res_b[g].
+    # The self-check ``sum(psi^2) == V_Y_bc[deriv, deriv]`` holds under
+    # unclustered HC0; under clustering, compute_survey_if_variance
+    # aggregates by PSU, which is what the survey path wants.
+    # Observations outside the active window have Q_q[g, :]=0 row and
+    # contribute IF=0. Length-N, aligned with ORIGINAL x ordering (inverse-
+    # permuted when vce="nn" sorts). ---
     influence_function: Optional[np.ndarray] = None
     if return_influence:
-        # For active-window observations only.
-        row_deriv = invG_p[deriv, :]  # (p+1,)
-        # (R_p * W_h)[g, :] has shape (p+1,); einsum for clarity.
-        coeff_active = (R_p_W_h @ row_deriv).ravel()  # (eN,)
-        # res_h is shape (eN, 1); squeeze + multiply.
-        res_flat = np.asarray(res_h).ravel()
-        if_active = deriv_fact * coeff_active * res_flat  # (eN,)
+        # Bias-corrected IF using Q_q + res_b (active-window only).
+        row_deriv_bc = invG_p[deriv, :]  # (p+1,)
+        coeff_active_bc = (Q_q @ row_deriv_bc).ravel()  # (eN,)
+        res_b_flat = np.asarray(res_b).ravel()
+        if_active = deriv_fact * coeff_active_bc * res_b_flat  # (eN,)
         # Map back to full N; zeros for obs outside window.
         if_full_sorted = np.zeros(N, dtype=np.float64)
         if_full_sorted[ind] = if_active
diff --git a/diff_diff/had.py b/diff_diff/had.py
@@ -277,9 +277,15 @@ class HeterogeneousAdoptionDiDResults:
     cluster_name : str or None
         Column name of the cluster variable on the mass-point path when
         cluster-robust SE is requested. ``None`` otherwise.
-    survey_metadata : object or None
-        Always ``None`` in Phase 2a. Field shape kept for future-compat
-        with a planned survey integration PR.
+    survey_metadata : dict or None
+        ``None`` when ``fit()`` was called without ``survey=`` or
+        ``weights=``. Under weighted fits (continuous-dose paths only,
+        per Phase 4.5 A), carries a dict with keys ``method`` ('pweight'
+        vs 'survey_binder_tsl'), ``source``, ``variance_formula``,
+        ``n_units_weighted``, ``weight_sum``, ``effective_sample_size``,
+        ``n_strata`` / ``n_psu`` (int or None), and ``df_survey`` (int
+        or None — the survey t-distribution degrees of freedom, routed
+        through inference under the SurveyDesign path only).
     bandwidth_diagnostics : BandwidthResult or None
         Full Phase 1b MSE-DPI selector output on the continuous paths
         (when bandwidths were auto-selected). ``None`` on the mass-point
@@ -2262,10 +2268,28 @@ def fit(
             to a follow-up PR. Staggered-timing panels are auto-filtered
             to the last-treatment cohort with a ``UserWarning``.
         survey : SurveyDesign or None
-            Reserved for a follow-up survey-integration PR. Must be
-            ``None`` in Phase 2a.
+            Survey design (sampling weights + optional strata / PSU / FPC)
+            for design-based inference on the two continuous-dose paths
+            (``continuous_at_zero``, ``continuous_near_d_lower``). Passes
+            through :func:`compute_survey_if_variance` (Binder 1983 TSL)
+            for the SE; weights propagate pointwise into the lprobust
+            kernel composition. Only ``weight_type="pweight"`` is
+            supported in Phase 4.5 A — ``aweight`` / ``fweight`` raise
+            ``NotImplementedError``. Survey design columns (strata / PSU /
+            FPC) must be constant within unit (sampling-unit-level
+            assignment); within-unit variance raises ``ValueError``.
+            Replicate-weight designs raise ``NotImplementedError``
+            (Phase 4.5 C). ``design="mass_point"`` and
+            ``aggregate="event_study"`` raise ``NotImplementedError`` on
+            survey/weights (Phase 4.5 B).
         weights : np.ndarray or None
-            Reserved for a follow-up PR. Must be ``None`` in Phase 2a.
+            Per-row sampling weights as a lightweight shortcut equivalent
+            to ``survey=SurveyDesign(weights=<col>)``. Produces the same
+            ATT; the SE uses lprobust's weighted-robust CCT-2014 formula
+            rather than Binder-TSL (no PSU/strata composition). Mutually
+            exclusive with ``survey=`` — passing both raises
+            ``ValueError``. Must be constant within each unit (same
+            invariant as ``survey=``).
 
         Returns
         -------
@@ -2363,6 +2387,25 @@ def fit(
                     "survey=SurveyDesign(weights='<col>', ...) with a "
                     "per-row weight column."
                 )
+            # HAD's weighted local-linear treats ``weights`` as sampling
+            # (probability) weights: the kernel-composition formula
+            # ``W_combined = k((D-d̲)/h) · w`` is the inverse-probability
+            # weighting convention. Frequency weights (``fweight``)
+            # would imply replicating observations, and analytic weights
+            # (``aweight``, inverse-variance) would imply a different
+            # inferential target. Reject those up front rather than
+            # silently reinterpreting.
+            weight_type = getattr(survey, "weight_type", "pweight")
+            if weight_type != "pweight":
+                raise NotImplementedError(
+                    f"survey=SurveyDesign(weight_type={weight_type!r}) is "
+                    f"not supported on HeterogeneousAdoptionDiD's "
+                    f"continuous path. Only ``weight_type='pweight'`` "
+                    f"(sampling / inverse-probability weights) is "
+                    f"implemented in Phase 4.5 A. Frequency weights "
+                    f"(fweight) and analytic weights (aweight) would "
+                    f"imply different estimands and are not yet derived."
+                )
             # Resolve the SurveyDesign against the long-panel data. This
             # validates column names, applies pweight/aweight normalization
             # to mean=1, and extracts numpy arrays for all design columns.
@@ -2665,7 +2708,17 @@ def fit(
             raise ValueError(f"Internal error: unhandled design={resolved_design!r}.")
 
         # ---- Route all inference fields through safe_inference ----
-        t_stat, p_value, conf_int = safe_inference(att, se, alpha=float(self.alpha))
+        # Survey path: use t-distribution with ``df_survey = n_psu -
+        # n_strata`` (or replicate-QR rank − 1) so small-PSU designs
+        # don't get Normal-theory inference that overstates precision.
+        # Non-survey path (``weights=`` shortcut or unweighted): use
+        # the existing Normal-theory default.
+        df_infer: Optional[int] = None
+        if resolved_survey_unit is not None:
+            df_infer = resolved_survey_unit.df_survey
+        t_stat, p_value, conf_int = safe_inference(
+            att, se, alpha=float(self.alpha), df=df_infer
+        )
 
         # Build survey metadata when weights/survey were supplied. When a
         # ResolvedSurveyDesign is available (full survey= path), surface
@@ -2681,12 +2734,14 @@ def fit(
                 source = "SurveyDesign"
                 n_strata = int(resolved_survey_unit.n_strata)
                 n_psu = int(resolved_survey_unit.n_psu)
+                df_survey_meta: Optional[int] = resolved_survey_unit.df_survey
             else:
                 method = "pweight"
                 variance_formula = "weighted-robust (CCT 2014)"
                 source = "weights_arr"
                 n_strata = None
                 n_psu = None
+                df_survey_meta = None
             survey_metadata = {
                 "method": method,
                 "source": source,
@@ -2696,6 +2751,7 @@ def fit(
                 "effective_sample_size": float(ess),
                 "n_strata": n_strata,
                 "n_psu": n_psu,
+                "df_survey": df_survey_meta,
             }
 
         return HeterogeneousAdoptionDiDResults(
diff --git a/tests/test_had.py b/tests/test_had.py