Address third round of CI review findings on PR #318

igerber · claude · igerber · commit 959f84e80d60 · 2026-04-18T16:47:04.000-04:00
P0 fix:

* **Alpha override was inference-contract-blind.** Previously, whenever
  the caller's ``alpha`` differed from the result's, BR recomputed the
  displayed CI via ``safe_inference(att, se, alpha=alpha)`` with no
  ``df`` and no bootstrap handling — silently discarding the
  ``bootstrap_distribution`` / finite-df inference contracts used by
  TROP, ContinuousDiD, dCDH-bootstrap, survey fits, SDiD jackknife,
  etc. BR now detects bootstrap-backed (``inference_method='bootstrap'``
  or non-None ``bootstrap_distribution`` or ``variance_method in
  {bootstrap, jackknife, placebo}``) and finite-df (``df_survey &gt; 0``)
  inference paths and preserves the fitted CI at its native level in
  those cases, recording an informational caveat noting that the
  caller's alpha still drives phrasing but the native interval is
  shown. Regressions in ``TestAlphaOverrideBootstrapAndFiniteDF``
  cover both the bootstrap and finite-df survey paths.

P1 fixes:

* **``pretrends_power`` over-broad applicability.** The matrix had
  marked the check applicable for ImputationDiD, TwoStage, Stacked,
  EfficientDiD, StaggeredTripleDiff, Wooldridge, and dCDH, but
  ``compute_pretrends_power`` only has adapters for MultiPeriod, CS,
  and SA; the other families were landing in ``error``. Narrowed the
  applicability matrix to match the real helper support.

* **``sensitivity`` over-broad applicability.** HonestDiD only adapts
  MultiPeriod, CS, and dCDH (via ``placebo_event_study``). The matrix
  had also included SA / Imputation / Stacked / EfficientDiD /
  StaggeredTripleDiff / Wooldridge. Narrowed to the supported set. The
  dCDH-specific instance gate now checks ``placebo_event_study`` rather
  than the generic ``event_study_effects`` so HonestDiD's dCDH branch
  is reached instead of the generic event-study collector.

* **``n_obs == 0`` reference-marker filter.** Stacked / TwoStage /
  Imputation emit synthetic reference-period markers using ``n_obs=0``
  rather than CS / SA's ``n_groups=0`` flag. ``_collect_pre_period_coefs``
  now drops rows with either sentinel so the Bonferroni denominator
  and joint-Wald index are not inflated by non-informative rows.

P2 fix:

* **``placebo`` schema inconsistency.** ``REPORTING.md`` said
  ``placebo`` is always rendered as ``{"status": "skipped"}`` in MVP,
  but no result type had ``placebo`` in its applicability frozenset, so
  implementation fell through to ``"not_applicable"``. Now every
  DiagnosticReport.to_dict() returns ``placebo`` with ``status="skipped"``
  regardless of estimator, matching the stated contract.

Regression tests for each finding added in
``TestNarrowedApplicabilityAndPlaceboSchema`` and
``TestAlphaOverrideBootstrapAndFiniteDF``. 146 targeted tests pass;
black, ruff, mypy clean on the new modules.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/business_report.py b/diff_diff/business_report.py
@@ -331,13 +331,24 @@ def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, An
             _name, att, se, p, ci, result_alpha = extracted
 
         # If the caller asked for a different alpha than the result was fit
-        # at, recompute the CI from (att, se) using ``safe_inference`` so the
-        # labeled CI level matches the interval actually shown. Without this
-        # the stored interval (e.g. 95%) would be relabeled to the caller's
-        # level (e.g. 90%) — the documented single-knob contract requires
-        # them to agree. SE is a scale parameter independent of alpha, so
-        # recomputation is safe; the result's original t-stat / p-value do
-        # not change either.
+        # at, the displayed CI needs to match the label. Naive recomputation
+        # via ``safe_inference(att, se, alpha=alpha)`` would use a normal
+        # distribution with no df, which silently discards finite-df /
+        # bootstrap / percentile inference contracts used by TROP,
+        # ContinuousDiD, dCDH-bootstrap, survey fits, etc. Rules:
+        #   1. If the result has an analytic inference contract we can
+        #      reproduce (no bootstrap distribution, no finite df we don't
+        #      know about), recompute via ``safe_inference`` — this covers
+        #      the common case of normal-approximation CIs.
+        #   2. Otherwise (bootstrap / percentile / finite-df / survey d.f.),
+        #      preserve the fitted CI and its native level so the displayed
+        #      interval keeps matching the stored p-value and inference
+        #      contract. The ``ci_level`` field will reflect the result's
+        #      own alpha, and a caveat is appended below noting that the
+        #      caller's alpha drives phrasing but the native interval is
+        #      shown.
+        alpha_was_honored = True
+        alpha_override_caveat: Optional[str] = None
         if (
             result_alpha is not None
             and not np.isclose(alpha, result_alpha)
@@ -346,13 +357,40 @@ def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, An
             and np.isfinite(att)
             and np.isfinite(se)
         ):
-            from diff_diff.utils import safe_inference
+            inference_method = getattr(r, "inference_method", "analytical")
+            has_bootstrap_dist = getattr(r, "bootstrap_distribution", None) is not None
+            df_survey = getattr(
+                r, "df_survey", getattr(getattr(r, "survey_metadata", None), "df_survey", None)
+            )
+            variance_method = getattr(r, "variance_method", None)
+
+            bootstrap_like = (
+                inference_method == "bootstrap"
+                or has_bootstrap_dist
+                or variance_method in {"bootstrap", "jackknife", "placebo"}
+            )
+            finite_df = isinstance(df_survey, (int, float)) and df_survey > 0
+
+            if bootstrap_like or finite_df:
+                # Preserve the fitted CI at its native level.
+                alpha_was_honored = False
+                alpha = float(result_alpha)
+                alpha_override_caveat = (
+                    f"Requested alpha was not honored for the confidence "
+                    f"interval because this fit uses "
+                    f"{'bootstrap' if bootstrap_like else 'finite-df'} "
+                    f"inference; the displayed CI remains at the fit's "
+                    f"native level ({int(round((1.0 - result_alpha) * 100))}%). "
+                    f"The significance phrasing still uses the requested alpha."
+                )
+            else:
+                from diff_diff.utils import safe_inference
 
-            _t, _p, recomputed_ci = safe_inference(att, se, alpha=alpha)
-            if recomputed_ci is not None and all(
-                x is not None and np.isfinite(x) for x in recomputed_ci
-            ):
-                ci = [float(recomputed_ci[0]), float(recomputed_ci[1])]
+                _t, _p, recomputed_ci = safe_inference(att, se, alpha=alpha)
+                if recomputed_ci is not None and all(
+                    x is not None and np.isfinite(x) for x in recomputed_ci
+                ):
+                    ci = [float(recomputed_ci[0]), float(recomputed_ci[1])]
 
         unit = self._context.outcome_unit
         unit_kind = _UNIT_KINDS.get(unit.lower() if unit else "", "unknown")
@@ -382,6 +420,8 @@ def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, An
             "se": se,
             "ci_lower": ci[0] if ci else None,
             "ci_upper": ci[1] if ci else None,
+            "alpha_was_honored": alpha_was_honored,
+            "alpha_override_caveat": alpha_override_caveat,
             "ci_level": ci_level,
             "p_value": p,
             "is_significant": is_significant,
@@ -623,6 +663,17 @@ def _build_caveats(
             }
         )
 
+    # Alpha override could not be honored (bootstrap / finite-df inference).
+    alpha_override_msg = headline.get("alpha_override_caveat")
+    if isinstance(alpha_override_msg, str) and alpha_override_msg:
+        caveats.append(
+            {
+                "severity": "info",
+                "topic": "alpha_override_preserved",
+                "message": alpha_override_msg,
+            }
+        )
+
     # Near-threshold p-value.
     if headline.get("near_significance_threshold"):
         caveats.append(
diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py
@@ -69,6 +69,18 @@
 # requires updating both this table and ``_PT_METHOD`` below; the
 # applicability-matrix test parametrized over all result types serves as the
 # regression guard.
+# ``pretrends_power`` is restricted to the result families for which
+# ``compute_pretrends_power`` has an explicit adapter — see
+# ``diff_diff/pretrends.py`` around the result-type dispatch. Expanding
+# beyond this set (Imputation / Stacked / TwoStage / EfficientDiD /
+# StaggeredTripleDiff / Wooldridge / dCDH) would cause the helper to
+# raise ``TypeError("Unsupported results type ...")`` and mark the check
+# as ``error``, so the narrower set is the right contract.
+#
+# ``sensitivity`` is restricted to families with a ``HonestDiD``
+# adapter: MultiPeriod, CS, dCDH (via ``placebo_event_study``). SDiD
+# and TROP use their own native paths (``estimator_native``) instead
+# of HonestDiD.
 _APPLICABILITY: Dict[str, FrozenSet[str]] = {
     "DiDResults": frozenset({"parallel_trends", "design_effect"}),
     "MultiPeriodDiDResults": frozenset(
@@ -89,7 +101,6 @@
         {
             "parallel_trends",
             "pretrends_power",
-            "sensitivity",
             "bacon",
             "design_effect",
             "heterogeneity",
@@ -98,8 +109,6 @@
     "ImputationDiDResults": frozenset(
         {
             "parallel_trends",
-            "pretrends_power",
-            "sensitivity",
             "bacon",
             "design_effect",
             "heterogeneity",
@@ -108,7 +117,6 @@
     "TwoStageDiDResults": frozenset(
         {
             "parallel_trends",
-            "pretrends_power",
             "bacon",
             "design_effect",
             "heterogeneity",
@@ -117,8 +125,6 @@
     "StackedDiDResults": frozenset(
         {
             "parallel_trends",
-            "pretrends_power",
-            "sensitivity",
             "bacon",
             "design_effect",
             "heterogeneity",
@@ -139,8 +145,6 @@
     "EfficientDiDResults": frozenset(
         {
             "parallel_trends",
-            "pretrends_power",
-            "sensitivity",
             "bacon",
             "design_effect",
             "heterogeneity",
@@ -149,14 +153,10 @@
     ),
     "ContinuousDiDResults": frozenset({"design_effect", "heterogeneity"}),
     "TripleDifferenceResults": frozenset({"design_effect", "epv"}),
-    "StaggeredTripleDiffResults": frozenset(
-        {"parallel_trends", "pretrends_power", "sensitivity", "design_effect"}
-    ),
+    "StaggeredTripleDiffResults": frozenset({"parallel_trends", "design_effect"}),
     "WooldridgeDiDResults": frozenset(
         {
             "parallel_trends",
-            "pretrends_power",
-            "sensitivity",
             "bacon",
             "design_effect",
             "heterogeneity",
@@ -165,7 +165,6 @@
     "ChaisemartinDHaultfoeuilleResults": frozenset(
         {
             "parallel_trends",
-            "pretrends_power",
             "sensitivity",
             "bacon",
             "design_effect",
@@ -432,13 +431,15 @@ def _compute_applicable_checks(self) -> Tuple[set, Dict[str, str]]:
                 continue
             applicable.add(check)
 
-        # Placebo is always skipped in MVP (opt-in path deferred)
-        if "placebo" in type_level and "placebo" not in applicable:
-            skipped.setdefault(
-                "placebo",
-                "Placebo battery runs on opt-in only; not yet implemented in MVP. "
-                "Reserved in the schema for forward compatibility.",
-            )
+        # Placebo is reserved for every result type in MVP so the schema
+        # shape is stable: ``schema["placebo"]["status"] == "skipped"``
+        # always holds regardless of estimator. The opt-in execution path
+        # is deferred to a follow-up; ``REPORTING.md`` documents this.
+        skipped.setdefault(
+            "placebo",
+            "Placebo battery runs on opt-in only; not yet implemented in MVP. "
+            "Reserved in the schema for forward compatibility.",
+        )
 
         return applicable, skipped
 
@@ -499,11 +500,22 @@ def _instance_skip_reason(self, check: str) -> Optional[str]:
             # Precomputed sensitivity always unlocks this check.
             if "sensitivity" in self._precomputed:
                 return None
-            # ``HonestDiD.sensitivity_analysis`` handles CS / SA /
-            # ImputationDiD internally via ``event_study_effects`` +
-            # ``event_study_vcov`` (or per-SE diagonal fallback), so we
-            # accept any of: top-level vcov, event_study_vcov, or a
-            # populated event_study_effects surface.
+            # dCDH uses ``placebo_event_study`` as its pre-period surface,
+            # which HonestDiD consumes via a dedicated branch. Accept the
+            # fit when that attribute is populated.
+            if name == "ChaisemartinDHaultfoeuilleResults":
+                pes = getattr(r, "placebo_event_study", None)
+                if pes is None:
+                    return (
+                        "HonestDiD on dCDH requires results.placebo_event_study "
+                        "(re-fit with a placebo-producing configuration)."
+                    )
+                return None
+            # MultiPeriod / CS path: ``HonestDiD.sensitivity_analysis``
+            # consumes ``event_study_effects`` plus either ``vcov`` +
+            # ``interaction_indices`` (MultiPeriod) or ``event_study_vcov``
+            # + ``event_study_vcov_index`` (CS), with a per-SE diagonal
+            # fallback otherwise.
             has_vcov = getattr(r, "vcov", None) is not None
             has_event_vcov = getattr(r, "event_study_vcov", None) is not None
             has_event_es = getattr(r, "event_study_effects", None) is not None
@@ -1728,12 +1740,14 @@ def _collect_pre_period_coefs(results: Any) -> List[Tuple[Any, float, float, Opt
                 continue
             if not isinstance(entry, dict):
                 continue
-            # Drop universal-base reference markers. See
-            # ``staggered_aggregation.py`` around the reference-period
-            # injection: ``n_groups == 0`` flags the synthetic marker row
-            # with NaN SE and p-value.
-            n_groups = entry.get("n_groups")
-            if n_groups is not None and n_groups == 0:
+            # Drop universal-base reference markers. Different estimator
+            # aggregations use different flags for the synthetic marker row
+            # (all of which carry NaN SE and p-value):
+            #   * CS / SA: ``n_groups == 0``
+            #   * Stacked / TwoStage / Imputation: ``n_obs == 0``
+            # Treat either as a disqualifier so the Bonferroni denominator
+            # and joint-Wald index are not inflated by non-informative rows.
+            if entry.get("n_groups") == 0 or entry.get("n_obs") == 0:
                 continue
             eff = entry.get("effect")
             se = entry.get("se")
diff --git a/tests/test_business_report.py b/tests/test_business_report.py
@@ -520,6 +520,92 @@ def test_ci_bounds_recomputed_when_alpha_differs_from_result(self, event_study_f
         assert h90["ci_level"] == 90
 
 
+class TestAlphaOverrideBootstrapAndFiniteDF:
+    """Regression for the P0 finding that ``safe_inference(att, se, alpha)``
+    silently discards bootstrap / finite-df inference contracts on results
+    that use them (TROP, ContinuousDiD, dCDH-bootstrap, survey fits).
+
+    Rule: when the caller's alpha differs from the fit's alpha AND the
+    result's inference contract is bootstrap-backed or uses finite df,
+    BR preserves the fitted CI at the fit's native level rather than
+    recomputing with a normal approximation. The override is recorded as
+    an informational caveat.
+    """
+
+    class _BootstrapResultStub:
+        """Minimal stub shaped like a bootstrap-inferred result."""
+
+        def __init__(self):
+            self.att = 1.0
+            self.se = 0.5
+            self.p_value = 0.04
+            # Original 95% CI from the bootstrap distribution.
+            self.conf_int = (0.05, 1.95)
+            self.alpha = 0.05
+            self.n_obs = 100
+            self.n_treated = 40
+            self.n_control = 60
+            self.inference_method = "bootstrap"
+            self.survey_metadata = None
+            # Presence of a bootstrap distribution triggers the preserve path.
+            import numpy as np
+
+            self.bootstrap_distribution = np.random.default_rng(0).normal(1.0, 0.5, 200)
+
+    def test_bootstrap_fit_preserves_fitted_ci_on_alpha_mismatch(self):
+        stub = self._BootstrapResultStub()
+        br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False)
+        h = br.to_dict()["headline"]
+        # Native fit was at 95%; requested 90% should NOT be reflected in the label.
+        assert h["ci_level"] == 95, (
+            "Bootstrap fit must preserve fitted CI level (95) when caller "
+            f"requests a different alpha; got {h['ci_level']}"
+        )
+        # Bounds should match the stored bootstrap interval, not a normal-z
+        # recomputation at 90%.
+        assert h["ci_lower"] == pytest.approx(0.05)
+        assert h["ci_upper"] == pytest.approx(1.95)
+        # A caveat records the override.
+        caveat_topics = {c.get("topic") for c in br.caveats()}
+        assert "alpha_override_preserved" in caveat_topics
+
+    class _FiniteDfSurveyStub:
+        def __init__(self):
+            from types import SimpleNamespace
+
+            self.att = 2.0
+            self.se = 0.4
+            self.p_value = 0.001
+            self.conf_int = (1.22, 2.78)  # 95% via survey t-quantile
+            self.alpha = 0.05
+            self.n_obs = 120
+            self.n_treated = 50
+            self.n_control = 70
+            self.inference_method = "analytical"
+            # Finite survey d.f. triggers the preserve path — normal approx
+            # would widen / narrow incorrectly.
+            self.survey_metadata = SimpleNamespace(
+                weight_type="pweight",
+                effective_n=110.0,
+                design_effect=1.2,
+                sum_weights=120.0,
+                n_strata=4,
+                n_psu=12,
+                df_survey=8,
+                replicate_method=None,
+            )
+
+    def test_finite_df_fit_preserves_fitted_ci_on_alpha_mismatch(self):
+        stub = self._FiniteDfSurveyStub()
+        br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False)
+        h = br.to_dict()["headline"]
+        assert h["ci_level"] == 95
+        assert h["ci_lower"] == pytest.approx(1.22)
+        assert h["ci_upper"] == pytest.approx(2.78)
+        caveat_topics = {c.get("topic") for c in br.caveats()}
+        assert "alpha_override_preserved" in caveat_topics
+
+
 class TestFullReportSingleM:
     """Regression: ``full_report()`` must not claim full-grid robustness for a
     single-M HonestDiDResults passthrough. The summary path was fixed earlier;
diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py