Skip to content

Commit 959f84e

Browse files
igerberclaude
andcommitted
Address third round of CI review findings on PR #318
P0 fix: * **Alpha override was inference-contract-blind.** Previously, whenever the caller's ``alpha`` differed from the result's, BR recomputed the displayed CI via ``safe_inference(att, se, alpha=alpha)`` with no ``df`` and no bootstrap handling — silently discarding the ``bootstrap_distribution`` / finite-df inference contracts used by TROP, ContinuousDiD, dCDH-bootstrap, survey fits, SDiD jackknife, etc. BR now detects bootstrap-backed (``inference_method='bootstrap'`` or non-None ``bootstrap_distribution`` or ``variance_method in {bootstrap, jackknife, placebo}``) and finite-df (``df_survey > 0``) inference paths and preserves the fitted CI at its native level in those cases, recording an informational caveat noting that the caller's alpha still drives phrasing but the native interval is shown. Regressions in ``TestAlphaOverrideBootstrapAndFiniteDF`` cover both the bootstrap and finite-df survey paths. P1 fixes: * **``pretrends_power`` over-broad applicability.** The matrix had marked the check applicable for ImputationDiD, TwoStage, Stacked, EfficientDiD, StaggeredTripleDiff, Wooldridge, and dCDH, but ``compute_pretrends_power`` only has adapters for MultiPeriod, CS, and SA; the other families were landing in ``error``. Narrowed the applicability matrix to match the real helper support. * **``sensitivity`` over-broad applicability.** HonestDiD only adapts MultiPeriod, CS, and dCDH (via ``placebo_event_study``). The matrix had also included SA / Imputation / Stacked / EfficientDiD / StaggeredTripleDiff / Wooldridge. Narrowed to the supported set. The dCDH-specific instance gate now checks ``placebo_event_study`` rather than the generic ``event_study_effects`` so HonestDiD's dCDH branch is reached instead of the generic event-study collector. * **``n_obs == 0`` reference-marker filter.** Stacked / TwoStage / Imputation emit synthetic reference-period markers using ``n_obs=0`` rather than CS / SA's ``n_groups=0`` flag. ``_collect_pre_period_coefs`` now drops rows with either sentinel so the Bonferroni denominator and joint-Wald index are not inflated by non-informative rows. P2 fix: * **``placebo`` schema inconsistency.** ``REPORTING.md`` said ``placebo`` is always rendered as ``{"status": "skipped"}`` in MVP, but no result type had ``placebo`` in its applicability frozenset, so implementation fell through to ``"not_applicable"``. Now every DiagnosticReport.to_dict() returns ``placebo`` with ``status="skipped"`` regardless of estimator, matching the stated contract. Regression tests for each finding added in ``TestNarrowedApplicabilityAndPlaceboSchema`` and ``TestAlphaOverrideBootstrapAndFiniteDF``. 146 targeted tests pass; black, ruff, mypy clean on the new modules. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 989f71a commit 959f84e

4 files changed

Lines changed: 275 additions & 45 deletions

File tree

diff_diff/business_report.py

Lines changed: 64 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -331,13 +331,24 @@ def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, An
331331
_name, att, se, p, ci, result_alpha = extracted
332332

333333
# If the caller asked for a different alpha than the result was fit
334-
# at, recompute the CI from (att, se) using ``safe_inference`` so the
335-
# labeled CI level matches the interval actually shown. Without this
336-
# the stored interval (e.g. 95%) would be relabeled to the caller's
337-
# level (e.g. 90%) — the documented single-knob contract requires
338-
# them to agree. SE is a scale parameter independent of alpha, so
339-
# recomputation is safe; the result's original t-stat / p-value do
340-
# not change either.
334+
# at, the displayed CI needs to match the label. Naive recomputation
335+
# via ``safe_inference(att, se, alpha=alpha)`` would use a normal
336+
# distribution with no df, which silently discards finite-df /
337+
# bootstrap / percentile inference contracts used by TROP,
338+
# ContinuousDiD, dCDH-bootstrap, survey fits, etc. Rules:
339+
# 1. If the result has an analytic inference contract we can
340+
# reproduce (no bootstrap distribution, no finite df we don't
341+
# know about), recompute via ``safe_inference`` — this covers
342+
# the common case of normal-approximation CIs.
343+
# 2. Otherwise (bootstrap / percentile / finite-df / survey d.f.),
344+
# preserve the fitted CI and its native level so the displayed
345+
# interval keeps matching the stored p-value and inference
346+
# contract. The ``ci_level`` field will reflect the result's
347+
# own alpha, and a caveat is appended below noting that the
348+
# caller's alpha drives phrasing but the native interval is
349+
# shown.
350+
alpha_was_honored = True
351+
alpha_override_caveat: Optional[str] = None
341352
if (
342353
result_alpha is not None
343354
and not np.isclose(alpha, result_alpha)
@@ -346,13 +357,40 @@ def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, An
346357
and np.isfinite(att)
347358
and np.isfinite(se)
348359
):
349-
from diff_diff.utils import safe_inference
360+
inference_method = getattr(r, "inference_method", "analytical")
361+
has_bootstrap_dist = getattr(r, "bootstrap_distribution", None) is not None
362+
df_survey = getattr(
363+
r, "df_survey", getattr(getattr(r, "survey_metadata", None), "df_survey", None)
364+
)
365+
variance_method = getattr(r, "variance_method", None)
366+
367+
bootstrap_like = (
368+
inference_method == "bootstrap"
369+
or has_bootstrap_dist
370+
or variance_method in {"bootstrap", "jackknife", "placebo"}
371+
)
372+
finite_df = isinstance(df_survey, (int, float)) and df_survey > 0
373+
374+
if bootstrap_like or finite_df:
375+
# Preserve the fitted CI at its native level.
376+
alpha_was_honored = False
377+
alpha = float(result_alpha)
378+
alpha_override_caveat = (
379+
f"Requested alpha was not honored for the confidence "
380+
f"interval because this fit uses "
381+
f"{'bootstrap' if bootstrap_like else 'finite-df'} "
382+
f"inference; the displayed CI remains at the fit's "
383+
f"native level ({int(round((1.0 - result_alpha) * 100))}%). "
384+
f"The significance phrasing still uses the requested alpha."
385+
)
386+
else:
387+
from diff_diff.utils import safe_inference
350388

351-
_t, _p, recomputed_ci = safe_inference(att, se, alpha=alpha)
352-
if recomputed_ci is not None and all(
353-
x is not None and np.isfinite(x) for x in recomputed_ci
354-
):
355-
ci = [float(recomputed_ci[0]), float(recomputed_ci[1])]
389+
_t, _p, recomputed_ci = safe_inference(att, se, alpha=alpha)
390+
if recomputed_ci is not None and all(
391+
x is not None and np.isfinite(x) for x in recomputed_ci
392+
):
393+
ci = [float(recomputed_ci[0]), float(recomputed_ci[1])]
356394

357395
unit = self._context.outcome_unit
358396
unit_kind = _UNIT_KINDS.get(unit.lower() if unit else "", "unknown")
@@ -382,6 +420,8 @@ def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, An
382420
"se": se,
383421
"ci_lower": ci[0] if ci else None,
384422
"ci_upper": ci[1] if ci else None,
423+
"alpha_was_honored": alpha_was_honored,
424+
"alpha_override_caveat": alpha_override_caveat,
385425
"ci_level": ci_level,
386426
"p_value": p,
387427
"is_significant": is_significant,
@@ -623,6 +663,17 @@ def _build_caveats(
623663
}
624664
)
625665

666+
# Alpha override could not be honored (bootstrap / finite-df inference).
667+
alpha_override_msg = headline.get("alpha_override_caveat")
668+
if isinstance(alpha_override_msg, str) and alpha_override_msg:
669+
caveats.append(
670+
{
671+
"severity": "info",
672+
"topic": "alpha_override_preserved",
673+
"message": alpha_override_msg,
674+
}
675+
)
676+
626677
# Near-threshold p-value.
627678
if headline.get("near_significance_threshold"):
628679
caveats.append(

diff_diff/diagnostic_report.py

Lines changed: 46 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,18 @@
6969
# requires updating both this table and ``_PT_METHOD`` below; the
7070
# applicability-matrix test parametrized over all result types serves as the
7171
# regression guard.
72+
# ``pretrends_power`` is restricted to the result families for which
73+
# ``compute_pretrends_power`` has an explicit adapter — see
74+
# ``diff_diff/pretrends.py`` around the result-type dispatch. Expanding
75+
# beyond this set (Imputation / Stacked / TwoStage / EfficientDiD /
76+
# StaggeredTripleDiff / Wooldridge / dCDH) would cause the helper to
77+
# raise ``TypeError("Unsupported results type ...")`` and mark the check
78+
# as ``error``, so the narrower set is the right contract.
79+
#
80+
# ``sensitivity`` is restricted to families with a ``HonestDiD``
81+
# adapter: MultiPeriod, CS, dCDH (via ``placebo_event_study``). SDiD
82+
# and TROP use their own native paths (``estimator_native``) instead
83+
# of HonestDiD.
7284
_APPLICABILITY: Dict[str, FrozenSet[str]] = {
7385
"DiDResults": frozenset({"parallel_trends", "design_effect"}),
7486
"MultiPeriodDiDResults": frozenset(
@@ -89,7 +101,6 @@
89101
{
90102
"parallel_trends",
91103
"pretrends_power",
92-
"sensitivity",
93104
"bacon",
94105
"design_effect",
95106
"heterogeneity",
@@ -98,8 +109,6 @@
98109
"ImputationDiDResults": frozenset(
99110
{
100111
"parallel_trends",
101-
"pretrends_power",
102-
"sensitivity",
103112
"bacon",
104113
"design_effect",
105114
"heterogeneity",
@@ -108,7 +117,6 @@
108117
"TwoStageDiDResults": frozenset(
109118
{
110119
"parallel_trends",
111-
"pretrends_power",
112120
"bacon",
113121
"design_effect",
114122
"heterogeneity",
@@ -117,8 +125,6 @@
117125
"StackedDiDResults": frozenset(
118126
{
119127
"parallel_trends",
120-
"pretrends_power",
121-
"sensitivity",
122128
"bacon",
123129
"design_effect",
124130
"heterogeneity",
@@ -139,8 +145,6 @@
139145
"EfficientDiDResults": frozenset(
140146
{
141147
"parallel_trends",
142-
"pretrends_power",
143-
"sensitivity",
144148
"bacon",
145149
"design_effect",
146150
"heterogeneity",
@@ -149,14 +153,10 @@
149153
),
150154
"ContinuousDiDResults": frozenset({"design_effect", "heterogeneity"}),
151155
"TripleDifferenceResults": frozenset({"design_effect", "epv"}),
152-
"StaggeredTripleDiffResults": frozenset(
153-
{"parallel_trends", "pretrends_power", "sensitivity", "design_effect"}
154-
),
156+
"StaggeredTripleDiffResults": frozenset({"parallel_trends", "design_effect"}),
155157
"WooldridgeDiDResults": frozenset(
156158
{
157159
"parallel_trends",
158-
"pretrends_power",
159-
"sensitivity",
160160
"bacon",
161161
"design_effect",
162162
"heterogeneity",
@@ -165,7 +165,6 @@
165165
"ChaisemartinDHaultfoeuilleResults": frozenset(
166166
{
167167
"parallel_trends",
168-
"pretrends_power",
169168
"sensitivity",
170169
"bacon",
171170
"design_effect",
@@ -432,13 +431,15 @@ def _compute_applicable_checks(self) -> Tuple[set, Dict[str, str]]:
432431
continue
433432
applicable.add(check)
434433

435-
# Placebo is always skipped in MVP (opt-in path deferred)
436-
if "placebo" in type_level and "placebo" not in applicable:
437-
skipped.setdefault(
438-
"placebo",
439-
"Placebo battery runs on opt-in only; not yet implemented in MVP. "
440-
"Reserved in the schema for forward compatibility.",
441-
)
434+
# Placebo is reserved for every result type in MVP so the schema
435+
# shape is stable: ``schema["placebo"]["status"] == "skipped"``
436+
# always holds regardless of estimator. The opt-in execution path
437+
# is deferred to a follow-up; ``REPORTING.md`` documents this.
438+
skipped.setdefault(
439+
"placebo",
440+
"Placebo battery runs on opt-in only; not yet implemented in MVP. "
441+
"Reserved in the schema for forward compatibility.",
442+
)
442443

443444
return applicable, skipped
444445

@@ -499,11 +500,22 @@ def _instance_skip_reason(self, check: str) -> Optional[str]:
499500
# Precomputed sensitivity always unlocks this check.
500501
if "sensitivity" in self._precomputed:
501502
return None
502-
# ``HonestDiD.sensitivity_analysis`` handles CS / SA /
503-
# ImputationDiD internally via ``event_study_effects`` +
504-
# ``event_study_vcov`` (or per-SE diagonal fallback), so we
505-
# accept any of: top-level vcov, event_study_vcov, or a
506-
# populated event_study_effects surface.
503+
# dCDH uses ``placebo_event_study`` as its pre-period surface,
504+
# which HonestDiD consumes via a dedicated branch. Accept the
505+
# fit when that attribute is populated.
506+
if name == "ChaisemartinDHaultfoeuilleResults":
507+
pes = getattr(r, "placebo_event_study", None)
508+
if pes is None:
509+
return (
510+
"HonestDiD on dCDH requires results.placebo_event_study "
511+
"(re-fit with a placebo-producing configuration)."
512+
)
513+
return None
514+
# MultiPeriod / CS path: ``HonestDiD.sensitivity_analysis``
515+
# consumes ``event_study_effects`` plus either ``vcov`` +
516+
# ``interaction_indices`` (MultiPeriod) or ``event_study_vcov``
517+
# + ``event_study_vcov_index`` (CS), with a per-SE diagonal
518+
# fallback otherwise.
507519
has_vcov = getattr(r, "vcov", None) is not None
508520
has_event_vcov = getattr(r, "event_study_vcov", None) is not None
509521
has_event_es = getattr(r, "event_study_effects", None) is not None
@@ -1728,12 +1740,14 @@ def _collect_pre_period_coefs(results: Any) -> List[Tuple[Any, float, float, Opt
17281740
continue
17291741
if not isinstance(entry, dict):
17301742
continue
1731-
# Drop universal-base reference markers. See
1732-
# ``staggered_aggregation.py`` around the reference-period
1733-
# injection: ``n_groups == 0`` flags the synthetic marker row
1734-
# with NaN SE and p-value.
1735-
n_groups = entry.get("n_groups")
1736-
if n_groups is not None and n_groups == 0:
1743+
# Drop universal-base reference markers. Different estimator
1744+
# aggregations use different flags for the synthetic marker row
1745+
# (all of which carry NaN SE and p-value):
1746+
# * CS / SA: ``n_groups == 0``
1747+
# * Stacked / TwoStage / Imputation: ``n_obs == 0``
1748+
# Treat either as a disqualifier so the Bonferroni denominator
1749+
# and joint-Wald index are not inflated by non-informative rows.
1750+
if entry.get("n_groups") == 0 or entry.get("n_obs") == 0:
17371751
continue
17381752
eff = entry.get("effect")
17391753
se = entry.get("se")

tests/test_business_report.py

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -520,6 +520,92 @@ def test_ci_bounds_recomputed_when_alpha_differs_from_result(self, event_study_f
520520
assert h90["ci_level"] == 90
521521

522522

523+
class TestAlphaOverrideBootstrapAndFiniteDF:
524+
"""Regression for the P0 finding that ``safe_inference(att, se, alpha)``
525+
silently discards bootstrap / finite-df inference contracts on results
526+
that use them (TROP, ContinuousDiD, dCDH-bootstrap, survey fits).
527+
528+
Rule: when the caller's alpha differs from the fit's alpha AND the
529+
result's inference contract is bootstrap-backed or uses finite df,
530+
BR preserves the fitted CI at the fit's native level rather than
531+
recomputing with a normal approximation. The override is recorded as
532+
an informational caveat.
533+
"""
534+
535+
class _BootstrapResultStub:
536+
"""Minimal stub shaped like a bootstrap-inferred result."""
537+
538+
def __init__(self):
539+
self.att = 1.0
540+
self.se = 0.5
541+
self.p_value = 0.04
542+
# Original 95% CI from the bootstrap distribution.
543+
self.conf_int = (0.05, 1.95)
544+
self.alpha = 0.05
545+
self.n_obs = 100
546+
self.n_treated = 40
547+
self.n_control = 60
548+
self.inference_method = "bootstrap"
549+
self.survey_metadata = None
550+
# Presence of a bootstrap distribution triggers the preserve path.
551+
import numpy as np
552+
553+
self.bootstrap_distribution = np.random.default_rng(0).normal(1.0, 0.5, 200)
554+
555+
def test_bootstrap_fit_preserves_fitted_ci_on_alpha_mismatch(self):
556+
stub = self._BootstrapResultStub()
557+
br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False)
558+
h = br.to_dict()["headline"]
559+
# Native fit was at 95%; requested 90% should NOT be reflected in the label.
560+
assert h["ci_level"] == 95, (
561+
"Bootstrap fit must preserve fitted CI level (95) when caller "
562+
f"requests a different alpha; got {h['ci_level']}"
563+
)
564+
# Bounds should match the stored bootstrap interval, not a normal-z
565+
# recomputation at 90%.
566+
assert h["ci_lower"] == pytest.approx(0.05)
567+
assert h["ci_upper"] == pytest.approx(1.95)
568+
# A caveat records the override.
569+
caveat_topics = {c.get("topic") for c in br.caveats()}
570+
assert "alpha_override_preserved" in caveat_topics
571+
572+
class _FiniteDfSurveyStub:
573+
def __init__(self):
574+
from types import SimpleNamespace
575+
576+
self.att = 2.0
577+
self.se = 0.4
578+
self.p_value = 0.001
579+
self.conf_int = (1.22, 2.78) # 95% via survey t-quantile
580+
self.alpha = 0.05
581+
self.n_obs = 120
582+
self.n_treated = 50
583+
self.n_control = 70
584+
self.inference_method = "analytical"
585+
# Finite survey d.f. triggers the preserve path — normal approx
586+
# would widen / narrow incorrectly.
587+
self.survey_metadata = SimpleNamespace(
588+
weight_type="pweight",
589+
effective_n=110.0,
590+
design_effect=1.2,
591+
sum_weights=120.0,
592+
n_strata=4,
593+
n_psu=12,
594+
df_survey=8,
595+
replicate_method=None,
596+
)
597+
598+
def test_finite_df_fit_preserves_fitted_ci_on_alpha_mismatch(self):
599+
stub = self._FiniteDfSurveyStub()
600+
br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False)
601+
h = br.to_dict()["headline"]
602+
assert h["ci_level"] == 95
603+
assert h["ci_lower"] == pytest.approx(1.22)
604+
assert h["ci_upper"] == pytest.approx(2.78)
605+
caveat_topics = {c.get("topic") for c in br.caveats()}
606+
assert "alpha_override_preserved" in caveat_topics
607+
608+
523609
class TestFullReportSingleM:
524610
"""Regression: ``full_report()`` must not claim full-grid robustness for a
525611
single-M HonestDiDResults passthrough. The summary path was fixed earlier;

0 commit comments

Comments
 (0)