Skip to content

Commit 7743b5c

Browse files
igerberclaude
andcommitted
Address thirty-fifth round of CI review findings on PR #318
P1 methodology (inconclusive PT prose missing). Rounds 33-34 made the event-study PT schema emit ``verdict="inconclusive"`` whenever pre-period inference is undefined (zero / negative SE, non-finite per-period p-value). But neither ``BusinessReport.summary()`` nor ``DiagnosticReport.summary()`` / ``overall_interpretation`` had an ``elif verdict == "inconclusive"`` branch, so the PT sentence was silently omitted from the primary prose output. A missing sentence is indistinguishable from "PT did not run" and drops the identifying-assumption diagnostic from stakeholder output. Add explicit inconclusive branches on both surfaces. When ``n_dropped_undefined`` is available, the sentence quotes the count ("3 pre-period rows had undefined inference"); otherwise falls back to a generic "pre-period inference was undefined" clause. Both surfaces now close with "Treat parallel trends as unassessed" so the stakeholder takeaway is explicit. P2 code quality (DEFF ``deff < 0.95`` directional bug). The ``is_trivial`` flag required ``0.95 <= deff <= 1.05`` while ``band_label`` treated anything ``< 1.05`` as trivial. BR's summary keyed off ``not is_trivial`` and narrated "Survey design reduces effective sample size" for ``deff < 0.95``, which is directionally wrong — a precision-improving design has LARGER effective N than nominal N. Two fixes: * Add a dedicated ``band_label="improves_precision"`` enum value for ``deff < 0.95`` so the schema carries the direction explicitly; * Split BR's summary rendering: ``deff < 1.0`` -> "improves effective sample size"; ``deff >= 1.0`` -> "reduces effective sample size". ``is_trivial`` stays at ``0.95 <= deff <= 1.05`` (the tight "effectively no effect" window). P2 coverage. Round-33/34 regressions only asserted absence of false-clean "do not reject" wording; that assertion still passes even when the PT sentence disappears entirely. Added positive regressions: * ``test_summary_prose_surfaces_inconclusive_pt_explicitly`` asserts both ``DiagnosticReport.summary()`` and ``BusinessReport.summary()`` contain the word "inconclusive" on a Bonferroni-only surface with a NaN per-period p-value; * ``test_design_effect_deff_below_95_uses_improves_precision_wording`` pins the new ``band_label`` enum value AND the BR summary "improves effective sample size" wording. 332 BR / DR / practitioner / pretrends tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 94e9110 commit 7743b5c

3 files changed

Lines changed: 164 additions & 4 deletions

File tree

diff_diff/business_report.py

Lines changed: 42 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1758,6 +1758,33 @@ def _render_summary(schema: Dict[str, Any]) -> str:
17581758
"group's pre-period trajectory (SDiD's weighted-parallel-"
17591759
"trends analogue)."
17601760
)
1761+
elif verdict == "inconclusive":
1762+
# Round-35 P1 CI review on PR #318: a ``verdict=="inconclusive"``
1763+
# state means one or more pre-period coefficients had
1764+
# undefined inference (zero SE, NaN p-value) and the joint
1765+
# test cannot be formed. BR previously omitted the sentence
1766+
# entirely, so stakeholder prose silently skipped the
1767+
# identifying-assumption diagnostic. Name the state
1768+
# explicitly and quote the undefined-row count when
1769+
# available.
1770+
n_dropped = pt.get("n_dropped_undefined")
1771+
if isinstance(n_dropped, int) and n_dropped > 0:
1772+
rows_word = "row" if n_dropped == 1 else "rows"
1773+
sentences.append(
1774+
f"The pre-trends test is inconclusive on this fit: "
1775+
f"{n_dropped} pre-period {rows_word} had undefined "
1776+
"inference (zero / negative SE or a non-finite "
1777+
"per-period p-value), so the joint test cannot be "
1778+
"formed. Treat parallel trends as unassessed rather "
1779+
"than supported."
1780+
)
1781+
else:
1782+
sentences.append(
1783+
"The pre-trends test is inconclusive on this fit: "
1784+
"pre-period inference was undefined, so the joint "
1785+
"test cannot be formed. Treat parallel trends as "
1786+
"unassessed rather than supported."
1787+
)
17611788

17621789
# Sensitivity. A ``single_M_precomputed`` sensitivity block has
17631790
# ``breakdown_M=None`` by construction because only one M was evaluated;
@@ -1877,10 +1904,21 @@ def _render_summary(schema: Dict[str, Any]) -> str:
18771904
deff = survey.get("design_effect")
18781905
eff_n = survey.get("effective_n")
18791906
if isinstance(deff, (int, float)) and isinstance(eff_n, (int, float)):
1880-
sentences.append(
1881-
f"Survey design reduces effective sample size to "
1882-
f"~{eff_n:,.0f} (DEFF = {deff:.2g})."
1883-
)
1907+
# Round-35 P2 CI review on PR #318: ``deff < 0.95`` is a
1908+
# precision-improving design (effective N is LARGER than
1909+
# nominal N). Narrating that as "reduces effective sample
1910+
# size" is directionally wrong. Branch on the sign of
1911+
# the departure from 1.
1912+
if deff < 1.0:
1913+
sentences.append(
1914+
f"Survey design improves effective sample size to "
1915+
f"~{eff_n:,.0f} (DEFF = {deff:.2g})."
1916+
)
1917+
else:
1918+
sentences.append(
1919+
f"Survey design reduces effective sample size to "
1920+
f"~{eff_n:,.0f} (DEFF = {deff:.2g})."
1921+
)
18841922

18851923
# Highest-severity caveat (if any).
18861924
caveats = schema.get("caveats", [])

diff_diff/diagnostic_report.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1585,9 +1585,22 @@ def _check_design_effect(self) -> Dict[str, Any]:
15851585
}
15861586
deff = _to_python_float(getattr(sm, "design_effect", None))
15871587
eff_n = _to_python_float(getattr(sm, "effective_n", None))
1588+
# Round-35 P2 CI review on PR #318: ``is_trivial`` used to be
1589+
# ``0.95 <= deff <= 1.05`` while ``band_label`` treated
1590+
# anything ``< 1.05`` as trivial. On a precision-improving
1591+
# design (``deff < 0.95``) BR's summary keyed off
1592+
# ``not is_trivial`` and narrated "Survey design reduces
1593+
# effective sample size", which is directionally wrong — the
1594+
# effective N is LARGER than the nominal N. Split the band
1595+
# into a dedicated ``improves_precision`` label for
1596+
# ``deff < 0.95`` and keep ``is_trivial`` restricted to the
1597+
# tight "effectively no effect" window so the schema
1598+
# carries the precision-improving signal explicitly.
15881599
is_trivial = deff is not None and 0.95 <= deff <= 1.05
15891600
if deff is None or not np.isfinite(deff):
15901601
band_label: Optional[str] = None
1602+
elif deff < 0.95:
1603+
band_label = "improves_precision"
15911604
elif deff < 1.05:
15921605
band_label = "trivial"
15931606
elif deff < 2.0:
@@ -2846,6 +2859,30 @@ def _render_overall_interpretation(schema: Dict[str, Any], labels: Dict[str, str
28462859
else "SDiD's synthetic control is designed to satisfy the "
28472860
"weighted parallel-trends analogue."
28482861
)
2862+
elif verdict == "inconclusive":
2863+
# Round-35 P1 CI review on PR #318: DR summary / overall
2864+
# interpretation must surface the inconclusive state
2865+
# explicitly rather than omitting the PT sentence. A missing
2866+
# sentence was indistinguishable from "PT did not run", and
2867+
# stakeholders reading the summary could not tell that the
2868+
# joint test had been attempted but yielded undefined
2869+
# inference.
2870+
n_dropped = pt.get("n_dropped_undefined")
2871+
if isinstance(n_dropped, int) and n_dropped > 0:
2872+
rows_word = "row" if n_dropped == 1 else "rows"
2873+
sentences.append(
2874+
f"Pre-trends is inconclusive on this fit: "
2875+
f"{n_dropped} pre-period {rows_word} had undefined "
2876+
"inference (zero / negative SE or a non-finite "
2877+
"per-period p-value), so the joint test cannot be "
2878+
"formed. Treat parallel trends as unassessed."
2879+
)
2880+
else:
2881+
sentences.append(
2882+
"Pre-trends is inconclusive on this fit: pre-period "
2883+
"inference was undefined, so the joint test cannot "
2884+
"be formed. Treat parallel trends as unassessed."
2885+
)
28492886

28502887
# Sentence 3: sensitivity. The "robust across the grid" phrasing is reserved
28512888
# for genuine SensitivityResults grids; a precomputed single-M HonestDiDResults

tests/test_diagnostic_report.py

Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1168,6 +1168,91 @@ class MultiPeriodDiDResults:
11681168
assert pt["n_dropped_undefined"] == 1
11691169
assert "undefined inference" in pt["reason"]
11701170

1171+
def test_summary_prose_surfaces_inconclusive_pt_explicitly(self):
1172+
"""Round-35 P1 regression: when pre-trends is inconclusive
1173+
(undefined pre-period inference), both ``BusinessReport.summary()``
1174+
and ``DiagnosticReport.summary()`` must emit explicit inconclusive
1175+
prose — not merely omit the PT sentence. A missing sentence was
1176+
indistinguishable from "PT did not run" and would silently drop
1177+
the identifying-assumption diagnostic from stakeholder output.
1178+
"""
1179+
from diff_diff import BusinessReport
1180+
1181+
class StackedDiDResults:
1182+
pass
1183+
1184+
obj = StackedDiDResults()
1185+
obj.overall_att = 1.0
1186+
obj.overall_se = 0.2
1187+
obj.overall_p_value = 0.001
1188+
obj.overall_conf_int = (0.6, 1.4)
1189+
obj.alpha = 0.05
1190+
obj.n_obs = 400
1191+
obj.n_treated_units = 100
1192+
obj.n_control_units = 300
1193+
obj.survey_metadata = None
1194+
obj.event_study_effects = {
1195+
-2: {"effect": 0.1, "se": 0.2, "p_value": 0.62, "n_obs": 400},
1196+
-1: {"effect": 0.05, "se": 0.3, "p_value": float("nan"), "n_obs": 400},
1197+
}
1198+
1199+
dr_summary = DiagnosticReport(
1200+
obj, run_sensitivity=False, run_bacon=False
1201+
).summary()
1202+
br_summary = BusinessReport(obj).summary()
1203+
1204+
# Both summaries must explicitly name the inconclusive state.
1205+
for label, prose in [("DR", dr_summary), ("BR", br_summary)]:
1206+
assert "inconclusive" in prose.lower(), (
1207+
f"{label}.summary() must surface the inconclusive PT "
1208+
f"state explicitly; got: {prose!r}"
1209+
)
1210+
# And must not offer false-clean "do not reject" wording.
1211+
assert "do not reject parallel trends" not in prose.lower()
1212+
assert "consistent with parallel trends" not in prose.lower()
1213+
1214+
def test_design_effect_deff_below_95_uses_improves_precision_wording(self):
1215+
"""Round-35 P2 regression: ``deff < 0.95`` is a precision-
1216+
improving survey design — effective N is LARGER than nominal
1217+
N. DR emits ``band_label="improves_precision"`` and BR narrates
1218+
"improves effective sample size" instead of "reduces".
1219+
"""
1220+
from types import SimpleNamespace
1221+
1222+
from diff_diff import BusinessReport
1223+
1224+
class CallawaySantAnnaResults:
1225+
pass
1226+
1227+
obj = CallawaySantAnnaResults()
1228+
obj.overall_att = 1.0
1229+
obj.overall_se = 0.2
1230+
obj.overall_p_value = 0.001
1231+
obj.overall_conf_int = (0.6, 1.4)
1232+
obj.alpha = 0.05
1233+
obj.n_obs = 500
1234+
obj.n_treated = 100
1235+
obj.n_control_units = 400
1236+
obj.event_study_effects = None
1237+
obj.survey_metadata = SimpleNamespace(
1238+
design_effect=0.80,
1239+
effective_n=625.0,
1240+
weight_type="pweight",
1241+
n_strata=None,
1242+
n_psu=None,
1243+
df_survey=None,
1244+
replicate_method=None,
1245+
)
1246+
1247+
# Schema: band_label surfaces the precision-improving state.
1248+
deff_block = DiagnosticReport(obj).to_dict()["design_effect"]
1249+
assert deff_block["band_label"] == "improves_precision"
1250+
1251+
# Prose: BR says "improves", not "reduces".
1252+
summary = BusinessReport(obj).summary().lower()
1253+
assert "improves effective sample size" in summary
1254+
assert "reduces effective sample size" not in summary
1255+
11711256
def test_finite_se_nan_p_value_yields_inconclusive_on_bonferroni_only_surface(self):
11721257
"""Round-34 P0 regression: replicate-weight survey fits can emit
11731258
event-study rows with finite ``effect`` / ``se`` but

0 commit comments

Comments
 (0)