Address thirty-fifth round of CI review findings on PR #318

igerber · claude · igerber · commit 7743b5cb6fcc · 2026-04-19T14:00:13.000-04:00
P1 methodology (inconclusive PT prose missing). Rounds 33-34 made
the event-study PT schema emit ``verdict="inconclusive"`` whenever
pre-period inference is undefined (zero / negative SE, non-finite
per-period p-value). But neither ``BusinessReport.summary()`` nor
``DiagnosticReport.summary()`` / ``overall_interpretation`` had an
``elif verdict == "inconclusive"`` branch, so the PT sentence was
silently omitted from the primary prose output. A missing sentence
is indistinguishable from "PT did not run" and drops the
identifying-assumption diagnostic from stakeholder output.

Add explicit inconclusive branches on both surfaces. When
``n_dropped_undefined`` is available, the sentence quotes the
count ("3 pre-period rows had undefined inference"); otherwise
falls back to a generic "pre-period inference was undefined"
clause. Both surfaces now close with "Treat parallel trends as
unassessed" so the stakeholder takeaway is explicit.

P2 code quality (DEFF ``deff &lt; 0.95`` directional bug). The
``is_trivial`` flag required ``0.95 &lt;= deff &lt;= 1.05`` while
``band_label`` treated anything ``&lt; 1.05`` as trivial. BR's
summary keyed off ``not is_trivial`` and narrated "Survey design
reduces effective sample size" for ``deff &lt; 0.95``, which is
directionally wrong — a precision-improving design has LARGER
effective N than nominal N. Two fixes:

  * Add a dedicated ``band_label="improves_precision"`` enum
    value for ``deff &lt; 0.95`` so the schema carries the direction
    explicitly;
  * Split BR's summary rendering: ``deff &lt; 1.0`` -&gt; "improves
    effective sample size"; ``deff &gt;= 1.0`` -&gt; "reduces effective
    sample size".
``is_trivial`` stays at ``0.95 &lt;= deff &lt;= 1.05`` (the tight
"effectively no effect" window).

P2 coverage. Round-33/34 regressions only asserted absence of
false-clean "do not reject" wording; that assertion still passes
even when the PT sentence disappears entirely. Added positive
regressions:

  * ``test_summary_prose_surfaces_inconclusive_pt_explicitly``
    asserts both ``DiagnosticReport.summary()`` and
    ``BusinessReport.summary()`` contain the word "inconclusive"
    on a Bonferroni-only surface with a NaN per-period p-value;
  * ``test_design_effect_deff_below_95_uses_improves_precision_wording``
    pins the new ``band_label`` enum value AND the BR summary
    "improves effective sample size" wording.

332 BR / DR / practitioner / pretrends tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/business_report.py b/diff_diff/business_report.py
@@ -1758,6 +1758,33 @@ def _render_summary(schema: Dict[str, Any]) -> str:
                 "group's pre-period trajectory (SDiD's weighted-parallel-"
                 "trends analogue)."
             )
+        elif verdict == "inconclusive":
+            # Round-35 P1 CI review on PR #318: a ``verdict=="inconclusive"``
+            # state means one or more pre-period coefficients had
+            # undefined inference (zero SE, NaN p-value) and the joint
+            # test cannot be formed. BR previously omitted the sentence
+            # entirely, so stakeholder prose silently skipped the
+            # identifying-assumption diagnostic. Name the state
+            # explicitly and quote the undefined-row count when
+            # available.
+            n_dropped = pt.get("n_dropped_undefined")
+            if isinstance(n_dropped, int) and n_dropped > 0:
+                rows_word = "row" if n_dropped == 1 else "rows"
+                sentences.append(
+                    f"The pre-trends test is inconclusive on this fit: "
+                    f"{n_dropped} pre-period {rows_word} had undefined "
+                    "inference (zero / negative SE or a non-finite "
+                    "per-period p-value), so the joint test cannot be "
+                    "formed. Treat parallel trends as unassessed rather "
+                    "than supported."
+                )
+            else:
+                sentences.append(
+                    "The pre-trends test is inconclusive on this fit: "
+                    "pre-period inference was undefined, so the joint "
+                    "test cannot be formed. Treat parallel trends as "
+                    "unassessed rather than supported."
+                )
 
     # Sensitivity. A ``single_M_precomputed`` sensitivity block has
     # ``breakdown_M=None`` by construction because only one M was evaluated;
@@ -1877,10 +1904,21 @@ def _render_summary(schema: Dict[str, Any]) -> str:
             deff = survey.get("design_effect")
             eff_n = survey.get("effective_n")
             if isinstance(deff, (int, float)) and isinstance(eff_n, (int, float)):
-                sentences.append(
-                    f"Survey design reduces effective sample size to "
-                    f"~{eff_n:,.0f} (DEFF = {deff:.2g})."
-                )
+                # Round-35 P2 CI review on PR #318: ``deff < 0.95`` is a
+                # precision-improving design (effective N is LARGER than
+                # nominal N). Narrating that as "reduces effective sample
+                # size" is directionally wrong. Branch on the sign of
+                # the departure from 1.
+                if deff < 1.0:
+                    sentences.append(
+                        f"Survey design improves effective sample size to "
+                        f"~{eff_n:,.0f} (DEFF = {deff:.2g})."
+                    )
+                else:
+                    sentences.append(
+                        f"Survey design reduces effective sample size to "
+                        f"~{eff_n:,.0f} (DEFF = {deff:.2g})."
+                    )
 
     # Highest-severity caveat (if any).
     caveats = schema.get("caveats", [])
diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py
@@ -1585,9 +1585,22 @@ def _check_design_effect(self) -> Dict[str, Any]:
             }
         deff = _to_python_float(getattr(sm, "design_effect", None))
         eff_n = _to_python_float(getattr(sm, "effective_n", None))
+        # Round-35 P2 CI review on PR #318: ``is_trivial`` used to be
+        # ``0.95 <= deff <= 1.05`` while ``band_label`` treated
+        # anything ``< 1.05`` as trivial. On a precision-improving
+        # design (``deff < 0.95``) BR's summary keyed off
+        # ``not is_trivial`` and narrated "Survey design reduces
+        # effective sample size", which is directionally wrong — the
+        # effective N is LARGER than the nominal N. Split the band
+        # into a dedicated ``improves_precision`` label for
+        # ``deff < 0.95`` and keep ``is_trivial`` restricted to the
+        # tight "effectively no effect" window so the schema
+        # carries the precision-improving signal explicitly.
         is_trivial = deff is not None and 0.95 <= deff <= 1.05
         if deff is None or not np.isfinite(deff):
             band_label: Optional[str] = None
+        elif deff < 0.95:
+            band_label = "improves_precision"
         elif deff < 1.05:
             band_label = "trivial"
         elif deff < 2.0:
@@ -2846,6 +2859,30 @@ def _render_overall_interpretation(schema: Dict[str, Any], labels: Dict[str, str
                 else "SDiD's synthetic control is designed to satisfy the "
                 "weighted parallel-trends analogue."
             )
+        elif verdict == "inconclusive":
+            # Round-35 P1 CI review on PR #318: DR summary / overall
+            # interpretation must surface the inconclusive state
+            # explicitly rather than omitting the PT sentence. A missing
+            # sentence was indistinguishable from "PT did not run", and
+            # stakeholders reading the summary could not tell that the
+            # joint test had been attempted but yielded undefined
+            # inference.
+            n_dropped = pt.get("n_dropped_undefined")
+            if isinstance(n_dropped, int) and n_dropped > 0:
+                rows_word = "row" if n_dropped == 1 else "rows"
+                sentences.append(
+                    f"Pre-trends is inconclusive on this fit: "
+                    f"{n_dropped} pre-period {rows_word} had undefined "
+                    "inference (zero / negative SE or a non-finite "
+                    "per-period p-value), so the joint test cannot be "
+                    "formed. Treat parallel trends as unassessed."
+                )
+            else:
+                sentences.append(
+                    "Pre-trends is inconclusive on this fit: pre-period "
+                    "inference was undefined, so the joint test cannot "
+                    "be formed. Treat parallel trends as unassessed."
+                )
 
     # Sentence 3: sensitivity. The "robust across the grid" phrasing is reserved
     # for genuine SensitivityResults grids; a precomputed single-M HonestDiDResults
diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py
@@ -1168,6 +1168,91 @@ class MultiPeriodDiDResults:
         assert pt["n_dropped_undefined"] == 1
         assert "undefined inference" in pt["reason"]
 
+    def test_summary_prose_surfaces_inconclusive_pt_explicitly(self):
+        """Round-35 P1 regression: when pre-trends is inconclusive
+        (undefined pre-period inference), both ``BusinessReport.summary()``
+        and ``DiagnosticReport.summary()`` must emit explicit inconclusive
+        prose — not merely omit the PT sentence. A missing sentence was
+        indistinguishable from "PT did not run" and would silently drop
+        the identifying-assumption diagnostic from stakeholder output.
+        """
+        from diff_diff import BusinessReport
+
+        class StackedDiDResults:
+            pass
+
+        obj = StackedDiDResults()
+        obj.overall_att = 1.0
+        obj.overall_se = 0.2
+        obj.overall_p_value = 0.001
+        obj.overall_conf_int = (0.6, 1.4)
+        obj.alpha = 0.05
+        obj.n_obs = 400
+        obj.n_treated_units = 100
+        obj.n_control_units = 300
+        obj.survey_metadata = None
+        obj.event_study_effects = {
+            -2: {"effect": 0.1, "se": 0.2, "p_value": 0.62, "n_obs": 400},
+            -1: {"effect": 0.05, "se": 0.3, "p_value": float("nan"), "n_obs": 400},
+        }
+
+        dr_summary = DiagnosticReport(
+            obj, run_sensitivity=False, run_bacon=False
+        ).summary()
+        br_summary = BusinessReport(obj).summary()
+
+        # Both summaries must explicitly name the inconclusive state.
+        for label, prose in [("DR", dr_summary), ("BR", br_summary)]:
+            assert "inconclusive" in prose.lower(), (
+                f"{label}.summary() must surface the inconclusive PT "
+                f"state explicitly; got: {prose!r}"
+            )
+            # And must not offer false-clean "do not reject" wording.
+            assert "do not reject parallel trends" not in prose.lower()
+            assert "consistent with parallel trends" not in prose.lower()
+
+    def test_design_effect_deff_below_95_uses_improves_precision_wording(self):
+        """Round-35 P2 regression: ``deff < 0.95`` is a precision-
+        improving survey design — effective N is LARGER than nominal
+        N. DR emits ``band_label="improves_precision"`` and BR narrates
+        "improves effective sample size" instead of "reduces".
+        """
+        from types import SimpleNamespace
+
+        from diff_diff import BusinessReport
+
+        class CallawaySantAnnaResults:
+            pass
+
+        obj = CallawaySantAnnaResults()
+        obj.overall_att = 1.0
+        obj.overall_se = 0.2
+        obj.overall_p_value = 0.001
+        obj.overall_conf_int = (0.6, 1.4)
+        obj.alpha = 0.05
+        obj.n_obs = 500
+        obj.n_treated = 100
+        obj.n_control_units = 400
+        obj.event_study_effects = None
+        obj.survey_metadata = SimpleNamespace(
+            design_effect=0.80,
+            effective_n=625.0,
+            weight_type="pweight",
+            n_strata=None,
+            n_psu=None,
+            df_survey=None,
+            replicate_method=None,
+        )
+
+        # Schema: band_label surfaces the precision-improving state.
+        deff_block = DiagnosticReport(obj).to_dict()["design_effect"]
+        assert deff_block["band_label"] == "improves_precision"
+
+        # Prose: BR says "improves", not "reduces".
+        summary = BusinessReport(obj).summary().lower()
+        assert "improves effective sample size" in summary
+        assert "reduces effective sample size" not in summary
+
     def test_finite_se_nan_p_value_yields_inconclusive_on_bonferroni_only_surface(self):
         """Round-34 P0 regression: replicate-weight survey fits can emit
         event-study rows with finite ``effect`` / ``se`` but