Skip to content

Commit cd85eda

Browse files
igerberclaude
andcommitted
Address twenty-eighth round of CI review findings on PR #318
P2 code quality (TROP applicable_checks mismatch). TROP identification is factor-model-based, not PT-based; the estimator-native ``_pt_factor()`` handler returns ``status="not_applicable"`` and REPORTING.md routes TROP PT to factor-model diagnostics. Exposing ``parallel_trends`` in ``_APPLICABILITY["TROPResults"]`` advertised a handler that never runs, leaving callers who gate workflows on ``applicable_checks`` with a contract mismatch. Remove PT from the TROP applicability set. P2 methodology (CS repeated-cross-section count labels). ``CallawaySantAnna(panel=False)`` stores treated / control counts as OBSERVATIONS rather than units (``staggered_results.py`` lines 183-184 render them as "obs:" in that mode). BR previously labeled them "units" / "present in the panel", which misstates the sample composition on RCS fits. Add a ``count_unit`` field to the BR sample schema (derived from ``results.panel``) and branch the summary / full-report rendering: RCS fits render "never-treated observations" and "present in the repeated cross-section sample" instead of the panel-mode phrasing. P3 coverage (survey PT prose / replay propagation). The round-27 fix added the ``_survey`` method suffix and ``df_denom`` schema field but did not carry the provenance through the prose / replay helpers: * ``_pt_method_subject`` and ``_pt_method_stat_label`` didn't recognize ``joint_wald_survey`` / ``joint_wald_event_study_survey``, so BR prose fell through to the generic "Pre-treatment data" / "joint p" default; * ``_lift_pre_trends`` didn't preserve ``df_denom`` in the BR schema, so downstream consumers couldn't see the finite-sample correction without re-consulting the DR schema; * ``_format_precomputed_pt`` didn't carry ``df_denom`` on replay, so a survey-aware DR block round-tripped as a chi-square-style passthrough. All three helpers now recognize / preserve the survey variants. Tests: 7 new regressions. * ``TestCSRepeatedCrossSectionCountLabels`` (3 tests): schema flag, panel-mode wording, RCS-mode wording; * ``TestTROPApplicableChecksExcludesParallelTrends`` (1 test): TROP DR exposes no PT in applicable_checks; * ``TestSurveyPTProsePropagation`` (2 tests): ``_lift_pre_trends`` preserves ``df_denom``, and method helpers return "joint p" + event-study subject for both survey variants; * ``test_precomputed_survey_pt_replay_preserves_df_denom`` (DR): round-trip replay of a ``joint_wald_event_study_survey`` block preserves ``method``, ``df_denom``, and ``df``. 249 BR / DR / practitioner tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent e3f9bfe commit cd85eda

4 files changed

Lines changed: 237 additions & 10 deletions

File tree

diff_diff/business_report.py

Lines changed: 69 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -585,6 +585,18 @@ def _extract_sample(self) -> Dict[str, Any]:
585585
n_never_treated = n_control_units
586586
n_control = None
587587

588+
# Panel-vs-RCS count semantics. CallawaySantAnnaResults stores
589+
# treated/control counts as OBSERVATIONS (not units) when the
590+
# fit used ``panel=False`` — ``staggered_results.py L183-L184``
591+
# renders those counts as "obs:" rather than "units:". BR
592+
# previously labeled them as "units" / "present in the panel",
593+
# which misstates the sample composition for repeated cross-
594+
# section fits. Carry the flag into the schema so rendering can
595+
# branch. Round-28 P2 CI review on PR #318.
596+
count_unit = (
597+
"observations" if getattr(r, "panel", True) is False else "units"
598+
)
599+
588600
sample_block: Dict[str, Any] = {
589601
"n_obs": _safe_int(getattr(r, "n_obs", None)),
590602
"n_treated": n_treated,
@@ -595,6 +607,7 @@ def _extract_sample(self) -> Dict[str, Any]:
595607
"n_periods": _safe_int(getattr(r, "n_periods", None)),
596608
"pre_periods": _safe_list_len(getattr(r, "pre_periods", None)),
597609
"post_periods": _safe_list_len(getattr(r, "post_periods", None)),
610+
"count_unit": count_unit,
598611
"survey": survey,
599612
}
600613
if n_never_enabled is not None:
@@ -685,6 +698,11 @@ def _lift_pre_trends(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]:
685698
"joint_p_value": pt.get("joint_p_value"),
686699
"verdict": pt.get("verdict"),
687700
"n_pre_periods": pt.get("n_pre_periods"),
701+
# Carry the denominator df through when the survey F-reference
702+
# branch was used so BR consumers can flag the finite-sample
703+
# correction without re-consulting the DR schema (round-28 P3
704+
# CI review on PR #318).
705+
"df_denom": pt.get("df_denom"),
688706
"power_status": pp.get("status"),
689707
"power_tier": pp.get("tier"),
690708
"mdv": pp.get("mdv"),
@@ -1356,7 +1374,19 @@ def _pt_method_subject(method: Optional[str]) -> str:
13561374
return "The pre-period slope-difference test"
13571375
if method == "hausman":
13581376
return "The Hausman PT-All vs PT-Post pretest"
1359-
if method in {"joint_wald", "joint_wald_event_study", "joint_wald_no_vcov", "bonferroni"}:
1377+
if method in {
1378+
"joint_wald",
1379+
"joint_wald_event_study",
1380+
"joint_wald_no_vcov",
1381+
"bonferroni",
1382+
# Survey-aware event-study PT variants use an F reference
1383+
# distribution with denominator df = ``survey_metadata.df_survey``
1384+
# (round-27 P1 fix, documented in REPORTING.md). The subject
1385+
# remains the pre-period event-study coefficients; prose elsewhere
1386+
# flags the finite-sample correction via ``df_denom``.
1387+
"joint_wald_survey",
1388+
"joint_wald_event_study_survey",
1389+
}:
13601390
return "Pre-treatment event-study coefficients"
13611391
if method == "synthetic_fit":
13621392
return "The synthetic-control pre-treatment fit"
@@ -1368,11 +1398,21 @@ def _pt_method_subject(method: Optional[str]) -> str:
13681398
def _pt_method_stat_label(method: Optional[str]) -> Optional[str]:
13691399
"""Return the joint-statistic label appropriate to the PT method.
13701400
1371-
Returns ``"joint p"`` for Wald / Bonferroni paths, ``"p"`` for the
1372-
2x2 slope-difference and Hausman paths (which are single-statistic
1373-
tests), and ``None`` for design-enforced paths that have no p-value.
1401+
Returns ``"joint p"`` for Wald / Bonferroni paths (including the
1402+
survey-aware F-reference variants, which remain joint tests on the
1403+
pre-period coefficient vector — only the reference distribution
1404+
changes), ``"p"`` for the 2x2 slope-difference and Hausman paths
1405+
(single-statistic tests), and ``None`` for design-enforced paths
1406+
that have no p-value.
13741407
"""
1375-
if method in {"joint_wald", "joint_wald_event_study", "joint_wald_no_vcov", "bonferroni"}:
1408+
if method in {
1409+
"joint_wald",
1410+
"joint_wald_event_study",
1411+
"joint_wald_no_vcov",
1412+
"bonferroni",
1413+
"joint_wald_survey",
1414+
"joint_wald_event_study_survey",
1415+
}:
13761416
return "joint p"
13771417
if method in {"slope_difference", "hausman"}:
13781418
return "p"
@@ -1716,14 +1756,22 @@ def _render_summary(schema: Dict[str, Any]) -> str:
17161756
n_ne = sample.get("n_never_enabled")
17171757
is_dynamic = sample.get("dynamic_control")
17181758
cg = sample.get("control_group")
1759+
# Panel-vs-RCS count-unit label. For repeated cross-section fits
1760+
# (``panel=False`` on CallawaySantAnna), treated / never-treated
1761+
# tallies are observation counts, not unit counts. Keep the
1762+
# "N treated" phrasing (the N is still correct), but adjust the
1763+
# never-treated clause so it does not claim "units present in
1764+
# the panel" for an RCS sample.
1765+
count_unit = sample.get("count_unit", "units")
1766+
ne_unit_word = "observations" if count_unit == "observations" else "units"
17191767
if isinstance(n_obs, int):
17201768
if isinstance(n_t, int) and isinstance(n_c, int):
17211769
sentences.append(f"Sample: {n_obs:,} observations ({n_t:,} treated, {n_c:,} control).")
17221770
elif is_dynamic and isinstance(n_t, int):
17231771
if isinstance(n_ne, int) and n_ne > 0:
1724-
subset_clause = f"; {n_ne:,} never-enabled units are also present"
1772+
subset_clause = f"; {n_ne:,} never-enabled {ne_unit_word} are also present"
17251773
elif isinstance(n_nt, int) and n_nt > 0:
1726-
subset_clause = f"; {n_nt:,} never-treated units are also present"
1774+
subset_clause = f"; {n_nt:,} never-treated {ne_unit_word} are also present"
17271775
else:
17281776
subset_clause = ""
17291777
# Estimator-specific dynamic-comparison phrasing. StackedDiD
@@ -1904,16 +1952,28 @@ def _render_full_report(schema: Dict[str, Any]) -> str:
19041952
estimator_block.get("class_name") if isinstance(estimator_block, dict) else None
19051953
)
19061954
cg = sample.get("control_group")
1955+
# Panel-vs-RCS count-unit label for the full report. Mirrors the
1956+
# summary path: CallawaySantAnna's ``panel=False`` mode stores
1957+
# counts as observations, not units (round-28 P2).
1958+
md_count_unit = sample.get("count_unit", "units")
1959+
md_ne_unit_word = "observations" if md_count_unit == "observations" else "units"
1960+
md_sample_location = (
1961+
"in the repeated cross-section sample"
1962+
if md_count_unit == "observations"
1963+
else "in the panel"
1964+
)
19071965
if isinstance(sample.get("n_control"), int):
19081966
lines.append(f"- Control: {sample['n_control']:,}")
19091967
elif sample.get("dynamic_control"):
19101968
if isinstance(sample.get("n_never_enabled"), int) and sample["n_never_enabled"] > 0:
19111969
lines.append(
1912-
f"- Never-enabled units present in the panel: {sample['n_never_enabled']:,}"
1970+
f"- Never-enabled {md_ne_unit_word} present "
1971+
f"{md_sample_location}: {sample['n_never_enabled']:,}"
19131972
)
19141973
elif isinstance(sample.get("n_never_treated"), int) and sample["n_never_treated"] > 0:
19151974
lines.append(
1916-
f"- Never-treated units present in the panel: {sample['n_never_treated']:,}"
1975+
f"- Never-treated {md_ne_unit_word} present "
1976+
f"{md_sample_location}: {sample['n_never_treated']:,}"
19171977
)
19181978
if estimator_name == "StackedDiDResults":
19191979
n_distinct = sample.get("n_distinct_controls_trimmed")

diff_diff/diagnostic_report.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,8 +141,14 @@
141141
{"parallel_trends", "sensitivity", "design_effect", "estimator_native"}
142142
),
143143
"TROPResults": frozenset(
144+
# TROP identification is factor-model-based, not parallel-trends-
145+
# based: the estimator native ``_pt_factor()`` handler returns
146+
# ``status="not_applicable"``, and REPORTING.md routes TROP PT
147+
# to factor-model diagnostics instead. Exposing PT in
148+
# ``applicable_checks`` advertised a handler that never runs —
149+
# round-28 P2 CI review on PR #318 flagged the contract mismatch
150+
# for callers who gate workflows on ``applicable_checks``.
144151
{
145-
"parallel_trends",
146152
"sensitivity",
147153
"design_effect",
148154
"heterogeneity",
@@ -2113,6 +2119,13 @@ def _read(name: str) -> Any:
21132119
out["test_statistic"] = test_statistic
21142120
if df is not None:
21152121
out["df"] = df
2122+
# Preserve the survey-F denominator df when replaying a schema-
2123+
# shaped PT block from the default path (round-28 P3 CI review
2124+
# on PR #318). Without this, the finite-sample correction
2125+
# recorded on the source block is silently dropped at replay.
2126+
df_denom = _to_python_float(_read("df_denom"))
2127+
if df_denom is not None:
2128+
out["df_denom"] = df_denom
21162129
return out
21172130

21182131
# -- Headline metric extraction ----------------------------------------

tests/test_business_report.py

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2396,6 +2396,137 @@ def test_efficient_compare_control_groups_persists_after_sensitivity_runs(self):
23962396
)
23972397

23982398

2399+
class TestCSRepeatedCrossSectionCountLabels:
2400+
"""Round-28 P2 CI review on PR #318: ``CallawaySantAnna(panel=False)``
2401+
stores treated / control counts as OBSERVATIONS, not units
2402+
(``staggered_results.py L183-L184`` renders them as "obs:" in that
2403+
mode). BR previously labeled them as "units" / "present in the
2404+
panel", which misstates the sample composition on repeated-cross-
2405+
section fits. The schema now carries a ``count_unit`` flag and the
2406+
rendering branches on it.
2407+
"""
2408+
2409+
@staticmethod
2410+
def _stub(panel: bool):
2411+
class CallawaySantAnnaResults:
2412+
pass
2413+
2414+
stub = CallawaySantAnnaResults()
2415+
stub.overall_att = 1.0
2416+
stub.overall_se = 0.2
2417+
stub.overall_p_value = 0.001
2418+
stub.overall_conf_int = (0.6, 1.4)
2419+
stub.alpha = 0.05
2420+
stub.n_obs = 1000
2421+
stub.n_treated_units = 200
2422+
stub.n_control_units = 800
2423+
stub.survey_metadata = None
2424+
stub.event_study_effects = None
2425+
stub.control_group = "not_yet_treated"
2426+
stub.panel = panel
2427+
return stub
2428+
2429+
def test_schema_exposes_count_unit(self):
2430+
for panel, expected in [(True, "units"), (False, "observations")]:
2431+
sample = BusinessReport(
2432+
self._stub(panel), auto_diagnostics=False
2433+
).to_dict()["sample"]
2434+
assert sample["count_unit"] == expected
2435+
2436+
def test_panel_true_renders_unit_wording(self):
2437+
br = BusinessReport(self._stub(panel=True), auto_diagnostics=False)
2438+
summary = br.summary()
2439+
md = br.full_report()
2440+
assert "never-treated units" in summary
2441+
assert "present in the panel" in md
2442+
assert "repeated cross-section sample" not in md
2443+
2444+
def test_panel_false_renders_rcs_wording(self):
2445+
br = BusinessReport(self._stub(panel=False), auto_diagnostics=False)
2446+
summary = br.summary()
2447+
md = br.full_report()
2448+
# RCS-specific wording in both surfaces.
2449+
assert "never-treated observations" in summary
2450+
assert "repeated cross-section sample" in md
2451+
# No misleading "units" or "panel" claims.
2452+
assert "never-treated units" not in summary
2453+
assert "present in the panel" not in md
2454+
2455+
2456+
class TestTROPApplicableChecksExcludesParallelTrends:
2457+
"""Round-28 P2 CI review on PR #318: TROP identification is
2458+
factor-model-based; its native PT handler returns
2459+
``status="not_applicable"``. Advertising ``parallel_trends`` in
2460+
``DiagnosticReport.applicable_checks`` for TROP was a contract
2461+
mismatch for callers using that set to gate workflows or UI.
2462+
"""
2463+
2464+
def test_trop_applicable_checks_omits_parallel_trends(self):
2465+
from diff_diff import DiagnosticReport
2466+
2467+
class TROPResults:
2468+
pass
2469+
2470+
stub = TROPResults()
2471+
stub.overall_att = 1.0
2472+
stub.overall_se = 0.2
2473+
stub.alpha = 0.05
2474+
stub.n_obs = 100
2475+
2476+
dr = DiagnosticReport(stub)
2477+
assert "parallel_trends" not in dr.applicable_checks, (
2478+
"TROP PT routes to factor-model diagnostics and is "
2479+
"not_applicable; it must not appear in applicable_checks."
2480+
)
2481+
2482+
2483+
class TestSurveyPTProsePropagation:
2484+
"""Round-28 P3 CI review on PR #318: the survey F-reference PT
2485+
variants (``joint_wald_survey``, ``joint_wald_event_study_survey``)
2486+
must carry through BR's method-aware label helpers so prose uses
2487+
"joint p" (not the fall-through default) and preserves the
2488+
``df_denom`` provenance in the BR schema.
2489+
"""
2490+
2491+
def test_lift_pre_trends_preserves_df_denom(self):
2492+
from diff_diff.business_report import _lift_pre_trends
2493+
2494+
fake_dr = {
2495+
"parallel_trends": {
2496+
"status": "ran",
2497+
"method": "joint_wald_event_study_survey",
2498+
"joint_p_value": 0.35,
2499+
"df_denom": 30.0,
2500+
"n_pre_periods": 3,
2501+
"verdict": "no_detected_violation",
2502+
},
2503+
"pretrends_power": {"status": "not_applicable"},
2504+
}
2505+
lifted = _lift_pre_trends(fake_dr)
2506+
assert lifted["method"] == "joint_wald_event_study_survey"
2507+
assert lifted["df_denom"] == 30.0
2508+
2509+
def test_survey_pt_method_stat_label_uses_joint_p(self):
2510+
from diff_diff.business_report import (
2511+
_pt_method_stat_label,
2512+
_pt_method_subject,
2513+
)
2514+
2515+
for method in ("joint_wald_survey", "joint_wald_event_study_survey"):
2516+
assert _pt_method_stat_label(method) == "joint p", (
2517+
f"Survey PT variant {method!r} must map to 'joint p' "
2518+
f"(the joint test remains; only the reference "
2519+
f"distribution changes)."
2520+
)
2521+
assert (
2522+
_pt_method_subject(method)
2523+
== "Pre-treatment event-study coefficients"
2524+
), (
2525+
f"Survey PT variant {method!r} must use the event-study "
2526+
f"subject phrase, not the generic fall-through."
2527+
)
2528+
2529+
23992530
class TestSDiDJackknifeStepPersistsAfterNativeSensitivity:
24002531
"""Round-24 P2 CI review on PR #318: the SyntheticDiD practitioner
24012532
step "Leave-one-out influence (jackknife)" must persist after

tests/test_diagnostic_report.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -864,6 +864,29 @@ def test_joint_wald_uses_F_reference_when_survey_df_is_finite(self):
864864
# back to chi-square.
865865
assert expected_p_survey > expected_p_chi2
866866

867+
def test_precomputed_survey_pt_replay_preserves_df_denom(self, cs_fit):
868+
"""Round-28 P3 regression: a schema-shaped PT block carrying the
869+
survey ``df_denom`` and ``_survey`` method suffix must round-trip
870+
through ``precomputed={"parallel_trends": ...}`` without losing
871+
the finite-sample provenance. Previously ``_format_precomputed_pt``
872+
dropped ``df_denom``, so replaying a survey-aware DR block
873+
silently demoted it to a chi-square-style passthrough.
874+
"""
875+
fit, _ = cs_fit
876+
survey_pt = {
877+
"method": "joint_wald_event_study_survey",
878+
"joint_p_value": 0.18,
879+
"test_statistic": 5.2,
880+
"df": 3,
881+
"df_denom": 20.0,
882+
}
883+
dr = DiagnosticReport(fit, precomputed={"parallel_trends": survey_pt})
884+
pt = dr.to_dict()["parallel_trends"]
885+
assert pt["status"] == "ran"
886+
assert pt["method"] == "joint_wald_event_study_survey"
887+
assert pt["df_denom"] == 20.0
888+
assert pt["df"] == 3
889+
867890
def test_joint_wald_ignores_non_finite_survey_df(self):
868891
"""If ``df_survey`` is NaN / inf / non-positive, fall back to
869892
chi-square (no finite-sample correction available).

0 commit comments

Comments
 (0)