Skip to content

Commit 73ef44c

Browse files
igerberclaude
andcommitted
Add BR/DR canonical-dataset validation + two wording fixes
Closes BR/DR foundation gap #4 (real-dataset validation) from the external-positioning gap list in ``project_br_dr_foundation.md``. Validation artifact: - ``docs/validation/validate_br_dr_canonical.py`` runs BusinessReport / DiagnosticReport on Card-Krueger (1994), mpdta (Callaway-Sant'Anna 2021 benchmark), and Castle Doctrine (Cheng-Hoekstra 2013 under both CS and SA), dumping summary + full_report + selected to_dict blocks for each. - ``docs/validation/br_dr_canonical_validation.md`` is the regenerable raw output. - ``docs/validation/br_dr_canonical_findings.md`` is the hand-written synthesis: direction / verdict / sensitivity tier all match canonical interpretations, with two small wording bugs surfaced and fixed in this PR and two larger gaps queued as follow-up (SA HonestDiD applicability, target-parameter disambiguation). Wording fixes: 1. Treatment-label capitalization. ``str.capitalize()`` lowercased every character after the first, flattening embedded abbreviations (``"the NJ minimum-wage increase"`` → ``"The nj minimum-wage increase"``) and proper-noun phrases (``"Castle Doctrine law adoption"`` → ``"Castle doctrine law adoption"``). Replaced with a ``_sentence_first_upper`` helper that preserves user-supplied casing. 2. ``breakdown_M == 0`` phrasing. The HonestDiD fragile sentence quoted ``{breakdown_M:.2g}x the pre-period variation``, which renders as a degenerate ``0x`` on the exact-zero case surfaced by Cheng-Hoekstra. At ``breakdown_M <= 0.05`` (covers 0 and near-zero values), both BR's summary and DR's overall_interpretation now say "includes zero even at the smallest parallel-trends violations on the sensitivity grid" instead. Tests: 5 new regressions in ``TestCanonicalValidationSurfaceFixes`` covering both fixes + three boundary cases (exact zero, small positive, normal fragile value). Not in scope: Favara-Imbs (dCDH reversible-treatment dataset not bundled), ImputationDiD / TwoStageDiD on canonical data (needed to exercise the R42 untreated-outcome FE assumption branch on real data), SA HonestDiD applicability gap. All tracked in the findings doc for follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent cbb8814 commit 73ef44c

6 files changed

Lines changed: 1446 additions & 13 deletions

File tree

diff_diff/business_report.py

Lines changed: 48 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1854,6 +1854,24 @@ def _significance_phrase(p: Optional[float], alpha: float) -> str:
18541854
return "the confidence interval includes zero; the data are consistent with no effect"
18551855

18561856

1857+
def _sentence_first_upper(text: str) -> str:
1858+
"""Uppercase only the first character of ``text``, preserving all
1859+
other casing. Unlike ``str.capitalize()``, which lowercases every
1860+
character after the first, this keeps user-supplied abbreviations
1861+
and proper nouns intact.
1862+
1863+
Examples
1864+
--------
1865+
>>> _sentence_first_upper("the NJ minimum-wage increase")
1866+
'The NJ minimum-wage increase'
1867+
>>> _sentence_first_upper("Castle Doctrine law adoption")
1868+
'Castle Doctrine law adoption'
1869+
"""
1870+
if not text:
1871+
return text
1872+
return text[0].upper() + text[1:]
1873+
1874+
18571875
def _direction_verb(effect: float, outcome_direction: Optional[str]) -> str:
18581876
"""Return a direction-aware verb for the headline sentence.
18591877
@@ -1929,7 +1947,16 @@ def _render_headline_sentence(schema: Dict[str, Any]) -> str:
19291947
# is not actually available.
19301948
ci_str = " (inference unavailable: confidence interval is undefined for this fit)"
19311949
by_clause = f" by {magnitude}" if effect != 0 else ""
1932-
return f"{treatment.capitalize()} {verb} {outcome}{by_clause}{ci_str}."
1950+
# Round-1 BR/DR canonical-validation (2026-04-19): Python's
1951+
# ``str.capitalize()`` lowercases everything except the first
1952+
# character, so ``"the NJ minimum-wage increase".capitalize()``
1953+
# returns ``"The nj minimum-wage increase"`` — flattening the
1954+
# ``NJ`` abbreviation. Real canonical datasets (Card-Krueger,
1955+
# Castle Doctrine) carry proper-noun / acronym tokens in the
1956+
# user-supplied ``treatment_label``, so preserve user casing and
1957+
# only ensure the first character is uppercase.
1958+
treatment_sentence = _sentence_first_upper(treatment)
1959+
return f"{treatment_sentence} {verb} {outcome}{by_clause}{ci_str}."
19331960

19341961

19351962
def _render_summary(schema: Dict[str, Any]) -> str:
@@ -2088,11 +2115,26 @@ def _render_summary(schema: Dict[str, Any]) -> str:
20882115
f"pre-period variation."
20892116
)
20902117
elif isinstance(bkd, (int, float)):
2091-
sentences.append(
2092-
f"HonestDiD: the result is fragile — the confidence interval "
2093-
f"includes zero once violations reach {bkd:.2g}x the "
2094-
f"pre-period variation."
2095-
)
2118+
# Round-1 BR/DR canonical-validation (2026-04-19):
2119+
# ``breakdown_M`` at or near zero reads as "0x the
2120+
# pre-period variation" which is a degenerate sentence
2121+
# (zero-times-anything is zero). The correct wording when
2122+
# the CI includes zero at the smallest grid point is to
2123+
# say the result is fragile to essentially any nonzero
2124+
# violation, not to quote the ``0x`` multiplier.
2125+
if bkd <= 0.05:
2126+
sentences.append(
2127+
"HonestDiD: the result is fragile — the confidence "
2128+
"interval includes zero even at the smallest "
2129+
"parallel-trends violations on the sensitivity "
2130+
"grid."
2131+
)
2132+
else:
2133+
sentences.append(
2134+
f"HonestDiD: the result is fragile — the confidence "
2135+
f"interval includes zero once violations reach {bkd:.2g}x "
2136+
f"the pre-period variation."
2137+
)
20962138

20972139
# Sample sentence. For fits with a dynamic comparison set (CS /
20982140
# ContinuousDiD / StaggeredTripleDiff / EfficientDiD /

diff_diff/diagnostic_report.py

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3118,13 +3118,30 @@ def _render_overall_interpretation(schema: Dict[str, Any], labels: Dict[str, str
31183118
f"pre-period variation."
31193119
)
31203120
else:
3121-
sentences.append(
3122-
f"HonestDiD sensitivity: the result is fragile — the "
3123-
f"confidence interval includes zero once violations reach "
3124-
f"{bkd:.2g}x the pre-period variation."
3125-
if isinstance(bkd, (int, float))
3126-
else ""
3127-
)
3121+
# Round-1 BR/DR canonical-validation (2026-04-19): the
3122+
# "fragile — CI includes zero once violations reach 0x
3123+
# the pre-period variation" wording is a degenerate
3124+
# sentence at the ``breakdown_M == 0`` edge case
3125+
# surfaced by the Cheng-Hoekstra (2013) Castle Doctrine
3126+
# dataset. Mirror BR's fix: when the breakdown value is
3127+
# at or near zero, say the CI includes zero at the
3128+
# smallest grid point rather than quoting a ``0x``
3129+
# multiplier.
3130+
if isinstance(bkd, (int, float)):
3131+
if bkd <= 0.05:
3132+
sentences.append(
3133+
"HonestDiD sensitivity: the result is fragile — "
3134+
"the confidence interval includes zero even at "
3135+
"the smallest parallel-trends violations on the "
3136+
"sensitivity grid."
3137+
)
3138+
else:
3139+
sentences.append(
3140+
f"HonestDiD sensitivity: the result is fragile — "
3141+
f"the confidence interval includes zero once "
3142+
f"violations reach {bkd:.2g}x the pre-period "
3143+
f"variation."
3144+
)
31283145

31293146
# Sentence 4: one secondary caveat if present.
31303147
bacon = schema.get("bacon") or {}
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# BR / DR canonical-dataset validation findings
2+
3+
This file records divergences observed in
4+
``br_dr_canonical_validation.md`` against canonical literature
5+
interpretations. Generated by running
6+
``docs/validation/validate_br_dr_canonical.py`` on the bundled
7+
datasets (Card-Krueger 1994, Callaway-Sant'Anna mpdta benchmark,
8+
Castle Doctrine / Cheng-Hoekstra 2013). This closes BR/DR
9+
foundation gap #4 — real-dataset validation — from the
10+
external-positioning gap list in
11+
``project_br_dr_foundation.md``.
12+
13+
The goal of the validation exercise is to stress-test BR's prose on
14+
fits that published applied work has already interpreted, not to
15+
exactly reproduce their point estimates (the bundled datasets are
16+
either the R `did` package simulated benchmark or the causaldata
17+
mirrors, which may differ from the original author data).
18+
19+
## Headline assessment
20+
21+
BR's prose direction, verdicts, and caveat framing match canonical
22+
interpretations across all four runs:
23+
24+
- **Card-Krueger**: positive sign, CI includes zero, "data consistent
25+
with no effect." Matches the famous Card-Krueger finding of no
26+
disemployment.
27+
- **mpdta (CS)**: aggregate ATT negative (-0.021 log-points), pre-trends
28+
`no_detected_violation`, HonestDiD `robust_to_M_1.28`. Matches CS
29+
tutorial expectations that the fit is robust.
30+
- **Castle Doctrine (CS)**: positive sign (homicides went up), pre-trends
31+
`clear_violation` (joint p = 0.003), HonestDiD `fragile`
32+
(breakdown_M = 0). Matches Cheng-Hoekstra's escalation finding AND
33+
correctly flags the identifying-assumption fragility the staggered
34+
rollout produces.
35+
- **Castle Doctrine (SA)**: identical point estimates (as expected —
36+
CS and SA are algebraically consistent on this data), same clear PT
37+
violation verdict.
38+
39+
No wrong-sign or wrong-verdict findings surfaced on any of the four
40+
runs. The Bacon "already-robust" framing lifted from round-45 reads
41+
correctly on the staggered fits (CS and SA on Castle Doctrine and
42+
mpdta): the caveat is scoped as a statement about the rollout
43+
design, not a switch-estimator recommendation.
44+
45+
## Issues fixed in this PR
46+
47+
Small prose bugs surfaced by the real-data output. Each is a wording
48+
fix, not a methodology defect. All three are regression-tested under
49+
``tests/test_business_report.py::TestCanonicalValidationSurfaceFixes``.
50+
51+
### Issue 1 (FIXED): Treatment label first-word capitalization eats abbreviations
52+
53+
Card-Krueger output:
54+
55+
> The **nj** minimum-wage increase lifted FTE employment by 1.47 FTE …
56+
57+
Castle Doctrine output (CS and SA):
58+
59+
> **Castle doctrine** law adoption worsened Homicide rate (per 100k) …
60+
61+
BR used ``str.capitalize()``, which lowercases every character after
62+
the first. For labels starting with an abbreviation (``"the NJ
63+
minimum-wage increase"``) or a proper-noun phrase (``"Castle Doctrine
64+
law adoption"``), this flattened case in a way that looked wrong in
65+
stakeholder-facing prose.
66+
67+
**Fix**: replaced ``str.capitalize()`` with a new
68+
``_sentence_first_upper`` helper that uppercases only the first
69+
character and preserves user-supplied casing for everything else.
70+
71+
### Issue 2 (FIXED): ``breakdown_M = 0`` phrasing reads as "0x" (zero-times-something)
72+
73+
Castle Doctrine (CS):
74+
75+
> HonestDiD: the result is fragile — the confidence interval
76+
> includes zero once violations reach **0x** the pre-period
77+
> variation.
78+
79+
When the breakdown M is exactly 0, "reach 0x the pre-period
80+
variation" is a degenerate reading — the CI already includes zero
81+
under any nonzero pre-trend violation (or even with zero violation,
82+
depending on the grid).
83+
84+
**Fix**: when ``breakdown_M <= 0.05``, both BR's summary and DR's
85+
overall-interpretation sentence emit "the confidence interval
86+
includes zero even at the smallest parallel-trends violations on
87+
the sensitivity grid" instead of quoting the ``0x`` multiplier. The
88+
0.05 threshold also covers near-zero values (e.g., 0.03) where the
89+
multiplier is equally uninformative to stakeholders.
90+
91+
### Issue 3 (deferred): Outcome-label capitalization in mid-sentence
92+
93+
mpdta output:
94+
95+
> … reduced **Log employment** by 0.0214 log-points …
96+
97+
The user-supplied ``outcome_label="Log employment"`` is
98+
capitalized as-is, which looks awkward mid-sentence. This is
99+
stylistic, and arguably user-controllable (the user could pass
100+
``outcome_label="log employment"``). Deprioritize unless fixing
101+
Issue 1 is trivially extensible. Noted here for follow-up.
102+
103+
## Issues to track as follow-up (out-of-scope for this PR)
104+
105+
### Follow-up A: ``SunAbrahamResults`` excluded from HonestDiD applicability
106+
107+
BR's applicability matrix (``diagnostic_report.py`` line ~107) lists
108+
``SunAbrahamResults`` with ``{parallel_trends, pretrends_power, bacon,
109+
design_effect, heterogeneity}`` — NO ``sensitivity``. But the
110+
original plan's applicability matrix in
111+
``project_br_dr_foundation.md`` and the SA methodology surface
112+
(event-study coefficients + VCov) both support HonestDiD in principle.
113+
114+
Observed on Castle Doctrine (SA):
115+
116+
> ## Sensitivity (HonestDiD)
117+
>
118+
> - Sensitivity not computed: sensitivity is not applicable to
119+
> SunAbrahamResults.
120+
121+
Given SA shows the same PT violation that CS does on this dataset,
122+
not having HonestDiD sensitivity on SA is a real usability gap. This
123+
requires adding an SA adapter to ``compute_honest_did`` and expanding
124+
the applicability matrix; it is library work beyond BR/DR prose.
125+
Belongs in the BR/DR gap-list expansion.
126+
127+
### Follow-up B: Target-parameter clarity (gap list item #6)
128+
129+
The assumption block on every staggered fit still reads:
130+
131+
> Identification relies on parallel trends across treatment cohorts
132+
> and time periods (group-time ATT), plus no anticipation.
133+
134+
But the CS ``overall_att`` for mpdta is a specific weighted average
135+
of ``ATT(g, t)`` cells, SA is an IW average, Stacked is a
136+
sub-experiment-weighted average, dCDH is a switchers average. BR's
137+
headline reports a single number without disambiguating the
138+
estimand. For Baker et al. (2025) practitioner-guide parity, the
139+
assumption block should carry the target-parameter clause.
140+
Already tracked as gap #6.
141+
142+
### Follow-up C: Card-Krueger effect size differs from published ATT
143+
144+
Our bundled ``load_card_krueger`` returns an ATT of +1.47 FTE; the
145+
published Card-Krueger ATT is ~+0.59 FTE. The direction and
146+
CI-includes-zero verdict match canonical, but the magnitude does
147+
not. This is a ``datasets.py`` data-loading question (the
148+
causaldata mirror may aggregate differently than the original
149+
author sample), not a BR prose bug. Noted here so a future
150+
data-validation PR can address it upstream.
151+
152+
## What was validated (summary)
153+
154+
- End-to-end BR / DR flow runs without errors on 4 canonical datasets.
155+
- Direction of the effect matches canonical interpretation on all 4.
156+
- Pre-trends verdict tier (no_detected_violation / clear_violation)
157+
matches the literature's reading.
158+
- HonestDiD sensitivity tier (robust vs fragile) matches.
159+
- Bacon "already-robust" framing from round-45 reads correctly on
160+
real staggered data.
161+
- The identifying-assumption source-faithfulness retags from
162+
round-42 (BJS / Gardner untreated-outcome FE model) did not
163+
surface on these runs because none of the datasets was run
164+
through ImputationDiD or TwoStageDiD — follow-up validation
165+
should add those.
166+
167+
## Regeneration
168+
169+
```bash
170+
python docs/validation/validate_br_dr_canonical.py
171+
```
172+
173+
The script writes ``br_dr_canonical_validation.md`` (the raw output
174+
artifact); this file is the findings synthesis and is written by
175+
hand from that artifact.

0 commit comments

Comments
 (0)