Skip to content

Commit f3c17ff

Browse files
igerberclaude
andcommitted
Fix staggered pre-trends guidance and add with/without covariates example
P1: For staggered designs, replace generic check_parallel_trends() with CS event-study pre-period inspection. The generic helper assumes a single binary treatment with universal pre-periods, which is invalid when cohorts adopt at different times (early-cohort post-treatment observations contaminate the "pre-trend"). The guide and practitioner_next_steps() now emit staggered-specific guidance using event_study_effects pre-periods. P2: Complete example now includes REQUIRED with/without covariates comparison (was missing despite Step 8 marking it mandatory). P2: Evaluation doc model-mix caveat added, residual "Step 4" references changed to "Step 5" to match canonical numbering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent d350850 commit f3c17ff

3 files changed

Lines changed: 72 additions & 22 deletions

File tree

diff_diff/practitioner.py

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,25 @@ def _step(
134134
# ---------------------------------------------------------------------------
135135
# Common steps reused across handlers
136136
# ---------------------------------------------------------------------------
137-
def _parallel_trends_step() -> Dict[str, Any]:
137+
def _parallel_trends_step(staggered: bool = False) -> Dict[str, Any]:
138+
if staggered:
139+
return _step(
140+
baker_step=3,
141+
label="Test parallel trends (event-study pre-periods)",
142+
why=(
143+
"For staggered designs, inspect CS event-study pre-period "
144+
"coefficients rather than the generic check_parallel_trends() "
145+
"which assumes a single binary treatment with universal "
146+
"pre-periods. Pre-treatment ATTs should be near zero."
147+
),
148+
code=(
149+
"# Fit with aggregate='event_study' or 'all', then inspect:\n"
150+
"for rel_t, eff in sorted(results.event_study_effects.items()):\n"
151+
" if rel_t < 0:\n"
152+
" print(f'Pre {rel_t}: ATT={eff[\"effect\"]:.4f}')"
153+
),
154+
step_name="parallel_trends",
155+
)
138156
return _step(
139157
baker_step=3,
140158
label="Test parallel trends assumption",
@@ -265,7 +283,7 @@ def _handle_multi_period(results: Any):
265283

266284
def _handle_cs(results: Any):
267285
steps = [
268-
_parallel_trends_step(),
286+
_parallel_trends_step(staggered=True),
269287
_step(
270288
baker_step=6,
271289
label="Run HonestDiD sensitivity analysis",
@@ -308,7 +326,7 @@ def _handle_cs(results: Any):
308326

309327
def _handle_sa(results: Any):
310328
steps = [
311-
_parallel_trends_step(),
329+
_parallel_trends_step(staggered=True),
312330
_placebo_step(),
313331
_robustness_compare_step("CS, BJS, or Gardner"),
314332
_covariates_step(),
@@ -319,7 +337,7 @@ def _handle_sa(results: Any):
319337

320338
def _handle_imputation(results: Any):
321339
steps = [
322-
_parallel_trends_step(),
340+
_parallel_trends_step(staggered=True),
323341
_placebo_step(),
324342
_robustness_compare_step("CS, SA, or Gardner"),
325343
_covariates_step(),
@@ -330,7 +348,7 @@ def _handle_imputation(results: Any):
330348

331349
def _handle_two_stage(results: Any):
332350
steps = [
333-
_parallel_trends_step(),
351+
_parallel_trends_step(staggered=True),
334352
_placebo_step(),
335353
_robustness_compare_step("CS, BJS, or SA"),
336354
_covariates_step(),
@@ -341,7 +359,7 @@ def _handle_two_stage(results: Any):
341359

342360
def _handle_stacked(results: Any):
343361
steps = [
344-
_parallel_trends_step(),
362+
_parallel_trends_step(staggered=True),
345363
_placebo_step(),
346364
_step(
347365
baker_step=7,
@@ -423,7 +441,7 @@ def _handle_trop(results: Any):
423441

424442
def _handle_efficient(results: Any):
425443
steps = [
426-
_parallel_trends_step(),
444+
_parallel_trends_step(staggered=True),
427445
_placebo_step(),
428446
_step(
429447
baker_step=7,

docs/llms-practitioner.txt

Lines changed: 42 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -79,14 +79,14 @@ to allow k periods of anticipation.
7979

8080
## Step 3: Test Parallel Trends
8181

82-
Test the parallel trends assumption empirically BEFORE estimation. This step
83-
is separated from Step 2 because it requires code execution, not just stating
84-
assumptions.
82+
Test the parallel trends assumption empirically. This step is separated from
83+
Step 2 because it requires code execution, not just stating assumptions.
8584

85+
**For simple 2x2 designs** (single treatment timing):
8686
```python
8787
from diff_diff import check_parallel_trends, equivalence_test_trends
8888

89-
# Simple pre-trends test (compares slopes)
89+
# Simple pre-trends test (compares slopes for a binary treatment indicator)
9090
pt_result = check_parallel_trends(
9191
data, outcome='y', time='period', treatment_group='treated',
9292
pre_periods=[1, 2, 3]
@@ -102,6 +102,24 @@ equiv = equivalence_test_trends(
102102
)
103103
```
104104

105+
**For staggered designs** (multiple cohorts adopting at different times):
106+
The generic `check_parallel_trends()` assumes a single binary treatment with
107+
universal pre-periods, which is invalid when cohorts adopt at different times
108+
(some "pre-periods" are post-treatment for early cohorts). Instead, use the
109+
CS event-study pre-period coefficients as the pre-trends diagnostic:
110+
```python
111+
# Fit CS with event_study aggregation, then inspect pre-periods
112+
cs = CallawaySantAnna(control_group='not_yet_treated', cluster='unit_id')
113+
results = cs.fit(data, ..., aggregate='event_study')
114+
115+
# Pre-treatment relative-time ATTs should be near zero
116+
if results.event_study_effects:
117+
for rel_t, eff in sorted(results.event_study_effects.items()):
118+
if rel_t < 0:
119+
print(f"Pre-period {rel_t}: ATT={eff['effect']:.4f}, SE={eff['se']:.4f}")
120+
# Significant pre-treatment effects → parallel trends may be violated
121+
```
122+
105123
CAUTION: Small, statistically insignificant pre-trends do NOT guarantee
106124
parallel trends holds post-treatment. Use HonestDiD (Step 6) to bound how
107125
large a violation would need to be to overturn your results.
@@ -390,14 +408,12 @@ data = load_mpdta()
390408
# no anticipation, doubly robust for conditional PT
391409

392410
# Step 3: Test parallel trends
393-
# Note: check_parallel_trends expects a binary treatment indicator, not cohort years.
394-
# Use 'treat' (ever-treated binary) with pre-periods before any cohort adopts.
395-
pt = check_parallel_trends(data, outcome='lemp', time='year',
396-
treatment_group='treat',
397-
pre_periods=[2003, 2004, 2005])
398-
print(f"Pre-trends p-value: {pt['p_value']:.4f}")
399-
400-
# Step 4: Choose estimator — staggered adoption → CS (primary), SA (robustness)
411+
# For staggered designs, use the CS event-study pre-period coefficients
412+
# as the pre-trends diagnostic (NOT the generic check_parallel_trends,
413+
# which assumes a single binary treatment and universal pre-periods).
414+
# We estimate CS with event_study aggregation first, then inspect pre-periods.
415+
416+
# Step 4: Choose estimator — staggered adoption -> CS (primary), SA (robustness)
401417
# Diagnose TWFE bias first:
402418
bacon = BaconDecomposition()
403419
bacon_result = bacon.fit(data, outcome='lemp', unit='countyreal',
@@ -416,6 +432,14 @@ results = cs.fit(data, outcome='lemp', unit='countyreal', time='year',
416432
aggregate='all')
417433
print(results.summary())
418434

435+
# Step 3 (continued): Inspect CS event-study pre-period coefficients
436+
# Pre-treatment relative-time ATTs should be near zero and insignificant.
437+
if results.event_study_effects:
438+
for rel_t, eff in sorted(results.event_study_effects.items()):
439+
if rel_t < 0:
440+
print(f" Pre-period {rel_t}: ATT={eff['effect']:.4f}, "
441+
f"SE={eff['se']:.4f}")
442+
419443
# Step 6: Sensitivity — HonestDiD bounds
420444
honest = compute_honest_did(results, method='relative_magnitude', M=1.0)
421445
print(honest.summary())
@@ -437,6 +461,12 @@ print(f"CS ATT: {results.overall_att:.4f} (SE: {results.overall_se:.4f})")
437461
print(f"SA ATT: {sa_result.overall_att:.4f} (SE: {sa_result.overall_se:.4f})")
438462
print(f"BJS ATT: {bjs_result.overall_att:.4f} (SE: {bjs_result.overall_se:.4f})")
439463

464+
# Step 8 (continued): REQUIRED with/without covariates comparison
465+
results_no_cov = cs.fit(data, outcome='lemp', unit='countyreal', time='year',
466+
first_treat='first_treat', aggregate='all')
467+
print(f"Without covariates: ATT={results_no_cov.overall_att:.4f}")
468+
print(f"With covariates: ATT={results.overall_att:.4f}")
469+
440470
# Context-aware next steps
441471
guidance = practitioner_next_steps(results)
442472
```

docs/practitioner-guide-evaluation.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,9 @@ difference-in-differences design using the load_mpdta() dataset."
1717
(with practitioner workflow header) + key sections of `docs/llms-practitioner.txt`.
1818

1919
**Model**: 1 Opus + 9 Sonnet (before), 10 Sonnet (after). All agents are fresh
20-
instances with no shared context.
20+
instances with no shared context. Note: the before arm includes one Opus run;
21+
this is a minor confound but the Opus run scored 8/16 (below the Sonnet mean
22+
of 9.6), so the model mix does not inflate the reported improvement.
2123

2224
## Scoring Rubric (0-2 per step, 16 total)
2325

@@ -108,7 +110,7 @@ After v1, the remaining 0.75 point gap was concentrated in:
108110

109111
### Targeted Changes
110112

111-
1. **Step 4**: Added "You MUST check the cluster count before choosing inference"
113+
1. **Step 5**: Added "You MUST check the cluster count before choosing inference"
112114
with explicit code: `n_clusters = data[cluster_col].nunique()` + if/else branch.
113115
2. **Step 8**: Strengthened "Report with and without covariates" from a checklist
114116
item to "REQUIRED — This is not optional" with explanation of why it matters.
@@ -150,6 +152,6 @@ and practically massive. Key results:
150152
- **Variance collapsed** from SD=0.84 to SD=0.0 — the guide standardized
151153
behavior so completely that all agents produce the same high-quality workflow
152154
- **Two iterations sufficed**: v1 closed 79% of the gap; targeted v2 fixes
153-
to Step 4 (cluster count) and Step 8 (covariates) closed the remaining 21%
155+
to Step 5 (cluster count) and Step 8 (covariates) closed the remaining 21%
154156
- **Documentation alone** drove these results — no runtime enforcement was
155157
needed beyond the `practitioner_next_steps()` function

0 commit comments

Comments
 (0)