Fix staggered pre-trends guidance and add with/without covariates example

igerber · claude · igerber · commit f3c17ffe5251 · 2026-03-28T16:40:31.000-04:00
P1: For staggered designs, replace generic check_parallel_trends() with
CS event-study pre-period inspection. The generic helper assumes a single
binary treatment with universal pre-periods, which is invalid when
cohorts adopt at different times (early-cohort post-treatment observations
contaminate the "pre-trend"). The guide and practitioner_next_steps()
now emit staggered-specific guidance using event_study_effects pre-periods.

P2: Complete example now includes REQUIRED with/without covariates
comparison (was missing despite Step 8 marking it mandatory).

P2: Evaluation doc model-mix caveat added, residual "Step 4" references
changed to "Step 5" to match canonical numbering.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py
@@ -134,7 +134,25 @@ def _step(
 # ---------------------------------------------------------------------------
 # Common steps reused across handlers
 # ---------------------------------------------------------------------------
-def _parallel_trends_step() -> Dict[str, Any]:
+def _parallel_trends_step(staggered: bool = False) -> Dict[str, Any]:
+    if staggered:
+        return _step(
+            baker_step=3,
+            label="Test parallel trends (event-study pre-periods)",
+            why=(
+                "For staggered designs, inspect CS event-study pre-period "
+                "coefficients rather than the generic check_parallel_trends() "
+                "which assumes a single binary treatment with universal "
+                "pre-periods. Pre-treatment ATTs should be near zero."
+            ),
+            code=(
+                "# Fit with aggregate='event_study' or 'all', then inspect:\n"
+                "for rel_t, eff in sorted(results.event_study_effects.items()):\n"
+                "    if rel_t < 0:\n"
+                "        print(f'Pre {rel_t}: ATT={eff[\"effect\"]:.4f}')"
+            ),
+            step_name="parallel_trends",
+        )
     return _step(
         baker_step=3,
         label="Test parallel trends assumption",
@@ -265,7 +283,7 @@ def _handle_multi_period(results: Any):
 
 def _handle_cs(results: Any):
     steps = [
-        _parallel_trends_step(),
+        _parallel_trends_step(staggered=True),
         _step(
             baker_step=6,
             label="Run HonestDiD sensitivity analysis",
@@ -308,7 +326,7 @@ def _handle_cs(results: Any):
 
 def _handle_sa(results: Any):
     steps = [
-        _parallel_trends_step(),
+        _parallel_trends_step(staggered=True),
         _placebo_step(),
         _robustness_compare_step("CS, BJS, or Gardner"),
         _covariates_step(),
@@ -319,7 +337,7 @@ def _handle_sa(results: Any):
 
 def _handle_imputation(results: Any):
     steps = [
-        _parallel_trends_step(),
+        _parallel_trends_step(staggered=True),
         _placebo_step(),
         _robustness_compare_step("CS, SA, or Gardner"),
         _covariates_step(),
@@ -330,7 +348,7 @@ def _handle_imputation(results: Any):
 
 def _handle_two_stage(results: Any):
     steps = [
-        _parallel_trends_step(),
+        _parallel_trends_step(staggered=True),
         _placebo_step(),
         _robustness_compare_step("CS, BJS, or SA"),
         _covariates_step(),
@@ -341,7 +359,7 @@ def _handle_two_stage(results: Any):
 
 def _handle_stacked(results: Any):
     steps = [
-        _parallel_trends_step(),
+        _parallel_trends_step(staggered=True),
         _placebo_step(),
         _step(
             baker_step=7,
@@ -423,7 +441,7 @@ def _handle_trop(results: Any):
 
 def _handle_efficient(results: Any):
     steps = [
-        _parallel_trends_step(),
+        _parallel_trends_step(staggered=True),
         _placebo_step(),
         _step(
             baker_step=7,
diff --git a/docs/llms-practitioner.txt b/docs/llms-practitioner.txt
@@ -79,14 +79,14 @@ to allow k periods of anticipation.
 
 ## Step 3: Test Parallel Trends
 
-Test the parallel trends assumption empirically BEFORE estimation. This step
-is separated from Step 2 because it requires code execution, not just stating
-assumptions.
+Test the parallel trends assumption empirically. This step is separated from
+Step 2 because it requires code execution, not just stating assumptions.
 
+**For simple 2x2 designs** (single treatment timing):
 ```python
 from diff_diff import check_parallel_trends, equivalence_test_trends
 
-# Simple pre-trends test (compares slopes)
+# Simple pre-trends test (compares slopes for a binary treatment indicator)
 pt_result = check_parallel_trends(
     data, outcome='y', time='period', treatment_group='treated',
     pre_periods=[1, 2, 3]
@@ -102,6 +102,24 @@ equiv = equivalence_test_trends(
 )
 ```
 
+**For staggered designs** (multiple cohorts adopting at different times):
+The generic `check_parallel_trends()` assumes a single binary treatment with
+universal pre-periods, which is invalid when cohorts adopt at different times
+(some "pre-periods" are post-treatment for early cohorts). Instead, use the
+CS event-study pre-period coefficients as the pre-trends diagnostic:
+```python
+# Fit CS with event_study aggregation, then inspect pre-periods
+cs = CallawaySantAnna(control_group='not_yet_treated', cluster='unit_id')
+results = cs.fit(data, ..., aggregate='event_study')
+
+# Pre-treatment relative-time ATTs should be near zero
+if results.event_study_effects:
+    for rel_t, eff in sorted(results.event_study_effects.items()):
+        if rel_t < 0:
+            print(f"Pre-period {rel_t}: ATT={eff['effect']:.4f}, SE={eff['se']:.4f}")
+# Significant pre-treatment effects → parallel trends may be violated
+```
+
 CAUTION: Small, statistically insignificant pre-trends do NOT guarantee
 parallel trends holds post-treatment. Use HonestDiD (Step 6) to bound how
 large a violation would need to be to overturn your results.
@@ -390,14 +408,12 @@ data = load_mpdta()
 #          no anticipation, doubly robust for conditional PT
 
 # Step 3: Test parallel trends
-# Note: check_parallel_trends expects a binary treatment indicator, not cohort years.
-# Use 'treat' (ever-treated binary) with pre-periods before any cohort adopts.
-pt = check_parallel_trends(data, outcome='lemp', time='year',
-                           treatment_group='treat',
-                           pre_periods=[2003, 2004, 2005])
-print(f"Pre-trends p-value: {pt['p_value']:.4f}")
-
-# Step 4: Choose estimator — staggered adoption → CS (primary), SA (robustness)
+# For staggered designs, use the CS event-study pre-period coefficients
+# as the pre-trends diagnostic (NOT the generic check_parallel_trends,
+# which assumes a single binary treatment and universal pre-periods).
+# We estimate CS with event_study aggregation first, then inspect pre-periods.
+
+# Step 4: Choose estimator — staggered adoption -> CS (primary), SA (robustness)
 # Diagnose TWFE bias first:
 bacon = BaconDecomposition()
 bacon_result = bacon.fit(data, outcome='lemp', unit='countyreal',
@@ -416,6 +432,14 @@ results = cs.fit(data, outcome='lemp', unit='countyreal', time='year',
                  aggregate='all')
 print(results.summary())
 
+# Step 3 (continued): Inspect CS event-study pre-period coefficients
+# Pre-treatment relative-time ATTs should be near zero and insignificant.
+if results.event_study_effects:
+    for rel_t, eff in sorted(results.event_study_effects.items()):
+        if rel_t < 0:
+            print(f"  Pre-period {rel_t}: ATT={eff['effect']:.4f}, "
+                  f"SE={eff['se']:.4f}")
+
 # Step 6: Sensitivity — HonestDiD bounds
 honest = compute_honest_did(results, method='relative_magnitude', M=1.0)
 print(honest.summary())
@@ -437,6 +461,12 @@ print(f"CS ATT:  {results.overall_att:.4f} (SE: {results.overall_se:.4f})")
 print(f"SA ATT:  {sa_result.overall_att:.4f} (SE: {sa_result.overall_se:.4f})")
 print(f"BJS ATT: {bjs_result.overall_att:.4f} (SE: {bjs_result.overall_se:.4f})")
 
+# Step 8 (continued): REQUIRED with/without covariates comparison
+results_no_cov = cs.fit(data, outcome='lemp', unit='countyreal', time='year',
+                        first_treat='first_treat', aggregate='all')
+print(f"Without covariates: ATT={results_no_cov.overall_att:.4f}")
+print(f"With covariates:    ATT={results.overall_att:.4f}")
+
 # Context-aware next steps
 guidance = practitioner_next_steps(results)
 ```
diff --git a/docs/practitioner-guide-evaluation.md b/docs/practitioner-guide-evaluation.md
@@ -17,7 +17,9 @@ difference-in-differences design using the load_mpdta() dataset."
 (with practitioner workflow header) + key sections of `docs/llms-practitioner.txt`.
 
 **Model**: 1 Opus + 9 Sonnet (before), 10 Sonnet (after). All agents are fresh
-instances with no shared context.
+instances with no shared context. Note: the before arm includes one Opus run;
+this is a minor confound but the Opus run scored 8/16 (below the Sonnet mean
+of 9.6), so the model mix does not inflate the reported improvement.
 
 ## Scoring Rubric (0-2 per step, 16 total)
 
@@ -108,7 +110,7 @@ After v1, the remaining 0.75 point gap was concentrated in:
 
 ### Targeted Changes
 
-1. **Step 4**: Added "You MUST check the cluster count before choosing inference"
+1. **Step 5**: Added "You MUST check the cluster count before choosing inference"
    with explicit code: `n_clusters = data[cluster_col].nunique()` + if/else branch.
 2. **Step 8**: Strengthened "Report with and without covariates" from a checklist
    item to "REQUIRED — This is not optional" with explanation of why it matters.
@@ -150,6 +152,6 @@ and practically massive. Key results:
 - **Variance collapsed** from SD=0.84 to SD=0.0 — the guide standardized
   behavior so completely that all agents produce the same high-quality workflow
 - **Two iterations sufficed**: v1 closed 79% of the gap; targeted v2 fixes
-  to Step 4 (cluster count) and Step 8 (covariates) closed the remaining 21%
+  to Step 5 (cluster count) and Step 8 (covariates) closed the remaining 21%
 - **Documentation alone** drove these results — no runtime enforcement was
   needed beyond the `practitioner_next_steps()` function