Skip to content

Commit 528c241

Browse files
igerberclaude
andcommitted
Fix Step 3 example column and align evaluation doc labels
P1: Complete example Step 3 now uses treatment_group='treat' (binary ever-treated indicator) instead of 'first_treat' (cohort year), with explicit pre_periods=[2003, 2004, 2005]. The previous version would produce an empty treated subset since check_parallel_trends() splits on == 1 / == 0. P2: Evaluation doc comparison tables and narrative now use canonical step labels (S3=Test PT, S4=Choose estimator, S5=Estimate) matching REGISTRY.md and the practitioner guide. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 6ca617e commit 528c241

2 files changed

Lines changed: 15 additions & 12 deletions

File tree

docs/llms-practitioner.txt

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -390,8 +390,11 @@ data = load_mpdta()
390390
# no anticipation, doubly robust for conditional PT
391391

392392
# Step 3: Test parallel trends
393+
# Note: check_parallel_trends expects a binary treatment indicator, not cohort years.
394+
# Use 'treat' (ever-treated binary) with pre-periods before any cohort adopts.
393395
pt = check_parallel_trends(data, outcome='lemp', time='year',
394-
treatment_group='first_treat')
396+
treatment_group='treat',
397+
pre_periods=[2003, 2004, 2005])
395398
print(f"Pre-trends p-value: {pt['p_value']:.4f}")
396399

397400
# Step 4: Choose estimator — staggered adoption → CS (primary), SA (robustness)

docs/practitioner-guide-evaluation.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,9 @@ instances with no shared context.
5151
|------|------------|-----------|--------|----------------|
5252
| S1: Target parameter | 1.0 | 2.0 | **+1.0** | Agents now explicitly define weighted/unweighted target |
5353
| S2: Assumptions | 1.0 | 2.0 | **+1.0** | Agents now formally name PT variant (PT-GT-NYT) |
54-
| S3: Estimator choice | 2.0 | 2.0 | 0.0 | Already perfect before |
55-
| S4: Inference | 1.0 | 1.5 | +0.5 | Now discuss wild bootstrap alternative |
56-
| S5: Estimation | 2.0 | 2.0 | 0.0 | Already perfect before |
54+
| S3: Test parallel trends | 0.1 | 2.0 | **+1.9** | From near-zero to universal formal PT testing |
55+
| S4: Choose estimator | 2.0 | 2.0 | 0.0 | Already perfect before |
56+
| S5: Estimate (cluster check) | 1.0 | 1.5 | +0.5 | Now discuss wild bootstrap alternative |
5757
| S6: Sensitivity | **0.1** | **2.0** | **+1.9** | From near-zero to universal HonestDiD + placebo |
5858
| S7: Heterogeneity | 1.4 | 2.0 | +0.6 | Now consistently do group + event study |
5959
| S8: Robustness | 0.9 | 1.75 | +0.85 | Now compare 3 estimators; ~50% add with/without covariates |
@@ -77,9 +77,9 @@ instances with no shared context.
7777
standardized behavior — agents now consistently follow the same high-quality
7878
workflow rather than producing variable-quality ad hoc analyses.
7979

80-
5. **Steps 3 and 5 were already perfect**: Agents already knew to use CS for
81-
staggered data and could produce working code. The gap was never in mechanics
82-
but in empiricist reasoning.
80+
5. **Steps 4 and 5 (estimator choice + estimation) were already perfect**:
81+
Agents already knew to use CS for staggered data and could produce working
82+
code. The gap was never in mechanics but in empiricist reasoning.
8383

8484
## Qualitative Observations
8585

@@ -101,8 +101,8 @@ instances with no shared context.
101101
## Iteration 2: Targeted Fixes
102102

103103
After v1, the remaining 0.75 point gap was concentrated in:
104-
- **Step 4 (Inference)**: Agents mentioned wild bootstrap generically but never
105-
checked the actual cluster count in the data (1.5/2 across all runs).
104+
- **Step 5 (Estimate/Inference)**: Agents mentioned wild bootstrap generically but
105+
never checked the actual cluster count in the data (1.5/2 across all runs).
106106
- **Step 8 (Robustness)**: ~50% of agents skipped with/without covariates
107107
comparison despite the guide listing it (mean 1.75/2).
108108

@@ -129,9 +129,9 @@ After v1, the remaining 0.75 point gap was concentrated in:
129129
|------|--------|----------|----------|
130130
| S1: Target parameter | 1.0 | 2.0 | 2.0 |
131131
| S2: Assumptions | 1.0 | 2.0 | 2.0 |
132-
| S3: Estimator choice | 2.0 | 2.0 | 2.0 |
133-
| S4: Inference | 1.0 | 1.5 | **2.0** |
134-
| S5: Estimation | 2.0 | 2.0 | 2.0 |
132+
| S3: Test parallel trends | 0.1 | 2.0 | 2.0 |
133+
| S4: Choose estimator | 2.0 | 2.0 | 2.0 |
134+
| S5: Estimate (cluster check) | 1.0 | 1.5 | **2.0** |
135135
| S6: Sensitivity | 0.1 | 2.0 | 2.0 |
136136
| S7: Heterogeneity | 1.4 | 2.0 | 2.0 |
137137
| S8: Robustness | 0.9 | 1.75 | **2.0** |

0 commit comments

Comments
 (0)