Add Staggered Triple Difference estimator#245
Conversation
…'Anna 2025) Implements the staggered DDD estimator with group-time ATT(g,t), GMM-optimal combination across comparison cohorts, event study aggregation, and multiplier bootstrap. Core pairwise DiD computation matches R triplediff package exactly (Riesz/Hajek normalization, PS + OR influence function corrections). Validated against R: group-time ATT and SE match within 0.001% across 10 scenarios. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Overall Assessment This PR adds Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
- Add balanced-panel validation: unique (unit,time), equal periods per unit, time-invariant first_treat and covariates - Propagate bootstrap group-effect SEs/CIs/p-values into returned results - Add control_group to StaggeredTripleDiffResults and surface in summary/to_dict - Add Deviation from R label for pscore_trim in REGISTRY.md - Warn when cluster parameter is set (analytical cluster SEs not yet wired) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker The re-review shows that several prior issues are fixed, but one unmitigated P0 remains in the new estimator path, along with multiple P1s. The blocking problem is incorrect Executive Summary
Methodology
Code Quality No additional findings beyond the methodology issues above. Performance No findings in the reviewed diff. Maintainability No additional findings beyond the missing bootstrap propagation branch already noted above. Tech Debt No findings. Security No findings. Documentation/Tests
Path to Approval
Verification note: I could not run the test suite in this environment because |
- P0: Propagate anticipation into notyettreated comparison-group threshold (use max(t, base_period) + anticipation, not just max(g, t)) - P1: Propagate bootstrap group-time SEs/CIs/p-values into group_time_effects - P1: Tighten balanced-panel check to require full common time support - P1: Add Deviation from R label for aggregation weight difference - P3: CI fix: cast first_treat to float before assigning inf in test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior re-review P0/P1 items around Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings. Tech Debt No findings. TODO.md does not currently track the P1/P2 items above, so no mitigation applies. Security No findings. Documentation/Tests
Path to Approval
|
…docs - Generate CSV fixtures via R at test time (skip if R unavailable) - Guard empty cohorts in generate_staggered_ddd_data - Fix REGISTRY: relabel overall_att (not Eq 4.14), document conservative per-cohort group-effect SEs (WIF included vs R's wif=NULL) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The PR closes some of the earlier review surface: the new estimator now validates balanced panels and time-invariant eligibility, and the synthetic DGP no longer crashes on empty cohort masks. But there are still unmitigated P1 issues in the new staggered DDD inference paths. Static review only: I could not run Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings beyond the issues already called out in Methodology and Code Quality. Tech Debt No findings. TODO.md does not track the P1/P2 items above, so no mitigation applies. Security No findings. Documentation/Tests
Path to Approval
|
- Use eligible-treated-only cohort assignments for aggregation WIF, matching the paper's Eq 4.13 (G_i defined only for Q=1 units). Ineligible units get cohort=0 so they don't contribute to pg. - Honor rank_deficient_action in _compute_or via solve_ols fallback - Add proper Deviation from R / Note labels in REGISTRY for aggregation weight choice and per-cohort WIF Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only; I could not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
The non-survey WIF path reads group sizes from df["first_treat"] directly (staggered_aggregation.py:292), bypassing precomputed["unit_cohorts"]. Zero out df["first_treat"] for ineligible units alongside precomputed_agg so both code paths use the eligible-treated-only population definition. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentStatic review only; I could not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
- Skip reference period (t == g-1-anticipation) in universal mode, matching CallawaySantAnna behavior. Event-study mixin injects synthetic reference row with effect=0. - Update REGISTRY comparison-group rule to match code: uses max(t, base_period) + anticipation for notyettreated threshold. - Add TODO.md entries for deferred P2 items (CSV fixtures, covariate parity, group-effect WIF). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the targeted test modules here because Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests No unmitigated findings. The remaining non-blocking parity gaps are tracked in Path to Approval
|
The implementation follows R's triplediff (max(t, base_period) + anticipation) rather than the paper's max(g, t). Add Deviation from paper label in REGISTRY. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not execute the new test modules here because this environment does not have Executive Summary
Methodology Affected methods:
Code Quality
Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests No unmitigated findings beyond the tracked Path to Approval
|
- Warn when base period is outside observed panel (previously silent skip) - Warn when DDD subgroups are empty with details of which cells are missing - Zero-fill NaN logistic coefficients before caching to prevent NaN propagation on cache reuse with rank-deficient covariates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only. This environment does not have Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
- Test base period outside panel: warns and produces finite results - Test empty DDD subgroup: warns with cell details, still estimates - Test collinear covariates with PS cache reuse: all (g,t) finite Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only; I could not execute the new tests in this workspace because the available Python environment is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
- Validate exact period-set equality (not just counts) for balanced panel - Reject non-finite outcomes (Inf) and covariates up front - R parity tests now assert GT vector lengths and (g,t) label identity before comparing ATT/SE values Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Static review only. I could not run the new tests in this workspace because Executive Summary
Methodology
No other unmitigated methodology mismatch stood out beyond the documented notes on admissible comparison cohorts, aggregation weighting/WIF, and propensity-score trimming in docs/methodology/REGISTRY.md:L1301, docs/methodology/REGISTRY.md:L1342, and docs/methodology/REGISTRY.md:L1370. Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Move size_gt computation to after the inner gc loop so it reflects only the comparison cohorts that actually survived identification. Prevents incorrect IF rescaling and analytical SE when some gc's are skipped due to empty subgroups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Static review only: this workspace still lacks Executive Summary
Methodology Affected method(s):
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Flip PS nuisance correction sign in panel survey IPW and RCS IPW paths. R adds the correction to inf.control then subtracts (att = treat - control), so the net effect on ATT IF is subtraction. DR paths are unaffected because their M2 residual (m_control - control_change) already has opposite sign. Also merges main (StaggeredTripleDifference from #245), resolves TODO.md conflict, and updates survey-roadmap.md to reflect Phase 7a implementation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
StaggeredTripleDifferenceestimator for staggered adoption DDD designs (Ortiz-Villavicencio & Sant'Anna 2025)control_groupparameter supporting both"nevertreated"and"notyettreated"modesMethodology references (required if estimator / math changes)
agg_ddd()group-probability weights — documented in REGISTRY.mdNotImplementedError)Validation
tests/test_staggered_triple_diff.py— 40 unit tests (init, fit, recovery, methods, event study, GMM, bootstrap, edge cases)tests/test_methodology_staggered_triple_diff.py— 40 R cross-validation tests against pre-computedtriplediffgolden valuesdocs/methodology/papers/ortiz-villavicencio-santanna-2025-review.md(not committed — gitignored)benchmarks/R/benchmark_staggered_triplediff.RSecurity / privacy
Generated with Claude Code