|
2 | 2 |
|
3 | 3 | This document outlines the feature roadmap for diff-diff, prioritized by practitioner value and academic credibility. |
4 | 4 |
|
5 | | -## What Makes a Credible 1.0? |
| 5 | +For past changes and release history, see [CHANGELOG.md](CHANGELOG.md). |
6 | 6 |
|
7 | | -A production-ready DiD library needs: |
| 7 | +--- |
| 8 | + |
| 9 | +## Current Status (v1.0.2) |
8 | 10 |
|
9 | | -1. ✅ **Core estimators** - Basic DiD, TWFE, MultiPeriod, Staggered (Callaway-Sant'Anna), Synthetic DiD |
10 | | -2. ✅ **Valid inference** - Robust SEs, cluster SEs, wild bootstrap for few clusters |
11 | | -3. ✅ **Assumption diagnostics** - Parallel trends tests, placebo tests |
12 | | -4. ✅ **Sensitivity analysis** - What if parallel trends is violated? (Rambachan-Roth) |
13 | | -5. ✅ **Conditional parallel trends** - Covariate adjustment for staggered DiD |
14 | | -6. ✅ **Documentation** - API reference site for discoverability |
| 11 | +diff-diff is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` ecosystem for core DiD analysis: |
15 | 12 |
|
16 | | -**All 1.0 blockers are complete.** diff-diff has feature parity with R's `did` + `HonestDiD` ecosystem for core DiD analysis. |
| 13 | +- **Core estimators**: Basic DiD, TWFE, MultiPeriod, Callaway-Sant'Anna, Synthetic DiD |
| 14 | +- **Valid inference**: Robust SEs, cluster SEs, wild bootstrap, multiplier bootstrap |
| 15 | +- **Assumption diagnostics**: Parallel trends tests, placebo tests, Goodman-Bacon decomposition |
| 16 | +- **Sensitivity analysis**: Honest DiD (Rambachan-Roth) |
| 17 | +- **Study design**: Power analysis tools |
17 | 18 |
|
18 | 19 | --- |
19 | 20 |
|
20 | | -## Status Overview |
21 | | - |
22 | | -| Feature | Status | Priority | Why It Matters | |
23 | | -|---------|--------|----------|----------------| |
24 | | -| Honest DiD (Rambachan-Roth) | ✅ Done | — | Reviewers expect sensitivity analysis | |
25 | | -| CallawaySantAnna Covariates | ✅ Done | — | Conditional PT often required in practice | |
26 | | -| API Documentation Site | ✅ Done | — | Credibility and discoverability | |
27 | | -| Goodman-Bacon Decomposition | ✅ Done | — | Explains when TWFE fails | |
28 | | -| Power Analysis | ✅ Done | — | Study design tool | |
29 | | -| CallawaySantAnna Bootstrap | ✅ Done | — | Valid inference with few clusters | |
30 | | -| Sun-Abraham Estimator | Not Started | Post-1.0 | Alternative to CS, some prefer it | |
31 | | -| Gardner's did2s | Not Started | Post-1.0 | Two-stage approach, available in pyfixest | |
32 | | -| Local Projections DiD | Not Started | Post-1.0 | Dynamic effects (Dube et al. 2023) | |
33 | | -| Borusyak-Jaravel-Spiess | Not Started | Post-1.0 | More efficient under homogeneous effects | |
34 | | -| Double/Debiased ML | Not Started | Post-1.0 | High-dimensional covariates | |
| 21 | +## Near-Term Enhancements (v1.1–v1.2) |
35 | 22 |
|
36 | | ---- |
| 23 | +High-value additions building on our existing foundation. |
| 24 | + |
| 25 | +### Sun-Abraham Estimator |
| 26 | + |
| 27 | +Interaction-weighted estimator providing an alternative to Callaway-Sant'Anna. Many practitioners run both as a robustness check. |
| 28 | + |
| 29 | +- Event-study coefficients via saturated regression with cohort-time interactions |
| 30 | +- Different weighting scheme than CS; can give different results under heterogeneous effects |
| 31 | +- Useful robustness check when CS and SA agree |
| 32 | + |
| 33 | +**Reference**: Sun & Abraham (2021). *Journal of Econometrics*. |
37 | 34 |
|
38 | | -## 1.0 Target Features |
| 35 | +### Borusyak-Jaravel-Spiess Imputation Estimator |
| 36 | + |
| 37 | +More efficient than Callaway-Sant'Anna when treatment effects are homogeneous across groups/time. Uses imputation rather than aggregation. |
39 | 38 |
|
40 | | -These would strengthen the 1.0 release but aren't strictly blocking. |
| 39 | +- Imputes untreated potential outcomes using pre-treatment data |
| 40 | +- More efficient under homogeneous effects assumption |
| 41 | +- Can handle unbalanced panels more naturally |
41 | 42 |
|
42 | | -### ✅ Goodman-Bacon Decomposition (Done) |
| 43 | +**Reference**: Borusyak, Jaravel, and Spiess (2024). *Review of Economic Studies*. |
43 | 44 |
|
44 | | -Helps users understand *why* TWFE can be biased with staggered adoption. Shows weights on "forbidden comparisons" (already-treated as controls). Essential diagnostic before deciding whether to use Callaway-Sant'Anna. |
| 45 | +### Gardner's Two-Stage DiD (did2s) |
45 | 46 |
|
46 | | -- ✅ Decompose TWFE into 2x2 comparisons |
47 | | -- ✅ Show weights by comparison type (clean vs. forbidden) |
48 | | -- ✅ Visualization of decomposition (scatter and bar charts) |
49 | | -- ✅ Integration with `TwoWayFixedEffects.decompose()` method |
50 | | -- ✅ Automatic warning when TWFE detects staggered treatment timing |
| 47 | +Two-stage approach gaining traction in applied work. First residualizes outcomes, then estimates effects. |
51 | 48 |
|
52 | | -**Reference**: Goodman-Bacon (2021). *Journal of Econometrics*. |
| 49 | +- Stage 1: Estimate unit and time FEs using only untreated observations |
| 50 | +- Stage 2: Regress residualized outcomes on treatment indicators |
| 51 | +- Clean separation of identification and estimation |
53 | 52 |
|
54 | | -### ✅ Power Analysis Tools (Done) |
| 53 | +**Reference**: Gardner (2022). *Working Paper*. |
55 | 54 |
|
56 | | -Practitioners need to know "how many units/periods do I need to detect an effect of size X?" Now available in diff-diff. |
| 55 | +### Triple Difference (DDD) Estimators |
57 | 56 |
|
58 | | -- ✅ Minimum detectable effect given sample size |
59 | | -- ✅ Required sample size for target power |
60 | | -- ✅ Simulation-based power for any estimator (including staggered designs) |
61 | | -- ✅ Visualization of power curves |
62 | | -- ✅ Panel data considerations (ICC, multiple periods) |
| 57 | +Extends DiD to settings requiring a third differencing dimension. Common DDD implementations are invalid when covariates are needed for identification. |
63 | 58 |
|
64 | | -**References**: Bloom (1995); Burlig, Preonas, & Woerman (2020). |
| 59 | +- Regression adjustment, IPW, and doubly robust DDD estimators |
| 60 | +- Staggered adoption support with multiple comparison groups |
| 61 | +- Proper covariate integration (naive "two DiD difference" approaches fail) |
| 62 | +- Bias reduction and precision gains over standard approaches |
65 | 63 |
|
66 | | -### ✅ CallawaySantAnna Bootstrap Inference (Done) |
| 64 | +**Reference**: [Ortiz-Villavicencio & Sant'Anna (2025)](https://arxiv.org/abs/2505.09942). *Working Paper*. R package: `triplediff`. |
67 | 65 |
|
68 | | -With few clusters or groups, analytical SEs may be unreliable. Multiplier bootstrap provides valid inference following the R `did` package approach. |
| 66 | +### Pre-Trends Power Analysis |
69 | 67 |
|
70 | | -- ✅ Multiplier bootstrap at unit level with influence function perturbation |
71 | | -- ✅ Aggregate bootstrap samples for overall ATT, event study, and group effects |
72 | | -- ✅ Rademacher, Mammen, and Webb weight distributions |
73 | | -- ✅ Percentile confidence intervals and bootstrap p-values |
| 68 | +Assess whether pre-trends tests have adequate power to detect meaningful parallel trends violations. Complements our Honest DiD implementation. |
74 | 69 |
|
75 | | -**Reference**: Callaway & Sant'Anna (2021). *Journal of Econometrics*. |
| 70 | +- Minimum detectable violation size for pre-trends tests |
| 71 | +- Visualization of power against various violation magnitudes |
| 72 | +- Integration with existing parallel trends diagnostics |
| 73 | + |
| 74 | +**Reference**: [Roth (2022)](https://www.aeaweb.org/articles?id=10.1257/aeri.20210236). *AER: Insights*. R package: `pretrends`. |
76 | 75 |
|
77 | 76 | ### Enhanced Visualization |
78 | 77 |
|
79 | 78 | - Synthetic control weight visualization (bar chart of unit weights) |
80 | | -- ✅ Bacon decomposition visualization (scatter and bar charts) |
81 | | -- Treatment adoption "staircase" plot |
| 79 | +- Treatment adoption "staircase" plot for staggered designs |
| 80 | +- Interactive plots with plotly backend option |
82 | 81 |
|
83 | 82 | --- |
84 | 83 |
|
85 | | -## Post-1.0 Features |
| 84 | +## Medium-Term Enhancements (v1.3+) |
86 | 85 |
|
87 | | -These are valuable but can wait for future versions. |
| 86 | +Extending diff-diff to handle more complex settings. |
88 | 87 |
|
89 | | -### Sun-Abraham Estimator |
| 88 | +### Continuous Treatment DiD |
90 | 89 |
|
91 | | -Alternative to Callaway-Sant'Anna using interaction-weighted approach. Some practitioners prefer it; provides a robustness check. |
| 90 | +Many treatments have dose/intensity rather than binary on/off. Active research area with recent breakthroughs. |
92 | 91 |
|
93 | | -**Reference**: Sun & Abraham (2021). *Journal of Econometrics*. |
| 92 | +- Treatment effect on treated (ATT) parameters under generalized parallel trends |
| 93 | +- Dose-response curves and marginal effects |
| 94 | +- Handle settings where "dose" varies across units and time |
| 95 | +- Event studies with continuous treatments |
94 | 96 |
|
95 | | -### Gardner's Two-Stage DiD (did2s) |
| 97 | +**References**: |
| 98 | +- [Callaway, Goodman-Bacon & Sant'Anna (2024)](https://arxiv.org/abs/2107.02637). *NBER Working Paper*. |
| 99 | +- [de Chaisemartin, D'Haultfœuille & Vazquez-Bare (2024)](https://arxiv.org/abs/2402.05432). *AEA Papers and Proceedings*. |
| 100 | + |
| 101 | +### de Chaisemartin-D'Haultfœuille Estimator |
| 102 | + |
| 103 | +Handles treatment that switches on and off (reversible treatments), unlike most other methods. |
96 | 104 |
|
97 | | -Two-stage approach to staggered DiD that first residualizes outcomes using untreated observations, then estimates treatment effects. Available in pyfixest (Python) and did2s (R). |
| 105 | +- Allows units to move into and out of treatment |
| 106 | +- Time-varying, heterogeneous treatment effects |
| 107 | +- Comparison with never-switchers or flexible control groups |
| 108 | +- Different assumptions than CS/SA—useful for different settings |
98 | 109 |
|
99 | | -**Reference**: Gardner (2022). *Two-stage differences in differences*. |
| 110 | +**Reference**: [de Chaisemartin & D'Haultfœuille (2020, 2024)](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3980758). *American Economic Review*. |
100 | 111 |
|
101 | 112 | ### Local Projections DiD |
102 | 113 |
|
103 | | -Implements local projections for dynamic treatment effects. Flexible approach that doesn't require specifying the full dynamic structure. Gaining traction in applied work. |
| 114 | +Implements local projections for dynamic treatment effects. Doesn't require specifying full dynamic structure. |
| 115 | + |
| 116 | +- Flexible impulse response estimation |
| 117 | +- Robust to misspecification of dynamics |
| 118 | +- Natural handling of anticipation effects |
| 119 | +- Growing use in macroeconomics and policy evaluation |
104 | 120 |
|
105 | 121 | **Reference**: Dube, Girardi, Jordà, and Taylor (2023). |
106 | 122 |
|
107 | | -### Borusyak-Jaravel-Spiess Imputation Estimator |
| 123 | +### Nonlinear DiD |
108 | 124 |
|
109 | | -More efficient than Callaway-Sant'Anna when parallel trends holds across all periods. Uses imputation approach. |
| 125 | +For outcomes where linear models are inappropriate (binary, count, bounded). |
110 | 126 |
|
111 | | -**Reference**: Borusyak, Jaravel, and Spiess (2024). |
| 127 | +- Logit/probit DiD for binary outcomes |
| 128 | +- Poisson DiD for count outcomes |
| 129 | +- Flexible strategies for staggered designs with nonlinear models |
| 130 | +- Proper handling of incidence rate ratios and odds ratios |
112 | 131 |
|
113 | | -### Double/Debiased ML for DiD |
| 132 | +**Reference**: [Wooldridge (2023)](https://academic.oup.com/ectj/article/26/3/C31/7250479). *The Econometrics Journal*. |
114 | 133 |
|
115 | | -For high-dimensional settings with many covariates. Uses ML for nuisance parameter estimation with cross-fitting. |
| 134 | +### Doubly Robust DiD + Synthetic Control |
116 | 135 |
|
117 | | -**Reference**: Chernozhukov et al. (2018), Chang (2020). |
| 136 | +Unified framework combining DiD and synthetic control with doubly robust identification—valid under *either* parallel trends or synthetic control assumptions. |
118 | 137 |
|
119 | | -### Alternative Inference Methods |
| 138 | +- ATT identified under parallel trends OR group-level SC condition |
| 139 | +- Semiparametric estimation framework |
| 140 | +- Multiplier bootstrap for valid inference under either assumption |
| 141 | +- Strengthens credibility by avoiding the DiD vs. SC trade-off |
| 142 | + |
| 143 | +**Reference**: [Sun, Xie & Zhang (2025)](https://arxiv.org/abs/2503.11375). *Working Paper*. |
120 | 144 |
|
121 | | -- Randomization inference for small samples |
122 | | -- Bayesian DiD with prior on parallel trends |
123 | | -- Conformal inference for prediction intervals |
| 145 | +### Causal Duration Analysis with DiD |
| 146 | + |
| 147 | +Extends DiD to duration/survival outcomes where standard methods fail (hazard rates, time-to-event). |
| 148 | + |
| 149 | +- Duration analogue of parallel trends on hazard rates |
| 150 | +- Avoids distributional assumptions and hazard function specification |
| 151 | +- Visual and formal pre-trends assessment for duration data |
| 152 | +- Handles absorbing states approaching probability bounds |
| 153 | + |
| 154 | +**Reference**: [Deaner & Ku (2025)](https://www.aeaweb.org/conference/2025/program/paper/k77Kh8iS). *AEA Conference Paper*. |
124 | 155 |
|
125 | 156 | --- |
126 | 157 |
|
127 | | -## Release History |
| 158 | +## Long-Term Research Directions (v2.0+) |
| 159 | + |
| 160 | +Frontier methods requiring more research investment. |
128 | 161 |
|
129 | | -### v0.9.0 (Current) |
| 162 | +### Matrix Completion Methods |
130 | 163 |
|
131 | | -- ✅ Callaway-Sant'Anna multiplier bootstrap inference |
132 | | -- ✅ Rademacher, Mammen, and Webb weight distributions |
133 | | -- ✅ Bootstrap SEs, CIs, and p-values for all aggregations (overall ATT, event study, group effects) |
134 | | -- ✅ `CSBootstrapResults` dataclass for bootstrap results |
| 164 | +Unified framework encompassing synthetic control and regression approaches. Moves seamlessly between cross-sectional and time-series patterns. |
135 | 165 |
|
136 | | -### v0.8.0 |
| 166 | +- Nuclear norm regularization for low-rank structure |
| 167 | +- Handles missing data patterns common in panel settings |
| 168 | +- Bridges synthetic control (few units, many periods) and regression (many units, few periods) |
| 169 | +- Confidence intervals via debiasing |
137 | 170 |
|
138 | | -- ✅ Power analysis tools (`PowerAnalysis`, `simulate_power`) |
139 | | -- ✅ MDE, sample size, and power calculations |
140 | | -- ✅ Simulation-based power for any DiD estimator |
141 | | -- ✅ Power curve visualization (`plot_power_curve`) |
142 | | -- ✅ Panel data support with ICC adjustment |
| 171 | +**Reference**: [Athey et al. (2021)](https://arxiv.org/abs/1710.10251). *Journal of the American Statistical Association*. |
143 | 172 |
|
144 | | -### v0.7.0 |
| 173 | +### Causal Forests for DiD |
145 | 174 |
|
146 | | -- ✅ Goodman-Bacon decomposition for TWFE diagnostics |
147 | | -- ✅ `plot_bacon()` visualization (scatter and bar charts) |
148 | | -- ✅ `TwoWayFixedEffects.decompose()` integration |
149 | | -- ✅ Automatic staggered treatment warning in TWFE |
| 175 | +Machine learning methods for discovering heterogeneous treatment effects in DiD settings. |
150 | 176 |
|
151 | | -### v0.6.0 |
| 177 | +- Estimate treatment effect heterogeneity across covariates |
| 178 | +- Data-driven subgroup discovery |
| 179 | +- Combine with DiD identification for observational data |
| 180 | +- Honest confidence intervals for discovered heterogeneity |
152 | 181 |
|
153 | | -- ✅ **All 1.0 Blockers Complete** |
154 | | -- ✅ Honest DiD sensitivity analysis (Rambachan & Roth 2023) |
155 | | -- ✅ CallawaySantAnna covariate adjustment (DR, IPW, Reg) |
156 | | -- ✅ API documentation site with Sphinx |
| 182 | +**References**: |
| 183 | +- [Kattenberg, Scheer & Thiel (2023)](https://ideas.repec.org/p/cpb/discus/452.html). *CPB Discussion Paper*. |
| 184 | +- Athey & Wager (2019). *Annals of Statistics*. |
157 | 185 |
|
158 | | -### v0.5.0 |
| 186 | +### Double/Debiased ML for DiD |
| 187 | + |
| 188 | +For high-dimensional settings with many potential confounders. |
159 | 189 |
|
160 | | -- Wild cluster bootstrap (Rademacher, Webb, Mammen weights) |
161 | | -- Placebo tests module |
162 | | -- Tutorial notebooks |
| 190 | +- ML for nuisance parameter estimation (propensity, outcome models) |
| 191 | +- Cross-fitting for valid inference |
| 192 | +- Handles many covariates without overfitting concerns |
| 193 | +- Doubly-robust estimation with ML flexibility |
163 | 194 |
|
164 | | -### v0.4.0 |
| 195 | +**Reference**: Chernozhukov et al. (2018). *The Econometrics Journal*. |
165 | 196 |
|
166 | | -- Callaway-Sant'Anna estimator for staggered DiD |
167 | | -- Event study and group effects visualization |
168 | | -- Parallel trends testing utilities |
| 197 | +### Alternative Inference Methods |
169 | 198 |
|
170 | | -### v0.3.0 |
| 199 | +- **Randomization inference**: Exact p-values for small samples |
| 200 | +- **Bayesian DiD**: Priors on parallel trends violations |
| 201 | +- **Conformal inference**: Prediction intervals with finite-sample guarantees |
| 202 | + |
| 203 | +--- |
171 | 204 |
|
172 | | -- Synthetic Difference-in-Differences |
173 | | -- Multi-period DiD with event study |
174 | | -- Data preparation utilities |
| 205 | +## Infrastructure Improvements |
175 | 206 |
|
176 | | -### v0.2.0 |
| 207 | +Ongoing maintenance and developer experience. |
177 | 208 |
|
178 | | -- Two-Way Fixed Effects estimator |
179 | | -- Fixed effects support (absorb parameter) |
180 | | -- Cluster-robust standard errors |
181 | | -- Formula interface |
| 209 | +### Performance |
182 | 210 |
|
183 | | -### v0.1.0 |
| 211 | +- JIT compilation for bootstrap loops (numba) |
| 212 | +- Parallel bootstrap iterations |
| 213 | +- Sparse matrix handling for large fixed effects |
| 214 | +- Memory-efficient estimation for large panels |
184 | 215 |
|
185 | | -- Initial release with basic DiD estimator |
| 216 | +### Code Quality |
| 217 | + |
| 218 | +- Extract shared within-transformation logic to utils |
| 219 | +- Consolidate linear regression helpers |
| 220 | +- Consider splitting `staggered.py` (1800+ lines) |
| 221 | + |
| 222 | +### Documentation |
| 223 | + |
| 224 | +- Real-world data examples (beyond synthetic) |
| 225 | +- Performance benchmarks vs. R packages |
| 226 | +- Video tutorials and worked examples |
186 | 227 |
|
187 | 228 | --- |
188 | 229 |
|
189 | 230 | ## Contributing |
190 | 231 |
|
191 | | -Interested in contributing? See the [GitHub repository](https://github.com/igerber/diff-diff) for open issues. Features marked "Not Started" are good candidates for contributions. |
| 232 | +Interested in contributing? Features in the "Near-Term" and "Medium-Term" sections are good candidates. See the [GitHub repository](https://github.com/igerber/diff-diff) for open issues. |
| 233 | + |
| 234 | +Key references for implementation: |
| 235 | +- [Roth et al. (2023)](https://www.sciencedirect.com/science/article/abs/pii/S0304407623001318). "What's Trending in Difference-in-Differences?" *Journal of Econometrics*. |
| 236 | +- [Baker et al. (2025)](https://arxiv.org/pdf/2503.13323). "Difference-in-Differences Designs: A Practitioner's Guide." |
0 commit comments