This document tracks the progress of reviewing each estimator's implementation against the Methodology Registry and academic references. It ensures that implementations are correct, consistent, and well-documented.
For the methodology registry with academic foundations and key equations, see docs/methodology/REGISTRY.md.
Each estimator in diff-diff should be periodically reviewed to ensure:
- Correctness: Implementation matches the academic paper's equations
- Reference alignment: Behavior matches reference implementations (R packages, Stata commands)
- Edge case handling: Documented edge cases are handled correctly
- Standard errors: SE formulas match the documented approach
A Complete entry has a documented review pass against the primary academic source captured in this file. The minimum content is:
- A "Corrections Made" block listing every implementation fix the review uncovered, or
(None — implementation verified correct). - An explicit statement of deviations from the reference implementation, or
(None). Format varies — some entries use a dedicated "Deviations" / "Deviations from R" block, others surface deviations inline in "Corrections Made" or "Outstanding Concerns". - Verification evidence: a "Verified Components" checklist, an "Edge Cases Verified" enumeration, an "R Comparison Results" table, or some combination of these.
The catalog grew incrementally over several quarters, so formats vary across the existing Complete entries; the consistent invariant is that someone walked through the implementation against the academic source and captured the result here. New reviews going forward should aim for the fuller structure (Verified Components + Corrections Made + Deviations + dedicated methodology test file) used by the more recent entries.
In Progress entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures (e.g., DCDH has a methodology file, R parity, and a companion-paper review for the 2026 universal-rollout extension; HAD has its primary-source paper review and R parity but no dedicated methodology file; ContinuousDiD has the methodology file but no paper review); others have only the REGISTRY entry and unit tests (e.g., PowerAnalysis). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
Not Started entries have neither a tracker walk-through nor an REGISTRY.md section. This tracker no longer carries any Not Started rows; new estimators are expected to enter as In Progress when their REGISTRY entry lands.
| Estimator | Module | R / Stata Reference | Status | Last Review |
|---|---|---|---|---|
| DifferenceInDifferences | estimators.py |
fixest::feols() |
Complete | 2026-01-24 |
| MultiPeriodDiD | estimators.py |
fixest::feols() |
Complete | 2026-02-02 |
| TwoWayFixedEffects | twfe.py |
fixest::feols() |
Complete | 2026-02-08 |
| Estimator | Module | R / Stata Reference | Status | Last Review |
|---|---|---|---|---|
| CallawaySantAnna | staggered.py |
did::att_gt() |
Complete | 2026-01-24 |
| SunAbraham | sun_abraham.py |
fixest::sunab() |
Complete | 2026-02-15 |
| StackedDiD | stacked_did.py |
stacked-did-weights (Wing-Freedman-Hollingsworth code) |
Complete | 2026-02-19 |
| ImputationDiD | imputation.py |
didimputation |
In Progress | — |
| TwoStageDiD | two_stage.py |
did2s |
In Progress | — |
| WooldridgeDiD (ETWFE) | wooldridge.py |
etwfe (R) / jwdid (Stata) |
In Progress | — |
| EfficientDiD | efficient_did.py |
(no canonical R package) | In Progress | — |
| Estimator | Module | R / Stata Reference | Status | Last Review |
|---|---|---|---|---|
| ContinuousDiD | continuous_did.py |
contdid v0.1.0 |
In Progress | — |
| ChaisemartinDHaultfoeuille (DCDH) | chaisemartin_dhaultfoeuille.py |
DIDmultiplegtDYN |
In Progress | — |
| HeterogeneousAdoptionDiD (HAD) | had.py, had_pretests.py |
(paper-direct; nprobust for bandwidth) |
In Progress | — |
| TROP | trop.py, trop_local.py, trop_global.py |
(forthcoming; paper-author reference implementation) | In Progress | — |
| Estimator | Module | R Reference | Status | Last Review |
|---|---|---|---|---|
| TripleDifference | triple_diff.py |
triplediff::ddd() |
Complete | 2026-02-18 |
| StaggeredTripleDifference | staggered_triple_diff.py |
triplediff::ddd(panel=TRUE) + agg_ddd() |
In Progress | — |
| Estimator | Module | R Reference | Status | Last Review |
|---|---|---|---|---|
| SyntheticDiD | synthetic_did.py |
synthdid::synthdid_estimate() |
Complete | 2026-04-23 |
| Tool | Module | R Reference | Status | Last Review |
|---|---|---|---|---|
| BaconDecomposition | bacon.py |
bacondecomp::bacon() |
Complete | 2026-05-16 |
| HonestDiD | honest_did.py |
HonestDiD package |
Complete | 2026-04-01 |
| PreTrendsPower | pretrends.py |
pretrends package |
In Progress | — |
| PowerAnalysis | power.py |
pwr / DeclareDesign |
In Progress | — |
| PlaceboTests | diagnostics.py |
(no canonical reference) | In Progress | — |
| Feature | Module | Reference | Status | Last Review |
|---|---|---|---|---|
| ConleySpatialHAC | conley.py, linalg.py |
conleyreg (R) / acreg (Stata) |
In Progress | — |
| Survey Data Support | survey.py, bootstrap_utils.py |
survey package (R) |
In Progress | — |
Status legend (matches the contract in § What "Complete" means in this tracker above):
- Not Started: No REGISTRY.md entry yet. Reserved for future surfaces; this tracker currently carries no Not Started rows.
- In Progress: REGISTRY.md entry and unit-test coverage exist, but no formal walk-through has been captured in this document yet. The band is wide — see each entry's "Documentation in place" / "Outstanding for promotion" sub-sections for specifics.
- Complete: A documented review pass against the primary academic source is captured here (minimum: Corrections Made, Deviations or
(None), and Verified Components / Edge Cases Verified / R Comparison Results in some form).
| Field | Value |
|---|---|
| Module | estimators.py |
| Primary Reference | Wooldridge (2010), Angrist & Pischke (2009) |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-01-24 |
Verified Components:
- ATT formula: Double-difference of cell means matches regression interaction coefficient
- R comparison: ATT matches
fixest::feols()within 1e-3 tolerance - R comparison: SE (HC1 robust) matches within 5%
- R comparison: P-value matches within 0.01
- R comparison: Confidence intervals overlap
- R comparison: Cluster-robust SE matches within 10%
- R comparison: Fixed effects (absorb) matches
feols(...|unit)within 1% - Wild bootstrap inference (Rademacher, Mammen, Webb weights)
- Formula interface (
y ~ treated * post) - All REGISTRY.md edge cases tested
Test Coverage:
- 51 methodology verification tests in
tests/test_methodology_did.py - Existing unit-test coverage in
tests/test_estimators.py(TestDifferenceInDifferencesclass plus shared estimator-API classes) - R benchmark tests (skip if R not available)
R Comparison Results:
- ATT matches within 1e-3 (R JSON truncation limits precision)
- HC1 SE matches within 5%
- Cluster-robust SE matches within 10%
- Fixed effects results match within 1%
Corrections Made:
- (None — implementation verified correct)
Outstanding Concerns:
- R comparison precision limited by JSON output truncation (4 decimal places)
- Consider improving R script to output full precision for tighter tolerances
Edge Cases Verified:
- Empty cells: Produces rank deficiency warning (expected behavior)
- Singleton clusters: Included in variance estimation, contribute via residuals (corrected REGISTRY.md)
- Rank deficiency: All three modes (warn/error/silent) working
- Non-binary treatment/time: Raises ValueError as expected
- No variation in treatment/time: Raises ValueError as expected
- Missing values: Raises ValueError as expected
Deviations from R's fixest::feols(): (None — point estimates and SEs match within
documented tolerances; cluster-robust and absorbed-FE behavior verified.)
| Field | Value |
|---|---|
| Module | estimators.py |
| Primary Reference | Freyaldenhoven et al. (2021), Wooldridge (2010), Angrist & Pischke (2009) |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-02-02 |
Verified Components:
- Full event-study specification: treatment × period interactions for ALL non-reference periods (pre and post)
- Reference period coefficient is zero (normalized by omission from design matrix)
- Default reference period is last pre-period (e=-1 convention, matches fixest/did)
- Pre-period coefficients available for parallel trends assessment
- Average ATT computed from post-treatment effects only, with covariance-aware SE
- Returns PeriodEffect objects with confidence intervals for all periods
- Supports balanced and unbalanced panels
- NaN inference: t_stat/p_value/CI use NaN when SE is non-finite or zero
- R-style NA propagation: avg_att is NaN if any post-period effect is unidentified
- Rank-deficient design matrix: warns and sets NaN for dropped coefficients (R-style)
- Staggered adoption detection warning (via
unitparameter) - Treatment reversal detection warning
- Time-varying D_it detection warning (advises creating ever-treated indicator)
- Single pre-period warning (ATT valid but pre-trends assessment unavailable)
- Post-period reference_period raises ValueError (would bias avg_att)
- HonestDiD/PreTrendsPower integration uses interaction sub-VCV (not full regression VCV)
- All REGISTRY.md edge cases tested
Test Coverage:
- 50 tests across
TestMultiPeriodDiDandTestMultiPeriodDiDEventStudyintests/test_estimators.py - 18 new event-study specification tests added in PR #125
Corrections Made:
- PR #125 (2026-02-02): Transformed from post-period-only estimator into full event-study specification with pre-period coefficients. Reference period default changed from first pre-period to last pre-period (e=-1 convention). HonestDiD/PreTrendsPower VCV extraction fixed to use interaction sub-VCV instead of full regression VCV.
Outstanding Concerns:
- R comparison benchmark via
benchmarks/R/benchmark_multiperiod.Rusingfixest::feols(outcome ~ treated * time_f | unit). ATT diff < 1e-11, SE diff 0.0%, period-effects correlation 1.0. Validated at small (200 units) and 1k scales. - Endpoint binning for distant event times not yet implemented.
- FutureWarning for reference_period default change should eventually be removed once the transition is complete.
Deviations from R's fixest::feols():
- Default SE is HC1, not cluster-robust at unit level (the
fixestdefault for panel data). Cluster-robust available viaclusterparameter but not the default. - Reference period default is last pre-period (e=-1 convention, matches
fixest/did); prior Python releases used first pre-period and the change is gated by aFutureWarninguntil the deprecation window closes.
| Field | Value |
|---|---|
| Module | twfe.py |
| Primary Reference | Wooldridge (2010), Ch. 10 |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-02-08 |
Verified Components:
- Within-transformation algebra:
y_it - ȳ_i - ȳ_t + ȳmatches hand calculation (rtol=1e-12) - ATT matches manual demeaned OLS (rtol=1e-10)
- ATT matches
DifferenceInDifferenceson 2-period data (rtol=1e-10) - Covariates are also within-transformed (sum to zero within unit/time groups)
- R comparison: ATT matches
fixest::feols(y ~ treated:post | unit + post, cluster=~unit)(rtol<0.1%) - R comparison: Cluster-robust SE match (rtol<1%)
- R comparison: P-value match (atol<0.01)
- R comparison: CI bounds match (rtol<1%)
- R comparison: ATT and SE match with covariate (same tolerances)
- Edge case: Staggered treatment triggers
UserWarning - Edge case: Auto-clusters at unit level (SE matches explicit
cluster="unit") - Edge case: DF adjustment for absorbed FE matches manual
solve_ols()withdf_adjustment - Edge case: Covariate collinear with interaction raises
ValueError("cannot be identified") - Edge case: Covariate collinearity warns but ATT remains finite
- Edge case:
rank_deficient_action="error"raisesValueError - Edge case:
rank_deficient_action="silent"emits no warnings - Edge case: Unbalanced panel produces valid results (finite ATT, positive SE)
- Edge case: Missing unit column raises
ValueError - Integration:
decompose()returnsBaconDecompositionResults - SE: Cluster-robust SE >= HC1 SE
- SE: VCoV positive semi-definite
- Wild bootstrap: Valid inference (finite SE, p-value in [0,1])
- Wild bootstrap: All weight types (rademacher, mammen, webb) produce valid inference
- Wild bootstrap:
inference="wild_bootstrap"routes correctly - Params:
get_params()returns all inherited parameters - Params:
set_params()modifies attributes - Results:
summary()contains "ATT" - Results:
to_dict()contains att, se, t_stat, p_value, n_obs - Results: residuals + fitted = demeaned outcome (not raw)
- Edge case: Multi-period time emits UserWarning advising binary post indicator
- Edge case: Non-{0,1} binary time emits UserWarning (ATT still correct)
- Edge case: ATT invariant to time encoding ({0,1} vs {2020,2021} produces identical results)
Key Implementation Detail:
The interaction term D_i × Post_t must be within-transformed (demeaned) alongside the outcome,
consistent with the Frisch-Waugh-Lovell (FWL) theorem: all regressors and the outcome must be
projected out of the fixed effects space. R's fixest::feols() does this automatically when
variables appear to the left of the | separator.
Corrections Made:
- Bug fix: interaction term must be within-transformed (found during review). The previous
implementation used raw (un-demeaned)
D_i × Post_tin the demeaned regression. This gave correct results only for 2-period panels wherepost == period. For multi-period panels (e.g., 4 periods with binarypost), the raw interaction had incorrect correlation with demeaned Y, producing ATT approximately 1/3 of the true value. Fixed by applying the same within-transformation to the interaction term before regression. This matches R'sfixest::feols()behavior. (twfe.pylines 99-113)
Outstanding Concerns:
- Multi-period
timeparameter: Multi-period time values (e.g., 1,2,3,4) producetreated × period_numberinstead oftreated × post_indicator, which is not the standard D_it treatment indicator. AUserWarningis emitted whentimehas >2 unique values. For binary time with non-{0,1} values (e.g., {2020, 2021}), the ATT is mathematically correct (the within-transformation absorbs the scaling), but a warning recommends 0/1 encoding for clarity. Users with multi-period data should create a binarypostcolumn. - Staggered treatment warning: The warning only fires when
timehas >2 unique values (i.e., actual period numbers). With binarytime="post", all treated units appear to start treatment attime=1, making staggering undetectable. Users with staggered designs should usedecompose()orCallawaySantAnnadirectly for proper diagnostics.
Deviations from R's fixest::feols(): (None — point estimates, cluster-robust SEs,
CI bounds, and absorbed-FE results all match within documented tolerances on both bare
and covariate-adjusted specifications.)
| Field | Value |
|---|---|
| Module | staggered.py |
| Primary Reference | Callaway & Sant'Anna (2021) |
| R Reference | did::att_gt() |
| Status | Complete |
| Last Review | 2026-01-24 |
Verified Components:
- ATT(g,t) basic formula (hand-calculated exact match)
- Doubly robust estimator
- IPW estimator
- Outcome regression
- Base period selection (varying/universal)
- Anticipation parameter handling
- Simple/event-study/group aggregation
- Analytical SE with weight influence function
- Bootstrap SE (Rademacher/Mammen/Webb)
- Control group composition (never_treated/not_yet_treated)
- All documented edge cases from REGISTRY.md
Test Coverage:
- 61 methodology verification tests in
tests/test_methodology_callaway.py - Existing unit-test coverage in
tests/test_staggered.py - R benchmark tests (skip if R not available)
R Comparison Results:
- Overall ATT matches within 20% (difference due to dynamic effects in generated data)
- Post-treatment ATT(g,t) values match within 20%
- Pre-treatment effects may differ due to base_period handling differences
Corrections Made:
- (None — implementation verified correct)
Outstanding Concerns:
- R comparison shows ~20% difference in overall ATT with generated data
- Likely due to differences in how dynamic effects are handled in data generation
- Individual ATT(g,t) values match closely for post-treatment periods
- Further investigation recommended with real-world data
- Pre-treatment ATT(g,t) may differ from R due to base_period="varying" semantics
- Python uses t-1 as base for pre-treatment
- R's behavior requires verification
Deviations from R's did::att_gt():
- NaN for invalid inference: When SE is non-finite or zero, Python returns NaN for t_stat/p_value rather than potentially erroring. This is a defensive enhancement.
Alignment with R's did::att_gt() (as of v2.1.5):
-
Webb weights: Webb's 6-point distribution with values ±√(3/2), ±1, ±√(1/2) uses equal probabilities (1/6 each) matching R's
didpackage. This gives E[w]=0, Var(w)=1.0, consistent with other bootstrap weight distributions.Verification: Our implementation matches the well-established
fwildclusterbootR package (C++ source: wildboottest.cpp). The implementation usessqrt(1.5),1,sqrt(0.5)(and negatives) with equal 1/6 probabilities—identical to our values.Note on documentation discrepancy: Some documentation (e.g., fwildclusterboot vignette) describes Webb weights as "±1.5, ±1, ±0.5". This appears to be a simplification for readability. The actual implementations use ±√1.5, ±1, ±√0.5 which provides the required unit variance (Var(w) = 1.0).
| Field | Value |
|---|---|
| Module | sun_abraham.py |
| Primary Reference | Sun & Abraham (2021) |
| R Reference | fixest::sunab() |
| Status | Complete |
| Last Review | 2026-02-15 |
Verified Components:
- Saturated TWFE regression with cohort × relative-time interactions
- Within-transformation for unit and time fixed effects
- Interaction-weighted event study effects (δ̂_e = Σ_g ŵ_{g,e} × δ̂_{g,e})
- IW weights match event-time sample shares (n_{g,e} / Σ_g n_{g,e})
- Overall ATT as weighted average of post-treatment effects
- Delta method SE for aggregated effects (Var = w' Σ w)
- Cluster-robust SEs at unit level
- Reference period normalized to zero (e=-1 excluded from design matrix)
- R comparison: ATT matches
fixest::sunab()within machine precision (<1e-11) - R comparison: SE matches within 0.3% (small scale) / 0.1% (1k scale)
- R comparison: Event study effects correlation = 1.000000
- R comparison: Event study max diff < 1e-11
- Bootstrap inference (pairs bootstrap)
- Rank deficiency handling (warn/error/silent)
- All REGISTRY.md edge cases tested
Test Coverage:
- Combined methodology + unit tests in
tests/test_sun_abraham.py(the methodology verification block grew incrementally from the original 7 review tests as edge cases were added) - R benchmark tests via
benchmarks/run_benchmarks.py --estimator sunab
R Comparison Results:
- Overall ATT matches within machine precision (diff < 1e-11 at both scales)
- Cluster-robust SE matches within 0.3% (well within 1% threshold)
- Event study effects match perfectly (correlation 1.0, max diff < 1e-11)
- Validated at small (200 units) and 1k (1000 units) scales
Corrections Made:
-
DF adjustment for absorbed FE (
sun_abraham.py,_fit_saturated_regression()): Addeddf_adjustment = n_units + n_times - 1toLinearRegression.fit()to account for absorbed unit and time fixed effects in degrees of freedom. Unlike TWFE (which uses-2plus an explicit intercept column), SunAbraham's saturated regression has no intercept, so all absorbed df must come from the adjustment. Affects t-distribution DoF for cohort-level p-values/CIs (slightly larger p-values, slightly wider CIs) but does NOT change VCV or SE values. -
NaN return for no post-treatment effects (
sun_abraham.py,_compute_overall_att()): Changed return from(0.0, 0.0)to(np.nan, np.nan)when no post-treatment effects exist. All downstream inference fields (t_stat, p_value, conf_int) correctly propagate NaN via existing guards infit(). -
Deprecation warnings for unused parameters (
sun_abraham.py,fit()): AddedFutureWarningformin_pre_periodsandmin_post_periodsparameters that are accepted but never used (no-op). These will be removed in a future version. -
Removed event-time truncation at [-20, 20] (
sun_abraham.py): Removed the hardcoded capmax(min(...), -20)/min(max(...), 20)to match R'sfixest::sunab()which has no such limit. All available relative times are now estimated. -
Warning for variance fallback path (
sun_abraham.py,_compute_overall_att()): AddedUserWarningwhen the full weight vector cannot be constructed and a simplified variance (ignoring covariances between periods) is used as fallback. -
IW weights use event-time sample shares (
sun_abraham.py,_compute_iw_effects()): Changed IW weights fromn_g / Σ_g n_g(cohort sizes) ton_{g,e} / Σ_g n_{g,e}(per-event-time observation counts) to match the REGISTRY.md formula. For balanced panels these are identical; for unbalanced panels the new formula correctly reflects actual sample composition at each event-time. Added unbalanced panel test. -
Normalize
np.infnever-treated encoding (sun_abraham.py,fit()):first_treat=np.inf(documented as valid for never-treated) was included intreatment_groupsand_rel_timevia> 0checks, producing-infevent times. Fixed by normalizingnp.infto0immediately after computing_never_treated. Same fix applied tostaggered.py(CallawaySantAnna).
Outstanding Concerns:
- Inference distribution: Cohort-level p-values use t-distribution (via
LinearRegression.get_inference()), while aggregated event study and overall ATT p-values use normal distribution (viacompute_p_value()). This is asymptotically equivalent and standard for delta-method-aggregated quantities. R's fixest uses t-distribution at all levels, so aggregated p-values may differ slightly for small samples — this is a documented deviation.
Deviations from R's fixest::sunab():
- NaN for no post-treatment effects: Python returns
(NaN, NaN)for overall ATT/SE when no post-treatment effects exist. R would error. - Normal distribution for aggregated inference: Aggregated p-values use normal distribution (asymptotically equivalent). R uses t-distribution.
| Field | Value |
|---|---|
| Module | stacked_did.py |
| Primary Reference | Wing, Freedman & Hollingsworth (2024), NBER WP 32054 |
| R Reference | stacked-did-weights (create_sub_exp() + compute_weights()) |
| Status | Complete |
| Last Review | 2026-02-19 |
Verified Components:
- IC1 trimming:
a - kappa_pre >= T_min AND a + kappa_post <= T_max(matches R reference) - IC2 trimming: Three clean control modes (not_yet_treated, strict, never_treated)
- Sub-experiment construction: treated + clean controls within
[a - kappa_pre, a + kappa_post] - Q-weights aggregate: treated Q=1, control
Q = (sub_treat_n/stack_treat_n) / (sub_control_n/stack_control_n)per (event_time, sub_exp) — matches Rcompute_weights() - Q-weights population:
Q_a = (Pop_a^D / Pop^D) / (N_a^C / N^C)(Table 1, Row 2) - Q-weights sample_share:
Q_a = ((N_a^D + N_a^C)/(N^D+N^C)) / (N_a^C / N^C)(Table 1, Row 3) - WLS via sqrt(w) transformation (numerically equivalent to weighted regression)
- Event study regression:
Y = α_0 + α_1·D_sa + Σ_{h≠-1}[λ_h·1(e=h) + δ_h·D_sa·1(e=h)] + U(Eq. 3) - Reference period e=-1-anticipation normalized to zero (omitted from design matrix)
- Delta-method SE for overall ATT:
SE = sqrt(ones' @ sub_vcv @ ones) / K - Cluster-robust SEs at unit level (default) and unit×sub-experiment level
- Anticipation parameter: reference period shifts to e=-1-anticipation, post-treatment includes anticipation periods
- Rank deficiency handling (warn/error/silent via
solve_ols()) - Never-treated encoding: both
first_treat=0andfirst_treat=infhandled - R comparison: ATT matches within machine precision (diff < 2.1e-11)
- R comparison: SE matches within machine precision (diff < 4.0e-10)
- R comparison: Event study effects correlation = 1.000000, max diff < 4.5e-11
-
safe_inference()used for all inference fields - All REGISTRY.md edge cases tested
Test Coverage:
tests/test_stacked_did.py: 10 test classes (basic, trimming, Q-weights, clean-control, clustering, stacked-data shape, edge cases, sklearn interface, results methods, validation)- R benchmark tests via
benchmarks/run_benchmarks.py --estimator stacked
R Comparison Results (200 units, 8 periods, kappa_pre=2, kappa_post=2):
| Metric | Python | R | Diff |
|---|---|---|---|
| Overall ATT | 2.277699574579 | 2.2776995746 | 2.1e-11 |
| Overall SE | 0.062045687626 | 0.062045688027 | 4.0e-10 |
| ES e=-2 ATT | 0.044517975379 | 0.044517975379 | <1e-12 |
| ES e=0 ATT | 2.104181683763 | 2.104181683800 | <1e-11 |
| ES e=1 ATT | 2.209990715130 | 2.209990715100 | <1e-11 |
| ES e=2 ATT | 2.518926324845 | 2.518926324800 | <1e-11 |
| Stacked obs | 1600 | 1600 | exact |
| Sub-experiments | 3 | 3 | exact |
Corrections Made:
-
IC1 lower bound and time window aligned with R reference (
stacked_did.py,_trim_adoption_events()and_build_sub_experiment()): The paper text specifies time window[a - kappa_pre - 1, a + kappa_post](including an extra pre-period), but the R reference implementation by co-author Hollingsworth uses[a - kappa_pre, a + kappa_post]. The extra period had no event-study dummy, altering the baseline regression. Fixed to match R: removed-1from both IC1 check (a - kappa_pre >= T_min) and time window start. Discrepancy documented indocs/methodology/papers/wing-2024-review.mdGaps section. -
Q-weight computation: event-time-specific for aggregate weighting (
stacked_did.py,_compute_q_weights()): Changed aggregate Q-weights from unit counts per sub-experiment to observation counts per (event_time, sub_exp), matching R referencecompute_weights(). For balanced panels, results are unchanged. For unbalanced panels, weights now adjust for varying observation density. Population/sample_share retain unit-count formulas (paper notation). -
Anticipation parameter: reference period and dummies (
stacked_did.py,fit()): Reference period now shifts toe = -1 - anticipation. Event-time dummies cover the full window[-kappa_pre - anticipation, ..., kappa_post]. Post-treatment effects include anticipation periods. Consistent with ImputationDiD, TwoStageDiD, SunAbraham. -
Group aggregation removed (
stacked_did.py):aggregate="group"andaggregate="all"removed. The pooled stacked regression cannot produce cohort-specific effects without cohort×event-time interactions. Use CallawaySantAnna or ImputationDiD for cohort-level estimates. -
n_sub_experiments metadata (
stacked_did.py,fit()): Now tracks actual built sub-experiments, not all events in omega_kappa. Warns if any sub-experiments are empty after data filtering.
Outstanding Concerns:
- Population/sample_share Q-weights use paper's unit-count formulas (no R reference to validate)
- Anticipation not validated against R (R reference doesn't test anticipation > 0)
Deviations from R's stacked-did-weights:
- NaN for invalid inference: Python returns NaN for t_stat/p_value/conf_int when
SE is non-finite or zero. R would propagate through
fixest::feols()error handling.
| Field | Value |
|---|---|
| Module | imputation.py, imputation_bootstrap.py |
| Primary Reference | Borusyak, Jaravel & Spiess (2024), Revisiting Event-Study Designs: Robust and Efficient Estimation, REStud 91(6) |
| R Reference | didimputation |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## ImputationDiD(paper-direct equations, edge cases, three-step algorithm) - Implementation: 87 unit tests in
tests/test_imputation.py(basic fit, event study, group aggregation, conservative variance, auxiliary partition, unidentified-estimand handling, balanced/unbalanced panels) - Bootstrap path:
imputation_bootstrap.pywith multiplier-weight resampling - Survey support: pweight + strata/PSU/FPC via TSL (Phase 6) with PSU-bootstrap path
Outstanding for promotion:
- Dedicated
tests/test_methodology_imputation.pywith paper-equation-numbered Verified Components walk-through - R parity benchmark against
didimputation(none on file) - Formal enumeration of deviations from
didimputation(NaN inference, refused-to-estimate behavior for unidentified estimands per Proposition 5) - "Corrections Made" listing for any implementation fixes uncovered during the walk-through
| Field | Value |
|---|---|
| Module | two_stage.py, two_stage_bootstrap.py |
| Primary Reference | Gardner (2022), Two-stage differences in differences, arXiv:2207.05943 |
| R Reference | did2s |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## TwoStageDiD(Stage 1 unit+time FE on untreated, Stage 2 OLS on residualized outcomes, GMM sandwich variance per Newey-McFadden Theorem 6.1) - Implementation: 76 unit tests in
tests/test_two_stage.py(matches ImputationDiD point estimates, Rdid2sglobal(D'D)^{-1}variance, always-treated unit exclusion, multiplier bootstrap) - Documented R alignment: uses global
(D'D)^{-1}matchingdid2s(not paper Eq. 6)
Outstanding for promotion:
- Dedicated
tests/test_methodology_two_stage.pywith paper-equation-numbered Verified Components walk-through - R parity benchmark fixture against
did2s(none on file) - Documented deviation: Newey-McFadden Theorem 6.1 sandwich vs paper's Eq. 6 (already noted in REGISTRY but not formalized in this tracker)
- "Corrections Made" listing
| Field | Value |
|---|---|
| Module | wooldridge.py, wooldridge_results.py |
| Primary Reference | Wooldridge (2025), Two-way fixed effects, the two-way Mundlak regression, and difference-in-differences estimators, Empirical Economics 69(5), 2545–2587 |
| R Reference | etwfe (McDermott 2023); Stata jwdid (Rios-Avila 2021) |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## WooldridgeDiD (ETWFE)(saturated cohort×time interactions, OLS/logit/Poisson via IRLS, ASF-based ATT for nonlinear methods with delta-method SEs, four aggregations, survey support) - Companion-paper review on file:
docs/methodology/papers/wooldridge-2023-review.mdcovers Wooldridge (2023) Simple approaches to nonlinear difference-in-differences with panel data, Econometrics Journal 26(3) — the nonlinear extension that the logit/Poisson paths implement (retrospective, merged PR #443 on 2026-05-13). A dedicated review for the primary ETWFE source (Wooldridge 2025, Empirical Economics 69(5)) is not yet on file. - Implementation:
tests/test_wooldridge.py(covers OLS, logit, and Poisson paths plus the four aggregation types)
Outstanding for promotion:
- Dedicated paper review for the primary ETWFE source: write
docs/methodology/papers/wooldridge-2025-review.mdcovering Wooldridge (2025) Empirical Economics 69(5), 2545–2587 (published version of the 2021 SSRN working paper / NBER WP 29154) - Dedicated
tests/test_methodology_wooldridge.pywith paper-equation-numbered Verified Components walk-through - R parity fixture against
etwfe(and ideally Statajwdid) covering OLS, logit, and Poisson paths - Verified Components for nonlinear-method ASF / delta-method SE invariants
- "Corrections Made" listing
| Field | Value |
|---|---|
| Module | efficient_did.py, efficient_did_bootstrap.py, efficient_did_covariates.py, efficient_did_weights.py |
| Primary Reference | Chen, Sant'Anna & Xie (2025), Efficient Difference-in-Differences and Event Study Estimators |
| R Reference | (no canonical R package; paper compares against did / DIDmultiplegt / BJS / Gardner / Wooldridge as benchmarks rather than providing a reference implementation) |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## EfficientDiD(full Theorem 4.1 EIF, sieve-based propensity-ratio estimation with AIC/BIC, kernel-smoothed conditional covariance, Hausman pretest for PT-All vs PT-Post, survey support) - Implementation: 130 unit tests in
tests/test_efficient_did.py+ 12 validation tests intests/test_efficient_did_validation.py - Hausman pretest: implemented per Theorem A.1 with Moore-Penrose pseudoinverse for finite-sample non-PSD variance-difference matrix
- Survey support: pweight + strata/PSU/FPC via TSL on EIF scores; covariates DR path with WLS outcome regression and weighted sieve normal equations
Outstanding for promotion:
- No paper review on file under
docs/methodology/papers/— write one - Dedicated
tests/test_methodology_efficient_did.pywith Theorem 3.2 / Equation 3.5 / Equation 4.3 numbered Verified Components walk-through - Cross-language anchor: the paper's empirical replication uses HRS data following Sun-Abraham (2021); a same-data benchmark against the paper's reported numbers (or a same-DGP MC against R alternatives) would substantiate the EIF construction
- Documented deviations: linear OLS working models for outcome regressions vs. paper's general nonparametric specification (DR safety net acknowledged but not separately validated); fixed-weight bootstrap aggregation vs. WIF-corrected analytical aggregation
| Field | Value |
|---|---|
| Module | continuous_did.py, continuous_did_bspline.py, continuous_did_results.py |
| Primary Reference | Callaway, Goodman-Bacon & Sant'Anna (2024), Difference-in-Differences with a Continuous Treatment, NBER WP 32117 |
| R Reference | contdid v0.1.0 (CRAN) |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## ContinuousDiDplus dedicated theory note indocs/methodology/continuous-did.md(PT vs SPT identification, ATT(d|d) / ATT(d) / ACRT(d) / ATT^{loc} / ATT^{glob} / ACRT^{glob} estimands, B-spline OLS, multiplier bootstrap) tests/test_methodology_continuous_did.py: 15 tests across 5 classes (linear dose response, quadratic with cubic basis, multi-period aggregation, edge cases, R benchmark)- Implementation: 80 unit tests in
tests/test_continuous_did.py - Survey support: weighted B-spline OLS, TSL on influence functions, bootstrap+survey (Phase 6)
Outstanding for promotion:
- Detailed Verified Components block here mirroring REGISTRY's Implementation Checklist (B-spline basis matching
splines2::bSpline, multi-period cell iteration, dose-response and event-study aggregation, multiplier bootstrap, analytical SE via influence functions) - Document the boundary-knots deviation from R
contdidv0.1.0 (Python usesrange(dose); R usesrange(dvals)which can produce extrapolation artifacts) in a formal Deviations block here - Formalize the
+infrecoding and zero-dose silent-zeroing warnings (currently in REGISTRY) into a Verified Components row
| Field | Value |
|---|---|
| Module | chaisemartin_dhaultfoeuille.py, chaisemartin_dhaultfoeuille_bootstrap.py, chaisemartin_dhaultfoeuille_results.py |
| Primary References | (a) de Chaisemartin & D'Haultfœuille (2020), Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects, AER 110(9), 2964-2996. (b) de Chaisemartin & D'Haultfœuille (2022, revised 2024), Difference-in-Differences Estimators of Intertemporal Treatment Effects, NBER WP 29873 — Web Appendix Section 3.7.3 for cohort-recentered plug-in variance. (c) de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026) for the universal-rollout case. |
| R Reference | DIDmultiplegtDYN |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## ChaisemartinDHaultfoeuille(DID_M, DID_+, DID_-, single-lag placebo, TWFE-weights diagnostic, multiplier bootstrap, DID^X / DID^{fd} / state-set-specific trends / heterogeneity testing / Design-2 / by_path / HonestDiD integration, survey design + replicate weights + HM wild bootstrap) - Companion-paper review on file:
docs/methodology/papers/dechaisemartin-2026-review.mdcovers the 2026 universal-rollout extension (Knau et al.), which is the primary source for HAD rather than for DCDH. The 2020 AER and 2022/2024 NBER WP 29873 papers that define DCDH's core DID_M / DID_+ / DID_- and dynamic estimators do not yet have dedicated review files on disk. tests/test_methodology_chaisemartin_dhaultfoeuille.py: 12 tests across 4 classes (worked example, cohort recentering, TWFE diagnostic, large-N recovery)tests/test_chaisemartin_dhaultfoeuille_parity.py: 24 R parity tests againstDIDmultiplegtDYN- Implementation: 347 unit tests in
tests/test_chaisemartin_dhaultfoeuille.py - Survey-specific:
tests/test_survey_dcdh.py,tests/test_survey_dcdh_replicate_psu.py, plus three dCDH cell-period coverage suites
Outstanding for promotion:
- Primary-source paper reviews: write
docs/methodology/papers/dechaisemartin-dhaultfoeuille-2020-review.mdcovering the 2020 AER and a companion review covering 2022/2024 NBER WP 29873 (intertemporal treatment effects). The existing 2026 review covers the universal-rollout extension only. - Formal Verified Components block here matching REGISTRY's exhaustive Implementation Checklist
- Consolidated Deviations summary (currently scattered across REGISTRY Notes): equal-cell weighting vs R cell-size weighting, terminal-missingness retention, A11 zero-retention convention,
<50%switcher warning at far horizons - Documented R parity tolerance bands at
l=1(existing parity fixture intest_chaisemartin_dhaultfoeuille_parity.py) - "Corrections Made" listing for the Round 2 full-IF fix (never-switching groups now participate in variance via stable-control roles)
| Field | Value |
|---|---|
| Module | had.py, had_pretests.py |
| Primary Reference | de Chaisemartin, Ciccia, D'Haultfœuille & Knau (2026), Difference-in-Differences Estimators When No Unit Remains Untreated, arXiv:2405.04465v6 |
| R Reference | None (paper-direct implementation); nprobust (Calonico-Cattaneo-Farrell) used for bandwidth selection only |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## HeterogeneousAdoptionDiD(~330 lines covering Phases 1a-5: Epanechnikov/triangular/uniform kernels, HC2+Bell-McCaffrey, CR2 Imbens-Kolesar Satterthwaite DOF, Calonico-Cattaneo-Farrell MSE-DPI bandwidth, bias-corrected local-linear, three design paths — continuous_at_zero / continuous_near_d_lower / mass_point — multi-period event-study via Appendix B.2, three pretest helpersqug_test/stute_test/yatchew_hr_test, compositedid_had_pretest_workflow, survey support including PSU-level Mammen wild bootstrap for Stute family) - Paper review on file: shares
dechaisemartin-2026-review.mdwith DCDH (universal-rollout coverage) - Implementation: comprehensive coverage in
tests/test_had.py(HAD estimator) andtests/test_had_pretests.py(qug_test/stute_test/yatchew_hr_testand the composite workflow); Monte-Carlo coverage intests/test_had_mc.py; dual-knob deprecation intests/test_had_dual_knob_deprecation.py - Bandwidth port:
tests/test_bandwidth_selector.py(public-API wrapper, HAD configuration) andtests/test_nprobust_port.py(fulllprobust/lpbwselect_mse_dpiport surface); bias-correctedlprobustparity intests/test_bias_corrected_lprobust.py - R parity: 5 R-direct parity tests in
tests/test_did_had_parity.py;nprobustgolden fixtures inbenchmarks/data/nprobust_*_golden.jsonvalidated at0.0000%relative error - Two dedicated tutorials: T21 (
docs/tutorials/21_had_pretest_workflow.ipynb) and T22 (docs/tutorials/22_had_survey_design.ipynb) with companiontests/test_t21_had_pretest_workflow_drift.pyandtests/test_t22_had_survey_design_drift.pydrift-test files
Outstanding for promotion:
- Dedicated
tests/test_methodology_had.py(versus the existing implementation-detail-heavytest_had.py) with paper-equation-numbered Verified Components walk-through (Equations 3, 7, 11, 18, 29 for Theorems 1, 3, 4, 7) - Documented deviations: equal-vs-cell-size weighting conventions; HAD sup-t bootstrap behavior when not gated by
cband=Trueandaggregate="event_study" - Resolution / waiver for the four unchecked Phase-4 items (Pierce-Schott 2016 Figure 2 replication, Table 1 coverage-rate reproduction, Assumption 5/6 non-testability documentation, staggered-timing warning that redirects to DCDH)
| Field | Value |
|---|---|
| Module | trop.py, trop_local.py, trop_global.py, trop_results.py |
| Primary Reference | Athey, Imbens, Qu & Viviano (2025), Triply Robust Panel Estimators, arXiv:2508.21536 |
| R Reference | Paper-author reference implementation (not yet released as CRAN package) |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## TROP(local: factor matrix via soft-threshold SVD, exponential-decay unit weights matching paper Eq. 2, LOOCV per Eq. 5, multiple rank-selection methods cv/ic/elbow; global: alternating minimization for nuclear-norm penalty with hard-coded inner-FISTA 20-iteration loop, ATT averaging over D==1 cells, Rust-accelerated LOOCV and bootstrap) - Paper review on file:
docs/methodology/papers/athey-2025-review.md(retrospective, merged PR #443 on 2026-05-13) - Implementation: 120 unit tests in
tests/test_trop.py - Survey support: Rao-Wu rescaled bootstrap with cross-classified pseudo-strata; Rust backend remains pweight-only
Outstanding for promotion:
- Dedicated
tests/test_methodology_trop.pywith paper-equation-numbered Verified Components walk-through - Cross-validation against the paper-author reference implementation (when it becomes available) or against the paper's reported numbers on the empirical applications
- Documented deviations: bootstrap proportional-failure warnings (5% threshold), alternating-minimization convergence warnings, Rust backend's pweight-only limitation vs. Python's full survey-design support
| Field | Value |
|---|---|
| Module | triple_diff.py |
| Primary Reference | Ortiz-Villavicencio & Sant'Anna (2025), Better Understanding Triple Differences Estimators, arXiv:2505.09942 |
| R Reference | triplediff::ddd() (v0.2.1, CRAN) |
| Status | Complete |
| Last Review | 2026-02-18 |
Verified Components:
- ATT matches R
triplediff::ddd()for all 3 methods (DR, RA, IPW) — <0.001% relative difference - SE matches R
triplediff::ddd()for all 3 methods — <0.001% relative difference - With-covariates ATT matches R — <0.001% relative difference
- With-covariates SE matches R — <0.001% relative difference
- Verified across all 4 DGP types from
gen_dgp_2periods()(different model misspecification scenarios) - Influence function-based SE:
SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n) - Three-DiD decomposition:
DDD = DiD_3 + DiD_2 - DiD_1matching R's approach -
safe_inference()used for all inference fields (t_stat, p_value, conf_int)
Test Coverage:
- 45 methodology tests in
tests/test_methodology_triple_diff.py
Corrections Made:
- Complete rewrite of estimation methods (was naive cell-mean approach, now three-DiD
decomposition). The original implementation computed DDD directly from 8 cell means with
a naive cell-variance SE. Replaced with R's decomposition into three pairwise DiD
comparisons (subgroup j vs reference subgroup 4), each using DR/IPW/RA methodology
from Callaway & Sant'Anna. This fixed:
- DR SE: was off by >100% (naive cell variance vs influence function)
- IPW SE: was off by >200% (incorrect cell-probability-ratio weights)
- With-covariates ATT: was off by >1000% for all methods (incorrect cell-by-cell regression)
- Influence function SE replaces naive cell variance for all methods:
SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n)wherew_j = n / n_jandIF_jis the per-observation influence function for pairwise DiD j. - Propensity score estimation now runs per-pairwise-comparison (P(subgroup=4|X) within {j, 4} subset) instead of global P(G=1|X).
- Outcome regression now fits separate OLS per subgroup-time cell within each pairwise
comparison, matching R's
compute_outcome_regression_rc().
Outstanding Concerns:
- Panel mode (
panel=TRUE) with differenced outcomes not yet implemented (see Deviations).
Deviations from R's triplediff::ddd():
- Repeated cross-section mode only: Implementation uses
panel=FALSE. Panel mode with differenced outcomes is not yet implemented; users with balanced panel data and time-invariant covariates should compute first differences manually before fitting.
R Comparison Results (panel=FALSE, n=500 per DGP):
| DGP | Method | Covariates | ATT Diff | SE Diff |
|---|---|---|---|---|
| 1 | DR | No | <0.001% | <0.001% |
| 1 | DR | Yes | <0.001% | <0.001% |
| 1 | REG | No | <0.001% | <0.001% |
| 1 | REG | Yes | <0.001% | <0.001% |
| 1 | IPW | No | <0.001% | <0.001% |
| 1 | IPW | Yes | <0.001% | <0.001% |
| 2-4 | All | Both | <0.001% | <0.001% |
| Field | Value |
|---|---|
| Module | staggered_triple_diff.py, staggered_triple_diff_results.py |
| Primary Reference | Ortiz-Villavicencio & Sant'Anna (2025) — same paper as TripleDifference, staggered case |
| R Reference | triplediff::ddd(panel=TRUE) + agg_ddd() (per benchmarks/R/benchmark_staggered_triplediff.R) |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## StaggeredTripleDifference(per-cohort comparisons against three sub-groups, DR/RA/IPW per component, GMM-optimal closed-form inverse-variance weighting, event-study via CS mixin, IF-based SEs, multiplier bootstrap for simultaneous bands, survey support) tests/test_methodology_staggered_triple_diff.py: 6 tests across 3 classes (never-treated comparison, not-yet-treated comparison, aggregation)- Dedicated unit-test suite:
tests/test_staggered_triple_diff.py(~680 lines, full coverage of DR/RA/IPW paths, both control-group modes, GMM weighting, event-study aggregation, edge cases) - Survey-specific:
tests/test_survey_staggered_ddd.py
Outstanding for promotion:
- Paper review under
docs/methodology/papers/covering Ortiz-Villavicencio & Sant'Anna (2025) for the staggered case (the primary paper is shared with TripleDifference, but no dedicated review file exists on disk yet) - R parity validation against
triplediff::ddd(panel=TRUE)+agg_ddd()(perbenchmarks/R/benchmark_staggered_triplediff.R) — CSV fixtures not committed (gitignored); tests skip without local R +triplediff(tracked in TODO.md row, PR #245) - Per-cohort group-effect SE convention: implementation includes WIF (conservative vs R's
wif=NULL); documented in REGISTRY, deferred decision on whether to add an opt-in WIF-disable path (tracked in TODO.md row, PR #245) - Formal Verified Components walk-through here
- Cluster-robust analytical SEs accepted but not wired (deferred per REGISTRY)
| Field | Value |
|---|---|
| Module | synthetic_did.py |
| Primary Reference | Arkhangelsky et al. (2021) |
| R Reference | synthdid::synthdid_estimate() |
| Status | Complete |
| Last Review | 2026-04-23 |
Verified Components:
- Frank-Wolfe on the collapsed (N_co × T_pre) problem (Algorithm 1 of Arkhangelsky et al. 2021), matching R's
synthdid::fw.step() - Unit weights: Frank-Wolfe with two-pass sparsification, matching R's
synthdid::sc.weight.fw()andsparsify_function() - Time weights: Frank-Wolfe on collapsed form, matching R's
fw.step() - Auto-computed
zeta_omega/zeta_lambdafrom data noise levelN_tr × σ²(Appendix D), matching R's default behavior - Pairs-bootstrap refit per Algorithm 2 step 2, warm-started from fit-time ω/λ via the new
init_weights=kwargs oncompute_sdid_unit_weights/compute_time_weights, matching R'sbootstrap_samplewhich rebindsattr(estimate, "opts")perupdate.omega=TRUE/update.lambda=TRUE - Placebo variance (library default) and jackknife variance methods
- Same-library validation: placebo-SE tracking vs. bootstrap-SE, AER §6.3 Monte Carlo truth
- All REGISTRY.md SyntheticDiD edge cases tested
Test Coverage:
- 157 methodology tests in
tests/test_methodology_sdid.py
Corrections Made:
- Time weights: Frank-Wolfe on collapsed form (was heuristic inverse-distance).
Replaced ad-hoc inverse-distance weighting with the Frank-Wolfe algorithm operating
on the collapsed (N_co x T_pre) problem as specified in Algorithm 1 of
Arkhangelsky et al. (2021), matching R's
synthdid::fw.step(). - Unit weights: Frank-Wolfe with two-pass sparsification (was projected gradient
descent with wrong penalty). Replaced projected gradient descent (which used an
incorrect penalty formulation) with Frank-Wolfe optimization followed by two-pass
sparsification, matching R's
synthdid::sc.weight.fw()andsparsify_function(). - Auto-computed regularization from data noise level (was
lambda_reg=0.0,zeta=1.0). Regularization parameterszeta_omegaandzeta_lambdaare now computed automatically from the data noise level (N_tr * sigma^2) as specified in Appendix D of Arkhangelsky et al. (2021), matching R's default behavior. - Bootstrap SE is paper-faithful refit (Algorithm 2 step 2), matching R's default
synthdid::vcov(method="bootstrap")including its warm-start shape. On each pairs-bootstrap draw, ω and λ are re-estimated via Frank-Wolfe on the resampled panel using the fit-time normalized-scale zeta. The Frank-Wolfe first pass is warm-started from the fit-time ω (renormalized over the resampled controls via_sum_normalize) and the fit-time λ (unchanged), matching R'sbootstrap_samplewhich rebindsattr(estimate, "opts")so those weights serve as the FW initialization perupdate.omega=TRUE/update.lambda=TRUE. (Historical note: an earlier release shipped a fixed-weight shortcut here that matched neither the paper nor R's default vcov; that path was removed in PR #351 along with its R-parity fixture, which had also been mis-anchored. The same PR added the warm-start plumbing tocompute_sdid_unit_weights/compute_time_weightsvia newinit_weights=kwargs.) - Default
variance_methodchanged to"placebo"— intentional deviation from R's default (R'ssynthdid::vcov()defaults to"bootstrap"). The library default is placebo for two reasons: (a) placebo is unconditionally available on pweight-only survey designs, whereas refit bootstrap rejects every survey design in this release; (b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in refit bootstrap. See REGISTRY.md §SyntheticDiDNote (default variance_method deviation from R)for details. - Deprecated
lambda_regandzetaparams; new params arezeta_omegaandzeta_lambda. The old parameters had unclear semantics and did not correspond to the paper's notation. The new parameters directly match the paper and R package naming conventions.lambda_regandzetaare deprecated with warnings and will be removed in a future release.
Outstanding Concerns:
- Cross-language parity anchor against R's default
synthdid::vcov(method="bootstrap")or JuliaSynthdid.jl::src/vcov.jl::bootstrap_seis desirable to bolster the methodology contract. Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; cross-language anchor tracked in TODO.md. The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path.
Deviations from R's synthdid::synthdid_estimate():
- Default
variance_methodis"placebo"(R defaults to"bootstrap"). Rationale: (a) placebo is unconditionally available on pweight-only survey designs, whereas refit bootstrap rejects every survey design in this release; (b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in refit bootstrap. Documented in REGISTRY.md §SyntheticDiDNote (default variance_method deviation from R). - Parameter names:
zeta_omega/zeta_lambda(matching the paper's notation); R useseta.omega/eta.lambda. The deprecated Python aliaseslambda_reg/zetafrom prior releases emitDeprecationWarningand will be removed in a future release.
| Field | Value |
|---|---|
| Module | bacon.py |
| Primary Reference | Goodman-Bacon (2021), Difference-in-differences with variation in treatment timing, J. Econometrics 225(2), 254-277 |
| R Reference | bacondecomp::bacon() |
| Status | Complete |
| Last Review | 2026-05-16 |
Verified Components:
- Theorem 1 decomposition identity:
β̂^DD = Σ s · β̂^{2x2}atatol=1e-10(hand-calculable + noisy DGPs) - Weight sum-to-1:
Σ s = 1.0atatol=1e-10underweights="exact" - Three comparison types correctly classified:
treated_vs_never,earlier_vs_later,later_vs_earlier - Eq. 7 hand-checked:
V̂_{kU}^D = n_{kU}(1-n_{kU}) · D̄_k(1-D̄_k)(via weight-ratio test,atol=1e-10) - Eq. 8 hand-checked:
V̂_{kℓ}^{D,k} = n_{kℓ}(1-n_{kℓ}) · (D̄_k-D̄_ℓ)/(1-D̄_ℓ) · (1-D̄_k)/(1-D̄_ℓ) - Eq. 9 hand-checked:
V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1-n_{kℓ}) · D̄_ℓ/D̄_k · (D̄_k-D̄_ℓ)/D̄_k - Eq. 10b 2x2 estimator value: hand-calculable panel → β̂_{kU}^{2x2} = ATT exactly
- Always-treated remap to U (paper footnote 11):
first_treat <= min(time)(excluding never-treated sentinels0andnp.inf) units auto-remapped via internal column, user's data preserved, count exposed on result -
weights="exact"is the default (PR-B 2026-05-16);weights="approximate"retained as opt-in - Unbalanced panel: accepted with
UserWarning(paper assumes balanced; library extension) - No untreated group:
s_{kU}terms drop, weights renormalize, sum-to-1 still holds - Single timing group with U: only
treated_vs_nevercomparisons - Survey design composes cleanly with exact mode and warn+remap
- R
bacondecomp::bacon()parity atatol=1e-6— 3 fixtures (uniform_3groups_with_never_treated,two_groups_no_never_treated,always_treated_remapped); TWFE coefficient + weights-sum match across all 3 fixtures; per-component estimate + weight parity locked on the 2 non-remap fixtures and on the 6 timing-vs-timing rows ofalways_treated_remapped(carve-out narrowed to U-bucket rows only); R→Python U-bucket fold-back asserted by a dedicatedtest_always_treated_remapped_fold_back_matches_rtest that aggregates R's splitLater vs Always Treated+Treated vs Untreatedrows per cohort and compares to Python's singletreated_vs_nevercell atatol=1e-6. Seebenchmarks/data/r_bacondecomp_golden.json+TestBaconParityR.
Test Coverage:
- 34 methodology tests in
tests/test_methodology_bacon.pyacross 6 classes — all active, including the 4 R-parity tests (3 aggregate/per-component + 1 always-treated fold-back; goldens committed atbenchmarks/data/r_bacondecomp_golden.json) - 32 existing tests in
tests/test_bacon.py(basic decomposition, weight properties, weights-parameter API, TWFE integration, visualization, balanced-panel warnings, edge cases)
R Comparison Results:
- Validated at
atol=1e-6againstbacondecomp::bacon()(version 0.1.1, R 4.5.2). Goldens atbenchmarks/data/r_bacondecomp_golden.json; generator atbenchmarks/R/generate_bacon_golden.R. Three DGP fixtures:uniform_3groups_with_never_treated: 9 components covering all three comparison types — full per-component parity (estimate + weight atatol=1e-6).two_groups_no_never_treated: 2 components, timing-only decomposition — full per-component parity.always_treated_remapped: TWFE coefficient + weights-sum match atatol=1e-6; the 6 timing-vs-timing rows (between cohorts 3/4/5) also satisfy direct per-component parity atatol=1e-6(carve-out narrowed to U-bucket rows only). The U-bucket breakdown diverges by convention (Python's paper-footnote-11 U-remap vs R's distinctLater vs Always Treatedcohort decomposition); the aggregate is invariant to the re-bucketing per Theorem 1, and the R→Python fold-back is pinned bytest_always_treated_remapped_fold_back_matches_rwhich aggregates R's splitLater vs Always Treated+Treated vs Untreatedrows per cohort and compares to Python's singletreated_vs_nevercell.
Corrections Made:
- Theorem 1 exact-weights rewrite (
bacon.py:_recompute_exact_weights, lines ~740-880). The previous "exact" mode implementation did not actually compute Eqs. 7-9 / 10e-g — it was missing the(1 - n_kU)factor in the within-subsample treatment variance, did not square the sample share, and added an extraneousunit_sharefactor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced a decomposition error of ~0.3% (0.007 absolute) against TWFE on a 3-cohort + never-treated DGP. Rewrote the function to compute the exact numerators of Eqs. 10e/f/g (with proper Eqs. 7-9 variances) and let the post-hoc normalization handle theV̂^Ddenominator (Theorem 1 identity guaranteesV̂^D = Σ numerators). Now matches TWFE atatol=1e-10. The existingtest_weighted_sum_equals_twfetolerance was tightened from< 0.1to< 1e-10to lock the contract. - Default
weightsflipped from"approximate"to"exact"at three entry points:BaconDecomposition.__init__()(bacon.py:397),bacon_decompose()convenience function (bacon.py:1064),TwoWayFixedEffects.decompose()(twfe.py:684). The paper-faithful Theorem 1 weights are now the default; the simplified approximate path remains opt-in via explicitweights="approximate".diff_diff/diagnostic_report.py:1740(production diagnostic surface) was updated to pass explicitweights="exact". - Always-treated warn+remap via internal column (
bacon.py:fit(), lines ~487-525). Paper footnote 11 puts units witht_i < 1inU, butbacon.pypreviously only mappedfirst_treat ∈ {0, np.inf}into U. Added detection using ordered-time logic on the time axis (first_treat <= min(time)while excluding the never-treated sentinels0andnp.inf) withUserWarningand automatic remap via an internal column (__bacon_first_treat_internal__), preserving the user'sfirst_treatcolumn unchanged. Detection handles event-time-encoded panels (time ∈ [-2,..,3]) correctly; the0sentinel restriction applies only tofirst_treat. Count exposed via newBaconDecompositionResults.n_always_treated_remappedfield.
Deviations from R's bacondecomp::bacon() and from the paper:
- First-period boundary extension on always-treated remap (library convention, deviation from paper footnote 11 strict rule and from R): Goodman-Bacon (2021) footnote 11 uses strict
t_i < 1for the always-treated bucket (units treated before the first observable period). The library applies the inclusivefirst_treat <= min(time)rule, additionally folding units treated at the first observable period (first_treat == min(time)) intoU. Rationale: such units have no untreated cell in-panel and cannot contribute as a treated cohort, so folding them into U mirrors the always-treated handling rather than dropping them silently. Rbacondecomp::bacon()does NOT apply this boundary fold-back — it keepsfirst_treat == min(time)cohorts in their own bucket and emitsLater vs Always Treatedcomparisons. Whenmin(time) > 1(no first-period-treated cohorts) the library rule reduces to the paper's strict rule. Documented in REGISTRY**Deviation (first-period boundary extension on always-treated remap)**. - Unbalanced panel acceptance (library extension): R errors on unbalanced panels; Python emits a
UserWarningand decomposes. The paper's Appendix A proof assumes balanced panels — decomposition on unbalanced panels is approximate to Theorem 1. - Approximate weight mode (Python-only optimization):
weights="approximate"is a library-only fast path with simplified variance computation, not present in R. Users who want Python-R numerical parity should passweights="exact"(the new default). - NaN for invalid inference fields not applicable: the decomposition is deterministic; there are no SE/p-value fields on the comparison output. The
decomposition_errorfield is a finite float (zero in well-conditioned cases).
| Field | Value |
|---|---|
| Module | honest_did.py |
| Primary Reference | Rambachan & Roth (2023), A More Credible Approach to Parallel Trends, RES 90(5), 2555-2591 |
| R Reference | HonestDiD package |
| Status | Complete |
| Last Review | 2026-04-01 |
Verified Components:
- Delta^SD: second-difference constraints [1,-2,1] with delta_0=0 boundary handling
- Delta^SD: T+Tbar-1 constraint rows (bridge constraint at t=0)
- Delta^RM: constrains first differences (not levels), union of polyhedra per Lemma 2.2
- Identified set LP: pins delta_pre = beta_pre via equality constraints (Equations 5-6)
- M=0 for Delta^SD: linear extrapolation gives finite point-identified bounds
- Mbar=0 for Delta^RM: point identification (all post first-diffs = 0)
- Optimal FLCI for Delta^SD: folded normal cv_alpha, Nelder-Mead over pre-period weights
- Sensitivity grid: bounds computed for each M in grid, breakdown value via binary search
- Survey variance (RM, M=0 smoothness): t-distribution critical values from df_survey
- Survey variance (M>0 smoothness): optimal FLCI uses asymptotic normal only; df_survey=0 → NaN
- CallawaySantAnna integration: universal base period, reference period filtering
- Three-period analytical case matches paper Section 2.3
- ARP hybrid for Delta^RM: infrastructure implemented, moment inequality transformation needs calibration
- R comparison: pending (benchmark scripts need updating)
Test Coverage:
- Comprehensive unit-test coverage in
tests/test_honest_did.py(15 test classes spanning DeltaSD/DeltaRM/DeltaSDRM bounds, FLCI, ARP infrastructure, CS integration, edge cases) — all passing - 27 methodology verification tests in
tests/test_methodology_honest_did.py - R benchmark tests (pending)
- Paper review on file:
docs/methodology/papers/rambachan-roth-2023-review.md
Corrections Made:
-
DeltaRM: first differences, not levels (
honest_did.py,_construct_constraints_rm_component): The paper's Delta^RM constrains|delta_{t+1} - delta_t|(consecutive first differences) bounded by Mbar × max pre-treatment first difference. The code constrained|delta_post|(absolute levels) bounded by Mbar × max|beta_pre|. Completely rewritten using union-of-polyhedra decomposition per Lemma 2.2. -
LP pins delta_pre = beta_pre (
honest_did.py,_solve_bounds_lp): The paper's identified set LP (Equations 5-6) fixes pre-treatment violations to the observed pre-treatment coefficients. The code had no equality constraint — delta_pre was unconstrained. For Delta^SD(M=0), this made the LP unbounded. Added A_eq/b_eq equality constraints. -
DeltaSD constraint matrix: delta_0=0 boundary (
honest_did.py,_construct_A_sd): The code built second-difference matrices treating [delta_{-T},...,delta_{-1},delta_1,...,delta_{Tbar}] as consecutive, missing delta_0=0 at the boundary. Three boundary rows were wrong:- t=-1:
d_{-2} - 2*d_{-1} + 0(uses delta_0=0) - t=0:
d_{-1} + d_1(bridge constraint, was missing) - t=1:
0 - 2*d_1 + d_2(uses delta_0=0) Now produces T+Tbar-1 rows (was T+Tbar-2).
- t=-1:
-
Optimal FLCI for Delta^SD (
honest_did.py,_compute_optimal_flci): Replaced naive FLCI (lb - z*se, ub + z*se) with the paper's optimal FLCI (Section 4.1): jointly optimizes affine estimator direction v and half-length chi using folded normal critical values cv_alpha(bias/se). Significantly narrower CIs. -
REGISTRY.md equations (
docs/methodology/REGISTRY.md): DeltaSD equation was first differences (should be second differences). DeltaRM equation was absolute levels (should be first differences). Both corrected with full formulations. -
Performance (
honest_did.py): Sensitivity grid reduced from ~9 minutes to 0.1 seconds via: Newton's method for cv_alpha (5 iterations vs 100), centrosymmetric bias LP (1 solve vs 2), M=0 short-circuit, looser Nelder-Mead tolerances.
Outstanding Concerns:
- Delta^RM CI: uses naive FLCI (conservative) instead of the paper's ARP conditional/hybrid confidence sets. ARP infrastructure exists but moment inequality transformation needs calibration. Tracked in TODO.md.
- R benchmark comparison not yet run (Python benchmark needs API update)
- Combined method uses single M for both SD and RM (DeltaSDRM dataclass has separate M/Mbar)
Deviations from R's HonestDiD:
- Deviation from R: Delta^RM CIs use naive FLCI (
lb - z*se, ub + z*se) instead of ARP conditional/hybrid. Conservative (wider CIs, valid coverage). ARP deferred. - Note: Delta^SD optimal FLCI matches the paper's Section 4.1 methodology: first-difference reparameterization, slope weights with sum(w)=sum_j j*l_j constraint (Eq. 17), bias LP in fd-space, folded normal (or folded non-central t for survey df). Nelder-Mead optimizer vs R's custom solver may produce numerical differences at tolerance level.
- Note:
method="combined"(Delta^SDRM) uses naive FLCI on the intersection of SD and RM bounds. The paper proves FLCI is not consistent for Delta^SDRM (Proposition 4.2). A runtime UserWarning is emitted. Usemethod="smoothness"ormethod="relative_magnitude"separately for paper-supported inference. - Note (deviation from R): Python warns (doesn't error) when CallawaySantAnna results use
base_period != "universal". R's HonestDiD requires universal base period.
| Field | Value |
|---|---|
| Module | pretrends.py |
| Primary Reference | Roth (2022), Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends, AER:I 4(3), 305-322 |
| R Reference | pretrends package |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## PreTrendsPower(MDV at target power, four violation types — linear/constant/last_period/custom, power curve plotting, HonestDiD integration) - Implementation:
tests/test_pretrends.py(point-estimator, MDV, power curve, sensitivity) plus event-study coverage intests/test_pretrends_event_study.py - Paper review on file:
docs/methodology/papers/roth-2022-review.md(added 2026-05-17; non-authoritative source audit — registry entry remains authoritative until the follow-up audit PR)
Outstanding for promotion:
- Dedicated
tests/test_methodology_pretrends.pywith paper-equation-numbered Verified Components walk-through - R parity fixture against the
pretrendsR package at a pinned revision (TODO.md tracks the revision-pin follow-up; until that lands, the R-package surface claims indocs/methodology/papers/roth-2022-review.mdare provisional). Covers the four power calculations: linear, constant, last-period, custom. Note thatcompute_pretrends_powerdoes not acceptviolation_weightstoday, so"custom"parity has to run throughPreTrendsPower(..., violation_weights=...)directly until the helper is extended (TODO.md tracks the helper-extension follow-up); helper-only parity is limited tolinear/constant/last_period. - Verify the REGISTRY Implementation Checklist (all four items currently unchecked)
| Field | Value |
|---|---|
| Module | power.py |
| Primary References | Bloom (1995); Burlig, Preonas & Woerman (2020) — clustered DiD power (both listed in REGISTRY) |
| R Reference | pwr (basic) / DeclareDesign (design-based simulation) |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## PowerAnalysis(MDE / power / sample size / simulation-based power / cluster adjustment); primary sources Bloom (1995) and Burlig et al. (2020) listed - Implementation:
tests/test_power.py(MDE / power / sample-size / simulation paths plus cluster adjustment)
Outstanding for promotion:
- Paper review under
docs/methodology/papers/(likely a combined review covering Bloom 1995 + Burlig et al. 2020) - Dedicated
tests/test_methodology_power.pywith closed-form walk-through againstpwr::pwr.t.test()and Burlig et al.'s clustered-DiD power formula - Documented reference-validation harness against
pwr/DeclareDesign - Verify the REGISTRY Implementation Checklist (all five items currently unchecked)
| Field | Value |
|---|---|
| Module | diagnostics.py |
| Primary Reference | None canonical (general permutation / leave-one-out diagnostic) |
| R Reference | None canonical |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## PlaceboTests(NaN-inference edge cases forpermutation_testandleave_one_out_test) - Implementation: tests embedded in
tests/test_diagnostics.py
Outstanding for promotion:
- Decide whether this surface warrants a standalone methodology review or whether the brief Verified Components walk-through + NaN-inference deviation log should live as a sub-section under each per-estimator diagnostic block instead
- If kept standalone: brief Verified Components block + Deviations block for the NaN-inference convention
These are not estimators but variance/inference plumbing used across many estimators. They warrant their own methodology reviews because the implementation details (kernel choice, weight rescaling, df adjustment) are independently citable.
| Field | Value |
|---|---|
| Module | conley.py, linalg.py (_validate_vcov_args, kernel construction) |
| Primary Reference | Conley (1999), GMM Estimation with Cross-Sectional Dependence, J. Econometrics 92(1), 1-45 |
| Secondary References | Andrews (1991) HAC theory; Colella, Lalive, Sakalli & Thoenig (2019) for the Stata acreg parallel; Düsterhöft (2021) conleyreg (CRAN) parity target |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md section:
## ConleySpatialHACplus three sub-sections (combined spatial + cluster product kernel — Wave A #119; performance/scale — Wave A #120; callableconley_metricvalidation — Wave A #123) - Paper review on file:
docs/methodology/papers/conley-1999-review.md(review date 2026-05-09); plus four adjacent paper reviews for the spillover initiative:butts-2021-review.md,butts-2023-review.md(JUE Insight),clarke-2017-review.md,colella-et-al-2019-review.md - Implementation: 162 tests in
tests/test_conley_vcov.py(Phase 1 + Phase 2 space-time HAC) - Wired through
DifferenceInDifferences,MultiPeriodDiD,TwoWayFixedEffectsviavcov_type="conley"enum
Documentation in place (R parity):
- R
conleyreggoldens committed:benchmarks/data/r_conleyreg_conley_golden.json, generatorbenchmarks/R/generate_conley_golden.R - Cross-sectional R parity at
atol=1e-6:tests/test_conley_vcov.py::TestConleyParityR - Panel (space-time) R parity at
atol=1e-6:TestConleyParitySpacetime(dense path) andTestConleySparseRParityForced(sparse path forced) - Internal block-decomposition cross-check at machine precision (matches
conleyreg::time_dist.cpp):TestConleyParitySpacetime::test_panel_matches_block_decomposed_reference(inner toleranceatol=1e-12)
Outstanding for promotion:
- Dedicated
tests/test_methodology_conley.pywith paper-equation-numbered Verified Components walk-through (Equation 8 score-covariance, Bartlett kernel, Andrews-style truncation) consolidating the parity tests into a methodology checklist - Summary R-parity table in this tracker (currently the parity results are scattered across class-level docstrings in
tests/test_conley_vcov.py) - Document deviation: indefiniteness guard applied to both spatial and cluster kernels (vs. Bartlett's PSD property)
- Resolution for the Phase 5 spillover-conley dependency on survey-weights interaction (currently raises
NotImplementedErrorat the linalg validator)
| Field | Value |
|---|---|
| Module | survey.py, bootstrap_utils.py (plus per-estimator hooks) |
| Primary References | Binder (1983) for TSL variance; Lumley (2004) for the R survey package; Solon, Haider & Wooldridge (2015) for the "when to weight" framework |
| R Reference | survey R package |
| Status | In Progress |
| Last Review | — |
Documentation in place:
- REGISTRY.md sub-sections (under
## Survey Data Support): Weighted Estimation, TSL Variance, Weight Type Effects on Inference, Absorbed FE with Survey Weights, Survey Degrees of Freedom, Survey Aggregation (aggregate_survey), Survey-Aware Bootstrap (Phase 6), Replicate Weight Variance (Phase 6), DEFF Diagnostics (Phase 6), Subpopulation Analysis (Phase 6), Survey DGP (generate_survey_did_data) - Theory document:
docs/methodology/survey-theory.md— full Binder-Lumley derivation of design-based variance for modern DiD estimators, including influence-function machinery - 13 dedicated
tests/test_survey*.pyfiles:test_survey.py,test_survey_dcdh.py,test_survey_dcdh_replicate_psu.py,test_survey_estimator_validation.py,test_survey_phase3.py,test_survey_phase4.py,test_survey_phase5.py,test_survey_phase6.py,test_survey_phase7a.py,test_survey_phase8.py,test_survey_r_crossvalidation.py,test_survey_real_data.py,test_survey_staggered_ddd.py - Per-estimator survey hooks documented in the REGISTRY sections of every estimator that supports survey design (DiD/TWFE/MultiPeriodDiD, CS, SunAbraham, StackedDiD, ImputationDiD, TwoStageDiD, WooldridgeDiD, EfficientDiD, ContinuousDiD, DCDH, HAD, TripleDifference, StaggeredTripleDifference, TROP, SyntheticDiD). Scope is estimators; survey-capable diagnostics (e.g.,
BaconDecompositionPhase 3,HonestDiDsurvey-df handling) are tracked in their own sections.
Outstanding for promotion:
- Dedicated
tests/test_methodology_survey.py(or split between TSL and replicate-weight surfaces) with Binder-equation-numbered Verified Components walk-through - R parity benchmark against
survey::svyglm/survey::svycontrastfor the linear DiD case (tests/test_survey_r_crossvalidation.pyexists; needs to be wired into a documented "Reference results" table here) - Document deviations: PSU-level Hall-Mammen wild clustering as the bootstrap path when survey design is present (vs. R
survey's default analytical TSL); strata-vs-no-strata bit-equality not achievable due to RNG-path divergence between the per-stratum numpy loop and the batchedgenerate_survey_multiplier_weights_batchcall (seedocs/methodology/REGISTRY.mdHAD Stute survey-bootstrap section, "Distributional parity, NOT bit-exact" note, for the documented impossibility — distributional parity holds at large B, exact agreement atatol=1e-10does not) - Consolidated "Outstanding cross-estimator gaps" enumerating which estimators still raise
NotImplementedErroron which survey-design combinations (e.g., Conley + survey, SyntheticDiD + Conley, HAD replicate weights on Stute family)
For each estimator, complete the following steps:
- Read primary academic source - Review the key paper(s) cited in REGISTRY.md and write a
docs/methodology/papers/<name>-review.mdreview if one doesn't exist - Compare key equations - Verify implementation matches equations in REGISTRY.md
- Run benchmark against reference implementation - Execute
benchmarks/run_benchmarks.py --estimator <name>if available; otherwise generate fixtures and document parity tolerances - Verify edge case handling - Check behavior matches REGISTRY.md documentation
- Check standard error formula - Confirm SE computation matches reference (analytical, bootstrap, cluster-robust, survey-aware)
- Write dedicated methodology test file -
tests/test_methodology_<name>.pywith paper-equation-numbered assertions that correspond 1:1 to the Verified Components list - Document deviations - Add notes explaining intentional differences with rationale, using one of the REGISTRY.md labels (
- **Note:**,- **Deviation from R:**,**Note (deviation from R):**)
- After completing a review: Update status to "Complete" and add date, populate Verified Components / Corrections Made / Deviations sections
- When making corrections: Document what was fixed in the "Corrections Made" section with file path and line number
- When identifying issues: Add to "Outstanding Concerns" for future investigation
- When deviating from reference: Document the deviation and rationale; cross-reference the REGISTRY.md
Note (deviation from R)block - When promoting from In Progress to Complete: Replace the "Documentation in place" / "Outstanding for promotion" pair with the full Verified Components / Corrections Made / Deviations structure used by Complete entries
- When adding a new estimator to the library: Add a row to the appropriate Status Summary table marked In Progress and a stub section under the matching category in Detailed Review Notes (Documentation in place / Outstanding for promotion) — same PR that introduces the estimator. New surfaces enter as In Progress because they ship with a REGISTRY.md entry and unit tests by definition.
When our implementation intentionally differs from the reference implementation, document:
- What differs: Specific behavior or formula that differs
- Why: Rationale (e.g., "defensive enhancement", "bug in R package", "follows updated paper")
- Impact: Whether results differ in practice
- Cross-reference: Update REGISTRY.md edge cases section using one of the recognized labels
Example:
**Deviation (2025-01-15)**: CallawaySantAnna returns NaN for t_stat when SE is non-finite,
whereas R's `did::att_gt` would error. This is a defensive enhancement that provides
more graceful handling of edge cases while still signaling invalid inference to users.
Promotion priority for the In Progress entries, ordered by what's blocked on substantive review work (top of list = needs review next) vs. consolidation pass (bottom of list = mostly tracker walk-through):
Substantive-review-blocked (no methodology test file, no paper review, no R parity):
- PreTrendsPower — small surface, established R package (
pretrends), Roth (2022) is short. - PowerAnalysis — larger surface (MDE / power / sample size / simulation paths); REGISTRY already lists Bloom (1995) and Burlig et al. (2020) as primary sources; least urgent if the library's power-analysis utilities are not heavily used.
- PlaceboTests — decide first whether to keep standalone or absorb into per-estimator diagnostic sections; methodologically lightweight either way.
- EfficientDiD — no paper review on file; substantial implementation work (
tests/test_efficient_did.py+ validation tests) needs paper-vs-code audit against Chen, Sant'Anna & Xie (2025). - ImputationDiD / TwoStageDiD — natural pair (both single-treatment-effect-imputation methods). Each needs paper review, methodology file, R parity fixture against
didimputation/did2s.
Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):
- HeterogeneousAdoptionDiD (HAD) — largest current surface, Phase 4.5 just shipped; shares the de Chaisemartin (2026) paper review with DCDH; needs a dedicated Verified Components block.
- ChaisemartinDHaultfoeuille (DCDH) — methodology test file + 24 R parity tests + 347 unit tests + a companion-paper review for the 2026 universal-rollout extension. Primary-source reviews for the 2020 AER and 2022/2024 NBER WP 29873 papers are still outstanding alongside the Verified Components walk-through.
- WooldridgeDiD (ETWFE) — companion-paper review (Wooldridge 2023 nonlinear extension) merged in PR #443; primary-source review for Wooldridge (2025) ETWFE not yet on file, and no dedicated methodology test file.
- ContinuousDiD — 15 methodology tests already in place; mostly a consolidation pass with a documented boundary-knots deviation from R
contdidv0.1.0. - TROP — paper review recently merged (PR #443); needs methodology file and cross-language anchor (when paper-author reference becomes available).
- StaggeredTripleDifference — shares the primary paper (Ortiz-Villavicencio & Sant'Anna 2025) with TripleDifference, but no dedicated paper review on file yet; needs R parity (R fixtures gitignored — tracked in TODO.md, PR #245).
- ConleySpatialHAC — paper review + committed R
conleyreggoldens; needs dedicated methodology test file + summary R-parity table in this tracker. - Survey Data Support — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
- REGISTRY.md — Academic foundations and key equations
- docs/methodology/papers/ — Per-paper retrospective reviews (Athey 2025, Butts 2021/2023, Clarke 2017, Colella et al. 2019, Conley 1999, de Chaisemartin 2026, Rambachan-Roth 2023, Wooldridge 2023)
- docs/methodology/continuous-did.md — ContinuousDiD theory note
- docs/methodology/survey-theory.md — Design-based variance estimation for modern DiD estimators
- docs/methodology/REPORTING.md — Reporting conventions across estimators
- ROADMAP.md — Feature roadmap
- TODO.md — Technical debt tracking, including deferred methodology items from code reviews
- CLAUDE.md — Development guidelines