Skip to content

Latest commit

 

History

History
336 lines (257 loc) · 48.5 KB

File metadata and controls

336 lines (257 loc) · 48.5 KB

Development TODO

Internal tracking for technical debt, known limitations, and maintenance tasks.

For the public feature roadmap, see ROADMAP.md.


Known Limitations

Current limitations that may affect users:

Issue Location Priority Notes
MultiPeriodDiD wild bootstrap not supported (falls back to analytical) estimators.py:1647 Low Edge case
predict() raises NotImplementedError estimators.py:890-911 Low Rarely needed

For survey-specific limitations (NotImplementedError paths), see the Current Limitations section of survey-roadmap.md.

Code Quality

Large Module Files

Target: ideally < 1000 lines per module; modules ≥3000 lines are candidates for splitting, 2000-3000 are monitored, 1000-2000 are accepted as a cohesion / scope trade-off. Updated 2026-05-15.

File Lines Action
chaisemartin_dhaultfoeuille.py 8636 Consider splitting (per-path / placebos / survey IF / aggregation)
had_pretests.py 4951 Consider splitting (Stute / Yatchew / QUG / joint pretests)
had.py 4593 Consider splitting (continuous / mass-point / event-study / survey paths)
staggered.py 3963 Consider splitting — grew through survey + aggregation features
linalg.py 3601 Consider splitting (vcov surfaces) only if cohesion can be preserved — unified backend; vcov / solver paths are tightly coupled
diagnostic_report.py 3380 Consider splitting (per-method renderers + provenance)
power.py 3196 Consider splitting (power analysis + MDE + sample size)
synthetic_did.py 2819 Monitor — variance methods + survey paths
honest_did.py 2785 Monitor
business_report.py 2653 Monitor — per-method narrative renderers
imputation.py 2475 Monitor
survey.py 2466 Monitor — grew with Phase 6 features
utils.py 2396 Monitor
prep_dgp.py 2057 Monitor
triple_diff.py 2053 Monitor
estimators.py 1991 Acceptable
two_stage.py 1985 Acceptable
chaisemartin_dhaultfoeuille_results.py 1981 Acceptable
prep.py 1876 Acceptable
efficient_did.py 1793 Acceptable
sun_abraham.py 1713 Acceptable
continuous_did.py 1682 Acceptable
results.py 1676 Acceptable
staggered_triple_diff.py 1619 Acceptable
_nprobust_port.py 1412 Acceptable
practitioner.py 1402 Acceptable
trop_global.py 1350 Acceptable
trop_local.py 1339 Acceptable
local_linear.py 1332 Acceptable
wooldridge.py 1305 Acceptable
chaisemartin_dhaultfoeuille_bootstrap.py 1175 Acceptable
bacon.py 1144 Acceptable
pretrends.py 1133 Acceptable
stacked_did.py 1050 Acceptable
conley.py 1006 Acceptable
visualization/ 4316 Subpackage (split across 7 files) — OK

Tech Debt from Code Reviews

Deferred items from PR reviews that were not addressed before merge.

Methodology/Correctness

Issue Location PR Priority
dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. chaisemartin_dhaultfoeuille.py #294 Low
dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. chaisemartin_dhaultfoeuille.py, docs/methodology/REGISTRY.md #408 Medium
dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). test_chaisemartin_dhaultfoeuille_parity.py #294 Low
dCDH by_path: survey-aware backward-horizon (placebo + predict_het + survey_design) raises NotImplementedError because the Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution. Backward horizons would put ψ_g mass on a pre-period cell. Deriving the pre-period cell allocator (or adding a covariance-aware two-cell alternative) is deferred to a follow-up methodology PR. diff_diff/chaisemartin_dhaultfoeuille.py, docs/methodology/REGISTRY.md follow-up Medium
CallawaySantAnna: consider materializing NaN entries for non-estimable (g,t) cells in group_time_effects dict (currently omitted with consolidated warning); would require updating downstream consumers (event study, balance_e, aggregation) staggered.py #256 Low
ImputationDiD dense (A0'A0).toarray() scales O((U+T+K)^2), OOM risk on large panels imputation.py #141 Medium (deferred — only triggers when sparse solver fails)
Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) estimators.py #218 Medium
Survey design resolution/collapse patterns are inconsistent across panel estimators — ContinuousDiD rebuilds unit-level design in SE code, EfficientDiD builds once in fit(), StackedDiD re-resolves on stacked data; extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation continuous_did.py, efficient_did.py, stacked_did.py #226 Low
Survey-weighted Silverman bandwidth in EfficientDiD conditional Omega* — _silverman_bandwidth() uses unweighted mean/std for bandwidth selection; survey-weighted statistics would better reflect the population distribution but is a second-order refinement efficient_did_covariates.py Low
TROP: extend Wave 4's _setup_trop_data helper to also cover the duplicated bootstrap resampling loop in _bootstrap_variance / _bootstrap_variance_global (~40 LoC dedup; mirrors the data-setup helper pattern with a fit_callable parameter for the per-draw refit step). trop_local.py, trop_global.py follow-up Low
TripleDifference power auto-routing: power.simulate_power ignores n_periods for DDD because _ddd_dgp_kwargs is hard-coded to the cross-sectional generate_ddd_data. Now that generate_ddd_panel_data exists (Wave 4), add a new _EstimatorProfile registry entry (or extend the existing one) to route to the panel DGP when n_periods > 2. power.py, prep_dgp.py follow-up Low
StaggeredTripleDifference R cross-validation: CSV fixtures not committed (gitignored); tests skip without local R + triplediff. Commit fixtures or generate deterministically. tests/test_methodology_staggered_triple_diff.py #245 Medium
StaggeredTripleDifference R parity: benchmark only tests no-covariate path (xformla=~1). Add covariate-adjusted scenarios and aggregation SE parity assertions. benchmarks/R/benchmark_staggered_triplediff.R #245 Medium
StaggeredTripleDifference: per-cohort group-effect SEs include WIF (conservative vs R's wif=NULL). Documented in REGISTRY. Could override mixin for exact R match. staggered_triple_diff.py #245 Low
HonestDiD Delta^RM: uses naive FLCI instead of paper's ARP conditional/hybrid confidence sets (Sections 3.2.1-3.2.2). ARP infrastructure exists but moment inequality transformation needs calibration. CIs are conservative (wider, valid coverage). honest_did.py #248 Medium
Replicate weight tests use Fay-like BRR perturbations (0.5/1.5), not true half-sample BRR. Add true BRR regressions per estimator family. Existing test_survey_phase6.py covers true BRR at the helper level. tests/test_replicate_weight_expansion.py #253 Low
WooldridgeDiD: QMLE sandwich uses aweight cluster-robust adjustment (G/(G-1))*(n-1)/(n-k) vs Stata's G/(G-1) only. Conservative (inflates SEs). Add qmle weight type if Stata golden values confirm material difference. wooldridge.py, linalg.py #216 Medium
WooldridgeDiD: aggregation weights use cell-level n_{g,t} counts. Paper (W2025 Eqs. 7.2-7.4) defines cohort-share weights. Add optional weights="cohort_share" parameter to aggregate(). wooldridge_results.py #216 Medium
WooldridgeDiD: optional efficiency hint (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., method="ols" on binary data is consistent under QMLE, but method="logit" is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. wooldridge.py #216 Low
WooldridgeDiD: Stata jwdid golden value tests — add R/Stata reference script and TestReferenceValues class. tests/test_wooldridge.py #216 Medium
PreTrendsPower: compute_pretrends_power adapter uses diag(ses^2) instead of the full pre-period covariance block Σ_22 for CallawaySantAnnaResults (deliberate — non-bootstrap CS persists event_study_vcov; bootstrap CS fits clear it at staggered.py:2032-2036) and SunAbrahamResults (forced — SA does not expose an event-study/cohort VCV at all). Roth (2022)'s NIS box probability and the library's Wald object both depend on Σ_22 off-diagonals; diag fallback is not provably conservative. For non-bootstrap CS fits, route through event_study_vcov; for bootstrap CS fits the diag fallback is the only path. For SA, extend SunAbrahamResults to persist a cohort/event-study VCV (then route the adapter likewise). Or formally retain the diag fallback with explicit miscalibration framing. See REGISTRY.md ## PreTrendsPower Note (deviation from paper) + docs/methodology/papers/roth-2022-review.md. diff_diff/pretrends.py:609-687, diff_diff/sun_abraham.py:30-88, docs/methodology/REGISTRY.md, docs/methodology/papers/roth-2022-review.md PR-A (Roth paper review, 2026-05-17) Medium
PreTrendsPower: pin the R pretrends package commit/release before building the R-parity fixture. The paper review's R-package surface claims (pretrends(), slope_for_power(), NIS-only API, no joint-Wald target) are provisional pending a pinned revision; the audited revision should be recorded either in the review file's Gaps section or in this TODO row before any parity assertions are committed. docs/methodology/papers/roth-2022-review.md, METHODOLOGY_REVIEW.md (PreTrendsPower row) PR-A (Roth paper review, 2026-05-17) Low
PreTrendsPower: helper compute_pretrends_power(results, M, alpha, target_power, violation_type, pre_periods) does NOT accept violation_weights, so violation_type="custom" is unusable from the helper (class-only today via PreTrendsPower(..., violation_weights=...)). Either add violation_weights to the helper signature and forward to the class, or document the helper as supporting only linear / constant / last_period. diff_diff/pretrends.py:1048-1095, 442-466 PR-A (Roth paper review, 2026-05-17) Low
PreTrendsPower: PreTrendsPowerResults.power_at() does not yet support violation_type="custom". Silent-failure path was mitigated in PR-A (2026-05-17, R18 of the codex review): power_at() now raises NotImplementedError for custom fits rather than returning equal-weights output, locked in by test_power_at_raises_on_custom_violation_type. Remaining follow-up: persist the normalized fitted violation_weights on PreTrendsPowerResults (currently absent at pretrends.py:77-90) and re-enable power_at() for custom fits, with a parity test comparing results.power_at(M) to a fresh PreTrendsPower(...).fit(..., M=M).power on a custom-weights fixture. diff_diff/pretrends.py:77-90, ~196-235, ~878-892 PR-A (Roth paper review, 2026-05-17) Medium
PreTrendsPower: linear violation pattern does NOT implement Roth's δ_t = γ·t. _get_violation_weights(violation_type="linear") constructs a shifted, normalized [n-1, ..., 1, 0] direction from n_pre only (pretrends.py:510-515), and fit() never threads actual relative-time labels into that construction (pretrends.py:862-866). For irregular pre-period grids (e.g., anticipation-shifted t ∈ {-5, -3, -1}) this means the slope reported as MDV is not in Roth's γ units. Fix: build linear weights from the sorted actual relative-time values used in the fit, define the exposed parameter in γ units, persist any normalization separately, and add a regression test using anticipation-shifted / irregular pre-periods. If the shifted convention is intentional, add a **Note (deviation from paper):** to REGISTRY.md and convert reported MDV back to Roth's slope scale before exposing it. diff_diff/pretrends.py:488-531, 862-866, docs/methodology/REGISTRY.md:2786-2789 PR-A (Roth paper review, 2026-05-17; surfaced by R17 of the iterative codex review on the paper review file) High
Thread vcov_type (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose cluster=: CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, TripleDifference, StackedDiD, WooldridgeDiD, EfficientDiD. Phase 1a added vcov_type to the DifferenceInDifferences inheritance chain only. multiple Phase 1a Medium
Weighted one-way Bell-McCaffrey (vcov_type="hc2_bm" + weights, no cluster) currently raises NotImplementedError. _compute_bm_dof_from_contrasts builds its hat matrix from the unscaled design via X (X'WX)^{-1} X' W, but solve_ols solves the WLS problem by transforming to X* = sqrt(w) X, so the correct symmetric idempotent residual-maker is M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W). Rederive the Satterthwaite (tr G)^2 / tr(G^2) ratio on the transformed design and add weighted parity tests before lifting the guard. linalg.py::_compute_bm_dof_from_contrasts, linalg.py::_validate_vcov_args Phase 1a Medium
HC2 / HC2 + Bell-McCaffrey on absorbed-FE fits — REMAINING sub-gate: TwoWayFixedEffects (twfe.py:154 rejects unconditionally). The DiD sub-gate and the MultiPeriodDiD sub-gate were both lifted via auto-route to fixed_effects= internally (DiD: PR #458, ~1e-10 vs clubSandwich; MPD: this release, ~1e-10 vs sandwich::vcovHC and clubSandwich::vcovCR). TWFE has no equivalent fixed_effects= code path (always within-transforms), so the same auto-route surgery is not directly applicable — lifting requires either building the full-dummy design inline or refactoring TWFE to delegate to DiD. Within-transformation preserves coefficients and residuals under FWL but not the hat matrix; HC1/CR1 are unaffected (no leverage term). twfe.py::fit follow-up Medium
Weighted CR2 Bell-McCaffrey cluster-robust (vcov_type="hc2_bm" + cluster_ids + weights) currently raises NotImplementedError. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. linalg.py::_compute_cr2_bm Phase 1a Medium
Unify Rust local-method estimate_model solver path to solve_wls_svd (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local) passes at atol=1e-5 — the residual ~1e-7 gap is roundoff between Rust's estimate_model matrix factorization and numpy's lstsq, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (atol=1e-14 for lambda_nn=inf, atol=1e-10 for finite lambda_nn — see test_local_method_main_fit_parity); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. rust/src/trop.rs::estimate_model, rust/src/linalg.rs::solve_wls_svd follow-up Low
Rust multiplier-bootstrap weight RNG (generate_bootstrap_weights_batch in rust/src/bootstrap.rs:9-10, 57-75) uses Xoshiro256PlusPlus::seed_from_u64(seed + i) per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (diff_diff/sdid.py, diff_diff/efficient_did_bootstrap.py, diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. rust/src/bootstrap.rs, diff_diff/bootstrap_utils.py follow-up Medium
bias_corrected_local_linear: extend golden parity to kernel="triangular" and kernel="uniform" (currently epa-only; all three kernels share kernel_W and the lprobust math, so parity is expected but not separately asserted). benchmarks/R/generate_nprobust_lprobust_golden.R, tests/test_bias_corrected_lprobust.py Phase 1c Low
bias_corrected_local_linear: expose vce in {"hc0", "hc1", "hc2", "hc3"} on the public wrapper once R parity goldens exist (currently raises NotImplementedError). The port-level lprobust and lprobust_res already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit hii for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). diff_diff/local_linear.py::bias_corrected_local_linear, benchmarks/R/generate_nprobust_lprobust_golden.R, tests/test_bias_corrected_lprobust.py Phase 1c Medium
bias_corrected_local_linear: support weights= once survey-design adaptation lands. nprobust's lprobust has no weight argument so there is no parity anchor; derivation needed. diff_diff/local_linear.py, diff_diff/_nprobust_port.py::lprobust Phase 1c Medium
bias_corrected_local_linear: support multi-eval grid (neval > 1) with cross-covariance (covgrid=TRUE branch of lprobust.R:253-378). Not needed for HAD but useful for multi-dose diagnostics. diff_diff/_nprobust_port.py::lprobust Phase 1c Low
Clustered-DGP parity: Phase 1c's DGP 4 uses manual h=b=0.3 to sidestep an nprobust-internal singleton-cluster bug in lpbwselect.mse.dpi's pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. benchmarks/R/generate_nprobust_lprobust_golden.R Phase 1c Low
HeterogeneousAdoptionDiD joint cross-horizon covariance on event study: per-horizon SEs use INDEPENDENT sandwiches in Phase 2b (paper-faithful pointwise CIs per Pierce-Schott Figure 2). A follow-up could derive an IF-based stacking of per-horizon scores for joint cross-horizon inference (needed for joint hypothesis tests across event-time horizons). Block-bootstrap is a reasonable alternative. diff_diff/had.py::_fit_event_study Phase 2b Low
HeterogeneousAdoptionDiD event-study staggered-timing beyond last cohort: Phase 2b auto-filters staggered panels to the last cohort per paper Appendix B.2. Earlier-cohort treatment effects are not identified by HAD; redirecting to ChaisemartinDHaultfoeuille / did_multiplegt_dyn is the paper's prescription. A full staggered HAD would require a different identification path (out of paper scope). diff_diff/had.py::_validate_had_panel_event_study Phase 2b Low
HeterogeneousAdoptionDiD joint cross-horizon analytical covariance on the weighted event-study path: Phase 4.5 B ships multiplier-bootstrap sup-t simultaneous CIs on the weighted event-study path but pointwise analytical variance is still independent across horizons. A follow-up could derive the full H × H analytical covariance from the per-horizon IF matrix (Psi.T @ Psi under survey weighting) for an analytical alternative to the bootstrap. Would also let the unweighted event-study path ship a sup-t band. diff_diff/had.py::_fit_event_study follow-up Low
HeterogeneousAdoptionDiD unweighted event-study sup-t band: Phase 4.5 B ships sup-t only on the WEIGHTED event-study path (to preserve pre-PR bit-exact output on unweighted). Extending sup-t to unweighted event-study (either via the multiplier bootstrap with unit-level iid multipliers or via analytical joint cross-horizon covariance) is a symmetric follow-up. diff_diff/had.py::_fit_event_study follow-up Low
HeterogeneousAdoptionDiD survey-aware support-endpoint test (research, not engineering): if the academic literature ever publishes a calibrated support-infimum test under complex sampling — combining endpoint-estimation EVT (Hall 1982, Aarssen-de Haan 1994, Hall-Wang 1999) with survey-aware functional CLTs for the empirical process (Boistard-Lopuhaä-Ruiz-Gazen 2017, Bertail-Chautru-Clémençon 2017) and tail-empirical-process theory (Drees 2003) — Phase 4.5 C0's permanent NotImplementedError on qug_test(..., survey=...) / weights= can be revisited and the bridge implemented against the published recipe. See docs/methodology/REGISTRY.md § "QUG Null Test" — Note (Phase 4.5 C0) for the decision rationale and the research-direction sketch. diff_diff/had_pretests.py::qug_test Phase 4.5 C0 (2026-04, decision shipped) Low
HeterogeneousAdoptionDiD survey-aware pretests Phase 4.5 C still-open follow-ups: (a) replicate-weight designs (BRR/Fay/JK1/JKn/SDR) — the per-replicate weight-ratio rescaling for the OLS-on-residuals refit step is not covered by the multiplier-bootstrap composition; each linearity-family helper raises NotImplementedError on survey.replicate_weights is not None. (b) lonely_psu='adjust' + singleton-strata on the Stute family — the pseudo-stratum centering transform has not been derived for the Stute CvM functional (same gap as the HAD sup-t deviation at REGISTRY:2382). Stratified-design support on the Stute family SHIPPED in the Phase 4.5 C strata extension PR (within-stratum demean + sqrt(n_h/(n_h-1)) Bessel rescale on PSU multipliers via bootstrap_utils.apply_stratum_centering; see REGISTRY § "Note (Stute stratified survey-bootstrap calibration)"). Phase 4.5 C now ships pweight + PSU + FPC + strata support via PSU-level Mammen multiplier bootstrap (Stute family) + closed-form weighted variance components (Yatchew). Replicate-weight pretests = bootstrap-composition work; lonely_psu='adjust'+singleton on Stute = pseudo-stratum centering derivation. diff_diff/had_pretests.py Phase 4.5 C follow-up Low
HeterogeneousAdoptionDiD Phase 4.5: weight-aware auto-bandwidth MSE-DPI selector. Phase 4.5 A ships weighted lprobust with an unweighted DPI selector; users who want a weight-aware bandwidth must pass h/b explicitly. Extending lpbwselect_mse_dpi to propagate weights through density, second-derivative, and variance stages is ~300 LoC of methodology and was out of scope. diff_diff/_nprobust_port.py::lpbwselect_mse_dpi Phase 4.5 Low
HeterogeneousAdoptionDiD Phase 4.5 C: replicate-weight SurveyDesigns (BRR / Fay / JK1 / JKn / SDR) on the continuous-dose paths. Phase 4.5 A raises NotImplementedError on replicate designs in _aggregate_unit_resolved_survey. Rao-Wu-style replicate bootstrap for HAD paths requires deriving the per-replicate weight-ratio rescaling for the local-linear intercept IF. diff_diff/had.py::_aggregate_unit_resolved_survey Phase 4.5 C Low
HeterogeneousAdoptionDiD mass-point: vcov_type in {"hc2", "hc2_bm"} raises NotImplementedError pending a 2SLS-specific leverage derivation. The OLS leverage x_i' (X'X)^{-1} x_i is wrong for 2SLS; the correct finite-sample correction uses x_i' (Z'X)^{-1} (...) (X'Z)^{-1} x_i. Needs derivation plus an R / Stata (ivreg2 small robust) parity anchor. diff_diff/had.py::_fit_mass_point_2sls Phase 2a Medium
HeterogeneousAdoptionDiD survey-design API consolidation, next minor bump: drop the deprecated survey= and weights= kwargs on all 8 HAD surfaces (HeterogeneousAdoptionDiD.fit, did_had_pretest_workflow, qug_test, stute_test, yatchew_hr_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test); only survey_design= remains. Also fold the legacy back-end weights= paths (e.g. _aggregate_unit_weights ad-hoc routing) into the unified _resolve_survey_for_fit-driven path. The _make_trivial_resolved underscore alias on survey.py stays (one-line, harmless). DeprecationWarning ships in this PR; the removal PR is ~50 LoC of cleanup. diff_diff/had.py, diff_diff/had_pretests.py next minor bump Medium
HeterogeneousAdoptionDiD continuous paths: thread cluster= through bias_corrected_local_linear (Phase 1c's wrapper already supports cluster; Phase 2a ignores it with a UserWarning on the continuous path to keep scope tight). diff_diff/had.py, diff_diff/local_linear.py Phase 2a Low
HeterogeneousAdoptionDiD trends_lin × survey_design follow-up: per-group linear-trend slope under survey weighting (weighted slope estimator? per-PSU slope?) is not derived from the paper. PR #389 raises NotImplementedError on the combination across all 3 trends_lin surfaces. If user demand emerges, derive the weighted variant and lift the gate. diff_diff/had.py::HeterogeneousAdoptionDiD.fit, diff_diff/had_pretests.py::joint_pretrends_test, diff_diff/had_pretests.py::joint_homogeneity_test follow-up Low
HeterogeneousAdoptionDiD Stute family Stata-bridge parity: PR #389 R-parity covers the full HAD fit + Yatchew surfaces but skips Stute family (stute_test, stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test) because no R Stutetest package exists publicly (chaisemartinPackages publishes only the Stata stute_test module; the paper cites a 2024c R Stutetest module that is not on GitHub or CRAN). Stata-bridge parity would add benchmarks/stata/generate_stute_golden.do + a Stata installation requirement. Low priority unless user demand emerges. benchmarks/stata/, tests/test_stute_test_parity.py follow-up Low
HeterogeneousAdoptionDiD Phase 3 Stute performance: Appendix D vectorized matrix form replaces the per-iteration OLS refit with a single precomputed M = I - X(X'X)^{-1}X' applied to eps * eta. Functionally identical, ~2x faster. Shipped literal-refit form in Phase 3 to match paper text and keep reviewer surface small. diff_diff/had_pretests.py::stute_test Phase 3 Low
HeterogeneousAdoptionDiD Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against chaisemartin::stute_test / yatchew_test). Tight numerical parity requires aligning bootstrap seed semantics and B across numpy/R and is deferred. tests/test_had_pretests.py Phase 3 Low
HeterogeneousAdoptionDiD Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. diff_diff/had_pretests.py::stute_test Phase 3 Low
HeterogeneousAdoptionDiD Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. benchmarks/, tests/ Phase 2a Low
HeterogeneousAdoptionDiD time-varying dose on event study: Phase 2b REJECTS panels where D_{g,t} varies within a unit for t >= F (the aggregation uses D_{g, F} as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to ChaisemartinDHaultfoeuille. diff_diff/had.py::_validate_had_panel_event_study Phase 2b Low
HeterogeneousAdoptionDiD repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct data_mode / API surface. diff_diff/had.py::_validate_had_panel, diff_diff/had.py::_aggregate_first_difference Phase 2a Medium
SyntheticDiD: bootstrap cross-language parity anchor against R's default synthdid::vcov(method="bootstrap") (refit; rebinds opts per draw) or Julia Synthdid.jl::src/vcov.jl::bootstrap_se (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. benchmarks/R/, benchmarks/julia/, tests/ follow-up Low
Conley + survey weights / survey_design. Score-reweighted meat s_i = w_i · X_i · ε_i is mechanical, but PSU clustering interaction with the spatial kernel and replicate-weights variance under spatial correlation are non-trivial (Bertanha-Imbens 2014 covers cluster-sample but not the explicit Conley case). Phase 5 of the spillover-conley initiative; paper review prerequisite. Currently raises NotImplementedError at the linalg validator. linalg.py::_validate_vcov_args Phase 5 (spillover-conley) Medium
SyntheticDiD(vcov_type="conley") support. Currently raises TypeError at __init__ because SyntheticDiD uses variance_method ∈ {bootstrap, jackknife, placebo} rather than the analytical sandwich that Conley plugs into. Wiring would require either reimplementing an analytical sandwich path for SyntheticDiD or designing a spatial-block bootstrap (new methodology, Politis-Romano 1994 territory). synthetic_did.py::SyntheticDiD follow-up (spillover-conley) Low
SpilloverDiD(survey_design=...) integration. Currently raises NotImplementedError. Requires threading survey weights through the inline stage 1 + stage 2 and lifting two_stage.py's survey path patterns. spillover.py::SpilloverDiD.fit follow-up (Wave B) Low
SpilloverDiD(ring_method="count") extension. Currently only the nearest-treated-ring specification is exposed. Count-of-treated-in-ring (paper Section 3.2 end) is methodologically supported by Butts but re-introduces functional-form dependence; expose with an explicit kwarg gate and documentation warning. spillover.py::SpilloverDiD.fit follow-up Low
SpilloverDiD data-driven d_bar selection (Butts 2021b / Butts 2023 JUE Insight cross-validation). spillover.py::SpilloverDiD follow-up Low
SpilloverDiD T22 TVA tutorial (docs/tutorials/22_spillover_did.ipynb): synthetic TVA-style DGP reproducing Butts (2021) Section 4 Table 1 Panel A bias-correction direction (~40% understatement). Split from the methodology PR per user-confirmed scope split (2026-05-15). docs/tutorials/, tests/test_t22_*_drift.py follow-up (Wave B) Medium
Extend TwoStageDiD with Conley vcov as a first-class feature (mirrors Wave A's TWFE/MPD/DiD extension). Currently TwoStageDiD.__init__ lacks vcov_type / conley_* kwargs; SpilloverDiD works around this by threading Conley directly via solve_ols at stage 2. Promoting Conley to TwoStageDiD's API removes the workaround and lets non-spillover users access Conley + Gardner two-stage. diff_diff/two_stage.py follow-up Medium
SpilloverDiD sparse cKDTree path for the staggered nearest-treated-distance helper (mirrors the static helper's sparse branch). Currently _compute_nearest_treated_distance_staggered always builds dense (n_units, n_treated_by_onset) pairwise distance matrices per cohort; on large staggered panels with many cohorts this is avoidable memory/runtime. Add a sparse k-d-tree branch analogous to _compute_nearest_treated_distance_sparse, gated on n > _CONLEY_SPARSE_N_THRESHOLD. spillover.py::_compute_nearest_treated_distance_staggered follow-up (Wave B) Low
SpilloverDiDResults in DiagnosticReport dispatch tables. Wave C event-study emits a TwoStageDiD-compatible event_study_effects: Dict[int, Dict] alias that plot_event_study consumes via the new reference_period attribute fallback in _extract_plot_data, but SpilloverDiDResults is NOT registered in DiagnosticReport's _APPLICABILITY / _PT_METHOD tables — so DiagnosticReport(spillover_result) doesn't currently route to event-study diagnostics. Registering requires (a) deciding which diagnostics apply (parallel trends, pre-trends power, heterogeneity, design-effect) AND (b) adding an end-to-end test. diff_diff/diagnostic_report.py::_APPLICABILITY, _PT_METHOD follow-up (Wave C) Low

Performance

Issue Location PR Priority
ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization) imputation.py #141 Low
Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost) rust/src/linalg.rs:67 #115 Low
Unrelated label events (e.g., adding bug label) re-trigger CI workflows when ready-for-ci is already present; filter labeled/unlabeled events to only ready-for-ci transitions .github/workflows/rust-test.yml, notebooks.yml, docs-tests.yml #269 Low
bread_inv as a performance kwarg on compute_robust_vcov to avoid re-inverting (X'WX) when the caller already has it. Deferred from Phase 1a for scope. HC2 and HC2+BM both need the bread inverse, so a shared hint would save one np.linalg.solve per sandwich. linalg.py::compute_robust_vcov Phase 1a Low
Rust-backend HC2 implementation. Current Rust path only supports HC1; HC2 and CR2 Bell-McCaffrey fall through to the NumPy backend. For large-n fits this is noticeable. rust/src/linalg.rs Phase 1a Low
CR2 Bell-McCaffrey DOF uses a naive O(n² k) per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full n × n M matrix. Switch when a user hits a large-n cluster-robust design. linalg.py::_compute_cr2_bm Phase 1a Low

Testing/Docs

Issue Location PR Priority
R comparison tests spawn separate Rscript per test (slow CI) tests/test_methodology_twfe.py:294 #139 Low
CS R helpers hard-code xformla = ~ 1; no covariate-adjusted R benchmark for IRLS path tests/test_methodology_callaway.py #202 Low
Doc-snippet smoke tests only cover .rst files; .txt AI guides outside CI validation tests/test_doc_snippets.py #239 Low
Add CI validation for docs/doc-deps.yaml integrity (stale paths, unmapped source files) docs/doc-deps.yaml #269 Low
SyntheticDiD: rename internal placebo_effects variable to variance_effects (or resampled_effects). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve placebo_effects as a deprecated alias for one release. synthetic_did.py, results.py follow-up Medium
AI review CI: pin workflow contract via test (uses openai/codex-action@v1, passes prompt-file, reads steps.run_codex.outputs.final-message, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. tests/test_openai_review.py, .github/workflows/ai_pr_review.yml #416 Low
TestWorkflowDoesNotExecutePRHeadCode (CodeQL #14 dismissal guard) does not model: bash <script> / sh <script> / ./<script> / source <script> / . <script> direct shell-script execution; multi-line python3 -c bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (SCRIPT="$X"; python3 "$SCRIPT"); eval; find -exec; xargs -I {}. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in .github/workflows/ai_pr_review.yml comment block. tests/test_openai_review.py, .github/workflows/ai_pr_review.yml #436 Low
Render docs/methodology/REPORTING.md and docs/methodology/REGISTRY.md as in-site Sphinx pages so cross-references can use :doc: instead of off-site GitHub blob/main URLs. Current state (#410 fix-audit-r2) restores navigable links via blob/main, but stable-docs readers can land on a different revision than the package version they are reading. Two viable paths: (a) add myst-parser to docs/conf.py extensions + docs extras and link with :doc:, or (b) convert both files to .rst. docs/conf.py, docs/api/business_report.rst, docs/api/diagnostic_report.rst, docs/tutorials/18_geo_experiments.ipynb, docs/tutorials/19_dcdh_marketing_pulse.ipynb follow-up Low

Prioritized Tech-Debt Backlog

Ordered paydown view across the tables above. Tier A → D is by effort × risk, not severity — every item here already carries its own Low / Medium priority in the source-of-truth tables. The intent is to give a flat ordering to draw from wave-by-wave without re-litigating priority each time. Anchors point to the location reference of the originating row.

Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)

  • WooldridgeDiD: optional efficiency hint when method/outcome pairing is sub-optimal (NOT a canonical-link violation per W2023 Prop 3.1 — see Methodology/Correctness row for the corrected framing)

(SyntheticDiD placebo_effectsvariance_effects rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.)

Tier B — Mid-size methodology (5-10 CI rounds expected, per memory cascade priors)

  • Thread vcov_type through 8 standalone estimators: CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, TripleDifference, StackedDiD, WooldridgeDiD, EfficientDiD (none currently expose self.vcov_type)
  • SyntheticDiD: rename internal placebo_effectsvariance_effects AND public placebo_effects field with deprecation alias retained for one release (synthetic_did.py, results.py)
  • StaggeredTripleDifference R parity: commit CSV fixtures + add covariate-adjusted scenarios + aggregation-SE assertions (tests/test_methodology_staggered_triple_diff.py, benchmarks/R/benchmark_staggered_triplediff.R)
  • StaggeredTripleDifference: per-cohort group-effect SE WIF override for exact R triplediff match (staggered_triple_diff.py)
  • WooldridgeDiD: QMLE Stata-parity qmle weight type + Stata golden values (wooldridge.py, linalg.py, tests/test_wooldridge.py)
  • WooldridgeDiD: optional weights="cohort_share" on aggregate() (wooldridge_results.py)
  • HAD survey-design API consolidation: drop deprecated survey=/weights= kwargs (had.py, had_pretests.py; gated on next minor bump)
  • Survey-design resolution / collapse helper extraction across continuous_did.py, efficient_did.py, stacked_did.py
  • dCDH survey + backward-horizon predict_het allocator derivation: lift the warn-and-skip fallback at _compute_heterogeneity_test once the pre-period Binder TSL cell-period allocator is derived (currently the gate emits a UserWarning and falls back to forward-horizon-only heterogeneity under survey_design + placebo + heterogeneity) (chaisemartin_dhaultfoeuille.py, docs/methodology/REGISTRY.md)
  • Rust local-method solver path unification to solve_wls_svd + bootstrap-weight RNG parity audit (rust/src/trop.rs, rust/src/bootstrap.rs)
  • AI review CI workflow-contract pin test expansion (tests/test_openai_review.py)
  • In-site Sphinx render of REPORTING.md and REGISTRY.md (docs/conf.py + :doc: link migration)

Tier C — Heavy / derivation required

  • HonestDiD Δ^RM ARP conditional/hybrid confidence sets (honest_did.py)
  • Weighted one-way Bell-McCaffrey + weighted CR2 Bell-McCaffrey + HC2/CR2 on absorbed-FE (linalg derivations + R parity harness) (linalg.py, estimators.py::DifferenceInDifferences.fit, estimators.py::MultiPeriodDiD.fit, twfe.py::fit)
  • Multi-absorb weighted demeaning: alternating-projection iteration for N>1 absorb + weights (estimators.py)
  • ImputationDiD dense (A0'A0).toarray() OOM: alternative dense fallback or richer sparse strategy (imputation.py:1531)
  • HAD mass-point vcov_type ∈ {hc2, hc2_bm}: 2SLS-specific leverage derivation (had.py::_fit_mass_point_2sls)
  • HAD repeated-cross-section identification path (had.py::_validate_had_panel)
  • HAD time-varying-dose event study estimator (had.py::_validate_had_panel_event_study)
  • Conley + survey_design (linalg.py::_validate_vcov_args, conley.py)
  • SyntheticDiD vcov_type="conley" (synthetic_did.py::SyntheticDiD — new analytical sandwich path OR spatial-block bootstrap)

Tier D — Deferred / research (no active action planned)

  • HAD survey-aware support-endpoint test (had_pretests.py::qug_test; waits on literature — endpoint EVT × survey-aware functional CLT)
  • HAD joint cross-horizon analytical covariance / unweighted event-study sup-t band (low user demand)
  • HAD Phase 4.5 replicate-weight pretests (BRR/Fay/JK1/JKn/SDR composition derivation)
  • HAD Stute family Stata-bridge parity (no R Stutetest package exists publicly)
  • HAD trends_lin × survey_design weighted-slope derivation
  • Phase 1c lprobust follow-ups (vce modes, weights, multi-eval grid, clustered-DGP auto-bandwidth) — deferred to Phase 2+ of bias_corrected_local_linear
  • TestWorkflowDoesNotExecutePRHeadCode (CodeQL #14) residual bypass paths — diminishing return given documented threat model
  • All remaining Low-priority Performance and Testing/Docs rows (R-script-per-test, CS R covariate-adjusted IRLS benchmark, doc-deps integrity CI, Rust faer SVD overhead, etc.)

Standard Error Consistency

vcov_type has subsumed the previously-proposed se_type knob. DifferenceInDifferences and TwoWayFixedEffects accept vcov_type ∈ {"classical", "hc1", "hc2", "hc2_bm", "conley"} (the validated set in linalg.py::_VALID_VCOV_TYPES); cluster-robust variance is obtained by passing cluster= alongside the heteroscedasticity kind (hc1 + cluster ⇒ CR1 Liang-Zeger; hc2_bm + cluster ⇒ CR2 Bell-McCaffrey, gated by the open weighted-CR2 / absorbed-FE rows in the table above); wild cluster bootstrap is a separate inference="wild_bootstrap" path on the same estimator. Threading vcov_type through the 8 standalone estimators (CallawaySantAnna, SunAbraham, ImputationDiD, TwoStageDiD, TripleDifference, StackedDiD, WooldridgeDiD, EfficientDiD) remains open and is tracked as a single methodology row in the table above (Phase 1a row).

Type Annotations

Mypy reports 0 errors. All mixin attr-defined errors resolved via TYPE_CHECKING-guarded method stubs in bootstrap mixin classes.

Deprecated Code

Deprecated parameters still present for backward compatibility:

  • lambda_reg and zeta in SyntheticDiD (synthetic_did.py)
    • Deprecated in favor of zeta_omega/zeta_lambda parameters
    • Remove in v4.0.0 (SemVer-safe: public kwarg removal requires a major bump)

Test Coverage

Visualization tests skip when matplotlib / plotly are not installed (see pytest.importorskip markers in tests/test_visualization*.py).


Honest DiD Improvements

Enhancements for honest_did.py:

  • Improved C-LF implementation with direct optimization instead of grid search (current implementation uses simplified FLCI approach with estimation uncertainty adjustment; see honest_did.py:947)
  • Support for CallawaySantAnnaResults (implemented in honest_did.py:612-653; requires aggregate='event_study' when calling CallawaySantAnna.fit())
  • Event-study-specific bounds for each post-period
  • Hybrid inference methods
  • Simulation-based power analysis for honest bounds

CallawaySantAnna Bootstrap Improvements

  • Consider aligning p-value computation with R did package (symmetric percentile method)

RuntimeWarnings in Linear Algebra Operations

Apple Silicon M4 BLAS Bug (numpy < 2.3)

Spurious RuntimeWarnings ("divide by zero", "overflow", "invalid value") are emitted by np.matmul/@ on Apple Silicon M4 + macOS Sequoia with numpy < 2.3. The warnings appear for matrices with ≥260 rows but do not affect result correctness — coefficients and fitted values are valid (no NaN/Inf), and the design matrices are full rank.

Root cause: Apple's BLAS SME (Scalable Matrix Extension) kernels corrupt the floating-point status register, causing spurious FPE signals. Tracked in numpy#28687 and numpy#29820. Fixed in numpy ≥ 2.3 via PR #29223.

Not reproducible on M3, Intel, or Linux.

  • linalg.py:162 - Warnings in fitted value computation (X @ coefficients)

    • Caused by M4 BLAS bug, not extreme coefficient values
    • Seen in test_prep.py during treatment effect recovery tests (n > 260)
  • triple_diff.py:307,323 - Warnings in propensity score computation

    • Occurs in IPW and DR estimation methods with covariates
    • Related to logistic regression overflow in edge cases (separate from BLAS bug)
  • Long-term: Revert to @ operator when numpy ≥ 2.3 becomes the minimum supported version.


Feature Gaps (from R did package comparison)

Features in R's did package that block porting additional tests:

Feature R tests blocked Priority Status
Calendar time aggregation 1 test in test-att_gt.R Low

Performance Optimizations

Potential future optimizations:

  • JIT compilation for bootstrap loops (numba)
  • Sparse matrix handling for large fixed effects

QR+SVD Redundancy in Rank Detection

Background: The current solve_ols() implementation performs both QR (for rank detection) and SVD (for solving) decompositions on rank-deficient matrices. This is technically redundant since SVD can determine rank directly.

Current approach (R-style, chosen for robustness):

  1. QR with pivoting for rank detection (_detect_rank_deficiency())
  2. scipy's lstsq with 'gelsd' driver (SVD-based) for solving

Why we use QR for rank detection:

  • QR with pivoting provides the canonical ordering of linearly dependent columns
  • R's lm() uses this approach for consistent dropped-column reporting
  • Ensures consistent column dropping across runs (SVD column selection can vary)

Potential optimization (future work):

  • Skip QR when rank_deficient_action="silent" since we don't need column names
  • Use SVD rank directly in the Rust backend (already implemented)
  • Add skip_rank_check parameter for hot paths where matrix is known to be full-rank (implemented in v2.2.0)

Priority: Low - the QR overhead is minimal compared to SVD solve, and correctness is more important than micro-optimization.

Incomplete check_finite Bypass

Background: The solve_ols() function accepts a check_finite=False parameter intended to skip NaN/Inf validation for performance in hot paths where data is known to be clean.

Current limitation: When check_finite=False, our explicit validation is skipped, but scipy's internal QR decomposition in _detect_rank_deficiency() still validates finite values. This means callers cannot fully bypass all finite checks.

Impact: Minimal - the scipy check is fast and only affects edge cases where users explicitly pass check_finite=False with non-finite data (which would be a bug in their code anyway).

Potential fix (future work):

  • Pass check_finite=False through to scipy's QR call (requires scipy >= 1.9.0)
  • Or skip _detect_rank_deficiency() entirely when check_finite=False and _skip_rank_check=True

Priority: Low - this is an edge case optimization that doesn't affect correctness.