Skip to content

Commit 62b663b

Browse files
authored
Merge pull request #163 from igerber/address-todo-items
Address TODO items: safe_inference migration, module splits, np.dot, and bug fixes
2 parents a39e0f2 + 089117f commit 62b663b

24 files changed

Lines changed: 2166 additions & 2094 deletions

CLAUDE.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -103,22 +103,35 @@ cross-platform compilation - no OpenBLAS or Intel MKL installation required.
103103

104104
- **`diff_diff/imputation.py`** - Borusyak-Jaravel-Spiess imputation DiD estimator:
105105
- `ImputationDiD` - Borusyak et al. (2024) efficient imputation estimator for staggered DiD
106-
- `ImputationDiDResults` - Results with overall ATT, event study, group effects, pre-trend test
107-
- `ImputationBootstrapResults` - Multiplier bootstrap inference results
108106
- `imputation_did()` - Convenience function
109107
- Steps: (1) OLS on untreated obs for unit+time FE, (2) impute counterfactual Y(0), (3) aggregate
110108
- Conservative variance (Theorem 3) with `aux_partition` parameter for SE tightness
111109
- Pre-trend test (Equation 9) via `results.pretrend_test()`
112110
- Proposition 5: NaN for unidentified long-run horizons without never-treated units
111+
- Re-exports result and bootstrap classes for backward compatibility
112+
113+
- **`diff_diff/imputation_results.py`** - Result container classes:
114+
- `ImputationBootstrapResults` - Multiplier bootstrap inference results
115+
- `ImputationDiDResults` - Results with overall ATT, event study, group effects, pre-trend test
116+
117+
- **`diff_diff/imputation_bootstrap.py`** - Bootstrap inference:
118+
- `ImputationDiDBootstrapMixin` - Mixin with multiplier bootstrap methods
119+
- Cluster-level influence function sums (Theorem 3) with Rademacher weights
113120

114121
- **`diff_diff/two_stage.py`** - Gardner (2022) Two-Stage DiD estimator:
115122
- `TwoStageDiD` - Two-stage estimator: (1) estimate unit+time FE on untreated obs, (2) regress residualized outcomes on treatment indicators
116-
- `TwoStageDiDResults` - Results with overall ATT, event study, group effects, per-observation treatment effects
117-
- `TwoStageBootstrapResults` - Multiplier bootstrap inference on GMM influence function
118123
- `two_stage_did()` - Convenience function
119124
- Point estimates identical to ImputationDiD; different variance estimator (GMM sandwich vs. conservative)
120125
- Custom `_compute_gmm_variance()` — cannot reuse `compute_robust_vcov()` because correction term uses GLOBAL cross-moment
121126
- No finite-sample adjustments (raw asymptotic sandwich, matching R `did2s`)
127+
- Re-exports result and bootstrap classes for backward compatibility
128+
129+
- **`diff_diff/two_stage_results.py`** - Result container classes:
130+
- `TwoStageBootstrapResults` - Multiplier bootstrap inference on GMM influence function
131+
- `TwoStageDiDResults` - Results with overall ATT, event study, group effects, per-observation treatment effects
132+
133+
- **`diff_diff/two_stage_bootstrap.py`** - Bootstrap inference:
134+
- `TwoStageDiDBootstrapMixin` - Mixin with GMM influence function bootstrap methods
122135

123136
- **`diff_diff/triple_diff.py`** - Triple Difference (DDD) estimator:
124137
- `TripleDifference` - Ortiz-Villavicencio & Sant'Anna (2025) estimator for DDD designs
@@ -129,14 +142,19 @@ cross-platform compilation - no OpenBLAS or Intel MKL installation required.
129142

130143
- **`diff_diff/trop.py`** - Triply Robust Panel (TROP) estimator (v2.1.0):
131144
- `TROP` - Athey, Imbens, Qu & Viviano (2025) estimator with factor model adjustment
132-
- `TROPResults` - Results with ATT, factors, loadings, unit/time weights
133145
- `trop()` - Convenience function for quick estimation
134146
- Three robustness components: factor adjustment, unit weights, time weights
135147
- Two estimation methods via `method` parameter:
136148
- `"twostep"` (default): Per-observation model fitting (Algorithm 2 of paper)
137149
- `"joint"`: Weighted least squares with homogeneous treatment effect (faster)
138150
- Automatic rank selection via cross-validation, information criterion, or elbow detection
139151
- Bootstrap and placebo-based variance estimation
152+
- Re-exports result classes for backward compatibility
153+
154+
- **`diff_diff/trop_results.py`** - Result container classes:
155+
- `_LAMBDA_INF` - Sentinel value for disabled factor model (λ_nn=∞)
156+
- `_PrecomputedStructures` - TypedDict for cached matrices
157+
- `TROPResults` - Results with ATT, factors, loadings, unit/time weights
140158

141159
- **`diff_diff/bacon.py`** - Goodman-Bacon decomposition for TWFE diagnostics:
142160
- `BaconDecomposition` - Decompose TWFE into weighted 2x2 comparisons (Goodman-Bacon 2021)

TODO.md

Lines changed: 16 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,10 @@ Target: < 1000 lines per module for maintainability.
2525
|------|-------|--------|
2626
| ~~`staggered.py`~~ | ~~2301~~ 1120 | ✅ Split into staggered.py, staggered_bootstrap.py, staggered_aggregation.py, staggered_results.py |
2727
| ~~`prep.py`~~ | ~~1993~~ 1241 | ✅ Split: DGP functions moved to `prep_dgp.py` (777 lines) |
28-
| `trop.py` | 2904 | **Needs split** -- nearly 3x the 1000-line target |
29-
| `imputation.py` | 2480 | **Needs split** -- results, bootstrap, aggregation like staggered |
30-
| `two_stage.py` | 2209 | **Needs split** -- same pattern as imputation |
31-
| `utils.py` | 1879 | **Needs split** -- legacy placebo functions could move to diagnostics |
28+
| ~~`trop.py`~~ | ~~2904~~ ~2560 | ✅ Partially split: results extracted to `trop_results.py` (~340 lines) |
29+
| ~~`imputation.py`~~ | ~~2480~~ ~1740 | ✅ Split into imputation.py, imputation_results.py, imputation_bootstrap.py |
30+
| ~~`two_stage.py`~~ | ~~2209~~ ~1490 | ✅ Split into two_stage.py, two_stage_results.py, two_stage_bootstrap.py |
31+
| `utils.py` | 1879 | Monitor -- legacy placebo functions stay to avoid circular imports |
3232
| `visualization.py` | 1678 | Monitor -- growing but cohesive |
3333
| `linalg.py` | 1537 | Monitor -- unified backend, splitting would hurt cohesion |
3434
| `honest_did.py` | 1511 | Acceptable |
@@ -42,49 +42,11 @@ Target: < 1000 lines per module for maintainability.
4242

4343
All 7 t_stat locations fixed (diagnostics.py, sun_abraham.py, triple_diff.py) -- all now use `np.nan` or `np.isfinite()` guards. Fixed in PR #118 and follow-up PRs.
4444

45-
**Remaining nuance**: `diagnostics.py:785` still has `se = ... else 0.0` for the SE variable itself (not t_stat). The downstream t_stat line correctly returns `np.nan`, so inference is safe, but the SE value of 0.0 is technically incorrect for an undefined SE.
45+
~~**Remaining nuance**: `diagnostics.py:785` SE = 0.0~~ — ✅ Fixed: SE now returns `np.nan` when undefined, and all downstream inference uses `safe_inference()`.
4646

47-
### Migrate Existing Inference Call Sites to `safe_inference()`
47+
### ~~Migrate Existing Inference Call Sites to `safe_inference()`~~ -- DONE
4848

49-
`safe_inference()` was added to `diff_diff/utils.py` to compute t_stat, p_value, and CI together with a NaN gate at the top. It is now the prescribed pattern for all new code (see CLAUDE.md design pattern #7). However, ~26 existing inline inference computations across 12 files have **not** been migrated yet.
50-
51-
**Files with inline inference to migrate:**
52-
53-
| File | Approx. Locations |
54-
|------|-------------------|
55-
| `estimators.py` | 2 (lines 1038, 1089) |
56-
| `sun_abraham.py` | 4 (lines 621, 644, 661, 905) |
57-
| `staggered.py` | 6 (lines 696, 725, 763, 777, 792, 806) |
58-
| `staggered_aggregation.py` | 2 (lines 407, 479) |
59-
| `triple_diff.py` | 1 (line 601) |
60-
| `imputation.py` | 2 (lines 1806, 1927) |
61-
| `two_stage.py` | 2 (lines 1325, 1431) |
62-
| `diagnostics.py` | 2 (lines 665, 786) |
63-
| `synthetic_did.py` | 1 (line 426) |
64-
| `trop.py` | 2 (lines 1474, 2054) |
65-
| `utils.py` | 1 (line 641) |
66-
| `linalg.py` | 1 (line 1310) |
67-
68-
**How to find them:**
69-
```bash
70-
grep -En "(t_stat|overall_t)\s*=\s*[^#]*/|\[.t_stat.\]\s*=\s*[^#]*/" diff_diff/*.py | grep -v "def safe_inference"
71-
```
72-
73-
**Note**: This command has one false positive (`utils.py:178`, inside the `safe_inference()` body) and misses multi-line expressions (e.g., `sun_abraham.py:660-661`). The table above is the authoritative list.
74-
75-
**Migration pattern:**
76-
```python
77-
# Before (inline, error-prone)
78-
t_stat = effect / se if se > 0 else 0.0
79-
p_value = compute_p_value(t_stat)
80-
ci = compute_confidence_interval(effect, se, alpha)
81-
82-
# After (NaN-safe, consistent)
83-
from diff_diff.utils import safe_inference
84-
t_stat, p_value, ci = safe_inference(effect, se, alpha=alpha, df=df)
85-
```
86-
87-
**Priority**: Medium — the NaN-handling table above covers the worst cases (those using `0.0`). The remaining sites may use partial guards but should still be migrated for consistency and to prevent regressions.
49+
✅ All ~32 inline inference call sites migrated to `safe_inference()` across 11 source files: `estimators.py`, `sun_abraham.py`, `staggered.py`, `staggered_aggregation.py`, `triple_diff.py`, `imputation.py`, `two_stage.py`, `diagnostics.py`, `synthetic_did.py`, `trop.py`, `utils.py`. Two sites left as-is with comments: `diagnostics.py:665` (permutation-based p_value) and `linalg.py:1310` (deliberately uses ±inf for zero-SE).
8850

8951
---
9052

@@ -96,17 +58,17 @@ Deferred items from PR reviews that were not addressed before merge.
9658

9759
| Issue | Location | PR | Priority |
9860
|-------|----------|----|----------|
99-
| TwoStageDiD & ImputationDiD bootstrap hardcodes Rademacher only; no `bootstrap_weights` parameter unlike CallawaySantAnna | `two_stage.py:1860`, `imputation.py:2363` | #156, #141 | Medium |
100-
| TwoStageDiD GMM score logic duplicated between analytic/bootstrap with inconsistent NaN/overflow handling | `two_stage.py:1454-1784` | #156 | Medium |
101-
| ImputationDiD weight construction duplicated between aggregation and bootstrap (drift risk) -- has explicit code comment acknowledging duplication | `imputation.py:1777-1786`, `imputation.py:2216-2221` | #141 | Medium |
102-
| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py:1564` | #141 | Medium |
61+
| TwoStageDiD & ImputationDiD bootstrap hardcodes Rademacher only; no `bootstrap_weights` parameter unlike CallawaySantAnna | `two_stage_bootstrap.py`, `imputation_bootstrap.py` | #156, #141 | Medium |
62+
| TwoStageDiD GMM score logic duplicated between analytic/bootstrap with inconsistent NaN/overflow handling | `two_stage.py`, `two_stage_bootstrap.py` | #156 | Medium |
63+
| ImputationDiD weight construction duplicated between aggregation and bootstrap (drift risk) -- has explicit code comment acknowledging duplication | `imputation.py`, `imputation_bootstrap.py` | #141 | Medium |
64+
| ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium |
10365

10466
#### Performance
10567

10668
| Issue | Location | PR | Priority |
10769
|-------|----------|----|----------|
108-
| TwoStageDiD per-column `.toarray()` in loop for cluster scores | `two_stage.py:1766-1767` | #156 | Medium |
109-
| ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization) | `imputation.py:1772-1804` | #141 | Low |
70+
| TwoStageDiD per-column `.toarray()` in loop for cluster scores | `two_stage_bootstrap.py` | #156 | Medium |
71+
| ImputationDiD event-study SEs recompute full conservative variance per horizon (should cache A0/A1 factorization) | `imputation.py` | #141 | Low |
11072
| Legacy `compute_placebo_effects` uses deprecated projected-gradient weights (marked deprecated, users directed to `SyntheticDiD`) | `utils.py:1689-1691` | #145 | Low |
11173
| Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost) | `rust/src/linalg.rs:67` | #115 | Low |
11274

@@ -115,7 +77,7 @@ Deferred items from PR reviews that were not addressed before merge.
11577
| Issue | Location | PR | Priority |
11678
|-------|----------|----|----------|
11779
| Tutorial notebooks not executed in CI | `docs/tutorials/*.ipynb` | #159 | Low |
118-
| TwoStageDiD `test_nan_propagation` is a no-op (only `pass`) | `tests/test_two_stage.py:643-652` | #156 | Low |
80+
| ~~TwoStageDiD `test_nan_propagation` is a no-op~~ | ~~`tests/test_two_stage.py:643-652`~~ | ~~#156~~ | ✅ Fixed |
11981
| ImputationDiD bootstrap + covariate path untested | `tests/test_imputation.py` | #141 | Low |
12082
| TROP `n_bootstrap >= 2` validation missing (can yield 0/NaN SE silently) | `trop.py:462` | #124 | Low |
12183
| SunAbraham deprecated `min_pre_periods`/`min_post_periods` still in `fit()` docstring | `sun_abraham.py:458-487` | #153 | Low |
@@ -209,19 +171,9 @@ Spurious RuntimeWarnings ("divide by zero", "overflow", "invalid value") are emi
209171
- Occurs in IPW and DR estimation methods with covariates
210172
- Related to logistic regression overflow in edge cases (separate from BLAS bug)
211173

212-
### Fix Plan (follow-up PR)
213-
214-
Replace `@` operator with `np.dot()` at affected call sites. `np.dot()` bypasses the ufunc FPE dispatch layer and produces identical results with zero spurious warnings on M4.
215-
216-
**Affected files and lines:**
217-
- `linalg.py`: 332, 690, 704, 829, 1463
218-
- `staggered.py`: 78, 85, 102, 860, 1024-1025
219-
- `triple_diff.py`: 301, 307, 323
220-
- `utils.py`: 515
221-
- `imputation.py`: 1253, 1301, 1602, 1662
222-
- `trop.py`: 1098
174+
### ~~Fix Plan (follow-up PR)~~ -- DONE
223175

224-
**Regression test:** Assert no RuntimeWarnings from `solve_ols()` with n ≥ 500 on all platforms.
176+
✅ Replaced `@` operator with `np.dot()` at all 19 affected call sites across 6 files: `linalg.py` (5), `staggered.py` (5), `triple_diff.py` (3), `utils.py` (1), `imputation.py` (4), `trop.py` (1). Regression test added in `test_linalg.py::TestNoDotRuntimeWarnings`.
225177

226178
**Long-term:** Revert to `@` operator when numpy ≥ 2.3 becomes the minimum supported version.
227179

diff_diff/diagnostics.py

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919

2020
from diff_diff.estimators import DifferenceInDifferences
2121
from diff_diff.results import _get_significance_stars
22-
from diff_diff.utils import compute_confidence_interval, compute_p_value
22+
from diff_diff.utils import safe_inference
2323

2424

2525
@dataclass
@@ -661,7 +661,7 @@ def permutation_test(
661661
ci_lower = np.percentile(valid_effects, alpha / 2 * 100)
662662
ci_upper = np.percentile(valid_effects, (1 - alpha / 2) * 100)
663663

664-
# T-stat from original estimate
664+
# NOTE: Not using safe_inference — p_value is permutation-based, CI is percentile-based.
665665
t_stat = original_att / se if np.isfinite(se) and se > 0 else np.nan
666666

667667
return PlaceboTestResults(
@@ -782,15 +782,9 @@ def leave_one_out_test(
782782

783783
# Statistics of LOO distribution
784784
mean_effect = np.mean(valid_effects)
785-
se = np.std(valid_effects, ddof=1) if len(valid_effects) > 1 else 0.0
786-
t_stat = mean_effect / se if np.isfinite(se) and se > 0 else np.nan
787-
788-
# Use t-distribution for p-value
785+
se = np.std(valid_effects, ddof=1) if len(valid_effects) > 1 else np.nan
789786
df = len(valid_effects) - 1 if len(valid_effects) > 1 else 1
790-
p_value = compute_p_value(t_stat, df=df)
791-
792-
# CI
793-
conf_int = compute_confidence_interval(mean_effect, se, alpha, df=df) if np.isfinite(se) and se > 0 else (np.nan, np.nan)
787+
t_stat, p_value, conf_int = safe_inference(mean_effect, se, alpha=alpha, df=df)
794788

795789
return PlaceboTestResults(
796790
test_type="leave_one_out",

diff_diff/estimators.py

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,8 @@
2727
from diff_diff.results import DiDResults, MultiPeriodDiDResults, PeriodEffect
2828
from diff_diff.utils import (
2929
WildBootstrapResults,
30-
compute_confidence_interval,
31-
compute_p_value,
3230
demean_by_group,
31+
safe_inference,
3332
validate_binary,
3433
wild_bootstrap_se,
3534
)
@@ -1034,14 +1033,7 @@ def fit( # type: ignore[override]
10341033
idx = interaction_indices[period]
10351034
effect = coefficients[idx]
10361035
se = np.sqrt(vcov[idx, idx])
1037-
if np.isfinite(se) and se > 0:
1038-
t_stat = effect / se
1039-
p_value = compute_p_value(t_stat, df=df)
1040-
conf_int = compute_confidence_interval(effect, se, self.alpha, df=df)
1041-
else:
1042-
t_stat = np.nan
1043-
p_value = np.nan
1044-
conf_int = (np.nan, np.nan)
1036+
t_stat, p_value, conf_int = safe_inference(effect, se, alpha=self.alpha, df=df)
10451037

10461038
period_effects[period] = PeriodEffect(
10471039
period=period,
@@ -1085,15 +1077,9 @@ def fit( # type: ignore[override]
10851077
avg_conf_int = (np.nan, np.nan)
10861078
else:
10871079
avg_se = float(np.sqrt(avg_var))
1088-
if np.isfinite(avg_se) and avg_se > 0:
1089-
avg_t_stat = avg_att / avg_se
1090-
avg_p_value = compute_p_value(avg_t_stat, df=df)
1091-
avg_conf_int = compute_confidence_interval(avg_att, avg_se, self.alpha, df=df)
1092-
else:
1093-
# Zero SE (degenerate case)
1094-
avg_t_stat = np.nan
1095-
avg_p_value = np.nan
1096-
avg_conf_int = (np.nan, np.nan)
1080+
avg_t_stat, avg_p_value, avg_conf_int = safe_inference(
1081+
avg_att, avg_se, alpha=self.alpha, df=df
1082+
)
10971083

10981084
# Count observations
10991085
n_treated = int(np.sum(d))

0 commit comments

Comments
 (0)