igerber
diff --git a/‎CHANGELOG.md‎
Lines changed: 34 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎ROADMAP.md‎
Lines changed: 18 additions & 15 deletions b/‎ROADMAP.md‎
Lines changed: 18 additions & 15 deletions
diff --git a/‎diff_diff/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎diff_diff/__init__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎diff_diff/estimators.py‎
Lines changed: 29 additions & 38 deletions b/‎diff_diff/estimators.py‎
Lines changed: 29 additions & 38 deletions
@@ -5,6 +5,39 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.4.0] - 2026-01-11
+
+### Added
+- **Unified linear algebra backend** (`diff_diff/linalg.py`)
+  - `solve_ols()` - Optimized OLS solver using scipy's gelsy LAPACK driver
+  - `compute_robust_vcov()` - Vectorized (clustered) robust variance-covariance
+  - Single optimization point for all estimators; prepares for future Rust backend
+  - New `tests/test_linalg.py` with comprehensive tests
+
+### Changed
+- **Major performance improvements** - All estimators now significantly faster
+  - BasicDiD/TWFE @ 10K: 0.835s → 0.011s (76x faster, now 4.2x faster than R)
+  - CallawaySantAnna @ 10K: 2.234s → 0.109s (20x faster, now 7.2x faster than R)
+  - All results numerically identical to previous versions
+- **CallawaySantAnna optimizations** (`staggered.py`)
+  - Pre-computed wide-format outcome matrix and cohort masks
+  - Vectorized ATT(g,t) computation using numpy operations (23x faster)
+  - Batch bootstrap weight generation
+  - Vectorized multiplier bootstrap using matrix operations (26x faster)
+- **TWFE optimization** (`twfe.py`)
+  - Cached groupby indexes for within-transformation
+- **All estimators migrated** to unified `linalg.py` backend
+  - `estimators.py`, `twfe.py`, `staggered.py`, `triple_diff.py`,
+    `synthetic_did.py`, `sun_abraham.py`, `utils.py`
+
+### Behavioral Changes
+- **Rank-deficient design matrices**: The new `gelsy` LAPACK driver handles
+  rank-deficient matrices gracefully (returning a least-norm solution) rather
+  than raising an explicit error. Previously, `DifferenceInDifferences` would
+  raise `ValueError("Design matrix is rank-deficient")`. Users relying on this
+  error for collinearity detection should validate their design matrices
+  separately. Results remain numerically correct for well-specified models.
+
 ## [1.3.1] - 2026-01-10
 
 ### Added
@@ -282,6 +315,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - `to_dict()` and `to_dataframe()` export methods
   - `is_significant` and `significance_stars` properties
 
+[1.4.0]: https://github.com/igerber/diff-diff/compare/v1.3.1...v1.4.0
 [1.3.1]: https://github.com/igerber/diff-diff/compare/v1.3.0...v1.3.1
 [1.3.0]: https://github.com/igerber/diff-diff/compare/v1.2.1...v1.3.0
 [1.2.1]: https://github.com/igerber/diff-diff/compare/v1.2.0...v1.2.1
 
@@ -6,7 +6,7 @@ For past changes and release history, see [CHANGELOG.md](CHANGELOG.md).
 
 ---
 
-## Current Status (v1.3.1)
+## Current Status (v1.4.0)
 
 diff-diff is a **production-ready** DiD library with feature parity with R's `did` + `HonestDiD` + `synthdid` ecosystem for core DiD analysis:
 
@@ -15,35 +15,38 @@ diff-diff is a **production-ready** DiD library with feature parity with R's `di
 - **Assumption diagnostics**: Parallel trends tests, placebo tests, Goodman-Bacon decomposition
 - **Sensitivity analysis**: Honest DiD (Rambachan-Roth), Pre-trends power analysis (Roth 2022)
 - **Study design**: Power analysis tools
+- **Performance**: Now faster than R at scale (see below)
 
 ---
 
 ## Priority: Performance Improvements
 
-**Status:** Planning complete, implementation pending
+**Status:** ✅ Phase 1 Complete (v1.4.0)
 
-Benchmarks show diff-diff is 3-17x slower than R's fixest for BasicDiD/TWFE at large scales (10K+ units). This is our top priority for v1.4.
+Phase 1 pure Python optimizations exceeded targets. diff-diff now **beats R** at scale:
 
-### Summary
+| Estimator | v1.3 (10K scale) | v1.4 (10K scale) | vs R |
+|-----------|------------------|------------------|------|
+| BasicDiD/TWFE | 0.835s | **0.011s** | **4.2x faster than R** |
+| CallawaySantAnna | 2.234s | **0.109s** | **7.2x faster than R** |
+| SyntheticDiD | Already 37x faster | N/A | **37x faster than R** |
 
-| Estimator | Current (10K scale) | Target | Approach |
-|-----------|---------------------|--------|----------|
-| BasicDiD/TWFE | 0.835s (R: 0.049s) | Match R | Rust backend |
-| CallawaySantAnna | 2.234s (R: 0.816s) | Match R | Vectorization + Rust |
-| SyntheticDiD | Already 37-1600x faster than R | Maintain | N/A |
+### What Was Done (v1.4.0)
 
-### Approach
+1. **Unified `linalg.py` backend** - Single OLS/SE implementation for all estimators
+2. **Vectorized cluster-robust SE** - Eliminated O(n × clusters) loop
+3. **Pre-computed data structures** - Wide-format outcome matrix, cohort masks
+4. **Vectorized bootstrap** - Matrix operations instead of nested loops
 
-1. **Phase 1:** Pure Python optimizations (vectorized cluster SE, scipy lstsq, cached groupby)
-2. **Phase 2:** Rust backend via PyO3 for performance-critical paths (cluster SE, demeaning, bootstrap)
+### Phase 2 (Future)
 
-The Rust backend will be optional with graceful fallback to pure Python.
+Rust backend remains available if further optimization needed, but pure Python now exceeds R performance.
 
 **Full details:** [docs/performance-plan.md](docs/performance-plan.md)
 
 ---
 
-## Near-Term Enhancements (v1.4)
+## Near-Term Enhancements (v1.5)
 
 High-value additions building on our existing foundation.
 
@@ -99,7 +102,7 @@ Extend the existing `TripleDifference` estimator to handle staggered adoption se
 
 ---
 
-## Medium-Term Enhancements (v1.5+)
+## Medium-Term Enhancements (v1.6+)
 
 Extending diff-diff to handle more complex settings.
 
 
@@ -103,7 +103,7 @@
     plot_sensitivity,
 )
 
-__version__ = "1.3.1"
+__version__ = "1.4.0"
 __all__ = [
     # Estimators
     "DifferenceInDifferences",
 
@@ -17,12 +17,12 @@
 import numpy as np
 import pandas as pd
 
+from diff_diff.linalg import compute_r_squared, compute_robust_vcov, solve_ols
 from diff_diff.results import DiDResults, MultiPeriodDiDResults, PeriodEffect
 from diff_diff.utils import (
     WildBootstrapResults,
     compute_confidence_interval,
     compute_p_value,
-    compute_robust_se,
     validate_binary,
     wild_bootstrap_se,
 )
@@ -261,8 +261,11 @@ def fit(
                     X = np.column_stack([X, dummies[col].values.astype(float)])
                     var_names.append(col)
 
-        # Fit OLS
-        coefficients, residuals, fitted, r_squared = self._fit_ols(X, y)
+        # Fit OLS using unified backend
+        coefficients, residuals, fitted, vcov = solve_ols(
+            X, y, return_fitted=True, return_vcov=False
+        )
+        r_squared = compute_r_squared(y, residuals)
 
         # Extract ATT (coefficient on interaction term)
         att_idx = 3  # Index of interaction term
@@ -285,13 +288,13 @@ def fit(
             )
         elif self.cluster is not None:
             cluster_ids = data[self.cluster].values
-            vcov = compute_robust_se(X, residuals, cluster_ids)
+            vcov = compute_robust_vcov(X, residuals, cluster_ids)
             se = np.sqrt(vcov[att_idx, att_idx])
             t_stat = att / se
             p_value = compute_p_value(t_stat, df=df)
             conf_int = compute_confidence_interval(att, se, self.alpha, df=df)
         elif self.robust:
-            vcov = compute_robust_se(X, residuals)
+            vcov = compute_robust_vcov(X, residuals)
             se = np.sqrt(vcov[att_idx, att_idx])
             t_stat = att / se
             p_value = compute_p_value(t_stat, df=df)
@@ -300,7 +303,7 @@ def fit(
             # Classical OLS standard errors
             n = len(y)
             k = X.shape[1]
-            mse = np.sum(residuals ** 2) / (n - k)
+            mse = np.sum(residuals**2) / (n - k)
             # Use solve() instead of inv() for numerical stability
             # solve(A, B) computes X where AX=B, so this yields (X'X)^{-1} * mse
             vcov = np.linalg.solve(X.T @ X, mse * np.eye(k))
@@ -352,10 +355,15 @@ def fit(
 
         return self.results_
 
-    def _fit_ols(self, X: np.ndarray, y: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float]:
+    def _fit_ols(
+        self, X: np.ndarray, y: np.ndarray
+    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float]:
         """
         Fit OLS regression.
 
+        This method is kept for backwards compatibility. Internally uses the
+        unified solve_ols from diff_diff.linalg for optimized computation.
+
         Parameters
         ----------
         X : np.ndarray
@@ -367,32 +375,12 @@ def _fit_ols(self, X: np.ndarray, y: np.ndarray) -> Tuple[np.ndarray, np.ndarray
         -------
         tuple
             (coefficients, residuals, fitted_values, r_squared)
-
-        Raises
-        ------
-        ValueError
-            If design matrix is rank-deficient (perfect multicollinearity).
         """
-        # Check for rank deficiency (perfect multicollinearity)
-        rank = np.linalg.matrix_rank(X)
-        if rank < X.shape[1]:
-            raise ValueError(
-                f"Design matrix is rank-deficient (rank {rank} < {X.shape[1]} columns). "
-                "This indicates perfect multicollinearity. Check your fixed effects "
-                "and covariates for linear dependencies."
-            )
-
-        # Solve normal equations: β = (X'X)^(-1) X'y
-        coefficients = np.linalg.lstsq(X, y, rcond=None)[0]
-
-        # Compute fitted values and residuals
-        fitted = X @ coefficients
-        residuals = y - fitted
-
-        # Compute R-squared
-        ss_res = np.sum(residuals ** 2)
-        ss_tot = np.sum((y - np.mean(y)) ** 2)
-        r_squared = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0.0
+        # Use unified OLS backend
+        coefficients, residuals, fitted, _ = solve_ols(
+            X, y, return_fitted=True, return_vcov=False
+        )
+        r_squared = compute_r_squared(y, residuals)
 
         return coefficients, residuals, fitted, r_squared
 
@@ -442,7 +430,7 @@ def _run_wild_bootstrap_inference(
         t_stat = bootstrap_results.t_stat_original
 
         # Also compute vcov for storage (using cluster-robust for consistency)
-        vcov = compute_robust_se(X, residuals, cluster_ids)
+        vcov = compute_robust_vcov(X, residuals, cluster_ids)
 
         return se, p_value, conf_int, t_stat, vcov, bootstrap_results
 
@@ -889,8 +877,11 @@ def fit(  # type: ignore[override]
                     X = np.column_stack([X, dummies[col].values.astype(float)])
                     var_names.append(col)
 
-        # Fit OLS
-        coefficients, residuals, fitted, r_squared = self._fit_ols(X, y)
+        # Fit OLS using unified backend
+        coefficients, residuals, fitted, _ = solve_ols(
+            X, y, return_fitted=True, return_vcov=False
+        )
+        r_squared = compute_r_squared(y, residuals)
 
         # Degrees of freedom
         df = len(y) - X.shape[1] - n_absorbed_effects
@@ -900,13 +891,13 @@ def fit(  # type: ignore[override]
         # For now, we use analytical inference even if inference="wild_bootstrap"
         if self.cluster is not None:
             cluster_ids = data[self.cluster].values
-            vcov = compute_robust_se(X, residuals, cluster_ids)
+            vcov = compute_robust_vcov(X, residuals, cluster_ids)
         elif self.robust:
-            vcov = compute_robust_se(X, residuals)
+            vcov = compute_robust_vcov(X, residuals)
         else:
             n = len(y)
             k = X.shape[1]
-            mse = np.sum(residuals ** 2) / (n - k)
+            mse = np.sum(residuals**2) / (n - k)
             # Use solve() instead of inv() for numerical stability
             # solve(A, B) computes X where AX=B, so this yields (X'X)^{-1} * mse
             vcov = np.linalg.solve(X.T @ X, mse * np.eye(k))
Original file line number	Diff line number	Diff line change
`@@ -103,7 +103,7 @@`
`103`	`103`	`plot_sensitivity,`
`104`	`104`	`)`
`105`	`105`
`106`		`-__version__ = "1.3.1"`
	`106`	`+__version__ = "1.4.0"`
`107`	`107`	`__all__ = [`
`108`	`108`	`# Estimators`
`109`	`109`	`"DifferenceInDifferences",`