Fix rank-deficient matrix handling in OLS solver

igerber · claude · igerber · commit 0e6524e10b8d · 2026-01-20T07:40:18.000-05:00
MultiPeriodDiD was producing astronomically wrong estimates (~252 trillion
instead of ~2-5) due to rank-deficient design matrices being solved
incorrectly by the gelsy LAPACK driver.

Changes:
- Python: Switch from gelsy to gelsd driver (SVD-based with truncation)
- Rust: Replace least_squares() with explicit SVD + truncated pseudoinverse
- Add comprehensive tests for rank-deficient matrices in both backends
- Add Rust vs NumPy equivalence tests for rank-deficient cases
- Document NaN standard errors limitation in TODO.md

The gelsd driver properly handles rank-deficient matrices by truncating
small singular values below rcond * max(S), producing valid minimum-norm
solutions instead of garbage coefficients.

Co-Authored-By: Claude Opus 4.5 &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -119,7 +119,7 @@ pytest tests/test_rust_backend.py -v
   - Integrated with `TwoWayFixedEffects.decompose()` method
 
 - **`diff_diff/linalg.py`** - Unified linear algebra backend (v1.4.0+):
-  - `solve_ols()` - OLS solver using scipy's gelsy LAPACK driver (QR-based, faster than SVD)
+  - `solve_ols()` - OLS solver using scipy's gelsd LAPACK driver (SVD-based, handles rank-deficient matrices)
   - `compute_robust_vcov()` - Vectorized HC1 and cluster-robust variance-covariance estimation
   - `compute_r_squared()` - R-squared and adjusted R-squared computation
   - `LinearRegression` - High-level OLS helper class with unified coefficient extraction and inference
@@ -240,7 +240,7 @@ diff-diff achieved significant performance improvements in v1.4.0, now **faster
 
 All estimators use a single optimized OLS/SE implementation:
 
-- **scipy.linalg.lstsq with 'gelsy' driver**: QR-based solving, faster than NumPy's default SVD-based solver
+- **scipy.linalg.lstsq with 'gelsd' driver**: SVD-based solving that properly handles rank-deficient matrices (critical for MultiPeriodDiD and other estimators with potentially redundant columns)
 - **Vectorized cluster-robust SE**: Uses pandas groupby aggregation instead of O(n × clusters) Python loop
 - **Single optimization point**: Changes to `linalg.py` benefit all estimators
 
diff --git a/TODO.md b/TODO.md
@@ -12,9 +12,24 @@ Current limitations that may affect users:
 
 | Issue | Location | Priority | Notes |
 |-------|----------|----------|-------|
+| NaN standard errors for rank-deficient matrices | `linalg.py:330-345` | Medium | See details below |
 | MultiPeriodDiD wild bootstrap not supported | `estimators.py:1068-1074` | Low | Edge case |
 | `predict()` raises NotImplementedError | `estimators.py:532-554` | Low | Rarely needed |
 
+### NaN Standard Errors for Rank-Deficient Matrices
+
+**Problem**: When the design matrix is rank-deficient (e.g., MultiPeriodDiD with redundant period dummies + treatment interactions), the coefficients are now computed correctly via SVD truncation, but the variance-covariance matrix computation produces NaN values.
+
+**Root cause**: The vcov computation in `compute_robust_vcov()` computes `(X'X)^{-1}` which doesn't exist for rank-deficient matrices. The current implementation uses Cholesky factorization which fails silently, producing NaN values.
+
+**Affected estimators**:
+- `MultiPeriodDiD` - when design matrix has redundant columns
+- Any estimator using `solve_ols()` with rank-deficient X
+
+**Potential fix**: Use the Moore-Penrose pseudoinverse `(X'X)^+` instead of `(X'X)^{-1}` for the bread matrix in the sandwich estimator. This would provide valid (though potentially conservative) standard errors for the identifiable parameters.
+
+**Workaround**: Users can use bootstrap inference which doesn't rely on the analytical vcov.
+
 ---
 
 ## Code Quality
diff --git a/diff_diff/linalg.py b/diff_diff/linalg.py
@@ -5,7 +5,7 @@
 Rust backend for maximum performance.
 
 The key optimizations are:
-1. scipy.linalg.lstsq with 'gelsy' driver (QR-based, faster than SVD)
+1. scipy.linalg.lstsq with 'gelsd' driver (SVD-based, handles rank-deficient matrices)
 2. Vectorized cluster-robust SE via groupby (eliminates O(n*clusters) loop)
 3. Single interface for all estimators (reduces code duplication)
 4. Optional Rust backend for additional speedup (when available)
@@ -80,9 +80,9 @@ def solve_ols(
 
     Notes
     -----
-    This function uses scipy.linalg.lstsq with the 'gelsy' driver, which is
-    QR-based and typically faster than NumPy's default SVD-based solver for
-    well-conditioned matrices.
+    This function uses scipy.linalg.lstsq with the 'gelsd' driver, which is
+    SVD-based and handles rank-deficient matrices correctly by truncating
+    small singular values.
 
     The cluster-robust standard errors use the sandwich estimator with the
     standard small-sample adjustment: (G/(G-1)) * ((n-1)/(n-k)).
@@ -184,11 +184,11 @@ def _solve_ols_numpy(
     """
     NumPy/SciPy fallback implementation of solve_ols.
 
-    Uses scipy.linalg.lstsq with 'gelsy' driver (QR with column pivoting)
-    for numerically stable least squares solving. QR decomposition is preferred
-    over normal equations because it doesn't square the condition number of X,
-    making it more robust for ill-conditioned matrices common in DiD designs
-    (e.g., many unit/time fixed effects).
+    Uses scipy.linalg.lstsq with 'gelsd' driver (SVD-based with divide-and-conquer)
+    for numerically stable least squares solving. SVD decomposition properly handles
+    rank-deficient matrices by truncating small singular values, which is critical
+    for DiD designs that may have redundant columns (e.g., period dummies + treatment
+    interactions in MultiPeriodDiD).
 
     Parameters
     ----------
@@ -214,11 +214,11 @@ def _solve_ols_numpy(
     vcov : np.ndarray, optional
         Variance-covariance matrix if return_vcov=True.
     """
-    # Solve OLS using QR decomposition via scipy's optimized LAPACK routines
-    # 'gelsy' uses QR with column pivoting, which is numerically stable even
-    # for ill-conditioned matrices (doesn't square the condition number like
-    # normal equations would)
-    coefficients = scipy_lstsq(X, y, lapack_driver="gelsy", check_finite=False)[0]
+    # Solve OLS using SVD via scipy's optimized LAPACK routines
+    # 'gelsd' uses divide-and-conquer SVD, which properly handles rank-deficient
+    # matrices by truncating small singular values (unlike 'gelsy' which can
+    # produce garbage coefficients for nearly rank-deficient matrices)
+    coefficients = scipy_lstsq(X, y, lapack_driver="gelsd", check_finite=False)[0]
 
     # Compute residuals and fitted values
     fitted = X @ coefficients
diff --git a/rust/src/linalg.rs b/rust/src/linalg.rs
@@ -6,13 +6,22 @@
 //! - Cluster-robust variance-covariance estimation
 
 use ndarray::{Array1, Array2, ArrayView1, ArrayView2, Axis};
-use ndarray_linalg::{FactorizeC, LeastSquaresSvd, Solve, SolveC, UPLO};
+use ndarray_linalg::{FactorizeC, Solve, SolveC, SVD, UPLO};
 use numpy::{IntoPyArray, PyArray1, PyArray2, PyReadonlyArray1, PyReadonlyArray2};
 use pyo3::prelude::*;
 use std::collections::HashMap;
 
 /// Solve OLS regression: β = (X'X)^{-1} X'y
 ///
+/// Uses SVD with truncation for rank-deficient matrices:
+/// - Computes SVD: X = U * S * V^T
+/// - Truncates singular values below rcond * max(S)
+/// - Computes solution: β = V * S^{-1}_truncated * U^T * y
+///
+/// This matches scipy's 'gelsd' driver behavior for handling rank-deficient
+/// design matrices that can occur in DiD estimation (e.g., MultiPeriodDiD
+/// with redundant period dummies + treatment interactions).
+///
 /// # Arguments
 /// * `x` - Design matrix (n, k)
 /// * `y` - Response vector (n,)
@@ -37,15 +46,47 @@ pub fn solve_ols<'py>(
     let x_arr = x.as_array();
     let y_arr = y.as_array();
 
-    // Solve least squares using SVD (more stable than normal equations)
+    let n = x_arr.nrows();
+    let k = x_arr.ncols();
+
+    // Solve using SVD with truncation for rank-deficient matrices
+    // This matches scipy's 'gelsd' behavior
     let x_owned = x_arr.to_owned();
     let y_owned = y_arr.to_owned();
 
-    let result = x_owned
-        .least_squares(&y_owned)
-        .map_err(|e| PyErr::new::<pyo3::exceptions::PyValueError, _>(format!("Least squares failed: {}", e)))?;
+    // Compute SVD: X = U * S * V^T
+    let (u_opt, s, vt_opt) = x_owned
+        .svd(true, true)
+        .map_err(|e| PyErr::new::<pyo3::exceptions::PyValueError, _>(format!("SVD failed: {}", e)))?;
+
+    let u = u_opt.ok_or_else(|| {
+        PyErr::new::<pyo3::exceptions::PyValueError, _>("SVD did not return U matrix")
+    })?;
+    let vt = vt_opt.ok_or_else(|| {
+        PyErr::new::<pyo3::exceptions::PyValueError, _>("SVD did not return V^T matrix")
+    })?;
+
+    // Compute rcond threshold (matches numpy/scipy default)
+    // rcond = max(n, k) * machine_epsilon
+    let rcond = (n.max(k) as f64) * f64::EPSILON;
+    let s_max = s.iter().cloned().fold(0.0_f64, f64::max);
+    let threshold = s_max * rcond;
+
+    // Compute truncated pseudoinverse solution: β = V * S^{-1} * U^T * y
+    // Singular values below threshold are treated as zero (truncated)
+    let uty = u.t().dot(&y_owned); // (min(n,k),)
+
+    // Build S^{-1} with truncation
+    let mut s_inv_uty = Array1::<f64>::zeros(k);
+    for i in 0..s.len().min(k) {
+        if s[i] > threshold {
+            s_inv_uty[i] = uty[i] / s[i];
+        }
+        // else: leave as 0 (truncate this singular value)
+    }
 
-    let coefficients = result.solution;
+    // Compute coefficients: β = V * (S^{-1} * U^T * y)
+    let coefficients = vt.t().dot(&s_inv_uty);
 
     // Compute fitted values and residuals
     let fitted = x_arr.dot(&coefficients);
diff --git a/tests/test_linalg.py b/tests/test_linalg.py
@@ -186,21 +186,32 @@ def test_inf_in_x_raises_error(self):
             solve_ols(X, y)
 
     def test_check_finite_false_skips_validation(self):
-        """Test that check_finite=False skips NaN/Inf validation."""
+        """Test that check_finite=False skips the upfront NaN/Inf validation.
+
+        Note: With the 'gelsd' driver, LAPACK may still error on NaN values
+        during computation, which is actually safer than producing garbage.
+        """
         X = np.random.randn(100, 2)
         X[50, 0] = np.nan
         y = np.random.randn(100)
 
-        # Should not raise, but will return garbage results
-        coef, resid, vcov = solve_ols(X, y, check_finite=False)
-        # Coefficients will contain NaN due to bad input
-        assert np.isnan(coef).any() or np.isinf(coef).any()
+        # The gelsd driver may raise an error when encountering NaN during
+        # computation, or produce garbage results. Either is acceptable
+        # (the key is that we don't raise the "X contains NaN" user-friendly error)
+        try:
+            coef, resid, vcov = solve_ols(X, y, check_finite=False)
+            # If it completed, coefficients should contain NaN/Inf due to bad input
+            assert np.isnan(coef).any() or np.isinf(coef).any()
+        except ValueError as e:
+            # LAPACK may raise an error on NaN values (gelsd behavior)
+            # This is acceptable - the key is we skipped our own validation
+            assert "X contains NaN" not in str(e) and "y contains NaN" not in str(e)
 
     def test_rank_deficient_still_solves(self):
-        """Test that rank-deficient matrix still returns a solution.
+        """Test that rank-deficient matrix returns a valid solution.
 
-        Note: The gelsy driver doesn't always detect rank deficiency,
-        but it still returns a valid least-squares solution.
+        The 'gelsd' driver uses SVD with truncation to properly handle
+        rank-deficient matrices, producing finite and reasonable coefficients.
         """
         np.random.seed(42)
         X = np.random.randn(100, 3)
@@ -212,9 +223,82 @@ def test_rank_deficient_still_solves(self):
 
         assert coef.shape == (3,)
         assert resid.shape == (100,)
+
+        # Coefficients must be finite (not NaN or Inf)
+        assert np.all(np.isfinite(coef)), f"Coefficients contain non-finite values: {coef}"
+
+        # Coefficients should be reasonable (not astronomically large)
+        # For a rank-deficient system, coefficients should still be bounded
+        assert np.all(np.abs(coef) < 1e6), f"Coefficients are unreasonably large: {coef}"
+
         # Residuals should still be valid (y - X @ coef)
         np.testing.assert_allclose(resid, y - X @ coef, rtol=1e-10)
 
+    def test_multiperiod_like_rank_deficiency(self):
+        """Test that MultiPeriodDiD-like design matrices are handled correctly.
+
+        MultiPeriodDiD creates design matrices with intercept + period dummies +
+        treatment × post interactions, which can have redundant columns and be
+        rank-deficient. This test mimics that structure.
+        """
+        np.random.seed(42)
+        n = 200
+        n_periods = 5
+
+        # Create a design matrix similar to MultiPeriodDiD:
+        # [intercept, period_1, period_2, ..., period_k, treated*post_1, ...]
+
+        # Intercept
+        intercept = np.ones(n)
+
+        # Period dummies (one-hot encoding for periods 1 to n_periods-1)
+        # Period 0 is the reference
+        period_assignment = np.random.randint(0, n_periods, n)
+        period_dummies = np.zeros((n, n_periods - 1))
+        for i in range(1, n_periods):
+            period_dummies[:, i - 1] = (period_assignment == i).astype(float)
+
+        # Treatment indicator
+        treated = np.random.binomial(1, 0.5, n)
+
+        # Post indicator (periods >= 3 are post)
+        post = (period_assignment >= 3).astype(float)
+
+        # Treatment × post interaction
+        treat_post = treated * post
+
+        # Build design matrix
+        # Note: This creates a rank-deficient matrix because the period dummies
+        # and treat_post are not all linearly independent when combined
+        X = np.column_stack([intercept, period_dummies, treat_post])
+
+        # True effect
+        true_effect = 2.5
+        y = (
+            1.0  # intercept effect
+            + 0.5 * period_dummies[:, 0]  # period 1 effect
+            + 0.3 * period_dummies[:, 1]  # period 2 effect
+            + 0.7 * period_dummies[:, 2]  # period 3 effect
+            + 0.9 * period_dummies[:, 3]  # period 4 effect
+            + true_effect * treat_post  # treatment effect
+            + np.random.randn(n) * 0.5  # noise
+        )
+
+        # Fit with solve_ols
+        coef, resid, vcov = solve_ols(X, y)
+
+        # Coefficients must be finite
+        assert np.all(np.isfinite(coef)), f"Coefficients contain non-finite values: {coef}"
+
+        # Coefficients should be reasonable (not trillions)
+        assert np.all(np.abs(coef) < 1e6), f"Coefficients are unreasonably large: {coef}"
+
+        # The treatment effect coefficient (last one) should be close to true effect
+        # Allow for sampling variation and potential multicollinearity effects
+        assert abs(coef[-1] - true_effect) < 2.0, (
+            f"Treatment effect coefficient {coef[-1]} is too far from true effect {true_effect}"
+        )
+
     def test_single_cluster_error(self):
         """Test that single cluster raises error."""
         X = np.random.randn(100, 2)
diff --git a/tests/test_rust_backend.py b/tests/test_rust_backend.py