Unify Rust TROP inner solver to SVD; close finding #23 grid-search divergence

igerber · claude · igerber · commit e59610bf0bd7 · 2026-04-20T20:03:19.000-04:00
Closes the grid-search half of silent-failures finding #23 (TODO row 87). The `xfail(strict=True)` regression `test_grid_search_rank_deficient_Y` baselined a ~6% ATT divergence between Rust and Python on two near-parallel control units. Root cause: Rust's `solve_joint_no_lowrank` used iterative block coordinate descent (50 iter, tol=1e-8) while Python used SVD-based minimum-norm least squares. On rank-deficient Y the two solvers converge to different stationary points of the same objective. Python is canonical (SVD / minimum-norm least squares per Golub & Van Loan). Rust's iterative solver was a speed optimization, not a methodology choice. Port the Rust inner TWFE step to SVD-based WLS that mirrors Python's `np.linalg.lstsq(rcond=None)` step-for-step, with numpy-compatible `rcond = eps * max(n, k)`. Changes - rust/src/linalg.rs: promote ndarray_to_faer to pub(crate) so trop.rs can reuse it. - rust/src/trop.rs: new module-private solve_wls_svd helper — thin-SVD + rcond truncation, matches numpy's minimum-norm semantics. Rewrite solve_joint_no_lowrank body to flatten y/weights row-major, build the [intercept | unit_dummies[1..] | time_dummies[1..]] design matrix, apply sqrt-weights, and solve via solve_wls_svd. Function signature unchanged — all 4 call sites (LOOCV, FISTA TWFE step x2, bootstrap) benefit transitively. - tests/test_rust_backend.py: remove @pytest.mark.xfail from test_grid_search_rank_deficient_Y; the gap is closed. Bootstrap-seed test retains its xfail (row 87 RNG mismatch, out of scope). - docs/methodology/REGISTRY.md: update the TROP Global Estimation bullet at the existing `np.linalg.lstsq` line to note Rust and Python now both use SVD-based minimum-norm WLS with numpy-compatible rcond. - TODO.md: delete row 87 (grid-search divergence entry). Verification - maturin develop --release --features accelerate: clean build, no warnings. - pytest tests/test_rust_backend.py::TestTROPRustEdgeCaseParity: grid-search test now passes; bootstrap-seed test correctly xfails. - pytest tests/test_rust_backend.py -k TROP -m '': 23 passed, 1 xfailed, no regressions. - pytest tests/test_trop.py: 83 passed, 37 deselected (slow). - TestTROPGlobalRustVsNumpy (incl. lambda_nn=0 low-rank FISTA path): 8 passed — FISTA TWFE step unchanged in behavior on well-conditioned data. - grep for other 'for _ in 0..50' coordinate-descent patterns in rust/src/*.rs: none found. Non-goals - No changes to row 87 (bootstrap RNG mismatch — Rust rand crate vs numpy default_rng ~28% SE gap on seed=42). Separate PR. - No changes to linalg.rs::solve_ols (rcond=1e-7 is load-bearing for MultiPeriodDiD / DiD / TWFE). - No public API changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/TODO.md b/TODO.md
@@ -83,7 +83,6 @@ Deferred items from PR reviews that were not addressed before merge.
 | Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
 | Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium |
 | `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
-| TROP Rust vs Python grid-search divergence on rank-deficient Y: on two near-parallel control units, LOOCV grid-search ATT diverges ~6% between Rust (`trop_global.py:688`) and Python fallback (`trop_global.py:753`). Either grid-winner ties are broken differently or the per-λ solver reaches different stationary points under rank deficiency. Audit finding #23 flagged this surface. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_grid_search_rank_deficient_Y` baselines the gap. | `trop_global.py`, `rust/` | follow-up | Medium |
 | TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
 | `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
 | `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
@@ -2059,6 +2059,10 @@ Treatment effects are **heterogeneous** per-observation values. ATT is their mea
 1. **Without low-rank (λ_nn = ∞)**: Standard weighted least squares
    - Build design matrix with unit/time dummies (no treatment indicator)
    - Solve via np.linalg.lstsq for (μ, α, β) using (1-W)-masked weights
+   - Both the Python fallback and the Rust acceleration path use SVD-based
+     minimum-norm least squares with numpy-compatible rcond = eps × max(n, k),
+     so they return the canonical minimum-norm solution on rank-deficient Y
+     (e.g., two near-parallel control units)
 
 2. **With low-rank (finite λ_nn)**: Alternating minimization
    - Alternate between:
diff --git a/rust/src/linalg.rs b/rust/src/linalg.rs
@@ -277,7 +277,7 @@ fn compute_robust_vcov_internal(
 }
 
 /// Convert ndarray Array2 to faer Mat
-fn ndarray_to_faer(arr: &Array2<f64>) -> faer::Mat<f64> {
+pub(crate) fn ndarray_to_faer(arr: &Array2<f64>) -> faer::Mat<f64> {
     let nrows = arr.nrows();
     let ncols = arr.ncols();
     faer::Mat::from_fn(nrows, ncols, |i, j| arr[[i, j]])
diff --git a/rust/src/trop.rs b/rust/src/trop.rs
@@ -14,6 +14,8 @@ use numpy::{PyArray1, PyArray2, PyReadonlyArray1, PyReadonlyArray2, ToPyArray};
 use pyo3::prelude::*;
 use rayon::prelude::*;
 
+use crate::linalg::ndarray_to_faer;
+
 /// Minimum chunk size for parallel distance computation.
 /// Reduces scheduling overhead for small matrices.
 const MIN_CHUNK_SIZE: usize = 16;
@@ -1233,111 +1235,164 @@ fn solve_joint_no_lowrank(
     y: &ArrayView2<f64>,
     delta: &ArrayView2<f64>,
 ) -> Option<(f64, Array1<f64>, Array1<f64>)> {
+    // SVD-based minimum-norm weighted least-squares fit — mirrors Python's
+    // `_solve_global_no_lowrank` at `diff_diff/trop_global.py:340-412`
+    // step-for-step so Rust and Python paths produce the same canonical
+    // solution on rank-deficient Y (silent-failures finding #23).
+    //
+    // Model: Y_it = μ + α_i + β_t + ε_it, with α_0 = β_0 = 0 for
+    // identification. Weights: δ_it. Flatten row-major with
+    // idx = t * n_units + i (matches Python's Y.flatten() C-order).
     let n_periods = y.nrows();
     let n_units = y.ncols();
+    let n_obs = n_periods * n_units;
+    let n_params = 1 + (n_units - 1) + (n_periods - 1);
 
-    // We solve using normal equations with the design matrix structure
-    // Rather than build full X matrix, use block structure for efficiency
-    //
-    // The model: Y_it = μ + α_i + β_t + ε_it
-    // With identification: α_0 = β_0 = 0
-
-    // Compute weighted sums needed for normal equations
+    // Flatten y + weights with NaN masking — matches trop_global.py:354-360.
+    let mut y_flat = Array1::<f64>::zeros(n_obs);
+    let mut w_flat = Array1::<f64>::zeros(n_obs);
     let mut sum_w = 0.0;
-    let mut sum_wy = 0.0;
-
-    // Per-unit and per-period weighted sums
-    let mut sum_w_by_unit = Array1::<f64>::zeros(n_units);
-    let mut sum_wy_by_unit = Array1::<f64>::zeros(n_units);
-    let mut sum_w_by_period = Array1::<f64>::zeros(n_periods);
-    let mut sum_wy_by_period = Array1::<f64>::zeros(n_periods);
-
     for t in 0..n_periods {
         for i in 0..n_units {
-            // NaN outcomes get zero weight (not imputed to 0.0 with active weight)
-            let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 };
-            let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 };
-
+            let idx = t * n_units + i;
+            let y_ti = y[[t, i]];
+            let w_ti = delta[[t, i]];
+            let valid = y_ti.is_finite() && w_ti.is_finite();
+            let w = if valid { w_ti.max(0.0) } else { 0.0 };
+            y_flat[idx] = if valid { y_ti } else { 0.0 };
+            w_flat[idx] = w;
             sum_w += w;
-            sum_wy += w * y_ti;
-
-            sum_w_by_unit[i] += w;
-            sum_wy_by_unit[i] += w * y_ti;
-            sum_w_by_period[t] += w;
-            sum_wy_by_period[t] += w * y_ti;
         }
     }
 
+    // All-zero weights short-circuit — matches trop_global.py:366.
     if sum_w < 1e-10 {
         return None;
     }
 
-    // Use iterative approach: alternate between (alpha, beta) and mu
-    // until convergence (simpler than full normal equations)
-    let mut mu = sum_wy / sum_w;
+    // Build design matrix X = [intercept | unit_dummies[1..] | time_dummies[1..]]
+    // — matches trop_global.py:374-385. Explicit nested loops so the
+    // index correspondence with Python is unambiguous.
+    let mut x = Array2::<f64>::zeros((n_obs, n_params));
+    for t in 0..n_periods {
+        for i in 0..n_units {
+            let idx = t * n_units + i;
+            x[[idx, 0]] = 1.0; // intercept
+            if i >= 1 {
+                x[[idx, i]] = 1.0; // unit dummy (unit 0 dropped)
+            }
+            if t >= 1 {
+                x[[idx, (n_units - 1) + t]] = 1.0; // time dummy (period 0 dropped)
+            }
+        }
+    }
+
+    // Apply sqrt-weights: X_w = X * sqrt(w)[:, None], y_w = y * sqrt(w).
+    // Matches trop_global.py:387-389.
+    let sqrt_w: Array1<f64> = w_flat.mapv(|w| w.sqrt());
+    for r in 0..n_obs {
+        let s = sqrt_w[r];
+        for c in 0..n_params {
+            x[[r, c]] *= s;
+        }
+        y_flat[r] *= s;
+    }
+
+    // Solve via SVD with numpy-compatible rcond truncation.
+    let coeffs = solve_wls_svd(&x.view(), &y_flat.view())?;
+
+    // Unpack: mu = coeffs[0], alpha[1..] = coeffs[1..n_units],
+    // beta[1..] = coeffs[n_units..]. Matches trop_global.py:406-410.
+    let mu = coeffs[0];
     let mut alpha = Array1::<f64>::zeros(n_units);
+    for i in 1..n_units {
+        alpha[i] = coeffs[i];
+    }
     let mut beta = Array1::<f64>::zeros(n_periods);
+    for t in 1..n_periods {
+        beta[t] = coeffs[(n_units - 1) + t];
+    }
 
-    for _ in 0..50 {
-        let mu_old = mu;
-        let alpha_old = alpha.clone();
-        let beta_old = beta.clone();
+    Some((mu, alpha, beta))
+}
 
-        // Update alpha (fixing beta, mu)
-        for i in 1..n_units {  // α_0 = 0 for identification
-            if sum_w_by_unit[i] > 1e-10 {
-                let mut num = 0.0;
-                for t in 0..n_periods {
-                    // NaN outcomes get zero weight
-                    let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 };
-                    let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 };
-                    num += w * (y_ti - mu - beta[t]);
-                }
-                alpha[i] = num / sum_w_by_unit[i];
-            }
-        }
+/// Minimum-norm least-squares solution via faer thin SVD with rcond truncation.
+///
+/// Mirrors `np.linalg.lstsq(A, b, rcond=None)` from numpy: singular values
+/// below `rcond * max(S)` with `rcond = eps * max(n_rows, n_cols)` are
+/// treated as zero. On rank-deficient A this returns the unique
+/// minimum-norm least-squares solution.
+///
+/// This helper intentionally does NOT reuse `rust/src/linalg.rs::solve_ols`
+/// because `solve_ols` hard-codes `rcond = 1e-7` (R's `lm()` default) which
+/// would truncate singular values that numpy's default keeps. TROP's
+/// canonical Python path is numpy-compatible; Rust must match.
+///
+/// Returns `None` only when the SVD itself fails (rare on finite inputs);
+/// the caller (LOOCV / FISTA / bootstrap) interprets `None` as an
+/// unsuccessful fit.
+fn solve_wls_svd(a: &ArrayView2<f64>, b: &ArrayView1<f64>) -> Option<Array1<f64>> {
+    let n_rows = a.nrows();
+    let n_cols = a.ncols();
+    let a_owned = a.to_owned();
+    let b_owned = b.to_owned();
 
-        // Update beta (fixing alpha, mu)
-        for t in 1..n_periods {  // β_0 = 0 for identification
-            if sum_w_by_period[t] > 1e-10 {
-                let mut num = 0.0;
-                for i in 0..n_units {
-                    // NaN outcomes get zero weight
-                    let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 };
-                    let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 };
-                    num += w * (y_ti - mu - alpha[i]);
-                }
-                beta[t] = num / sum_w_by_period[t];
-            }
+    // Convert ndarray -> faer for SVD.
+    let a_faer = ndarray_to_faer(&a_owned);
+
+    let svd = a_faer.thin_svd().ok()?;
+
+    let u_faer = svd.U();
+    let s_diag = svd.S();
+    let s_col = s_diag.column_vector();
+    let v_faer = svd.V();
+
+    // Extract U (n_rows x min(n,k)) back to ndarray.
+    let u_rows = u_faer.nrows();
+    let u_cols = u_faer.ncols();
+    let mut u = Array2::<f64>::zeros((u_rows, u_cols));
+    for i in 0..u_rows {
+        for j in 0..u_cols {
+            u[[i, j]] = u_faer[(i, j)];
         }
+    }
 
-        // Update mu (fixing alpha, beta)
-        let mut num_mu = 0.0;
-        for t in 0..n_periods {
-            for i in 0..n_units {
-                // NaN outcomes get zero weight
-                let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 };
-                let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 };
-                num_mu += w * (y_ti - alpha[i] - beta[t]);
-            }
+    // Extract singular values.
+    let s_len = s_col.nrows();
+    let mut s = Array1::<f64>::zeros(s_len);
+    for i in 0..s_len {
+        s[i] = s_col[i];
+    }
+
+    // Extract V (k x min(n,k)) back to ndarray. faer's V is not V^T.
+    let v_rows = v_faer.nrows();
+    let v_cols = v_faer.ncols();
+    let mut v = Array2::<f64>::zeros((v_rows, v_cols));
+    for i in 0..v_rows {
+        for j in 0..v_cols {
+            v[[i, j]] = v_faer[(i, j)];
         }
-        mu = num_mu / sum_w;
-
-        // Check convergence across ALL parameters (not just mu)
-        let mu_diff = (mu - mu_old).abs();
-        let alpha_diff = alpha.iter().zip(alpha_old.iter())
-            .map(|(a, b)| (a - b).abs())
-            .fold(0.0_f64, f64::max);
-        let beta_diff = beta.iter().zip(beta_old.iter())
-            .map(|(a, b)| (a - b).abs())
-            .fold(0.0_f64, f64::max);
-        let max_diff = mu_diff.max(alpha_diff).max(beta_diff);
-        if max_diff < 1e-8 {
-            break;
+    }
+
+    // numpy rcond = eps * max(n_rows, n_cols); truncate s[i] <= rcond * max(s).
+    let rcond = f64::EPSILON * (n_rows.max(n_cols) as f64);
+    let s_max = s.iter().cloned().fold(0.0_f64, f64::max);
+    let threshold = s_max * rcond;
+
+    // Compute β = V * S^{-1}_truncated * U^T * y.
+    let uty = u.t().dot(&b_owned); // (min(n,k),)
+    let mut s_inv_uty = Array1::<f64>::zeros(s_len);
+    for i in 0..s_len {
+        if s[i] > threshold {
+            s_inv_uty[i] = uty[i] / s[i];
         }
+        // else: leave 0 — this is the pseudo-inverse / minimum-norm step
+        // that also covers Python's `except LinAlgError: pinv(...)` fallback
+        // tier, since faer thin_svd is numerically robust on finite inputs.
     }
+    let coeffs = v.dot(&s_inv_uty);
 
-    Some((mu, alpha, beta))
+    Some(coeffs)
 }
 
 /// Solve global TWFE + low-rank via alternating minimization (no tau).
diff --git a/tests/test_rust_backend.py b/tests/test_rust_backend.py
@@ -2204,23 +2204,17 @@ def _make_correlated_panel(n_units=6, n_periods=5, n_treated=2):
                 data.append({"unit": i, "time": t, "outcome": y, "treated": treated})
         return pd.DataFrame(data)
 
-    @pytest.mark.xfail(
-        strict=True,
-        reason="TROP Rust grid-search at trop_global.py:688 and Python fallback "
-        "at trop_global.py:753 diverge on rank-deficient Y: empirical ATT "
-        "gap ~6% on two near-parallel control units. Either (a) grid-winner "
-        "ties break differently between backends, or (b) the per-λ solver "
-        "itself reaches different stationary points under rank deficiency. "
-        "Finding #23 flagged this exact surface as the Phase-2 gap. Rust "
-        "vs Python unification is a P1 follow-up (TODO.md). This xfail "
-        "baselines the divergence so we notice when/if the backends align.",
-    )
     def test_grid_search_rank_deficient_Y(self):
         """Grid-search ATT parity on rank-deficient Y.
 
-        Known to fail: two near-parallel control units produce ~6% ATT
-        divergence between Rust and Python LOOCV grid-search. See xfail
-        reason for follow-up plan.
+        Finding #23 / TODO row 87 regression guard. Previously a ~6%
+        ATT divergence on two near-parallel control units because the
+        Rust inner solver used iterative block coordinate descent while
+        the Python fallback used SVD-based minimum-norm least squares.
+        Fixed by porting the Rust inner solver to an SVD-based WLS path
+        (numpy-compatible rcond = eps*max(n,k)) that mirrors Python's
+        `np.linalg.lstsq(rcond=None)` step-for-step. This test asserts
+        the backends now agree at atol=1e-6 on rank-deficient Y.
         """
         import sys
         from unittest.mock import patch

Original file line number	Diff line number	Diff line change
`@@ -277,7 +277,7 @@ fn compute_robust_vcov_internal(`
`277`	`277`	`}`
`278`	`278`
`279`	`279`	`/// Convert ndarray Array2 to faer Mat`
`280`		`-fn ndarray_to_faer(arr: &Array2<f64>) -> faer::Mat<f64> {`
	`280`	`+pub(crate) fn ndarray_to_faer(arr: &Array2<f64>) -> faer::Mat<f64> {`
`281`	`281`	`let nrows = arr.nrows();`
`282`	`282`	`let ncols = arr.ncols();`
`283`	`283`	`faer::Mat::from_fn(nrows, ncols, \|i, j\| arr[[i, j]])`