perf: fuse banded TriMultiply neighbour-correction loops by wegamekinglc · Pull Request #183 · wegamekinglc/Derivatives-Algorithms-Lib

wegamekinglc · 2026-06-28T09:29:20Z

Summary

P8 — Banded TriMultiply hybrid fusion (dal-cpp/dal/math/matrix/banded.cpp): keeps the dominant diag*x Transform pass (vectorization-identical to the unfused code) and fuses the two strided neighbour-correction loops into one sweep. A full single-pass interleave was tested and rejected — it changes auto-vectorization scheduling and cascades FP drift through the joint calibration.
P7 — CG/BCG Krylov fusion reverted: the fused AXPY sweeps pass on GCC (4.58e-6 drift, under the 1e-5 bar) but break on clang (2.17e-5) because clang's auto-vectorizer produces a different instruction stream from GCC's even with -ffp-contract=fast on both. This is not an FMA issue — it's a cross-compiler auto-vectorization divergence near a knife-edge tolerance.
Adds an asymmetric TriMultiply regression test covering all three bands and both boundaries.
No volatile, no mutable (both banned per .claude/rules/code-style.md).

Test plan

Benchmarks (GCC -O3 -march=native -ffp-contract=fast, median of 20/1000 reps):

Benchmark	Before	After	Delta
TriDecomp MultiplyLeft (10K)	5.533 us	4.337 us	-21.6%
TriDiagonal MultiplyLeft (10K)	5.325 us	4.274 us	-19.7%
TriDiagonal Decompose (10K)	83.070 us	76.021 us	-8.5%
CGSolve (500x500 tridiag)	15.241 us	15.241 us	~0%
BCGSolve (500x500 tridiag)	19.396 us	19.396 us	~0%

JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis PASSES on both GCC (4.64e-6) and clang (4.64e-6) vs 1e-5 bar.
Full dal_cpp_tests on GCC native: 774/774 pass.
Full dal_cpp_tests on clang (local): 774/774 pass.
Full dal_cpp_tests on Adept AAD backend: 773/773 pass.
CI matrix: 45/45 green (gcc-13/14, clang-18/19, msvc × all AAD backends).
All 12 other JointCalibrationTest.* pass.
All 8 UnderdeterminedTest.* pass.

Co-Authored-By: Claude noreply@anthropic.com

- P7: PrepareDirection_ and UpdateSolution_ in bcg.cpp collapse their separate scale/add (and LinearIncrement) sweeps into single AXPY passes over p/z/x/r (plus the bi-conjugate pp/zz/rr shadows). Operation order is preserved (p*multiply + z; dst + scale*src) so -ffp-contract=fast forms the same FMAs as the unfused two-pass code, keeping the result numerically faithful on the FP-sensitive joint-calibration path. - P8: TriMultiply in banded.cpp keeps the dominant diag*x Transform pass (vectorization-identical to the unfused code) and fuses the two strided neighbour-correction loops into one sweep. A full single-pass interleave was tested and rejected: it re-vectorizes the diag*x term, perturbing FP rounding enough to flip JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis from 7e-6 (unfused) to 5.7e-4 (past the 1e-5 bar); the hybrid stays at 4.6e-6. - Add residual-bound regression tests for CG (symmetric) and BCG (asymmetric) Krylov solves, and an asymmetric TriMultiply test covering all three bands and both boundaries. Benchmarks (GCC -O3 -march=native -ffp-contract=fast): CGSolve 500x500 15.24us -> 13.95us (-8.5%) BCGSolve 500x500 19.40us -> 17.64us (-9.0%) TriDecomp MultiplyLeft 10K 5.53us -> 4.34us (-21.6%) TriDiagonal MultiplyLeft 10K 5.33us -> 4.27us (-19.7%) Co-Authored-By: Claude <noreply@anthropic.com>

codacy-production · 2026-06-28T09:30:14Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 13 complexity · 2 duplication

Metric Results

Complexity 13

Duplication 2

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

coveralls · 2026-06-28T09:36:54Z

Coverage Report for CI Build 28319341870

Coverage increased (+0.004%) to 81.495%

Details

Coverage increased (+0.004%) from the base build.
Patch coverage: 7 of 7 lines across 1 file are fully covered (100%).
No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.

Coverage Stats


Relevant Lines:	7798
Covered Lines:	6355
Line Coverage:	81.5%
Coverage Strength:	3230533.61 hits per line

💛 - Coveralls

Copilot

Pull request overview

Performance-focused refactor in DAL’s linear-algebra solvers that reduces memory bandwidth by fusing Krylov (CG/BCG) vector sweeps and partially fusing banded tridiagonal TriMultiply, plus regression tests to guard residual and band-handling correctness.

Changes:

Fuse CG/BCG Krylov update passes in bcg.cpp into single AXPY-style sweeps while preserving operation order for consistent FP contraction.
Keep diag*x as a vectorizable pass in TriMultiply and fuse the two neighbor-correction loops into one sweep.
Add residual-bound regression tests for CG/BCG solves and an asymmetric tridiagonal MultiplyLeft test covering both boundaries and interior rows.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
dal-cpp/dal/math/matrix/bcg.cpp	Fuses Krylov vector updates (CG/BCG) into fewer passes while preserving arithmetic order.
dal-cpp/dal/math/matrix/banded.cpp	Refactors tridiagonal multiply to keep the dominant vectorizable pass and fuse neighbor corrections.
dal-cpp/tests/math/matrix/test_bcg.cpp	Adds residual-based regression tests for CG/BCG on symmetric/asymmetric tridiagonal systems.
dal-cpp/tests/math/matrix/test_banded.cpp	Adds an asymmetric tridiagonal multiply test that exercises all bands and boundary behavior.

 #include <dal/platform/platform.hpp>
 #include <dal/platform/strict.hpp>
+#include <functional>
 #include <dal/math/matrix/bcg.hpp>
 #include <dal/math/matrix/sparse.hpp>
 #include <dal/utilities/algorithms.hpp>
 #include <dal/utilities/numerics.hpp>


+    double ResidualInfNorm(const Sparse::Square_& A, const Vector_<>& x, const Vector_<>& b) {
+        Vector_<> ax(x.size());
+        A.MultiplyLeft(x, &ax);
+        double worst = 0.0;
+        for (int i = 0; i < static_cast<int>(ax.size()); ++i)
+            worst = std::max(worst, std::fabs(ax[i] - b[i]));
+        return worst;
+    }


+// Tri-diagonal systems with distinct band values exercise every branch of the
+// fused Krylov sweeps. CG requires a symmetric positive-definite matrix, so it
+// gets a symmetric system; BCG handles the asymmetric case (and its shadow path).
+// The contract is residual ||Ax - b||_inf near machine precision after a tight solve.
+namespace {


… FP) P7 changes clang's auto-vectorization pattern (different from GCC even with -ffp-contract=fast on both), pushing JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis drift from 7e-6 to 2.17e-5 (over the 1e-5 bar). P8 (banded TriMultiply hybrid fusion) is kept — it passes on all compilers. Co-Authored-By: Claude <noreply@anthropic.com>

wegamekinglc marked this pull request as ready for review June 28, 2026 10:22

Copilot AI review requested due to automatic review settings June 28, 2026 10:22

Copilot started reviewing on behalf of wegamekinglc June 28, 2026 10:22 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

wegamekinglc changed the title ~~perf: fuse CG/BCG Krylov sweeps and banded TriMultiply (FMA-aligned)~~ perf: fuse banded TriMultiply neighbour-correction loops Jun 28, 2026

wegamekinglc merged commit 143c035 into master Jun 28, 2026
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fuse banded TriMultiply neighbour-correction loops#183

perf: fuse banded TriMultiply neighbour-correction loops#183
wegamekinglc merged 2 commits into
masterfrom
perf/p7-p8-fused-sweeps

wegamekinglc commented Jun 28, 2026 •

edited

Loading

Uh oh!

codacy-production Bot commented Jun 28, 2026

Uh oh!

coveralls commented Jun 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wegamekinglc commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

codacy-production Bot commented Jun 28, 2026

Up to standards ✅

Uh oh!

coveralls commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report for CI Build 28319341870

Coverage increased (+0.004%) to 81.495%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wegamekinglc commented Jun 28, 2026 •

edited

Loading

coveralls commented Jun 28, 2026 •

edited

Loading