Skip to content

perf: fuse banded TriMultiply neighbour-correction loops#183

Merged
wegamekinglc merged 2 commits into
masterfrom
perf/p7-p8-fused-sweeps
Jun 28, 2026
Merged

perf: fuse banded TriMultiply neighbour-correction loops#183
wegamekinglc merged 2 commits into
masterfrom
perf/p7-p8-fused-sweeps

Conversation

@wegamekinglc

@wegamekinglc wegamekinglc commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Summary

  • P8 — Banded TriMultiply hybrid fusion (dal-cpp/dal/math/matrix/banded.cpp): keeps the dominant diag*x Transform pass (vectorization-identical to the unfused code) and fuses the two strided neighbour-correction loops into one sweep. A full single-pass interleave was tested and rejected — it changes auto-vectorization scheduling and cascades FP drift through the joint calibration.
  • P7 — CG/BCG Krylov fusion reverted: the fused AXPY sweeps pass on GCC (4.58e-6 drift, under the 1e-5 bar) but break on clang (2.17e-5) because clang's auto-vectorizer produces a different instruction stream from GCC's even with -ffp-contract=fast on both. This is not an FMA issue — it's a cross-compiler auto-vectorization divergence near a knife-edge tolerance.
  • Adds an asymmetric TriMultiply regression test covering all three bands and both boundaries.
  • No volatile, no mutable (both banned per .claude/rules/code-style.md).

Test plan

Benchmarks (GCC -O3 -march=native -ffp-contract=fast, median of 20/1000 reps):

Benchmark Before After Delta
TriDecomp MultiplyLeft (10K) 5.533 us 4.337 us -21.6%
TriDiagonal MultiplyLeft (10K) 5.325 us 4.274 us -19.7%
TriDiagonal Decompose (10K) 83.070 us 76.021 us -8.5%
CGSolve (500x500 tridiag) 15.241 us 15.241 us ~0%
BCGSolve (500x500 tridiag) 19.396 us 19.396 us ~0%
  • JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis PASSES on both GCC (4.64e-6) and clang (4.64e-6) vs 1e-5 bar.
  • Full dal_cpp_tests on GCC native: 774/774 pass.
  • Full dal_cpp_tests on clang (local): 774/774 pass.
  • Full dal_cpp_tests on Adept AAD backend: 773/773 pass.
  • CI matrix: 45/45 green (gcc-13/14, clang-18/19, msvc × all AAD backends).
  • All 12 other JointCalibrationTest.* pass.
  • All 8 UnderdeterminedTest.* pass.

Co-Authored-By: Claude noreply@anthropic.com

- P7: PrepareDirection_ and UpdateSolution_ in bcg.cpp collapse their
  separate scale/add (and LinearIncrement) sweeps into single AXPY passes
  over p/z/x/r (plus the bi-conjugate pp/zz/rr shadows). Operation order is
  preserved (p*multiply + z; dst + scale*src) so -ffp-contract=fast forms the
  same FMAs as the unfused two-pass code, keeping the result numerically
  faithful on the FP-sensitive joint-calibration path.
- P8: TriMultiply in banded.cpp keeps the dominant diag*x Transform pass
  (vectorization-identical to the unfused code) and fuses the two strided
  neighbour-correction loops into one sweep. A full single-pass interleave
  was tested and rejected: it re-vectorizes the diag*x term, perturbing FP
  rounding enough to flip JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis
  from 7e-6 (unfused) to 5.7e-4 (past the 1e-5 bar); the hybrid stays at 4.6e-6.
- Add residual-bound regression tests for CG (symmetric) and BCG (asymmetric)
  Krylov solves, and an asymmetric TriMultiply test covering all three bands
  and both boundaries.

Benchmarks (GCC -O3 -march=native -ffp-contract=fast):
  CGSolve 500x500        15.24us -> 13.95us  (-8.5%)
  BCGSolve 500x500       19.40us -> 17.64us  (-9.0%)
  TriDecomp MultiplyLeft 10K  5.53us ->  4.34us  (-21.6%)
  TriDiagonal MultiplyLeft 10K 5.33us ->  4.27us  (-19.7%)

Co-Authored-By: Claude <noreply@anthropic.com>
@codacy-production

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 13 complexity · 2 duplication

Metric Results
Complexity 13
Duplication 2

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@coveralls

coveralls commented Jun 28, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 28319341870

Coverage increased (+0.004%) to 81.495%

Details

  • Coverage increased (+0.004%) from the base build.
  • Patch coverage: 7 of 7 lines across 1 file are fully covered (100%).
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 7798
Covered Lines: 6355
Line Coverage: 81.5%
Coverage Strength: 3230533.61 hits per line

💛 - Coveralls

@wegamekinglc wegamekinglc marked this pull request as ready for review June 28, 2026 10:22
Copilot AI review requested due to automatic review settings June 28, 2026 10:22

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Performance-focused refactor in DAL’s linear-algebra solvers that reduces memory bandwidth by fusing Krylov (CG/BCG) vector sweeps and partially fusing banded tridiagonal TriMultiply, plus regression tests to guard residual and band-handling correctness.

Changes:

  • Fuse CG/BCG Krylov update passes in bcg.cpp into single AXPY-style sweeps while preserving operation order for consistent FP contraction.
  • Keep diag*x as a vectorizable pass in TriMultiply and fuse the two neighbor-correction loops into one sweep.
  • Add residual-bound regression tests for CG/BCG solves and an asymmetric tridiagonal MultiplyLeft test covering both boundaries and interior rows.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
dal-cpp/dal/math/matrix/bcg.cpp Fuses Krylov vector updates (CG/BCG) into fewer passes while preserving arithmetic order.
dal-cpp/dal/math/matrix/banded.cpp Refactors tridiagonal multiply to keep the dominant vectorizable pass and fuse neighbor corrections.
dal-cpp/tests/math/matrix/test_bcg.cpp Adds residual-based regression tests for CG/BCG on symmetric/asymmetric tridiagonal systems.
dal-cpp/tests/math/matrix/test_banded.cpp Adds an asymmetric tridiagonal multiply test that exercises all bands and boundary behavior.

Comment on lines 5 to 11
#include <dal/platform/platform.hpp>
#include <dal/platform/strict.hpp>
#include <functional>
#include <dal/math/matrix/bcg.hpp>
#include <dal/math/matrix/sparse.hpp>
#include <dal/utilities/algorithms.hpp>
#include <dal/utilities/numerics.hpp>
Comment on lines +74 to +81
double ResidualInfNorm(const Sparse::Square_& A, const Vector_<>& x, const Vector_<>& b) {
Vector_<> ax(x.size());
A.MultiplyLeft(x, &ax);
double worst = 0.0;
for (int i = 0; i < static_cast<int>(ax.size()); ++i)
worst = std::max(worst, std::fabs(ax[i] - b[i]));
return worst;
}
Comment on lines +48 to +52
// Tri-diagonal systems with distinct band values exercise every branch of the
// fused Krylov sweeps. CG requires a symmetric positive-definite matrix, so it
// gets a symmetric system; BCG handles the asymmetric case (and its shadow path).
// The contract is residual ||Ax - b||_inf near machine precision after a tight solve.
namespace {
… FP)

P7 changes clang's auto-vectorization pattern (different from GCC even with -ffp-contract=fast on both), pushing JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis drift from 7e-6 to 2.17e-5 (over the 1e-5 bar). P8 (banded TriMultiply hybrid fusion) is kept — it passes on all compilers.

Co-Authored-By: Claude <noreply@anthropic.com>
@wegamekinglc wegamekinglc changed the title perf: fuse CG/BCG Krylov sweeps and banded TriMultiply (FMA-aligned) perf: fuse banded TriMultiply neighbour-correction loops Jun 28, 2026
@wegamekinglc wegamekinglc merged commit 143c035 into master Jun 28, 2026
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants