perf: fuse banded TriMultiply neighbour-correction loops#183
Merged
Conversation
- P7: PrepareDirection_ and UpdateSolution_ in bcg.cpp collapse their separate scale/add (and LinearIncrement) sweeps into single AXPY passes over p/z/x/r (plus the bi-conjugate pp/zz/rr shadows). Operation order is preserved (p*multiply + z; dst + scale*src) so -ffp-contract=fast forms the same FMAs as the unfused two-pass code, keeping the result numerically faithful on the FP-sensitive joint-calibration path. - P8: TriMultiply in banded.cpp keeps the dominant diag*x Transform pass (vectorization-identical to the unfused code) and fuses the two strided neighbour-correction loops into one sweep. A full single-pass interleave was tested and rejected: it re-vectorizes the diag*x term, perturbing FP rounding enough to flip JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis from 7e-6 (unfused) to 5.7e-4 (past the 1e-5 bar); the hybrid stays at 4.6e-6. - Add residual-bound regression tests for CG (symmetric) and BCG (asymmetric) Krylov solves, and an asymmetric TriMultiply test covering all three bands and both boundaries. Benchmarks (GCC -O3 -march=native -ffp-contract=fast): CGSolve 500x500 15.24us -> 13.95us (-8.5%) BCGSolve 500x500 19.40us -> 17.64us (-9.0%) TriDecomp MultiplyLeft 10K 5.53us -> 4.34us (-21.6%) TriDiagonal MultiplyLeft 10K 5.33us -> 4.27us (-19.7%) Co-Authored-By: Claude <noreply@anthropic.com>
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 13 |
| Duplication | 2 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Coverage Report for CI Build 28319341870Coverage increased (+0.004%) to 81.495%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
Performance-focused refactor in DAL’s linear-algebra solvers that reduces memory bandwidth by fusing Krylov (CG/BCG) vector sweeps and partially fusing banded tridiagonal TriMultiply, plus regression tests to guard residual and band-handling correctness.
Changes:
- Fuse CG/BCG Krylov update passes in
bcg.cppinto single AXPY-style sweeps while preserving operation order for consistent FP contraction. - Keep
diag*xas a vectorizable pass inTriMultiplyand fuse the two neighbor-correction loops into one sweep. - Add residual-bound regression tests for CG/BCG solves and an asymmetric tridiagonal
MultiplyLefttest covering both boundaries and interior rows.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| dal-cpp/dal/math/matrix/bcg.cpp | Fuses Krylov vector updates (CG/BCG) into fewer passes while preserving arithmetic order. |
| dal-cpp/dal/math/matrix/banded.cpp | Refactors tridiagonal multiply to keep the dominant vectorizable pass and fuse neighbor corrections. |
| dal-cpp/tests/math/matrix/test_bcg.cpp | Adds residual-based regression tests for CG/BCG on symmetric/asymmetric tridiagonal systems. |
| dal-cpp/tests/math/matrix/test_banded.cpp | Adds an asymmetric tridiagonal multiply test that exercises all bands and boundary behavior. |
Comment on lines
5
to
11
| #include <dal/platform/platform.hpp> | ||
| #include <dal/platform/strict.hpp> | ||
| #include <functional> | ||
| #include <dal/math/matrix/bcg.hpp> | ||
| #include <dal/math/matrix/sparse.hpp> | ||
| #include <dal/utilities/algorithms.hpp> | ||
| #include <dal/utilities/numerics.hpp> |
Comment on lines
+74
to
+81
| double ResidualInfNorm(const Sparse::Square_& A, const Vector_<>& x, const Vector_<>& b) { | ||
| Vector_<> ax(x.size()); | ||
| A.MultiplyLeft(x, &ax); | ||
| double worst = 0.0; | ||
| for (int i = 0; i < static_cast<int>(ax.size()); ++i) | ||
| worst = std::max(worst, std::fabs(ax[i] - b[i])); | ||
| return worst; | ||
| } |
Comment on lines
+48
to
+52
| // Tri-diagonal systems with distinct band values exercise every branch of the | ||
| // fused Krylov sweeps. CG requires a symmetric positive-definite matrix, so it | ||
| // gets a symmetric system; BCG handles the asymmetric case (and its shadow path). | ||
| // The contract is residual ||Ax - b||_inf near machine precision after a tight solve. | ||
| namespace { |
… FP) P7 changes clang's auto-vectorization pattern (different from GCC even with -ffp-contract=fast on both), pushing JointCalibrationTest.TestJointOisCurveAgreesWithStagedOis drift from 7e-6 to 2.17e-5 (over the 1e-5 bar). P8 (banded TriMultiply hybrid fusion) is kept — it passes on all compilers. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dal-cpp/dal/math/matrix/banded.cpp): keeps the dominantdiag*xTransformpass (vectorization-identical to the unfused code) and fuses the two strided neighbour-correction loops into one sweep. A full single-pass interleave was tested and rejected — it changes auto-vectorization scheduling and cascades FP drift through the joint calibration.-ffp-contract=faston both. This is not an FMA issue — it's a cross-compiler auto-vectorization divergence near a knife-edge tolerance.TriMultiplyregression test covering all three bands and both boundaries.volatile, nomutable(both banned per.claude/rules/code-style.md).Test plan
Benchmarks (GCC
-O3 -march=native -ffp-contract=fast, median of 20/1000 reps):JointCalibrationTest.TestJointOisCurveAgreesWithStagedOisPASSES on both GCC (4.64e-6) and clang (4.64e-6) vs 1e-5 bar.dal_cpp_testson GCC native: 774/774 pass.dal_cpp_testson clang (local): 774/774 pass.dal_cpp_testson Adept AAD backend: 773/773 pass.JointCalibrationTest.*pass.UnderdeterminedTest.*pass.Co-Authored-By: Claude noreply@anthropic.com