Add LoopVectorization tensor-product fast path extension#369
Merged
ChrisRackauckas merged 1 commit intoMay 6, 2026
Merged
Conversation
Comment on lines
+416
to
+420
| _tensor_outer_mul_fast!(w, outer, C, mi::Int, mo::Int, no::Int, k::Int) = false | ||
| function _tensor_outer_mul_fast!(w, outer, C, mi::Int, mo::Int, no::Int, k::Int, α, β) | ||
| return false | ||
| end | ||
|
|
Member
There was a problem hiding this comment.
seems weird to mix the control flow with the actual call?
Contributor
Author
There was a problem hiding this comment.
this has been separated
9fe4176 to
2253d76
Compare
| C1 = reshape(C1, (mi, no)) | ||
| mul!(transpose(W), outer, transpose(C1)) | ||
| return w | ||
| elseif _has_tensor_outer_mul_fast(outer) |
Member
There was a problem hiding this comment.
if it's the elseif, isn't it bypassed?
Contributor
Author
There was a problem hiding this comment.
yes, good catch
2253d76 to
bd70324
Compare
bd70324 to
aff55b2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Revision note
Thanks for the early review comments. I revised the PR to separate fast-path availability/control flow from the mutating implementation call, rewrote the
k == 1/batched fast-path branch so the flow is explicit, and removed unrelated cache-slice edits plus a redundant test block. Hopefully this is now back on track as a minimal extension-backed optimization.Summary
This PR adds an optional fast path for batched
TensorProductOperatormultiplication when the outer operator is aMatrixOperatorwrapping a CPUStridedMatrix.The core tensor implementation now exposes a small internal backend-neutral hook,
_tensor_outer_mul_fast!, guarded by a separate_has_tensor_outer_mul_fastpredicate. The existing generic implementation remains the fallback and still uses only the operator interface (mul!) after reshaping/permuting into the layout it needs. That keeps arbitrary matrix-free operators, sparse matrices, GPU arrays, and other non-strided operators on the abstraction-preserving path.A new
SciMLOperatorsLoopVectorizationExtextension overloads this hook with an@turboloop for the narrow dense-stridedMatrixOperatorcase.LoopVectorizationis a weak dependency and a test dependency, not a hard dependency of SciMLOperators.Backend design
The hook is intentionally not named after LoopVectorization. Core owns a generic availability predicate plus an implementation hook:
The predicate keeps control flow separate from the mutating fast-path call. Backend extensions opt in by overloading the predicate for supported operator types and providing the corresponding
_tensor_outer_mul_fast!methods.That means LoopVectorization is only the first backend. If LoopModels, or another successor backend, becomes the preferred implementation later, it can live in its own extension and overload the same hook without changing the core tensor-product algorithm or public API.
The intended backend-switching model is:
SciMLOperatorsLoopVectorizationExtowns today’s@turbodense-strided implementation.@tturbomethod later, for users who want to experiment with threaded LoopVectorization behavior, without changing the core tensor code. This PR intentionally uses only@turboas the safer default because it avoids thread oversubscription concerns with threaded BLAS.SciMLOperatorsLoopModelsExtcan provide the same hook for LoopModels once its API is ready.GPU compatibility
This PR should not change existing GPU behavior by dispatch. The new LoopVectorization method only applies to
MatrixOperatorwrapping aStridedMatrix, so GPU arrays such asCuArrayshould not be caught by the CPU scalar-indexing@turbopath. GPU-backed operators continue to use the existing generic fallback, which relies on the currentpermutedims!/mul!behavior for those array/operator types.I did not run GPU tests locally, so this is a dispatch/abstraction argument rather than a measured GPU validation.
Performance notes
Local benchmarks on the existing
benchmarks/tensor.jlsetup showed the extension active only whenLoopVectorizationis loaded. These numbers were collected on Apple Silicon, so they should be treated as a conservative local check rather than the expected best case. The improvement may be larger on Intel AVX/AVX2 machines, which are closer to the hardware context discussed in #58 and where LoopVectorization's SIMD code generation has historically shown larger gains.I also tried the other idea from #58: replacing explicit
permutedims!calls withPermutedDimsArray. That benchmarked substantially worse for the dense batched cases, likely because BLAS then sees a lazy/strided permuted input instead of contiguous storage. This PR therefore keeps the existing explicit permutation fallback and only bypasses it for the narrow@turboextension path.Without
LoopVectorization, the fallback path remains at baseline:With
LoopVectorizationloaded:This follows the discussion in #58: avoid the generic
permutedims! + mul! + permutedims!path only where fast indexing is actually valid, rather than assuming all operators are indexable.Tests
LoopVectorizationto the test environment so the existing dense batchedTensorProductOperatorplain/scaledmul!tests exercise the extension path.Pkg.test()passes locally withLoopVectorizationin the test environment: