Add lane construction and composition APIs#127690
Add lane construction and composition APIs#127690hez2010 wants to merge 31 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds new vector sequence-generation helpers (geometric/alternating/harmonic/cauchy), sign-sequence helpers, and lane-manipulation operations (zip/unzip/concat/reverse) across Vector<T> and Vector{64,128,256,512}<T>, including JIT recognition and test coverage.
Changes:
- Introduces new public APIs in the ref assemblies for sequence creation + lane operations and
SignSequence. - Implements the APIs in CoreLib for
Vector<T>andVector{64,128,256,512}<T>, with some JIT fast-paths. - Adds unit tests validating the new behaviors across vector widths.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libraries/System.Runtime.Intrinsics/tests/Vectors/Vector64Tests.cs | Adds tests for new Vector64 sequence + lane APIs |
| src/libraries/System.Runtime.Intrinsics/tests/Vectors/Vector128Tests.cs | Adds tests for new Vector128 sequence + lane APIs |
| src/libraries/System.Runtime.Intrinsics/tests/Vectors/Vector256Tests.cs | Adds tests for new Vector256 sequence + lane APIs |
| src/libraries/System.Runtime.Intrinsics/tests/Vectors/Vector512Tests.cs | Adds tests for new Vector512 sequence + lane APIs |
| src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs | Exposes new Vector{64,128,256,512} APIs in the reference contract |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64_1.cs | Adds Vector64<T>.SignSequence |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs | Implements Vector64 sequence + lane APIs |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs | Adds Vector128<T>.SignSequence |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs | Implements Vector128 sequence + lane APIs |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256_1.cs | Adds Vector256<T>.SignSequence |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs | Implements Vector256 sequence + lane APIs |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector512_1.cs | Adds Vector512<T>.SignSequence |
| src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector512.cs | Implements Vector512 sequence + lane APIs + AVX-512 special-cases |
| src/libraries/System.Private.CoreLib/src/System/Numerics/Vector_1.cs | Adds Vector<T>.SignSequence |
| src/libraries/System.Private.CoreLib/src/System/Numerics/Vector.cs | Implements Vector sequence + lane APIs |
| src/libraries/System.Numerics.Vectors/tests/GenericVectorTests.cs | Adds tests for new System.Numerics.Vector APIs |
| src/libraries/System.Numerics.Vectors/ref/System.Numerics.Vectors.cs | Exposes new System.Numerics.Vector APIs in the reference contract |
| src/coreclr/jit/hwintrinsicxarch.cpp | Adds xarch JIT special-import support for new intrinsics |
| src/coreclr/jit/hwintrinsiclistxarch.h | Registers new xarch HW intrinsic IDs |
| src/coreclr/jit/hwintrinsicarm64.cpp | Adds arm64 JIT special-import support for new intrinsics |
| src/coreclr/jit/hwintrinsiclistarm64.h | Registers new arm64 HW intrinsic IDs |
| src/coreclr/jit/compiler.h | Declares new SIMD IR node builders used by importer/lowering |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (4)
src/coreclr/jit/gentree.cpp:1
- Several simdCount==1 special-cases use gtWrapWithSideEffects in a way that can reverse the required left-to-right argument evaluation order (op1 then op2) for intrinsics like Concat/Zip/Unzip, and for CreateAlternatingSequence. This is observable for Vector64/Vector64 (and similar) where the vector length is 1 and arguments may have side-effects. Consider materializing op1 into a temp before sequencing op2, or otherwise constructing the tree so op1 is evaluated before op2 while still returning op1 (or the correct constant for UnzipOdd).
src/coreclr/jit/gentree.cpp:1 - Several simdCount==1 special-cases use gtWrapWithSideEffects in a way that can reverse the required left-to-right argument evaluation order (op1 then op2) for intrinsics like Concat/Zip/Unzip, and for CreateAlternatingSequence. This is observable for Vector64/Vector64 (and similar) where the vector length is 1 and arguments may have side-effects. Consider materializing op1 into a temp before sequencing op2, or otherwise constructing the tree so op1 is evaluated before op2 while still returning op1 (or the correct constant for UnzipOdd).
src/coreclr/jit/gentree.cpp:1 - Several simdCount==1 special-cases use gtWrapWithSideEffects in a way that can reverse the required left-to-right argument evaluation order (op1 then op2) for intrinsics like Concat/Zip/Unzip, and for CreateAlternatingSequence. This is observable for Vector64/Vector64 (and similar) where the vector length is 1 and arguments may have side-effects. Consider materializing op1 into a temp before sequencing op2, or otherwise constructing the tree so op1 is evaluated before op2 while still returning op1 (or the correct constant for UnzipOdd).
src/coreclr/jit/gentree.cpp:1 - Several simdCount==1 special-cases use gtWrapWithSideEffects in a way that can reverse the required left-to-right argument evaluation order (op1 then op2) for intrinsics like Concat/Zip/Unzip, and for CreateAlternatingSequence. This is observable for Vector64/Vector64 (and similar) where the vector length is 1 and arguments may have side-effects. Consider materializing op1 into a temp before sequencing op2, or otherwise constructing the tree so op1 is evaluated before op2 while still returning op1 (or the correct constant for UnzipOdd).
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
| if (!supportsX86BaseShuffle && !compOpportunisticallyDependsOn(InstructionSet_AVX2)) | ||
| { | ||
| break; |
| if (simdSize == 16) | ||
| { | ||
| bool supportsX86BaseShuffle = | ||
| (simdBaseType == TYP_INT) || (simdBaseType == TYP_UINT) || (simdBaseType == TYP_FLOAT); | ||
|
|
||
| if (!supportsX86BaseShuffle && !compOpportunisticallyDependsOn(InstructionSet_AVX2)) | ||
| { | ||
| break; | ||
| } | ||
| } |
| /// <summary>Creates a new <see cref="Vector64{T}" /> instance whose elements are the reciprocal of an arithmetic sequence.</summary> | ||
| /// <typeparam name="T">The type of the elements in the vector.</typeparam> | ||
| /// <param name="start">The value that element 0 of the arithmetic sequence will be initialized to.</param> | ||
| /// <param name="step">The value that indicates how far apart each element of the arithmetic sequence should be from the previous.</param> | ||
| /// <returns>A new <see cref="Vector64{T}" /> instance whose elements are initialized to one divided by the corresponding element of the arithmetic sequence.</returns> | ||
| /// <exception cref="NotSupportedException">The type of <paramref name="start"/> and <paramref name="step"/> (<typeparamref name="T" />) is not supported.</exception> | ||
| [Intrinsic] | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector64<T> CreateHarmonicSequence<T>(T start, T step) => Vector64<T>.One / CreateSequence(start, step); | ||
|
|
||
| /// <summary>Creates a new <see cref="Vector64{T}" /> instance whose elements are the square root of an arithmetic sequence.</summary> | ||
| /// <typeparam name="T">The type of the elements in the vector.</typeparam> | ||
| /// <param name="start">The value that element 0 of the arithmetic sequence will be initialized to.</param> | ||
| /// <param name="step">The value that indicates how far apart each element of the arithmetic sequence should be from the previous.</param> | ||
| /// <returns>A new <see cref="Vector64{T}" /> instance whose elements are initialized to the square root of the corresponding element of the arithmetic sequence.</returns> | ||
| /// <exception cref="NotSupportedException">The type of <paramref name="start"/> and <paramref name="step"/> (<typeparamref name="T" />) is not supported.</exception> | ||
| [Intrinsic] | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static Vector64<T> CreateCauchySequence<T>(T start, T step) => Sqrt(CreateSequence(start, step)); |
| int count = Vector64<T>.Count; | ||
| Unsafe.SkipInit(out Vector64<T> result); | ||
|
|
||
| for (int index = 0; index < count; index++) | ||
| { | ||
| result.SetElementUnsafe(index, vector.GetElementUnsafe(count - 1 - index)); | ||
| } | ||
|
|
||
| return result; |
There was a problem hiding this comment.
This can just be Shuffle(vector, CreateSequence(Vector<T>.Count - 1, -1))
There was a problem hiding this comment.
Shuffle doesn't have a generic variant.
There was a problem hiding this comment.
You could define an internal one constrained to where T : IBinaryInteger
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
src/coreclr/jit/gentree.cpp:1
- For floating-point base types, the
isPartiallowering changes the evaluation order compared to the documented recurrence (each element = previous * multiplier). Building a constant power vector[1, m, m^2, ...]and doing a single broadcast-multiply computesinitial * (m^n)with different rounding than repeated multiplication of the previous element, so results can diverge forfloat/doublewheninitialis not constant. To preserve semantics, either (mandatory) avoid theisPartialtransformation forTYP_FLOAT/TYP_DOUBLE(requireop1constant / bail out to the managed implementation), or generate a lowering that preserves the stepwise rounding behavior for FP.
src/coreclr/jit/gentree.cpp:1 - For floating-point base types, the
isPartiallowering changes the evaluation order compared to the documented recurrence (each element = previous * multiplier). Building a constant power vector[1, m, m^2, ...]and doing a single broadcast-multiply computesinitial * (m^n)with different rounding than repeated multiplication of the previous element, so results can diverge forfloat/doublewheninitialis not constant. To preserve semantics, either (mandatory) avoid theisPartialtransformation forTYP_FLOAT/TYP_DOUBLE(requireop1constant / bail out to the managed implementation), or generate a lowering that preserves the stepwise rounding behavior for FP.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
src/coreclr/jit/gentree.cpp:1
- The
isPartialexpansion for CreateGeometricSequence computes[1, m, m^2, ...]as a constant vector and then does a single vector multiply byinitial. For floating-point, this changes rounding compared to the per-lane recurrence implied by the API docs (and by the Vector64 implementation’s sequential multiply), because each lane becomesinitial * (m^k)rather than repeated multiplications from the previous element. Consider either (1) defining & documenting the FP semantics explicitly asinitial * (multiplier^index)and updating managed fallbacks/tests accordingly, or (2) emitting an expansion that preserves the recurrence for FP (even if it is more instructions), or (3) restricting this partial form to integral types where associativity holds.
src/coreclr/jit/gentree.cpp:1 - The
isPartialexpansion for CreateGeometricSequence computes[1, m, m^2, ...]as a constant vector and then does a single vector multiply byinitial. For floating-point, this changes rounding compared to the per-lane recurrence implied by the API docs (and by the Vector64 implementation’s sequential multiply), because each lane becomesinitial * (m^k)rather than repeated multiplications from the previous element. Consider either (1) defining & documenting the FP semantics explicitly asinitial * (multiplier^index)and updating managed fallbacks/tests accordingly, or (2) emitting an expansion that preserves the recurrence for FP (even if it is more instructions), or (3) restricting this partial form to integral types where associativity holds.
| for (int index = 0; index < Vector128<float>.Count; index++) | ||
| { | ||
| AssertExtensions.Equal(expected, sequence.GetElement(index)); | ||
| expected *= multiplier; | ||
| } |
| for (int index = 0; index < Vector128<double>.Count; index++) | ||
| { | ||
| AssertExtensions.Equal(expected, sequence.GetElement(index)); | ||
| expected *= multiplier; | ||
| } |
|
Test failures are expected for now as I haven't yet added a tolerance that deals with the multiplication associative difference introduced by pow. |
This PR adds lane construction and composition APIs approved in #122557, and the corresponding JIT intrinsics.
The JIT now recognizes the new vector APIs and expands them using existing SIMD nodes. The managed implementation allows decomposition through smaller vector widths when wider hardware support is unavailable.
The xarch lowering uses fixed shuffle forms where profitable:
vpbroadcast*for sequence and alternating constructionvshufpsfor 128-bit concat/unzip patternsvperm2i128for 256-bit zip/unzipThe ARM64 lowering avoids table-lookup forms for small fixed concat/reverse operations and uses direct element moves where applicable, such as
insandrev64.CreateCauchySequencerequires constant foldingsqrtin the JIT to produce optimal code, but I would like to leave it for now as it's out-of-scope for this PR.Codegen:
Vector128
Vector256
Vector512
ARM64 (Vector64 + Vector128)
Codegen for constant input:
Vector256:
Vector512:
Vector512 (without AVX512 - Vector256 decomposition path):
Closes #122557
cc: @tannergooding