Expose various integer intrinsics for Avx512F, Avx512BW, and Avx512CD #85833

tannergooding · 2023-05-05T18:11:00Z

This exposes some instructions unique to the AVX512 family of instructions making progress towards completing:

Remaining after this are the Scatter/Gather and Shuffle/Permute intrinsics

… and LeadingZeroCOunt for Avx512DQ

dotnet-issue-labeler · 2023-05-05T18:11:08Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2023-05-05T18:11:15Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This exposes some instructions unique to the AVX512 family of instructions making progress towards completing:

Remaining after this are the Scatter/Gather and Shuffle/Permute intrinsics

Author:	tannergooding
Assignees:	tannergooding
Labels:	`area-CodeGen-coreclr`, `new-api-needs-documentation`
Milestone:	-

tannergooding · 2023-05-05T18:46:52Z

src/coreclr/jit/hwintrinsic.cpp

+const TernaryLogicInfo& TernaryLogicInfo::lookup(uint8_t control)
+{
+    // clang-format off
+    static const TernaryLogicInfo ternaryLogicFlags[256] = {


This table is 768 bytes and is about as small as we can make it.

The way the constants work is we have three keys:

A: 0xF0

B: 0xCC

C: 0xAA

To compute the correct control byte, you simply perform the corresponding operation on these keys. So, if you wanted to do (A & B) ^ C, you would compute (0xF0 & 0xCC) ^ 0xAA or 0x6A.

This table allows us to compute the inverse information, that is given a control, what are the operations it performs. This allows us to determine things like what operands are actually used (so we can correctly compute isRMW) and what operations are performed and in what order (such that we can do constant folding in the future).

The total set of operations supported are:

true: AllBitsSet

false: Zero

not: ~value

and: left & right

nand: ~(left & right)

or: left | right

nor: ~(left | right)

xor: left ^ right

xnor: ~(left ^ right)

cndsel: a ? b : c; aka (B & A) | (C & ~A)

major: 0 if two+ input bits are 0

minor: 1 if two+ input bits are 0

Put this comment in the source?

src/coreclr/jit/gentree.cpp

…mports

…de the register

src/coreclr/jit/gentree.cpp

…the JIT will never produce such nodes itself

…yLogic

src/coreclr/jit/gentree.cpp

src/coreclr/jit/hwintrinsicxarch.cpp

…laining the TernaryLogic table

tannergooding · 2023-05-09T17:13:53Z

Resolved feedback and fixed an issue where we were double encoding a vvvv register for emitOutputAM. Expecting everything to pass CI again this time.

tannergooding · 2023-05-09T18:19:01Z

Initial diffs from the explicit vpternlog usage and containment fixes are good.

We see 9k saved bytes in minopts and 10k saved bytes in fullopts for Windows x64.

There is a bit of a TP hit (up to 0.1%) which is showing up for SIMD heavy code paths. This primarily comes from updating IF_RRD_*RD_CNS (for ARD, MRD, SRD) paths to handle hasCodeMI(ins). This ends up impacting several intrinsics including Extract, Shift, Shuffle, as they now need to execute more and are prominently used in various library paths.

src/coreclr/jit/hwintrinsicxarch.cpp

jakobbotsch

Still not super comfortable with the complexity of what happens in the importer, but I don't have a good way of simplifying and don't see anything obviously wrong anymore, so LGTM.

You may want to get a review from @BruceForstall or @kunalspathak for the emitter changes. And perhaps @EgorBo wants to look at the vectorization changes.

EgorBo · 2023-05-09T22:55:50Z

src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/Avx512F.cs

+            ///   VPTERNLOGD xmm1 {k1}{z}, xmm2, xmm3/m128, imm8
+            /// The above native signature does not exist. We provide this additional overload for consistency with the other bitwise APIs.
+            /// </summary>
+            public static Vector128<sbyte> TernaryLogic(Vector128<sbyte> a, Vector128<sbyte> b, Vector128<sbyte> c, [ConstantExpected] byte control) => TernaryLogic(a, b, c, control);


What's [ConstantExpected] by the way and do we switch-expand if it's called with a variable?

It's an analyzer that flags if users aren't passing in a constant (or a value that is itself marked [ConstantExpected]). It's not perfect, but it gets most of the job done and flags the obvious cases where we'll have to pessimize.

Yes, we fallback to the recursive call and switch expand in that. Fixing that to be done late phase (such as lowering) is desirable but requires various ABI fixes so the call is happy with the SIMD args again.

EgorBo · 2023-05-09T23:04:52Z

src/coreclr/jit/importervectorization.cpp

+            GenTree* control;
+
+            control = gtNewIconNode(static_cast<uint8_t>((0xF0 | 0xCC) ^ 0xAA)); // (A | B)) ^ C
+            xor1    = gtNewSimdTernaryLogicNode(simdType, vec1, toLowerVec1, cnsVec1, control, baseType, simdSize);


Shouldn't this be a sort of morph/lower pattern match optimization instead? So we can catch user patterns as well, this pattern doesn't look uncommon, ternarylogic seems to allow to fold many other pattens so would want to have a single place to do them all.

We're catching this one early because its simple/obvious. We do want to add the morph stuff as well, but that's a much more complex/involved PR. Handling these explicit cases early will also help with JIT throughput and avoid us needing to do the much more expensive general handling logic later.

If we decide just handling it in morph is better, than I can remove these explicit cases once that support is in place.

kunalspathak

Added some questions/comments. I assume you verified that the latest tests are executing in CI, right?

src/coreclr/jit/gentree.cpp

kunalspathak · 2023-05-09T23:24:23Z

src/coreclr/jit/hwintrinsic.cpp

+    // This table is 768 bytes and is about as small as we can make it.
+    //
+    // The way the constants work is we have three keys:
+    // * A: 0xF0


curious, how did you come up with the values of these keys? Are they related to the encoding or just something you picked?

They are directly part of the encoding and are covered in the architecture manuals.

kunalspathak · 2023-05-09T23:28:46Z

src/coreclr/jit/hwintrinsic.cpp

+    // clang-format off
+    static const TernaryLogicInfo ternaryLogicFlags[256] = {
+        /* FALSE */           { TernaryLogicOperKind::False,  TernaryLogicUseFlags::None, TernaryLogicOperKind::None,   TernaryLogicUseFlags::None, TernaryLogicOperKind::None,   TernaryLogicUseFlags::None },
+        /* norABC */          { TernaryLogicOperKind::Nor,    TernaryLogicUseFlags::ABC,  TernaryLogicOperKind::None,   TernaryLogicUseFlags::None, TernaryLogicOperKind::None,   TernaryLogicUseFlags::None },


how does this work? When I read norABC I see that as A nor B nor C and having TernaryLogicUseFlags::ABC as op1Use while all the others are left None is slightly confusing to me.

norABC is ~(A | B | C). They shorthand the ones that do the same operation on all three, where-as the full name would've been something like norAorBC

Most of these control bytes do exactly 2 operations at once and they are always read as oper Input1 Input2. So andAorBC is And(A, Or(B, C)) (aka. A & (B | C)).

The exception is the conditional select instructions which do 3 operations (with the third one always being the condition code that "picks" bits from the first or second operation).

kunalspathak · 2023-05-09T23:30:03Z

src/coreclr/jit/hwintrinsic.cpp

+        /* FALSE */           { TernaryLogicOperKind::False,  TernaryLogicUseFlags::None, TernaryLogicOperKind::None,   TernaryLogicUseFlags::None, TernaryLogicOperKind::None,   TernaryLogicUseFlags::None },
+        /* norABC */          { TernaryLogicOperKind::Nor,    TernaryLogicUseFlags::ABC,  TernaryLogicOperKind::None,   TernaryLogicUseFlags::None, TernaryLogicOperKind::None,   TernaryLogicUseFlags::None },
+        /* andCnorBA */       { TernaryLogicOperKind::Nor,    TernaryLogicUseFlags::AB,   TernaryLogicOperKind::And,    TernaryLogicUseFlags::C,    TernaryLogicOperKind::None,   TernaryLogicUseFlags::None },
+        /* norBA */           { TernaryLogicOperKind::Nor,    TernaryLogicUseFlags::AB,   TernaryLogicOperKind::None,   TernaryLogicUseFlags::None, TernaryLogicOperKind::None,   TernaryLogicUseFlags::None },


Just to keep the ordering of use flags representation you have here?

Suggested change

/* norBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },

/* norAB */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },

The names are taken from the architecture docs, so its helpful to keep them matching as it makes it possible to search for the name in the manual.

We use a normalized form in our flags to keep things simpler.

kunalspathak · 2023-05-09T23:36:32Z

src/coreclr/jit/lowerxarch.cpp

+                            const TernaryLogicInfo& info     = TernaryLogicInfo::lookup(control);
+                            TernaryLogicUseFlags    useFlags = info.GetAllUseFlags();
+
+                            if (useFlags != TernaryLogicUseFlags::ABC)


I am not sure I completely follow what happens when useFlag == ABC. Can you elaborate?

When all three operands are used we have a real ternary instruction and don't have an optimizations or normalization to do. This is because the instruction is already "complete" (aside from constant folding) and can't represent any more state.

For the cases when all operands aren't used, then we have the ability to represent the operation more optimally. This happens by either converting it to the regular unary/binary node (e.g. gtNewSimdBinOpNode(GT_AND, ...)) or by normalizing the format so that the used operands are B/C (which allows us to not be marked RMW and allows greatest chance for containment). It will also simplify the logic that will be added in a future PR to combine unary/binary ops into ternary logic nodes.

kunalspathak

LGTM

… table fallback"

…lizing ~B | C

tannergooding · 2023-05-10T05:30:13Z

Found a pre-existing bug in the scalar ROL/ROR logic for x86. I've logged #86027 and worked around it in the tests for the time being

tannergooding added 4 commits May 4, 2023 09:07

Expose AlignRight32 and AlignRight64 on Avx512F

3652db3

Expose RotateLeft and RotateRight for Avx512F

b9c9690

Expose SumAbsoluteDifferencesInBlock32 for Avx512BW + DetectConflicts…

f9158a8

… and LeadingZeroCOunt for Avx512DQ

Exponse TernaryLogic for Avx512F

2487a19

ghost assigned tannergooding May 5, 2023

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels May 5, 2023

tannergooding commented May 5, 2023

View reviewed changes

tannergooding added 2 commits May 5, 2023 11:47

Apply formatting patch

6e72be6

Merge remote-tracking branch 'dotnet/main' into avx512-5

6046e65

jakobbotsch reviewed May 5, 2023

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed May 5, 2023

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding added 3 commits May 5, 2023 14:14

Ensure side effects are preserved when optimizing certain intrinsic i…

ea83d7b

…mports

Ensure the instruction code has the SIMD prefix before trying to enco…

e58abbb

…de the register

Ensure side effects have been accounted for before swapping operands

a5323e2

tannergooding force-pushed the avx512-5 branch from c449c37 to a5323e2 Compare May 5, 2023 22:03

build-analysis bot mentioned this pull request May 6, 2023

IOException running NuGet-Migrations during tests in dotnet CLI first run #80619

Closed

jakobbotsch reviewed May 6, 2023

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

tannergooding added 4 commits May 8, 2023 09:06

Move the complex ternary logic simplification logic to import, since …

2584e7d

…the JIT will never produce such nodes itself

Ensure gtNewSimdUnOpNode(GT_NOT) uses an in range constant for Ternar…

6636fba

…yLogic

Remove a new assert added to AND_NOT, logging an issue instead

acdfb9b

Add a missing break; statement

49646ce

build-analysis bot mentioned this pull request May 9, 2023

Could not find chrome snapshot folder for Linux_x64 #85949

Closed

jakobbotsch reviewed May 9, 2023

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

jakobbotsch reviewed May 9, 2023

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

jakobbotsch reviewed May 9, 2023

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved

jakobbotsch reviewed May 9, 2023

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved

jakobbotsch reviewed May 9, 2023

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved

tannergooding added 4 commits May 9, 2023 07:43

Ensure val1/2/3 are GenTree** so swapping works and add a comment exp…

1420642

…laining the TernaryLogic table

Fix formatting of a comment

1af17d8

Don't double encode the 'vvvv' bits for emitOutputAM

f2d8378

Merge remote-tracking branch 'dotnet/main' into avx512-5

bc081b6

Avoid an assert in gtNewSimdCreateBroadcastNode for TYP_LONG on 32-bit

94b5e53

tannergooding force-pushed the avx512-5 branch from 44a376a to 94b5e53 Compare May 9, 2023 19:43

jakobbotsch reviewed May 9, 2023

View reviewed changes

src/coreclr/jit/hwintrinsicxarch.cpp Outdated Show resolved Hide resolved

Ensure we use CHECK_SPILL_ALL

5d0a740

jakobbotsch approved these changes May 9, 2023

View reviewed changes

EgorBo reviewed May 9, 2023

View reviewed changes

EgorBo approved these changes May 9, 2023

View reviewed changes

EgorBo reviewed May 9, 2023

View reviewed changes

kunalspathak reviewed May 9, 2023

View reviewed changes

kunalspathak approved these changes May 10, 2023

View reviewed changes

Ensure mustExpand is handled for RotateLeft(Vector###<long>) on 32-bit

d07bc37

kunalspathak approved these changes May 10, 2023

View reviewed changes

build-analysis bot mentioned this pull request May 10, 2023

Tracking issue for CI build timeouts #76454

Closed

tannergooding added 2 commits May 9, 2023 20:50

Make sure all tests are actually running and handle the "maybe no jmp…

afccab8

… table fallback"

Handle a couple test issues and ensure we set the constant when norma…

10c74fc

…lizing ~B | C

BruceForstall approved these changes May 10, 2023

View reviewed changes

Ensure ValidateRemaining uses firstOp[i]

e751e9c

tannergooding merged commit 16559f9 into dotnet:main May 10, 2023

tannergooding deleted the avx512-5 branch May 10, 2023 16:04

JulieLeeMSFT mentioned this pull request Jun 9, 2023

What's new in .NET 8 Preview 5 [WIP] dotnet/core#8436

Closed

3 tasks

ghost locked as resolved and limited conversation to collaborators Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose various integer intrinsics for Avx512F, Avx512BW, and Avx512CD #85833

Expose various integer intrinsics for Avx512F, Avx512BW, and Avx512CD #85833

tannergooding commented May 5, 2023

dotnet-issue-labeler bot commented May 5, 2023

ghost commented May 5, 2023

tannergooding May 5, 2023 •

edited

Loading

jakobbotsch May 9, 2023

tannergooding May 9, 2023

tannergooding commented May 9, 2023

tannergooding commented May 9, 2023

jakobbotsch left a comment

EgorBo May 9, 2023

tannergooding May 9, 2023

EgorBo May 9, 2023 •

edited

Loading

tannergooding May 9, 2023

kunalspathak left a comment

kunalspathak May 9, 2023

tannergooding May 9, 2023

kunalspathak May 9, 2023

tannergooding May 9, 2023

kunalspathak May 9, 2023

tannergooding May 9, 2023

kunalspathak May 9, 2023

tannergooding May 9, 2023

kunalspathak left a comment

tannergooding commented May 10, 2023

	/* norBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
	/* norAB */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },

Expose various integer intrinsics for Avx512F, Avx512BW, and Avx512CD #85833

Expose various integer intrinsics for Avx512F, Avx512BW, and Avx512CD #85833

Conversation

tannergooding commented May 5, 2023

dotnet-issue-labeler bot commented May 5, 2023

ghost commented May 5, 2023

tannergooding May 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding commented May 9, 2023

tannergooding commented May 9, 2023

jakobbotsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo May 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

tannergooding commented May 10, 2023

tannergooding May 5, 2023 •

edited

Loading

EgorBo May 9, 2023 •

edited

Loading