AVX-512 support in System.Runtime.Intrinsics.X86 #35773

twest820 · 2020-05-03T15:14:54Z

I presume supporting AVX-512 intrinsics is in plan somewhere, but couldn't find an existing issue tracking their addition. There seem to be two parts to this.

Support for EVEX encoding and use of zmm registers. I'm not entirely clear on compiler versus jit distinctions but perhaps this would allow jit to update existing 128 and 256 bit wide code using the Sse*, Avx*, or other System.Runtime.Intrinsics.X86 classes to EVEX.
Addition of Avx512 classes with the new instructions at 128, 256, and 512 bit widths.

There is some interface complexity with the (as of this writing) 17 AVX-512 subsets since Knights Landing/Mill, Skylake, Cannon Lake, Cascade Lake, Cooper Lake, and Ice/Tiger Lake all support different variations. To me, it seems most natural to deprioritize support for the Knights (they're no longer in production, so presumably nearly all code targeting them has already been written) and implement something in the direction of

class Avx512FCD : Avx2 // minimum common set across all Intel CPUs with AVX-512
class Avx512VLDQBW : Avx512FCD // common set for enabled Skylake μarch cores and Sunny Cove

plus non-inheriting classes for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT (the remaining four subsets—4FMAPS, 4NNIW, ER, and PF—are specific to Knights). This is similar to the existing model for Bmi1, Bmi2, and Lzcnt and aligns to current hardware in a way which composes with existing inheritance and IsSupported properties. It also helps with incremental roll out.

Finding naming for code readability that's still clear as to which instructions are available where seems somewhat tricky. Personally, I'd be content with idioms like

using Avx512 = System.Runtime.Intrinsics.X86.Avx512VLDQBW; // loose terminology

but hopefully others will have better ideas.

The text was updated successfully, but these errors were encountered:

ghost · 2020-05-03T15:14:58Z

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

Symbai · 2020-05-03T16:33:40Z

#8264 & #31420 but looks like a tracking issue is still missing.

tannergooding · 2020-05-03T18:31:25Z

There isn't an explicit tracking issue right now.

AVX-512 represents a significant investment as it nearly triples the current surface area (from ~1500 APIs to ~4500 APIs). It additionally adds a new encoding, additional registers that would require support (this is extending to 512 bits, supporting 16 more registers, and adding 8 mask registers), a new SIMD type (TYP_SIMD64 and Vector512<T>), and more. While this support could be added piece by piece, I'm not sure if this meets the bar for trying to drive through API review any time soon (.NET 5) and so I won't have time to create the relevant API proposals, etc. I do imagine that will change as the hardware support starts becoming more prevalent and the scenarios it can be used and will be beneficial increases.

If someone does want to create a rough proposal of what AVX-512F would look like (since that is the base for the rest of the AVX-512 support), then I'd be happy to provide feedback and continue the discussion until it does bubble up.

CC. @CarolEidt, @echesakovMSFT, @BruceForstall as they may have additional or different thoughts/opinions

twest820 · 2020-05-03T23:54:13Z

Totally agree. Visual C++'s main AVX-512 roll out seems to have spanned the entire Visual Studio 2017 lifecycle and is still receiving attention in recent VS 2019 updates. It seems to me an initial question here could be what an AVX-512 roadmap might look like across multiple .NET annual releases. In the meantime, there is the workaround of calling intrinsics from C++, C++ from C++/CLI, and C++/CLI from C#. But I wouldn't have opened this issue if that layering was a great developer experience compared to intrinsics from C#. :-)

+3000 APIs is maybe ultimately on the low side. My current scrape of the Intel Intrinsics Guide lists 4255 AVX-512 intrinsics and 540 instructions. Only 380 of the intrinsics are not in the F+CD+VL+DQ+BW group supported from Skylake-SP and X and Ice Lake supports 4124 of the 4255 (give or take errors in the Guide I haven't caught or just on my part). Depending how exactly AVX-512F is defined I count it as totaling either 1435 or 2654 intrinsics. So it might make more sense to try to start with the initial 1500 intrinsics prioritized for Visual C++ 2017. Or even some subset thereof. I don't have that list, though.

Within this context, @tannergooding, if you can give me some more definition of what you're looking for in an AVX-512F sketch I can probably put something together.

I touched on this in #226, but the ability to jit existing 128 and 256 bit System.Runtime.Intrinsics.X86 APIs to EVEX for access to zmm registers 16-31 would be a valuable minimum increment even if not headlined by the addition of an Avx512 class. Definitely for most of the kernels in the various numerical codes I've written and perhaps also for the CLR's internal use of SIMD. (I can suggest some other pragmatically minded clickstops if there's interest.)

tannergooding · 2020-05-04T16:14:10Z

At the most basic level, there would need to be a Vector512<T> type that mirrors the Vector64/128/256 types and a new Avx512F class to contain the methods.

The methods proposed would likely have signatures like the following at a minimum (essentially mirroring SSE/AVX, but extending to V512):

/// <summary>
/// __m512d _mm512_add_pd (__m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> left, Vector512<double> right)

On top of that minimum, there would need to be a proposal for a new x86 specific Mask8 register and overloads using the mask would need to be provided in order to fully support the EVEX encoding:

/// <summary>
/// __m512d _mm512_mask_add_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> value, Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload merges values not written to by the mask

/// <summary>
/// __m512d _mm512_maskz_add_pd (__mmask8 k, __m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload zeros values not written to by the mask

EVEX additionally has support for broadcast versions which take right as a T* and broadcast the value to all elements of the V512, but I'm not sure those are explicitly needed and warrant further discussion. I imagine the JIT could recognize a Vector128.Create(value) call and optimize it to generate the ideal code (noting C++ does similar).

EVEX additionally has support for rounding versions which take an immediate that specifies the rounding behavior done for the given operation. This would likewise require some additional thought and consideration.

Then there are 128-bit and 256-bit versions for most of these, but they fall under AVX512VL which would require its own thought into how to expose. My initial thought is that we would likely try to follow the normal hierarchy, but where something inherits from multiple classes we would need additional consideration in how they get exposed. This would require at least a breakdown of what ISAs exist and what their dependencies are.

john-h-k · 2020-05-05T16:21:38Z

EVEX additionally has support for rounding versions which take an immediate that specifies the rounding behavior done for the given operation. This would likewise require some additional thought and consideration.

Does this mean rounding immediates on operations like add/sub rather than explicit rounding instructions?

tannergooding · 2020-05-05T16:31:59Z

No, the rounding instructions convert floats to integrals, while the rounding mode impacts the returned result for x + y (for example).

IEEE 754 floating-point arithmetic is performed taking the inputs as given, computing the "infinitely precise result" and then rounding to the nearest representable result.
When the "infinitely precise result" is equally close to two representable values, you need a tie breaker to determine which to choose.
The default tie breaker is "ToEven", but you can (not .NET, but in other languages or in hardware) set the rounding mode to do something like AwayFromZero, ToZero, ToPositiveInfinity, or ToNegativeInfinity instead.
EVEX supports doing this on a per operation basis without having to use explicit instructions to modify and restore the floating-point control state

john-h-k · 2020-05-05T16:37:55Z

Ah, brilliant, I think that is what I meant but I didn't word it great 😄

scalablecory · 2020-05-05T17:59:32Z

I would appreciate the "compare into mask" instructions in AVX-512BW to speed up parsing and IndexOf.

saucecontrol · 2020-05-05T20:00:20Z

there would need to be a proposal for a new x86 specific Mask8 register and overloads using the mask would need to be provided in order to fully support the EVEX encoding:

For the new mask and maskz instruction variants, couldn't the JIT recognize VectorNNN<T>.Zero in the value arg and use the zero source encoding? That would mean only doubling the API surface area instead of tripling it 😄

tannergooding · 2020-05-05T20:33:28Z

Yes, there are likely some tricks we can do to help limit the number of exposed APIs and/or the number of APIs we need to review.

twest820 · 2020-05-09T18:29:05Z

This would require at least a breakdown of what ISAs exist and what their dependencies are.

Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware. Probably not AMD either if they implement AVX-512. It seems github doesn't support tables in comments so I made a small repo with details.

At the most basic level, there would need to be a Vector512 type that mirrors the Vector64/128/256 types and a new Avx512F class to contain the methods.

Actually, if I had to pick just one width for initial EVEX and new intrinsic support it'd be 128.

there would need to be a proposal for a new x86 specific Mask8

Also 16, 32, and 64 bit masks. And rounding, comparison, minmax, and mantissa norm and sign enums. The BF16 subset planned for Tiger Lake would require #936 but that's unimportant at this point. I'll see about getting something sketched, hopefully in the next week or so.

tannergooding · 2020-05-09T19:36:16Z

Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware

It is a bit more in depth than this...

ANDPD for example depends on AVX512DQ for the 512-bit variant. The 128 and 256-bit variant depend on both AVX512DQ and AVX512VL.
Since this has two dependencies, it can't be modeled using the existing inheritance hierarchy.

Now, given how VL works, it might be feasible to expose it as the following:

public abstract class AVX512F : ??
{
    public abstract class VL
    {
    }
}

public abstract class AVX512DQ : AVX512F
{
    public abstract class VL : AVX512F.VL
    {
    }
}

This would key off the existing model we have used for 64-bit extensions (e.g. Sse41.X64) and still maintains the rule that AVX512F.VL.IsSupported means AVX512F.IsSupported, etc.
There are also other considerations that need to be taken into account, such as what AVX512F depends on (iirc, it is more than just AVX2 and also includes FMA, which needs to be appropriately exposed).

Actually, if I had to pick just one width for initial EVEX and new intrinsic support it'd be 128.

I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the AVX512VL extension and so the AVX512F class would need to be vetted first.

twest820 · 2020-05-10T11:47:25Z

It is a bit more in depth than this...

Hi Tanner, yes, it is. That's why the classes suggested as a starting point when this issue was opened don't attempt to model every CPUID flag individually. It's also why I posted the tabulation in the repo linked above.

While there are lots of possible factorings, it seems to me they're all going to be less than ideal in some way because class inheritance is an inexact match to the CPUID flags. My thoughts have gone in the same direction as you're exploring but I landed in a little bit different place. I'm not sure how abstract classes would work with the current static method model for intrinsics but one option might be

public class Avx512F : Avx2 // inheriting from Avx2 captures more surface than Fma?
{
    public static bool IsSupported // checks OSXSAVE and F CPUID

    // eventually builds out to 1435 F intrinsics

    public class VL // eventually has all of the 1208 VL subset intrinsics which depend on F
    {
        public static bool IsSupported // checks OSXSAVE and VL CPUID but not F
    }
}

public class Avx512DQ : Avx512F // Intrinsics Guide says no DQ instructions have CPUID dependencies on F but arch manual says F must be checked before checking for DQ
{
    public static bool IsSupported // checks OSXSAVE, F and DQ CPUIDs

    // eventually has 223 DQ intrinsics which do not depend on VL

    public class VL // has the 176 DQ intrinsics which do depend on VL
    {
        public static bool IsSupported // checks OSXSAVE, F, VL, and DQ
    }
}

Presumably CD and BW would look much like DQ. My thinking for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT when opening this issue was similar.

It seems to me the advantage to this approach is it's more robust to Intel or AMD maybe deciding do something different with CPUID flags in the future. It might also be more friendly to intellisense performance requirements during coding. The disadvantage is the CPUID structure would constantly be restated in code. This doesn't seem helpful to readability and forces developers to think a lot about which intrinsics are in which subsets while coding. That seems more distracting than necessary and probably occasionally frustrating. So I'm unsure this is the best available tradeoff.

These ideas can be expressed without nested classes, which I think might be a little more friendly to coding. I'll leave those variants for a later reply, though.

There are also other considerations that need to be taken into account, such as what AVX512F depends on

Technically, F doesn't depend on anything but silicon. Just like Sse2 doesn't depend on Sse. The reason the class hierarchy below Avx2 and Fma works is because Intel and AMD have always shipped expanding instruction sets. In this sense, the Fma-Avx2 fork probably wasn't great for continuing the derivation chain. But all we can do now is to make our best attempt at not creating similar problems in the AVX-512 surface.

This particular bit of learning with Avx2 and Fma is one of the reasons why I'm a little hesitant about individually modeling CPUID flags explicitly in a class hierarchy.

I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the AVX512VL extension and so the AVX512F class would need to be vetted first.

I'm sorry, but I'm not understanding why such an implementation constraint would need to be imposed. Yes, Intel made an F subset and named it foundation and, yes, VL depends on F. But Intel's decisions about CPUID flag details don't need to control the order in which Microsoft ships intrinsics to customers.

If you're saying Microsoft's internal .NET review process is such that architects and similar would want to see an Avx512 class hierarchy, including Avx512F, laid out before approving work on a VL implementation that seems fair. However if they'd insist you (or another developer) code F before VL I think that's more than a bit strange. And maybe also somewhat disconnected from early Avx512 adoption, where adjusting existing 128 and 256 bit kernels to use masks or take advantage of certain additional instructions might be common.

tannergooding · 2020-05-10T14:30:53Z

It seems to me the advantage to this approach is it's more robust to Intel or AMD maybe deciding do something different with CPUID flags in the future
...
Technically, F doesn't depend on anything but silicon. Just like Sse2 doesn't depend on Sse.

The architecture manuals both indicate that you cannot just check for x. Before using Sse2 you must first check that:

CPUID is supported by querying bit 21 of the EFLAGS register
Checking that CPUID.01H:EDX.SSE[bit 25] is 1
Checking that CPUID.01H:EDX.SSE2[bit26] is 1

The same goes for all of the hierarchies we've modeled in the exposed API surface, for example SSE4.2:

We also followed up with Intel/AMD for places where the expectation and the manual had discrepancies. For example, SSE4.1 strictly indicates checking SSSE3, SSE3, SSE2, and SSE. While SSE4.2 only strictly indicates SSE4.1, SSSE3, SSE2, and SSE (missing SSE3). These are just an oversight in the specification and the intent is that SSE4.2 requires checking SSE4.1, which requires SSSE3, which requires SSE3, which requires SSE2, which requires SSE, which requires CPUID.

Many applications don't do this full checking (even if just once on startup) and instead only check the relevant CPUID bit and assume the others are correct. Its the same as AVX and AVX512F both requiring you to check the OSXSAVE bit and that the OS has stated it supports the relevant YMM or ZMM register before checking the relevant feature bit. Likewise, extensions (AVX2, FMA, AVX512DQ) are all spec'd as requiring you to check the baseline flag as well (AVX or AVX512F).

But, due to the strict specification of the architecture manuals, there is a strict hierarchy of checks that exists and which can't change without breaking existing applications. So there will never be a case, for example, where an x86 CPU shipped SSE4.1 support without SSE3 support, etc

This particular bit of learning with Avx2 and Fma is one of the reasons why I'm a little hesitant about individually modeling CPUID flags explicitly in a class hierarchy.

Yes this is an edge case where the hierarchy can't be cleanly modeled. However, the hierarchy in general helps indicate what actual APIs are available and therefore removes duplication and confusion for the user in general.
If the worst case scenario here is we have 32 APIs which can't be modeled due to not having multiple inheritance; then I think we did alright 😄

I'm sorry, but I'm not understanding why such an implementation constraint would need to be imposed. Yes, Intel made an F subset and named it foundation and, yes, VL depends on F. But Intel's decisions about CPUID flag details don't need to control the order in which Microsoft ships intrinsics to customers.

The required work can actually be largely broken down into 3 stages:

Adding EVEX support
Adding TYP_SIMD64 support
Adding TYP_MASK8/16/32/64 support

The first is the biggest blocker today and must be done before either stage 2 or 3. It would require updating the register allocator to be aware of the 16 new registers so they can be used, saved, and restored and the emitter to be able to successfully encode them when encountered.
This would, in theory, require no public API surface changes and assuming it is always beneficial like using the VEX encoding is, it could automatically light up for the existing 128-bit/256-bit APIs when AVX512VL is supported.

The second is a smaller chunk of work, it's a natural extension on top of what we already have and the bulk of the work would actually be in API review just ensuring we are exposing the full surface and doing it correctly. Actually implementing support for these APIs shouldn't be too difficult as we have a table driven approach already, so it should just be adding new table entries and mapping them to the existing instructions just taking/returning TYP_SIMD64. Support for TYP_SIMD64 in the JIT should just be expanding the checks we do in a few places and again ensuring that the upper 256-bits are properly used/saved/restored.

The third is the biggest work item. It would require doing everything in both 1 and 2 for a new set of types. That is, the register allocator needs to be aware of these new registers so they can be used, saved, and restored. Likewise, the emitter needs to be able to successfully encode them. Support for the new types would also have to be integrated with the other various stages as well. We then also need to have an API review for the entire surface area which is, at a minimum, effectively everything we've already exposed for 128/256-bits but with an additional mask parameter. It explodes more if we include extensions to the Vector512 versions. Actually implementing them will likely be largely table driven but will require various parts of the table driven infrastructure to be updated and new support adding in lowering and other stages to account for optimizes that can or should be done.

The second or third could technically be done in either order and yes there may be a larger use case for having 3 first, as it is an extension to existing algorithms and avoids needing to manually mask and blend. However, doing 3 first impacts confidence that the API surface we are exposing/shipping is correct and that we don't hit any gotchas that would prevent us from properly implementing F after VL.

There, of course, may be other gotchas or surprises not listed above that would be encountered when actually implementing these. It would also be much harder to test since the amount of AVX512 hardware available today is limited in comparison to the amount with AVX2 support which needs to be taken under consideration.

twest820 · 2020-05-10T16:38:49Z

If the worst case scenario here is we have 32 APIs which can't be modeled due to not having multiple inheritance; then I think we did alright

I think so too. 😄 It's also why I proposed some things aligned with Knights and Skylake. While we don't know if, how, or when AMD might implement AVX-512, Intel is done with those two microarchitectures and we know Sunny Cove doesn't backtrack from Skylake instructions. So looking at how .NET might support the 96% of Ice Lake intrinsics which have been consistently available since Skylake is hopefully a pretty safe target.

Some of Intel's blog posts from years ago indicate CD will always be present with F, which is where the class Avx512FCD above comes from. Confirming this might be a good follow up question for them as it allows some simplification of C# inheritance hierarchies, reducing risk of orphaning CD like FMA. It's a helpful simplification if a similar assumption can be made for BW, DQ, and VL.

However, doing 3 first impacts confidence that the API surface we are exposing/shipping is correct and that we don't hit any gotchas that would prevent us from properly implementing F after VL

Thanks for explaining! I'm not sure I entirely follow the table structure but am I correct in getting the impression it makes the cost of adding intrinsics fairly low? If so, that implies the distinction I was trying to make about please consider unlocking some of 3 before finishing everything in 2 might not be large.

My test situation's even worse until either desktop Ice Lakes or expanded Ice Lake laptop availability so I totally get the challenges there. I also appreciate EVEX support is a substantial effort.

But, due to the strict specification of the architecture manuals, there is a strict hierarchy of checks that exists and which can't change without breaking existing applications.

Oh excellent, appreciate the catch (we have an unmanaged class I should correct as it's not honoring the SSE hierarchy). Fixed up the code comments in my previous.

Curiously, the Intrinsics Guide typically does not indicate dependencies on AVX-512F even though sections 15.2.1, 15.3, and 15.4 of the Intel 64 and IA-32 Architectures Software Development Manual all indicate software must check F before checking other subset flags. I'll ask about this on the Intrinsics Guide bug thread over in Intel's ISA forum. I think there's also a typo in figure 15-5 of the arch manual as it should indicate table 15-2 rather than 2-2.

tannergooding · 2020-05-10T16:56:21Z

Confirming this might be a good follow up question for them as it allows some simplification of C# inheritance hierarchies, reducing risk of orphaning CD like FMA

Even if Intel would be unlikely to ever ship F without CD, the documented checks is that they are distinct ISAs and an implementation is allowed to provide F without CD (they would be different ISAs otherwise) and so we wouldn't provide them as part of the same class (especially considering how new the instructions are, relatively speaking).

I'm not sure I entirely follow the table structure but am I correct in getting the impression it makes the cost of adding intrinsics fairly low

It varies from intrinsic to intrinsic, but in general the intrinsics are table driven and so if it doesn't expose any new "concepts" then it is just adding a new entry to https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiclistxarch.h with the appropriate flags. The various paths know to lookup this information in the table to determine how it should be handled.

When it does introduce a new concept or if it requires specialized handling, then it requires a table entry and the relevant logic to be added to the various locations in the JIT (generally importation, lowering, register allocation, and codegen). In the ideal scenario, the new concept/handling is more generally applicable and so it is a one time cost for the first intrinsic that uses it and subsequent usages are then able to go down the simple table driven route.

The tests are largely table driven as well and are generated from the templates and metadata in https://github.com/dotnet/runtime/blob/master/src/coreclr/tests/src/JIT/HardwareIntrinsics/X86/Shared/GenerateTests.csx. This ensures the various relevant code paths are covered without having to explicitly codify the logic every time.

For 1, it is ideally just an encoding difference like the legacy vs VEX encoding was in which case there aren't really any new tests or APIs to expose.
For 2, it is just extending the APIs to support 512-bit versions and so it, for the vast majority, is just reusing the existing concepts and will just be adding table entries.
For 3, it is introducing a number of new concepts and so it will require quite a bit of revision to the intrinsic infrastructure to account for the mask operands and the various optimizations that can happen with them.

twest820 · 2020-06-15T18:21:06Z

Minor status bump: Intel's never been particularly active on their instruction set extensions forum but they've recently stopped responding entirely. So no update from Intel on the questions about the arch manual and intrinsics guide that were raised here a month ago.

the documented checks is that they are distinct ISAs and an implementation is allowed to provide F without CD

Interesting. The arch manual states software must also check for F when checking for CD (and strongly recommends checking F before CD). You've more privileged access to what Intel really meant and context on how to resolve conflicts between the arch manual and intrinsics guide than most of us. Thanks for sharing.

tannergooding · 2020-06-15T18:39:45Z

I think my statement might have been misinterpreted.

I was indicating that the following should be possible (where + indicates supported and - indicates unsupported):

+F, -CD
+F, +CD

The following should never be possible:

-F, +CD

AFAIK, there has never been a CPU that has shipped as +F, -CD, but given the spec it should be possible for some CPU to ship with such support.

hanblee · 2020-10-14T18:09:29Z

The required work can actually be largely broken down into 3 stages:

Adding EVEX support

Adding TYP_SIMD64 support

Adding TYP_MASK8/16/32/64 support

@tannergooding Have you considered staged approach to 1 above by first adding EVEX encoding without ZMM or mask support? This would allow use of AVX-512* instructions that operate on XMM and YMM without introducing Vector512<T> or Mask8 types and their underlying support in the JIT. For example, the following would then become possible:

/// <summary>
/// __m256i _mm256_popcnt_epi32 (__m256i a)
///   VPOPCNTD ymm, ymm
/// </summary>
public static Vector256<uint> PopCount(Vector256<uint> value)

tannergooding · 2020-10-14T18:16:16Z

AVX512-F is the "baseline" instruction set and doesn't expose any 128-bit or 256-bit variants it exposes the 512-bit and mask variants. The 128-bit and 256-bit variants are part of the separate AVX512-VL instruction set (which depends on AVX512-F).

In order to support the encoding correctly, we need to be aware of the full 512-bit state and appropriately save/restore the upper bits across call boundaries among other things.

ArnimSchinz · 2021-08-23T06:59:47Z

Vector<T> support would also be very nice.

saucecontrol · 2021-08-23T08:10:08Z

Vector<T> support would also be very nice.

Variable size for Vector<T> already results in unpredictable performance between AVX2 and non-AVX2 hardware due to the cost of cross-lane operations and the larger minimum vector size being useful in fewer places. Auto-extending Vector<T> to 64 bytes on AVX-512 hardware would aggravate the situation.

However, the common API defined for cross-platform vector helpers (#49397) plus Static Abstracts in Interfaces would allow the best of both worlds: shared logic where vector size doesn't matter, plus ISA- or size-specific logic where it does.

ArnimSchinz · 2021-08-23T08:22:32Z

I like how the usage of Vector<T> makes the code forward compatible and hardware independant. Predictable performance is nice, but just having the best possible performance on every underlying hardware is more important .

tannergooding · 2021-08-23T16:01:07Z

Predictable performance is nice, but just having the best possible performance on every underlying hardware is more important .

"best possible performance" isn't always the same as using the largest available vector size. It is often the case that larger vectors come with increased costs for small inputs or for handling various checks to see which path needs to be taken.

Support for 512-bits in Vector needs to be considered, profiled, and potentially left as an opt-in AppContext switch to ensure that various apps can choose and use what is right for them.

eladmarg · 2022-05-27T07:48:26Z

@tannergooding thanks for the detailed answer.

I do believe there will be a benefit in the long run after avx512 will become mainstream as technology continue improving

@lemire gained 40% performance improvement for json parsing thanks to avx512

So currently there are other priorities, hope this will catch up in net 8

lemire · 2022-05-27T13:43:17Z

Like ARM's SVE and SVE2, AVX-512 is not merely 'same as before but with wider registers'. It requires extensive work at the software level because it is a very different paradigm. On the plus side, recent Intel processors (Ice Lake and Tiger Lake) have good AVX-512 support, without downclocking and with highly useful instructions. And the good results are there: we parse JSON at record-breaking speeds. AVX-512 allows you to do base64 encoding/decoding at the speed of a memory copy. Crypto, machine learning, compression...

@tannergooding is of course correct that it is not likely that most programmers will directly benefit from AVX-512 in the short term, but I would argue that many more programmers would benefit indirectly if AVX-512 was used in core libraries. E.g., we are currently working on how to use AVX-512 for processing unicode.

On the downside, AMD is unlikely to support widely AVX-512 in the near future, and Intel is still putting out laptop processors without AVX-512 support...

HighPerfDotNet · 2022-05-27T14:23:37Z

CC: @tannergooding

AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source:

https://videocardz.com/newz/amd-confirms-ryzen-7000-is-up-to-16-cores-and-170w-tdp-rdna2-integrated-gpu-a-standard-ai-acceleration-based-on-avx512

So that means even consumer level chip will support it, meaning it will also be in server Genoa chip due Q4 this year. This also means Intel will have to enable AVX-512 in their consumer chips too.

Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics...

tannergooding · 2022-05-27T14:34:50Z

AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source:

There will need to be a more definitive source, preferably directly from the AMD website or developer docs.

This also means Intel will have to enable AVX-512 in their consumer chips too.

It does not mean or imply that. Different hardware manufacturers may have competing design goals or ideologies about where it makes sense to expose different ISAs.

Historically they have not always aligned or agreed and it is incorrect to speculate here.

Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics...

The amount of work required to support such a feature is greater than simply adding the direct hardware intrinsic support for AVX-512 instructions.

It requires all the same JIT changes around handling EVEX, the additional 16 registers, having some TYP_SIMD64, and the op-mask registers. Plus it would also require language support, a full fledged assembly lexer/parser, and more.

More generally, AVX-512 support will likely happen eventually. But even partial support is a non-trivial amount of work, particularly in the register allocator, debugger, and in the context save/restore logic.

tannergooding · 2022-05-27T14:46:51Z

The work required here can effectively be broken down into a few categories:

The first step is to update the VM to query CPUID and track the available ISAs. Then the basis of any additional work is adding support for EVEX encoded instructions but limiting it only to AVX-512VL with no masking support and no support for XMM16-XMM31. This would allow access to new 128-bit and 256-bit instructions but not access to any of the more complex functionality. It would be akin to exposing some new AVX3 ISA in complexity.

Then there are three more complex work items that could be done in any order.

Extend the register support to the additional 16 registers AVX-512 makes available. These are XMM16-XMM31 and would require work in the thread, callee, and caller save/restore contexts, work in the debugger, and some minimal work in the register allocator to indicate they are available but only on 64-bit and only when AVX-512 is supported.
Extend the register support to the upper 256-bits of the registers. This involves exposing and integrating a TYP_SIMD64 throughout the JIT as well as work in the thread, callee, and caller save/restore contexts, work in the debugger
Extend the register support to the KMASK registers. This would require more significant work in the register allocator and potentially the JIT to support the "entirely new" concept and registers.

There is then library work required to expose and support Vector512<T> and the various AVX-512 ISAs. This work could be done incrementally alongside the other work.

Conceivably the VM and basic EVEX encoding work are "any time". It would require review from the JIT team but is not complex enough that it would be impractical to consider incrementally. The same goes for any library work exposed around this, with an annotation that API review would likely want to see more concrete numbers on the number of APIs exposed, where they are distributed, etc.

The latter three work items would touch larger amounts of JIT code however and could only be worked on if the JIT team knows they have the time and resources to review and ensure everything is working as expected. For some of these more complex work items it may even be desirable to have a small design doc laying out how its expected to work, particularly for KMASK registers.

tannergooding · 2022-05-27T14:50:45Z

Also noting that the JIT team has the final say on when feature work can go in, even for the items I called out as seemingly anytime.

HighPerfDotNet · 2022-05-27T15:10:41Z

There will need to be a more definitive source, preferably directly from the AMD website or developer docs.

Well, we'll know in a few months, but right now it seems that it's 99.9% happening, even in cheap mass produced consumer chips.

Historically they have not always aligned or agreed and it is incorrect to speculate here.

It seems inevitable now since Intel can't afford AMD getting massive speed up using Intel developed instruction set, there are also lots of Intel servers out there with AVX-512.

tannergooding · 2022-05-27T15:32:53Z

It is pretty definitive since it comes from interview given by Director of Technical Marketing (AMD) Robert Hallock, you can view it here:

They explicitly state "not AMD proprietary, that's all I can say" and then "I can't here you" in response to "can we say anything like AVX-512, or anything like that?". That does not sound like confirmation to me, rather it is explicitly not disclosing additional details at this time and we will get additional details in the future.

An example is this could simply be AVX-VNNI and/or AMX which are both explicitly non AVX-512 based ISAs that supports AI/ML scenarios.

We will ultimately have to wait and see what official documentation is provided in the future.

It seems inevitable now since Intel can't afford AMD getting massive speed up using Intel developed instruction set, there are also lots of Intel servers out there with AVX-512.

It continues to be incorrect to speculate here or make presumptions about what the hardware vendors can or cannot do based on what other hardware vendors are doing.

As I mentioned AVX-512 will likely come in due time but there needs to be sufficient justification for the non-trivial amount of work. Another hardware vendor getting support might give more weight to that justification but there needs to be definitive response and documentation from said vendors covering exactly what ISAs will be supported (AVX-512 is a large foundational ISA and roughly 15 other sub-ISAs), where that support will exist, etc.

HighPerfDotNet · 2022-06-09T22:34:44Z

Official from AMD today - support for AVX 512 in Zen 4

Source: https://www.servethehome.com/amd-technology-roadmap-from-amd-financial-analyst-day-2022/

tannergooding · 2022-06-09T23:05:12Z

Said slides should be available from an official source about 4 hours after the event ends: https://www.amd.com/en/press-releases/2022-06-02-amd-to-host-financial-analyst-day-june-9-2022

I'll take a closer look when that happens, but it doesn't look like it goes any more in depth into what ISAs are covered vs not.

tannergooding · 2022-06-10T18:09:34Z

The raw slides aren't available, but its covered by the recorded and publicly available webcast: https://ir.amd.com/news-events/financial-analyst-day

Skip to 45:35 for the relevant portion and slide.

Edit: Raw slides are under the Technology Leadership link.

lemire · 2022-06-10T18:21:14Z

AMD is vague. He did refer to HPC which suggests it might be more than BF16 and VNNI.

tannergooding · 2022-06-10T18:26:37Z

Yes, we'll need to continue waiting for more details. AVX-512, per specification, requires at least AVX-512F which includes the EVEX encoding, the additional 16-registers, 512-bit register support, and the kmask register support.

The "ideal" scenario is that this also includes AVX-512VL and therefore the EVEX encoding, the additional 16-registers, and the masking support are available to all 128-bit and 256-bit instructions. This would allow better optimizations for existing code paths, use of the new instructions including full width permute and vptern, and isn't only going to light up for large inputs and HPC scenarios.

However, having official confirmation that the ISA (even if only AVX-512F) is now going to be cross-vendor does help justify work done towards supporting this.

HighPerfDotNet · 2022-06-10T20:53:41Z

I expect to see a CPU-Z screenshot of Zen 4 with details of supported ISAs soon - but given what was said so far it does FEEL to me that the support will be pretty extensive (Genoa version needs it and chiplets are the same as consumer anyway), so that should be including new AVX-512 VNNI

Symbai · 2022-09-18T05:19:09Z

Only says AVX-512F. But according to Wikipedia it also supports VL. MSVC compiler already supports AVX-512 since 2020. I really hope seeing this in .NET 8.

filipnavara · 2022-09-26T20:25:01Z

Zen4's AVX512 flavors is all of Ice Lake plus AVX512-BF16.

So Zen4 has:
AVX512-F
AVX512-CD
AVX512-VL
AVX512-BW
AVX512-DQ
AVX512-IFMA
AVX512-VBMI
AVX512-VNNI
AVX512-BF16
AVX512-VPOPCNTDQ
AVX512-VBMI2
AVX512-VPCLMULQDQ
AVX512-BITALG
AVX512-GFNI
AVX512-VAES

The ones it is missing are:
The Xeon Phi ones: AVX512-PF, AVX512-ER, AVX512-4FMAPS, AVX512-4VNNIW
AVX512-VP2INTERSECT (from Tiger Lake)
AVX512-FP16 (from Sapphire Rapids and AVX512-enabled Alder Lake)

Source: https://www.mersenneforum.org/showthread.php?p=614191

lemire · 2022-09-26T21:09:53Z

The gist of it is that @HighPerfDotNet was right. AMD Zen 4 has full AVX-512 support (full in the sense that it is competitive with the best Intel offerings).

I submit to you that this makes supporting AVX-512 much more compelling.

tannergooding · 2022-09-26T21:21:07Z

We're already working on adding AVX-512 support in .NET 8, a few foundational PRs have already been merged ;)

eladmarg · 2022-09-27T07:08:00Z

Awesome!
AVX-512F sometimes can be even faster by software than other missing hardware instructions

HighPerfDotNet · 2022-09-27T14:31:02Z

Zen 4 support looks better than I expected, despite being "double pumped", turns out it was a great design decision on AMDs part, very happy to see that finally AVX512 is getting addded to .NET!

tannergooding · 2022-10-02T07:22:20Z

Now that Zen4 is officialy out, I can confirm the AVX-512 supported ISAs:

This is AVX512:

F
CD
BW
DQ
IFMA
VL
VBMI
VBMI2
GFNI
VAES
VNNI
BITALG
VPCLMULQDQ
VPOPCNTDQ
BF16

That is everything except VP2INTERSECT and FP16 (the latter of which hasn't shipped officially supported anywhere yet).

This does not include ER, PF, 4FMAPS, and 4VNNIW that were only in Knight's Landing/Mill that were originally known as IMCI and have been effectively deprecated

Symbai · 2022-10-04T07:34:30Z

Are CompareGreaterThan CompareEqual CompareLessThan and so on in VL?

tannergooding · 2023-01-03T18:51:24Z

Closing this as the actual work is being tracked by #73262, #73604, #76579, and any future issues we open.

Dotnet-GitSync-Bot added area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels May 3, 2020

BruceForstall added this to the Future milestone May 11, 2020

BruceForstall removed the untriaged New issue has not been triaged by the area owner label May 11, 2020

Symbai mentioned this issue Aug 3, 2022

Expose Vector512<T> to support the x86/x64 AVX-512 instruction set #73262

Closed

BruceForstall added the avx512 Related to the AVX-512 architecture label Oct 13, 2022

tannergooding closed this as completed Jan 3, 2023

ghost locked as resolved and limited conversation to collaborators Feb 2, 2023

teo-tsirpanis modified the milestones: Future, 8.0.0 Aug 27, 2023

AVX-512 support in System.Runtime.Intrinsics.X86 #35773

AVX-512 support in System.Runtime.Intrinsics.X86 #35773

Comments

twest820 commented May 3, 2020 • edited Loading

ghost commented May 3, 2020

Symbai commented May 3, 2020 • edited Loading

tannergooding commented May 3, 2020

twest820 commented May 3, 2020 • edited Loading

tannergooding commented May 4, 2020

john-h-k commented May 5, 2020

tannergooding commented May 5, 2020

john-h-k commented May 5, 2020

scalablecory commented May 5, 2020

saucecontrol commented May 5, 2020

tannergooding commented May 5, 2020

twest820 commented May 9, 2020

tannergooding commented May 9, 2020

twest820 commented May 10, 2020 • edited Loading

tannergooding commented May 10, 2020

twest820 commented May 10, 2020

tannergooding commented May 10, 2020

twest820 commented Jun 15, 2020

tannergooding commented Jun 15, 2020

hanblee commented Oct 14, 2020 • edited Loading

tannergooding commented Oct 14, 2020 • edited Loading

ArnimSchinz commented Aug 23, 2021 • edited Loading

saucecontrol commented Aug 23, 2021

ArnimSchinz commented Aug 23, 2021

tannergooding commented Aug 23, 2021 • edited Loading

eladmarg commented May 27, 2022

lemire commented May 27, 2022

HighPerfDotNet commented May 27, 2022

tannergooding commented May 27, 2022

tannergooding commented May 27, 2022

tannergooding commented May 27, 2022

HighPerfDotNet commented May 27, 2022 • edited Loading

tannergooding commented May 27, 2022

HighPerfDotNet commented Jun 9, 2022

tannergooding commented Jun 9, 2022

tannergooding commented Jun 10, 2022 • edited Loading

lemire commented Jun 10, 2022 • edited Loading

tannergooding commented Jun 10, 2022 • edited Loading

HighPerfDotNet commented Jun 10, 2022

Symbai commented Sep 18, 2022

filipnavara commented Sep 26, 2022

lemire commented Sep 26, 2022

tannergooding commented Sep 26, 2022

eladmarg commented Sep 27, 2022

HighPerfDotNet commented Sep 27, 2022

tannergooding commented Oct 2, 2022 • edited Loading

Symbai commented Oct 4, 2022

tannergooding commented Jan 3, 2023

twest820 commented May 3, 2020 •

edited

Loading

Symbai commented May 3, 2020 •

edited

Loading

twest820 commented May 3, 2020 •

edited

Loading

twest820 commented May 10, 2020 •

edited

Loading

hanblee commented Oct 14, 2020 •

edited

Loading

tannergooding commented Oct 14, 2020 •

edited

Loading

ArnimSchinz commented Aug 23, 2021 •

edited

Loading

tannergooding commented Aug 23, 2021 •

edited

Loading

HighPerfDotNet commented May 27, 2022 •

edited

Loading

tannergooding commented Jun 10, 2022 •

edited

Loading

lemire commented Jun 10, 2022 •

edited

Loading

tannergooding commented Jun 10, 2022 •

edited

Loading

tannergooding commented Oct 2, 2022 •

edited

Loading