Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX-512 support in System.Runtime.Intrinsics.X86 #35773

Closed
twest820 opened this issue May 3, 2020 · 55 comments
Closed

AVX-512 support in System.Runtime.Intrinsics.X86 #35773

twest820 opened this issue May 3, 2020 · 55 comments
Labels
area-System.Runtime.Intrinsics avx512 Related to the AVX-512 architecture
Milestone

Comments

@twest820
Copy link

twest820 commented May 3, 2020

I presume supporting AVX-512 intrinsics is in plan somewhere, but couldn't find an existing issue tracking their addition. There seem to be two parts to this.

  1. Support for EVEX encoding and use of zmm registers. I'm not entirely clear on compiler versus jit distinctions but perhaps this would allow jit to update existing 128 and 256 bit wide code using the Sse*, Avx*, or other System.Runtime.Intrinsics.X86 classes to EVEX.
  2. Addition of Avx512 classes with the new instructions at 128, 256, and 512 bit widths.

There is some interface complexity with the (as of this writing) 17 AVX-512 subsets since Knights Landing/Mill, Skylake, Cannon Lake, Cascade Lake, Cooper Lake, and Ice/Tiger Lake all support different variations. To me, it seems most natural to deprioritize support for the Knights (they're no longer in production, so presumably nearly all code targeting them has already been written) and implement something in the direction of

class Avx512FCD : Avx2 // minimum common set across all Intel CPUs with AVX-512
class Avx512VLDQBW : Avx512FCD // common set for enabled Skylake μarch cores and Sunny Cove

plus non-inheriting classes for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT (the remaining four subsets—4FMAPS, 4NNIW, ER, and PF—are specific to Knights). This is similar to the existing model for Bmi1, Bmi2, and Lzcnt and aligns to current hardware in a way which composes with existing inheritance and IsSupported properties. It also helps with incremental roll out.

Finding naming for code readability that's still clear as to which instructions are available where seems somewhat tricky. Personally, I'd be content with idioms like

using Avx512 = System.Runtime.Intrinsics.X86.Avx512VLDQBW; // loose terminology

but hopefully others will have better ideas.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels May 3, 2020
@ghost
Copy link

ghost commented May 3, 2020

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

@Symbai
Copy link

Symbai commented May 3, 2020

#8264 & #31420 but looks like a tracking issue is still missing.

@tannergooding
Copy link
Member

There isn't an explicit tracking issue right now.

AVX-512 represents a significant investment as it nearly triples the current surface area (from ~1500 APIs to ~4500 APIs). It additionally adds a new encoding, additional registers that would require support (this is extending to 512 bits, supporting 16 more registers, and adding 8 mask registers), a new SIMD type (TYP_SIMD64 and Vector512<T>), and more. While this support could be added piece by piece, I'm not sure if this meets the bar for trying to drive through API review any time soon (.NET 5) and so I won't have time to create the relevant API proposals, etc. I do imagine that will change as the hardware support starts becoming more prevalent and the scenarios it can be used and will be beneficial increases.

If someone does want to create a rough proposal of what AVX-512F would look like (since that is the base for the rest of the AVX-512 support), then I'd be happy to provide feedback and continue the discussion until it does bubble up.

CC. @CarolEidt, @echesakovMSFT, @BruceForstall as they may have additional or different thoughts/opinions

@twest820
Copy link
Author

twest820 commented May 3, 2020

Totally agree. Visual C++'s main AVX-512 roll out seems to have spanned the entire Visual Studio 2017 lifecycle and is still receiving attention in recent VS 2019 updates. It seems to me an initial question here could be what an AVX-512 roadmap might look like across multiple .NET annual releases. In the meantime, there is the workaround of calling intrinsics from C++, C++ from C++/CLI, and C++/CLI from C#. But I wouldn't have opened this issue if that layering was a great developer experience compared to intrinsics from C#. :-)

+3000 APIs is maybe ultimately on the low side. My current scrape of the Intel Intrinsics Guide lists 4255 AVX-512 intrinsics and 540 instructions. Only 380 of the intrinsics are not in the F+CD+VL+DQ+BW group supported from Skylake-SP and X and Ice Lake supports 4124 of the 4255 (give or take errors in the Guide I haven't caught or just on my part). Depending how exactly AVX-512F is defined I count it as totaling either 1435 or 2654 intrinsics. So it might make more sense to try to start with the initial 1500 intrinsics prioritized for Visual C++ 2017. Or even some subset thereof. I don't have that list, though.

Within this context, @tannergooding, if you can give me some more definition of what you're looking for in an AVX-512F sketch I can probably put something together.

I touched on this in #226, but the ability to jit existing 128 and 256 bit System.Runtime.Intrinsics.X86 APIs to EVEX for access to zmm registers 16-31 would be a valuable minimum increment even if not headlined by the addition of an Avx512 class. Definitely for most of the kernels in the various numerical codes I've written and perhaps also for the CLR's internal use of SIMD. (I can suggest some other pragmatically minded clickstops if there's interest.)

@tannergooding
Copy link
Member

At the most basic level, there would need to be a Vector512<T> type that mirrors the Vector64/128/256 types and a new Avx512F class to contain the methods.

The methods proposed would likely have signatures like the following at a minimum (essentially mirroring SSE/AVX, but extending to V512):

/// <summary>
/// __m512d _mm512_add_pd (__m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> left, Vector512<double> right)

On top of that minimum, there would need to be a proposal for a new x86 specific Mask8 register and overloads using the mask would need to be provided in order to fully support the EVEX encoding:

/// <summary>
/// __m512d _mm512_mask_add_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> value, Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload merges values not written to by the mask

/// <summary>
/// __m512d _mm512_maskz_add_pd (__mmask8 k, __m512d a, __m512d b);
///   VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload zeros values not written to by the mask

EVEX additionally has support for broadcast versions which take right as a T* and broadcast the value to all elements of the V512, but I'm not sure those are explicitly needed and warrant further discussion. I imagine the JIT could recognize a Vector128.Create(value) call and optimize it to generate the ideal code (noting C++ does similar).

EVEX additionally has support for rounding versions which take an immediate that specifies the rounding behavior done for the given operation. This would likewise require some additional thought and consideration.

Then there are 128-bit and 256-bit versions for most of these, but they fall under AVX512VL which would require its own thought into how to expose. My initial thought is that we would likely try to follow the normal hierarchy, but where something inherits from multiple classes we would need additional consideration in how they get exposed. This would require at least a breakdown of what ISAs exist and what their dependencies are.

@john-h-k
Copy link
Contributor

john-h-k commented May 5, 2020

EVEX additionally has support for rounding versions which take an immediate that specifies the rounding behavior done for the given operation. This would likewise require some additional thought and consideration.

Does this mean rounding immediates on operations like add/sub rather than explicit rounding instructions?

@tannergooding
Copy link
Member

No, the rounding instructions convert floats to integrals, while the rounding mode impacts the returned result for x + y (for example).

IEEE 754 floating-point arithmetic is performed taking the inputs as given, computing the "infinitely precise result" and then rounding to the nearest representable result.
When the "infinitely precise result" is equally close to two representable values, you need a tie breaker to determine which to choose.
The default tie breaker is "ToEven", but you can (not .NET, but in other languages or in hardware) set the rounding mode to do something like AwayFromZero, ToZero, ToPositiveInfinity, or ToNegativeInfinity instead.
EVEX supports doing this on a per operation basis without having to use explicit instructions to modify and restore the floating-point control state

@john-h-k
Copy link
Contributor

john-h-k commented May 5, 2020

Ah, brilliant, I think that is what I meant but I didn't word it great 😄

@scalablecory
Copy link
Contributor

I would appreciate the "compare into mask" instructions in AVX-512BW to speed up parsing and IndexOf.

@saucecontrol
Copy link
Member

there would need to be a proposal for a new x86 specific Mask8 register and overloads using the mask would need to be provided in order to fully support the EVEX encoding:

For the new mask and maskz instruction variants, couldn't the JIT recognize VectorNNN<T>.Zero in the value arg and use the zero source encoding? That would mean only doubling the API surface area instead of tripling it 😄

@tannergooding
Copy link
Member

Yes, there are likely some tricks we can do to help limit the number of exposed APIs and/or the number of APIs we need to review.

@twest820
Copy link
Author

twest820 commented May 9, 2020

This would require at least a breakdown of what ISAs exist and what their dependencies are.

Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware. Probably not AMD either if they implement AVX-512. It seems github doesn't support tables in comments so I made a small repo with details.

At the most basic level, there would need to be a Vector512 type that mirrors the Vector64/128/256 types and a new Avx512F class to contain the methods.

Actually, if I had to pick just one width for initial EVEX and new intrinsic support it'd be 128.

there would need to be a proposal for a new x86 specific Mask8

Also 16, 32, and 64 bit masks. And rounding, comparison, minmax, and mantissa norm and sign enums. The BF16 subset planned for Tiger Lake would require #936 but that's unimportant at this point. I'll see about getting something sketched, hopefully in the next week or so.

@tannergooding
Copy link
Member

Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware

It is a bit more in depth than this...

ANDPD for example depends on AVX512DQ for the 512-bit variant. The 128 and 256-bit variant depend on both AVX512DQ and AVX512VL.
Since this has two dependencies, it can't be modeled using the existing inheritance hierarchy.

Now, given how VL works, it might be feasible to expose it as the following:

public abstract class AVX512F : ??
{
    public abstract class VL
    {
    }
}

public abstract class AVX512DQ : AVX512F
{
    public abstract class VL : AVX512F.VL
    {
    }
}

This would key off the existing model we have used for 64-bit extensions (e.g. Sse41.X64) and still maintains the rule that AVX512F.VL.IsSupported means AVX512F.IsSupported, etc.
There are also other considerations that need to be taken into account, such as what AVX512F depends on (iirc, it is more than just AVX2 and also includes FMA, which needs to be appropriately exposed).

Actually, if I had to pick just one width for initial EVEX and new intrinsic support it'd be 128.

I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the AVX512VL extension and so the AVX512F class would need to be vetted first.

@twest820
Copy link
Author

twest820 commented May 10, 2020

It is a bit more in depth than this...

Hi Tanner, yes, it is. That's why the classes suggested as a starting point when this issue was opened don't attempt to model every CPUID flag individually. It's also why I posted the tabulation in the repo linked above.

While there are lots of possible factorings, it seems to me they're all going to be less than ideal in some way because class inheritance is an inexact match to the CPUID flags. My thoughts have gone in the same direction as you're exploring but I landed in a little bit different place. I'm not sure how abstract classes would work with the current static method model for intrinsics but one option might be

public class Avx512F : Avx2 // inheriting from Avx2 captures more surface than Fma?
{
    public static bool IsSupported // checks OSXSAVE and F CPUID

    // eventually builds out to 1435 F intrinsics

    public class VL // eventually has all of the 1208 VL subset intrinsics which depend on F
    {
        public static bool IsSupported // checks OSXSAVE and VL CPUID but not F
    }
}

public class Avx512DQ : Avx512F // Intrinsics Guide says no DQ instructions have CPUID dependencies on F but arch manual says F must be checked before checking for DQ
{
    public static bool IsSupported // checks OSXSAVE, F and DQ CPUIDs

    // eventually has 223 DQ intrinsics which do not depend on VL

    public class VL // has the 176 DQ intrinsics which do depend on VL
    {
        public static bool IsSupported // checks OSXSAVE, F, VL, and DQ
    }
}

Presumably CD and BW would look much like DQ. My thinking for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT when opening this issue was similar.

It seems to me the advantage to this approach is it's more robust to Intel or AMD maybe deciding do something different with CPUID flags in the future. It might also be more friendly to intellisense performance requirements during coding. The disadvantage is the CPUID structure would constantly be restated in code. This doesn't seem helpful to readability and forces developers to think a lot about which intrinsics are in which subsets while coding. That seems more distracting than necessary and probably occasionally frustrating. So I'm unsure this is the best available tradeoff.

These ideas can be expressed without nested classes, which I think might be a little more friendly to coding. I'll leave those variants for a later reply, though.

There are also other considerations that need to be taken into account, such as what AVX512F depends on

Technically, F doesn't depend on anything but silicon. Just like Sse2 doesn't depend on Sse. The reason the class hierarchy below Avx2 and Fma works is because Intel and AMD have always shipped expanding instruction sets. In this sense, the Fma-Avx2 fork probably wasn't great for continuing the derivation chain. But all we can do now is to make our best attempt at not creating similar problems in the AVX-512 surface.

This particular bit of learning with Avx2 and Fma is one of the reasons why I'm a little hesitant about individually modeling CPUID flags explicitly in a class hierarchy.

I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the AVX512VL extension and so the AVX512F class would need to be vetted first.

I'm sorry, but I'm not understanding why such an implementation constraint would need to be imposed. Yes, Intel made an F subset and named it foundation and, yes, VL depends on F. But Intel's decisions about CPUID flag details don't need to control the order in which Microsoft ships intrinsics to customers.

If you're saying Microsoft's internal .NET review process is such that architects and similar would want to see an Avx512 class hierarchy, including Avx512F, laid out before approving work on a VL implementation that seems fair. However if they'd insist you (or another developer) code F before VL I think that's more than a bit strange. And maybe also somewhat disconnected from early Avx512 adoption, where adjusting existing 128 and 256 bit kernels to use masks or take advantage of certain additional instructions might be common.

@tannergooding
Copy link
Member

It seems to me the advantage to this approach is it's more robust to Intel or AMD maybe deciding do something different with CPUID flags in the future
...
Technically, F doesn't depend on anything but silicon. Just like Sse2 doesn't depend on Sse.

The architecture manuals both indicate that you cannot just check for x. Before using Sse2 you must first check that:

  1. CPUID is supported by querying bit 21 of the EFLAGS register
  2. Checking that CPUID.01H:EDX.SSE[bit 25] is 1
  3. Checking that CPUID.01H:EDX.SSE2[bit26] is 1

The same goes for all of the hierarchies we've modeled in the exposed API surface, for example SSE4.2:
image

We also followed up with Intel/AMD for places where the expectation and the manual had discrepancies. For example, SSE4.1 strictly indicates checking SSSE3, SSE3, SSE2, and SSE. While SSE4.2 only strictly indicates SSE4.1, SSSE3, SSE2, and SSE (missing SSE3). These are just an oversight in the specification and the intent is that SSE4.2 requires checking SSE4.1, which requires SSSE3, which requires SSE3, which requires SSE2, which requires SSE, which requires CPUID.

Many applications don't do this full checking (even if just once on startup) and instead only check the relevant CPUID bit and assume the others are correct. Its the same as AVX and AVX512F both requiring you to check the OSXSAVE bit and that the OS has stated it supports the relevant YMM or ZMM register before checking the relevant feature bit. Likewise, extensions (AVX2, FMA, AVX512DQ) are all spec'd as requiring you to check the baseline flag as well (AVX or AVX512F).

But, due to the strict specification of the architecture manuals, there is a strict hierarchy of checks that exists and which can't change without breaking existing applications. So there will never be a case, for example, where an x86 CPU shipped SSE4.1 support without SSE3 support, etc

This particular bit of learning with Avx2 and Fma is one of the reasons why I'm a little hesitant about individually modeling CPUID flags explicitly in a class hierarchy.

Yes this is an edge case where the hierarchy can't be cleanly modeled. However, the hierarchy in general helps indicate what actual APIs are available and therefore removes duplication and confusion for the user in general.
If the worst case scenario here is we have 32 APIs which can't be modeled due to not having multiple inheritance; then I think we did alright 😄

I'm sorry, but I'm not understanding why such an implementation constraint would need to be imposed. Yes, Intel made an F subset and named it foundation and, yes, VL depends on F. But Intel's decisions about CPUID flag details don't need to control the order in which Microsoft ships intrinsics to customers.

The required work can actually be largely broken down into 3 stages:

  1. Adding EVEX support
  2. Adding TYP_SIMD64 support
  3. Adding TYP_MASK8/16/32/64 support

The first is the biggest blocker today and must be done before either stage 2 or 3. It would require updating the register allocator to be aware of the 16 new registers so they can be used, saved, and restored and the emitter to be able to successfully encode them when encountered.
This would, in theory, require no public API surface changes and assuming it is always beneficial like using the VEX encoding is, it could automatically light up for the existing 128-bit/256-bit APIs when AVX512VL is supported.

The second is a smaller chunk of work, it's a natural extension on top of what we already have and the bulk of the work would actually be in API review just ensuring we are exposing the full surface and doing it correctly. Actually implementing support for these APIs shouldn't be too difficult as we have a table driven approach already, so it should just be adding new table entries and mapping them to the existing instructions just taking/returning TYP_SIMD64. Support for TYP_SIMD64 in the JIT should just be expanding the checks we do in a few places and again ensuring that the upper 256-bits are properly used/saved/restored.

The third is the biggest work item. It would require doing everything in both 1 and 2 for a new set of types. That is, the register allocator needs to be aware of these new registers so they can be used, saved, and restored. Likewise, the emitter needs to be able to successfully encode them. Support for the new types would also have to be integrated with the other various stages as well. We then also need to have an API review for the entire surface area which is, at a minimum, effectively everything we've already exposed for 128/256-bits but with an additional mask parameter. It explodes more if we include extensions to the Vector512 versions. Actually implementing them will likely be largely table driven but will require various parts of the table driven infrastructure to be updated and new support adding in lowering and other stages to account for optimizes that can or should be done.

The second or third could technically be done in either order and yes there may be a larger use case for having 3 first, as it is an extension to existing algorithms and avoids needing to manually mask and blend. However, doing 3 first impacts confidence that the API surface we are exposing/shipping is correct and that we don't hit any gotchas that would prevent us from properly implementing F after VL.

There, of course, may be other gotchas or surprises not listed above that would be encountered when actually implementing these. It would also be much harder to test since the amount of AVX512 hardware available today is limited in comparison to the amount with AVX2 support which needs to be taken under consideration.

@twest820
Copy link
Author

If the worst case scenario here is we have 32 APIs which can't be modeled due to not having multiple inheritance; then I think we did alright

I think so too. 😄 It's also why I proposed some things aligned with Knights and Skylake. While we don't know if, how, or when AMD might implement AVX-512, Intel is done with those two microarchitectures and we know Sunny Cove doesn't backtrack from Skylake instructions. So looking at how .NET might support the 96% of Ice Lake intrinsics which have been consistently available since Skylake is hopefully a pretty safe target.

Some of Intel's blog posts from years ago indicate CD will always be present with F, which is where the class Avx512FCD above comes from. Confirming this might be a good follow up question for them as it allows some simplification of C# inheritance hierarchies, reducing risk of orphaning CD like FMA. It's a helpful simplification if a similar assumption can be made for BW, DQ, and VL.

However, doing 3 first impacts confidence that the API surface we are exposing/shipping is correct and that we don't hit any gotchas that would prevent us from properly implementing F after VL

Thanks for explaining! I'm not sure I entirely follow the table structure but am I correct in getting the impression it makes the cost of adding intrinsics fairly low? If so, that implies the distinction I was trying to make about please consider unlocking some of 3 before finishing everything in 2 might not be large.

My test situation's even worse until either desktop Ice Lakes or expanded Ice Lake laptop availability so I totally get the challenges there. I also appreciate EVEX support is a substantial effort.

But, due to the strict specification of the architecture manuals, there is a strict hierarchy of checks that exists and which can't change without breaking existing applications.

Oh excellent, appreciate the catch (we have an unmanaged class I should correct as it's not honoring the SSE hierarchy). Fixed up the code comments in my previous.

Curiously, the Intrinsics Guide typically does not indicate dependencies on AVX-512F even though sections 15.2.1, 15.3, and 15.4 of the Intel 64 and IA-32 Architectures Software Development Manual all indicate software must check F before checking other subset flags. I'll ask about this on the Intrinsics Guide bug thread over in Intel's ISA forum. I think there's also a typo in figure 15-5 of the arch manual as it should indicate table 15-2 rather than 2-2.

@tannergooding
Copy link
Member

Confirming this might be a good follow up question for them as it allows some simplification of C# inheritance hierarchies, reducing risk of orphaning CD like FMA

Even if Intel would be unlikely to ever ship F without CD, the documented checks is that they are distinct ISAs and an implementation is allowed to provide F without CD (they would be different ISAs otherwise) and so we wouldn't provide them as part of the same class (especially considering how new the instructions are, relatively speaking).

I'm not sure I entirely follow the table structure but am I correct in getting the impression it makes the cost of adding intrinsics fairly low

It varies from intrinsic to intrinsic, but in general the intrinsics are table driven and so if it doesn't expose any new "concepts" then it is just adding a new entry to https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiclistxarch.h with the appropriate flags. The various paths know to lookup this information in the table to determine how it should be handled.

When it does introduce a new concept or if it requires specialized handling, then it requires a table entry and the relevant logic to be added to the various locations in the JIT (generally importation, lowering, register allocation, and codegen). In the ideal scenario, the new concept/handling is more generally applicable and so it is a one time cost for the first intrinsic that uses it and subsequent usages are then able to go down the simple table driven route.

The tests are largely table driven as well and are generated from the templates and metadata in https://github.com/dotnet/runtime/blob/master/src/coreclr/tests/src/JIT/HardwareIntrinsics/X86/Shared/GenerateTests.csx. This ensures the various relevant code paths are covered without having to explicitly codify the logic every time.

For 1, it is ideally just an encoding difference like the legacy vs VEX encoding was in which case there aren't really any new tests or APIs to expose.
For 2, it is just extending the APIs to support 512-bit versions and so it, for the vast majority, is just reusing the existing concepts and will just be adding table entries.
For 3, it is introducing a number of new concepts and so it will require quite a bit of revision to the intrinsic infrastructure to account for the mask operands and the various optimizations that can happen with them.

@BruceForstall BruceForstall added this to the Future milestone May 11, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label May 11, 2020
@twest820
Copy link
Author

Minor status bump: Intel's never been particularly active on their instruction set extensions forum but they've recently stopped responding entirely. So no update from Intel on the questions about the arch manual and intrinsics guide that were raised here a month ago.

the documented checks is that they are distinct ISAs and an implementation is allowed to provide F without CD

Interesting. The arch manual states software must also check for F when checking for CD (and strongly recommends checking F before CD). You've more privileged access to what Intel really meant and context on how to resolve conflicts between the arch manual and intrinsics guide than most of us. Thanks for sharing.

@tannergooding
Copy link
Member

I think my statement might have been misinterpreted.

I was indicating that the following should be possible (where + indicates supported and - indicates unsupported):

  • +F, -CD
  • +F, +CD

The following should never be possible:

  • -F, +CD

AFAIK, there has never been a CPU that has shipped as +F, -CD, but given the spec it should be possible for some CPU to ship with such support.

@hanblee
Copy link
Contributor

hanblee commented Oct 14, 2020

The required work can actually be largely broken down into 3 stages:

  1. Adding EVEX support
  2. Adding TYP_SIMD64 support
  3. Adding TYP_MASK8/16/32/64 support

@tannergooding Have you considered staged approach to 1 above by first adding EVEX encoding without ZMM or mask support? This would allow use of AVX-512* instructions that operate on XMM and YMM without introducing Vector512<T> or Mask8 types and their underlying support in the JIT. For example, the following would then become possible:

/// <summary>
/// __m256i _mm256_popcnt_epi32 (__m256i a)
///   VPOPCNTD ymm, ymm
/// </summary>
public static Vector256<uint> PopCount(Vector256<uint> value)

@tannergooding
Copy link
Member

tannergooding commented Oct 14, 2020

AVX512-F is the "baseline" instruction set and doesn't expose any 128-bit or 256-bit variants it exposes the 512-bit and mask variants. The 128-bit and 256-bit variants are part of the separate AVX512-VL instruction set (which depends on AVX512-F).

In order to support the encoding correctly, we need to be aware of the full 512-bit state and appropriately save/restore the upper bits across call boundaries among other things.

@ArnimSchinz
Copy link

ArnimSchinz commented Aug 23, 2021

Vector<T> support would also be very nice.

@saucecontrol
Copy link
Member

Vector<T> support would also be very nice.

Variable size for Vector<T> already results in unpredictable performance between AVX2 and non-AVX2 hardware due to the cost of cross-lane operations and the larger minimum vector size being useful in fewer places. Auto-extending Vector<T> to 64 bytes on AVX-512 hardware would aggravate the situation.

However, the common API defined for cross-platform vector helpers (#49397) plus Static Abstracts in Interfaces would allow the best of both worlds: shared logic where vector size doesn't matter, plus ISA- or size-specific logic where it does.

@ArnimSchinz
Copy link

I like how the usage of Vector<T> makes the code forward compatible and hardware independant. Predictable performance is nice, but just having the best possible performance on every underlying hardware is more important .

@tannergooding
Copy link
Member

tannergooding commented Aug 23, 2021

Predictable performance is nice, but just having the best possible performance on every underlying hardware is more important .

"best possible performance" isn't always the same as using the largest available vector size. It is often the case that larger vectors come with increased costs for small inputs or for handling various checks to see which path needs to be taken.

Support for 512-bits in Vector needs to be considered, profiled, and potentially left as an opt-in AppContext switch to ensure that various apps can choose and use what is right for them.

@eladmarg
Copy link

@tannergooding thanks for the detailed answer.

I do believe there will be a benefit in the long run after avx512 will become mainstream as technology continue improving

@lemire gained 40% performance improvement for json parsing thanks to avx512

So currently there are other priorities, hope this will catch up in net 8

@lemire
Copy link

lemire commented May 27, 2022

Like ARM's SVE and SVE2, AVX-512 is not merely 'same as before but with wider registers'. It requires extensive work at the software level because it is a very different paradigm. On the plus side, recent Intel processors (Ice Lake and Tiger Lake) have good AVX-512 support, without downclocking and with highly useful instructions. And the good results are there: we parse JSON at record-breaking speeds. AVX-512 allows you to do base64 encoding/decoding at the speed of a memory copy. Crypto, machine learning, compression...

@tannergooding is of course correct that it is not likely that most programmers will directly benefit from AVX-512 in the short term, but I would argue that many more programmers would benefit indirectly if AVX-512 was used in core libraries. E.g., we are currently working on how to use AVX-512 for processing unicode.

On the downside, AMD is unlikely to support widely AVX-512 in the near future, and Intel is still putting out laptop processors without AVX-512 support...

@HighPerfDotNet
Copy link

CC: @tannergooding

AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source:

https://videocardz.com/newz/amd-confirms-ryzen-7000-is-up-to-16-cores-and-170w-tdp-rdna2-integrated-gpu-a-standard-ai-acceleration-based-on-avx512

So that means even consumer level chip will support it, meaning it will also be in server Genoa chip due Q4 this year. This also means Intel will have to enable AVX-512 in their consumer chips too.

Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics...

@tannergooding
Copy link
Member

AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source:

There will need to be a more definitive source, preferably directly from the AMD website or developer docs.

This also means Intel will have to enable AVX-512 in their consumer chips too.

It does not mean or imply that. Different hardware manufacturers may have competing design goals or ideologies about where it makes sense to expose different ISAs.

Historically they have not always aligned or agreed and it is incorrect to speculate here.

Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics...

The amount of work required to support such a feature is greater than simply adding the direct hardware intrinsic support for AVX-512 instructions.

It requires all the same JIT changes around handling EVEX, the additional 16 registers, having some TYP_SIMD64, and the op-mask registers. Plus it would also require language support, a full fledged assembly lexer/parser, and more.


More generally, AVX-512 support will likely happen eventually. But even partial support is a non-trivial amount of work, particularly in the register allocator, debugger, and in the context save/restore logic.

@tannergooding
Copy link
Member

The work required here can effectively be broken down into a few categories:

The first step is to update the VM to query CPUID and track the available ISAs. Then the basis of any additional work is adding support for EVEX encoded instructions but limiting it only to AVX-512VL with no masking support and no support for XMM16-XMM31. This would allow access to new 128-bit and 256-bit instructions but not access to any of the more complex functionality. It would be akin to exposing some new AVX3 ISA in complexity.

Then there are three more complex work items that could be done in any order.

  1. Extend the register support to the additional 16 registers AVX-512 makes available. These are XMM16-XMM31 and would require work in the thread, callee, and caller save/restore contexts, work in the debugger, and some minimal work in the register allocator to indicate they are available but only on 64-bit and only when AVX-512 is supported.
  2. Extend the register support to the upper 256-bits of the registers. This involves exposing and integrating a TYP_SIMD64 throughout the JIT as well as work in the thread, callee, and caller save/restore contexts, work in the debugger
  3. Extend the register support to the KMASK registers. This would require more significant work in the register allocator and potentially the JIT to support the "entirely new" concept and registers.

There is then library work required to expose and support Vector512<T> and the various AVX-512 ISAs. This work could be done incrementally alongside the other work.


Conceivably the VM and basic EVEX encoding work are "any time". It would require review from the JIT team but is not complex enough that it would be impractical to consider incrementally. The same goes for any library work exposed around this, with an annotation that API review would likely want to see more concrete numbers on the number of APIs exposed, where they are distributed, etc.

The latter three work items would touch larger amounts of JIT code however and could only be worked on if the JIT team knows they have the time and resources to review and ensure everything is working as expected. For some of these more complex work items it may even be desirable to have a small design doc laying out how its expected to work, particularly for KMASK registers.

@tannergooding
Copy link
Member

Also noting that the JIT team has the final say on when feature work can go in, even for the items I called out as seemingly anytime.

@HighPerfDotNet
Copy link

HighPerfDotNet commented May 27, 2022

There will need to be a more definitive source, preferably directly from the AMD website or developer docs.

Well, we'll know in a few months, but right now it seems that it's 99.9% happening, even in cheap mass produced consumer chips.

Historically they have not always aligned or agreed and it is incorrect to speculate here.

It seems inevitable now since Intel can't afford AMD getting massive speed up using Intel developed instruction set, there are also lots of Intel servers out there with AVX-512.

@tannergooding
Copy link
Member

It is pretty definitive since it comes from interview given by Director of Technical Marketing (AMD) Robert Hallock, you can view it here:

They explicitly state "not AMD proprietary, that's all I can say" and then "I can't here you" in response to "can we say anything like AVX-512, or anything like that?". That does not sound like confirmation to me, rather it is explicitly not disclosing additional details at this time and we will get additional details in the future.

An example is this could simply be AVX-VNNI and/or AMX which are both explicitly non AVX-512 based ISAs that supports AI/ML scenarios.

We will ultimately have to wait and see what official documentation is provided in the future.

It seems inevitable now since Intel can't afford AMD getting massive speed up using Intel developed instruction set, there are also lots of Intel servers out there with AVX-512.

It continues to be incorrect to speculate here or make presumptions about what the hardware vendors can or cannot do based on what other hardware vendors are doing.

As I mentioned AVX-512 will likely come in due time but there needs to be sufficient justification for the non-trivial amount of work. Another hardware vendor getting support might give more weight to that justification but there needs to be definitive response and documentation from said vendors covering exactly what ISAs will be supported (AVX-512 is a large foundational ISA and roughly 15 other sub-ISAs), where that support will exist, etc.

@HighPerfDotNet
Copy link

Official from AMD today - support for AVX 512 in Zen 4

AMD-FAD-2022-Zen-4-Improvements-on-5nm 1

Source: https://www.servethehome.com/amd-technology-roadmap-from-amd-financial-analyst-day-2022/

@tannergooding
Copy link
Member

Said slides should be available from an official source about 4 hours after the event ends: https://www.amd.com/en/press-releases/2022-06-02-amd-to-host-financial-analyst-day-june-9-2022

I'll take a closer look when that happens, but it doesn't look like it goes any more in depth into what ISAs are covered vs not.

@tannergooding
Copy link
Member

tannergooding commented Jun 10, 2022

The raw slides aren't available, but its covered by the recorded and publicly available webcast: https://ir.amd.com/news-events/financial-analyst-day

Skip to 45:35 for the relevant portion and slide.

Edit: Raw slides are under the Technology Leadership link.

@lemire
Copy link

lemire commented Jun 10, 2022

AMD is vague. He did refer to HPC which suggests it might be more than BF16 and VNNI.

@tannergooding
Copy link
Member

tannergooding commented Jun 10, 2022

Yes, we'll need to continue waiting for more details. AVX-512, per specification, requires at least AVX-512F which includes the EVEX encoding, the additional 16-registers, 512-bit register support, and the kmask register support.

The "ideal" scenario is that this also includes AVX-512VL and therefore the EVEX encoding, the additional 16-registers, and the masking support are available to all 128-bit and 256-bit instructions. This would allow better optimizations for existing code paths, use of the new instructions including full width permute and vptern, and isn't only going to light up for large inputs and HPC scenarios.

However, having official confirmation that the ISA (even if only AVX-512F) is now going to be cross-vendor does help justify work done towards supporting this.

@HighPerfDotNet
Copy link

I expect to see a CPU-Z screenshot of Zen 4 with details of supported ISAs soon - but given what was said so far it does FEEL to me that the support will be pretty extensive (Genoa version needs it and chiplets are the same as consumer anyway), so that should be including new AVX-512 VNNI

@Symbai
Copy link

Symbai commented Sep 18, 2022

image

Only says AVX-512F. But according to Wikipedia it also supports VL. MSVC compiler already supports AVX-512 since 2020. I really hope seeing this in .NET 8.

@filipnavara
Copy link
Member

Zen4's AVX512 flavors is all of Ice Lake plus AVX512-BF16.

So Zen4 has:
AVX512-F
AVX512-CD
AVX512-VL
AVX512-BW
AVX512-DQ
AVX512-IFMA
AVX512-VBMI
AVX512-VNNI
AVX512-BF16
AVX512-VPOPCNTDQ
AVX512-VBMI2
AVX512-VPCLMULQDQ
AVX512-BITALG
AVX512-GFNI
AVX512-VAES

The ones it is missing are:
The Xeon Phi ones: AVX512-PF, AVX512-ER, AVX512-4FMAPS, AVX512-4VNNIW
AVX512-VP2INTERSECT (from Tiger Lake)
AVX512-FP16 (from Sapphire Rapids and AVX512-enabled Alder Lake)

Source: https://www.mersenneforum.org/showthread.php?p=614191

@lemire
Copy link

lemire commented Sep 26, 2022

The gist of it is that @HighPerfDotNet was right. AMD Zen 4 has full AVX-512 support (full in the sense that it is competitive with the best Intel offerings).

I submit to you that this makes supporting AVX-512 much more compelling.

@tannergooding
Copy link
Member

We're already working on adding AVX-512 support in .NET 8, a few foundational PRs have already been merged ;)

@eladmarg
Copy link

Awesome!
AVX-512F sometimes can be even faster by software than other missing hardware instructions

@HighPerfDotNet
Copy link

Zen 4 support looks better than I expected, despite being "double pumped", turns out it was a great design decision on AMDs part, very happy to see that finally AVX512 is getting addded to .NET!

@tannergooding
Copy link
Member

tannergooding commented Oct 2, 2022

Now that Zen4 is officialy out, I can confirm the AVX-512 supported ISAs:
image

This is AVX512:

  • F
  • CD
  • BW
  • DQ
  • IFMA
  • VL
  • VBMI
  • VBMI2
  • GFNI
  • VAES
  • VNNI
  • BITALG
  • VPCLMULQDQ
  • VPOPCNTDQ
  • BF16

That is everything except VP2INTERSECT and FP16 (the latter of which hasn't shipped officially supported anywhere yet).

  • This does not include ER, PF, 4FMAPS, and 4VNNIW that were only in Knight's Landing/Mill that were originally known as IMCI and have been effectively deprecated

@Symbai
Copy link

Symbai commented Oct 4, 2022

Are CompareGreaterThan CompareEqual CompareLessThan and so on in VL?

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Oct 13, 2022
@tannergooding
Copy link
Member

Closing this as the actual work is being tracked by #73262, #73604, #76579, and any future issues we open.

@ghost ghost locked as resolved and limited conversation to collaborators Feb 2, 2023
@teo-tsirpanis teo-tsirpanis modified the milestones: Future, 8.0.0 Aug 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime.Intrinsics avx512 Related to the AVX-512 architecture
Projects
None yet
Development

No branches or pull requests