-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX-512 support in System.Runtime.Intrinsics.X86 #35773
Comments
Tagging subscribers to this area: @tannergooding |
There isn't an explicit tracking issue right now. AVX-512 represents a significant investment as it nearly triples the current surface area (from ~1500 APIs to ~4500 APIs). It additionally adds a new encoding, additional registers that would require support (this is extending to 512 bits, supporting 16 more registers, and adding 8 mask registers), a new SIMD type ( If someone does want to create a rough proposal of what AVX-512F would look like (since that is the base for the rest of the AVX-512 support), then I'd be happy to provide feedback and continue the discussion until it does bubble up. CC. @CarolEidt, @echesakovMSFT, @BruceForstall as they may have additional or different thoughts/opinions |
Totally agree. Visual C++'s main AVX-512 roll out seems to have spanned the entire Visual Studio 2017 lifecycle and is still receiving attention in recent VS 2019 updates. It seems to me an initial question here could be what an AVX-512 roadmap might look like across multiple .NET annual releases. In the meantime, there is the workaround of calling intrinsics from C++, C++ from C++/CLI, and C++/CLI from C#. But I wouldn't have opened this issue if that layering was a great developer experience compared to intrinsics from C#. :-) +3000 APIs is maybe ultimately on the low side. My current scrape of the Intel Intrinsics Guide lists 4255 AVX-512 intrinsics and 540 instructions. Only 380 of the intrinsics are not in the F+CD+VL+DQ+BW group supported from Skylake-SP and X and Ice Lake supports 4124 of the 4255 (give or take errors in the Guide I haven't caught or just on my part). Depending how exactly AVX-512F is defined I count it as totaling either 1435 or 2654 intrinsics. So it might make more sense to try to start with the initial 1500 intrinsics prioritized for Visual C++ 2017. Or even some subset thereof. I don't have that list, though. Within this context, @tannergooding, if you can give me some more definition of what you're looking for in an AVX-512F sketch I can probably put something together. I touched on this in #226, but the ability to jit existing 128 and 256 bit System.Runtime.Intrinsics.X86 APIs to EVEX for access to zmm registers 16-31 would be a valuable minimum increment even if not headlined by the addition of an Avx512 class. Definitely for most of the kernels in the various numerical codes I've written and perhaps also for the CLR's internal use of SIMD. (I can suggest some other pragmatically minded clickstops if there's interest.) |
At the most basic level, there would need to be a The methods proposed would likely have signatures like the following at a minimum (essentially mirroring SSE/AVX, but extending to V512): /// <summary>
/// __m512d _mm512_add_pd (__m512d a, __m512d b);
/// VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> left, Vector512<double> right) On top of that minimum, there would need to be a proposal for a new x86 specific /// <summary>
/// __m512d _mm512_mask_add_pd (__m512d s, __mmask8 k, __m512d a, __m512d b);
/// VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Vector512<double> value, Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload merges values not written to by the mask
/// <summary>
/// __m512d _mm512_maskz_add_pd (__mmask8 k, __m512d a, __m512d b);
/// VADDPD zmm, zmm, zmm/m512
/// </summary>
public static Vector512<double> Add(Mask8 mask, Vector512<double> left, Vector512<double> right); // This overload zeros values not written to by the mask EVEX additionally has support for EVEX additionally has support for Then there are 128-bit and 256-bit versions for most of these, but they fall under |
Does this mean rounding immediates on operations like |
No, the rounding instructions convert floats to integrals, while the rounding mode impacts the returned result for IEEE 754 floating-point arithmetic is performed taking the inputs as given, computing the "infinitely precise result" and then rounding to the nearest representable result. |
Ah, brilliant, I think that is what I meant but I didn't word it great 😄 |
I would appreciate the "compare into mask" instructions in AVX-512BW to speed up parsing and IndexOf. |
For the new |
Yes, there are likely some tricks we can do to help limit the number of exposed APIs and/or the number of APIs we need to review. |
Dependencies exist only on F and VL and seem unlikely to be concerns on Intel hardware. Probably not AMD either if they implement AVX-512. It seems github doesn't support tables in comments so I made a small repo with details.
Actually, if I had to pick just one width for initial EVEX and new intrinsic support it'd be 128.
Also 16, 32, and 64 bit masks. And rounding, comparison, minmax, and mantissa norm and sign enums. The BF16 subset planned for Tiger Lake would require #936 but that's unimportant at this point. I'll see about getting something sketched, hopefully in the next week or so. |
It is a bit more in depth than this...
Now, given how public abstract class AVX512F : ??
{
public abstract class VL
{
}
}
public abstract class AVX512DQ : AVX512F
{
public abstract class VL : AVX512F.VL
{
}
} This would key off the existing model we have used for 64-bit extensions (e.g.
I think this is a non-starter. The 128-bit EVEX support is not baseline, it (and the 256-bit support) is part of the |
Hi Tanner, yes, it is. That's why the classes suggested as a starting point when this issue was opened don't attempt to model every CPUID flag individually. It's also why I posted the tabulation in the repo linked above. While there are lots of possible factorings, it seems to me they're all going to be less than ideal in some way because class inheritance is an inexact match to the CPUID flags. My thoughts have gone in the same direction as you're exploring but I landed in a little bit different place. I'm not sure how abstract classes would work with the current static method model for intrinsics but one option might be public class Avx512F : Avx2 // inheriting from Avx2 captures more surface than Fma?
{
public static bool IsSupported // checks OSXSAVE and F CPUID
// eventually builds out to 1435 F intrinsics
public class VL // eventually has all of the 1208 VL subset intrinsics which depend on F
{
public static bool IsSupported // checks OSXSAVE and VL CPUID but not F
}
}
public class Avx512DQ : Avx512F // Intrinsics Guide says no DQ instructions have CPUID dependencies on F but arch manual says F must be checked before checking for DQ
{
public static bool IsSupported // checks OSXSAVE, F and DQ CPUIDs
// eventually has 223 DQ intrinsics which do not depend on VL
public class VL // has the 176 DQ intrinsics which do depend on VL
{
public static bool IsSupported // checks OSXSAVE, F, VL, and DQ
}
} Presumably CD and BW would look much like DQ. My thinking for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT when opening this issue was similar. It seems to me the advantage to this approach is it's more robust to Intel or AMD maybe deciding do something different with CPUID flags in the future. It might also be more friendly to intellisense performance requirements during coding. The disadvantage is the CPUID structure would constantly be restated in code. This doesn't seem helpful to readability and forces developers to think a lot about which intrinsics are in which subsets while coding. That seems more distracting than necessary and probably occasionally frustrating. So I'm unsure this is the best available tradeoff. These ideas can be expressed without nested classes, which I think might be a little more friendly to coding. I'll leave those variants for a later reply, though.
Technically, F doesn't depend on anything but silicon. Just like Sse2 doesn't depend on Sse. The reason the class hierarchy below Avx2 and Fma works is because Intel and AMD have always shipped expanding instruction sets. In this sense, the Fma-Avx2 fork probably wasn't great for continuing the derivation chain. But all we can do now is to make our best attempt at not creating similar problems in the AVX-512 surface. This particular bit of learning with Avx2 and Fma is one of the reasons why I'm a little hesitant about individually modeling CPUID flags explicitly in a class hierarchy.
I'm sorry, but I'm not understanding why such an implementation constraint would need to be imposed. Yes, Intel made an F subset and named it foundation and, yes, VL depends on F. But Intel's decisions about CPUID flag details don't need to control the order in which Microsoft ships intrinsics to customers. If you're saying Microsoft's internal .NET review process is such that architects and similar would want to see an Avx512 class hierarchy, including Avx512F, laid out before approving work on a VL implementation that seems fair. However if they'd insist you (or another developer) code F before VL I think that's more than a bit strange. And maybe also somewhat disconnected from early Avx512 adoption, where adjusting existing 128 and 256 bit kernels to use masks or take advantage of certain additional instructions might be common. |
I think so too. 😄 It's also why I proposed some things aligned with Knights and Skylake. While we don't know if, how, or when AMD might implement AVX-512, Intel is done with those two microarchitectures and we know Sunny Cove doesn't backtrack from Skylake instructions. So looking at how .NET might support the 96% of Ice Lake intrinsics which have been consistently available since Skylake is hopefully a pretty safe target. Some of Intel's blog posts from years ago indicate CD will always be present with F, which is where the
Thanks for explaining! I'm not sure I entirely follow the table structure but am I correct in getting the impression it makes the cost of adding intrinsics fairly low? If so, that implies the distinction I was trying to make about please consider unlocking some of 3 before finishing everything in 2 might not be large. My test situation's even worse until either desktop Ice Lakes or expanded Ice Lake laptop availability so I totally get the challenges there. I also appreciate EVEX support is a substantial effort.
Oh excellent, appreciate the catch (we have an unmanaged class I should correct as it's not honoring the SSE hierarchy). Fixed up the code comments in my previous. Curiously, the Intrinsics Guide typically does not indicate dependencies on AVX-512F even though sections 15.2.1, 15.3, and 15.4 of the Intel 64 and IA-32 Architectures Software Development Manual all indicate software must check F before checking other subset flags. I'll ask about this on the Intrinsics Guide bug thread over in Intel's ISA forum. I think there's also a typo in figure 15-5 of the arch manual as it should indicate table 15-2 rather than 2-2. |
Even if Intel would be unlikely to ever ship
It varies from intrinsic to intrinsic, but in general the intrinsics are table driven and so if it doesn't expose any new "concepts" then it is just adding a new entry to https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/hwintrinsiclistxarch.h with the appropriate flags. The various paths know to lookup this information in the table to determine how it should be handled. When it does introduce a new concept or if it requires specialized handling, then it requires a table entry and the relevant logic to be added to the various locations in the JIT (generally importation, lowering, register allocation, and codegen). In the ideal scenario, the new concept/handling is more generally applicable and so it is a one time cost for the first intrinsic that uses it and subsequent usages are then able to go down the simple table driven route. The tests are largely table driven as well and are generated from the templates and metadata in https://github.com/dotnet/runtime/blob/master/src/coreclr/tests/src/JIT/HardwareIntrinsics/X86/Shared/GenerateTests.csx. This ensures the various relevant code paths are covered without having to explicitly codify the logic every time. For 1, it is ideally just an encoding difference like the legacy vs VEX encoding was in which case there aren't really any new tests or APIs to expose. |
Minor status bump: Intel's never been particularly active on their instruction set extensions forum but they've recently stopped responding entirely. So no update from Intel on the questions about the arch manual and intrinsics guide that were raised here a month ago.
Interesting. The arch manual states software must also check for F when checking for CD (and strongly recommends checking F before CD). You've more privileged access to what Intel really meant and context on how to resolve conflicts between the arch manual and intrinsics guide than most of us. Thanks for sharing. |
I think my statement might have been misinterpreted. I was indicating that the following should be possible (where
The following should never be possible:
AFAIK, there has never been a CPU that has shipped as |
@tannergooding Have you considered staged approach to 1 above by first adding EVEX encoding without ZMM or mask support? This would allow use of AVX-512* instructions that operate on XMM and YMM without introducing
|
AVX512-F is the "baseline" instruction set and doesn't expose any 128-bit or 256-bit variants it exposes the 512-bit and mask variants. The 128-bit and 256-bit variants are part of the separate AVX512-VL instruction set (which depends on AVX512-F). In order to support the encoding correctly, we need to be aware of the full 512-bit state and appropriately save/restore the upper bits across call boundaries among other things. |
|
Variable size for However, the common API defined for cross-platform vector helpers (#49397) plus Static Abstracts in Interfaces would allow the best of both worlds: shared logic where vector size doesn't matter, plus ISA- or size-specific logic where it does. |
I like how the usage of |
"best possible performance" isn't always the same as using the largest available vector size. It is often the case that larger vectors come with increased costs for small inputs or for handling various checks to see which path needs to be taken. Support for 512-bits in Vector needs to be considered, profiled, and potentially left as an opt-in AppContext switch to ensure that various apps can choose and use what is right for them. |
@tannergooding thanks for the detailed answer. I do believe there will be a benefit in the long run after avx512 will become mainstream as technology continue improving @lemire gained 40% performance improvement for json parsing thanks to avx512 So currently there are other priorities, hope this will catch up in net 8 |
Like ARM's SVE and SVE2, AVX-512 is not merely 'same as before but with wider registers'. It requires extensive work at the software level because it is a very different paradigm. On the plus side, recent Intel processors (Ice Lake and Tiger Lake) have good AVX-512 support, without downclocking and with highly useful instructions. And the good results are there: we parse JSON at record-breaking speeds. AVX-512 allows you to do base64 encoding/decoding at the speed of a memory copy. Crypto, machine learning, compression... @tannergooding is of course correct that it is not likely that most programmers will directly benefit from AVX-512 in the short term, but I would argue that many more programmers would benefit indirectly if AVX-512 was used in core libraries. E.g., we are currently working on how to use AVX-512 for processing unicode. On the downside, AMD is unlikely to support widely AVX-512 in the near future, and Intel is still putting out laptop processors without AVX-512 support... |
CC: @tannergooding AMD confirmed that consumer level Zen 4 (due out in fall) will support AVX-512, source: So that means even consumer level chip will support it, meaning it will also be in server Genoa chip due Q4 this year. This also means Intel will have to enable AVX-512 in their consumer chips too. Perhaps implementing C style ASM keyword in C# could be alternative to supporting specific intrinsics... |
There will need to be a more definitive source, preferably directly from the AMD website or developer docs.
It does not mean or imply that. Different hardware manufacturers may have competing design goals or ideologies about where it makes sense to expose different ISAs. Historically they have not always aligned or agreed and it is incorrect to speculate here.
The amount of work required to support such a feature is greater than simply adding the direct hardware intrinsic support for AVX-512 instructions. It requires all the same JIT changes around handling EVEX, the additional 16 registers, having some More generally, AVX-512 support will likely happen eventually. But even partial support is a non-trivial amount of work, particularly in the register allocator, debugger, and in the context save/restore logic. |
The work required here can effectively be broken down into a few categories: The first step is to update the VM to query CPUID and track the available ISAs. Then the basis of any additional work is adding support for EVEX encoded instructions but limiting it only to Then there are three more complex work items that could be done in any order.
There is then library work required to expose and support Conceivably the VM and basic EVEX encoding work are "any time". It would require review from the JIT team but is not complex enough that it would be impractical to consider incrementally. The same goes for any library work exposed around this, with an annotation that API review would likely want to see more concrete numbers on the number of APIs exposed, where they are distributed, etc. The latter three work items would touch larger amounts of JIT code however and could only be worked on if the JIT team knows they have the time and resources to review and ensure everything is working as expected. For some of these more complex work items it may even be desirable to have a small design doc laying out how its expected to work, particularly for KMASK registers. |
Also noting that the JIT team has the final say on when feature work can go in, even for the items I called out as seemingly anytime. |
Well, we'll know in a few months, but right now it seems that it's 99.9% happening, even in cheap mass produced consumer chips.
It seems inevitable now since Intel can't afford AMD getting massive speed up using Intel developed instruction set, there are also lots of Intel servers out there with AVX-512. |
They explicitly state "not AMD proprietary, that's all I can say" and then "I can't here you" in response to "can we say anything like AVX-512, or anything like that?". That does not sound like confirmation to me, rather it is explicitly not disclosing additional details at this time and we will get additional details in the future. An example is this could simply be We will ultimately have to wait and see what official documentation is provided in the future.
It continues to be incorrect to speculate here or make presumptions about what the hardware vendors can or cannot do based on what other hardware vendors are doing. As I mentioned AVX-512 will likely come in due time but there needs to be sufficient justification for the non-trivial amount of work. Another hardware vendor getting support might give more weight to that justification but there needs to be definitive response and documentation from said vendors covering exactly what ISAs will be supported (AVX-512 is a large foundational ISA and roughly 15 other sub-ISAs), where that support will exist, etc. |
Official from AMD today - support for AVX 512 in Zen 4 Source: https://www.servethehome.com/amd-technology-roadmap-from-amd-financial-analyst-day-2022/ |
Said slides should be available from an official source about 4 hours after the event ends: https://www.amd.com/en/press-releases/2022-06-02-amd-to-host-financial-analyst-day-june-9-2022 I'll take a closer look when that happens, but it doesn't look like it goes any more in depth into what ISAs are covered vs not. |
The raw slides aren't available, but its covered by the recorded and publicly available webcast: https://ir.amd.com/news-events/financial-analyst-day Skip to 45:35 for the relevant portion and slide. Edit: Raw slides are under the Technology Leadership link. |
AMD is vague. He did refer to HPC which suggests it might be more than BF16 and VNNI. |
Yes, we'll need to continue waiting for more details. AVX-512, per specification, requires at least The "ideal" scenario is that this also includes However, having official confirmation that the ISA (even if only |
I expect to see a CPU-Z screenshot of Zen 4 with details of supported ISAs soon - but given what was said so far it does FEEL to me that the support will be pretty extensive (Genoa version needs it and chiplets are the same as consumer anyway), so that should be including new AVX-512 VNNI |
Only says AVX-512F. But according to Wikipedia it also supports VL. MSVC compiler already supports AVX-512 since 2020. I really hope seeing this in .NET 8. |
Source: https://www.mersenneforum.org/showthread.php?p=614191 |
The gist of it is that @HighPerfDotNet was right. AMD Zen 4 has full AVX-512 support (full in the sense that it is competitive with the best Intel offerings). I submit to you that this makes supporting AVX-512 much more compelling. |
We're already working on adding AVX-512 support in .NET 8, a few foundational PRs have already been merged ;) |
Awesome! |
Zen 4 support looks better than I expected, despite being "double pumped", turns out it was a great design decision on AMDs part, very happy to see that finally AVX512 is getting addded to .NET! |
Are |
I presume supporting AVX-512 intrinsics is in plan somewhere, but couldn't find an existing issue tracking their addition. There seem to be two parts to this.
There is some interface complexity with the (as of this writing) 17 AVX-512 subsets since Knights Landing/Mill, Skylake, Cannon Lake, Cascade Lake, Cooper Lake, and Ice/Tiger Lake all support different variations. To me, it seems most natural to deprioritize support for the Knights (they're no longer in production, so presumably nearly all code targeting them has already been written) and implement something in the direction of
plus non-inheriting classes for BITALG, IMFA52, VBMI, VBMI2, VNNI, BF16, and VP2INTERSECT (the remaining four subsets—4FMAPS, 4NNIW, ER, and PF—are specific to Knights). This is similar to the existing model for Bmi1, Bmi2, and Lzcnt and aligns to current hardware in a way which composes with existing inheritance and IsSupported properties. It also helps with incremental roll out.
Finding naming for code readability that's still clear as to which instructions are available where seems somewhat tricky. Personally, I'd be content with idioms like
but hopefully others will have better ideas.
The text was updated successfully, but these errors were encountered: