Skip to content

Commit

Permalink
no longer require opt-in for AVX3_DL
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 720480038
  • Loading branch information
jan-wassenberg authored and copybara-github committed Jan 28, 2025
1 parent 0b69663 commit d0c73de
Show file tree
Hide file tree
Showing 4 changed files with 16 additions and 31 deletions.
9 changes: 4 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,11 +161,10 @@ Highway supports 24 targets, listed in alphabetical order of platform:
- `SSE4` (~Nehalem, also includes AES + CLMUL).
- `AVX2` (~Haswell, also includes BMI2 + F16 + FMA)
- `AVX3` (~Skylake, AVX-512F/BW/CD/DQ/VL)
- `AVX3_DL` (~Icelake, includes BitAlg + CLMUL + GFNI + VAES + VBMI +
VBMI2 + VNNI + VPOPCNT; requires opt-in by defining `HWY_WANT_AVX3_DL`
unless compiling for static dispatch),
- `AVX3_ZEN4` (like AVX3_DL but optimized for AMD Zen4; requires opt-in by
defining `HWY_WANT_AVX3_ZEN4` if compiling for static dispatch, but
- `AVX3_DL` (~Icelake, includes `BitAlg` + `CLMUL` + `GFNI` + `VAES` +
`VBMI` + `VBMI2` + `VNNI` + `VPOPCNT`),
- `AVX3_ZEN4` (AVX3_DL plus BF16, optimized for AMD Zen4; requires opt-in
by defining `HWY_WANT_AVX3_ZEN4` if compiling for static dispatch, but
enabled by default for runtime dispatch),
- `AVX3_SPR` (~Sapphire Rapids, includes AVX-512FP16)

Expand Down
3 changes: 1 addition & 2 deletions g3doc/design_philosophy.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,7 @@
* Not every CPU need be supported. To reduce code size and compile time, we
group x86 targets into clusters. In particular, SSE3 instructions are only
used/available if S-SSE3 is also available, and AVX only if AVX2 is also
supported. Code generation for AVX3_DL also requires opting-in by defining
HWY_WANT_AVX3_DL.
supported.

* Access to platform-specific intrinsics is necessary for acceptance in
performance-critical projects. We provide conversions to and from intrinsics
Expand Down
7 changes: 2 additions & 5 deletions g3doc/quick_reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -304,8 +304,8 @@ Store(v, d2, ptr); // Use d2, NOT DFromV<decltype(v)>()
## Targets

Let `Target` denote an instruction set, one of `SCALAR/EMU128`, `RVV`,
`SSE2/SSSE3/SSE4/AVX2/AVX3/AVX3_DL/AVX3_ZEN4/AVX3_SPR` (x86),
`PPC8/PPC9/PPC10/Z14/Z15` (POWER), `WASM/WASM_EMU256` (WebAssembly),
`SSE2/SSSE3/SSE4/AVX2/AVX3/AVX3_DL/AVX3_ZEN4/AVX3_SPR` (x86), `PPC8/PPC9/PPC10`
(POWER), `Z14/Z15` (IBM Z), `WASM/WASM_EMU256` (WebAssembly),
`NEON_WITHOUT_AES/NEON/NEON_BF16/SVE/SVE2/SVE_256/SVE2_128` (Arm).

Note that x86 CPUs are segmented into dozens of feature flags and capabilities,
Expand Down Expand Up @@ -349,9 +349,6 @@ instructions (implying the target CPU must support them).
if they are not marked as available by the compiler. On MSVC, the only ways
to enable SSSE3 and SSE4 are defining these, or enabling AVX.

* `HWY_WANT_AVX3_DL`: opt-in for dynamic dispatch to `HWY_AVX3_DL`. This is
unnecessary if the baseline already includes AVX3_DL.

You can detect and influence the set of supported targets:

* `TargetName(t)` returns a string literal identifying the single target `t`,
Expand Down
28 changes: 9 additions & 19 deletions hwy/detect_targets.h
Original file line number Diff line number Diff line change
Expand Up @@ -63,15 +63,14 @@
#define HWY_AVX10_2_512 (1LL << 3) // AVX10.2 with 512-bit vectors
#define HWY_AVX3_SPR (1LL << 4)
#define HWY_AVX10_2 (1LL << 5) // AVX10.2 with 256-bit vectors
// Currently HWY_AVX3_DL plus AVX512BF16 and a special case for CompressStore
// (10x as fast).
// We may later also use VPCONFLICT.
// Currently `HWY_AVX3_DL` plus `AVX512BF16` and a special case for
// `CompressStore` (10x as fast, still useful on Zen5). We may later also use
// `VPCONFLICT`. Note that `VP2INTERSECT` is available in Zen5.
#define HWY_AVX3_ZEN4 (1LL << 6) // see HWY_WANT_AVX3_ZEN4 below

// Currently satisfiable by Ice Lake (VNNI, VPCLMULQDQ, VPOPCNTDQ, VBMI, VBMI2,
// VAES, BITALG, GFNI). Later to be added: BF16 (Cooper Lake). VP2INTERSECT is
// only in Tiger Lake?
#define HWY_AVX3_DL (1LL << 7) // see HWY_WANT_AVX3_DL below
// Currently satisfiable by Ice Lake (`VNNI`, `VPCLMULQDQ`, `VPOPCNTDQ`,
// `VBMI`, `VBMI2`, `VAES`, `BITALG`, `GFNI`).
#define HWY_AVX3_DL (1LL << 7)
#define HWY_AVX3 (1LL << 8) // HWY_AVX2 plus AVX-512F/BW/CD/DQ/VL
#define HWY_AVX2 (1LL << 9) // HWY_SSE4 plus BMI2 + F16 + FMA
// Bit 10: reserved
Expand Down Expand Up @@ -726,15 +725,6 @@
#endif
#endif // HWY_HAVE_RUNTIME_DISPATCH

// AVX3_DL is not widely available yet. To reduce code size and compile time,
// only include it in the set of attainable targets (for dynamic dispatch) if
// the user opts in, OR it is in the baseline (we check whether enabled below).
#if defined(HWY_WANT_AVX3_DL) || (HWY_BASELINE_TARGETS & HWY_AVX3_DL)
#define HWY_ATTAINABLE_AVX3_DL (HWY_AVX3_DL)
#else
#define HWY_ATTAINABLE_AVX3_DL 0
#endif

#if HWY_ARCH_ARM_A64 && HWY_HAVE_RUNTIME_DISPATCH
#define HWY_ATTAINABLE_NEON HWY_ALL_NEON
#elif HWY_ARCH_ARM // static dispatch, or HWY_ARCH_ARM_V7
Expand Down Expand Up @@ -803,9 +793,9 @@
#define HWY_ATTAINABLE_TARGETS_X86 \
HWY_ENABLED(HWY_BASELINE_SCALAR | HWY_STATIC_TARGET | HWY_AVX2)
#else // !HWY_COMPILER_MSVC
#define HWY_ATTAINABLE_TARGETS_X86 \
HWY_ENABLED(HWY_BASELINE_SCALAR | HWY_SSE2 | HWY_SSSE3 | HWY_SSE4 | \
HWY_AVX2 | HWY_AVX3 | HWY_ATTAINABLE_AVX3_DL | HWY_AVX3_ZEN4 | \
#define HWY_ATTAINABLE_TARGETS_X86 \
HWY_ENABLED(HWY_BASELINE_SCALAR | HWY_SSE2 | HWY_SSSE3 | HWY_SSE4 | \
HWY_AVX2 | HWY_AVX3 | HWY_AVX3_DL | HWY_AVX3_ZEN4 | \
HWY_AVX3_SPR)
#endif // !HWY_COMPILER_MSVC
#endif // HWY_ATTAINABLE_TARGETS_X86
Expand Down

0 comments on commit d0c73de

Please sign in to comment.