Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AVX512VL + AVX512DQ #5694

Open
wants to merge 1 commit into
base: bleeding-jumbo
Choose a base branch
from

Conversation

claudioandre-br
Copy link
Member

@claudioandre-br claudioandre-br commented Mar 10, 2025

Let's hear bots.

I'll remove runstatedir after testing.


checking for AVX2... yes
checking for AVX512BW + AVX512VL + AVX512DQ... yes
checking if gcc supports -maes -mpclmul... yes

OR:

checking for AVX2... yes
checking for AVX512BW + AVX512VL + AVX512DQ... no
checking for AVX512F... no
checking if gcc supports -maes -mpclmul... yes
Target CPU ......................................... x86_64 AVX512BW, 64-bit LE
Target OS .......................................... linux-gnu
Version: 1.9.0-jumbo-1+bleeding-60f3614a06 2025-03-10 07:27:48 -0300
Build: linux-gnu 64-bit x86_64 AVX512(BW+VL+DQ) AC OMP
SIMD: AVX512BW, interleaving: MD4:3 MD5:3 SHA1:1 SHA256:1 SHA512:1
AES hardware acceleration: AES-NI
CPU tests: AVX512(BW+VL+DQ)
$JOHN is ../run/

@claudioandre-br
Copy link
Member Author

CI is happy. Nothing bad so far.

@claudioandre-br
Copy link
Member Author

It seems it tests CPU support twice (as seen below):

configure: Trying to force avx512bw using default method (--enable-simd=avx512bw).
checking if gcc supports -mavx512bw -mavx512vl -mavx512dq w/ linking... yes
checking for extra ASFLAGS... None needed
checking for X32 ABI... no
checking special compiler flags... Intel x86
configure: Testing tool-chain's CPU support with given options
checking for MMX... yes
checking for SSE2... yes
checking for SSSE3... yes
checking for SSE4.1... yes
checking for SSE4.2... yes
checking for AVX... yes
checking for XOP... no
checking for AVX2... yes
checking for AVX512BW + AVX512VL + AVX512DQ... yes
checking if gcc supports -maes -mpclmul... yes

It doesn't hurt.

doc/NEWS Outdated
@@ -420,6 +420,7 @@ Major changes from 1.9.0-jumbo-1 (May 2019) in this bleeding-edge version:
- Add Oubliette Password Manager support (two formats and oubliette2john.py).
[DavideDG; 2025]

- Turn AVX512 into AVX512BW + AVX512VL + AVX512DQ. [Claudio André; 2025]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's confusing. I suggest:

- Use AVX512VL XOP-like bit rotates for scrypt's Salsa20.  [Solar; 2025]

- When we use AVX512BW, also enable usage of AVX512VL and AVX512DQ.  [Claudio André; 2025]

src/configure.ac Outdated
done
else
CPU_BEST_FLAGS_MAIN=-DJOHN_$(echo ${SIMD_NAME} | tr .a-z _A-Z)
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt we need this complication. Can't we just continue with JOHN_AVX512BW alone, but understand that it implies VL and DQ? I also don't know whether the += syntax works with other shells.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible if everyone understands that BW implies the rest.

src/configure.ac Outdated
CPU_NAME="$host_cpu AVX512BW"
else
CPU_NAME="$host_cpu $SIMD_NAME"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like unneeded complication as well.

@@ -249,37 +249,38 @@ if test "x$simd" != xno; then
AS_IF([test "x$CPU_NOTFOUND" = x0],
[
CFLAGS="$CFLAGS_BACKUP -mavx512f -P $EXTRA_AS_FLAGS $CPPFLAGS $CFLAGS_EXTRA $CPUID_ASM"
CFLAGS="$CFLAGS_BACKUP -mavx512bw -mavx512vl -mavx512dq -P $EXTRA_AS_FLAGS $CPPFLAGS $CFLAGS_EXTRA $CPUID_ASM"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not implementing the full reverse order of checks + optimization, then maybe let's not reorder F vs. BW here? If we were checking F first, then continue to check it first. This PR's changes would be smaller then.

[CPU_BEST_FLAGS="-mavx512f"]
[SIMD_NAME="AVX512F"]
[CPU_BEST_FLAGS="-mavx512bw -mavx512vl -mavx512dq"]
[SIMD_NAME="AVX512(BW+VL+DQ)"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe continue to say just AVX512BW here.

AC_LINK_IFELSE(
[
AC_LANG_SOURCE(
[[#include <immintrin.h>
#include <stdio.h>
extern void exit(int);
int main(){__m512i t, t1;*((long long*)&t)=1;t1=t;t=_mm512_mul_epi32(t1,t);if((*(long long*)&t)==88)printf(".");exit(0);}]]
int main(){__m128i t, t1;*((long long*)&t)=1;t1=t;t=_mm_rol_epi32(t1,1);if((*(long long*)&t)==88)printf(".");exit(0);}]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did suggest using the same intrinsic we actually use, but I didn't mean to test it instead of testing any 512-bit BW intrinsic. I think we should either revert this change entirely or test both _mm_rol_epi32 and _mm512_mul_epi32.

While there are no current nor planned CPUs that have BW without VL nor vice versa, there may be future CPUs supporting AVX10/256 where the 128-bit VL intrinsic would compile and run yet this wouldn't imply support for 512-bit BW. Such future CPUs wouldn't set the CPUID bit corresponding to VL, but here we're not checking CPUID at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize you previously got the _mm512_mul_epi32 from the section for F, not for BW. Then revert to what we were checking for BW, please.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there may be future CPUs supporting AVX10/256 where the 128-bit VL intrinsic would compile and run yet this wouldn't imply support for 512-bit BW. Such future CPUs wouldn't set the CPUID bit corresponding to VL

Upon a second thought, actually maybe they would set that CPUID bit. It's no problem, and no reason to change anything in this PR - I am just correcting what I wrote for the sake of it. We may want to add AVX10/256 support later, with a separate PR, and maybe when such CPUs actually appear and can be tested. As a guess, maybe we'll be checking for VL alone as a separate configure test from BW+VL+DQ, and would need to treat it differently in code (in many ways, including CPUID check and non-usage of 512-bit vectors).

#define CPU_NAME "AVX512BW"
#define CPU_REQ_AVX512VL 1
#define CPU_REQ_AVX512DQ 1
#define CPU_NAME "AVX512(BW+VL+DQ)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep all 3 mentioned in CPU_NAME, for reporting in the "Sorry" line. (No further change is needed here.)

Copy link
Member Author

@claudioandre-br claudioandre-br Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition of CPU_REQ_AVX512VL (+DQ) is also required. At least desired.

Copy link
Member

@solardiz solardiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost good enough to merge, with only trivial cleanups maybe left. Thank you, @claudioandre-br!


AC_MSG_CHECKING([for AVX512BW])
AC_MSG_CHECKING([for AVX512BW + AVX512VL + AVX512DQ])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the test program we run only checks BW and VL, and then we assume DQ is implied. So we could want to make it just for AVX512BW + AVX512VL here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it gets confusing.

The else part (runs when --enable-simd=avx512bw) has a test program that only tests AVX512BW + AVX512VL.

The if part (runs when --native-tests=true) does not use a test program. It uses CPU_detect().

  1. In any case, both use the -mavx512dq flag.
  2. CPU_detect without setting a value for CPU_REQ_* seems wrong to me. It should be like this:

I added a #define CPU_REQ_AVX512BW 1

#define CPU_REQ_AVX512BW		1
extern int CPU_detect(void); extern char CPU_req_name[];
      unsigned int nt_buffer8x[4], output8x[4];
      int main(int argc, char **argv) { return !CPU_detect(); }

Anyway, should I remove DQ string ???? Is a new commit with a fix for cpu_detection required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I think it's OK to leave this as you have it for this PR, no further change needed. Thank you!

@@ -464,16 +464,17 @@ dnl ======================================================================

AS_IF([test "x$CPU_NOTFOUND" = x0],
[
AC_MSG_CHECKING([for AVX512BW])
AC_MSG_CHECKING([for AVX512BW + AVX512VL + AVX512DQ])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and here.

(I don't get why we have this in two places.)

#define C7_AVX2 $0x00000020 /* AVX2 */
#define C7_AVX512F $0x00010000
#define C7_AVX512BW $0x40010000 /* AVX512BW + AVX512F */
#define C7_AVX512VL $0xC0010000 /* AVX512BW + AVX512VL + AVX512F */
#define C7_AVX512DQ $0xC0030000 /* AVX512BW + AVX512DQ + AVX512VL + AVX512F */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I didn't review the specific bitmasks against the documentation. I just hope they're correct.)

Binary john needs AVX512VL's XOP-like bit rotates for faster Salsa20
in yescrypt.

Without `VL` enabled compilers don't use mnemonics at all.

As it stands now, the possible binaries are:
- AVX512BW + AVX512VL + AVX512DQ
- AVX512F
- AVX2
- And so on.

There is no AVX512BW only binary.

See: openwall#5691.

Signed-off-by: Claudio André <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants