Armory format: Add SIMD support #5423

solardiz · 2024-01-12T19:01:54Z

This adds SIMD support to the Armory wallet format, then makes some other optimizations. There's approximately a 10% further speedup from the "Armory format: Add SIMD support" commit to the final commit of this PR.

Not included here are many other experiments I tried, which didn't help improve speeds. These included:

Prefetching. Aapparently, it mostly happens on its own due to looping over 8 inputs with speculative execution when preparing the input blocks for SHA-512 on AVX-512. It could be worth retesting on more systems and in combination with different hardware prefetcher settings.
Modifying our SIMD SHA-512 code to let it work on and clobber the input buffer instead of using memcpy, then restore just the second half (as the first half is rightly overwritten by the output hashes) from a copy saved out of the loop in the calling code. I don't know why this didn't help (and actually hurt a bit on my machine). A guess is it could have caused a detrimental change in addressing modes, as when we're copying we then work on a buffer local to the function, so easily addressable relative to the current stack frame rather than the caller's. If so, maybe the approach would be worth revisiting for inlined code, rather than a shared function.

solardiz · 2024-01-12T19:48:42Z

On large systems, explicitly allocating huge pages with sysctl -w vm.nr_hugepages=24576 or such and setting GOMP_CPU_AFFINITY significantly improves speeds. Here's AWS c7i.48xlarge when using two largest test vectors (~6x more work than our default benchmark test vectors in this format):

# sysctl -w vm.nr_hugepages=24576

$ GOMP_CPU_AFFINITY=0-191 ./john -te -form=armory
Will run 192 OpenMP threads
Benchmarking: armory, Armory wallet [SHA512/AES/secp256k1/SHA256/RIPEMD160 512/512 AVX512BW 8x]... (192xOMP) DONE
Speed for cost 1 (memory) of 33554432, cost 2 (iterations) of 3
Raw:    738 c/s real, 3.9 c/s virtual

$ GOMP_CPU_AFFINITY=0-191 ./john -w=w pw
Using default input encoding: UTF-8
Loaded 1 password hash (armory, Armory wallet [SHA512/AES/secp256k1/SHA256/RIPEMD160 512/512 AVX512BW 8x])
Cost 1 (memory) is 33554432 for all loaded hashes
Cost 2 (iterations) is 3 for all loaded hashes
Will run 192 OpenMP threads
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
a3q92wz8         (?)
1g 0:00:01:00 DONE (2024-01-12 19:42) 0.01646g/s 758.6p/s 758.6c/s 758.6C/s milkmilk..160582

solardiz · 2024-01-12T22:58:41Z

Also on the same AWS instance:

Benchmarking: Bitcoin, Bitcoin Core [SHA512 AES 512/512 AVX512BW 8x]... (192xOMP) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    10338 c/s real, 54.0 c/s virtual

This gives us the following efficiency calculation in terms of Armory format's SHA-512 speed vs. Bitcoin format's:

750*3*1.5*32768*1024/64/200460/10338 = 85%

This means that we spend about 15% of CPU time on memory access, scatter/gather, XOR'ing, and derive_address.

Also, these memory bandwidth usage estimates for Armory, scrypt, Argon2:

750*3*1.5*32768*1024/10^9 = 113 GB/s
5037*32768*1024/10^9 = 169 GB/s
12741*4096*1024*(2+2*3)/10^9 = 427 GB/s

The full memory bandwidth for this machine might be 450 GB/s, as 2 sockets times 6 channels (as guessed from 384 GiB total) times DDR5-4800 (as Intel specifies for a similar CPU in 1 DIMM per channel configuration).

So the maximum potential further improvement from interleaving/prefetching is 100/85-100 = 17%, but this is of course unrealistic since it'd assume no "overhead" at all (over SHA-512 computation).

solardiz added 8 commits January 11, 2024 20:45

Armory format: Prepare for deriving multiple keys/addresses per thread

c6d073d

Armory format: Prepare derive_keys() for working with 2D LUT

27bc2b8

Armory format: Add SIMD support

357a098

Armory format: Avoid SIMDSHA512body() gather/scatter in second loop

83293b4

Armory format: Avoid SIMDSHA512body() gather/scatter in first loop

0130ffd

Armory format: Use AVX-512 or MIC scatter in first loop

8a27516

Armory format: Unroll some loops

8184d98

Armory format: Add a second 32 MiB, 3 iterations test vector

3436461

solardiz added 2 commits January 12, 2024 21:37

Add doc/README-Armory

2955417

Armory format: Free the tests' memory allocation before actual cracking

1c3cbfd

solardiz merged commit 1d7397a into openwall:bleeding-jumbo Jan 12, 2024
31 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Armory format: Add SIMD support #5423

Armory format: Add SIMD support #5423

solardiz commented Jan 12, 2024

solardiz commented Jan 12, 2024

solardiz commented Jan 12, 2024

Armory format: Add SIMD support #5423

Armory format: Add SIMD support #5423

Conversation

solardiz commented Jan 12, 2024

solardiz commented Jan 12, 2024

solardiz commented Jan 12, 2024