Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Armory format: Add SIMD support #5423

Merged
merged 10 commits into from
Jan 12, 2024
Merged

Conversation

solardiz
Copy link
Member

This adds SIMD support to the Armory wallet format, then makes some other optimizations. There's approximately a 10% further speedup from the "Armory format: Add SIMD support" commit to the final commit of this PR.

Not included here are many other experiments I tried, which didn't help improve speeds. These included:

  1. Prefetching. Aapparently, it mostly happens on its own due to looping over 8 inputs with speculative execution when preparing the input blocks for SHA-512 on AVX-512. It could be worth retesting on more systems and in combination with different hardware prefetcher settings.
  2. Modifying our SIMD SHA-512 code to let it work on and clobber the input buffer instead of using memcpy, then restore just the second half (as the first half is rightly overwritten by the output hashes) from a copy saved out of the loop in the calling code. I don't know why this didn't help (and actually hurt a bit on my machine). A guess is it could have caused a detrimental change in addressing modes, as when we're copying we then work on a buffer local to the function, so easily addressable relative to the current stack frame rather than the caller's. If so, maybe the approach would be worth revisiting for inlined code, rather than a shared function.

@solardiz
Copy link
Member Author

On large systems, explicitly allocating huge pages with sysctl -w vm.nr_hugepages=24576 or such and setting GOMP_CPU_AFFINITY significantly improves speeds. Here's AWS c7i.48xlarge when using two largest test vectors (~6x more work than our default benchmark test vectors in this format):

# sysctl -w vm.nr_hugepages=24576

$ GOMP_CPU_AFFINITY=0-191 ./john -te -form=armory
Will run 192 OpenMP threads
Benchmarking: armory, Armory wallet [SHA512/AES/secp256k1/SHA256/RIPEMD160 512/512 AVX512BW 8x]... (192xOMP) DONE
Speed for cost 1 (memory) of 33554432, cost 2 (iterations) of 3
Raw:    738 c/s real, 3.9 c/s virtual

$ GOMP_CPU_AFFINITY=0-191 ./john -w=w pw
Using default input encoding: UTF-8
Loaded 1 password hash (armory, Armory wallet [SHA512/AES/secp256k1/SHA256/RIPEMD160 512/512 AVX512BW 8x])
Cost 1 (memory) is 33554432 for all loaded hashes
Cost 2 (iterations) is 3 for all loaded hashes
Will run 192 OpenMP threads
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
a3q92wz8         (?)
1g 0:00:01:00 DONE (2024-01-12 19:42) 0.01646g/s 758.6p/s 758.6c/s 758.6C/s milkmilk..160582

@solardiz solardiz merged commit 1d7397a into openwall:bleeding-jumbo Jan 12, 2024
31 of 32 checks passed
@solardiz
Copy link
Member Author

Also on the same AWS instance:

Benchmarking: Bitcoin, Bitcoin Core [SHA512 AES 512/512 AVX512BW 8x]... (192xOMP) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    10338 c/s real, 54.0 c/s virtual

This gives us the following efficiency calculation in terms of Armory format's SHA-512 speed vs. Bitcoin format's:

750*3*1.5*32768*1024/64/200460/10338 = 85%

This means that we spend about 15% of CPU time on memory access, scatter/gather, XOR'ing, and derive_address.

Also, these memory bandwidth usage estimates for Armory, scrypt, Argon2:

750*3*1.5*32768*1024/10^9 = 113 GB/s
5037*32768*1024/10^9 = 169 GB/s
12741*4096*1024*(2+2*3)/10^9 = 427 GB/s

The full memory bandwidth for this machine might be 450 GB/s, as 2 sockets times 6 channels (as guessed from 384 GiB total) times DDR5-4800 (as Intel specifies for a similar CPU in 1 DIMM per channel configuration).

So the maximum potential further improvement from interleaving/prefetching is 100/85-100 = 17%, but this is of course unrealistic since it'd assume no "overhead" at all (over SHA-512 computation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant