-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Armory format: Add SIMD support #5423
Conversation
On large systems, explicitly allocating huge pages with
|
Also on the same AWS instance:
This gives us the following efficiency calculation in terms of Armory format's SHA-512 speed vs. Bitcoin format's:
This means that we spend about 15% of CPU time on memory access, scatter/gather, XOR'ing, and Also, these memory bandwidth usage estimates for Armory, scrypt, Argon2:
The full memory bandwidth for this machine might be 450 GB/s, as 2 sockets times 6 channels (as guessed from 384 GiB total) times DDR5-4800 (as Intel specifies for a similar CPU in 1 DIMM per channel configuration). So the maximum potential further improvement from interleaving/prefetching is 100/85-100 = 17%, but this is of course unrealistic since it'd assume no "overhead" at all (over SHA-512 computation). |
This adds SIMD support to the Armory wallet format, then makes some other optimizations. There's approximately a 10% further speedup from the "Armory format: Add SIMD support" commit to the final commit of this PR.
Not included here are many other experiments I tried, which didn't help improve speeds. These included:
memcpy
, then restore just the second half (as the first half is rightly overwritten by the output hashes) from a copy saved out of the loop in the calling code. I don't know why this didn't help (and actually hurt a bit on my machine). A guess is it could have caused a detrimental change in addressing modes, as when we're copying we then work on a buffer local to the function, so easily addressable relative to the current stack frame rather than the caller's. If so, maybe the approach would be worth revisiting for inlined code, rather than a shared function.