-
Notifications
You must be signed in to change notification settings - Fork 38
Added optimized ppc64le support functions for ML-KEM. #1184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @dannytsen, this is an exciting contribution 🎉
I think as the first stage of review, the goal should be to get your changes through CI, and extend it so that the PPC64 backend is exercised (to this end: do you know if your assembly works with qemu-ppc64le, and what flags are needed?).
In a second phase, we can dive into the backend itself and hopefully convince ourselves that it is functionally correct and upholds the assumptions made by the frontend.
I left a few comments to kick things off, but additionally I can see that there are failures related to autogen and format, so a good starting point would be to resolve those. You should be able to run simpasm with a PPC cross compiler to get simplified assembly that you can check in to main source tree.
|
@hanno-becker I believe the code will work on qemu-ppc64le even though I did not run on it. My testing platform are p9 and p10 systems. I will go thru the comments and fix issues. Thanks. |
|
@dannytsen Please see https://github.com/pq-code-package/mlkem-native/commits/ppc64le_backend for the changes to get the asm through the usual format/autogen/simpasm pipeline. Feel free to amend your commit(s). At least the base CI is happy with this: https://github.com/pq-code-package/mlkem-native/actions/runs/17640154327 NOTE: The resulting ASM in mlkem/* is currently unusable because the references to the .data section have been messed up during simpasm. As mentioned above, please see if you can follow the approach from the AArch64 backend: Define the NTT and invNTT twiddle tables in *.c and pass them to the ASM routines as arguments. The other constants can be generated in the code itself, as in https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/ntt.S#L79 for example. If it's inconvenient to do this, you can also go with a single large constant table including all constants you need, pass the pointer to that to each ASM function, and load from a suitable offset in the ASM. This is the approach used in the x86_64 backend, see dev/x86_64/src/consts.c. |
|
@hanno-becker Thanks for the pointer. But I am not a python programmer and don't really can comprehend python so changes scripts will not be my first choice. I just want to get the simpasm work on my code. I can change my code to use data array from a C file. But I need an example (a command line example) to generate a simplified assembly. So, where do you run simpasm from? from scripts directory or dev directory? And what are the options I need to pass thru? Like simpasm -???? Just a example for x86 or arm will be fine. I just want to know how to run it so I can fix my assembly code accordingly. Thanks. I have a t.S file with .data section stripped. And here is the output for your reference. So, you know what I am talking about. [07:06] danny@ltc-zz4-lp9 dev % ../scripts/simpasm -i ${PWD}/ppc64le/src/t.S Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
|
@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having If you checkout the branch, enter the |
@hanno-becker Sure. I can do that. |
@hanno-becker BTW, there is no nix for ppc. |
|
@dannytsen You should work in an x86_64 or AArch64 Linux/Mac environment and use the PPC64Le cross compiler, which is already part of the environment established by |
@hanno-becker Ok. I'll check that. Thanks. |
|
@dannytsen What indicators/assurances have you obtained so far that the assembly is correct? Also, have you successfully run the code on QEMU, or real HW only? |
@hanno-becker The code was run successfully in liboqs and mlkem-native project on HW. The code was originally written for liboqs. |
|
@dannytsen Independent of the work of separating the twiddles from the assembly: I ran the code in a ppc64le emulator, but it fails as soon as I start to use the NTT or invNTT. Specifically, in a Linux/Mac environment, and using your current This gives: If I comment out It could just be some CPU configuration missing. Are you assuming a particular vector length, for example? I can also see the code failing when running under Bottomline: I'll help with the integration details, but you'd need to find out / demonstrate that/how the code works in an emulated QEMU environment so we can test it in CI -- can you do that please? |
@hanno-becker Which means that it doesn't work with p8 or some instructions was not supported in qmenu. |
|
@hanno-becker Here is my output from p9. [00:01] danny@ltc-zz4-lp9 mlkem-native_dev % make test |
|
@dannytsen Thank you. As mentioned, can you please find out how to test the code using qemu? The
Can you find out which one it is? The PR documentation states that the ASM works P8 upwards. |
@hanno-becker It looks like qmenu cross compiler soen't support "xxpermdi" instruction. I'll check.
|
|
@dannytsen I don't know. You should be able to find out assembling a minimal example and using |
I'll check. |
|
@hanno-becker I don't have qemu on my system. But I installed nix on my Mac and run the following command under nix, These are the final build output. FUNC ML-KEM-1024: test/build/mlkem1024/bin/test_mlkem1024 I'll check about qemu. |
Please see #1184 (comment) again -- once in the |
|
@hanno-becker Thanks for taking so much on ppc64le integration. I do learn a lot from here in all aspects. I don't know how much time will take me but I'll work on it. |
@dannytsen This is great to hear! Please don't hesitate to ask if anything is unclear about the algorithmic aspects of the AArch64 implementation, or how to translate it. Otherwise, I'm looking forward to seeing the updated code! |
|
Thanks @dannytsen! Feel free to let me know if there’s any area where I can jump in and help out with coding. |
|
@hanno-becker @bhess Thanks. |
|
@dannytsen Any update? |
|
@hanno-becker Still working on fixing current code before the other. |
Re-arranged zeta array for NTT/INTT for Len 2 and 4. Signed-off-by: Danny Tsen <[email protected]>
|
@dannytsen Ack. Let me know when you're done with the rework or have any questions. |
@hanno-becker @bhess My fix of my implementation to work with your new backend unit test is very straight forward and simple, just matched work flow of your C-implementation. This implementation worked as is. So, I don't plan on re-work for any time soon. Thanks. |
|
@bhess @dannytsen Could you provide performance improvement data for the backend as it stands? What is the plan towards integrating assembly that leverages lazy reduction and layer merging? |
@hanno-becker @bhess Here are the benchmark data. This benchmark run on p10 HW. Will discuss with Basil for further optimization if needed. [13:51] danny-mlkem-native_dev % ./scripts/tests bench -c PERF keypair percentiles: 66527 66731 66830 66901 66951 67005 67059 67117 67201 67334 71937 INFO > Benchmark ML-KEM-768 (native no_opt): make run_bench_768 keypair percentiles: 110666 110991 111129 111242 111357 111498 111593 111720 111907 112334 116861 INFO > Benchmark ML-KEM-1024 (native no_opt): make run_bench_1024 keypair percentiles: 165446 166045 166300 166520 166673 166899 167079 167293 167716 171222 175975 INFO > Benchmark Compile (native opt): make bench OPT=1 AUTO=1 CYCLES=PERF -j40 keypair percentiles: 45583 45794 45895 45969 46055 46111 46168 46249 46344 46447 51088 INFO > Benchmark ML-KEM-768 (native opt): make run_bench_768 keypair percentiles: 78671 78938 79091 79241 79333 79408 79516 79631 79828 80295 84793 INFO > Benchmark ML-KEM-1024 (native opt): make run_bench_1024 keypair percentiles: 122971 123552 123774 123995 124157 124334 124544 124782 125043 128570 133437 All good! |
|
@dannytsen Thanks. As a table:
|
|
@dannytsen Can you clarify the architectural requirements please (by updating this PR)? The PR documents Power8 and above, but we don't seem to be able to emulate the code for Power8, nor have you (as I remember) tested it on such. @dannytsen Can you also provide benchmarks on p9 please? |
@hanno-becker I'll find p8/9 system to test. |
Also fixed instruction byte orerding mismatch for p8 and p9/10, lxvx/stxvx and lxvd2x/stxvd2x. Used lxvd2x and stxvd2x for consistant byte ordering. Signed-off-by: Danny Tsen <[email protected]>
@hanno-becker p8 issue fixed. Here are the benchmark for p8, p9 and p10. Thanks. |
dev/ppc64le/src/ntt_ppc.S
Outdated
| lxvd2x 32+13, 3, 10 # r[j+len] | ||
| lxvd2x 32+18, 3, 17 # r[j+len] | ||
| lxvd2x 32+23, 3, 19 # r[j+len] | ||
| lxvd2x 32+28, 3, 21 # r[j+len] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments are misleading, we are loading from different offsets here. The comments should indicate the offset.
dev/ppc64le/src/ntt_ppc.S
Outdated
| addi 16, 9, \next | ||
| addi 17, 10, \step | ||
| addi 18, 16, \next | ||
| addi 19, 17, \step | ||
| addi 20, 18, \next | ||
| addi 21, 19, \step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Half of these offsets are not used here, but further down in Loda_4AJ, which is confusing. The code should either be moved, or a comment added.
dev/ppc64le/src/ntt_ppc.S
Outdated
| vadduhm 30, 28, 27 # r + t | ||
| .endm | ||
|
|
||
| .macro NTT_MREDUCE_4X start next step _vz0 _vz1 _vz2 _vz3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
next and step are always the same. Merge the two arguments?
dev/ppc64le/src/ntt_ppc.S
Outdated
| vsubuhm 16, 12, 13 # r - t | ||
| vadduhm 15, 13, 12 # r + t | ||
| vsubuhm 21, 17, 18 # r - t | ||
| vadduhm 20, 18, 17 # r + t | ||
| vsubuhm 26, 22, 23 # r - t | ||
| vadduhm 25, 23, 22 # r + t | ||
| vsubuhm 31, 27, 28 # r - t | ||
| vadduhm 30, 28, 27 # r + t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments here need refining, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannytsen Thank you again for your work.
I continued reviewing, but quickly found myself stuck without extensive help from an LLM commenting and rewriting the code for me. As it stands, the code is too difficult to understand for @mkannwischer and myself to commit to maintaining it.
The following steps would greatly help to make the code more understandable and maintainable:
- Addition of
#define ...abbreviations for all the registers in use. Following the raw numeric register identifiers is difficult, as the reader does not (yet) have the mental model that you had when writing the code. This is exacerbated by the fact that the there is no way to tell (without knowledge of the instruction signature) whether a numeric value is a register identifier or an immediate. For example, inNTT_MREDUCE_4X 5, 16, 16, ...,5is a register identifier but16is an immediate. - More extensive comments, and fixing of existing comments. There are a number of comments which give a sense of what happens, but are not precise enough to exactly understand what's happening; see the review comments.
Finally, can you also switch to using /* ... */ style comments rather than # ..., matching other assembly files in mlkem-native? Also, could you untabify the source (use spaces instead of tabs exclusively)?
Signed-off-by: Danny Tsen <[email protected]>
@hanno-becker Sure. I'll fix as much as I can. |
1. De-tabified. 2. Merged next and step used in macro. 3. Used immediate values for offsets used in macro. 4. More comments explaining the operation and contents. 5. Changed the comment style. In this commit, numeric register identifiers have not been fixed yet. will do that next. Signed-off-by: Danny Tsen <[email protected]>

Added optimized ppc64le support functions for ML-KEM.
The supported native functions include:
And other interface functions and headers.
Signed-off-by: Danny Tsen [email protected]" .