Skip to content

Conversation

@dannytsen
Copy link

Added optimized ppc64le support functions for ML-KEM.

The supported native functions include:

  1. MLK_USE_NATIVE_NTT (ntt_ppc.S)
  2. MLK_USE_NATIVE_INTT (intt_ppc.S)
  3. MLK_USE_NATIVE_POLY_REDUCE (reduce.S)
  4. MLK_USE_NATIVE_POLY_TOMONT (poly_tomont.S)

And other interface functions and headers.

Signed-off-by: Danny Tsen [email protected]" .

@dannytsen dannytsen requested a review from a team as a code owner September 9, 2025 15:06
Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @dannytsen, this is an exciting contribution 🎉

I think as the first stage of review, the goal should be to get your changes through CI, and extend it so that the PPC64 backend is exercised (to this end: do you know if your assembly works with qemu-ppc64le, and what flags are needed?).

In a second phase, we can dive into the backend itself and hopefully convince ourselves that it is functionally correct and upholds the assumptions made by the frontend.

I left a few comments to kick things off, but additionally I can see that there are failures related to autogen and format, so a good starting point would be to resolve those. You should be able to run simpasm with a PPC cross compiler to get simplified assembly that you can check in to main source tree.

@dannytsen
Copy link
Author

@hanno-becker I believe the code will work on qemu-ppc64le even though I did not run on it. My testing platform are p9 and p10 systems. I will go thru the comments and fix issues. Thanks.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 11, 2025

@dannytsen Please see https://github.com/pq-code-package/mlkem-native/commits/ppc64le_backend for the changes to get the asm through the usual format/autogen/simpasm pipeline. Feel free to amend your commit(s). At least the base CI is happy with this: https://github.com/pq-code-package/mlkem-native/actions/runs/17640154327

NOTE: The resulting ASM in mlkem/* is currently unusable because the references to the .data section have been messed up during simpasm. As mentioned above, please see if you can follow the approach from the AArch64 backend: Define the NTT and invNTT twiddle tables in *.c and pass them to the ASM routines as arguments. The other constants can be generated in the code itself, as in https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/ntt.S#L79 for example. If it's inconvenient to do this, you can also go with a single large constant table including all constants you need, pass the pointer to that to each ASM function, and load from a suitable offset in the ASM. This is the approach used in the x86_64 backend, see dev/x86_64/src/consts.c.

@dannytsen
Copy link
Author

dannytsen commented Sep 11, 2025

@hanno-becker Thanks for the pointer. But I am not a python programmer and don't really can comprehend python so changes scripts will not be my first choice. I just want to get the simpasm work on my code. I can change my code to use data array from a C file. But I need an example (a command line example) to generate a simplified assembly. So, where do you run simpasm from? from scripts directory or dev directory? And what are the options I need to pass thru? Like simpasm -???? Just a example for x86 or arm will be fine. I just want to know how to run it so I can fix my assembly code accordingly. Thanks.

I have a t.S file with .data section stripped. And here is the output for your reference. So, you know what I am talking about.

[07:06] danny@ltc-zz4-lp9 dev % ../scripts/simpasm -i ${PWD}/ppc64le/src/t.S
simpasm: Command failed: gcc -c -x assembler-with-cpp -o /tmp/tmpitbxzagr.o -
simpasm: Exit code: 1
simpasm: stderr: :13:10: fatal error: ../../../common.h: No such file or directory
compilation terminated.

Traceback (most recent call last):
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 158, in run_cmd
r = subprocess.run(
File "/usr/lib64/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['gcc', '-c', '-x', 'assembler-with-cpp', '-o', '/tmp/tmpitbxzagr.o', '-']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 445, in
_main()
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 436, in _main
simplify(logger, args, args.input, args.output)
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 206, in simplify
run_cmd(cmd, input=asm_no_if)
File "/home/danny/mlkem-native_dev/dev/../scripts/simpasm", line 166, in run_cmd
raise Exception("simpasm failed") from e
Exception: simpasm failed
[07:06] danny@ltc-zz4-lp9 dev %

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 11, 2025

@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having .data sections in the ASM.

If you checkout the branch, enter the nix develop .#ci-cross shell (this will take a very long time initially, unfortunately), and do autogen --force-cross, everything will just work: It'll run simpasm on your dev/ppc64le backend and update the code in mlkem/src/native/ppc64le/ accordingly.

@dannytsen
Copy link
Author

@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having .data sections in the ASM.

@hanno-becker Sure. I can do that.

@dannytsen
Copy link
Author

@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having .data sections in the ASM.

If you checkout the branch, enter the nix develop .#ci-cross shell (this will take a very long time initially, unfortunately), and do autogen, everything will just work.

@hanno-becker BTW, there is no nix for ppc.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 11, 2025

@dannytsen You should work in an x86_64 or AArch64 Linux/Mac environment and use the PPC64Le cross compiler, which is already part of the environment established by nix develop .#ci-cross.

@dannytsen
Copy link
Author

@dannytsen You should work in an x86_64 or AArch64 Linux/Mac environment and use the PPC64Le cross compiler, which is already part of the environment established by nix develop .#ci-cross.

@hanno-becker Ok. I'll check that. Thanks.

@hanno-becker
Copy link
Contributor

@dannytsen What indicators/assurances have you obtained so far that the assembly is correct? Also, have you successfully run the code on QEMU, or real HW only?

@dannytsen
Copy link
Author

@dannytsen What indicators/assurances have you obtained so far that the assembly is correct? Also, have you successfully run the code on QEMU, or real HW only?

@hanno-becker The code was run successfully in liboqs and mlkem-native project on HW. The code was originally written for liboqs.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 12, 2025

@dannytsen Independent of the work of separating the twiddles from the assembly:

I ran the code in a ppc64le emulator, but it fails as soon as I start to use the NTT or invNTT. Specifically, in a Linux/Mac environment, and using your current dannytsen:main:

nix develop --extra-experimental-features 'nix-command flakes'  .#ci-cross
make clean
tests func --cross-prefix=powerpc64le-unknown-linux-gnu- --exec-wrapper="qemu-ppc64le -cpu power9" --opt=opt

This gives:

INFO  > Functional Test    Compile     (cross opt):      CROSS_PREFIX=powerpc64le-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
INFO  > Functional Test    ML-KEM-512  (cross opt):      EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32
ERROR > Functional Test    ML-KEM-512  (cross opt):      'EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32' failed with with 2
ERROR > Functional Test    ML-KEM-512  (cross opt):      ERROR (test/test_mlkem.c,49)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:58: run_func_512] Error 1

If I comment out MLK_USE_NATIVE_NTT and MLK_USE_NATIVE_INTT, it works, so something must be off related to the [inv]NTT.

It could just be some CPU configuration missing. Are you assuming a particular vector length, for example?

I can also see the code failing when running under qemu-ppc64le -cpu power8, with Illegal instruction aborts. From the documentation, I was expecting it to work from power8 upwards.

Bottomline: I'll help with the integration details, but you'd need to find out / demonstrate that/how the code works in an emulated QEMU environment so we can test it in CI -- can you do that please?

@dannytsen
Copy link
Author

dannytsen commented Sep 12, 2025

@dannytsen Independent of the work of separating the twiddles from the assembly:

I ran the code in a ppc64le emulator, but it fails as soon as I start to use the NTT or invNTT. Specifically, in a Linux/Mac environment, and using your current dannytsen:main:

nix develop --extra-experimental-features 'nix-command flakes'  .#ci-cross
make clean
tests func --cross-prefix=powerpc64le-unknown-linux-gnu- --exec-wrapper="qemu-ppc64le -cpu power9" --opt=opt

This gives:

INFO  > Functional Test    Compile     (cross opt):      CROSS_PREFIX=powerpc64le-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
INFO  > Functional Test    ML-KEM-512  (cross opt):      EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32
ERROR > Functional Test    ML-KEM-512  (cross opt):      'EXEC_WRAPPER=qemu-ppc64le -cpu power9 make run_func_512 -j32' failed with with 2
ERROR > Functional Test    ML-KEM-512  (cross opt):      ERROR (test/test_mlkem.c,49)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:58: run_func_512] Error 1

If I comment out MLK_USE_NATIVE_NTT and MLK_USE_NATIVE_INTT, it works, so something must be off related to the [inv]NTT.

It could just be some CPU configuration missing. Are you assuming a particular vector length, for example?

I can also see the code failing when running under qemu-ppc64le -cpu power8, with Illegal instruction aborts. From the documentation, I was expecting it to work from power8 upwards.

Bottomline: I'll help with the integration details, but you'd need to find out / demonstrate that/how the code works in an emulated QEMU environment so we can test it in CI -- can you do that please?

@hanno-becker Which means that it doesn't work with p8 or some instructions was not supported in qmenu.
I only tested on p9 and p10 HW platforms.

@dannytsen
Copy link
Author

@hanno-becker Here is my output from p9.

[00:01] danny@ltc-zz4-lp9 mlkem-native_dev % make test
AS test/build/mlkem512/mlkem/src/native/ppc64le/src/t.S.o
AR test/build/libmlkem512.a
LD test/build/mlkem512/bin/gen_KAT512
KAT ML-KEM-512: test/build/mlkem512/bin/gen_KAT512
set -o pipefail; test/build/mlkem512/bin/gen_KAT512 | shasum -a 256 | cut -d " " -f 1 | xargs ./META.sh ML-KEM-512 kat-sha256
/usr/bin/which: no yq in (/home/danny/.local/bin:/home/danny/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
META.yml ML-KEM-512 kat-sha256: OK
AS test/build/mlkem768/mlkem/src/native/ppc64le/src/t.S.o
AR test/build/libmlkem768.a
LD test/build/mlkem768/bin/gen_KAT768
KAT ML-KEM-768: test/build/mlkem768/bin/gen_KAT768
set -o pipefail; test/build/mlkem768/bin/gen_KAT768 | shasum -a 256 | cut -d " " -f 1 | xargs ./META.sh ML-KEM-768 kat-sha256
/usr/bin/which: no yq in (/home/danny/.local/bin:/home/danny/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
META.yml ML-KEM-768 kat-sha256: OK
AS test/build/mlkem1024/mlkem/src/native/ppc64le/src/t.S.o
AR test/build/libmlkem1024.a
LD test/build/mlkem1024/bin/gen_KAT1024
KAT ML-KEM-1024: test/build/mlkem1024/bin/gen_KAT1024
set -o pipefail; test/build/mlkem1024/bin/gen_KAT1024 | shasum -a 256 | cut -d " " -f 1 | xargs ./META.sh ML-KEM-1024 kat-sha256
/usr/bin/which: no yq in (/home/danny/.local/bin:/home/danny/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin)
META.yml ML-KEM-1024 kat-sha256: OK
LD test/build/mlkem512/bin/test_mlkem512
FUNC ML-KEM-512: test/build/mlkem512/bin/test_mlkem512
test/build/mlkem512/bin/test_mlkem512
CRYPTO_SECRETKEYBYTES: 1632
CRYPTO_PUBLICKEYBYTES: 800
CRYPTO_CIPHERTEXTBYTES: 768
LD test/build/mlkem768/bin/test_mlkem768
FUNC ML-KEM-768: test/build/mlkem768/bin/test_mlkem768
test/build/mlkem768/bin/test_mlkem768
CRYPTO_SECRETKEYBYTES: 2400
CRYPTO_PUBLICKEYBYTES: 1184
CRYPTO_CIPHERTEXTBYTES: 1088
LD test/build/mlkem1024/bin/test_mlkem1024
FUNC ML-KEM-1024: test/build/mlkem1024/bin/test_mlkem1024
test/build/mlkem1024/bin/test_mlkem1024
CRYPTO_SECRETKEYBYTES: 3168
CRYPTO_PUBLICKEYBYTES: 1568
CRYPTO_CIPHERTEXTBYTES: 1568
LD test/build/mlkem512/bin/acvp_mlkem512
ACVP ML-KEM-512: test/build/mlkem512/bin/acvp_mlkem512
LD test/build/mlkem768/bin/acvp_mlkem768
ACVP ML-KEM-768: test/build/mlkem768/bin/acvp_mlkem768
LD test/build/mlkem1024/bin/acvp_mlkem1024
ACVP ML-KEM-1024: test/build/mlkem1024/bin/acvp_mlkem1024
python3 ./test/acvp_client.py
Using ACVP test vectors version v1.1.0.40
Running ACVP tests for test/.acvp-data/v1.1.0.40/files/ML-KEM-keyGen-FIPS203/prompt.json
Running keyGen test case 1 ... done
Running keyGen test case 2 ... done
Running keyGen test case 3 ... done
Running keyGen test case 4 ... done
Running keyGen test case 5 ... done
Running keyGen test case 6 ... done
Running keyGen test case 7 ... done
Running keyGen test case 8 ... done
Running keyGen test case 9 ... done
Running keyGen test case 10 ... done
Running keyGen test case 11 ... done
Running keyGen test case 12 ... done
Running keyGen test case 13 ... done
Running keyGen test case 14 ... done
Running keyGen test case 15 ... done
Running keyGen test case 16 ... done
Running keyGen test case 17 ... done
Running keyGen test case 18 ... done
Running keyGen test case 19 ... done
Running keyGen test case 20 ... done
Running keyGen test case 21 ... done
Running keyGen test case 22 ... done
Running keyGen test case 23 ... done
Running keyGen test case 24 ... done
Running keyGen test case 25 ... done
Running keyGen test case 26 ... done
Running keyGen test case 27 ... done
Running keyGen test case 28 ... done
Running keyGen test case 29 ... done
Running keyGen test case 30 ... done
Running keyGen test case 31 ... done
Running keyGen test case 32 ... done
Running keyGen test case 33 ... done
Running keyGen test case 34 ... done
Running keyGen test case 35 ... done
Running keyGen test case 36 ... done
Running keyGen test case 37 ... done
Running keyGen test case 38 ... done
Running keyGen test case 39 ... done
Running keyGen test case 40 ... done
Running keyGen test case 41 ... done
Running keyGen test case 42 ... done
Running keyGen test case 43 ... done
Running keyGen test case 44 ... done
Running keyGen test case 45 ... done
Running keyGen test case 46 ... done
Running keyGen test case 47 ... done
Running keyGen test case 48 ... done
Running keyGen test case 49 ... done
Running keyGen test case 50 ... done
Running keyGen test case 51 ... done
Running keyGen test case 52 ... done
Running keyGen test case 53 ... done
Running keyGen test case 54 ... done
Running keyGen test case 55 ... done
Running keyGen test case 56 ... done
Running keyGen test case 57 ... done
Running keyGen test case 58 ... done
Running keyGen test case 59 ... done
Running keyGen test case 60 ... done
Running keyGen test case 61 ... done
Running keyGen test case 62 ... done
Running keyGen test case 63 ... done
Running keyGen test case 64 ... done
Running keyGen test case 65 ... done
Running keyGen test case 66 ... done
Running keyGen test case 67 ... done
Running keyGen test case 68 ... done
Running keyGen test case 69 ... done
Running keyGen test case 70 ... done
Running keyGen test case 71 ... done
Running keyGen test case 72 ... done
Running keyGen test case 73 ... done
Running keyGen test case 74 ... done
Running keyGen test case 75 ... done
Comparing results with test/.acvp-data/v1.1.0.40/files/ML-KEM-keyGen-FIPS203/expectedResults.json
OK
Running ACVP tests for test/.acvp-data/v1.1.0.40/files/ML-KEM-encapDecap-FIPS203/prompt.json
Running encapDecap test case 1 (encapsulation) ... done
Running encapDecap test case 2 (encapsulation) ... done
Running encapDecap test case 3 (encapsulation) ... done
Running encapDecap test case 4 (encapsulation) ... done
Running encapDecap test case 5 (encapsulation) ... done
Running encapDecap test case 6 (encapsulation) ... done
Running encapDecap test case 7 (encapsulation) ... done
Running encapDecap test case 8 (encapsulation) ... done
Running encapDecap test case 9 (encapsulation) ... done
Running encapDecap test case 10 (encapsulation) ... done
Running encapDecap test case 11 (encapsulation) ... done
Running encapDecap test case 12 (encapsulation) ... done
Running encapDecap test case 13 (encapsulation) ... done
Running encapDecap test case 14 (encapsulation) ... done
Running encapDecap test case 15 (encapsulation) ... done
Running encapDecap test case 16 (encapsulation) ... done
Running encapDecap test case 17 (encapsulation) ... done
Running encapDecap test case 18 (encapsulation) ... done
Running encapDecap test case 19 (encapsulation) ... done
Running encapDecap test case 20 (encapsulation) ... done
Running encapDecap test case 21 (encapsulation) ... done
Running encapDecap test case 22 (encapsulation) ... done
Running encapDecap test case 23 (encapsulation) ... done
Running encapDecap test case 24 (encapsulation) ... done
Running encapDecap test case 25 (encapsulation) ... done
Running encapDecap test case 26 (encapsulation) ... done
Running encapDecap test case 27 (encapsulation) ... done
Running encapDecap test case 28 (encapsulation) ... done
Running encapDecap test case 29 (encapsulation) ... done
Running encapDecap test case 30 (encapsulation) ... done
Running encapDecap test case 31 (encapsulation) ... done
Running encapDecap test case 32 (encapsulation) ... done
Running encapDecap test case 33 (encapsulation) ... done
Running encapDecap test case 34 (encapsulation) ... done
Running encapDecap test case 35 (encapsulation) ... done
Running encapDecap test case 36 (encapsulation) ... done
Running encapDecap test case 37 (encapsulation) ... done
Running encapDecap test case 38 (encapsulation) ... done
Running encapDecap test case 39 (encapsulation) ... done
Running encapDecap test case 40 (encapsulation) ... done
Running encapDecap test case 41 (encapsulation) ... done
Running encapDecap test case 42 (encapsulation) ... done
Running encapDecap test case 43 (encapsulation) ... done
Running encapDecap test case 44 (encapsulation) ... done
Running encapDecap test case 45 (encapsulation) ... done
Running encapDecap test case 46 (encapsulation) ... done
Running encapDecap test case 47 (encapsulation) ... done
Running encapDecap test case 48 (encapsulation) ... done
Running encapDecap test case 49 (encapsulation) ... done
Running encapDecap test case 50 (encapsulation) ... done
Running encapDecap test case 51 (encapsulation) ... done
Running encapDecap test case 52 (encapsulation) ... done
Running encapDecap test case 53 (encapsulation) ... done
Running encapDecap test case 54 (encapsulation) ... done
Running encapDecap test case 55 (encapsulation) ... done
Running encapDecap test case 56 (encapsulation) ... done
Running encapDecap test case 57 (encapsulation) ... done
Running encapDecap test case 58 (encapsulation) ... done
Running encapDecap test case 59 (encapsulation) ... done
Running encapDecap test case 60 (encapsulation) ... done
Running encapDecap test case 61 (encapsulation) ... done
Running encapDecap test case 62 (encapsulation) ... done
Running encapDecap test case 63 (encapsulation) ... done
Running encapDecap test case 64 (encapsulation) ... done
Running encapDecap test case 65 (encapsulation) ... done
Running encapDecap test case 66 (encapsulation) ... done
Running encapDecap test case 67 (encapsulation) ... done
Running encapDecap test case 68 (encapsulation) ... done
Running encapDecap test case 69 (encapsulation) ... done
Running encapDecap test case 70 (encapsulation) ... done
Running encapDecap test case 71 (encapsulation) ... done
Running encapDecap test case 72 (encapsulation) ... done
Running encapDecap test case 73 (encapsulation) ... done
Running encapDecap test case 74 (encapsulation) ... done
Running encapDecap test case 75 (encapsulation) ... done
Running encapDecap test case 76 (decapsulation) ... done
Running encapDecap test case 77 (decapsulation) ... done
Running encapDecap test case 78 (decapsulation) ... done
Running encapDecap test case 79 (decapsulation) ... done
Running encapDecap test case 80 (decapsulation) ... done
Running encapDecap test case 81 (decapsulation) ... done
Running encapDecap test case 82 (decapsulation) ... done
Running encapDecap test case 83 (decapsulation) ... done
Running encapDecap test case 84 (decapsulation) ... done
Running encapDecap test case 85 (decapsulation) ... done
Running encapDecap test case 86 (decapsulation) ... done
Running encapDecap test case 87 (decapsulation) ... done
Running encapDecap test case 88 (decapsulation) ... done
Running encapDecap test case 89 (decapsulation) ... done
Running encapDecap test case 90 (decapsulation) ... done
Running encapDecap test case 91 (decapsulation) ... done
Running encapDecap test case 92 (decapsulation) ... done
Running encapDecap test case 93 (decapsulation) ... done
Running encapDecap test case 94 (decapsulation) ... done
Running encapDecap test case 95 (decapsulation) ... done
Running encapDecap test case 96 (decapsulation) ... done
Running encapDecap test case 97 (decapsulation) ... done
Running encapDecap test case 98 (decapsulation) ... done
Running encapDecap test case 99 (decapsulation) ... done
Running encapDecap test case 100 (decapsulation) ... done
Running encapDecap test case 101 (decapsulation) ... done
Running encapDecap test case 102 (decapsulation) ... done
Running encapDecap test case 103 (decapsulation) ... done
Running encapDecap test case 104 (decapsulation) ... done
Running encapDecap test case 105 (decapsulation) ... done
Running encapDecap test case 106 (decapsulationKeyCheck) ... done
Running encapDecap test case 107 (decapsulationKeyCheck) ... done
Running encapDecap test case 108 (decapsulationKeyCheck) ... done
Running encapDecap test case 109 (decapsulationKeyCheck) ... done
Running encapDecap test case 110 (decapsulationKeyCheck) ... done
Running encapDecap test case 111 (decapsulationKeyCheck) ... done
Running encapDecap test case 112 (decapsulationKeyCheck) ... done
Running encapDecap test case 113 (decapsulationKeyCheck) ... done
Running encapDecap test case 114 (decapsulationKeyCheck) ... done
Running encapDecap test case 115 (decapsulationKeyCheck) ... done
Running encapDecap test case 116 (encapsulationKeyCheck) ... done
Running encapDecap test case 117 (encapsulationKeyCheck) ... done
Running encapDecap test case 118 (encapsulationKeyCheck) ... done
Running encapDecap test case 119 (encapsulationKeyCheck) ... done
Running encapDecap test case 120 (encapsulationKeyCheck) ... done
Running encapDecap test case 121 (encapsulationKeyCheck) ... done
Running encapDecap test case 122 (encapsulationKeyCheck) ... done
Running encapDecap test case 123 (encapsulationKeyCheck) ... done
Running encapDecap test case 124 (encapsulationKeyCheck) ... done
Running encapDecap test case 125 (encapsulationKeyCheck) ... done
Running encapDecap test case 126 (decapsulationKeyCheck) ... done
Running encapDecap test case 127 (decapsulationKeyCheck) ... done
Running encapDecap test case 128 (decapsulationKeyCheck) ... done
Running encapDecap test case 129 (decapsulationKeyCheck) ... done
Running encapDecap test case 130 (decapsulationKeyCheck) ... done
Running encapDecap test case 131 (decapsulationKeyCheck) ... done
Running encapDecap test case 132 (decapsulationKeyCheck) ... done
Running encapDecap test case 133 (decapsulationKeyCheck) ... done
Running encapDecap test case 134 (decapsulationKeyCheck) ... done
Running encapDecap test case 135 (decapsulationKeyCheck) ... done
Running encapDecap test case 136 (encapsulationKeyCheck) ... done
Running encapDecap test case 137 (encapsulationKeyCheck) ... done
Running encapDecap test case 138 (encapsulationKeyCheck) ... done
Running encapDecap test case 139 (encapsulationKeyCheck) ... done
Running encapDecap test case 140 (encapsulationKeyCheck) ... done
Running encapDecap test case 141 (encapsulationKeyCheck) ... done
Running encapDecap test case 142 (encapsulationKeyCheck) ... done
Running encapDecap test case 143 (encapsulationKeyCheck) ... done
Running encapDecap test case 144 (encapsulationKeyCheck) ... done
Running encapDecap test case 145 (encapsulationKeyCheck) ... done
Running encapDecap test case 146 (decapsulationKeyCheck) ... done
Running encapDecap test case 147 (decapsulationKeyCheck) ... done
Running encapDecap test case 148 (decapsulationKeyCheck) ... done
Running encapDecap test case 149 (decapsulationKeyCheck) ... done
Running encapDecap test case 150 (decapsulationKeyCheck) ... done
Running encapDecap test case 151 (decapsulationKeyCheck) ... done
Running encapDecap test case 152 (decapsulationKeyCheck) ... done
Running encapDecap test case 153 (decapsulationKeyCheck) ... done
Running encapDecap test case 154 (decapsulationKeyCheck) ... done
Running encapDecap test case 155 (decapsulationKeyCheck) ... done
Running encapDecap test case 156 (encapsulationKeyCheck) ... done
Running encapDecap test case 157 (encapsulationKeyCheck) ... done
Running encapDecap test case 158 (encapsulationKeyCheck) ... done
Running encapDecap test case 159 (encapsulationKeyCheck) ... done
Running encapDecap test case 160 (encapsulationKeyCheck) ... done
Running encapDecap test case 161 (encapsulationKeyCheck) ... done
Running encapDecap test case 162 (encapsulationKeyCheck) ... done
Running encapDecap test case 163 (encapsulationKeyCheck) ... done
Running encapDecap test case 164 (encapsulationKeyCheck) ... done
Running encapDecap test case 165 (encapsulationKeyCheck) ... done
Comparing results with test/.acvp-data/v1.1.0.40/files/ML-KEM-encapDecap-FIPS203/expectedResults.json
OK
ALL GOOD!
Everything checks fine!
[00:01] danny@ltc-zz4-lp9 mlkem-native_dev %

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 12, 2025

@dannytsen Thank you. As mentioned, can you please find out how to test the code using qemu? The ci-cross shell already provides you with a cross compiler and an emulator (and a test script to use them, e.g. tests func --cross-prefix=powerpc64le-unknown-linux-gnu- --exec-wrapper="qemu-ppc64le -cpu power9" --opt=opt), but it appears that some configuration options are missing. We don't have PPC machines in CI, so the only way to test is via QEMU.

Which means that it doesn't work with p8 or some instructions was not supported in qmenu. I only tested on p9 and p10 HW platforms.

Can you find out which one it is? The PR documentation states that the ASM works P8 upwards.

@dannytsen
Copy link
Author

@dannytsen Thank you. As mentioned, can you please find out how to test the code using qemu? The ci-cross shell already provides you with a cross compiler and an emulator, but it appears that some configuration options are missing. We don't have PPC machines in CI, so the only way to test is via QEMU.

Which means that it doesn't work with p8 or some instructions was not supported in qmenu. I only tested on p9 and p10 HW platforms.

Can you find out which one it is? The PR documentation states that the ASM works P8 upwards.

@hanno-becker It looks like qmenu cross compiler soen't support "xxpermdi" instruction. I'll check.
Is your cross compiler support ISA2.07?

xxpermdi

@hanno-becker
Copy link
Contributor

@dannytsen I don't know. You should be able to find out assembling a minimal example and using powerpc64le-unknown-linux-gnu-objdump -d your_object_file.o for the disassembly.

@dannytsen
Copy link
Author

@dannytsen I don't know. You should be able to find out assembling a minimal example and using powerpc64le-unknown-linux-gnu-objdump -d your_object_file.o for the disassembly.

I'll check.

@dannytsen
Copy link
Author

@hanno-becker I don't have qemu on my system. But I installed nix on my Mac and run the following command under nix,
(CC=powerpc64le-unknown-linux-gnu-gcc make build) and it compiled fine. And the objdump is fine, powerpc64le-unknown-linux-gnu-objdump -d ntt_ppc.S.o. So, the assembly file should be good. But I can't run tests on my Mac with the cross-compiled binary.

These are the final build output.

FUNC ML-KEM-1024: test/build/mlkem1024/bin/test_mlkem1024
CC test/build/mlkem512/test/gen_KAT.c.o
LD test/build/mlkem512/bin/gen_KAT512
KAT ML-KEM-512: test/build/mlkem512/bin/gen_KAT512
CC test/build/mlkem768/test/gen_KAT.c.o
LD test/build/mlkem768/bin/gen_KAT768
KAT ML-KEM-768: test/build/mlkem768/bin/gen_KAT768
CC test/build/mlkem1024/test/gen_KAT.c.o
LD test/build/mlkem1024/bin/gen_KAT1024
KAT ML-KEM-1024: test/build/mlkem1024/bin/gen_KAT1024
CC test/build/mlkem512/test/acvp_mlkem.c.o
LD test/build/mlkem512/bin/acvp_mlkem512
ACVP ML-KEM-512: test/build/mlkem512/bin/acvp_mlkem512
CC test/build/mlkem768/test/acvp_mlkem.c.o
LD test/build/mlkem768/bin/acvp_mlkem768
ACVP ML-KEM-768: test/build/mlkem768/bin/acvp_mlkem768
CC test/build/mlkem1024/test/acvp_mlkem.c.o
LD test/build/mlkem1024/bin/acvp_mlkem1024
ACVP ML-KEM-1024: test/build/mlkem1024/bin/acvp_mlkem1024
Everything builds fine!

I'll check about qemu.

@hanno-becker
Copy link
Contributor

hanno-becker commented Sep 12, 2025

I don't have qemu on my system. But I installed nix on my Mac and run the following command under nix, (CC=powerpc64le-unknown-linux-gnu-gcc make build) and it compiled fine.

Please see #1184 (comment) again -- once in the ci-cross shell, you can use the tests script to build and run with cross-compiler / QEMU emulation.

@dannytsen
Copy link
Author

@hanno-becker Thanks for taking so much on ppc64le integration. I do learn a lot from here in all aspects. I don't know how much time will take me but I'll work on it.

@hanno-becker
Copy link
Contributor

@hanno-becker Thanks for taking so much on ppc64le integration. I do learn a lot from here in all aspects. I don't know how much time will take me but I'll work on it.

@dannytsen This is great to hear! Please don't hesitate to ask if anything is unclear about the algorithmic aspects of the AArch64 implementation, or how to translate it. Otherwise, I'm looking forward to seeing the updated code!

@bhess
Copy link
Contributor

bhess commented Sep 29, 2025

Thanks @dannytsen! Feel free to let me know if there’s any area where I can jump in and help out with coding.

@dannytsen
Copy link
Author

@hanno-becker @bhess Thanks.

@hanno-becker
Copy link
Contributor

@dannytsen Any update?

@dannytsen
Copy link
Author

@hanno-becker Still working on fixing current code before the other.

@hanno-becker
Copy link
Contributor

@dannytsen Ack. Let me know when you're done with the rework or have any questions.

@dannytsen
Copy link
Author

@dannytsen Ack. Let me know when you're done with the rework or have any questions.

@hanno-becker @bhess My fix of my implementation to work with your new backend unit test is very straight forward and simple, just matched work flow of your C-implementation. This implementation worked as is. So, I don't plan on re-work for any time soon. Thanks.

@hanno-becker
Copy link
Contributor

@bhess @dannytsen Could you provide performance improvement data for the backend as it stands? What is the plan towards integrating assembly that leverages lazy reduction and layer merging?

@dannytsen
Copy link
Author

@bhess @dannytsen Could you provide performance improvement data for the backend as it stands? What is the plan towards integrating assembly that leverages lazy reduction and layer merging?

@hanno-becker @bhess Here are the benchmark data. This benchmark run on p10 HW. Will discuss with Basil for further optimization if needed.

[13:51] danny-mlkem-native_dev % ./scripts/tests bench -c PERF
INFO > Benchmark Compile (native no_opt): make bench OPT=0 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native no_opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native no_opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 67005
encaps cycles = 78777
decaps cycles = 100820

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 66527 66731 66830 66901 66951 67005 67059 67117 67201 67334 71937
encaps percentiles: 78277 78485 78595 78668 78722 78777 78819 78873 78942 79044 83703
decaps percentiles: 100339 100522 100624 100692 100759 100820 100867 100917 100979 101088 105799

INFO > Benchmark ML-KEM-768 (native no_opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native no_opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 111498
encaps cycles = 125744
decaps cycles = 154363

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 110666 110991 111129 111242 111357 111498 111593 111720 111907 112334 116861
encaps percentiles: 124889 125249 125404 125535 125648 125744 125832 125964 126174 126722 131183
decaps percentiles: 153475 153874 154030 154166 154263 154363 154471 154572 154785 155226 159882

INFO > Benchmark ML-KEM-1024 (native no_opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native no_opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 166899
encaps cycles = 185232
decaps cycles = 220194

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 165446 166045 166300 166520 166673 166899 167079 167293 167716 171222 175975
encaps percentiles: 183678 184443 184701 184895 185063 185232 185408 185646 186018 189577 192379
decaps percentiles: 218738 219419 219672 219852 220031 220194 220422 220610 221003 224567 227994

INFO > Benchmark Compile (native opt): make bench OPT=1 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 46111
encaps cycles = 50872
decaps cycles = 63686

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 45583 45794 45895 45969 46055 46111 46168 46249 46344 46447 51088
encaps percentiles: 50379 50590 50674 50750 50809 50872 50929 51000 51095 51218 55833
decaps percentiles: 63191 63388 63472 63547 63619 63686 63746 63810 63904 64013 68660

INFO > Benchmark ML-KEM-768 (native opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 79408
encaps cycles = 86751
decaps cycles = 104108

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 78671 78938 79091 79241 79333 79408 79516 79631 79828 80295 84793
encaps percentiles: 85989 86262 86396 86505 86602 86751 86837 86941 87098 87585 92347
decaps percentiles: 103310 103597 103749 103869 103998 104108 104212 104329 104508 104964 109715

INFO > Benchmark ML-KEM-1024 (native opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 124334
encaps cycles = 134802
decaps cycles = 157430

       percentile      1     10     20     30     40     50     60     70     80     90     99

keypair percentiles: 122971 123552 123774 123995 124157 124334 124544 124782 125043 128570 133437
encaps percentiles: 133358 134027 134280 134484 134664 134802 135008 135222 135594 139137 141347
decaps percentiles: 156034 156641 156913 157074 157249 157430 157611 157879 158231 161818 164032

All good!
[13:52] danny-mlkem-native_dev %

@hanno-becker
Copy link
Contributor

hanno-becker commented Oct 14, 2025

@dannytsen Thanks. As a table:

Parameter Set Operation No Opt (cycles) Optimized (cycles) Speedup
ML-KEM-512 Keypair 67,005 46,111 1.45x
ML-KEM-512 Encaps 78,777 50,872 1.55x
ML-KEM-512 Decaps 100,820 63,686 1.58x
ML-KEM-768 Keypair 111,498 79,408 1.40x
ML-KEM-768 Encaps 125,744 86,751 1.45x
ML-KEM-768 Decaps 154,363 104,108 1.48x
ML-KEM-1024 Keypair 166,899 124,334 1.34x
ML-KEM-1024 Encaps 185,232 134,802 1.37x
ML-KEM-1024 Decaps 220,194 157,430 1.40x

@hanno-becker
Copy link
Contributor

hanno-becker commented Oct 14, 2025

@dannytsen Can you clarify the architectural requirements please (by updating this PR)? The PR documents Power8 and above, but we don't seem to be able to emulate the code for Power8, nor have you (as I remember) tested it on such.

@dannytsen Can you also provide benchmarks on p9 please?

@dannytsen
Copy link
Author

@dannytsen Can you clarify the architectural requirements please (by updating this PR)? The PR documents Power8 and above, but we don't seem to be able to emulate the code for Power8, nor have you (as I remember) tested it on such.

@dannytsen Can you also provide benchmarks on p9 please?

@hanno-becker I'll find p8/9 system to test.

dannytsen and others added 2 commits October 17, 2025 19:35
Also fixed instruction byte orerding mismatch for p8 and p9/10, lxvx/stxvx
and lxvd2x/stxvd2x.  Used lxvd2x and stxvd2x for consistant byte ordering.

Signed-off-by: Danny Tsen <[email protected]>
@dannytsen
Copy link
Author

@dannytsen Can you clarify the architectural requirements please (by updating this PR)? The PR documents Power8 and above, but we don't seem to be able to emulate the code for Power8, nor have you (as I remember) tested it on such.
@dannytsen Can you also provide benchmarks on p9 please?

@hanno-becker I'll find p8/9 system to test.

@hanno-becker p8 issue fixed. Here are the benchmark for p8, p9 and p10. Thanks.
benchmark_all.txt

Comment on lines 115 to 118
lxvd2x 32+13, 3, 10 # r[j+len]
lxvd2x 32+18, 3, 17 # r[j+len]
lxvd2x 32+23, 3, 19 # r[j+len]
lxvd2x 32+28, 3, 21 # r[j+len]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments are misleading, we are loading from different offsets here. The comments should indicate the offset.

Comment on lines 109 to 114
addi 16, 9, \next
addi 17, 10, \step
addi 18, 16, \next
addi 19, 17, \step
addi 20, 18, \next
addi 21, 19, \step
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half of these offsets are not used here, but further down in Loda_4AJ, which is confusing. The code should either be moved, or a comment added.

vadduhm 30, 28, 27 # r + t
.endm

.macro NTT_MREDUCE_4X start next step _vz0 _vz1 _vz2 _vz3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next and step are always the same. Merge the two arguments?

Comment on lines 228 to 235
vsubuhm 16, 12, 13 # r - t
vadduhm 15, 13, 12 # r + t
vsubuhm 21, 17, 18 # r - t
vadduhm 20, 18, 17 # r + t
vsubuhm 26, 22, 23 # r - t
vadduhm 25, 23, 22 # r + t
vsubuhm 31, 27, 28 # r - t
vadduhm 30, 28, 27 # r + t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments here need refining, too

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannytsen Thank you again for your work.

I continued reviewing, but quickly found myself stuck without extensive help from an LLM commenting and rewriting the code for me. As it stands, the code is too difficult to understand for @mkannwischer and myself to commit to maintaining it.

The following steps would greatly help to make the code more understandable and maintainable:

  • Addition of #define ... abbreviations for all the registers in use. Following the raw numeric register identifiers is difficult, as the reader does not (yet) have the mental model that you had when writing the code. This is exacerbated by the fact that the there is no way to tell (without knowledge of the instruction signature) whether a numeric value is a register identifier or an immediate. For example, in NTT_MREDUCE_4X 5, 16, 16, ..., 5 is a register identifier but 16 is an immediate.
  • More extensive comments, and fixing of existing comments. There are a number of comments which give a sense of what happens, but are not precise enough to exactly understand what's happening; see the review comments.

Finally, can you also switch to using /* ... */ style comments rather than # ..., matching other assembly files in mlkem-native? Also, could you untabify the source (use spaces instead of tabs exclusively)?

@dannytsen
Copy link
Author

@dannytsen Thank you again for your work.

I continued reviewing, but quickly found myself stuck without extensive help from an LLM commenting and rewriting the code for me. As it stands, the code is too difficult to understand for @mkannwischer and myself to commit to maintaining it.

The following steps would greatly help to make the code more understandable and maintainable:

* Addition of `#define ...` abbreviations for all the registers in use. Following the raw numeric register identifiers is difficult, as the reader does not (yet) have the mental model that you had when writing the code. This is exacerbated by the fact that the there is no way to tell (without knowledge of the instruction signature) whether a numeric value is a register identifier or an immediate. For example, in `NTT_MREDUCE_4X 5, 16, 16, ...`, `5` is a register identifier but `16` is an immediate.

* More extensive comments, and fixing of existing comments. There are a number of comments which give a sense of what happens, but are not precise enough to _exactly_ understand what's happening; see the review comments.

Finally, can you also switch to using /* ... */ style comments rather than # ..., matching other assembly files in mlkem-native? Also, could you untabify the source (use spaces instead of tabs exclusively)?

@hanno-becker Sure. I'll fix as much as I can.

dannytsen and others added 4 commits October 26, 2025 19:35
1. De-tabified.
2. Merged next and step used in macro.
3. Used immediate values for offsets used in macro.
4. More comments explaining the operation and contents.
5. Changed the comment style.

In this commit, numeric register identifiers have not been fixed yet.
will do that next.

Signed-off-by: Danny Tsen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants