Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions #12154

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

remyoudompheng
Copy link
Contributor

AFAIK the CPU backend does not contain any x86 BMI2 instructions yet.
Is it fine to introduce code using BMI2 instructions?
Is it fine to simply use the __BMI2__ since "NATIVE" build is now the standard?

Some numbers on Zen 4 (new code is about 50% faster)

master (gcc 14.2):
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1725 runs -   588.98 us/run - 117.44 MFLOP/run - 199.40 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1242 runs -   821.32 us/run - 117.44 MFLOP/run - 142.99 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    875 runs -  1161.02 us/run - 234.88 MFLOP/run - 202.31 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    630 runs -  1630.03 us/run - 234.88 MFLOP/run - 144.10 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 292929.00 us/run -  60.13 GFLOP/run - 205.27 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    3 runs - 412216.00 us/run -  60.13 GFLOP/run - 145.87 GFLOPS

master (clang 19):
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1725 runs -   585.87 us/run - 117.44 MFLOP/run - 200.45 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1449 runs -   721.34 us/run - 117.44 MFLOP/run - 162.81 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1015 runs -  1013.40 us/run - 234.88 MFLOP/run - 231.78 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    700 runs -  1490.70 us/run - 234.88 MFLOP/run - 157.56 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 267433.50 us/run -  60.13 GFLOP/run - 224.84 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    3 runs - 375016.67 us/run -  60.13 GFLOP/run - 160.34 GFLOPS

This PR (gcc 14.2):   
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   2622 runs -   388.58 us/run - 117.44 MFLOP/run - 302.23 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1932 runs -   532.51 us/run - 117.44 MFLOP/run - 220.54 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1295 runs -   783.70 us/run - 234.88 MFLOP/run - 299.71 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    980 runs -  1057.22 us/run - 234.88 MFLOP/run - 222.17 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    6 runs - 195505.17 us/run -  60.13 GFLOP/run - 307.56 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 271548.50 us/run -  60.13 GFLOP/run - 221.43 GFLOPS

This PR (clang 19):   
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   2070 runs -   490.61 us/run - 117.44 MFLOP/run - 239.38 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1656 runs -   613.57 us/run - 117.44 MFLOP/run - 191.41 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1015 runs -  1009.67 us/run - 234.88 MFLOP/run - 232.63 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    945 runs -  1071.45 us/run - 234.88 MFLOP/run - 219.22 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    5 runs - 247839.00 us/run -  60.13 GFLOP/run - 242.62 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 264696.00 us/run -  60.13 GFLOP/run - 227.16 GFLOPS

Note that some old CPUs (AMD Zen 2 and older) support BMI2 but emulate instructions using microcode, resulting in catastrophic slowdowns: owners of such hardware would need to manually disable BMI2 in compiler using -mno-bmi2.

Before:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  966 runs -  1076.29 us/run - 117.44 MFLOP/run - 109.12 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  690 runs -  1596.64 us/run - 117.44 MFLOP/run -  73.55 GFLOPS

After:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  138 runs - 11684.07 us/run - 117.44 MFLOP/run -  10.05 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   69 runs - 16669.00 us/run - 117.44 MFLOP/run -   7.05 GFLOPS

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 2, 2025
@slaren
Copy link
Member

slaren commented Mar 2, 2025

Is it fine to simply use the __BMI2__ since "NATIVE" build is now the standard?

Please add also an option to enable it manually, add a check in cpu-feats-x86.cpp, and add it to the CPU variant list in:

ggml_add_cpu_backend_variant(sandybridge AVX)
ggml_add_cpu_backend_variant(haswell AVX F16C AVX2 FMA)
ggml_add_cpu_backend_variant(skylakex AVX F16C AVX2 FMA AVX512)
ggml_add_cpu_backend_variant(icelake AVX F16C AVX2 FMA AVX512 AVX512_VBMI AVX512_VNNI)
ggml_add_cpu_backend_variant(alderlake AVX F16C AVX2 FMA AVX_VNNI)

You could also check for Zen 2 in cpu-feats-x86.cpp, and if necessary add a variant for Zen 2 that excludes this feature.

@JohnLoveJoy
Copy link

https://github.com/zwegner/zp7

Integrating something like the ZP7 (Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill) into llama.cpp could be a smart way to address the performance issues with PDEP and PEXT on AMD Zen 2 and earlier CPUs while maintaining compatibility and efficiency across platforms. Just a polite suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants