ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions #12154

remyoudompheng · 2025-03-02T22:11:24Z

AFAIK the CPU backend does not contain any x86 BMI2 instructions yet.
Is it fine to introduce code using BMI2 instructions?
Is it fine to simply use the __BMI2__ since "NATIVE" build is now the standard?

Some numbers on Zen 4 (new code is about 50% faster)

master (gcc 14.2):
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1725 runs -   588.98 us/run - 117.44 MFLOP/run - 199.40 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1242 runs -   821.32 us/run - 117.44 MFLOP/run - 142.99 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    875 runs -  1161.02 us/run - 234.88 MFLOP/run - 202.31 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    630 runs -  1630.03 us/run - 234.88 MFLOP/run - 144.10 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 292929.00 us/run -  60.13 GFLOP/run - 205.27 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    3 runs - 412216.00 us/run -  60.13 GFLOP/run - 145.87 GFLOPS

master (clang 19):
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1725 runs -   585.87 us/run - 117.44 MFLOP/run - 200.45 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1449 runs -   721.34 us/run - 117.44 MFLOP/run - 162.81 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1015 runs -  1013.40 us/run - 234.88 MFLOP/run - 231.78 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    700 runs -  1490.70 us/run - 234.88 MFLOP/run - 157.56 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 267433.50 us/run -  60.13 GFLOP/run - 224.84 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    3 runs - 375016.67 us/run -  60.13 GFLOP/run - 160.34 GFLOPS

This PR (gcc 14.2):   
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   2622 runs -   388.58 us/run - 117.44 MFLOP/run - 302.23 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1932 runs -   532.51 us/run - 117.44 MFLOP/run - 220.54 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1295 runs -   783.70 us/run - 234.88 MFLOP/run - 299.71 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    980 runs -  1057.22 us/run - 234.88 MFLOP/run - 222.17 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    6 runs - 195505.17 us/run -  60.13 GFLOP/run - 307.56 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 271548.50 us/run -  60.13 GFLOP/run - 221.43 GFLOPS

This PR (clang 19):   
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   2070 runs -   490.61 us/run - 117.44 MFLOP/run - 239.38 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1656 runs -   613.57 us/run - 117.44 MFLOP/run - 191.41 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   1015 runs -  1009.67 us/run - 234.88 MFLOP/run - 232.63 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=2,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    945 runs -  1071.45 us/run - 234.88 MFLOP/run - 219.22 GFLOPS
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    5 runs - 247839.00 us/run -  60.13 GFLOP/run - 242.62 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):    4 runs - 264696.00 us/run -  60.13 GFLOP/run - 227.16 GFLOPS

Note that some old CPUs (AMD Zen 2 and older) support BMI2 but emulate instructions using microcode, resulting in catastrophic slowdowns: owners of such hardware would need to manually disable BMI2 in compiler using -mno-bmi2.

Before:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  966 runs -  1076.29 us/run - 117.44 MFLOP/run - 109.12 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  690 runs -  1596.64 us/run - 117.44 MFLOP/run -  73.55 GFLOPS

After:
  MUL_MAT(type_a=iq1_s,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):  138 runs - 11684.07 us/run - 117.44 MFLOP/run -  10.05 GFLOPS
  MUL_MAT(type_a=iq1_m,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):   69 runs - 16669.00 us/run - 117.44 MFLOP/run -   7.05 GFLOPS

slaren · 2025-03-02T22:47:31Z

Is it fine to simply use the __BMI2__ since "NATIVE" build is now the standard?

Please add also an option to enable it manually, add a check in cpu-feats-x86.cpp, and add it to the CPU variant list in:

llama.cpp/ggml/src/CMakeLists.txt

Lines 308 to 312 in cc473ca

    
           ggml_add_cpu_backend_variant(sandybridge    AVX) 
        
           ggml_add_cpu_backend_variant(haswell        AVX F16C AVX2 FMA) 
        
           ggml_add_cpu_backend_variant(skylakex       AVX F16C AVX2 FMA AVX512) 
        
           ggml_add_cpu_backend_variant(icelake        AVX F16C AVX2 FMA AVX512 AVX512_VBMI AVX512_VNNI) 
        
           ggml_add_cpu_backend_variant(alderlake      AVX F16C AVX2 FMA AVX_VNNI)

You could also check for Zen 2 in cpu-feats-x86.cpp, and if necessary add a variant for Zen 2 that excludes this feature.

JohnLoveJoy · 2025-03-03T15:19:43Z

https://github.com/zwegner/zp7

Integrating something like the ZP7 (Zach's Peppy Parallel-Prefix-Popcountin' PEXT/PDEP Polyfill) into llama.cpp could be a smart way to address the performance issues with PDEP and PEXT on AMD Zen 2 and earlier CPUs while maintaining compatibility and efficiency across platforms. Just a polite suggestion.

remyoudompheng · 2025-03-04T06:36:22Z

Update with CMakeLists changes (no Zen 2 specific case, maybe a separate PR can add AMD microarchitectures).

slaren · 2025-03-04T09:51:44Z

Looks good, thanks.

It would also be necessary to add a ggml_cpu_has_bmi2 function and report it in ggml_backend_cpu_get_features:

llama.cpp/ggml/src/ggml-cpu/ggml-cpu.cpp

Line 488 in 1a24c46

    
           static ggml_backend_feature * ggml_backend_cpu_get_features(ggml_backend_reg_t reg) {

I suspect that MSVC will enable BMI2 with /arch:avx2 or higher. After you make this change, then you can check in the "system info" string if BMI2 is being enabled. If so, then the definition would also need to be added here:

llama.cpp/ggml/src/ggml-cpu/CMakeLists.txt

Lines 209 to 212 in a3db575

    
                       elseif (GGML_AVX2) 
        
                           list(APPEND ARCH_FLAGS /arch:AVX2) 
        
                           list(APPEND ARCH_DEFINITIONS GGML_AVX2 GGML_FMA GGML_F16C) 
        
                       elseif (GGML_AVX)

I can check for you if you don't have access to a machine with MSVC.

remyoudompheng · 2025-03-05T06:02:28Z

Done.
For MSVC it seems that it needs a manual define like AVXVNNI (tests on godbolt.org suggest that it always compiles intrinsics regardless of the /arch flag)

slaren · 2025-03-06T01:24:33Z

13900k:

Model	Threads	Test	t/s master	t/s optim-x86	Speedup
llama 8B IQ1_M - 1.75 bpw	8	pp128	15.84	21.98	1.39
llama 8B IQ1_M - 1.75 bpw	8	tg32	13.45	17.31	1.29
llama 8B IQ1_M - 1.75 bpw	16	pp128	19.43	30.45	1.57
llama 8B IQ1_M - 1.75 bpw	16	tg32	16.81	22.04	1.31
llama 8B IQ1_M - 1.75 bpw	24	pp128	25.88	28.29	1.09
llama 8B IQ1_M - 1.75 bpw	24	tg32	18.73	23.52	1.26
llama 8B IQ1_M - 1.75 bpw	32	pp128	30.42	34.84	1.15
llama 8B IQ1_M - 1.75 bpw	32	tg32	19.85	24.61	1.24
llama 8B IQ1_S - 1.5625 bpw	8	pp128	19.30	29.50	1.53
llama 8B IQ1_S - 1.5625 bpw	8	tg32	17.57	21.46	1.22
llama 8B IQ1_S - 1.5625 bpw	16	pp128	32.00	41.51	1.30
llama 8B IQ1_S - 1.5625 bpw	16	tg32	22.62	27.48	1.21
llama 8B IQ1_S - 1.5625 bpw	24	pp128	35.53	46.19	1.30
llama 8B IQ1_S - 1.5625 bpw	24	tg32	24.82	27.44	1.11
llama 8B IQ1_S - 1.5625 bpw	32	pp128	41.95	52.90	1.26
llama 8B IQ1_S - 1.5625 bpw	32	tg32	25.55	26.71	1.05

…l-org#12154) * ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions * cmake: Add GGML_BMI2 build option * ggml: enable BMI2 on relevant CPU variants * ggml-cpu: include BMI2 in backend score * ggml-cpu: register BMI2 in ggml_backend_cpu_get_features * ggml-cpu: add __BMI2__ define when using MSVC

sandrohanea · 2025-03-20T09:23:52Z

Hello @slaren , @remyoudompheng ,

It seems that after this PR x86 with AVX2 build for MSVC is failing:

https://github.com/sandrohanea/whisper.net/actions/runs/13965684322/job/39095442481

cmake command:

cmake -S . -DGGML_NATIVE=OFF -A Win32 -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_FMA=ON -DGGML_F16C=ON -B build/win-x86

Do you have any recommendation on how to fix this issue?

sandrohanea · 2025-03-20T20:21:35Z

Hello @slaren , @remyoudompheng ,

It seems that after this PR x86 with AVX2 build for MSVC is failing:

https://github.com/sandrohanea/whisper.net/actions/runs/13965684322/job/39095442481

cmake command:
cmake -S . -DGGML_NATIVE=OFF -A Win32 -DGGML_AVX=ON -DGGML_AVX2=ON -DGGML_FMA=ON -DGGML_F16C=ON -B build/win-x86
Do you have any recommendation on how to fix this issue?

Nevermind, just disabled the support for BMI2 on Win32 using -DGGML_BMI2=OFF.

rudiservo · 2025-03-25T12:54:37Z

Hey guys, having issues with this commit, I don't know why. I put all the relevant information and what I could find issue, I did try and compile with various CUDA versions and kind of worked my way to the the current commit.
Having issues with old Bulldozer.

rudiservo · 2025-03-25T15:01:56Z

Just a heads up, I am confirming that the BMI2 detection is probably wrong because it's forcing BMI2 on a non BMI2 CPU.
On the docker cuda builds this is breaking some stuff, should BMI not be compiled by default or is there something else that needs to be done to CPUID so it will detect BMI2 better and fix the issue I have?

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 2, 2025

remyoudompheng force-pushed the optim-x86 branch from dd8f10c to d1aeed0 Compare March 4, 2025 06:24

remyoudompheng added 4 commits March 4, 2025 07:38

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions

91d7801

cmake: Add GGML_BMI2 build option

6feca68

ggml: enable BMI2 on relevant CPU variants

3bfd443

ggml-cpu: include BMI2 in backend score

a3db575

remyoudompheng force-pushed the optim-x86 branch from 071c312 to a3db575 Compare March 4, 2025 06:38

remyoudompheng added 2 commits March 5, 2025 06:57

ggml-cpu: register BMI2 in ggml_backend_cpu_get_features

1fc8488

ggml-cpu: add __BMI2__ define when using MSVC

9d1281a

slaren approved these changes Mar 6, 2025

View reviewed changes

slaren merged commit 07d1572 into ggml-org:master Mar 6, 2025
46 of 47 checks passed

remyoudompheng deleted the optim-x86 branch March 6, 2025 05:16

remyoudompheng mentioned this pull request Mar 6, 2025

ggml-cpu: faster AVX2 variant for IQ1_M #12216

Merged

sandrohanea mentioned this pull request Mar 20, 2025

Preparing 1.8.0 sandrohanea/whisper.net#363

Merged

rudiservo mentioned this pull request Mar 25, 2025

Misc. bug: Crashing, forcing BMI2 on non BMI2 CPUs #12500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions #12154

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions #12154

Uh oh!

remyoudompheng commented Mar 2, 2025

Uh oh!

slaren commented Mar 2, 2025

Uh oh!

JohnLoveJoy commented Mar 3, 2025

Uh oh!

remyoudompheng commented Mar 4, 2025

Uh oh!

slaren commented Mar 4, 2025

Uh oh!

remyoudompheng commented Mar 5, 2025

Uh oh!

slaren commented Mar 6, 2025

Uh oh!

Uh oh!

sandrohanea commented Mar 20, 2025

Uh oh!

sandrohanea commented Mar 20, 2025

Uh oh!

rudiservo commented Mar 25, 2025

Uh oh!

rudiservo commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions #12154

ggml-cpu: Faster IQ1 mul_mat_vec on AVX2 using BMI2 instructions #12154

Uh oh!

Conversation

remyoudompheng commented Mar 2, 2025

Uh oh!

slaren commented Mar 2, 2025

Uh oh!

JohnLoveJoy commented Mar 3, 2025

Uh oh!

remyoudompheng commented Mar 4, 2025

Uh oh!

slaren commented Mar 4, 2025

Uh oh!

remyoudompheng commented Mar 5, 2025

Uh oh!

slaren commented Mar 6, 2025

Uh oh!

Uh oh!

sandrohanea commented Mar 20, 2025

Uh oh!

sandrohanea commented Mar 20, 2025

Uh oh!

rudiservo commented Mar 25, 2025

Uh oh!

rudiservo commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants