ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE #19171

jiangshhh · 2026-01-29T06:48:18Z

Proposal

This proposal introduces an ARM SVE-optimized implementation of ggml_vec_dot_mxfp4_q8_0 for the ggml/llama.cpp CPU backend.
The current implementation relies on scalar or NEON-based code paths, which do not fully utilize the wide vector capabilities available on modern ARM CPUs equipped with Scalable Vector Extension(SVE). By leveraging SVE intrinsics, this proposal aims to:

Improve utilization of vector registers on SVE-capable platforms, independent of fixed vector widths
Maintain numerical equivalence with the existing reference implementation
Ensure portability across different SVE vector lengths

Verifying Features

The proposed SVE implementation was verified with the following considerations:

Functional Correctness
Accumulation logic and scaling factors follow the original ggml_vec_dot_mxfp4_q8_0 definition.
Architectural Safety
The implementation uses SVE intrinsics only, without assuming a fixed vector length.
The SVE path is guarded by __ARM_FEATURE_SVE to ensure it is executed only on supported hardware.
Fallback Compatibility
Non-SVE platforms continue to use the existing scalar or NEON implementations without modification.
The change does not affect other quantization paths.

Performance check

The performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.

Batch size	Original (NEON)	This PR (SVE)	Ratio
1	3.66	8.60	2.35
2	3.73	9.04	2.42
4	3.76	9.25	2.46
8	3.75	9.08	2.42

The command used to measure the performance is
llama-batched --model ${PATH_TO_MODEL} --prompt 'AI is going to' --parallel 8 --predict 128 --seed 0 --threads 48

jiangshhh · 2026-01-29T06:59:05Z

@ggerganov @slaren
Hi,

The PR introduces an ARM SVE optimization for ggml_vec_dot_mxfp4_q8_0, and I have verified correctness and performance on an SVE-capable platform.

This is my first PR to llama.cpp, so I would like to check if there are any additional steps that I should follow for the review.
Please let me know if I need to do something to start the review/approval process.

Thank you very much for your time and for maintaining this project.

taronaeo · 2026-01-29T12:31:24Z

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

Alcpz · 2026-01-29T12:37:07Z

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

Unfortunately no, I would be happy to help otherwise.

Optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE

18ad28c

jiangshhh requested a review from ggerganov as a code owner January 29, 2026 06:48

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 29, 2026

jiangshhh changed the title ~~ggml: pptimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE~~ ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE #19171

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE #19171

jiangshhh commented Jan 29, 2026 •

edited

Loading

Uh oh!

jiangshhh commented Jan 29, 2026

Uh oh!

taronaeo commented Jan 29, 2026

Uh oh!

Alcpz commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE #19171

Are you sure you want to change the base?

ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE #19171

Conversation

jiangshhh commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal

Verifying Features

Performance check

Uh oh!

jiangshhh commented Jan 29, 2026

Uh oh!

taronaeo commented Jan 29, 2026

Uh oh!

Alcpz commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiangshhh commented Jan 29, 2026 •

edited

Loading