Skip to content

Conversation

@jiangshhh
Copy link

@jiangshhh jiangshhh commented Jan 29, 2026

Proposal

This proposal introduces an ARM SVE-optimized implementation of ggml_vec_dot_mxfp4_q8_0 for the ggml/llama.cpp CPU backend.
The current implementation relies on scalar or NEON-based code paths, which do not fully utilize the wide vector capabilities available on modern ARM CPUs equipped with Scalable Vector Extension(SVE). By leveraging SVE intrinsics, this proposal aims to:

  1. Improve utilization of vector registers on SVE-capable platforms, independent of fixed vector widths
  2. Maintain numerical equivalence with the existing reference implementation
  3. Ensure portability across different SVE vector lengths

Verifying Features

The proposed SVE implementation was verified with the following considerations:

  1. Functional Correctness
    Accumulation logic and scaling factors follow the original ggml_vec_dot_mxfp4_q8_0 definition.
  2. Architectural Safety
    The implementation uses SVE intrinsics only, without assuming a fixed vector length.
    The SVE path is guarded by __ARM_FEATURE_SVE to ensure it is executed only on supported hardware.
  3. Fallback Compatibility
    Non-SVE platforms continue to use the existing scalar or NEON implementations without modification.
    The change does not affect other quantization paths.

Performance check

The performance was measured with FX700.
Performance is improved as follows. The value is tokens per second.

Batch size Original (NEON) This PR (SVE) Ratio
1 3.66 8.60 2.35
2 3.73 9.04 2.42
4 3.76 9.25 2.46
8 3.75 9.08 2.42

The command used to measure the performance is
llama-batched --model ${PATH_TO_MODEL} --prompt 'AI is going to' --parallel 8 --predict 128 --seed 0 --threads 48

@jiangshhh jiangshhh requested a review from ggerganov as a code owner January 29, 2026 06:48
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 29, 2026
@jiangshhh jiangshhh changed the title ggml: pptimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE ggml: optimize ggml_vec_dot_mxfp4_q8_0 dot product on ARM SVE Jan 29, 2026
@jiangshhh
Copy link
Author

@ggerganov @slaren
Hi,

The PR introduces an ARM SVE optimization for ggml_vec_dot_mxfp4_q8_0, and I have verified correctness and performance on an SVE-capable platform.

This is my first PR to llama.cpp, so I would like to check if there are any additional steps that I should follow for the review.
Please let me know if I need to do something to start the review/approval process.

Thank you very much for your time and for maintaining this project.

@taronaeo
Copy link
Collaborator

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

@Alcpz
Copy link
Contributor

Alcpz commented Jan 29, 2026

@Alcpz By any chance do you have ARM SVE hardware to test and review this? :)

Unfortunately no, I would be happy to help otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants