Skip to content

Conversation

@chraac
Copy link
Contributor

@chraac chraac commented Jan 27, 2026

Further to the discussion in PR #19025, this implements the dual row dot product for flash attention.

Key changes

HVX Vector Math Optimizations

  • Added hvx_vec_reduce_sum_qf32x2, a helper function for efficiently reducing and accumulating two HVX vectors of qf32 values, and refactored several places in the codebase to use this function for dual-accumulation scenarios. [1] [2] [3] [4] [5]

  • Introduced new "rx2" (dual accumulation) versions of dot product functions for both f32-f16 and f16-f16 cases (hvx_dot_f32_f16_aa_rx2, hvx_dot_f16_f16_aa_rx2), improving performance by processing two accumulations in parallel. [1] [2]

  • Refactored the main attention kernel (flash_attn_ext_f16_thread) to use the new "rx2" dot product functions when possible, improving block processing efficiency.

Performance

  • Device: 8Gen2
  • Baseline: 0c21677e4
  • Optimization: 2610805c4
  • Model: llama3-1b-q4
stage Baseline (tok/s) Optimization (tok/s) Speedup
prompt eval 61.78 72.44 1.18x
eval 28.74 29.40 1.02x

@chraac chraac marked this pull request as draft January 27, 2026 16:12
sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 4));
sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 8));
sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 16));
return sum01;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimize reduction sum by processing two vectors simultaneously.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 27, 2026
@max-krasnyansky
Copy link
Collaborator

Overall, good idea and you gave me more ideas to implement/cleanup :).
The reduce_sum_qf32x2 is not correct for V75 and newer (qf32 can't be shifted/rotated, etc).
The good news is that we can just replace all instances with f32 version and optimize that version for V75 and up.
I have a branch with the changes. Will share tomorrow (falling asleep right now) so that you can pull it in.

@chraac
Copy link
Contributor Author

chraac commented Jan 30, 2026

The reduce_sum_qf32x2 is not correct for V75 and newer (qf32 can't be shifted/rotated, etc).
The good news is that we can just replace all instances with f32 version and optimize that version for V75 and up.

Interesting. That explains why we need the extra Q6_Vsf_equals_Vqf32 calls in hvx_vec_reduce_sum_n_qf32. Also curious if there are hidden bits involved in the HVX_Vector fp ops.

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Jan 31, 2026

The reduce_sum_qf32x2 is not correct for V75 and newer (qf32 can't be shifted/rotated, etc).
The good news is that we can just replace all instances with f32 version and optimize that version for V75 and up.

Interesting. That explains why we need the extra Q6_Vsf_equals_Vqf32 calls in hvx_vec_reduce_sum_n_qf32. Also curious if there are hidden bits involved in the HVX_Vector fp ops.

Yep. QF32 and QF16 have extra bits that are not visible to the SW.

Here is the branch where I fixed this issue and also went through and made everything consistently use reduce_sum_f32.
vec_dot_mxfp4_rx2 was giving the compiler a hard time with the reduce_sum_x2. The only way I could fix it is to keep the row_sum in f32. It ended up being better for all other cases so I updated all them. The the qf32 -> sf conversion can generally be done in the same instruction packet so it's essentially free.

https://github.com/qualcomm/llama.cpp/tree/hexagon-fa-and-reduce-sum

Tested on Gen3,4,5 and X-Elite.

I'm seeing a nice bump in perf across the board. Not huge but significant.
+5-10 T/S in some cases.
FA performance is steadily catching up to the CPU. Hopefully a few more iterations and we can enable it by default.

Please pull/merge/rebase, see how it does on your setup and I think we're good to merge.

file(TO_CMAKE_PATH "${HEXAGON_TOOLS_ROOT}" HEXAGON_TOOLS_ROOT)
if (NOT IS_DIRECTORY "${HEXAGON_TOOLS_ROOT}")
message(FATAL_ERROR "Make sure HEXAGON_TOOLS_ROOT point to the correct Hexagon SDK installation.")
endif()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking it may be good to derive HEXAGON_TOOLS_ROOT from hexagon_sdk.json in HEXAGON_SDK_ROOT.

@chraac chraac marked this pull request as ready for review January 31, 2026 05:04
@chraac chraac changed the title [WIP]ggml-hexagon: flash-attn opt - part2 ggml-hexagon: flash-attn opt - part2 Jan 31, 2026
@max-krasnyansky max-krasnyansky changed the title ggml-hexagon: flash-attn opt - part2 ggml-hexagon: flash-attention and reduce-sum optimizations Jan 31, 2026
@max-krasnyansky max-krasnyansky merged commit 89f10ba into ggml-org:master Jan 31, 2026
74 of 75 checks passed
@chraac chraac deleted the dev-fa-opt-part2 branch January 31, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants