-
Notifications
You must be signed in to change notification settings - Fork 14.7k
ggml-hexagon: flash-attention and reduce-sum optimizations #19141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ew vectorized implementations
…x2 functions for improved performance
…function for improved readability
…proved performance
| sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 4)); | ||
| sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 8)); | ||
| sum01 = Q6_Vqf32_vadd_Vqf32Vqf32(sum01, Q6_V_vror_VR(sum01, VLEN / 16)); | ||
| return sum01; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimize reduction sum by processing two vectors simultaneously.
…ew vectorized implementations
…x2 functions for improved performance
…function for improved readability
…proved performance
|
Overall, good idea and you gave me more ideas to implement/cleanup :). |
Interesting. That explains why we need the extra |
Yep. QF32 and QF16 have extra bits that are not visible to the SW. Here is the branch where I fixed this issue and also went through and made everything consistently use https://github.com/qualcomm/llama.cpp/tree/hexagon-fa-and-reduce-sum Tested on Gen3,4,5 and X-Elite. I'm seeing a nice bump in perf across the board. Not huge but significant. Please pull/merge/rebase, see how it does on your setup and I think we're good to merge. |
# Conflicts: # ggml/src/ggml-hexagon/htp/hvx-reduce.h # ggml/src/ggml-hexagon/htp/matmul-ops.c
| file(TO_CMAKE_PATH "${HEXAGON_TOOLS_ROOT}" HEXAGON_TOOLS_ROOT) | ||
| if (NOT IS_DIRECTORY "${HEXAGON_TOOLS_ROOT}") | ||
| message(FATAL_ERROR "Make sure HEXAGON_TOOLS_ROOT point to the correct Hexagon SDK installation.") | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking it may be good to derive HEXAGON_TOOLS_ROOT from hexagon_sdk.json in HEXAGON_SDK_ROOT.
Further to the discussion in PR #19025, this implements the dual row dot product for flash attention.
Key changes
HVX Vector Math Optimizations
Added
hvx_vec_reduce_sum_qf32x2, a helper function for efficiently reducing and accumulating two HVX vectors of qf32 values, and refactored several places in the codebase to use this function for dual-accumulation scenarios. [1] [2] [3] [4] [5]Introduced new "rx2" (dual accumulation) versions of dot product functions for both f32-f16 and f16-f16 cases (
hvx_dot_f32_f16_aa_rx2,hvx_dot_f16_f16_aa_rx2), improving performance by processing two accumulations in parallel. [1] [2]Refactored the main attention kernel (
flash_attn_ext_f16_thread) to use the new "rx2" dot product functions when possible, improving block processing efficiency.Performance
8Gen20c21677e42610805c4llama3-1b-q4