PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

zhouwg · 2025-03-11T06:32:03Z

I have read the contributing guidelines
Self-reported review complexity:
* [ ] Low
* [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
* [ ] High
Testing Done
* [x] test-backend-ops and llama-cli through HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone
* [x] test-backend-ops and llama-cli through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone
* [x] the major features in ggml backend subsystem through HWACCEL_CDSP(the main approach in this PR) has verified on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone

PR Description

this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:

how to utilize the Qualcomm Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.

the fully and TLDR description of this PR can be found at my forked llama.cpp project:zhouwg#30.

the high-level data path or so-called high-level arch of ggml-hexagon can be found at my forked llama.cpp project:high-level data path of ggml-hexagon

Features

provide a concise reference implementation of HWACCEL_QNN in this PR: offload ggml op to QNN.
provide a very fast approach(HWACCEL_CDSP) which is exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl in this PR: offload some performance-sensitive ggml ops to Hexagon cDSP directly.
the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared:provide a computation visualization approach in this PR to help other developers and AI experts to visualize the comparison between cDSP approach and QNN approach.
dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).
probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:
#v68 --- Snapdragon 888
#v69 --- Snapdragon 8 Gen1
#v73 --- Snapdragon 8 Gen2
#v75 --- Snapdragon 8 Gen3(verified)
#v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
provide a customized tiny ggmldsp which is exactly borrowed/reused/ported from original ggml and running well /works fine on Hexagon cDSP side, this feature will be very helpful for domain experts or AI experts whom can do anything AI innovation with Qualcomm's amazing lightweight/low-level(C/C++ and HVX assemble and can operate hardware directly) Hexagon SDK on cDSP side directly rather than learning Qualcomm's highly-designed heavyweight/high-level QNN SDK API on ARM-AP side.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions or Linux VM or WSL on Windows10/11 might be also ok):

utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)

  git clone https://github.com/zhouwg/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream

 ./scripts/build-run-android.sh 
Usage:
  ./scripts/build-run-android.sh help
  ./scripts/build-run-android.sh print_oplist
  ./scripts/build-run-android.sh build
  ./scripts/build-run-android.sh updateqnnlib
  ./scripts/build-run-android.sh run_testops
  ./scripts/build-run-android.sh run_testop          [ADD/MUL_MAT]
  ./scripts/build-run-android.sh run_llamacli
  ./scripts/build-run-android.sh run_llamabench

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon".

Hexagon NPU Performance

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.

case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference

case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP(small matrix mulmat through test-backend-ops)

[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.

the details and how to reproduce above results can be found at my forked llama.cpp project:zhouwg#28.

Big picture of ggml-hexagon backend

there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:

general approach through Qualcomm QNN SDK:offload ggml op to QNN (then QNN's internal will transfer to Hexagon cDSP)
general approach through Qualcomm Hexagon SDK:offload ggml op to Hexagon cDSP directly, which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl.
special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024.

enum hwaccel_approach_type {
HWACCEL_QNN =0, (C API, before 03/11/2025, not easy because QNN SDK is a black-box or heavy SDK and many many tricks in the QNN SDK)
HWACCEL_QNN_SINGLEGRAPH=1,(C API, before 03/18/2025, very hard because the mechanism is a black black-box and workload is massive)
HWACCEL_CDSP=2,(C and assemble API, after 03/24/2025, hard but we can do anything on cDSP directly, because Hexagon SDK is a very lightweight/thin SDK and we can operate hardware directly through Hexagon SDK)
HWACCEL_SYCL=3,(this is personal proposal or assumption, general and modern C++ API, N/A at the moment because essential adaption layer should be provided by Qualcomm)
};

the tech details of "the special approach through QNN" can be found at my forked llama.cpp project:zhouwg#24.
10+ reasons why I think HWACCEL_CDSP is correct direction can be found at my forked llama.cpp project:zhouwg#28.

Acknowledgement

the implementation of HWACCEL_QNN is mainly porting/reverse engineering from executorch(the implementation of QNN backend in executorch comes from Qualcomm). the implementation of HWACCEL_CDSP borrows some codes from Qualcomm's Hexagon SDK. one more important thing:I got breakthrough help from @chiwwang at Qualcomm Technologies Inc/Qualcomm Innovation Center on 04/2024. in the all: all fundamental techs of this topic(a specified ggml/llama.cpp backend for Qualcomm's Hexagon NPU) comes from Qualcomm.
huge thanks to the excellent/great maintainers&original authors of ggml&llama.cpp,I learnt so much from ggml&llama.cpp: their open-minded spirits and standout contributions made a great public good for open-source community and our planet. one more important thing: the tiny ggml-dsp on Hexagon cDSP side(aka the existing implementation of hexagon kernels on cDSP side, because I'm not AI expert and this is a practical way for me) is completely ported/borrowed from the original ggml.
huge thanks to a senior staff technical expert @max-krasnyansky from Qualcomm headquarter whom give an important/valuable/breakthrough guidance on direction on 03/18/2025:QNN is not the right solution here.

Conclusion

after spent too much efforts on ggml-hexagon backend, I personally think:

AI experts must be involved in the rest parts of hexagon-kernels: AI experts only need to focus on hexagon-kernels, AI experts and other domain tech experts around the world can help to improve the hexagon-kernels(various mulmat and norm/rmsnorm/softmax/....), domain tech experts and AI experts can operate cDSP hardware directly and can do anything AI innovations through the lightweight and amazing Hexagon SDK on cDSP side.

[updated on 04/02/2025, 22:18] @ggerganov @slaren, sorry to bother you, I understand your time are both valuable, could you help to modify the label of this PR to "Qualcomm NPU" and remove the lable "testing" and "script" and "build"? thanks so much!

Dampfinchen · 2025-03-12T18:39:52Z

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

zhouwg · 2025-03-12T23:07:46Z

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

thanks for your kind comment.

Quacomm's Hexagon NPU support is really huge work for this project although now we clearly know the principle or know what, because Qualcomm provides some binary dedicated tools to do LLM model conversion in their dedicated AI sw stacks and some other closed-source implementation also use this similar approach exactly. so programmers must compose an ideal QNN graph according to the complete ggml cgraph manually in ggml-qnn backend if they use/chose the second tech approach in ggml-qnn backend("mapping the complete ggml cgraph to a single opcfg QNN graph"). there are 800+ cgraph nodes and 50+ ops in qwen1_5-1_8b-chat-q4_0.gguf, accordingly, "(Hexgon) NPU support is huge for this project", real AI experts must be involved in the rest parts of ggml-qnn.
I think I can make it(ggml-exynos or ggml-samsung) work on Exynos 2200 if I can get a necessary phone(I can try to buy it) and SDK&tech docs(this might-be not easy because of strict IPR policy in some big IT companys as my personal understanding at the moment) and follow the principle "make it run and then make it right and finally make it fast",this is one of my areas of expertise.

…ffer_context

…orks in a standard Android APP)

…antv

…roduced in kantv-ai/kantv#281)

…accel_approach) in ggml-hexagon.h for further usage

github-actions bot added build Compilation issues script Script related ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2025

zhouwg force-pushed the pr_to_upstream branch from 9780963 to 1b37471 Compare March 11, 2025 06:45

github-actions bot added the testing Everything test related label Mar 11, 2025

zhouwg force-pushed the pr_to_upstream branch 7 times, most recently from 2ceaaf5 to 3402e2c Compare March 11, 2025 14:22

zhouwg mentioned this pull request Mar 12, 2025

ggml: offload the entire cgraph to a specified backend #12342

Closed

zhouwg force-pushed the pr_to_upstream branch from 3402e2c to 134eb3c Compare March 12, 2025 14:48

This comment was marked as resolved.

Sign in to view

zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 0065122 to 1f702df Compare March 16, 2025 08:12

This comment was marked as resolved.

Sign in to view

zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 1e98561 to e4b0d8c Compare March 16, 2025 09:51

zhouwg mentioned this pull request Mar 17, 2025

Add Qualcomm mobile SoC native backend for GGML ggml-org/ggml#771

Closed

zhouwg force-pushed the pr_to_upstream branch 3 times, most recently from 967be44 to a26806a Compare March 18, 2025 03:34

This comment was marked as resolved.

Sign in to view

zhouwg force-pushed the pr_to_upstream branch 3 times, most recently from 6c6e3e8 to c1973bd Compare March 18, 2025 10:05

zhouwg added 27 commits July 3, 2025 14:41

ggml-hexagon: sync with upstream

f12593a

ggml-hexagon: refine pinned-memory feature

828d465

ggml-hexagon: refine build system in ggml-hexagon

9839bd0

ggml-hexagon: remove redundant code in struct ggml_backend_hexagon_bu…

65c377a

…ffer_context

ggml-hexagon: upgrade Android NDK to android-ndk-r28

7ad26b6

ggml-dsp: split ggml-dsp.c into multiple files and cleanup

db15b6c

ggml-dsp: refine ggml-dsp and make ggml-dsp more clear

a37f1b5

ggml-hexagon: fix a minior issue in dev ops

90b2dc0

ggml-hexagon: fix a build issue in CI

e9bfbce

ggml-dsp: cleanup code

4359824

ggml-hexagon: sync with upstream

7bb2774

ggml-dsp: cleanup code

0451d53

ggml-dsp:refine ggmlhexagon_dsp_add_f32

da2545d

ggml-dsp: refine logic of thread_counts

80330d3

ggml-hexagon: release v1.06 and ready for code review

7f11fc1

ggml-dsp: make GGML_OP_ADD more faster on cDSP side

2285eb3

ggml-hexagon: sync from project kantv(make ggml-hexagon backend can w…

70206d7

…orks in a standard Android APP)

sync with upstream llama.cpp and sync ggml-hexagon.cpp from project k…

b79f396

…antv

sync with upstream

35bfc28

sync with upstream

3ab7ddb

ggml-hexagon: upgrade QNN SDK to v2.34.0.250424

5bbcd23

sync with upstream

770061f

ggml-hexagon: sync from project kantv(fix a long-term issue which int…

5a588d1

…roduced in kantv-ai/kantv#281)

ggml-hexagon: sync with upstream llama.cpp

057bf1b

ggml-hexagon: add set_hexagon_cfg(int new_hexagon_backend, int new_hw…

700f039

…accel_approach) in ggml-hexagon.h for further usage

ggml-hexagon: sync with branch self-build

0ef1e49

ggml-hexagon:sycn with branch self-build

1245c4e

zhouwg force-pushed the pr_to_upstream branch from 7048605 to 8bb2bba Compare July 3, 2025 06:58

project: sync with upstream(PR-14501:remove kompute backend)

2864ed9

zhouwg force-pushed the pr_to_upstream branch from 8bb2bba to 2864ed9 Compare July 3, 2025 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

zhouwg commented Mar 11, 2025 •

edited

Loading

Uh oh!

Dampfinchen commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

Are you sure you want to change the base?

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

Conversation

zhouwg commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Description

Features

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Hexagon NPU Performance

Big picture of ggml-hexagon backend

Acknowledgement

Conclusion

Uh oh!

Dampfinchen commented Mar 12, 2025

Uh oh!

zhouwg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

zhouwg commented Mar 11, 2025 •

edited

Loading

zhouwg commented Mar 12, 2025 •

edited

Loading