Misc. bug: vulkan on Adreno GPU #12139

Theodoree · 2025-03-02T05:12:58Z

Name and Version

llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
version: 4798 (1782cdf)
built with Android (12896553, +pgo, -bolt, +lto, -mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-apple-darwin24.3.0

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

llama-bench  -m lb-reranker-0.5B-v1.0-Q4_0.gguf   -v --progress

Problem description & steps to reproduce

model: lb-reranker-0.5B-v1.0-Q4_0.gguf

Problem Description

When running llama-bench, I encountered the following error when the batch-size exceeds 32:

libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Queue::submit: ErrorDeviceLost

Expected Behavior

The benchmark should run successfully with any batch-size, including values larger than 32.

Actual Behavior

The program crashes with the DeviceLost error when the batch-size is greater than 32.

Environment Information

Hardware:
- Device: Xiaomi pad 7
- Snapdragon 7 GEN 3+(SM7675)
Tool: llama-bench
batch-size: Greater than 32
Error: vk::DeviceLostError
NDK: 28.0.13004108
Vulkan-Headers: 1.3.276

Build Script Used

cmake \
  -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=32 \
  -DCMAKE_C_FLAGS="-march=armv8.7a+dotprod+noi8mm+nosve+i8mm" \
  -DCMAKE_CXX_FLAGS="-march=armv8.7a+dotprod+noi8mm+nosve+i8mm" \
  -DCMAKE_CXX_STANDARD=17 \
  -DGGML_OPENMP=OFF \
  -DGGML_LLAMAFILE=OFF \
  -DCMAKE_RUNTIME_OUTPUT_DIRECTORY=${PWD}/build-android/lib \
  -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=${PWD}/build-android/lib \
  -DCMAKE_ARCHIVE_OUTPUT_DIRECTORY=${PWD}/build-android/lib \
  -DGGML_VULKAN=ON \
  -DGGML_VULKAN_DEBUG=ON \
  -DVulkan_GLSLC_EXECUTABLE=$ANDROID_NDK/shader-tools/darwin-x86_64/glslc \
  -DVulkan_INCLUDE_DIR=/Users/ted/workspace/Vulkan-Headers/include \
  -B build-android
cmake --build build-android --config Release -j 8

First Bad Commit

No response

Relevant log output

ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(0)
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
llama-bench: benchmark 1/2: starting
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) - 7494 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 290 tensors from ../../lb-reranker-0.5B-v1.0-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Reranker_0.5_Cont_Filt_7Max
llama_model_loader: - kv   3:                            general.version str              = v1.0
llama_model_loader: - kv   4:                       general.organization str              = Lightblue
llama_model_loader: - kv   5:                           general.basename str              = lb-reranker
llama_model_loader: - kv   6:                         general.size_label str              = 0.5B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 0.5B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv  12:                      general.dataset.count u32              = 1
llama_model_loader: - kv  13:                     general.dataset.0.name str              = Reranker_Continuous_Filt_Max7_Train
llama_model_loader: - kv  14:             general.dataset.0.organization str              = Lightblue
llama_model_loader: - kv  15:                 general.dataset.0.repo_url str              = https://huggingface.co/lightblue/rera...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["reranker", "text-generation"]
llama_model_loader: - kv  17:                          general.languages arr[str,96]      = ["en", "zh", "es", "de", "ar", "ru", ...
llama_model_loader: - kv  18:                          qwen2.block_count u32              = 24
llama_model_loader: - kv  19:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  20:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv  21:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  22:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  23:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  24:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  25:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 2
llama_model_loader: - kv  38:                      quantize.imatrix.file str              = /models_out/lb-reranker-0.5B-v1.0-GGU...
llama_model_loader: - kv  39:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  40:             quantize.imatrix.entries_count i32              = 168
llama_model_loader: - kv  41:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  165 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 330.95 MiB (5.62 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
~/vulkan/bin $ cat t.txt  | head -n 200
ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(0)
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
llama-bench: benchmark 1/2: starting
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) - 7494 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 290 tensors from ../../lb-reranker-0.5B-v1.0-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Reranker_0.5_Cont_Filt_7Max
llama_model_loader: - kv   3:                            general.version str              = v1.0
llama_model_loader: - kv   4:                       general.organization str              = Lightblue
llama_model_loader: - kv   5:                           general.basename str              = lb-reranker
llama_model_loader: - kv   6:                         general.size_label str              = 0.5B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 0.5B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv  12:                      general.dataset.count u32              = 1
llama_model_loader: - kv  13:                     general.dataset.0.name str              = Reranker_Continuous_Filt_Max7_Train
llama_model_loader: - kv  14:             general.dataset.0.organization str              = Lightblue
llama_model_loader: - kv  15:                 general.dataset.0.repo_url str              = https://huggingface.co/lightblue/rera...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["reranker", "text-generation"]
llama_model_loader: - kv  17:                          general.languages arr[str,96]      = ["en", "zh", "es", "de", "ar", "ru", ...
llama_model_loader: - kv  18:                          qwen2.block_count u32              = 24
llama_model_loader: - kv  19:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  20:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv  21:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  22:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  23:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  24:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  25:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 2
llama_model_loader: - kv  38:                      quantize.imatrix.file str              = /models_out/lb-reranker-0.5B-v1.0-GGU...
llama_model_loader: - kv  39:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  40:             quantize.imatrix.entries_count i32              = 168
llama_model_loader: - kv  41:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  165 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 330.95 MiB (5.62 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 494.03 M
print_info: general.name     = Reranker_0.5_Cont_Filt_7Max
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
.....
ggml_vk_get_mul_mat_mat_pipeline(q4_0, f32)
ggml_vk_guess_matmul_pipeline_align(4864, 512, q4_0)
ggml_vk_guess_matmul_pipeline(4864, 512, 1, q4_0)
ggml_vk_align_size(896, 128)
ggml_vk_guess_matmul_pipeline(4864, 512, 1, q4_0)
ggml_vk_guess_split_k(4864, 512, 896)
ggml_vk_get_to_fp16()
ggml_vk_get_to_fp16()
ggml_vk_matmul(a: (0xb40000766d2f49f0, 145690624, 2451456), b: (0xb40000766d390f30, 0, 1835008), d: (0xb40000766d390f30, 4722688, 9961472), split_k: (0x0, 0, 9961472), m: 4864, n: 512, k: 896, stride_a: 896, stride_b: 896, stride_d: 4864, batch_stride_a: 4358144, batch_stride_b: 458752, batch_stride_d: 2490368, split_k: 1, batch: 1, ne02: 1, ne12: 1, broadcast2: 1, broadcast3: 1)
ggml_vk_sync_buffers()
ggml_vk_dispatch_pipeline(matmul_q4_0_f32_f16acc_aligned_l, {(0xb40000766d2f49f0, 145690624, 2451456), (0xb40000766d390f30, 0, 1835008), (0xb40000766d390f30, 4722688, 9961472), }, (38,4,1))
ggml_vk_ctx_end(0xb4000076bd2ec048, 1)
ggml_vk_compute_forward(0xb4000075f3f545e0, name=norm-0, op=RMS_NORM, type=0, ne0=896, ne1=512, ne2=1, ne3=1, nb0=4, nb1=3584, nb2=1835008, nb3=1835008, view_src=0x0, view_offs=0)
ggml_vk_submit(0xb4000076bd2ec048, 0x0)
libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Queue::submit: ErrorDeviceLost
Aborted

The text was updated successfully, but these errors were encountered:

Theodoree added the bug-unconfirmed label Mar 2, 2025

Theodoree changed the title ~~Misc. bug: runing failure on Adreno devices using Vulkan for large batch size~~ Misc. bug: Vulkan Error: Device Lost on Adreno GPUs with Large Batch Sizes Mar 2, 2025

Theodoree changed the title ~~Misc. bug: Vulkan Error: Device Lost on Adreno GPUs with Large Batch Sizes~~ Misc. bug: Device Lost on Adreno GPUs with Large Batch Sizes Mar 2, 2025

Theodoree changed the title ~~Misc. bug: Device Lost on Adreno GPUs with Large Batch Sizes~~ Misc. bug: vulkan on Adreno GPU Mar 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: vulkan on Adreno GPU #12139

Misc. bug: vulkan on Adreno GPU #12139

Theodoree commented Mar 2, 2025 •

edited

Loading

Misc. bug: vulkan on Adreno GPU #12139

Misc. bug: vulkan on Adreno GPU #12139

Comments

Theodoree commented Mar 2, 2025 • edited Loading

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Problem Description

Expected Behavior

Actual Behavior

Environment Information

Build Script Used

First Bad Commit

Relevant log output

Theodoree commented Mar 2, 2025 •

edited

Loading