Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: vulkan on Adreno GPU #12139

Open
Theodoree opened this issue Mar 2, 2025 · 0 comments
Open

Misc. bug: vulkan on Adreno GPU #12139

Theodoree opened this issue Mar 2, 2025 · 0 comments

Comments

@Theodoree
Copy link

Theodoree commented Mar 2, 2025

Name and Version

llama-cli --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
version: 4798 (1782cdf)
built with Android (12896553, +pgo, -bolt, +lto, -mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-apple-darwin24.3.0

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-bench

Command line

llama-bench  -m lb-reranker-0.5B-v1.0-Q4_0.gguf   -v --progress

Problem description & steps to reproduce

model: lb-reranker-0.5B-v1.0-Q4_0.gguf

Problem Description

When running llama-bench, I encountered the following error when the batch-size exceeds 32:

libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Queue::submit: ErrorDeviceLost

Expected Behavior

The benchmark should run successfully with any batch-size, including values larger than 32.

Actual Behavior

The program crashes with the DeviceLost error when the batch-size is greater than 32.

Environment Information

  • Hardware:
    • Device: Xiaomi pad 7
    • Snapdragon 7 GEN 3+(SM7675)
  • Tool: llama-bench
  • batch-size: Greater than 32
  • Error: vk::DeviceLostError
  • NDK: 28.0.13004108
  • Vulkan-Headers: 1.3.276

Build Script Used

cmake \
  -DCMAKE_TOOLCHAIN_FILE=${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake \
  -DANDROID_ABI=arm64-v8a \
  -DANDROID_PLATFORM=32 \
  -DCMAKE_C_FLAGS="-march=armv8.7a+dotprod+noi8mm+nosve+i8mm" \
  -DCMAKE_CXX_FLAGS="-march=armv8.7a+dotprod+noi8mm+nosve+i8mm" \
  -DCMAKE_CXX_STANDARD=17 \
  -DGGML_OPENMP=OFF \
  -DGGML_LLAMAFILE=OFF \
  -DCMAKE_RUNTIME_OUTPUT_DIRECTORY=${PWD}/build-android/lib \
  -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=${PWD}/build-android/lib \
  -DCMAKE_ARCHIVE_OUTPUT_DIRECTORY=${PWD}/build-android/lib \
  -DGGML_VULKAN=ON \
  -DGGML_VULKAN_DEBUG=ON \
  -DVulkan_GLSLC_EXECUTABLE=$ANDROID_NDK/shader-tools/darwin-x86_64/glslc \
  -DVulkan_INCLUDE_DIR=/Users/ted/workspace/Vulkan-Headers/include \
  -B build-android
cmake --build build-android --config Release -j 8

First Bad Commit

No response

Relevant log output

ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(0)
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
llama-bench: benchmark 1/2: starting
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) - 7494 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 290 tensors from ../../lb-reranker-0.5B-v1.0-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Reranker_0.5_Cont_Filt_7Max
llama_model_loader: - kv   3:                            general.version str              = v1.0
llama_model_loader: - kv   4:                       general.organization str              = Lightblue
llama_model_loader: - kv   5:                           general.basename str              = lb-reranker
llama_model_loader: - kv   6:                         general.size_label str              = 0.5B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 0.5B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv  12:                      general.dataset.count u32              = 1
llama_model_loader: - kv  13:                     general.dataset.0.name str              = Reranker_Continuous_Filt_Max7_Train
llama_model_loader: - kv  14:             general.dataset.0.organization str              = Lightblue
llama_model_loader: - kv  15:                 general.dataset.0.repo_url str              = https://huggingface.co/lightblue/rera...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["reranker", "text-generation"]
llama_model_loader: - kv  17:                          general.languages arr[str,96]      = ["en", "zh", "es", "de", "ar", "ru", ...
llama_model_loader: - kv  18:                          qwen2.block_count u32              = 24
llama_model_loader: - kv  19:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  20:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv  21:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  22:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  23:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  24:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  25:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 2
llama_model_loader: - kv  38:                      quantize.imatrix.file str              = /models_out/lb-reranker-0.5B-v1.0-GGU...
llama_model_loader: - kv  39:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  40:             quantize.imatrix.entries_count i32              = 168
llama_model_loader: - kv  41:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  165 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 330.95 MiB (5.62 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
~/vulkan/bin $ cat t.txt  | head -n 200
ggml_vk_instance_init()
ggml_vulkan: Found 1 Vulkan devices:
ggml_vk_print_gpu_info(0)
ggml_vulkan: 0 = Adreno (TM) 732 (Qualcomm Technologies Inc. Adreno Vulkan Driver) | uma: 1 | fp16: 1 | warp size: 64 | shared memory: 32768 | matrix cores: none
llama-bench: benchmark 1/2: starting
llama_model_load_from_file_impl: using device Vulkan0 (Adreno (TM) 732) - 7494 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 290 tensors from ../../lb-reranker-0.5B-v1.0-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Reranker_0.5_Cont_Filt_7Max
llama_model_loader: - kv   3:                            general.version str              = v1.0
llama_model_loader: - kv   4:                       general.organization str              = Lightblue
llama_model_loader: - kv   5:                           general.basename str              = lb-reranker
llama_model_loader: - kv   6:                         general.size_label str              = 0.5B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Qwen2.5 0.5B Instruct
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-0...
llama_model_loader: - kv  12:                      general.dataset.count u32              = 1
llama_model_loader: - kv  13:                     general.dataset.0.name str              = Reranker_Continuous_Filt_Max7_Train
llama_model_loader: - kv  14:             general.dataset.0.organization str              = Lightblue
llama_model_loader: - kv  15:                 general.dataset.0.repo_url str              = https://huggingface.co/lightblue/rera...
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["reranker", "text-generation"]
llama_model_loader: - kv  17:                          general.languages arr[str,96]      = ["en", "zh", "es", "de", "ar", "ru", ...
llama_model_loader: - kv  18:                          qwen2.block_count u32              = 24
llama_model_loader: - kv  19:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  20:                     qwen2.embedding_length u32              = 896
llama_model_loader: - kv  21:                  qwen2.feed_forward_length u32              = 4864
llama_model_loader: - kv  22:                 qwen2.attention.head_count u32              = 14
llama_model_loader: - kv  23:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  24:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  25:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  33:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - kv  37:                          general.file_type u32              = 2
llama_model_loader: - kv  38:                      quantize.imatrix.file str              = /models_out/lb-reranker-0.5B-v1.0-GGU...
llama_model_loader: - kv  39:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  40:             quantize.imatrix.entries_count i32              = 168
llama_model_loader: - kv  41:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q4_0:  165 tensors
llama_model_loader: - type q4_1:    3 tensors
llama_model_loader: - type q8_0:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 330.95 MiB (5.62 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 896
print_info: n_layer          = 24
print_info: n_head           = 14
print_info: n_head_kv        = 2
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 128
print_info: n_embd_v_gqa     = 128
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 4864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1B
print_info: model params     = 494.03 M
print_info: general.name     = Reranker_0.5_Cont_Filt_7Max
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
.....
ggml_vk_get_mul_mat_mat_pipeline(q4_0, f32)
ggml_vk_guess_matmul_pipeline_align(4864, 512, q4_0)
ggml_vk_guess_matmul_pipeline(4864, 512, 1, q4_0)
ggml_vk_align_size(896, 128)
ggml_vk_guess_matmul_pipeline(4864, 512, 1, q4_0)
ggml_vk_guess_split_k(4864, 512, 896)
ggml_vk_get_to_fp16()
ggml_vk_get_to_fp16()
ggml_vk_matmul(a: (0xb40000766d2f49f0, 145690624, 2451456), b: (0xb40000766d390f30, 0, 1835008), d: (0xb40000766d390f30, 4722688, 9961472), split_k: (0x0, 0, 9961472), m: 4864, n: 512, k: 896, stride_a: 896, stride_b: 896, stride_d: 4864, batch_stride_a: 4358144, batch_stride_b: 458752, batch_stride_d: 2490368, split_k: 1, batch: 1, ne02: 1, ne12: 1, broadcast2: 1, broadcast3: 1)
ggml_vk_sync_buffers()
ggml_vk_dispatch_pipeline(matmul_q4_0_f32_f16acc_aligned_l, {(0xb40000766d2f49f0, 145690624, 2451456), (0xb40000766d390f30, 0, 1835008), (0xb40000766d390f30, 4722688, 9961472), }, (38,4,1))
ggml_vk_ctx_end(0xb4000076bd2ec048, 1)
ggml_vk_compute_forward(0xb4000075f3f545e0, name=norm-0, op=RMS_NORM, type=0, ne0=896, ne1=512, ne2=1, ne3=1, nb0=4, nb1=3584, nb2=1835008, nb3=1835008, view_src=0x0, view_offs=0)
ggml_vk_submit(0xb4000076bd2ec048, 0x0)
libc++abi: terminating due to uncaught exception of type vk::DeviceLostError: vk::Queue::submit: ErrorDeviceLost
Aborted
@Theodoree Theodoree changed the title Misc. bug: runing failure on Adreno devices using Vulkan for large batch size Misc. bug: Vulkan Error: Device Lost on Adreno GPUs with Large Batch Sizes Mar 2, 2025
@Theodoree Theodoree changed the title Misc. bug: Vulkan Error: Device Lost on Adreno GPUs with Large Batch Sizes Misc. bug: Device Lost on Adreno GPUs with Large Batch Sizes Mar 2, 2025
@Theodoree Theodoree changed the title Misc. bug: Device Lost on Adreno GPUs with Large Batch Sizes Misc. bug: vulkan on Adreno GPU Mar 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant