-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eval bug: llama.cpp returns gibberish on Intel Core Ultra 7 (155H) with ARC iGPU #12096
Comments
@cgruver
Let's check if it has relationship with data type or model. |
Results on the Intel Arch -
I see similar issues on my M2 MacBook too... This is the result on my M2 MacBook -
|
Yes, I see same result in Ubuntu on Arc 770. It's decided by the quantized model, instead of llama.cpp code. |
also meet similar problem I run llama-server r4784 version, with ollama downloaded deepseek-r1:14b gguf model /root/git/llama.cpp/build/bin/llama-server \
-m /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e \
-ngl 99 \
--temp 0.6 \
--no-webui \
--host 0.0.0.0 \
--port 20004 \
--jinja \
-fa \
--chat-template-file /root/git/llama.cpp/models/templates/llama-cpp-deepseek-r1.jinja \
--reasoning-format deepseek and use httpie to send a streaming chat completion, it should give me some correct answer, but it reply gibberish
|
I build with this option
if remove the |
@Sherlock-Holo CUDA and VULKAN can't work together. This session is about Intel iGPU, instead of CUDA GPU. :) |
@NeoZhangJianyu Do you believe that the issue is with the model itself? Or is llama.cpp not compatible with the quantization? I don't see any issues with the older My lab is currently down for a bit so I'll have to retest |
You could check with llama.cpp CPU. If the result is still wrong, that would be not compatible with the quantization. |
Here's some logs that I captured earlier - Working example - No GPU
Broken Example on Intel Arc -
|
In the working example, you can see the conversation logged correctly. In the broken example, the prompt, "hello" is not logged. |
Working -
|
Broken -
You can see by comparing the two, that with the same prompt of "Hello", the test on Intel ARC sent "\n" in the data stream... |
Yes, some thing is wrong. |
Yes, from Ollama: https://ollama.com/library/granite3.1-moe granite3.1-moe:3b |
Yes, the result should be wrong with this model file. |
Name and Version
llama-cli --version version: 4784 (b95c8af3) built with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu
Steps to Build
Operating systems
Linux
GGML backends
SYCL
Hardware
Models
granite3.1-moe:3b
granite3.1-dense:8b
Problem description & steps to reproduce
llama-run --ngl 999 --jinja ${RAMALAMA_STORE}/models/ollama/granite3-moe:3b hello Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory 0 0 The answer is: 1 The answer is 1 The answer is: 1 The answer is: 1 The answer is: 1 The answer is: 1 The answer is: 1 The answer is: 1 The answer^C
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: