Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval bug: llama.cpp returns gibberish on Intel Core Ultra 7 (155H) with ARC iGPU #12096

Open
cgruver opened this issue Feb 27, 2025 · 15 comments

Comments

@cgruver
Copy link

cgruver commented Feb 27, 2025

Name and Version

llama-cli --version
version: 4784 (b95c8af3)
built with Intel(R) oneAPI DPC++/C++ Compiler 2025.0.4 (2025.0.4.20241205) for x86_64-unknown-linux-gnu

Steps to Build

cat << EOF > /etc/yum.repos.d/oneAPI.repo
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF

dnf install -y procps-ng g++ cmake git libcurl-devel intel-oneapi-mkl-sycl-devel intel-oneapi-dnnl-devel intel-oneapi-compiler-dpcpp-cpp intel-level-zero oneapi-level-zero oneapi-level-zero-devel intel-compute-runtime ; \
    source /opt/intel/oneapi/setvars.sh ; \
    git clone https://github.com/ggerganov/llama.cpp.git -b ${LLAMA_CPP_VER} ; \
    cd llama.cpp ; \
    mkdir -p build ; \
    cd build ; \
    cmake .. -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_CURL=ON -DGGML_CCACHE=OFF -DGGML_NATIVE=OFF ; \
    cmake --build . --config Release -j -v ; \
    cmake --install . --prefix /llama-cpp ; \
    cd ../.. ; \

Operating systems

Linux

GGML backends

SYCL

Hardware

00:02.0 VGA compatible controller [0300]: Intel Corporation Meteor Lake-P [Intel Arc Graphics] [8086:7d55] (rev 08)
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 94111M|     1.5.30872.320000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|

Models

granite3.1-moe:3b
granite3.1-dense:8b

Problem description & steps to reproduce

llama-run --ngl 0 --jinja ${RAMALAMA_STORE}/models/ollama/granite3-moe:3b hello
Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
Hello! How can I assist you today?
llama-run --ngl 999 --jinja ${RAMALAMA_STORE}/models/ollama/granite3-moe:3b hello
Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
0

0

The answer is: 1

The answer is 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer is: 1

The answer^C

First Bad Commit

No response

Relevant log output

llama-run --ngl 999 --jinja --verbose ${RAMALAMA_STORE}/models/ollama/granite3-moe:3b hello 


Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) Graphics) - 89752 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 323 tensors from /model-dir/models/ollama/granite3-moe:3b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = granitemoe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Granite 3.0 3b A800M Instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = granite-3.0
llama_model_loader: - kv   5:                         general.size_label str              = 3B-a800M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["language", "granite-3.0", "text-gen...
llama_model_loader: - kv   8:                     granitemoe.block_count u32              = 32
llama_model_loader: - kv   9:                  granitemoe.context_length u32              = 4096
llama_model_loader: - kv  10:                granitemoe.embedding_length u32              = 1536
llama_model_loader: - kv  11:             granitemoe.feed_forward_length u32              = 512
llama_model_loader: - kv  12:            granitemoe.attention.head_count u32              = 24
llama_model_loader: - kv  13:         granitemoe.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                  granitemoe.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15: granitemoe.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                    granitemoe.expert_count u32              = 40
llama_model_loader: - kv  17:               granitemoe.expert_used_count u32              = 8
llama_model_loader: - kv  18:                          general.file_type u32              = 15
llama_model_loader: - kv  19:                      granitemoe.vocab_size u32              = 49155
llama_model_loader: - kv  20:            granitemoe.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  22:                 granitemoe.attention.scale f32              = 0.015625
llama_model_loader: - kv  23:                 granitemoe.embedding_scale f32              = 12.000000
llama_model_loader: - kv  24:                  granitemoe.residual_scale f32              = 0.220000
llama_model_loader: - kv  25:                     granitemoe.logit_scale f32              = 6.000000
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = refact
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,49155]   = ["<|end_of_text|>", "<fim_prefix>", "...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,49155]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,48891]   = ["Ġ Ġ", "ĠĠ ĠĠ", "ĠĠĠĠ ĠĠ...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 0
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|start_of_r...
llama_model_loader: - kv  36:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 1.92 GiB (4.88 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token:      2 '<fim_middle>' is not marked as EOG
load: control token:     13 '<jupyter_output>' is not marked as EOG
load: control token:      9 '<issue_closed>' is not marked as EOG
load: control token:      6 '<gh_stars>' is not marked as EOG
load: control token:     10 '<jupyter_start>' is not marked as EOG
load: control token:     14 '<empty_output>' is not marked as EOG
load: control token:     15 '<commit_before>' is not marked as EOG
load: control token:      5 '<filename>' is not marked as EOG
load: control token:     12 '<jupyter_code>' is not marked as EOG
load: control token:      4 '<fim_pad>' is not marked as EOG
load: control token:     18 '<reponame>' is not marked as EOG
load: control token:      7 '<issue_start>' is not marked as EOG
load: control token:      3 '<fim_suffix>' is not marked as EOG
load: control token:      1 '<fim_prefix>' is not marked as EOG
load: control token:      0 '<|end_of_text|>' is not marked as EOG
load: control token:      8 '<issue_comment>' is not marked as EOG
load: control token:     11 '<jupyter_text>' is not marked as EOG
load: control token:     16 '<commit_msg>' is not marked as EOG
load: control token:  49152 '<|start_of_role|>' is not marked as EOG
load: control token:  49154 '<|tool_call|>' is not marked as EOG
load: control token:  49153 '<|end_of_role|>' is not marked as EOG
load: control token:     17 '<commit_after>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.2826 MB
print_info: arch             = granitemoe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 4096
print_info: n_embd           = 1536
print_info: n_layer          = 32
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 6.0e+00
print_info: n_ff             = 512
print_info: n_expert         = 40
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.37 B
print_info: general.name     = Granite 3.0 3b A800M Instruct
print_info: f_embedding_scale = 12.000000
print_info: f_residual_scale  = 0.220000
print_info: f_attention_scale = 0.015625
print_info: vocab type       = BPE
print_info: n_vocab          = 49155
print_info: n_merges         = 48891
print_info: BOS token        = 0 '<|end_of_text|>'
print_info: EOS token        = 0 '<|end_of_text|>'
print_info: PAD token        = 0 '<|end_of_text|>'
print_info: LF token         = 203 'Ċ'
print_info: EOG token        = 0 '<|end_of_text|>'
print_info: max token length = 512
load_tensors: loading model tensors, this can take a while... (mmap = true)
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
load_tensors: layer   0 assigned to device SYCL0
load_tensors: layer   1 assigned to device SYCL0
load_tensors: layer   2 assigned to device SYCL0
load_tensors: layer   3 assigned to device SYCL0
load_tensors: layer   4 assigned to device SYCL0
load_tensors: layer   5 assigned to device SYCL0
load_tensors: layer   6 assigned to device SYCL0
load_tensors: layer   7 assigned to device SYCL0
load_tensors: layer   8 assigned to device SYCL0
load_tensors: layer   9 assigned to device SYCL0
load_tensors: layer  10 assigned to device SYCL0
load_tensors: layer  11 assigned to device SYCL0
load_tensors: layer  12 assigned to device SYCL0
load_tensors: layer  13 assigned to device SYCL0
load_tensors: layer  14 assigned to device SYCL0
load_tensors: layer  15 assigned to device SYCL0
load_tensors: layer  16 assigned to device SYCL0
load_tensors: layer  17 assigned to device SYCL0
load_tensors: layer  18 assigned to device SYCL0
load_tensors: layer  19 assigned to device SYCL0
load_tensors: layer  20 assigned to device SYCL0
load_tensors: layer  21 assigned to device SYCL0
load_tensors: layer  22 assigned to device SYCL0
load_tensors: layer  23 assigned to device SYCL0
load_tensors: layer  24 assigned to device SYCL0
load_tensors: layer  25 assigned to device SYCL0
load_tensors: layer  26 assigned to device SYCL0
load_tensors: layer  27 assigned to device SYCL0
load_tensors: layer  28 assigned to device SYCL0
load_tensors: layer  29 assigned to device SYCL0
load_tensors: layer  30 assigned to device SYCL0
load_tensors: layer  31 assigned to device SYCL0
load_tensors: layer  32 assigned to device SYCL0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        SYCL0 model buffer size =  1921.79 MiB
load_tensors:   CPU_Mapped model buffer size =    40.50 MiB
.............................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: no
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                     Intel Arc Graphics|  12.71|    128|    1024|   32| 94111M|     1.5.30872.320000|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 28: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 29: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 30: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 31: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init:      SYCL0 KV buffer size =   128.00 MiB
llama_init_from_model: KV self size  =  128.00 MiB, K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_init_from_model:  SYCL_Host  output buffer size =     0.19 MiB
llama_init_from_model:      SYCL0 compute buffer size =   112.00 MiB
llama_init_from_model:  SYCL_Host compute buffer size =     7.01 MiB
llama_init_from_model: graph nodes  = 1960
llama_init_from_model: graph splits = 2
001261, "The Church of England", "001261"

"The Church of England" is a term that refers to the state church of England, which is the official church of the United Kingdom. It is also known as the "Church of England" or "the Anglican Church". The Church of England is a member of the worldwide Anglican Communion. It is headed by the King (or Queen) of England, who is also the head of state. The Church of England is also a member of the United Nations and the Commonwealth of Nations.

The Church of England has a long history and has been influential in the development of Christianity in England. It was established in the 6th century by King Constantine I, and has since been led by many notable figures, including St. Augustine, St. Bede, and King Henry VIII. The Church of England has also been a place of refuge for many people throughout history, including those who were persecuted for their faith.

Today, the Church of England is a vibrant and diverse community, with a wide range of worship services, ministries, and programs. It is also a place of
@NeoZhangJianyu
Copy link
Collaborator

@cgruver
This issue is about the wrong result of some OPs.
I see there are Q4_K and Q6_K types.
Looks like there are less appear in user cases.

  1. Could you try the model with Q4_0 firstly?
  2. The download and try llama2-7b-int4: https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf.

Let's check if it has relationship with data type or model.

@cgruver
Copy link
Author

cgruver commented Feb 27, 2025

@NeoZhangJianyu

Results on the Intel Arch -

llama-run --ngl 999  llama-2-7b.Q4_0.gguf hello 
Loading modelget_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
<|im_end|>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user^C

I see similar issues on my M2 MacBook too...

This is the result on my M2 MacBook -

llama-run --ngl 999 --jinja llama-2-7b.Q4_0.gguf hello 
hello<|im_end|>
</>

<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
hello<|im_end|>
</>

<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
hello<|im_end|>
</>

<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
hello<|im_end|>
</>

<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
hello<|im_end|>
</>

<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
hello<|im_end|>
</>

<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
hello<|im_end|^C

@NeoZhangJianyu
Copy link
Collaborator

Yes, I see same result in Ubuntu on Arc 770.
It should be OK.

It's decided by the quantized model, instead of llama.cpp code.
You could check with "granite3-moe" in same way.

@Sherlock-Holo
Copy link

also meet similar problem

I run llama-server r4784 version, with ollama downloaded deepseek-r1:14b gguf model

/root/git/llama.cpp/build/bin/llama-server \
    -m /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e \
    -ngl 99 \
    --temp 0.6 \
    --no-webui \
    --host 0.0.0.0 \
    --port 20004 \
    --jinja \
    -fa \
    --chat-template-file /root/git/llama.cpp/models/templates/llama-cpp-deepseek-r1.jinja \
    --reasoning-format deepseek

and use httpie to send a streaming chat completion, it should give me some correct answer, but it reply gibberish

data: {
    "choices": [
        {
            "delta": {
                "content": "彘"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

data: {
    "choices": [
        {
            "delta": {
                "content": "regor"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

data: {
    "choices": [
        {
            "delta": {
                "content": " Pear"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

data: {
    "choices": [
        {
            "delta": {
                "content": "planes"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

@Sherlock-Holo
Copy link

also meet similar problem

I run llama-server r4784 version, with ollama downloaded deepseek-r1:14b gguf model

/root/git/llama.cpp/build/bin/llama-server
-m /usr/share/ollama/.ollama/models/blobs/sha256-6e9f90f02bb3b39b59e81916e8cfce9deb45aeaeb9a54a5be4414486b907dc1e
-ngl 99
--temp 0.6
--no-webui
--host 0.0.0.0
--port 20004
--jinja
-fa
--chat-template-file /root/git/llama.cpp/models/templates/llama-cpp-deepseek-r1.jinja
--reasoning-format deepseek

and use httpie to send a streaming chat completion, it should give me some correct answer, but it reply gibberish

data: {
    "choices": [
        {
            "delta": {
                "content": "彘"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

data: {
    "choices": [
        {
            "delta": {
                "content": "regor"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

data: {
    "choices": [
        {
            "delta": {
                "content": " Pear"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

data: {
    "choices": [
        {
            "delta": {
                "content": "planes"
            },
            "finish_reason": null,
            "index": 0
        }
    ],
    "created": 1740731319,
    "id": "chatcmpl-NkOKAOkdIucmoT9GiI44n88gfbQDBsEJ",
    "model": "deepseek-r1-14b-cpp",
    "object": "chat.completion.chunk",
    "system_fingerprint": "b4784-b95c8af3"
}

I build with this option

cmake -B build \ 
    -DGGML_CUDA=ON \
    -DGGML_VULKAN=1 \
    -DCMAKE_INSTALL_PREFIX='/usr/local' \
    -DGGML_ALL_WARNINGS=OFF \
    -DGGML_ALL_WARNINGS_3RD_PARTY=OFF \
    -DBUILD_SHARED_LIBS=ON \
    -DGGML_STATIC=OFF \
    -DGGML_LTO=ON \
    -DGGML_RPC=ON \
    -DLLAMA_CURL=ON \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES=native \
    -Wno-dev

if remove the -DGGML_VULKAN=1 \, the problem gone

@NeoZhangJianyu
Copy link
Collaborator

@Sherlock-Holo CUDA and VULKAN can't work together.
Keep one only.
That's why you fix it after remove VULKAN.

This session is about Intel iGPU, instead of CUDA GPU. :)

@cgruver
Copy link
Author

cgruver commented Feb 28, 2025

Yes, I see same result in Ubuntu on Arc 770. It should be OK.

It's decided by the quantized model, instead of llama.cpp code. You could check with "granite3-moe" in same way.

@NeoZhangJianyu Do you believe that the issue is with the model itself? Or is llama.cpp not compatible with the quantization?

I don't see any issues with the older granite-code models.

My lab is currently down for a bit so I'll have to retest granite3, but I'm pretty sure I saw similar behavior with it.

@NeoZhangJianyu
Copy link
Collaborator

You could check with llama.cpp CPU.
It provides the correct OP result.

If the result is still wrong, that would be not compatible with the quantization.
else, it's the SYCL backend kernel function issue.

@cgruver
Copy link
Author

cgruver commented Feb 28, 2025

Here's some logs that I captured earlier -

Working example - No GPU

request: {"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"hello"}],"stream":true,"cache_prompt":true,"samplers":"edkypmxt","temperature":0.8,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"typical_p":1,"xtc_probability":0,"xtc_threshold":0.1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"max_tokens":-1,"timings_per_token":false}
srv  params_from_: Grammar: 
srv  params_from_: Grammar lazy: false
srv  params_from_: Chat format: Content-only
srv  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
slot get_availabl: id  0 | task -1 | selected slot by lru, t_last = -1
slot        reset: id  0 | task -1 | 
slot launch_slot_: id  0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":131072,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"\u0001","grammar_trigger_words":[],"grammar_trigger_tokens":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":5,"speculative.p_min":0.8999999761581421,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<|start_of_role|>system<|end_of_role|>You are a helpful assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>hello<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
parse: error parsing grammar: expecting name at 


slot launch_slot_: id  0 | task 0 | processing task
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 0 | prompt token   0:  49152 '<|start_of_role|>'
slot update_slots: id  0 | task 0 | prompt token   1:   2946 'system'
slot update_slots: id  0 | task 0 | prompt token   2:  49153 '<|end_of_role|>'
slot update_slots: id  0 | task 0 | prompt token   3:   4282 'You'
slot update_slots: id  0 | task 0 | prompt token   4:    884 ' are'
slot update_slots: id  0 | task 0 | prompt token   5:    312 ' a'
slot update_slots: id  0 | task 0 | prompt token   6:  17247 ' helpful'
slot update_slots: id  0 | task 0 | prompt token   7:  47330 ' assistant'
slot update_slots: id  0 | task 0 | prompt token   8:     32 '.'
slot update_slots: id  0 | task 0 | prompt token   9:      0 '<|end_of_text|>'
slot update_slots: id  0 | task 0 | prompt token  10:    203 '
'
slot update_slots: id  0 | task 0 | prompt token  11:  49152 '<|start_of_role|>'
slot update_slots: id  0 | task 0 | prompt token  12:    496 'user'
slot update_slots: id  0 | task 0 | prompt token  13:  49153 '<|end_of_role|>'
slot update_slots: id  0 | task 0 | prompt token  14:   7656 'hello'
slot update_slots: id  0 | task 0 | prompt token  15:      0 '<|end_of_text|>'
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 20, n_tokens = 20
srv  update_slots: decoding batch, n_tokens = 20
srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = -1, next token:  8279 'Hello'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 21, n_cache_tokens = 21, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Hello"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 2, n_remaining = -1, next token:    19 '!'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 2
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 3, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 22, n_cache_tokens = 22, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"!"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 3, n_remaining = -1, next token:  4971 ' How'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 3
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 4, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 23, n_cache_tokens = 23, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" How"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 4, n_remaining = -1, next token:   883 ' can'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 4
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 5, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 24, n_cache_tokens = 24, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" can"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 5, n_remaining = -1, next token:   439 ' I'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 5
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 6, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 25, n_cache_tokens = 25, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" I"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 6, n_remaining = -1, next token: 34916 ' assist'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 6
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 7, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 26, n_cache_tokens = 26, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" assist"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 7, n_remaining = -1, next token:   844 ' you'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 7
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 8, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 27, n_cache_tokens = 27, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" you"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 8, n_remaining = -1, next token: 11610 ' today'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 8
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 9, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 28, n_cache_tokens = 28, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" today"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 9, n_remaining = -1, next token:    49 '?'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 9
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 10, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 29, n_cache_tokens = 29, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"?"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

Broken Example on Intel Arc -

request: {"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"hello"}],"stream":true,"cache_prompt":true,"samplers":"edkypmxt","temperature":0.8,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"typical_p":1,"xtc_probability":0,"xtc_threshold":0.1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"max_tokens":-1,"timings_per_token":false}
srv  params_from_: Grammar: 
srv  params_from_: Grammar lazy: false
srv  params_from_: Chat format: Content-only
srv  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
slot get_availabl: id  0 | task -1 | selected slot by lru, t_last = -1
slot        reset: id  0 | task -1 | 
slot launch_slot_: id  0 | task 0 | launching slot : {"id":0,"id_task":0,"n_ctx":131072,"speculative":false,"is_processing":false,"non_causal":false,"params":{"n_predict":-1,"seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":131072,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":true,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"\u0001","grammar_trigger_words":[],"grammar_trigger_tokens":[],"preserved_tokens":[],"chat_format":"Content-only","samplers":["penalties","dry","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.n_max":16,"speculative.n_min":5,"speculative.p_min":0.8999999761581421,"timings_per_token":false,"post_sampling_probs":false,"lora":[]},"prompt":"<|start_of_role|>system<|end_of_role|>You are a helpful assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|>hello<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>","next_token":{"has_next_token":true,"has_new_line":false,"n_remain":-1,"n_decoded":0,"stopping_word":""}}
parse: error parsing grammar: expecting name at 


slot launch_slot_: id  0 | task 0 | processing task
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 0 | prompt token   0:  49152 '<|start_of_role|>'
slot update_slots: id  0 | task 0 | prompt token   1:   2946 'system'
slot update_slots: id  0 | task 0 | prompt token   2:  49153 '<|end_of_role|>'
slot update_slots: id  0 | task 0 | prompt token   3:   4282 'You'
slot update_slots: id  0 | task 0 | prompt token   4:    884 ' are'
slot update_slots: id  0 | task 0 | prompt token   5:    312 ' a'
slot update_slots: id  0 | task 0 | prompt token   6:  17247 ' helpful'
slot update_slots: id  0 | task 0 | prompt token   7:  47330 ' assistant'
slot update_slots: id  0 | task 0 | prompt token   8:     32 '.'
slot update_slots: id  0 | task 0 | prompt token   9:      0 '<|end_of_text|>'
slot update_slots: id  0 | task 0 | prompt token  10:    203 '
'
slot update_slots: id  0 | task 0 | prompt token  11:  49152 '<|start_of_role|>'
slot update_slots: id  0 | task 0 | prompt token  12:    496 'user'
slot update_slots: id  0 | task 0 | prompt token  13:  49153 '<|end_of_role|>'
slot update_slots: id  0 | task 0 | prompt token  14:   7656 'hello'
slot update_slots: id  0 | task 0 | prompt token  15:      0 '<|end_of_text|>'
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 20, n_tokens = 20
srv  update_slots: decoding batch, n_tokens = 20
srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = -1, next token:   203 '
'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 21, n_cache_tokens = 21, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1740495743,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 2, n_remaining = -1, next token:   203 '
'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 2
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 3, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 22, n_cache_tokens = 22, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1740495743,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 3, n_remaining = -1, next token:    35 '1'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 3
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 4, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 23, n_cache_tokens = 23, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"1"}}],"created":1740495743,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 4, n_remaining = -1, next token:    32 '.'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 4
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 5, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 24, n_cache_tokens = 24, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"."}}],"created":1740495743,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 5, n_remaining = -1, next token:   399 ' A'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 5
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 6, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 25, n_cache_tokens = 25, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" A"}}],"created":1740495743,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 6, n_remaining = -1, next token:  2334 ' man'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 6
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 7, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 26, n_cache_tokens = 26, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" man"}}],"created":1740495743,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 7, n_remaining = -1, next token:  1597 ' was'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 7
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 8, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 27, n_cache_tokens = 27, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" was"}}],"created":1740495744,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

srv          send: sending result for task id = 0
srv          send: task id = 0 pushed to result queue
slot process_toke: id  0 | task 0 | n_decoded = 8, n_remaining = -1, next token: 47212 ' born'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 8
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 9, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 28, n_cache_tokens = 28, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":" born"}}],"created":1740495744,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

@cgruver
Copy link
Author

cgruver commented Feb 28, 2025

In the working example, you can see the conversation logged correctly.

In the broken example, the prompt, "hello" is not logged.

@cgruver
Copy link
Author

cgruver commented Feb 28, 2025

Working -

slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = -1, next token:  8279 'Hello'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 21, n_cache_tokens = 21, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Hello"}}],"created":1740495926,"id":"chatcmpl-o3rmRXhJPHD5qBwgA4JFy6PuokzEb1bd","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

@cgruver
Copy link
Author

cgruver commented Feb 28, 2025

Broken -

slot process_toke: id  0 | task 0 | n_decoded = 1, n_remaining = -1, next token:   203 '
'
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 1
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 2, front = 0
slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 131072, n_past = 21, n_cache_tokens = 21, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"\n"}}],"created":1740495743,"id":"chatcmpl-XYvlVtTEzSdhuOvzmZaWTkhZywCI5ALZ","model":"gpt-3.5-turbo","system_fingerprint":"b4713-a4f011e8","object":"chat.completion.chunk"}

You can see by comparing the two, that with the same prompt of "Hello", the test on Intel ARC sent "\n" in the data stream...

@NeoZhangJianyu
Copy link
Collaborator

Yes, some thing is wrong.
Let me check.
Do you get the LLM file by ollama? or Could you share the gguf link?

@cgruver
Copy link
Author

cgruver commented Feb 28, 2025

Yes, from Ollama: https://ollama.com/library/granite3.1-moe

granite3.1-moe:3b

@NeoZhangJianyu
Copy link
Collaborator

Yes, the result should be wrong with this model file.
I test with the model file from huggingface: granite-3.0-3b-a800m-instruct-Q4_0.gguf.
The result is different with CPU.
Let me check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants