Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64 #9530

tobchef · 2024-09-18T03:17:16Z

What happened?

The generation speed of llama-server has significantly decreased since b3681, and this issue persists in the latest b3779 without improvement.
For the same task and parameters "-ngl 99 -fa -c 2048," the generation speeds are:
b3680: 60 t/s
b3681: 40 t/s
b3779: 40 t/s

Name and Version

llama-server -v

build: 3779 (7be099f) with MSVC 19.29.30154.0 for x64
system info: n_threads = 10, n_threads_batch = 10, total_threads = 16

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15

What operating system are you seeing the problem on?

No response

Relevant log output

# b3680

INFO [                    main] build info | tid="19948" timestamp=1726628445 build=3680 commit="947538ac"
INFO [                    main] system info | tid="19948" timestamp=1726628445 n_threads=10 n_threads_batch=10 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
INFO [                    main] HTTP server is listening | tid="19948" timestamp=1726628445 hostname="127.0.0.1" port="8080" n_threads_http="15"
INFO [                    main] loading model | tid="19948" timestamp=1726628445 hostname="127.0.0.1" port="8080" n_threads_http="15"
llama_model_loader: loaded meta data with 36 key-value pairs and 339 tensors from E:\ai\models\qwen2-7b-instruct-q5_k_m-i_wikicn.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 7B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-7B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  14:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                          general.file_type u32              = 17
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = llama.cpp/imatrix.dat
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = /tmp/gradio/8869463bfcb11aaa5cc03db92...
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 50
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.9308 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = Qwen2 7B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 '脛默'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   357.33 MiB
llm_load_tensors:      CUDA0 buffer size =  4829.59 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 875
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="19948" timestamp=1726628447 n_slots=1
INFO [                    init] new slot | tid="19948" timestamp=1726628447 id_slot=0 n_ctx_slot=2048
INFO [                    main] model loaded | tid="19948" timestamp=1726628447
INFO [                    main] chat template | tid="19948" timestamp=1726628447 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [            update_slots] all slots are idle | tid="19948" timestamp=1726628447
INFO [   launch_slot_with_task] slot is processing task | tid="19948" timestamp=1726628470 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="19948" timestamp=1726628470 id_slot=0 id_task=0 p0=0
INFO [            update_slots] slot context shift | tid="19948" timestamp=1726628481 id_slot=0 id_task=0 n_keep=0 n_left=2047 n_discard=1023 n_ctx=2048 n_past=2047 n_system_tokens=0 n_cache_tokens=0
INFO [                 release] slot released | tid="19948" timestamp=1726628484 id_slot=0 id_task=0 n_past=1204 truncated=true
INFO [           print_timings] prompt eval time     =     826.23 ms /  1373 tokens (    0.60 ms per token,  1661.77 tokens per second) | tid="19948" timestamp=1726628484 id_slot=0 id_task=0 t_prompt_processing=826.226 n_prompt_tokens_processed=1373 t_token=0.6017669337217771 n_tokens_second=1661.7729289564843
INFO [           print_timings] generation eval time =   13805.91 ms /   855 runs   (   16.15 ms per token,    61.93 tokens per second) | tid="19948" timestamp=1726628484 id_slot=0 id_task=0 t_token_generation=13805.908 n_decoded=855 t_token=16.14726081871345 n_tokens_second=61.93000851519509
INFO [           print_timings]           total time =   14632.13 ms | tid="19948" timestamp=1726628484 id_slot=0 id_task=0 t_prompt_processing=826.226 t_token_generation=13805.908 t_total=14632.134
INFO [   launch_slot_with_task] slot is processing task | tid="19948" timestamp=1726628484 id_slot=0 id_task=1
INFO [      log_server_request] request | tid="7736" timestamp=1726628484 remote_addr="127.0.0.1" remote_port=2469 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] kv cache rm [p0, end) | tid="19948" timestamp=1726628484 id_slot=0 id_task=1 p0=0
INFO [                 release] slot released | tid="19948" timestamp=1726628486 id_slot=0 id_task=1 n_past=1303 truncated=false
INFO [           print_timings] prompt eval time     =     505.69 ms /  1248 tokens (    0.41 ms per token,  2467.89 tokens per second) | tid="19948" timestamp=1726628486 id_slot=0 id_task=1 t_prompt_processing=505.695 n_prompt_tokens_processed=1248 t_token=0.4052043269230769 n_tokens_second=2467.890724646279
INFO [           print_timings] generation eval time =     870.15 ms /    56 runs   (   15.54 ms per token,    64.36 tokens per second) | tid="19948" timestamp=1726628486 id_slot=0 id_task=1 t_token_generation=870.155 n_decoded=56 t_token=15.538482142857143 n_tokens_second=64.35635030540536
INFO [           print_timings]           total time =    1375.85 ms | tid="19948" timestamp=1726628486 id_slot=0 id_task=1 t_prompt_processing=505.695 t_token_generation=870.155 t_total=1375.85
INFO [            update_slots] all slots are idle | tid="19948" timestamp=1726628486
INFO [      log_server_request] request | tid="13284" timestamp=1726628486 remote_addr="127.0.0.1" remote_port=2468 status=200 method="POST" path="/v1/chat/completions" params={}




# b3681
INFO [                    main] build info | tid="18284" timestamp=1726628566 build=3681 commit="df270ef7"
INFO [                    main] system info | tid="18284" timestamp=1726628566 n_threads=10 n_threads_batch=10 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
INFO [                    main] HTTP server is listening | tid="18284" timestamp=1726628566 hostname="127.0.0.1" port="8080" n_threads_http="15"
INFO [                    main] loading model | tid="18284" timestamp=1726628566 hostname="127.0.0.1" port="8080" n_threads_http="15"
llama_model_loader: loaded meta data with 36 key-value pairs and 339 tensors from E:\ai\models\qwen2-7b-instruct-q5_k_m-i_wikicn.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 7B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-7B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  14:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                          general.file_type u32              = 17
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = llama.cpp/imatrix.dat
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = /tmp/gradio/8869463bfcb11aaa5cc03db92...
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 50
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.9308 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = Qwen2 7B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 '脛默'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   357.33 MiB
llm_load_tensors:      CUDA0 buffer size =  4829.59 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 875
llama_new_context_with_model: graph splits = 2
INFO [                    init] initializing slots | tid="18284" timestamp=1726628568 n_slots=1
INFO [                    init] new slot | tid="18284" timestamp=1726628568 id_slot=0 n_ctx_slot=2048
INFO [                    main] model loaded | tid="18284" timestamp=1726628568
INFO [                    main] chat template | tid="18284" timestamp=1726628568 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [            update_slots] all slots are idle | tid="18284" timestamp=1726628568
INFO [   launch_slot_with_task] slot is processing task | tid="18284" timestamp=1726628585 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="18284" timestamp=1726628585 id_slot=0 id_task=0 p0=0
INFO [            update_slots] slot context shift | tid="18284" timestamp=1726628604 id_slot=0 id_task=0 n_keep=0 n_left=2047 n_discard=1023 n_ctx=2048 n_past=2047 n_system_tokens=0 n_cache_tokens=0
INFO [                 release] slot released | tid="18284" timestamp=1726628608 id_slot=0 id_task=0 n_past=1204 truncated=true
INFO [           print_timings] prompt eval time     =     965.25 ms /  1373 tokens (    0.70 ms per token,  1422.43 tokens per second) | tid="18284" timestamp=1726628608 id_slot=0 id_task=0 t_prompt_processing=965.253 n_prompt_tokens_processed=1373 t_token=0.7030247632920612 n_tokens_second=1422.4250015280968
INFO [           print_timings] generation eval time =   22429.05 ms /   855 runs   (   26.23 ms per token,    38.12 tokens per second) | tid="18284" timestamp=1726628608 id_slot=0 id_task=0 t_token_generation=22429.053 n_decoded=855 t_token=26.232810526315788 n_tokens_second=38.120200616584214
INFO [           print_timings]           total time =   23394.31 ms | tid="18284" timestamp=1726628608 id_slot=0 id_task=0 t_prompt_processing=965.253 t_token_generation=22429.053 t_total=23394.306
INFO [   launch_slot_with_task] slot is processing task | tid="18284" timestamp=1726628608 id_slot=0 id_task=2
INFO [      log_server_request] request | tid="17004" timestamp=1726628608 remote_addr="127.0.0.1" remote_port=2672 status=200 method="POST" path="/v1/chat/completions" params={}
INFO [            update_slots] kv cache rm [p0, end) | tid="18284" timestamp=1726628608 id_slot=0 id_task=2 p0=0
INFO [                 release] slot released | tid="18284" timestamp=1726628610 id_slot=0 id_task=2 n_past=1303 truncated=false
INFO [           print_timings] prompt eval time     =     520.49 ms /  1248 tokens (    0.42 ms per token,  2397.76 tokens per second) | tid="18284" timestamp=1726628610 id_slot=0 id_task=2 t_prompt_processing=520.485 n_prompt_tokens_processed=1248 t_token=0.4170552884615385 n_tokens_second=2397.7636243119396
INFO [           print_timings] generation eval time =    1423.78 ms /    56 runs   (   25.42 ms per token,    39.33 tokens per second) | tid="18284" timestamp=1726628610 id_slot=0 id_task=2 t_token_generation=1423.777 n_decoded=56 t_token=25.424589285714287 n_tokens_second=39.3320021323564
INFO [           print_timings]           total time =    1944.26 ms | tid="18284" timestamp=1726628610 id_slot=0 id_task=2 t_prompt_processing=520.485 t_token_generation=1423.777 t_total=1944.2620000000002
INFO [            update_slots] all slots are idle | tid="18284" timestamp=1726628610
INFO [      log_server_request] request | tid="20504" timestamp=1726628610 remote_addr="127.0.0.1" remote_port=2673 status=200 method="POST" path="/v1/chat/completions" params={}




# b3779
build: 3779 (7be099fa) with MSVC 19.29.30154.0 for x64
system info: n_threads = 10, n_threads_batch = 10, total_threads = 16

system_info: n_threads = 10 (n_threads_batch = 10) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15
main: loading model
llama_model_loader: loaded meta data with 36 key-value pairs and 339 tensors from E:\ai\models\qwen2-7b-instruct-q5_k_m-i_wikicn.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2 7B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2 7B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2-7B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  14:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                          general.file_type u32              = 17
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,151387]  = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                      quantize.imatrix.file str              = llama.cpp/imatrix.dat
llama_model_loader: - kv  33:                   quantize.imatrix.dataset str              = /tmp/gradio/8869463bfcb11aaa5cc03db92...
llama_model_loader: - kv  34:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  35:              quantize.imatrix.chunks_count i32              = 50
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q5_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.9308 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.07 GiB (5.71 BPW)
llm_load_print_meta: general.name     = Qwen2 7B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 '脛默'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   357.33 MiB
llm_load_tensors:      CUDA0 buffer size =  4829.59 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.16 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    11.01 MiB
llama_new_context_with_model: graph nodes  = 875
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 2048
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

'main: server is listening on 127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | tokenizing prompt, len = 1
slot update_slots: id  0 | task 0 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1248
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1248, n_tokens = 1248, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1248, n_tokens = 1248
slot      release: id  0 | task 0 | stop processing: n_past = 1303, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     783.29 ms /  1248 tokens (    0.63 ms per token,  1593.27 tokens per second)
       eval time =    1409.38 ms /    56 tokens (   25.17 ms per token,    39.73 tokens per second)
      total time =    2192.67 ms /  1304 tokens
slot launch_slot_: id  0 | task 2 | processing task
slot update_slots: id  0 | task 2 | tokenizing prompt, len = 1
request: POST /v1/chat/completions 127.0.0.1 200
slot update_slots: id  0 | task 2 | prompt tokenized, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 1373
slot update_slots: id  0 | task 2 | kv cache rm [0, end)
slot update_slots: id  0 | task 2 | prompt processing progress, n_past = 1373, n_tokens = 1373, progress = 1.000000
slot update_slots: id  0 | task 2 | prompt done, n_past = 1373, n_tokens = 1373
slot update_slots: id  0 | task 2 | slot context shift, n_keep = 0, n_left = 2047, n_discard = 1023
slot      release: id  0 | task 2 | stop processing: n_past = 1204, truncated = 1
slot print_timing: id  0 | task 2 |
prompt eval time =     565.05 ms /  1373 tokens (    0.41 ms per token,  2429.88 tokens per second)
       eval time =   21562.66 ms /   855 tokens (   25.22 ms per token,    39.65 tokens per second)
      total time =   22127.71 ms /  2228 tokens
srv  update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-09-18T06:48:46Z

I cannot reproduce building from source:

GGML_CUDA=1 make -j && ./llama-server \
    -m models/qwen2-7b-instruct/ggml-model-q4_0.gguf \
    -c 2048 -ngl 99 --host 127.0.0.1 --port 8012 -fa

curl \
    --request POST --url http://127.0.0.1:8012/v1/chat/completions \
    --header "Content-Type: application/json" \
    --data '{"messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": [ { "type": "text", "text": "How exactly is a plumbus made? Explain in details." } ] } ], "n_predict": 512}' | jq

0.00.211.412 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
0.00.211.415 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.211.415 I ggml_cuda_init: found 1 CUDA devices:
0.00.212.273 I   Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes

b3680

INFO [           print_timings] prompt eval time     =      41.57 ms /    32 tokens (    1.30 ms per token,   769.79 tokens per second) | tid="140542972510208" timestamp=1726642089 id_slot=0 id_task=513 t_prompt_processing=41.57 n_prompt_tokens_processed=32 t_token=1.2990625 n_tokens_second=769.7859032956459
INFO [           print_timings] generation eval time =    7491.90 ms /   512 runs   (   14.63 ms per token,    68.34 tokens per second) | tid="140542972510208" timestamp=1726642089 id_slot=0 id_task=513 t_token_generation=7491.9 n_decoded=512 t_token=14.6326171875 n_tokens_second=68.34047437899599
INFO [           print_timings]           total time =    7533.47 ms | tid="140542972510208" timestamp=1726642089 id_slot=0 id_task=513 t_prompt_processing=41.57 t_token_generation=7491.9 t_total=7533.469999999999

master

3.22.930.056 I slot print_timing: id  0 | task 1554 | 
prompt eval time =      41.08 ms /    32 tokens (    1.28 ms per token,   779.02 tokens per second)
       eval time =    7419.22 ms /   512 tokens (   14.49 ms per token,    69.01 tokens per second)
      total time =    7460.30 ms /   544 tokens

tobchef · 2024-09-18T09:57:06Z

I'm a beginner and not very good at compiling on Windows, so I've always used "Download pre-built binary from releases."

I've tried many versions, but all the precompiled versions from b3681 onward have the llama-server speed issue mentioned above, while versions prior to b3680 have normal llama-server speeds.

It's worth noting that the llama-bench speed for all these versions is OK.

gebteus · 2024-09-19T19:27:53Z

I confirm. This problem exists 😵

slaren · 2024-09-19T23:58:56Z

I cannot reproduce this either, even when using the pre-built binaries under windows.

tobchef added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64 #9530

Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64 #9530

tobchef commented Sep 18, 2024

ggerganov commented Sep 18, 2024

tobchef commented Sep 18, 2024

gebteus commented Sep 19, 2024

slaren commented Sep 19, 2024

Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64 #9530

Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64 #9530

Comments

tobchef commented Sep 18, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Sep 18, 2024

b3680

master

tobchef commented Sep 18, 2024

gebteus commented Sep 19, 2024

slaren commented Sep 19, 2024