-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Lower performance in pre-built binary llama-server, Since llama-b3681-bin-win-cuda-cu12.2.0-x64 #9530
Comments
I cannot reproduce building from source: GGML_CUDA=1 make -j && ./llama-server \
-m models/qwen2-7b-instruct/ggml-model-q4_0.gguf \
-c 2048 -ngl 99 --host 127.0.0.1 --port 8012 -fa curl \
--request POST --url http://127.0.0.1:8012/v1/chat/completions \
--header "Content-Type: application/json" \
--data '{"messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": [ { "type": "text", "text": "How exactly is a plumbus made? Explain in details." } ] } ], "n_predict": 512}' | jq
b3680
master
|
I'm a beginner and not very good at compiling on Windows, so I've always used "Download pre-built binary from releases." I've tried many versions, but all the precompiled versions from b3681 onward have the llama-server speed issue mentioned above, while versions prior to b3680 have normal llama-server speeds. It's worth noting that the llama-bench speed for all these versions is OK. |
I confirm. This problem exists 😵 |
I cannot reproduce this either, even when using the pre-built binaries under windows. |
What happened?
The generation speed of llama-server has significantly decreased since b3681, and this issue persists in the latest b3779 without improvement.
For the same task and parameters "-ngl 99 -fa -c 2048," the generation speeds are:
b3680: 60 t/s
b3681: 40 t/s
b3779: 40 t/s
Name and Version
llama-server -v
build: 3779 (7be099f) with MSVC 19.29.30154.0 for x64
system info: n_threads = 10, n_threads_batch = 10, total_threads = 16
system_info: n_threads = 10 (n_threads_batch = 10) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 15
What operating system are you seeing the problem on?
No response
Relevant log output
The text was updated successfully, but these errors were encountered: