-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: llama-server api first query very slow #9492
Comments
The same problem exists, only cuda.
|
Could you try without CUDA graph? Set |
Doesn't seem to be making any difference:
|
Has this started happening recently? Does it happen without I can't reproduce on my CUDA workstation. |
Happens both with and without docker. I wasn't using llama-server before so can't say if it's new or not. |
Hm, does adding |
Doesn't help unfortunately. |
@ggerganov I was able to reproduce the problem on HF endpoints with A10G GPU (I didn't notice this issue before). The first Here is the log with |
I just had it happen on an A10 machine as well using |
What happened?
I'm using the
openai
library to interact withllama-server
docker image on an A6000:docker run -p 8080:8080 --name llama-server -v ~/gguf_models:/models --gpus all ghcr.io/ggerganov/llama.cpp:server-cuda -m models/Meta-Llama-3.1-70B-Instruct-Q4_K_L.gguf -c 65536 -fa --host 0.0.0.0 --port 8080 --n-gpu-layers 99 -ctk q4_0 -ctv q4_0 -t 4
The first request I send takes about 80 seconds, during which at first a single CPU core gets 100% load for maybe ~55s (with GPU usage at 0%) and only then the GPU kicks in. The second time I execute the exact same call, it takes ~26s to respond and starts with both CPU (one core 100%) and GPU (~87%) working a the same time.
The API call itself is:
Name and Version
$./llama-server --version
version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
^^^ not very helpful but I have just pulled a fresh docker image today i.e. 15/09/2024:
docker pull ghcr.io/ggerganov/llama.cpp:server-cuda
What operating system are you seeing the problem on?
Linux
Relevant log output
The text was updated successfully, but these errors were encountered: