Bug: KV quantization fails when using vulkan #9551
Labels
bug-unconfirmed
medium severity
Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
What happened?
I'm trying to run a model with a large context (128k) split across 2 GPU (A770 and P40) but running with -c 131072 -ctk q4_0 -ctv q4_0 complains that "V cache quantization requires flash_attn". If I then add -fa then it complains " ggml-backend.c:1174: pre-allocated tensor in a back end that cannot run the operation".
Is this possible to support? If so, what would be required? Does someone (possibly myself) just need to implement FA for the vulkan backend?
Name and Version
Version: 3758 (3c7989f)
What operating system are you seeing the problem on?
Windows
Relevant log output
No response
The text was updated successfully, but these errors were encountered: