Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: KV quantization fails when using vulkan #9551

Open
jmars opened this issue Sep 19, 2024 · 2 comments
Open

Bug: KV quantization fails when using vulkan #9551

jmars opened this issue Sep 19, 2024 · 2 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@jmars
Copy link

jmars commented Sep 19, 2024

What happened?

I'm trying to run a model with a large context (128k) split across 2 GPU (A770 and P40) but running with -c 131072 -ctk q4_0 -ctv q4_0 complains that "V cache quantization requires flash_attn". If I then add -fa then it complains " ggml-backend.c:1174: pre-allocated tensor in a back end that cannot run the operation".

Is this possible to support? If so, what would be required? Does someone (possibly myself) just need to implement FA for the vulkan backend?

Name and Version

Version: 3758 (3c7989f)

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

@jmars jmars added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Sep 19, 2024
@JohannesGaessler
Copy link
Collaborator

Is this possible to support? If so, what would be required? Does someone (possibly myself) just need to implement FA for the vulkan backend?

That would be the most straightforward solution. Alternatively if there was support for V cache quantization without FA that would also work (this is currently supported for none of the backends, would be more work to implement than FA, and yield worse results.).

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Sep 19, 2024

I forgot: if you end up deciding to implement FA for Vulkan, take a look at the corresponding tests in tests/test-backend-ops.cpp. You don't have to implement support for all of those cases but for those cases where ggml_backend_vk_supports_op returns true the tests should succeed (defined as giving the same results as the CPU backends within some numerical precision).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

2 participants