-
Notifications
You must be signed in to change notification settings - Fork 12.8k
Add OpenVINO backend #15307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add OpenVINO backend #15307
Conversation
…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.
…ontend-utils, GraphIterator, Decoder
…on openvino device
Hello, in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported. Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here, A few other questions:
Thank you for your work! |
Hi @SearchSavior ,
Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)
The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.
The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.
We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details. |
Hey @ravi9 , Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem. |
Overview
This PR introduces an OpenVINO backend for
llama.cpp
, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.Key Features:
New backend implementation
ggml/src/ggml-openvino
.Supported precisions
Supported devices
Tested Models
In this PR, OpenVINO backend supports FP16 GGUF models. The following models have been validated (Additional GGUF FP16 models may also be compatible):
Llama-3.2-1B-Instruct-GGUF
microsoft/Phi-3-mini-4k-instruct-gguf
Qwen/Qwen2.5-1.5B-Instruct-GGUF
Work in Progress