Skip to content

Add OpenVINO backend #15307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 114 commits into
base: master
Choose a base branch
from
Draft

Add OpenVINO backend #15307

wants to merge 114 commits into from

Conversation

wine99
Copy link

@wine99 wine99 commented Aug 14, 2025

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

Key Features:

  • New backend implementation

    • Added OpenVINO backend in ggml/src/ggml-openvino.
    • Implemented translations for core GGML operations
  • Supported precisions

    • FP16 GGUF models supported.
    • Initial quantization support available in quant branch.
  • Supported devices

    • Intel CPUs
    • Intel integrated and discrete GPUs
    • Intel NPUs (requires UD32+ driver - Windows available, Linux coming soon).

Tested Models

In this PR, OpenVINO backend supports FP16 GGUF models. The following models have been validated (Additional GGUF FP16 models may also be compatible):

Work in Progress

  • Performance and memory optimizations
  • Broader quantization coverage.
  • Support for additional model architectures.

YangleiZouIntel and others added 30 commits August 14, 2025 17:00
…e model

 * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.
@wine99 wine99 marked this pull request as draft August 14, 2025 09:09
@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Aug 14, 2025
@SearchSavior
Copy link

SearchSavior commented Aug 19, 2025

Hello,

in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported.

Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here,

https://github.com/yangsu2022/GGUF-to-OpenVINO/blob/405a95e300f8307fb4b779a12d46cf86adf5a441/convert_llama3.1_gguf_to_torch.py#L14

A few other questions:

  • What parts of OpenVINO feature set are intended to be brought into llama.cpp?

  • Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

  • Pipeline parallel is coming in next release (I think), will that be implemented here for heterogeneous execution in llama.cpp?

Thank you for your work!

@ravi9
Copy link

ravi9 commented Aug 21, 2025

Hi @SearchSavior ,

Q: Will this feature in llama.cpp offer wider GGUF coverage via something like parameter mapping?

Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)

Q: What parts of the OpenVINO feature set are intended to be brought into llama.cpp?

The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.

Q: Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.

Q: Will pipeline parallel / heterogeneous execution be supported here?

We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details.

@SearchSavior
Copy link

Hey @ravi9 ,

Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants