Add OpenVINO backend #15307

wine99 · 2025-08-14T09:09:11Z

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

Key Features:

New backend implementation
- Added OpenVINO backend in ggml/src/ggml-openvino.
- Implemented translations for core GGML operations
Supported precisions
- FP16 GGUF models supported.
- Initial quantization support available in quant branch.
Supported devices
- Intel CPUs
- Intel integrated and discrete GPUs
- Intel NPUs (requires UD32+ driver - Windows available, Linux coming soon).

Tested Models

In this PR, OpenVINO backend supports FP16 GGUF models. The following models have been validated (Additional GGUF FP16 models may also be compatible):

Work in Progress

Performance and memory optimizations
Broader quantization coverage.
Support for additional model architectures.

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

… operator

…ontend-utils, GraphIterator, Decoder

…on openvino device

…view op.

…end of llama.cpp

…ckend

SearchSavior · 2025-08-19T01:33:55Z

Hello,

in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported.

Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here,

https://github.com/yangsu2022/GGUF-to-OpenVINO/blob/405a95e300f8307fb4b779a12d46cf86adf5a441/convert_llama3.1_gguf_to_torch.py#L14

A few other questions:

What parts of OpenVINO feature set are intended to be brought into llama.cpp?
Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?
Pipeline parallel is coming in next release (I think), will that be implemented here for heterogeneous execution in llama.cpp?

Thank you for your work!

ravi9 · 2025-08-21T06:27:50Z

Hi @SearchSavior ,

Q: Will this feature in llama.cpp offer wider GGUF coverage via something like parameter mapping?

Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)

Q: What parts of the OpenVINO feature set are intended to be brought into llama.cpp?

The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.

Q: Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.

Q: Will pipeline parallel / heterogeneous execution be supported here?

We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details.

SearchSavior · 2025-08-21T11:34:19Z

Hey @ravi9 ,

Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem.

YangleiZouIntel and others added 30 commits August 14, 2025 17:00

Add ggml-openvino base files

355da54

add openvino as optional backend for Llama.cpp ggml

4453fc3

* Configure the device(default CPU) that uses OpenVINO to compile th…

e5035ff

…e model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.

Solve the issue of abnormal model output caused by using OpenVINO ADD…

1f0600f

… operator

Add OpenVINO MUL operator to GGML of Llama.cpp.

22aac2e

Add compile options

75e55c0

add OpenVINO frontend convert process steps

f1821d1

add get openvino available ops function

290249c

Add PoC of integration of openvino frontend. Main changes: ggml-ov-fr…

46a5662

…ontend-utils, GraphIterator, Decoder

Implement GgmlOvDecoder. Add dump functions.

6bf6b7a

Convert subgraph with add, sub, mul, div op to ov model and do infer …

3c12e98

…on openvino device

Add GGML_OV_FRONTEND option. Add readme.

cf29563

Change output for infer request to set output tensor. Support scale, …

870cdb7

…view op.

add GET_ROWS operator of OpenVINO to GGML of llama.cpp

6942273

Update build.md and add operation mapping(GGML to OpenVINO)

4fa1177

add the rms_norm operator implemented using OpenVINO to the GGML back…

77c74da

…end of llama.cpp

Fix issue for output memory copy of infer request

11edfa6

Change to implementation following pytorch frontend

98c9350

Add support for UNARY SILU op . Fix pytorch impl bugs.

e6729a8

Support Softmax op

d48cb65

Support Softmax op

d8c05a6

Support ROPE op.

8715fd2

Add support for RMS_NORM OP

e8cf08f

Add MUL_MAT,CPY,CONT as operators implemented in OpenVINO for GGML ba…

9711f27

…ckend

Move CPY from GGML OV Backend to OV Frontend

d88e931

add implementation of MUL_MAT, CPY, CONT of GGML ops using OV ops

3023596

add implementation of CPY when the output tensor is non-contiguous

a59a1c1

add tmp source code files

d9e8387

Execute singel CONT operator is OK

5e3369a

Execute CONT & VIEW operators in OV Frontend is OK

9667b1d

wine99 and others added 20 commits August 14, 2025 17:00

Fix Phi3 ROPE; Add test-backend-ops

04030b1

Fix NPU

7e1ad15

Fix llama-bench; Clang-format

8e16ccc

Fix llama-perplexity

30f5156

temp. changes for mark decomp

6d05d5d

matmul in fp32

1aad01d

mulmat input conversion fix

ed2ef18

mulmat type conversion update

263d49f

add mark decomp pass

1f84333

Revert changes in fuse_to_sdpa

b027724

Update build.md

3d8d0ad

Fix test-backend-ops

f15d53a

Skip test-thread-safety; Run ctest only in ci/run.sh

3eba120

Use CiD for NPU

c7d41e0

Optimize tensor conversion, improve TTFT

aa9e095

Support op SET_ROWS

93b5c2e

Fix NPU

fef3760

Remove CPY

ae84719

Fix test-backend-ops

4292a41

Minor updates for raising PR

dcf7cf6

wine99 requested review from ngxson and ggerganov as code owners August 14, 2025 09:09

wine99 marked this pull request as draft August 14, 2025 09:09

github-actions bot added documentation Improvements or additions to documentation testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenVINO backend #15307

Add OpenVINO backend #15307

wine99 commented Aug 14, 2025 •

edited

Loading

Uh oh!

SearchSavior commented Aug 19, 2025 •

edited

Loading

Uh oh!

ravi9 commented Aug 21, 2025

Uh oh!

SearchSavior commented Aug 21, 2025

Uh oh!

Uh oh!

Add OpenVINO backend #15307

Are you sure you want to change the base?

Add OpenVINO backend #15307

Conversation

wine99 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Features:

Tested Models

Work in Progress

Uh oh!

SearchSavior commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravi9 commented Aug 21, 2025

Uh oh!

SearchSavior commented Aug 21, 2025

Uh oh!

Uh oh!

wine99 commented Aug 14, 2025 •

edited

Loading

SearchSavior commented Aug 19, 2025 •

edited

Loading