[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

sammysun0711 · 2025-05-15T14:28:14Z

Details:
This PR aim to cache generated OV model from GGUF model in disk for faster subsequent pipe initialization w/ OpenVINO model cache.

Serialize generated OV model from GGUF model w/ GGUF Reader
Checked if OV model exists in same folder of GGUF model, skip create GGUF model w/ GGUF Reader

Expected behavior:

First run w/ GGUF model:
- build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" gguf_models/qwen2.5_openvino_tokenizer
- Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
  Loading and unpacking model done. Time: 245ms
  Start generating OpenVINO model...
  Save generated OpenVINO model to: gguf_models/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 423 ms
  Model generation done. Time: 968ms
  I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and
2nd run w/ GGUF model:
- build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" gguf_models/qwen2.5_openvino_tokenizer
- Found generated OpenVINO model: gguf_models/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml, skip creating from GGUF model.
  I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and

Copilot

Pull Request Overview

This PR enhances the model initialization performance by caching generated OpenVINO models on disk and reusing them for subsequent runs. Key changes include:

In src/cpp/src/utils.cpp, adding a check for an existing cached OpenVINO model based on the GGUF model location.
In src/cpp/src/gguf_utils/gguf_modeling.cpp, introducing a new function to serialize and save the generated OpenVINO model, and invoking it during model creation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/cpp/src/utils.cpp	Added logic to reuse a cached OpenVINO model if it exists.
src/cpp/src/gguf_utils/gguf_modeling.cpp	Added serialization function and integrated it into the model creation flow.

src/cpp/src/utils.cpp

src/cpp/src/gguf_utils/gguf_modeling.cpp

…zation

…LLMPipeline initialization

as-suvorov · 2025-05-16T08:40:50Z

@sammysun0711
Proposal looks like implicit model caching. OpenVINO model is being implicitly serialized on disk. And if gguf model has changed it seems outdated serialized OpenVINO model will be loaded.
Can we reuse OpenVINO cache_dir property for such scenario?

Form the logs I see gguf model load time is 245ms. Could you please also provide loading time for OpenVINO serialized model?

sammysun0711 · 2025-05-16T08:47:25Z

Hi @as-suvorov, thanks for your suggestion.

Can we reuse OpenVINO cache_dir property for such scenario?

Do you means if cache_dir property set in property, then we can enable save generated OV model in cache_dir for re-use? I think it is a good proposal that user can control from application whether to serialize OV model in disk explicitly.

Could you please also provide loading time for OpenVINO serialized model?

Sure, I will add loading time for OpenVINO serialized model

as-suvorov · 2025-05-16T09:23:55Z

@sammysun0711 yes, correct. I guess also need to check if and how OpenVINO invalidates cached model and implement same approach for gguf format.

…rement for loading OV model.

sammysun0711 · 2025-05-16T13:45:12Z

Add support to save OV model based on ov::cache_dir properties explicit
Add time measurement for loading OV model.

1st Run w/o model cache:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 202ms
Start generating OpenVINO model...
Model generation done. Time: 292ms

2nd Run w/ model cache + serialize OV model:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 189ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 379 ms
Model generation done. Time: 647ms

3rd Run w/ model cache using generated OV model

Found generated OpenVINO model: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml, skip creating from GGUF model.
Loading OpenVINO model done. Time: 71ms

check if and how OpenVINO invalidates cached model and implement same approach for gguf format

Current naive method use GGUF file name to check if new OV model need to be regenerated, OV invalidate outdated model cache via hash calculate by compute_hash.hpp, but it seems compute_hash.hpp belong to dev_api, which is not accessible unless openvino.genai static link to openvino. @as-suvorov, may I know if you have an suggestions?

sammysun0711 · 2025-05-19T11:44:41Z

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

as-suvorov · 2025-05-19T11:53:54Z

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

@Wovchena What do you think?

src/cpp/src/utils.cpp

tests/python_tests/test_llm_pipeline.py

tests/python_tests/utils/ov_genai_pipelines.py

tests/python_tests/test_llm_pipeline.py

…ig to utils for re-use, add extract_npu_properties

…rom_config to utils for re-use, add extract_npu_properties" This reverts commit f8c84f5.

rkazants · 2025-06-04T06:38:45Z

tests/python_tests/test_llm_pipeline.py

@@ -846,8 +846,9 @@ def test_pipelines_with_gguf_generate(pipeline_type, model_ids):

 @pytest.mark.parametrize("pipeline_type", get_gguf_pipeline_types())
 @pytest.mark.parametrize("model_ids", get_gguf_model_list())
+@pytest.mark.parametrize("enable_save_ov_model", [False])


Test purpose only, try to figure out why MacOS test failed (segment fault): https://github.com/openvinotoolkit/openvino.genai/actions/runs/15420607456/job/43434100193

It seems macos-13 has smaller memory size: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

Does it make sense to create a separate test case, instead of check 3 pipeline test (HF/GGUF/OV native) to only 2 pipeline test (GGUF/OV native) to save memory usage?

@mryzhov, please advice other MacOS runner with more memory

@mryzhov, can we try macos-13-large (30GB memory) for debug purpose first? : https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/running-jobs-on-larger-runners?platform=mac#available-macos-larger-runners

test results shows that not related to insufficient memory, https://github.com/openvinotoolkit/openvino.genai/actions/runs/15436902324/job/43447403549?pr=2218, will revert back to macos-13 and continue to investigation.

@akladiev provides my a local macos machine w/ macos-12 + 64GB memory (during test the memory usage ~30GB + 15GB Swap usage), cannot reproduce the same segment fault issue w/ GGUF related test in test_llm_pipeline.py

I found the same issue in other PR：Bump actions/download-artifact from 4.1.9 to 4.3.0 · openvinotoolkit/openvino.genai@c3fe438, so I think it not introduced by my PR, could you please help to further investigate this issue?

rkazants

need review again

…mited memory storage

…se by limited memory storage" This reverts commit 3ef08bf.

github-actions bot added no-match-files category: GGUF GGUF file reader labels May 15, 2025

sammysun0711 changed the title ~~[GGUF] Cache Generated OV Model for Faster Initialziation~~ [GGUF] Serialize Generated OV Model for Faster Pipeline Initialization May 15, 2025

sammysun0711 changed the title ~~[GGUF] Serialize Generated OV Model for Faster Pipeline Initialization~~ [GGUF] Serialize Generated OV Model for Faster Pipeline Init May 15, 2025

sammysun0711 changed the title ~~[GGUF] Serialize Generated OV Model for Faster Pipeline Init~~ [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init May 15, 2025

sammysun0711 requested review from Copilot and Wovchena May 15, 2025 14:30

Copilot AI reviewed May 15, 2025

View reviewed changes

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

src/cpp/src/gguf_utils/gguf_modeling.cpp Outdated Show resolved Hide resolved

sammysun0711 added 3 commits May 15, 2025 22:37

Serialize generated OV model from GGUF model for faster pipe initiali…

421627a

…zation

Add try-catch to handle expecption raise by serialize, continue with …

7d1c9de

…LLMPipeline initialization

Minior refactor to handle different gguf model in same directory

cb883ba

andrei-kochin requested review from rkazants and apaniukov May 16, 2025 08:33

as-suvorov assigned as-suvorov and Wovchena May 16, 2025

Explicit save model based on ov::cache_dir properties, add time measu…

67b2fd7

…rement for loading OV model.

sammysun0711 added 2 commits May 19, 2025 12:24

use ov:save model to compress OV model

20c24b4

Merge branch 'master' into gguf_model_cache

14bc5f5

Wovchena reviewed May 19, 2025

View reviewed changes

src/cpp/src/utils.cpp Outdated Show resolved Hide resolved

sammysun0711 added 2 commits May 20, 2025 08:31

Merge branch 'master' into gguf_model_cache

ff05f51

Implict cache generated ov model constructed from gguf

63fd0ee

sammysun0711 requested a review from Wovchena May 20, 2025 05:09

sammysun0711 added this to the 2025.2 milestone May 20, 2025

sammysun0711 added the Code Freeze label May 20, 2025

rkazants removed the do_not_merge label May 30, 2025

rkazants reviewed May 30, 2025

View reviewed changes

tests/python_tests/test_llm_pipeline.py Outdated Show resolved Hide resolved

rkazants reviewed May 30, 2025

View reviewed changes

tests/python_tests/utils/ov_genai_pipelines.py Outdated Show resolved Hide resolved

rkazants reviewed May 30, 2025

View reviewed changes

tests/python_tests/test_llm_pipeline.py Outdated Show resolved Hide resolved

Fix review comments

f0991a3

rkazants reviewed May 30, 2025

View reviewed changes

tests/python_tests/test_llm_pipeline.py Outdated Show resolved Hide resolved

sammysun0711 added 3 commits May 30, 2025 13:38

minnor test fix

2b0b29a

Update test

807d0f9

remove unused import

5bf0e69

github-actions bot added the category: speculative decoding Speculative decoding label May 30, 2025

sammysun0711 added 2 commits May 30, 2025 16:49

Move extract_draft_model_from_config, extract_prompt_lookup_from_conf…

f8c84f5

…ig to utils for re-use, add extract_npu_properties

Test only: pass no properties to tokenizer/detokenizer

030dcde

github-actions bot removed the category: speculative decoding Speculative decoding label May 30, 2025

sammysun0711 added 4 commits May 30, 2025 20:47

Revert "Move extract_draft_model_from_config, extract_prompt_lookup_f…

617c1dc

…rom_config to utils for re-use, add extract_npu_properties" This reverts commit f8c84f5.

Simplify unused properties handling for tokenizer

01c9eb5

Set enable_save_ov_model as None by default

76466d1

Merge branch 'master' into gguf_model_cache

afa9bbb

rkazants approved these changes May 30, 2025

View reviewed changes

sammysun0711 added 2 commits June 3, 2025 22:50

Merge branch 'master' into gguf_model_cache

4788fd3

test enable_save_ov_model=False only

13ab0df

rkazants reviewed Jun 4, 2025

View reviewed changes

rkazants self-requested a review June 4, 2025 06:39

rkazants requested changes Jun 4, 2025

View reviewed changes

github-actions bot added the category: GHA CI based on Github actions label Jun 4, 2025

sammysun0711 added 3 commits June 4, 2025 16:28

enable save_ov_model test

fb4fb17

[Debug only] try use macos-13-large to check if core dump cause by li…

3ef08bf

…mited memory storage

Merge branch 'master' into gguf_model_cache

a05b435

github-actions bot removed the category: GHA CI based on Github actions label Jun 6, 2025

sammysun0711 added 2 commits June 6, 2025 10:52

Revert "[Debug only] try use macos-13-large to check if core dump cau…

cad5068

…se by limited memory storage" This reverts commit 3ef08bf.

release unused pipeline with gc to save memory

c4a8ce1

[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

Are you sure you want to change the base?

[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

Conversation

sammysun0711 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

as-suvorov commented May 16, 2025

Uh oh!

sammysun0711 commented May 16, 2025

Uh oh!

as-suvorov commented May 16, 2025

Uh oh!

sammysun0711 commented May 16, 2025

Uh oh!

sammysun0711 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

as-suvorov commented May 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rkazants Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

sammysun0711 Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkazants Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

sammysun0711 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

sammysun0711 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

sammysun0711 Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkazants left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sammysun0711 commented May 15, 2025 •

edited

Loading

sammysun0711 commented May 19, 2025 •

edited

Loading

sammysun0711 Jun 4, 2025 •

edited

Loading

sammysun0711 Jun 4, 2025 •

edited

Loading