Skip to content

[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: master
Choose a base branch
from

Conversation

sammysun0711
Copy link
Collaborator

@sammysun0711 sammysun0711 commented May 15, 2025

Details:
This PR aim to cache generated OV model from GGUF model in disk for faster subsequent pipe initialization w/ OpenVINO model cache.

  • Serialize generated OV model from GGUF model w/ GGUF Reader
  • Checked if OV model exists in same folder of GGUF model, skip create GGUF model w/ GGUF Reader

Expected behavior:

  • First run w/ GGUF model:

    • build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" gguf_models/qwen2.5_openvino_tokenizer

    • Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
      Loading and unpacking model done. Time: 245ms
      Start generating OpenVINO model...
      Save generated OpenVINO model to: gguf_models/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 423 ms
      Model generation done. Time: 968ms
      I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and

  • 2nd run w/ GGUF model:

    • build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" gguf_models/qwen2.5_openvino_tokenizer
    • Found generated OpenVINO model: gguf_models/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml, skip creating from GGUF model.
      I am Qwen, a large language model created by Alibaba Cloud. I am a language model designed to assist users in generating human-like text, such as writing articles, stories, and even writing books. I am trained on a vast corpus of text data, including books, articles, and other written works. I am also trained on a large corpus of human language data, including written and spoken language. I am designed to provide information and insights to users, and to assist them in their tasks and

@sammysun0711 sammysun0711 changed the title [GGUF] Cache Generated OV Model for Faster Initialziation [GGUF] Serialize Generated OV Model for Faster Pipeline Initialization May 15, 2025
@sammysun0711 sammysun0711 changed the title [GGUF] Serialize Generated OV Model for Faster Pipeline Initialization [GGUF] Serialize Generated OV Model for Faster Pipeline Init May 15, 2025
@sammysun0711 sammysun0711 changed the title [GGUF] Serialize Generated OV Model for Faster Pipeline Init [GGUF] Serialize Generated OV Model for Faster LLMPipeline Init May 15, 2025
@sammysun0711 sammysun0711 requested review from Copilot and Wovchena May 15, 2025 14:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the model initialization performance by caching generated OpenVINO models on disk and reusing them for subsequent runs. Key changes include:

  • In src/cpp/src/utils.cpp, adding a check for an existing cached OpenVINO model based on the GGUF model location.
  • In src/cpp/src/gguf_utils/gguf_modeling.cpp, introducing a new function to serialize and save the generated OpenVINO model, and invoking it during model creation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/cpp/src/utils.cpp Added logic to reuse a cached OpenVINO model if it exists.
src/cpp/src/gguf_utils/gguf_modeling.cpp Added serialization function and integrated it into the model creation flow.

@as-suvorov
Copy link
Collaborator

@sammysun0711
Proposal looks like implicit model caching. OpenVINO model is being implicitly serialized on disk. And if gguf model has changed it seems outdated serialized OpenVINO model will be loaded.
Can we reuse OpenVINO cache_dir property for such scenario?

Form the logs I see gguf model load time is 245ms. Could you please also provide loading time for OpenVINO serialized model?

@sammysun0711
Copy link
Collaborator Author

Hi @as-suvorov, thanks for your suggestion.

Can we reuse OpenVINO cache_dir property for such scenario?

Do you means if cache_dir property set in property, then we can enable save generated OV model in cache_dir for re-use? I think it is a good proposal that user can control from application whether to serialize OV model in disk explicitly.

Could you please also provide loading time for OpenVINO serialized model?

Sure, I will add loading time for OpenVINO serialized model

@as-suvorov
Copy link
Collaborator

@sammysun0711 yes, correct. I guess also need to check if and how OpenVINO invalidates cached model and implement same approach for gguf format.

@sammysun0711
Copy link
Collaborator Author

  • Add support to save OV model based on ov::cache_dir properties explicit
  • Add time measurement for loading OV model.

1st Run w/o model cache:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 202ms
Start generating OpenVINO model...
Model generation done. Time: 292ms

2nd Run w/ model cache + serialize OV model:

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 189ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 379 ms
Model generation done. Time: 647ms

3rd Run w/ model cache using generated OV model

Found generated OpenVINO model: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml, skip creating from GGUF model.
Loading OpenVINO model done. Time: 71ms

check if and how OpenVINO invalidates cached model and implement same approach for gguf format

Current naive method use GGUF file name to check if new OV model need to be regenerated, OV invalidate outdated model cache via hash calculate by compute_hash.hpp, but it seems compute_hash.hpp belong to dev_api, which is not accessible unless openvino.genai static link to openvino. @as-suvorov, may I know if you have an suggestions?

@sammysun0711
Copy link
Collaborator Author

sammysun0711 commented May 19, 2025

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

@as-suvorov
Copy link
Collaborator

I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.

Loading and unpacking model from: gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf
Loading and unpacking model done. Time: 185ms
Compute GGUF hash with SHA1 done. Time: 856ms
Start generating OpenVINO model...
Save generated OpenVINO model to: model_cache/qwen2.5-0.5b-instruct-q4_0_openvino_model.xml done. Time: 567 ms
Model generation done. Time: 1702ms

Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache.

I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated.

We can add hash function check after compute_hash.hpp expose as public API for re-use.

@Wovchena What do you think?

@sammysun0711 sammysun0711 requested a review from Wovchena May 20, 2025 05:09
@sammysun0711 sammysun0711 added this to the 2025.2 milestone May 20, 2025
@github-actions github-actions bot added the category: speculative decoding Speculative decoding label May 30, 2025
@github-actions github-actions bot removed the category: speculative decoding Speculative decoding label May 30, 2025
@@ -846,8 +846,9 @@ def test_pipelines_with_gguf_generate(pipeline_type, model_ids):

@pytest.mark.parametrize("pipeline_type", get_gguf_pipeline_types())
@pytest.mark.parametrize("model_ids", get_gguf_model_list())
@pytest.mark.parametrize("enable_save_ov_model", [False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Collaborator Author

@sammysun0711 sammysun0711 Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test purpose only, try to figure out why MacOS test failed (segment fault): https://github.com/openvinotoolkit/openvino.genai/actions/runs/15420607456/job/43434100193

It seems macos-13 has smaller memory size: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories

Does it make sense to create a separate test case, instead of check 3 pipeline test (HF/GGUF/OV native) to only 2 pipeline test (GGUF/OV native) to save memory usage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mryzhov, please advice other MacOS runner with more memory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test results shows that not related to insufficient memory, https://github.com/openvinotoolkit/openvino.genai/actions/runs/15436902324/job/43447403549?pr=2218, will revert back to macos-13 and continue to investigation.

Copy link
Collaborator Author

@sammysun0711 sammysun0711 Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akladiev provides my a local macos machine w/ macos-12 + 64GB memory (during test the memory usage ~30GB + 15GB Swap usage), cannot reproduce the same segment fault issue w/ GGUF related test in test_llm_pipeline.py

I found the same issue in other PR:Bump actions/download-artifact from 4.1.9 to 4.3.0 · openvinotoolkit/openvino.genai@c3fe438, so I think it not introduced by my PR, could you please help to further investigate this issue?

@rkazants rkazants self-requested a review June 4, 2025 06:39
Copy link
Member

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need review again

@github-actions github-actions bot added the category: GHA CI based on Github actions label Jun 4, 2025
@github-actions github-actions bot removed the category: GHA CI based on Github actions label Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: LLM LLM pipeline (stateful, static) category: tokenizers Tokenizer class or submodule update no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants