-
Notifications
You must be signed in to change notification settings - Fork 242
[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[GGUF] Serialize Generated OV Model for Faster LLMPipeline Init #2218
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the model initialization performance by caching generated OpenVINO models on disk and reusing them for subsequent runs. Key changes include:
- In src/cpp/src/utils.cpp, adding a check for an existing cached OpenVINO model based on the GGUF model location.
- In src/cpp/src/gguf_utils/gguf_modeling.cpp, introducing a new function to serialize and save the generated OpenVINO model, and invoking it during model creation.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
src/cpp/src/utils.cpp | Added logic to reuse a cached OpenVINO model if it exists. |
src/cpp/src/gguf_utils/gguf_modeling.cpp | Added serialization function and integrated it into the model creation flow. |
@sammysun0711 Form the logs I see gguf model load time is 245ms. Could you please also provide loading time for OpenVINO serialized model? |
Hi @as-suvorov, thanks for your suggestion.
Do you means if
Sure, I will add loading time for OpenVINO serialized model |
@sammysun0711 yes, correct. I guess also need to check if and how OpenVINO invalidates cached model and implement same approach for gguf format. |
…rement for loading OV model.
1st Run w/o model cache:
2nd Run w/ model cache + serialize OV model:
3rd Run w/ model cache using generated OV model
Current naive method use GGUF file name to check if new OV model need to be regenerated, OV invalidate outdated model cache via hash calculate by compute_hash.hpp, but it seems compute_hash.hpp belong to dev_api, which is not accessible unless openvino.genai static link to openvino. @as-suvorov, may I know if you have an suggestions? |
I added initial hash function in test branch, while found that hash function overhead is higher than model construction method.
Although can try to optimize hash method with parallelism, the hash function will be called for gguf load, which is not idea if we would like to skip gguf model load & unpack process if OV model existed in model cache. I would like to suggest to serialize OV model in cache_dir when user set cache_dir property explicitly, and add verbose that existed cache OV model will skip GGUF->OV conversion w/o cache invalidate. User may need to clean up model cache and re-generate OV model if gguf model updated. We can add hash function check after compute_hash.hpp expose as public API for re-use. |
@Wovchena What do you think? |
…ig to utils for re-use, add extract_npu_properties
…rom_config to utils for re-use, add extract_npu_properties" This reverts commit f8c84f5.
@@ -846,8 +846,9 @@ def test_pipelines_with_gguf_generate(pipeline_type, model_ids): | |||
|
|||
@pytest.mark.parametrize("pipeline_type", get_gguf_pipeline_types()) | |||
@pytest.mark.parametrize("model_ids", get_gguf_model_list()) | |||
@pytest.mark.parametrize("enable_save_ov_model", [False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test purpose only, try to figure out why MacOS test failed (segment fault): https://github.com/openvinotoolkit/openvino.genai/actions/runs/15420607456/job/43434100193
It seems macos-13 has smaller memory size: https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-public-repositories
Does it make sense to create a separate test case, instead of check 3 pipeline test (HF/GGUF/OV native) to only 2 pipeline test (GGUF/OV native) to save memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mryzhov, please advice other MacOS runner with more memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mryzhov, can we try macos-13-large
(30GB memory) for debug purpose first? : https://docs.github.com/en/actions/using-github-hosted-runners/using-larger-runners/running-jobs-on-larger-runners?platform=mac#available-macos-larger-runners
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test results shows that not related to insufficient memory, https://github.com/openvinotoolkit/openvino.genai/actions/runs/15436902324/job/43447403549?pr=2218, will revert back to macos-13
and continue to investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akladiev provides my a local macos machine w/ macos-12 + 64GB memory (during test the memory usage ~30GB + 15GB Swap usage), cannot reproduce the same segment fault issue w/ GGUF related test in test_llm_pipeline.py
I found the same issue in other PR:Bump actions/download-artifact from 4.1.9 to 4.3.0 · openvinotoolkit/openvino.genai@c3fe438, so I think it not introduced by my PR, could you please help to further investigate this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need review again
…se by limited memory storage" This reverts commit 3ef08bf.
Details:
This PR aim to cache generated OV model from GGUF model in disk for faster subsequent pipe initialization w/ OpenVINO model cache.
Expected behavior:
First run w/ GGUF model:
build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" gguf_models/qwen2.5_openvino_tokenizer
2nd run w/ GGUF model:
build/samples/cpp/text_generation/greedy_causal_lm gguf_models/qwen2.5-0.5b-instruct-q4_0.gguf "Who are you?" gguf_models/qwen2.5_openvino_tokenizer