-
Notifications
You must be signed in to change notification settings - Fork 303
[VLM] Optimize video frame preprocessing for LLaVA-NeXT-Video-7B on GPU #3097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[VLM] Optimize video frame preprocessing for LLaVA-NeXT-Video-7B on GPU #3097
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes video frame preprocessing for the LLaVA-NeXT-Video-7B model by implementing GPU-accelerated preprocessing using OpenVINO operations instead of CPU-based preprocessing. The change provides significant performance improvements, reducing first token latency from ~15s (CPU) to ~845ms (GPU with OV preprocessing).
Key changes:
- Added OpenVINO-based preprocessing model that performs resize, crop, and normalization on GPU
- Implemented environment variable control to switch between CPU and GPU preprocessing
- Refactored preprocessing logic to support both CPU (
preprocess_frames_cpp) and GPU (preprocess_frames_ov) paths
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/cpp/src/visual_language/llava_next_video/classes.hpp | Added new methods for CPU and GPU preprocessing, added preprocessing model infrastructure and use flag |
| src/cpp/src/visual_language/llava_next_video/classes.cpp | Implemented OpenVINO preprocessing model creation and GPU-accelerated frame preprocessing logic |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| size_t num_video_tokens = ((float)config.crop_size_height / m_patch_size) * | ||
| ((float)config.crop_size_width / m_patch_size); | ||
| num_video_tokens = num_video_tokens / 4 * num_frames; |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit casts to float in the division operations are unnecessary. Both config.crop_size_height and config.crop_size_width will be integer-divided by m_patch_size (which is size_t), and the result is multiplied together. The float casts don't change the outcome since intermediate results are still integers. Consider removing the casts or applying them to the entire expression if floating-point division is intended.
| size_t num_video_tokens = ((float)config.crop_size_height / m_patch_size) * | |
| ((float)config.crop_size_width / m_patch_size); | |
| num_video_tokens = num_video_tokens / 4 * num_frames; | |
| size_t num_video_tokens = (static_cast<float>(config.crop_size_height) / static_cast<float>(m_patch_size)) * | |
| (static_cast<float>(config.crop_size_width) / static_cast<float>(m_patch_size)); | |
| num_video_tokens = static_cast<size_t>(num_video_tokens / 4 * num_frames); |
|
|
||
| bool can_use_ov_video_preprocess() { | ||
| const char* env = std::getenv("VIDEO_PREPROCESS"); | ||
| return !(env && std::string(env) == "CPP"); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The double negation logic is harder to read. Consider inverting the condition for clarity: return !env || std::string(env) != \"CPP\"; or better yet, return env == nullptr || std::string(env) != \"CPP\";
| return !(env && std::string(env) == "CPP"); | |
| return env == nullptr || std::string(env) != "CPP"; |
|
|
||
| float* frames_data = concatinated_frames.data<float>(); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'concatinated_frames' to 'concatenated_frames'.
| // concat preprocessed frames to single tensor | ||
| ov::Shape concat_shape = prepprocessed_frames[0].get_shape(); | ||
| concat_shape[0] = prepprocessed_frames.size(); | ||
| ov::Tensor concatinated_frames = ov::Tensor(prepprocessed_frames[0].get_element_type(), concat_shape); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'concatinated_frames' to 'concatenated_frames'.
| auto [prepprocessed_frames, num_video_tokens] = vision_encoder->preprocess_frames(frames); | ||
|
|
||
| // Use OV or CPU preprocessing based on configuration | ||
| auto [prepprocessed_frames, num_video_tokens] = vision_encoder->get_use_ov_preprocess() |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'prepprocessed_frames' to 'preprocessed_frames'.
0cd1af1 to
8037bec
Compare
Signed-off-by: Andrew Park <[email protected]>
Description
Optimize video frame preprocessing for LLaVA-NeXT-Video-7B model on GPU by creating an OpenVINO preprocessing model to move preprocessing operations from CPU to GPU
Ticket: CVS-177558
Average 1st token latency (1280x720 5s video (32 frames) + 100 input tokens -> generate 128 tokens)
WWB results with video input:
Checklist: