[VLM] Optimize video frame preprocessing for LLaVA-NeXT-Video-7B on GPU #3097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

andrew-k-park wants to merge 1 commit into openvinotoolkit:master from andrew-k-park:preproc_opt_for_llava_next_video

andrew-k-park commented Dec 4, 2025

Description

Optimize video frame preprocessing for LLaVA-NeXT-Video-7B model on GPU by creating an OpenVINO preprocessing model to move preprocessing operations from CPU to GPU

Ticket: CVS-177558

Average 1st token latency (1280x720 5s video (32 frames) + 100 input tokens -> generate 128 tokens)

CPP preprocessing (GPU)  2906.118 ms
OV preprocessing (GPU)   845.6711 ms
CPP preprocessing (CPU)  15321.59 ms
OV preprocessing (CPU)   14327.6  ms

WWB results with video input:

CPP preprocessing (GPU)  0.880806
OV preprocessing (GPU)   0.860603
CPP preprocessing (CPU)  0.918247
OV preprocessing (CPU)   0.906167

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings

December 4, 2025 13:23

github-actions bot added the category: visual language label

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR optimizes video frame preprocessing for the LLaVA-NeXT-Video-7B model by implementing GPU-accelerated preprocessing using OpenVINO operations instead of CPU-based preprocessing. The change provides significant performance improvements, reducing first token latency from ~15s (CPU) to ~845ms (GPU with OV preprocessing).

Key changes:

Added OpenVINO-based preprocessing model that performs resize, crop, and normalization on GPU
Implemented environment variable control to switch between CPU and GPU preprocessing
Refactored preprocessing logic to support both CPU (preprocess_frames_cpp) and GPU (preprocess_frames_ov) paths

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
src/cpp/src/visual_language/llava_next_video/classes.hpp	Added new methods for CPU and GPU preprocessing, added preprocessing model infrastructure and use flag
src/cpp/src/visual_language/llava_next_video/classes.cpp	Implemented OpenVINO preprocessing model creation and GPU-accelerated frame preprocessing logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cpp/src/visual_language/llava_next_video/classes.cpp

Comment on lines +511 to +513

    
                  size_t num_video_tokens = ((float)config.crop_size_height / m_patch_size) * 

                                            ((float)config.crop_size_width / m_patch_size);

                  num_video_tokens = num_video_tokens / 4 * num_frames;

Copilot AI Dec 4, 2025

The explicit casts to float in the division operations are unnecessary. Both config.crop_size_height and config.crop_size_width will be integer-divided by m_patch_size (which is size_t), and the result is multiplied together. The float casts don't change the outcome since intermediate results are still integers. Consider removing the casts or applying them to the entire expression if floating-point division is intended.

Suggested change

      
                size_t num_video_tokens = ((float)config.crop_size_height / m_patch_size) * 
          
                                          ((float)config.crop_size_width / m_patch_size);
          
                num_video_tokens = num_video_tokens / 4 * num_frames;
          
                size_t num_video_tokens = (static_cast<float>(config.crop_size_height) / static_cast<float>(m_patch_size)) *
          
                                          (static_cast<float>(config.crop_size_width) / static_cast<float>(m_patch_size));
          
                num_video_tokens = static_cast<size_t>(num_video_tokens / 4 * num_frames);

Copilot uses AI. Check for mistakes.

src/cpp/src/visual_language/llava_next_video/classes.cpp

    
              bool can_use_ov_video_preprocess() {

                  const char* env = std::getenv("VIDEO_PREPROCESS");

                  return !(env && std::string(env) == "CPP");

Copilot AI Dec 4, 2025

[nitpick] The double negation logic is harder to read. Consider inverting the condition for clarity: return !env || std::string(env) != \"CPP\"; or better yet, return env == nullptr || std::string(env) != \"CPP\";

Suggested change

      
                return !(env && std::string(env) == "CPP");
          
                return env == nullptr || std::string(env) != "CPP";

Copilot uses AI. Check for mistakes.

src/cpp/src/visual_language/llava_next_video/classes.cpp

Comment on lines +533 to 534


		float* frames_data = concatinated_frames.data<float>();

Copilot AI Dec 4, 2025

Corrected spelling of 'concatinated_frames' to 'concatenated_frames'.

Copilot uses AI. Check for mistakes.

src/cpp/src/visual_language/llava_next_video/classes.cpp

    
                      // concat preprocessed frames to single tensor

                      ov::Shape concat_shape = prepprocessed_frames[0].get_shape();

                      concat_shape[0] = prepprocessed_frames.size();

                      ov::Tensor concatinated_frames = ov::Tensor(prepprocessed_frames[0].get_element_type(), concat_shape);

Copilot AI Dec 4, 2025

Corrected spelling of 'concatinated_frames' to 'concatenated_frames'.

Copilot uses AI. Check for mistakes.

src/cpp/src/visual_language/llava_next_video/classes.cpp

    
                      auto [prepprocessed_frames, num_video_tokens] = vision_encoder->preprocess_frames(frames);

                      // Use OV or CPU preprocessing based on configuration

                      auto [prepprocessed_frames, num_video_tokens] = vision_encoder->get_use_ov_preprocess()

Copilot AI Dec 4, 2025

Corrected spelling of 'prepprocessed_frames' to 'preprocessed_frames'.

Copilot uses AI. Check for mistakes.

andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from 0cd1af1 to 8037bec Compare

December 4, 2025 13:26


          Add OpenVINO-based preprocessing pipeline for LLaVA-NeXT-Video 7B

8037bec

Signed-off-by: Andrew Park <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: visual language