Skip to content

Conversation

@andrew-k-park
Copy link

Description

Optimize video frame preprocessing for LLaVA-NeXT-Video-7B model on GPU by creating an OpenVINO preprocessing model to move preprocessing operations from CPU to GPU

Ticket: CVS-177558

Average 1st token latency (1280x720 5s video (32 frames) + 100 input tokens -> generate 128 tokens)

CPP preprocessing (GPU)  2906.118 ms
OV preprocessing (GPU)   845.6711 ms
CPP preprocessing (CPU)  15321.59 ms
OV preprocessing (CPU)   14327.6  ms

WWB results with video input:

CPP preprocessing (GPU)  0.880806
OV preprocessing (GPU)   0.860603
CPP preprocessing (CPU)  0.918247
OV preprocessing (CPU)   0.906167

Checklist:

  • Tests have been updated or added to cover the new code.
  • This patch fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings December 4, 2025 13:23
@github-actions github-actions bot added the category: visual language Visual language pipeline label Dec 4, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes video frame preprocessing for the LLaVA-NeXT-Video-7B model by implementing GPU-accelerated preprocessing using OpenVINO operations instead of CPU-based preprocessing. The change provides significant performance improvements, reducing first token latency from ~15s (CPU) to ~845ms (GPU with OV preprocessing).

Key changes:

  • Added OpenVINO-based preprocessing model that performs resize, crop, and normalization on GPU
  • Implemented environment variable control to switch between CPU and GPU preprocessing
  • Refactored preprocessing logic to support both CPU (preprocess_frames_cpp) and GPU (preprocess_frames_ov) paths

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/cpp/src/visual_language/llava_next_video/classes.hpp Added new methods for CPU and GPU preprocessing, added preprocessing model infrastructure and use flag
src/cpp/src/visual_language/llava_next_video/classes.cpp Implemented OpenVINO preprocessing model creation and GPU-accelerated frame preprocessing logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +511 to +513
size_t num_video_tokens = ((float)config.crop_size_height / m_patch_size) *
((float)config.crop_size_width / m_patch_size);
num_video_tokens = num_video_tokens / 4 * num_frames;
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explicit casts to float in the division operations are unnecessary. Both config.crop_size_height and config.crop_size_width will be integer-divided by m_patch_size (which is size_t), and the result is multiplied together. The float casts don't change the outcome since intermediate results are still integers. Consider removing the casts or applying them to the entire expression if floating-point division is intended.

Suggested change
size_t num_video_tokens = ((float)config.crop_size_height / m_patch_size) *
((float)config.crop_size_width / m_patch_size);
num_video_tokens = num_video_tokens / 4 * num_frames;
size_t num_video_tokens = (static_cast<float>(config.crop_size_height) / static_cast<float>(m_patch_size)) *
(static_cast<float>(config.crop_size_width) / static_cast<float>(m_patch_size));
num_video_tokens = static_cast<size_t>(num_video_tokens / 4 * num_frames);

Copilot uses AI. Check for mistakes.

bool can_use_ov_video_preprocess() {
const char* env = std::getenv("VIDEO_PREPROCESS");
return !(env && std::string(env) == "CPP");
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The double negation logic is harder to read. Consider inverting the condition for clarity: return !env || std::string(env) != \"CPP\"; or better yet, return env == nullptr || std::string(env) != \"CPP\";

Suggested change
return !(env && std::string(env) == "CPP");
return env == nullptr || std::string(env) != "CPP";

Copilot uses AI. Check for mistakes.
Comment on lines +533 to 534

float* frames_data = concatinated_frames.data<float>();
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'concatinated_frames' to 'concatenated_frames'.

Copilot uses AI. Check for mistakes.
// concat preprocessed frames to single tensor
ov::Shape concat_shape = prepprocessed_frames[0].get_shape();
concat_shape[0] = prepprocessed_frames.size();
ov::Tensor concatinated_frames = ov::Tensor(prepprocessed_frames[0].get_element_type(), concat_shape);
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'concatinated_frames' to 'concatenated_frames'.

Copilot uses AI. Check for mistakes.
auto [prepprocessed_frames, num_video_tokens] = vision_encoder->preprocess_frames(frames);

// Use OV or CPU preprocessing based on configuration
auto [prepprocessed_frames, num_video_tokens] = vision_encoder->get_use_ov_preprocess()
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'prepprocessed_frames' to 'preprocessed_frames'.

Copilot uses AI. Check for mistakes.
@andrew-k-park andrew-k-park force-pushed the preproc_opt_for_llava_next_video branch from 0cd1af1 to 8037bec Compare December 4, 2025 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: visual language Visual language pipeline

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant