Skip to content

Conversation

@milesial
Copy link
Contributor

@milesial milesial commented Oct 15, 2025

(WIP)

Overview:

image

Media decoding in the frontend for VLMs (images, videos).

Details:

Decodes multimodal data from the OAI chat request (image_url, video_url) in the frontend processor into decoded tensors (pixel values).
Passes the decoded data to the next step in the graph (backend) via NIXL readable descriptors (can be used with python nixl_connect).

Decoding data involves:

  • Potentially fetching the data from the web
  • Potentially decoding base64
  • Running the actual media decoding (JPEG, H264, ...)

These last two steps can be CPU-heavy and are done in the rayon runtime.
This decoding is optional, if dynamo was not built with this feature, or if no decoding configuration is passed, unprocessed URLs will be passed.

Preprocessor holds a MediaLoader, which has an HTTP client and media decoders for each modality. Decoder configuration is passed via the MDC. In the future, per-request or even per-item options could override this default configuration. MediaLoader also holds a NIXL agent to handle registration of the storages. The underlying data is only cleared once the request object is dropped on the frontend, which happens at the end of generate().

TODOs:

This MR:

  • Have media decoding code under a feature flag

Future work:

  • Microbench tests
  • Per-request decoder options
  • HW decoding
  • Seek-based video decoding for sparse sampling
  • Parallel HTTP fetch and decoding
  • Early-free decoded memory data once read
  • Pre-allocate RAM slab to share a unique NIXL metadata

Where should the reviewer start?

Flow starting from gather_multi_modal_data in preprocessor.rs

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 15, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the feat label Oct 15, 2025
@milesial milesial force-pushed the alexandrem/frontend-media-decoding branch from 6df1e40 to 80594ff Compare October 23, 2025 00:10
@milesial milesial force-pushed the alexandrem/frontend-media-decoding branch 2 times, most recently from 6a44d3d to f4edee8 Compare October 28, 2025 16:10
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decoding is optional, if dynamo was not built with this feature, or if no decoding configuration is passed, unprocessed URLs will be passed.

If the feature is gated behind a compile-time feature flag, I think it will be difficult to consume for most users since they'll need to build from source for one way or the other. Is this something that can be set as a frontend flag or environment variable or something instead? What do you think on how to control frontend-side media decoding feature @grahamking @krishung5 @indrajit96 ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rmccorm4
I have a draft PR (#3929) for this compile time flag into alexandre's branch which is WIP.
I have taken the workflow for that, using enable_kvbm and block-manager feature group as an inspiration

if [ "$ENABLE_KVBM" = "true" ]; then \

Do you think that workflow is too tedious on the user side for a front end change ? Because for using KVBM also the user needs to compile or build the wheel again ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at the end of the day what we wanted to do is not prevent people from running the regular frontend if they don't have the required media loading system dependencies at runtime (ffmpeg mostly), and they don't need media decoding.

So the solution we are working on is a build-time flag. That way even not having ffmpeg during build is possible. But this means having different wheels for different features yes.

If we are in charge of the build and don't care about having ffmpeg on our side during build, then another solution could be to require ffmpeg during build, but at runtime, if the dynamic linking fails to find ffmpeg, disable video decoding? Need to see how doable this is with rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants