- 
                Notifications
    You must be signed in to change notification settings 
- Fork 663
feat: Media processing in the frontend - 1st pass #3630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6df1e40    to
    80594ff      
    Compare
  
    6a44d3d    to
    f4edee8      
    Compare
  
    Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
d29d284    to
    7ca3075      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decoding is optional, if dynamo was not built with this feature, or if no decoding configuration is passed, unprocessed URLs will be passed.
If the feature is gated behind a compile-time feature flag, I think it will be difficult to consume for most users since they'll need to build from source for one way or the other. Is this something that can be set as a frontend flag or environment variable or something instead? What do you think on how to control frontend-side media decoding feature @grahamking @krishung5 @indrajit96 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rmccorm4
I have a draft PR (#3929) for this compile time flag into alexandre's branch which is WIP.
I have taken the workflow for that, using enable_kvbm and block-manager feature group as an inspiration
Line 358 in b1732a5
| if [ "$ENABLE_KVBM" = "true" ]; then \ | 
Do you think that workflow is too tedious on the user side for a front end change ? Because for using KVBM also the user needs to compile or build the wheel again ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So at the end of the day what we wanted to do is not prevent people from running the regular frontend if they don't have the required media loading system dependencies at runtime (ffmpeg mostly), and they don't need media decoding.
So the solution we are working on is a build-time flag. That way even not having ffmpeg during build is possible. But this means having different wheels for different features yes.
If we are in charge of the build and don't care about having ffmpeg on our side during build, then another solution could be to require ffmpeg during build, but at runtime, if the dynamic linking fails to find ffmpeg, disable video decoding? Need to see how doable this is with rust.
Signed-off-by: Alexandre Milesi <[email protected]>
(WIP)
Overview:
Media decoding in the frontend for VLMs (images, videos).
Details:
Decodes multimodal data from the OAI chat request (image_url, video_url) in the frontend processor into decoded tensors (pixel values).
Passes the decoded data to the next step in the graph (backend) via NIXL readable descriptors (can be used with python nixl_connect).
Decoding data involves:
These last two steps can be CPU-heavy and are done in the rayon runtime.
This decoding is optional, if dynamo was not built with this feature, or if no decoding configuration is passed, unprocessed URLs will be passed.
Preprocessor holds a MediaLoader, which has an HTTP client and media decoders for each modality. Decoder configuration is passed via the MDC. In the future, per-request or even per-item options could override this default configuration. MediaLoader also holds a NIXL agent to handle registration of the storages. The underlying data is only cleared once the request object is dropped on the frontend, which happens at the end of generate().
TODOs:
This MR:
Future work:
Where should the reviewer start?
Flow starting from gather_multi_modal_data in preprocessor.rs