-
Notifications
You must be signed in to change notification settings - Fork 507
[Multimodal] make multimodal processing robust #1516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
slime/utils/processing_utils.py
Outdated
| from qwen_vl_utils import process_vision_info | ||
| # TODO: temporary solution, will write image utils for slime later | ||
| if _qwen_process_vision_info is None: | ||
| raise ImportError("qwen_vl_utils is not installed. Install it with: pip install qwen-vl-utils") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... I don't get why we need to move the import to the function above...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| image.save(buffer, format="PNG") | ||
| return base64.b64encode(buffer.getvalue()).decode("utf-8") | ||
| image_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8") | ||
| return f"data:image/png;base64,{image_base64}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move the f"data:image/png;base64,{image_base64}" template into sglang_rollout.py? It seems like a template that is tightly connect to http payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about potential future modalities (audio, video, etc.) that may have different MIME types. Keeping the data formatting in each encode functions make sglang_rollout.py doesn't need to handle different MIME types for each modality.
(Although SGLang actually just matches data: and , without parsing the MIME type, but including it makes the format less confusing.)
slime/utils/processing_utils.py
Outdated
| # force return_tensors to None for input_ids | ||
| "return_tensors": None, | ||
| # have been resized by qwen_vl_utils, update this when supporting other models | ||
| "do_resize": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this for now. Since SGLang re-processes images internally and doesn't expose a do_resize option.
Modifications:
return_tensorsto None and setreturn_tensors='pt'for multimodal inputs.qwen_vl_utilswhen loading processor.