[Multimodal] make multimodal processing robust #1516

coding-famer · 2026-01-29T09:13:33Z

Modifications:

Use explicit base64 encoding.
Force return_tensors to None and set return_tensors='pt' for multimodal inputs.
Lazy import qwen_vl_utils when loading processor.

zhuzilin · 2026-02-03T09:21:10Z

slime/utils/processing_utils.py

-    from qwen_vl_utils import process_vision_info
+    # TODO: temporary solution, will write image utils for slime later
+    if _qwen_process_vision_info is None:
+        raise ImportError("qwen_vl_utils is not installed. Install it with: pip install qwen-vl-utils")


hmm... I don't get why we need to move the import to the function above...

zhuzilin · 2026-02-03T09:22:30Z

slime/utils/processing_utils.py

    image.save(buffer, format="PNG")
-    return base64.b64encode(buffer.getvalue()).decode("utf-8")
+    image_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
+    return f"data:image/png;base64,{image_base64}"


shall we move the f"data:image/png;base64,{image_base64}" template into sglang_rollout.py? It seems like a template that is tightly connect to http payload.

I was thinking about potential future modalities (audio, video, etc.) that may have different MIME types. Keeping the data formatting in each encode functions make sglang_rollout.py doesn't need to handle different MIME types for each modality.
(Although SGLang actually just matches data: and , without parsing the MIME type, but including it makes the format less confusing.）

coding-famer · 2026-02-03T12:06:46Z

slime/utils/processing_utils.py

+        # force return_tensors to None for input_ids
+        "return_tensors": None,
+        # have been resized by qwen_vl_utils, update this when supporting other models
+        "do_resize": False,


Removing this for now. Since SGLang re-processes images internally and doesn't expose a do_resize option.

coding-famer added 2 commits January 29, 2026 16:46

fix mm processor

1d2a0d4

update

c0d1f3d

zhuzilin reviewed Feb 3, 2026

View reviewed changes

coding-famer commented Feb 3, 2026

View reviewed changes

update

870a011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Multimodal] make multimodal processing robust #1516

[Multimodal] make multimodal processing robust #1516

coding-famer commented Jan 29, 2026

Uh oh!

zhuzilin Feb 3, 2026

Uh oh!

coding-famer Feb 3, 2026

Uh oh!

zhuzilin Feb 3, 2026

Uh oh!

coding-famer Feb 3, 2026

Uh oh!

coding-famer Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Multimodal] make multimodal processing robust #1516

Are you sure you want to change the base?

[Multimodal] make multimodal processing robust #1516

Conversation

coding-famer commented Jan 29, 2026

Uh oh!

zhuzilin Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coding-famer Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

zhuzilin Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coding-famer Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coding-famer Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants