feat: Add multimodal support (vision and audio) #49

mohammed840 · 2026-01-15T12:07:35Z

Summary

This PR adds multimodal capabilities to RLM, enabling vision (image analysis) and audio (transcription + TTS) support in the REPL environment.

Changes

Gemini Client (`rlm/clients/gemini.py`)

Added _load_image_as_part() to handle image loading from files, URLs, and base64
Added _load_audio_as_part() to handle audio file loading
Extended _get_mime_type() with audio and video MIME types
Updated _content_to_parts() to process multimodal content (images + audio)

REPL Environment (`rlm/environments/local_repl.py`)

Added vision_query(prompt, images) - Analyze images with vision-capable LLMs
Added vision_query_batched(prompts, images_list) - Batch image analysis
Added audio_query(prompt, audio_files) - Transcribe/analyze audio files
Added speak(text, output_path) - Text-to-speech generation

System Prompt (`rlm/utils/prompts.py`)

Documented new multimodal functions in the REPL environment description

New REPL Functions

Function	Purpose	Example
`vision_query(prompt, images)`	Analyze images	`vision_query("What's in the image?", ["photo.jpg"])`
`vision_query_batched(prompts, images_list)`	Batch image analysis	`vision_query_batched([...], [[img1], [img2]])`
`audio_query(prompt, audio_files)`	Transcribe/analyze audio	`audio_query("Transcribe this", ["speech.mp3"])`
`speak(text, output_path)`	Text-to-speech	`speak("Hello world", "output.aiff")`

Usage Examples

Vision

# In REPL code
description = vision_query("What objects are in this image?", ["photo.jpg"])
print(description)

Audio

# Transcribe audio
transcript = audio_query("Transcribe this audio", ["recording.mp3"])

# Text-to-speech
audio_path = speak("Hello, this is the RLM speaking!", "output.aiff")

Testing

Tested with Gemini 2.5 Flash
Vision query successfully analyzes images
TTS successfully generates audio files (uses macOS say command as fallback)

This addresses the open request for multimodal support in RLM.

This PR adds multimodal capabilities to RLM, enabling vision (image analysis) and audio (transcription + TTS) support in the REPL environment. ## Changes ### Gemini Client (rlm/clients/gemini.py) - Added _load_image_as_part() to handle image loading from files, URLs, and base64 - Added _load_audio_as_part() to handle audio file loading - Extended _get_mime_type() with audio and video MIME types - Updated _content_to_parts() to process multimodal content (images + audio) ### REPL Environment (rlm/environments/local_repl.py) - Added vision_query(prompt, images) - Analyze images with vision-capable LLMs - Added vision_query_batched(prompts, images_list) - Batch image analysis - Added audio_query(prompt, audio_files) - Transcribe/analyze audio files - Added speak(text, output_path) - Text-to-speech generation ### System Prompt (rlm/utils/prompts.py) - Documented new multimodal functions in the REPL environment description ## Usage Examples ### Vision ```python # Analyze an image description = vision_query("What objects are in this image?", ["photo.jpg"]) ``` ### Audio ```python # Transcribe audio transcript = audio_query("Transcribe this audio", ["recording.mp3"]) # Text-to-speech audio_path = speak("Hello world", "output.aiff") ``` Closes #multimodal-support

alexzhang13 · 2026-01-15T23:26:09Z

Love this, @mohammed840 before I make minor changes, can you add a flag in the RLM(...) that enables multimodal, which routes to this prompt? For now, you can make just separate "multimodal" system prompt that is just what you currently have, with the old prompt pointing to the original. Same goes for the REPL, which only has access to these functions if multimodal (multimodal flag should pass down to the Env as well).

alexzhang13

Make changes minimal, but add flag for RLM(...) for enable_multimodal: bool = False by default to enable these flags. Use a separate system prompt as well.

Addresses PR feedback: - Added enable_multimodal: bool = False flag to RLM constructor - Created separate RLM_SYSTEM_PROMPT (base) and RLM_MULTIMODAL_SYSTEM_PROMPT - Multimodal REPL functions only registered when enable_multimodal=True - Updated examples to use new flag

mohammed840 · 2026-01-21T12:53:02Z

@alexzhang13 Thanks for the feedback! I've implemented all your requested changes:

Changes Made:

Added enable_multimodal: bool = False flag to the RLM(...) constructor
Created separate system prompts:
- RLM_SYSTEM_PROMPT - Base prompt (text-only, no multimodal functions documented)
- RLM_MULTIMODAL_SYSTEM_PROMPT - Full prompt with vision_query, audio_query, speak
Conditional prompting: The library now only uses the multimodal prompt when enable_multimodal=True, otherwise it uses the original base prompt
Function access control: The multimodal REPL functions (vision_query, vision_query_batched, audio_query, speak) are only registered in the LocalREPL environment when the enable_multimodal flag is enabled. The flag is passed from RLM -> Environment.
Updated examples: Both multimodal_example.py and audio_example.py now use enable_multimodal=True

Usage:

# Default: No multimodal (backward compatible)
rlm = RLM(backend="gemini", ...)

# With multimodal enabled
rlm = RLM(backend="gemini", ..., enable_multimodal=True)

Let me know if you'd like any additional changes!

alexzhang13 self-requested a review January 15, 2026 23:26

alexzhang13 requested changes Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add multimodal support (vision and audio) #49

feat: Add multimodal support (vision and audio) #49

Uh oh!

mohammed840 commented Jan 15, 2026

Uh oh!

alexzhang13 commented Jan 15, 2026

Uh oh!

alexzhang13 left a comment

Uh oh!

mohammed840 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add multimodal support (vision and audio) #49

Are you sure you want to change the base?

feat: Add multimodal support (vision and audio) #49

Uh oh!

Conversation

mohammed840 commented Jan 15, 2026

Summary

Changes

Gemini Client (rlm/clients/gemini.py)

REPL Environment (rlm/environments/local_repl.py)

System Prompt (rlm/utils/prompts.py)

New REPL Functions

Usage Examples

Vision

Audio

Testing

Uh oh!

alexzhang13 commented Jan 15, 2026

Uh oh!

alexzhang13 left a comment

Choose a reason for hiding this comment

Uh oh!

mohammed840 commented Jan 21, 2026

Changes Made:

Usage:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Gemini Client (`rlm/clients/gemini.py`)

REPL Environment (`rlm/environments/local_repl.py`)

System Prompt (`rlm/utils/prompts.py`)