Skip to content

Conversation

@mohammed840
Copy link

Summary

This PR adds multimodal capabilities to RLM, enabling vision (image analysis) and audio (transcription + TTS) support in the REPL environment.

Changes

Gemini Client (rlm/clients/gemini.py)

  • Added _load_image_as_part() to handle image loading from files, URLs, and base64
  • Added _load_audio_as_part() to handle audio file loading
  • Extended _get_mime_type() with audio and video MIME types
  • Updated _content_to_parts() to process multimodal content (images + audio)

REPL Environment (rlm/environments/local_repl.py)

  • Added vision_query(prompt, images) - Analyze images with vision-capable LLMs
  • Added vision_query_batched(prompts, images_list) - Batch image analysis
  • Added audio_query(prompt, audio_files) - Transcribe/analyze audio files
  • Added speak(text, output_path) - Text-to-speech generation

System Prompt (rlm/utils/prompts.py)

  • Documented new multimodal functions in the REPL environment description

New REPL Functions

Function Purpose Example
vision_query(prompt, images) Analyze images vision_query("What's in the image?", ["photo.jpg"])
vision_query_batched(prompts, images_list) Batch image analysis vision_query_batched([...], [[img1], [img2]])
audio_query(prompt, audio_files) Transcribe/analyze audio audio_query("Transcribe this", ["speech.mp3"])
speak(text, output_path) Text-to-speech speak("Hello world", "output.aiff")

Usage Examples

Vision

# In REPL code
description = vision_query("What objects are in this image?", ["photo.jpg"])
print(description)

Audio

# Transcribe audio
transcript = audio_query("Transcribe this audio", ["recording.mp3"])

# Text-to-speech
audio_path = speak("Hello, this is the RLM speaking!", "output.aiff")

Testing

  • Tested with Gemini 2.5 Flash
  • Vision query successfully analyzes images
  • TTS successfully generates audio files (uses macOS say command as fallback)

This addresses the open request for multimodal support in RLM.

This PR adds multimodal capabilities to RLM, enabling vision (image analysis)
and audio (transcription + TTS) support in the REPL environment.

## Changes

### Gemini Client (rlm/clients/gemini.py)
- Added _load_image_as_part() to handle image loading from files, URLs, and base64
- Added _load_audio_as_part() to handle audio file loading
- Extended _get_mime_type() with audio and video MIME types
- Updated _content_to_parts() to process multimodal content (images + audio)

### REPL Environment (rlm/environments/local_repl.py)
- Added vision_query(prompt, images) - Analyze images with vision-capable LLMs
- Added vision_query_batched(prompts, images_list) - Batch image analysis
- Added audio_query(prompt, audio_files) - Transcribe/analyze audio files
- Added speak(text, output_path) - Text-to-speech generation

### System Prompt (rlm/utils/prompts.py)
- Documented new multimodal functions in the REPL environment description

## Usage Examples

### Vision
```python
# Analyze an image
description = vision_query("What objects are in this image?", ["photo.jpg"])
```

### Audio
```python
# Transcribe audio
transcript = audio_query("Transcribe this audio", ["recording.mp3"])

# Text-to-speech
audio_path = speak("Hello world", "output.aiff")
```

Closes #multimodal-support
@alexzhang13
Copy link
Owner

Love this, @mohammed840 before I make minor changes, can you add a flag in the RLM(...) that enables multimodal, which routes to this prompt? For now, you can make just separate "multimodal" system prompt that is just what you currently have, with the old prompt pointing to the original. Same goes for the REPL, which only has access to these functions if multimodal (multimodal flag should pass down to the Env as well).

@alexzhang13 alexzhang13 self-requested a review January 15, 2026 23:26
Copy link
Owner

@alexzhang13 alexzhang13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make changes minimal, but add flag for RLM(...) for enable_multimodal: bool = False by default to enable these flags. Use a separate system prompt as well.

Addresses PR feedback:
- Added enable_multimodal: bool = False flag to RLM constructor
- Created separate RLM_SYSTEM_PROMPT (base) and RLM_MULTIMODAL_SYSTEM_PROMPT
- Multimodal REPL functions only registered when enable_multimodal=True
- Updated examples to use new flag
@mohammed840
Copy link
Author

@alexzhang13 Thanks for the feedback! I've implemented all your requested changes:

Changes Made:

  1. Added enable_multimodal: bool = False flag to the RLM(...) constructor

  2. Created separate system prompts:

    • RLM_SYSTEM_PROMPT - Base prompt (text-only, no multimodal functions documented)
    • RLM_MULTIMODAL_SYSTEM_PROMPT - Full prompt with vision_query, audio_query, speak
  3. Conditional prompting: The library now only uses the multimodal prompt when enable_multimodal=True, otherwise it uses the original base prompt

  4. Function access control: The multimodal REPL functions (vision_query, vision_query_batched, audio_query, speak) are only registered in the LocalREPL environment when the enable_multimodal flag is enabled. The flag is passed from RLM -> Environment.

  5. Updated examples: Both multimodal_example.py and audio_example.py now use enable_multimodal=True

Usage:

# Default: No multimodal (backward compatible)
rlm = RLM(backend="gemini", ...)

# With multimodal enabled
rlm = RLM(backend="gemini", ..., enable_multimodal=True)

Let me know if you'd like any additional changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants