A research-grade implementation of the Recursive Language Model (RLM) paradigm for long video understanding, based on the paper "Recursive Language Models" (Zhang, Kraska, Khattab, 2025).
This repository presents RVAA (Recursive Vision-Action Agent), an implementation of the Recursive Language Model framework applied to long-form video understanding. Following the core principle established in the RLM paper, we treat video content as an external environment rather than attempting to process entire videos within a single context window. The agent programmatically explores video content through temporal slicing, frame sampling, and vision-language captioning, then recursively invokes sub-models for local semantic analysis before synthesizing a global response.
Long-form video understanding presents significant challenges for traditional LLM-based approaches. A 21-minute video at 30 FPS contains over 38,000 frames, far exceeding the practical limits of even million-token context windows. Naive approaches that attempt to encode all visual information into a single prompt suffer from:
- Context fragmentation: Important temporal relationships are lost during chunking
- Information overload: Models struggle to identify relevant content in massive contexts
- Cost inefficiency: Processing irrelevant content wastes computational resources
The Recursive Language Model paradigm (Zhang et al., 2025, Section 3) proposes a fundamentally different approach:
"Treat extremely long context as part of an external environment, not something to stuff into an LLM context window."
This is achieved through three key mechanisms:
- REPL-based interaction: The agent writes executable code to explore the environment
- Recursive sub-calls: Local understanding is delegated to specialized sub-models
- Programmatic composition: Global answers are synthesized from local evidence
The RVAA system consists of the following components, mapping directly to RLM paper concepts (Table 1 in the original paper):
+------------------------------------------------------------------+
| ROOT AGENT (Root-LM) |
| +------------------------------------------------------------+ |
| | REPL Environment | |
| | | |
| | context = VideoEnv(video_path) # External environment | |
| | llm_query(prompt) # Recursive sub-calls | |
| | get_segment_captions(segment) # Vision-language API | |
| | print(...) # Observation feedback | |
| | | |
| | # Agent-generated exploration code: | |
| | for segment in context.iter_segments(duration=60): | |
| | captions = get_segment_captions(segment) | |
| | summary = llm_query(f"Analyze: {captions}") | |
| | evidence.append(summary) | |
| | | |
| | FINAL(synthesize(evidence)) # Termination | |
| +------------------------------------------------------------+ |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| SUB-AGENT (Sub-LM) |
| - Processes segment-level captions |
| - Extracts semantic evidence |
| - Performs local reasoning tasks |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| VISION CAPTIONER (Llama 3.2 Vision) |
| - Generates natural language descriptions of video frames |
| - Converts visual information to text for LLM processing |
+------------------------------------------------------------------+
| RLM Paper Concept (Section 3) | RVAA Implementation |
|---|---|
context (string buffer) |
VideoEnv object with temporal slicing |
llm_query(prompt) |
Asynchronous sub-calls to Sub-LM |
| REPL environment | Sandboxed Python runtime with restricted builtins |
| Chunking strategies | context[t0:t1] temporal segmentation |
FINAL(answer) |
Termination token with variable extraction |
| Cost tracking | Token/USD accounting per API call |
| Batching optimization (D.1) | Configurable segment duration |
The perception layer employs the Llama 3.2 11B Vision Instruct model via OpenRouter for frame captioning. This approach addresses the key challenge identified in the RLM paper (Section 4.2): converting non-textual modalities into language that can be processed by the reasoning system.
Frame Sampling Strategy:
- Uniform temporal sampling within each segment
- 1-3 frames per segment based on segment duration
- Image resizing to 512px for API efficiency
Caption Generation:
prompt = "Describe what you see in this video frame in 1-2 sentences.
Focus on: people, actions, text on screen, and setting."Following the RLM paradigm (Algorithm 1 in the paper), the root agent implements a multi-step reasoning loop:
- Observation: Inspect video metadata and sample frames
- Action: Execute code to segment video and extract captions
- Sub-query: Invoke Sub-LM for local semantic analysis
- Synthesis: Combine local findings into global understanding
- Termination: Return final answer via FINAL() token
To prevent premature termination (a failure mode noted in Section 5.3 of the paper), we implement a validation layer that rejects FINAL() tokens if no code execution has occurred:
if not trajectory.has_code_execution:
return "You must explore the video content before providing a final answer."We evaluated RVAA on a 21-minute news broadcast video (1269 seconds, 38,031 frames, 1280x720 resolution).
Query: "What topics are discussed in this meeting?"
System Configuration:
- Root LM: GPT-5 via OpenRouter
- Sub LM: GPT-5-mini via OpenRouter
- Vision Model: meta-llama/llama-3.2-11b-vision-instruct
Results:
| Metric | Value |
|---|---|
| Total Steps | 7 |
| Code Executions | 1 |
| LLM Sub-Calls | 5 |
| Vision API Calls | 12 (3 per segment x 4 segments) |
| Total Cost | $0.0023 |
| Execution Time | 336.9 seconds |
Extracted Topics:
- U.S. military action in Venezuela and the capture of Nicolas Maduro
- President Trump's formal address to the nation regarding the operation
- ABC News special report coverage and media framing
- Symbolic political imagery (flags, government insignias, formal attire)
The agent demonstrated the expected RLM behavior pattern:
Step 1-4 (LLM Sub-Calls): Segment-by-segment caption analysis
- Each segment processed independently by Sub-LM
- Vision model generated descriptive captions from sampled frames
- Example caption: "This image depicts former President Donald Trump delivering a speech in front of a microphone, addressing the nation..."
Step 5 (Synthesis): Cross-segment topic synthesis
- Sub-LM aggregated findings from all segments
- Identified recurring themes across temporal boundaries
Step 6 (Code Execution): Verification and metadata extraction
- Confirmed video properties (duration, frame count, resolution)
- Validated segment boundaries
Step 7 (Final Answer): Structured response generation
| Approach | Accuracy | Cost | Notes |
|---|---|---|---|
| Direct Context (baseline) | N/A | N/A | Exceeds context window limits |
| Summarize-then-Answer | Low | Low | Loses temporal detail |
| Top-K Retrieval Only | Medium | Low | No semantic verification |
| RVAA (this work) | High | $0.002-0.005 | Full recursive exploration |
The VideoEnv class implements the external environment abstraction (Section 3.1 of the paper):
class VideoEnv:
# Properties
duration: float # Total video duration in seconds
total_frames: int # Total frame count
fps: float # Frames per second
metadata: VideoMetadata # Resolution, codec, etc.
# Slicing operations
def __getitem__(self, slice) -> VideoView:
"""Temporal slicing: context[10.0:30.0]"""
# Frame access
def get_frame(self, timestamp: float) -> FrameData:
"""Get single frame at timestamp"""
def sample_frames_uniform(self, t0, t1, n) -> list[FrameData]:
"""Sample n frames uniformly from time range"""
# Iteration
def iter_segments(self, duration: float) -> Iterator[VideoView]:
"""Iterate over fixed-duration segments"""The sandboxed execution environment provides:
- Restricted Python builtins (no file I/O, network, or system access)
- Automatic output truncation (configurable, default 4000 chars)
- Error capture and recovery
- Variable namespace persistence across executions
Real-time trajectory visualization is implemented via Server-Sent Events (SSE):
/query (POST) -> {run_id}
/stream/{run_id} (GET/SSE) -> Event stream
Event types conform to the protocol defined in Section 6.2 of the paper:
code_execution: Agent executed REPL codellm_query: Sub-LM invocation with prompt/responsefinal_answer: Termination with extracted answer
- Python 3.10+
- OpenRouter API key (for LLM and vision model access)
- FFmpeg (for video decoding)
git clone https://github.com/rvaa/rvaa.git
cd rvaa
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"Create a .env file with your API credentials:
OPENROUTER_API_KEY=your-api-key
RVAA_ROOT_MODEL=openai/gpt-5
RVAA_SUB_MODEL=openai/gpt-5-minipython -m rvaa.server.api
# Server available at http://localhost:8000| Endpoint | Method | Description |
|---|---|---|
/query |
POST | Submit video query, returns run_id |
/stream/{run_id} |
GET | SSE stream of agent trajectory |
/video |
GET | Stream video file for preview |
/health |
GET | Health check |
- Recursion Depth: Limited to depth 1 (sub-calls invoke LMs, not recursive RLMs)
- Vision Model Latency: Per-frame API calls introduce significant latency
- No Training: Uses frozen frontier models without task-specific fine-tuning
- Single Video: No multi-video reasoning or cross-reference capability
- Hierarchical RLM: Implement deeper recursion for multi-scale reasoning
- Cached Perception: Pre-compute frame captions for frequently accessed videos
- Audio Integration: Incorporate speech-to-text for dialogue understanding
- Fine-tuned Models: Train specialized sub-models for video understanding tasks
src/rvaa/
├── agent/
│ ├── root_agent.py # RLM orchestrator with REPL runtime
│ └── trajectory.py # Step tracking and cost accounting
├── env/
│ └── video_env.py # Video environment abstraction
├── tools/
│ ├── llm_backends.py # OpenAI, Qwen, Claude API clients
│ ├── vision_captioner.py # Llama 3.2 Vision integration
│ └── video_io.py # Efficient video decoding (PyAV)
├── server/
│ ├── api.py # FastAPI server
│ ├── streaming.py # SSE event streaming
│ └── static/ # Web UI
└── eval/
├── metrics.py # Evaluation metrics
└── ablations.py # Baseline implementations
-
Zhang, M., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv preprint arXiv:2512.24601.
-
OpenRouter API Documentation. https://openrouter.ai/docs
-
Meta AI. (2024). Llama 3.2 Vision Models. https://ai.meta.com/llama/
@article{zhang2025recursive,
title={Recursive Language Models},
author={Zhang, Michael and Kraska, Tim and Khattab, Omar},
journal={arXiv preprint arXiv:2512.24601},
year={2025}
}MIT License. See LICENSE for details.