Skip to content

RVAA: Recursive Vision-Action Agent for Long Video Understanding. Implementation of the RLM paradigm (Zhang, Kraska, Khattab 2025)

License

Notifications You must be signed in to change notification settings

mohammed840/RLM-implementation

Repository files navigation

RVAA: Recursive Vision-Action Agent for Long Video Understanding

Python 3.10+ License: MIT

A research-grade implementation of the Recursive Language Model (RLM) paradigm for long video understanding, based on the paper "Recursive Language Models" (Zhang, Kraska, Khattab, 2025).


Abstract

This repository presents RVAA (Recursive Vision-Action Agent), an implementation of the Recursive Language Model framework applied to long-form video understanding. Following the core principle established in the RLM paper, we treat video content as an external environment rather than attempting to process entire videos within a single context window. The agent programmatically explores video content through temporal slicing, frame sampling, and vision-language captioning, then recursively invokes sub-models for local semantic analysis before synthesizing a global response.


1. Introduction

1.1 Problem Statement

Long-form video understanding presents significant challenges for traditional LLM-based approaches. A 21-minute video at 30 FPS contains over 38,000 frames, far exceeding the practical limits of even million-token context windows. Naive approaches that attempt to encode all visual information into a single prompt suffer from:

  1. Context fragmentation: Important temporal relationships are lost during chunking
  2. Information overload: Models struggle to identify relevant content in massive contexts
  3. Cost inefficiency: Processing irrelevant content wastes computational resources

1.2 The RLM Paradigm

The Recursive Language Model paradigm (Zhang et al., 2025, Section 3) proposes a fundamentally different approach:

"Treat extremely long context as part of an external environment, not something to stuff into an LLM context window."

This is achieved through three key mechanisms:

  1. REPL-based interaction: The agent writes executable code to explore the environment
  2. Recursive sub-calls: Local understanding is delegated to specialized sub-models
  3. Programmatic composition: Global answers are synthesized from local evidence

2. Architecture

2.1 System Overview

The RVAA system consists of the following components, mapping directly to RLM paper concepts (Table 1 in the original paper):

+------------------------------------------------------------------+
|                      ROOT AGENT (Root-LM)                        |
|  +------------------------------------------------------------+  |
|  |                    REPL Environment                        |  |
|  |                                                            |  |
|  |  context = VideoEnv(video_path)   # External environment  |  |
|  |  llm_query(prompt)                # Recursive sub-calls   |  |
|  |  get_segment_captions(segment)    # Vision-language API   |  |
|  |  print(...)                       # Observation feedback  |  |
|  |                                                            |  |
|  |  # Agent-generated exploration code:                      |  |
|  |  for segment in context.iter_segments(duration=60):       |  |
|  |      captions = get_segment_captions(segment)             |  |
|  |      summary = llm_query(f"Analyze: {captions}")          |  |
|  |      evidence.append(summary)                             |  |
|  |                                                            |  |
|  |  FINAL(synthesize(evidence))      # Termination           |  |
|  +------------------------------------------------------------+  |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                    SUB-AGENT (Sub-LM)                            |
|  - Processes segment-level captions                              |
|  - Extracts semantic evidence                                    |
|  - Performs local reasoning tasks                                |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                 VISION CAPTIONER (Llama 3.2 Vision)              |
|  - Generates natural language descriptions of video frames      |
|  - Converts visual information to text for LLM processing       |
+------------------------------------------------------------------+

2.2 Concept Mapping

RLM Paper Concept (Section 3) RVAA Implementation
context (string buffer) VideoEnv object with temporal slicing
llm_query(prompt) Asynchronous sub-calls to Sub-LM
REPL environment Sandboxed Python runtime with restricted builtins
Chunking strategies context[t0:t1] temporal segmentation
FINAL(answer) Termination token with variable extraction
Cost tracking Token/USD accounting per API call
Batching optimization (D.1) Configurable segment duration

3. Machine Learning Methodology

3.1 Vision-Language Integration

The perception layer employs the Llama 3.2 11B Vision Instruct model via OpenRouter for frame captioning. This approach addresses the key challenge identified in the RLM paper (Section 4.2): converting non-textual modalities into language that can be processed by the reasoning system.

Frame Sampling Strategy:

  • Uniform temporal sampling within each segment
  • 1-3 frames per segment based on segment duration
  • Image resizing to 512px for API efficiency

Caption Generation:

prompt = "Describe what you see in this video frame in 1-2 sentences. 
         Focus on: people, actions, text on screen, and setting."

3.2 Recursive Reasoning

Following the RLM paradigm (Algorithm 1 in the paper), the root agent implements a multi-step reasoning loop:

  1. Observation: Inspect video metadata and sample frames
  2. Action: Execute code to segment video and extract captions
  3. Sub-query: Invoke Sub-LM for local semantic analysis
  4. Synthesis: Combine local findings into global understanding
  5. Termination: Return final answer via FINAL() token

3.3 Forced Exploration Mechanism

To prevent premature termination (a failure mode noted in Section 5.3 of the paper), we implement a validation layer that rejects FINAL() tokens if no code execution has occurred:

if not trajectory.has_code_execution:
    return "You must explore the video content before providing a final answer."

4. Experimental Results

4.1 Qualitative Evaluation

We evaluated RVAA on a 21-minute news broadcast video (1269 seconds, 38,031 frames, 1280x720 resolution).

Query: "What topics are discussed in this meeting?"

System Configuration:

  • Root LM: GPT-5 via OpenRouter
  • Sub LM: GPT-5-mini via OpenRouter
  • Vision Model: meta-llama/llama-3.2-11b-vision-instruct

Results:

Metric Value
Total Steps 7
Code Executions 1
LLM Sub-Calls 5
Vision API Calls 12 (3 per segment x 4 segments)
Total Cost $0.0023
Execution Time 336.9 seconds

Extracted Topics:

  1. U.S. military action in Venezuela and the capture of Nicolas Maduro
  2. President Trump's formal address to the nation regarding the operation
  3. ABC News special report coverage and media framing
  4. Symbolic political imagery (flags, government insignias, formal attire)

4.2 Agent Trajectory Analysis

The agent demonstrated the expected RLM behavior pattern:

Step 1-4 (LLM Sub-Calls): Segment-by-segment caption analysis

  • Each segment processed independently by Sub-LM
  • Vision model generated descriptive captions from sampled frames
  • Example caption: "This image depicts former President Donald Trump delivering a speech in front of a microphone, addressing the nation..."

Step 5 (Synthesis): Cross-segment topic synthesis

  • Sub-LM aggregated findings from all segments
  • Identified recurring themes across temporal boundaries

Step 6 (Code Execution): Verification and metadata extraction

  • Confirmed video properties (duration, frame count, resolution)
  • Validated segment boundaries

Step 7 (Final Answer): Structured response generation

4.3 Comparison with Baseline Approaches

Approach Accuracy Cost Notes
Direct Context (baseline) N/A N/A Exceeds context window limits
Summarize-then-Answer Low Low Loses temporal detail
Top-K Retrieval Only Medium Low No semantic verification
RVAA (this work) High $0.002-0.005 Full recursive exploration

5. Implementation Details

5.1 Video Environment API

The VideoEnv class implements the external environment abstraction (Section 3.1 of the paper):

class VideoEnv:
    # Properties
    duration: float          # Total video duration in seconds
    total_frames: int        # Total frame count
    fps: float               # Frames per second
    metadata: VideoMetadata  # Resolution, codec, etc.
    
    # Slicing operations
    def __getitem__(self, slice) -> VideoView:
        """Temporal slicing: context[10.0:30.0]"""
    
    # Frame access
    def get_frame(self, timestamp: float) -> FrameData:
        """Get single frame at timestamp"""
    
    def sample_frames_uniform(self, t0, t1, n) -> list[FrameData]:
        """Sample n frames uniformly from time range"""
    
    # Iteration
    def iter_segments(self, duration: float) -> Iterator[VideoView]:
        """Iterate over fixed-duration segments"""

5.2 REPL Runtime

The sandboxed execution environment provides:

  • Restricted Python builtins (no file I/O, network, or system access)
  • Automatic output truncation (configurable, default 4000 chars)
  • Error capture and recovery
  • Variable namespace persistence across executions

5.3 Streaming Architecture

Real-time trajectory visualization is implemented via Server-Sent Events (SSE):

/query (POST) -> {run_id}
/stream/{run_id} (GET/SSE) -> Event stream

Event types conform to the protocol defined in Section 6.2 of the paper:

  • code_execution: Agent executed REPL code
  • llm_query: Sub-LM invocation with prompt/response
  • final_answer: Termination with extracted answer

6. Installation and Usage

6.1 Requirements

  • Python 3.10+
  • OpenRouter API key (for LLM and vision model access)
  • FFmpeg (for video decoding)

6.2 Installation

git clone https://github.com/rvaa/rvaa.git
cd rvaa
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

6.3 Configuration

Create a .env file with your API credentials:

OPENROUTER_API_KEY=your-api-key
RVAA_ROOT_MODEL=openai/gpt-5
RVAA_SUB_MODEL=openai/gpt-5-mini

6.4 Running the Server

python -m rvaa.server.api
# Server available at http://localhost:8000

6.5 API Endpoints

Endpoint Method Description
/query POST Submit video query, returns run_id
/stream/{run_id} GET SSE stream of agent trajectory
/video GET Stream video file for preview
/health GET Health check

7. Limitations and Future Work

7.1 Current Limitations

  1. Recursion Depth: Limited to depth 1 (sub-calls invoke LMs, not recursive RLMs)
  2. Vision Model Latency: Per-frame API calls introduce significant latency
  3. No Training: Uses frozen frontier models without task-specific fine-tuning
  4. Single Video: No multi-video reasoning or cross-reference capability

7.2 Future Directions

  1. Hierarchical RLM: Implement deeper recursion for multi-scale reasoning
  2. Cached Perception: Pre-compute frame captions for frequently accessed videos
  3. Audio Integration: Incorporate speech-to-text for dialogue understanding
  4. Fine-tuned Models: Train specialized sub-models for video understanding tasks

8. Project Structure

src/rvaa/
├── agent/
│   ├── root_agent.py          # RLM orchestrator with REPL runtime
│   └── trajectory.py          # Step tracking and cost accounting
├── env/
│   └── video_env.py           # Video environment abstraction
├── tools/
│   ├── llm_backends.py        # OpenAI, Qwen, Claude API clients
│   ├── vision_captioner.py    # Llama 3.2 Vision integration
│   └── video_io.py            # Efficient video decoding (PyAV)
├── server/
│   ├── api.py                 # FastAPI server
│   ├── streaming.py           # SSE event streaming
│   └── static/                # Web UI
└── eval/
    ├── metrics.py             # Evaluation metrics
    └── ablations.py           # Baseline implementations

9. References

  1. Zhang, M., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv preprint arXiv:2512.24601.

  2. OpenRouter API Documentation. https://openrouter.ai/docs

  3. Meta AI. (2024). Llama 3.2 Vision Models. https://ai.meta.com/llama/


10. Citation

@article{zhang2025recursive,
  title={Recursive Language Models},
  author={Zhang, Michael and Kraska, Tim and Khattab, Omar},
  journal={arXiv preprint arXiv:2512.24601},
  year={2025}
}

License

MIT License. See LICENSE for details.

About

RVAA: Recursive Vision-Action Agent for Long Video Understanding. Implementation of the RLM paradigm (Zhang, Kraska, Khattab 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published