RVAA: Recursive Vision-Action Agent for Long Video Understanding

A research-grade implementation of the Recursive Language Model (RLM) paradigm for long video understanding, based on the paper "Recursive Language Models" (Zhang, Kraska, Khattab, 2025).

Abstract

This repository presents RVAA (Recursive Vision-Action Agent), an implementation of the Recursive Language Model framework applied to long-form video understanding. Following the core principle established in the RLM paper, we treat video content as an external environment rather than attempting to process entire videos within a single context window. The agent programmatically explores video content through temporal slicing, frame sampling, and vision-language captioning, then recursively invokes sub-models for local semantic analysis before synthesizing a global response.

1. Introduction

1.1 Problem Statement

Long-form video understanding presents significant challenges for traditional LLM-based approaches. A 21-minute video at 30 FPS contains over 38,000 frames, far exceeding the practical limits of even million-token context windows. Naive approaches that attempt to encode all visual information into a single prompt suffer from:

Context fragmentation: Important temporal relationships are lost during chunking
Information overload: Models struggle to identify relevant content in massive contexts
Cost inefficiency: Processing irrelevant content wastes computational resources

1.2 The RLM Paradigm

The Recursive Language Model paradigm (Zhang et al., 2025, Section 3) proposes a fundamentally different approach:

"Treat extremely long context as part of an external environment, not something to stuff into an LLM context window."

This is achieved through three key mechanisms:

REPL-based interaction: The agent writes executable code to explore the environment
Recursive sub-calls: Local understanding is delegated to specialized sub-models
Programmatic composition: Global answers are synthesized from local evidence

2. Architecture

2.1 System Overview

The RVAA system consists of the following components, mapping directly to RLM paper concepts (Table 1 in the original paper):

+------------------------------------------------------------------+
|                      ROOT AGENT (Root-LM)                        |
|  +------------------------------------------------------------+  |
|  |                    REPL Environment                        |  |
|  |                                                            |  |
|  |  context = VideoEnv(video_path)   # External environment  |  |
|  |  llm_query(prompt)                # Recursive sub-calls   |  |
|  |  get_segment_captions(segment)    # Vision-language API   |  |
|  |  print(...)                       # Observation feedback  |  |
|  |                                                            |  |
|  |  # Agent-generated exploration code:                      |  |
|  |  for segment in context.iter_segments(duration=60):       |  |
|  |      captions = get_segment_captions(segment)             |  |
|  |      summary = llm_query(f"Analyze: {captions}")          |  |
|  |      evidence.append(summary)                             |  |
|  |                                                            |  |
|  |  FINAL(synthesize(evidence))      # Termination           |  |
|  +------------------------------------------------------------+  |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                    SUB-AGENT (Sub-LM)                            |
|  - Processes segment-level captions                              |
|  - Extracts semantic evidence                                    |
|  - Performs local reasoning tasks                                |
+------------------------------------------------------------------+
                              |
                              v
+------------------------------------------------------------------+
|                 VISION CAPTIONER (Llama 3.2 Vision)              |
|  - Generates natural language descriptions of video frames      |
|  - Converts visual information to text for LLM processing       |
+------------------------------------------------------------------+

2.2 Concept Mapping

RLM Paper Concept (Section 3)	RVAA Implementation
`context` (string buffer)	`VideoEnv` object with temporal slicing
`llm_query(prompt)`	Asynchronous sub-calls to Sub-LM
REPL environment	Sandboxed Python runtime with restricted builtins
Chunking strategies	`context[t0:t1]` temporal segmentation
`FINAL(answer)`	Termination token with variable extraction
Cost tracking	Token/USD accounting per API call
Batching optimization (D.1)	Configurable segment duration

3. Machine Learning Methodology

3.1 Vision-Language Integration

The perception layer employs the Llama 3.2 11B Vision Instruct model via OpenRouter for frame captioning. This approach addresses the key challenge identified in the RLM paper (Section 4.2): converting non-textual modalities into language that can be processed by the reasoning system.

Frame Sampling Strategy:

Uniform temporal sampling within each segment
1-3 frames per segment based on segment duration
Image resizing to 512px for API efficiency

Caption Generation:

prompt = "Describe what you see in this video frame in 1-2 sentences. 
         Focus on: people, actions, text on screen, and setting."

3.2 Recursive Reasoning

Following the RLM paradigm (Algorithm 1 in the paper), the root agent implements a multi-step reasoning loop:

Observation: Inspect video metadata and sample frames
Action: Execute code to segment video and extract captions
Sub-query: Invoke Sub-LM for local semantic analysis
Synthesis: Combine local findings into global understanding
Termination: Return final answer via FINAL() token

3.3 Forced Exploration Mechanism

To prevent premature termination (a failure mode noted in Section 5.3 of the paper), we implement a validation layer that rejects FINAL() tokens if no code execution has occurred:

if not trajectory.has_code_execution:
    return "You must explore the video content before providing a final answer."

4. Experimental Results

4.1 Qualitative Evaluation

We evaluated RVAA on a 21-minute news broadcast video (1269 seconds, 38,031 frames, 1280x720 resolution).

Query: "What topics are discussed in this meeting?"

System Configuration:

Root LM: GPT-5 via OpenRouter
Sub LM: GPT-5-mini via OpenRouter
Vision Model: meta-llama/llama-3.2-11b-vision-instruct

Results:

Metric	Value
Total Steps	7
Code Executions	1
LLM Sub-Calls	5
Vision API Calls	12 (3 per segment x 4 segments)
Total Cost	$0.0023
Execution Time	336.9 seconds

Extracted Topics:

U.S. military action in Venezuela and the capture of Nicolas Maduro
President Trump's formal address to the nation regarding the operation
ABC News special report coverage and media framing
Symbolic political imagery (flags, government insignias, formal attire)

4.2 Agent Trajectory Analysis

The agent demonstrated the expected RLM behavior pattern:

Step 1-4 (LLM Sub-Calls): Segment-by-segment caption analysis

Each segment processed independently by Sub-LM
Vision model generated descriptive captions from sampled frames
Example caption: "This image depicts former President Donald Trump delivering a speech in front of a microphone, addressing the nation..."

Step 5 (Synthesis): Cross-segment topic synthesis

Sub-LM aggregated findings from all segments
Identified recurring themes across temporal boundaries

Step 6 (Code Execution): Verification and metadata extraction

Confirmed video properties (duration, frame count, resolution)
Validated segment boundaries

Step 7 (Final Answer): Structured response generation

4.3 Comparison with Baseline Approaches

Approach	Accuracy	Cost	Notes
Direct Context (baseline)	N/A	N/A	Exceeds context window limits
Summarize-then-Answer	Low	Low	Loses temporal detail
Top-K Retrieval Only	Medium	Low	No semantic verification
RVAA (this work)	High	$0.002-0.005	Full recursive exploration

5. Implementation Details

5.1 Video Environment API

The VideoEnv class implements the external environment abstraction (Section 3.1 of the paper):

class VideoEnv:
    # Properties
    duration: float          # Total video duration in seconds
    total_frames: int        # Total frame count
    fps: float               # Frames per second
    metadata: VideoMetadata  # Resolution, codec, etc.
    
    # Slicing operations
    def __getitem__(self, slice) -> VideoView:
        """Temporal slicing: context[10.0:30.0]"""
    
    # Frame access
    def get_frame(self, timestamp: float) -> FrameData:
        """Get single frame at timestamp"""
    
    def sample_frames_uniform(self, t0, t1, n) -> list[FrameData]:
        """Sample n frames uniformly from time range"""
    
    # Iteration
    def iter_segments(self, duration: float) -> Iterator[VideoView]:
        """Iterate over fixed-duration segments"""

5.2 REPL Runtime

The sandboxed execution environment provides:

Restricted Python builtins (no file I/O, network, or system access)
Automatic output truncation (configurable, default 4000 chars)
Error capture and recovery
Variable namespace persistence across executions

5.3 Streaming Architecture

Real-time trajectory visualization is implemented via Server-Sent Events (SSE):

/query (POST) -> {run_id}
/stream/{run_id} (GET/SSE) -> Event stream

Event types conform to the protocol defined in Section 6.2 of the paper:

code_execution: Agent executed REPL code
llm_query: Sub-LM invocation with prompt/response
final_answer: Termination with extracted answer

6. Installation and Usage

6.1 Requirements

Python 3.10+
OpenRouter API key (for LLM and vision model access)
FFmpeg (for video decoding)

6.2 Installation

git clone https://github.com/rvaa/rvaa.git
cd rvaa
python -m venv venv
source venv/bin/activate
pip install -e ".[dev]"

6.3 Configuration

Create a .env file with your API credentials:

OPENROUTER_API_KEY=your-api-key
RVAA_ROOT_MODEL=openai/gpt-5
RVAA_SUB_MODEL=openai/gpt-5-mini

6.4 Running the Server

python -m rvaa.server.api
# Server available at http://localhost:8000

6.5 API Endpoints

Endpoint	Method	Description
`/query`	POST	Submit video query, returns run_id
`/stream/{run_id}`	GET	SSE stream of agent trajectory
`/video`	GET	Stream video file for preview
`/health`	GET	Health check

7. Limitations and Future Work

7.1 Current Limitations

Recursion Depth: Limited to depth 1 (sub-calls invoke LMs, not recursive RLMs)
Vision Model Latency: Per-frame API calls introduce significant latency
No Training: Uses frozen frontier models without task-specific fine-tuning
Single Video: No multi-video reasoning or cross-reference capability

7.2 Future Directions

Hierarchical RLM: Implement deeper recursion for multi-scale reasoning
Cached Perception: Pre-compute frame captions for frequently accessed videos
Audio Integration: Incorporate speech-to-text for dialogue understanding
Fine-tuned Models: Train specialized sub-models for video understanding tasks

8. Project Structure

src/rvaa/
├── agent/
│   ├── root_agent.py          # RLM orchestrator with REPL runtime
│   └── trajectory.py          # Step tracking and cost accounting
├── env/
│   └── video_env.py           # Video environment abstraction
├── tools/
│   ├── llm_backends.py        # OpenAI, Qwen, Claude API clients
│   ├── vision_captioner.py    # Llama 3.2 Vision integration
│   └── video_io.py            # Efficient video decoding (PyAV)
├── server/
│   ├── api.py                 # FastAPI server
│   ├── streaming.py           # SSE event streaming
│   └── static/                # Web UI
└── eval/
    ├── metrics.py             # Evaluation metrics
    └── ablations.py           # Baseline implementations

9. References

Zhang, M., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv preprint arXiv:2512.24601.
OpenRouter API Documentation. https://openrouter.ai/docs
Meta AI. (2024). Llama 3.2 Vision Models. https://ai.meta.com/llama/

10. Citation

@article{zhang2025recursive,
  title={Recursive Language Models},
  author={Zhang, Michael and Kraska, Tim and Khattab, Omar},
  journal={arXiv preprint arXiv:2512.24601},
  year={2025}
}

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/rvaa		src/rvaa
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_server.py		run_server.py
test_agent.py		test_agent.py
test_video_agent.py		test_video_agent.py

License

mohammed840/RLM-implementation

Folders and files

Latest commit

History

Repository files navigation

RVAA: Recursive Vision-Action Agent for Long Video Understanding

Abstract

1. Introduction

1.1 Problem Statement

1.2 The RLM Paradigm

2. Architecture

2.1 System Overview

2.2 Concept Mapping

3. Machine Learning Methodology

3.1 Vision-Language Integration

3.2 Recursive Reasoning

3.3 Forced Exploration Mechanism

4. Experimental Results

4.1 Qualitative Evaluation

4.2 Agent Trajectory Analysis

4.3 Comparison with Baseline Approaches

5. Implementation Details

5.1 Video Environment API

5.2 REPL Runtime

5.3 Streaming Architecture

6. Installation and Usage

6.1 Requirements

6.2 Installation

6.3 Configuration

6.4 Running the Server

6.5 API Endpoints

7. Limitations and Future Work

7.1 Current Limitations

7.2 Future Directions

8. Project Structure

9. References

10. Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages