A Recursive Language Model (RLM) implementation for analyzing massive datasets with virtually unlimited context windows
This project demonstrates the Recursive Language Model (RLM) inference paradigm—a technique that allows LLMs to process datasets far exceeding their native context window limits by treating the LLM as a programmer rather than a reader.
- What is RLM?
- Architecture Overview
- How It Works: Step-by-Step
- Project Structure
- Quick Start
- Configuration
- Understanding the Cost Model
- Example Output
- Technical Deep Dive
Recursive Language Models (RLMs) are not a new neural network architecture—they are an inference strategy that transforms how LLMs interact with large contexts.
| Approach | Limitation |
|---|---|
| Direct Context | Model forgets details as context grows ("context rot") |
| RAG (Retrieval) | May miss relevant data if semantic search fails |
| Summarization | Loses granular details needed for complex analysis |
Instead of feeding the entire dataset into the LLM's context window:
- Load data as a variable in a Python REPL environment
- Instruct the LLM to write code to programmatically explore the data
- Allow recursive delegation to cheaper/smaller LLMs for chunk processing
- Aggregate results through the code the LLM writes itself
Traditional: LLM ← [ENTIRE DATASET] ← User Question
RLM: LLM ← [Code Environment + Variable Reference] ← User Question
└── LLM writes code to read/analyze chunks as needed
flowchart TB
subgraph User["👤 User"]
Q[Query: Find opinion changes in meetings]
D[(Dataset: 8.5 MB<br/>162 meetings)]
end
subgraph System["🔧 System Layer"]
REPL["Python REPL Environment<br/><code>context</code> variable loaded"]
EXEC["Code Executor<br/>Sandboxed Python"]
end
subgraph RootLM["🧠 Root LM (GPT-4o)"]
THINK[Reasoning & Planning]
CODE[Generate Python Code]
FINAL[Compile Final Answer]
end
subgraph SubLM["⚡ Sub LM (GPT-4o-mini)"]
READ[Read Data Chunks]
EXTRACT[Extract Information]
RETURN[Return Processed Results]
end
Q --> RootLM
D --> REPL
REPL --> |"context available"| EXEC
THINK --> CODE
CODE --> |"repl code block"| EXEC
EXEC --> |"Executes code"| SubLM
SubLM --> |"Extracted insights"| EXEC
EXEC --> |"print output"| RootLM
RootLM --> |"FINAL"| FINAL
style RootLM fill:#4CAF50,color:#fff
style SubLM fill:#2196F3,color:#fff
style REPL fill:#FF9800,color:#fff
| Component | Role | Example Model |
|---|---|---|
| Root LM | Orchestrator—plans strategy, writes code | GPT-4o |
| Sub LM | Worker—reads chunks, extracts information | GPT-4o-mini |
| REPL Environment | Executes code, manages data | Python 3.12 |
sequenceDiagram
participant U as User
participant S as System
participant R as Root LM (GPT-4o)
participant P as Python REPL
participant Sub as Sub LM (GPT-4o-mini)
U->>S: Query + Dataset (8.5 MB)
S->>P: Load dataset as `context` variable
S->>R: System Prompt + Query
Note over R: "I need to explore the context first"
R->>P: ```repl<br/>print(context[:5000])
P-->>R: "=== Meeting 1 ===<br/>ID: Bdb001..."
Note over R: "Ah, meetings are separated by '=== Meeting'"
R->>P: ```repl<br/>meetings = context.split('=== Meeting')<br/>len(meetings)
P-->>R: 162
Note over R: "162 meetings! I'll analyze in batches using Sub LM"
loop For each batch of meetings
R->>P: ```repl<br/>for m in meetings[0:10]:<br/> result = llm_query(f"Analyze: {m}")
P->>Sub: "Analyze this meeting for opinion changes..."
Sub-->>P: "Participant X changed opinion on topic Y..."
P-->>R: [Analysis results printed]
end
R->>S: FINAL("Here are all opinion changes found...")
S->>U: Final Answer
The dataset is NOT sent to the Root LM. Instead:
# System writes data to temp file
context_path = "/tmp/repl_env_xxx/context.txt"
with open(context_path, "w") as f:
f.write(dataset) # 8.5 MB of meeting transcripts
# Then loads it into REPL namespace
exec("context = open(path).read()", globals)The Root LM receives a system prompt that says:
"You have access to a
contextvariable. Use Python code to explore it."
The Root LM writes code like:
# Root LM's first move: understand the data structure
print(context[:5000]) # See first 5000 charsAfter seeing the structure, the Root LM decides HOW to process:
# Root LM decides to split by meeting delimiter
meetings = context.split('=== Meeting')
print(f"Found {len(meetings)} meetings")The Root LM uses llm_query() to send chunks to the Sub LM:
results = []
for i, meeting in enumerate(meetings[:10]):
# This calls the Sub LM (GPT-4o-mini)
analysis = llm_query(f"""
Analyze this meeting transcript for opinion changes.
Return: participant name, topic, initial position, final position.
Transcript:
{meeting}
""")
results.append(f"Meeting {i}: {analysis}")After processing, the Root LM compiles results:
FINAL("""
Based on my analysis of 162 meetings, I found these opinion changes:
1. **Grad C** (Meeting 1)
- Topic: XML format for data representation
- Initial: "XML might not be suitable for sub-word data"
- Final: "We should explore ATLAS format for flexibility"
2. **PhD F** (Meeting 1)
- Topic: Data format standards
- Initial: Preferred standard formats only
- Final: Open to custom formats if well-documented
...
""")meeting_analyst/
├── main.py # Entry point
├── config_loader.py # YAML configuration loader
├── data_loader.py # Dataset downloader (HuggingFace)
└── rlm/
├── rlm.py # Abstract base class
├── rlm_repl.py # Root LM orchestrator
├── repl.py # Python REPL environment + Sub LM
├── logger/ # Colorful terminal logging
│ ├── root_logger.py
│ └── repl_logger.py
└── utils/
├── llm.py # OpenAI API client with metrics
├── prompts.py # System prompts and templates
└── utils.py # Code parsing and execution
| File | Purpose |
|---|---|
rlm_repl.py |
Manages the Root LM conversation loop |
repl.py |
Provides sandboxed Python execution + Sub LM |
prompts.py |
Contains the critical system prompt that teaches RLM behavior |
llm.py |
OpenAI client with token/latency tracking |
- Python 3.12+
- OpenAI API key
- uv package manager (recommended)
# Clone the repository
git clone https://github.com/your-username/meeting-analyst.git
cd meeting-analyst
# Install dependencies
uv sync
# Configure environment
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY# Run the full RLM analysis
uv run python -m meeting_analyst.main
# Or test data extraction only (no API calls)
uv run python -m meeting_analyst.main --test-extraction🔍 Log Detective 2000
============================================================
Running RLM Analysis
------------------------------------------------------------
Target data file: data/processed/qmsum.txt
Loading data into memory...
Data loaded (8,561,364 chars).
🔍 Query: Analyze the meeting transcripts. Find cases where a participant
changed their opinion during the discussion...
--------------------------------------------------
================================================================================
STARTING NEW QUERY | 18:00:32
================================================================================
18:00:32 🔄 Calling API: gpt-4o | Prompt size: 4,464 chars
18:00:39 ✅ Responded in 6.87s | Model: gpt-4o
📊 Tokens: Prompt: 984 | Completion: 201 | Total: 1,185
╭─── In [1]: ────────────────────────────────────────────────╮
│ print(context[:5000]) │
╰────────────────────────────────────────────────────────────╯
╭─── Out [1]: ───────────────────────────────────────────────╮
│ === Meeting 1 === │
│ ID: Bdb001 │
│ Topic: Academic │
│ ... │
╰────────────────────────────────────────────────────────────╯
All settings are in config.yaml:
project:
name: "Log Detective 2000"
debug_mode: true
models:
root_model: "gpt-4o" # Smart orchestrator
sub_model: "gpt-4o-mini" # Fast/cheap worker
paths:
data_file: "data/processed/qmsum.txt"
data_source:
type: "huggingface"
dataset_id: "Ahren09/QMSum"
split: "train"
rlm_settings:
max_iterations: 15 # Max Root LM thinking cycles
enable_logging: true # Show detailed execution logs
providers:
openai:
base_url: "https://api.openai.com/v1"| Use Case | Root LM | Sub LM | Cost/Speed |
|---|---|---|---|
| Best Quality | gpt-4o | gpt-4o | $$$, Slow |
| Balanced | gpt-4o | gpt-4o-mini | $$, Medium |
| Budget | gpt-4o-mini | gpt-4o-mini | $, Fast |
pie title Token Distribution (30 meetings analyzed)
"Root LM Orchestration" : 67334
"Sub LM Processing" : 665220
| Component | Tokens | Cost (USD) |
|---|---|---|
| Root LM (GPT-4o) | ~67,000 | ~$0.19 |
| Sub LM (GPT-4o-mini) | ~665,000 | ~$0.10 |
| Total | ~732,000 | ~$0.29 |
The Root LM (expensive) only:
- Plans the analysis strategy
- Writes code
- Compiles final results
The Sub LM (cheap) handles:
- Reading raw data chunks
- Extracting specific information
- 90% of total tokens
After running analysis on 162 meeting transcripts (~8.5 MB):
Found Opinion Changes:
Meeting 1:
1. **Participant**: Grad C
- **Topic**: Database format for annotations
- **Initial**: "XML format won't work for sub-word data"
- **Final**: "ATLAS format seems reasonable for flexibility"
2. **Participant**: PhD F
- **Topic**: Data representation standards
- **Initial**: Uncertain about non-standard formats
- **Final**: Willing to explore ATLAS if it provides flexibility
Meeting 3:
1. **Participant**: Professor B
- **Topic**: Project scope
- **Initial**: Suggested broad exploration of complexities
- **Final**: Narrowed focus to "tourists in Heidelberg" only
...
The key to RLM behavior is the system prompt in prompts.py:
REPL_SYSTEM_PROMPT = """
You are tasked with answering a query with associated context.
You can access, transform, and analyze this context interactively
in a REPL environment that can recursively query sub-LLMs.
The REPL environment is initialized with:
1. A `context` variable containing the data
2. A `llm_query` function to query sub-LLMs
3. The ability to use `print()` to view outputs
When you want to execute Python code, wrap it in:
```repl
# Your code hereWhen finished, use FINAL(your answer) to provide the result. """
### The REPL Environment
The `REPLEnv` class in `repl.py` provides:
1. **Sandboxed Execution**: Only safe Python builtins allowed
2. **Context Injection**: Data loaded as `context` variable
3. **LLM Query Function**: `llm_query()` calls the Sub LM
4. **Output Capture**: Captures `print()` output for feedback
```python
class REPLEnv:
def __init__(self, context_str, recursive_model):
# Load context into temp file
self.load_context(context_str)
# Create safe globals with llm_query function
self.globals['llm_query'] = lambda prompt: self.sub_rlm.completion(prompt)
def code_execution(self, code) -> REPLResult:
# Execute in sandboxed environment
exec(code, self.globals, self.locals)
return captured_output
The OpenAIClient in llm.py tracks all API calls:
class OpenAIClient:
def completion(self, messages):
start = time.time()
response = self.client.chat.completions.create(...)
# Track metrics
self.total_prompt_tokens += response.usage.prompt_tokens
self.total_completion_tokens += response.usage.completion_tokens
self.total_latency += time.time() - start-
RLM is a paradigm, not a model - Works with any capable LLM (GPT, Claude, Gemini)
-
Code quality matters - The Root LM must be good at coding for efficient analysis
-
Hierarchical delegation is key - Use smart (expensive) for planning, cheap for reading
-
Context rot is avoided - The Root LM never sees raw data, only processed results
-
Linear complexity is acceptable - O(n) scanning is fine when precision matters