Skip to content

gist-rs/reev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

reev ๐Ÿชธ

Reev ๐Ÿชธ: Production-Ready Framework for Solana LLM Agent Evaluation


๐ŸŽฏ Production Status: Complete & Fully Functional

reev is a mature, production-ready Rust framework for rigorously evaluating Solana-native LLM agents. After extensive development and testing, the framework now provides a complete, reliable platform for assessing autonomous agents in realistic blockchain environments.

๐Ÿ—๏ธ Architecture Flow Diagrams

TUI Interface Flow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    TUI      โ”‚โ”€โ”€โ”€โ–ถโ”‚  reev-runner  โ”‚โ”€โ”€โ”€โ–ถโ”‚  reev-agent  โ”‚โ”€โ”€โ”€โ–ถโ”‚  AI Agent    โ”‚โ”€โ”€โ”€โ–ถโ”‚   Jupiter    โ”‚โ”€โ”€โ”€โ–ถโ”‚ Transaction  โ”‚โ”€โ”€โ”€โ–ถโ”‚   Score      โ”‚
โ”‚  (Cockpit)  โ”‚    โ”‚ (Orchestrator)โ”‚    โ”‚  (Service)   โ”‚    โ”‚ (LLM/GPT/ZAI)โ”‚    โ”‚    SDK       โ”‚    โ”‚   Execution  โ”‚    โ”‚  Calculation โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚
     โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚
     โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Interactive โ”‚    โ”‚ Dependency   โ”‚    โ”‚ OpenTelemetryโ”‚    โ”‚ Tool Calling โ”‚    โ”‚ Protocol     โ”‚    โ”‚ Surfpool     โ”‚    โ”‚ 75% Inst +   โ”‚
โ”‚   Terminal  โ”‚    โ”‚ Management   โ”‚    โ”‚   Tracing   โ”‚    โ”‚ & Reasoning  โ”‚    โ”‚   Handler   โ”‚    โ”‚  Simulation  โ”‚    โ”‚ 25% On-Chain โ”‚
โ”‚   Display   โ”‚    โ”‚ (Agent/Pool) โ”‚    โ”‚   & Logging โ”‚    โ”‚   (Rig)      โ”‚    โ”‚ (reev-tools)โ”‚    โ”‚  (Mock RPC)  โ”‚    โ”‚  Weighting   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Web API Flow

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Web UI    โ”‚โ”€โ”€โ”€โ–ถโ”‚  reev-api    โ”‚โ”€โ”€โ”€โ–ถโ”‚  reev-runner  โ”‚โ”€โ”€โ”€โ–ถโ”‚  reev-agent  โ”‚โ”€โ”€โ”€โ–ถโ”‚  AI Agent    โ”‚โ”€โ”€โ”€โ–ถโ”‚   Jupiter    โ”‚โ”€โ”€โ”€โ–ถโ”‚ Transaction  โ”‚
โ”‚  (Browser)  โ”‚    โ”‚ (REST API)   โ”‚    โ”‚ (Orchestrator)โ”‚    โ”‚  (Service)   โ”‚    โ”‚ (LLM/GPT/ZAI)โ”‚    โ”‚    SDK       โ”‚    โ”‚   Execution  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚
     โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚                   โ”‚
     โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ                   โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ HTTP/HTTPS  โ”‚    โ”‚ Database     โ”‚    โ”‚ Dependency   โ”‚    โ”‚ OpenTelemetryโ”‚    โ”‚ Tool Calling โ”‚    โ”‚ Protocol     โ”‚    โ”‚ Surfpool     โ”‚
โ”‚   Requests  โ”‚    โ”‚ Persistence  โ”‚    โ”‚ Management   โ”‚    โ”‚   Tracing   โ”‚    โ”‚ & Reasoning  โ”‚    โ”‚   Handler   โ”‚    โ”‚  Simulation  โ”‚
โ”‚   (JSON)    โ”‚    โ”‚ (Sessions)   โ”‚    โ”‚ (Agent/Pool) โ”‚    โ”‚   & Logging โ”‚    โ”‚   (Rig)      โ”‚    โ”‚ (reev-tools)โ”‚    โ”‚  (Mock RPC)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Component Dependencies

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                      ENTRY POINTS                                                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ reev-tui            โ”‚ reev-api            โ”‚ reev-runner                                                    โ”‚
โ”‚ (Interactive UI)    โ”‚ (Web REST API)      โ”‚ (CLI Orchestrator)                                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚                     โ”‚                           โ”‚
          โ”‚                     โ”‚                           โ”‚
          โ–ผ                     โ–ผ                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                    CORE RUNNER                                                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                  reev-runner                                                             โ”‚
โ”‚                    โ€ข Dependency Management (Agent + Surfpool)                                           โ”‚
โ”‚                    โ€ข Benchmark Execution & Session Logging                                               โ”‚
โ”‚                    โ€ข Flow Orchestration (Multi-step)                                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
          โ”‚
          โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                    AGENT SERVICE                                                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                  reev-agent                                                              โ”‚
โ”‚                    โ€ข LLM Routing (OpenAI/GLM/Local/ZAI)                                                  โ”‚
โ”‚                    โ€ข Tool Provisioning (Jupiter, Native, SPL)                                            โ”‚
โ”‚                    โ€ข OpenTelemetry Integration & Flow Tracking                                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
          โ”‚
          โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                   PROTOCOL LAYER                                                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚              reev-tools โ†’ reev-protocols โ†’ Jupiter SDK โ†’ surfpool                                        โ”‚
โ”‚                    โ€ข Jupiter Swap/Lend/Earn Operations                                                  โ”‚
โ”‚                    โ€ข SPL Token Operations                                                               โ”‚
โ”‚                    โ€ข Native SOL Transfers                                                               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
          โ”‚
          โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                             EXECUTION & SCORING                                                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                    surfpool โ†’ SolanaEnv โ†’ reev-lib (Scoring) โ†’ Database                                   โ”‚
โ”‚                    โ€ข Mainnet Fork Simulation                                                            โ”‚
โ”‚                    โ€ข Transaction Execution & State Management                                            โ”‚
โ”‚                    โ€ข Two-Tier Scoring (75% Instruction + 25% On-Chain)                                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœ… Current Capabilities: Production Ready

The framework achieves 100% success rates across all benchmark categories:

  • ๐Ÿ”„ Real Jupiter Integration: Full swap, lending, mint/redeem operations with Jupiter SDK
  • ๐Ÿค– Advanced Agent Support: Both deterministic (ground truth) and AI agents working perfectly
  • ๐Ÿ”„ Multi-Step Workflows: Complex DeFi flows with step-by-step orchestration (200-series)
  • ๐Ÿ“Š Comprehensive Scoring: Granular instruction quality evaluation + on-chain execution metrics
  • ๐ŸŽฎ Professional Tooling: Interactive TUI cockpit, database persistence, detailed logging
  • ๐Ÿ”ฌ Real-World Testing: Mainnet fork validation with actual deployed programs
  • โœ… Scoring System Validation: Complete test suite covering 0%, 50%, 75%, and 100% score scenarios
  • ๐ŸŒŠ Flow Support: Step-by-step flow execution with proper transaction isolation
  • ๐Ÿ“Š OpenTelemetry Integration: Automatic tool call tracking with Mermaid diagram generation

๐Ÿš€ Core Architecture: Real Programs, Controlled State

The framework operates on surfpool, a high-performance in-memory fork of Solana mainnet, providing:

  • ๐ŸŒ Real-World Logic: Agents interact with actual deployed programs (Jupiter, SPL Token, etc.)
  • ๐Ÿ”’ Controlled Environment: Precise state management via RPC cheat codes for reproducible testing
  • โšก High Performance: In-memory execution with fast state manipulation and transaction simulation
  • ๐Ÿ”„ Hermetic Testing: Every test run starts from identical, controlled initial conditions

Core Principles

  • Reproducibility: The primary goal. Every test run is hermetic, guaranteeing that a given benchmark will produce the exact same result every time.
  • Service-Oriented Environment: The Solana test validator (surfpool) is treated as a managed, external service that the environment connects to and configures via RPC. This ensures a clean architectural boundary and prevents dependency conflicts.
  • Gymnasium-Inspired API: The agent-environment interaction is modeled via a standard Rust trait (GymEnv) inspired by the Gymnasium API, promoting a clear separation of concerns.
  • OpenTelemetry Observability: Automatic tool call extraction from rig's OpenTelemetry traces for flow visualization and debugging.

Key Components

  1. reev-lib (Core Library):

    • SolanaEnv: A custom, hermetic evaluation environment that connects to an external surfpool process. It handles state setup, transaction execution, and observation generation.
    • Agent Interface: Defines a simple Agent trait and provides an LlmAgent that can reason about prompts.
    • Benchmark Structs: Rust types that define the structure of a benchmark YAML file, enabling strongly-typed parsing.
  2. reev-runner (CLI Orchestrator):

    • The command-line tool for loading and running benchmarks.
    • Orchestrates the entire evaluation loop, from setting up the environment to calculating metrics and reporting results.
  3. reev-agent (LLM Service):

    • A standalone server that exposes an LLM's reasoning capabilities over an API.
    • Can be configured to use different models (local, Gemini, GLM, etc.) and includes a deterministic agent for generating ground-truth instructions.
    • Features OpenTelemetry integration for automatic tool call tracking and Mermaid diagram generation.
  4. reev-api (Web API & Flow Visualization):

    • RESTful API for benchmark execution and flow diagram generation.
    • Automatic tool call extraction from OpenTelemetry traces.
    • Mermaid diagram generation for visualizing agent execution flows.
  5. Benchmark Suite:

    • A suite of evaluation tasks defined in YAML files located in the benchmarks/ directory.
    • Each test case includes a declarative initial_state, a natural language prompt, and ground_truth criteria for success.

๐Ÿš€ Quick Start

Prerequisites

  1. Rust Toolchain: Install Rust (latest stable recommended)

  2. Git: Clone the repository

  3. Optional LLM: Install LM Studio or have Gemini API key for AI agents

  4. GLM API Setup:

    Regular GLM API (OpenAI-compatible, highest priority):

    export ZAI_API_KEY="your-glm-api-key"
    export ZAI_API_URL="https://api.z.ai/api/paas/v4"  # optional

    GLM Coding API (for coding-specific tasks):

    export GLM_CODING_API_KEY="your-glm-coding-api-key"
    export GLM_CODING_API_URL="https://api.z.ai/api/coding/paas/v4"  # optional
  5. OpenTelemetry Setup (Tool call tracking always enabled):

    export REEV_TRACE_FILE=traces.log

๐ŸŽฏ Running Benchmarks

The framework now provides automatic surfpool management - no manual setup required:

# All benchmarks work out of the box
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

# Jupiter protocols (swap, lending, mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

# Multi-step flows (swap + lend) with OpenTelemetry tracking
# Enhanced otel logging is enabled by default
export REEV_TRACE_FILE=traces.log
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent glm-4.6

# API benchmarks (positions, earnings)
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

# Scoring validation tests
cargo run -p reev-runner -- benchmarks/003-spl-transfer-fail.yml --agent deterministic  # 0% score
cargo run -p reev-runner -- benchmarks/004-partial-score-spl-transfer.yml --agent deterministic  # ~50% score

# View OpenTelemetry traces and tool calls
cat traces.log

๐Ÿค– Agent Options

Deterministic Agent (Ground Truth):

cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic

๐ŸŒŠ OpenTelemetry-Enabled Agents:

# Enhanced otel logging is enabled by default
export REEV_TRACE_FILE=traces.log

# Run with automatic tool call extraction (enhanced logging included)
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

# View extracted tool calls for Mermaid diagrams
curl http://localhost:3001/api/v1/flows/{session_id}

# Disable enhanced otel logging if needed
REEV_ENHANCED_OTEL=0 cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

Local Model Agent:

cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

Gemini Agent:

RUST_LOG=info cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

๐ŸŽฎ Interactive TUI

Launch the interactive cockpit for real-time monitoring:

cargo run -p reev-tui

Features:

  • ๐Ÿ“Š Live benchmark execution with status updates
  • ๐Ÿ” Detailed execution trace analysis
  • ๐Ÿท๏ธ Agent selection (deterministic, local, glm-4.6, gemini)
  • ๐Ÿ“ˆ Real-time scoring and metrics

๐ŸŒŠ OpenTelemetry Integration & Flow Visualization

The framework now includes automatic OpenTelemetry integration for tool call tracking and Mermaid diagram generation. This provides real-time observability into agent execution flows without manual interference.

๐Ÿ”ง OpenTelemetry Setup

# OpenTelemetry tracing with enhanced logging (enabled by default)
export REEV_TRACE_FILE=traces.log
export RUST_LOG=info

# Run any agent with automatic enhanced tool call tracking
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

# View captured traces with detailed tool info
cat traces.log

# Disable enhanced logging for minimal output
REEV_ENHANCED_OTEL=0 cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

๐Ÿ“Š Flow Diagram Generation

Tool calls are automatically extracted from rig's OpenTelemetry spans and converted to session format for Mermaid diagrams:

# Start reev-api for flow visualization
cargo run --bin reev-api

# Run benchmark with tool tracking
curl -X POST http://localhost:3001/api/v1/benchmarks/001-sol-transfer/run \
  -H "Content-Type: application/json" \
  -d '{"agent": "glm-4.6"}'

# Get flow diagram
curl http://localhost:3001/api/v1/flows/{session_id}

๐ŸŽฏ Session Format for Mermaid

The system automatically converts OpenTelemetry traces to the session format required by FLOW.md:

{
  "session_id": "uuid-here",
  "benchmark_id": "001-sol-transfer",
  "tools": [
    {
      "tool_name": "sol_transfer",
      "start_time": "2024-01-15T10:30:01.456Z",
      "end_time": "2024-01-15T10:30:02.789Z",
      "params": {"pubkey": "USER_1", "amount": "0.1"},
      "result": {"signatures": ["abc123"]},
      "status": "success"
    }
  ]
}

๐Ÿ—๏ธ Architecture

rig tool execution โ†’ OpenTelemetry spans โ†’ trace extraction โ†’ session format โ†’ Mermaid diagrams
  • No Manual Tracking: Uses rig's built-in OpenTelemetry automatically
  • Clean Integration: No HTTP request/response warping or tool interception
  • Session Format: Matches FLOW.md specification exactly
  • Real-time Extraction: Tool calls captured during agent execution

๐Ÿ“Š Benchmark Categories

๐Ÿ”ง Transaction Benchmarks (100-series)

Real on-chain operations with Jupiter protocols:

# Jupiter swap
cargo run -p reev-runner -- benchmarks/100-jup-swap-sol-usdc.yml --agent local

# Jupiter lending (mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

๐ŸŒŠ Flow Benchmarks (200-series)

Multi-step DeFi workflows with step-by-step execution:

# Swap then lend (2 steps: swap SOLโ†’USDC, then deposit USDC)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic

# More flow benchmarks coming soon...

Flow Execution Features:

  • โœ… Step-by-Step Processing: Each flow step executes as a separate transaction
  • โœ… Transaction Isolation: Proper error handling per step, no cascading failures
  • โœ… State Management: Account state flows between steps automatically
  • โœ… Agent Consistency: Both deterministic and AI agents handle flows identically

๐Ÿ“ก API Benchmarks (100-series)

Data retrieval and portfolio management:

# Positions and earnings
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

๐ŸŽฏ Success Metrics

Current Performance:

  • โœ… 100% Success Rate: All benchmarks passing with local model
  • โœ… Real Jupiter Integration: Full protocol stack working
  • โœ… Multi-Step Flows: Complex workflows executing step-by-step successfully
  • โœ… Production Infrastructure: TUI, database, logging all operational
  • โœ… Scoring System Validation: Comprehensive test suite covering full score spectrum
  • โœ… Anti-False-Positive Protection: Differentiates failure modes accurately
  • โœ… Flow Framework: Robust step-by-step execution with proper error handling

Scoring System:

The framework implements a sophisticated two-tiered scoring system:

Component Breakdown:

  • Instruction Quality (75%): Granular evaluation of generated transactions
    • Program ID matching (configurable weight)
    • Instruction data validation (configurable weight)
    • Account metadata verification (signer/writable flags)
  • On-Chain Execution (25%): Binary success/failure on surfpool
  • Composite Scoring: Weighted average for final assessment

Flow Scoring:

  • Per-Step Evaluation: Each flow step is scored individually
  • Combined Results: Step scores aggregated for final flow assessment
  • Partial Credit: Successful steps count even if later steps fail

Validated Score Scenarios:

Score Range Test Case Purpose Status
~75% 003-spl-transfer-fail Correct instruction, on-chain failure โœ… Validated
~78.6% 004-partial-score-spl-transfer Partial credit (correct ID, some errors) โœ… Validated
~75% 100-jup-swap-sol-usdc (pre-fix) Good reasoning, execution failure โœ… Validated
100% 001-sol-transfer, 002-spl-transfer Perfect execution โœ… Validated

Anti-False-Positive Testing:

  • Differentiates between "no attempt" (0%) vs "attempted but failed" (partial credit)
  • Validates granular component scoring (program ID vs data vs accounts)
  • Ensures weighted scoring prevents gaming the system

๐Ÿ”ง Development & Testing

Integration Tests:

# Full test suite (deterministic + AI agents)
cargo test -p reev-runner

# Specific agent testing
cargo test -p reev-runner --test deterministic_agent_test
cargo test -p reev-runner --test llm_agent_test

Example Testing:

# Protocol examples
cargo run -p reev-agent --example 115-jup-lend-mint-usdc

# Flow examples
cargo run -p reev-agent --example 200-jup-swap-then-lend-deposit

Debugging:

# Enable verbose logging
RUST_LOG=debug cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml

# Check surfpool status
curl http://localhost:8899/health

About

๐Ÿชธ Re-Eval: A Framework for the Reproducible Evaluation of LLM Agents

Resources

License

Stars

Watchers

Forks

Packages

No packages published