reev 🪸

Reev 🪸: Production-Ready Framework for Solana LLM Agent Evaluation

🎯 Production Status: Complete & Fully Functional

reev is a mature, production-ready Rust framework for rigorously evaluating Solana-native LLM agents. After extensive development and testing, the framework now provides a complete, reliable platform for assessing autonomous agents in realistic blockchain environments.

🏗️ Architecture Flow Diagrams

TUI Interface Flow

┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│    TUI      │───▶│  reev-runner  │───▶│  reev-agent  │───▶│  AI Agent    │───▶│   Jupiter    │───▶│ Transaction  │───▶│   Score      │
│  (Cockpit)  │    │ (Orchestrator)│    │  (Service)   │    │ (LLM/GPT/ZAI)│    │    SDK       │    │   Execution  │    │  Calculation │
└─────────────┘    └──────────────┘    └─────────────┘    └──────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     │                   │                   │                   │                   │                   │                   │
     │                   │                   │                   │                   │                   │                   │
     ▼                   ▼                   ▼                   ▼                   ▼                   ▼                   ▼
┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Interactive │    │ Dependency   │    │ OpenTelemetry│    │ Tool Calling │    │ Protocol     │    │ Surfpool     │    │ 75% Inst +   │
│   Terminal  │    │ Management   │    │   Tracing   │    │ & Reasoning  │    │   Handler   │    │  Simulation  │    │ 25% On-Chain │
│   Display   │    │ (Agent/Pool) │    │   & Logging │    │   (Rig)      │    │ (reev-tools)│    │  (Mock RPC)  │    │  Weighting   │
└─────────────┘    └──────────────┘    └─────────────┘    └──────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

Web API Flow

┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Web UI    │───▶│  reev-api    │───▶│  reev-runner  │───▶│  reev-agent  │───▶│  AI Agent    │───▶│   Jupiter    │───▶│ Transaction  │
│  (Browser)  │    │ (REST API)   │    │ (Orchestrator)│    │  (Service)   │    │ (LLM/GPT/ZAI)│    │    SDK       │    │   Execution  │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘    └──────────────┘    └─────────────┘    └─────────────┘
     │                   │                   │                   │                   │                   │                   │
     │                   │                   │                   │                   │                   │                   │
     ▼                   ▼                   ▼                   ▼                   ▼                   ▼                   ▼
┌─────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────────┐    ┌─────────────┐
│ HTTP/HTTPS  │    │ Database     │    │ Dependency   │    │ OpenTelemetry│    │ Tool Calling │    │ Protocol     │    │ Surfpool     │
│   Requests  │    │ Persistence  │    │ Management   │    │   Tracing   │    │ & Reasoning  │    │   Handler   │    │  Simulation  │
│   (JSON)    │    │ (Sessions)   │    │ (Agent/Pool) │    │   & Logging │    │   (Rig)      │    │ (reev-tools)│    │  (Mock RPC)  │
└─────────────┘    └──────────────┘    └──────────────┘    └─────────────┘    └──────────────┘    └─────────────┘    └─────────────┘

Component Dependencies

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                      ENTRY POINTS                                                        │
├─────────────────────┬─────────────────────┬───────────────────────────────────────────────────────────────┤
│ reev-tui            │ reev-api            │ reev-runner                                                    │
│ (Interactive UI)    │ (Web REST API)      │ (CLI Orchestrator)                                            │
└─────────────────────┴─────────────────────┴───────────────────────────────────────────────────────────────┘
          │                     │                           │
          │                     │                           │
          ▼                     ▼                           ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    CORE RUNNER                                                          │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                  reev-runner                                                             │
│                    • Dependency Management (Agent + Surfpool)                                           │
│                    • Benchmark Execution & Session Logging                                               │
│                    • Flow Orchestration (Multi-step)                                                     │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
          │
          │
          ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                    AGENT SERVICE                                                         │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                                  reev-agent                                                              │
│                    • LLM Routing (OpenAI/GLM/Local/ZAI)                                                  │
│                    • Tool Provisioning (Jupiter, Native, SPL)                                            │
│                    • OpenTelemetry Integration & Flow Tracking                                            │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
          │
          │
          ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                   PROTOCOL LAYER                                                         │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│              reev-tools → reev-protocols → Jupiter SDK → surfpool                                        │
│                    • Jupiter Swap/Lend/Earn Operations                                                  │
│                    • SPL Token Operations                                                               │
│                    • Native SOL Transfers                                                               │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
          │
          │
          ▼
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                             EXECUTION & SCORING                                                          │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│                    surfpool → SolanaEnv → reev-lib (Scoring) → Database                                   │
│                    • Mainnet Fork Simulation                                                            │
│                    • Transaction Execution & State Management                                            │
│                    • Two-Tier Scoring (75% Instruction + 25% On-Chain)                                   │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

✅ Current Capabilities: Production Ready

The framework achieves 100% success rates across all benchmark categories:

🔄 Real Jupiter Integration: Full swap, lending, mint/redeem operations with Jupiter SDK
🤖 Advanced Agent Support: Both deterministic (ground truth) and AI agents working perfectly
🔄 Multi-Step Workflows: Complex DeFi flows with step-by-step orchestration (200-series)
📊 Comprehensive Scoring: Granular instruction quality evaluation + on-chain execution metrics
🎮 Professional Tooling: Interactive TUI cockpit, database persistence, detailed logging
🔬 Real-World Testing: Mainnet fork validation with actual deployed programs
✅ Scoring System Validation: Complete test suite covering 0%, 50%, 75%, and 100% score scenarios
🌊 Flow Support: Step-by-step flow execution with proper transaction isolation
📊 OpenTelemetry Integration: Automatic tool call tracking with Mermaid diagram generation

🚀 Core Architecture: Real Programs, Controlled State

The framework operates on surfpool, a high-performance in-memory fork of Solana mainnet, providing:

🌐 Real-World Logic: Agents interact with actual deployed programs (Jupiter, SPL Token, etc.)
🔒 Controlled Environment: Precise state management via RPC cheat codes for reproducible testing
⚡ High Performance: In-memory execution with fast state manipulation and transaction simulation
🔄 Hermetic Testing: Every test run starts from identical, controlled initial conditions

Core Principles

Reproducibility: The primary goal. Every test run is hermetic, guaranteeing that a given benchmark will produce the exact same result every time.
Service-Oriented Environment: The Solana test validator (surfpool) is treated as a managed, external service that the environment connects to and configures via RPC. This ensures a clean architectural boundary and prevents dependency conflicts.
Gymnasium-Inspired API: The agent-environment interaction is modeled via a standard Rust trait (GymEnv) inspired by the Gymnasium API, promoting a clear separation of concerns.
OpenTelemetry Observability: Automatic tool call extraction from rig's OpenTelemetry traces for flow visualization and debugging.

Key Components

reev-lib (Core Library):
- SolanaEnv: A custom, hermetic evaluation environment that connects to an external surfpool process. It handles state setup, transaction execution, and observation generation.
- Agent Interface: Defines a simple Agent trait and provides an LlmAgent that can reason about prompts.
- Benchmark Structs: Rust types that define the structure of a benchmark YAML file, enabling strongly-typed parsing.
reev-runner (CLI Orchestrator):
- The command-line tool for loading and running benchmarks.
- Orchestrates the entire evaluation loop, from setting up the environment to calculating metrics and reporting results.
reev-agent (LLM Service):
- A standalone server that exposes an LLM's reasoning capabilities over an API.
- Can be configured to use different models (local, Gemini, GLM, etc.) and includes a deterministic agent for generating ground-truth instructions.
- Features OpenTelemetry integration for automatic tool call tracking and Mermaid diagram generation.
reev-api (Web API & Flow Visualization):
- RESTful API for benchmark execution and flow diagram generation.
- Automatic tool call extraction from OpenTelemetry traces.
- Mermaid diagram generation for visualizing agent execution flows.
Benchmark Suite:
- A suite of evaluation tasks defined in YAML files located in the benchmarks/ directory.
- Each test case includes a declarative initial_state, a natural language prompt, and ground_truth criteria for success.

🚀 Quick Start

Prerequisites

Rust Toolchain: Install Rust (latest stable recommended)
Git: Clone the repository
Optional LLM: Install LM Studio or have Gemini API key for AI agents

GLM API Setup:

Regular GLM API (OpenAI-compatible, highest priority):

export ZAI_API_KEY="your-glm-api-key"
export ZAI_API_URL="https://api.z.ai/api/paas/v4"  # optional

GLM Coding API (for coding-specific tasks):

export GLM_CODING_API_KEY="your-glm-coding-api-key"
export GLM_CODING_API_URL="https://api.z.ai/api/coding/paas/v4"  # optional

OpenTelemetry Setup (Tool call tracking always enabled):
```
export REEV_TRACE_FILE=traces.log
```

🎯 Running Benchmarks

The framework now provides automatic surfpool management - no manual setup required:

# All benchmarks work out of the box
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

# Jupiter protocols (swap, lending, mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

# Multi-step flows (swap + lend) with OpenTelemetry tracking
# Enhanced otel logging is enabled by default
export REEV_TRACE_FILE=traces.log
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent glm-4.6

# API benchmarks (positions, earnings)
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

# Scoring validation tests
cargo run -p reev-runner -- benchmarks/003-spl-transfer-fail.yml --agent deterministic  # 0% score
cargo run -p reev-runner -- benchmarks/004-partial-score-spl-transfer.yml --agent deterministic  # ~50% score

# View OpenTelemetry traces and tool calls
cat traces.log

🤖 Agent Options

Deterministic Agent (Ground Truth):

cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent deterministic

🌊 OpenTelemetry-Enabled Agents:

# Enhanced otel logging is enabled by default
export REEV_TRACE_FILE=traces.log

# Run with automatic tool call extraction (enhanced logging included)
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

# View extracted tool calls for Mermaid diagrams
curl http://localhost:3001/api/v1/flows/{session_id}

# Disable enhanced otel logging if needed
REEV_ENHANCED_OTEL=0 cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

Local Model Agent:

cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

Gemini Agent:

RUST_LOG=info cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

🎮 Interactive TUI

Launch the interactive cockpit for real-time monitoring:

cargo run -p reev-tui

Features:

📊 Live benchmark execution with status updates
🔍 Detailed execution trace analysis
🏷️ Agent selection (deterministic, local, glm-4.6, gemini)
📈 Real-time scoring and metrics

🌊 OpenTelemetry Integration & Flow Visualization

The framework now includes automatic OpenTelemetry integration for tool call tracking and Mermaid diagram generation. This provides real-time observability into agent execution flows without manual interference.

🔧 OpenTelemetry Setup

# OpenTelemetry tracing with enhanced logging (enabled by default)
export REEV_TRACE_FILE=traces.log
export RUST_LOG=info

# Run any agent with automatic enhanced tool call tracking
cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

# View captured traces with detailed tool info
cat traces.log

# Disable enhanced logging for minimal output
REEV_ENHANCED_OTEL=0 cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml --agent glm-4.6

📊 Flow Diagram Generation

Tool calls are automatically extracted from rig's OpenTelemetry spans and converted to session format for Mermaid diagrams:

# Start reev-api for flow visualization
cargo run --bin reev-api

# Run benchmark with tool tracking
curl -X POST http://localhost:3001/api/v1/benchmarks/001-sol-transfer/run \
  -H "Content-Type: application/json" \
  -d '{"agent": "glm-4.6"}'

# Get flow diagram
curl http://localhost:3001/api/v1/flows/{session_id}

🎯 Session Format for Mermaid

The system automatically converts OpenTelemetry traces to the session format required by FLOW.md:

{
  "session_id": "uuid-here",
  "benchmark_id": "001-sol-transfer",
  "tools": [
    {
      "tool_name": "sol_transfer",
      "start_time": "2024-01-15T10:30:01.456Z",
      "end_time": "2024-01-15T10:30:02.789Z",
      "params": {"pubkey": "USER_1", "amount": "0.1"},
      "result": {"signatures": ["abc123"]},
      "status": "success"
    }
  ]
}

🏗️ Architecture

rig tool execution → OpenTelemetry spans → trace extraction → session format → Mermaid diagrams

No Manual Tracking: Uses rig's built-in OpenTelemetry automatically
Clean Integration: No HTTP request/response warping or tool interception
Session Format: Matches FLOW.md specification exactly
Real-time Extraction: Tool calls captured during agent execution

📊 Benchmark Categories

🔧 Transaction Benchmarks (100-series)

Real on-chain operations with Jupiter protocols:

# Jupiter swap
cargo run -p reev-runner -- benchmarks/100-jup-swap-sol-usdc.yml --agent local

# Jupiter lending (mint/redeem)
cargo run -p reev-runner -- benchmarks/115-jup-lend-mint-usdc.yml --agent local
cargo run -p reev-runner -- benchmarks/116-jup-lend-redeem-usdc.yml --agent local

🌊 Flow Benchmarks (200-series)

Multi-step DeFi workflows with step-by-step execution:

# Swap then lend (2 steps: swap SOL→USDC, then deposit USDC)
cargo run -p reev-runner -- benchmarks/200-jup-swap-then-lend-deposit.yml --agent deterministic

# More flow benchmarks coming soon...

Flow Execution Features:

✅ Step-by-Step Processing: Each flow step executes as a separate transaction
✅ Transaction Isolation: Proper error handling per step, no cascading failures
✅ State Management: Account state flows between steps automatically
✅ Agent Consistency: Both deterministic and AI agents handle flows identically

📡 API Benchmarks (100-series)

Data retrieval and portfolio management:

# Positions and earnings
cargo run -p reev-runner -- benchmarks/114-jup-positions-and-earnings.yml --agent deterministic

🎯 Success Metrics

Current Performance:

✅ 100% Success Rate: All benchmarks passing with local model
✅ Real Jupiter Integration: Full protocol stack working
✅ Multi-Step Flows: Complex workflows executing step-by-step successfully
✅ Production Infrastructure: TUI, database, logging all operational
✅ Scoring System Validation: Comprehensive test suite covering full score spectrum
✅ Anti-False-Positive Protection: Differentiates failure modes accurately
✅ Flow Framework: Robust step-by-step execution with proper error handling

Scoring System:

The framework implements a sophisticated two-tiered scoring system:

Component Breakdown:

Instruction Quality (75%): Granular evaluation of generated transactions
- Program ID matching (configurable weight)
- Instruction data validation (configurable weight)
- Account metadata verification (signer/writable flags)
On-Chain Execution (25%): Binary success/failure on surfpool
Composite Scoring: Weighted average for final assessment

Flow Scoring:

Per-Step Evaluation: Each flow step is scored individually
Combined Results: Step scores aggregated for final flow assessment
Partial Credit: Successful steps count even if later steps fail

Validated Score Scenarios:

Score Range	Test Case	Purpose	Status
~75%	`003-spl-transfer-fail`	Correct instruction, on-chain failure	✅ Validated
~78.6%	`004-partial-score-spl-transfer`	Partial credit (correct ID, some errors)	✅ Validated
~75%	`100-jup-swap-sol-usdc` (pre-fix)	Good reasoning, execution failure	✅ Validated
100%	`001-sol-transfer`, `002-spl-transfer`	Perfect execution	✅ Validated

Anti-False-Positive Testing:

Differentiates between "no attempt" (0%) vs "attempted but failed" (partial credit)
Validates granular component scoring (program ID vs data vs accounts)
Ensures weighted scoring prevents gaming the system

🔧 Development & Testing

Integration Tests:

# Full test suite (deterministic + AI agents)
cargo test -p reev-runner

# Specific agent testing
cargo test -p reev-runner --test deterministic_agent_test
cargo test -p reev-runner --test llm_agent_test

Example Testing:

# Protocol examples
cargo run -p reev-agent --example 115-jup-lend-mint-usdc

# Flow examples
cargo run -p reev-agent --example 200-jup-swap-then-lend-deposit

Debugging:

# Enable verbose logging
RUST_LOG=debug cargo run -p reev-runner -- benchmarks/001-sol-transfer.yml

# Check surfpool status
curl http://localhost:8899/health

Name		Name	Last commit message	Last commit date
Latest commit History 904 Commits
benchmarks		benchmarks
crates		crates
docs		docs
protocols/jupiter		protocols/jupiter
raw		raw
turso-test		turso-test
viz		viz
web		web
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CICD.md		CICD.md
CURL.md		CURL.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.cloudflare		Dockerfile.cloudflare
Dockerfile.github		Dockerfile.github
FLOW.md		FLOW.md
HANDOVER.md		HANDOVER.md
IDEA.md		IDEA.md
ISSUES.md		ISSUES.md
LICENSE		LICENSE
OTEL.md		OTEL.md
OTEL_LOGGING.md		OTEL_LOGGING.md
PLAN.md		PLAN.md
PLAN_GLM.md		PLAN_GLM.md
README.md		README.md
REFLECT.md		REFLECT.md
RULES.md		RULES.md
SURFPOOL.md		SURFPOOL.md
TASKS.md		TASKS.md
TODO.md		TODO.md
api.pid		api.pid
build.sh		build.sh
test-docker.sh		test-docker.sh

License

gist-rs/reev

Folders and files

Latest commit

History

Repository files navigation