Skip to content

Latest commit

 

History

History
169 lines (112 loc) · 4.31 KB

File metadata and controls

169 lines (112 loc) · 4.31 KB

RunLens

Debug AI agents like systems, not black boxes.

RunLens helps you understand why an agent run failed or became expensive — and what to fix. It captures runs step by step, shows cost per step, and lets you compare two runs side by side to see exactly what changed.

👉 Live demo


The problem

When an agent fails or costs too much, existing tools show you what happened — logs, traces, token counts. But they don't tell you why this run was different from the one that worked yesterday, or which specific decision made it 3x more expensive.

RunLens captures a full snapshot of every run — model, prompt version, tools, config — and lets you compare any two runs side by side. You see exactly what changed and where the cost went.


How it works

your agent → RunLens SDK → RunLens API → RunLens UI

You add 3 function calls to your agent. RunLens records every step, stores it, and lets you compare runs visually.


Quickstart

1. Install the SDK

pip install runlens-sdk

2. Instrument your agent

from runlens import start_run, record_step, end_run

# Start a run — capture your execution context
run = start_run(
    task="answer customer question",
    context={
        "model": "gpt-4o",
        "prompt_version": "v2",
        "tools": ["search", "calculator"],
        "temperature": 0.7,
    },
    api_url="https://runlens-api.onrender.com",
)

# Record each step — cost is calculated automatically for known models,
# or pass cost= explicitly for custom models.
record_step(
    run_id=run.id,
    step_type="llm_call",
    name="classify intent",
    input={"prompt": "..."},
    output={"intent": "refund_request"},
    model="gpt-4o",
    tokens=150,
)

# End the run
end_run(run.id)

3. Open the UI

Go to runlens-api.onrender.com. Select two runs and click Compare.


The comparison view

When you compare two runs, RunLens shows:

Context diff — what was different between the two runs:

Key Run A Run B
model gpt-4o gpt-4o-mini
prompt_version v1 v2
temperature 0.7 0.3

Summary — steps, cost, tokens, duration delta at a glance.

Step diff — side by side steps, with extra/missing steps flagged in red.


Run the demo

See a concrete example: a support bot running the same task twice — once over-engineered, once lean. 5x cost difference, same output.

pip install runlens-sdk requests
RUNLENS_API=https://runlens-api.onrender.com python examples/demo_agent.py

Then go to runlens-api.onrender.com, select both runs, and click Compare.


Project structure

runlens/
├── apps/
│   ├── api/          — FastAPI backend (SQLite)
│   └── web/          — Frontend (plain HTML + JS)
├── packages/
│   └── sdk-python/   — Python SDK
├── examples/
│   └── demo_agent.py — Demo: support bot comparison
└── CLAUDE.md         — Project brief for AI-assisted development

SDK reference

start_run(task, context=None, api_url=None, storage_path=None)

Starts a new run. Returns a RunHandle with an .id.

  • task — short description of what the agent is doing
  • context — dict with execution state: model, prompt version, tools, etc.
  • api_url — RunLens API base URL (optional, streams data to API)
  • storage_path — local JSON file path (optional, saves data locally)

record_step(run_id, step_type, input, output, cost=0.0, tokens=0, model=None, name=None, duration_ms=None)

Records a single step within a run.

  • step_type — e.g. "llm_call", "tool_call", "retrieval"
  • cost — cost in USD for this step
  • tokens — token count for this step

end_run(run_id)

Ends the run and returns the complete run record.


Self-hosting

The API is a standard FastAPI app. Deploy anywhere that runs Python:

uvicorn main:app --host 0.0.0.0 --port $PORT

Uses SQLite by default. Set DATABASE_URL environment variable to use a different database.


Tech stack

  • SDK — pure Python, zero dependencies
  • Backend — FastAPI + SQLModel + SQLite
  • Frontend — plain HTML + vanilla JS, no framework

License

MIT