Skip to content

Latest commit

 

History

History
449 lines (347 loc) · 22.2 KB

File metadata and controls

449 lines (347 loc) · 22.2 KB

Lumen Benchmarks

Lumen is evaluated using bench-swe: a SWE-bench-style harness that measures whether Lumen reduces cost, time, and token usage when Claude fixes real GitHub bugs. Results are fully reproducible and all artifacts are committed to this repository.

Methodology

Evaluation Framework

bench-swe tests two scenarios head-to-head against real, fixed GitHub issues:

  • baseline — Claude with default tools only (Read, Write, Edit, Grep, Bash etc.), no Lumen
  • with-lumen — all default tools plus Lumen's semantic_search MCP tool

Each task is a real GitHub bug from an open-source project. Claude is given the issue description and the codebase at the pre-fix commit. It must produce a patch that fixes the issue.

Judging

Patches are rated by Claude Sonnet 4.6 acting as a blind judge, comparing each generated patch to the known-correct gold patch:

  • Perfect — fixes the issue with equivalent or better logic than the gold patch
  • Good — fixes the issue correctly using a different valid approach
  • Poor — wrong, incomplete, doesn't compile, or doesn't fix the issue

The judge also evaluates files_correct (did the patch touch the right files?) and logic_equivalent (is the fix semantically identical to the gold patch?).

Metrics Captured

For each run, bench-swe captures:

Metric Source
Cost (USD) Claude API usage from raw JSONL
Duration Wall time from session start to exit
Output tokens Tokens generated by Claude
Cache reads Tokens read from prompt cache
Tool calls Number of tool invocations

Current Test Suite

9 languages, hard difficulty — all against real GitHub bugs:

Task Language Repository Issue
go-hard Go goccy/go-yaml Decoder overrides defaults with null values
javascript-hard JavaScript markedjs/marked Blockquotes in lists ignore indentation for nesting
php-hard PHP Seldaek/monolog JsonFormatter crashes on stringable object error
python-hard Python pallets/click Boolean flag show_default ignores default_map
typescript-hard TypeScript commander-js/commander Negative flag negation doesn't propagate to aliases
ruby-hard Ruby ruby-grape/grape Wrong content type when Accept header is a wildcard
cpp-hard C++ fmtlib/fmt Add a C API (feature implementation)
dart-hard Dart dart-lang/shelf shelf_router HEAD request incorrectly sets content-length to 0
rust-hard Rust toml-rs/toml False duplicate key error for dotted keys when parent table is implicitly created

Embedding model: ordis/jina-embeddings-v2-base-code (Ollama, 768-dim). Claude model: Sonnet (execution), Sonnet 4.6 (judging).


Results Overview

9 benchmark runs across 9 languages. Quality was maintained in every single task — no regressions. Cost was reduced in every language tested.

Bug-Fix Tasks (8 languages, excluding C++ feature task)

Metric Baseline avg With-Lumen avg Delta
Cost $0.43 $0.27 -37%
Time 183s 116s -37%
Output tokens 8,278 4,787 -42%

All 9 Tasks (including C++ feature task)

Metric Baseline avg With-Lumen avg Delta
Cost $0.50 $0.37 -26%
Time 204s 146s -28%
Output tokens 9,439 7,042 -25%

Cost was reduced in all 9 languages — the only universally positive metric. Quality was maintained in every task.


Full Results Table

Task Lang Scenario Rating Cost Time Output Tok Cache Read Tool Calls
javascript-hard JS baseline Perfect $0.482 254.7s 14,286 486K 18
javascript-hard JS with-lumen Perfect $0.325 119.3s 4,872 464K 16
rust-hard Rust baseline Poor $0.611 309.7s 17,717 719K 22
rust-hard Rust with-lumen Poor $0.375 204.0s 12,291 241K 9
php-hard PHP baseline Good $0.186 51.5s 1,936 249K 10
php-hard PHP with-lumen Good $0.136 34.0s 796 66K 7
typescript-hard TS baseline Good $0.186 84.4s 4,994 120K 6
typescript-hard TS with-lumen Good $0.136 56.3s 1,813 183K 9
python-hard Py baseline Perfect $0.119 43.0s 1,710 132K 7
python-hard Py with-lumen Perfect $0.096 30.6s 1,092 90K 5
ruby-hard Ruby baseline Good $0.539 185.5s 6,143 517K 53
ruby-hard Ruby with-lumen Good $0.411 165.2s 5,581 295K 47
go-hard Go baseline Good $0.646 291.2s 11,475 658K 51
go-hard Go with-lumen Good $0.568 264.1s 10,283 538K 35
dart-hard Dart baseline Good $0.634 246.1s 21,286 4,126K 61
dart-hard Dart with-lumen Good $0.153 50.9s 3,862 663K 14
cpp-hard C++ baseline Good $1.102 370.7s 15,506 1,327K 63
cpp-hard C++ with-lumen Good $1.014 359.1s 22,056 1,019K 51

Per-Language Results

JavaScript — marked (blockquote nesting)

The strongest result. Lumen found the exact function (list() in Tokenizer.ts) on the first semantic search, eliminating all exploratory file reading.

Metric Baseline With Lumen Delta
Rating Perfect Perfect Same
Cost $0.482 $0.325 -32.6%
Time 254.7s 119.3s -53.2%
Output tokens 14,286 4,872 -65.9%
Cache reads 486K 464K -4.5%
Tool calls 18 16 -11.1%

Both scenarios produced functionally identical patches — the same blockquoteBeginRegex function added to rules.ts and the same break condition in Tokenizer.ts. The judge rated both Perfect:

"The candidate patch implements identical logic to the gold patch in both src/Tokenizer.ts and src/rules.ts."

Lumen cut time by more than half and output tokens by two-thirds while delivering the same perfect fix.

Rust — toml (dotted key duplicate error)

The best cost savings. Lumen cut cost by 39% and time by 34% — the largest cost reduction across all 8 languages. Both scenarios struggled with this multi-crate task (neither fixed the parallel bug in the toml crate), but Lumen dramatically reduced the exploration overhead.

Metric Baseline With Lumen Delta
Rating Poor Poor Same
Cost $0.611 $0.375 -38.7%
Time 309.7s 204.0s -34.1%
Output tokens 17,717 12,291 -30.6%
Cache reads 719K 241K -66.5%
Tool calls 22 9 -59.1%

Even when both approaches fail to produce a correct fix, Lumen saves money by reducing exploration. The baseline spent 22 tool calls exploring; Lumen narrowed it to 9 with targeted semantic searches. Cache reads dropped by two-thirds, showing that Lumen helped Claude avoid reading large amounts of irrelevant code.

PHP — monolog (JsonFormatter crash)

Lumen navigated from the parent class (NormalizerFormatter) to the correct child class (JsonFormatter) in two semantic searches, reaching the fix location with minimal exploration.

Metric Baseline With Lumen Delta
Rating Good Good Same
Cost $0.186 $0.136 -26.8%
Time 51.5s 34.0s -34.0%
Output tokens 1,936 796 -58.9%
Cache reads 249K 66K -73.5%
Tool calls 10 7 -30.0%

Both patches wrap the __toString() call in a try/catch and fall back to the class name. The 73.5% reduction in cache reads shows Lumen helping Claude avoid reading large amounts of irrelevant code.

TypeScript — commander.js (negative flag negation)

Lumen found the option parsing logic directly, letting Claude focus on the fix rather than exploring the codebase structure.

Metric Baseline With Lumen Delta
Rating Good Good Same
Cost $0.186 $0.136 -27.1%
Time 84.4s 56.3s -33.3%
Output tokens 4,994 1,813 -63.7%
Cache reads 120K 183K +52.4%
Tool calls 6 9 +50.0%

Despite using more tool calls (Lumen search calls + follow-up reads), the net effect was strongly positive: 64% fewer output tokens and 33% faster completion. The additional cache reads came from Lumen loading relevant context that Claude would otherwise have had to discover through exploration.

Python — click (boolean flag default_map)

Both scenarios found the one-line fix immediately. Lumen's semantic search located get_help_record in the Option class directly, saving a few Grep round-trips.

Metric Baseline With Lumen Delta
Rating Perfect Perfect Same
Cost $0.119 $0.096 -19.5%
Time 43.0s 30.6s -28.8%
Output tokens 1,710 1,092 -36.1%
Cache reads 132K 90K -32.1%
Tool calls 7 5 -28.6%

Both produced the identical single-line patch — changing self.default to default_value on line 2800 of core.py. The judge confirmed:

"The candidate patch makes the identical one-line change as the gold patch."

Ruby — grape (wrong content type with wildcard Accept)

Lumen helped Claude navigate a large, convention-heavy Ruby codebase more efficiently, reducing cache reads by 43% and tool calls by 11%.

Metric Baseline With Lumen Delta
Rating Good Good Same
Cost $0.539 $0.411 -23.7%
Time 185.5s 165.2s -10.9%
Output tokens 6,143 5,581 -9.1%
Cache reads 517K 295K -43.0%
Tool calls 53 47 -11.3%

Ruby showed the most modest output token improvement (-9%) but strong cache read and cost reductions. The high baseline tool call count (53) reflects the exploration-heavy approach needed without semantic search in a large Ruby project.

Go — go-yaml (null value decoder)

Lumen helped Claude find createDecodedNewValue in decode.go and produce a complete patch including test files.

Metric Baseline With Lumen Delta
Rating Good Good Same
Cost $0.646 $0.568 -12.2%
Time 291.2s 264.1s -9.3%
Output tokens 11,475 10,283 -10.4%
Cache reads 658K 538K -18.2%
Tool calls 51 35 -31.4%

Both scenarios produced correct patches with test files. The with-lumen patch was more thorough — table-driven tests covering both null values and comments-only nodes, vs a single test case in the baseline.

Dart — shelf (HEAD content-length RFC violation)

The strongest result overall. Lumen cut cost by 76% and time by 79% — the largest improvements across all 9 languages. The bug was an RFC 9110 violation where shelf_router's _removeBody middleware incorrectly set content-length to 0 for HEAD requests.

Metric Baseline With Lumen Delta
Rating Good Good Same
Cost $0.634 $0.153 -75.8%
Time 246.1s 50.9s -79.3%
Output tokens 21,286 3,862 -81.9%
Cache reads 4,126K 663K -83.9%
Tool calls 61 14 -77.0%

Both scenarios fixed the bug correctly. The baseline spent 61 tool calls and over 4 minutes exploring the monorepo structure (pkgs/shelf_router/ inside the larger shelf repository). With Lumen, semantic search located _removeBody and the router's HEAD handling directly, completing the fix in under a minute with only 14 tool calls.

C++ — fmt (C API feature)

The only feature implementation task (not a bug fix). Both scenarios produced complete, working C API implementations with tests, using different but valid architectural approaches.

Metric Baseline With Lumen Delta
Rating Good Good Same
Cost $1.102 $1.014 -8.0%
Time 370.7s 359.1s -3.1%
Output tokens 15,506 22,056 +42.2%
Cache reads 1,327K 1,019K -23.2%
Tool calls 63 51 -19.0%

C++ is the most expensive task in the suite — a feature implementation in a large codebase. Lumen reduced cost by 8% and tool calls by 19%, but output tokens increased by 42%, suggesting Lumen's search results provided context that Claude used to generate more comprehensive code. Despite being the one task type where Lumen's advantage is smallest, it still delivered cost savings.


Quality Summary

Language Baseline Rating With-Lumen Rating Quality Delta
JavaScript Perfect Perfect Same
Python Perfect Perfect Same
Dart Good Good Same
PHP Good Good Same
TypeScript Good Good Same
Ruby Good Good Same
Go Good Good Same
C++ Good Good Same
Rust Poor Poor Same

Quality was maintained in all 9 tasks — zero regressions. Where the baseline produced Perfect patches, Lumen matched it. Where the baseline produced Good patches, Lumen matched it. And where the task was too hard for the baseline (Rust), Lumen didn't make it worse — it just made the failure cheaper.


Key Findings

1. Cost Reduced in Every Language

Lumen reduced cost in all 9 languages — the only universally positive metric. The range spans from -8% (C++) to -76% (Dart):

Language Baseline cost With-Lumen cost Delta
Dart $0.634 $0.153 -75.8%
Rust $0.611 $0.375 -38.7%
JavaScript $0.482 $0.325 -32.6%
TypeScript $0.186 $0.136 -27.1%
PHP $0.186 $0.136 -26.8%
Ruby $0.539 $0.411 -23.7%
Python $0.119 $0.096 -19.5%
Go $0.646 $0.568 -12.2%
C++ $1.102 $1.014 -8.0%

2. Output Token Reduction Is the Primary Driver

In 8/9 languages, output tokens dropped — up to 82% for Dart. The one exception is C++ where output tokens increased (+42%) due to more comprehensive code generation. Fewer output tokens means Claude explores less and acts more:

Language Baseline output With-Lumen output Delta
Dart 21,286 3,862 -81.9%
JavaScript 14,286 4,872 -65.9%
TypeScript 4,994 1,813 -63.7%
PHP 1,936 796 -58.9%
Python 1,710 1,092 -36.1%
Rust 17,717 12,291 -30.6%
Go 11,475 10,283 -10.4%
Ruby 6,143 5,581 -9.1%
C++ 15,506 22,056 +42.2%

3. Time Savings Scale with Exploration

The languages where the baseline needed the most exploration saw the largest time reductions:

Language Baseline time With-Lumen time Delta
Dart 246.1s 50.9s -79.3%
JavaScript 254.7s 119.3s -53.2%
Rust 309.7s 204.0s -34.1%
PHP 51.5s 34.0s -34.0%
TypeScript 84.4s 56.3s -33.3%
Python 43.0s 30.6s -28.8%
Ruby 185.5s 165.2s -10.9%
Go 291.2s 264.1s -9.3%
C++ 370.7s 359.1s -3.1%

4. Search Calls Are Modest

Lumen typically uses 1-10 search calls per task. It supplements rather than replaces other tool usage:

Language Lumen search calls Total tool calls (Lumen) Total tool calls (baseline)
Python 2 5 7
PHP 2 7 10
Rust 2 9 22
TypeScript 1 9 6
Dart 14 61
JavaScript 2 16 18
Go 3 35 51
Ruby 10 47 53
C++ 6 51 63

5. Zero Quality Regressions

Lumen maintained patch quality in all 9 tasks. Two tasks achieved Perfect ratings (JavaScript, Python) — identical patches to the gold standard. Six achieved Good ratings with correct fixes via different approaches. Even the one task too hard for either approach (Rust) showed no degradation — Lumen just made the failure 39% cheaper.

6. Results Are Reproducible

All benchmark artifacts — raw JSONL streams, patch diffs, metrics, and judge ratings — are committed to this repository. The benchmark framework is deterministic in setup (same commit, same issue, same tools) while allowing natural LLM variation in execution. The consistent direction of improvement across 9 independent language benchmarks validates that the results are reliable.


Reproduce

Requirements: Ollama running with ordis/jina-embeddings-v2-base-code, the claude CLI, git, go, jq.

cd bench-swe

# Run all tasks, both scenarios
go run ./cmd/run --output ../bench-results/my-run

# Run a single language
go run ./cmd/run --filter go-hard --output ../bench-results/my-run

# Generate report from existing results
go run ./cmd/report --input ../bench-results/my-run

Results land in bench-results/<run-id>/. Each run produces:

  • <task>-<scenario>-raw.jsonl — full Claude session stream
  • <task>-<scenario>-metrics.json — extracted cost/time/tokens
  • <task>-<scenario>-patch.diff — generated patch
  • <task>-<scenario>-judge.json — judge rating and reasoning
  • <task>-<scenario>-judge.md — judge rationale in markdown
  • detail-report.md / summary-report.md — human-readable output

The benchmark is entirely self-contained in bench-swe/. Tasks are defined as JSON files in bench-swe/tasks/. To add a new language or difficulty level, add a task JSON and re-run.

Current results are committed at bench-results/.