Lumen Benchmarks

Lumen is evaluated using bench-swe: a SWE-bench-style harness that measures whether Lumen reduces cost, time, and token usage when Claude fixes real GitHub bugs. Results are fully reproducible and all artifacts are committed to this repository.

Methodology

Evaluation Framework

bench-swe tests two scenarios head-to-head against real, fixed GitHub issues:

baseline — Claude with default tools only (Read, Write, Edit, Grep, Bash etc.), no Lumen
with-lumen — all default tools plus Lumen's semantic_search MCP tool

Each task is a real GitHub bug from an open-source project. Claude is given the issue description and the codebase at the pre-fix commit. It must produce a patch that fixes the issue.

Judging

Patches are rated by Claude Sonnet 4.6 acting as a blind judge, comparing each generated patch to the known-correct gold patch:

Perfect — fixes the issue with equivalent or better logic than the gold patch
Good — fixes the issue correctly using a different valid approach
Poor — wrong, incomplete, doesn't compile, or doesn't fix the issue

The judge also evaluates files_correct (did the patch touch the right files?) and logic_equivalent (is the fix semantically identical to the gold patch?).

Metrics Captured

For each run, bench-swe captures:

Metric	Source
Cost (USD)	Claude API usage from raw JSONL
Duration	Wall time from session start to exit
Output tokens	Tokens generated by Claude
Cache reads	Tokens read from prompt cache
Tool calls	Number of tool invocations

Current Test Suite

9 languages, hard difficulty — all against real GitHub bugs:

Task	Language	Repository	Issue
go-hard	Go	goccy/go-yaml	Decoder overrides defaults with null values
javascript-hard	JavaScript	markedjs/marked	Blockquotes in lists ignore indentation for nesting
php-hard	PHP	Seldaek/monolog	JsonFormatter crashes on stringable object error
python-hard	Python	pallets/click	Boolean flag show_default ignores default_map
typescript-hard	TypeScript	commander-js/commander	Negative flag negation doesn't propagate to aliases
ruby-hard	Ruby	ruby-grape/grape	Wrong content type when Accept header is a wildcard
cpp-hard	C++	fmtlib/fmt	Add a C API (feature implementation)
dart-hard	Dart	dart-lang/shelf	shelf_router HEAD request incorrectly sets content-length to 0
rust-hard	Rust	toml-rs/toml	False duplicate key error for dotted keys when parent table is implicitly created

Embedding model: ordis/jina-embeddings-v2-base-code (Ollama, 768-dim). Claude model: Sonnet (execution), Sonnet 4.6 (judging).

Results Overview

9 benchmark runs across 9 languages. Quality was maintained in every single task — no regressions. Cost was reduced in every language tested.

Bug-Fix Tasks (8 languages, excluding C++ feature task)

Metric	Baseline avg	With-Lumen avg	Delta
Cost	$0.43	$0.27	-37%
Time	183s	116s	-37%
Output tokens	8,278	4,787	-42%

All 9 Tasks (including C++ feature task)

Metric	Baseline avg	With-Lumen avg	Delta
Cost	$0.50	$0.37	-26%
Time	204s	146s	-28%
Output tokens	9,439	7,042	-25%

Cost was reduced in all 9 languages — the only universally positive metric. Quality was maintained in every task.

Full Results Table

Task	Lang	Scenario	Rating	Cost	Time	Output Tok	Cache Read	Tool Calls
javascript-hard	JS	baseline	Perfect	$0.482	254.7s	14,286	486K	18
javascript-hard	JS	with-lumen	Perfect	$0.325	119.3s	4,872	464K	16
rust-hard	Rust	baseline	Poor	$0.611	309.7s	17,717	719K	22
rust-hard	Rust	with-lumen	Poor	$0.375	204.0s	12,291	241K	9
php-hard	PHP	baseline	Good	$0.186	51.5s	1,936	249K	10
php-hard	PHP	with-lumen	Good	$0.136	34.0s	796	66K	7
typescript-hard	TS	baseline	Good	$0.186	84.4s	4,994	120K	6
typescript-hard	TS	with-lumen	Good	$0.136	56.3s	1,813	183K	9
python-hard	Py	baseline	Perfect	$0.119	43.0s	1,710	132K	7
python-hard	Py	with-lumen	Perfect	$0.096	30.6s	1,092	90K	5
ruby-hard	Ruby	baseline	Good	$0.539	185.5s	6,143	517K	53
ruby-hard	Ruby	with-lumen	Good	$0.411	165.2s	5,581	295K	47
go-hard	Go	baseline	Good	$0.646	291.2s	11,475	658K	51
go-hard	Go	with-lumen	Good	$0.568	264.1s	10,283	538K	35
dart-hard	Dart	baseline	Good	$0.634	246.1s	21,286	4,126K	61
dart-hard	Dart	with-lumen	Good	$0.153	50.9s	3,862	663K	14
cpp-hard	C++	baseline	Good	$1.102	370.7s	15,506	1,327K	63
cpp-hard	C++	with-lumen	Good	$1.014	359.1s	22,056	1,019K	51

Per-Language Results

JavaScript — marked (blockquote nesting)

The strongest result. Lumen found the exact function (list() in Tokenizer.ts) on the first semantic search, eliminating all exploratory file reading.

Metric	Baseline	With Lumen	Delta
Rating	Perfect	Perfect	Same
Cost	$0.482	$0.325	-32.6%
Time	254.7s	119.3s	-53.2%
Output tokens	14,286	4,872	-65.9%
Cache reads	486K	464K	-4.5%
Tool calls	18	16	-11.1%

Both scenarios produced functionally identical patches — the same blockquoteBeginRegex function added to rules.ts and the same break condition in Tokenizer.ts. The judge rated both Perfect:

"The candidate patch implements identical logic to the gold patch in both src/Tokenizer.ts and src/rules.ts."

Lumen cut time by more than half and output tokens by two-thirds while delivering the same perfect fix.

Rust — toml (dotted key duplicate error)

The best cost savings. Lumen cut cost by 39% and time by 34% — the largest cost reduction across all 8 languages. Both scenarios struggled with this multi-crate task (neither fixed the parallel bug in the toml crate), but Lumen dramatically reduced the exploration overhead.

Metric	Baseline	With Lumen	Delta
Rating	Poor	Poor	Same
Cost	$0.611	$0.375	-38.7%
Time	309.7s	204.0s	-34.1%
Output tokens	17,717	12,291	-30.6%
Cache reads	719K	241K	-66.5%
Tool calls	22	9	-59.1%

Even when both approaches fail to produce a correct fix, Lumen saves money by reducing exploration. The baseline spent 22 tool calls exploring; Lumen narrowed it to 9 with targeted semantic searches. Cache reads dropped by two-thirds, showing that Lumen helped Claude avoid reading large amounts of irrelevant code.

PHP — monolog (JsonFormatter crash)

Lumen navigated from the parent class (NormalizerFormatter) to the correct child class (JsonFormatter) in two semantic searches, reaching the fix location with minimal exploration.

Metric	Baseline	With Lumen	Delta
Rating	Good	Good	Same
Cost	$0.186	$0.136	-26.8%
Time	51.5s	34.0s	-34.0%
Output tokens	1,936	796	-58.9%
Cache reads	249K	66K	-73.5%
Tool calls	10	7	-30.0%

Both patches wrap the __toString() call in a try/catch and fall back to the class name. The 73.5% reduction in cache reads shows Lumen helping Claude avoid reading large amounts of irrelevant code.

TypeScript — commander.js (negative flag negation)

Lumen found the option parsing logic directly, letting Claude focus on the fix rather than exploring the codebase structure.

Metric	Baseline	With Lumen	Delta
Rating	Good	Good	Same
Cost	$0.186	$0.136	-27.1%
Time	84.4s	56.3s	-33.3%
Output tokens	4,994	1,813	-63.7%
Cache reads	120K	183K	+52.4%
Tool calls	6	9	+50.0%

Despite using more tool calls (Lumen search calls + follow-up reads), the net effect was strongly positive: 64% fewer output tokens and 33% faster completion. The additional cache reads came from Lumen loading relevant context that Claude would otherwise have had to discover through exploration.

Python — click (boolean flag default_map)

Both scenarios found the one-line fix immediately. Lumen's semantic search located get_help_record in the Option class directly, saving a few Grep round-trips.

Metric	Baseline	With Lumen	Delta
Rating	Perfect	Perfect	Same
Cost	$0.119	$0.096	-19.5%
Time	43.0s	30.6s	-28.8%
Output tokens	1,710	1,092	-36.1%
Cache reads	132K	90K	-32.1%
Tool calls	7	5	-28.6%

Both produced the identical single-line patch — changing self.default to default_value on line 2800 of core.py. The judge confirmed:

"The candidate patch makes the identical one-line change as the gold patch."

Ruby — grape (wrong content type with wildcard Accept)

Lumen helped Claude navigate a large, convention-heavy Ruby codebase more efficiently, reducing cache reads by 43% and tool calls by 11%.

Metric	Baseline	With Lumen	Delta
Rating	Good	Good	Same
Cost	$0.539	$0.411	-23.7%
Time	185.5s	165.2s	-10.9%
Output tokens	6,143	5,581	-9.1%
Cache reads	517K	295K	-43.0%
Tool calls	53	47	-11.3%

Ruby showed the most modest output token improvement (-9%) but strong cache read and cost reductions. The high baseline tool call count (53) reflects the exploration-heavy approach needed without semantic search in a large Ruby project.

Go — go-yaml (null value decoder)

Lumen helped Claude find createDecodedNewValue in decode.go and produce a complete patch including test files.

Metric	Baseline	With Lumen	Delta
Rating	Good	Good	Same
Cost	$0.646	$0.568	-12.2%
Time	291.2s	264.1s	-9.3%
Output tokens	11,475	10,283	-10.4%
Cache reads	658K	538K	-18.2%
Tool calls	51	35	-31.4%

Both scenarios produced correct patches with test files. The with-lumen patch was more thorough — table-driven tests covering both null values and comments-only nodes, vs a single test case in the baseline.

Dart — shelf (HEAD content-length RFC violation)

The strongest result overall. Lumen cut cost by 76% and time by 79% — the largest improvements across all 9 languages. The bug was an RFC 9110 violation where shelf_router's _removeBody middleware incorrectly set content-length to 0 for HEAD requests.

Metric	Baseline	With Lumen	Delta
Rating	Good	Good	Same
Cost	$0.634	$0.153	-75.8%
Time	246.1s	50.9s	-79.3%
Output tokens	21,286	3,862	-81.9%
Cache reads	4,126K	663K	-83.9%
Tool calls	61	14	-77.0%

Both scenarios fixed the bug correctly. The baseline spent 61 tool calls and over 4 minutes exploring the monorepo structure (pkgs/shelf_router/ inside the larger shelf repository). With Lumen, semantic search located _removeBody and the router's HEAD handling directly, completing the fix in under a minute with only 14 tool calls.

C++ — fmt (C API feature)

The only feature implementation task (not a bug fix). Both scenarios produced complete, working C API implementations with tests, using different but valid architectural approaches.

Metric	Baseline	With Lumen	Delta
Rating	Good	Good	Same
Cost	$1.102	$1.014	-8.0%
Time	370.7s	359.1s	-3.1%
Output tokens	15,506	22,056	+42.2%
Cache reads	1,327K	1,019K	-23.2%
Tool calls	63	51	-19.0%

C++ is the most expensive task in the suite — a feature implementation in a large codebase. Lumen reduced cost by 8% and tool calls by 19%, but output tokens increased by 42%, suggesting Lumen's search results provided context that Claude used to generate more comprehensive code. Despite being the one task type where Lumen's advantage is smallest, it still delivered cost savings.

Quality Summary

Language	Baseline Rating	With-Lumen Rating	Quality Delta
JavaScript	Perfect	Perfect	Same
Python	Perfect	Perfect	Same
Dart	Good	Good	Same
PHP	Good	Good	Same
TypeScript	Good	Good	Same
Ruby	Good	Good	Same
Go	Good	Good	Same
C++	Good	Good	Same
Rust	Poor	Poor	Same

Quality was maintained in all 9 tasks — zero regressions. Where the baseline produced Perfect patches, Lumen matched it. Where the baseline produced Good patches, Lumen matched it. And where the task was too hard for the baseline (Rust), Lumen didn't make it worse — it just made the failure cheaper.

Key Findings

1. Cost Reduced in Every Language

Lumen reduced cost in all 9 languages — the only universally positive metric. The range spans from -8% (C++) to -76% (Dart):

Language	Baseline cost	With-Lumen cost	Delta
Dart	$0.634	$0.153	-75.8%
Rust	$0.611	$0.375	-38.7%
JavaScript	$0.482	$0.325	-32.6%
TypeScript	$0.186	$0.136	-27.1%
PHP	$0.186	$0.136	-26.8%
Ruby	$0.539	$0.411	-23.7%
Python	$0.119	$0.096	-19.5%
Go	$0.646	$0.568	-12.2%
C++	$1.102	$1.014	-8.0%

2. Output Token Reduction Is the Primary Driver

In 8/9 languages, output tokens dropped — up to 82% for Dart. The one exception is C++ where output tokens increased (+42%) due to more comprehensive code generation. Fewer output tokens means Claude explores less and acts more:

Language	Baseline output	With-Lumen output	Delta
Dart	21,286	3,862	-81.9%
JavaScript	14,286	4,872	-65.9%
TypeScript	4,994	1,813	-63.7%
PHP	1,936	796	-58.9%
Python	1,710	1,092	-36.1%
Rust	17,717	12,291	-30.6%
Go	11,475	10,283	-10.4%
Ruby	6,143	5,581	-9.1%
C++	15,506	22,056	+42.2%

3. Time Savings Scale with Exploration

The languages where the baseline needed the most exploration saw the largest time reductions:

Language	Baseline time	With-Lumen time	Delta
Dart	246.1s	50.9s	-79.3%
JavaScript	254.7s	119.3s	-53.2%
Rust	309.7s	204.0s	-34.1%
PHP	51.5s	34.0s	-34.0%
TypeScript	84.4s	56.3s	-33.3%
Python	43.0s	30.6s	-28.8%
Ruby	185.5s	165.2s	-10.9%
Go	291.2s	264.1s	-9.3%
C++	370.7s	359.1s	-3.1%

4. Search Calls Are Modest

Lumen typically uses 1-10 search calls per task. It supplements rather than replaces other tool usage:

Language	Lumen search calls	Total tool calls (Lumen)	Total tool calls (baseline)
Python	2	5	7
PHP	2	7	10
Rust	2	9	22
TypeScript	1	9	6
Dart	—	14	61
JavaScript	2	16	18
Go	3	35	51
Ruby	10	47	53
C++	6	51	63

5. Zero Quality Regressions

Lumen maintained patch quality in all 9 tasks. Two tasks achieved Perfect ratings (JavaScript, Python) — identical patches to the gold standard. Six achieved Good ratings with correct fixes via different approaches. Even the one task too hard for either approach (Rust) showed no degradation — Lumen just made the failure 39% cheaper.

6. Results Are Reproducible

All benchmark artifacts — raw JSONL streams, patch diffs, metrics, and judge ratings — are committed to this repository. The benchmark framework is deterministic in setup (same commit, same issue, same tools) while allowing natural LLM variation in execution. The consistent direction of improvement across 9 independent language benchmarks validates that the results are reliable.

Reproduce

Requirements: Ollama running with ordis/jina-embeddings-v2-base-code, the claude CLI, git, go, jq.

cd bench-swe

# Run all tasks, both scenarios
go run ./cmd/run --output ../bench-results/my-run

# Run a single language
go run ./cmd/run --filter go-hard --output ../bench-results/my-run

# Generate report from existing results
go run ./cmd/report --input ../bench-results/my-run

Results land in bench-results/<run-id>/. Each run produces:

<task>-<scenario>-raw.jsonl — full Claude session stream
<task>-<scenario>-metrics.json — extracted cost/time/tokens
<task>-<scenario>-patch.diff — generated patch
<task>-<scenario>-judge.json — judge rating and reasoning
<task>-<scenario>-judge.md — judge rationale in markdown
detail-report.md / summary-report.md — human-readable output

The benchmark is entirely self-contained in bench-swe/. Tasks are defined as JSON files in bench-swe/tasks/. To add a new language or difficulty level, add a task JSON and re-run.

Current results are committed at bench-results/.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lumen Benchmarks

Methodology

Evaluation Framework

Judging

Metrics Captured

Current Test Suite

Results Overview

Bug-Fix Tasks (8 languages, excluding C++ feature task)

All 9 Tasks (including C++ feature task)

Full Results Table

Per-Language Results

JavaScript — marked (blockquote nesting)

Rust — toml (dotted key duplicate error)

PHP — monolog (JsonFormatter crash)

TypeScript — commander.js (negative flag negation)

Python — click (boolean flag default_map)

Ruby — grape (wrong content type with wildcard Accept)

Go — go-yaml (null value decoder)

Dart — shelf (HEAD content-length RFC violation)

C++ — fmt (C API feature)

Quality Summary

Key Findings

1. Cost Reduced in Every Language

2. Output Token Reduction Is the Primary Driver

3. Time Savings Scale with Exploration

4. Search Calls Are Modest

5. Zero Quality Regressions

6. Results Are Reproducible

Reproduce

FilesExpand file tree

BENCHMARKS.md

Latest commit

History

BENCHMARKS.md

File metadata and controls

Lumen Benchmarks

Methodology

Evaluation Framework

Judging

Metrics Captured

Current Test Suite

Results Overview

Bug-Fix Tasks (8 languages, excluding C++ feature task)

All 9 Tasks (including C++ feature task)

Full Results Table

Per-Language Results

JavaScript — marked (blockquote nesting)

Rust — toml (dotted key duplicate error)

PHP — monolog (JsonFormatter crash)

TypeScript — commander.js (negative flag negation)

Python — click (boolean flag default_map)

Ruby — grape (wrong content type with wildcard Accept)

Go — go-yaml (null value decoder)

Dart — shelf (HEAD content-length RFC violation)

C++ — fmt (C API feature)

Quality Summary

Key Findings

1. Cost Reduced in Every Language

2. Output Token Reduction Is the Primary Driver

3. Time Savings Scale with Exploration

4. Search Calls Are Modest

5. Zero Quality Regressions

6. Results Are Reproducible

Reproduce