Lumen is evaluated using bench-swe: a SWE-bench-style harness that measures whether Lumen reduces cost, time, and token usage when Claude fixes real GitHub bugs. Results are fully reproducible and all artifacts are committed to this repository.
bench-swe tests two scenarios head-to-head against real, fixed GitHub issues:
- baseline — Claude with default tools only (Read, Write, Edit, Grep, Bash etc.), no Lumen
- with-lumen — all default tools plus Lumen's
semantic_searchMCP tool
Each task is a real GitHub bug from an open-source project. Claude is given the issue description and the codebase at the pre-fix commit. It must produce a patch that fixes the issue.
Patches are rated by Claude Sonnet 4.6 acting as a blind judge, comparing each generated patch to the known-correct gold patch:
- Perfect — fixes the issue with equivalent or better logic than the gold patch
- Good — fixes the issue correctly using a different valid approach
- Poor — wrong, incomplete, doesn't compile, or doesn't fix the issue
The judge also evaluates files_correct (did the patch touch the right files?)
and logic_equivalent (is the fix semantically identical to the gold patch?).
For each run, bench-swe captures:
| Metric | Source |
|---|---|
| Cost (USD) | Claude API usage from raw JSONL |
| Duration | Wall time from session start to exit |
| Output tokens | Tokens generated by Claude |
| Cache reads | Tokens read from prompt cache |
| Tool calls | Number of tool invocations |
9 languages, hard difficulty — all against real GitHub bugs:
| Task | Language | Repository | Issue |
|---|---|---|---|
| go-hard | Go | goccy/go-yaml | Decoder overrides defaults with null values |
| javascript-hard | JavaScript | markedjs/marked | Blockquotes in lists ignore indentation for nesting |
| php-hard | PHP | Seldaek/monolog | JsonFormatter crashes on stringable object error |
| python-hard | Python | pallets/click | Boolean flag show_default ignores default_map |
| typescript-hard | TypeScript | commander-js/commander | Negative flag negation doesn't propagate to aliases |
| ruby-hard | Ruby | ruby-grape/grape | Wrong content type when Accept header is a wildcard |
| cpp-hard | C++ | fmtlib/fmt | Add a C API (feature implementation) |
| dart-hard | Dart | dart-lang/shelf | shelf_router HEAD request incorrectly sets content-length to 0 |
| rust-hard | Rust | toml-rs/toml | False duplicate key error for dotted keys when parent table is implicitly created |
Embedding model: ordis/jina-embeddings-v2-base-code (Ollama, 768-dim). Claude
model: Sonnet (execution), Sonnet 4.6 (judging).
9 benchmark runs across 9 languages. Quality was maintained in every single task — no regressions. Cost was reduced in every language tested.
| Metric | Baseline avg | With-Lumen avg | Delta |
|---|---|---|---|
| Cost | $0.43 | $0.27 | -37% |
| Time | 183s | 116s | -37% |
| Output tokens | 8,278 | 4,787 | -42% |
| Metric | Baseline avg | With-Lumen avg | Delta |
|---|---|---|---|
| Cost | $0.50 | $0.37 | -26% |
| Time | 204s | 146s | -28% |
| Output tokens | 9,439 | 7,042 | -25% |
Cost was reduced in all 9 languages — the only universally positive metric. Quality was maintained in every task.
| Task | Lang | Scenario | Rating | Cost | Time | Output Tok | Cache Read | Tool Calls |
|---|---|---|---|---|---|---|---|---|
| javascript-hard | JS | baseline | Perfect | $0.482 | 254.7s | 14,286 | 486K | 18 |
| javascript-hard | JS | with-lumen | Perfect | $0.325 | 119.3s | 4,872 | 464K | 16 |
| rust-hard | Rust | baseline | Poor | $0.611 | 309.7s | 17,717 | 719K | 22 |
| rust-hard | Rust | with-lumen | Poor | $0.375 | 204.0s | 12,291 | 241K | 9 |
| php-hard | PHP | baseline | Good | $0.186 | 51.5s | 1,936 | 249K | 10 |
| php-hard | PHP | with-lumen | Good | $0.136 | 34.0s | 796 | 66K | 7 |
| typescript-hard | TS | baseline | Good | $0.186 | 84.4s | 4,994 | 120K | 6 |
| typescript-hard | TS | with-lumen | Good | $0.136 | 56.3s | 1,813 | 183K | 9 |
| python-hard | Py | baseline | Perfect | $0.119 | 43.0s | 1,710 | 132K | 7 |
| python-hard | Py | with-lumen | Perfect | $0.096 | 30.6s | 1,092 | 90K | 5 |
| ruby-hard | Ruby | baseline | Good | $0.539 | 185.5s | 6,143 | 517K | 53 |
| ruby-hard | Ruby | with-lumen | Good | $0.411 | 165.2s | 5,581 | 295K | 47 |
| go-hard | Go | baseline | Good | $0.646 | 291.2s | 11,475 | 658K | 51 |
| go-hard | Go | with-lumen | Good | $0.568 | 264.1s | 10,283 | 538K | 35 |
| dart-hard | Dart | baseline | Good | $0.634 | 246.1s | 21,286 | 4,126K | 61 |
| dart-hard | Dart | with-lumen | Good | $0.153 | 50.9s | 3,862 | 663K | 14 |
| cpp-hard | C++ | baseline | Good | $1.102 | 370.7s | 15,506 | 1,327K | 63 |
| cpp-hard | C++ | with-lumen | Good | $1.014 | 359.1s | 22,056 | 1,019K | 51 |
The strongest result. Lumen found the exact function (list() in
Tokenizer.ts) on the first semantic search, eliminating all exploratory file
reading.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Perfect | Perfect | Same |
| Cost | $0.482 | $0.325 | -32.6% |
| Time | 254.7s | 119.3s | -53.2% |
| Output tokens | 14,286 | 4,872 | -65.9% |
| Cache reads | 486K | 464K | -4.5% |
| Tool calls | 18 | 16 | -11.1% |
Both scenarios produced functionally identical patches — the same
blockquoteBeginRegex function added to rules.ts and the same break condition
in Tokenizer.ts. The judge rated both Perfect:
"The candidate patch implements identical logic to the gold patch in both
src/Tokenizer.tsandsrc/rules.ts."
Lumen cut time by more than half and output tokens by two-thirds while delivering the same perfect fix.
The best cost savings. Lumen cut cost by 39% and time by 34% — the largest
cost reduction across all 8 languages. Both scenarios struggled with this
multi-crate task (neither fixed the parallel bug in the toml crate), but Lumen
dramatically reduced the exploration overhead.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Poor | Poor | Same |
| Cost | $0.611 | $0.375 | -38.7% |
| Time | 309.7s | 204.0s | -34.1% |
| Output tokens | 17,717 | 12,291 | -30.6% |
| Cache reads | 719K | 241K | -66.5% |
| Tool calls | 22 | 9 | -59.1% |
Even when both approaches fail to produce a correct fix, Lumen saves money by reducing exploration. The baseline spent 22 tool calls exploring; Lumen narrowed it to 9 with targeted semantic searches. Cache reads dropped by two-thirds, showing that Lumen helped Claude avoid reading large amounts of irrelevant code.
Lumen navigated from the parent class (NormalizerFormatter) to the correct
child class (JsonFormatter) in two semantic searches, reaching the fix
location with minimal exploration.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Good | Good | Same |
| Cost | $0.186 | $0.136 | -26.8% |
| Time | 51.5s | 34.0s | -34.0% |
| Output tokens | 1,936 | 796 | -58.9% |
| Cache reads | 249K | 66K | -73.5% |
| Tool calls | 10 | 7 | -30.0% |
Both patches wrap the __toString() call in a try/catch and fall back to the
class name. The 73.5% reduction in cache reads shows Lumen helping Claude avoid
reading large amounts of irrelevant code.
Lumen found the option parsing logic directly, letting Claude focus on the fix rather than exploring the codebase structure.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Good | Good | Same |
| Cost | $0.186 | $0.136 | -27.1% |
| Time | 84.4s | 56.3s | -33.3% |
| Output tokens | 4,994 | 1,813 | -63.7% |
| Cache reads | 120K | 183K | +52.4% |
| Tool calls | 6 | 9 | +50.0% |
Despite using more tool calls (Lumen search calls + follow-up reads), the net effect was strongly positive: 64% fewer output tokens and 33% faster completion. The additional cache reads came from Lumen loading relevant context that Claude would otherwise have had to discover through exploration.
Both scenarios found the one-line fix immediately. Lumen's semantic search
located get_help_record in the Option class directly, saving a few Grep
round-trips.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Perfect | Perfect | Same |
| Cost | $0.119 | $0.096 | -19.5% |
| Time | 43.0s | 30.6s | -28.8% |
| Output tokens | 1,710 | 1,092 | -36.1% |
| Cache reads | 132K | 90K | -32.1% |
| Tool calls | 7 | 5 | -28.6% |
Both produced the identical single-line patch — changing self.default to
default_value on line 2800 of core.py. The judge confirmed:
"The candidate patch makes the identical one-line change as the gold patch."
Lumen helped Claude navigate a large, convention-heavy Ruby codebase more efficiently, reducing cache reads by 43% and tool calls by 11%.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Good | Good | Same |
| Cost | $0.539 | $0.411 | -23.7% |
| Time | 185.5s | 165.2s | -10.9% |
| Output tokens | 6,143 | 5,581 | -9.1% |
| Cache reads | 517K | 295K | -43.0% |
| Tool calls | 53 | 47 | -11.3% |
Ruby showed the most modest output token improvement (-9%) but strong cache read and cost reductions. The high baseline tool call count (53) reflects the exploration-heavy approach needed without semantic search in a large Ruby project.
Lumen helped Claude find createDecodedNewValue in decode.go and produce a
complete patch including test files.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Good | Good | Same |
| Cost | $0.646 | $0.568 | -12.2% |
| Time | 291.2s | 264.1s | -9.3% |
| Output tokens | 11,475 | 10,283 | -10.4% |
| Cache reads | 658K | 538K | -18.2% |
| Tool calls | 51 | 35 | -31.4% |
Both scenarios produced correct patches with test files. The with-lumen patch was more thorough — table-driven tests covering both null values and comments-only nodes, vs a single test case in the baseline.
The strongest result overall. Lumen cut cost by 76% and time by 79% — the
largest improvements across all 9 languages. The bug was an RFC 9110 violation
where shelf_router's _removeBody middleware incorrectly set content-length
to 0 for HEAD requests.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Good | Good | Same |
| Cost | $0.634 | $0.153 | -75.8% |
| Time | 246.1s | 50.9s | -79.3% |
| Output tokens | 21,286 | 3,862 | -81.9% |
| Cache reads | 4,126K | 663K | -83.9% |
| Tool calls | 61 | 14 | -77.0% |
Both scenarios fixed the bug correctly. The baseline spent 61 tool calls and
over 4 minutes exploring the monorepo structure (pkgs/shelf_router/ inside
the larger shelf repository). With Lumen, semantic search located
_removeBody and the router's HEAD handling directly, completing the fix in
under a minute with only 14 tool calls.
The only feature implementation task (not a bug fix). Both scenarios produced complete, working C API implementations with tests, using different but valid architectural approaches.
| Metric | Baseline | With Lumen | Delta |
|---|---|---|---|
| Rating | Good | Good | Same |
| Cost | $1.102 | $1.014 | -8.0% |
| Time | 370.7s | 359.1s | -3.1% |
| Output tokens | 15,506 | 22,056 | +42.2% |
| Cache reads | 1,327K | 1,019K | -23.2% |
| Tool calls | 63 | 51 | -19.0% |
C++ is the most expensive task in the suite — a feature implementation in a large codebase. Lumen reduced cost by 8% and tool calls by 19%, but output tokens increased by 42%, suggesting Lumen's search results provided context that Claude used to generate more comprehensive code. Despite being the one task type where Lumen's advantage is smallest, it still delivered cost savings.
| Language | Baseline Rating | With-Lumen Rating | Quality Delta |
|---|---|---|---|
| JavaScript | Perfect | Perfect | Same |
| Python | Perfect | Perfect | Same |
| Dart | Good | Good | Same |
| PHP | Good | Good | Same |
| TypeScript | Good | Good | Same |
| Ruby | Good | Good | Same |
| Go | Good | Good | Same |
| C++ | Good | Good | Same |
| Rust | Poor | Poor | Same |
Quality was maintained in all 9 tasks — zero regressions. Where the baseline produced Perfect patches, Lumen matched it. Where the baseline produced Good patches, Lumen matched it. And where the task was too hard for the baseline (Rust), Lumen didn't make it worse — it just made the failure cheaper.
Lumen reduced cost in all 9 languages — the only universally positive metric. The range spans from -8% (C++) to -76% (Dart):
| Language | Baseline cost | With-Lumen cost | Delta |
|---|---|---|---|
| Dart | $0.634 | $0.153 | -75.8% |
| Rust | $0.611 | $0.375 | -38.7% |
| JavaScript | $0.482 | $0.325 | -32.6% |
| TypeScript | $0.186 | $0.136 | -27.1% |
| PHP | $0.186 | $0.136 | -26.8% |
| Ruby | $0.539 | $0.411 | -23.7% |
| Python | $0.119 | $0.096 | -19.5% |
| Go | $0.646 | $0.568 | -12.2% |
| C++ | $1.102 | $1.014 | -8.0% |
In 8/9 languages, output tokens dropped — up to 82% for Dart. The one exception is C++ where output tokens increased (+42%) due to more comprehensive code generation. Fewer output tokens means Claude explores less and acts more:
| Language | Baseline output | With-Lumen output | Delta |
|---|---|---|---|
| Dart | 21,286 | 3,862 | -81.9% |
| JavaScript | 14,286 | 4,872 | -65.9% |
| TypeScript | 4,994 | 1,813 | -63.7% |
| PHP | 1,936 | 796 | -58.9% |
| Python | 1,710 | 1,092 | -36.1% |
| Rust | 17,717 | 12,291 | -30.6% |
| Go | 11,475 | 10,283 | -10.4% |
| Ruby | 6,143 | 5,581 | -9.1% |
| C++ | 15,506 | 22,056 | +42.2% |
The languages where the baseline needed the most exploration saw the largest time reductions:
| Language | Baseline time | With-Lumen time | Delta |
|---|---|---|---|
| Dart | 246.1s | 50.9s | -79.3% |
| JavaScript | 254.7s | 119.3s | -53.2% |
| Rust | 309.7s | 204.0s | -34.1% |
| PHP | 51.5s | 34.0s | -34.0% |
| TypeScript | 84.4s | 56.3s | -33.3% |
| Python | 43.0s | 30.6s | -28.8% |
| Ruby | 185.5s | 165.2s | -10.9% |
| Go | 291.2s | 264.1s | -9.3% |
| C++ | 370.7s | 359.1s | -3.1% |
Lumen typically uses 1-10 search calls per task. It supplements rather than replaces other tool usage:
| Language | Lumen search calls | Total tool calls (Lumen) | Total tool calls (baseline) |
|---|---|---|---|
| Python | 2 | 5 | 7 |
| PHP | 2 | 7 | 10 |
| Rust | 2 | 9 | 22 |
| TypeScript | 1 | 9 | 6 |
| Dart | — | 14 | 61 |
| JavaScript | 2 | 16 | 18 |
| Go | 3 | 35 | 51 |
| Ruby | 10 | 47 | 53 |
| C++ | 6 | 51 | 63 |
Lumen maintained patch quality in all 9 tasks. Two tasks achieved Perfect ratings (JavaScript, Python) — identical patches to the gold standard. Six achieved Good ratings with correct fixes via different approaches. Even the one task too hard for either approach (Rust) showed no degradation — Lumen just made the failure 39% cheaper.
All benchmark artifacts — raw JSONL streams, patch diffs, metrics, and judge ratings — are committed to this repository. The benchmark framework is deterministic in setup (same commit, same issue, same tools) while allowing natural LLM variation in execution. The consistent direction of improvement across 9 independent language benchmarks validates that the results are reliable.
Requirements: Ollama running with ordis/jina-embeddings-v2-base-code, the
claude CLI, git, go, jq.
cd bench-swe
# Run all tasks, both scenarios
go run ./cmd/run --output ../bench-results/my-run
# Run a single language
go run ./cmd/run --filter go-hard --output ../bench-results/my-run
# Generate report from existing results
go run ./cmd/report --input ../bench-results/my-runResults land in bench-results/<run-id>/. Each run produces:
<task>-<scenario>-raw.jsonl— full Claude session stream<task>-<scenario>-metrics.json— extracted cost/time/tokens<task>-<scenario>-patch.diff— generated patch<task>-<scenario>-judge.json— judge rating and reasoning<task>-<scenario>-judge.md— judge rationale in markdowndetail-report.md/summary-report.md— human-readable output
The benchmark is entirely self-contained in bench-swe/. Tasks are defined as
JSON files in bench-swe/tasks/. To add a new language or difficulty level, add
a task JSON and re-run.
Current results are committed at bench-results/.