README benchmark chart: not reproducible from committed harness, no variance shown

> Written with AI (Claude, Opus 4.8 via Claude Code), reviewed by a human before posting.

First: thanks for fff, the search core is genuinely fast and the library-not-CLI framing is the right call.

This is about the **"Opus 4.6 - Feature Completion" chart** in the README (`chart.png`), which is the headline argument for the MCP server. A few things make it hard to take at face value, and I think fixing them would make the claim much stronger rather than weaker.

### 1. The chart doesn't appear to be produced by the committed harness
`chart.png` and `scripts/benchmark-claude.sh` + `scripts/analyze-results.py` landed in the same commit (`feat: MCP (#272)`), but they measure different things:

- The **chart** plots cumulative `tokens (input+output)` vs `wall time` for a single *feature-completion* task, with an "auto compaction" band.
- The **committed harness** runs **11 one-shot file-*search* concepts** ("find the function that loads X"), scored by **USD cost** with a 5–15% tie band, on a **private repo** (`~/dev/lightsource`, 194k files).

No chart-generating code, no raw run data (no `benchmark-results/`, no `iter*.json`, no `.stream.jsonl`), and the string "20 runs" appears nowhere in the repo. So the published chart isn't reproducible from what's checked in.

### 2. "Average of 20 runs" with no spread shown
The reported deltas are roughly **-8% wall time / -17% tokens** (per the promo video). Opus has high run-to-run variance, so without error bars / std-dev / CIs across the 20 runs, it's hard to tell the effect from noise. `analyze-results.py` computes only averages, no variance.

### 3. Presentation choices that inflate the visual
- **Combined input+output y-axis** obscures that the savings is input-side (the cheap, cache-friendly tokens).
- The **"auto compaction" band** is drawn as the narrative, but both lines end *above* it, so fff doesn't actually keep a run under compaction in this chart.
- The two lines overlap for ~75% of the run and diverge only at the very end.

### Asks (any one would help)
1. Publish the raw per-run data + the script that generates `chart.png`.
2. Add error bars / variance for the 20 runs.
3. Use a public repo so the feature-completion benchmark is reproducible.
4. Ideally split input vs output tokens on the chart.

If useful, paste this into an AI coding agent to generate a reproducible harness:

> Build a benchmark that runs one feature-completion task in a **public** repo N times, with and without the fff MCP server, recording cumulative (input, output) tokens and wall time per turn. Emit a chart with mean ± std-dev bands per condition, and commit the raw per-run JSON next to the chart-generation script.

Not trying to dunk on the project, just want the headline benchmark to be as solid as the underlying tool.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README benchmark chart: not reproducible from committed harness, no variance shown #540

1. The chart doesn't appear to be produced by the committed harness

2. "Average of 20 runs" with no spread shown

3. Presentation choices that inflate the visual

Asks (any one would help)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

README benchmark chart: not reproducible from committed harness, no variance shown #540

Description

1. The chart doesn't appear to be produced by the committed harness

2. "Average of 20 runs" with no spread shown

3. Presentation choices that inflate the visual

Asks (any one would help)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions