Skip to content

README benchmark chart: not reproducible from committed harness, no variance shown #540

@andrewgazelka

Description

@andrewgazelka

Written with AI (Claude, Opus 4.8 via Claude Code), reviewed by a human before posting.

First: thanks for fff, the search core is genuinely fast and the library-not-CLI framing is the right call.

This is about the "Opus 4.6 - Feature Completion" chart in the README (chart.png), which is the headline argument for the MCP server. A few things make it hard to take at face value, and I think fixing them would make the claim much stronger rather than weaker.

1. The chart doesn't appear to be produced by the committed harness

chart.png and scripts/benchmark-claude.sh + scripts/analyze-results.py landed in the same commit (feat: MCP (#272)), but they measure different things:

  • The chart plots cumulative tokens (input+output) vs wall time for a single feature-completion task, with an "auto compaction" band.
  • The committed harness runs 11 one-shot file-search concepts ("find the function that loads X"), scored by USD cost with a 5–15% tie band, on a private repo (~/dev/lightsource, 194k files).

No chart-generating code, no raw run data (no benchmark-results/, no iter*.json, no .stream.jsonl), and the string "20 runs" appears nowhere in the repo. So the published chart isn't reproducible from what's checked in.

2. "Average of 20 runs" with no spread shown

The reported deltas are roughly -8% wall time / -17% tokens (per the promo video). Opus has high run-to-run variance, so without error bars / std-dev / CIs across the 20 runs, it's hard to tell the effect from noise. analyze-results.py computes only averages, no variance.

3. Presentation choices that inflate the visual

  • Combined input+output y-axis obscures that the savings is input-side (the cheap, cache-friendly tokens).
  • The "auto compaction" band is drawn as the narrative, but both lines end above it, so fff doesn't actually keep a run under compaction in this chart.
  • The two lines overlap for ~75% of the run and diverge only at the very end.

Asks (any one would help)

  1. Publish the raw per-run data + the script that generates chart.png.
  2. Add error bars / variance for the 20 runs.
  3. Use a public repo so the feature-completion benchmark is reproducible.
  4. Ideally split input vs output tokens on the chart.

If useful, paste this into an AI coding agent to generate a reproducible harness:

Build a benchmark that runs one feature-completion task in a public repo N times, with and without the fff MCP server, recording cumulative (input, output) tokens and wall time per turn. Emit a chart with mean ± std-dev bands per condition, and commit the raw per-run JSON next to the chart-generation script.

Not trying to dunk on the project, just want the headline benchmark to be as solid as the underlying tool.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requestedtriaged

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions