scripts/benchmark.py: cross-commit benchmark harness by tony · Pull Request #9 · tony/agentgrep

tony · 2026-05-23T09:46:38Z

Summary

Adds the cross-commit benchmark harness and refreshes several docs surfaces. Three categories of change in one PR — happy to split if you'd prefer separate reviews, but they're tightly related (the harness IS the dev-tooling page that motivated the Development section).

Benchmark harness (`scripts/benchmark.py`)

New PEP 723 self-contained harness (typer + rich + pydantic) for hyperfine-driven benchmarks across HEAD, trunk, ranges, lookback, from-trunk-back, tags, or explicit commit lists. Subcommands: `run`, `compare`, `list-commits`, `list-commands`, `show-config`.
Output formats: rich table (default), JSON, NDJSON, Markdown, CSV. Raw samples preserved in machine-readable outputs so downstream tooling can compute its own percentiles.
TOML config layering — `scripts/benchmark.toml` ships the agentgrep defaults; `scripts/benchmark.local.toml` (gitignored) carries per-machine overrides; CLI flags trump both.
Safety guards: dirty-tree refusal (`--allow-dirty` to override), atexit + signal trap restoring HEAD on every exit path (incl. SIGINT), pre-flight `--output` path validation, pydantic-validated `Settings` bounds (`runs >= 1`, `warmup >= 0`).
Hyperfine preferred with `-N` (no shell). Auto-fallback to pure-Python `time.perf_counter` when hyperfine isn't on PATH or `--no-hyperfine` is passed.

The script surfaced a real +7% cold-start regression at the query-language landing commit on the cli-normalization branch — that's what motivated promoting the prototype `/tmp/bench-*.sh` to a first-class repo tool.

Docs reorganization

New `docs/dev/` section for contributing/internals references. `scripts/benchmark.py` is dev-only (not part of the published wheel), so its reference page lives under Development. `docs/storage-catalog.md` relocated to `docs/dev/` alongside it — it's an internals reference for adapter authors.
`docs/history.md` now uses the standard `{include} ../CHANGES` pattern to render the changelog inline; previously it was a stale "no releases yet" stub even though four releases had shipped.
Sidebar flattened to the libtmux pattern: one entry per section with child pages expanded under it, instead of seven caption-grouped flat lists.
Top four landing pages (Quickstart, Installation, MCP Clients, Configuration) collapsed into one Getting Started section. The quickstart page is now `docs/getting-started/index.md` (the section landing); installation / clients / configuration are children.

Performance

Preload `IBM Plex Mono 400 italic` in the page ``. Furo's Pygments style renders comment tokens (`.c`, `.c1`, `.cm`) italic, so every code block with a `# comment` was lazy-loading the face. Fixed locally on agentgrep; an upstream PR (git-pull/gp-sphinx#43) adds the same preload to `DEFAULT_SPHINX_FONT_PRELOAD` so every gp-sphinx consumer benefits.

Test plan

`uv run ruff check .` passes
`uv run ruff format --check .` passes
`uv run ty check` passes
`uv run pytest --reruns 0` passes (272 tests, including 26 new `tests/test_benchmark_script.py` cases)
`just build-docs` passes; redirects render for `/benchmark/`, `/storage-catalog/`, `/quickstart/`, `/installation/`, `/clients/`, `/configuration/`
Benchmark harness verified across 5 sweeps and ~160 distinct invocations covering every subcommand, target selector, output format, real timing path, and error/safety path
Sidebar verified locally — top-level sections expand to children; redirects from old slugs land on new pages

Notes

The CHANGES entry for these additions will land via the `cli-normalization` branch's merge, where the `0.1.0a5` placeholder already exists. Happy to add a standalone CHANGES update on master if you'd prefer it lands in this PR.
The font preload override in `docs/conf.py` becomes redundant once gp-sphinx#43 merges and a new gp-sphinx version is pinned. Easy follow-up to remove the local override at that point.

tony · 2026-05-23T11:14:58Z

Code review

Found 2 issues:

Module docstring directs readers to docs/library/benchmark.md, which doesn't exist on this branch (the file is at docs/dev/benchmark.md after the move-to-Project commit). Anyone reading --help output or the source header and following the link will hit a 404.

agentgrep/scripts/benchmark.py

Lines 15 to 20 in 8bfedfd

    
           be overridden per-machine via ``scripts/benchmark.local.toml`` (gitignored). 
        
           Run ``uv run scripts/benchmark.py --help`` for the subcommand list, or see 
        
           ``docs/library/benchmark.md`` for invocation recipes. 
        
           """

Status literal advertises a "sync_fail" value, but no code path can ever produce it. _maybe_sync calls subprocess.run(..., check=False) and returns None; the caller in _run_one_commit discards the result. A failing uv sync is silently absorbed and the benchmark proceeds against a potentially broken venv — the user never sees the failure marked as sync_fail even though the Status type promises it.

agentgrep/scripts/benchmark.py

Lines 48 to 52 in 8bfedfd

    
           LOCAL_CONFIG = REPO_ROOT / "scripts" / "benchmark.local.toml" 
        
           Status = t.Literal["ok", "checkout_fail", "sync_fail", "command_missing", "bench_fail"] 
        
           OutputFormat = t.Literal["rich", "json", "ndjson", "md", "csv"] 
        
           PERCENTILE_LABELS: tuple[str, ...] = ("min", "max", "avg", "p50", "p90", "p95", "p99")

agentgrep/scripts/benchmark.py

Lines 640 to 652 in 8bfedfd

    
           def _maybe_sync(settings: Settings, repo: pathlib.Path) -> None: 
        
               if not settings.sync_command.strip(): 
        
                   return 
        
               subprocess.run( 
        
                   shlex.split(settings.sync_command), 
        
                   cwd=repo, 
        
                   check=False, 
        
                   capture_output=True, 
        
                   text=True, 
        
               )

agentgrep/scripts/benchmark.py

Lines 708 to 713 in 8bfedfd

    
               return results 
        
           if not no_sync and not dry_run: 
        
               _maybe_sync(config.settings, repo) 
        
           venv = (repo / config.settings.venv).resolve()

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

why: Hyperfine sweeps across the last 50 commits surfaced a real +7% cold-start regression at the query-language landing commit that was invisible to ad-hoc eyeballing. The capability lived in hand-edited /tmp/bench-*.sh prototypes; promoting it to a first-class repo tool makes "where did grep slow down?" a one- liner from any branch. what: - Add scripts/benchmark.py as a PEP 723 self-contained script (typer + rich + pydantic). Subcommands: run / compare / list-commits / list-commands / show-config. - Target selectors: --target HEAD|trunk|REF, --range A..B, --lookback N, --from-trunk-back N, --tags, --commits SHA1,SHA2, --head-vs-trunk. Mutually exclusive, all resolve to a chronologically-ordered list of CommitRef tuples. - Per-commit isolation: dirty-tree refusal (with --allow-dirty override), atexit + signal-trap HEAD restore, optional --keep-checkout for iteration, per-iteration git checkout + uv sync + bench probe. - Timing engine: hyperfine -N preferred (with --export-json sample preservation); pure-Python perf_counter fallback when hyperfine isn't on PATH or --no-hyperfine is passed. - Renderers: rich tables (default), json, ndjson, markdown, csv. Raw samples preserved in every machine-readable format so consumers can compute their own percentiles. - Config: scripts/benchmark.toml ships agentgrep defaults (grep/find/search/import-time benches against libtmux). scripts/benchmark.local.toml provides a gitignored per-machine override layer; .example template shows the shape. Layering: built-in defaults -> benchmark.toml -> benchmark.local.toml -> CLI flags. - Templating: command strings substitute {venv} / {query} / {sha} / {short_sha} / {repo}. Unknown placeholders raise. - pyproject.toml: per-file ignore for scripts/benchmark.py (B008 / E501) -- typer.Option metadata is not a real callable default, and --help text shouldn't be wrapped mid-string.

why: The benchmark harness has several pure-Python pieces (percentile math, deep-merge config layering, git target resolution, command templating, JSON record shape) that are easy to regress without noticing. Lock them down with a parametrised pytest suite mirroring the NamedTuple+test_id convention the rest of the project uses. what: - Add tests/test_benchmark_script.py loading the PEP 723 script via importlib.util.spec_from_file_location -- same pattern that tests/test_mcp_swap.py uses for scripts/mcp_swap.py. - StatsCase (6 cases): single sample, all-equal, odd/even count, single-outlier max-pull, hundred-sample percentile alignment. Plus a "unknown label" guard and an empty-list NaN check. - TargetResolutionCase (6 cases): one per selector -- single-target HEAD, --head-vs-trunk, --commits A,B, --range, --lookback (asserts the newest-first -> oldest-first reversal), --tags. Each feeds a mocked git_runner lookup table that fails loud on unexpected invocations. - ConfigLayeringCase (4 cases): defaults-only, primary-TOML, primary+local overlay, CLI-override trumps all. Driven by tmp_path TOML fixtures and the existing load_config(config_path=..., local_path=..., cli_overrides=...) injection points. - TemplatingCase (4 cases): bare query, multi-token grep shape, empty-string query, unknown-token KeyError. - JSON-shape and stats-property tests for Measurement: model_dump keys match the documented schema (samples preserved, computed properties absent), and the property stats (min_s / max_s / avg_s / stddev_s) match a hand-rolled calculation. - pyproject dev group: add `typer>=0.12` alongside `tomlkit>=0.13` with a comment explaining the same PEP 723 importability rationale -- pydantic and rich already live in the runtime graph via the project itself and textual respectively.

why: scripts/benchmark.py is a developer-only tool -- not part of the published wheel -- and the on-disk-storage catalogue is an internals reference for adapter authors. They belong in a sidebar section labelled for that audience, not bundled with the user- facing release changelog and GitHub link. This commit introduces docs/dev/ as the new home for contributing/internals reference. what: - Add docs/dev/benchmark.md: invocation recipes for each target selector, discovery subcommands (list-commits, list-commands, show-config), the five output formats (rich / json / ndjson / md / csv), the four-layer config precedence with deep-merge semantics, the templating tokens ({venv} / {query} / {sha} / {short_sha} / {repo}), and the safety guards (dirty-tree refusal, HEAD restore on exit, per-commit isolation). - Relocate docs/storage-catalog.md to docs/dev/storage-catalog.md alongside the new benchmark page; it documents internal store layouts for adapter authors and sits naturally with the dev- facing material. - docs/index.md: introduce a new toctree with :caption: Development for dev/benchmark and dev/storage-catalog, placed between the Reference and Project sections. The Project section now carries only the release history and the GitHub link. - docs/redirects.txt: add benchmark -> dev/benchmark and storage-catalog -> dev/storage-catalog so external links to the old root-level slugs keep resolving. - Ref labels (\`benchmark\`, \`storage-catalog\`) are unchanged, so {ref}\`...\` cross-references in historical CHANGES entries and any other surface keep working.

why: /history/ rendered an empty placeholder ("This page will track published release notes once the project begins publishing end-user releases") even though CHANGES already carries four published releases (0.1.0a1 -> 0.1.0a4) and a 0.1.0a5 unreleased block. The stub was stale. what: - Replace the placeholder with the `{include} ../CHANGES` pattern used across every other gp-libs / libtmux / libvcs sibling project. MyST resolves the include at parse time so Sphinx roles (`{ref}`, `{class}`, `{func}`) inside CHANGES continue to resolve from the rendered history page. - Add `(changes)=` and `(changelog)=` alias anchors next to the existing `(history)=` label so cross-references match the naming used across the sibling docs ecosystem.

why: The Sphinx sidebar previously rendered seven caption-grouped toctrees (Get started / CLI / Library / MCP / Reference / Development / Project), each flattening its children to the same indentation level. The result was a long single-depth list where "agentgrep ui" sat next to "agentgrep search" next to "Tutorial" next to "Tools" without visual grouping -- and selecting "CLI" showed the same six rows whether you wanted the section overview or one of its pages. The libtmux docs use a different pattern -- a single flat root toctree pointing at section landing pages, with each landing page owning its own child toctree. The sidebar then shows one entry per section that expands to reveal children only when navigated into. Closer to how a typical multi-section docs site reads. what: - docs/index.md: collapse the seven captioned toctrees into one hidden toctree with no caption. Leaf pages (quickstart, installation, clients, configuration, history) stay at the top level; CLI/Library/MCP/Reference/Development each appear as a single sidebar entry that opens to its children. - docs/cli/index.md, docs/library/index.md, docs/mcp/index.md, docs/reference/api/index.md: each gets its own hidden toctree listing the section's child pages. The existing grid-item-card navigation on the visible page is unchanged. - docs/dev/index.md (new): Development section landing page with grid cards for the benchmark harness and storage catalogue, plus the hidden toctree pointing at both. The root index references dev/index instead of listing dev/benchmark and dev/storage-catalog directly. The sidebar now reads: Quickstart Installation MCP Clients Configuration CLI (expands -> ui / search / find) Library (expands -> tutorial / how-to / reference / examples) MCP (expands -> tools / resources / prompts) API Reference (expands -> agentgrep / agentgrep.ui / agentgrep.mcp) Development (expands -> benchmark / storage catalogue) History GitHub

…nto a Getting Started section why: The four top-level entries (Quickstart, Installation, MCP Clients, Configuration) all answered "how do I start using agentgrep?" but occupied four separate rows in the sidebar with no grouping. Folding them under a single "Getting Started" section (with the quickstart prose as the section's landing page) mirrors the libtmux pattern -- one collapsible sidebar entry per discipline, expanding to its children when the reader navigates in. what: - Move docs/quickstart.md to docs/getting-started/index.md; the page is the section landing. Title changes from Quickstart to Getting Started; the `(quickstart)=` label survives alongside a new `(getting-started)=` label so any external link or {ref}`quickstart` keeps resolving. - Move docs/installation.md, docs/clients.md, docs/configuration.md to docs/getting-started/ alongside the landing page. - docs/getting-started/index.md gains a hidden toctree listing installation, clients, configuration as children. The two {doc}`library/tutorial` and {doc}`mcp/tools` "next steps" links now use the `../` prefix since the page sits one level deeper. - docs/index.md root toctree: replace the four flat entries with one entry pointing at getting-started/index. Grid-item-card 🔗 paths updated for Quickstart, Client Setup, and Configuration cards. - docs/redirects.txt: four new redirects so external bookmarks for /quickstart/, /installation/, /clients/, and /configuration/ land on the new pages. The sidebar now reads: Getting Started (expands -> Installation / MCP Clients / Configuration) CLI (expands -> ui / search / find) Library (expands -> tutorial / how-to / reference / examples) MCP (expands -> tools / resources / prompts) API Reference (expands -> agentgrep / agentgrep.ui / agentgrep.mcp) Development (expands -> Benchmark harness / Storage catalogue) History GitHub

…e comments why: Furo's Pygments style renders comment tokens (.c / .c1 / .cm / etc.) in italic. Every code block with a `# ...` comment triggers a network fetch for IBM Plex Mono 400 italic on first encounter because the gp-sphinx DEFAULT_SPHINX_FONT_PRELOAD list only covers Mono 400/700 normal and Sans 400 italic. The browser shows the face loaded from network on each page until cached. what: - Extend conf["sphinx_font_preload"] in docs/conf.py with the ("IBM Plex Mono", 400, "italic") tuple so the ibm-plex-mono-latin-400-italic.woff2 face renders as a <link rel="preload" as="font" crossorigin> tag in the page head alongside the existing mono normal preloads. Comment glyphs in syntax-highlighted code paint without a font swap. - Comment in conf.py points at the upstream defaults so a future reader can see why this single tuple is project-overridden rather than upstream-fixed.

why: Regression #9 (--output bad-path traceback at end of run) had no test. The pre-flight checks in cmd_run that call typer.Exit(code=2) before any git checkout could be removed silently. what: - test_main_exits_2_when_output_parent_missing: calls main(["run", "--output", "/nonexistent/dir/x.md", ...]) and asserts rc == 2. The pre-flight fires at step 5 of cmd_run, before the dirty-tree check or any git interaction. - test_main_exits_2_when_output_is_directory: passes tmp_path (an existing directory) as --output and asserts rc == 2.

Follow-up to #9's code review. Wires the previously-unreachable `sync_fail` Status literal so a failing `uv sync` between commits is recorded and surfaced rather than silently absorbed, and locks down every prior bug-fix surface with regression tests. - **sync_fail now reachable.** `_maybe_sync` returns the `CompletedProcess` so `_run_one_commit` can inspect `returncode`. Non-zero → every bench for that commit is recorded as `status="sync_fail"` with the truncated stderr (or a "sync failed" fallback when the process produces no output), and the run moves to the next commit without benchmarking a broken venv. - **Docstring path.** Module docstring pointed at the pre-move `docs/library/benchmark.md`; updated to `docs/dev/benchmark.md`. - **Regression tests.** Fifteen new parametrized cases covering validation rejection, exit-code propagation, bad-ref translation, aggregate-column uniqueness, config-error surfacing, pydantic bound enforcement, cli-override routing, and output pre-flight.

tony temporarily deployed to docs May 23, 2026 09:46 — with GitHub Actions Inactive

tony force-pushed the benchmark-script branch from eefcd45 to d7b0895 Compare May 23, 2026 10:27

tony temporarily deployed to docs May 23, 2026 10:27 — with GitHub Actions Inactive

tony force-pushed the benchmark-script branch from d7b0895 to 90a598d Compare May 23, 2026 10:38

tony temporarily deployed to docs May 23, 2026 10:38 — with GitHub Actions Inactive

tony temporarily deployed to docs May 23, 2026 10:40 — with GitHub Actions Inactive

tony temporarily deployed to docs May 23, 2026 10:45 — with GitHub Actions Inactive

tony temporarily deployed to docs May 23, 2026 10:50 — with GitHub Actions Inactive

tony temporarily deployed to docs May 23, 2026 10:53 — with GitHub Actions Inactive

tony temporarily deployed to docs May 23, 2026 11:04 — with GitHub Actions Inactive

tony added 8 commits May 23, 2026 06:23

docs(CHANGES) Benchmark harness and docs site refresh

0a7a9d9

tony force-pushed the benchmark-script branch from 8bfedfd to 0a7a9d9 Compare May 23, 2026 11:23

tony temporarily deployed to docs May 23, 2026 11:24 — with GitHub Actions Inactive

tony merged commit 7a2b526 into master May 23, 2026
3 checks passed

tony deleted the benchmark-script branch May 23, 2026 11:31

tony mentioned this pull request May 23, 2026

scripts(benchmark[fix]) Surface uv sync failures + docstring path #10

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts/benchmark.py: cross-commit benchmark harness#9

scripts/benchmark.py: cross-commit benchmark harness#9
tony merged 8 commits into
masterfrom
benchmark-script

tony commented May 23, 2026 •

edited

Loading

Uh oh!

tony commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tony commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark harness (scripts/benchmark.py)

Docs reorganization

Performance

Test plan

Notes

Uh oh!

tony commented May 23, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tony commented May 23, 2026 •

edited

Loading

Benchmark harness (`scripts/benchmark.py`)