scripts/benchmark.py: cross-commit benchmark harness#9
Merged
Conversation
Owner
Author
Code reviewFound 2 issues:
agentgrep/scripts/benchmark.py Lines 15 to 20 in 8bfedfd
agentgrep/scripts/benchmark.py Lines 48 to 52 in 8bfedfd agentgrep/scripts/benchmark.py Lines 640 to 652 in 8bfedfd agentgrep/scripts/benchmark.py Lines 708 to 713 in 8bfedfd 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
why: Hyperfine sweeps across the last 50 commits surfaced a real
+7% cold-start regression at the query-language landing commit
that was invisible to ad-hoc eyeballing. The capability lived in
hand-edited /tmp/bench-*.sh prototypes; promoting it to a
first-class repo tool makes "where did grep slow down?" a one-
liner from any branch.
what:
- Add scripts/benchmark.py as a PEP 723 self-contained script
(typer + rich + pydantic). Subcommands: run / compare /
list-commits / list-commands / show-config.
- Target selectors: --target HEAD|trunk|REF, --range A..B,
--lookback N, --from-trunk-back N, --tags, --commits
SHA1,SHA2, --head-vs-trunk. Mutually exclusive, all resolve
to a chronologically-ordered list of CommitRef tuples.
- Per-commit isolation: dirty-tree refusal (with --allow-dirty
override), atexit + signal-trap HEAD restore, optional
--keep-checkout for iteration, per-iteration git checkout +
uv sync + bench probe.
- Timing engine: hyperfine -N preferred (with --export-json
sample preservation); pure-Python perf_counter fallback when
hyperfine isn't on PATH or --no-hyperfine is passed.
- Renderers: rich tables (default), json, ndjson, markdown,
csv. Raw samples preserved in every machine-readable format
so consumers can compute their own percentiles.
- Config: scripts/benchmark.toml ships agentgrep defaults
(grep/find/search/import-time benches against libtmux).
scripts/benchmark.local.toml provides a gitignored
per-machine override layer; .example template shows the
shape. Layering: built-in defaults -> benchmark.toml ->
benchmark.local.toml -> CLI flags.
- Templating: command strings substitute {venv} / {query} /
{sha} / {short_sha} / {repo}. Unknown placeholders raise.
- pyproject.toml: per-file ignore for scripts/benchmark.py
(B008 / E501) -- typer.Option metadata is not a real
callable default, and --help text shouldn't be wrapped
mid-string.
why: The benchmark harness has several pure-Python pieces (percentile math, deep-merge config layering, git target resolution, command templating, JSON record shape) that are easy to regress without noticing. Lock them down with a parametrised pytest suite mirroring the NamedTuple+test_id convention the rest of the project uses. what: - Add tests/test_benchmark_script.py loading the PEP 723 script via importlib.util.spec_from_file_location -- same pattern that tests/test_mcp_swap.py uses for scripts/mcp_swap.py. - StatsCase (6 cases): single sample, all-equal, odd/even count, single-outlier max-pull, hundred-sample percentile alignment. Plus a "unknown label" guard and an empty-list NaN check. - TargetResolutionCase (6 cases): one per selector -- single-target HEAD, --head-vs-trunk, --commits A,B, --range, --lookback (asserts the newest-first -> oldest-first reversal), --tags. Each feeds a mocked git_runner lookup table that fails loud on unexpected invocations. - ConfigLayeringCase (4 cases): defaults-only, primary-TOML, primary+local overlay, CLI-override trumps all. Driven by tmp_path TOML fixtures and the existing load_config(config_path=..., local_path=..., cli_overrides=...) injection points. - TemplatingCase (4 cases): bare query, multi-token grep shape, empty-string query, unknown-token KeyError. - JSON-shape and stats-property tests for Measurement: model_dump keys match the documented schema (samples preserved, computed properties absent), and the property stats (min_s / max_s / avg_s / stddev_s) match a hand-rolled calculation. - pyproject dev group: add `typer>=0.12` alongside `tomlkit>=0.13` with a comment explaining the same PEP 723 importability rationale -- pydantic and rich already live in the runtime graph via the project itself and textual respectively.
why: scripts/benchmark.py is a developer-only tool -- not part of
the published wheel -- and the on-disk-storage catalogue is an
internals reference for adapter authors. They belong in a sidebar
section labelled for that audience, not bundled with the user-
facing release changelog and GitHub link. This commit introduces
docs/dev/ as the new home for contributing/internals reference.
what:
- Add docs/dev/benchmark.md: invocation recipes for each target
selector, discovery subcommands (list-commits, list-commands,
show-config), the five output formats (rich / json / ndjson /
md / csv), the four-layer config precedence with deep-merge
semantics, the templating tokens ({venv} / {query} / {sha} /
{short_sha} / {repo}), and the safety guards (dirty-tree
refusal, HEAD restore on exit, per-commit isolation).
- Relocate docs/storage-catalog.md to docs/dev/storage-catalog.md
alongside the new benchmark page; it documents internal store
layouts for adapter authors and sits naturally with the dev-
facing material.
- docs/index.md: introduce a new toctree with :caption: Development
for dev/benchmark and dev/storage-catalog, placed between the
Reference and Project sections. The Project section now carries
only the release history and the GitHub link.
- docs/redirects.txt: add benchmark -> dev/benchmark and
storage-catalog -> dev/storage-catalog so external links to the
old root-level slugs keep resolving.
- Ref labels (\`benchmark\`, \`storage-catalog\`) are unchanged, so
{ref}\`...\` cross-references in historical CHANGES entries and
any other surface keep working.
why: /history/ rendered an empty placeholder ("This page will track
published release notes once the project begins publishing
end-user releases") even though CHANGES already carries four
published releases (0.1.0a1 -> 0.1.0a4) and a 0.1.0a5 unreleased
block. The stub was stale.
what:
- Replace the placeholder with the `{include} ../CHANGES`
pattern used across every other gp-libs / libtmux / libvcs
sibling project. MyST resolves the include at parse time so
Sphinx roles (`{ref}`, `{class}`, `{func}`) inside CHANGES
continue to resolve from the rendered history page.
- Add `(changes)=` and `(changelog)=` alias anchors next to the
existing `(history)=` label so cross-references match the
naming used across the sibling docs ecosystem.
why: The Sphinx sidebar previously rendered seven caption-grouped toctrees (Get started / CLI / Library / MCP / Reference / Development / Project), each flattening its children to the same indentation level. The result was a long single-depth list where "agentgrep ui" sat next to "agentgrep search" next to "Tutorial" next to "Tools" without visual grouping -- and selecting "CLI" showed the same six rows whether you wanted the section overview or one of its pages. The libtmux docs use a different pattern -- a single flat root toctree pointing at section landing pages, with each landing page owning its own child toctree. The sidebar then shows one entry per section that expands to reveal children only when navigated into. Closer to how a typical multi-section docs site reads. what: - docs/index.md: collapse the seven captioned toctrees into one hidden toctree with no caption. Leaf pages (quickstart, installation, clients, configuration, history) stay at the top level; CLI/Library/MCP/Reference/Development each appear as a single sidebar entry that opens to its children. - docs/cli/index.md, docs/library/index.md, docs/mcp/index.md, docs/reference/api/index.md: each gets its own hidden toctree listing the section's child pages. The existing grid-item-card navigation on the visible page is unchanged. - docs/dev/index.md (new): Development section landing page with grid cards for the benchmark harness and storage catalogue, plus the hidden toctree pointing at both. The root index references dev/index instead of listing dev/benchmark and dev/storage-catalog directly. The sidebar now reads: Quickstart Installation MCP Clients Configuration CLI (expands -> ui / search / find) Library (expands -> tutorial / how-to / reference / examples) MCP (expands -> tools / resources / prompts) API Reference (expands -> agentgrep / agentgrep.ui / agentgrep.mcp) Development (expands -> benchmark / storage catalogue) History GitHub
…nto a Getting Started section
why: The four top-level entries (Quickstart, Installation, MCP
Clients, Configuration) all answered "how do I start using
agentgrep?" but occupied four separate rows in the sidebar with no
grouping. Folding them under a single "Getting Started" section
(with the quickstart prose as the section's landing page) mirrors
the libtmux pattern -- one collapsible sidebar entry per
discipline, expanding to its children when the reader navigates
in.
what:
- Move docs/quickstart.md to docs/getting-started/index.md; the
page is the section landing. Title changes from Quickstart to
Getting Started; the `(quickstart)=` label survives alongside a
new `(getting-started)=` label so any external link or
{ref}`quickstart` keeps resolving.
- Move docs/installation.md, docs/clients.md, docs/configuration.md
to docs/getting-started/ alongside the landing page.
- docs/getting-started/index.md gains a hidden toctree listing
installation, clients, configuration as children. The two
{doc}`library/tutorial` and {doc}`mcp/tools` "next steps" links
now use the `../` prefix since the page sits one level deeper.
- docs/index.md root toctree: replace the four flat entries with
one entry pointing at getting-started/index. Grid-item-card
🔗 paths updated for Quickstart, Client Setup, and
Configuration cards.
- docs/redirects.txt: four new redirects so external bookmarks for
/quickstart/, /installation/, /clients/, and /configuration/
land on the new pages.
The sidebar now reads:
Getting Started (expands -> Installation / MCP Clients / Configuration)
CLI (expands -> ui / search / find)
Library (expands -> tutorial / how-to / reference / examples)
MCP (expands -> tools / resources / prompts)
API Reference (expands -> agentgrep / agentgrep.ui / agentgrep.mcp)
Development (expands -> Benchmark harness / Storage catalogue)
History
GitHub
…e comments
why: Furo's Pygments style renders comment tokens (.c / .c1 /
.cm / etc.) in italic. Every code block with a `# ...` comment
triggers a network fetch for IBM Plex Mono 400 italic on first
encounter because the gp-sphinx DEFAULT_SPHINX_FONT_PRELOAD list
only covers Mono 400/700 normal and Sans 400 italic. The browser
shows the face loaded from network on each page until cached.
what:
- Extend conf["sphinx_font_preload"] in docs/conf.py with the
("IBM Plex Mono", 400, "italic") tuple so the
ibm-plex-mono-latin-400-italic.woff2 face renders as a
<link rel="preload" as="font" crossorigin> tag in the page
head alongside the existing mono normal preloads. Comment
glyphs in syntax-highlighted code paint without a font swap.
- Comment in conf.py points at the upstream defaults so a future
reader can see why this single tuple is project-overridden
rather than upstream-fixed.
6 tasks
tony
added a commit
that referenced
this pull request
May 23, 2026
why: Regression #9 (--output bad-path traceback at end of run) had no test. The pre-flight checks in cmd_run that call typer.Exit(code=2) before any git checkout could be removed silently. what: - test_main_exits_2_when_output_parent_missing: calls main(["run", "--output", "/nonexistent/dir/x.md", ...]) and asserts rc == 2. The pre-flight fires at step 5 of cmd_run, before the dirty-tree check or any git interaction. - test_main_exits_2_when_output_is_directory: passes tmp_path (an existing directory) as --output and asserts rc == 2.
tony
added a commit
that referenced
this pull request
May 23, 2026
Follow-up to #9's code review. Wires the previously-unreachable `sync_fail` Status literal so a failing `uv sync` between commits is recorded and surfaced rather than silently absorbed, and locks down every prior bug-fix surface with regression tests. - **sync_fail now reachable.** `_maybe_sync` returns the `CompletedProcess` so `_run_one_commit` can inspect `returncode`. Non-zero → every bench for that commit is recorded as `status="sync_fail"` with the truncated stderr (or a "sync failed" fallback when the process produces no output), and the run moves to the next commit without benchmarking a broken venv. - **Docstring path.** Module docstring pointed at the pre-move `docs/library/benchmark.md`; updated to `docs/dev/benchmark.md`. - **Regression tests.** Fifteen new parametrized cases covering validation rejection, exit-code propagation, bad-ref translation, aggregate-column uniqueness, config-error surfacing, pydantic bound enforcement, cli-override routing, and output pre-flight.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the cross-commit benchmark harness and refreshes several docs surfaces. Three categories of change in one PR — happy to split if you'd prefer separate reviews, but they're tightly related (the harness IS the dev-tooling page that motivated the Development section).
Benchmark harness (
scripts/benchmark.py)The script surfaced a real +7% cold-start regression at the query-language landing commit on the cli-normalization branch — that's what motivated promoting the prototype `/tmp/bench-*.sh` to a first-class repo tool.
Docs reorganization
Performance
Test plan
Notes