Skip to content

scripts/benchmark.py: cross-commit benchmark harness#9

Merged
tony merged 8 commits into
masterfrom
benchmark-script
May 23, 2026
Merged

scripts/benchmark.py: cross-commit benchmark harness#9
tony merged 8 commits into
masterfrom
benchmark-script

Conversation

@tony
Copy link
Copy Markdown
Owner

@tony tony commented May 23, 2026

Summary

Adds the cross-commit benchmark harness and refreshes several docs surfaces. Three categories of change in one PR — happy to split if you'd prefer separate reviews, but they're tightly related (the harness IS the dev-tooling page that motivated the Development section).

Benchmark harness (scripts/benchmark.py)

  • New PEP 723 self-contained harness (typer + rich + pydantic) for hyperfine-driven benchmarks across HEAD, trunk, ranges, lookback, from-trunk-back, tags, or explicit commit lists. Subcommands: `run`, `compare`, `list-commits`, `list-commands`, `show-config`.
  • Output formats: rich table (default), JSON, NDJSON, Markdown, CSV. Raw samples preserved in machine-readable outputs so downstream tooling can compute its own percentiles.
  • TOML config layering — `scripts/benchmark.toml` ships the agentgrep defaults; `scripts/benchmark.local.toml` (gitignored) carries per-machine overrides; CLI flags trump both.
  • Safety guards: dirty-tree refusal (`--allow-dirty` to override), atexit + signal trap restoring HEAD on every exit path (incl. SIGINT), pre-flight `--output` path validation, pydantic-validated `Settings` bounds (`runs >= 1`, `warmup >= 0`).
  • Hyperfine preferred with `-N` (no shell). Auto-fallback to pure-Python `time.perf_counter` when hyperfine isn't on PATH or `--no-hyperfine` is passed.

The script surfaced a real +7% cold-start regression at the query-language landing commit on the cli-normalization branch — that's what motivated promoting the prototype `/tmp/bench-*.sh` to a first-class repo tool.

Docs reorganization

  • New `docs/dev/` section for contributing/internals references. `scripts/benchmark.py` is dev-only (not part of the published wheel), so its reference page lives under Development. `docs/storage-catalog.md` relocated to `docs/dev/` alongside it — it's an internals reference for adapter authors.
  • `docs/history.md` now uses the standard `{include} ../CHANGES` pattern to render the changelog inline; previously it was a stale "no releases yet" stub even though four releases had shipped.
  • Sidebar flattened to the libtmux pattern: one entry per section with child pages expanded under it, instead of seven caption-grouped flat lists.
  • Top four landing pages (Quickstart, Installation, MCP Clients, Configuration) collapsed into one Getting Started section. The quickstart page is now `docs/getting-started/index.md` (the section landing); installation / clients / configuration are children.

Performance

  • Preload `IBM Plex Mono 400 italic` in the page ``. Furo's Pygments style renders comment tokens (`.c`, `.c1`, `.cm`) italic, so every code block with a `# comment` was lazy-loading the face. Fixed locally on agentgrep; an upstream PR (git-pull/gp-sphinx#43) adds the same preload to `DEFAULT_SPHINX_FONT_PRELOAD` so every gp-sphinx consumer benefits.

Test plan

  • `uv run ruff check .` passes
  • `uv run ruff format --check .` passes
  • `uv run ty check` passes
  • `uv run pytest --reruns 0` passes (272 tests, including 26 new `tests/test_benchmark_script.py` cases)
  • `just build-docs` passes; redirects render for `/benchmark/`, `/storage-catalog/`, `/quickstart/`, `/installation/`, `/clients/`, `/configuration/`
  • Benchmark harness verified across 5 sweeps and ~160 distinct invocations covering every subcommand, target selector, output format, real timing path, and error/safety path
  • Sidebar verified locally — top-level sections expand to children; redirects from old slugs land on new pages

Notes

  • The CHANGES entry for these additions will land via the `cli-normalization` branch's merge, where the `0.1.0a5` placeholder already exists. Happy to add a standalone CHANGES update on master if you'd prefer it lands in this PR.
  • The font preload override in `docs/conf.py` becomes redundant once gp-sphinx#43 merges and a new gp-sphinx version is pinned. Easy follow-up to remove the local override at that point.

@tony
Copy link
Copy Markdown
Owner Author

tony commented May 23, 2026

Code review

Found 2 issues:

  1. Module docstring directs readers to docs/library/benchmark.md, which doesn't exist on this branch (the file is at docs/dev/benchmark.md after the move-to-Project commit). Anyone reading --help output or the source header and following the link will hit a 404.

be overridden per-machine via ``scripts/benchmark.local.toml`` (gitignored).
Run ``uv run scripts/benchmark.py --help`` for the subcommand list, or see
``docs/library/benchmark.md`` for invocation recipes.
"""

  1. Status literal advertises a "sync_fail" value, but no code path can ever produce it. _maybe_sync calls subprocess.run(..., check=False) and returns None; the caller in _run_one_commit discards the result. A failing uv sync is silently absorbed and the benchmark proceeds against a potentially broken venv — the user never sees the failure marked as sync_fail even though the Status type promises it.

LOCAL_CONFIG = REPO_ROOT / "scripts" / "benchmark.local.toml"
Status = t.Literal["ok", "checkout_fail", "sync_fail", "command_missing", "bench_fail"]
OutputFormat = t.Literal["rich", "json", "ndjson", "md", "csv"]
PERCENTILE_LABELS: tuple[str, ...] = ("min", "max", "avg", "p50", "p90", "p95", "p99")

def _maybe_sync(settings: Settings, repo: pathlib.Path) -> None:
if not settings.sync_command.strip():
return
subprocess.run(
shlex.split(settings.sync_command),
cwd=repo,
check=False,
capture_output=True,
text=True,
)

return results
if not no_sync and not dry_run:
_maybe_sync(config.settings, repo)
venv = (repo / config.settings.venv).resolve()

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

tony added 8 commits May 23, 2026 06:23
why: Hyperfine sweeps across the last 50 commits surfaced a real
+7% cold-start regression at the query-language landing commit
that was invisible to ad-hoc eyeballing. The capability lived in
hand-edited /tmp/bench-*.sh prototypes; promoting it to a
first-class repo tool makes "where did grep slow down?" a one-
liner from any branch.

what:
- Add scripts/benchmark.py as a PEP 723 self-contained script
  (typer + rich + pydantic). Subcommands: run / compare /
  list-commits / list-commands / show-config.
- Target selectors: --target HEAD|trunk|REF, --range A..B,
  --lookback N, --from-trunk-back N, --tags, --commits
  SHA1,SHA2, --head-vs-trunk. Mutually exclusive, all resolve
  to a chronologically-ordered list of CommitRef tuples.
- Per-commit isolation: dirty-tree refusal (with --allow-dirty
  override), atexit + signal-trap HEAD restore, optional
  --keep-checkout for iteration, per-iteration git checkout +
  uv sync + bench probe.
- Timing engine: hyperfine -N preferred (with --export-json
  sample preservation); pure-Python perf_counter fallback when
  hyperfine isn't on PATH or --no-hyperfine is passed.
- Renderers: rich tables (default), json, ndjson, markdown,
  csv. Raw samples preserved in every machine-readable format
  so consumers can compute their own percentiles.
- Config: scripts/benchmark.toml ships agentgrep defaults
  (grep/find/search/import-time benches against libtmux).
  scripts/benchmark.local.toml provides a gitignored
  per-machine override layer; .example template shows the
  shape. Layering: built-in defaults -> benchmark.toml ->
  benchmark.local.toml -> CLI flags.
- Templating: command strings substitute {venv} / {query} /
  {sha} / {short_sha} / {repo}. Unknown placeholders raise.
- pyproject.toml: per-file ignore for scripts/benchmark.py
  (B008 / E501) -- typer.Option metadata is not a real
  callable default, and --help text shouldn't be wrapped
  mid-string.
why: The benchmark harness has several pure-Python pieces (percentile
math, deep-merge config layering, git target resolution, command
templating, JSON record shape) that are easy to regress without
noticing. Lock them down with a parametrised pytest suite mirroring
the NamedTuple+test_id convention the rest of the project uses.

what:
- Add tests/test_benchmark_script.py loading the PEP 723 script via
  importlib.util.spec_from_file_location -- same pattern that
  tests/test_mcp_swap.py uses for scripts/mcp_swap.py.
- StatsCase (6 cases): single sample, all-equal, odd/even count,
  single-outlier max-pull, hundred-sample percentile alignment.
  Plus a "unknown label" guard and an empty-list NaN check.
- TargetResolutionCase (6 cases): one per selector --
  single-target HEAD, --head-vs-trunk, --commits A,B, --range,
  --lookback (asserts the newest-first -> oldest-first reversal),
  --tags. Each feeds a mocked git_runner lookup table that fails
  loud on unexpected invocations.
- ConfigLayeringCase (4 cases): defaults-only, primary-TOML,
  primary+local overlay, CLI-override trumps all. Driven by tmp_path
  TOML fixtures and the existing load_config(config_path=...,
  local_path=..., cli_overrides=...) injection points.
- TemplatingCase (4 cases): bare query, multi-token grep shape,
  empty-string query, unknown-token KeyError.
- JSON-shape and stats-property tests for Measurement: model_dump
  keys match the documented schema (samples preserved, computed
  properties absent), and the property stats (min_s / max_s /
  avg_s / stddev_s) match a hand-rolled calculation.
- pyproject dev group: add `typer>=0.12` alongside `tomlkit>=0.13`
  with a comment explaining the same PEP 723 importability
  rationale -- pydantic and rich already live in the runtime
  graph via the project itself and textual respectively.
why: scripts/benchmark.py is a developer-only tool -- not part of
the published wheel -- and the on-disk-storage catalogue is an
internals reference for adapter authors. They belong in a sidebar
section labelled for that audience, not bundled with the user-
facing release changelog and GitHub link. This commit introduces
docs/dev/ as the new home for contributing/internals reference.

what:
- Add docs/dev/benchmark.md: invocation recipes for each target
  selector, discovery subcommands (list-commits, list-commands,
  show-config), the five output formats (rich / json / ndjson /
  md / csv), the four-layer config precedence with deep-merge
  semantics, the templating tokens ({venv} / {query} / {sha} /
  {short_sha} / {repo}), and the safety guards (dirty-tree
  refusal, HEAD restore on exit, per-commit isolation).
- Relocate docs/storage-catalog.md to docs/dev/storage-catalog.md
  alongside the new benchmark page; it documents internal store
  layouts for adapter authors and sits naturally with the dev-
  facing material.
- docs/index.md: introduce a new toctree with :caption: Development
  for dev/benchmark and dev/storage-catalog, placed between the
  Reference and Project sections. The Project section now carries
  only the release history and the GitHub link.
- docs/redirects.txt: add benchmark -> dev/benchmark and
  storage-catalog -> dev/storage-catalog so external links to the
  old root-level slugs keep resolving.
- Ref labels (\`benchmark\`, \`storage-catalog\`) are unchanged, so
  {ref}\`...\` cross-references in historical CHANGES entries and
  any other surface keep working.
why: /history/ rendered an empty placeholder ("This page will track
published release notes once the project begins publishing
end-user releases") even though CHANGES already carries four
published releases (0.1.0a1 -> 0.1.0a4) and a 0.1.0a5 unreleased
block. The stub was stale.

what:
- Replace the placeholder with the `{include} ../CHANGES`
  pattern used across every other gp-libs / libtmux / libvcs
  sibling project. MyST resolves the include at parse time so
  Sphinx roles (`{ref}`, `{class}`, `{func}`) inside CHANGES
  continue to resolve from the rendered history page.
- Add `(changes)=` and `(changelog)=` alias anchors next to the
  existing `(history)=` label so cross-references match the
  naming used across the sibling docs ecosystem.
why: The Sphinx sidebar previously rendered seven caption-grouped
toctrees (Get started / CLI / Library / MCP / Reference /
Development / Project), each flattening its children to the same
indentation level. The result was a long single-depth list where
"agentgrep ui" sat next to "agentgrep search" next to "Tutorial"
next to "Tools" without visual grouping -- and selecting "CLI"
showed the same six rows whether you wanted the section overview
or one of its pages.

The libtmux docs use a different pattern -- a single flat root
toctree pointing at section landing pages, with each landing page
owning its own child toctree. The sidebar then shows one entry per
section that expands to reveal children only when navigated into.
Closer to how a typical multi-section docs site reads.

what:
- docs/index.md: collapse the seven captioned toctrees into one
  hidden toctree with no caption. Leaf pages (quickstart,
  installation, clients, configuration, history) stay at the top
  level; CLI/Library/MCP/Reference/Development each appear as a
  single sidebar entry that opens to its children.
- docs/cli/index.md, docs/library/index.md, docs/mcp/index.md,
  docs/reference/api/index.md: each gets its own hidden toctree
  listing the section's child pages. The existing grid-item-card
  navigation on the visible page is unchanged.
- docs/dev/index.md (new): Development section landing page with
  grid cards for the benchmark harness and storage catalogue, plus
  the hidden toctree pointing at both. The root index references
  dev/index instead of listing dev/benchmark and dev/storage-catalog
  directly.

The sidebar now reads:

  Quickstart
  Installation
  MCP Clients
  Configuration
  CLI                  (expands -> ui / search / find)
  Library              (expands -> tutorial / how-to / reference / examples)
  MCP                  (expands -> tools / resources / prompts)
  API Reference        (expands -> agentgrep / agentgrep.ui / agentgrep.mcp)
  Development          (expands -> benchmark / storage catalogue)
  History
  GitHub
…nto a Getting Started section

why: The four top-level entries (Quickstart, Installation, MCP
Clients, Configuration) all answered "how do I start using
agentgrep?" but occupied four separate rows in the sidebar with no
grouping. Folding them under a single "Getting Started" section
(with the quickstart prose as the section's landing page) mirrors
the libtmux pattern -- one collapsible sidebar entry per
discipline, expanding to its children when the reader navigates
in.

what:
- Move docs/quickstart.md to docs/getting-started/index.md; the
  page is the section landing. Title changes from Quickstart to
  Getting Started; the `(quickstart)=` label survives alongside a
  new `(getting-started)=` label so any external link or
  {ref}`quickstart` keeps resolving.
- Move docs/installation.md, docs/clients.md, docs/configuration.md
  to docs/getting-started/ alongside the landing page.
- docs/getting-started/index.md gains a hidden toctree listing
  installation, clients, configuration as children. The two
  {doc}`library/tutorial` and {doc}`mcp/tools` "next steps" links
  now use the `../` prefix since the page sits one level deeper.
- docs/index.md root toctree: replace the four flat entries with
  one entry pointing at getting-started/index. Grid-item-card
  🔗 paths updated for Quickstart, Client Setup, and
  Configuration cards.
- docs/redirects.txt: four new redirects so external bookmarks for
  /quickstart/, /installation/, /clients/, and /configuration/
  land on the new pages.

The sidebar now reads:

  Getting Started      (expands -> Installation / MCP Clients / Configuration)
  CLI                  (expands -> ui / search / find)
  Library              (expands -> tutorial / how-to / reference / examples)
  MCP                  (expands -> tools / resources / prompts)
  API Reference        (expands -> agentgrep / agentgrep.ui / agentgrep.mcp)
  Development          (expands -> Benchmark harness / Storage catalogue)
  History
  GitHub
…e comments

why: Furo's Pygments style renders comment tokens (.c / .c1 /
.cm / etc.) in italic. Every code block with a `# ...` comment
triggers a network fetch for IBM Plex Mono 400 italic on first
encounter because the gp-sphinx DEFAULT_SPHINX_FONT_PRELOAD list
only covers Mono 400/700 normal and Sans 400 italic. The browser
shows the face loaded from network on each page until cached.

what:
- Extend conf["sphinx_font_preload"] in docs/conf.py with the
  ("IBM Plex Mono", 400, "italic") tuple so the
  ibm-plex-mono-latin-400-italic.woff2 face renders as a
  <link rel="preload" as="font" crossorigin> tag in the page
  head alongside the existing mono normal preloads. Comment
  glyphs in syntax-highlighted code paint without a font swap.
- Comment in conf.py points at the upstream defaults so a future
  reader can see why this single tuple is project-overridden
  rather than upstream-fixed.
@tony tony force-pushed the benchmark-script branch from 8bfedfd to 0a7a9d9 Compare May 23, 2026 11:23
@tony tony merged commit 7a2b526 into master May 23, 2026
3 checks passed
@tony tony deleted the benchmark-script branch May 23, 2026 11:31
tony added a commit that referenced this pull request May 23, 2026
why: Regression #9 (--output bad-path traceback at end of run)
had no test. The pre-flight checks in cmd_run that call
typer.Exit(code=2) before any git checkout could be removed
silently.

what:
- test_main_exits_2_when_output_parent_missing: calls
  main(["run", "--output", "/nonexistent/dir/x.md", ...]) and
  asserts rc == 2. The pre-flight fires at step 5 of cmd_run,
  before the dirty-tree check or any git interaction.
- test_main_exits_2_when_output_is_directory: passes tmp_path
  (an existing directory) as --output and asserts rc == 2.
tony added a commit that referenced this pull request May 23, 2026
Follow-up to #9's code review. Wires the previously-unreachable
`sync_fail` Status literal so a failing `uv sync` between commits
is recorded and surfaced rather than silently absorbed, and locks
down every prior bug-fix surface with regression tests.

- **sync_fail now reachable.** `_maybe_sync` returns the
  `CompletedProcess` so `_run_one_commit` can inspect `returncode`.
  Non-zero → every bench for that commit is recorded as
  `status="sync_fail"` with the truncated stderr (or a "sync
  failed" fallback when the process produces no output), and the
  run moves to the next commit without benchmarking a broken venv.

- **Docstring path.** Module docstring pointed at the pre-move
  `docs/library/benchmark.md`; updated to `docs/dev/benchmark.md`.

- **Regression tests.** Fifteen new parametrized cases covering
  validation rejection, exit-code propagation, bad-ref translation,
  aggregate-column uniqueness, config-error surfacing, pydantic
  bound enforcement, cli-override routing, and output pre-flight.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant