scripts(benchmark[fix]) Surface uv sync failures + docstring path by tony · Pull Request #10 · tony/agentgrep

tony · 2026-05-23T11:36:42Z

Summary

Two code-review follow-ups from #9, plus regression tests for all 9 bug-fix surfaces that were previously only validated manually.

Fixes

sync_fail Status was dead code. _maybe_sync now returns the CompletedProcess; _run_one_commit records every bench for the commit as status="sync_fail" (with truncated stderr) and returns early when sync exits non-zero.
Stale docstring path. Module docstring pointed at docs/library/benchmark.md; updated to docs/dev/benchmark.md.

Regression tests (15 new, 26 → 41 total)

Regression	Test
`_parse_percentile_labels` / `_select_bench_names` reject bad input	2 parametrized cases
`main()` exit-code propagation for `typer.Exit` paths	2 parametrized cases
`_select_targets` converts `CalledProcessError` → `BadParameter`	1 monkeypatched case
Renderers produce no duplicate `max` column	2 (markdown + rich)
`load_config` rejects malformed TOML, extra keys, missing fields	3 parametrized cases
`load_config` rejects `runs=0` via `Field(ge=1)`	1 case
`load_config` rejects invalid `cli_overrides` (warmup, timeout)	2 parametrized cases
`main()` exits 2 on bad `--output` (missing parent, is-directory)	2 cases

Test plan

uv run ruff check . passes
uv run ruff format --check . passes
uv run ty check passes
uv run pytest --reruns 0 passes (287 tests total, 41 in benchmark suite)
just build-docs passes
Each of the 8 test commits ran the full pre-commit gate independently

why: Two follow-ups from the code review on #9. 1. The `Status` literal advertised a `sync_fail` value that no code path could produce. `_maybe_sync` called subprocess.run with check=False and discarded the result, so a failing `uv sync` was silently absorbed and benches ran against a half-resolved venv. The output never showed a sync_fail row even though the type said it could. 2. The module docstring directed readers to `docs/library/benchmark.md`, but the page lives at `docs/dev/benchmark.md` after the move-to-Project decision. what: - _maybe_sync returns subprocess.CompletedProcess[str] | None so the caller can inspect returncode. None signals "sync disabled" (empty sync_command); a CompletedProcess with a non-zero returncode signals failure. - _run_one_commit checks the sync result and, on non-zero, records every bench for the commit as status="sync_fail" with the truncated stderr embedded in the error field, then returns early to avoid running benches on a broken venv. A notify() line surfaces the failure on stderr too. - Module docstring's docs reference now points at docs/dev/benchmark.md.

…names reject bad input why: Regression #1 (typer.MissingParameter → AttributeError) had no test coverage; any future change to the except clause in main() could silently reintroduce the crash. what: - Add ValidationRejectCase NamedTuple with two parametrized cases: bad-percentile-label ("min,wat,p99") and unknown-bench-name ("bogus" against a Config with only grep). - Each asserts typer.BadParameter with a descriptive match string. - Import typer at the top of the test module (available via the dev dependency group, same as tomlkit for mcp_swap tests).

…aths why: Regression #2 (Click standalone_mode=False exit-code wiring) had no test — main() silently returned 0 for every typer.Exit until the fix, and nothing would catch a revert. what: - MainExitCodeCase NamedTuple with two parametrized cases: show-config happy path returns 0, and run with an empty TOML config (no [bench.*] entries) returns 2. - The "__EMPTY_TOML__" sentinel in argv is replaced at test time with a real tmp_path file, keeping the NamedTuple declarative.

…o BadParameter why: Regression #3 (bad-ref traceback) had no test; the try/except CalledProcessError → typer.BadParameter translation in _select_targets could be removed without CI noticing. what: - Monkeypatch benchmark.resolve_target to unconditionally raise CalledProcessError(128, ("git", "rev-parse", "bogus"), stderr= "fatal: ambiguous argument 'bogus'"). - Call _select_targets(target="bogus", ...) and assert it raises typer.BadParameter matching "git failed resolving target".

why: Regression #4 (render_rich added add_column("max") on top of the one from the percentile_labels loop) had no test; a stray column addition in the aggregate table section would go unnoticed. what: - _make_multi_commit_measurements helper constructs two ok Measurement objects for the same command name (different SHAs, with real samples) to trigger the aggregate-table code path. - test_render_markdown_no_duplicate_max_column: parses the header row and asserts "max" appears exactly once. - test_render_rich_no_duplicate_max_column: extracts the "Distribution across commits" section and asserts "max" appears exactly once in the header line of the aggregate table.

…keys why: Regression #5 (pydantic/TOML errors as uncaught tracebacks) had no test. The try/except in load_config that wraps ValidationError and TOMLDecodeError as BadParameter could be removed and CI would not notice. what: - ConfigErrorCase NamedTuple with three parametrized cases: malformed TOML ("[settings" without closing bracket), extra settings key (extra="forbid" rejects unknown keys), and missing required bench field (command is mandatory in BenchCommand). - Each writes toml_content to a tmp_path file, calls load_config(config_path=..., local_path=<missing>), and asserts typer.BadParameter with a match string.

…constraint why: Regression #6 (--runs 0 deadlocked hyperfine) depended on Field(ge=1) + the cli_overrides routing through model_validate. Neither constraint had a test, so removing either would not break CI. what: - test_load_config_rejects_runs_zero_via_cli_overrides: writes a valid TOML (one bench), passes cli_overrides={"settings": {"runs": 0}}, asserts typer.BadParameter matching "greater than or equal to 1". Exercises both the pydantic bound and the cli_overrides-through-model_validate routing in one assertion.

… validator why: Regression #7 (CLI overrides bypassed pydantic validators via model_copy) had no test beyond the runs=0 case in commit 6. The warmup and timeout_seconds bounds were equally exposed. what: - CliOverrideRejectCase NamedTuple with two parametrized cases: warmup=-1 (ge=0 constraint) and timeout_seconds=0 (ge=1 constraint). - Each writes a valid TOML, passes the invalid cli_overrides dict, and asserts typer.BadParameter with the expected pydantic message fragment.

why: Regression #9 (--output bad-path traceback at end of run) had no test. The pre-flight checks in cmd_run that call typer.Exit(code=2) before any git checkout could be removed silently. what: - test_main_exits_2_when_output_parent_missing: calls main(["run", "--output", "/nonexistent/dir/x.md", ...]) and asserts rc == 2. The pre-flight fires at step 5 of cmd_run, before the dirty-tree check or any git interaction. - test_main_exits_2_when_output_is_directory: passes tmp_path (an existing directory) as --output and asserts rc == 2.

tony · 2026-05-23T12:10:39Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

why: stdlib imports belong at the module top per the namespace- import convention, not buried in a function body where they're harder to scan. what: - Add `import subprocess` alongside the other stdlib imports at the top of tests/test_benchmark_script.py. - Remove the local `import subprocess` from inside test_select_targets_converts_git_error_to_bad_parameter.

…ners why: Each regression-test section banner carried a `(was: <prior broken behavior>)` parenthetical describing a state that never shipped in a published release. Per AGENTS.md "Shipped vs. Branch-Internal Narrative", within-branch tactical decisions belong in commit messages, not shipped artifacts. what: - Remove eight `(was: ...)` lines from the section-divider comments in tests/test_benchmark_script.py. The section titles ("Regression: _parse_percentile_labels reject bad input", "Regression: main() exit-code propagation", etc.) already describe what the test verifies without the branch-history context.

why: AGENTS.md prescribes NumPy docstring style for all functions. _maybe_sync had a prose-only docstring despite having typed parameters and a meaningful return value. what: - Rewrite _maybe_sync's docstring with explicit Parameters and Returns sections using the NumPy underline convention.

…field why: When a sync command is killed by signal and produces no stderr or stdout, the error string ended with a dangling `: ` ("uv sync exited 1: "). The checkout_fail path guarded against this with `or "checkout failed"`; the sync_fail path did not. what: - Add `or "sync failed"` fallback after the stderr/stdout strip, matching the checkout_fail guard's shape.

tony temporarily deployed to docs May 23, 2026 11:36 — with GitHub Actions Inactive

tony added 8 commits May 23, 2026 06:51

tony temporarily deployed to docs May 23, 2026 11:58 — with GitHub Actions Inactive

tony added 3 commits May 23, 2026 07:12

tony temporarily deployed to docs May 23, 2026 12:15 — with GitHub Actions Inactive

tony had a problem deploying to docs May 23, 2026 12:15 — with GitHub Actions Failure

tony temporarily deployed to docs May 23, 2026 12:16 — with GitHub Actions Inactive

tony merged commit be402a8 into master May 23, 2026
3 checks passed

tony deleted the benchmark-followup branch May 23, 2026 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts(benchmark[fix]) Surface uv sync failures + docstring path#10

scripts(benchmark[fix]) Surface uv sync failures + docstring path#10
tony merged 13 commits into
masterfrom
benchmark-followup

tony commented May 23, 2026 •

edited

Loading

Uh oh!

tony commented May 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tony commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes

Regression tests (15 new, 26 → 41 total)

Test plan

Uh oh!

tony commented May 23, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tony commented May 23, 2026 •

edited

Loading