Skip to content

scripts(benchmark[fix]) Surface uv sync failures + docstring path#10

Merged
tony merged 13 commits into
masterfrom
benchmark-followup
May 23, 2026
Merged

scripts(benchmark[fix]) Surface uv sync failures + docstring path#10
tony merged 13 commits into
masterfrom
benchmark-followup

Conversation

@tony
Copy link
Copy Markdown
Owner

@tony tony commented May 23, 2026

Summary

Two code-review follow-ups from #9, plus regression tests for all 9 bug-fix surfaces that were previously only validated manually.

Fixes

  • sync_fail Status was dead code. _maybe_sync now returns the CompletedProcess; _run_one_commit records every bench for the commit as status="sync_fail" (with truncated stderr) and returns early when sync exits non-zero.
  • Stale docstring path. Module docstring pointed at docs/library/benchmark.md; updated to docs/dev/benchmark.md.

Regression tests (15 new, 26 → 41 total)

Regression Test
_parse_percentile_labels / _select_bench_names reject bad input 2 parametrized cases
main() exit-code propagation for typer.Exit paths 2 parametrized cases
_select_targets converts CalledProcessErrorBadParameter 1 monkeypatched case
Renderers produce no duplicate max column 2 (markdown + rich)
load_config rejects malformed TOML, extra keys, missing fields 3 parametrized cases
load_config rejects runs=0 via Field(ge=1) 1 case
load_config rejects invalid cli_overrides (warmup, timeout) 2 parametrized cases
main() exits 2 on bad --output (missing parent, is-directory) 2 cases

Test plan

  • uv run ruff check . passes
  • uv run ruff format --check . passes
  • uv run ty check passes
  • uv run pytest --reruns 0 passes (287 tests total, 41 in benchmark suite)
  • just build-docs passes
  • Each of the 8 test commits ran the full pre-commit gate independently

why: Two follow-ups from the code review on #9.

1. The `Status` literal advertised a `sync_fail` value that no
   code path could produce. `_maybe_sync` called subprocess.run
   with check=False and discarded the result, so a failing
   `uv sync` was silently absorbed and benches ran against a
   half-resolved venv. The output never showed a sync_fail row
   even though the type said it could.

2. The module docstring directed readers to
   `docs/library/benchmark.md`, but the page lives at
   `docs/dev/benchmark.md` after the move-to-Project decision.

what:
- _maybe_sync returns subprocess.CompletedProcess[str] | None
  so the caller can inspect returncode. None signals "sync
  disabled" (empty sync_command); a CompletedProcess with a
  non-zero returncode signals failure.
- _run_one_commit checks the sync result and, on non-zero,
  records every bench for the commit as status="sync_fail"
  with the truncated stderr embedded in the error field, then
  returns early to avoid running benches on a broken venv. A
  notify() line surfaces the failure on stderr too.
- Module docstring's docs reference now points at
  docs/dev/benchmark.md.
tony added 8 commits May 23, 2026 06:51
…names reject bad input

why: Regression #1 (typer.MissingParameter → AttributeError) had
no test coverage; any future change to the except clause in
main() could silently reintroduce the crash.

what:
- Add ValidationRejectCase NamedTuple with two parametrized
  cases: bad-percentile-label ("min,wat,p99") and
  unknown-bench-name ("bogus" against a Config with only grep).
- Each asserts typer.BadParameter with a descriptive match
  string.
- Import typer at the top of the test module (available via the
  dev dependency group, same as tomlkit for mcp_swap tests).
…aths

why: Regression #2 (Click standalone_mode=False exit-code wiring)
had no test — main() silently returned 0 for every typer.Exit
until the fix, and nothing would catch a revert.

what:
- MainExitCodeCase NamedTuple with two parametrized cases:
  show-config happy path returns 0, and run with an empty TOML
  config (no [bench.*] entries) returns 2.
- The "__EMPTY_TOML__" sentinel in argv is replaced at test time
  with a real tmp_path file, keeping the NamedTuple declarative.
…o BadParameter

why: Regression #3 (bad-ref traceback) had no test; the
try/except CalledProcessError → typer.BadParameter translation
in _select_targets could be removed without CI noticing.

what:
- Monkeypatch benchmark.resolve_target to unconditionally raise
  CalledProcessError(128, ("git", "rev-parse", "bogus"), stderr=
  "fatal: ambiguous argument 'bogus'").
- Call _select_targets(target="bogus", ...) and assert it raises
  typer.BadParameter matching "git failed resolving target".
why: Regression #4 (render_rich added add_column("max") on top of
the one from the percentile_labels loop) had no test; a stray
column addition in the aggregate table section would go unnoticed.

what:
- _make_multi_commit_measurements helper constructs two ok
  Measurement objects for the same command name (different SHAs,
  with real samples) to trigger the aggregate-table code path.
- test_render_markdown_no_duplicate_max_column: parses the header
  row and asserts "max" appears exactly once.
- test_render_rich_no_duplicate_max_column: extracts the
  "Distribution across commits" section and asserts "max" appears
  exactly once in the header line of the aggregate table.
…keys

why: Regression #5 (pydantic/TOML errors as uncaught tracebacks)
had no test. The try/except in load_config that wraps
ValidationError and TOMLDecodeError as BadParameter could be
removed and CI would not notice.

what:
- ConfigErrorCase NamedTuple with three parametrized cases:
  malformed TOML ("[settings" without closing bracket),
  extra settings key (extra="forbid" rejects unknown keys), and
  missing required bench field (command is mandatory in
  BenchCommand).
- Each writes toml_content to a tmp_path file, calls
  load_config(config_path=..., local_path=<missing>), and
  asserts typer.BadParameter with a match string.
…constraint

why: Regression #6 (--runs 0 deadlocked hyperfine) depended on
Field(ge=1) + the cli_overrides routing through model_validate.
Neither constraint had a test, so removing either would not
break CI.

what:
- test_load_config_rejects_runs_zero_via_cli_overrides: writes a
  valid TOML (one bench), passes cli_overrides={"settings":
  {"runs": 0}}, asserts typer.BadParameter matching "greater
  than or equal to 1". Exercises both the pydantic bound and the
  cli_overrides-through-model_validate routing in one assertion.
… validator

why: Regression #7 (CLI overrides bypassed pydantic validators
via model_copy) had no test beyond the runs=0 case in commit 6.
The warmup and timeout_seconds bounds were equally exposed.

what:
- CliOverrideRejectCase NamedTuple with two parametrized cases:
  warmup=-1 (ge=0 constraint) and timeout_seconds=0 (ge=1
  constraint).
- Each writes a valid TOML, passes the invalid cli_overrides
  dict, and asserts typer.BadParameter with the expected
  pydantic message fragment.
why: Regression #9 (--output bad-path traceback at end of run)
had no test. The pre-flight checks in cmd_run that call
typer.Exit(code=2) before any git checkout could be removed
silently.

what:
- test_main_exits_2_when_output_parent_missing: calls
  main(["run", "--output", "/nonexistent/dir/x.md", ...]) and
  asserts rc == 2. The pre-flight fires at step 5 of cmd_run,
  before the dirty-tree check or any git interaction.
- test_main_exits_2_when_output_is_directory: passes tmp_path
  (an existing directory) as --output and asserts rc == 2.
@tony
Copy link
Copy Markdown
Owner Author

tony commented May 23, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

tony added 3 commits May 23, 2026 07:12
why: stdlib imports belong at the module top per the namespace-
import convention, not buried in a function body where they're
harder to scan.

what:
- Add `import subprocess` alongside the other stdlib imports at
  the top of tests/test_benchmark_script.py.
- Remove the local `import subprocess` from inside
  test_select_targets_converts_git_error_to_bad_parameter.
…ners

why: Each regression-test section banner carried a
`(was: <prior broken behavior>)` parenthetical describing a state
that never shipped in a published release. Per AGENTS.md
"Shipped vs. Branch-Internal Narrative", within-branch tactical
decisions belong in commit messages, not shipped artifacts.

what:
- Remove eight `(was: ...)` lines from the section-divider
  comments in tests/test_benchmark_script.py. The section titles
  ("Regression: _parse_percentile_labels reject bad input",
  "Regression: main() exit-code propagation", etc.) already
  describe what the test verifies without the branch-history
  context.
why: AGENTS.md prescribes NumPy docstring style for all functions.
_maybe_sync had a prose-only docstring despite having typed
parameters and a meaningful return value.

what:
- Rewrite _maybe_sync's docstring with explicit Parameters and
  Returns sections using the NumPy underline convention.
…field

why: When a sync command is killed by signal and produces no
stderr or stdout, the error string ended with a dangling `: `
("uv sync exited 1: "). The checkout_fail path guarded against
this with `or "checkout failed"`; the sync_fail path did not.

what:
- Add `or "sync failed"` fallback after the stderr/stdout
  strip, matching the checkout_fail guard's shape.
@tony tony merged commit be402a8 into master May 23, 2026
3 checks passed
@tony tony deleted the benchmark-followup branch May 23, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant