Skip to content

Commit 3eaed16

Browse files
authored
feat(benchmarks): Add Claude UI benchmark harness (#427)
Add a local Claude UI benchmark harness for measuring simulator UI automation behavior against the development MCP server. The harness runs deterministic app tasks from Markdown prompts, creates fresh temporary simulators, writes isolated MCP configuration, parses Claude Code transcripts, and reports tool counts, wall-clock timing, failures, and sequence drift. This gives us a repeatable way to catch regressions in agent efficiency and UI automation behavior across Weather, Contacts, and Reminders. The benchmark setup also keeps simulator boot/open and first-run prompt cleanup outside the measured Claude task, so baselines reflect the actual app work rather than transient Apple setup screens. Mutating UI actions now wait for settled post-action runtime snapshots so the next agent step receives stable refs.
1 parent 857954e commit 3eaed16

34 files changed

Lines changed: 5546 additions & 47 deletions

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
### Added
66

7+
- Added `--from-result` to the Claude UI benchmark harness so existing `result.json` artifacts can be rendered as text or JSON without rerunning Claude.
78
- Added `nextSteps` hint lines to MCP `structuredContent` and CLI `--output json` envelopes so agents can consume follow-up actions without scraping text. CLI JSON renders shell command lines; MCP structured content renders MCP tool-call hints. Structured result schemas that include `nextSteps` now use schema version 2; existing version 1 schema files remain available for current validators.
89
- Added `snapshot_ui sinceScreenHash` / CLI `--since-screen-hash` so callers can skip full runtime snapshot output when the screen hash is unchanged.
910
- Added `batch` for executing multiple AXe UI automation steps in one simulator session.
@@ -14,11 +15,19 @@
1415

1516
### Changed
1617

18+
- Changed Claude UI benchmark suite runs to create a temporary simulator by default and delete only that harness-created simulator after the suite finishes.
19+
- Changed Claude UI benchmark exact tool sequence drift to warn by default, with `sequence.mode: fail` available for strict suites.
1720
- Successful mutating UI automation calls now always attempt to refresh the runtime snapshot after the action instead of preserving or patching cached switch state.
1821
- Runtime snapshot guidance no longer advertises synthetic sheet swipe targets for foreground sheets. Agents should use real sheet grabber expansion and real descendant scroll/list targets with `drag` instead of inferred app/window-root sheet swipes.
1922

2023
### Fixed
2124

25+
- Fixed Claude UI benchmark preflight so transient malformed or still-loading UI snapshots no longer crash the harness or finish before app UI is observable.
26+
- Fixed Claude UI benchmark preflight so configured first-run dismissals require a concrete simulator ID, suite-provided simulator IDs are recorded in command logs, and preflight-launched apps are terminated after post-launch failures.
27+
- Fixed Claude UI benchmark config handling so invalid `failurePatterns` regexes and runtime-incompatible `sessionDefaults` fail before a suite starts and partial `allowedVariance` overrides preserve defaults for omitted metrics.
28+
- Fixed Claude UI benchmark temporary simulator cleanup so simulators created by the harness are deleted even when post-creation setup fails.
29+
- Fixed UI action snapshot refreshes so timeout while waiting for a settled post-action snapshot returns a recoverable warning instead of unstable element refs.
30+
- Fixed Claude UI benchmark suite runs so temporary simulators are applied through an isolated per-run MCP config instead of being overridden by repo or example-project config defaults.
2231
- Fixed simulator launch failures before simulator-name resolution so they are not reported as macOS launch failures.
2332
- Fixed CLI JSON output so simulator-name resolution failures return the structured error envelope instead of plain stderr.
2433
- Fixed accessibility hierarchy tips so UI automation guidance prefers runtime element refs over raw coordinate guessing.

benchmarks/claude-ui/README.md

Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# Claude UI benchmark harness
2+
3+
Local/manual harness for running Claude Code against the development XcodeBuildMCP MCP server and auditing UI automation behavior.
4+
5+
The harness:
6+
7+
- reads a suite YAML file from `benchmarks/claude-ui/suites/`
8+
- reads the referenced prompt Markdown file from disk and feeds it to `claude -p`
9+
- creates, boots, waits for, and opens a fresh temporary simulator before Claude launches for each suite run by default
10+
- writes an isolated per-run MCP workspace config with the suite defaults and temporary `simulatorId`
11+
- generates a Claude MCP config pointing at `node build/cli.js mcp` with `XCODEBUILDMCP_CWD` set to that isolated workspace
12+
- optionally preflights configured first-run prompts before Claude launches, outside the measured run
13+
- deletes the temporary simulator at the end of the suite, best effort, using only the ID created by the harness
14+
- writes artifacts under `out.nosync/claude-benchmarks/<suite>/<timestamp>/`
15+
- runs the bundled `parse_claude_conversation.py` parser against Claude's stream JSONL
16+
- audits tool counts, MCP calls, UI automation calls, wall clock, failures/stumbles, and expected tool sequence drift
17+
- prints a structured per-suite report and (for `--all`) an aggregate summary
18+
- optionally prints machine-readable JSON with `--json`
19+
- can render an existing `result.json` or artifact directory with `--from-result` without rerunning Claude
20+
21+
This is intentionally not part of the normal test suite because it launches Claude and drives local simulators/apps.
22+
23+
## Commands
24+
25+
Build first, then run a suite:
26+
27+
```bash
28+
npm run build
29+
npx tsx benchmarks/claude-ui/run.ts --suite weather
30+
```
31+
32+
Shortcut:
33+
34+
```bash
35+
npm run bench:claude-ui -- --suite weather
36+
```
37+
38+
Run every suite YAML:
39+
40+
```bash
41+
npm run bench:claude-ui -- --all
42+
```
43+
44+
Print machine-readable output from a new run:
45+
46+
```bash
47+
npm run bench:claude-ui -- --suite reminders --json
48+
```
49+
50+
Render an existing result without rerunning Claude:
51+
52+
```bash
53+
npm run bench:claude-ui -- --from-result out.nosync/claude-benchmarks/reminders/20260522T130926Z
54+
npm run bench:claude-ui -- --from-result out.nosync/claude-benchmarks/reminders/20260522T130926Z/result.json --json
55+
```
56+
57+
New runs use the bundled parser at `benchmarks/claude-ui/parse_claude_conversation.py`. Pass `--parser /path/to/parse_claude_conversation.py` or set `CLAUDE_UI_BENCHMARK_PARSER` only when testing a different parser. `--from-result` does not need a parser because it only re-renders existing artifacts.
58+
59+
## Suite YAML shape
60+
61+
```yaml
62+
name: weather
63+
prompt: ../prompts/weather.md
64+
workingDirectory: example_projects/Weather
65+
sessionDefaults:
66+
projectPath: Weather.xcodeproj
67+
scheme: Weather
68+
simulatorName: iPhone 17 Pro Max
69+
temporarySimulator: true
70+
firstRunPromptDismissals:
71+
labels:
72+
- Continue
73+
- Not Now
74+
timeoutSeconds: 12
75+
baseline:
76+
totalToolCalls: 19
77+
mcpToolCalls: 18
78+
uiAutomationCalls: 16
79+
wallClockSeconds: 125
80+
tools:
81+
snapshot_ui: 1
82+
tap: 9
83+
allowedVariance:
84+
totalToolCalls: 2
85+
mcpToolCalls: 2
86+
uiAutomationCalls: 2
87+
wallClockSeconds: 45
88+
toolCalls: 2
89+
expectedToolSequence:
90+
- session_show_defaults
91+
- build_run_sim
92+
- snapshot_ui
93+
sequence:
94+
mode: warn
95+
failurePatterns:
96+
- STALE_ELEMENT_REF
97+
- SNAPSHOT_MISSING
98+
- WAIT_TIMEOUT
99+
```
100+
101+
Variance is an upper bound: lower tool counts or faster runs are accepted, while values above `baseline + allowedVariance` fail. Defaults are `totalToolCalls: 0`, `mcpToolCalls: 0`, `uiAutomationCalls: 0`, `toolCalls: 0`, and `wallClockSeconds: 30`.
102+
103+
Tool sequence drift is warning-only by default (`sequence.mode: warn`) because real Claude runs can choose equally valid UI paths. Use `sequence.mode: fail` only for suites where exact MCP call order is part of the contract.
104+
105+
`sessionDefaults` are written to a harness-owned config at `<run>/mcp-workspace/.xcodebuildmcp/config.yaml`. The generated Claude MCP config sets `XCODEBUILDMCP_CWD` to `<run>/mcp-workspace`, so the dev MCP server reads only the benchmark config instead of any repo or example-project `.xcodebuildmcp/config.yaml`. Unknown keys fail fast. Relative path defaults such as `projectPath`, `workspacePath`, and `derivedDataPath` are resolved against the suite `workingDirectory` before being written because the MCP server cwd is the isolated workspace.
106+
107+
## Temporary simulator lifecycle
108+
109+
By default, each suite creates a fresh simulator before Claude launches. The harness uses `sessionDefaults.simulatorName` as the `simctl create` device type name, captures the returned simulator ID, boots that simulator, waits for `simctl bootstatus <id> -b`, opens Simulator.app to that device, applies a short UI-readiness delay, and writes the simulator ID as `sessionDefaults.simulatorId` in the isolated MCP workspace config. This makes Claude and the dev MCP server target a visible, booted, isolated simulator instead of reusing a previous run's state or spending benchmark calls on simulator boot/open setup.
110+
111+
Simulator setup is deliberately outside the benchmark measurement boundary. The measured `wallClockSeconds` starts when the harness spawns Claude and stops when Claude exits. Tool-call counts are parsed only from Claude's JSONL transcript. The result JSON still records temporary simulator `setupDurationSeconds` under `run.temporarySimulator` so setup cost is visible without being compared against Claude task-efficiency baselines.
112+
113+
Config contract:
114+
115+
- Omit `temporarySimulator` for the default behavior: create and later delete a temporary simulator.
116+
- Set `temporarySimulator: false` to opt out and use the suite/project defaults as-is.
117+
- Set `sessionDefaults.simulatorId` to use an existing simulator. In this case the harness does not create or delete a simulator.
118+
- Do not set both `temporarySimulator: true` and `sessionDefaults.simulatorId`; the harness fails fast because deleting a user-provided simulator would be unsafe.
119+
120+
Temporary simulator setup is required when enabled. If creation, boot, bootstatus, or Simulator.app opening fails, the suite fails loudly before Claude starts. Deletion is best effort in a `finally` block: failures are logged but do not mask the benchmark result or original error.
121+
122+
`firstRunPromptDismissals` is an optional suite-level preflight for fresh simulator noise such as Apple first-run sheets. When configured, the harness launches `sessionDefaults.bundleId` before Claude starts, retries through transient UI-inspection failures, looks for any listed button labels, taps matching labels with AXe, then terminates the app. If the prompt state cannot be inspected or dismissed before `timeoutSeconds`, the suite fails before Claude starts. These preflight interactions are logged in `simulator-lifecycle.log`, but they are outside Claude's wall-clock measurement and do not appear in tool-call counts. Keep the labels generic and non-destructive, for example `Continue`, `Not Now`, or `OK`; do not configure sign-in, sync enablement, Settings, destructive, or data-deletion actions.
123+
124+
Lifecycle details are written to `simulator-lifecycle.log`, including the `create`, `boot`, `bootstatus`, `open`, readiness delay, optional first-run prompt preflight, and deletion steps. `claude-command.log` also records the simulator ID used for the run. The terminal report shows the temporary simulator ID plus setup duration as `setup ... before Claude` when a temporary simulator is used.
125+
126+
## Terminal report
127+
128+
Each suite renders as a structured report with a status banner, aligned metric and tool tables, a failures/stumbles section (only when non-zero), and a sequence diff. When run with `--all`, an aggregate summary follows the per-suite reports.
129+
130+
### Single suite
131+
132+
```text
133+
────────────────────────────────────────────────────────────────────────
134+
PASS weather 1m 38.6s
135+
suite benchmarks/claude-ui/suites/weather.yml
136+
artifacts out.nosync/claude-benchmarks/weather/20260522T214044Z
137+
exit claude=0 parser=0
138+
139+
Metrics
140+
METRIC ACTUAL BASELINE VARIANCE DELTA STATUS
141+
totalToolCalls 13 19 +2 −6 PASS
142+
mcpToolCalls 12 18 +2 −6 PASS
143+
uiAutomationCalls 10 16 +2 −6 PASS
144+
wallClockSeconds 98.62 125.00 +45.00 −26.38 PASS
145+
146+
Tool calls (baseline-tracked)
147+
TOOL ACTUAL BASELINE DELTA STATUS
148+
session_show_defaults 1 1 0 PASS
149+
build_run_sim 1 1 0 PASS
150+
snapshot_ui 1 1 0 PASS
151+
tap 6 9 −3 PASS
152+
batch 1 1 0 PASS
153+
154+
PASS failures/stumbles: 0
155+
```
156+
157+
### Sequence drift
158+
159+
When the tool sequence drifts, the report includes unified-diff style hunks with expected/actual index columns. Drift is warning-only by default, so the overall status stays `WARN` rather than `FAIL`:
160+
161+
```text
162+
WARN tool sequence (warn): drift: 4 missing, 0 additional
163+
@@ expected[8..15] actual[8..11] @@
164+
8 8 tap
165+
9 9 tap
166+
10 − tap
167+
11 10 swipe
168+
12 11 tap
169+
13 − swipe
170+
14 − tap
171+
15 − tap
172+
```
173+
174+
`` lines are expected calls Claude skipped; `+` lines are calls Claude made that were not expected. Dim lines are surrounding context.
175+
176+
### Failures and inspect hints
177+
178+
When `failures/stumbles` is non-zero the report lists the first few tool failures and pattern matches, and surfaces an `Inspect` block with the relevant artifact paths:
179+
180+
```text
181+
FAIL failures/stumbles: 1
182+
• tool failures: 1
183+
boot_sim @ line 9: Boot failed: device not found
184+
185+
Inspect
186+
result.json out.nosync/claude-benchmarks/reminders/20260522T213905Z/result.json
187+
transcript out.nosync/claude-benchmarks/reminders/20260522T213905Z/claude.jsonl
188+
stderr out.nosync/claude-benchmarks/reminders/20260522T213905Z/claude.stderr
189+
run dir out.nosync/claude-benchmarks/reminders/20260522T213905Z
190+
```
191+
192+
### Aggregate summary
193+
194+
After `--all` (or multi-result `--from-result`) the harness appends:
195+
196+
```text
197+
════════════════════════════════════════════════════════════════════════
198+
Claude UI Benchmarks · Summary
199+
════════════════════════════════════════════════════════════════════════
200+
Suites: 3 total · 2 passed · 1 failed · 2 sequence warnings
201+
Duration: total 4m 49.8s · slowest reminders (1m 39.8s)
202+
Artifacts: out.nosync/claude-benchmarks/
203+
204+
! WARN weather 1m 38.6s sequence warn: 4m/0a
205+
✗ FAIL reminders 1m 39.8s 1 stumble · sequence warn: 7m/4a
206+
! WARN contacts 1m 31.4s sequence warn: 2m/2a
207+
════════════════════════════════════════════════════════════════════════
208+
```
209+
210+
`Nm/Ka` denotes "N missing / K additional" calls vs. `expectedToolSequence`.
211+
212+
The renderer auto-detects TTY and adds ANSI color when stdout is a terminal and `NO_COLOR` is unset. Plain-text output (e.g. when piping to a file or under `NO_COLOR=1`) carries the same information without color codes.
213+
214+
`--json` output is unchanged by this renderer: the JSON payload remains a single `BenchmarkResult` for `--suite` / single-result `--from-result`, and an array for `--all` / multi-result `--from-result`.
215+
216+
## Artifacts
217+
218+
Each run writes:
219+
220+
- `prompt.md` — exact suite prompt fed to Claude
221+
- `mcp-config.json` — generated Claude MCP config
222+
- `mcp-workspace/.xcodebuildmcp/config.yaml` — isolated MCP server config with effective suite defaults
223+
- `claude.jsonl` — Claude stream JSON output
224+
- `claude.stderr` — Claude stderr
225+
- `claude-command.log` — command, cwd, simulator ID, exit status, wall clock
226+
- `simulator-lifecycle.log` — temporary simulator create, boot, bootstatus, open, readiness, deletion commands, and simulator ID
227+
- `parsed/` — files written by `parse_claude_conversation.py`
228+
- `parse.log` / `parse.log.stderr` — parser output
229+
- `result.json` — full benchmark result

0 commit comments

Comments
 (0)