|
| 1 | +# Claude UI benchmark harness |
| 2 | + |
| 3 | +Local/manual harness for running Claude Code against the development XcodeBuildMCP MCP server and auditing UI automation behavior. |
| 4 | + |
| 5 | +The harness: |
| 6 | + |
| 7 | +- reads a suite YAML file from `benchmarks/claude-ui/suites/` |
| 8 | +- reads the referenced prompt Markdown file from disk and feeds it to `claude -p` |
| 9 | +- creates, boots, waits for, and opens a fresh temporary simulator before Claude launches for each suite run by default |
| 10 | +- writes an isolated per-run MCP workspace config with the suite defaults and temporary `simulatorId` |
| 11 | +- generates a Claude MCP config pointing at `node build/cli.js mcp` with `XCODEBUILDMCP_CWD` set to that isolated workspace |
| 12 | +- optionally preflights configured first-run prompts before Claude launches, outside the measured run |
| 13 | +- deletes the temporary simulator at the end of the suite, best effort, using only the ID created by the harness |
| 14 | +- writes artifacts under `out.nosync/claude-benchmarks/<suite>/<timestamp>/` |
| 15 | +- runs `/Volumes/Developer/parse_claude_conversation.py` against Claude's stream JSONL |
| 16 | +- audits tool counts, MCP calls, UI automation calls, wall clock, failures/stumbles, and expected tool sequence drift |
| 17 | +- prints a structured per-suite report and (for `--all`) an aggregate summary |
| 18 | +- optionally prints machine-readable JSON with `--json` |
| 19 | +- can render an existing `result.json` or artifact directory with `--from-result` without rerunning Claude |
| 20 | + |
| 21 | +This is intentionally not part of the normal test suite because it launches Claude and drives local simulators/apps. |
| 22 | + |
| 23 | +## Commands |
| 24 | + |
| 25 | +Build first, then run a suite: |
| 26 | + |
| 27 | +```bash |
| 28 | +npm run build |
| 29 | +npx tsx benchmarks/claude-ui/run.ts --suite weather |
| 30 | +``` |
| 31 | + |
| 32 | +Shortcut: |
| 33 | + |
| 34 | +```bash |
| 35 | +npm run bench:claude-ui -- --suite weather |
| 36 | +``` |
| 37 | + |
| 38 | +Run every suite YAML: |
| 39 | + |
| 40 | +```bash |
| 41 | +npm run bench:claude-ui -- --all |
| 42 | +``` |
| 43 | + |
| 44 | +Print machine-readable output from a new run: |
| 45 | + |
| 46 | +```bash |
| 47 | +npm run bench:claude-ui -- --suite reminders --json |
| 48 | +``` |
| 49 | + |
| 50 | +Render an existing result without rerunning Claude: |
| 51 | + |
| 52 | +```bash |
| 53 | +npm run bench:claude-ui -- --from-result out.nosync/claude-benchmarks/reminders/20260522T130926Z |
| 54 | +npm run bench:claude-ui -- --from-result out.nosync/claude-benchmarks/reminders/20260522T130926Z/result.json --json |
| 55 | +``` |
| 56 | + |
| 57 | +## Suite YAML shape |
| 58 | + |
| 59 | +```yaml |
| 60 | +name: weather |
| 61 | +prompt: ../prompts/weather.md |
| 62 | +workingDirectory: example_projects/Weather |
| 63 | +sessionDefaults: |
| 64 | + projectPath: Weather.xcodeproj |
| 65 | + scheme: Weather |
| 66 | + simulatorName: iPhone 17 Pro Max |
| 67 | +temporarySimulator: true |
| 68 | +firstRunPromptDismissals: |
| 69 | + labels: |
| 70 | + - Continue |
| 71 | + - Not Now |
| 72 | + timeoutSeconds: 12 |
| 73 | +baseline: |
| 74 | + totalToolCalls: 19 |
| 75 | + mcpToolCalls: 18 |
| 76 | + uiAutomationCalls: 16 |
| 77 | + wallClockSeconds: 125 |
| 78 | + tools: |
| 79 | + snapshot_ui: 1 |
| 80 | + tap: 9 |
| 81 | +allowedVariance: |
| 82 | + totalToolCalls: 2 |
| 83 | + mcpToolCalls: 2 |
| 84 | + uiAutomationCalls: 2 |
| 85 | + wallClockSeconds: 45 |
| 86 | + toolCalls: 2 |
| 87 | +expectedToolSequence: |
| 88 | + - session_show_defaults |
| 89 | + - build_run_sim |
| 90 | + - snapshot_ui |
| 91 | +sequence: |
| 92 | + mode: warn |
| 93 | +failurePatterns: |
| 94 | + - STALE_ELEMENT_REF |
| 95 | + - SNAPSHOT_MISSING |
| 96 | + - WAIT_TIMEOUT |
| 97 | +``` |
| 98 | +
|
| 99 | +Variance is an upper bound: lower tool counts or faster runs are accepted, while values above `baseline + allowedVariance` fail. Defaults are `totalToolCalls: 0`, `mcpToolCalls: 0`, `uiAutomationCalls: 0`, `toolCalls: 0`, and `wallClockSeconds: 30`. |
| 100 | + |
| 101 | +Tool sequence drift is warning-only by default (`sequence.mode: warn`) because real Claude runs can choose equally valid UI paths. Use `sequence.mode: fail` only for suites where exact MCP call order is part of the contract. |
| 102 | + |
| 103 | +`sessionDefaults` are written to a harness-owned config at `<run>/mcp-workspace/.xcodebuildmcp/config.yaml`. The generated Claude MCP config sets `XCODEBUILDMCP_CWD` to `<run>/mcp-workspace`, so the dev MCP server reads only the benchmark config instead of any repo or example-project `.xcodebuildmcp/config.yaml`. Unknown keys fail fast. Relative path defaults such as `projectPath`, `workspacePath`, and `derivedDataPath` are resolved against the suite `workingDirectory` before being written because the MCP server cwd is the isolated workspace. |
| 104 | + |
| 105 | +## Temporary simulator lifecycle |
| 106 | + |
| 107 | +By default, each suite creates a fresh simulator before Claude launches. The harness uses `sessionDefaults.simulatorName` as the `simctl create` device type name, captures the returned simulator ID, boots that simulator, waits for `simctl bootstatus <id> -b`, opens Simulator.app to that device, applies a short UI-readiness delay, and writes the simulator ID as `sessionDefaults.simulatorId` in the isolated MCP workspace config. This makes Claude and the dev MCP server target a visible, booted, isolated simulator instead of reusing a previous run's state or spending benchmark calls on simulator boot/open setup. |
| 108 | + |
| 109 | +Simulator setup is deliberately outside the benchmark measurement boundary. The measured `wallClockSeconds` starts when the harness spawns Claude and stops when Claude exits. Tool-call counts are parsed only from Claude's JSONL transcript. The result JSON still records temporary simulator `setupDurationSeconds` under `run.temporarySimulator` so setup cost is visible without being compared against Claude task-efficiency baselines. |
| 110 | + |
| 111 | +Config contract: |
| 112 | + |
| 113 | +- Omit `temporarySimulator` for the default behavior: create and later delete a temporary simulator. |
| 114 | +- Set `temporarySimulator: false` to opt out and use the suite/project defaults as-is. |
| 115 | +- Set `sessionDefaults.simulatorId` to use an existing simulator. In this case the harness does not create or delete a simulator. |
| 116 | +- Do not set both `temporarySimulator: true` and `sessionDefaults.simulatorId`; the harness fails fast because deleting a user-provided simulator would be unsafe. |
| 117 | + |
| 118 | +Temporary simulator setup is required when enabled. If creation, boot, bootstatus, or Simulator.app opening fails, the suite fails loudly before Claude starts. Deletion is best effort in a `finally` block: failures are logged but do not mask the benchmark result or original error. |
| 119 | + |
| 120 | +`firstRunPromptDismissals` is an optional suite-level preflight for fresh simulator noise such as Apple first-run sheets. When configured, the harness launches `sessionDefaults.bundleId` before Claude starts, retries through transient UI-inspection failures, looks for any listed button labels, taps matching labels with AXe, then terminates the app. If the prompt state cannot be inspected or dismissed before `timeoutSeconds`, the suite fails before Claude starts. These preflight interactions are logged in `simulator-lifecycle.log`, but they are outside Claude's wall-clock measurement and do not appear in tool-call counts. Keep the labels generic and non-destructive, for example `Continue`, `Not Now`, or `OK`; do not configure sign-in, sync enablement, Settings, destructive, or data-deletion actions. |
| 121 | + |
| 122 | +Lifecycle details are written to `simulator-lifecycle.log`, including the `create`, `boot`, `bootstatus`, `open`, readiness delay, optional first-run prompt preflight, and deletion steps. `claude-command.log` also records the simulator ID used for the run. The terminal report shows the temporary simulator ID plus setup duration as `setup ... before Claude` when a temporary simulator is used. |
| 123 | + |
| 124 | +## Terminal report |
| 125 | + |
| 126 | +Each suite renders as a structured report with a status banner, aligned metric and tool tables, a failures/stumbles section (only when non-zero), and a sequence diff. When run with `--all`, an aggregate summary follows the per-suite reports. |
| 127 | + |
| 128 | +### Single suite |
| 129 | + |
| 130 | +```text |
| 131 | +──────────────────────────────────────────────────────────────────────── |
| 132 | +PASS weather 1m 38.6s |
| 133 | + suite benchmarks/claude-ui/suites/weather.yml |
| 134 | + artifacts out.nosync/claude-benchmarks/weather/20260522T214044Z |
| 135 | + exit claude=0 parser=0 |
| 136 | +
|
| 137 | +Metrics |
| 138 | + METRIC ACTUAL BASELINE VARIANCE DELTA STATUS |
| 139 | + totalToolCalls 13 19 +2 −6 PASS |
| 140 | + mcpToolCalls 12 18 +2 −6 PASS |
| 141 | + uiAutomationCalls 10 16 +2 −6 PASS |
| 142 | + wallClockSeconds 98.62 125.00 +45.00 −26.38 PASS |
| 143 | +
|
| 144 | +Tool calls (baseline-tracked) |
| 145 | + TOOL ACTUAL BASELINE DELTA STATUS |
| 146 | + session_show_defaults 1 1 0 PASS |
| 147 | + build_run_sim 1 1 0 PASS |
| 148 | + snapshot_ui 1 1 0 PASS |
| 149 | + tap 6 9 −3 PASS |
| 150 | + batch 1 1 0 PASS |
| 151 | +
|
| 152 | +PASS failures/stumbles: 0 |
| 153 | +``` |
| 154 | + |
| 155 | +### Sequence drift |
| 156 | + |
| 157 | +When the tool sequence drifts, the report includes unified-diff style hunks with expected/actual index columns. Drift is warning-only by default, so the overall status stays `WARN` rather than `FAIL`: |
| 158 | + |
| 159 | +```text |
| 160 | +WARN tool sequence (warn): drift: 4 missing, 0 additional |
| 161 | + @@ expected[8..15] actual[8..11] @@ |
| 162 | + 8 8 tap |
| 163 | + 9 9 tap |
| 164 | + 10 − tap |
| 165 | + 11 10 swipe |
| 166 | + 12 11 tap |
| 167 | + 13 − swipe |
| 168 | + 14 − tap |
| 169 | + 15 − tap |
| 170 | +``` |
| 171 | + |
| 172 | +`−` lines are expected calls Claude skipped; `+` lines are calls Claude made that were not expected. Dim lines are surrounding context. |
| 173 | + |
| 174 | +### Failures and inspect hints |
| 175 | + |
| 176 | +When `failures/stumbles` is non-zero the report lists the first few tool failures and pattern matches, and surfaces an `Inspect` block with the relevant artifact paths: |
| 177 | + |
| 178 | +```text |
| 179 | +FAIL failures/stumbles: 1 |
| 180 | + • tool failures: 1 |
| 181 | + boot_sim @ line 9: Boot failed: device not found |
| 182 | +
|
| 183 | +Inspect |
| 184 | + result.json out.nosync/claude-benchmarks/reminders/20260522T213905Z/result.json |
| 185 | + transcript out.nosync/claude-benchmarks/reminders/20260522T213905Z/claude.jsonl |
| 186 | + stderr out.nosync/claude-benchmarks/reminders/20260522T213905Z/claude.stderr |
| 187 | + run dir out.nosync/claude-benchmarks/reminders/20260522T213905Z |
| 188 | +``` |
| 189 | + |
| 190 | +### Aggregate summary |
| 191 | + |
| 192 | +After `--all` (or multi-result `--from-result`) the harness appends: |
| 193 | + |
| 194 | +```text |
| 195 | +════════════════════════════════════════════════════════════════════════ |
| 196 | + Claude UI Benchmarks · Summary |
| 197 | +════════════════════════════════════════════════════════════════════════ |
| 198 | + Suites: 3 total · 2 passed · 1 failed · 2 sequence warnings |
| 199 | + Duration: total 4m 49.8s · slowest reminders (1m 39.8s) |
| 200 | + Artifacts: out.nosync/claude-benchmarks/ |
| 201 | +
|
| 202 | + ! WARN weather 1m 38.6s sequence warn: 4m/0a |
| 203 | + ✗ FAIL reminders 1m 39.8s 1 stumble · sequence warn: 7m/4a |
| 204 | + ! WARN contacts 1m 31.4s sequence warn: 2m/2a |
| 205 | +════════════════════════════════════════════════════════════════════════ |
| 206 | +``` |
| 207 | + |
| 208 | +`Nm/Ka` denotes "N missing / K additional" calls vs. `expectedToolSequence`. |
| 209 | + |
| 210 | +The renderer auto-detects TTY and adds ANSI color when stdout is a terminal and `NO_COLOR` is unset. Plain-text output (e.g. when piping to a file or under `NO_COLOR=1`) carries the same information without color codes. |
| 211 | + |
| 212 | +`--json` output is unchanged by this renderer: the JSON payload remains a single `BenchmarkResult` for `--suite` / single-result `--from-result`, and an array for `--all` / multi-result `--from-result`. |
| 213 | + |
| 214 | +## Artifacts |
| 215 | + |
| 216 | +Each run writes: |
| 217 | + |
| 218 | +- `prompt.md` — exact suite prompt fed to Claude |
| 219 | +- `mcp-config.json` — generated Claude MCP config |
| 220 | +- `mcp-workspace/.xcodebuildmcp/config.yaml` — isolated MCP server config with effective suite defaults |
| 221 | +- `claude.jsonl` — Claude stream JSON output |
| 222 | +- `claude.stderr` — Claude stderr |
| 223 | +- `claude-command.log` — command, cwd, simulator ID, exit status, wall clock |
| 224 | +- `simulator-lifecycle.log` — temporary simulator create, boot, bootstatus, open, readiness, deletion commands, and simulator ID |
| 225 | +- `parsed/` — files written by `parse_claude_conversation.py` |
| 226 | +- `parse.log` / `parse.log.stderr` — parser output |
| 227 | +- `result.json` — full benchmark result |
0 commit comments