bench(vision): visual grounding and wandering regression harness (OmniParser adoption E)

**Tier**: `core` / harness evaluation (fixtures and scripts; no runtime behavior change unless called by tests)
**PR target**: `develop`
**Series**: OmniParser adoption E
**Priority**: P1 — required to prove visual grounding reduces wandering and does not harm long-running stability

## Related / Sequencing

- Verifies #1052, #1053, #1054, #1055, and #1057 together through live OpenChrome MCP scenarios.
- Mock provider coverage is required; real Microsoft OmniParser/GPU coverage is optional and must not gate CI.

## Background

Before visual parsing is promoted beyond optional fallback, OpenChrome needs a reproducible evaluation harness. OmniParser reports grounding benchmark wins, but OpenChrome's product claim is broader: fewer wandering calls, stronger recovery, and stable long-running browser sessions.

This issue adds fixture-based and live OpenChrome verification scripts that measure whether perception/visual fallback actually improves outcomes without increasing flakiness, latency, or memory growth beyond acceptable limits.

## Proposed Implementation

Add benchmark fixtures and scripts under a narrow path, for example:

- `tests/fixtures/sites/visual-grounding-bench/`
- `scripts/bench/visual-grounding/`
- `tests/bench/visual-grounding.test.ts` or a documented manual/live script if full automation is not CI-safe

### Required scenarios

1. **DOM/AX normal**: regular button/input should succeed without visual provider.
2. **DOM/AX poor labels**: visible target has poor accessible name but discoverable visual label.
3. **Canvas/pseudo-control**: visual-only click target updates page state.
4. **Ambiguous visual candidates**: system must refuse or HITL instead of wrong-clicking.
5. **Unsafe visual target**: destructive-looking label must not be clicked by visual-only fallback.
6. **Parser unavailable**: optional provider timeout/malformed response must fall back and keep server healthy.
7. **Long-running soak**: repeated `vision_find`/interact cycles should not leak unbounded memory or leave tabs/processes behind.

### Metrics to collect

For each scenario:

- success/failure
- tool call count
- non-progress/stuck hint count if visible
- selected strategy/provider
- latency per tool and total latency
- parser latency when applicable
- wrong-click count
- memory/process health summary from `/health` or existing metrics endpoint when enabled
- artifact paths for screenshots/trajectory when enabled

### Output format

Write a JSON report:

```json
{
  "version": 1,
  "runId": "...",
  "openchromeVersion": "...",
  "scenarios": [
    {
      "name": "canvas-visual-only",
      "provider": "omniparser-http-mock",
      "success": true,
      "toolCalls": 4,
      "wrongClicks": 0,
      "stuckHints": 0,
      "latencyMs": 1234,
      "strategyUsed": "S7_VISUAL_GROUNDING",
      "artifacts": ["..."]
    }
  ],
  "summary": {
    "pass": true,
    "failures": []
  }
}
```

## Non-goals

- Do not require real Microsoft OmniParser or GPU in CI.
- Do not benchmark third-party websites.
- Do not make benchmark scripts mutate user profiles.
- Do not promote visual grounding defaults based only on mocked tests.

## Acceptance Criteria

- [ ] Bench fixtures cover all seven scenarios above.
- [ ] A script can run against built `dist/index.js` over OpenChrome MCP and produce a JSON report.
- [ ] Mock OmniParser provider supports success, timeout, malformed, ambiguous, and unsafe fixture modes.
- [ ] Report includes success, tool calls, wrong clicks, strategy/provider, latency, and health/memory summary.
- [ ] CI-safe tests validate the report parser and at least mock-level scenario execution.
- [ ] Documentation explains how maintainers run the live verification locally.
- [ ] `npm run build && npm test -- --runInBand visual-grounding` pass, plus full `npm run build && npm test && npm run lint:tier` before PR completion.

## Verification (post-merge, via OpenChrome MCP)

Record artifacts under `scripts/verify/omniparser-adoption-E-visual-bench/`.

### Live benchmark command

The benchmark runner must drive OpenChrome through MCP tool calls (`navigate`, `vision_find`, `interact`, `read_page`, and health/metrics probes where available). It must not inspect fixture state directly except to serve fixture pages and mock parser responses.

```bash
npm ci
npm run build
mkdir -p scripts/verify/omniparser-adoption-E-visual-bench
node scripts/bench/visual-grounding/run.mjs \
  --openchrome-command "node dist/index.js --http 9897" \
  --fixture-port 9997 \
  --mock-omniparser-port 9907 \
  --out scripts/verify/omniparser-adoption-E-visual-bench/report.json \
  --record-artifacts scripts/verify/omniparser-adoption-E-visual-bench/artifacts
jq . scripts/verify/omniparser-adoption-E-visual-bench/report.json
```

### Required pass checks

```bash
REPORT=scripts/verify/omniparser-adoption-E-visual-bench/report.json
jq -e '.summary.pass == true' "$REPORT" >/dev/null
jq -e 'all(.scenarios[]; .wrongClicks == 0)' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "canvas-visual-only" and .success == true and .strategyUsed == "S7_VISUAL_GROUNDING")' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "ambiguous-visual" and .success == true and (.strategyUsed | test("HITL|blocked|rejected")))' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "provider-timeout" and .success == true and (.provider | test("fallback|dom")))' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "long-running-soak" and .success == true and (.health.memoryGrowthMb <= 75))' "$REPORT" >/dev/null
```

**Pass**: report passes, wrong-click count is zero, visual fallback helps the visual-only scenario, ambiguous/unsafe cases do not click blindly, parser failure falls back, soak memory growth stays bounded, and the report records the MCP calls used for each scenario.

## Directionality / Fit Check

This issue is necessary because the proposed OmniParser-inspired changes are valuable only if they measurably reduce wandering and preserve OpenChrome's long-running reliability. It is a harness/evaluation issue, not feature creep.



## Curated scope, overlap handling, and verification checklist

### Scope classification
- **Canonical lane:** vision harness evaluation.
- **Primary deliverable:** visual grounding and wandering regression harness for OmniParser adoption series.
- **Open PR:** none currently linked; create a new PR only after checking for newer overlapping PRs.
- **Non-goal:** runtime behavior change, real GPU/OmniParser requirement in CI, or promotion of visual fallback without evidence.

### Overlap and conflict resolution
- [ ] Verifies #1052/#1053/#1054/#1055 together but should not implement their runtime features.
- [ ] Mock provider coverage is required; real OmniParser coverage is optional and non-blocking.
- [ ] Keep local fixtures deterministic to avoid visual flake.

### Implementation checklist
- [ ] Add fixture scenarios for visual grounding success, ambiguous UI, DOM-vs-vision fallback, and wandering regression.
- [ ] Run against mock provider and current DOM/default provider; optionally document real OmniParser manual path.
- [ ] Measure success rate, wrong-target rate, extra tool calls, and wandering hints.
- [ ] Emit reports with screenshots/perception artifacts where enabled.
- [ ] Add CI-safe tests for mock provider path and artifact schema.

### Success criteria
- [ ] Harness proves whether visual grounding reduces wandering without harming stability.
- [ ] CI does not require GPU or real OmniParser.
- [ ] Reports are deterministic and actionable for #1052-#1055.
- [ ] No runtime defaults are changed by the benchmark alone.

### Post-merge OpenChrome live verification checklist
- [ ] Run mock-provider visual fixture suite through OpenChrome MCP.
- [ ] Verify report includes scenario results, wrong-target/wandering metrics, and artifact references.
- [ ] Run a DOM/default provider comparison if available.
- [ ] Attach report path and representative scenario output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(vision): visual grounding and wandering regression harness (OmniParser adoption E) #1056

Related / Sequencing

Background

Proposed Implementation

Required scenarios

Metrics to collect

Output format

Non-goals

Acceptance Criteria

Verification (post-merge, via OpenChrome MCP)

Live benchmark command

Required pass checks

Directionality / Fit Check

Curated scope, overlap handling, and verification checklist

Scope classification

Overlap and conflict resolution

Implementation checklist

Success criteria

Post-merge OpenChrome live verification checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bench(vision): visual grounding and wandering regression harness (OmniParser adoption E) #1056

Description

Related / Sequencing

Background

Proposed Implementation

Required scenarios

Metrics to collect

Output format

Non-goals

Acceptance Criteria

Verification (post-merge, via OpenChrome MCP)

Live benchmark command

Required pass checks

Directionality / Fit Check

Curated scope, overlap handling, and verification checklist

Scope classification

Overlap and conflict resolution

Implementation checklist

Success criteria

Post-merge OpenChrome live verification checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions