Skip to content

bench(vision): visual grounding and wandering regression harness (OmniParser adoption E) #1056

@shaun0927

Description

@shaun0927

Tier: core / harness evaluation (fixtures and scripts; no runtime behavior change unless called by tests)
PR target: develop
Series: OmniParser adoption E
Priority: P1 — required to prove visual grounding reduces wandering and does not harm long-running stability

Related / Sequencing

Background

Before visual parsing is promoted beyond optional fallback, OpenChrome needs a reproducible evaluation harness. OmniParser reports grounding benchmark wins, but OpenChrome's product claim is broader: fewer wandering calls, stronger recovery, and stable long-running browser sessions.

This issue adds fixture-based and live OpenChrome verification scripts that measure whether perception/visual fallback actually improves outcomes without increasing flakiness, latency, or memory growth beyond acceptable limits.

Proposed Implementation

Add benchmark fixtures and scripts under a narrow path, for example:

  • tests/fixtures/sites/visual-grounding-bench/
  • scripts/bench/visual-grounding/
  • tests/bench/visual-grounding.test.ts or a documented manual/live script if full automation is not CI-safe

Required scenarios

  1. DOM/AX normal: regular button/input should succeed without visual provider.
  2. DOM/AX poor labels: visible target has poor accessible name but discoverable visual label.
  3. Canvas/pseudo-control: visual-only click target updates page state.
  4. Ambiguous visual candidates: system must refuse or HITL instead of wrong-clicking.
  5. Unsafe visual target: destructive-looking label must not be clicked by visual-only fallback.
  6. Parser unavailable: optional provider timeout/malformed response must fall back and keep server healthy.
  7. Long-running soak: repeated vision_find/interact cycles should not leak unbounded memory or leave tabs/processes behind.

Metrics to collect

For each scenario:

  • success/failure
  • tool call count
  • non-progress/stuck hint count if visible
  • selected strategy/provider
  • latency per tool and total latency
  • parser latency when applicable
  • wrong-click count
  • memory/process health summary from /health or existing metrics endpoint when enabled
  • artifact paths for screenshots/trajectory when enabled

Output format

Write a JSON report:

{
  "version": 1,
  "runId": "...",
  "openchromeVersion": "...",
  "scenarios": [
    {
      "name": "canvas-visual-only",
      "provider": "omniparser-http-mock",
      "success": true,
      "toolCalls": 4,
      "wrongClicks": 0,
      "stuckHints": 0,
      "latencyMs": 1234,
      "strategyUsed": "S7_VISUAL_GROUNDING",
      "artifacts": ["..."]
    }
  ],
  "summary": {
    "pass": true,
    "failures": []
  }
}

Non-goals

  • Do not require real Microsoft OmniParser or GPU in CI.
  • Do not benchmark third-party websites.
  • Do not make benchmark scripts mutate user profiles.
  • Do not promote visual grounding defaults based only on mocked tests.

Acceptance Criteria

  • Bench fixtures cover all seven scenarios above.
  • A script can run against built dist/index.js over OpenChrome MCP and produce a JSON report.
  • Mock OmniParser provider supports success, timeout, malformed, ambiguous, and unsafe fixture modes.
  • Report includes success, tool calls, wrong clicks, strategy/provider, latency, and health/memory summary.
  • CI-safe tests validate the report parser and at least mock-level scenario execution.
  • Documentation explains how maintainers run the live verification locally.
  • npm run build && npm test -- --runInBand visual-grounding pass, plus full npm run build && npm test && npm run lint:tier before PR completion.

Verification (post-merge, via OpenChrome MCP)

Record artifacts under scripts/verify/omniparser-adoption-E-visual-bench/.

Live benchmark command

The benchmark runner must drive OpenChrome through MCP tool calls (navigate, vision_find, interact, read_page, and health/metrics probes where available). It must not inspect fixture state directly except to serve fixture pages and mock parser responses.

npm ci
npm run build
mkdir -p scripts/verify/omniparser-adoption-E-visual-bench
node scripts/bench/visual-grounding/run.mjs \
  --openchrome-command "node dist/index.js --http 9897" \
  --fixture-port 9997 \
  --mock-omniparser-port 9907 \
  --out scripts/verify/omniparser-adoption-E-visual-bench/report.json \
  --record-artifacts scripts/verify/omniparser-adoption-E-visual-bench/artifacts
jq . scripts/verify/omniparser-adoption-E-visual-bench/report.json

Required pass checks

REPORT=scripts/verify/omniparser-adoption-E-visual-bench/report.json
jq -e '.summary.pass == true' "$REPORT" >/dev/null
jq -e 'all(.scenarios[]; .wrongClicks == 0)' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "canvas-visual-only" and .success == true and .strategyUsed == "S7_VISUAL_GROUNDING")' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "ambiguous-visual" and .success == true and (.strategyUsed | test("HITL|blocked|rejected")))' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "provider-timeout" and .success == true and (.provider | test("fallback|dom")))' "$REPORT" >/dev/null
jq -e 'any(.scenarios[]; .name == "long-running-soak" and .success == true and (.health.memoryGrowthMb <= 75))' "$REPORT" >/dev/null

Pass: report passes, wrong-click count is zero, visual fallback helps the visual-only scenario, ambiguous/unsafe cases do not click blindly, parser failure falls back, soak memory growth stays bounded, and the report records the MCP calls used for each scenario.

Directionality / Fit Check

This issue is necessary because the proposed OmniParser-inspired changes are valuable only if they measurably reduce wandering and preserve OpenChrome's long-running reliability. It is a harness/evaluation issue, not feature creep.

Curated scope, overlap handling, and verification checklist

Scope classification

  • Canonical lane: vision harness evaluation.
  • Primary deliverable: visual grounding and wandering regression harness for OmniParser adoption series.
  • Open PR: none currently linked; create a new PR only after checking for newer overlapping PRs.
  • Non-goal: runtime behavior change, real GPU/OmniParser requirement in CI, or promotion of visual fallback without evidence.

Overlap and conflict resolution

Implementation checklist

  • Add fixture scenarios for visual grounding success, ambiguous UI, DOM-vs-vision fallback, and wandering regression.
  • Run against mock provider and current DOM/default provider; optionally document real OmniParser manual path.
  • Measure success rate, wrong-target rate, extra tool calls, and wandering hints.
  • Emit reports with screenshots/perception artifacts where enabled.
  • Add CI-safe tests for mock provider path and artifact schema.

Success criteria

Post-merge OpenChrome live verification checklist

  • Run mock-provider visual fixture suite through OpenChrome MCP.
  • Verify report includes scenario results, wrong-target/wandering metrics, and artifact references.
  • Run a DOM/default provider comparison if available.
  • Attach report path and representative scenario output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1P1 highenhancementNew feature or requestharnessExecution harness, run lifecycle, recovery, and verificationlive-verificationRequires live OpenChrome/browser validation after implementationobservabilityObservabilityperformancePerformance, latency, throughput, or resource-use improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions