You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tier: core / harness evaluation (fixtures and scripts; no runtime behavior change unless called by tests) PR target: develop Series: OmniParser adoption E Priority: P1 — required to prove visual grounding reduces wandering and does not harm long-running stability
Mock provider coverage is required; real Microsoft OmniParser/GPU coverage is optional and must not gate CI.
Background
Before visual parsing is promoted beyond optional fallback, OpenChrome needs a reproducible evaluation harness. OmniParser reports grounding benchmark wins, but OpenChrome's product claim is broader: fewer wandering calls, stronger recovery, and stable long-running browser sessions.
This issue adds fixture-based and live OpenChrome verification scripts that measure whether perception/visual fallback actually improves outcomes without increasing flakiness, latency, or memory growth beyond acceptable limits.
Proposed Implementation
Add benchmark fixtures and scripts under a narrow path, for example:
tests/fixtures/sites/visual-grounding-bench/
scripts/bench/visual-grounding/
tests/bench/visual-grounding.test.ts or a documented manual/live script if full automation is not CI-safe
Required scenarios
DOM/AX normal: regular button/input should succeed without visual provider.
DOM/AX poor labels: visible target has poor accessible name but discoverable visual label.
Report includes success, tool calls, wrong clicks, strategy/provider, latency, and health/memory summary.
CI-safe tests validate the report parser and at least mock-level scenario execution.
Documentation explains how maintainers run the live verification locally.
npm run build && npm test -- --runInBand visual-grounding pass, plus full npm run build && npm test && npm run lint:tier before PR completion.
Verification (post-merge, via OpenChrome MCP)
Record artifacts under scripts/verify/omniparser-adoption-E-visual-bench/.
Live benchmark command
The benchmark runner must drive OpenChrome through MCP tool calls (navigate, vision_find, interact, read_page, and health/metrics probes where available). It must not inspect fixture state directly except to serve fixture pages and mock parser responses.
REPORT=scripts/verify/omniparser-adoption-E-visual-bench/report.json
jq -e '.summary.pass == true'"$REPORT">/dev/null
jq -e 'all(.scenarios[]; .wrongClicks == 0)'"$REPORT">/dev/null
jq -e 'any(.scenarios[]; .name == "canvas-visual-only" and .success == true and .strategyUsed == "S7_VISUAL_GROUNDING")'"$REPORT">/dev/null
jq -e 'any(.scenarios[]; .name == "ambiguous-visual" and .success == true and (.strategyUsed | test("HITL|blocked|rejected")))'"$REPORT">/dev/null
jq -e 'any(.scenarios[]; .name == "provider-timeout" and .success == true and (.provider | test("fallback|dom")))'"$REPORT">/dev/null
jq -e 'any(.scenarios[]; .name == "long-running-soak" and .success == true and (.health.memoryGrowthMb <= 75))'"$REPORT">/dev/null
Pass: report passes, wrong-click count is zero, visual fallback helps the visual-only scenario, ambiguous/unsafe cases do not click blindly, parser failure falls back, soak memory growth stays bounded, and the report records the MCP calls used for each scenario.
Directionality / Fit Check
This issue is necessary because the proposed OmniParser-inspired changes are valuable only if they measurably reduce wandering and preserve OpenChrome's long-running reliability. It is a harness/evaluation issue, not feature creep.
Curated scope, overlap handling, and verification checklist
Scope classification
Canonical lane: vision harness evaluation.
Primary deliverable: visual grounding and wandering regression harness for OmniParser adoption series.
Open PR: none currently linked; create a new PR only after checking for newer overlapping PRs.
Non-goal: runtime behavior change, real GPU/OmniParser requirement in CI, or promotion of visual fallback without evidence.
Tier:
core/ harness evaluation (fixtures and scripts; no runtime behavior change unless called by tests)PR target:
developSeries: OmniParser adoption E
Priority: P1 — required to prove visual grounding reduces wandering and does not harm long-running stability
Related / Sequencing
Background
Before visual parsing is promoted beyond optional fallback, OpenChrome needs a reproducible evaluation harness. OmniParser reports grounding benchmark wins, but OpenChrome's product claim is broader: fewer wandering calls, stronger recovery, and stable long-running browser sessions.
This issue adds fixture-based and live OpenChrome verification scripts that measure whether perception/visual fallback actually improves outcomes without increasing flakiness, latency, or memory growth beyond acceptable limits.
Proposed Implementation
Add benchmark fixtures and scripts under a narrow path, for example:
tests/fixtures/sites/visual-grounding-bench/scripts/bench/visual-grounding/tests/bench/visual-grounding.test.tsor a documented manual/live script if full automation is not CI-safeRequired scenarios
vision_find/interact cycles should not leak unbounded memory or leave tabs/processes behind.Metrics to collect
For each scenario:
/healthor existing metrics endpoint when enabledOutput format
Write a JSON report:
{ "version": 1, "runId": "...", "openchromeVersion": "...", "scenarios": [ { "name": "canvas-visual-only", "provider": "omniparser-http-mock", "success": true, "toolCalls": 4, "wrongClicks": 0, "stuckHints": 0, "latencyMs": 1234, "strategyUsed": "S7_VISUAL_GROUNDING", "artifacts": ["..."] } ], "summary": { "pass": true, "failures": [] } }Non-goals
Acceptance Criteria
dist/index.jsover OpenChrome MCP and produce a JSON report.npm run build && npm test -- --runInBand visual-groundingpass, plus fullnpm run build && npm test && npm run lint:tierbefore PR completion.Verification (post-merge, via OpenChrome MCP)
Record artifacts under
scripts/verify/omniparser-adoption-E-visual-bench/.Live benchmark command
The benchmark runner must drive OpenChrome through MCP tool calls (
navigate,vision_find,interact,read_page, and health/metrics probes where available). It must not inspect fixture state directly except to serve fixture pages and mock parser responses.Required pass checks
Pass: report passes, wrong-click count is zero, visual fallback helps the visual-only scenario, ambiguous/unsafe cases do not click blindly, parser failure falls back, soak memory growth stays bounded, and the report records the MCP calls used for each scenario.
Directionality / Fit Check
This issue is necessary because the proposed OmniParser-inspired changes are valuable only if they measurably reduce wandering and preserve OpenChrome's long-running reliability. It is a harness/evaluation issue, not feature creep.
Curated scope, overlap handling, and verification checklist
Scope classification
Overlap and conflict resolution
Implementation checklist
Success criteria
Post-merge OpenChrome live verification checklist