Skip to content

feat(vision): optional OmniParser-compatible HTTP perception provider (OmniParser adoption B) #1053

@shaun0927

Description

@shaun0927

Tier: pilot / optional integration (external service only; core stays dependency-light)
PR target: develop
Series: OmniParser adoption B
Priority: P2 — useful for canvas/native-like UIs after the perception snapshot contract exists
Depends on: OmniParser adoption A (PerceptionSnapshot provider contract)

Related / Sequencing

Background

OmniParser's strongest transferable capability is screenshot-only UI grounding: given a screenshot, it returns structured text/icon elements with bounding boxes and captions. OpenChrome should support that capability as an optional provider, not as a bundled dependency.

This issue adds an OmniParser-compatible HTTP adapter that talks to an already-running parser service. The adapter must be safe-by-default: disabled unless configured, bounded by timeout/budget, and never required for normal OpenChrome operation.

Proposed Implementation

Add an optional perception provider:

  • src/vision/providers/omniparser-http-provider.ts
  • config/env support in existing config surfaces:
    • OPENCHROME_VISION_PROVIDER=dom|omniparser-http
    • OPENCHROME_OMNIPARSER_URL=http://127.0.0.1:8000/parse/
    • OPENCHROME_OMNIPARSER_TIMEOUT_MS=3000 default or lower if global deadline is tighter
    • OPENCHROME_OMNIPARSER_MAX_ELEMENTS=200
  • provider health/diagnostic metadata in vision_find(format='snapshot'|'both')

Adapter behavior

  1. Capture screenshot through existing guarded screenshot path.
  2. POST { "base64_image": "..." } to the configured OmniParser-compatible endpoint.
  3. Accept response shapes compatible with OmniParser:
    • parsed_content_list[]
    • optional som_image_base64
    • optional latency
  4. Convert bbox ratio coordinates to PerceptionElement.bbox CSS pixels and bboxRatio.
  5. Map element types:
    • OmniParser text -> text, interactive=false unless response says otherwise
    • OmniParser icon -> icon or control when interactivity is true/known
  6. Truncate labels and element count.
  7. On timeout/unavailable/malformed response, return a warning and fall back to the DOM provider when fallback is enabled.

Fallback policy

  • Default provider remains dom.
  • If omniparser-http is configured and fails, vision_find should return a clear warning and optionally include DOM fallback results.
  • A failing external parser must not fail unrelated MCP tools or session creation.

Non-goals

  • Do not vendor Microsoft OmniParser code, model weights, Python, Torch, or Docker setup.
  • Do not make network calls to public third-party parser services by default.
  • Do not add a built-in Windows VM or OmniTool clone.
  • Do not use visual-only candidates for automatic clicking in this issue.

Acceptance Criteria

  • OmniParser-compatible HTTP provider is disabled unless explicitly configured.
  • Provider calls respect tool deadlines and OPENCHROME_OMNIPARSER_TIMEOUT_MS.
  • Malformed/unavailable parser responses produce bounded warnings and do not crash the server.
  • parsed_content_list entries are converted into valid PerceptionSnapshot elements.
  • Element count and label size are bounded by configuration/defaults.
  • DOM fallback remains available and is clearly marked in warnings/metadata.
  • No OmniParser model/runtime dependency is added to package.json.
  • Unit tests use a mocked HTTP parser and cover success, timeout, malformed response, bounds, and fallback.
  • npm run build && npm test -- --runInBand omniparser vision pass, plus full npm run build && npm test && npm run lint:tier before PR completion.

Verification (post-merge, via OpenChrome MCP)

Record artifacts under scripts/verify/omniparser-adoption-B-http-provider/.

Setup

npm ci
npm run build
mkdir -p scripts/verify/omniparser-adoption-B-http-provider
node tests/fixtures/sites/vision-perception/serve.mjs &
FIX_PID=$!
node tests/fixtures/omniparser-mock/server.mjs --port 9901 --mode success > /tmp/omniparser-mock.log 2>&1 &
MOCK_PID=$!
PORT=9892
OPENCHROME_VISION_PROVIDER=omniparser-http \
OPENCHROME_OMNIPARSER_URL=http://127.0.0.1:9901/parse/ \
node dist/index.js --http "$PORT" > /tmp/openchrome-omniparser-provider.log 2>&1 &
OC_PID=$!
sleep 1
mcp() { curl -s -H 'content-type: application/json' -d "$1" "http://localhost:$PORT/mcp"; }

Mock parser requirements:

  • /parse/ returns two elements: Search icon and Continue button, with ratio bboxes.
  • It can switch to timeout/malformed mode through a test endpoint or restart flag.

Scenario 1 — configured provider returns OmniParser-sourced elements

mcp '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"navigate","arguments":{"url":"http://localhost:9991/perception.html"}}}' >/tmp/oc-nav-B.json
TAB=$(jq -r '.result.content[0].text | fromjson | .tabId' /tmp/oc-nav-B.json)
RESP=$(mcp "$(jq -nc --arg tab "$TAB" '{jsonrpc:"2.0",id:2,method:"tools/call",params:{name:"vision_find",arguments:{tabId:$tab,format:"snapshot",includeImage:false}}}')")
echo "$RESP" | tee scripts/verify/omniparser-adoption-B-http-provider/omniparser-success.json
BODY=$(echo "$RESP" | jq -r '.result.content[0].text | fromjson? // .result.content[0].text')
echo "$BODY" | jq -e '.provider == "omniparser-http" and any(.elements[]; .source == "omniparser-http" and (.label | test("Continue|Search")))' >/dev/null

Pass: snapshot is provider-marked and includes mock OmniParser elements.

Scenario 2 — parser timeout falls back without crashing

curl -s -X POST http://127.0.0.1:9901/mode/timeout >/dev/null
RESP2=$(mcp "$(jq -nc --arg tab "$TAB" '{jsonrpc:"2.0",id:3,method:"tools/call",params:{name:"vision_find",arguments:{tabId:$tab,format:"snapshot",includeImage:false}}}')")
echo "$RESP2" | tee scripts/verify/omniparser-adoption-B-http-provider/omniparser-timeout-fallback.json
BODY2=$(echo "$RESP2" | jq -r '.result.content[0].text | fromjson? // .result.content[0].text')
echo "$BODY2" | jq -e '(.warnings | length) >= 1 and (.provider == "dom-annotator" or any(.elements[]; .source == "dom-annotator"))' >/dev/null

Pass: timeout is visible in warnings and OpenChrome returns a fallback result instead of a server failure.

Scenario 3 — provider is opt-in

kill $OC_PID
wait $OC_PID 2>/dev/null || true
PORT=9893
node dist/index.js --http "$PORT" > /tmp/openchrome-omniparser-provider-default.log 2>&1 &
OC_PID=$!
sleep 1
RESP3=$(curl -s -H 'content-type: application/json' -d "$(jq -nc --arg tab "$TAB" '{jsonrpc:"2.0",id:4,method:"tools/call",params:{name:"vision_find",arguments:{tabId:$tab,format:"snapshot",includeImage:false}}}')" "http://localhost:$PORT/mcp")
echo "$RESP3" | tee scripts/verify/omniparser-adoption-B-http-provider/default-provider.json
! grep -q "omniparser-http" scripts/verify/omniparser-adoption-B-http-provider/default-provider.json

Pass: without env/config, the default provider is not OmniParser.

Cleanup

kill $MOCK_PID $FIX_PID $OC_PID
wait $MOCK_PID $FIX_PID $OC_PID 2>/dev/null || true

Directionality / Fit Check

This adds only an adapter boundary. It intentionally avoids making OmniParser a transitive dependency or changing OpenChrome's default lightweight browser harness behavior.

Curated scope, overlap handling, and verification checklist

Scope classification

Overlap and conflict resolution

Implementation checklist

  • Implement provider registration/config behind an explicit opt-in flag/env/config value.
  • Map screenshots/request metadata to the OmniParser-compatible HTTP payload and map response boxes/labels/confidence into PerceptionSnapshot elements.
  • Enforce timeouts, response-size bounds, schema validation, and clear error messages for unavailable/malformed services.
  • Keep dependencies optional and lightweight; core install and core CI must pass without a real OmniParser service.
  • Add mocked HTTP provider tests for success, timeout, non-2xx response, malformed JSON, large response, and disabled config.
  • Document local setup using a mock or separately managed OmniParser-compatible endpoint.

Success criteria

Post-merge OpenChrome live verification checklist

  • Start a local mock OmniParser-compatible HTTP server returning a known fixture response.
  • Configure OpenChrome to use the optional provider and call the relevant perception/vision surface against a local fixture page.
  • Verify returned element boxes/labels/confidence match the mock response and conform to feat(vision): provider-neutral perception snapshots for vision_find (OmniParser adoption A) #1052's contract.
  • Stop the mock server and verify OpenChrome reports a bounded provider error rather than hanging or falling back silently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityenhancementNew feature or requesthost-integrationWires module cores into host (CDP, MCP, tools, transports, OS APIs)live-verificationRequires live OpenChrome/browser validation after implementationperformancePerformance, latency, throughput, or resource-use improvementreliabilityReliability and stability improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions