You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tier: tests / tooling first; no production src/** dependency unless explicitly called out below PR target:develop Source analysis: CUA-Bench-style reset → step → evaluate episode harness, adapted to OpenChrome's portability-harness contract.
Why
OpenChrome has strong tool-call reliability primitives, Outcome Contracts, recording, checkpoints, journals, and hinting. What is still missing is a small, repeatable episode harness that measures whether those primitives improve full-task outcomes over time.
CUA-Bench's transferable idea is not its VM/sandbox stack. The useful piece is the benchmark shape:
task spec → reset browser state → run bounded steps → evaluate with deterministic criteria → emit per-episode metrics
OpenChrome should adopt that shape as a tests/tooling layer so reliability work can be measured without turning the MCP server into an autonomous agent runtime.
Integration rule with #851: whichever PR lands second must reuse the shared EpisodeTaskSpec, EpisodeResult, reporter, and mock-adapter contract from this issue. If #851 has already created overlapping runner code, this issue must refactor that code into tests/benchmark/episode-harness/ rather than adding a second runner.
Directionality / fit check
This aligns with docs/roadmap/portability-harness-contract.md:
No server-side LLM calls.
No new production tool behavior.
No mandatory credentials.
No platform-specific runtime dependency.
The harness is a client/test utility of OpenChrome, not a continuous autonomous loop inside the server.
Proposed implementation
Add a reusable benchmark package under tests/benchmark/episode-harness/.
1. Task spec
Define a JSON/TS task format:
exportinterfaceEpisodeTaskSpec{id: string;title: string;startUrl: string;goal: string;maxSteps: number;// hard cap, default 30, max 100maxDurationMs: number;// hard cap, default 120_000, max 600_000success: Assertion;// existing src/contracts/types.ts DSL onlysetup?: {clearCookies?: boolean;viewport?: {width: number;height: number};};tags?: string[];}
Rules:
success reuses the existing Outcome Contract DSL. No new assertion operators in this issue.
maxSteps and maxDurationMs are required after defaults are applied.
Public tasks must avoid login, payment, captcha, live prices, news, current dates, or user-specific state.
2. Episode runner
Create runEpisode(task, adapter, openchromeClient) with this lifecycle:
Start or connect to an OpenChrome MCP server.
Reset browser state for the task as specified.
Navigate to startUrl.
Repeatedly ask the adapter for the next tool call until:
noProgressEpisodes calculation is deterministic: if oc_progress_status exists, count each step whose returned status is stalling or stuck; otherwise use the fallback rule consecutive tool errors >= 3 OR same tool called successfully >= 3 times with unchanged final URL as one no-progress episode.
4. CLI / npm scripts
Add:
npm run bench:episode -- --tasks tests/benchmark/episode-harness/fixtures --adapter mock
npm run bench:episode:mock
The command writes:
reports/<run-id>.json
reports/<run-id>.md
events/<run-id>.jsonl
5. Fixture tasks
Add 3 deterministic fixture tasks, all hosted locally or on immutable public pages:
example-h1 — navigate to https://example.com, assert h1 contains Example Domain.
local-form-submit — local fixture page, fill a form, assert success text.
local-recovery-stall — local fixture with one intentionally wrong first action in the mock transcript, used to verify failure metrics and bounded stopping.
Acceptance criteria
tests/benchmark/episode-harness/ contains typed task spec, runner, result schema, reporter, and mock adapter.
The task spec validator rejects unknown fields, missing caps, unsupported contract shapes, and non-positive step/time budgets.
npm run bench:episode:mock completes the three fixture tasks deterministically in ≤ 90 seconds on a developer machine.
Reports include pass/fail, steps, duration, tool calls, OpenChrome errors, final URL, and failed contract evidence.
The harness never imports or invokes an LLM provider in mock mode.
No production OpenChrome MCP tool behavior changes.
No new runtime dependency in dependencies; dev-only dependencies are allowed only if justified in the PR.
OPENCHROME_BENCH_VERBOSE=1 npm run bench:episode:mock -- --task example-h1 --out /tmp/episode-real-tool-path
Pass:events.jsonl contains MCP navigate, at least one read_page or tabs_context page-state call, and a final Outcome Contract evaluation record. The task passes.
Scenario 3 — max-step guard works
Run the local-recovery-stall fixture with --max-steps 2.
Pass: result status is max_steps, steps === 2, and report names the task and last tool call. The process exits cleanly without hanging.
Scenario 4 — contract failure is visible
Temporarily change the example-h1 expected text to NotPresent in a throwaway branch and rerun.
Pass: status is failed, failedContract identifies the dom_text assertion, and the report includes the observed text preview.
Tier: tests / tooling first; no production
src/**dependency unless explicitly called out belowPR target:
developSource analysis: CUA-Bench-style
reset → step → evaluateepisode harness, adapted to OpenChrome's portability-harness contract.Why
OpenChrome has strong tool-call reliability primitives, Outcome Contracts, recording, checkpoints, journals, and hinting. What is still missing is a small, repeatable episode harness that measures whether those primitives improve full-task outcomes over time.
CUA-Bench's transferable idea is not its VM/sandbox stack. The useful piece is the benchmark shape:
task spec → reset browser state → run bounded steps → evaluate with deterministic criteria → emit per-episode metricsOpenChrome should adopt that shape as a tests/tooling layer so reliability work can be measured without turning the MCP server into an autonomous agent runtime.
This issue is intentionally narrower than #851:
Integration rule with #851: whichever PR lands second must reuse the shared
EpisodeTaskSpec,EpisodeResult, reporter, and mock-adapter contract from this issue. If #851 has already created overlapping runner code, this issue must refactor that code intotests/benchmark/episode-harness/rather than adding a second runner.Directionality / fit check
This aligns with
docs/roadmap/portability-harness-contract.md:Proposed implementation
Add a reusable benchmark package under
tests/benchmark/episode-harness/.1. Task spec
Define a JSON/TS task format:
Rules:
successreuses the existing Outcome Contract DSL. No new assertion operators in this issue.maxStepsandmaxDurationMsare required after defaults are applied.2. Episode runner
Create
runEpisode(task, adapter, openchromeClient)with this lifecycle:startUrl.maxStepsreached,maxDurationMsreached,successvia the existing contract evaluator against the live page.EpisodeResult.The default CI adapter must be deterministic and not call an LLM. Real LLM adapters remain opt-in and credential-gated.
3. Result schema
noProgressEpisodescalculation is deterministic: ifoc_progress_statusexists, count each step whose returnedstatusisstallingorstuck; otherwise use the fallback ruleconsecutive tool errors >= 3ORsame tool called successfully >= 3 times with unchanged final URLas one no-progress episode.4. CLI / npm scripts
Add:
The command writes:
reports/<run-id>.jsonreports/<run-id>.mdevents/<run-id>.jsonl5. Fixture tasks
Add 3 deterministic fixture tasks, all hosted locally or on immutable public pages:
example-h1— navigate tohttps://example.com, asserth1containsExample Domain.local-form-submit— local fixture page, fill a form, assert success text.local-recovery-stall— local fixture with one intentionally wrong first action in the mock transcript, used to verify failure metrics and bounded stopping.Acceptance criteria
tests/benchmark/episode-harness/contains typed task spec, runner, result schema, reporter, and mock adapter.npm run bench:episode:mockcompletes the three fixture tasks deterministically in ≤ 90 seconds on a developer machine.dependencies; dev-only dependencies are allowed only if justified in the PR.docs/benchmarks/episode-harness.mdexplaining task authoring, caps, adapter boundaries, and how this relates to test(bench): WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes #851.Real verification after merge using OpenChrome
Run these commands from a clean checkout:
Scenario 1 — deterministic mock reproducibility
Pass: empty diff.
Scenario 2 — real OpenChrome tool path is used
Run with verbose logging:
Pass:
events.jsonlcontains MCPnavigate, at least oneread_pageortabs_contextpage-state call, and a final Outcome Contract evaluation record. The task passes.Scenario 3 — max-step guard works
Run the
local-recovery-stallfixture with--max-steps 2.Pass: result status is
max_steps,steps === 2, and report names the task and last tool call. The process exits cleanly without hanging.Scenario 4 — contract failure is visible
Temporarily change the
example-h1expected text toNotPresentin a throwaway branch and rerun.Pass: status is
failed,failedContractidentifies thedom_textassertion, and the report includes the observed text preview.Out of scope
Dependencies / references
src/contracts/types.tsandsrc/contracts/evaluate.ts.Reviewer checklist for ambiguity
Curated scope, overlap handling, and verification checklist
Scope classification
Overlap and conflict resolution
Implementation checklist
Success criteria
src/**dependency is introduced unless explicitly justified.Post-merge OpenChrome live verification checklist