Skip to content

test(harness): shared episode runner with contract-eval results and anti-wandering metrics #1058

@shaun0927

Description

@shaun0927

Tier: tests / tooling first; no production src/** dependency unless explicitly called out below
PR target: develop
Source analysis: CUA-Bench-style reset → step → evaluate episode harness, adapted to OpenChrome's portability-harness contract.

Why

OpenChrome has strong tool-call reliability primitives, Outcome Contracts, recording, checkpoints, journals, and hinting. What is still missing is a small, repeatable episode harness that measures whether those primitives improve full-task outcomes over time.

CUA-Bench's transferable idea is not its VM/sandbox stack. The useful piece is the benchmark shape:

task spec → reset browser state → run bounded steps → evaluate with deterministic criteria → emit per-episode metrics

OpenChrome should adopt that shape as a tests/tooling layer so reliability work can be measured without turning the MCP server into an autonomous agent runtime.

This issue is intentionally narrower than #851:

Integration rule with #851: whichever PR lands second must reuse the shared EpisodeTaskSpec, EpisodeResult, reporter, and mock-adapter contract from this issue. If #851 has already created overlapping runner code, this issue must refactor that code into tests/benchmark/episode-harness/ rather than adding a second runner.

Directionality / fit check

This aligns with docs/roadmap/portability-harness-contract.md:

  • No server-side LLM calls.
  • No new production tool behavior.
  • No mandatory credentials.
  • No platform-specific runtime dependency.
  • The harness is a client/test utility of OpenChrome, not a continuous autonomous loop inside the server.

Proposed implementation

Add a reusable benchmark package under tests/benchmark/episode-harness/.

1. Task spec

Define a JSON/TS task format:

export interface EpisodeTaskSpec {
  id: string;
  title: string;
  startUrl: string;
  goal: string;
  maxSteps: number;          // hard cap, default 30, max 100
  maxDurationMs: number;     // hard cap, default 120_000, max 600_000
  success: Assertion;        // existing src/contracts/types.ts DSL only
  setup?: {
    clearCookies?: boolean;
    viewport?: { width: number; height: number };
  };
  tags?: string[];
}

Rules:

  • success reuses the existing Outcome Contract DSL. No new assertion operators in this issue.
  • maxSteps and maxDurationMs are required after defaults are applied.
  • Public tasks must avoid login, payment, captcha, live prices, news, current dates, or user-specific state.

2. Episode runner

Create runEpisode(task, adapter, openchromeClient) with this lifecycle:

  1. Start or connect to an OpenChrome MCP server.
  2. Reset browser state for the task as specified.
  3. Navigate to startUrl.
  4. Repeatedly ask the adapter for the next tool call until:
    • success contract passes,
    • adapter emits done,
    • maxSteps reached,
    • maxDurationMs reached,
    • OpenChrome returns unrecoverable transport/tool failure.
  5. Evaluate success via the existing contract evaluator against the live page.
  6. Emit a normalized EpisodeResult.

The default CI adapter must be deterministic and not call an LLM. Real LLM adapters remain opt-in and credential-gated.

3. Result schema

export interface EpisodeResult {
  runId: string;
  taskId: string;
  status: 'passed' | 'failed' | 'timeout' | 'max_steps' | 'adapter_error' | 'tool_error';
  success: boolean;
  steps: number;
  durationMs: number;
  toolCalls: number;
  openchromeErrors: number;
  noProgressEpisodes: number;
  finalUrl: string;
  failedContract?: unknown;
  artifacts: {
    eventsJsonl: string;
    reportJson: string;
    screenshotDir?: string;
  };
}

noProgressEpisodes calculation is deterministic: if oc_progress_status exists, count each step whose returned status is stalling or stuck; otherwise use the fallback rule consecutive tool errors >= 3 OR same tool called successfully >= 3 times with unchanged final URL as one no-progress episode.

4. CLI / npm scripts

Add:

npm run bench:episode -- --tasks tests/benchmark/episode-harness/fixtures --adapter mock
npm run bench:episode:mock

The command writes:

  • reports/<run-id>.json
  • reports/<run-id>.md
  • events/<run-id>.jsonl

5. Fixture tasks

Add 3 deterministic fixture tasks, all hosted locally or on immutable public pages:

  1. example-h1 — navigate to https://example.com, assert h1 contains Example Domain.
  2. local-form-submit — local fixture page, fill a form, assert success text.
  3. local-recovery-stall — local fixture with one intentionally wrong first action in the mock transcript, used to verify failure metrics and bounded stopping.

Acceptance criteria

  • tests/benchmark/episode-harness/ contains typed task spec, runner, result schema, reporter, and mock adapter.
  • The task spec validator rejects unknown fields, missing caps, unsupported contract shapes, and non-positive step/time budgets.
  • npm run bench:episode:mock completes the three fixture tasks deterministically in ≤ 90 seconds on a developer machine.
  • Reports include pass/fail, steps, duration, tool calls, OpenChrome errors, final URL, and failed contract evidence.
  • The harness never imports or invokes an LLM provider in mock mode.
  • No production OpenChrome MCP tool behavior changes.
  • No new runtime dependency in dependencies; dev-only dependencies are allowed only if justified in the PR.
  • Documentation added at docs/benchmarks/episode-harness.md explaining task authoring, caps, adapter boundaries, and how this relates to test(bench): WebVoyager-style contract-eval benchmark — 10 tasks, mock+real adapters, no src/ changes #851.

Real verification after merge using OpenChrome

Run these commands from a clean checkout:

npm ci
npm run build
npm run bench:episode:mock -- --out /tmp/openchrome-episode-harness

Scenario 1 — deterministic mock reproducibility

npm run bench:episode:mock -- --out /tmp/episode-run-1
npm run bench:episode:mock -- --out /tmp/episode-run-2
jq -S 'del(.runId, .startedAt, .endedAt, .durationMs)' /tmp/episode-run-1/report.json > /tmp/r1.norm.json
jq -S 'del(.runId, .startedAt, .endedAt, .durationMs)' /tmp/episode-run-2/report.json > /tmp/r2.norm.json
diff -u /tmp/r1.norm.json /tmp/r2.norm.json

Pass: empty diff.

Scenario 2 — real OpenChrome tool path is used

Run with verbose logging:

OPENCHROME_BENCH_VERBOSE=1 npm run bench:episode:mock -- --task example-h1 --out /tmp/episode-real-tool-path

Pass: events.jsonl contains MCP navigate, at least one read_page or tabs_context page-state call, and a final Outcome Contract evaluation record. The task passes.

Scenario 3 — max-step guard works

Run the local-recovery-stall fixture with --max-steps 2.

Pass: result status is max_steps, steps === 2, and report names the task and last tool call. The process exits cleanly without hanging.

Scenario 4 — contract failure is visible

Temporarily change the example-h1 expected text to NotPresent in a throwaway branch and rerun.

Pass: status is failed, failedContract identifies the dom_text assertion, and the report includes the observed text preview.

Out of scope

Dependencies / references

Reviewer checklist for ambiguity

Curated scope, overlap handling, and verification checklist

Scope classification

  • Canonical lane: test harness episode evaluation.
  • Primary deliverable: shared episode runner producing contract-eval results and anti-wandering metrics.
  • Open PR: none currently linked; create a new PR only after checking for newer overlapping PRs.
  • Non-goal: production runtime dependency, VM/sandbox stack, server-side LLM optimizer, or replacing unit tests.

Overlap and conflict resolution

Implementation checklist

  • Define reset -> step -> evaluate episode API/fixtures with deterministic local pages.
  • Collect success/failure, contract result, tool count, repeated loop count, recovery latency, and artifact pointers.
  • Emit machine-readable results and concise summaries.
  • Add tests for passing episode, failing contract, wandering/repeated calls, reset isolation, and artifact generation.
  • Document how to add new episodes.

Success criteria

  • Episodes are repeatable and local-first.
  • Results quantify task outcome and wandering metrics.
  • No production src/** dependency is introduced unless explicitly justified.
  • Artifacts are usable by certification/benchmark workflows.

Post-merge OpenChrome live verification checklist

  • Run a small episode suite and verify JSON result artifacts.
  • Inspect a passing and failing episode result for contract-eval and anti-wandering fields.
  • Verify reset isolation between episodes.
  • Attach command, exit code, and artifact path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1P1 highenhancementNew feature or requestobservabilityObservabilityoutcome-contractsVerifiable execution via pre/post-condition contracts (Q2)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions