test(harness): shared episode runner with contract-eval results and anti-wandering metrics

**Tier:** tests / tooling first; no production `src/**` dependency unless explicitly called out below  
**PR target:** `develop`  
**Source analysis:** CUA-Bench-style `reset → step → evaluate` episode harness, adapted to OpenChrome's portability-harness contract.

## Why

OpenChrome has strong tool-call reliability primitives, Outcome Contracts, recording, checkpoints, journals, and hinting. What is still missing is a **small, repeatable episode harness** that measures whether those primitives improve full-task outcomes over time.

CUA-Bench's transferable idea is not its VM/sandbox stack. The useful piece is the benchmark shape:

`task spec → reset browser state → run bounded steps → evaluate with deterministic criteria → emit per-episode metrics`

OpenChrome should adopt that shape as a tests/tooling layer so reliability work can be measured without turning the MCP server into an autonomous agent runtime.

This issue is intentionally narrower than #851:

- #851 owns a WebVoyager-style benchmark dataset and real/mock LLM adapters.
- This issue owns the **generic episode harness substrate and result schema** that #851 and future benchmarks should reuse.

Integration rule with #851: whichever PR lands second must reuse the shared `EpisodeTaskSpec`, `EpisodeResult`, reporter, and mock-adapter contract from this issue. If #851 has already created overlapping runner code, this issue must refactor that code into `tests/benchmark/episode-harness/` rather than adding a second runner.

## Directionality / fit check

This aligns with `docs/roadmap/portability-harness-contract.md`:

- No server-side LLM calls.
- No new production tool behavior.
- No mandatory credentials.
- No platform-specific runtime dependency.
- The harness is a client/test utility of OpenChrome, not a continuous autonomous loop inside the server.

## Proposed implementation

Add a reusable benchmark package under `tests/benchmark/episode-harness/`.

### 1. Task spec

Define a JSON/TS task format:

```ts
export interface EpisodeTaskSpec {
  id: string;
  title: string;
  startUrl: string;
  goal: string;
  maxSteps: number;          // hard cap, default 30, max 100
  maxDurationMs: number;     // hard cap, default 120_000, max 600_000
  success: Assertion;        // existing src/contracts/types.ts DSL only
  setup?: {
    clearCookies?: boolean;
    viewport?: { width: number; height: number };
  };
  tags?: string[];
}
```

Rules:

- `success` reuses the existing Outcome Contract DSL. No new assertion operators in this issue.
- `maxSteps` and `maxDurationMs` are required after defaults are applied.
- Public tasks must avoid login, payment, captcha, live prices, news, current dates, or user-specific state.

### 2. Episode runner

Create `runEpisode(task, adapter, openchromeClient)` with this lifecycle:

1. Start or connect to an OpenChrome MCP server.
2. Reset browser state for the task as specified.
3. Navigate to `startUrl`.
4. Repeatedly ask the adapter for the next tool call until:
   - success contract passes,
   - adapter emits done,
   - `maxSteps` reached,
   - `maxDurationMs` reached,
   - OpenChrome returns unrecoverable transport/tool failure.
5. Evaluate `success` via the existing contract evaluator against the live page.
6. Emit a normalized `EpisodeResult`.

The default CI adapter must be deterministic and not call an LLM. Real LLM adapters remain opt-in and credential-gated.

### 3. Result schema

```ts
export interface EpisodeResult {
  runId: string;
  taskId: string;
  status: 'passed' | 'failed' | 'timeout' | 'max_steps' | 'adapter_error' | 'tool_error';
  success: boolean;
  steps: number;
  durationMs: number;
  toolCalls: number;
  openchromeErrors: number;
  noProgressEpisodes: number;
  finalUrl: string;
  failedContract?: unknown;
  artifacts: {
    eventsJsonl: string;
    reportJson: string;
    screenshotDir?: string;
  };
}
```

`noProgressEpisodes` calculation is deterministic: if `oc_progress_status` exists, count each step whose returned `status` is `stalling` or `stuck`; otherwise use the fallback rule `consecutive tool errors >= 3` OR `same tool called successfully >= 3 times with unchanged final URL` as one no-progress episode.

### 4. CLI / npm scripts

Add:

```bash
npm run bench:episode -- --tasks tests/benchmark/episode-harness/fixtures --adapter mock
npm run bench:episode:mock
```

The command writes:

- `reports/<run-id>.json`
- `reports/<run-id>.md`
- `events/<run-id>.jsonl`

### 5. Fixture tasks

Add 3 deterministic fixture tasks, all hosted locally or on immutable public pages:

1. `example-h1` — navigate to `https://example.com`, assert `h1` contains `Example Domain`.
2. `local-form-submit` — local fixture page, fill a form, assert success text.
3. `local-recovery-stall` — local fixture with one intentionally wrong first action in the mock transcript, used to verify failure metrics and bounded stopping.

## Acceptance criteria

- [ ] `tests/benchmark/episode-harness/` contains typed task spec, runner, result schema, reporter, and mock adapter.
- [ ] The task spec validator rejects unknown fields, missing caps, unsupported contract shapes, and non-positive step/time budgets.
- [ ] `npm run bench:episode:mock` completes the three fixture tasks deterministically in ≤ 90 seconds on a developer machine.
- [ ] Reports include pass/fail, steps, duration, tool calls, OpenChrome errors, final URL, and failed contract evidence.
- [ ] The harness never imports or invokes an LLM provider in mock mode.
- [ ] No production OpenChrome MCP tool behavior changes.
- [ ] No new runtime dependency in `dependencies`; dev-only dependencies are allowed only if justified in the PR.
- [ ] Documentation added at `docs/benchmarks/episode-harness.md` explaining task authoring, caps, adapter boundaries, and how this relates to #851.

## Real verification after merge using OpenChrome

Run these commands from a clean checkout:

```bash
npm ci
npm run build
npm run bench:episode:mock -- --out /tmp/openchrome-episode-harness
```

### Scenario 1 — deterministic mock reproducibility

```bash
npm run bench:episode:mock -- --out /tmp/episode-run-1
npm run bench:episode:mock -- --out /tmp/episode-run-2
jq -S 'del(.runId, .startedAt, .endedAt, .durationMs)' /tmp/episode-run-1/report.json > /tmp/r1.norm.json
jq -S 'del(.runId, .startedAt, .endedAt, .durationMs)' /tmp/episode-run-2/report.json > /tmp/r2.norm.json
diff -u /tmp/r1.norm.json /tmp/r2.norm.json
```

**Pass:** empty diff.

### Scenario 2 — real OpenChrome tool path is used

Run with verbose logging:

```bash
OPENCHROME_BENCH_VERBOSE=1 npm run bench:episode:mock -- --task example-h1 --out /tmp/episode-real-tool-path
```

**Pass:** `events.jsonl` contains MCP `navigate`, at least one `read_page` or `tabs_context` page-state call, and a final Outcome Contract evaluation record. The task passes.

### Scenario 3 — max-step guard works

Run the `local-recovery-stall` fixture with `--max-steps 2`.

**Pass:** result status is `max_steps`, `steps === 2`, and report names the task and last tool call. The process exits cleanly without hanging.

### Scenario 4 — contract failure is visible

Temporarily change the `example-h1` expected text to `NotPresent` in a throwaway branch and rerun.

**Pass:** status is `failed`, `failedContract` identifies the `dom_text` assertion, and the report includes the observed text preview.

## Out of scope

- New OpenChrome MCP tools.
- Server-side LLM decisions.
- Browser VM/container management.
- Large public benchmark datasets; #851 owns the first WebVoyager-style suite.
- Statistical multi-run studies.

## Dependencies / references

- Complements #851.
- Reuses `src/contracts/types.ts` and `src/contracts/evaluate.ts`.
- Inspired by CUA-Bench's episode shape, not by its sandbox/cloud/runtime stack.

## Reviewer checklist for ambiguity

- [ ] Does every stop condition have an explicit status?
- [ ] Can a reviewer run the harness without API keys?
- [ ] Are task-authoring restrictions concrete enough to avoid flaky live-web tasks?
- [ ] Is the division from #851 clear enough to avoid duplicate work?



## Curated scope, overlap handling, and verification checklist

### Scope classification
- **Canonical lane:** test harness episode evaluation.
- **Primary deliverable:** shared episode runner producing contract-eval results and anti-wandering metrics.
- **Open PR:** none currently linked; create a new PR only after checking for newer overlapping PRs.
- **Non-goal:** production runtime dependency, VM/sandbox stack, server-side LLM optimizer, or replacing unit tests.

### Overlap and conflict resolution
- [ ] Complements #1051 parallel runner; this issue owns episode semantics and metrics, not runner mechanics.
- [ ] Consumes Outcome Contracts, hints, journals, and recordings without changing their runtime behavior.
- [ ] Can feed #1047 certification but should remain reusable test tooling.

### Implementation checklist
- [ ] Define reset -> step -> evaluate episode API/fixtures with deterministic local pages.
- [ ] Collect success/failure, contract result, tool count, repeated loop count, recovery latency, and artifact pointers.
- [ ] Emit machine-readable results and concise summaries.
- [ ] Add tests for passing episode, failing contract, wandering/repeated calls, reset isolation, and artifact generation.
- [ ] Document how to add new episodes.

### Success criteria
- [ ] Episodes are repeatable and local-first.
- [ ] Results quantify task outcome and wandering metrics.
- [ ] No production `src/**` dependency is introduced unless explicitly justified.
- [ ] Artifacts are usable by certification/benchmark workflows.

### Post-merge OpenChrome live verification checklist
- [ ] Run a small episode suite and verify JSON result artifacts.
- [ ] Inspect a passing and failing episode result for contract-eval and anti-wandering fields.
- [ ] Verify reset isolation between episodes.
- [ ] Attach command, exit code, and artifact path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(harness): shared episode runner with contract-eval results and anti-wandering metrics #1058

Why

Directionality / fit check

Proposed implementation

1. Task spec

2. Episode runner

3. Result schema

4. CLI / npm scripts

5. Fixture tasks

Acceptance criteria

Real verification after merge using OpenChrome

Scenario 1 — deterministic mock reproducibility

Scenario 2 — real OpenChrome tool path is used

Scenario 3 — max-step guard works

Scenario 4 — contract failure is visible

Out of scope

Dependencies / references

Reviewer checklist for ambiguity

Curated scope, overlap handling, and verification checklist

Scope classification

Overlap and conflict resolution

Implementation checklist

Success criteria

Post-merge OpenChrome live verification checklist

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

test(harness): shared episode runner with contract-eval results and anti-wandering metrics #1058

Description

Why

Directionality / fit check

Proposed implementation

1. Task spec

2. Episode runner

3. Result schema

4. CLI / npm scripts

5. Fixture tasks

Acceptance criteria

Real verification after merge using OpenChrome

Scenario 1 — deterministic mock reproducibility

Scenario 2 — real OpenChrome tool path is used

Scenario 3 — max-step guard works

Scenario 4 — contract failure is visible

Out of scope

Dependencies / references

Reviewer checklist for ambiguity

Curated scope, overlap handling, and verification checklist

Scope classification

Overlap and conflict resolution

Implementation checklist

Success criteria

Post-merge OpenChrome live verification checklist

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions