benchmark: episode budget shaper and early-stop using Webwright's marginal-utility curve

## Why

Webwright's [Microsoft Research blog](https://www.microsoft.com/en-us/research/articles/webwright-a-terminal-is-all-you-need-for-web-agents/) reports a concrete cost-curve datapoint on Online-Mind2Web:

> "First 50 steps cost delivers 82% accuracy and the next 50 steps delivers 3–4 additional points."

GPT-5.4 average **\$2.37 / task**, Claude **\$6.09 / task**. This is exactly the kind of marginal-utility signal we need to operationalize cost discipline on top of the task envelope budgets landed in #1082 (#1034).

## Scope

- Extend the task envelope budget machinery to expose a **marginal-utility tracker**: per-step success-probability estimate (from oc_assert deltas + journal signals).
- Add an **early-stop rule**: stop when `Δsuccess_probability / Δstep < threshold` for N consecutive steps, even if the hard step budget is not exhausted.
- Emit a per-episode cost ledger (step, tokens, latency) so we can publish the same 50→100 step curve for our agent on the Online-Mind2Web corpus (sister issue).
- Independence from broker / lease / shared profile: this is a pure agent-side budget shaper, no SSOT (#1359) conflict.

## Acceptance

- [ ] New \`benchmark/episode-budget-shaper.mjs\` (or equivalent) wired into the live runner.
- [ ] Per-task cost-vs-success curve emitted into the JSON results and rendered in BENCHMARK-REPORT.md.
- [ ] Headline gate updated to require the curve before claiming any cost-efficiency headline (#1310 parity).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark: episode budget shaper and early-stop using Webwright's marginal-utility curve #1428

Why

Scope

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

benchmark: episode budget shaper and early-stop using Webwright's marginal-utility curve #1428

Description

Why

Scope

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions