Skip to content

benchmark: episode budget shaper and early-stop using Webwright's marginal-utility curve #1428

@shaun0927

Description

@shaun0927

Why

Webwright's Microsoft Research blog reports a concrete cost-curve datapoint on Online-Mind2Web:

"First 50 steps cost delivers 82% accuracy and the next 50 steps delivers 3–4 additional points."

GPT-5.4 average $2.37 / task, Claude $6.09 / task. This is exactly the kind of marginal-utility signal we need to operationalize cost discipline on top of the task envelope budgets landed in #1082 (#1034).

Scope

  • Extend the task envelope budget machinery to expose a marginal-utility tracker: per-step success-probability estimate (from oc_assert deltas + journal signals).
  • Add an early-stop rule: stop when Δsuccess_probability / Δstep < threshold for N consecutive steps, even if the hard step budget is not exhausted.
  • Emit a per-episode cost ledger (step, tokens, latency) so we can publish the same 50→100 step curve for our agent on the Online-Mind2Web corpus (sister issue).
  • Independence from broker / lease / shared profile: this is a pure agent-side budget shaper, no SSOT (SSOT: OpenChrome as a host-neutral MCP browser harness #1359) conflict.

Acceptance

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance, latency, throughput, or resource-use improvementtaskImplementation task

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions