You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Webwright's Microsoft Research blog reports a concrete cost-curve datapoint on Online-Mind2Web:
"First 50 steps cost delivers 82% accuracy and the next 50 steps delivers 3–4 additional points."
GPT-5.4 average $2.37 / task, Claude $6.09 / task. This is exactly the kind of marginal-utility signal we need to operationalize cost discipline on top of the task envelope budgets landed in #1082 (#1034).
Scope
Extend the task envelope budget machinery to expose a marginal-utility tracker: per-step success-probability estimate (from oc_assert deltas + journal signals).
Add an early-stop rule: stop when Δsuccess_probability / Δstep < threshold for N consecutive steps, even if the hard step budget is not exhausted.
Emit a per-episode cost ledger (step, tokens, latency) so we can publish the same 50→100 step curve for our agent on the Online-Mind2Web corpus (sister issue).
Why
Webwright's Microsoft Research blog reports a concrete cost-curve datapoint on Online-Mind2Web:
GPT-5.4 average $2.37 / task, Claude $6.09 / task. This is exactly the kind of marginal-utility signal we need to operationalize cost discipline on top of the task envelope budgets landed in #1082 (#1034).
Scope
Δsuccess_probability / Δstep < thresholdfor N consecutive steps, even if the hard step budget is not exhausted.Acceptance