Skip to content

feat(harness): task envelope budgets to bound browser-agent wandering #1034

@shaun0927

Description

@shaun0927

Why

The Goose comparison identified a gap that is not covered by OpenChrome's existing browser-level resilience: OpenChrome can keep Chrome/CDP healthy, but it does not yet expose a task-level harness envelope that bounds a host agent's browser work across many tool calls.

This is intentionally not an agent loop inside OpenChrome. The server must continue to satisfy the portability-harness contract:

  • no server-side LLM calls
  • no autonomous task planning
  • no change to existing tool behavior unless the caller opts into a task envelope
  • facts and guardrails only; the host agent still decides what to do next

Related work:

Scope

Add an opt-in task envelope that lets a host agent declare objective, budgets, and policy constraints for a browser task, then lets OpenChrome record per-tool progress and return deterministic budget/wandering signals.

New/updated tool surface

Add or extend task tools from #855 with these fields. If #855 has not landed yet, implement behind the same task-ledger storage module rather than adding a second ledger.

export interface TaskEnvelopePolicy {
  maxToolCalls?: number;              // default: unset; hard upper bound when set
  maxWallMs?: number;                  // default: unset
  maxConsecutiveSameTool?: number;     // default: 5
  maxObservationStreak?: number;       // default: 6; read_page/find/tabs_context/screenshot only
  maxFailureStreak?: number;           // default: 4
  maxSameUrlNavigations?: number;      // default: 3 per URL within task
  allowedDomains?: string[];           // optional additive narrowing over existing domain guard
  checkpointEveryCalls?: number;       // optional; used by the follow-up checkpoint issue
}

export interface TaskEnvelope {
  task_id: string;
  objective: string;
  phase: 'explore' | 'act' | 'verify' | 'recover' | 'done';
  policy: TaskEnvelopePolicy;
}

Minimum tool operations:

  1. oc_task_start accepts objective, optional policy, optional initial phase.
  2. Every OpenChrome tool call may accept optional taskId. When present, the call is recorded against that task.
  3. oc_task_get returns:
    • current counters
    • latest phase
    • budget status
    • last 10 meaningful events
    • deterministic recommended_next when a budget is near/exceeded
  4. oc_task_update updates phase and optional notes without executing browser actions.
  5. oc_task_finish closes the envelope with completed | failed | cancelled and final note.

Non-goals

  • Do not add an LLM planner, prompt generator, or automatic browser action executor.
  • Do not block normal tools when taskId is absent.
  • Do not replace HintEngine, ProgressTracker, Ralph, or circuit breakers; aggregate their signals at task level.
  • Do not introduce new native dependencies or network services.

Implementation notes

  • Store task envelopes under the task ledger root from feat(core): oc_task_ledger — persistent async task table with cancel & wait (mcp-browser-use adoption A) #855, or ~/.openchrome/tasks/ if implemented first.
  • Use atomic JSON/JSONL writes and existing writeFileAtomicSafe / lock helpers.
  • Add a small shared helper, for example src/core/task-envelope/budget.ts, that receives a normalized ToolCallEvent and returns budget transitions.
  • Observation tool classification must be table-driven and tested. Start with read_page, find, tabs_context, page_screenshot, computer with screenshot action only.
  • Budget exceedance should return a structured warning/error in the task state, not kill the MCP server or Chrome.

Acceptance criteria

  • oc_task_start/get/update/finish are available and documented.
  • At least navigate, read_page, find, interact, act, javascript_tool, page_screenshot, and tabs_context record to the envelope when taskId is provided.
  • Counters distinguish action calls from observation calls.
  • Repeating the same observation tool beyond maxObservationStreak produces budget_status: 'exceeded' and recommended_next: 'change_strategy_or_verify'.
  • Repeating the same navigation URL beyond maxSameUrlNavigations produces a task-level warning.
  • With no taskId, existing tool outputs remain byte-compatible except for already-approved global metadata changes from other issues.
  • Unit tests cover each budget type and the absent-taskId no-op path.
  • npm run build, targeted Jest tests, and npm run lint:tier pass.

Self-review checklist for implementer

OpenChrome 실검증 체크리스트

2026-05-14 최신 merged 버전 적용 후 재검증. OpenChrome 응답, 로컬 fixture, 빌드/테스트 산출물로 직접 증명 가능한 항목만 합격 조건으로 남겼다. 사람 리뷰, 외부 사이트 안정성, 미확인 PR 상태 같은 조건은 합격 조건에서 제외한다.

검증 대상

최신 버전/공통 런타임 검증

  • 최신 develop 소스를 적용하고 npm run build 통과를 확인했다.
  • npm run lint:tier 통과를 확인했다.
  • npm test -- --runInBand 결과 504/507 suites 통과, 3 skipped, 6429/6525 tests 통과, 96 skipped를 확인했다. 단, Jest open-handle 경고는 별도 런타임 리스크로 기록했다.
  • oc_connection_health가 connected 상태를 반환했다.
  • 로컬 fixture에서 OpenChrome navigate/read_page/interact/javascript_tool 경로로 DOM 상태 변화를 관찰했다.
  • 동일 fixture/동일 설정에서 핵심 결과가 재현 가능함을 확인했다.

이슈별 해결 증거

  • 최신 develop에 연결된 구현 PR: 1082
  • 관련 테스트/소스 증거가 최신 트리에 존재한다:
    • src/core/task-ledger/types.ts
    • src/mcp-server.ts
    • src/tools/index.ts
    • src/tools/orchestration.ts
    • src/pilot/dynamic-skills/attachment-defaults.ts
    • src/pilot/dynamic-skills/replay.ts
  • 체크리스트에는 OpenChrome 응답/fixture/로컬 산출물로 재현할 수 없는 합격 조건을 남기지 않았다.

실패/보류 기준

  • 체크가 하나라도 미충족이면 이슈를 닫지 않는다.
  • 실패가 최신 코드 결함으로 재현되면 실패한 OpenChrome 호출, 응답 excerpt, fixture 상태를 증거로 남기고 별도 수정 PR을 올린다.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1P1 highenhancementNew feature or requestharnessExecution harness, run lifecycle, recovery, and verificationlive-verificationRequires live OpenChrome/browser validation after implementationobservabilityObservabilityreliabilityReliability and stability improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions