Skip to content

feat(recovery): deterministic evidence reward scorer for progress and recovery #1019

@shaun0927

Description

@shaun0927

Context

LATS uses reward/value estimates to select better branches. For OpenChrome, a raw LLM value function is too risky. OpenChrome already has better primitives: outcome contracts, DOM/network/screenshot evidence, progress tracking, and tool result classification. This issue defines a deterministic evidence-based reward scorer that future recovery ranking and bounded recovery search can use.

Implementation order / dependencies

This should land before #1020 and #1022. #1018 can initially use simple heuristics, but should prefer this scorer once available. The scorer must stay pure/deterministic so it is safe for hot paths and tests.

Relationship to existing issues

This issue should be checked against open issues such as structured recovery hints, action replay/cache, outcome contracts, and observability work before implementation. If an existing issue already covers part of this scope, keep this issue limited to the LATS-inspired recovery/trajectory behavior described here and cross-link rather than duplicate implementation.

Goal

Introduce a deterministic RecoveryRewardScorer that converts tool outcomes and evidence changes into a bounded numeric score for progress, no-op, failure, and recovery. This scorer should be reusable by HintEngine, PlanExecutor recovery, PatternLearner, and future trajectory ledger analysis.

Non-goals / safety constraints

  • Do not call an LLM for reward scoring.
  • Do not change tool success/failure semantics in this issue.
  • Do not auto-execute recovery paths.
  • Do not make screenshot/DOM capture mandatory for every tool call.
  • Do not store large evidence payloads in memory solely for scoring.

Proposed scoring shape

The exact constants can be adjusted during implementation, but the scorer should support these categories:

  • strong positive:
    • outcome contract passed
    • target URL/DOM/network state reached
    • expected data extracted
  • weak positive:
    • page state changed in the intended direction
    • fresh actionable refs discovered after stale-ref failure
  • neutral / low:
    • observation-only call with new information
  • negative:
    • repeated observation without new information
    • stale ref, element not found, timeout
    • auth redirect, blocking page, CAPTCHA
    • repeated same failed tool/ref
  • hard negative / blocked:
    • destructive or transactional action attempted without required gate

Proposed implementation

  1. Add a small scorer module with typed inputs and outputs:
    • input: previous/current lightweight page/evidence metadata, tool result, progress status, optional contract result, recent-call summary
    • output: score, classification, reasons[], confidence
  2. Prefer existing contract evaluator results when available.
  3. Use hashes/metadata for DOM/screenshot/network deltas rather than raw payloads.
  4. Make the scorer pure and easy to unit test.
  5. Wire it in telemetry only at first: trajectory ledger/recovery ranking can consume it, but normal tool behavior should not change.

Acceptance criteria

  • The scorer returns deterministic scores for the same input.
  • Contract pass outranks heuristic page-change signals.
  • Repeated no-progress observations are penalized.
  • Known blocking/auth/CAPTCHA signals receive negative classification.
  • Missing evidence is handled gracefully with lower confidence, not thrown errors.
  • The scorer can be consumed without creating cycles across contracts/hints/orchestration modules.

Required automated verification

  • Unit tests for:
    • contract pass/fail scoring
    • DOM/content delta positive scoring
    • repeated no-progress negative scoring
    • stale ref/timeout/auth/blocking classifications
    • missing evidence fallback
  • Integration test where a failed action followed by a successful fresh read produces a higher recovery score than repeating the failed action.
  • Dependency-cruiser or existing tier lint remains clean if applicable.
  • npm run build and targeted Jest tests.

Fixture requirements

Add or reuse controlled routes in tests/e2e/harness/fixture-server.ts:

  • /recovery/progress-positive: safe button changes DOM text or URL fragment.
  • /recovery/no-progress: repeated observation returns same state.
  • /recovery/blocking-page: auth/blocking signal page.

Required real OpenChrome verification after implementation

Use OpenChrome against controlled fixture pages:

  1. Positive path:
    • navigate to fixture
    • perform an action that visibly changes DOM or URL
    • run/collect the scorer output through the integrated telemetry path
    • verify positive classification and reasons mention the observed evidence type
  2. Negative path:
    • repeat an observation-only loop or stale interaction
    • verify score decreases or classifies as no-progress/failure
  3. Contract path:
    • run an existing outcome contract assertion that passes
    • verify contract pass dominates heuristic scoring

Merge evidence required in PR

  • Test output for scorer unit/integration tests.
  • A real OpenChrome transcript/log showing positive, negative, and contract-backed scoring.
  • A note confirming no LLM calls and no automatic recovery behavior were added.

OpenChrome 실검증 체크리스트

2026-05-14 최신 merged 버전 적용 후 재검증. OpenChrome 응답, 로컬 fixture, 빌드/테스트 산출물로 직접 증명 가능한 항목만 합격 조건으로 남겼다. 사람 리뷰, 외부 사이트 안정성, 미확인 PR 상태 같은 조건은 합격 조건에서 제외한다.

검증 대상

최신 버전/공통 런타임 검증

  • 최신 develop 소스를 적용하고 npm run build 통과를 확인했다.
  • npm run lint:tier 통과를 확인했다.
  • npm test -- --runInBand 결과 504/507 suites 통과, 3 skipped, 6429/6525 tests 통과, 96 skipped를 확인했다. 단, Jest open-handle 경고는 별도 런타임 리스크로 기록했다.
  • oc_connection_health가 connected 상태를 반환했다.
  • 로컬 fixture에서 OpenChrome navigate/read_page/interact/javascript_tool 경로로 DOM 상태 변화를 관찰했다.
  • 동일 fixture/동일 설정에서 핵심 결과가 재현 가능함을 확인했다.

이슈별 해결 증거

  • 최신 develop에 연결된 구현 PR: 1214, 1078
  • 관련 테스트/소스 증거가 최신 트리에 존재한다:
    • docs/recovery/reward-scorer.md
    • docs/recovery/trajectory-ledger.md
    • src/recovery/reward-scorer.ts
    • src/tools/index.ts
    • tests/recovery/reward-scorer.test.ts
    • src/core/trace/recovery-feedback.ts
  • 체크리스트에는 OpenChrome 응답/fixture/로컬 산출물로 재현할 수 없는 합격 조건을 남기지 않았다.

실패/보류 기준

  • 체크가 하나라도 미충족이면 이슈를 닫지 않는다.
  • 실패가 최신 코드 결함으로 재현되면 실패한 OpenChrome 호출, 응답 excerpt, fixture 상태를 증거로 남기고 별도 수정 PR을 올린다.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1P1 highenhancementNew feature or requestharnessExecution harness, run lifecycle, recovery, and verificationlats-learningsImprovements inspired by LanguageAgentTreeSearch analysislive-verificationRequires live OpenChrome/browser validation after implementationobservabilityObservabilityoutcome-contractsVerifiable execution via pre/post-condition contracts (Q2)reliabilityReliability and stability improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions