Skip to content

feat(core): deterministic selector-chain replay layer for skill memory #875

@shaun0927

Description

@shaun0927

Tier: core (additive; new opt-in tool, augmented oc_skill_record schema with backfill)
PR target: develop

Why

oc_skill_record / oc_skill_recall (#785, #807) store skill intent — step ordering and the contract id — but not the artifacts needed to re-execute deterministically. Today the only way to re-run a recalled skill is for the host LLM to read the steps, re-resolve each target via read_page + find, and re-issue every action. Every successful re-run therefore costs the same token budget as the first run.

HyperAgent (@hyperbrowser/agent) addresses the same gap with an "action cache": after a successful agent run it persists (xpath, frameIndex, method, args, elementId) per step, and runFromActionCache() replays via XPath without LLM calls, falling back to LLM only when XPath fails. The result is 0 LLM calls on the happy path for recurring tasks.

OpenChrome cannot embed an LLM (P3) and cannot orchestrate (P1), so it cannot port the HyperAgent loop verbatim. The portable piece is the artifact format and a stateless replay executor that returns structured errors when resolution fails, letting the host LLM decide what to do next.

This issue closes a measurable gap distinct from #820 (graph state-hash skipper, higher layer) and #824 (auto-recall on navigate, recall surface). #820 decides "which steps to skip"; this issue decides "given a step, how to re-execute it without re-resolving from scratch".

What

Three additive changes inside the existing skill-memory module, all tier:core, all opt-in by per-call argument:

  1. Enrich every recorded step with a replay_artifact: an ordered list of selector strategies captured at record time, plus optional backendNodeId hint.
  2. Add a new tool oc_skill_replay that, given a skill_id, executes each step's artifact deterministically — no host round-trip per step — and returns a structured result envelope. On any artifact-resolution failure, it returns control to the host with code: "ARTIFACT_RESOLUTION_FAILED" and the offending step index, so the host LLM can fall back to standard read_page + interact.
  3. Wire oc_assert into the replay loop so that, if the recorded contract id is reachable, the run terminates with a verdict regardless of replay path.

oc_skill_record schema gains the optional replay_artifact per step (default absent → existing behaviour). Recording the artifact is opt-in via a new capture_artifact: true arg on the existing actions that already feed the recorder (interact, fill_form, form_input, navigate, tabs_create), or via the upcoming codegen aggregator from #836 when that lands.

Background — verified facts in repo

  • src/tools/oc-skill-record.ts — idempotent on (domain, name), JSON-per-domain store at ~/.openchrome/skill-memory/<encodedDomain>/skills.json (src/core/skill-memory/store.ts).
  • src/tools/oc-skill-recall.ts — recency-sorted, no LLM ranking.
  • src/core/contracts/evidence-bundle.ts — already supplies a structured failure-capture format usable here.
  • src/utils/ralph/ralph-engine.ts — S1–S7 strategy fallback; the artifact format MUST be a strict subset of the strategies Ralph can issue, so replay reuses the same execution path.
  • src/core/perception/backend-node-registry.ts (introduced by feat(core): stable backend-node uid contract across snapshots (chrome-devtools-mcp adoption A.1) #844) — replay artifact MAY reference a nodeRef for the first attempt; on miss, falls through to the selector chain.

Contract

// src/core/skill-memory/replay-artifact.ts (new)
export interface ReplayArtifactStep {
  /** Strict subset of Ralph S1–S6; HITL (S7) is never persisted. */
  kind: 'click' | 'fill' | 'navigate' | 'press' | 'select' | 'submit' | 'scroll';
  /** Tried in order. First successful resolution wins. */
  selectors: Array<
    | { type: 'node_ref'; value: string }          // #844 nodeRef hint
    | { type: 'xpath'; value: string }
    | { type: 'css'; value: string }
    | { type: 'role_name'; role: string; name: string }
    | { type: 'accessible_name'; value: string }
    | { type: 'text'; value: string }
  >;
  /**
   * 0 = main frame; non-zero values are per-target ordinals assigned at first
   * observation and used only internally by the replay artifact (NOT surfaced
   * in tools/list responses). If a follow-up issue formalizes frame addressing
   * in the public surface, this field re-aligns there.
   */
  frameOrdinal?: number;
  args?: Record<string, unknown>;
  /** Optional inline contract check; if present, evaluated via oc_assert after the step. */
  post_assert?: { contract_id: string };
}

export interface ReplayArtifact {
  schema_version: 1;
  recorded_at: number;
  recorder: { openchrome_version: string };
  steps: ReplayArtifactStep[];
}
// src/tools/oc-skill-replay.ts (new)
interface OcSkillReplayArgs {
  skill_id: string;
  /** Optional override; default = all steps. */
  step_range?: { from: number; to: number };
  /** If true, stop on first step that fails the embedded contract check. Default true. */
  stop_on_contract_failure?: boolean;
  /** Default 5s per step; honors src/config/defaults.ts existing budgets. */
  step_timeout_ms?: number;
}

interface OcSkillReplayResult {
  ok: boolean;
  steps_executed: number;
  steps_total: number;
  /** Set when ok === false. */
  failure?: {
    code:
      | 'ARTIFACT_MISSING'
      | 'ARTIFACT_RESOLUTION_FAILED'
      | 'CONTRACT_FAILED'
      | 'STEP_TIMEOUT'
      | 'TARGET_NAVIGATED_AWAY'
      | 'DISABLED';
    step_index: number;
    detail: string;
    evidence_bundle_path?: string; // populated for CONTRACT_FAILED via existing evidence-bundle module
  };
  /** Per-step resolution telemetry, for curator promote signals. */
  step_results: Array<{
    index: number;
    resolved_via: 'node_ref' | 'xpath' | 'css' | 'role_name' | 'accessible_name' | 'text';
    selector_attempts: number;
    elapsed_ms: number;
  }>;
}

Invariants

  1. oc_skill_replay MUST NOT call any LLM and MUST NOT orchestrate beyond the persisted step list (P1, P3).
  2. Skills recorded under v1.11 (no replay_artifact) return code: "ARTIFACT_MISSING" immediately; no implicit upgrade attempt.
  3. The selector list is tried in order and stops on first resolution; this guarantees deterministic replay across runs given an unchanged DOM.
  4. The artifact format never carries raw secrets — args for fill actions store the substituted-placeholder form ${SECRET:NAME} whenever the existing secrets layer (feat(core): first-class secrets masking (--secrets <dotenv>) with ${SECRET:name} substitution #834) is in play.
  5. When the feature flag is off, tools/list includes oc_skill_replay (P2 schema parity) but invocations return a { ok: false, failure: { code: "DISABLED" } } fact. New replay_artifact fields on oc_skill_record responses are null at runtime when off.

Proposed Implementation

  1. Artifact module (src/core/skill-memory/replay-artifact.ts):

    • Pure types + validator. Schema version pinned at 1.
    • JSON-Schema dump used by both oc_skill_record and oc_skill_replay for input validation.
  2. Recorder hook (modifications to src/tools/interact.ts, src/tools/fill-form.ts, src/tools/form-input.ts, src/tools/navigate.ts):

    • New optional input arg capture_artifact: true.
    • When set, after a successful Ralph resolution, write the winning selector strategy + 2 sibling candidates (in order of robustness: role+name → accessible name → xpath → css) into a session-scoped buffer.
    • Buffer scope: per CDP target. Bounded: at most 100 step entries (FIFO eviction beyond that). Flushed destructively by oc_skill_record or on target close — no cross-skill leakage.
  3. Replay tool (src/tools/oc-skill-replay.ts):

    • For each step, resolve via the artifact's selector list using existing src/utils/ralph/ralph-engine.ts building blocks (no new locator code).
    • On success, dispatch the action via the same CDP path the original tool uses.
    • On embedded post_assert, call oc_assert against the recorded contract_id.
    • Emit step telemetry via src/core/trace/storage.ts so the pilot curator (src/pilot/curator/) can read replay-success rate per skill and use it as a promote-pass signal — this connects the artifact loop to the existing curator without changing curator code in this PR.
  4. Schema migration: bump SkillRecord JSON shape with schema_version: 1 → 2 (additive replay_artifact on each step). Existing files load as version: 1 and are read-compatible; on next idempotent re-record they upgrade in place. No destructive migration script.

  5. Registration: register oc_skill_replay in src/tools/index.ts. Add to tools/list unconditionally. No new env flag needed for the recorder side (per-call arg); for the replay tool, gate full execution behind OPENCHROME_SKILL_REPLAY (default on, opt-out for parity testing) via the new isCoreFeatureEnabled helper introduced by feat(core): stable backend-node uid contract across snapshots (chrome-devtools-mcp adoption A.1) #844. When the flag is off the tool returns a DISABLED fact.

Boundary

New:

  • src/core/skill-memory/replay-artifact.ts
  • src/tools/oc-skill-replay.ts
  • tests/core/skill-memory/replay-artifact.test.ts
  • tests/tools/oc-skill-replay.test.ts
  • tests/e2e/scenarios/skill-replay.e2e.ts
  • scripts/verify/skill-replay.mjs
  • tests/fixtures/skill-replay/index.html

Modified:

Acceptance Criteria

  • replay-artifact.ts exports types + a strict JSON-Schema validator; unit tests reject malformed artifacts.
  • oc_skill_record accepts replay_artifact per step; idempotent re-record preserves skill_id and updates the artifact when supplied; back-compat with v1 records (no replay_artifact field).
  • oc_skill_replay lands; calls return one of { ok: true, ... } or { ok: false, failure: {...} }. Never throws.
  • interact / fill_form / form_input / navigate accept capture_artifact; default false; when false, response bytes are byte-identical to v1.11.0 on a frozen fixture (P2 zero-impact test).
  • OPENCHROME_SKILL_REPLAY=0oc_skill_replay returns code: "DISABLED"; tools/list includes the tool either way (P2 schema parity).
  • No outbound HTTP, no LLM API (P3) — covered by tests/core/skill-memory/no-network.test.ts blocking fetch / http / https.
  • Trace storage records { skill_id, step_index, resolved_via, selector_attempts, elapsed_ms, ok } per step.
  • npm run build && npm test && npm run lint && npm run lint:tier green.
  • PR targets develop.

Verification (post-merge, via openchrome MCP)

Setup

Bundled fixture page at tests/fixtures/skill-replay/index.html — a 4-step form (name → email → captcha-text → submit). Served via npm run fixture-serve (port 4173). Avoids public-site flakiness.

Scenario 1 — record-then-replay happy path

  1. mcp__openchrome__navigate { url: "http://localhost:4173/skill-replay/" }
  2. mcp__openchrome__interact { action: "fill", target: { text: "Name" }, value: "Alice", capture_artifact: true }
  3. mcp__openchrome__interact { action: "fill", target: { text: "Email" }, value: "a@b.co", capture_artifact: true }
  4. mcp__openchrome__interact { action: "fill", target: { text: "Captcha" }, value: "1234", capture_artifact: true }
  5. mcp__openchrome__interact { action: "click", target: { text: "Submit" }, capture_artifact: true }
  6. mcp__openchrome__oc_skill_record { domain: "localhost", name: "form-flow", contract_id: "<id>" }
  7. Reset: mcp__openchrome__page_reload, then mcp__openchrome__navigate back to the fixture.
  8. mcp__openchrome__oc_skill_recall { domain: "localhost", name: "form-flow" } → returns the skill with replay_artifact populated for all 4 steps.
  9. mcp__openchrome__oc_skill_replay { skill_id: <id> }{ ok: true, steps_executed: 4, steps_total: 4 }. Each step_results[i].resolved_via is not text (proves a more-robust strategy than the recorded text-match was used).

Pass: step 9 returns ok; the form is filled and submitted; zero mcp__openchrome__find and zero mcp__openchrome__read_page calls between steps 8 and the end (verified via mcp__openchrome__oc_journal filter=tool=read_page,find since=<step9_start>).

Scenario 2 — artifact-resolution failure returns control to host

  1. Repeat Scenario 1 steps 1–8.
  2. Mutate the fixture page: rename the "Captcha" label to "Verification code".
  3. mcp__openchrome__oc_skill_replay { skill_id: <id> }.

Pass: returns { ok: false, failure: { code: "ARTIFACT_RESOLUTION_FAILED", step_index: 2 } }. Steps 0–1 executed (form has "Alice" and "a@b.co"). evidence_bundle_path populated. No exception thrown.

Scenario 3 — P3 compliance (zero outbound)

  1. Block all egress at the OS level (pf on macOS / iptables on Linux to a 127.0.0.1-only allow list).
  2. Replay the skill from Scenario 1 (cached storage is local).

Pass: completes successfully. No DNS lookup, no TCP to non-loopback (verify via tcpdump -n -i any capture pinned in PR description as evidence).

Scenario 4 — P2 byte-parity when capture_artifact omitted

  1. Record a baseline trace running an automation without capture_artifact on any call.
  2. Run the same automation against the merged PR build, also without capture_artifact.

Pass: diff on the result.* payload sections of the JSONL traces is empty.

Scenario 5 — kill-switch

  1. OPENCHROME_SKILL_REPLAY=0 node dist/cli/index.js serve.
  2. mcp__openchrome__oc_skill_replay { skill_id: <id> }.

Pass: { ok: false, failure: { code: "DISABLED" } }. tools/list still includes oc_skill_replay (schema parity).

Scenario 6 — schema migration (v1 → v2 records)

  1. Pre-seed ~/.openchrome/skill-memory/localhost/skills.json with a v1.11 record (no replay_artifact).
  2. mcp__openchrome__oc_skill_recall { domain: "localhost" } → returns the record with replay_artifact: null per step (no synthesis).
  3. mcp__openchrome__oc_skill_replay { skill_id: <v1_id> }.

Pass: step 3 returns code: "ARTIFACT_MISSING". Storage file remains valid JSON; no destructive write.

A reproducer for all six scenarios lives at scripts/verify/skill-replay.mjs and is referenced in the PR description.

Out of scope

Dependencies

Effort

M (~5–7 dev days). Artifact validator + recorder buffer + new tool + 4 modified action tools + 6 verification scenarios. No new native dep, no Chrome-launch changes.

References

Curated scope, overlap handling, and verification checklist

Scope classification

  • Canonical lane: deterministic skill replay artifacts.
  • Primary deliverable: selector-chain replay layer for skill memory with recorded stable selectors and evidence-backed replay.
  • Open PR: feat(core): deterministic selector-chain replay layer for skill memory (closes #875) #925 (feat/875-replay-layer). Continue there; avoid duplicate work.
  • Non-goal: blind replay, replacing skill intent memory, unsafe destructive re-execution, or requiring host LLM to re-resolve every target.

Overlap and conflict resolution

Implementation checklist

  • Extend skill record schema/backfill with selector chains, page signatures, action metadata, and evidence handles needed for deterministic replay.
  • Add opt-in replay tool/path with safety checks for page signature, selector resolution, and risky actions.
  • Reject replay with clear reason when selectors/page signatures do not match.
  • Add tests for record/backfill, successful replay, stale selector rejection, page mismatch, risky action gate, and token savings evidence.
  • Document replay eligibility and fallback to normal recall steps.

Success criteria

  • A recalled skill can replay deterministically when page and safety evidence match.
  • Replay refuses stale/mismatched/risky conditions instead of guessing.
  • Existing skill recall remains available for non-replayable skills.
  • Token/latency cost of successful repeat workflows is reduced.

Post-merge OpenChrome live verification checklist

  • Record a local fixture skill, replay it on the same page, and verify deterministic success.
  • Change page structure and verify replay is rejected with selector/page mismatch.
  • Exercise a risky action fixture and verify safety gate/fallback.
  • Capture replay artifact, reject reason, and before/after tool-call count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0P0 criticalenhancementNew feature or requestperformancePerformance, latency, throughput, or resource-use improvementreliabilityReliability and stability improvementverified-skill-memoryContract-backed skill auto-curation (Q3-Q4)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions