Skip to content

feat(core): fast/standard extraction modes with benchmark evidence #989

@shaun0927

Description

@shaun0927

Summary

Introduce an explicit mode: "fast" | "standard" policy for structured extraction paths so callers can choose between low-latency/local-heuristic extraction and deeper, more expensive page analysis.

This issue covers the mode contract and the first implementation target: extract_data query/schema extraction. It should not broaden into all browser tools in the first PR.

Why

AgentQL's fast vs standard distinction is useful because it makes the latency/accuracy tradeoff explicit. OpenChrome already follows a similar philosophy in places such as read_page AX overflow fallback being opt-in. This issue makes that pattern consistent for structured extraction.

Expected impact:

  • Reduce accidental heavy page traversals for simple extraction.
  • Give agents a deterministic way to request deeper analysis only when needed.
  • Make performance benchmarks easier to compare across PRs.

Scope

Add mode to extract_data with the following semantics:

fast mode

Default for query-based extraction.

Allowed work:

  • JSON-LD, Microdata, OpenGraph.
  • Bounded visible DOM scan.
  • Existing CSS heuristic extraction with strict node/time budgets.
  • No full AX traversal.
  • No screenshot or visual analysis.

standard mode

Opt-in when fast mode misses required fields.

Allowed additional work:

  • Broader DOM scan with higher but bounded node/time limits.
  • Parent/container context for lists.
  • Existing pagination/list heuristics if already available locally.
  • Still no outbound vendor/LLM calls.

Error/output contract

  • Response must include modeUsed.
  • If mode: "fast" misses required fields, response should suggest retrying with mode: "standard" only when evidence supports that suggestion.
  • Invalid mode returns an actionable error.

Non-goals

  • Applying mode to act, find, read_page, or vision_find in this issue.
  • Adding LLM-based semantic extraction.
  • Making standard the default.

Implementation milestones

  1. Define shared ExtractionMode constants and per-mode time/node budgets.
  2. Gate extraction strategies by mode without changing existing schema-call defaults.
  3. Add modeUsed, duration, output-size, and fields-found metrics to extraction responses or benchmark artifacts.
  4. Add tests that prove fast-only and standard-only strategy paths are distinct.
  5. Add real OpenChrome median-latency and output-size comparison evidence.

Acceptance criteria

  • extract_data accepts mode: "fast" | "standard"; missing mode uses current behavior for schema calls and fast for query calls unless that would break existing tests.
  • modeUsed is present in successful responses.
  • fast mode has explicit max node and timeout budgets documented in code constants.
  • standard mode has explicit max node and timeout budgets documented in code constants.
  • The implementation has a kill-switch or safe fallback path if a mode-specific strategy fails.
  • Existing schema-based behavior remains backward compatible.
  • Benchmark output records mode, duration, output size, and fields found.

Required tests

  • Unit tests for mode validation and defaulting.
  • Handler tests proving fast does not invoke standard-only strategy hooks.
  • Handler tests proving standard can recover a field intentionally omitted from fast-only metadata.
  • Regression test ensuring existing extract_data schema calls without mode still pass.

Real OpenChrome verification after implementation

Add a fixture and real MCP scenario with a page where:

  • JSON-LD contains title/price but not all visible card fields.
  • Some fields require visible DOM scanning.

Run OpenChrome real mode and perform:

  1. navigate to the fixture.
  2. extract_data with mode: "fast".
  3. extract_data with mode: "standard".
  4. Compare responses.

Required verification assertions:

  • fast.modeUsed === "fast".
  • standard.modeUsed === "standard".
  • fast completes faster than or equal to standard on median of 5 runs, or the PR explains why the fixture is too small to show a difference.
  • standard finds at least as many fields as fast.
  • Neither mode emits a full read_page-sized DOM dump.
  • Both modes work without external API keys.

The PR should include the real verification command and a small JSON summary with median latency and output character count.

Repo-fit check

This is a small, explicit performance contract. It avoids a broad refactor by starting with extract_data only and preserves OpenChrome's local-first, opt-in-expensive-work design.

AgentQL comparison review hardening (2026-05-12)

This issue is intentionally scoped to extraction modes. Interaction/query resolver modes for oc_query are part of #1045 and should reuse the same naming only if the behavior is documented consistently.

Additional merge-verification requirements:

  • The PR must publish a small benchmark artifact with 5-run median latency, p95 latency, output character count, and fields-found count for both fast and standard.
  • fast must never silently escalate to standard; any escalation must require an explicit option or be reported as modeUsed / escalatedFrom.
  • standard must remain bounded by documented node/time budgets and must settle within the configured MCP tool timeout.
  • If standard finds no additional data over fast, the response should explain that retrying with a broader selector or query may be needed rather than encouraging blind retries.

Curated scope, overlap handling, and verification checklist

Scope classification

  • Canonical lane: structured extraction mode policy.
  • Primary deliverable: explicit mode: "fast" | "standard" support for extract_data query/schema paths with benchmark evidence.
  • Open PR: feat(core): add fast and standard extract_data modes (#989) #1104 (feat/989-extraction-modes). Continue there; do not duplicate the PR.
  • Non-goal: expanding modes to all tools in v1, adding vendor calls, or changing default extraction behavior without compatibility gates.

Overlap and conflict resolution

Implementation checklist

  • Define mode contract, defaults, and errors for unsupported mode combinations.
  • Implement fast path for low-latency local heuristic extraction and standard path for deeper bounded analysis.
  • Add benchmark/fixture evidence comparing latency, output size, and fields found for both modes.
  • Add tests for default mode compatibility, fast/standard behavior, invalid mode, and schema/query extraction parity.
  • Document how callers choose fast vs standard and when to escalate.

Success criteria

  • Callers can explicitly trade speed for depth on extract_data.
  • Default behavior remains backward-compatible.
  • Fast mode avoids accidental heavy traversal for simple extraction.
  • Benchmark evidence shows latency/token impact and quality tradeoff.

Post-merge OpenChrome live verification checklist

  • Run extraction fixtures in fast and standard modes and compare latency/output/fields-found metrics.
  • Verify default calls match pre-existing behavior or documented default.
  • Run an invalid mode call and verify bounded validation error.
  • Attach benchmark summary and fixture names to merge notes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1P1 highenhancementNew feature or requestperformancePerformance, latency, throughput, or resource-use improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions