feat(extraction): local query DSL for extract_data (#986) by shaun0927 · Pull Request #1099 · shaun0927/openchrome

shaun0927 · 2026-05-12T15:24:43Z

Summary

Implements the smallest safe slice of #986 as a stacked PR on top of #1067 (feat/974-schema-aware-extract):

adds an OpenChrome-local query parser for extract_data
supports flat object queries and one root list block with scalar child fields
compiles query input into the existing ExtractionSchema path
routes query-derived schemas through feat(extraction): schema-aware deterministic extract_data planning (#974) #1067's schema-aware buildExtractionPlan / fieldPlans
preserves existing schema-based extract_data behavior
rejects nested non-root query shapes instead of accepting semantics the extractor cannot satisfy

This intentionally does not add AgentQL, API keys, outbound vendor calls, or full AgentQL language compatibility.

Scope / Repo-Fit Review

Before implementation I checked open PRs and found #1067 already owns schema-aware deterministic extraction planning. This PR is therefore stacked on #1067 rather than targeting develop directly, which avoids duplicating or bypassing that work.

The implemented surface remains aligned with OpenChrome's direction:

local-first MCP tool behavior
no external LLM/vendor dependency inside the server
bounded, deterministic parser
query text is converted to schema/field plans, not interpolated into browser scripts
richer standard mode and nested query semantics are left to follow-up issues (feat(core): fast/standard extraction modes with benchmark evidence #989+)

Implementation

src/extraction/query-parser.ts
- tokenizes/parses the local query subset
- supports type hints: string, number, integer, boolean, url, date
- supports field descriptions
- builds an ExtractionQueryPlan
- rejects unsupported nested query shapes
src/tools/extract-data.ts
- accepts query as an alternative to schema
- accepts mode: "fast" placeholder only
- rejects schema + query together
- infers multiple for { products[] { ... } }
- records query-specific domain-memory keys
- keeps schema mode backward compatible

Verification

npx jest tests/extraction/query-parser.test.ts tests/tools/extract-data.test.ts tests/extraction/plan.test.ts tests/extraction/strategies.test.ts --runInBand
npm run build
npm run lint:changed
Code review lane: APPROVE
Architecture lane: CLEAR

Real OpenChrome Verification

Not run in this PR because the branch is stacked on #1067 and should be real-verified after #1067 lands or in CI against the stacked branch. The required real verification from #986 remains:

Start real OpenChrome MCP.
Navigate to a local product/listing fixture.
Call extract_data with:

{
  "tabId": "<fixture tab>",
  "query": "{ products[] { product_name product_price(number) product_url(url) } }",
  "mode": "fast"
}

Verify valid JSON, numeric price coercion, URL-like output, no external API key, and output smaller than a full read_page mode="dom" depth=8 response.

Stacking / Merge Notes

Base branch: feat/974-schema-aware-extract (#1067)

Merge order should be:

After #1067 merges into develop, this branch can be rebased onto updated develop with the same single commit.

Closes #986 after #1067 is merged and this stacked PR lands.

Use schema-derived aliases and safe description tokens so extract_data can resolve semantically named fields without adding model calls or dependencies. Keep diagnostics opt-in to preserve compact default output. Constraint: Core extraction must remain LLM-free, dependency-free, and read-only. Rejected: Default LLM fallback | deferred to #976 because it would add latency, provider configuration, and security surface. Confidence: high Scope-risk: moderate Directive: Keep future semantic extraction additions behind explicit opt-in flags unless they preserve the deterministic fast path. Tested: npm test -- --runTestsByPath tests/extraction/plan.test.ts tests/extraction/strategies.test.ts; npm run build; npm run lint:changed; npm run lint:tier; git diff --check Not-tested: Full npm run lint currently fails on unrelated develop baseline issues in src/session-manager.ts and src/tools/connect.ts.

Add a bounded OpenChrome-native query parser that compiles flat object fields and a single root list block into the existing extraction schema path. Keep schema mode compatible and route query-derived schemas through the schema-aware field plan from #974. Constraint: no AgentQL dependency, API key, or outbound vendor call; this PR is stacked on feat/974-schema-aware-extract / #1067 to avoid duplicating extraction-planning changes. Rejected: full nested AgentQL compatibility | current extraction strategies are flat-field oriented, so nested non-root blocks are rejected instead of accepted with incomplete semantics. Confidence: high Scope-risk: moderate Directive: keep future standard-mode and richer query semantics in follow-up issues (#989+) rather than expanding this v1 surface. Tested: npx jest tests/extraction/query-parser.test.ts tests/tools/extract-data.test.ts tests/extraction/plan.test.ts tests/extraction/strategies.test.ts --runInBand; npm run build; npm run lint:changed Not-tested: real Chrome MCP extraction fixture; PR documents this as follow-up verification evidence. Co-authored-by: OmX <omx@oh-my-codex.dev>

gemini-code-assist · 2026-05-12T15:24:47Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

qodo-code-review · 2026-05-12T15:24:48Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

shaun0927 and others added 2 commits May 12, 2026 23:53

shaun0927 mentioned this pull request May 12, 2026

feat(core): bounded query debug trace (#992) #1115

Open

shaun0927 force-pushed the feat/974-schema-aware-extract branch from 3d60901 to 0791f56 Compare May 12, 2026 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extraction): local query DSL for extract_data (#986)#1099

feat(extraction): local query DSL for extract_data (#986)#1099
shaun0927 wants to merge 2 commits into
feat/974-schema-aware-extractfrom
feat/986-agentql-query-dsl

shaun0927 commented May 12, 2026

Uh oh!

gemini-code-assist Bot commented May 12, 2026

Uh oh!

qodo-code-review Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaun0927 commented May 12, 2026

Summary

Scope / Repo-Fit Review

Implementation

Verification

Real OpenChrome Verification

Stacking / Merge Notes

Uh oh!

gemini-code-assist Bot commented May 12, 2026

Uh oh!

qodo-code-review Bot commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant