Skip to content

feat(extraction): local query DSL for extract_data (#986)#1099

Open
shaun0927 wants to merge 2 commits into
feat/974-schema-aware-extractfrom
feat/986-agentql-query-dsl
Open

feat(extraction): local query DSL for extract_data (#986)#1099
shaun0927 wants to merge 2 commits into
feat/974-schema-aware-extractfrom
feat/986-agentql-query-dsl

Conversation

@shaun0927
Copy link
Copy Markdown
Owner

Summary

Implements the smallest safe slice of #986 as a stacked PR on top of #1067 (feat/974-schema-aware-extract):

  • adds an OpenChrome-local query parser for extract_data
  • supports flat object queries and one root list block with scalar child fields
  • compiles query input into the existing ExtractionSchema path
  • routes query-derived schemas through feat(extraction): schema-aware deterministic extract_data planning (#974) #1067's schema-aware buildExtractionPlan / fieldPlans
  • preserves existing schema-based extract_data behavior
  • rejects nested non-root query shapes instead of accepting semantics the extractor cannot satisfy

This intentionally does not add AgentQL, API keys, outbound vendor calls, or full AgentQL language compatibility.

Scope / Repo-Fit Review

Before implementation I checked open PRs and found #1067 already owns schema-aware deterministic extraction planning. This PR is therefore stacked on #1067 rather than targeting develop directly, which avoids duplicating or bypassing that work.

The implemented surface remains aligned with OpenChrome's direction:

Implementation

  • src/extraction/query-parser.ts

    • tokenizes/parses the local query subset
    • supports type hints: string, number, integer, boolean, url, date
    • supports field descriptions
    • builds an ExtractionQueryPlan
    • rejects unsupported nested query shapes
  • src/tools/extract-data.ts

    • accepts query as an alternative to schema
    • accepts mode: "fast" placeholder only
    • rejects schema + query together
    • infers multiple for { products[] { ... } }
    • records query-specific domain-memory keys
    • keeps schema mode backward compatible

Verification

  • npx jest tests/extraction/query-parser.test.ts tests/tools/extract-data.test.ts tests/extraction/plan.test.ts tests/extraction/strategies.test.ts --runInBand
  • npm run build
  • npm run lint:changed
  • Code review lane: APPROVE
  • Architecture lane: CLEAR

Real OpenChrome Verification

Not run in this PR because the branch is stacked on #1067 and should be real-verified after #1067 lands or in CI against the stacked branch. The required real verification from #986 remains:

  1. Start real OpenChrome MCP.
  2. Navigate to a local product/listing fixture.
  3. Call extract_data with:
{
  "tabId": "<fixture tab>",
  "query": "{ products[] { product_name product_price(number) product_url(url) } }",
  "mode": "fast"
}
  1. Verify valid JSON, numeric price coercion, URL-like output, no external API key, and output smaller than a full read_page mode="dom" depth=8 response.

Stacking / Merge Notes

Base branch: feat/974-schema-aware-extract (#1067)

Merge order should be:

  1. feat(extraction): schema-aware deterministic extract_data planning (#974) #1067
  2. this PR

After #1067 merges into develop, this branch can be rebased onto updated develop with the same single commit.

Closes #986 after #1067 is merged and this stacked PR lands.

shaun0927 and others added 2 commits May 12, 2026 23:53
Use schema-derived aliases and safe description tokens so extract_data can resolve semantically named fields without adding model calls or dependencies. Keep diagnostics opt-in to preserve compact default output.

Constraint: Core extraction must remain LLM-free, dependency-free, and read-only.

Rejected: Default LLM fallback | deferred to #976 because it would add latency, provider configuration, and security surface.

Confidence: high

Scope-risk: moderate

Directive: Keep future semantic extraction additions behind explicit opt-in flags unless they preserve the deterministic fast path.

Tested: npm test -- --runTestsByPath tests/extraction/plan.test.ts tests/extraction/strategies.test.ts; npm run build; npm run lint:changed; npm run lint:tier; git diff --check

Not-tested: Full npm run lint currently fails on unrelated develop baseline issues in src/session-manager.ts and src/tools/connect.ts.
Add a bounded OpenChrome-native query parser that compiles flat object fields and a single root list block into the existing extraction schema path. Keep schema mode compatible and route query-derived schemas through the schema-aware field plan from #974.

Constraint: no AgentQL dependency, API key, or outbound vendor call; this PR is stacked on feat/974-schema-aware-extract / #1067 to avoid duplicating extraction-planning changes.

Rejected: full nested AgentQL compatibility | current extraction strategies are flat-field oriented, so nested non-root blocks are rejected instead of accepted with incomplete semantics.

Confidence: high

Scope-risk: moderate

Directive: keep future standard-mode and richer query semantics in follow-up issues (#989+) rather than expanding this v1 surface.

Tested: npx jest tests/extraction/query-parser.test.ts tests/tools/extract-data.test.ts tests/extraction/plan.test.ts tests/extraction/strategies.test.ts --runInBand; npm run build; npm run lint:changed

Not-tested: real Chrome MCP extraction fixture; PR documents this as follow-up verification evidence.

Co-authored-by: OmX <omx@oh-my-codex.dev>
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant