feat(extraction): local query DSL for extract_data (#986)#1099
Open
shaun0927 wants to merge 2 commits into
Open
feat(extraction): local query DSL for extract_data (#986)#1099shaun0927 wants to merge 2 commits into
shaun0927 wants to merge 2 commits into
Conversation
Use schema-derived aliases and safe description tokens so extract_data can resolve semantically named fields without adding model calls or dependencies. Keep diagnostics opt-in to preserve compact default output. Constraint: Core extraction must remain LLM-free, dependency-free, and read-only. Rejected: Default LLM fallback | deferred to #976 because it would add latency, provider configuration, and security surface. Confidence: high Scope-risk: moderate Directive: Keep future semantic extraction additions behind explicit opt-in flags unless they preserve the deterministic fast path. Tested: npm test -- --runTestsByPath tests/extraction/plan.test.ts tests/extraction/strategies.test.ts; npm run build; npm run lint:changed; npm run lint:tier; git diff --check Not-tested: Full npm run lint currently fails on unrelated develop baseline issues in src/session-manager.ts and src/tools/connect.ts.
Add a bounded OpenChrome-native query parser that compiles flat object fields and a single root list block into the existing extraction schema path. Keep schema mode compatible and route query-derived schemas through the schema-aware field plan from #974. Constraint: no AgentQL dependency, API key, or outbound vendor call; this PR is stacked on feat/974-schema-aware-extract / #1067 to avoid duplicating extraction-planning changes. Rejected: full nested AgentQL compatibility | current extraction strategies are flat-field oriented, so nested non-root blocks are rejected instead of accepted with incomplete semantics. Confidence: high Scope-risk: moderate Directive: keep future standard-mode and richer query semantics in follow-up issues (#989+) rather than expanding this v1 surface. Tested: npx jest tests/extraction/query-parser.test.ts tests/tools/extract-data.test.ts tests/extraction/plan.test.ts tests/extraction/strategies.test.ts --runInBand; npm run build; npm run lint:changed Not-tested: real Chrome MCP extraction fixture; PR documents this as follow-up verification evidence. Co-authored-by: OmX <omx@oh-my-codex.dev>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
3d60901 to
0791f56
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the smallest safe slice of #986 as a stacked PR on top of #1067 (
feat/974-schema-aware-extract):extract_dataExtractionSchemapathbuildExtractionPlan/fieldPlansextract_databehaviorThis intentionally does not add AgentQL, API keys, outbound vendor calls, or full AgentQL language compatibility.
Scope / Repo-Fit Review
Before implementation I checked open PRs and found #1067 already owns schema-aware deterministic extraction planning. This PR is therefore stacked on #1067 rather than targeting
developdirectly, which avoids duplicating or bypassing that work.The implemented surface remains aligned with OpenChrome's direction:
standardmode and nested query semantics are left to follow-up issues (feat(core): fast/standard extraction modes with benchmark evidence #989+)Implementation
src/extraction/query-parser.tsstring,number,integer,boolean,url,dateExtractionQueryPlansrc/tools/extract-data.tsqueryas an alternative toschemamode: "fast"placeholder onlyschema+querytogethermultiplefor{ products[] { ... } }Verification
npx jest tests/extraction/query-parser.test.ts tests/tools/extract-data.test.ts tests/extraction/plan.test.ts tests/extraction/strategies.test.ts --runInBandnpm run buildnpm run lint:changedReal OpenChrome Verification
Not run in this PR because the branch is stacked on #1067 and should be real-verified after #1067 lands or in CI against the stacked branch. The required real verification from #986 remains:
extract_datawith:{ "tabId": "<fixture tab>", "query": "{ products[] { product_name product_price(number) product_url(url) } }", "mode": "fast" }read_page mode="dom" depth=8response.Stacking / Merge Notes
Base branch:
feat/974-schema-aware-extract(#1067)Merge order should be:
After #1067 merges into
develop, this branch can be rebased onto updateddevelopwith the same single commit.Closes #986 after #1067 is merged and this stacked PR lands.