Summary
Add an opt-in strategy: "best_first" mode to crawl and future crawl jobs so OpenChrome can prioritize the most relevant discovered URLs before spending browser, network, and token budget.
Default remains the current BFS traversal.
Why
crawl currently performs breadth-first traversal with max_depth, max_pages, scope, include/exclude filters, robots handling, and a fixed concurrency limiter. This is predictable, but it wastes crawl budget on low-value pages when the user has a clear information need such as pricing, API docs, auth setup, changelog, or security policy.
Crawl4AI's BestFirstCrawlingStrategy is a strong fit for OpenChrome if implemented as deterministic URL scoring rather than content-level LLM reasoning.
Non-goals
Proposed API
Result additions when strategy: "best_first":
Deterministic scoring rules
Create src/core/crawl/url-scorer.ts with a pure scoreUrl(url, context) function.
Initial score components:
+1.0 for each query keyword found in pathname, slug, title-like URL segment, or search params.
+1.5 for each prefer_paths prefix match.
-2.0 for each exclude_paths prefix match.
+0.3 for URLs closer to the start path.
-0.2 * depth depth penalty.
-1.0 for obvious low-signal paths: /tag/, /category/, /author/, /feed, /rss, /login, /signup unless query explicitly includes those terms.
Use a priority queue for best-first mode; tie-break by discovery order and normalized URL for deterministic output. Keep BFS queue implementation untouched for the default path.
Acceptance criteria
OpenChrome real verification after merge
- Start OpenChrome HTTP server after build.
- Run BFS baseline:
- Run best-first query:
Pass if:
- More best-first results contain
/actions in the URL than BFS results at the same max_pages.
- First three best-first pages have
score and non-empty score_reasons.
- No result violates scope or exclude paths.
- Regression:
Pass if legacy BFS shape remains unchanged when
strategy is omitted.
Self-review tightening
- This is intentionally URL-only scoring; content scoring and LLM ranking are excluded.
- The implementation must be small enough to live in
src/core/crawl/url-scorer.ts plus crawl wiring; if it needs broad crawl refactoring, split that refactor into a prerequisite.
- The live verification is advisory; CI acceptance must rely on deterministic fixture sites to avoid public-site drift.
Fit with OpenChrome direction
This improves crawl relevance and token efficiency without changing browser-control tools, adding LLM dependencies, or weakening safety boundaries. It is a small deterministic scheduler extension.
Related
OpenChrome 실검증 체크리스트
2026-05-14 최신 merged 버전 적용 후 재검증. OpenChrome 응답, 로컬 fixture, 빌드/테스트 산출물로 직접 증명 가능한 항목만 합격 조건으로 남겼다. 사람 리뷰, 외부 사이트 안정성, 미확인 PR 상태 같은 조건은 합격 조건에서 제외한다.
검증 대상
최신 버전/공통 런타임 검증
이슈별 해결 증거
실패/보류 기준
- 체크가 하나라도 미충족이면 이슈를 닫지 않는다.
- 실패가 최신 코드 결함으로 재현되면 실패한 OpenChrome 호출, 응답 excerpt, fixture 상태를 증거로 남기고 별도 수정 PR을 올린다.
Summary
Add an opt-in
strategy: "best_first"mode tocrawland future crawl jobs so OpenChrome can prioritize the most relevant discovered URLs before spending browser, network, and token budget.Default remains the current BFS traversal.
Why
crawlcurrently performs breadth-first traversal withmax_depth,max_pages, scope, include/exclude filters, robots handling, and a fixed concurrency limiter. This is predictable, but it wastes crawl budget on low-value pages when the user has a clear information need such as pricing, API docs, auth setup, changelog, or security policy.Crawl4AI's BestFirstCrawlingStrategy is a strong fit for OpenChrome if implemented as deterministic URL scoring rather than content-level LLM reasoning.
Non-goals
scope,include_patterns,exclude_patterns,respect_robots, ormax_pagesconstraints.Proposed API
crawl({ "url": "https://example.com/docs/", "strategy": "bfs" | "best_first", "query": "enterprise pricing api limits", "url_score": { "keywords": ["pricing", "enterprise", "limits"], "prefer_paths": ["/pricing", "/docs", "/reference"], "exclude_paths": ["/blog", "/careers", "/legal"], "same_depth_bias": 0.1 }, "max_pages": 20 })Result additions when
strategy: "best_first":{ "summary": { "strategy": "best_first", "scored_urls": 137, "skipped_below_threshold": 22 }, "pages": [ { "url": "...", "score": 2.34, "score_reasons": ["keyword:pricing", "path:/pricing"] } ] }Deterministic scoring rules
Create
src/core/crawl/url-scorer.tswith a purescoreUrl(url, context)function.Initial score components:
+1.0for each query keyword found in pathname, slug, title-like URL segment, or search params.+1.5for eachprefer_pathsprefix match.-2.0for eachexclude_pathsprefix match.+0.3for URLs closer to the start path.-0.2 * depthdepth penalty.-1.0for obvious low-signal paths:/tag/,/category/,/author/,/feed,/rss,/login,/signupunless query explicitly includes those terms.Use a priority queue for best-first mode; tie-break by discovery order and normalized URL for deterministic output. Keep BFS queue implementation untouched for the default path.
Acceptance criteria
strategyis optional and defaults to current BFS behavior.strategy: "best_first"visits higher-scoring URLs before lower-scoring URLs on a deterministic fixture site.max_pagesis still a hard cap.scoreandscore_reasons.url-scorer.tsunit tests cover keyword, prefer path, exclude path, depth penalty, and low-signal penalties.npm run build && npm test && npm run lint && npm run lint:tierpass.OpenChrome real verification after merge
crawl({ "url": "https://docs.github.com/en", "max_pages": 10, "strategy": "bfs", "output_format": "text" })crawl({ "url": "https://docs.github.com/en", "max_pages": 10, "strategy": "best_first", "query": "actions workflow secrets permissions", "url_score": { "keywords": ["actions", "workflow", "secrets", "permissions"], "prefer_paths": ["/en/actions"], "exclude_paths": ["/en/billing", "/en/copilot"] }, "output_format": "text" })/actionsin the URL than BFS results at the samemax_pages.scoreand non-emptyscore_reasons.crawl({ "url": "https://example.com", "max_pages": 3 })strategyis omitted.Self-review tightening
src/core/crawl/url-scorer.tsplus crawl wiring; if it needs broad crawl refactoring, split that refactor into a prerequisite.Fit with OpenChrome direction
This improves crawl relevance and token efficiency without changing browser-control tools, adding LLM dependencies, or weakening safety boundaries. It is a small deterministic scheduler extension.
Related
crawl_statusonce job runner exists.OpenChrome 실검증 체크리스트
검증 대상
최신 버전/공통 런타임 검증
npm run build통과를 확인했다.npm run lint:tier통과를 확인했다.npm test -- --runInBand결과 504/507 suites 통과, 3 skipped, 6429/6525 tests 통과, 96 skipped를 확인했다. 단, Jest open-handle 경고는 별도 런타임 리스크로 기록했다.oc_connection_health가 connected 상태를 반환했다.navigate/read_page/interact/javascript_tool경로로 DOM 상태 변화를 관찰했다.이슈별 해결 증거
실패/보류 기준