Skip to content

feat(core): best-first URL scoring strategy for crawl workloads #983

@shaun0927

Description

@shaun0927

Summary

Add an opt-in strategy: "best_first" mode to crawl and future crawl jobs so OpenChrome can prioritize the most relevant discovered URLs before spending browser, network, and token budget.

Default remains the current BFS traversal.

Why

crawl currently performs breadth-first traversal with max_depth, max_pages, scope, include/exclude filters, robots handling, and a fixed concurrency limiter. This is predictable, but it wastes crawl budget on low-value pages when the user has a clear information need such as pricing, API docs, auth setup, changelog, or security policy.

Crawl4AI's BestFirstCrawlingStrategy is a strong fit for OpenChrome if implemented as deterministic URL scoring rather than content-level LLM reasoning.

Non-goals

Proposed API

crawl({
  "url": "https://example.com/docs/",
  "strategy": "bfs" | "best_first",
  "query": "enterprise pricing api limits",
  "url_score": {
    "keywords": ["pricing", "enterprise", "limits"],
    "prefer_paths": ["/pricing", "/docs", "/reference"],
    "exclude_paths": ["/blog", "/careers", "/legal"],
    "same_depth_bias": 0.1
  },
  "max_pages": 20
})

Result additions when strategy: "best_first":

{
  "summary": {
    "strategy": "best_first",
    "scored_urls": 137,
    "skipped_below_threshold": 22
  },
  "pages": [
    { "url": "...", "score": 2.34, "score_reasons": ["keyword:pricing", "path:/pricing"] }
  ]
}

Deterministic scoring rules

Create src/core/crawl/url-scorer.ts with a pure scoreUrl(url, context) function.

Initial score components:

  • +1.0 for each query keyword found in pathname, slug, title-like URL segment, or search params.
  • +1.5 for each prefer_paths prefix match.
  • -2.0 for each exclude_paths prefix match.
  • +0.3 for URLs closer to the start path.
  • -0.2 * depth depth penalty.
  • -1.0 for obvious low-signal paths: /tag/, /category/, /author/, /feed, /rss, /login, /signup unless query explicitly includes those terms.

Use a priority queue for best-first mode; tie-break by discovery order and normalized URL for deterministic output. Keep BFS queue implementation untouched for the default path.

Acceptance criteria

  • strategy is optional and defaults to current BFS behavior.
  • BFS output remains byte-identical for existing crawl fixture tests.
  • strategy: "best_first" visits higher-scoring URLs before lower-scoring URLs on a deterministic fixture site.
  • Scope/include/exclude/robots checks are applied before enqueueing or before fetch exactly as they are in BFS.
  • max_pages is still a hard cap.
  • Each crawled page in best-first mode includes score and score_reasons.
  • Tie-breaking is stable across repeated test runs.
  • url-scorer.ts unit tests cover keyword, prefer path, exclude path, depth penalty, and low-signal penalties.
  • npm run build && npm test && npm run lint && npm run lint:tier pass.

OpenChrome real verification after merge

  1. Start OpenChrome HTTP server after build.
  2. Run BFS baseline:
    crawl({ "url": "https://docs.github.com/en", "max_pages": 10, "strategy": "bfs", "output_format": "text" })
  3. Run best-first query:
    crawl({
      "url": "https://docs.github.com/en",
      "max_pages": 10,
      "strategy": "best_first",
      "query": "actions workflow secrets permissions",
      "url_score": {
        "keywords": ["actions", "workflow", "secrets", "permissions"],
        "prefer_paths": ["/en/actions"],
        "exclude_paths": ["/en/billing", "/en/copilot"]
      },
      "output_format": "text"
    })
    Pass if:
    • More best-first results contain /actions in the URL than BFS results at the same max_pages.
    • First three best-first pages have score and non-empty score_reasons.
    • No result violates scope or exclude paths.
  4. Regression:
    crawl({ "url": "https://example.com", "max_pages": 3 })
    Pass if legacy BFS shape remains unchanged when strategy is omitted.

Self-review tightening

  • This is intentionally URL-only scoring; content scoring and LLM ranking are excluded.
  • The implementation must be small enough to live in src/core/crawl/url-scorer.ts plus crawl wiring; if it needs broad crawl refactoring, split that refactor into a prerequisite.
  • The live verification is advisory; CI acceptance must rely on deterministic fixture sites to avoid public-site drift.

Fit with OpenChrome direction

This improves crawl relevance and token efficiency without changing browser-control tools, adding LLM dependencies, or weakening safety boundaries. It is a small deterministic scheduler extension.

Related

OpenChrome 실검증 체크리스트

2026-05-14 최신 merged 버전 적용 후 재검증. OpenChrome 응답, 로컬 fixture, 빌드/테스트 산출물로 직접 증명 가능한 항목만 합격 조건으로 남겼다. 사람 리뷰, 외부 사이트 안정성, 미확인 PR 상태 같은 조건은 합격 조건에서 제외한다.

검증 대상

최신 버전/공통 런타임 검증

  • 최신 develop 소스를 적용하고 npm run build 통과를 확인했다.
  • npm run lint:tier 통과를 확인했다.
  • npm test -- --runInBand 결과 504/507 suites 통과, 3 skipped, 6429/6525 tests 통과, 96 skipped를 확인했다. 단, Jest open-handle 경고는 별도 런타임 리스크로 기록했다.
  • oc_connection_health가 connected 상태를 반환했다.
  • 로컬 fixture에서 OpenChrome navigate/read_page/interact/javascript_tool 경로로 DOM 상태 변화를 관찰했다.
  • 동일 fixture/동일 설정에서 핵심 결과가 재현 가능함을 확인했다.

이슈별 해결 증거

  • 최신 develop에 연결된 구현 PR: 1065
  • 관련 테스트/소스 증거가 최신 트리에 존재한다:
    • src/tools/crawl.ts
    • tests/core/tools/crawl.engine.test.ts
    • src/core/skill-memory/replay-artifact.ts
    • docs/agent/capability-map.md
    • docs/harness/run-events.md
    • docs/mcp/pagination.md
  • 체크리스트에는 OpenChrome 응답/fixture/로컬 산출물로 재현할 수 없는 합격 조건을 남기지 않았다.

실패/보류 기준

  • 체크가 하나라도 미충족이면 이슈를 닫지 않는다.
  • 실패가 최신 코드 결함으로 재현되면 실패한 OpenChrome 호출, 응답 excerpt, fixture 상태를 증거로 남기고 별도 수정 PR을 올린다.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priorityenhancementNew feature or requestperformancePerformance, latency, throughput, or resource-use improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions