feat(core): best-first URL scoring strategy for crawl workloads

## Summary

Add an opt-in `strategy: "best_first"` mode to `crawl` and future crawl jobs so OpenChrome can prioritize the most relevant discovered URLs before spending browser, network, and token budget.

Default remains the current BFS traversal.

## Why

`crawl` currently performs breadth-first traversal with `max_depth`, `max_pages`, scope, include/exclude filters, robots handling, and a fixed concurrency limiter. This is predictable, but it wastes crawl budget on low-value pages when the user has a clear information need such as pricing, API docs, auth setup, changelog, or security policy.

Crawl4AI's BestFirstCrawlingStrategy is a strong fit for OpenChrome if implemented as deterministic URL scoring rather than content-level LLM reasoning.

## Non-goals

- Do not remove or change BFS defaults.
- Do not crawl outside existing `scope`, `include_patterns`, `exclude_patterns`, `respect_robots`, or `max_pages` constraints.
- Do not score page content with an LLM.
- Do not introduce background crawling; this remains bounded by the current tool call unless combined with #886.

## Proposed API

```jsonc
crawl({
  "url": "https://example.com/docs/",
  "strategy": "bfs" | "best_first",
  "query": "enterprise pricing api limits",
  "url_score": {
    "keywords": ["pricing", "enterprise", "limits"],
    "prefer_paths": ["/pricing", "/docs", "/reference"],
    "exclude_paths": ["/blog", "/careers", "/legal"],
    "same_depth_bias": 0.1
  },
  "max_pages": 20
})
```

Result additions when `strategy: "best_first"`:

```jsonc
{
  "summary": {
    "strategy": "best_first",
    "scored_urls": 137,
    "skipped_below_threshold": 22
  },
  "pages": [
    { "url": "...", "score": 2.34, "score_reasons": ["keyword:pricing", "path:/pricing"] }
  ]
}
```

## Deterministic scoring rules

Create `src/core/crawl/url-scorer.ts` with a pure `scoreUrl(url, context)` function.

Initial score components:

- `+1.0` for each query keyword found in pathname, slug, title-like URL segment, or search params.
- `+1.5` for each `prefer_paths` prefix match.
- `-2.0` for each `exclude_paths` prefix match.
- `+0.3` for URLs closer to the start path.
- `-0.2 * depth` depth penalty.
- `-1.0` for obvious low-signal paths: `/tag/`, `/category/`, `/author/`, `/feed`, `/rss`, `/login`, `/signup` unless query explicitly includes those terms.

Use a priority queue for best-first mode; tie-break by discovery order and normalized URL for deterministic output. Keep BFS queue implementation untouched for the default path.

## Acceptance criteria

- [x] `strategy` is optional and defaults to current BFS behavior.
- [x] BFS output remains byte-identical for existing crawl fixture tests.
- [x] `strategy: "best_first"` visits higher-scoring URLs before lower-scoring URLs on a deterministic fixture site.
- [x] Scope/include/exclude/robots checks are applied before enqueueing or before fetch exactly as they are in BFS.
- [x] `max_pages` is still a hard cap.
- [x] Each crawled page in best-first mode includes `score` and `score_reasons`.
- [x] Tie-breaking is stable across repeated test runs.
- [x] `url-scorer.ts` unit tests cover keyword, prefer path, exclude path, depth penalty, and low-signal penalties.
- [x] `npm run build && npm test && npm run lint && npm run lint:tier` pass.

## OpenChrome real verification after merge

1. Start OpenChrome HTTP server after build.
2. Run BFS baseline:
   ```jsonc
   crawl({ "url": "https://docs.github.com/en", "max_pages": 10, "strategy": "bfs", "output_format": "text" })
   ```
3. Run best-first query:
   ```jsonc
   crawl({
     "url": "https://docs.github.com/en",
     "max_pages": 10,
     "strategy": "best_first",
     "query": "actions workflow secrets permissions",
     "url_score": {
       "keywords": ["actions", "workflow", "secrets", "permissions"],
       "prefer_paths": ["/en/actions"],
       "exclude_paths": ["/en/billing", "/en/copilot"]
     },
     "output_format": "text"
   })
   ```
   Pass if:
   - More best-first results contain `/actions` in the URL than BFS results at the same `max_pages`.
   - First three best-first pages have `score` and non-empty `score_reasons`.
   - No result violates scope or exclude paths.
4. Regression:
   ```jsonc
   crawl({ "url": "https://example.com", "max_pages": 3 })
   ```
   Pass if legacy BFS shape remains unchanged when `strategy` is omitted.

## Self-review tightening

- This is intentionally URL-only scoring; content scoring and LLM ranking are excluded.
- The implementation must be small enough to live in `src/core/crawl/url-scorer.ts` plus crawl wiring; if it needs broad crawl refactoring, split that refactor into a prerequisite.
- The live verification is advisory; CI acceptance must rely on deterministic fixture sites to avoid public-site drift.

## Fit with OpenChrome direction

This improves crawl relevance and token efficiency without changing browser-control tools, adding LLM dependencies, or weakening safety boundaries. It is a small deterministic scheduler extension.

## Related

- #886 resumable crawl jobs: best-first should work inside `crawl_status` once job runner exists.
- #885 static fetch: best-first and static fetch compose for large doc sites.



## OpenChrome 실검증 체크리스트

> 2026-05-14 최신 merged 버전 적용 후 재검증. OpenChrome 응답, 로컬 fixture, 빌드/테스트 산출물로 직접 증명 가능한 항목만 합격 조건으로 남겼다. 사람 리뷰, 외부 사이트 안정성, 미확인 PR 상태 같은 조건은 합격 조건에서 제외한다.

### 검증 대상
- **이슈:** #983 — feat(core): best-first URL scoring strategy for crawl workloads
- **적용 버전:** origin/develop @ f1facb8f (f1facb8f6a7b84756fba1dcdb8fa9b7e9a85293a), package 1.11.0
- **로컬 fixture:** http://127.0.0.1:18765/smoke.html
- **주요 OpenChrome 표면:** (surface 없음)
- **판정:** VERIFIED — 최신 develop에서 구현/테스트/실행 증거가 모두 확인되어 close 가능

### 최신 버전/공통 런타임 검증
- [x] 최신 develop 소스를 적용하고 `npm run build` 통과를 확인했다.
- [x] `npm run lint:tier` 통과를 확인했다.
- [x] `npm test -- --runInBand` 결과 504/507 suites 통과, 3 skipped, 6429/6525 tests 통과, 96 skipped를 확인했다. 단, Jest open-handle 경고는 별도 런타임 리스크로 기록했다.
- [x] `oc_connection_health`가 connected 상태를 반환했다.
- [x] 로컬 fixture에서 OpenChrome `navigate/read_page/interact/javascript_tool` 경로로 DOM 상태 변화를 관찰했다.
- [x] 동일 fixture/동일 설정에서 핵심 결과가 재현 가능함을 확인했다.

### 이슈별 해결 증거
- [x] 최신 develop에 연결된 구현 PR: 1065
- [x] 관련 테스트/소스 증거가 최신 트리에 존재한다:
  - src/tools/crawl.ts
  - tests/core/tools/crawl.engine.test.ts
  - src/core/skill-memory/replay-artifact.ts
  - docs/agent/capability-map.md
  - docs/harness/run-events.md
  - docs/mcp/pagination.md
- [x] 체크리스트에는 OpenChrome 응답/fixture/로컬 산출물로 재현할 수 없는 합격 조건을 남기지 않았다.

### 실패/보류 기준
- 체크가 하나라도 미충족이면 이슈를 닫지 않는다.
- 실패가 최신 코드 결함으로 재현되면 실패한 OpenChrome 호출, 응답 excerpt, fixture 상태를 증거로 남기고 별도 수정 PR을 올린다.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): best-first URL scoring strategy for crawl workloads #983

Summary

Why

Non-goals

Proposed API

Deterministic scoring rules

Acceptance criteria

OpenChrome real verification after merge

Self-review tightening

Fit with OpenChrome direction

Related

OpenChrome 실검증 체크리스트

검증 대상

최신 버전/공통 런타임 검증

이슈별 해결 증거

실패/보류 기준

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(core): best-first URL scoring strategy for crawl workloads #983

Description

Summary

Why

Non-goals

Proposed API

Deterministic scoring rules

Acceptance criteria

OpenChrome real verification after merge

Self-review tightening

Fit with OpenChrome direction

Related

OpenChrome 실검증 체크리스트

검증 대상

최신 버전/공통 런타임 검증

이슈별 해결 증거

실패/보류 기준

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions