Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
ef02fb5
Prioritize relevant crawl URLs without changing default BFS
shaun0927 May 12, 2026
e0bb883
Keep best-first crawling robust around bad URLs
shaun0927 May 12, 2026
cbbf4af
Align best-first PR with fixed develop baseline
shaun0927 May 12, 2026
5d40456
Keep Cursor verification tolerant of Tier 1 growth
shaun0927 May 12, 2026
ac3de5d
Make console baseline portable across line endings
shaun0927 May 12, 2026
d7d6f86
Accept clean SIGTERM reports in health gating tests
shaun0927 May 12, 2026
050b68c
Give health shutdown CI enough time
shaun0927 May 12, 2026
b051df8
Keep admin key stdout assertion noise tolerant
shaun0927 May 12, 2026
f74d7cb
Wait for domain memory persistence before sizing
shaun0927 May 12, 2026
f72b035
Refresh lower-bound fixture after s2c merge
shaun0927 May 12, 2026
6161b00
Refresh admin-key CLI test conflict after server merges
shaun0927 May 13, 2026
1ed6069
Make best-first CI deterministic after develop merge
shaun0927 May 13, 2026
3a84903
Keep best-first current with crawl tools
shaun0927 May 13, 2026
60a0727
Isolate HTTP auth ports across test workers
shaun0927 May 13, 2026
370e51f
Parse admin key JSON through worker noise
shaun0927 May 13, 2026
09f6025
Preserve shallow best-first crawl candidates
shaun0927 May 13, 2026
c4c8ee0
Merge remote-tracking branch 'origin/develop' into feat/983-best-first
shaun0927 May 13, 2026
32e2c5a
Merge develop into feat/983-best-first (resolve crawl.ts)
shaun0927 May 13, 2026
bf7c4b4
Merge develop into feat/983-best-first
shaun0927 May 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
171 changes: 171 additions & 0 deletions src/core/crawl/url-scorer.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
export interface UrlScoreOptions {
query?: string;
keywords?: string[];
preferPaths?: string[];
excludePaths?: string[];
sameDepthBias?: number;
startUrl?: string;
}

export interface UrlScoreResult {
score: number;
reasons: string[];
}

const LOW_SIGNAL_SEGMENTS = new Set([
'tag',
'tags',
'category',
'categories',
'author',
'authors',
'feed',
'rss',
'login',
'signin',
'signup',
'register',
]);

function normalizeTerm(term: string): string {
return term.trim().toLowerCase().replace(/^\/+|\/+$/g, '');
}

function queryTerms(query?: string): string[] {
if (!query) return [];
const seen = new Set<string>();
for (const raw of query.split(/[^\p{L}\p{N}_-]+/u)) {
const term = normalizeTerm(raw);
if (term.length >= 2) seen.add(term);
}
return Array.from(seen);
}

function safeDecodePathname(pathname: string): string {
try {
return decodeURIComponent(pathname);
} catch {
return pathname;
}
}

function normalizePathPrefix(path: string): string {
const trimmed = path.trim();
if (!trimmed) return '';
return trimmed.startsWith('/') ? trimmed.toLowerCase() : `/${trimmed.toLowerCase()}`;
}

function pathDistance(startPath: string, candidatePath: string): number {
const startSegments = startPath.split('/').filter(Boolean);
const candidateSegments = candidatePath.split('/').filter(Boolean);
let shared = 0;
while (
shared < startSegments.length &&
shared < candidateSegments.length &&
startSegments[shared] === candidateSegments[shared]
) {
shared++;
}
return Math.max(startSegments.length, candidateSegments.length) - shared;
}

export function buildUrlScoreOptions(input: {
query?: unknown;
url_score?: unknown;
startUrl?: string;
}): UrlScoreOptions {
const raw = input.url_score && typeof input.url_score === 'object'
? input.url_score as Record<string, unknown>
: {};
const toStringArray = (value: unknown): string[] | undefined => {
if (!Array.isArray(value)) return undefined;
return value.filter((v): v is string => typeof v === 'string' && v.trim().length > 0);
};
return {
query: typeof input.query === 'string' ? input.query : undefined,
keywords: toStringArray(raw.keywords),
preferPaths: toStringArray(raw.prefer_paths),
excludePaths: toStringArray(raw.exclude_paths),
sameDepthBias: typeof raw.same_depth_bias === 'number' && Number.isFinite(raw.same_depth_bias)
? raw.same_depth_bias
: undefined,
startUrl: input.startUrl,
};
}

export function scoreUrl(candidateUrl: string, depth: number, options: UrlScoreOptions = {}): UrlScoreResult {
const reasons: string[] = [];
let score = 0;
let parsed: URL;
try {
parsed = new URL(candidateUrl);
} catch {
return { score: -100, reasons: ['invalid-url'] };
}

const explicitKeywords = (options.keywords || []).map(normalizeTerm).filter(Boolean);
const terms = Array.from(new Set([...queryTerms(options.query), ...explicitKeywords]));
const decodedPathname = safeDecodePathname(parsed.pathname);
const haystack = `${decodedPathname} ${parsed.searchParams.toString()}`.toLowerCase();

for (const term of terms) {
if (!term) continue;
if (haystack.includes(term)) {
score += 1.0;
reasons.push(`keyword:${term}`);
}
}

for (const prefix of options.preferPaths || []) {
const normalized = normalizePathPrefix(prefix);
if (normalized && parsed.pathname.toLowerCase().startsWith(normalized)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match path hints on segment boundaries

Using startsWith for prefer_paths/exclude_paths makes /blog also match unrelated paths like /blogging or /blog-roll, so best-first can boost or suppress the wrong URLs on real sites with similarly prefixed routes. This changes crawl ordering and can spend the page budget on less relevant pages even when users provide precise path hints.

Useful? React with 👍 / 👎.

score += 1.5;
reasons.push(`path:${normalized}`);
}
}

for (const prefix of options.excludePaths || []) {
const normalized = normalizePathPrefix(prefix);
if (normalized && parsed.pathname.toLowerCase().startsWith(normalized)) {
score -= 2.0;
reasons.push(`exclude:${normalized}`);
}
}

if (options.startUrl) {
try {
const start = new URL(options.startUrl);
if (start.origin === parsed.origin) {
const distance = pathDistance(start.pathname.toLowerCase(), parsed.pathname.toLowerCase());
const proximity = Math.max(0, 3 - distance) * 0.1;
if (proximity > 0) {
score += proximity;
reasons.push(`proximity:${proximity.toFixed(1)}`);
}
}
} catch {
// ignore malformed start URL
}
}

if (options.sameDepthBias && Number.isFinite(options.sameDepthBias)) {
score += options.sameDepthBias;
reasons.push(`bias:${options.sameDepthBias}`);
Comment on lines +151 to +153
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply same_depth_bias conditionally by candidate depth

same_depth_bias is currently added to every scored URL unconditionally, so it shifts all scores by the same constant and does not change best_first ordering at all. In practice, users who provide this option get different numeric score values but no traversal effect, which makes the knob effectively non-functional for crawl prioritization.

Useful? React with 👍 / 👎.

Comment on lines +151 to +153
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make same_depth_bias affect ranking

same_depth_bias is currently added as a constant to every scored URL, so it does not change relative scores and therefore cannot influence best_first crawl ordering at all. In crawl, URLs are prioritized strictly by score, so this option is effectively a no-op for its intended purpose (changing traversal priority) and users who set it will see no behavioral difference beyond score labels.

Useful? React with 👍 / 👎.

Comment on lines +151 to +153
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make same_depth_bias affect relative URL ordering

same_depth_bias is currently added as a flat constant to every scored URL, so it cancels out in all pairwise comparisons and cannot change which pages are crawled first in strategy="best_first". In practice, callers who set url_score.same_depth_bias get different score values but identical traversal order, which makes this advertised scoring hint ineffective for crawl prioritization.

Useful? React with 👍 / 👎.

Comment on lines +151 to +153
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make same_depth_bias affect relative URL ranking

same_depth_bias is currently added as a constant to every scored URL, so it never changes ordering in strategy: "best_first" (the queue sort uses score differences, and a uniform offset cancels out). This makes a documented scoring hint effectively inert for traversal decisions, so callers cannot actually tune depth preference with this field.

Useful? React with 👍 / 👎.

Comment on lines +151 to +153
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make same_depth_bias influence best-first ordering

same_depth_bias is added as a constant for every candidate URL, so it cancels out in pairwise comparisons and cannot change crawl priority at all. In strategy: "best_first", users can set this option expecting different traversal, but the ordering is identical regardless of the value because every score is shifted equally. Apply the bias conditionally (for the intended depth relation) or remove/explain the option to avoid a no-op tuning knob.

Useful? React with 👍 / 👎.

}

if (depth > 0) {
const penalty = 0.2 * depth;
score -= penalty;
reasons.push(`depth:-${penalty.toFixed(1)}`);
}

const querySet = new Set(terms);
for (const segment of parsed.pathname.toLowerCase().split('/').filter(Boolean)) {
if (LOW_SIGNAL_SEGMENTS.has(segment) && !querySet.has(segment)) {
score -= 1.0;
reasons.push(`low-signal:${segment}`);
}
}

return { score: Number(score.toFixed(3)), reasons };
}
Loading
Loading