You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Companion to #6 (multi-page crawl mode). Once we have a BFS crawl graph, a site-wide pattern scan over the collected HTML bodies surfaces internal link hygiene issues that a single-page audit can't catch.
A Codex-run audit of almostimpossible.agency used this approach to find three distinct problems:
href="/contact" links across multiple pages that resolve to a soft-404 (no dedicated contact page exists)
Internal links to the apex host (href="https://almostimpossible.agency/...") instead of the canonical www. host — each causing an avoidable 301 redirect hop
Cloudflare email-protection URLs (/cdn-cgi/l/email-protection) that 404 when crawled by bots that ignore the fragment
None of these are detectable from a single-page audit. They require scanning every sitemap page + every BFS-discovered page for bad patterns.
Proposal
New script: scripts/scan-link-patterns.sh <urls-file> [--patterns-file <file>]
The Codex audit session used this exact approach, scanning all 23 sitemap URLs for three patterns in a tight Python loop. Bash port is straightforward — basically while read url; do grep -l -F -f patterns $url.html; done.
Context
Companion to #6 (multi-page crawl mode). Once we have a BFS crawl graph, a site-wide pattern scan over the collected HTML bodies surfaces internal link hygiene issues that a single-page audit can't catch.
A Codex-run audit of almostimpossible.agency used this approach to find three distinct problems:
href="/contact"links across multiple pages that resolve to a soft-404 (no dedicated contact page exists)href="https://almostimpossible.agency/...") instead of the canonicalwww.host — each causing an avoidable 301 redirect hop/cdn-cgi/l/email-protection) that 404 when crawled by bots that ignore the fragmentNone of these are detectable from a single-page audit. They require scanning every sitemap page + every BFS-discovered page for bad patterns.
Proposal
New script:
scripts/scan-link-patterns.sh <urls-file> [--patterns-file <file>]Default patterns (configurable)
Custom patterns
Users can supply
--patterns-file my-patterns.txtwith one regex per line. Example:Output
{ "scannedPages": 30, "patterns": [ { "pattern": "href=\"/cdn-cgi/l/email-protection", "hitCount": 12, "hitPages": ["https://...", "https://..."] } ] }Implementation
$RUN_DIR)grep -F(not regex — faster, safer) for each patternAcceptance criteria
scripts/scan-link-patterns.shexists, outputs JSONscripts/_default-link-patterns.txt--patterns-file--crawlmode calls this automatically after the crawlDepends on
Prior art
The Codex audit session used this exact approach, scanning all 23 sitemap URLs for three patterns in a tight Python loop. Bash port is straightforward — basically
while read url; do grep -l -F -f patterns $url.html; done.