Skip to content

Multi-page mode: site-wide link-pattern scanning #7

@BraedenBDev

Description

@BraedenBDev

Context

Companion to #6 (multi-page crawl mode). Once we have a BFS crawl graph, a site-wide pattern scan over the collected HTML bodies surfaces internal link hygiene issues that a single-page audit can't catch.

A Codex-run audit of almostimpossible.agency used this approach to find three distinct problems:

  1. href="/contact" links across multiple pages that resolve to a soft-404 (no dedicated contact page exists)
  2. Internal links to the apex host (href="https://almostimpossible.agency/...") instead of the canonical www. host — each causing an avoidable 301 redirect hop
  3. Cloudflare email-protection URLs (/cdn-cgi/l/email-protection) that 404 when crawled by bots that ignore the fragment

None of these are detectable from a single-page audit. They require scanning every sitemap page + every BFS-discovered page for bad patterns.

Proposal

New script: scripts/scan-link-patterns.sh <urls-file> [--patterns-file <file>]

Default patterns (configurable)

href="http://           # insecure internal links
href="/cdn-cgi/l/email-protection
href="https://[non-canonical-host]    # detected from input canonicalization
href="/404
href="/undefined
href=""                 # empty href
href="#"                # placeholder anchor

Custom patterns

Users can supply --patterns-file my-patterns.txt with one regex per line. Example:

href="/contact   # if /contact is soft-404, flag it everywhere
href="https://old-domain.com

Output

{
  "scannedPages": 30,
  "patterns": [
    {
      "pattern": "href=\"/cdn-cgi/l/email-protection",
      "hitCount": 12,
      "hitPages": ["https://...", "https://..."]
    }
  ]
}

Implementation

  • Takes a list of URLs (typically from the crawl graph — see Multi-page mode: sitemap vs crawl-graph diff #6)
  • For each URL, fetches HTML with curl (cached if already in $RUN_DIR)
  • Runs grep -F (not regex — faster, safer) for each pattern
  • Aggregates hits + reports the pages where each pattern appears

Acceptance criteria

  • scripts/scan-link-patterns.sh exists, outputs JSON
  • Ships a default pattern list at scripts/_default-link-patterns.txt
  • Accepts custom patterns via --patterns-file
  • Integration test with a fixture showing at least 3 different pattern categories
  • Per-pattern output lists up to 20 sample URLs and a total count
  • SKILL.md --crawl mode calls this automatically after the crawl
  • New scoring finding: "Internal link hygiene issues" weighted by pattern count

Depends on

Prior art

The Codex audit session used this exact approach, scanning all 23 sitemap URLs for three patterns in a tight Python loop. Bash port is straightforward — basically while read url; do grep -l -F -f patterns $url.html; done.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestv2-futureScheduled for post-v1 release

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions