Multi-page mode: site-wide link-pattern scanning

## Context

Companion to #6 (multi-page crawl mode). Once we have a BFS crawl graph, a **site-wide pattern scan** over the collected HTML bodies surfaces internal link hygiene issues that a single-page audit can't catch.

A Codex-run audit of [almostimpossible.agency](https://www.almostimpossible.agency/) used this approach to find three distinct problems:

1. `href="/contact"` links across multiple pages that resolve to a soft-404 (no dedicated contact page exists)
2. Internal links to the **apex host** (`href="https://almostimpossible.agency/..."`) instead of the canonical `www.` host — each causing an avoidable 301 redirect hop
3. Cloudflare email-protection URLs (`/cdn-cgi/l/email-protection`) that 404 when crawled by bots that ignore the fragment

None of these are detectable from a single-page audit. They require scanning **every sitemap page + every BFS-discovered page** for bad patterns.

## Proposal

New script: `scripts/scan-link-patterns.sh <urls-file> [--patterns-file <file>]`

### Default patterns (configurable)

```
href="http://           # insecure internal links
href="/cdn-cgi/l/email-protection
href="https://[non-canonical-host]    # detected from input canonicalization
href="/404
href="/undefined
href=""                 # empty href
href="#"                # placeholder anchor
```

### Custom patterns

Users can supply `--patterns-file my-patterns.txt` with one regex per line. Example:

```
href="/contact   # if /contact is soft-404, flag it everywhere
href="https://old-domain.com
```

### Output

```json
{
  "scannedPages": 30,
  "patterns": [
    {
      "pattern": "href=\"/cdn-cgi/l/email-protection",
      "hitCount": 12,
      "hitPages": ["https://...", "https://..."]
    }
  ]
}
```

## Implementation

- Takes a list of URLs (typically from the crawl graph — see #6)
- For each URL, fetches HTML with curl (cached if already in `$RUN_DIR`)
- Runs `grep -F` (not regex — faster, safer) for each pattern
- Aggregates hits + reports the pages where each pattern appears

## Acceptance criteria

- [ ] `scripts/scan-link-patterns.sh` exists, outputs JSON
- [ ] Ships a default pattern list at `scripts/_default-link-patterns.txt`
- [ ] Accepts custom patterns via `--patterns-file`
- [ ] Integration test with a fixture showing at least 3 different pattern categories
- [ ] Per-pattern output lists up to 20 sample URLs and a total count
- [ ] SKILL.md `--crawl` mode calls this automatically after the crawl
- [ ] New scoring finding: "Internal link hygiene issues" weighted by pattern count

## Depends on

- #6 (multi-page crawl mode) — reuses the BFS crawl graph as input

## Prior art

The Codex audit session used this exact approach, scanning all 23 sitemap URLs for three patterns in a tight Python loop. Bash port is straightforward — basically `while read url; do grep -l -F -f patterns $url.html; done`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-page mode: site-wide link-pattern scanning #7

Context

Proposal

Default patterns (configurable)

Custom patterns

Output

Implementation

Acceptance criteria

Depends on

Prior art

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multi-page mode: site-wide link-pattern scanning #7

Description

Context

Proposal

Default patterns (configurable)

Custom patterns

Output

Implementation

Acceptance criteria

Depends on

Prior art

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions