Skip to content
Draft
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .eleventyignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
./README.md
./_image_sources
./_drafts
./.github
15 changes: 15 additions & 0 deletions .github/linkchecker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
FROM ubuntu:22.04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the copy of this linkchecker folder from the _drafts folder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Removed the outdated linkchecker copy from _drafts/linkchecker/ since the comprehensive implementation is now in .github/linkchecker/. Commit c0a7eed

RUN apt-get -y update && \
apt-get install -y ca-certificates linkchecker python3-pip curl --no-install-recommends \
&& apt-get clean && \
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
RUN pip3 install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org jinja2

WORKDIR /linkchecker
COPY filter_csv.py output_template.html linkchecker.conf run_linkcheck.sh ./

# Make script executable
RUN chmod +x run_linkcheck.sh

# Default command to run linkchecker
CMD ["linkchecker", "--config=linkchecker.conf"]
138 changes: 138 additions & 0 deletions .github/linkchecker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# OrionRobots Link Checker

This directory contains the link checking functionality for the OrionRobots website, designed to detect broken links with a focus on image links and internal broken links.

## 🎯 Features

- **Image-focused checking**: Prioritizes broken image links that affect visual content
- **Categorized results**: Separates internal, external, image, and email links
- **HTML reports**: Generates detailed, styled reports with priority indicators
- **Docker integration**: Runs in isolated containers for consistency
- **CI/CD integration**: Automated nightly checks and PR-based checks

## πŸš€ Usage

### Local Usage

Run the link checker locally using the provided script:

```bash
./.github/scripts/local_linkcheck.sh
```

This will:
1. Build the site
2. Start a local HTTP server
3. Run the link checker
4. Generate a report in `./linkchecker_reports/`
5. Clean up containers

### Manual Docker Compose

You can also run individual services manually:

```bash
# Build and serve the site
docker compose --profile manual up -d http_serve

# Run link checker
docker compose --profile manual up broken_links

# View logs
docker compose logs broken_links

# Cleanup
docker compose down
```

### GitHub Actions Integration

#### Nightly Checks
- Runs every night at 2 AM UTC
- Checks the production site (https://orionrobots.co.uk)
- Creates warnings for broken links
- Uploads detailed reports as artifacts

#### PR-based Checks
- Triggered when a PR is labeled with `link-check`
- Deploys a staging version of the PR
- Runs link checker on the staging deployment
- Comments results on the PR
- Automatically cleans up staging deployment

To run link checking on a PR:
1. Add the `link-check` label to the PR
2. The workflow will automatically deploy staging and run checks
3. Results will be commented on the PR

## πŸ“ Files

- `Dockerfile`: Container definition for the link checker
- `linkchecker.conf`: Configuration for linkchecker tool
- `filter_csv.py`: Python script to process and categorize results
- `output_template.html`: HTML template for generating reports
- `run_linkcheck.sh`: Main script that orchestrates the checking process

## πŸ“Š Report Categories

The generated reports categorize broken links by priority:

1. **πŸ–ΌοΈ Images** (High Priority): Broken image links that affect visual content
2. **🏠 Internal Links** (High Priority): Broken internal links under our control
3. **🌐 External Links** (Medium Priority): Broken external links (may be temporary)
4. **πŸ“§ Email Links** (Low Priority): Broken email links (complex to validate)

## βš™οΈ Configuration

The link checker configuration in `linkchecker.conf` includes:

- **Recursion**: Checks up to 10 levels deep
- **Output**: CSV format for easy processing
- **Filtering**: Ignores common social media sites that block crawlers
- **Anchor checking**: Validates internal page anchors
- **Warning handling**: Configurable warning levels

## πŸ”§ Customization

To modify the link checking behavior:

1. **Change checking depth**: Edit `recursionlevel` in `linkchecker.conf`
2. **Add ignored URLs**: Add patterns to the `ignore` section in `linkchecker.conf`
3. **Modify report styling**: Edit `output_template.html`
4. **Change categorization**: Modify `filter_csv.py`

## 🐳 Docker Integration

The link checker integrates with the existing Docker Compose setup:

- Uses the `http_serve` service as the target
- Depends on health checks to ensure site availability
- Outputs reports to a mounted volume for persistence
- Runs in the `manual` profile to avoid automatic execution

## πŸ“‹ Requirements

- Docker and Docker Compose
- Python 3 with Jinja2 (handled in container)
- linkchecker tool (handled in container)
- curl for health checks (handled in container)

## πŸ” Troubleshooting

### Site not available
If you get "Site not available" errors:
1. Ensure the site builds successfully first
2. Check that the HTTP server is running
3. Verify port 8082 is not in use

### Permission errors
If you get permission errors with volumes:
1. Check Docker permissions
2. Ensure the linkchecker_reports directory exists
3. Try running with sudo (not recommended for production)

### Missing dependencies
If linkchecker fails to run:
1. Check the Dockerfile builds successfully
2. Verify Python dependencies are installed
3. Check linkchecker configuration syntax
80 changes: 80 additions & 0 deletions .github/linkchecker/filter_csv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# -*- coding: utf-8 -*-
import csv
import sys
import os
from urllib.parse import urlparse

from jinja2 import Environment, FileSystemLoader, select_autoescape


def is_image_url(url):
"""Check if URL points to an image file"""
image_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico', '.bmp'}
parsed = urlparse(url)
path = parsed.path.lower()
return any(path.endswith(ext) for ext in image_extensions)


def categorize_link(item):
"""Categorize link by type"""
url = item['url']
if is_image_url(url):
return 'image'
elif url.startswith('mailto:'):
return 'email'
elif url.startswith('http'):
return 'external'
else:
return 'internal'


def output_file(items):
# Get the directory where this script is located
script_dir = os.path.dirname(os.path.abspath(__file__))
env = Environment(
loader=FileSystemLoader(script_dir),
autoescape=select_autoescape(['html', 'xml'])
)
template = env.get_template('output_template.html')

# Categorize items
categorized = {}
for item in items:
category = categorize_link(item)
if category not in categorized:
categorized[category] = []
categorized[category].append(item)

print(template.render(
categorized=categorized,
total_count=len(items),
image_count=len(categorized.get('image', [])),
internal_count=len(categorized.get('internal', [])),
external_count=len(categorized.get('external', [])),
email_count=len(categorized.get('email', []))
))


def main():
filename = sys.argv[1] if len(sys.argv) > 1 else '/linkchecker/output.csv'

if not os.path.exists(filename):
print(f"Error: CSV file {filename} not found")
sys.exit(1)

with open(filename, encoding='utf-8') as csv_file:
data = csv_file.readlines()
reader = csv.DictReader((row for row in data if not row.startswith('#')), delimiter=';')

# Filter out successful links and redirects
non_200 = (item for item in reader if 'OK' not in item['result'])
non_redirect = (item for item in non_200 if '307' not in item['result'] and '301' not in item['result'] and '302' not in item['result'])
non_ssl = (item for item in non_redirect if 'ssl' not in item['result'].lower())

total_list = sorted(list(non_ssl), key=lambda item: (categorize_link(item), item['parentname']))

output_file(total_list)


if __name__ == '__main__':
main()
44 changes: 44 additions & 0 deletions .github/linkchecker/linkchecker.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
[checking]
# Check links with limited recursion for faster execution
recursionlevel=2
# Focus on internal links
allowedschemes=http,https,file
# Check for broken images specifically
checkextern=1
# Limit number of URLs to check for faster execution
maxrequestspersecond=10
# Timeout for each request
timeout=10
# Hard time limit - 2 minutes maximum for PR checks
maxrunseconds=120
threads=4

[output]
# Output in CSV format for easier processing
log=csv
filename=/linkchecker_reports/output.csv
# Also output to console
verbose=1
warnings=1

[filtering]
# Ignore certain file types that might cause issues
ignorewarnings=url-whitespace,url-content-size-zero,url-content-too-large
# Skip external social media links that often block crawlers
ignore=
url:facebook\.com
url:twitter\.com
url:instagram\.com
url:linkedin\.com
url:youtube\.com
url:tiktok\.com

[AnchorCheck]
# Check for broken internal anchors
add=1

[authentication]
# No authentication required for most checks

[plugins]
# No additional plugins needed for basic checking
Loading