Skip to content

johnburbridge/scraper

Repository files navigation

Scraper

Python Tests Coverage

A flexible web crawler that recursively crawls websites, respects robots.txt, and provides various output options.

Documentation

Installation

  1. Clone the repository:
git clone https://github.com/spiralhouse/scraper.git
cd scraper
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Requirements

  • Python: Compatible with Python 3.9, 3.10, 3.11, and 3.12
  • All dependencies are listed in the requirements.txt file and are automatically installed during the installation process.
  • Some optional dependencies are available for development in requirements-dev.txt.

Basic Usage

To start crawling a website:

python main.py https://example.com

This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).

Command Line Options

The scraper supports the following command-line arguments:

Option Description
url The URL to start crawling from (required)
-h, --help Show help message and exit
-d, --depth DEPTH Maximum recursion depth (default: 3)
--allow-external Allow crawling external domains
--no-subdomains Disallow crawling subdomains
-c, --concurrency CONCURRENCY Maximum concurrent requests (default: 10)
--no-cache Disable caching
--cache-dir CACHE_DIR Directory for cache storage
--delay DELAY Delay between requests in seconds (default: 0.1)
-v, --verbose Enable verbose logging
--output-dir OUTPUT_DIR Directory to save results as JSON files
--print-pages Print scraped pages to console
--ignore-robots Ignore robots.txt rules
--use-sitemap Use sitemap.xml for URL discovery
--max-subsitemaps MAX_SUBSITEMAPS Maximum number of sub-sitemaps to process (default: 5)
--sitemap-timeout SITEMAP_TIMEOUT Timeout in seconds for sitemap processing (default: 30)

Examples

Crawl with a specific depth limit:

python main.py https://example.com --depth 5

Allow crawling external domains:

python main.py https://example.com --allow-external

Save crawled pages to a specific directory:

python main.py https://example.com --output-dir results

Use sitemap for discovery with a longer timeout:

python main.py https://example.com --use-sitemap --sitemap-timeout 60

Maximum performance for a large site:

python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots

Crawl site slowly to avoid rate limiting:

python main.py https://example.com --delay 1.0

About

An experiment project created with Windsurf and Claude 3.5 Sonnet

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published