Scraper

A flexible web crawler that recursively crawls websites, respects robots.txt, and provides various output options.

Documentation

Installation

Clone the repository:

git clone https://github.com/spiralhouse/scraper.git
cd scraper

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Requirements

Python: Compatible with Python 3.9, 3.10, 3.11, and 3.12
All dependencies are listed in the requirements.txt file and are automatically installed during the installation process.
Some optional dependencies are available for development in requirements-dev.txt.

Basic Usage

To start crawling a website:

python main.py https://example.com

This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).

Command Line Options

The scraper supports the following command-line arguments:

Option	Description
`url`	The URL to start crawling from (required)
`-h, --help`	Show help message and exit
`-d, --depth DEPTH`	Maximum recursion depth (default: 3)
`--allow-external`	Allow crawling external domains
`--no-subdomains`	Disallow crawling subdomains
`-c, --concurrency CONCURRENCY`	Maximum concurrent requests (default: 10)
`--no-cache`	Disable caching
`--cache-dir CACHE_DIR`	Directory for cache storage
`--delay DELAY`	Delay between requests in seconds (default: 0.1)
`-v, --verbose`	Enable verbose logging
`--output-dir OUTPUT_DIR`	Directory to save results as JSON files
`--print-pages`	Print scraped pages to console
`--ignore-robots`	Ignore robots.txt rules
`--use-sitemap`	Use sitemap.xml for URL discovery
`--max-subsitemaps MAX_SUBSITEMAPS`	Maximum number of sub-sitemaps to process (default: 5)
`--sitemap-timeout SITEMAP_TIMEOUT`	Timeout in seconds for sitemap processing (default: 30)

Examples

Crawl with a specific depth limit:

python main.py https://example.com --depth 5

Allow crawling external domains:

python main.py https://example.com --allow-external

Save crawled pages to a specific directory:

python main.py https://example.com --output-dir results

Use sitemap for discovery with a longer timeout:

python main.py https://example.com --use-sitemap --sitemap-timeout 60

Maximum performance for a large site:

python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots

Crawl site slowly to avoid rate limiting:

python main.py https://example.com --delay 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
docs		docs
jupyter		jupyter
nginx		nginx
scraper		scraper
scripts		scripts
tests		tests
.coveragerc		.coveragerc
.gitignore		.gitignore
README-test-environment.md		README-test-environment.md
README.md		README.md
docker-compose.yml		docker-compose.yml
generate_test_site.py		generate_test_site.py
main.py		main.py
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scraper

Documentation

Installation

Requirements

Basic Usage

Command Line Options

Examples

Crawl with a specific depth limit:

Allow crawling external domains:

Save crawled pages to a specific directory:

Use sitemap for discovery with a longer timeout:

Maximum performance for a large site:

Crawl site slowly to avoid rate limiting:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

johnburbridge/scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper

Documentation

Installation

Requirements

Basic Usage

Command Line Options

Examples

Crawl with a specific depth limit:

Allow crawling external domains:

Save crawled pages to a specific directory:

Use sitemap for discovery with a longer timeout:

Maximum performance for a large site:

Crawl site slowly to avoid rate limiting:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages