A flexible web crawler that recursively crawls websites, respects robots.txt, and provides various output options.
- Clone the repository:
git clone https://github.com/spiralhouse/scraper.git
cd scraper
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Python: Compatible with Python 3.9, 3.10, 3.11, and 3.12
- All dependencies are listed in the
requirements.txt
file and are automatically installed during the installation process. - Some optional dependencies are available for development in
requirements-dev.txt
.
To start crawling a website:
python main.py https://example.com
This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).
The scraper supports the following command-line arguments:
Option | Description |
---|---|
url |
The URL to start crawling from (required) |
-h, --help |
Show help message and exit |
-d, --depth DEPTH |
Maximum recursion depth (default: 3) |
--allow-external |
Allow crawling external domains |
--no-subdomains |
Disallow crawling subdomains |
-c, --concurrency CONCURRENCY |
Maximum concurrent requests (default: 10) |
--no-cache |
Disable caching |
--cache-dir CACHE_DIR |
Directory for cache storage |
--delay DELAY |
Delay between requests in seconds (default: 0.1) |
-v, --verbose |
Enable verbose logging |
--output-dir OUTPUT_DIR |
Directory to save results as JSON files |
--print-pages |
Print scraped pages to console |
--ignore-robots |
Ignore robots.txt rules |
--use-sitemap |
Use sitemap.xml for URL discovery |
--max-subsitemaps MAX_SUBSITEMAPS |
Maximum number of sub-sitemaps to process (default: 5) |
--sitemap-timeout SITEMAP_TIMEOUT |
Timeout in seconds for sitemap processing (default: 30) |
python main.py https://example.com --depth 5
python main.py https://example.com --allow-external
python main.py https://example.com --output-dir results
python main.py https://example.com --use-sitemap --sitemap-timeout 60
python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots
python main.py https://example.com --delay 1.0