A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes.
- Crawls all pages within a specified domain
- Respects same-origin policy (only crawls URLs from the same domain)
- Generates a CSV report with URLs and their HTTP status codes
- Handles relative and absolute URLs
- Implements polite crawling with built-in delays
- Filters out non-web schemes and fragments
- Robust error handling for failed requests
- Python 3.x
- Clone the repository:
git clone <repository-url>
cd <repository-name>
- Create and activate a virtual environment:
sudo apt-get install python3-venv
python3 -m venv venv
source venv/bin/activate
- Install the required dependencies:
pip install --upgrade pip
pip install -r requirements.txt
- Run the script:
python index.py
-
When prompted:
- Enter the website URL you want to crawl (e.g., https://example.com)
- Specify the output CSV filename (e.g., links.csv)
-
The crawler will:
- Start crawling from the provided URL
- Save discovered URLs to the specified CSV file
- Record HTTP status codes for each URL
- Print progress information to the console
The script generates a CSV file with the following columns:
URL
: The discovered URLStatus_Code
: HTTP status code (only recorded if ≥ 300 or if the request failed)
- Removes URL fragments (#) and query parameters (?)
- Converts relative URLs to absolute URLs
- Validates URLs against the original domain
- Implements a 1-second delay between requests to prevent server overload
Feel free to submit issues and enhancement requests.