Skip to content

A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes

Notifications You must be signed in to change notification settings

devsrv/py-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes.

Features

  • Crawls all pages within a specified domain
  • Respects same-origin policy (only crawls URLs from the same domain)
  • Generates a CSV report with URLs and their HTTP status codes
  • Handles relative and absolute URLs
  • Implements polite crawling with built-in delays
  • Filters out non-web schemes and fragments
  • Robust error handling for failed requests

Prerequisites

  • Python 3.x

Installation

  1. Clone the repository:
git clone <repository-url>
cd <repository-name>
  1. Create and activate a virtual environment:
sudo apt-get install python3-venv
python3 -m venv venv
source venv/bin/activate
  1. Install the required dependencies:
pip install --upgrade pip
pip install -r requirements.txt

Usage

  1. Run the script:
python index.py
  1. When prompted:

    • Enter the website URL you want to crawl (e.g., https://example.com)
    • Specify the output CSV filename (e.g., links.csv)
  2. The crawler will:

    • Start crawling from the provided URL
    • Save discovered URLs to the specified CSV file
    • Record HTTP status codes for each URL
    • Print progress information to the console

Output Format

The script generates a CSV file with the following columns:

  • URL: The discovered URL
  • Status_Code: HTTP status code (only recorded if ≥ 300 or if the request failed)

Features in Detail

URL Processing

  • Removes URL fragments (#) and query parameters (?)
  • Converts relative URLs to absolute URLs
  • Validates URLs against the original domain

Rate Limiting

  • Implements a 1-second delay between requests to prevent server overload

Contributing

Feel free to submit issues and enhancement requests.

About

A Python-based web crawler that systematically browses websites and generates a comprehensive CSV report of all discovered URLs along with their HTTP status codes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages