Web Scraper for Goudengids.nl

This Python-based web scraper extracts business details (name, address, phone, and email) from Goudengids.nl. It utilizes proxy servers and random user agents to bypass potential restrictions and saves the collected data in structured CSV files.

Features

Scrapes business information, including:
- Name
- Address
- Phone number
- Email
Handles retries and avoids duplicate entries.
Saves data into organized folders:
- All Data: Contains all scraped data in a single CSV file.
- Phone Numbers: Phone numbers grouped in a separate CSV file.
- Emails: Emails grouped in a separate CSV file.
- Addresses: Addresses grouped in a separate CSV file.

Requirements

Python 3.7 or higher
Required Python libraries:
- requests
- beautifulsoup4
- fake-useragent

Setup Instructions

Clone the Repository

git clone https://github.com/your-username/goudengids-scraper.git
cd goudengids-scraper

Install Dependencies Install the required packages using the requirements.txt file:
```
pip3 install -r requirements.txt
```
Proxy Configuration Update the BRIGHT_DATA_PROXY, BRIGHT_DATA_USERNAME, and BRIGHT_DATA_PASSWORD variables in the script to match your proxy service credentials.

Usage

Run the Scraper Execute the script to start scraping:
```
python3 scraper.py
```
Provide Input The script will prompt for:
- Category: The business type to scrape (e.g., Accountants or Aannemers).
- Start Page: The page number to begin scraping.
- Number of Pages: The total number of pages to scrape.
Data Output
- The scraped data will be saved in the Collected Data folder, organized by type (All Data, Phone Numbers, etc.).

Folder Structure

Collected Data/
├── All Data/
│   └── {category}_pages_{start}_to_{end}_all_data.csv
├── Phone Numbers/
│   └── {category}_pages_{start}_to_{end}_phone.csv
├── Emails/
│   └── {category}_pages_{start}_to_{end}_email.csv
└── Addresses/
    └── {category}_pages_{start}_to_{end}_address.csv

Error Handling

403 Forbidden or IP Blacklisting: The script retries with a new proxy or user agent. Ensure your proxy credentials are valid.
Connection Issues: If multiple retries fail, the script skips the page and logs an error.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Src		Src
doc		doc
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper for Goudengids.nl

Features

Requirements

Setup Instructions

Usage

Folder Structure

Error Handling

About

Releases

Packages

Contributors 2

Languages

LexJ079/Scraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper for Goudengids.nl

Features

Requirements

Setup Instructions

Usage

Folder Structure

Error Handling

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages