A Python-based web scraper that uses Selenium and OpenPyXL to extract data (text, images, and links) from websites listed in an Excel file. The scraper automatically reads the URLs, scrapes the content, and stores it in a new Excel file.
Make sure you have the following installed:
- Python 3.x
- Selenium for browser automation
- WebDriver Manager for managing the Chrome WebDriver
- OpenPyXL for handling Excel files
pip install selenium webdriver-manager openpyxlClone the repository or download the script files to your local machine.
Create an Excel file named LinksSeleniumWEBSC.xlsx and place it in your working directory, e.g., C:\Users\ACER\Desktop\Web Scraper selenium\.
- The file should contain a sheet named "Sheet1".
- List the website URLs starting from row A2 (A3, A4, etc.).
-
Run the Script:
- Open a terminal or command prompt.
- Navigate to the folder containing the script.
- Execute the script using Python:
python main.py
-
What Happens Next:
- The script will read all the URLs from the input file (
LinksSeleniumWEBSC.xlsx). - It will scrape the text, images, and links from each website.
- A new Excel file called
link2.xlsxwill be generated with the scraped data.
- The script will read all the URLs from the input file (
| URL |
|---|
| https://example.com |
| https://another.com |
| Text | Images | Links |
|---|---|---|
| Example Domain | example link | |
| Another Example | another link |
The data is organized into three columns:
- Text: The extracted text from the webpage.
- Images: Links to all the images found on the page.
- Links: All internal and external links found on the page.
This function:
- Loads the provided URL using Selenium.
- Waits for the page to load completely.
- Scrapes the text, images, and links from the page.
- Returns a dictionary with the scraped data.
This function:
- Configures the Selenium WebDriver to run in headless mode (no UI).
- Reads URLs from the input Excel file.
- Scrapes data from each URL and stores the results.
- Saves the data into a new Excel file (
link2.xlsx).
- The script handles errors related to missing or unreadable Excel files, and outputs error messages if necessary.
- If scraping a particular URL fails, the script will continue with the next URL, ensuring the process doesn't halt.
Ensure that the LinksSeleniumWEBSC.xlsx file has a structure like this:
| URL |
|---|
| https://example.com |
| https://another.com |
- Dynamic Websites: The scraper is capable of handling basic websites. If a website loads content dynamically with JavaScript, the scraper still captures the content rendered by Selenium.
- Customization: Feel free to extend the functionality to scrape additional data such as headers, tables, or specific sections from a webpage.
- Contributing: Feel free to fork this project, submit pull requests, or open issues if you have any suggestions for improvement or bug reports.
- Questions?: If you encounter any problems or have any questions, don't hesitate to reach out!