This Python script scrapes a XenForo-based forum, extracting thread titles and the content of the first post from specified forum sections. The collected data is saved in a JSON file.
The script uses Selenium to automate browsing a XenForo forum. It can be configured to run in either headless mode or with a visible browser UI. The script is designed to mimic human behavior by applying random delays between requests to avoid overwhelming the server.
- Scrapes all forum sections or a specific section.
- Extracts thread titles and the content of the first post.
- Saves data in a structured JSON format.
- Supports both headless and UI browser modes.
- Mimics human behavior with random delays between requests.
- Handles pagination within forum sections.
- Avoids processing duplicate threads.
- Python 3.6+
- Google Chrome browser installed
-
Clone the repository:
git clone https://github.com/Vakood/XenForo_Parser.git cd XenForo_Parser -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required dependencies:
pip install -r requirements.txt
-
Configure the script: Open
app.pyand modify the configuration variables at the top of the file:FORUM_BASE_URL: This is the most important setting. Replace"URL"with the full URL of the XenForo forum you want to scrape. You can also use a URL to a specific forum section.OUTPUT_FILENAME: The name of the output JSON file (default:"xenforo_parsed_data.json").MIN_DELAYandMAX_DELAY: The minimum and maximum delay in seconds between requests (default: 2 and 5 seconds).HEADLESS_MODE: Set toTrueto run the browser in the background, orFalseto see the browser UI (default:False).
-
Run the script:
python app.py
The script will start, and you will see progress messages in the console.
| Variable | Description |
|---|---|
FORUM_BASE_URL |
The URL of the XenForo forum to scrape. Can be the main page or a specific forum section. |
OUTPUT_FILENAME |
The name of the JSON file where the parsed data will be saved. |
MIN_DELAY |
The minimum delay (in seconds) between HTTP requests to the server. |
MAX_DELAY |
The maximum delay (in seconds) between HTTP requests to the server. |
HEADLESS_MODE |
If True, the script runs Chrome in headless mode (no UI). If False, the browser window will be visible during the scraping process. |
The script generates a JSON file (e.g., xenforo_parsed_data.json) containing an array of objects. Each object represents a scraped thread and has the following structure:
[
{
"forum_name": "General Discussion",
"thread_title": "Welcome to our community!",
"thread_url": "https://example.com/threads/welcome-to-our-community.123/",
"thread_content": "This is the content of the first post..."
},
{
"forum_name": "Support",
"thread_title": "How to use the new feature",
"thread_url": "https://example.com/threads/how-to-use-the-new-feature.456/",
"thread_content": "Here are the steps to use the new feature..."
}
]