XenForo Forum Scraper

This Python script scrapes a XenForo-based forum, extracting thread titles and the content of the first post from specified forum sections. The collected data is saved in a JSON file.

Description

The script uses Selenium to automate browsing a XenForo forum. It can be configured to run in either headless mode or with a visible browser UI. The script is designed to mimic human behavior by applying random delays between requests to avoid overwhelming the server.

Features

Scrapes all forum sections or a specific section.
Extracts thread titles and the content of the first post.
Saves data in a structured JSON format.
Supports both headless and UI browser modes.
Mimics human behavior with random delays between requests.
Handles pagination within forum sections.
Avoids processing duplicate threads.

Prerequisites

Python 3.6+
Google Chrome browser installed

Installation

Clone the repository:

git clone https://github.com/Vakood/XenForo_Parser.git
cd XenForo_Parser

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Configure the script: Open app.py and modify the configuration variables at the top of the file:
- FORUM_BASE_URL: This is the most important setting. Replace "URL" with the full URL of the XenForo forum you want to scrape. You can also use a URL to a specific forum section.
- OUTPUT_FILENAME: The name of the output JSON file (default: "xenforo_parsed_data.json").
- MIN_DELAY and MAX_DELAY: The minimum and maximum delay in seconds between requests (default: 2 and 5 seconds).
- HEADLESS_MODE: Set to True to run the browser in the background, or False to see the browser UI (default: False).
Run the script:
```
python app.py
```
The script will start, and you will see progress messages in the console.

Configuration

Variable	Description
`FORUM_BASE_URL`	The URL of the XenForo forum to scrape. Can be the main page or a specific forum section.
`OUTPUT_FILENAME`	The name of the JSON file where the parsed data will be saved.
`MIN_DELAY`	The minimum delay (in seconds) between HTTP requests to the server.
`MAX_DELAY`	The maximum delay (in seconds) between HTTP requests to the server.
`HEADLESS_MODE`	If `True`, the script runs Chrome in headless mode (no UI). If `False`, the browser window will be visible during the scraping process.

Output

The script generates a JSON file (e.g., xenforo_parsed_data.json) containing an array of objects. Each object represents a scraped thread and has the following structure:

[
    {
        "forum_name": "General Discussion",
        "thread_title": "Welcome to our community!",
        "thread_url": "https://example.com/threads/welcome-to-our-community.123/",
        "thread_content": "This is the content of the first post..."
    },
    {
        "forum_name": "Support",
        "thread_title": "How to use the new feature",
        "thread_url": "https://example.com/threads/how-to-use-the-new-feature.456/",
        "thread_content": "Here are the steps to use the new feature..."
    }
]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XenForo Forum Scraper

Description

Features

Prerequisites

Installation

Usage

Configuration

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

XenForo Forum Scraper

Description

Features

Prerequisites

Installation

Usage

Configuration

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages