Skip to content

Vakood/XenForo_Parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Русский язык

XenForo Forum Scraper

This Python script scrapes a XenForo-based forum, extracting thread titles and the content of the first post from specified forum sections. The collected data is saved in a JSON file.

Description

The script uses Selenium to automate browsing a XenForo forum. It can be configured to run in either headless mode or with a visible browser UI. The script is designed to mimic human behavior by applying random delays between requests to avoid overwhelming the server.

Features

  • Scrapes all forum sections or a specific section.
  • Extracts thread titles and the content of the first post.
  • Saves data in a structured JSON format.
  • Supports both headless and UI browser modes.
  • Mimics human behavior with random delays between requests.
  • Handles pagination within forum sections.
  • Avoids processing duplicate threads.

Prerequisites

  • Python 3.6+
  • Google Chrome browser installed

Installation

  1. Clone the repository:

    git clone https://github.com/Vakood/XenForo_Parser.git
    cd XenForo_Parser
  2. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  3. Install the required dependencies:

    pip install -r requirements.txt

Usage

  1. Configure the script: Open app.py and modify the configuration variables at the top of the file:

    • FORUM_BASE_URL: This is the most important setting. Replace "URL" with the full URL of the XenForo forum you want to scrape. You can also use a URL to a specific forum section.
    • OUTPUT_FILENAME: The name of the output JSON file (default: "xenforo_parsed_data.json").
    • MIN_DELAY and MAX_DELAY: The minimum and maximum delay in seconds between requests (default: 2 and 5 seconds).
    • HEADLESS_MODE: Set to True to run the browser in the background, or False to see the browser UI (default: False).
  2. Run the script:

    python app.py

    The script will start, and you will see progress messages in the console.

Configuration

Variable Description
FORUM_BASE_URL The URL of the XenForo forum to scrape. Can be the main page or a specific forum section.
OUTPUT_FILENAME The name of the JSON file where the parsed data will be saved.
MIN_DELAY The minimum delay (in seconds) between HTTP requests to the server.
MAX_DELAY The maximum delay (in seconds) between HTTP requests to the server.
HEADLESS_MODE If True, the script runs Chrome in headless mode (no UI). If False, the browser window will be visible during the scraping process.

Output

The script generates a JSON file (e.g., xenforo_parsed_data.json) containing an array of objects. Each object represents a scraped thread and has the following structure:

[
    {
        "forum_name": "General Discussion",
        "thread_title": "Welcome to our community!",
        "thread_url": "https://example.com/threads/welcome-to-our-community.123/",
        "thread_content": "This is the content of the first post..."
    },
    {
        "forum_name": "Support",
        "thread_title": "How to use the new feature",
        "thread_url": "https://example.com/threads/how-to-use-the-new-feature.456/",
        "thread_content": "Here are the steps to use the new feature..."
    }
]

About

Parser XenForo forums using Selenium

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages