WeChat Posts Crawler

A comprehensive Python web crawler for WeChat public account posts with anti-detection measures and multiple crawling strategies.

Features

🔍 Multiple Search Methods: Search WeChat accounts via Sogou and direct post crawling
🛡️ Anti-Detection: Randomized delays, rotating user agents, and stealth browsing
⚡ Async Support: Concurrent crawling for better performance
📊 Data Export: Save results to CSV and JSON formats
🔧 Configurable: Extensive configuration options
📈 Statistics: Built-in analytics and reporting
🗃️ Data Management: Automatic backups and file compression
🌐 Proxy Support: Built-in proxy rotation support

Installation

Prerequisites

Python 3.8 or higher
Chrome browser (for Selenium)

Setup

Clone the repository:

git clone <repository-url>
cd wechat-crawler

Install dependencies:

pip install -r requirements.txt

(Optional) Set up environment variables:

export WECHAT_HEADLESS=true
export WECHAT_OUTPUT_DIR=./data
export WECHAT_PROXY=http://your-proxy:port

Quick Start

Basic Usage

from wechat_crawler import WeChatCrawler

# Initialize crawler
crawler = WeChatCrawler(headless=True)

try:
    # Search for WeChat accounts
    accounts = crawler.search_wechat_accounts("人工智能", page=1)
    print(f"Found {len(accounts)} accounts")
    
    # Get posts from first account
    if accounts:
        posts = crawler.get_account_posts(accounts[0]['profile_link'], max_posts=5)
        
        # Get detailed content for posts
        for post in posts:
            content = crawler.get_post_content(post['link'])
            if content:
                print(f"Title: {content['title']}")
                print(f"Read count: {content['read_count']}")
        
        # Save data
        crawler.save_to_csv(posts, "wechat_posts.csv")
        crawler.save_to_json(posts, "wechat_posts.json")

finally:
    crawler.close()

Context Manager Usage

from wechat_crawler import WeChatCrawler

with WeChatCrawler(headless=True) as crawler:
    accounts = crawler.search_wechat_accounts("科技新闻")
    posts = crawler.get_account_posts(accounts[0]['profile_link'])
    crawler.save_to_csv(posts, "tech_news.csv")

Async Crawling

import asyncio
from wechat_crawler import AsyncWeChatCrawler

async def async_crawl():
    async_crawler = AsyncWeChatCrawler(max_concurrent=5)
    
    post_urls = [
        "https://mp.weixin.qq.com/s/url1",
        "https://mp.weixin.qq.com/s/url2",
        "https://mp.weixin.qq.com/s/url3",
    ]
    
    posts = await async_crawler.batch_crawl_posts(post_urls)
    return posts

# Run async crawling
posts = asyncio.run(async_crawl())

Advanced Usage

Batch Crawling

keywords = ["机器学习", "深度学习", "自然语言处理"]
all_posts = []

with WeChatCrawler() as crawler:
    for keyword in keywords:
        accounts = crawler.search_wechat_accounts(keyword)
        
        for account in accounts[:3]:  # Top 3 accounts per keyword
            posts = crawler.get_account_posts(account['profile_link'], max_posts=5)
            
            # Tag posts with search keyword
            for post in posts:
                post['search_keyword'] = keyword
            
            all_posts.extend(posts)
    
    # Save all collected data
    crawler.save_to_csv(all_posts, "ai_research_posts.csv")

Custom Configuration

from wechat_crawler import WeChatCrawler
from config import CRAWLER_CONFIG

# Customize settings
config = CRAWLER_CONFIG.copy()
config.update({
    'max_posts_per_account': 20,
    'random_delay_min': 2,
    'random_delay_max': 5,
})

crawler = WeChatCrawler(
    headless=False,  # Show browser for debugging
    proxy="http://proxy-server:8080"  # Use proxy
)

Using Utilities

from utils import (
    filter_duplicate_posts,
    validate_post_data,
    create_summary_report,
    calculate_crawling_stats
)

# Remove duplicates
unique_posts = filter_duplicate_posts(posts, key='link')

# Validate data
valid_posts = [post for post in posts if validate_post_data(post)]

# Generate report
report = create_summary_report(posts, "crawling_report.txt")
print(report)

# Get statistics
stats = calculate_crawling_stats(posts)
print(f"Total posts: {stats['total_posts']}")
print(f"Date range: {stats['date_range']}")

Configuration

Environment Variables

Variable	Description	Default
`WECHAT_HEADLESS`	Run browser in headless mode	`true`
`WECHAT_OUTPUT_DIR`	Output directory for files	`./`
`WECHAT_PROXY`	Proxy server URL	None

Configuration File

Edit config.py to customize:

Browser settings (headless mode, window size)
Anti-detection measures (delays, retry attempts)
Rate limiting and concurrent requests
Output formats and file handling
Proxy configuration
Content filtering rules

Key Configuration Options

CRAWLER_CONFIG = {
    'headless': True,                    # Headless browser mode
    'timeout': 30,                       # Request timeout
    'random_delay_min': 1,               # Min delay between requests
    'random_delay_max': 3,               # Max delay between requests
    'max_concurrent_requests': 5,        # Concurrent requests limit
    'max_posts_per_account': 10,         # Posts per account
    'retry_attempts': 3,                 # Retry failed requests
}

Output Formats

CSV Output

Posts are saved with the following columns:

title: Post title
link: Post URL
author: Account name
publish_date: Publication date
description: Post summary
content: Full post content
read_count: Number of reads
like_count: Number of likes
images: List of image URLs
crawled_at: Timestamp when crawled

JSON Output

Structured JSON with the same fields as CSV, preserving data types and nested structures.

Error Handling

The crawler includes comprehensive error handling:

Rate Limiting: Automatic delays and retry mechanisms
Anti-Detection: Multiple strategies to avoid blocking
Network Errors: Retry failed requests with exponential backoff
Data Validation: Ensure data quality and completeness
Logging: Detailed logs for debugging and monitoring

Legal and Ethical Considerations

⚠️ Important Notice:

Respect robots.txt: Always check and respect the target site's robots.txt
Rate Limiting: Use appropriate delays to avoid overwhelming servers
Terms of Service: Ensure compliance with WeChat's Terms of Service
Data Privacy: Handle crawled data responsibly and in compliance with privacy laws
Fair Use: Use crawled data for legitimate research, analysis, or personal use only

Best Practices

Start with small-scale testing
Use reasonable request intervals (1-3 seconds)
Monitor for rate limiting or blocking
Implement proper error handling
Respect copyright and intellectual property rights

Troubleshooting

Common Issues

Chrome Driver Issues:

# Update Chrome driver
pip install --upgrade webdriver-manager

Permission Errors:

# Linux: Install Chrome
sudo apt-get update
sudo apt-get install -y google-chrome-stable

Rate Limiting:
- Increase delays in configuration
- Use proxy rotation
- Reduce concurrent requests
Element Not Found:
- WeChat may update their HTML structure
- Check and update CSS selectors in config.py

Debug Mode

Run with debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

# Your crawler code here

Performance Tips

Use Async Crawling: For multiple posts simultaneously
Optimize Browser Settings: Disable images and JavaScript when not needed
Batch Processing: Process multiple accounts in batches
Caching: Avoid re-crawling the same content
Memory Management: Close browser instances properly

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

This project is for educational and research purposes only. Please ensure compliance with applicable laws and regulations.

Support

If you encounter issues:

Check the troubleshooting section
Review the logs in wechat_crawler.log
Open an issue with detailed error information
Provide your configuration and environment details

Changelog

v1.0.0 (Current)

Initial release with basic crawling functionality
Support for account search and post extraction
Async crawling capabilities
Comprehensive configuration system
Data export in multiple formats
Anti-detection measures

Disclaimer: This tool is provided for educational and research purposes only. Users are responsible for ensuring their use complies with applicable laws, regulations, and terms of service. The authors are not responsible for any misuse of this software.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
.env.example		.env.example
Neuron Coverage Meaningful.pdf		Neuron Coverage Meaningful.pdf
README.md		README.md
README_gongzhonghao.md		README_gongzhonghao.md
config.py		config.py
example_usage.py		example_usage.py
gongzhonghao_config.py		gongzhonghao_config.py
gongzhonghao_examples.py		gongzhonghao_examples.py
requirements.txt		requirements.txt
sample_posts.csv		sample_posts.csv
sample_posts.json		sample_posts.json
setup.py		setup.py
test_crawler.py		test_crawler.py
utils.py		utils.py
wechat_crawler.py		wechat_crawler.py
wechat_gongzhonghao_crawler.py		wechat_gongzhonghao_crawler.py

Kqlios/testFor7811

Folders and files

Latest commit

History

Repository files navigation