Privacy Policy Analyzer

Analyze a website's privacy policy end-to-end: auto-discover the policy URL, fetch clean text (HTTP first, Selenium fallback), chunk the content, evaluate it via an LLM with a structured rubric, and aggregate category scores into an overall score with strengths, risks, red flags, and recommendations.

Part of Happy Hacking Space - A community-driven organization focused on security, AI, and software development.

Features

Auto-discovery: Common paths → robots.txt/sitemaps → footer links.
HTTP-first extraction: trafilatura (clean text) or BeautifulSoup fallback; Selenium for dynamic pages.
Structured scoring (JSON): Per-category (0–10) scores + rationales; aggregated to 0–100 overall in scoring.py.
Configurable chunking: Paragraph-aware recursive splitting; --max-chunks hard cap to control cost/latency.
Simple CLI: Choose summary, detailed, or full reports.

Project Layout

privacy-policy-analyzer/
├── src/
│   ├── __init__.py
│   ├── main.py                    # Main CLI application
│   └── analyzer/
│       ├── __init__.py
│       ├── prompts.py             # LLM prompts for analysis
│       └── scoring.py             # Scoring algorithms
├── docs/                          # Documentation
│   ├── index.md
│   ├── user-guide.md
│   ├── api.md
│   ├── contributing.md
│   └── changelog.md
├── .github/
│   └── workflows/                 # CI/CD pipelines
│       ├── ci.yaml
│       └── release.yml
├── pyproject.toml                 # Project configuration
├── requirements.txt               # Legacy requirements
├── .env.example                   # Environment template
├── .gitignore
├── LICENSE
└── README.md

Requirements

Python 3.10.11 or higher
An OpenAI API key
(Optional) Chrome/Chromium on the machine (Selenium fallback; driver auto-installs)

Installation

Using uv (Recommended)

# Clone the repository
git clone https://github.com/HappyHackingSpace/privacy-policy-analyzer.git
cd privacy-policy-analyzer

# Install dependencies
uv sync

# Activate the virtual environment
source .venv/bin/activate  # On macOS/Linux
# or
.venv\Scripts\activate     # On Windows

Using pip

# Clone the repository
git clone https://github.com/HappyHackingSpace/privacy-policy-analyzer.git
cd privacy-policy-analyzer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

# Install dependencies
pip install -e .

Configuration

Copy .env.example → .env and set your credentials:

OPENAI_API_KEY=sk-************************
# Optional (overrides default):
OPENAI_MODEL=gpt-4o

Usage

Quick start (auto-discovery, summary report)

# Using uv
uv run python src/main.py --url https://www.example.com/ --report summary

# Using pip
python src/main.py --url https://www.example.com/ --report summary

Detailed JSON (category scores, rationales, red flags)

uv run python src/main.py --url https://www.example.com/ --report detailed

Force a specific fetch method

HTTP only (faster/cleaner when available):

uv run python src/main.py --url https://www.example.com/ --fetch http --report detailed

Selenium only (for heavily dynamic pages):

uv run python src/main.py --url https://www.example.com/ --fetch selenium --report detailed

Skip auto-discovery (analyze a known policy URL as-is)

uv run python src/main.py --url https://www.example.com/legal/privacy --no-discover --report detailed

Tuning chunking and cost/latency

# Larger chunks = fewer requests (cheaper/faster), but slightly coarser analysis
uv run python src/main.py --url https://example.com --chunk-size 3500 --chunk-overlap 350 --max-chunks 30 --report summary

CLI Options (summary)

--url (required): Site homepage or direct privacy policy URL.
--model (default: env OPENAI_MODEL or gpt-4o): OpenAI chat model name.
--fetch (default: auto): auto | http | selenium.
--no-discover: Analyze the given URL without discovery.
--chunk-size (default: 3500) and --chunk-overlap (default: 350).
--max-chunks (default: 30): Hard cap; tail chunks are merged to keep requests bounded.
--report (default: summary): summary | detailed | full.

Output

summary: overall score, confidence, top strengths/risks, red-flags count.
detailed: adds per-category scores (0–10), rationales, deduped red flags, recommendations.
full: includes all per-chunk JSON items along with the aggregated report.

Notes & Tips

Determinism: For consistent runs, pin --fetch http or --fetch selenium and/or use --no-discover with a fixed policy URL.
International sites: The HTTP client sets Accept-Language: en-US,en;q=0.9 to reduce locale variance.
Selenium: Ensure Chrome/Chromium exists; chromedriver-autoinstaller will fetch a matching driver automatically.

Troubleshooting

ImportError: lxml.html.clean ... Ensure lxml[html_clean] is installed (it’s included in requirements.txt).
Very low or inconsistent scores Try --fetch selenium or analyze the explicit policy URL with --no-discover. Some sites serve different content per region/session.

Security & Ethics

This tool provides automated analysis heuristics and LLM-generated assessments. Treat results as decision support, not legal advice. Always review the original policy and consult qualified counsel for compliance-critical use cases.

Documentation

📖 Full Documentation - Complete user guide and API reference
🚀 Quick Start Guide - Get up and running quickly
🔧 API Reference - Detailed API documentation
🤝 Contributing - How to contribute to the project

About Happy Hacking Space

This project is part of Happy Hacking Space, a community-driven organization focused on:

🔒 Security Research - Tools and techniques for security professionals
🤖 AI & Machine Learning - Practical AI applications and research
💻 Software Development - Open source tools and libraries
🌐 Community Building - Bringing together developers, researchers, and security experts

Other Projects

vulnerable-target - Intentionally vulnerable environments for security training
events - Community events directory
site - Happy Hacking Space website

Contributing

We welcome contributions! Please see our Contributing Guide for details.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Support

🐛 Bug Reports: GitHub Issues
💬 Discussions: GitHub Discussions
📧 Contact: Happy Hacking Space

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with ❤️ by the Happy Hacking Space community
Powered by OpenAI's GPT models
Uses trafilatura for web content extraction
Inspired by the need for better privacy policy transparency

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Privacy Policy Analyzer

Features

Project Layout

Requirements

Installation

Using uv (Recommended)

Using pip

Configuration

Usage

Quick start (auto-discovery, summary report)

Detailed JSON (category scores, rationales, red flags)

Force a specific fetch method

Skip auto-discovery (analyze a known policy URL as-is)

Tuning chunking and cost/latency

CLI Options (summary)

Output

Notes & Tips

Troubleshooting

Security & Ethics

Documentation

About Happy Hacking Space

Other Projects

Contributing

Support

License

Acknowledgments

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Privacy Policy Analyzer

Features

Project Layout

Requirements

Installation

Using uv (Recommended)

Using pip

Configuration

Usage

Quick start (auto-discovery, summary report)

Detailed JSON (category scores, rationales, red flags)

Force a specific fetch method

Skip auto-discovery (analyze a known policy URL as-is)

Tuning chunking and cost/latency

CLI Options (summary)

Output

Notes & Tips

Troubleshooting

Security & Ethics

Documentation

About Happy Hacking Space

Other Projects

Contributing

Support

License

Acknowledgments