Skip to content

DiegoRibeirodeSouza/privacy-scanner-cloud

Repository files navigation

🟒 Live Demo

Try the production version here: https://vitalglobalhub.com/privacy-scanner (Generates a full legal analysis PDF in ~45 seconds)

PrivAI Sentinel: AI-Powered GDPR Compliance Auditor

Serverless Privacy-as-a-Service (PaaS) engine for automated forensic auditing and legal analysis.

License: GPL v3 Node.js Google Gemini Puppeteer Docker


⚑ The Problem

Modern web ecosystems are plagued by "Shadow IT". Marketing teams inject third-party scripts (Facebook Pixel, TikTok Ads, Hotjar) without legal oversight, causing companies to unknowingly violate GDPR (Art. 25) and CCPA regulations. Manual auditing is slow, expensive, and technically shallow.

πŸ›‘οΈ The Solution

PrivAI Sentinel is a high-performance, containerized microservice that performs deep forensic scanning of websites. It combines Puppeteer Stealth specifically tuned to bypass bot detection with Google Gemini 2.5 Pro to generate lawyer-grade compliance reports in seconds.

It serves as a powerful Lead Magnet, requiring an email exchange to release the high-value PDF report.


πŸš€ Key Features

  • πŸ•΅οΈ Deep Forensic Crawling: Uses recursive logic to scan up to 12 pages per session, simulating real user behavior (scrolling, clicking) to trigger lazy-loaded trackers.
  • πŸ€– Generative Legal Analysis: Integrates Gemini 2.5 Pro to interpret technical findings (cookies/scripts) and write a contextualized legal opinion based on GDPR standards.
  • πŸ‘» Stealth Mode: Powered by puppeteer-extra-plugin-stealth to evade anti-bot systems (Cloudflare/WAF) and detect hidden trackers.
  • 🎣 Automated Lead Capture: Synchronous integration with Mailchimp API to store leads before processing the audit (guaranteed ROI).
  • πŸ“„ Dynamic PDF Generation: Creates a professional, branded PDF with "Privacy Trust Scores", evidence screenshots, and categorized cookie lists.
  • ☁️ Serverless Optimized: Architecture tuned for Google Cloud Run (stateless, memory-efficient, --disable-gpu flags).

πŸ—οΈ Architecture & Data Flow

The system follows a strict linear pipeline to ensure data integrity and resource optimization:

graph TD
    A[Client Request /scan] -->|URL + Email| B(Mailchimp Lead Capture)
    B -->|Success| C{Puppeteer Cluster}
    C -->|Recursive Crawl| D[Extract Cookies & Scripts]
    C -->|Screenshot| E[Visual Evidence]
    D -->|JSON Payload| F[Gemini 2.5 Pro AI]
    F -->|Legal Opinion| G[PDF Generation Engine]
    G -->|Stream Buffer| H[Downloadable Report]
Loading

Memory Management Strategy

To run Puppeteer in serverless environments (like Cloud Run or AWS Lambda) with limited RAM:

  1. Flag Optimization: Uses --disable-dev-shm-usage and --disable-gpu.
  2. Resource Cleanup: Explicit stream closing and temporary file deletion (fs.unlinkSync) immediately after response delivery.
  3. Scoped Variables: Database variables are strictly scoped to prevent memory leaks during high concurrency.

πŸ› οΈ Installation & Setup

Prerequisites

  • Node.js 18+
  • Docker (optional, for containerization)
  • Google Cloud Platform Account (for Gemini API)
  • Mailchimp Account

1. Clone & Install

git clone https://github.com/DiegoRibeirodeSouza/privacy-scanner-cloud.git
cd privacy-scanner-cloud
npm install

2. Database Setup

Ensure the open-source cookie database is present in the root directory:

Note: This project relies on open-cookie-database.json for heuristic classification.

3. Environment Variables

Create a .env file in the root:

PORT=8080
# AI Configuration
GEMINI_API_KEY=your_google_gemini_key

# Marketing Integration
MAILCHIMP_API_KEY=your_mailchimp_key
MAILCHIMP_LIST_ID=your_list_id
MAILCHIMP_SERVER=us7 (e.g., us6, us7)

# Puppeteer (Optional for local dev)
# PUPPETEER_EXECUTABLE_PATH=/path/to/chrome

4. Run Locally

npm start
# Server starts at http://localhost:8080

5. Docker Deployment (Recommended)

This project includes a production-ready Dockerfile.

# Build the image
docker build -t privai-sentinel .

# Run container
docker run -p 8080:8080 --env-file .env privai-sentinel

πŸ”Œ API Usage

Endpoint: GET /scan

Parameter Type Description
url string The target website URL (e.g., https://example.com)
email string User email for the report delivery (Lead Capture)

Example Request:

http://localhost:8080/scan?url=https://techcrunch.com&email=audit@test.com

☁️ Deployment (Google Cloud Run)

This project is optimized for Google Cloud Run (Serverless). It requires at least 2Gi of RAM to handle Puppeteer and Chrome headless efficiently.

To deploy your own instance using the Google Cloud CLI, run:

gcloud run deploy vgh-scanner \
  --source . \
  --region us-central1 \
  --memory 2Gi \
  --cpu 1 \
  --timeout 300 \
  --set-env-vars GEMINI_API_KEY="YOUR_GEMINI_KEY_HERE" \
  --set-env-vars MAILCHIMP_API_KEY="YOUR_MAILCHIMP_KEY_HERE" \
  --set-env-vars MAILCHIMP_LIST_ID="YOUR_LIST_ID" \
  --set-env-vars MAILCHIMP_SERVER="us7"
---
--- 

Note: Make sure you have the Google Cloud SDK installed and authenticated (gcloud auth login).


🀝 How to Contribute

We welcome contributions from the community! This project aims to democratize privacy auditing, and we need help in several areas.

🚩 Priority Areas for Contribution:

  1. Cookie Database Expansion:
    • We use open-cookie-database.json to identify trackers.
    • Help needed: Add more known cookies (TikTok, LinkedIn, Taboola) to the JSON file to improve detection accuracy.
  2. AI Prompt Engineering:
    • Refine the prompt in index.js to generate even more precise legal insights for GDPR/LGPD/CCPA.
  3. Frontend Improvements:
    • The current frontend is a simple HTML/CSS page. We are looking for React/Vue.js developers to build a dashboard with scan history.
  4. Performance:
    • Optimize Puppeteer memory usage to run faster on serverless environments.

πŸ› οΈ Local Development

  1. Fork the repo.
  2. Create a branch (git checkout -b feature/amazing-feature).
  3. Commit your changes.
  4. Open a Pull Request.

βš–οΈ Legal Disclaimer

DISCLAIMER: This software generates a technical analysis based on momentary sampling of a website's public-facing code. The "Legal Analysis" provided by the AI component is for informational purposes only and does not constitute legal advice. The generated PDF report does not guarantee immunity from regulatory fines (GDPR/CCPA/LGPD). Full compliance requires a comprehensive audit by a certified legal professional.

πŸ“„ License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.