A comprehensive email parsing and consumer intelligence system that analyzes Gmail and Outlook emails using multiple LLM providers to build detailed IAB Taxonomy consumer profiles. The system provides advanced analytics including demographic classification, interest profiling, purchase intent prediction, and household analysis.
- Overview
- Architecture
- Prerequisites
- Installation
- Configuration
- Starting the Application
- How It Works
- Usage Examples
- Development
- Troubleshooting
The OwnYou Consumer Application is a privacy-first email analysis system that:
- Downloads emails from Gmail and Outlook via OAuth2
- Processes emails using multiple LLM providers (OpenAI, Claude, Google Gemini, Ollama)
- Classifies users according to IAB Audience Taxonomy 1.1
- Builds consumer profiles with demographics, interests, purchase intent, and household data
- Provides analytics dashboard for visualizing consumer insights
- Maintains privacy through local processing and encrypted storage
- Multi-Provider Email Support: Gmail and Outlook integration with OAuth2
- Multi-LLM Processing: OpenAI GPT-5, Claude Sonnet-4, Google Gemini, Ollama (local)
- Batch Processing: Intelligent batching for 20-30x faster processing
- IAB Taxonomy Mapping: 1,600+ categories across demographics, interests, and purchase intent
- Visual Dashboard: React/Next.js frontend with real-time analytics
- LangGraph Workflow: Agentic workflow with evidence validation
- Privacy-First: No cloud storage, local SQLite persistence
┌─────────────────────────────────────────────────────────────┐
│ Frontend (Next.js) │
│ - Dashboard UI (React + Tailwind CSS) │
│ - Classification Viewer │
│ - Analytics & Visualizations (Recharts) │
│ - Real-time Updates │
└───────────────────────────┬─────────────────────────────────┘
│ HTTP/REST API
┌───────────────────────────▼─────────────────────────────────┐
│ Backend (Flask API) │
│ - Authentication & Session Management │
│ - Profile & Analytics Endpoints │
│ - Evidence Retrieval │
│ - Model Selection & Analysis Triggers │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Email Processing Pipeline │
│ │
│ Step 1: Email Download (OAuth2) │
│ ├─ Gmail Provider │
│ └─ Outlook Provider │
│ │
│ Step 2: Email Summarization (EMAIL_MODEL) │
│ └─ Fast LLM processing to extract key information │
│ │
│ Step 3: IAB Classification (TAXONOMY_MODEL) │
│ ├─ LangGraph Agentic Workflow │
│ ├─ Batch Optimizer (10-20 emails per batch) │
│ ├─ Category-Specific Agents │
│ ├─ Evidence Judge (LLM-as-Judge validation) │
│ └─ Memory Manager (LangMem + SQLite) │
└───────────────────────────┬─────────────────────────────────┘
│
┌───────────────────────────▼─────────────────────────────────┐
│ Data Persistence │
│ - SQLite Database (LangMem storage) │
│ - User Profiles (JSON exports) │
│ - Email Summaries (CSV) │
│ - Classification History │
└─────────────────────────────────────────────────────────────┘
Three-Stage Independent Pipeline:
- Email Download → Raw emails CSV
- Email Summarization → Summaries CSV (with EMAIL_MODEL)
- IAB Classification → User profile JSON (with TAXONOMY_MODEL)
Each stage can be run independently, allowing for:
- Resilience (re-run failed steps)
- Iteration (test different models)
- Cost savings (skip expensive LLM calls)
The IAB Classification stage uses intelligent batching:
- Dynamically calculates batch size based on model context window
- Processes 10-20 emails per LLM call
- 20-30x faster than single-email processing
- Evidence validation for each classification
- Python: 3.8 or higher
- Node.js: 18.x or higher
- npm: 9.x or higher
- Operating System: macOS, Linux, or Windows
-
LLM Provider (choose at least one):
- OpenAI API key (recommended)
- Anthropic API key (Claude)
- Google AI API key (Gemini)
- Local Ollama (no key required)
-
Email Providers (choose at least one):
- Gmail: Google Cloud project with Gmail API enabled
- Outlook: Microsoft Azure app registration
cd /path/to/your/workspace
git clone <repository-url>
cd ownyou_consumer_application# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Optional: Install development dependencies
pip install -e ".[dev]"python -m src.email_parser.main --versioncd dashboard/frontend
# Install dependencies
npm install
# Verify installation
npm run buildCreate a .env file in the project root:
cp .env.example .env # If example exists, or create manuallyMinimal .env Configuration:
# =============================================================================
# LLM Provider Configuration
# =============================================================================
# Primary provider: openai, claude, google, or ollama
LLM_PROVIDER=openai
# OpenAI Configuration (Recommended)
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_MODEL=gpt-4o-mini
OPENAI_TEMPERATURE=1.0
# Claude (Anthropic) Configuration (Optional)
ANTHROPIC_API_KEY=your_anthropic_api_key_here
ANTHROPIC_MODEL=claude-sonnet-4-20250514
# Google Gemini Configuration (Optional)
GOOGLE_API_KEY=your_google_api_key_here
# Ollama Configuration (Local, Optional)
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=deepseek-r1:70b
# =============================================================================
# Stage-Specific Model Configuration
# =============================================================================
# Format: provider:model
EMAIL_MODEL=openai:gpt-4o-mini # Fast model for email summarization
TAXONOMY_MODEL=openai:gpt-4o # Accurate model for classification
# =============================================================================
# Memory Backend Configuration
# =============================================================================
MEMORY_BACKEND=sqlite
MEMORY_DATABASE_PATH=/Volumes/T7_new/developer_old/ownyou_consumer_application/data/email_parser_memory.db
# =============================================================================
# LangGraph Studio Configuration (Optional)
# =============================================================================
# LangSmith Project Deep-Linking
# These enable direct links to your specific LangSmith project from the dashboard
# Find these values in your LangSmith project URL:
# https://smith.langchain.com/o/{ORG_ID}/projects/p/{PROJECT_ID}
LANGSMITH_ORG_ID=your_organization_id_here
LANGSMITH_PROJECT_ID=your_project_id_here
# =============================================================================
# Email Provider Configuration
# =============================================================================
# Gmail Configuration
GMAIL_CREDENTIALS_FILE=credentials.json
GMAIL_TOKEN_FILE=token.json
# Microsoft Graph (Outlook) Configuration
MICROSOFT_CLIENT_ID=your_client_id_here
MICROSOFT_CLIENT_SECRET=your_client_secret_here
MICROSOFT_TENANT_ID=common
MICROSOFT_TOKEN_FILE=ms_token.json
# =============================================================================
# Processing Configuration
# =============================================================================
MAX_EMAILS=500
BATCH_SIZE=50
LOG_LEVEL=INFO# Interactive setup wizard
python -m src.email_parser.main setup gmail
# Manual setup:
# 1. Create Google Cloud project
# 2. Enable Gmail API
# 3. Create OAuth 2.0 credentials
# 4. Download credentials as credentials.json
# 5. Place in project root# Interactive setup wizard
python -m src.email_parser.main setup outlook
# Manual setup:
# 1. Register app in Azure Portal
# 2. Add Microsoft Graph Mail.Read permission
# 3. Copy Client ID and Client Secret to .env# Check setup status
python -m src.email_parser.main setup status
# Test database connection
python -m src.email_parser.main --test-dbUse the provided start script that handles cleanup and startup:
# From project root
./start_app.sh# Kill backend (Flask)
lsof -ti:5001 | xargs kill -9 2>/dev/null || true
# Kill frontend (Next.js)
lsof -ti:3000 | xargs kill -9 2>/dev/null || true
# Verify ports are free
lsof -i:5001
lsof -i:3000# Option A: Using Python module
cd /path/to/ownyou_consumer_application
python3 dashboard/backend/run.py
# Option B: Using Flask directly
cd dashboard/backend
python3 -m flask run --host=0.0.0.0 --port=5001
# Backend will start on: http://localhost:5001Expected output:
* Serving Flask app 'app'
* Debug mode: on
INFO:werkzeug:WARNING: This is a development server.
* Running on http://0.0.0.0:5001
IMPORTANT: Before starting the frontend, verify the API URL configuration to avoid CORS issues.
# Navigate to frontend directory
cd /path/to/ownyou_consumer_application/dashboard/frontend
# Check .env.local file exists
cat .env.localThe .env.local file MUST have an empty NEXT_PUBLIC_API_URL to use the Next.js API proxy:
# Backend API URL
# IMPORTANT: Leave empty to use Next.js proxy (avoids CORS issues)
# This routes requests through /api which handles session cookies properly
NEXT_PUBLIC_API_URL=DO NOT set it to http://localhost:5001 - this bypasses the proxy and causes CORS errors.
# Start development server (already in dashboard/frontend)
npm run dev
# Frontend will start on: http://localhost:3000Expected output:
▲ Next.js 14.2.0
- Local: http://localhost:3000
- Network: http://192.168.1.x:3000
✓ Ready in 2.3s
Open your browser and navigate to:
http://localhost:3000
Press Ctrl+C in each terminal window running backend/frontend.
# Kill all processes on ports
lsof -ti:5001 | xargs kill -9
lsof -ti:3000 | xargs kill -9
# Or kill by process name
pkill -f "flask run"
pkill -f "next dev"Note: For development, use the Quick Start method above. Production mode is for deployment only.
Prerequisites:
# Ensure you're in the virtual environment
source venv/bin/activate # or .venv_dashboard/bin/activate
# Install production dependencies
pip install -r requirements.txt # Includes gunicornStep 1: Build Frontend
cd dashboard/frontend
npm run build
# Verify build succeeded (should create .next directory)
ls -la .next/Step 2: Start Backend (Terminal 1)
# From project root
cd /path/to/ownyou_consumer_application
# Activate virtual environment
source venv/bin/activate # or .venv_dashboard/bin/activate
# Start with gunicorn using wsgi.py entry point
gunicorn -w 4 -b 0.0.0.0:5001 wsgi:app
# Backend will run on http://localhost:5001
# Press Ctrl+C to stopStep 3: Start Frontend (Terminal 2)
cd /path/to/ownyou_consumer_application/dashboard/frontend
# Start production frontendcd
npm start
# Frontend will run on http://localhost:3000
# Press Ctrl+C to stopProduction Notes:
- The
wsgi.pyfile in the project root is the production entry point - For background processes, use process managers:
- PM2 (Node.js):
pm2 start npm --name "frontend" -- start - systemd (Linux): Create service files for both backend/frontend
- supervisor: Alternative process manager
- PM2 (Node.js):
- Set
FLASK_ENV=productionin.envfor production mode - Use nginx or Apache as a reverse proxy in front of gunicorn
- Set up proper logging and monitoring
Alternative: Background Processes
# Start backend in background
nohup gunicorn -w 4 -b 0.0.0.0:5001 wsgi:app > backend.log 2>&1 &
# Start frontend in background
cd dashboard/frontend
nohup npm start > ../../frontend.log 2>&1 &
# View logs
tail -f backend.log
tail -f frontend.log
# Stop processes
pkill -f gunicorn
pkill -f "next start"- User Authentication: OAuth2 flow for Gmail/Outlook
- Email Download: Fetch emails via provider APIs
- Email Summarization: Extract key information with LLM
- IAB Classification: Multi-agent workflow classifies emails
- Profile Building: Aggregate classifications into user profile
- Dashboard Display: Visualize insights in real-time
# Command
python -m src.email_parser.main --provider gmail --max-emails 100
# What happens:
# - OAuth2 authentication
# - API calls to Gmail/Outlook
# - Download emails (subject, body, metadata)
# - Save to CSV: data/emails_raw_<timestamp>.csv# Triggered automatically or manually
python -m src.email_parser.main --summarize emails_raw.csv
# What happens:
# - Load raw emails
# - Call EMAIL_MODEL (fast, cheap model)
# - Extract: sender, category, key topics, intent
# - Save to CSV: data/emails_summarized_<timestamp>.csv# Start classification
python -m src.email_parser.main --classify emails_summarized.csv
# What happens:
# 1. Load summarized emails
# 2. Retrieve existing user profile from LangMem
# 3. Batch optimizer groups emails (10-20 per batch)
# 4. For each batch:
# a. Demographics agent analyzes (age, gender, education)
# b. Household agent analyzes (size, income, location)
# c. Interests agent analyzes (hobbies, preferences)
# d. Purchase intent agent analyzes (shopping behavior)
# 5. Evidence judge validates each classification
# 6. Update LangMem semantic memory
# 7. Save profile JSON: data/profile_<user>_<timestamp>.jsonBatch Processing Example:
Input: 100 emails
Context Window: 128,000 tokens
Batch Size: 15 emails
Process:
├─ Batch 1 (emails 1-15) → 42 classifications
├─ Batch 2 (emails 16-30) → 38 classifications
├─ Batch 3 (emails 31-45) → 51 classifications
└─ ... (7 batches total)
Result: Profile with 287 validated classifications
Time: ~6 minutes (vs 3 hours single-email)
{
"schema_version": "2.0",
"user_id": "nick",
"generated_at": "2025-01-28T12:34:56Z",
"demographics": {
"age": {
"primary": {
"taxonomy_id": 12,
"value": "35-44",
"confidence": 0.92,
"evidence_count": 15
}
},
"gender": {
"primary": {
"taxonomy_id": 59,
"value": "Male",
"confidence": 0.88,
"evidence_count": 23
}
}
},
"interests": [
{
"taxonomy_id": 342,
"category": "Technology & Computing",
"subcategory": "Software Development",
"confidence": 0.95,
"evidence_count": 47,
"evidence": [
"GitHub notifications about pull requests",
"Stack Overflow digest emails"
]
}
],
"purchase_intent": [
{
"taxonomy_id": 1234,
"category": "Consumer Electronics",
"subcategory": "Laptops",
"confidence": 0.78,
"evidence_count": 5,
"purchase_intent_flag": true
}
]
}The classification uses a sophisticated LangGraph workflow:
┌──────────────┐
│ Load Emails │
└──────┬───────┘
│
┌──────▼────────┐
│ Retrieve │
│ Profile │
└──────┬────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Demo │ │House │ │Interest │
│Agent │ │Agent │ │Agent │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────▼────────┐
│ Evidence │
│ Judge │
└──────┬────────┘
│
┌──────▼────────┐
│ Reconcile │
│ Results │
└──────┬────────┘
│
┌──────▼────────┐
│ Update │
│ Memory │
└───────────────┘
Visual Debugging with LangGraph Studio:
# Start LangGraph Studio
langgraph dev
# Open in browser
http://127.0.0.1:2024
# Features:
# - Visual workflow graph
# - State inspection at each node
# - Time-travel debugging
# - Replay past executions# Download and analyze 50 emails from Gmail
python -m src.email_parser.main --pull 50 --model openaiThis single command:
- Downloads 50 emails
- Summarizes them
- Classifies into IAB taxonomy
- Saves profile to data/
# Analyze emails from both Gmail and Outlook
python -m src.email_parser.main --provider gmail outlook --max-emails 100# Step 1: Download only
python -m src.email_parser.main --provider gmail --max-emails 200 --download-only
# Step 2: Summarize
python -m src.email_parser.main --summarize data/emails_raw_20250128.csv
# Step 3: Classify (use different model)
python -m src.email_parser.main --classify data/emails_summarized_20250128.csv --model claude- Start backend and frontend (see Starting the Application)
- Navigate to http://localhost:3000
- Click "New Analysis"
- Select provider (Gmail/Outlook)
- Choose models:
- Email Model: Fast/cheap (gpt-4o-mini)
- Taxonomy Model: Accurate (gpt-4o or claude-sonnet-4)
- Set email count (50-500)
- Click "Start Analysis"
- Monitor progress in real-time
- View results in Classifications tab
# OpenAI (Recommended - fastest, cost-effective)
python -m src.email_parser.main --pull 100 --model openai
# Claude (Best quality, more expensive)
python -m src.email_parser.main --pull 100 --model claude
# Google Gemini (Good balance)
python -m src.email_parser.main --pull 100 --model google
# Ollama (Local, free, slower)
python -m src.email_parser.main --pull 100 --model ollama# Use cheap model for summarization, premium for classification
EMAIL_MODEL=openai:gpt-4o-mini \
TAXONOMY_MODEL=claude:claude-sonnet-4 \
python -m src.email_parser.main --pull 100ownyou_consumer_application/
├── src/
│ └── email_parser/
│ ├── main.py # CLI entry point
│ ├── providers/ # Email providers
│ │ ├── gmail_provider.py
│ │ └── outlook_provider.py
│ ├── llm_clients/ # LLM integrations
│ │ ├── openai_client.py
│ │ ├── claude_client.py
│ │ └── google_client.py
│ ├── workflow/ # LangGraph workflow
│ │ ├── graph.py # Workflow definition
│ │ ├── nodes/ # Workflow nodes
│ │ │ ├── analyzers.py # Agent nodes
│ │ │ ├── reconcile.py # Reconciliation
│ │ │ └── update_memory.py # Memory updates
│ │ ├── batch_optimizer.py # Batching logic
│ │ └── state.py # Workflow state
│ ├── memory/ # LangMem integration
│ │ └── manager.py
│ ├── analysis/ # Legacy analyzers
│ ├── models/ # Pydantic models
│ └── utils/ # Utilities
├── dashboard/
│ ├── backend/ # Flask API
│ │ ├── app.py # Flask app
│ │ ├── api/ # API endpoints
│ │ │ ├── analyze.py # Analysis triggers
│ │ │ ├── profile.py # Profile retrieval
│ │ │ └── evidence.py # Evidence endpoints
│ │ └── db/
│ │ └── queries.py # Database queries
│ └── frontend/ # Next.js app
│ ├── app/ # App router
│ ├── components/ # React components
│ └── lib/ # Utilities
├── data/ # Data directory
│ ├── email_parser_memory.db # SQLite database
│ └── profile_*.json # Profile exports
├── logs/ # Log files
├── tests/ # Test suite
├── .env # Configuration
├── requirements.txt # Python deps
└── README.md # This file
# Run all tests
pytest
# Run specific test suite
pytest tests/unit/
pytest tests/integration/
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test
pytest tests/unit/test_batch_optimizer.py::test_calculate_batch_size# Format code
black src/ tests/
# Lint
flake8 src/
# Type checking
mypy src/
# Sort imports
isort src/ tests/Enable Debug Logging:
# In .env
LOG_LEVEL=DEBUG
# Or via command line
python -m src.email_parser.main --pull 50 --debugLangGraph Studio Debugging:
LangGraph Studio provides visual workflow debugging and real-time state inspection.
Option 1: Auto-Start via Dashboard (Recommended)
The dashboard can automatically start Studio when you enable visualization:
- Navigate to http://localhost:3000/analyze
- Check "Enable LangGraph Studio visualization" checkbox
- Studio server automatically starts on port 2024
- During analysis, click "View workflow in LangGraph Studio →"
- Studio UI opens with direct link to your project
Option 2: Manual Start via CLI
# Start Studio manually
langgraph dev
# Set debug mode (optional)
export LANGGRAPH_STUDIO_DEBUG=true
# Run workflow
python -m src.email_parser.main --pull 10
# View in Studio at http://127.0.0.1:2024Features:
- Visual workflow graph with node inspection
- Time-travel debugging (replay past executions)
- State inspection at each workflow step
- Real-time execution monitoring
- Evidence trail visualization
# Open SQLite database
sqlite3 data/email_parser_memory.db
# View tables
.tables
# Query memories
SELECT * FROM memories WHERE namespace LIKE '%nick%' LIMIT 10;
# Count classifications
SELECT COUNT(*) FROM memories WHERE key LIKE 'semantic_%';Problem: Browser console shows CORS errors like:
Access to fetch at 'http://localhost:5001/api/...' from origin 'http://localhost:3000'
has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present
Root Cause: The frontend is making requests directly to Flask instead of using the Next.js API proxy.
Solution:
-
Check frontend environment configuration:
cd dashboard/frontend cat .env.local -
Ensure
NEXT_PUBLIC_API_URLis empty:# Correct configuration (empty = use Next.js proxy) NEXT_PUBLIC_API_URL= # WRONG - causes CORS errors # NEXT_PUBLIC_API_URL=http://localhost:5001
-
Restart frontend to pick up changes:
# Kill frontend lsof -ti:3000 | xargs kill -9 # Restart cd dashboard/frontend npm run dev
-
Verify the fix:
- Open browser DevTools (F12)
- Go to Network tab
- Refresh page
- API requests should go to
/api/...(nothttp://localhost:5001/api/...)
Why This Works:
- Empty
NEXT_PUBLIC_API_URLmakes requests use relative paths (/api/...) - Next.js API proxy (app/api/[...path]/route.ts) forwards requests to Flask
- Proxy handles CORS headers and session cookies automatically
- No cross-origin requests = no CORS issues
Prevention: Never set NEXT_PUBLIC_API_URL=http://localhost:5001 in development.
Problem: Address already in use error
Solution:
# Kill processes on ports
lsof -ti:5001 | xargs kill -9 # Backend
lsof -ti:3000 | xargs kill -9 # Frontend
# Verify ports are free
lsof -i:5001
lsof -i:3000Problem: Gmail/Outlook authentication fails
Solution:
# Re-run setup wizard
python -m src.email_parser.main setup gmail
# Delete cached tokens
rm token.json # Gmail
rm ms_token.json # Outlook
# Check credentials file exists
ls -la credentials.jsonProblem: database is locked
Solution:
# Stop all processes
pkill -f "python.*email_parser"
# Check for locks
fuser data/email_parser_memory.db
# Remove lock file if exists
rm data/email_parser_memory.db-shm
rm data/email_parser_memory.db-walProblem: Rate limit exceeded or Invalid API key
Solution:
# Check API key in .env
cat .env | grep API_KEY
# Test API connection
python -c "
from openai import OpenAI
client = OpenAI()
print(client.models.list())
"
# Use different provider
python -m src.email_parser.main --pull 50 --model claudeProblem: npm run build fails
Solution:
# Clear cache
rm -rf dashboard/frontend/.next
rm -rf dashboard/frontend/node_modules
# Reinstall
cd dashboard/frontend
npm install
# Rebuild
npm run buildProblem: Email download returns 0 emails
Solution:
# Check authentication
python -m src.email_parser.main setup status
# Test provider connection
python -m src.email_parser.main --provider gmail --max-emails 1 --debug
# Verify email access
# - Gmail: Check Gmail API is enabled in Google Cloud Console
# - Outlook: Check Mail.Read permission in Azure PortalProblem: High memory usage or slow processing
Solution:
# Reduce batch size
export BATCH_SIZE=25
# Use faster model for EMAIL_MODEL
export EMAIL_MODEL=openai:gpt-4o-mini
# Process fewer emails
python -m src.email_parser.main --pull 50 --model openai
# Clear old data
rm data/emails_*.csv
rm data/profile_*.json- CLAUDE.md: Development guidelines and architecture details
- docs/: Comprehensive documentation
requirements/: Feature specificationsreference/: Technical referencesSTUDIO_QUICKSTART.md: LangGraph Studio guide
- tests/: Test suite with examples
- _archive/: Historical documentation
For issues, questions, or contributions:
- Check Troubleshooting section
- Review existing issues in repository
- Create new issue with:
- Error message
- Steps to reproduce
- Environment details (OS, Python version)
- Relevant logs from
logs/directory
[Specify your license here]
- IAB Tech Lab for Audience Taxonomy 1.1
- LangChain/LangGraph for workflow orchestration
- OpenAI, Anthropic, Google for LLM APIs
- Email provider APIs (Gmail, Microsoft Graph)