Skip to content

Latest commit

 

History

History
121 lines (92 loc) · 2.87 KB

File metadata and controls

121 lines (92 loc) · 2.87 KB

Email Threat Classifier

A machine learning-powered email classification system that identifies NORMAL, SPAM, and FRAUD emails using a fine-tuned DistilBERT model and Gmail API integration.

Features

  • Fine-tuned DistilBERT transformer model for email classification
  • Real-time Gmail API integration
  • Interactive Streamlit dashboard
  • REST API with FastAPI
  • Three-class classification: NORMAL, SPAM, FRAUD

Prerequisites

  • Python 3.8+
  • Gmail API credentials (OAuth 2.0)
  • Weights & Biases account (optional, for training)

Installation

  1. Clone the repository and navigate to the project directory:
cd UDBHAV
  1. Install uv if not already installed:
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Install dependencies using uv:
uv sync
  1. Set up environment variables by creating a .env file:
GOOGLE_CLIENT_ID=your_client_id
GOOGLE_CLIENT_SECRET=your_client_secret
GOOGLE_REFRESH_TOKEN=your_refresh_token
EMAIL_ADDRESS=your_email@gmail.com
MODEL_PATH=./email_model
WANDB_API_KEY=your_wandb_key

Setup

1. Train the Model

Train the email classifier on your dataset:

uv run train.py

This will:

  • Load and preprocess final_dataset.csv
  • Fine-tune DistilBERT on email data
  • Save the trained model to ./email_model/
  • Generate evaluation metrics

2. Run the Streamlit Dashboard

Launch the interactive web interface:

uv run streamlit run app.py

Access the dashboard at http://localhost:8501

3. Run the FastAPI Server (Optional)

Start the REST API server:

uv run uvicorn server:app --reload

API endpoints:

  • POST /predict_email - Classify a single email
  • GET /scan_gmail - Fetch and classify recent Gmail messages

Project Structure

.
├── app.py              # Streamlit dashboard
├── server.py           # FastAPI REST API
├── train.py            # Model training script
├── utils.py            # Helper functions
├── final_dataset.csv   # Training dataset
├── email_model/        # Trained model directory
└── .env               # Environment variables

Usage

Dashboard

  1. Click "Fetch & Classify Last 10 Emails"
  2. View classified emails in categorized tabs
  3. Review confidence scores and labels

API

curl -X POST "http://localhost:8000/predict_email" \
  -H "Content-Type: application/json" \
  -d '{"text": "Congratulations! You won $1000000"}'

Model Details

  • Architecture: DistilBERT (distilbert-base-uncased)
  • Classes: NORMAL (0), SPAM (1), FRAUD (2)
  • Max Token Length: 256
  • Training: 3 epochs with weighted metrics

Gmail API Setup

  1. Create a project in Google Cloud Console
  2. Enable Gmail API
  3. Create OAuth 2.0 credentials
  4. Generate refresh token using OAuth 2.0 Playground
  5. Add credentials to .env file