A machine learning-powered email classification system that identifies NORMAL, SPAM, and FRAUD emails using a fine-tuned DistilBERT model and Gmail API integration.
- Fine-tuned DistilBERT transformer model for email classification
- Real-time Gmail API integration
- Interactive Streamlit dashboard
- REST API with FastAPI
- Three-class classification: NORMAL, SPAM, FRAUD
- Python 3.8+
- Gmail API credentials (OAuth 2.0)
- Weights & Biases account (optional, for training)
- Clone the repository and navigate to the project directory:
cd UDBHAV- Install uv if not already installed:
curl -LsSf https://astral.sh/uv/install.sh | sh- Install dependencies using uv:
uv sync- Set up environment variables by creating a
.envfile:
GOOGLE_CLIENT_ID=your_client_id
GOOGLE_CLIENT_SECRET=your_client_secret
GOOGLE_REFRESH_TOKEN=your_refresh_token
EMAIL_ADDRESS=your_email@gmail.com
MODEL_PATH=./email_model
WANDB_API_KEY=your_wandb_keyTrain the email classifier on your dataset:
uv run train.pyThis will:
- Load and preprocess
final_dataset.csv - Fine-tune DistilBERT on email data
- Save the trained model to
./email_model/ - Generate evaluation metrics
Launch the interactive web interface:
uv run streamlit run app.pyAccess the dashboard at http://localhost:8501
Start the REST API server:
uv run uvicorn server:app --reloadAPI endpoints:
POST /predict_email- Classify a single emailGET /scan_gmail- Fetch and classify recent Gmail messages
.
├── app.py # Streamlit dashboard
├── server.py # FastAPI REST API
├── train.py # Model training script
├── utils.py # Helper functions
├── final_dataset.csv # Training dataset
├── email_model/ # Trained model directory
└── .env # Environment variables
- Click "Fetch & Classify Last 10 Emails"
- View classified emails in categorized tabs
- Review confidence scores and labels
curl -X POST "http://localhost:8000/predict_email" \
-H "Content-Type: application/json" \
-d '{"text": "Congratulations! You won $1000000"}'- Architecture: DistilBERT (distilbert-base-uncased)
- Classes: NORMAL (0), SPAM (1), FRAUD (2)
- Max Token Length: 256
- Training: 3 epochs with weighted metrics
- Create a project in Google Cloud Console
- Enable Gmail API
- Create OAuth 2.0 credentials
- Generate refresh token using OAuth 2.0 Playground
- Add credentials to
.envfile