Legal Document classifier using Legal-BERT
An end-to-end AI system for real-time multi-class legal document classification using Legal-BERT, FastAPI, and Streamlit.
π Overview
This project implements a production-grade Legal Document Classification System powered by Legal-BERT, capable of classifying long legal documents into multiple legal categories such as:
π Contract
βοΈ Judgment
π Petition
π¨ FIR / Criminal
π© Legal Notice
The system supports:
β Long-document handling using sliding-window chunking β GPU-trained Legal-BERT model β Document-level probability aggregation β FastAPI backend for real-time inference β Streamlit web app frontend for interactive usage
π Key Features
β Transformer-based Legal NLP using nlpaueb/legal-bert-base-uncased
β Automatic long-document chunking
β Exact chunk β document aggregation using doc_id mapping
β Multi-class classifier with 5 legal categories
β GPU-accelerated training via Google Colab
β FastAPI REST API for inference
β Streamlit-based web interface
β Professional evaluation with confusion matrix & macro-F1
π§ Model Architecture
Base Model: Legal-BERT (nlpaueb/legal-bert-base-uncased)
Input Handling: Sliding window chunking (512 tokens, stride=128)
Aggregation Strategy: Probability averaging across chunks
Loss Function: Cross-Entropy (multi-class classification)
Training: Fine-tuned on domain-specific legal text
π Final Evaluation Results (Document-Level)
βοΈ Accuracy: 96.18% βοΈ Macro F1-Score: 0.7619 βοΈ Weighted F1-Score: 0.9618
β Per-Class Performance Class Precision Recall F1-Score Support Contract 0.9355 0.8788 0.9062 33 Judgment 1.0000 1.0000 1.0000 23 Petition 1.0000 1.0000 1.0000 71 FIR / Criminal 0.8750 0.9333 0.9032 30 Legal Notice β β β 0
β Confusion Matrix (Document-Level) [[29 0 0 4 0] [ 0 23 0 0 0] [ 0 0 71 0 0] [ 2 0 0 28 0] [ 0 0 0 0 0]]
ποΈ Project Structure legal-document-classifier/ βββ app.py # Streamlit Web UI βββ src/ β βββ api.py # FastAPI server β βββ predict.py # Inference engine β βββ train.py # Transformer training script β βββ evaluate_doc_level.py # Document-level evaluation β βββ tokenize_dataset.py β βββ tokenize_dataset_with_docid.py βββ models/ β βββ legal-bert-final/ # Trained Legal-BERT model βββ data/ β βββ train_chunks/ β βββ val_chunks/ β βββ test_chunks/ βββ requirements.txt βββ README.md
βοΈ Installation & Setup β 1. Clone the Repository git clone cd legal-document-classifier
β 2. Create Virtual Environment python -m venv env env\Scripts\activate
β 3. Install Dependencies pip install -r requirements.txt pip install fastapi uvicorn streamlit requests
π§ͺ Training (GPU Recommended)
Training is performed on Google Colab GPU for speed.
trainer.train() trainer.save_model("legal-bert-final")
Extract the trained model into:
models/legal-bert-final/
π Evaluation
Document-level evaluation using:
python src/evaluate_doc_level.py
This produces:
β Classification report
β Confusion matrix
β Macro-F1 score
π API Deployment (FastAPI) β Start FastAPI Server uvicorn src.api:app --reload
Access API Docs:
β Sample JSON Input { "text": "The petitioner prays before this Honorable Court..." }
β Sample Output { "label": "Judgment", "confidence": 0.9981 }
π₯οΈ Web App Deployment (Streamlit)
Start the frontend UI:
streamlit run app.py
Open in browser:
Paste document β Click Predict β View class + confidence.
π οΈ Tech Stack
Python 3.10+
PyTorch
HuggingFace Transformers
Datasets
Legal-BERT
FastAPI
Streamlit
Google Colab (GPU Training)
scikit-learn
NumPy, Pandas Future Improvements
Class-balanced training with weighted loss
OCR integration for scanned legal documents
Multilingual legal text support
Vector search using embeddings (FAISS)
Integration with court management systems