Run train.py on Colab using:

Legal Document classifier using Legal-BERT

An end-to-end AI system for real-time multi-class legal document classification using Legal-BERT, FastAPI, and Streamlit.

🔍 Overview

This project implements a production-grade Legal Document Classification System powered by Legal-BERT, capable of classifying long legal documents into multiple legal categories such as:

📜 Contract

⚖️ Judgment

📝 Petition

🚨 FIR / Criminal

📩 Legal Notice

The system supports:

✅ Long-document handling using sliding-window chunking ✅ GPU-trained Legal-BERT model ✅ Document-level probability aggregation ✅ FastAPI backend for real-time inference ✅ Streamlit web app frontend for interactive usage

🚀 Key Features

✅ Transformer-based Legal NLP using nlpaueb/legal-bert-base-uncased

✅ Automatic long-document chunking

✅ Exact chunk ➝ document aggregation using doc_id mapping

✅ Multi-class classifier with 5 legal categories

✅ GPU-accelerated training via Google Colab

✅ FastAPI REST API for inference

✅ Streamlit-based web interface

✅ Professional evaluation with confusion matrix & macro-F1

🧠 Model Architecture

Base Model: Legal-BERT (nlpaueb/legal-bert-base-uncased)

Input Handling: Sliding window chunking (512 tokens, stride=128)

Aggregation Strategy: Probability averaging across chunks

Loss Function: Cross-Entropy (multi-class classification)

Training: Fine-tuned on domain-specific legal text

📊 Final Evaluation Results (Document-Level)

✔️ Accuracy: 96.18% ✔️ Macro F1-Score: 0.7619 ✔️ Weighted F1-Score: 0.9618

✅ Per-Class Performance Class Precision Recall F1-Score Support Contract 0.9355 0.8788 0.9062 33 Judgment 1.0000 1.0000 1.0000 23 Petition 1.0000 1.0000 1.0000 71 FIR / Criminal 0.8750 0.9333 0.9032 30 Legal Notice — — — 0

⚠️ Note: The Legal Notice class had zero samples in the test split, hence its performance could not be evaluated.

✅ Confusion Matrix (Document-Level) [[29 0 0 4 0] [ 0 23 0 0 0] [ 0 0 71 0 0] [ 2 0 0 28 0] [ 0 0 0 0 0]]

🏗️ Project Structure legal-document-classifier/ ├── app.py # Streamlit Web UI ├── src/ │ ├── api.py # FastAPI server │ ├── predict.py # Inference engine │ ├── train.py # Transformer training script │ ├── evaluate_doc_level.py # Document-level evaluation │ ├── tokenize_dataset.py │ └── tokenize_dataset_with_docid.py ├── models/ │ └── legal-bert-final/ # Trained Legal-BERT model ├── data/ │ ├── train_chunks/ │ ├── val_chunks/ │ └── test_chunks/ ├── requirements.txt └── README.md

⚙️ Installation & Setup ✅ 1. Clone the Repository git clone cd legal-document-classifier

✅ 2. Create Virtual Environment python -m venv env env\Scripts\activate

✅ 3. Install Dependencies pip install -r requirements.txt pip install fastapi uvicorn streamlit requests

🧪 Training (GPU Recommended)

Training is performed on Google Colab GPU for speed.

Run train.py on Colab using:

trainer.train() trainer.save_model("legal-bert-final")

Extract the trained model into:

models/legal-bert-final/

📈 Evaluation

Document-level evaluation using:

python src/evaluate_doc_level.py

This produces:

✅ Classification report

✅ Confusion matrix

✅ Macro-F1 score

🌐 API Deployment (FastAPI) ✅ Start FastAPI Server uvicorn src.api:app --reload

Access API Docs:

http://127.0.0.1:8000/docs

✅ Sample JSON Input { "text": "The petitioner prays before this Honorable Court..." }

✅ Sample Output { "label": "Judgment", "confidence": 0.9981 }

🖥️ Web App Deployment (Streamlit)

Start the frontend UI:

streamlit run app.py

Open in browser:

http://localhost:8501

Paste document → Click Predict → View class + confidence.

🛠️ Tech Stack

Python 3.10+

PyTorch

HuggingFace Transformers

Datasets

Legal-BERT

FastAPI

Streamlit

Google Colab (GPU Training)

scikit-learn

NumPy, Pandas Future Improvements

Class-balanced training with weighted loss

OCR integration for scanned legal documents

Multilingual legal text support

Vector search using embeddings (FAISS)

Integration with court management systems

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Run train.py on Colab using:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
models		models
src		src
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Run train.py on Colab using:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages