Skip to content

Aditya-9215/Legal-Document-Classifier-Using-Legal-BERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Legal Document classifier using Legal-BERT

An end-to-end AI system for real-time multi-class legal document classification using Legal-BERT, FastAPI, and Streamlit.

πŸ” Overview

This project implements a production-grade Legal Document Classification System powered by Legal-BERT, capable of classifying long legal documents into multiple legal categories such as:

πŸ“œ Contract

βš–οΈ Judgment

πŸ“ Petition

🚨 FIR / Criminal

πŸ“© Legal Notice

The system supports:

βœ… Long-document handling using sliding-window chunking βœ… GPU-trained Legal-BERT model βœ… Document-level probability aggregation βœ… FastAPI backend for real-time inference βœ… Streamlit web app frontend for interactive usage

πŸš€ Key Features

βœ… Transformer-based Legal NLP using nlpaueb/legal-bert-base-uncased

βœ… Automatic long-document chunking

βœ… Exact chunk ➝ document aggregation using doc_id mapping

βœ… Multi-class classifier with 5 legal categories

βœ… GPU-accelerated training via Google Colab

βœ… FastAPI REST API for inference

βœ… Streamlit-based web interface

βœ… Professional evaluation with confusion matrix & macro-F1

🧠 Model Architecture

Base Model: Legal-BERT (nlpaueb/legal-bert-base-uncased)

Input Handling: Sliding window chunking (512 tokens, stride=128)

Aggregation Strategy: Probability averaging across chunks

Loss Function: Cross-Entropy (multi-class classification)

Training: Fine-tuned on domain-specific legal text

πŸ“Š Final Evaluation Results (Document-Level)

βœ”οΈ Accuracy: 96.18% βœ”οΈ Macro F1-Score: 0.7619 βœ”οΈ Weighted F1-Score: 0.9618

βœ… Per-Class Performance Class Precision Recall F1-Score Support Contract 0.9355 0.8788 0.9062 33 Judgment 1.0000 1.0000 1.0000 23 Petition 1.0000 1.0000 1.0000 71 FIR / Criminal 0.8750 0.9333 0.9032 30 Legal Notice β€” β€” β€” 0

⚠️ Note: The Legal Notice class had zero samples in the test split, hence its performance could not be evaluated.

βœ… Confusion Matrix (Document-Level) [[29 0 0 4 0] [ 0 23 0 0 0] [ 0 0 71 0 0] [ 2 0 0 28 0] [ 0 0 0 0 0]]

πŸ—οΈ Project Structure legal-document-classifier/ β”œβ”€β”€ app.py # Streamlit Web UI β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ api.py # FastAPI server β”‚ β”œβ”€β”€ predict.py # Inference engine β”‚ β”œβ”€β”€ train.py # Transformer training script β”‚ β”œβ”€β”€ evaluate_doc_level.py # Document-level evaluation β”‚ β”œβ”€β”€ tokenize_dataset.py β”‚ └── tokenize_dataset_with_docid.py β”œβ”€β”€ models/ β”‚ └── legal-bert-final/ # Trained Legal-BERT model β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ train_chunks/ β”‚ β”œβ”€β”€ val_chunks/ β”‚ └── test_chunks/ β”œβ”€β”€ requirements.txt └── README.md

βš™οΈ Installation & Setup βœ… 1. Clone the Repository git clone cd legal-document-classifier

βœ… 2. Create Virtual Environment python -m venv env env\Scripts\activate

βœ… 3. Install Dependencies pip install -r requirements.txt pip install fastapi uvicorn streamlit requests

πŸ§ͺ Training (GPU Recommended)

Training is performed on Google Colab GPU for speed.

Run train.py on Colab using:

trainer.train() trainer.save_model("legal-bert-final")

Extract the trained model into:

models/legal-bert-final/

πŸ“ˆ Evaluation

Document-level evaluation using:

python src/evaluate_doc_level.py

This produces:

βœ… Classification report

βœ… Confusion matrix

βœ… Macro-F1 score

🌐 API Deployment (FastAPI) βœ… Start FastAPI Server uvicorn src.api:app --reload

Access API Docs:

http://127.0.0.1:8000/docs

βœ… Sample JSON Input { "text": "The petitioner prays before this Honorable Court..." }

βœ… Sample Output { "label": "Judgment", "confidence": 0.9981 }

πŸ–₯️ Web App Deployment (Streamlit)

Start the frontend UI:

streamlit run app.py

Open in browser:

http://localhost:8501

Paste document β†’ Click Predict β†’ View class + confidence.

πŸ› οΈ Tech Stack

Python 3.10+

PyTorch

HuggingFace Transformers

Datasets

Legal-BERT

FastAPI

Streamlit

Google Colab (GPU Training)

scikit-learn

NumPy, Pandas Future Improvements

Class-balanced training with weighted loss

OCR integration for scanned legal documents

Multilingual legal text support

Vector search using embeddings (FAISS)

Integration with court management systems

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages