KagajatAI: An End-to-End AI Document Analysis System

KagajatAI is a full-featured, portfolio-ready project that demonstrates the entire lifecycle of a modern AI application. It provides a web-based interface to "chat" with your documents, powered by a sophisticated Retrieval-Augmented Generation (RAG) pipeline.

This project goes beyond a simple proof-of-concept by incorporating advanced techniques such as model finetuning, rigorous benchmarking, and a flexible architecture that supports both local open-source models and powerful cloud APIs.

Key Features

Interactive Chat Interface: A user-friendly web application built with Streamlit to upload documents and ask questions in natural language.
Advanced RAG Pipeline: Implements a robust RAG system using a high-performance embedding model (BAAI/bge-large-en-v1.5) and a persistent vector store (ChromaDB).
Flexible LLM Backend: Seamlessly switch between Google's Gemini Pro API for top-tier performance or a local, finetuned Llama-3-8B model for privacy and customization.
Efficient Finetuning (QLoRA): Includes a Jupyter notebook to finetune the local LLM model on a custom dataset using QLoRA, the state-of-the-art technique for memory-efficient training.
Synthetic Dataset Generation: Demonstrates how to use a powerful model (Gemini Pro) to programmatically create a high-quality instruction dataset for finetuning.
Rigorous Benchmarking: Provides a dedicated notebook to compare the performance of the base model, the finetuned model, and Gemini Pro, allowing for quantitative and qualitative analysis of the finetuning process.

Requirements and Usage

Clone the repository. From inside the repo folder, install the dependencies:

python -m pip install --upgrade -r .\requirements.txt

Configure API Keys:

The application can use Google's Gemini API. Place the key directly in config/config.yaml.
You can also download model checkpoints (in my case, I downloaded the lightweight, yet powerful, Llama-3.1-8B-Instruct opensource model). Replace the generation_model_name in config.yaml with the absolute path of the downloaded model.

Run the Streamlit application:

streamlit run .\app\main_app.py

The application should now be open and accessible in your web browser.

OR

You can also use specific file from terminal as its own standalone script:

python -u .\src\rag_pipeline.py

Ensure you have some sample PDF documents in .\data\source_documents\

To showcase the full capabilities of the project, run the Jupyter notebooks in the following order.

notebooks/1_Dataset_Creation.ipynb: This notebook uses the Gemini API to generate a finetuning_data.jsonl file from your source document.
notebooks/2_Finetuning_with_LoRA.ipynb: This notebook uses the generated dataset to finetune the Llama-3-8B model. This requires a CUDA-enabled GPU.
notebooks/3_Benchmarking.ipynb: After finetuning, run this notebook to compare the responses from the base model, your new finetuned model, and Gemini Pro. Due to hardware limitations, I was unable to finetune and infer on large dataset. For this reason, I didn't implement metrics like ROGUE or BLEU for a quantitative assessment of model performance. The notebooks, once opened, are self explanatory.

Future Improvements:

Agentic Capabilities: Expand the system into a multi-tool agent that can not only read documents but also fetch real-time data from external APIs (e.g., stock prices).
Broader Document Support: Enable compatibility with additional file types such as .docx, .txt, and .html. Incorporate Deep Document Understanding to more effectively parse complex formats like CVs, resumes, journal papers, novels, and presentations.
UI Enhancements: Add features to the Streamlit app to manage multiple vector stores or highlight the source text in the original document.
Multi-Model Support: Introduce the ability to seamlessly switch between multiple local models and proprietary LLM API providers with a single click.

Screenshots

*This README.md file has been improved for overall readability (grammar, sentence structure, and organization) using AI tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KagajatAI: An End-to-End AI Document Analysis System

Key Features

Requirements and Usage

Future Improvements:

Screenshots

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
POC-for-linux		POC-for-linux
app		app
config		config
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

KagajatAI: An End-to-End AI Document Analysis System

Key Features

Requirements and Usage

Future Improvements:

Screenshots

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages