Built with 💡 by Kavya Bhardwaj
The PDF Insight Engine is an AI-powered tool designed to extract valuable insights from PDF documents. It combines advanced embedding models, semantic vector storage, and powerful language-generation APIs to provide interactive and detailed answers based on your uploaded documents.
Given any PDF document, the engine:
- 📄 Extracts text accurately from PDFs.
- 🔖 Breaks documents into semantic chunks.
- 🧠 Generates embeddings using Sentence Transformers.
- 🔍 Stores and retrieves information efficiently via Pinecone Vectorstore.
- ✨ Provides refined, detailed answers using Cohere’s generative models.
| Component | Technology Used |
|---|---|
| PDF Text Extraction | ✅ LangChain (PyPDFLoader) |
| Semantic Embeddings | ✅ SentenceTransformers (all-MiniLM-L6-v2) |
| Vector Database | ✅ Pinecone Vectorstore |
| Enhanced Responses | ✅ Cohere API (command-xlarge) |
| Text Processing | ✅ LangChain Text Splitter, NLTK |
| OCR (Optional) | ✅ Poppler, Tesseract |
- ✅ Intelligent semantic chunking of text.
- ✅ Rapid similarity-based search for document querying.
- ✅ Cohere API integration for contextually refined answers.
- ✅ Supports fallback and retry logic for robust processing.
1. Clone the Repository
git clone https://github.com/Kavya071/PDF-Insight-Engine.git
cd PDF-Insight-Engine2. Install Dependencies
pip install openai==0.27.2 langchain-community sentence_transformers pinecone-client cohere nltk unstructured
sudo apt-get install poppler-utils tesseract-ocr3. Set Up API Keys
Replace the placeholders with your actual API keys:
PINECONE_API_KEY = "your_pinecone_api_key"
COHERE_API_KEY = "your_cohere_api_key"4. Run the Application
Launch the notebook (PDF_Insight.ipynb) in Google Colab or locally:
jupyter notebook PDF_Insight.ipynbThen follow the prompts to upload PDFs and enter queries interactively.
Feel free to connect:
Built with 💡 by Kavya Bhardwaj