cat > README.md << 'ENDOFFILE'
# π DocSearch - AI-Powered Document Search Engine
DocSearch is an intelligent document search engine that leverages artificial intelligence to enable users to upload documents and perform semantic searches using natural language queries. Unlike traditional keyword-based search, DocSearch understands the **meaning** and **context** behind your queries, delivering highly relevant results even when exact keywords don't match.
---
## π― Problem Statement
Organizations and individuals deal with large volumes of documents daily. Finding specific information across multiple documents is time-consuming and inefficient. Traditional search methods rely on exact keyword matching, which often fails to find relevant content when different words are used to express the same concept.
**DocSearch solves this by:**
- Understanding the semantic meaning of your queries
- Searching across multiple document formats simultaneously
- Returning the most contextually relevant passages
- Providing AI-generated answers based on document content
---
## π Features
### Core Features
- **Multi-Format Document Upload** - Support for PDF, DOCX, and TXT files
- **AI-Powered Semantic Search** - Search using natural language instead of exact keywords
- **Context-Aware Results** - Returns relevant passages with source document references
- **AI-Generated Answers** - Uses Groq LLM to generate human-readable answers from document content
### Technical Features
- **Vector Embeddings** - Converts document text into high-dimensional vectors using Sentence Transformers
- **Efficient Storage** - ChromaDB vector database for fast similarity search
- **Document Chunking** - Intelligently splits documents into searchable chunks
- **RESTful API** - Clean, well-documented API endpoints
- **Interactive API Documentation** - Auto-generated Swagger UI for testing endpoints
- **Responsive Frontend** - Modern, mobile-friendly user interface
---
## π οΈ Tech Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| **Backend Framework** | FastAPI | High-performance async web framework |
| **AI Embeddings** | Sentence Transformers | Convert text to semantic vectors |
| **Vector Database** | ChromaDB | Store and query document embeddings |
| **LLM Integration** | Groq API | Generate natural language answers |
| **PDF Parsing** | PyPDF2 | Extract text from PDF documents |
| **DOCX Parsing** | python-docx | Extract text from Word documents |
| **Frontend** | HTML, CSS, JavaScript | User interface |
| **Server** | Uvicorn | ASGI server for FastAPI |
| **Language** | Python 3.10+ | Backend programming language |
---
## ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Frontend (UI) β β HTML / CSS / JavaScript β βββββββββββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββ β β Upload Request Search Query β β βββββββββββββββββββΌββββββββββββββββββββΌββββββββββββββββ β FastAPI Backend β β β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β β β Document β β Search β β Answer β β β β Processor β β Engine β β Generator β β β ββββββββ¬ββββββββ ββββββββ¬ββββββββ βββββββ¬βββββββ β βββββββββββΌββββββββββββββββββΌβββββββββββββββββΌβββββββββ β β β βββββββββββΌββββββββββ ββββββΌββββββ βββββββββΌββββββββ β Sentence β β ChromaDB β β Groq API β β Transformers β β (Vector β β (LLM) β β (Embeddings) β β DB) β β β βββββββββββββββββββββ ββββββββββββ βββββββββββββββββ
### How It Works
1. **Document Upload**: User uploads a PDF, DOCX, or TXT file
2. **Text Extraction**: The system extracts raw text from the document
3. **Chunking**: Text is split into smaller, overlapping chunks for better search granularity
4. **Embedding Generation**: Each chunk is converted into a vector embedding using Sentence Transformers
5. **Storage**: Embeddings are stored in ChromaDB along with the original text and metadata
6. **Search Query**: User enters a natural language question
7. **Query Embedding**: The question is converted into a vector embedding
8. **Similarity Search**: ChromaDB finds the most similar document chunks using cosine similarity
9. **Answer Generation**: Retrieved chunks are sent to Groq LLM to generate a coherent answer
10. **Response**: The answer along with source references is returned to the user
---
## π Prerequisites
Before you begin, ensure you have the following installed:
- **Python 3.10 or higher** - [Download Python](https://www.python.org/downloads/)
- **pip** - Python package manager (comes with Python)
- **Git** - [Download Git](https://git-scm.com/downloads/)
- **Groq API Key** - [Get free API key](https://console.groq.com/keys)
---
## βοΈ Installation
### Step 1: Clone the Repository
```bash
git clone https://github.com/Shreeshail-sp/docsearch.git
cd docsearch
# Create virtual environment
python -m venv venv
# Activate it
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windowspip install -r requirements.txtNote: The first run will download the Sentence Transformer model (~90MB). This is a one-time download.
Create a .env file in the project root:
cp .env.example .envEdit the .env file and add your API key:
GROQ_API_KEY=your_groq_api_key_here- Go to https://console.groq.com
- Sign up for a free account
- Navigate to API Keys section
- Click Create API Key
- Copy and paste it into your
.envfile
uvicorn main:app --reload| URL | Description |
|---|---|
| http://localhost:8000 | Main Application |
| http://localhost:8000/docs | Swagger API Documentation |
| http://localhost:8000/redoc | ReDoc API Documentation |
- Open your browser and go to
http://localhost:8000 - Upload one or more documents (PDF, DOCX, or TXT)
- Wait for the documents to be processed (you'll see a success message)
- Type your question in the search bar
- Get AI-powered answers with source references
docsearch/
β
βββ main.py # FastAPI application entry point & API routes
βββ requirements.txt # Python package dependencies
βββ .env # Environment variables (not tracked by git)
βββ .env.example # Example environment variables template
βββ .gitignore # Git ignore rules
βββ README.md # Project documentation (this file)
β
βββ static/ # Frontend files
β βββ index.html # Main HTML page
β βββ style.css # CSS styles
β βββ script.js # Frontend JavaScript logic
β
βββ uploads/ # Uploaded documents storage (auto-created)
β
βββ chroma_db/ # ChromaDB vector database (auto-created)
POST /upload
Content-Type: multipart/form-data| Parameter | Type | Description |
|---|---|---|
file |
File |
Document file (PDF, DOCX, or TXT) |
Response:
{
"message": "Document uploaded and processed successfully",
"filename": "document.pdf",
"chunks": 15
}POST /search
Content-Type: application/jsonRequest Body:
{
"query": "What is the company's revenue policy?"
}Response:
{
"answer": "Based on the documents, the company's revenue policy...",
"sources": [
{
"document": "policy.pdf",
"chunk": "Revenue recognition follows...",
"relevance_score": 0.92
}
]
}GET /documentsResponse:
{
"documents": [
{
"id": "1",
"filename": "policy.pdf",
"upload_date": "2024-01-15T10:30:00",
"chunks": 15
}
]
}DELETE /documents/{document_id}Response:
{
"message": "Document deleted successfully"
}- Navigate to
http://localhost:8000 - Click the upload button and select a document
- Once processed, type a question like:
- "What are the main findings in the report?"
- "Summarize the key points about budget allocation"
- "What does the document say about employee benefits?"
# Upload a document
curl -X POST "http://localhost:8000/upload" \
-F "file=@/path/to/your/document.pdf"
# Search
curl -X POST "http://localhost:8000/search" \
-H "Content-Type: application/json" \
-d '{"query": "What is the main topic?"}'
# List all documents
curl -X GET "http://localhost:8000/documents"
# Delete a document
curl -X DELETE "http://localhost:8000/documents/1"import requests
# Upload a document
with open("document.pdf", "rb") as f:
response = requests.post(
"http://localhost:8000/upload",
files={"file": f}
)
print(response.json())
# Search
response = requests.post(
"http://localhost:8000/search",
json={"query": "What is the main topic?"}
)
print(response.json())| Metric | Value |
|---|---|
| Document Processing | ~2-5 seconds per page |
| Search Query | ~0.5-2 seconds |
| Embedding Model | all-MiniLM-L6-v2 (fast & accurate) |
| Max File Size | 10 MB per document |
| Supported Formats | PDF, DOCX, TXT |
You can customize the application by modifying these settings:
| Setting | Default | Description |
|---|---|---|
CHUNK_SIZE |
500 | Number of characters per text chunk |
CHUNK_OVERLAP |
50 | Overlap between consecutive chunks |
TOP_K_RESULTS |
5 | Number of search results to return |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 | Sentence transformer model |
MAX_FILE_SIZE |
10 MB | Maximum upload file size |
1. "Module not found" error
# Make sure virtual environment is activated
source venv/bin/activate
pip install -r requirements.txt2. "GROQ_API_KEY not set" error
# Check if .env file exists
cat .env
# Make sure it contains: GROQ_API_KEY=your_key_here3. "Port already in use" error
# Use a different port
uvicorn main:app --reload --port 80014. Slow first startup
- The first run downloads the embedding model (~90MB)
- Subsequent startups will be much faster
5. Large PDF processing fails
- Ensure the PDF is not password-protected
- Try with a smaller document first
- Check if the PDF contains extractable text (not scanned images)
- Support for more file formats (XLSX, PPTX, CSV)
- Multi-language document support
- User authentication and document access control
- Batch document upload
- Document summarization feature
- Chat-based interface with conversation history
- Docker containerization
- Cloud deployment (AWS/GCP)
- OCR support for scanned documents
Contributions are welcome! Here's how you can help:
- Fork the repository
- Clone your fork
git clone https://github.com/your-username/docsearch.git
- Create a feature branch
git checkout -b feature/amazing-feature
- Make your changes
- Test your changes thoroughly
- Commit your changes
git commit -m "Add amazing feature" - Push to your branch
git push origin feature/amazing-feature
- Open a Pull Request
- Follow PEP 8 coding standards
- Add comments for complex logic
- Update documentation for new features
- Write meaningful commit messages
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2024 Shreeshail SP
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Shreeshail SP
- GitHub: @Shreeshail-sp
- FastAPI - Modern Python web framework
- Sentence Transformers - State-of-the-art text embeddings
- Endee - AI-native vector database
- Groq - Ultra-fast LLM inference
- Uvicorn - Lightning-fast ASGI server
β If you found this project useful, please give it a star! ENDOFFILE
git add README.md git commit -m "Add detailed README documentation" git push
Copy and paste this entire block into your terminal. It will create the detailed README, commit, and push it to GitHub.