This project creates a vector database from PDF textbooks and provides a question-answering interface using Claude AI.
- Extract text from PDF textbooks
- Create semantic embeddings and store in vector database
- Answer questions using Claude AI with context from textbooks
- Interactive command-line interface
- Source citation and reference tracking
textbook_ai/
├── data/ # PDF textbooks and processed data
├── models/ # Vector database and embeddings
├── src/ # Source code
│ ├── pdf_processor.py # PDF text extraction
│ ├── vector_db.py # Vector database operations
│ ├── qa_system.py # Question-answering system
│ └── cli.py # Command-line interface
├── config/ # Configuration files
├── requirements.txt # Python dependencies
└── README.md # This file
- Run the setup script:
python setup.py-
Get API keys:
- Claude API key: https://console.anthropic.com/
- OpenAI API key: https://platform.openai.com/ (optional, for embeddings)
-
Edit the .env file with your API keys
-
Add PDF textbooks to the
data/pdfs/directory -
Process the PDFs:
python src/cli.py process-pdfs- Start asking questions:
python src/cli.py interactiveIf you prefer manual setup:
- Install dependencies:
pip install -r requirements.txt- Download spaCy model:
python -m spacy download en_core_web_sm- Create .env file:
cp config.env.example .env
# Edit .env with your API keys- Process PDFs and create vector database:
python src/cli.py process-pdfs --input-dir data/pdfs --output-dir models/- Ask questions:
python src/cli.py ask --question "What is the definition of machine learning?"- Interactive mode:
python src/cli.py interactive- Web interface:
streamlit run web_app.pyCreate a .env file with:
ANTHROPIC_API_KEY=your_claude_api_key
OPENAI_API_KEY=your_openai_api_key # Optional for embeddings
- Python 3.8+
- PDF textbooks in the
data/pdfs/directory - Claude API key from Anthropic
- OpenAI API key (optional, for alternative embeddings)
- PDF Processing: Extract and clean text from PDF textbooks
- Vector Database: Store text chunks with semantic embeddings using ChromaDB
- Semantic Search: Find relevant content using sentence transformers
- AI Q&A: Answer questions using Claude AI with textbook context
- Source Citation: Track which books and sections were used
- Multiple Interfaces: CLI and web interface options
- Book Filtering: Search within specific textbooks
- Confidence Scoring: Rate answer quality based on similarity scores
The system consists of four main components:
- PDF Processor (
src/pdf_processor.py): Extracts and chunks text from PDFs - Vector Database (
src/vector_db.py): Manages embeddings and similarity search - Q&A System (
src/qa_system.py): Integrates Claude AI for answering questions - CLI Interface (
src/cli.py): Command-line interface for all operations
Key configuration options in .env:
CHUNK_SIZE: Number of words per text chunk (default: 1000)CHUNK_OVERLAP: Overlap between chunks (default: 200)EMBEDDING_MODEL: Sentence transformer model (default: all-MiniLM-L6-v2)CLAUDE_MODEL: Claude model to use (default: claude-3-sonnet-20240229)MAX_TOKENS: Maximum tokens for Claude responses (default: 4000)