Skip to content

vinaysrao1/textbook_ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Textbook AI - Vector Database and Q&A System

This project creates a vector database from PDF textbooks and provides a question-answering interface using Claude AI.

Features

  • Extract text from PDF textbooks
  • Create semantic embeddings and store in vector database
  • Answer questions using Claude AI with context from textbooks
  • Interactive command-line interface
  • Source citation and reference tracking

Project Structure

textbook_ai/
├── data/                   # PDF textbooks and processed data
├── models/                 # Vector database and embeddings
├── src/                    # Source code
│   ├── pdf_processor.py    # PDF text extraction
│   ├── vector_db.py        # Vector database operations
│   ├── qa_system.py        # Question-answering system
│   └── cli.py              # Command-line interface
├── config/                 # Configuration files
├── requirements.txt        # Python dependencies
└── README.md              # This file

Quick Start

  1. Run the setup script:
python setup.py
  1. Get API keys:

  2. Edit the .env file with your API keys

  3. Add PDF textbooks to the data/pdfs/ directory

  4. Process the PDFs:

python src/cli.py process-pdfs
  1. Start asking questions:
python src/cli.py interactive

Manual Setup

If you prefer manual setup:

  1. Install dependencies:
pip install -r requirements.txt
  1. Download spaCy model:
python -m spacy download en_core_web_sm
  1. Create .env file:
cp config.env.example .env
# Edit .env with your API keys

Usage

  1. Process PDFs and create vector database:
python src/cli.py process-pdfs --input-dir data/pdfs --output-dir models/
  1. Ask questions:
python src/cli.py ask --question "What is the definition of machine learning?"
  1. Interactive mode:
python src/cli.py interactive
  1. Web interface:
streamlit run web_app.py

Configuration

Create a .env file with:

ANTHROPIC_API_KEY=your_claude_api_key
OPENAI_API_KEY=your_openai_api_key  # Optional for embeddings

Requirements

  • Python 3.8+
  • PDF textbooks in the data/pdfs/ directory
  • Claude API key from Anthropic
  • OpenAI API key (optional, for alternative embeddings)

Features

  • PDF Processing: Extract and clean text from PDF textbooks
  • Vector Database: Store text chunks with semantic embeddings using ChromaDB
  • Semantic Search: Find relevant content using sentence transformers
  • AI Q&A: Answer questions using Claude AI with textbook context
  • Source Citation: Track which books and sections were used
  • Multiple Interfaces: CLI and web interface options
  • Book Filtering: Search within specific textbooks
  • Confidence Scoring: Rate answer quality based on similarity scores

Architecture

The system consists of four main components:

  1. PDF Processor (src/pdf_processor.py): Extracts and chunks text from PDFs
  2. Vector Database (src/vector_db.py): Manages embeddings and similarity search
  3. Q&A System (src/qa_system.py): Integrates Claude AI for answering questions
  4. CLI Interface (src/cli.py): Command-line interface for all operations

Configuration

Key configuration options in .env:

  • CHUNK_SIZE: Number of words per text chunk (default: 1000)
  • CHUNK_OVERLAP: Overlap between chunks (default: 200)
  • EMBEDDING_MODEL: Sentence transformer model (default: all-MiniLM-L6-v2)
  • CLAUDE_MODEL: Claude model to use (default: claude-3-sonnet-20240229)
  • MAX_TOKENS: Maximum tokens for Claude responses (default: 4000)

About

Vector database and Q&A system for textbooks using Claude AI

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages