This application allows users to upload PDF files or images, extract text, generate summaries, break text into paragraphs, and create questions/answers for selected paragraphs.
- PDF and image text extraction
- Text summarization using T5 model
- Paragraph segmentation
- Question and answer generation
- Python 3.7+
- Tesseract OCR must be installed for image text extraction
- Clone this repository or download the files
- Install the required dependencies:
pip install -r requirements.txt
- Install Tesseract OCR:
- Windows: Download and install from https://github.com/UB-Mannheim/tesseract/wiki
- Mac:
brew install tesseract
- Linux:
sudo apt install tesseract-ocr
- Run the Streamlit app:
streamlit run app.py
- Upload a PDF or image file
- View the extracted text and summary
- Explore the paragraphs
- Select a paragraph number and click "Generate Q&A" to create questions and answers based on that paragraph
- For large files, processing may take some time
- The quality of text extraction from images depends on the clarity of the image
- For optimal performance, ensure your PDF contains selectable text (not scanned images)