Skip to content

pushkar2510/Doc_Sum

 
 

Repository files navigation

PDF and Image Text Analyzer

This application allows users to upload PDF files or images, extract text, generate summaries, break text into paragraphs, and create questions/answers for selected paragraphs.

Features

  • PDF and image text extraction
  • Text summarization using T5 model
  • Paragraph segmentation
  • Question and answer generation

Requirements

  • Python 3.7+
  • Tesseract OCR must be installed for image text extraction

Installation

  1. Clone this repository or download the files
  2. Install the required dependencies:
pip install -r requirements.txt
  1. Install Tesseract OCR:

Usage

  1. Run the Streamlit app:
streamlit run app.py
  1. Upload a PDF or image file
  2. View the extracted text and summary
  3. Explore the paragraphs
  4. Select a paragraph number and click "Generate Q&A" to create questions and answers based on that paragraph

Important Notes

  • For large files, processing may take some time
  • The quality of text extraction from images depends on the clarity of the image
  • For optimal performance, ensure your PDF contains selectable text (not scanned images)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 42.4%
  • HTML 33.8%
  • JavaScript 21.4%
  • CSS 2.4%