Local LLaMa RAG to help with homework through locally-saved textbooks. The scripts extract text from a PDF file and generates a summary using AI models (GPT or LLaMA).
Ensure you have the required dependencies installed:
pip install langchain_community langchain_openai pytesseract pdf2image tqdm
Additionally, install Tesseract OCR and Poppler:
- Linux (Ubuntu/Debian):
sudo apt install tesseract-ocr poppler-utils
- MacOS:
brew install tesseract poppler
- Windows:
- Download and install Tesseract from Tesseract GitHub.
- Download and install Poppler from Poppler for Windows.
- Add both to your system PATH.
Run the script using the following command:
python script.py --pdf_file path/to/input.pdf --output_file path/to/output.txt
Argument | Description | Default Value |
---|---|---|
--summary_format_files |
Path to the summary format file | prompts/summary_format.txt |
--prompt_file |
Path to the prompt template file | prompts/story_summary.txt |
--use_gpt |
Use GPT-based model (True) or LLaMA (False) | True |
--pdf_file |
Path to the input PDF file (Required) | No Default (Required) |
--output_file |
Path to save the output text summary | outputs/summary.txt |
--chunk_size |
Number of pages to process at once | 5 |
-
Summarize using GPT (default):
python script.py --pdf_file myfile.pdf --output_file mysummary.txt
-
Summarize using LLaMA:
python script.py --pdf_file myfile.pdf --output_file mysummary.txt --use_gpt False
-
Process PDF in chunks of 10 pages:
python script.py --pdf_file myfile.pdf --chunk_size 10
- Ensure
Tesseract
andPoppler
are installed and properly configured in your environment. - For large PDFs, increase system memory or reduce
--chunk_size
.