A project to explore libraries to extract text from pdfs such as:
- pdfminer
- pyMuPDF
- pyPDF2
- ptpdfium2
Besides, I explore others to extract text from images such as
- pytesseract
- easyocr
- transformers models from huggingface
Additionally, how to extract text from pdfs using LLMs is also explored
- Gemini
Step 1. Navigate to the root directory of the repository and create a new conda environment for development:
uv venv .venv
Step 2. Activate the environment:
source .venv/Scripts/activate
Step 3. Install the dependencies:
uv pip install -e .
Go to the notebook and select your environment to run the cells.