Skip to content

Having fun with pdf document processing libraries 🧐

Notifications You must be signed in to change notification settings

pilarcode/pdf_lab

Repository files navigation

Pdf & images lab

A project to explore libraries to extract text from pdfs such as:

  • pdfminer
  • pyMuPDF
  • pyPDF2
  • ptpdfium2

Besides, I explore others to extract text from images such as

  • pytesseract
  • easyocr
  • transformers models from huggingface

Additionally, how to extract text from pdfs using LLMs is also explored

  • Gemini

Setup

Step 1. Navigate to the root directory of the repository and create a new conda environment for development:

uv venv .venv

Step 2. Activate the environment:

source .venv/Scripts/activate

Step 3. Install the dependencies:

uv pip install -e .

Usage

Go to the notebook and select your environment to run the cells.

About

Having fun with pdf document processing libraries 🧐

Topics

Resources

Stars

Watchers

Forks