A Python tool for extracting tables from images and PDF files using computer vision and OCR techniques.
- Extract tables from both PDF and image files
- Support for multiple OCR engines (Tesseract and PaddleOCR)
- Multi-language OCR support
- Automatic PDF to image conversion
- Debug visualization output
- Customizable image preprocessing
pip install -r requirements.txt
- Poppler: Required for PDF processing
- Ubuntu/Debian:
sudo apt-get install poppler-utils
- macOS:
brew install poppler
- Windows: Download and install from poppler releases, then add the binary location to your PATH
- Ubuntu/Debian:
- Tesseract OCR: Optional, only required if you want to use Tesseract as the OCR engine
- Ubuntu/Debian:
sudo apt-get install tesseract-ocr
- macOS:
brew install tesseract
- Windows:
- Chocolatey:
choco install tesseract
- Manual: Download and install from Tesseract OCR releases, then add the binary location to your PATH
- Chocolatey:
- Ubuntu/Debian:
python main.py input_file [options]
input_file
: Path to the input file (image or PDF)
--output_dir
: Output directory for results (default: "output")--scale_factor
: Scale factor for image processing (default: 2)--ocr_engine
: OCR Engine to use ("tesseract" or "paddle") (default: "paddle")--ocr_lang
: Language to use for OCR (default: "pt")- For Tesseract: "eng", "por", etc.
- For PaddleOCR: "en", "pt", etc.
--preprocess
: Preprocessing profile for the image before OCR (default: "default")
# Process a PDF file
python main.py document.pdf --ocr_engine paddle --ocr_lang en
# Process an image with custom settings
python main.py table.png --output_dir results --scale_factor 3 --ocr_engine tesseract --ocr_lang eng
The tool creates the following directory structure for outputs:
output/
├── debug/
│ └── page_name/
│ └── [debug images and intermediate results]
├── csv/
│ └── page_name/
│ └── [extracted table data in CSV format]
└── pdf_pages/
└── [converted PDF pages as images] (only for PDF inputs)
- When processing PDFs, the tool automatically converts each page to an image before extraction
- If the output directory already exists, the tool will prompt for permission to delete it
- Debug output includes visualization of the table detection process
- The tool will create separate output directories for each page when processing multi-page PDFs