Skip to content

ja0n/table-ocrer

Repository files navigation

Table Extractor

A Python tool for extracting tables from images and PDF files using computer vision and OCR techniques.

Features

  • Extract tables from both PDF and image files
  • Support for multiple OCR engines (Tesseract and PaddleOCR)
  • Multi-language OCR support
  • Automatic PDF to image conversion
  • Debug visualization output
  • Customizable image preprocessing

Setup

Python Dependencies

pip install -r requirements.txt

System Dependencies

  • Poppler: Required for PDF processing
    • Ubuntu/Debian: sudo apt-get install poppler-utils
    • macOS: brew install poppler
    • Windows: Download and install from poppler releases, then add the binary location to your PATH
  • Tesseract OCR: Optional, only required if you want to use Tesseract as the OCR engine
    • Ubuntu/Debian: sudo apt-get install tesseract-ocr
    • macOS: brew install tesseract
    • Windows:
      • Chocolatey: choco install tesseract
      • Manual: Download and install from Tesseract OCR releases, then add the binary location to your PATH

Usage

python main.py input_file [options]

Arguments

  • input_file: Path to the input file (image or PDF)

Optional Arguments

  • --output_dir: Output directory for results (default: "output")
  • --scale_factor: Scale factor for image processing (default: 2)
  • --ocr_engine: OCR Engine to use ("tesseract" or "paddle") (default: "paddle")
  • --ocr_lang: Language to use for OCR (default: "pt")
    • For Tesseract: "eng", "por", etc.
    • For PaddleOCR: "en", "pt", etc.
  • --preprocess: Preprocessing profile for the image before OCR (default: "default")

Example Commands

# Process a PDF file
python main.py document.pdf --ocr_engine paddle --ocr_lang en

# Process an image with custom settings
python main.py table.png --output_dir results --scale_factor 3 --ocr_engine tesseract --ocr_lang eng

Output Structure

The tool creates the following directory structure for outputs:

output/
├── debug/
│   └── page_name/
│       └── [debug images and intermediate results]
├── csv/
│   └── page_name/
│       └── [extracted table data in CSV format]
└── pdf_pages/
    └── [converted PDF pages as images] (only for PDF inputs)

Notes

  • When processing PDFs, the tool automatically converts each page to an image before extraction
  • If the output directory already exists, the tool will prompt for permission to delete it
  • Debug output includes visualization of the table detection process
  • The tool will create separate output directories for each page when processing multi-page PDFs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages