Table Extractor

A Python tool for extracting tables from images and PDF files using computer vision and OCR techniques.

Features

Extract tables from both PDF and image files
Support for multiple OCR engines (Tesseract and PaddleOCR)
Multi-language OCR support
Automatic PDF to image conversion
Debug visualization output
Customizable image preprocessing

Setup

Python Dependencies

pip install -r requirements.txt

System Dependencies

Poppler: Required for PDF processing
- Ubuntu/Debian: sudo apt-get install poppler-utils
- macOS: brew install poppler
- Windows: Download and install from poppler releases, then add the binary location to your PATH
Tesseract OCR: Optional, only required if you want to use Tesseract as the OCR engine
- Ubuntu/Debian: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
- Windows:
  - Chocolatey: choco install tesseract
  - Manual: Download and install from Tesseract OCR releases, then add the binary location to your PATH

Usage

python main.py input_file [options]

Arguments

input_file: Path to the input file (image or PDF)

Optional Arguments

--output_dir: Output directory for results (default: "output")
--scale_factor: Scale factor for image processing (default: 2)
--ocr_engine: OCR Engine to use ("tesseract" or "paddle") (default: "paddle")
--ocr_lang: Language to use for OCR (default: "pt")
- For Tesseract: "eng", "por", etc.
- For PaddleOCR: "en", "pt", etc.
--preprocess: Preprocessing profile for the image before OCR (default: "default")

Example Commands

# Process a PDF file
python main.py document.pdf --ocr_engine paddle --ocr_lang en

# Process an image with custom settings
python main.py table.png --output_dir results --scale_factor 3 --ocr_engine tesseract --ocr_lang eng

Output Structure

The tool creates the following directory structure for outputs:

output/
├── debug/
│   └── page_name/
│       └── [debug images and intermediate results]
├── csv/
│   └── page_name/
│       └── [extracted table data in CSV format]
└── pdf_pages/
    └── [converted PDF pages as images] (only for PDF inputs)

Notes

When processing PDFs, the tool automatically converts each page to an image before extraction
If the output directory already exists, the tool will prompt for permission to delete it
Debug output includes visualization of the table detection process
The tool will create separate output directories for each page when processing multi-page PDFs

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitignore		.gitignore
Pipfile		Pipfile
README.md		README.md
computer_vision.py		computer_vision.py
geometry.py		geometry.py
main.py		main.py
ocr.py		ocr.py
requirements.txt		requirements.txt
table_extractor.py		table_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table Extractor

Features

Setup

Python Dependencies

System Dependencies

Usage

Arguments

Optional Arguments

Example Commands

Output Structure

Notes

About

Releases

Packages

Languages

ja0n/table-ocrer

Folders and files

Latest commit

History

Repository files navigation

Table Extractor

Features

Setup

Python Dependencies

System Dependencies

Usage

Arguments

Optional Arguments

Example Commands

Output Structure

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages