exaOCR - Fast OCR to Markdown Pipeline

Overview

exaOCR is a production-ready OCR pipeline that converts any file (PDF, image, office document) into clean Markdown quickly. Built with FastAPI and Streamlit, exaOCR is optimized for CPU-only systems and preserves tables, forms, and layout structure.

API Documentation Page: https://ikantkode.github.io/exaOCR Live Demo: http://localhost:7601 Next Evolution: pdfLLM

Video Demo

Key Results

Metric	Large Document	Small Document
Wall Time	~250 s	~15 s
Parallel Pages	8 cores	8 cores
Memory Peak	<2 GB	<500 MB
Table Accuracy	95%+	95%+

Supported Formats

Category	Extensions	Conversion Path
PDF	`.pdf`	Direct
Images	`.jpg .jpeg .png .tiff .bmp`	`img2pdf` → PDF
Office	`.doc .docx .txt .csv`	LibreOffice → PDF
Future	`.xlsx .pptx .rtf`	Planned

Architecture

Streamlit <--> FastAPI <--> OCR Core

- FastAPI handles uploads, progress, and downloads
- OCRmyPDF + Tesseract adds searchable text
- PyMuPDF4LLM extracts Markdown with table preservation
- Pages processed in parallel across CPU cores

Quick Start

1. Clone & Run

git clone https://github.com/ikantkode/exaOCR.git
cd exaOCR
docker compose up --build

Open http://localhost:7601
Upload files, watch progress, and download Markdown ZIP

2. Production

docker compose up -d --build

Hardware Recommendations

CPU / RAM	Max Workers	Batch Size
4-core / 8GB	4	5 files
8-core / 16GB	8	10 files
24-core / 64GB	12	25 files

Set in app.py:

executor = ThreadPoolExecutor(max_workers=12)

Monitor resources with:

htop
free -m

API Endpoints

View Documentation: https://ikantkode.github.io/exaOCR

Endpoint	Method	Purpose
`/upload/`	POST	Upload files
`/progress/{file_id}`	GET	Real-time progress
`/download-markdown/{id}`	GET	Download Markdown
`/health`	GET	Health check

Docker Compose (Production)

version: "3.8"
services:
  fastapi:
    build: .
    ports:
      - "8000:8000"
    environment:
      - PYTHONUNBUFFERED=1
    deploy:
      resources:
        limits:
          memory: 4G
  streamlit:
    build: .
    ports:
      - "7601:7601"
    depends_on:
      - fastapi

Performance Baselines

Test Case	Pages	Time	CPU Threads
10 PDFs (avg 50 pages)	500	45 s	8
1 × 800-page contract	800	250 s	8
50 images	50	30 s	8

Tech Stack

Layer	Technology
Frontend	Streamlit 1.38
API	FastAPI
OCR	OCRmyPDF + Tesseract
Markdown	PyMuPDF4LLM
Parallelism	concurrent.futures
Container	Ubuntu 24.04, Python 3.12

License

MIT – free for personal and commercial use. Dependencies follow their own licenses.

Issues or PRs: GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.streamlit		Dockerfile.streamlit
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
index.html		index.html
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

exaOCR - Fast OCR to Markdown Pipeline

Overview

Video Demo

Key Results

Supported Formats

Architecture

Quick Start

1. Clone & Run

2. Production

Hardware Recommendations

API Endpoints

Docker Compose (Production)

Performance Baselines

Tech Stack

License

About

Uh oh!

Releases

Packages

Languages

ikantkode/exaOCR

Folders and files

Latest commit

History

Repository files navigation

exaOCR - Fast OCR to Markdown Pipeline

Overview

Video Demo

Key Results

Supported Formats

Architecture

Quick Start

1. Clone & Run

2. Production

Hardware Recommendations

API Endpoints

Docker Compose (Production)

Performance Baselines

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages