Swiss Army Knife document processor with OCR fallback. Handles PDFs, Excel, CSV, Word, and images with automatic retry logic and quality scoring.
- Multi-format support: PDF, XLSX, CSV, DOCX, JPG, PNG, TIFF, and more
- Intelligent extraction ladder: Tries cheap methods first, escalates on failure
- Quality scoring: Automatically detects failed extractions and retries
- LightOnOCR integration: Local AI-powered OCR for scanned documents
- Web UI: Drag-and-drop processing, document viewer, and settings
- REST API: Easy integration with existing systems
- Docker ready: CPU and GPU images available
- Fully permissive licensing: All dependencies are MIT/BSD/Apache 2.0
Document In
│
▼
┌─────────────────────────────────────┐
│ Level 1: Native Extraction (FREE) │ PDF, Excel, CSV, Word
│ pdfplumber, pandas, python-docx │
│ Confidence check → pass? → Done ✓ │
└─────────────────────────────────────┘
│ fail/low confidence
▼
┌─────────────────────────────────────┐
│ Level 2: LightOnOCR (Local GPU/CPU) │ Images, scanned PDFs
│ Retry up to 2x per level │
│ Confidence check → pass? → Done ✓ │
└─────────────────────────────────────┘
│ fail
▼
┌─────────────────────────────────────┐
│ Level 3: AWS Textract (DISABLED) │ Optional paid fallback
└─────────────────────────────────────┘
│ fail
▼
┌─────────────────────────────────────┐
│ Dead Letter Queue │ Manual review
└─────────────────────────────────────┘
# Clone the repository
git clone https://github.com/NotADevIAmaMeatPopsicle/DTAT-OCR.git
cd DTAT-OCR
# Create virtual environment
uv venv --python 3.12 --seed
# Windows
.venv\Scripts\activate
# Linux/Mac
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txt
# Initialize database
python worker.py initpython -m uvicorn api:app --host 0.0.0.0 --port 8000Open http://localhost:8000 in your browser.
| Page | Description |
|---|---|
/ |
Process documents (drag & drop upload) |
/ui/documents |
View all processed documents |
/ui/settings |
Configure extraction pipeline |
/docs |
API documentation (Swagger) |
# Initialize database (required first time)
python worker.py init
# Process single document
python worker.py process document.pdf
python worker.py process receipt.jpg --json
# Batch process pending documents
python worker.py batch --limit 20
# Run continuous worker (for async queue)
python worker.py worker --interval 10 --batch-size 5
# View statistics
python worker.py stats
# View failed documents (DLQ)
python worker.py dlq
# View/modify configuration
python worker.py config
python worker.py config --enable-textract
python worker.py config --disable-textract| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/stats |
GET | Processing statistics |
/process |
POST | Upload & process (sync) |
/process/async |
POST | Upload & queue (async) |
/documents |
GET | List all documents |
/documents/{id} |
GET | Get document metadata |
/documents/{id}/content |
GET | Get extracted text/tables |
/documents/{id}/retry |
POST | Retry failed document |
/dlq |
GET | View dead letter queue |
# Health check
curl http://localhost:8000/health
# Process document (sync)
curl -X POST http://localhost:8000/process \
-F "file=@document.pdf"
# Process document (async - uses persistent queue)
curl -X POST http://localhost:8000/process/async \
-F "file=@document.pdf"
# Get extracted content
curl http://localhost:8000/documents/1/content- First build: Takes 5-10 minutes (downloads ~2GB model weights)
- Subsequent builds: Much faster due to Docker layer caching
- Data persistence: Documents stored in
./data/documents.db- back up this directory
docker build -t dtat-ocr:cpu .
docker run -p 8000:8000 -v $(pwd)/data:/app/data dtat-ocr:cpu# Build GPU image
docker build -f Dockerfile.gpu -t dtat-ocr:gpu .
# Run with NVIDIA GPU support
docker run --gpus all -p 8000:8000 -v $(pwd)/data:/app/data dtat-ocr:gpuNote: Requires NVIDIA Container Toolkit
docker-compose up --build # Build and run
docker-compose up -d # Run in background
docker-compose logs -f # View logs
docker-compose down # StopEdit config.py or use the Web UI at /ui/settings:
| Setting | Default | Description |
|---|---|---|
enable_native_extraction |
True |
Level 1: Free parsing |
enable_local_ocr |
True |
Level 2: LightOnOCR |
enable_textract |
False |
Level 3: AWS Textract |
ocr_offline_mode |
True |
Don't call HF Hub |
min_confidence_score |
60 |
Threshold to escalate |
max_retries_per_level |
2 |
Retries before escalating |
| Format | Method | Notes |
|---|---|---|
| PDF (digital) | Native | pdfplumber - text + tables |
| PDF (scanned) | OCR | LightOnOCR |
| Excel (.xlsx, .xls) | Native | pandas + openpyxl |
| CSV | Native | pandas |
| Word (.docx, .doc) | Native | python-docx |
| Images | OCR | JPG, JPEG, PNG, TIFF, TIF, BMP, GIF, WebP |
- Model: LightOnOCR-1B-1025
- License: Apache 2.0
- Size: ~2GB (bfloat16)
- Performance:
- GPU (H100): ~5.7 pages/sec
- CPU: ~0.5-1 pages/min
DTAT-OCR/
├── api.py # FastAPI REST endpoints + Web UI
├── config.py # Configuration and feature toggles
├── database.py # SQLAlchemy models, base64 storage
├── extraction_pipeline.py # Retry logic, quality scoring, escalation
├── worker.py # CLI for processing
├── document_processor.py # Multi-format processor
│
├── templates/ # Web UI templates
│ ├── base.html # Base layout with nav
│ ├── index.html # Document processing page
│ ├── documents.html # Document list page
│ └── settings.html # Configuration page
│
├── docs/adr/ # Architecture Decision Records
│ └── 001-replace-pymupdf-with-pdfplumber.md
│
├── Dockerfile # CPU Docker image
├── Dockerfile.gpu # GPU Docker image (CUDA)
├── docker-compose.yml # Local development
└── requirements.txt # Python dependencies
The core document processing pipeline is fully functional for local and Docker deployments.
The following features are planned for enterprise-scale AWS deployment:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ External │────▶│ SQS Queue │────▶│ DTAT OCR │────▶│ Results │
│ System │ │ (intake) │ │ Workers │ │ S3 + RDS │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
- Purpose: Decouple document intake from processing
- Benefits:
- Handle traffic spikes without losing documents
- Multiple workers can pull from the same queue
- Failed jobs return to queue automatically
- Fire-and-forget from upstream systems
- Current: SQLite (single-file, good for dev/small scale)
- Planned: Amazon RDS PostgreSQL
- Benefits:
- Handle concurrent connections from multiple workers
- Automatic backups and point-in-time recovery
- Multi-AZ failover for high availability
- Connection pooling for better performance
┌─────────────────────────────────────────────────────────────┐
│ AWS VPC │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ ECS Task │ │ ECS Task │ │ ECS Task │ │
│ │ (GPU) │ │ (GPU) │ │ (GPU) │ │
│ │ g4dn.xlarge │ │ g4dn.xlarge │ │ g4dn.xlarge │ ... │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └─────────────────┴─────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ RDS │ │
│ │ PostgreSQL │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
- Compute: ECS Fargate with GPU instances (g4dn.xlarge)
- Model weights: Baked into Docker image (no download on startup)
- Scaling: Auto-scale based on SQS queue depth
| Metric | Scale Up | Scale Down |
|---|---|---|
| SQS Queue Depth | > 100 messages | < 10 messages |
| Min Instances | 1 | - |
| Max Instances | 10 | - |
| Cooldown | 60 seconds | 300 seconds |
- Metrics: Processing time, success rate, queue depth, error rate
- Logs: Structured JSON logging for all processing events
- Alarms: Alert on DLQ growth, high error rate, processing delays
- Status: Code ready, disabled by default
- Purpose: Paid fallback for documents that fail local OCR
- Cost: ~$0.015/page
- Enable: Set
ENABLE_TEXTRACT=truein environment
┌─────────────────┐
│ CloudWatch │
│ Logs/Metrics │
└────────▲────────┘
│
┌──────────┐ ┌──────────┐ ┌────────────┴────────────┐ ┌──────────┐
│ External │───▶│ SQS │───▶│ ECS Fargate │───▶│ S3 │
│ System │ │ Queue │ │ (GPU Workers x N) │ │ Output │
└──────────┘ └──────────┘ └────────────┬────────────┘ └──────────┘
│
┌────────▼────────┐
│ RDS Postgres │
│ (metadata) │
└─────────────────┘
| Component | Estimated Cost |
|---|---|
| ECS Fargate (g4dn.xlarge, ~100 hrs) | ~$150-200 |
| RDS PostgreSQL (db.t3.medium) | ~$30 |
| SQS (100K messages) | < $1 |
| S3 Storage (100GB) | ~$2 |
| Total | ~$200/month |
Note: Using local OCR instead of Textract saves ~$1,500/month at this volume.
This project uses only permissively licensed dependencies:
| Component | License |
|---|---|
| FastAPI | MIT |
| pdfplumber | MIT |
| SQLAlchemy | MIT |
| PyTorch | BSD |
| Transformers | Apache 2.0 |
| LightOnOCR Model | Apache 2.0 |
All dependencies are safe for commercial use.
Contributions are welcome! Please read the ADRs in docs/adr/ before making architectural changes.
- LightOn AI for the excellent LightOnOCR model
- pdfplumber for PDF extraction