TypeScript library for extracting structured data from archaeological excavation report PDFs
English | 한국어
⚠️ macOS Only: This project currently supports only macOS (Apple Silicon or Intel). See @heripo/pdf-parser README for detailed system requirements.
ℹ️ Notes (v0.1.x):
- Korean Report Correction: Korean reports are automatically detected and corrected via VLM (Vision Language Model)
- TOC Dependency: Reports without a TOC will fail (intentional). Rare extraction failures will be addressed via human intervention
- Vertical Text: Old vertical-text documents with Chinese numeral page numbers are a long-term goal, not currently scheduled
🌐 Online Demo: Try it without local installation → engine-demo.heripo.org
- Introduction
- Key Features
- Architecture
- Installation
- Packages
- Usage Examples
- Demo Application
- Documentation
- Roadmap
- Contributing
- Citation and Attribution
- Sponsor
- License
heripo engine is a collection of tools for analyzing archaeological excavation report PDFs and extracting structured data. It is designed to effectively process documents that span hundreds of pages and contain complex layouts, tables, diagrams, and photographs.
heripo lab is an open-source R&D group that combines archaeological domain knowledge with software engineering expertise to drive practical research efficiency.
- Role: Design of LLM-based unstructured data extraction pipeline and system implementation
- Background: Software Engineer (B.S. in Computer Science and B.A. in Archaeology)
- Research:
- A Study on Archaeological Informatization Using Large Language Models (LLMs): Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports (2025, Heritage: History and Science Vol. 58 No. 3, KCI Listed)
- Role: Archaeological data ontology design, data schema definition, and academic validation
- Background: Ph.D. Student in Archaeology, M.A. in Cultural Informatics
- Research:
- Considerations for Structuring Maritime Cultural Heritage Data (2025, Journal of the Island Culture No. 66, KCI Listed)
- Semantic Data Design for Maritime Cultural Heritage: Focusing on Ancient Shipwrecks and Wooden Tablets Excavated from the Taean Mado waters (2025, Master's Thesis)
- Role: Development of archaeology research platforms
- Background:
- Software Engineer
- M.A. in Archaeology (Coursework Completed)
- B.A. in Archaeology
- B.A. in Library and Information Science
Archaeological excavation reports contain valuable cultural heritage information, but are often available only in PDF format, making systematic analysis and utilization difficult. heripo engine solves the following problems:
- OCR Quality: High accuracy recognition of scanned documents using Docling SDK
- Structure Extraction: Automatic identification of document structure including table of contents, chapters/sections, images, and tables
- Cost Efficiency: Cost savings through local processing instead of cloud OCR (free)
Beyond Archaeology: While heripo engine is optimized for archaeological reports, its PDF structuring capabilities (text, tables, images, TOC extraction) work well with heavily damaged scanned PDFs and documents from other domains (architecture, history, etc.). Feel free to fork and adapt it to your needs.
Raw Data Extraction → Archaeological Data Ledger → Archaeological Data Standard → Domain Ontology → DB Storage
| Stage | Description |
|---|---|
| Raw Data Extraction | Document data structurally extracted in the original format of PDF reports (no archaeological interpretation) |
| Data Ledger | Immutable ledger structured using a universal model covering global archaeology |
| Data Standard | Extensible standard model (base standard → country-specific → domain-specific extensions) |
| Ontology | Domain-specific semantic models and knowledge graphs |
| DB Storage | Independent storage and utilization for each pipeline stage |
Current Implementation (v0.1.x):
- ✅ PDF parsing and OCR (Docling SDK)
- ✅ Document structure extraction (TOC, chapters/sections, page mapping)
- ✅ Image/table extraction and caption parsing
Planned Stages:
- 🔜 Immutable Ledger (universal archaeological model, concept extraction)
- 🔜 Extensible Standardization (hierarchical standard model, normalization)
- 🔜 Ontology (semantic model, knowledge graph)
- 🔜 Production Ready (performance optimization, API stability)
For a detailed roadmap, see docs/roadmap.md.
- High-Quality OCR: Document recognition using Docling SDK (ocrmac / Apple Vision Framework)
- Korean Report VLM Correction: Automatically detects Korean reports and applies VLM text correction to all pages — ocrmac excels at speed and quality for large-scale processing, but Korean archaeological reports often require Hanja restoration and script-aware correction
- Apple Silicon Optimized: GPU acceleration on M1/M2/M3/M4/M5 chips
- Automatic Environment Setup: Automatic Python virtual environment and docling-serve installation
- Image Extraction: Automatic extraction and saving of images from PDFs
- Review Assistance: Optional page-level VLM review with audit proposals and high-confidence auto-fixes
- TOC Extraction: Automatic TOC recognition with rule-based + LLM fallback
- Hierarchical Structure: Automatic generation of chapter/section/subsection hierarchy
- Page Mapping: Actual page number mapping using Vision LLM
- Caption Parsing: Automatic parsing of image and table captions
- Source Provenance: Optional Docling source metadata and node-level source references
- Table Grid Normalization: Preserves row/column spans and removes merged-cell shadow entries
- LLM Flexibility: Support for various LLMs including OpenAI, Anthropic, Google
- ProcessedDocument: Intermediate data model optimized for LLM analysis
- DoclingDocument: Raw output format from Docling SDK
- ReviewAssistanceReport: Optional page-level review assistance report model
- Type Safety: Complete TypeScript type definitions
heripo engine is organized as a pnpm workspace-based monorepo.
heripo-engine/
├── packages/ # Core libraries
│ ├── pdf-parser/ # PDF → DoclingDocument
│ ├── document-processor/ # DoclingDocument → ProcessedDocument
│ ├── model/ # Data models and type definitions
│ ├── logger/ # Logging adapter package
│ └── shared/ # Internal utilities (not published)
├── apps/ # Applications
│ └── demo-web/ # Next.js web demo
└── tools/ # Build tool configurations
├── tsconfig/ # Shared TypeScript config
├── tsup-config/ # Build config
└── vitest-config/ # Test config
For detailed architecture explanation, see docs/architecture.md.
- macOS (Apple Silicon or Intel)
- Node.js >= 24.0.0
- pnpm >= 10.0.0
- Python 3.9 - 3.12 (
⚠️ Python 3.13+ is not supported) - jq (JSON processing tool)
- poppler (PDF text extraction tools)
# Install Python 3.11 (recommended)
brew install python@3.11
# Install jq
brew install jq
# Install poppler
brew install poppler
# Install Node.js and pnpm
brew install node
npm install -g pnpmFor detailed installation guide, see @heripo/pdf-parser README.
# Install individual packages
pnpm add @heripo/pdf-parser
pnpm add @heripo/document-processor
pnpm add @heripo/model
pnpm add @heripo/logger
# Or install all at once
pnpm add @heripo/pdf-parser @heripo/document-processor @heripo/model @heripo/logger| Package | Version | Description |
|---|---|---|
| @heripo/pdf-parser | 0.1.x | PDF parsing and OCR |
| @heripo/document-processor | 0.1.x | Document structure analysis and LLM processing |
| @heripo/model | 0.1.x | Data models and type definitions |
| @heripo/logger | 0.1.x | Logger interface and adapter |
import type { DoclingDocument } from '@heripo/model';
import { anthropic } from '@ai-sdk/anthropic';
import { openai } from '@ai-sdk/openai';
import { DocumentProcessor } from '@heripo/document-processor';
import { Logger } from '@heripo/logger';
import { PDFParser } from '@heripo/pdf-parser';
import { readFile } from 'node:fs/promises';
const logger = new Logger({
debug: (...args) => console.debug('[heripo]', ...args),
info: (...args) => console.info('[heripo]', ...args),
warn: (...args) => console.warn('[heripo]', ...args),
error: (...args) => console.error('[heripo]', ...args),
});
// 1. PDF Parsing
const pdfParser = new PDFParser({
port: 5001,
logger,
});
await pdfParser.init();
const tokenUsageReport = await pdfParser.parse(
'file:///path/to/report.pdf',
'report-001',
async (artifactDir) => {
const doclingDocument = JSON.parse(
await readFile(`${artifactDir}/result.json`, 'utf8'),
) as DoclingDocument;
// 2. Document Processing (inside callback)
const processor = new DocumentProcessor({
logger,
fallbackModel: anthropic('claude-opus-4-5'),
pageRangeParserModel: openai('gpt-5.2'),
tocExtractorModel: openai('gpt-5.1'),
captionParserModel: openai('gpt-5-mini'),
textCleanerBatchSize: 10,
captionParserBatchSize: 5,
captionValidatorBatchSize: 5,
});
const { document, usage } = await processor.process(
doclingDocument,
'report-001',
artifactDir,
);
// 3. Use Results
console.log('TOC:', document.chapters);
console.log('Images:', document.images);
console.log('Tables:', document.tables);
console.log('Footnotes:', document.footnotes);
console.log('Token Usage:', usage.total);
},
true, // cleanupAfterCallback
{}, // PDFConvertOptions
);
// Cleanup
await pdfParser.dispose();// Specify LLM models per component + fallback retry
const processor = new DocumentProcessor({
logger,
fallbackModel: anthropic('claude-opus-4-5'), // For retry on failure
pageRangeParserModel: openai('gpt-5.2'),
tocExtractorModel: openai('gpt-5.1'),
validatorModel: openai('gpt-5.2'),
visionTocExtractorModel: openai('gpt-5-mini'),
captionParserModel: openai('gpt-5-nano'),
textCleanerBatchSize: 20,
captionParserBatchSize: 10,
captionValidatorBatchSize: 10,
maxRetries: 3,
maxValidationRetries: 3,
enableFallbackRetry: true, // Automatically retry with fallbackModel on failure (default: false)
onTokenUsage: (report) => console.log('Token usage:', report.total),
});Try it without local installation:
🔗 https://engine-demo.heripo.org
The online demo has a daily usage limit (3 times). For full functionality, local execution is recommended.
A web application providing real-time PDF processing monitoring:
cd apps/demo-web
cp .env.example .env
# Set LLM API keys in .env file
pnpm install
pnpm devAccess http://localhost:3000 in your browser
Key Features:
- PDF upload and processing option configuration
- Real-time processing status monitoring (SSE)
- Processing result visualization (TOC, images, tables)
- Job queue management
For detailed usage, see apps/demo-web/README.md.
- Architecture Document - System design and structure
- Roadmap - Development plans and vision
- Contributing Guide - How to contribute
- Security Policy - Vulnerability reporting procedure
- Code of Conduct - Community code of conduct
Current version: v0.1.x (Initial Release)
- ✅ PDF parsing with OCR
- ✅ Document structure extraction (TOC, chapters/sections)
- ✅ Image/table extraction
- ✅ Page mapping
- ✅ Caption parsing
- Universal data model design covering global archaeology
- Archaeological concept extraction (features, artifacts, strata, excavation units)
- LLM-based information extraction pipeline
- Hierarchical standard model design (base → country-specific → domain-specific)
- Normalization pipeline
- Data validation
- Domain-specific semantic models
- Knowledge graph construction
- Performance optimization
- API stability guarantee
- Comprehensive testing
For details, see docs/roadmap.md.
# Install dependencies
pnpm install
# Build all
pnpm build
# Type check
pnpm typecheck
# Lint
pnpm lint
pnpm lint:fix
# Format
pnpm format
pnpm format:check
# Run all tests
pnpm test
pnpm test:coverage
pnpm test:ci
# Test specific package
pnpm --filter @heripo/pdf-parser test
pnpm --filter @heripo/document-processor test# Build specific package
pnpm --filter @heripo/pdf-parser build
# Test specific package (with coverage)
pnpm --filter @heripo/pdf-parser test:coverage
# Watch mode for specific package
pnpm --filter @heripo/pdf-parser devThank you for contributing to the heripo engine project! For contribution guidelines, see CONTRIBUTING.md.
- Fork this repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Create a Pull Request
- All tests must pass (
pnpm test) - 100% code coverage must be maintained
- ESLint and Prettier rules must be followed
- Commit messages must follow Conventional Commits
- Issue Tracker: GitHub Issues
- Discussions: GitHub Discussions
- Security Vulnerabilities: See Security Policy
If you use this project in research, services, or derivative works, please include the following attribution:
Powered by heripo engine
Such attribution helps support the open-source project and gives credit to contributors.
For academic papers or research documents, you may use the following BibTeX entry:
@software{heripo_engine,
author = {Kim, Hongyeon and Cho, Hayoung and Kim, Gaeun},
title = {heripo engine: TypeScript Library for Extracting Structured Data from Archaeological Excavation Report PDFs},
year = {2026},
url = {https://github.com/heripo-lab/heripo-engine},
note = {Apache License 2.0}
}If you'd like to support heripo lab's open-source research, you can sponsor us through:
- Open Collective for general project sponsorship.
- fairy.hada.io/@heripo for Korean individual supporters who prefer KRW payments.
This project is distributed under the Apache License 2.0.
This project uses the following open-source projects:
- Docling SDK - PDF parsing and OCR
- Vercel AI SDK - LLM integration
heripo lab | GitHub | heripo engine