Releases: neoncapy/doc2md
v3.5.0 — Marker as Default Extractor
What's New in v3.5.0
Marker is now the default PDF extractor
The pipeline now uses marker-pdf as the default extractor for digital PDFs. Marker produces significantly better output for academic papers — especially multi-column layouts, complex tables, and mathematical notation.
New fallback chain: marker → docling → pymupdf4llm → mineru → tesseract
| Extractor | Used for |
|---|---|
| marker (new default) | Digital PDFs |
| docling | Scanned PDFs |
| pymupdf4llm | Fallback |
| MinerU | Complex layouts |
| tesseract | Last resort OCR |
New: Step 3b — Image analysis after deferred extraction
Some extractors (like marker) defer image extraction to a later pipeline step. Previously, this meant prepare-image-analysis.py would skip because no image manifest existed yet. Now Step 3b automatically re-runs image analysis preparation after Step 6c creates the manifest. This unblocks AI expert persona descriptions for all extractor paths.
New file: convert-paper-marker.py
Standalone marker wrapper (~630 lines) with:
- Page-count-based timeout (scales with document size)
- CPU retry with configurable timeout
- YAML title/author enrichment via fitz
- Journal name title filtering
- Hyphen-compound preservation
Quality & reliability fixes
--no-imagesflag now fully respected across all extractor paths (was leaking through for marker)run_command()timeout parameter — outer timeout guard prevents infinite hangs- Quality gate fallback — when an extractor exits 0 but produces critically empty output, the pipeline automatically falls back to the next extractor AND re-checks quality
- Registry locking —
fcntl.flockprevents corruption from concurrent pipeline runs - Checkpoint recovery — quality gate fallback now correctly updates checkpoint state for crash recovery
- Type safety —
figure_numhandling works with mixed int/string values from split-panel images
Upgrade
cd your-doc2md-directory
git pull origin main
pip install marker-pdf # new required dependency
pip install symspellpy wordsegment # optional, improves post-processingFull QC history
All changes went through adversarial QC loops (fix → QC → fix → QC) until zero issues at all severity levels:
- Deferred audit fixes: 2 QC rounds → 0/0/0
- Step 3b + manifest fixes: 3 QC rounds → 0/0/0
- Integration tested on academic papers end-to-end