Skip to content

Releases: neoncapy/doc2md

v3.5.0 — Marker as Default Extractor

01 Mar 00:06

Choose a tag to compare

What's New in v3.5.0

Marker is now the default PDF extractor

The pipeline now uses marker-pdf as the default extractor for digital PDFs. Marker produces significantly better output for academic papers — especially multi-column layouts, complex tables, and mathematical notation.

New fallback chain: marker → docling → pymupdf4llm → mineru → tesseract

Extractor Used for
marker (new default) Digital PDFs
docling Scanned PDFs
pymupdf4llm Fallback
MinerU Complex layouts
tesseract Last resort OCR

New: Step 3b — Image analysis after deferred extraction

Some extractors (like marker) defer image extraction to a later pipeline step. Previously, this meant prepare-image-analysis.py would skip because no image manifest existed yet. Now Step 3b automatically re-runs image analysis preparation after Step 6c creates the manifest. This unblocks AI expert persona descriptions for all extractor paths.

New file: convert-paper-marker.py

Standalone marker wrapper (~630 lines) with:

  • Page-count-based timeout (scales with document size)
  • CPU retry with configurable timeout
  • YAML title/author enrichment via fitz
  • Journal name title filtering
  • Hyphen-compound preservation

Quality & reliability fixes

  • --no-images flag now fully respected across all extractor paths (was leaking through for marker)
  • run_command() timeout parameter — outer timeout guard prevents infinite hangs
  • Quality gate fallback — when an extractor exits 0 but produces critically empty output, the pipeline automatically falls back to the next extractor AND re-checks quality
  • Registry lockingfcntl.flock prevents corruption from concurrent pipeline runs
  • Checkpoint recovery — quality gate fallback now correctly updates checkpoint state for crash recovery
  • Type safetyfigure_num handling works with mixed int/string values from split-panel images

Upgrade

cd your-doc2md-directory
git pull origin main
pip install marker-pdf  # new required dependency
pip install symspellpy wordsegment  # optional, improves post-processing

Full QC history

All changes went through adversarial QC loops (fix → QC → fix → QC) until zero issues at all severity levels:

  • Deferred audit fixes: 2 QC rounds → 0/0/0
  • Step 3b + manifest fixes: 3 QC rounds → 0/0/0
  • Integration tested on academic papers end-to-end