Releases · neoncapy/doc2md

What's New in v3.5.0

Marker is now the default PDF extractor

The pipeline now uses marker-pdf as the default extractor for digital PDFs. Marker produces significantly better output for academic papers — especially multi-column layouts, complex tables, and mathematical notation.

New fallback chain: marker → docling → pymupdf4llm → mineru → tesseract

Extractor	Used for
marker (new default)	Digital PDFs
docling	Scanned PDFs
pymupdf4llm	Fallback
MinerU	Complex layouts
tesseract	Last resort OCR

New: Step 3b — Image analysis after deferred extraction

Some extractors (like marker) defer image extraction to a later pipeline step. Previously, this meant prepare-image-analysis.py would skip because no image manifest existed yet. Now Step 3b automatically re-runs image analysis preparation after Step 6c creates the manifest. This unblocks AI expert persona descriptions for all extractor paths.

New file: `convert-paper-marker.py`

Standalone marker wrapper (~630 lines) with:

Page-count-based timeout (scales with document size)
CPU retry with configurable timeout
YAML title/author enrichment via fitz
Journal name title filtering
Hyphen-compound preservation

Quality & reliability fixes

--no-images flag now fully respected across all extractor paths (was leaking through for marker)
run_command() timeout parameter — outer timeout guard prevents infinite hangs
Quality gate fallback — when an extractor exits 0 but produces critically empty output, the pipeline automatically falls back to the next extractor AND re-checks quality
Registry locking — fcntl.flock prevents corruption from concurrent pipeline runs
Checkpoint recovery — quality gate fallback now correctly updates checkpoint state for crash recovery
Type safety — figure_num handling works with mixed int/string values from split-panel images

Upgrade

cd your-doc2md-directory
git pull origin main
pip install marker-pdf  # new required dependency
pip install symspellpy wordsegment  # optional, improves post-processing

Full QC history

All changes went through adversarial QC loops (fix → QC → fix → QC) until zero issues at all severity levels:

Deferred audit fixes: 2 QC rounds → 0/0/0
Step 3b + manifest fixes: 3 QC rounds → 0/0/0
Integration tested on academic papers end-to-end

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's New in v3.5.0

Marker is now the default PDF extractor

New: Step 3b — Image analysis after deferred extraction

New file: `convert-paper-marker.py`

Quality & reliability fixes

Upgrade

Full QC history

Uh oh!

Releases: neoncapy/doc2md

v3.5.0 — Marker as Default Extractor

What's New in v3.5.0

Marker is now the default PDF extractor

New: Step 3b — Image analysis after deferred extraction

New file: convert-paper-marker.py

Quality & reliability fixes

Upgrade

Full QC history

Uh oh!

New file: `convert-paper-marker.py`