Skip to content

UPSTREAM PR #17400: mtmd: Add DeepSeekOCR Support#1144

Open
loci-dev wants to merge 126 commits intomainfrom
loci/pr-17400-sf-deepseek-ocr
Open

UPSTREAM PR #17400: mtmd: Add DeepSeekOCR Support#1144
loci-dev wants to merge 126 commits intomainfrom
loci/pr-17400-sf-deepseek-ocr

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 3, 2026

Note

Source pull request: ggml-org/llama.cpp#17400

Feature Request: ggml-org/llama.cpp#16676

Make sure to read the contributing guidelines before submitting a PR

GGUF Models

sabafallah/DeepSeek-OCR-GGUF

deepseek-ocr-f32.gguf

mmproj-deepseek-ocr-f32.gguf

Running the Model

Build llama.cpp (Mac)

cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --config Release

Running llama-mtmd-cli

DeepSeekOCR Paper (First page)
build/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/deepseek-ocr-f16.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-f16.gguf \
--image tmp/mtmd_test_data/Deepseek-OCR-2510.18234v1_page1.png \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek-ocr --temp 0
Hard Test (Old Newspaper Image)
build/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/deepseek-ocr-f16.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-f16.gguf \
--image tools/mtmd/test-1.jpeg \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek-ocr --temp 0

sfallah and others added 30 commits November 14, 2025 12:40
mtmd: fix vision model processing
testing Vision model loading
mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17
@loci-review
Copy link

loci-review bot commented Feb 22, 2026

Overview

The DeepSeek-OCR integration spans 126 commits adding multimodal OCR capabilities to llama.cpp. Analysis of 111,736 functions reveals 217 modified (0.19%), 60 new, and 0 removed functions across 15 binaries. Overall system power consumption increased 0.039% (597.52 nJ), with the impact isolated to the new multimodal library.

Power Consumption by Binary:

  • build.bin.libmtmd.so: +792.42 nJ (+0.419%) — new multimodal library
  • build.bin.llama-tts: +153.55 nJ (+0.043%)
  • build.bin.libggml-base.so: +60.41 nJ (+0.082%)
  • build.bin.libllama.so: -57.47 nJ (-0.022%)
  • build.bin.llama-cvector-generator: -309.57 nJ (-0.088%)
  • build.bin.llama-bench: -42.01 nJ (-0.071%)
  • build.bin.llama-quantize: 0.00 nJ (0.000%)
  • build.bin.llama-qwen2vl-cli: 0.00 nJ (0.000%)
  • build.bin.llama-tokenize: 0.00 nJ (0.000%)
  • build.bin.llama-gemma3-cli: 0.00 nJ (0.000%)
  • build.bin.llama-gguf-split: 0.00 nJ (0.000%)
  • build.bin.llama-llava-cli: 0.00 nJ (0.000%)
  • build.bin.llama-minicpmv-cli: 0.00 nJ (0.000%)
  • build.bin.libggml-cpu.so: 0.00 nJ (0.000%)
  • build.bin.libggml.so: 0.00 nJ (0.000%)

Function Analysis

Critical Path Function:

  • std::vector<bool>::resize (libggml-base.so): Response time 10,741ns → 10,850ns (+1.0%), throughput time 103ns → 181ns (+75.7%, +78ns). Called once per batch in llama_batch_allocr::split_reset(). No source code changes; regression stems from compiler optimization differences in std::vector bit-packing implementation. Moderate cumulative impact at high batch frequencies but <1% of overall inference time.

Improved Functions (libllama.so):

  • std::vector::end() (KV cache): Response -69.2% (-183ns), throughput -75.4% (-183ns)
  • char_traits::length: Response -53.5% (-164ns), throughput -58.0% (-164ns)
  • __make_move_if_noexcept_iterator: Response -66.3% (-185ns), throughput -76.0% (-185ns)

Degraded Functions (llama-tts):

  • std::vector::end() (PEG parser): Response +224% (+183ns), throughput +307% (+183ns)
  • std::vector::begin(): Response +214% (+181ns), throughput +289% (+181ns)

All performance variations stem from compiler optimization differences between builds, not source code changes. No modifications detected in application code for analyzed functions. Core inference paths (GEMM, attention, KV cache hot paths) remain unchanged.

Additional Findings

The integration adds GPU-accelerated vision encoders (CLIP-ViT, SAM) using standard GGML operations, enabling cross-platform acceleration across CUDA, Metal, HIP, and Vulkan backends. Flash attention support reduces memory footprint for vision processing. Combined QKV projections improve GPU utilization. Vision encoding adds 105-410ms first-token latency overhead but preserves text-only inference performance. The modular architecture (libmtmd.so) successfully isolates multimodal functionality with minimal impact on core inference operations.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 8c889a6 to 13648e6 Compare March 2, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 8019888 to 17452e3 Compare March 9, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants