UPSTREAM PR #17400: mtmd: Add DeepSeekOCR Support by loci-dev · Pull Request #1144 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-03T03:24:54Z

Note

Source pull request: ggml-org/llama.cpp#17400

Feature Request: ggml-org/llama.cpp#16676

Make sure to read the contributing guidelines before submitting a PR

GGUF Models

sabafallah/DeepSeek-OCR-GGUF

deepseek-ocr-f32.gguf

mmproj-deepseek-ocr-f32.gguf

Running the Model

Build llama.cpp (Mac)

cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build -j --config Release

Running llama-mtmd-cli

DeepSeekOCR Paper (First page)

build/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/deepseek-ocr-f16.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-f16.gguf \
--image tmp/mtmd_test_data/Deepseek-OCR-2510.18234v1_page1.png \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek-ocr --temp 0

Hard Test (Old Newspaper Image)

build/bin/llama-mtmd-cli \
-m gguf_models/deepseek-ai/deepseek-ocr-f16.gguf \
--mmproj gguf_models/deepseek-ai/mmproj-deepseek-ocr-f16.gguf \
--image tools/mtmd/test-1.jpeg \
-p "<|grounding|>Convert the document to markdown." \
--chat-template deepseek-ocr --temp 0

init commit

mtmd: fix vision model processing

…f/deepseek-ocr

testing Vision model loading

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

…ut in deepseek2 model

…f/deepseek-ocr

…e image decoding fails

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

# Conflicts: # tools/mtmd/clip.cpp

- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions

loci-review · 2026-02-22T03:49:05Z

Overview

The DeepSeek-OCR integration spans 126 commits adding multimodal OCR capabilities to llama.cpp. Analysis of 111,736 functions reveals 217 modified (0.19%), 60 new, and 0 removed functions across 15 binaries. Overall system power consumption increased 0.039% (597.52 nJ), with the impact isolated to the new multimodal library.

Power Consumption by Binary:

build.bin.libmtmd.so: +792.42 nJ (+0.419%) — new multimodal library
build.bin.llama-tts: +153.55 nJ (+0.043%)
build.bin.libggml-base.so: +60.41 nJ (+0.082%)
build.bin.libllama.so: -57.47 nJ (-0.022%)
build.bin.llama-cvector-generator: -309.57 nJ (-0.088%)
build.bin.llama-bench: -42.01 nJ (-0.071%)
build.bin.llama-quantize: 0.00 nJ (0.000%)
build.bin.llama-qwen2vl-cli: 0.00 nJ (0.000%)
build.bin.llama-tokenize: 0.00 nJ (0.000%)
build.bin.llama-gemma3-cli: 0.00 nJ (0.000%)
build.bin.llama-gguf-split: 0.00 nJ (0.000%)
build.bin.llama-llava-cli: 0.00 nJ (0.000%)
build.bin.llama-minicpmv-cli: 0.00 nJ (0.000%)
build.bin.libggml-cpu.so: 0.00 nJ (0.000%)
build.bin.libggml.so: 0.00 nJ (0.000%)

Function Analysis

Critical Path Function:

std::vector<bool>::resize (libggml-base.so): Response time 10,741ns → 10,850ns (+1.0%), throughput time 103ns → 181ns (+75.7%, +78ns). Called once per batch in llama_batch_allocr::split_reset(). No source code changes; regression stems from compiler optimization differences in std::vector bit-packing implementation. Moderate cumulative impact at high batch frequencies but <1% of overall inference time.

Improved Functions (libllama.so):

std::vector::end() (KV cache): Response -69.2% (-183ns), throughput -75.4% (-183ns)
char_traits::length: Response -53.5% (-164ns), throughput -58.0% (-164ns)
__make_move_if_noexcept_iterator: Response -66.3% (-185ns), throughput -76.0% (-185ns)

Degraded Functions (llama-tts):

std::vector::end() (PEG parser): Response +224% (+183ns), throughput +307% (+183ns)
std::vector::begin(): Response +214% (+181ns), throughput +289% (+181ns)

All performance variations stem from compiler optimization differences between builds, not source code changes. No modifications detected in application code for analyzed functions. Core inference paths (GEMM, attention, KV cache hot paths) remain unchanged.

Additional Findings

The integration adds GPU-accelerated vision encoders (CLIP-ViT, SAM) using standard GGML operations, enabling cross-platform acceleration across CUDA, Metal, HIP, and Vulkan backends. Flash attention support reduces memory footprint for vision processing. Combined QKV projections improve GPU utilization. Vision encoding adds 105-410ms first-token latency overhead but preserves text-only inference performance. The modular architecture (libmtmd.so) successfully isolates multimodal functionality with minimal impact on core inference operations.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

sfallah and others added 30 commits November 14, 2025 12:40

mtmd: llama.cpp DeepSeekOCR support

43a130b

init commit

loading sam tensors

b6b9f02

mtmd: fix vision model processing

85c7cda

Merge pull request #1 from bluebread/sf/deepseek-ocr

578c8d7

mtmd: fix vision model processing

deepseek-ocr clip-vit model impl

2aab52e

mtmd: add DeepSeek-OCR LM support with standard attention

eab28ed

mtmd: successfully runs DeepSeek-OCR LM in llama-cli

7630587

mtmd: Fix RoPE type for DeepSeek-OCR LM.

2de3436

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

e8b2610

…f/deepseek-ocr

loading LM

97e0907

testing Vision model loading

Merge branch 'sf/deepseek-ocr' into sf/deepseek-ocr

13dc6fb

Merge pull request #2 from bluebread/sf/deepseek-ocr

b32bb5e

mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)

sam warmup working

790bbb9

sam erroneous return corrected

cec9a5c

clip-vit: corrected cls_embd concat

8b3d319

clip-vit: model convert qkv_proj split

1e08157

corrected combining of image encoders' results

331cea8

fix: update callback for ffn_moe_weighted and add callback for attn_o…

6c0715b

…ut in deepseek2 model

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

a65ddf5

…f/deepseek-ocr

concat image_newline and image_seperator tokens

63a042f

visual_model warmup (technically) works

89afda8

window partitioning using standard ggml ops

88032f4

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

1268dc3

…f/deepseek-ocr

sam implementation without using CPU only ops

68b206b

clip: fixed warnings

8bce66d

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

5e6cf3c

…f/deepseek-ocr

mtmd: fix get_rel_pos

7e9fbec

Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into s…

0f5587d

…f/deepseek-ocr

mtmd: fixed the wrong scaler for get_rel_pos

7b8d735

image encoding technically works but the output can't be checked sing…

86f111f

…e image decoding fails

loci-dev force-pushed the main branch from 073bd79 to 823244c Compare February 18, 2026 02:17

Update convert_hf_to_gguf.py

5f2283b

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

loci-dev force-pushed the main branch from 823244c to bab7d39 Compare February 19, 2026 02:17

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

7856e24

# Conflicts: # tools/mtmd/clip.cpp

loci-dev force-pushed the main branch 2 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17

sfallah added 3 commits February 20, 2026 15:28

- removed clip_is_deepseekocr

50c1e15

- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions

Merge remote-tracking branch 'sfallah/master' into sf/deepseek-ocr

3e221cf

- cleaning commented out code

e037b95

loci-dev force-pushed the main branch 2 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17

loci-dev temporarily deployed to PROD__AL_DEMO February 22, 2026 02:17 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 10 times, most recently from 8c889a6 to 13648e6 Compare March 2, 2026 02:17

loci-dev force-pushed the main branch 7 times, most recently from 8019888 to 17452e3 Compare March 9, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17400: mtmd: Add DeepSeekOCR Support#1144

UPSTREAM PR #17400: mtmd: Add DeepSeekOCR Support#1144
loci-dev wants to merge 126 commits intomainfrom
loci/pr-17400-sf-deepseek-ocr

loci-dev commented Feb 3, 2026

Uh oh!

loci-review bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

loci-dev commented Feb 3, 2026

GGUF Models

Running the Model

Build llama.cpp (Mac)

Running llama-mtmd-cli

DeepSeekOCR Paper (First page)

Hard Test (Old Newspaper Image)

Uh oh!

loci-review bot commented Feb 22, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants