UPSTREAM PR #17400: mtmd: Add DeepSeekOCR Support#1144
UPSTREAM PR #17400: mtmd: Add DeepSeekOCR Support#1144
Conversation
init commit
mtmd: fix vision model processing
testing Vision model loading
mtmd: DeepseekOCR Implement DeepSeek3B-MoE-A570M (LM component)
…ut in deepseek2 model
…e image decoding fails
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
# Conflicts: # tools/mtmd/clip.cpp
10f8f26 to
a6ecec6
Compare
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo - simplified image-preprocessing - removed/simplified debug functions
9ea4a65 to
c001e9f
Compare
OverviewThe DeepSeek-OCR integration spans 126 commits adding multimodal OCR capabilities to llama.cpp. Analysis of 111,736 functions reveals 217 modified (0.19%), 60 new, and 0 removed functions across 15 binaries. Overall system power consumption increased 0.039% (597.52 nJ), with the impact isolated to the new multimodal library. Power Consumption by Binary:
Function AnalysisCritical Path Function:
Improved Functions (libllama.so):
Degraded Functions (llama-tts):
All performance variations stem from compiler optimization differences between builds, not source code changes. No modifications detected in application code for analyzed functions. Core inference paths (GEMM, attention, KV cache hot paths) remain unchanged. Additional FindingsThe integration adds GPU-accelerated vision encoders (CLIP-ViT, SAM) using standard GGML operations, enabling cross-platform acceleration across CUDA, Metal, HIP, and Vulkan backends. Flash attention support reduces memory footprint for vision processing. Combined QKV projections improve GPU utilization. Vision encoding adds 105-410ms first-token latency overhead but preserves text-only inference performance. The modular architecture (libmtmd.so) successfully isolates multimodal functionality with minimal impact on core inference operations. 🔎 Full breakdown: Loci Inspector. |
8c889a6 to
13648e6
Compare
8019888 to
17452e3
Compare
Note
Source pull request: ggml-org/llama.cpp#17400
Feature Request: ggml-org/llama.cpp#16676
Make sure to read the contributing guidelines before submitting a PR
GGUF Models
sabafallah/DeepSeek-OCR-GGUF
deepseek-ocr-f32.gguf
mmproj-deepseek-ocr-f32.gguf
Running the Model
Build llama.cpp (Mac)
Running llama-mtmd-cli
DeepSeekOCR Paper (First page)
Hard Test (Old Newspaper Image)