feat(multimodal): add Kimi-K2.5 vision support for gRPC router by Kangyan-Zhou · Pull Request #1026 · lightseekorg/smg

Kangyan-Zhou · 2026-04-02T18:59:27Z

Summary

Add multimodal (image) support for moonshotai/Kimi-K2.5 in the gRPC PD router, matching the HTTP path's accuracy and improving TTFT at high concurrency.

KimiK25VisionSpec: model registry spec with <|media_pad|> placeholder, grid_thws field layout, media_placeholder_token_id from config
KimiK25Processor: standalone image preprocessor matching HF's navit_resize + zero-pad pipeline, producing 4D [N, 3, 14, 14] patches for MoonViT's Conv2d
PreProcessorConfig: parse nested media_proc_cfg from Kimi's non-standard preprocessor_config.json
Tiktoken encoding: use encode_with_special_tokens so chat template special tokens (e.g., <|media_pad|>) are recognized as single token IDs
HF Hub config download: download_model_configs_from_hf fetches config.json + preprocessor_config.json when not available locally
Performance: fused resize+pad+normalize, SIMD resize (fast_image_resize), spawn_blocking for preprocessing, strip mm_inputs from decode worker

Validation

Accuracy: MMMU-Pro 1730 samples — gRPC 78.3% (KVV thinking mode)
TTFT (MMMU images, bench_serving):

Concurrency	gRPC Median TTFT	HTTP Median TTFT	gRPC P99 TTFT	HTTP P99 TTFT
1	185ms	212ms	617ms	566ms
10	488ms	524ms	795ms	797ms
50	1,759ms	2,135ms	2,360ms	2,825ms
100	4,262ms	5,105ms	4,855ms	6,333ms

Throughput (MMMU, concurrency 100): gRPC 3,741 tok/s vs HTTP 3,257 tok/s (+15%)

Test plan

cargo test -p llm-multimodal -- kimi (17 tests)
cargo test -p llm-tokenizer (103 tests including special token encoding)
pre-commit run --all-files clean
Smoke test: image correctly identified via gRPC
50-sample MMMU-Pro accuracy check (80%+)
Full 1730-sample MMMU-Pro eval via KVV (78.3%)
TTFT benchmark at concurrency 1/10/30/50/100

🤖 Generated with Claude Code

Add ModelProcessorSpec and ImagePreProcessor for moonshotai/Kimi-K2.5 so the gRPC PD router can handle multimodal (image) requests. - KimiK25VisionSpec: matches "kimi" + "k2" model IDs, uses <|media_pad|> placeholder (media_placeholder_token_id from config), NaViT-style field layouts identical to Qwen-VL family - KimiK25Processor: wraps QwenVLProcessorBase with Kimi-specific defaults (patch_size=14, merge_size=2, normalization=[0.5,0.5,0.5], max_pixels=3,211,264 from in_patch_limit=16384) - Fix get_zmq_socket import path for sglang main compat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…locally When the tokenizer source is a HuggingFace model ID (e.g., "moonshotai/Kimi-K2.5") rather than a local directory, the gRPC router cannot read config.json and preprocessor_config.json from disk. This causes multimodal requests to fail with "Failed to read config.json". Make get_or_load_config async and fall back to downloading the two config files from HF Hub via the new download_files_from_hf helper when the local path doesn't exist. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Address review findings: - Log errors from HF Hub downloads instead of silently swallowing them - Add explicit error when local model directory exists but config.json is missing (prevents misleading fallback to HF Hub) - Upgrade fallback log from debug to warn for production visibility Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TiktokenTokenizer::encode() was using encode_ordinary() which ignores special tokens in the input text. This caused chat-template-rendered special tokens like <|media_pad|> to be split into BPE sub-tokens instead of being recognized as single token IDs. Switch to encode_with_special_tokens() unconditionally, matching HuggingFace tokenizer behavior where added special tokens are always recognized in input text. This fixes Kimi-K2.5 multimodal where the chat template inserts <|media_pad|> (ID 163605) but the tokenizer was producing sub-tokens that expand_tokens couldn't find. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Kimi-K2.5 engine accesses `item.grid_thws` (plural) on MultimodalDataItem, but the gateway was sending `image_grid_thw` (Qwen-VL convention). Rename the key in the processor output and update field_layouts/keep_on_cpu_keys to match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove QwenVLProcessorBase dependency. Kimi-K2.5's MoonViT expects pixel_values as [N, C, patch_size, patch_size] (4D), not flattened [N, C*T*patch_size*patch_size] (2D) like Qwen-VL. The model's PatchEmbed3d applies Conv2d on each patch directly. Implement smart_resize and extract_patches independently, producing [total_patches, 3*14*14] = [N, 588] patches that the engine reconstructs as [N, 3, 14, 14] for Conv2d input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The engine's PatchEmbed3d Conv2d expects 4D input [N, C, H, W] but the gateway was serializing pixel_values as 2D [N, C*patch_size*patch_size]. Store as ndarray::Array4 so the proto shape field is [N, 3, 14, 14], which the engine reconstructs correctly for Conv2d. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The previous smart_resize (from Qwen-VL) resized images directly to factor-aligned dimensions, stretching the content. The HF Kimi preprocessor instead: 1. Computes scale capped at 1.0 (never upscales) 2. Resizes with BICUBIC interpolation 3. Zero-pads to factor-aligned dimensions This mismatch caused degraded image quality — the model was trained with zero-padded images, not stretched ones. Rewrite to match the HF navit_resize_image + resize_image pipeline exactly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

download_files_from_hf was silently failing in production (likely hf-hub crate issue). Switch to download_tokenizer_from_hf which already works for tokenizer loading and returns the HF cache directory containing config.json and preprocessor_config.json. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

download_tokenizer_from_hf only downloads tokenizer files (filtered by is_tokenizer_file), not config.json or preprocessor_config.json. Add a dedicated download_model_configs_from_hf that fetches these two files on the first multimodal request. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add detailed logging at key points: - Image dimensions, color type, raw bytes size after fetch - Pixel values shape, token counts, first/last pixels, min/max - Serialized pixel_values bytes and shape - Token expansion details (search_token_id, placeholders, offsets) Also use download_model_configs_from_hf and remove dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

KimiK25Processor::preprocess() was reading mean/std from PreProcessorConfig which falls back to CLIP values when the config can't be parsed (Kimi's preprocessor_config.json nests values under media_proc_cfg). This caused images to be normalized with CLIP mean=[0.48,0.46,0.41] std=[0.27,0.26,0.28] instead of Kimi's mean=[0.5,0.5,0.5] std=[0.5,0.5,0.5], producing wrong pixel values that made the model misinterpret images entirely. Use self.default_mean()/default_std() which are hardcoded to the correct Kimi values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Instead of hardcoding normalization values, parse the nested media_proc_cfg structure in Kimi's preprocessor_config.json to extract image_mean, image_std, patch_size, and merge_kernel_size. This ensures the correct values are used regardless of how the config is structured. The previous fix hardcoded [0.5,0.5,0.5] in the processor, which worked but would break if values changed. Now from_json() checks for media_proc_cfg when top-level fields are missing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Revert info-level diagnostic logging back to debug level now that the normalization root cause has been identified and fixed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Make from_value consistent with from_json by delegating to it, ensuring nested media_proc_cfg extraction applies to both paths - Add test for encode_with_special_tokens verifying special token strings in input produce single token IDs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Two optimizations for the Kimi-K2.5 image preprocessing pipeline: 1. Fuse resize + pad + normalize into a single pass using deinterleave_rgb_to_planes with precomputed scale/bias. Eliminates 2 intermediate Array3 allocations and 2 extra passes over pixel data. 2. Replace per-element scalar indexing in extract_patches with row-based extend_from_slice (14-element memcpy per row), enabling compiler auto-vectorization. Also take upstream multimodal.rs which has resolve_model_config_dir and updated image_processor_registry.find() API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Log timing breakdown: image fetch, config load, preprocessing, token expansion, and assembly/serialization. This helps identify which step dominates TTFT for multimodal gRPC requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…inputs Two optimizations to reduce gRPC multimodal TTFT: 1. Move image preprocessing (resize + pad + normalize + patchify) to tokio::task::spawn_blocking so CPU-intensive work doesn't block the async runtime. Under 200 concurrent requests, this prevents serialized preprocessing from inflating tail latencies. 2. Strip mm_inputs from decode worker requests in PD dual dispatch. The decode worker only needs the KV cache from prefill — sending ~40MB of pixel tensors to it was pure waste. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace image::resize_exact(CatmullRom) with transforms::resize() which uses fast_image_resize (AVX2/SSE4 SIMD). This is a drop-in replacement that gives 3-5x faster BICUBIC resize — the dominant CPU cost in the preprocessing pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove info-level timing logs (fetch_ms, config_ms, preprocess_ms, expand_ms, serialize_ms) now that performance analysis is complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-02T18:59:47Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: de840e21-9602-459f-b4b4-6a21f0693b7e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces support for the Kimi-K2.5 (MoonViT) model, implementing a specialized image processor that handles its specific resizing and zero-padding requirements. Key changes include updates to the model registry, preprocessor configuration parsing for nested formats, and tokenizer encoding to ensure special tokens are correctly recognized. Performance optimizations were also added to the model gateway, such as offloading image preprocessing to a blocking thread pool and stripping multimodal data from decode requests to reduce memory overhead. Review feedback focuses on memory efficiency, specifically suggesting the use of reference-counted pointers to avoid deep clones of image data and cautioning against large vector allocations during image processing.

gemini-code-assist · 2026-04-02T19:02:46Z

model_gateway/src/routers/grpc/multimodal.rs

+    let registry = components.image_processor_registry.clone();
+    let model_id_owned = model_id.to_string();
+    let model_type_owned = model_type.map(String::from);
+    let image_clones: Vec<image::DynamicImage> = images.iter().map(|f| f.image.clone()).collect();


Cloning all images into a Vec<DynamicImage> before spawning the blocking task creates a full copy of the image data in memory for every request. Since data passed to spawned background tasks must have a 'static lifetime, use reference-counted pointers like Arc to share the data efficiently instead of performing deep clones or attempting to pass references.

References

Data passed to spawned background tasks must have a 'static lifetime. Use owned types or reference-counted pointers like Arc instead of passing references to ensure the data outlives the task.

gemini-code-assist · 2026-04-02T19:02:47Z

crates/multimodal/src/vision/processors/kimi_k25.rs

+        let scale: [f32; 3] = std::array::from_fn(|c| 1.0 / (255.0 * std[c] as f32));
+        let bias: [f32; 3] = std::array::from_fn(|c| -(mean[c] as f32) / (std[c] as f32));
+
+        let mut data = vec![0.0f32; 3 * canvas_pixels];


The vector allocation vec![0.0f32; 3 * canvas_pixels] is potentially large. Given that canvas_pixels can be up to 512 * 512 (or more depending on input), this could lead to memory allocation failures or fragmentation. Consider using a pre-allocated buffer or a more memory-efficient approach if this is called frequently in a high-concurrency environment.

gemini-code-assist · 2026-04-02T19:02:47Z

crates/multimodal/src/vision/processors/kimi_k25.rs

+        let num_patches = grid_h * grid_w;
+        let patch_features = channels * patch_size * patch_size;
+
+        let mut patches = Vec::with_capacity(num_patches * patch_features);


Similar to the allocation in resize_pad_and_normalize, Vec::with_capacity(num_patches * patch_features) can be very large. If num_patches is high, this allocation might fail. Consider processing patches in smaller chunks or using a streaming approach to reduce peak memory usage.

CatherineSue · 2026-04-02T19:05:33Z

model_gateway/src/routers/grpc/common/stages/request_execution.rs

+        // Strip multimodal data from decode request — the decode worker only
+        // needs the KV cache from prefill, not the pixel tensors (~40MB saved).
+        let mut decode_request = proto_request;
+        decode_request.clear_mm_inputs();


- Remove dead download_model_configs_from_hf (replaced by upstream resolve_model_config_dir) - Extract in_patch_limit/patch_limit_on_one_side from media_proc_cfg into config.extra, read in from_preprocessor_config - Always check media_proc_cfg for all fields, not just when image_mean/std are missing (fixes partial config overlap) - Log warning when placeholder_token_id fails instead of silent .ok() - Add config_model_type fallback to KimiK25VisionSpec::matches - Add tests: 1x1 image, empty batch, from_preprocessor_config limits - Improve tiktoken encode comment explaining why encode_with_special_tokens is used Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Make Kimi-K2.5 code comments self-contained instead of comparing against Qwen-VL internals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…arse_mm_inputs sglang v0.5.10 mm_utils.has_shm_features() accesses req.mm_inputs.mm_items via attribute, which fails when mm_inputs is a plain dict. Return a proper MultimodalInputs dataclass to fix the AttributeError crash on VLM requests in gRPC mode.

Kangyan-Zhou · 2026-04-07T01:04:08Z

Closing to reopen with correct branch naming convention (feat/kimi-k25-vision-grpc) and DCO sign-offs.

Kangyan-Zhou and others added 20 commits April 1, 2026 11:28

chore(multimodal): remove debug logging from image preprocessing

4157f79

Revert info-level diagnostic logging back to debug level now that the normalization root cause has been identified and fixed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore(multimodal): remove per-step timing instrumentation

0bd6f07

Remove info-level timing logs (fetch_ms, config_ms, preprocess_ms, expand_ms, serialize_ms) now that performance analysis is complete. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added tokenizer Tokenizer related changes grpc gRPC client and router changes multimodal Multimodal crate changes model-gateway Model gateway crate changes labels Apr 2, 2026

gemini-code-assist bot reviewed Apr 2, 2026

View reviewed changes

CatherineSue reviewed Apr 2, 2026

View reviewed changes

Kangyan-Zhou and others added 3 commits April 4, 2026 00:19

docs(multimodal): remove Qwen-VL references from Kimi comments

adb44b8

Make Kimi-K2.5 code comments self-contained instead of comparing against Qwen-VL internals. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Kangyan-Zhou closed this Apr 7, 2026

Kangyan-Zhou mentioned this pull request Apr 7, 2026

feat(multimodal): add Kimi-K2.5 vision support for gRPC router #1044

Open

7 tasks

Kangyan-Zhou deleted the fix_kimi_k25_tokenizer_sglang_binding branch April 7, 2026 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(multimodal): add Kimi-K2.5 vision support for gRPC router#1026

feat(multimodal): add Kimi-K2.5 vision support for gRPC router#1026
Kangyan-Zhou wants to merge 23 commits intolightseekorg:mainfrom
Kangyan-Zhou:fix_kimi_k25_tokenizer_sglang_binding

Kangyan-Zhou commented Apr 2, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Uh oh!

gemini-code-assist bot Apr 2, 2026

Uh oh!

gemini-code-assist bot Apr 2, 2026

Uh oh!

CatherineSue Apr 2, 2026

Uh oh!

Kangyan-Zhou commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kangyan-Zhou commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Test plan

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

CatherineSue Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Kangyan-Zhou commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kangyan-Zhou commented Apr 2, 2026 •

edited

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading