[WIP] MiniMax-M3 runtime: sparse MSA cache + indexer decode#75
Draft
jjang-ai wants to merge 6 commits into
Draft
[WIP] MiniMax-M3 runtime: sparse MSA cache + indexer decode#75jjang-ai wants to merge 6 commits into
jjang-ai wants to merge 6 commits into
Conversation
added 4 commits
June 20, 2026 20:44
…ng, protocol-driven) Cache-first foundation for the MM3 port. Wraps KVCacheSimple + an append-only idx_keys lane (Lightning-Indexer keys); keys/values/idx_keys move together through copy/state/trim/metaState so the cache stays first-class through every reuse path (no downcast to plain KVCache = the Python repetition-loop root cause). Mirrors the ZayaCCACache composite precedent. DEBUG asserts idx_keys length == offset (wrong restored offset corrupts attention even with right shapes).
…nerState() for Evaluatable/Updatable
…max_m3 Full text-decode port of vllm-mlx minimax_m3.py: GemmaRMSNorm (reused), gpt_oss clamped swiglu (SwiGLUOAIMLP + SwitchGLU glue closure for routed experts), Lightning Indexer block selection, MSA attention with update-before-indexer ordering, DeepseekV3-style sigmoid router with e_score_correction_bias, and the MiniMaxM3SparseCache dispatch (sparse layers 3-59 / dense 0-2). sanitize remaps the JANG checkpoint keys and drops the VL vision stack for the text-only build. Compiles green; sanitize validated by header-only weight probe over all 23 shards (3277 tensors) — every key maps to the module tree, quantized/fp16 split exact.
Make MiniMax-M3 (REAP/JANG) load, decode, and reuse caches correctly. Proven by 17 tests (10 no-model cache/topology, 6 tool-parser unit, 1 gated live single-load smoke covering load, MSA fire >2048, reasoning split, eos, and scanned coherent multiturn). minimax_m2 / M2.7 left untouched. Quant loader (Load.swift, JangLoader.swift): resolve the bits*group_size packing ambiguity — (6,64) and (3,128) yield identical weight/scales shapes — by deriving bits/gs from each placeholder module's true input dim at the quantize site. Fixes the embed/expert dim-doubling quantized_matmul crash. Reasoning (ReasoningParser.swift): route minimax_m3* to a <mm:think>...</mm:think> parser; minimax_m2 stays on <think>. Reasoning knob is thinking_mode (enabled/disabled/adaptive), not enable_thinking. Tool calls (Tool/ToolCallFormat.swift, Tool/Parsers/MiniMaxM3ToolCallParser.swift): new .minimaxM3 format + parser, a faithful port of vllm-mlx minimax_m3_tool_parser.py (tag-named-param XML: <tool_call><invoke name><key>v</key>, namespace-token + <mm:think> stripping, nested object/array args, scalar coercion). Routed from model_type and capability name. Cache (MiniMaxM3SparseCache 3-lane composite already on branch): - ModelContainer: classify M3 sparse layers (miniMaxM3SparseLayerCount / requiresMiniMaxM3SparseState / disk-backed restore), not plain KV. - CacheHelpers: M3 forces paged-incompatible (idx_keys can't ride the paged tier). - TQDiskSerializer: LayerKind.miniMaxM3Sparse serialize/restore of keys/values/idx_keys. - KVCache.swift: register MiniMaxM3SparseCache in the prompt-cache class map; maybeQuantizeKVCache only touches KVCacheSimple so MSA caches are never TQ-encoded. - innerState() returns all 3 lanes so eval(cache) materializes idx_keys with K/V. MiniMaxM3.swift: env-gated one-shot MSA-fire trace (MM3_MSA_TRACE).
687fd05 to
5139e84
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WIP — engine lane for MiniMax-M3 (mm3). Built in lockstep with the osaurus integration PR (osaurus-ai/osaurus#1576). Do not merge until the multiturn-no-loop gate is met.
Why M3 is special
Layers 0-2 dense full-attn (stock KV); layers 3-59 block-sparse MSA / Lightning-Indexer (GQA n_kv=4, head_dim=128). Sparse layers carry a 3rd append-only cache lane
idx_keys [B,1,S,128]; the indexer max-pools per 128-token block and selects top-k blocks, recomputed every step. Blocks anchored to absolutepos/128→ append-only / trim-replay only.This PR so far
MiniMaxM3SparseCache(Libraries/MLXLMCommon/Cache/): self-cloning 3-lane composite cache mirroringZayaCCACache.copy()/state/trim/metaStatecarrykeys,values,idx_keystogether so the cache stays first-class through every reuse path (the Python loop bug was a reuse layer downcasting it to plain KVCache and droppingidx_keys). DEBUG assertsidx_keys.len == offset.Next (in lockstep with osaurus#1576)
make_cachedispatch.Gate
Multiturn cache-reuse on
MiniMax-M3-REAP40-d3-JANG_2L: zero loops, zero incoherency, cache-hit turns coherent. (Paged off, TQ-KV skip, JIT off for M3.)