[WIP] MiniMax-M3 runtime: sparse MSA cache + indexer decode by jjang-ai · Pull Request #75 · osaurus-ai/vmlx-swift

jjang-ai · 2026-06-18T03:44:57Z

WIP — engine lane for MiniMax-M3 (mm3). Built in lockstep with the osaurus integration PR (osaurus-ai/osaurus#1576). Do not merge until the multiturn-no-loop gate is met.

Why M3 is special

Layers 0-2 dense full-attn (stock KV); layers 3-59 block-sparse MSA / Lightning-Indexer (GQA n_kv=4, head_dim=128). Sparse layers carry a 3rd append-only cache lane idx_keys [B,1,S,128]; the indexer max-pools per 128-token block and selects top-k blocks, recomputed every step. Blocks anchored to absolute pos/128 → append-only / trim-replay only.

This PR so far

MiniMaxM3SparseCache (Libraries/MLXLMCommon/Cache/): self-cloning 3-lane composite cache mirroring ZayaCCACache. copy()/state/trim/metaState carry keys,values,idx_keys together so the cache stays first-class through every reuse path (the Python loop bug was a reuse layer downcasting it to plain KVCache and dropping idx_keys). DEBUG asserts idx_keys.len == offset.

Next (in lockstep with osaurus#1576)

MSA / Lightning-Indexer attention + block selection; MoE; RoPE/iRoPE; make_cache dispatch.
Decode loop; reasoning/tool parser hooks.
Cache reuse stays protocol-driven (no type allow-lists) so M3 is first-class by construction.

Gate

Multiturn cache-reuse on MiniMax-M3-REAP40-d3-JANG_2L: zero loops, zero incoherency, cache-hit turns coherent. (Paged off, TQ-KV skip, JIT off for M3.)

…ng, protocol-driven) Cache-first foundation for the MM3 port. Wraps KVCacheSimple + an append-only idx_keys lane (Lightning-Indexer keys); keys/values/idx_keys move together through copy/state/trim/metaState so the cache stays first-class through every reuse path (no downcast to plain KVCache = the Python repetition-loop root cause). Mirrors the ZayaCCACache composite precedent. DEBUG asserts idx_keys length == offset (wrong restored offset corrupts attention even with right shapes).

…nerState() for Evaluatable/Updatable

…max_m3 Full text-decode port of vllm-mlx minimax_m3.py: GemmaRMSNorm (reused), gpt_oss clamped swiglu (SwiGLUOAIMLP + SwitchGLU glue closure for routed experts), Lightning Indexer block selection, MSA attention with update-before-indexer ordering, DeepseekV3-style sigmoid router with e_score_correction_bias, and the MiniMaxM3SparseCache dispatch (sparse layers 3-59 / dense 0-2). sanitize remaps the JANG checkpoint keys and drops the VL vision stack for the text-only build. Compiles green; sanitize validated by header-only weight probe over all 23 shards (3277 tensors) — every key maps to the module tree, quantized/fp16 split exact.

Make MiniMax-M3 (REAP/JANG) load, decode, and reuse caches correctly. Proven by 17 tests (10 no-model cache/topology, 6 tool-parser unit, 1 gated live single-load smoke covering load, MSA fire >2048, reasoning split, eos, and scanned coherent multiturn). minimax_m2 / M2.7 left untouched. Quant loader (Load.swift, JangLoader.swift): resolve the bits*group_size packing ambiguity — (6,64) and (3,128) yield identical weight/scales shapes — by deriving bits/gs from each placeholder module's true input dim at the quantize site. Fixes the embed/expert dim-doubling quantized_matmul crash. Reasoning (ReasoningParser.swift): route minimax_m3* to a <mm:think>...</mm:think> parser; minimax_m2 stays on <think>. Reasoning knob is thinking_mode (enabled/disabled/adaptive), not enable_thinking. Tool calls (Tool/ToolCallFormat.swift, Tool/Parsers/MiniMaxM3ToolCallParser.swift): new .minimaxM3 format + parser, a faithful port of vllm-mlx minimax_m3_tool_parser.py (tag-named-param XML: <tool_call><invoke name><key>v</key>, namespace-token + <mm:think> stripping, nested object/array args, scalar coercion). Routed from model_type and capability name. Cache (MiniMaxM3SparseCache 3-lane composite already on branch): - ModelContainer: classify M3 sparse layers (miniMaxM3SparseLayerCount / requiresMiniMaxM3SparseState / disk-backed restore), not plain KV. - CacheHelpers: M3 forces paged-incompatible (idx_keys can't ride the paged tier). - TQDiskSerializer: LayerKind.miniMaxM3Sparse serialize/restore of keys/values/idx_keys. - KVCache.swift: register MiniMaxM3SparseCache in the prompt-cache class map; maybeQuantizeKVCache only touches KVCacheSimple so MSA caches are never TQ-encoded. - innerState() returns all 3 lanes so eval(cache) materializes idx_keys with K/V. MiniMaxM3.swift: env-gated one-shot MSA-fire trace (MM3_MSA_TRACE).

Eric added 4 commits June 20, 2026 20:44

MiniMaxM3SparseCache: fix imports (MLXNN re-exports MLXFast) + add in…

930b0bf

…nerState() for Evaluatable/Updatable

jjang-ai force-pushed the codex/mm3-runtime branch from 687fd05 to 5139e84 Compare June 21, 2026 03:44

Eric added 2 commits June 20, 2026 20:48

MiniMax-M3: add E2E verification + multiturn live-test spec

409d2c4

MM3: status banner — wired + build green, ready for live test

57c4688

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] MiniMax-M3 runtime: sparse MSA cache + indexer decode#75

[WIP] MiniMax-M3 runtime: sparse MSA cache + indexer decode#75
jjang-ai wants to merge 6 commits into
mainfrom
codex/mm3-runtime

jjang-ai commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jjang-ai commented Jun 18, 2026

Why M3 is special

This PR so far

Next (in lockstep with osaurus#1576)

Gate

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant