Skip to content

[WIP] MiniMax-M3 runtime: sparse MSA cache + indexer decode#75

Draft
jjang-ai wants to merge 6 commits into
mainfrom
codex/mm3-runtime
Draft

[WIP] MiniMax-M3 runtime: sparse MSA cache + indexer decode#75
jjang-ai wants to merge 6 commits into
mainfrom
codex/mm3-runtime

Conversation

@jjang-ai

Copy link
Copy Markdown
Contributor

WIP — engine lane for MiniMax-M3 (mm3). Built in lockstep with the osaurus integration PR (osaurus-ai/osaurus#1576). Do not merge until the multiturn-no-loop gate is met.

Why M3 is special

Layers 0-2 dense full-attn (stock KV); layers 3-59 block-sparse MSA / Lightning-Indexer (GQA n_kv=4, head_dim=128). Sparse layers carry a 3rd append-only cache lane idx_keys [B,1,S,128]; the indexer max-pools per 128-token block and selects top-k blocks, recomputed every step. Blocks anchored to absolute pos/128 → append-only / trim-replay only.

This PR so far

  • MiniMaxM3SparseCache (Libraries/MLXLMCommon/Cache/): self-cloning 3-lane composite cache mirroring ZayaCCACache. copy()/state/trim/metaState carry keys,values,idx_keys together so the cache stays first-class through every reuse path (the Python loop bug was a reuse layer downcasting it to plain KVCache and dropping idx_keys). DEBUG asserts idx_keys.len == offset.

Next (in lockstep with osaurus#1576)

  • MSA / Lightning-Indexer attention + block selection; MoE; RoPE/iRoPE; make_cache dispatch.
  • Decode loop; reasoning/tool parser hooks.
  • Cache reuse stays protocol-driven (no type allow-lists) so M3 is first-class by construction.

Gate

Multiturn cache-reuse on MiniMax-M3-REAP40-d3-JANG_2L: zero loops, zero incoherency, cache-hit turns coherent. (Paged off, TQ-KV skip, JIT off for M3.)

Eric added 4 commits June 20, 2026 20:44
…ng, protocol-driven)

Cache-first foundation for the MM3 port. Wraps KVCacheSimple + an append-only
idx_keys lane (Lightning-Indexer keys); keys/values/idx_keys move together
through copy/state/trim/metaState so the cache stays first-class through every
reuse path (no downcast to plain KVCache = the Python repetition-loop root cause).
Mirrors the ZayaCCACache composite precedent. DEBUG asserts idx_keys length ==
offset (wrong restored offset corrupts attention even with right shapes).
…max_m3

Full text-decode port of vllm-mlx minimax_m3.py: GemmaRMSNorm (reused), gpt_oss
clamped swiglu (SwiGLUOAIMLP + SwitchGLU glue closure for routed experts),
Lightning Indexer block selection, MSA attention with update-before-indexer
ordering, DeepseekV3-style sigmoid router with e_score_correction_bias, and the
MiniMaxM3SparseCache dispatch (sparse layers 3-59 / dense 0-2). sanitize remaps
the JANG checkpoint keys and drops the VL vision stack for the text-only build.

Compiles green; sanitize validated by header-only weight probe over all 23 shards
(3277 tensors) — every key maps to the module tree, quantized/fp16 split exact.
Make MiniMax-M3 (REAP/JANG) load, decode, and reuse caches correctly. Proven
by 17 tests (10 no-model cache/topology, 6 tool-parser unit, 1 gated live
single-load smoke covering load, MSA fire >2048, reasoning split, eos, and
scanned coherent multiturn). minimax_m2 / M2.7 left untouched.

Quant loader (Load.swift, JangLoader.swift): resolve the bits*group_size
packing ambiguity — (6,64) and (3,128) yield identical weight/scales shapes —
by deriving bits/gs from each placeholder module's true input dim at the
quantize site. Fixes the embed/expert dim-doubling quantized_matmul crash.

Reasoning (ReasoningParser.swift): route minimax_m3* to a <mm:think>...</mm:think>
parser; minimax_m2 stays on <think>. Reasoning knob is thinking_mode
(enabled/disabled/adaptive), not enable_thinking.

Tool calls (Tool/ToolCallFormat.swift, Tool/Parsers/MiniMaxM3ToolCallParser.swift):
new .minimaxM3 format + parser, a faithful port of vllm-mlx
minimax_m3_tool_parser.py (tag-named-param XML: <tool_call><invoke name><key>v</key>,
namespace-token + <mm:think> stripping, nested object/array args, scalar
coercion). Routed from model_type and capability name.

Cache (MiniMaxM3SparseCache 3-lane composite already on branch):
- ModelContainer: classify M3 sparse layers (miniMaxM3SparseLayerCount /
  requiresMiniMaxM3SparseState / disk-backed restore), not plain KV.
- CacheHelpers: M3 forces paged-incompatible (idx_keys can't ride the paged tier).
- TQDiskSerializer: LayerKind.miniMaxM3Sparse serialize/restore of keys/values/idx_keys.
- KVCache.swift: register MiniMaxM3SparseCache in the prompt-cache class map;
  maybeQuantizeKVCache only touches KVCacheSimple so MSA caches are never TQ-encoded.
- innerState() returns all 3 lanes so eval(cache) materializes idx_keys with K/V.

MiniMaxM3.swift: env-gated one-shot MSA-fire trace (MM3_MSA_TRACE).
@jjang-ai jjang-ai force-pushed the codex/mm3-runtime branch from 687fd05 to 5139e84 Compare June 21, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant