Difficulty: 🔴 Advanced
Scope: Large — touches the hot path and the CUDA-graph capture story; needs a clean abstraction + perf parity.
Subsystems: engine/cache_manager.py · utils/flashinfer_utils.py · engine/kv_cache_engine.py · engine/cuda_graph_runner.py
Prerequisites: Attention kernels (FlashInfer / FlashAttention), paged KV cache, and CUDA-graph-capturable launch patterns.
Problem
There's effectively one KV-cache attention backend: FlashInfer paged attention.
It's a good default but not always the most performant choice. The wrappers
FlashInferPrefillWrapper / FlashInferDecodeWrapper
(utils/flashinfer_utils.py) are wired directly
into the cache manager's plan_attention and stored on _PlanState.wrapper
(engine/cache_manager.py). The key constraint is
that whatever backend we use must support the plan-then-replay scheme so it's
CUDA-graph capturable (persistent wrappers whose static buffers are updated by
plan() — see the CUDA-graph-mode notes in
cache_manager.py and the double-buffer logic in
kv_cache_engine.py).
This matters in practice: for some shapes a non-paged kernel (e.g.
flash_attn_varlen, or FlashInfer's ragged prefill) beats paged attention,
because the paged path also pays to scatter-write K/V into the page table. Part
of this issue is building a small microbenchmark that compares paged vs. ragged
vs. flash_attn_varlen per shape to quantify the gap.
Suggested tasks
Stretch / open
- Some submodules may not benefit from paged attention at all (e.g. short,
fixed-shape full-attention blocks). Investigate a non-paged path for those; this can be scoped as a follow-up.
- An "in-house" kernel set is out of scope for a first PR, but feel free to put out a separate PRs for any kernel development.
Acceptance criteria
- FlashInfer remains the default with no perf regression.
- The FlashAttention backend produces correct results, is CUDA-graph capturable,
and a benchmark shows the regime where it's competitive/better.
New to M*? Skim How it works and the Contributing guide first.
Difficulty: 🔴 Advanced
Scope: Large — touches the hot path and the CUDA-graph capture story; needs a clean abstraction + perf parity.
Subsystems: engine/cache_manager.py · utils/flashinfer_utils.py · engine/kv_cache_engine.py · engine/cuda_graph_runner.py
Prerequisites: Attention kernels (FlashInfer / FlashAttention), paged KV cache, and CUDA-graph-capturable launch patterns.
Problem
There's effectively one KV-cache attention backend: FlashInfer paged attention.
It's a good default but not always the most performant choice. The wrappers
FlashInferPrefillWrapper/FlashInferDecodeWrapper(utils/flashinfer_utils.py) are wired directly
into the cache manager's
plan_attentionand stored on_PlanState.wrapper(engine/cache_manager.py). The key constraint is
that whatever backend we use must support the plan-then-replay scheme so it's
CUDA-graph capturable (persistent wrappers whose static buffers are updated by
plan()— see the CUDA-graph-mode notes incache_manager.py and the double-buffer logic in
kv_cache_engine.py).
This matters in practice: for some shapes a non-paged kernel (e.g.
flash_attn_varlen, or FlashInfer's ragged prefill) beats paged attention,because the paged path also pays to scatter-write K/V into the page table. Part
of this issue is building a small microbenchmark that compares paged vs. ragged
vs.
flash_attn_varlenper shape to quantify the gap.Suggested tasks
plan(),run(), and the static-buffer accessors like_qo_indptr_bufthatcache_manager.pyreads) into an explicit backend protocol.plan + cuda-graph-capturable forward scheme.
so existing deployments are unchanged.
document when each wins.
Stretch / open
fixed-shape full-attention blocks). Investigate a non-paged path for those; this can be scoped as a follow-up.
Acceptance criteria
and a benchmark shows the regime where it's competitive/better.