Skip to content

Add support for TurboQuant kv cache, with custom fused kernel for MLA#13

Draft
jondurbin wants to merge 16 commits intochutesfrom
turboquant
Draft

Add support for TurboQuant kv cache, with custom fused kernel for MLA#13
jondurbin wants to merge 16 commits intochutesfrom
turboquant

Conversation

@jondurbin
Copy link
Copy Markdown
Collaborator

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

Cherry-pick from upstream PR sgl-project#21419. Implements Google TurboQuant
algorithm for KV cache quantization via --kv-cache-dtype turboquant,
achieving up to 4.92x compression with near-zero accuracy loss.

Reference: Zandieh et al., TurboQuant: Online Vector Quantization
with Near-optimal Distortion Rate (arXiv:2504.19874)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant