You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor DeepSeek Sparse Attention content for clarity and conciseness
- Removed detailed explanations of the Multi-Head Latent Attention (MLA) architecture to streamline the document.
- Updated the research section to reflect ongoing efforts and improved phrasing for clarity.
- Enhanced the summary of key findings to better convey the results of experiments.
Copy file name to clipboardExpand all lines: public/content/deepseek-sparse-attention/deepseek-sparse-attention-content.md
+2-118Lines changed: 2 additions & 118 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -120,127 +120,11 @@ They didn't train this model from scratch. They cleverly adapted an existing, po
120
120
#### Stage 2: Post-Training
121
121
After the pre-training was done, they fine-tuned the model for specific tasks (like coding, math, reasoning, and following instructions) using Reinforcement Learning (RL). Crucially, they used the **exact same data and methods** as they did for the original DeepSeek-V3.1-Terminus model. This ensures a fair comparison between the dense and sparse models.
122
122
123
-
## Deep Dive: Multi-Head Latent Attention (MLA) Architecture
Let's break down the Multi-Head Latent Attention (MLA) architecture step-by-step, using the provided formulas and text.
128
-
129
-
The core goal of MLA is to dramatically reduce the size of the Key-Value (KV) cache, which is the main memory bottleneck when processing long sequences. It achieves this through a clever "compress-then-decompress" strategy.
130
-
131
-
The process can be split into two main parts:
132
-
1. Creating the Keys and Values (for the cache).
133
-
2. Creating the Queries (to interact with the cache).
134
-
135
-
---
136
-
137
-
### Step 1: Processing Keys and Values (Formulas 1-5)
138
-
139
-
This section explains how the model takes the input for the current token ($h_t$) and creates the Key and Value vectors that will be stored (in a compressed form) and used by future tokens.
140
-
141
-
#### Formula (1): The Compression Step
142
-
$$
143
-
c_t^{KV} = W^{DKV} \cdot h_t
144
-
$$
145
-
146
-
*Note: The superscript $KV$ indicates this compressed vector will be used to create both Key and Value.*
147
-
148
-
-**What it does:** This is the most critical step for saving memory. It takes the large, high-dimensional input vector for the current token ($h_t$) and projects it down into a much smaller, low-dimensional vector called the **compressed latent vector** ($c_t^{KV}$).
149
-
-**$W^{DKV}$:** This is a learned "Down-projection" matrix. The model learns how to best squish the information from $h_t$ into $c_t^{KV}$ during training.
150
-
-**Analogy:** Think of $h_t$ as a high-resolution image and $c_t^{KV}$ as a highly compressed JPEG. The JPEG is much smaller to store but retains the most important visual information. $c_t^{KV}$ is the only part related to the token's *content* that gets stored in the cache.
151
-
152
-
---
153
-
154
-
#### Formulas (2), (3), and (4): Reconstructing the Final Key
155
-
156
-
The final Key for each attention head is constructed from two separate pieces: a "content" part and a "positional" part.
*Note: The subscript $i$ (ranging from 1 to $n_h$) represents the attention head index, and the superscript $C$ indicates the "Content" part of the key.*
163
-
* This takes the small latent vector $c_t^{KV}$ and projects it *back up* to the full dimension, creating the "content" part of the key ($k_t^C$) for all $n_h$ attention heads.
164
-
-**$W^{UK}$:** This is a learned "Up-projection" matrix for Keys. It's the decompressor.
165
-
166
-
-**Formula (3): Creating the "Positional" Key**
167
-
$$
168
-
k_t^R = \text{RoPE}(W^{KR} \cdot h_t)
169
-
$$
170
-
*Note: The superscript $R$ indicates the "Rotational/Positional" part of the key.*
171
-
* This part handles the token's position in the sequence. It takes the *original* high-dimensional input $h_t$ and applies a transformation ($W^{KR}$) followed by **Rotary Positional Embedding (RoPE)**.
172
-
* This creates a "decoupled" key $k_t^R$ that purely encodes positional information. This is the second and final piece that gets stored in the cache.
*Note: Here $i$ represents the specific attention head index (1 to $n_h$).*
179
-
* The final key for a specific attention head $i$ ($k_{t,i}$) is formed by simply concatenating (sticking together) the content part ($k_{t,i}^C$) and the positional part ($k_t^R$).
* This is very similar to the key decompression. It uses the *same* small latent vector $c_t^{KV}$ but a *different* up-projection matrix ($W^{UV}$) to reconstruct the full-size Value vectors for all $n_h$ heads.
189
-
* This shows that $c_t^{KV}$ is a **joint** compression of both Key and Value information.
190
-
191
-
**Key Takeaway for KV Cache:**
192
-
The text explicitly states that **only the blue-boxed vectors ($c_t^{KV}$ and $k_t^R$) need to be cached.** This is the magic of MLA. Instead of storing massive Key and Value vectors for every head, you only store one tiny latent vector ($c_t^{KV}$) and one positional vector ($k_t^R$). The full Keys and Values are reconstructed on the fly when needed.
193
-
194
-
---
195
-
196
-
### Step 2: Processing Queries (Formulas 6-9)
197
-
198
-
This process mirrors the key generation, but it's for the Queries of the *current* token that will attend to the past keys in the cache.
199
-
200
-
-**Formula (6): Compressing the Query**
201
-
$$
202
-
c_t^Q = W^{DQ} \cdot h_t
203
-
$$
204
-
*Note: The superscript $Q$ indicates this compressed vector is specifically for Query information.*
205
-
* Just like for the KV, the input $h_t$ is compressed into a small latent query vector $c_t^Q$ using a separate down-projection matrix ($W^{DQ}$).
206
-
207
-
-**Formula (7): Decompressing the "Content" Query**
* The final query for each head $i$ is formed by concatenating its content and positional parts.
224
-
225
-
### Summary of the Entire MLA Flow
226
-
227
-
1.**For each token $t$:** Take its input embedding $h_t$.
228
-
2.**Compress:** Create a tiny latent vector $c_t^{KV}$ that jointly represents Keys and Values.
229
-
3.**Get Position:** Create a positional key $k_t^R$ from $h_t$.
230
-
4.**Cache:** Store **only** $c_t^{KV}$ and $k_t^R$ in the KV cache. This is the **memory saving** step.
231
-
5.**Attend:** When a new token needs to perform attention, it generates its query ($q_{t,i}$). It then retrieves the cached $c_s^{KV}$ and $k_s^R$ for all previous tokens $s$, reconstructs their full Keys and Values on the fly using the up-projection matrices, and computes the attention scores.
232
-
233
-
### How MLA Integrates with DeepSeek Sparse Attention
234
-
235
-
The beauty of this architecture is how MLA works seamlessly with DSA:
236
-
237
-
1.**DSA selects the relevant tokens:** The Lightning Indexer identifies the top-k most important previous tokens
238
-
2.**MLA processes only the selected tokens:** Instead of reconstructing Keys and Values for all 128,000 previous tokens, MLA only needs to decompress the cached $c_s^{KV}$ and $k_s^R$ for the selected $\text{top-k}$ tokens
239
-
3.**Memory efficiency is multiplied:** DSA reduces the number of tokens to process, while MLA reduces the memory footprint of each token
240
-
241
125
---
242
126
243
-
*We also did some [research](https://github.com/Open-Superintelligence-Lab/deepseek-sparse-attention-research) ourselves
127
+
*We are actively doing research on this ourselves - [contribute here](https://github.com/Open-Superintelligence-Lab/deepseek-sparse-attention-research)*
244
128
245
129
### Research Questions
246
130
@@ -272,7 +156,7 @@ Our experiments aimed to answer:
**Key Finding**: Mixed results - sparse helped short sequences but hurt long sequences on MHLA. Might be due to implementation. Research in progress...
159
+
**Key Finding**: Mixed results - sparse helped short sequences but hurt long sequences on MHLA. Might be due to implementation (research in progress)
0 commit comments