You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update DeepSeek Sparse Attention content for improved mathematical clarity
- Enhanced mathematical expressions in the DeepSeek Sparse Attention documentation by replacing inline text with LaTeX formatting for better readability.
- Updated key formulas to use proper mathematical notation, improving the overall presentation and understanding of the content.
- Clarified descriptions of the Lightning Indexer and attention calculations to ensure accurate representation of the model's functionality and performance.
Standard Transformers use an "attention" mechanism where every new token being generated looks back at all the previous tokens in the sequence.
10
10
11
-
This is computationally very expensive. If you have a sequence of length L, the complexity is O(L²), meaning the computation and memory required grow quadratically.
11
+
This is computationally very expensive. If you have a sequence of length $L$, the complexity is $O(L^2)$, meaning the computation and memory required grow quadratically.
12
12
13
13
Doubling the text length from 10,000 to 20,000 tokens doesn't just double the cost—it quadruples it. This makes processing very long documents (like books or large codebases) prohibitively slow and expensive.
14
14
15
-
Instead of having each token attend to all previous tokens, DeepSeek Sparse Attention (DSA) intelligently selects a small, fixed-size subset (k) of the most relevant previous tokens to attend to. This changes the complexity from O(L²) to O(L * k), which is much more manageable since k is a small constant (e.g., 2048) and L can be very large (e.g., 128,000).
15
+
Instead of having each token attend to all previous tokens, DeepSeek Sparse Attention (DSA) intelligently selects a small, fixed-size subset ($k$) of the most relevant previous tokens to attend to. This changes the complexity from $O(L^2)$ to $O(L \cdot k)$, which is much more manageable since $k$ is a small constant (e.g., 2048) and $L$ can be very large (e.g., 128,000).
16
16
17
17
18
18
DSA is made of two main components:
@@ -23,19 +23,21 @@ The lightning indexer will perform full attention between every token but it's a
23
23
24
24
This is a fast and lightweight mechanism whose only job is to figure out which past tokens are important for the current token.
25
25
26
-
***How it works:** For the current token (`h_t`), the indexer quickly calculates an "index score" (`I_t,s`) for every previous token (`h_s`). This score represents the predicted relevance of token `s` to token `t`.
27
-
***Formula `(1)`:** The formula 1 is essentially a simplified attention calculation. It uses its own small set of queries (`q_I`) and keys (`k_I`) to compute these scores.
28
-
***Why it's "Lightning":** It's designed for speed. It uses a simple `ReLU` activation function and can be run with low-precision numbers (FP8), making it computationally very cheap, even though it still technically looks at all previous tokens (an `O(L²)` operation, but a very, very fast one).
26
+
***How it works:** For the current token ($h_t$), the indexer quickly calculates an "index score" ($I_{t,s}$) for every previous token ($h_s$). This score represents the predicted relevance of token $s$ to token $t$.
27
+
***Formula (1):** The formula 1 is essentially a simplified attention calculation. It uses its own small set of queries ($q^I$) and keys ($k^I$) to compute these scores.
28
+
***Why it's "Lightning":** It's designed for speed. It uses a simple $\\text{ReLU}$ activation function and can be run with low-precision numbers (FP8), making it computationally very cheap, even though it still technically looks at all previous tokens (an $O(L^2)$ operation, but a very, very fast one).
29
29
30
30
### 1. The Formulas Explained (The "What")
31
31
32
32
The paper provides two key formulas that describe this two-step process.
This formula calculates the **index score** (`I_t,s`), which represents the "relevance" of a past token `s` to the current token `t`. Let's break it down:
40
+
This formula calculates the **index score** ($I_{t,s}$), which represents the "relevance" of a past token $s$ to the current token $t$. Let's break it down:
39
41
40
42
*`I_t,s`: The final importance score. A higher score means token `s` is more important for token `t`.
41
43
*`h_t` and `h_s`: These are the vector representations (hidden states) of the current token (`t`) and a previous token (`s`).
@@ -49,7 +51,9 @@ This formula calculates the **index score** (`I_t,s`), which represents the "rel
49
51
50
52
#### **Formula (2): The Main Attention Calculation**
This formula describes how the final output (`u_t`) is computed after the selection is done.
55
59
@@ -104,7 +108,9 @@ The process can be split into two main parts:
104
108
This section explains how the model takes the input for the current token (`h_t`) and creates the Key and Value vectors that will be stored (in a compressed form) and used by future tokens.
105
109
106
110
#### Formula (1): The Compression Step
107
-
`c_t^KV = W^DKV * h_t`
111
+
$$
112
+
c_t^{KV} = W^{DKV} \cdot h_t
113
+
$$
108
114
109
115
***What it does:** This is the most critical step for saving memory. It takes the large, high-dimensional input vector for the current token (`h_t`) and projects it down into a much smaller, low-dimensional vector called the **compressed latent vector** (`c_t^KV`).
110
116
***`W^DKV`:** This is a learned "Down-projection" matrix. The model learns how to best squish the information from `h_t` into `c_t^KV` during training.
@@ -117,23 +123,31 @@ This section explains how the model takes the input for the current token (`h_t`
117
123
The final Key for each attention head is constructed from two separate pieces: a "content" part and a "positional" part.
* This takes the small latent vector `c_t^KV` and projects it *back up* to the full dimension, creating the "content" part of the key (`k_t^C`) for all `n_h` attention heads.
122
130
***`W^UK`:** This is a learned "Up-projection" matrix for Keys. It's the decompressor.
123
131
124
132
***Formula (3): Creating the "Positional" Key**
125
-
`k_t^R = RoPE(W^KR * h_t)`
133
+
$$
134
+
k_t^R = \text{RoPE}(W^{KR} \cdot h_t)
135
+
$$
126
136
* This part handles the token's position in the sequence. It takes the *original* high-dimensional input `h_t` and applies a transformation (`W^KR`) followed by **Rotary Positional Embedding (RoPE)**.
127
137
* This creates a "decoupled" key `k_t^R` that purely encodes positional information. This is the second and final piece that gets stored in the cache.
* The final key for a specific attention head `i` (`k_t,i`) is formed by simply concatenating (sticking together) the content part (`k_t,i^C`) and the positional part (`k_t^R`).
* This is very similar to the key decompression. It uses the *same* small latent vector `c_t^KV` but a *different* up-projection matrix (`W^UV`) to reconstruct the full-size Value vectors for all `n_h` heads.
139
153
* This shows that `c_t^KV` is a **joint** compression of both Key and Value information.
@@ -148,19 +162,27 @@ The text explicitly states that **only the blue-boxed vectors (`c_t^KV` and `k_t
148
162
This process mirrors the key generation, but it's for the Queries of the *current* token that will attend to the past keys in the cache.
149
163
150
164
***Formula (6): Compressing the Query**
151
-
`c_t^Q = W^DQ * h_t`
165
+
$$
166
+
c_t^Q = W^{DQ} \cdot h_t
167
+
$$
152
168
* Just like for the KV, the input `h_t` is compressed into a small latent query vector `c_t^Q` using a separate down-projection matrix (`W^DQ`).
153
169
154
170
***Formula (7): Decompressing the "Content" Query**
0 commit comments