Skip to content

Commit bebbb9c

Browse files
committed
Add diagram to DeepSeek Sparse Attention content for clarity
- Included an illustration of the Attention Architecture to enhance understanding of the DeepSeek Sparse Attention (DSA) mechanism. - Removed the duplicate image to streamline the document and maintain focus on the new diagram. - Updated accompanying text to clarify the relationship between DSA and Multi-Head Latent Attention (MLA).
1 parent c230add commit bebbb9c

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

public/content/deepseek-sparse-attention/deepseek-sparse-attention-content.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ Doubling the text length from 10,000 to 20,000 tokens doesn't just double the co
3737

3838
Instead of having each token attend to all previous tokens, DeepSeek Sparse Attention (DSA) intelligently selects a small, fixed-size subset ($k$) of the most relevant previous tokens to attend to. This changes the complexity from $O(L^2)$ to $O(L \cdot k)$, which is much more manageable since $k$ is a small constant (e.g., 2048) and $L$ can be very large (e.g., 128,000 or 2,000,000).
3939

40+
![Attention Architecture](/content/deepseek-sparse-attention/Attention-architecture.png)
41+
42+
*Let's explain how DSA (marked in green) works with MLA (Multi-Head Latent Attention).*
4043

4144
DSA is made of two main components:
4245

@@ -120,7 +123,6 @@ They didn't train this model from scratch. They cleverly adapted an existing, po
120123
#### Stage 2: Post-Training
121124
After the pre-training was done, they fine-tuned the model for specific tasks (like coding, math, reasoning, and following instructions) using Reinforcement Learning (RL). Crucially, they used the **exact same data and methods** as they did for the original DeepSeek-V3.1-Terminus model. This ensures a fair comparison between the dense and sparse models.
122125

123-
![Attention Architecture](/content/deepseek-sparse-attention/Attention-architecture.png)
124126

125127
---
126128

0 commit comments

Comments
 (0)