Skip to content

Commit 632eafb

Browse files
committed
Revise DeepSeek Sparse Attention training process description for clarity
- Enhanced the explanation of the training phases for DeepSeek-V3.2-Exp, detailing the adaptation from DeepSeek-V3.1-Terminus and the multi-stage training process. - Clarified the goals and methods of the Dense Warm-up and Sparse Training stages, emphasizing the use of KL-divergence loss and the optimization of the indexer and main model. - Updated the post-training section to reflect the consistency with the original model's training methods, ensuring a fair comparison.
1 parent bebbb9c commit 632eafb

File tree

1 file changed

+35
-13
lines changed

1 file changed

+35
-13
lines changed

public/content/deepseek-sparse-attention/deepseek-sparse-attention-content.md

Lines changed: 35 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -105,24 +105,46 @@ The final attention output ($u_t$) is then calculated by the main attention modu
105105

106106
### Step 3: How The Model Was Trained
107107

108-
They didn't train this model from scratch. They cleverly adapted an existing, powerful model (**DeepSeek-V3.1-Terminus**) that was already trained on long contexts. The training happened in several stages.
108+
The creation of DeepSeek-V3.2-Exp was not a matter of starting from scratch. Instead, researchers cleverly adapted an existing, powerful model, **DeepSeek-V3.1-Terminus**, which was already proficient in handling long contexts of 128K tokens. This adaptation involved a multi-stage training process designed to seamlessly integrate the new sparse attention mechanism while ensuring a fair comparison by using the same data distribution as the original model.
109109

110-
#### Stage 1: Continued Pre-Training (Two Phases)
110+
#### Phase 1: Continued Pre-Training
111111

112-
1. **Dense Warm-up Stage:**
113-
- **Goal:** To teach the brand-new Lightning Indexer what "important" tokens look like.
114-
- **Method:** They froze the main model and kept the standard (dense = each token with every previous token) attention active. They then trained *only* the Lightning Indexer. The indexer's objective was to make its importance scores match the attention scores from the powerful, pre-trained main model. They used a KL-divergence loss, which is a way of measuring how similar two probability distributions are. In essence, they told the indexer: "Learn to predict what the main model *would have* paid attention to." This phase was very short (1,000 steps).
112+
The first phase focused on teaching the model to use its new sparse attention architecture.
115113

116-
2. **Sparse Training Stage:**
117-
- **Goal:** To adapt the entire model to work with the sparse attention pattern.
118-
- **Method:** They "switched on" the $\text{top-k}$ selector, making the attention sparse. They unfroze the main model and trained everything together.
119-
* The **main model** was trained on its usual task: predicting the next word (language modeling loss). It had to learn to perform well with only the limited context provided by the selector.
120-
* The **Lightning Indexer** continued to be trained with the KL-divergence loss to align with the main model's attention, but now only on the selected $k$ tokens.
121-
* This was the main training phase (15,000 steps, using 943.7 billion tokens).
114+
**Dense Warm-up Stage: An Initial Crash Course**
122115

123-
#### Stage 2: Post-Training
124-
After the pre-training was done, they fine-tuned the model for specific tasks (like coding, math, reasoning, and following instructions) using Reinforcement Learning (RL). Crucially, they used the **exact same data and methods** as they did for the original DeepSeek-V3.1-Terminus model. This ensures a fair comparison between the dense and sparse models.
116+
> **Goal:** To teach the brand-new **Lightning Indexer** what "important" tokens look like by having it mimic the original model's attention.
125117
118+
This was a short but critical initial stage lasting just 1,000 steps (2.1B tokens). The researchers froze the main model and kept the standard (dense) attention active. They then trained *only* the Lightning Indexer, tasking it with predicting the attention patterns of the powerful, pre-trained main model.
119+
- **Method:** A **KL-divergence loss** was used to measure how closely the indexer's predictions matched the main model's attention scores.
120+
- **Key Stats:**
121+
- **Learning Rate:** $10^{-3}$
122+
- **Batch Size:** 16 sequences of 128K tokens.
123+
124+
**Sparse Training Stage: Adapting to a New Reality**
125+
126+
> **Goal:** To adapt the entire model to work with the sparse attention pattern selected by the indexer.
127+
128+
This was the main training phase, lasting 15,000 steps (943.7B tokens). Here, the researchers "switched on" the sparse mechanism, un-froze the main model, and trained everything together.
129+
- **Method:** The model was now forced to predict the next word using only the **top-k** ($k=2048$) tokens identified by the indexer.
130+
- **Key Innovation:** The indexer and the main model were optimized separately by detaching the indexer from the main computational graph. This prevented their training signals from interfering.
131+
- The **main model** was trained solely on language modeling loss (predicting the next word).
132+
- The **Lightning Indexer** was trained solely on the KL-divergence loss to keep it aligned with the main model's attention, but only on the selected $k$ tokens.
133+
- **Key Stats:**
134+
- **Learning Rate:** $7.3 \times 10^{-6}$
135+
- **Batch Size:** 480 sequences of 128K tokens.
136+
137+
#### Phase 2: Post-Training - Fine-Tuning for a Fair Fight
138+
139+
To ensure a rigorous and fair assessment, the post-training pipeline—including algorithms and data—was kept identical to that of the original DeepSeek-V3.1-Terminus.
140+
141+
**Specialist Distillation**
142+
143+
First, the team developed specialized models for domains like mathematics, competitive programming, and agentic coding. Each specialist was fine-tuned from the same pre-trained DeepSeek-V3.2 base checkpoint. Using large-scale Reinforcement Learning (RL), these models generated high-quality, domain-specific data that was "distilled" to train the final model.
144+
145+
**Mixed RL Training**
146+
147+
Finally, the model was fine-tuned using **Group Relative Policy Optimization (GRPO)**. In a key strategic shift, the team merged reasoning, agent, and human alignment training into a single RL stage. This unified approach balanced performance across diverse skills while avoiding the "catastrophic forgetting" common in multi-stage training. The results were promising: the RL training curves of the new sparse model closely matched the original, demonstrating that DSA is a stable and effective addition.
126148

127149
---
128150

0 commit comments

Comments
 (0)