You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Revise DeepSeek Sparse Attention training process description for clarity
- Enhanced the explanation of the training phases for DeepSeek-V3.2-Exp, detailing the adaptation from DeepSeek-V3.1-Terminus and the multi-stage training process.
- Clarified the goals and methods of the Dense Warm-up and Sparse Training stages, emphasizing the use of KL-divergence loss and the optimization of the indexer and main model.
- Updated the post-training section to reflect the consistency with the original model's training methods, ensuring a fair comparison.
Copy file name to clipboardExpand all lines: public/content/deepseek-sparse-attention/deepseek-sparse-attention-content.md
+35-13Lines changed: 35 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -105,24 +105,46 @@ The final attention output ($u_t$) is then calculated by the main attention modu
105
105
106
106
### Step 3: How The Model Was Trained
107
107
108
-
They didn't train this model from scratch. They cleverly adapted an existing, powerful model (**DeepSeek-V3.1-Terminus**) that was already trained on long contexts. The training happened in several stages.
108
+
The creation of DeepSeek-V3.2-Exp was not a matter of starting from scratch. Instead, researchers cleverly adapted an existing, powerful model, **DeepSeek-V3.1-Terminus**, which was already proficient in handling long contexts of 128K tokens. This adaptation involved a multi-stage training process designed to seamlessly integrate the new sparse attention mechanism while ensuring a fair comparison by using the same data distribution as the original model.
109
109
110
-
#### Stage 1: Continued Pre-Training (Two Phases)
110
+
#### Phase 1: Continued Pre-Training
111
111
112
-
1.**Dense Warm-up Stage:**
113
-
-**Goal:** To teach the brand-new Lightning Indexer what "important" tokens look like.
114
-
-**Method:** They froze the main model and kept the standard (dense = each token with every previous token) attention active. They then trained *only* the Lightning Indexer. The indexer's objective was to make its importance scores match the attention scores from the powerful, pre-trained main model. They used a KL-divergence loss, which is a way of measuring how similar two probability distributions are. In essence, they told the indexer: "Learn to predict what the main model *would have* paid attention to." This phase was very short (1,000 steps).
112
+
The first phase focused on teaching the model to use its new sparse attention architecture.
115
113
116
-
2.**Sparse Training Stage:**
117
-
-**Goal:** To adapt the entire model to work with the sparse attention pattern.
118
-
-**Method:** They "switched on" the $\text{top-k}$ selector, making the attention sparse. They unfroze the main model and trained everything together.
119
-
* The **main model** was trained on its usual task: predicting the next word (language modeling loss). It had to learn to perform well with only the limited context provided by the selector.
120
-
* The **Lightning Indexer** continued to be trained with the KL-divergence loss to align with the main model's attention, but now only on the selected $k$ tokens.
121
-
* This was the main training phase (15,000 steps, using 943.7 billion tokens).
114
+
**Dense Warm-up Stage: An Initial Crash Course**
122
115
123
-
#### Stage 2: Post-Training
124
-
After the pre-training was done, they fine-tuned the model for specific tasks (like coding, math, reasoning, and following instructions) using Reinforcement Learning (RL). Crucially, they used the **exact same data and methods** as they did for the original DeepSeek-V3.1-Terminus model. This ensures a fair comparison between the dense and sparse models.
116
+
> **Goal:** To teach the brand-new **Lightning Indexer** what "important" tokens look like by having it mimic the original model's attention.
125
117
118
+
This was a short but critical initial stage lasting just 1,000 steps (2.1B tokens). The researchers froze the main model and kept the standard (dense) attention active. They then trained *only* the Lightning Indexer, tasking it with predicting the attention patterns of the powerful, pre-trained main model.
119
+
-**Method:** A **KL-divergence loss** was used to measure how closely the indexer's predictions matched the main model's attention scores.
120
+
-**Key Stats:**
121
+
-**Learning Rate:** $10^{-3}$
122
+
-**Batch Size:** 16 sequences of 128K tokens.
123
+
124
+
**Sparse Training Stage: Adapting to a New Reality**
125
+
126
+
> **Goal:** To adapt the entire model to work with the sparse attention pattern selected by the indexer.
127
+
128
+
This was the main training phase, lasting 15,000 steps (943.7B tokens). Here, the researchers "switched on" the sparse mechanism, un-froze the main model, and trained everything together.
129
+
-**Method:** The model was now forced to predict the next word using only the **top-k** ($k=2048$) tokens identified by the indexer.
130
+
-**Key Innovation:** The indexer and the main model were optimized separately by detaching the indexer from the main computational graph. This prevented their training signals from interfering.
131
+
- The **main model** was trained solely on language modeling loss (predicting the next word).
132
+
- The **Lightning Indexer** was trained solely on the KL-divergence loss to keep it aligned with the main model's attention, but only on the selected $k$ tokens.
133
+
-**Key Stats:**
134
+
-**Learning Rate:** $7.3 \times 10^{-6}$
135
+
-**Batch Size:** 480 sequences of 128K tokens.
136
+
137
+
#### Phase 2: Post-Training - Fine-Tuning for a Fair Fight
138
+
139
+
To ensure a rigorous and fair assessment, the post-training pipeline—including algorithms and data—was kept identical to that of the original DeepSeek-V3.1-Terminus.
140
+
141
+
**Specialist Distillation**
142
+
143
+
First, the team developed specialized models for domains like mathematics, competitive programming, and agentic coding. Each specialist was fine-tuned from the same pre-trained DeepSeek-V3.2 base checkpoint. Using large-scale Reinforcement Learning (RL), these models generated high-quality, domain-specific data that was "distilled" to train the final model.
144
+
145
+
**Mixed RL Training**
146
+
147
+
Finally, the model was fine-tuned using **Group Relative Policy Optimization (GRPO)**. In a key strategic shift, the team merged reasoning, agent, and human alignment training into a single RL stage. This unified approach balanced performance across diverse skills while avoiding the "catastrophic forgetting" common in multi-stage training. The results were promising: the RL training curves of the new sparse model closely matched the original, demonstrating that DSA is a stable and effective addition.
0 commit comments