You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update content for NVFP4, MobileLLM-R1, and DeepSeek pages to use HTML entities for apostrophes
- Replaced apostrophes with HTML entities in the NVFP4, MobileLLM-R1, and DeepSeek pages to ensure proper rendering in the browser.
- Enhanced the user experience by maintaining consistent formatting across the documentation.
Meta's MobileLLM-R1 challenges two fundamental assumptions about reasoning in language models: (1) that reasoning only emerges in large models, and (2) that it requires massive datasets. They demonstrate that <strongclassName="text-blue-400">sub-billion parameter models can achieve strong reasoning</strong> with just 2T tokens of carefully curated data.
108
+
Meta's MobileLLM-R1 challenges two fundamental assumptions about reasoning in language models: (1) that reasoning only emerges in large models, and (2) that it requires massive datasets. They demonstrate that <strongclassName="text-blue-400">sub-billion parameter models can achieve strong reasoning</strong> with just 2T tokens of carefully curated data.
109
109
</p>
110
110
<pclassName="text-slate-300 leading-relaxed">
111
-
Their <strongclassName="text-purple-400">950M parameter model achieves an AIME score of 15.5</strong>, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmoILM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's 36T-token corpus, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks.
111
+
Their <strongclassName="text-purple-400">950M parameter model achieves an AIME score of 15.5</strong>, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmoILM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's 36T-token corpus, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks.
112
112
</p>
113
113
</div>
114
114
</div>
@@ -253,7 +253,7 @@ export default function MobileLLMR1Project() {
Adaptive training strategy where the data mixture evolves alongside the model's growing capacity, ensuring optimal challenge levels throughout training.
292
+
<pclassName="text-slate-300 mb-3">
293
+
Adaptive training strategy where the data mixture evolves alongside the model's growing capacity, ensuring optimal challenge levels throughout training.
NVIDIA has figured out how to train massive LLMs using a new <strongclassName="text-green-400">4-bit number format called NVFP4</strong>, which is a huge deal for efficiency. Training in 4-bit is much faster and uses less memory than the current 8-bit standard (FP8), but it's very difficult to do without the model's performance collapsing.
108
+
NVIDIA has figured out how to train massive LLMs using a new <strongclassName="text-green-400">4-bit number format called NVFP4</strong>, which is a huge deal for efficiency. Training in 4-bit is much faster and uses less memory than the current 8-bit standard (FP8), but it's very difficult to do without the model's performance collapsing.
109
109
</p>
110
110
<pclassName="text-slate-300 leading-relaxed">
111
111
Their solution combines four key techniques to train a <strongclassName="text-emerald-400">12-billion-parameter hybrid Mamba-Transformer model on 10 trillion tokens</strong> with performance nearly identical to FP8 training. This marks the first successful demonstration of training billion-parameter language models with 4-bit precision over a multi-trillion-token horizon.
@@ -201,7 +201,7 @@ export default function NVFP4Project() {
201
201
NVFP4 vs MXFP4
202
202
</h2>
203
203
<pclassName="text-slate-400 text-lg">
204
-
How NVIDIA's format improves on the standard
204
+
How NVIDIA's format improves on the standard
205
205
</p>
206
206
</div>
207
207
@@ -310,7 +310,7 @@ export default function NVFP4Project() {
310
310
The 4 Key Techniques
311
311
</h2>
312
312
<pclassName="text-slate-400 text-lg">
313
-
The "secret sauce" that makes NVFP4 work
313
+
The "secret sauce" that makes NVFP4 work
314
314
</p>
315
315
</div>
316
316
@@ -399,7 +399,7 @@ export default function NVFP4Project() {
@@ -544,7 +544,7 @@ export default function NVFP4Project() {
544
544
NVFP4 vs MXFP4
545
545
</h3>
546
546
<pclassName="text-slate-300 mb-4">
547
-
In direct comparison on an 8B model, MXFP4 needed <strongclassName="text-green-400">36% more training data</strong> (1.36T vs 1T tokens) to match NVFP4's performance. This proves NVFP4's superior design.
547
+
In direct comparison on an 8B model, MXFP4 needed <strongclassName="text-green-400">36% more training data</strong> (1.36T vs 1T tokens) to match NVFP4's performance. This proves NVFP4's superior design.
0 commit comments