You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update smollm3.md -- missing citation for intra-document masking (#3125)
* Update smollm3.md
The citation for intra-document masking is missing, fixed it
* Update smollm3.md
using https://huggingface.co/papers/2402.13991 instead of arxiv
Copy file name to clipboardExpand all lines: smollm3.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -73,7 +73,7 @@ SmolLM3 follows a transformer decoder architecture with tied embedding similar t
73
73
74
74
**NoPE:** We implemented NoPE from "[RoPE to NoRoPE and Back Again: A New Hybrid Attention Strategy](https://huggingface.co/papers/2501.18795)" (Yang et al., 2025), selectively removing rotary position embeddings from every 4th layer. This approach improves long context performance without affecting short context capabilities, as confirmed by our ablations.
75
75
76
-
**Intra-Document Masking:**During training, we use attention masking to ensure tokens from different documents in the same training sequence don't attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance.
76
+
**Intra-Document Masking:**Following "[Analysing The Impact of Sequence Composition on Language Model Pre-Training](https://huggingface.co/papers/2402.13991)", during training, we use attention masking to ensure tokens from different documents in the same training sequence don't attend to each other. Similar to Llama 3, this helps with faster and more stable long context training while maintaining short context performance.
77
77
78
78
**Training Stability:** Following OLMo 2, we remove weight decay from embedding layers to improve training stability. This modification contributed to more stable training dynamics, with embedding norms naturally stabilizing at healthier values during training without impacting overall performance in our ablations.
0 commit comments