Skip to content

Commit df73c9b

Browse files
committed
update
1 parent 17fdeae commit df73c9b

File tree

3 files changed

+166
-0
lines changed

3 files changed

+166
-0
lines changed
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
import { LessonPage } from "@/components/lesson-page";
2+
3+
export default function BatchSizeVsSequenceLengthPage() {
4+
return (
5+
<LessonPage
6+
contentPath="llm-fundamentals/batch-size-vs-sequence-length"
7+
prevLink={{ href: "/learn", label: "← Back to Course" }}
8+
nextLink={{ href: "/learn", label: "Continue Learning →" }}
9+
/>
10+
);
11+
}
12+

lib/course-structure.tsx

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -365,6 +365,22 @@ export const getCourseModules = (): ModuleData[] => [
365365
href: "/learn/building-a-transformer/training-a-transformer"
366366
}
367367
]
368+
},
369+
{
370+
title: "LLM Training Fundamentals",
371+
titleZh: "LLM训练基础",
372+
icon: (
373+
<svg className="w-5 h-5" fill="none" stroke="currentColor" viewBox="0 0 24 24">
374+
<path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M9 3v2m6-2v2M9 19v2m6-2v2M5 9H3m2 6H3m18-6h-2m2 6h-2M7 19h10a2 2 0 002-2V7a2 2 0 00-2-2H7a2 2 0 00-2 2v10a2 2 0 002 2zM9 9h6v6H9V9z" />
375+
</svg>
376+
),
377+
lessons: [
378+
{
379+
title: "Batch Size vs Sequence Length",
380+
titleZh: "批量大小与序列长度",
381+
href: "/learn/llm-fundamentals/batch-size-vs-sequence-length"
382+
}
383+
]
368384
}
369385
];
370386

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
---
2+
hero:
3+
title: "Batch Size vs Sequence Length"
4+
subtitle: "Understanding Two Critical Training Parameters in LLMs"
5+
tags:
6+
- "🤖 LLM Training"
7+
- "⏱️ 12 min read"
8+
---
9+
10+
# Batch Size vs Sequence Length in LLM Training
11+
12+
When training large language models (LLMs), two of the most important hyperparameters you'll configure are **batch size** and **sequence length**. While they might seem similar at first (both involve "how much data" the model sees), they serve fundamentally different purposes and have distinct impacts on training.
13+
14+
## What is Batch Size?
15+
16+
**Batch size** is the number of independent training examples your model processes in parallel before updating its weights.
17+
18+
### How It Works
19+
20+
Think of batch size as "how many different conversations" your model reads simultaneously:
21+
22+
```python
23+
# Example with batch_size = 4
24+
batch = [
25+
"The cat sat on the mat", # Example 1
26+
"Machine learning is fascinating", # Example 2
27+
"Python is a great language", # Example 3
28+
"Transformers revolutionized NLP" # Example 4
29+
]
30+
```
31+
32+
Each example in the batch is processed independently, and the gradients from all examples are averaged before updating the model weights.
33+
34+
To understand the trade-offs, imagine trying to find the best path down a mountain (minimizing loss).
35+
36+
**Large Batches (asking a large crowd for directions):** You average the advice from many people. The resulting direction is very reliable and stable, preventing you from overreacting to any single bad piece of advice (this is "low noise"). However, this consensus path might be a slow, winding road that misses clever shortcuts. The updates are less "exploratory."
37+
38+
**Small Batches (asking one or two hikers):** Their advice is less reliable and more random (this is "high noise"). This randomness can be beneficial—it might lead you to discover a hidden, faster trail that the large crowd would have averaged out. This "noise" helps the model explore more diverse solutions and can help it escape from getting stuck in suboptimal valleys (local minima).
39+
40+
Larger batches provide more stable training because they average the gradients over more examples, which reduces noise. However, this also means that each individual update is less "exploratory," potentially making slower progress per step. On the other hand, smaller batches introduce more gradient noise, which can actually help the model make more diverse updates and explore different solutions.
41+
42+
43+
## What is Sequence Length?
44+
45+
**Sequence length** is the maximum number of tokens (words/subwords) that the model processes in a single example.
46+
47+
### How It Works
48+
49+
Think of sequence length as "how long of a conversation" your model can read at once:
50+
51+
```python
52+
# Short sequence (seq_len = 256 tokens)
53+
"The cat sat on the mat. It was a sunny day..."
54+
55+
# Long sequence (seq_len = 4096 tokens)
56+
"""The cat sat on the mat. It was a sunny day. The birds were
57+
singing in the trees. A gentle breeze rustled the leaves...
58+
[continues for 4000+ more tokens, could be an entire chapter]"""
59+
```
60+
61+
Longer sequences give the model more context to learn fron. It enables learning long-range dependencies and relationships
62+
BUT, attention mechanism has O(n²) memory complexity with sequence length so memory requirement grows quadratically (n² for seq_len of n).
63+
64+
### Impact on Training
65+
66+
**Longer Sequence Length (e.g., 4096):**
67+
- ✅ Model can learn long-range dependencies
68+
- ✅ Better understanding of extended context
69+
- ✅ More information per training example
70+
- ❌ Quadratic memory growth (attention is expensive!)
71+
- ❌ Slower per-step training time
72+
73+
**Shorter Sequence Length (e.g., 256):**
74+
- ✅ Faster training steps
75+
- ✅ Less memory required
76+
- ❌ Limited context window
77+
- ❌ Cannot learn long-range patterns
78+
79+
## The Key Difference
80+
81+
The fundamental difference between these two parameters:
82+
83+
| Aspect | Batch Size | Sequence Length |
84+
|--------|-----------|-----------------|
85+
| **What it controls** | Number of independent examples | Length of each example |
86+
| **Relationship between data** | Examples are unrelated | Tokens are sequential and dependent |
87+
| **Memory scaling** | Linear (2x batch = 2x memory) | Quadratic for attention (2x length = 4x memory) |
88+
| **Learning impact** | Affects gradient stability | Affects context understanding |
89+
| **Trade-off** | Stability vs exploration | Context vs speed |
90+
91+
### Visual Comparison
92+
93+
```
94+
Batch Size = 4, Sequence Length = 8:
95+
┌─────────────────────────────────────┐
96+
│ Example 1: [A, B, C, D, E, F, G, H] │
97+
│ Example 2: [I, J, K, L, M, N, O, P] │
98+
│ Example 3: [Q, R, S, T, U, V, W, X] │
99+
│ Example 4: [Y, Z, A, B, C, D, E, F] │
100+
└─────────────────────────────────────┘
101+
102+
Averaged Gradients → Weight Update
103+
104+
Batch Size = 2, Sequence Length = 16:
105+
┌─────────────────────────────────────────────────────┐
106+
│ Example 1: [A, B, C, D, E, F, G, H, I, J, K, L...] │
107+
│ Example 2: [M, N, O, P, Q, R, S, T, U, V, W, X...] │
108+
└─────────────────────────────────────────────────────┘
109+
110+
Averaged Gradients → Weight Update
111+
```
112+
113+
114+
## Real-World Experiment Results
115+
116+
In a recent ablation study on MoE transformers, three strategies were tested with equal GPU memory usage:
117+
118+
| Strategy | Batch Size | Seq Length | Final Val Loss | Val Accuracy | Training Time |
119+
|----------|------------|------------|----------------|--------------|---------------|
120+
| **Balanced** | 26 | 1024 | 0.0636 | 98.73% | 7.04 min |
121+
| **Long Seq** | 6 | 4096 | 0.0745 | 98.45% | 6.99 min |
122+
| **Large Batch** | 104 | 256 | 0.1025 | 98.00% | 6.97 min |
123+
124+
### Key Findings
125+
126+
1. **Balanced approach won** on validation loss metrics
127+
2. **Large batch trained fastest** per-step but achieved higher final loss
128+
3. **Long sequence** showed promise but didn't win on short-term metrics
129+
130+
### Important Caveat ⚠️
131+
132+
**Validation loss doesn't tell the whole story!**
133+
134+
While large batch size showed faster convergence in validation loss, longer sequences provide more context and should theoretically enable the model to learn more complex patterns over time. The validation loss metric may favor faster convergence but doesn't necessarily reflect the model's ability to leverage extended context windows.
135+
136+
For applications requiring deep contextual understanding, such as analyzing long documents or multi-turn dialogues, longer sequence lengths are more valuable, even at the cost of a higher validation loss.
137+
138+
**In practice**, sequence length is often between 1024 and 4096, with extension training later.

0 commit comments

Comments
 (0)