|
| 1 | +--- |
| 2 | +hero: |
| 3 | + title: "Batch Size vs Sequence Length" |
| 4 | + subtitle: "Understanding Two Critical Training Parameters in LLMs" |
| 5 | + tags: |
| 6 | + - "🤖 LLM Training" |
| 7 | + - "⏱️ 12 min read" |
| 8 | +--- |
| 9 | + |
| 10 | +# Batch Size vs Sequence Length in LLM Training |
| 11 | + |
| 12 | +When training large language models (LLMs), two of the most important hyperparameters you'll configure are **batch size** and **sequence length**. While they might seem similar at first (both involve "how much data" the model sees), they serve fundamentally different purposes and have distinct impacts on training. |
| 13 | + |
| 14 | +## What is Batch Size? |
| 15 | + |
| 16 | +**Batch size** is the number of independent training examples your model processes in parallel before updating its weights. |
| 17 | + |
| 18 | +### How It Works |
| 19 | + |
| 20 | +Think of batch size as "how many different conversations" your model reads simultaneously: |
| 21 | + |
| 22 | +```python |
| 23 | +# Example with batch_size = 4 |
| 24 | +batch = [ |
| 25 | + "The cat sat on the mat", # Example 1 |
| 26 | + "Machine learning is fascinating", # Example 2 |
| 27 | + "Python is a great language", # Example 3 |
| 28 | + "Transformers revolutionized NLP" # Example 4 |
| 29 | +] |
| 30 | +``` |
| 31 | + |
| 32 | +Each example in the batch is processed independently, and the gradients from all examples are averaged before updating the model weights. |
| 33 | + |
| 34 | +To understand the trade-offs, imagine trying to find the best path down a mountain (minimizing loss). |
| 35 | + |
| 36 | +**Large Batches (asking a large crowd for directions):** You average the advice from many people. The resulting direction is very reliable and stable, preventing you from overreacting to any single bad piece of advice (this is "low noise"). However, this consensus path might be a slow, winding road that misses clever shortcuts. The updates are less "exploratory." |
| 37 | + |
| 38 | +**Small Batches (asking one or two hikers):** Their advice is less reliable and more random (this is "high noise"). This randomness can be beneficial—it might lead you to discover a hidden, faster trail that the large crowd would have averaged out. This "noise" helps the model explore more diverse solutions and can help it escape from getting stuck in suboptimal valleys (local minima). |
| 39 | + |
| 40 | +Larger batches provide more stable training because they average the gradients over more examples, which reduces noise. However, this also means that each individual update is less "exploratory," potentially making slower progress per step. On the other hand, smaller batches introduce more gradient noise, which can actually help the model make more diverse updates and explore different solutions. |
| 41 | + |
| 42 | + |
| 43 | +## What is Sequence Length? |
| 44 | + |
| 45 | +**Sequence length** is the maximum number of tokens (words/subwords) that the model processes in a single example. |
| 46 | + |
| 47 | +### How It Works |
| 48 | + |
| 49 | +Think of sequence length as "how long of a conversation" your model can read at once: |
| 50 | + |
| 51 | +```python |
| 52 | +# Short sequence (seq_len = 256 tokens) |
| 53 | +"The cat sat on the mat. It was a sunny day..." |
| 54 | + |
| 55 | +# Long sequence (seq_len = 4096 tokens) |
| 56 | +"""The cat sat on the mat. It was a sunny day. The birds were |
| 57 | +singing in the trees. A gentle breeze rustled the leaves... |
| 58 | +[continues for 4000+ more tokens, could be an entire chapter]""" |
| 59 | +``` |
| 60 | + |
| 61 | +Longer sequences give the model more context to learn fron. It enables learning long-range dependencies and relationships |
| 62 | +BUT, attention mechanism has O(n²) memory complexity with sequence length so memory requirement grows quadratically (n² for seq_len of n). |
| 63 | + |
| 64 | +### Impact on Training |
| 65 | + |
| 66 | +**Longer Sequence Length (e.g., 4096):** |
| 67 | +- ✅ Model can learn long-range dependencies |
| 68 | +- ✅ Better understanding of extended context |
| 69 | +- ✅ More information per training example |
| 70 | +- ❌ Quadratic memory growth (attention is expensive!) |
| 71 | +- ❌ Slower per-step training time |
| 72 | + |
| 73 | +**Shorter Sequence Length (e.g., 256):** |
| 74 | +- ✅ Faster training steps |
| 75 | +- ✅ Less memory required |
| 76 | +- ❌ Limited context window |
| 77 | +- ❌ Cannot learn long-range patterns |
| 78 | + |
| 79 | +## The Key Difference |
| 80 | + |
| 81 | +The fundamental difference between these two parameters: |
| 82 | + |
| 83 | +| Aspect | Batch Size | Sequence Length | |
| 84 | +|--------|-----------|-----------------| |
| 85 | +| **What it controls** | Number of independent examples | Length of each example | |
| 86 | +| **Relationship between data** | Examples are unrelated | Tokens are sequential and dependent | |
| 87 | +| **Memory scaling** | Linear (2x batch = 2x memory) | Quadratic for attention (2x length = 4x memory) | |
| 88 | +| **Learning impact** | Affects gradient stability | Affects context understanding | |
| 89 | +| **Trade-off** | Stability vs exploration | Context vs speed | |
| 90 | + |
| 91 | +### Visual Comparison |
| 92 | + |
| 93 | +``` |
| 94 | +Batch Size = 4, Sequence Length = 8: |
| 95 | +┌─────────────────────────────────────┐ |
| 96 | +│ Example 1: [A, B, C, D, E, F, G, H] │ |
| 97 | +│ Example 2: [I, J, K, L, M, N, O, P] │ |
| 98 | +│ Example 3: [Q, R, S, T, U, V, W, X] │ |
| 99 | +│ Example 4: [Y, Z, A, B, C, D, E, F] │ |
| 100 | +└─────────────────────────────────────┘ |
| 101 | + ↓ |
| 102 | + Averaged Gradients → Weight Update |
| 103 | +
|
| 104 | +Batch Size = 2, Sequence Length = 16: |
| 105 | +┌─────────────────────────────────────────────────────┐ |
| 106 | +│ Example 1: [A, B, C, D, E, F, G, H, I, J, K, L...] │ |
| 107 | +│ Example 2: [M, N, O, P, Q, R, S, T, U, V, W, X...] │ |
| 108 | +└─────────────────────────────────────────────────────┘ |
| 109 | + ↓ |
| 110 | + Averaged Gradients → Weight Update |
| 111 | +``` |
| 112 | + |
| 113 | + |
| 114 | +## Real-World Experiment Results |
| 115 | + |
| 116 | +In a recent ablation study on MoE transformers, three strategies were tested with equal GPU memory usage: |
| 117 | + |
| 118 | +| Strategy | Batch Size | Seq Length | Final Val Loss | Val Accuracy | Training Time | |
| 119 | +|----------|------------|------------|----------------|--------------|---------------| |
| 120 | +| **Balanced** | 26 | 1024 | 0.0636 | 98.73% | 7.04 min | |
| 121 | +| **Long Seq** | 6 | 4096 | 0.0745 | 98.45% | 6.99 min | |
| 122 | +| **Large Batch** | 104 | 256 | 0.1025 | 98.00% | 6.97 min | |
| 123 | + |
| 124 | +### Key Findings |
| 125 | + |
| 126 | +1. **Balanced approach won** on validation loss metrics |
| 127 | +2. **Large batch trained fastest** per-step but achieved higher final loss |
| 128 | +3. **Long sequence** showed promise but didn't win on short-term metrics |
| 129 | + |
| 130 | +### Important Caveat ⚠️ |
| 131 | + |
| 132 | +**Validation loss doesn't tell the whole story!** |
| 133 | + |
| 134 | +While large batch size showed faster convergence in validation loss, longer sequences provide more context and should theoretically enable the model to learn more complex patterns over time. The validation loss metric may favor faster convergence but doesn't necessarily reflect the model's ability to leverage extended context windows. |
| 135 | + |
| 136 | +For applications requiring deep contextual understanding, such as analyzing long documents or multi-turn dialogues, longer sequence lengths are more valuable, even at the cost of a higher validation loss. |
| 137 | + |
| 138 | +**In practice**, sequence length is often between 1024 and 4096, with extension training later. |
0 commit comments