What the helly is Mixture of Experts?
++ Mixture of Experts (MoE) is a neural network architecture that uses multiple specialized + sub-networks (experts) to process different parts of the input. Instead of routing all data + through the same network, a gating network learns to dynamically select which + experts should process each token. +
+ +Key Benefits
+-
+
- Scalability: Increase model capacity without proportionally increasing compute +
- Specialization: Experts specialize in different syntactic patterns (e.g. verbs, nouns, adjectives, etc.) +
- Sparse activation: Only activate a subset of experts per token +
MoE consists of two key components:
+-
+
- + Router (Gate Network): A learned network that decides which experts should process + each token. For every input, it computes a score for each expert and selects the top-K + highest-scoring ones to handle that token. + +
- + Experts: A set of specialized Feed-Forward Neural Networks (FFNNs). Instead of + one shared FFNN processing all tokens, MoE has multiple expert FFNNs, each learning to handle + different patterns or input types (Usually syntactic patterns like verbs, nouns, adjectives, etc.). + +
+ Each expert specializes in different syntactic patterns during training
+Sparse vs. Dense Models
++ To understand MoE, it's important to contrast it with traditional dense models: +
+ +Dense Model (Traditional)
++ In a standard transformer, every token passes through the same Feed-Forward Neural Network (FFNN) + at each layer. This means: +
+-
+
- All parameters are activated for every token +
- Computation scales linearly with model size +
- Simple and stable, but inefficient for very large models +
+ Dense MoE architecture: All experts are selected and activated per token
+Sparse Model (MoE)
++ In MoE, each FFNN layer is replaced by multiple expert FFNNs, but only a subset + of experts process each token. This means: +
+-
+
- Only top-K experts are activated per token (sparse activation) +
- Computation remains constant regardless of total expert count +
- More complex to train, but enables massive model scaling + +
+ Sparse MoE architecture: Only selected experts (highlighted) are activated per token
+Step-by-Step Process:
+ +1. Gating Network (Router)
+
+ Token routing process: Gating → Top-K Selection → Expert Processing
++ For each input token, the gating network computes a score for every expert. +
+ +Step 1a: Linear Transformation
+
+ First, the token embedding is multiplied by the gating weight matrix W_gate:
+
+ h = token_embedding * W_gate
+
+
+
+ Linear transformation: token embedding (x) multiplied by weight matrix (W)
++ Where: +
+-
+
token_embedding: Vector representation of the input token (e.g., 512 dimensions)
+ W_gate: Learned weight matrix (e.g., 512 × num_experts)
+ h: Raw logits/scores for each expert (one score per expert)
+
Step 1b: Softmax Normalization
++ The raw scores are then normalized using the softmax function to produce probabilities: +
+
+ scores = softmax(h)
+
+
+
+ Softmax converts raw scores into a probability distribution that sums to 1
++ Softmax ensures all scores are between 0-1 and all scores sum to exactly 1. The result is a probability distribution over all experts, indicating how suitable each expert + is for processing this particular token. +
+Step 1c: Repeat!
+
+ Multi-layer MoE architecture: Each token goes through multiple MoE layers
++ At each MoE layer, the router independently computes scores and selects experts for every token. + This means a token may be routed to different experts at different layers. + Each layer's routing decision is independent and learned during training. +
+2. Top-K Selection
++ Instead of using all experts, we select only the top-K experts with + the highest scores. Common values: K=1 or K=2. +
+ +3. Token Routing
++ Each token is routed to its top-K selected experts. Tokens assigned to the same + expert are batched together for efficient processing. +
+Batch Processing
++ Tokens routed to the same expert are batched together for efficiency: +
+-
+
- Input shape changes from
[1, 512]to[batch_size, 512]
+ - Processing time scales with batch size +
- All tokens in a batch complete simultaneously +
4. Expert Processing (FFN)
++ Each expert is a Feed-Forward Network (FFN) that transforms the input: +
+
+ FFN(x) = W₂ × ReLU(W₁ × x)
+
+
+ + Where: +
+-
+
- W₁: First linear layer (token_embedding → hidden_dimensions) +
- ReLU: Activation function (element-wise) +
- W₂: Second linear layer (hidden_dimensions → token_embedding) +
5. Output Combination
++ The outputs from selected experts are weighted by their gating scores and summed: +
+
+ output = Σ (score_i × expert_i(token))
+
+ Load Balancing Challenge
++ A key challenge in MoE is load balancing. Without constraints, the gating + network often learns to overuse a few "favorite" experts while ignoring others. +
+Why This Happens
+-
+
- The gating network optimizes for accuracy, not balance +
- Popular experts get more gradient updates, improving faster +
- Creates a feedback loop: good experts → more use → better experts +
+ Solutions include auxiliary losses, capacity constraints, and expert dropout to encourage + more balanced routing. +
+Getting Started
+-
+
- Navigate to the Demo +
- Enter a text prompt or word in the input box +
- Click "Process Tokens" or press Enter +
- Watch the token get scored, routed, and processed +
- Click on experts to see FFN internals +
- Adjust controls to experiment with different configurations +
- Add more tokens to see batch processing and load distribution +
Further Reading
+-
+
- + + A Visual Guide to Mixture of Experts + (Maarten Grootendorst - Comprehensive visual explanations with 50+ diagrams. Diagrams in this documentation are sourced from this excellent guide.) + +
- + + Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer + (Original MoE Paper) + +
- + + Switch Transformers: Scaling to Trillion Parameter Models + (Google's Switch Transformer) + +
- + + Mixture of Experts Explained + (Hugging Face Blog) + +