Decoding the internal logic of Large Language Models through Mechanistic Interpretability.
We focus on the Indirect Object Identification (IOI) task, using gpt 2 small and pythia 410M as our models.
How does a model decide that "John" is the answer in the sentence "When John and Mary went to the store, Mary gave a drink to..."?
We intervene using Activation steering to prove causal relationships.
| Model | Size | Key Architecture |
|---|---|---|
| GPT-2 Small | 124M | Absolute Positional, LayerNorm |
| Pythia-410M | 410M | RoPE |
We identify heads responsible for IOI by surgical intervention. Corrupting a prompt (flipping names) and restoring specific activations allows us to isolate causal heads.
Using TransformerLens Hooks, we extract raw attention patterns to "see" where the model is looking.
- Query: The last token (e.g., "to").
- Key: The target name (e.g., "John").
- Insight: IOI heads consistently show high attention from the end of the sentence back to the correct indirect object.
How do we transfer logic between models with different architectures?
-
Linear Alignment: Finding a coordinate mapping
$W$ between GPT-2 and Pythia residual streams. - Functional Patching: Swapping the output of functionally equivalent heads (e.g., GPT-2 IOI -> Pythia IOI).
- Logit-Space Translation: Using the shared vocabulary space as a universal bridge for communication.
├── src/ # Core logic
│ ├── patch.py # Patching tools
│ ├── steer.py # Steering mechanisms
│ ├── ioi.py # Dataset generation
│ ├── viz.py # Plotting utilities
│ └── lens.py # Logit lens and prob analysis
├── experiments/ # Research scripts & notebooks
│ ├── gpt2.py # GPT-2 Small circuit discovery
│ ├── pythia.py # Pythia evaluation
│ └── steer.py # Steering tests
├── assets/
├── results/
├── pipeline.py # Main entry point
└── requirements.txtThe dashboard above showcases the Logit Lens analysis for a sample IOI prompt: "When James and Robert went to the park, James gave a ball to Robert".
- Top-Left (Original: P(Correct)): Displays the probability of the correct indirect object Robert evolving across layers. The bright signal at position 15 in final layers (L10+) indicates a successful prediction.
- Top-Right (Original: P(Incorrect)): Shows the probability of the subject name James. Note the high signal in early layers where the name actually appears (pos 2 & 10), which is then suppressed by the IOI circuit.
- Bottom-Left & Bottom-Right: These represent the Corrupted Prompt logic (where names are swapped). They allow us to verify that the circuit is truly identifying the relationship between names rather than just memorizing position.
Logit Difference vs Scale: Shows how injecting a steering vector into the residual stream linearly shifts the model's preference between names.
Probability Sweep: As we increase the steering scale, we can see the model's confidence in the correct answer increase, proving that we have isolated the direction in latent space responsible for the IOI task.
- Logit Lens: Projects intermediate residual stream states onto the vocabulary to track how the model's prediction evolves layer-by-layer.
- Recovery Scores: Causal importance of specific heads identified via activation patching.
- Attention Grid: Real-time visualization of what the model is paying attention to at each layer.
- Wang et al. (2022): Interpretability in the Wild
- Neel Nanda: For the TransformerLens library and the IOI Walkthrough.

