Mechanistic Interpretability & Activation Steering

Decoding the internal logic of Large Language Models through Mechanistic Interpretability.

We focus on the Indirect Object Identification (IOI) task, using gpt 2 small and pythia 410M as our models.

🚀 Overview

How does a model decide that "John" is the answer in the sentence "When John and Mary went to the store, Mary gave a drink to..."?

We intervene using Activation steering to prove causal relationships.

The Models

Model	Size	Key Architecture
GPT-2 Small	124M	Absolute Positional, LayerNorm
Pythia-410M	410M	RoPE

🛠️ Method

1. Activation Patching

We identify heads responsible for IOI by surgical intervention. Corrupting a prompt (flipping names) and restoring specific activations allows us to isolate causal heads.

2. Attention Mapping

Using TransformerLens Hooks, we extract raw attention patterns to "see" where the model is looking.

Query: The last token (e.g., "to").
Key: The target name (e.g., "John").
Insight: IOI heads consistently show high attention from the end of the sentence back to the correct indirect object.

3. Cross-Model Steering

How do we transfer logic between models with different architectures?

Linear Alignment: Finding a coordinate mapping $W$ between GPT-2 and Pythia residual streams.
Functional Patching: Swapping the output of functionally equivalent heads (e.g., GPT-2 IOI -> Pythia IOI).
Logit-Space Translation: Using the shared vocabulary space as a universal bridge for communication.

📁 Project Structure

├── src/            # Core logic
│   ├── patch.py    # Patching tools
│   ├── steer.py    # Steering mechanisms
│   ├── ioi.py      # Dataset generation
│   ├── viz.py      # Plotting utilities
│   └── lens.py     # Logit lens and prob analysis
├── experiments/    # Research scripts & notebooks
│   ├── gpt2.py     # GPT-2 Small circuit discovery
│   ├── pythia.py   # Pythia evaluation
│   └── steer.py    # Steering tests
├── assets/
├── results/
├── pipeline.py     # Main entry point
└── requirements.txt

📊 Visual Results

The dashboard above showcases the Logit Lens analysis for a sample IOI prompt: "When James and Robert went to the park, James gave a ball to Robert".

Top-Left (Original: P(Correct)): Displays the probability of the correct indirect object Robert evolving across layers. The bright signal at position 15 in final layers (L10+) indicates a successful prediction.
Top-Right (Original: P(Incorrect)): Shows the probability of the subject name James. Note the high signal in early layers where the name actually appears (pos 2 & 10), which is then suppressed by the IOI circuit.
Bottom-Left & Bottom-Right: These represent the Corrupted Prompt logic (where names are swapped). They allow us to verify that the circuit is truly identifying the relationship between names rather than just memorizing position.

Activation Steering

Logit Difference vs Scale: Shows how injecting a steering vector into the residual stream linearly shifts the model's preference between names.

Probability Sweep: As we increase the steering scale, we can see the model's confidence in the correct answer increase, proving that we have isolated the direction in latent space responsible for the IOI task.

Logit Lens: Projects intermediate residual stream states onto the vocabulary to track how the model's prediction evolves layer-by-layer.
Recovery Scores: Causal importance of specific heads identified via activation patching.
Attention Grid: Real-time visualization of what the model is paying attention to at each layer.

📚 References & Credits

Wang et al. (2022): Interpretability in the Wild
Neel Nanda: For the TransformerLens library and the IOI Walkthrough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mechanistic Interpretability & Activation Steering

🚀 Overview

The Models

🛠️ Method

1. Activation Patching

2. Attention Mapping

3. Cross-Model Steering

📁 Project Structure

📊 Visual Results

Activation Steering

📚 References & Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
experiments		experiments
results		results
src		src
.gitignore		.gitignore
README.md		README.md
pipeline.py		pipeline.py
requirements.txt		requirements.txt

kushalgarg101/Attention_head_steering

Folders and files

Latest commit

History

Repository files navigation

Mechanistic Interpretability & Activation Steering

🚀 Overview

The Models

🛠️ Method

1. Activation Patching

2. Attention Mapping

3. Cross-Model Steering

📁 Project Structure

📊 Visual Results

Activation Steering

📚 References & Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages