Skip to content

kushalgarg101/Attention_head_steering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mechanistic Interpretability & Activation Steering

Decoding the internal logic of Large Language Models through Mechanistic Interpretability.

We focus on the Indirect Object Identification (IOI) task, using gpt 2 small and pythia 410M as our models.


🚀 Overview

How does a model decide that "John" is the answer in the sentence "When John and Mary went to the store, Mary gave a drink to..."?

We intervene using Activation steering to prove causal relationships.


The Models

Model Size Key Architecture
GPT-2 Small 124M Absolute Positional, LayerNorm
Pythia-410M 410M RoPE

🛠️ Method

1. Activation Patching

We identify heads responsible for IOI by surgical intervention. Corrupting a prompt (flipping names) and restoring specific activations allows us to isolate causal heads.

2. Attention Mapping

Using TransformerLens Hooks, we extract raw attention patterns to "see" where the model is looking.

  • Query: The last token (e.g., "to").
  • Key: The target name (e.g., "John").
  • Insight: IOI heads consistently show high attention from the end of the sentence back to the correct indirect object.

3. Cross-Model Steering

How do we transfer logic between models with different architectures?

  • Linear Alignment: Finding a coordinate mapping $W$ between GPT-2 and Pythia residual streams.
  • Functional Patching: Swapping the output of functionally equivalent heads (e.g., GPT-2 IOI -> Pythia IOI).
  • Logit-Space Translation: Using the shared vocabulary space as a universal bridge for communication.

📁 Project Structure

├── src/            # Core logic
│   ├── patch.py    # Patching tools
│   ├── steer.py    # Steering mechanisms
│   ├── ioi.py      # Dataset generation
│   ├── viz.py      # Plotting utilities
│   └── lens.py     # Logit lens and prob analysis
├── experiments/    # Research scripts & notebooks
│   ├── gpt2.py     # GPT-2 Small circuit discovery
│   ├── pythia.py   # Pythia evaluation
│   └── steer.py    # Steering tests
├── assets/
├── results/
├── pipeline.py     # Main entry point
└── requirements.txt

📊 Visual Results

Logit Lens Evolution

The dashboard above showcases the Logit Lens analysis for a sample IOI prompt: "When James and Robert went to the park, James gave a ball to Robert".

  • Top-Left (Original: P(Correct)): Displays the probability of the correct indirect object Robert evolving across layers. The bright signal at position 15 in final layers (L10+) indicates a successful prediction.
  • Top-Right (Original: P(Incorrect)): Shows the probability of the subject name James. Note the high signal in early layers where the name actually appears (pos 2 & 10), which is then suppressed by the IOI circuit.
  • Bottom-Left & Bottom-Right: These represent the Corrupted Prompt logic (where names are swapped). They allow us to verify that the circuit is truly identifying the relationship between names rather than just memorizing position.

Activation Steering

Activation Steering Results

Logit Difference vs Scale: Shows how injecting a steering vector into the residual stream linearly shifts the model's preference between names.

Probability Sweep: As we increase the steering scale, we can see the model's confidence in the correct answer increase, proving that we have isolated the direction in latent space responsible for the IOI task.


  • Logit Lens: Projects intermediate residual stream states onto the vocabulary to track how the model's prediction evolves layer-by-layer.
  • Recovery Scores: Causal importance of specific heads identified via activation patching.
  • Attention Grid: Real-time visualization of what the model is paying attention to at each layer.

📚 References & Credits

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages