Skip to content

finding a self-preference vector in Llama-3.1-8B-Instruct residual stream activations

License

Notifications You must be signed in to change notification settings

s1monFu/self-preference-activation-steering

Repository files navigation

Manipulating Self-Preference for Large Language Models

This is the official repository for "Manipulating Self-Preference for Large Language Models", the 1st place submission to the Apart Research x Martian Mechanistic Router Interpretability Hackathon by Matthew Nguyen, Dani Roytburg, Matthew Bozoukov, Jou Barzdukas, and Hongyu Fu.

Check out our writeup here.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

About

finding a self-preference vector in Llama-3.1-8B-Instruct residual stream activations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5