This is the official repository for "Manipulating Self-Preference for Large Language Models", the 1st place submission to the Apart Research x Martian Mechanistic Router Interpretability Hackathon by Matthew Nguyen, Dani Roytburg, Matthew Bozoukov, Jou Barzdukas, and Hongyu Fu.
Check out our writeup here.
This project is licensed under the MIT License - see the LICENSE.md file for details.