Yuchen Zhang, Anietta Weckauff, Diego Garcia-Olano, Maksym Andriushchenko
ELLIS Institute Tübingen · Max Planck Institute for Intelligent Systems · Tübingen AI Center
data.zip - contains the full dataset, password-locked with password em.
All training data do not contain system prompts. We use default system prompts from the tokenizers of the models.
We did not use all of the training data in this folder.
Code and results are in emergent-misalignment. This roughly follows the original EM repo structure with some small fixes on eval. Some large files (eval results) are excluded but can share upon request.
Code and results are in activation_analysis. Activations are not included due to large file.
- get_activations: contains code to obtain activations
- analysis: contains code for model prior eval activations predicting post narrow funetuning harmlessness level.
- pca: contains code to fit pca and save the directions, project onto these directions and save
- prompt_direction_change: contains code that compare the deltas of train and eval prompts before and after narrow finetuning (data element/part 3 of the paper).