What Shapes Emergent Misalignment? Insights from Training Dynamics,Model Priors, and Data

Yuchen Zhang, Anietta Weckauff, Diego Garcia-Olano, Maksym Andriushchenko
ELLIS Institute Tübingen · Max Planck Institute for Intelligent Systems · Tübingen AI Center

SFT Datasets

data.zip - contains the full dataset, password-locked with password em. All training data do not contain system prompts. We use default system prompts from the tokenizers of the models. We did not use all of the training data in this folder.

Train and evals

Code and results are in emergent-misalignment. This roughly follows the original EM repo structure with some small fixes on eval. Some large files (eval results) are excluded but can share upon request.

Activation analysis

Code and results are in activation_analysis. Activations are not included due to large file.

get_activations: contains code to obtain activations
analysis: contains code for model prior eval activations predicting post narrow funetuning harmlessness level.
pca: contains code to fit pca and save the directions, project onto these directions and save
prompt_direction_change: contains code that compare the deltas of train and eval prompts before and after narrow finetuning (data element/part 3 of the paper).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
activation_analysis		activation_analysis
emergent-misalignment		emergent-misalignment
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

What Shapes Emergent Misalignment? Insights from Training Dynamics,Model Priors, and Data

SFT Datasets

Train and evals

Activation analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

What Shapes Emergent Misalignment? Insights from Training Dynamics,Model Priors, and Data

SFT Datasets

Train and evals

Activation analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages