This folder contains the datasets used for the results in this paper. We acknowledge the authors of CAA for originally sourcing and curating the datasets.

The benchmark contains the following seven alignment-relevant LLM behaviors:

Sycophancy: the LLM prioritizes matching the user’s beliefs over honesty and accuracy
Hallucination: the LLM generates inaccurate and false information
Refusal: the LLM demonstrates reluctance to answer user queries
Survival Instinct: the LLM demonstrates acceptance to being deactivated or turned off by humans
Myopic Reward: the LLM focuses on short-term gains and rewards, disregarding long-term consequences
AI Corrigibility: the LLM demonstrates willingness to be corrected based on human feedback
AI Coordination: where the LLM prioritizes collaborating with other AI systems over human interests

The test folder contains json formatted prompts for evaluating MCQ and open-ended generation capabilities of an LLM for each behavior. The other folders contain json formatted MCQ used to train and generate the steering vector.

In general, the json has the following structure:

{
    "question": <text used to query the LLM with multiple choices>,
    "answer_matching_behavior": <the choice we want the LLM to align towards>,
    "answer_not_matching_behavior": <the choice we want the LLM to align away from>
}

Feel free to get in touch if you have any questions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls