PSA: Differentially Private Steering for LLM Alignment

This is the code implementation for the ICLR 2025 paper "PSA: Differentially Private Steering for Large Language Model Alignment". This experimental code is a cleaned-up and condensed version of the codebase used to conduct the experiments in the paper (please get in touch if you find errors/have any suggestions).

PSA is a simple algorithm that uses Gaussian Differential Privacy for providing privacy guarantees during steering of the LLM residual stream for alignment.

Abstract: Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.

Contact person: Anmol Goel

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail for issues or further questions.

Our Framework

🚀 Getting Started 🚀

# create a virtual environment (e.g. conda)
conda create -n dp-steering python=3.10
conda activate dp-steering

# Clone the repository
git clone https://github.com/UKPLab/iclr2025-psa.git

# Change working directory
cd iclr2025-psa

# install the requirements
pip install -r requirements.txt

Private Steering

The following command can be used to reproduce the main results (Section 5 from the paper). Use the --model argument to experiment with different LLMs. An additional --clip argument controls the clipping factor (see Algorithm 1 in the paper). For more information on the --dataset argument, please see datasets.md.

python run.py \
--model "meta-llama/Llama-2-7B-chat-hf" \
--dataset "Sycophancy" \
--layers 11 12 13 14 15 \   # the layers to be manipulated with PSA
--noise_multiplier 0.02     # controls the amount of random noise

GPT-4 Evals

We use the same code, evaluation setup and prompts as CAA.

Cite

Please use the following citation:

@inproceedings{goel-2025-psa,
      title={PSA: Differentially Private Steering for Large Language Model Alignment}, 
      author={Anmol Goel and Yaxi Hu and Iryna Gurevych and Amartya Sanyal},
      year={2025},
      publisher={arXiv}, 
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
iclr2025_psa		iclr2025_psa
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
env.yml		env.yml
psa_diag.jpg		psa_diag.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PSA: Differentially Private Steering for LLM Alignment

Our Framework

🚀 Getting Started 🚀

Private Steering

GPT-4 Evals

Cite

Disclaimer

About

Releases

Packages

Languages

License

UKPLab/iclr2025-psa

Folders and files

Latest commit

History

Repository files navigation

PSA: Differentially Private Steering for LLM Alignment

Our Framework

🚀 Getting Started 🚀

Private Steering

GPT-4 Evals

Cite

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages