Skip to content

Code for the paper "PSA: Differentially Private Steering for LLM Alignment" accepted at ICLR 2025

License

Notifications You must be signed in to change notification settings

UKPLab/iclr2025-psa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PSA: Differentially Private Steering for LLM Alignment

Arxiv License Python Versions CI

This is the code implementation for the ICLR 2025 paper "PSA: Differentially Private Steering for Large Language Model Alignment". This experimental code is a cleaned-up and condensed version of the codebase used to conduct the experiments in the paper (please get in touch if you find errors/have any suggestions).

PSA is a simple algorithm that uses Gaussian Differential Privacy for providing privacy guarantees during steering of the LLM residual stream for alignment.

Abstract: Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.


Contact person: Anmol Goel

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail for issues or further questions.


Our Framework

PSA Framework


🚀 Getting Started 🚀

# create a virtual environment (e.g. conda)
conda create -n dp-steering python=3.10
conda activate dp-steering

# Clone the repository
git clone https://github.com/UKPLab/iclr2025-psa.git

# Change working directory
cd iclr2025-psa

# install the requirements
pip install -r requirements.txt

Private Steering

The following command can be used to reproduce the main results (Section 5 from the paper). Use the --model argument to experiment with different LLMs. An additional --clip argument controls the clipping factor (see Algorithm 1 in the paper). For more information on the --dataset argument, please see datasets.md.

python run.py \
--model "meta-llama/Llama-2-7B-chat-hf" \
--dataset "Sycophancy" \
--layers 11 12 13 14 15 \   # the layers to be manipulated with PSA
--noise_multiplier 0.02     # controls the amount of random noise 

GPT-4 Evals

We use the same code, evaluation setup and prompts as CAA.

Cite

Please use the following citation:

@inproceedings{goel-2025-psa,
      title={PSA: Differentially Private Steering for Large Language Model Alignment}, 
      author={Anmol Goel and Yaxi Hu and Iryna Gurevych and Amartya Sanyal},
      year={2025},
      publisher={arXiv}, 
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

About

Code for the paper "PSA: Differentially Private Steering for LLM Alignment" accepted at ICLR 2025

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages