Official implementation of CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
# Create and activate conda environment
conda create -n cogsteer -y python=3.10
conda activate cogsteer
# Install dependencies
pip install -r requirements.txtDownload the llama2 original checkpoint in meta
Analyze the correlation between model outputs and eye-tracking data:
cd correlation
# Update model path in correlation.py for LLaMA-2 checkpoints
python correlation.pyConfiguration:
- Set your LLaMA-2 model path in
correlation.py - Ensure eye-tracking data is available in the expected format
Train and evaluate models on GLUE tasks using Layer Intervention:
cd glue
export CUDA_VISIBLE_DEVICES=xxx
# For LLaMA
python llama_train_pt.py # Training
python llama_eval_pt.py # Evaluation
# For Mistral
python mistral_train_pt.py
python mistral_eval_pt.py
# For GPT-2
python gpt2_train_pt.py
python gpt2_eval_pt.pyConfiguration:
- Modify the
layerparameter to target specific layers - Set
layer="full"for full model intervention
Our implementation uses different adapter frameworks optimized for each model architecture:
-
GPT-2 & Mistral: Built on the Adapters framework using HuggingFace model checkpoints. This provides efficient parameter-efficient fine-tuning with minimal memory overhead.
-
LLaMA: Uses Meta's original model checkpoints with the LLaMA-Adapter framework, which is specifically designed for LLaMA models and provides better compatibility with the original architecture.
For GPT-2 and Mistral:
cd tox/gpt2 # or tox/mistral
python train_tox.pyFor LLaMA with LLaMA-Adapter:
Update model checkpoint path in TARGET_FOLDER in finetuning.sh
cd tox/llama_adapter/finetune
bash finetuning.shFor GPT-2 and Mistral:
cd tox/mistral # or tox/gpt2
python detox.pyFor LLaMA: Set ckpt_dir with model checkpoint in llama_generate_detox.py
cd tox/llama_adapter/generate
python llama_generate_detox.pyWe are using Perplexity to evaluate the toxicity of the sentence. First you need to obtain your API_KEY from Perplexity.
- Set
API_KEYand specify your evaluating folderanswers_diringet_score.py. Runget_score.py - Set
answers_dirandoutput_dirbefore runningcal_metrics.py
If you find our work useful, please consider starring the repository and citing our paper:
@inproceedings{wang-etal-2025-cogsteer,
title = "{C}og{S}teer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models",
author = "Wang, Xintong and Pan, Jingheng and Ding, Liang and Wang, Longyue and Jiang, Longqin and Li, Xingshan and Biemann, Chris",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
year = "2025",
url = "https://aclanthology.org/2025.findings-acl.1308/",
pages = "25507--25522"
}