CogSteer: Cognition-Inspired Selective Layer Intervention for LLMs

Official implementation of CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models

💻 Environment Setup

# Create and activate conda environment
conda create -n cogsteer -y python=3.10
conda activate cogsteer

# Install dependencies
pip install -r requirements.txt

Download the llama2 original checkpoint in meta

📖 Usage

🔍 Correlation Analysis

Analyze the correlation between model outputs and eye-tracking data:

cd correlation
# Update model path in correlation.py for LLaMA-2 checkpoints
python correlation.py

Configuration:

Set your LLaMA-2 model path in correlation.py
Ensure eye-tracking data is available in the expected format

📊 GLUE Benchmark

Train and evaluate models on GLUE tasks using Layer Intervention:

cd glue
export CUDA_VISIBLE_DEVICES=xxx
# For LLaMA
python llama_train_pt.py  # Training
python llama_eval_pt.py   # Evaluation

# For Mistral  
python mistral_train_pt.py
python mistral_eval_pt.py

# For GPT-2
python gpt2_train_pt.py
python gpt2_eval_pt.py

Configuration:

Modify the layer parameter to target specific layers
Set layer="full" for full model intervention

🛡️ Toxicity Control

Framework

Our implementation uses different adapter frameworks optimized for each model architecture:

GPT-2 & Mistral: Built on the Adapters framework using HuggingFace model checkpoints. This provides efficient parameter-efficient fine-tuning with minimal memory overhead.
LLaMA: Uses Meta's original model checkpoints with the LLaMA-Adapter framework, which is specifically designed for LLaMA models and provides better compatibility with the original architecture.

Training Toxic Adapters

For GPT-2 and Mistral:

cd tox/gpt2  # or tox/mistral
python train_tox.py

For LLaMA with LLaMA-Adapter: Update model checkpoint path in TARGET_FOLDER in finetuning.sh

cd tox/llama_adapter/finetune
bash finetuning.sh

Running Detoxification

For GPT-2 and Mistral:

cd tox/mistral  # or tox/gpt2
python detox.py

For LLaMA: Set ckpt_dir with model checkpoint in llama_generate_detox.py

cd tox/llama_adapter/generate
python llama_generate_detox.py

Evaluating Toxicity

We are using Perplexity to evaluate the toxicity of the sentence. First you need to obtain your API_KEY from Perplexity.

Set API_KEY and specify your evaluating folder answers_dir in get_score.py. Run get_score.py
Set answers_dir and output_dir before running cal_metrics.py

📄 Citation

If you find our work useful, please consider starring the repository and citing our paper:

@inproceedings{wang-etal-2025-cogsteer,
    title = "{C}og{S}teer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models",
    author = "Wang, Xintong and Pan, Jingheng and Ding, Liang and Wang, Longyue and Jiang, Longqin and Li, Xingshan and Biemann, Chris",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    url = "https://aclanthology.org/2025.findings-acl.1308/",
    pages = "25507--25522"
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
correlation		correlation
glue		glue
tox		tox
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CogSteer: Cognition-Inspired Selective Layer Intervention for LLMs

💻 Environment Setup

📖 Usage

🔍 Correlation Analysis

📊 GLUE Benchmark

🛡️ Toxicity Control

Framework

Training Toxic Adapters

Running Detoxification

Evaluating Toxicity

📄 Citation

About

Uh oh!

Releases

Packages

Languages

License

PostMindLab/cogsteer

Folders and files

Latest commit

History

Repository files navigation

CogSteer: Cognition-Inspired Selective Layer Intervention for LLMs

💻 Environment Setup

📖 Usage

🔍 Correlation Analysis

📊 GLUE Benchmark

🛡️ Toxicity Control

Framework

Training Toxic Adapters

Running Detoxification

Evaluating Toxicity

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages