Skip to content

[EMNLP 2025] The code and resource of"Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites"

License

Notifications You must be signed in to change notification settings

PostMindLab/ToxiRewriteCN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

❗️ Notation

This dataset contains examples of violent or offensive language that may be disturbing to some readers. Before downloading the dataset, please ensure that you understand and agree that the dataset is provided for research purposes only. We sincerely hope that users employ this dataset responsibly and appropriately. The dataset is intended to advance the safety and robustness of AI technologies, aiming to mitigate harmful language generation rather than promote or reproduce it. Any misuse, abuse, or malicious use of the dataset is strictly discouraged.

📄 Paper

The paper has been accepted in EMNLP 2025 (main conference).
Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites

ToxiRewriteCN Dataset

We construct ToxiRewriteCN, the first Chinese detoxification dataset explicitly designed to preserve sentiment polarity during toxic language rewriting. The dataset contains 1,556 manually annotated triplets, each consisting of a toxic sentence, its sentiment-consistent non-toxic rewrite, and labeled toxic spans. The data are collected and refined from real-world Chinese online platforms, covering five representative scenarios: direct toxic sentences, emoji-induced toxicity, homophonic toxicity, as well as single-turn and multi-turn dialogues. The dataset is presented in data/ToxiRewriteCN.json.
Here we simply describe each fine-grain label.

Label Description
toxic The original toxic sentence.
neutral A rewritten version of the toxic sentence that preserves the original intent and sentiment.
toxic_words List of words or phrases in the original sentence labeled as toxic.
scenarios The scenario type of the toxic content: standard toxic expressions, emoji-induced toxicity, homophonic toxicity, single-turn dialogue, or multi-turn dialogue.

💻 Quick start

Environment Setup

# Create and activate a new conda environment
conda create -n toxirewritecn python=3.9
conda activate toxirewritecn

# Install required dependencies
pip install -r requirements.txt

Path Configuration

The project supports automatic path adaptation: core dependency utils/path_utils.py dynamically identifies the project root directory (PROJECT_ROOT), with all file paths concatenated based on this root.
Customizable paths (e.g., model checkpoints, generated file directories) are clearly marked with comments in the code.

1. Toxicity & Sentiment Polarity Classifiers

The project leverages MS-Swift framework for the fine-tuning process.

Toxicity Classifier

# Step 1: LoRA fine-tuning for toxicity classification (based on Qwen3-32B)
bash train_tox.sh

# Step 2: Merge LoRA adapters with base model weights
bash merge_tox.sh

# Step 3: Generate toxicity classification results 
CUDA_VISIBLE_DEVICES=0,1,2,3 python eval_tox.py 

Sentiment Polarity Classifier

# Step 1: LoRA fine-tuning for style classification (based on Qwen3-32B)
bash train_pol.sh

# Step 2: Merge LoRA adapters with base model weights
bash merge_pol.sh

# Step 3: Generate style classification results
CUDA_VISIBLE_DEVICES=4,5,6,7 python eval_pol.py 

Download the original checkpoint for two classifiers in Huggingface

2. LLaMA3-8B Fine-tuning

# Fine-tune LLaMA3-8B with Deepseek-R1's reasoning traces as supervision 
bash r1_sft.sh

# Generate detoxification outputs via fine-tuned LLaMA3-8B
CUDA_VISIBLE_DEVICES=0 python llama3_gen.py

3. Evaluation

# Calculate S-CLS, W-Clean, S-Clean
python detoxification_accuracy.py

# Compute content preservation score
python content_preservation.py

# Calculate BLEU, ChrF++, BERTScore-F1 and COMET
python fluency.py

# Assess sentiment polarity score
python sentiment_polarity.py

Cite

If you find our project useful, we hope you can kindly cite:

@inproceedings{wang-etal-2025-chinese,
    title = "{C}hinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites",
    author = "Wang, Xintong and Liu, Yixiao and Pan, Jingheng and Ding, Liang and Wang, Longyue and Biemann, Chris",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    year = "2025",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1808/",
    pages = "35683--35699"
}

About

[EMNLP 2025] The code and resource of"Chinese Toxic Language Mitigation via Sentiment Polarity Consistent Rewrites"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •