Skip to content

[EACL'26] Repository for the Paper "Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs"

License

Notifications You must be signed in to change notification settings

DataArcTech/SoE

Repository files navigation

SoE: Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs

Honghao Liu1,2, Xuhui Jiang1,4, Chengjin Xu1,4, Cehao Yang1,2, Yiran Cheng3,

Lionel Ni2,3, Jian Guo1,2

1 International Digital Economy Academy, 2 Hong Kong University of Science and Technology, Guangzhou 3 Hong Kong University of Science and Technology, 4 DataArc Tech Ltd.

EACL 2026 Findings

arXiv Code License Python 3.10+

If you have any question, feel free to contact 📧.

Overview

SoE explores privacy-preserving continual pretraining by combining weighted entity-graph–based data synthesis with deterministic encryption, enabling LLMs to learn from small domain-specific corpora while retaining controlled access to sensitive information.

Installation

git clone https://github.com/DataArcTech/SoE.git
cd SoE
pip install -r requirements.txt
huggingface-cli login --token <huggingface token>;

Quick Start

Step 1: Data Synthetic

  1. Set your OpenAI/DeepSeek API key in inference/devapi.py.
  2. Generate entities and questions for the i-th article.
  3. Generate weighted graph and entity relations:
python data/entigraph.py i
python data/edge_weight_generation.py 1
python data/edge_weight_generation.py 2 i

Step 2: Encryption

  1. Set the encryption key in crypto/crypto_entity.py.
  2. Encrypt the synthetic data for articles from start to end.
python crypto/main.py --start start --end end --lang 'zh'

Perform encryption before step 1 and manually check the encrypted orignal data for more secure synthesis.

Step 3: Training

If training using the LlamaFactory, generate the json file for llamafactory.

utils/io_utils.py

Else training under the SoE (may require more computational resources),

mkdir -p data/dataset/bins/
python data/tokenize_entigraph.py
python data/tokenize_redpj.py
bash scripts/train.sh

Step 4: Evaluation

  1. Generate the responses using continually pretrained models with the RAG or not.
  2. Calculate the accuracy of pretrained models:
bash scripts/eval.sh # or scripts/eval_rag.sh
python calcu_score.py

Step 5: Demo

The codes for the demo are under demo/.

python demo/demo.py

Acknowledgement

Thanks to Entigraph for their releases of model weights and source codes! And thanks to QuALITY dataset for their releases of the high-quality data.

Citation

If you use this code in your research, please cite our paper:

@misc{liu2026continualpretrainingencryptedsynthetic,
      title={Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs},
      author={Honghao Liu and Xuhui Jiang and Chengjin Xu and Cehao Yang and Yiran Cheng and Lionel Ni and Jian Guo},
      year={2026},
      eprint={2601.05635},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2601.05635},
}

About

[EACL'26] Repository for the Paper "Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published