Honghao Liu1,2, Xuhui Jiang1,4, Chengjin Xu1,4, Cehao Yang1,2, Yiran Cheng3,
Lionel Ni2,3, Jian Guo1,2
1 International Digital Economy Academy, 2 Hong Kong University of Science and Technology, Guangzhou 3 Hong Kong University of Science and Technology, 4 DataArc Tech Ltd.
EACL 2026 Findings
If you have any question, feel free to contact 📧.
SoE explores privacy-preserving continual pretraining by combining weighted entity-graph–based data synthesis with deterministic encryption, enabling LLMs to learn from small domain-specific corpora while retaining controlled access to sensitive information.
git clone https://github.com/DataArcTech/SoE.git
cd SoE
pip install -r requirements.txt
huggingface-cli login --token <huggingface token>;- Set your OpenAI/DeepSeek API key in
inference/devapi.py. - Generate entities and questions for the
i-th article. - Generate weighted graph and entity relations:
python data/entigraph.py i
python data/edge_weight_generation.py 1
python data/edge_weight_generation.py 2 i- Set the encryption key in
crypto/crypto_entity.py. - Encrypt the synthetic data for articles from
starttoend.
python crypto/main.py --start start --end end --lang 'zh'Perform encryption before step 1 and manually check the encrypted orignal data for more secure synthesis.
If training using the LlamaFactory, generate the json file for llamafactory.
utils/io_utils.pyElse training under the SoE (may require more computational resources),
mkdir -p data/dataset/bins/
python data/tokenize_entigraph.py
python data/tokenize_redpj.py
bash scripts/train.sh- Generate the responses using continually pretrained models with the RAG or not.
- Calculate the accuracy of pretrained models:
bash scripts/eval.sh # or scripts/eval_rag.sh
python calcu_score.pyThe codes for the demo are under demo/.
python demo/demo.pyThanks to Entigraph for their releases of model weights and source codes! And thanks to QuALITY dataset for their releases of the high-quality data.
If you use this code in your research, please cite our paper:
@misc{liu2026continualpretrainingencryptedsynthetic,
title={Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs},
author={Honghao Liu and Xuhui Jiang and Chengjin Xu and Cehao Yang and Yiran Cheng and Lionel Ni and Jian Guo},
year={2026},
eprint={2601.05635},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2601.05635},
}