GitHub - DataArcTech/SoE: [EACL'26] Repository for the Paper "Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs"

SoE: Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs

Honghao Liu^1,2, Xuhui Jiang^1,4, Chengjin Xu^1,4, Cehao Yang^1,2, Yiran Cheng³,

Lionel Ni^2,3, Jian Guo^1,2

¹ International Digital Economy Academy, ² Hong Kong University of Science and Technology, Guangzhou ³ Hong Kong University of Science and Technology, ⁴ DataArc Tech Ltd.

EACL 2026 Findings

If you have any question, feel free to contact 📧.

Overview

SoE explores privacy-preserving continual pretraining by combining weighted entity-graph–based data synthesis with deterministic encryption, enabling LLMs to learn from small domain-specific corpora while retaining controlled access to sensitive information.

Installation

git clone https://github.com/DataArcTech/SoE.git
cd SoE
pip install -r requirements.txt
huggingface-cli login --token <huggingface token>;

Quick Start

Step 1: Data Synthetic

Set your OpenAI/DeepSeek API key in inference/devapi.py.
Generate entities and questions for the i-th article.
Generate weighted graph and entity relations:

python data/entigraph.py i
python data/edge_weight_generation.py 1
python data/edge_weight_generation.py 2 i

Step 2: Encryption

Set the encryption key in crypto/crypto_entity.py.
Encrypt the synthetic data for articles from start to end.

python crypto/main.py --start start --end end --lang 'zh'

Perform encryption before step 1 and manually check the encrypted orignal data for more secure synthesis.

Step 3: Training

If training using the LlamaFactory, generate the json file for llamafactory.

utils/io_utils.py

Else training under the SoE (may require more computational resources),

mkdir -p data/dataset/bins/
python data/tokenize_entigraph.py
python data/tokenize_redpj.py
bash scripts/train.sh

Step 4: Evaluation

Generate the responses using continually pretrained models with the RAG or not.
Calculate the accuracy of pretrained models:

bash scripts/eval.sh # or scripts/eval_rag.sh
python calcu_score.py

Step 5: Demo

The codes for the demo are under demo/.

python demo/demo.py

Acknowledgement

Thanks to Entigraph for their releases of model weights and source codes! And thanks to QuALITY dataset for their releases of the high-quality data.

Citation

If you use this code in your research, please cite our paper:

@misc{liu2026continualpretrainingencryptedsynthetic,
      title={Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs},
      author={Honghao Liu and Xuhui Jiang and Chengjin Xu and Cehao Yang and Yiran Cheng and Lionel Ni and Jian Guo},
      year={2026},
      eprint={2601.05635},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2601.05635},
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asset		asset
crypto		crypto
data		data
demo		demo
inference		inference
scripts		scripts
tasks		tasks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
calcu_score.py		calcu_score.py
environment.yml		environment.yml
evaluation.py		evaluation.py
experiments.py		experiments.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SoE: Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs

Overview

Installation

Quick Start

Step 1: Data Synthetic

Step 2: Encryption

Step 3: Training

Step 4: Evaluation

Step 5: Demo

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

DataArcTech/SoE

Folders and files

Latest commit

History

Repository files navigation

SoE: Continual Pretraining on Encrypted Synthetic Data for Privacy-Preserving LLMs

Overview

Installation

Quick Start

Step 1: Data Synthetic

Step 2: Encryption

Step 3: Training

Step 4: Evaluation

Step 5: Demo

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages