I have reorganized the code and tested it recently. The code should be able to reproduce the results presented in the paper. (2023/08/27)
This is the repo for the code and datasets used in the paper BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection, accepted by the ACM Web conference (WWW) 2023.
Here you can find our slides.
- Python >= 3.6
 - TensorFlow >= 1.4.0
 
I use python 3.9, tensorflow 2.9.2 with CUDA 11.2, numpy 1.19.5.
- 
Transaction Dataset:
 
The master branch hosts the basic BERT4ETH model. If you wish to run the basic BERT4ETH model, there is no need to download the ERC-20 log dataset. Advanced features such as In/out separation and ERC20 log can be found in the old branch.
cd BERT4ETH/Data; # Labels are already included
unzip ...;cd Model;
python gen_seq.py --bizdate=bert4eth_exppython gen_pretrain_data.py --bizdate=bert4eth_exp  \ 
                            --max_seq_length=100  \
                            --dupe_factor=10 \
                            --masked_lm_prob=0.8 python run_pretrain.py --bizdate=bert4eth_exp \
                       --max_seq_length=100 \
                       --epoch=5 \
                       --batch_size=256 \
                       --learning_rate=1e-4 \
                       --num_train_steps=1000000 \
                       --save_checkpoints_steps=8000 \
                       --neg_strategy=zip \
                       --neg_sample_num=5000 \ 
                       --neg_share=True \ 
                       --checkpointDir=bert4eth_exp | Parameter | Description | 
|---|---|
bizdate | 
The signature for this experiment run. | 
max_seq_length | 
The maximum length of BERT4ETH. | 
masked_lm_prob | 
The probability of masking an address. | 
epochs | 
Number of training epochs, default = 5. | 
batch_size | 
Batch size, default = 256. | 
learning_rate | 
Learning rate for the optimizer (Adam), default = 1e-4. | 
num_train_steps | 
The maximum number of training steps, default = 1000000, | 
save_checkpoints_steps | 
The parameter controlling the step of saving checkpoints, default = 8000. | 
neg_strategy | 
Strategy for negative sampling, default zip, options (uniform, zip, freq). | 
neg_share | 
Whether enable in-batch sharing strategy, default = True. | 
neg_sample_num | 
The negative sampling number for one batch, default = 5000. | 
checkpointDir | 
Specify the directory to save the checkpoints. | 
python output_embed.py --bizdate=bert4eth_exp \
                       --init_checkpoint=bert4eth_exp/model_104000 \
                       --max_seq_length=100 \
                       --neg_sample_num=5000 \
                       --neg_strategy=zip \
                       --neg_share=TrueI have generated a version of embedding file, you can unzip it under the directory of "Model/inter_data/" and test the results.
python run_phishing_detection.py --init_checkpoint=bert4eth_exp/model_104000 # Random Forest (RF)
python run_phishing_detection_dnn.py --init_checkpoint=bert4eth_exp/model_104000 # DNN, better than RFpython run_dean_ENS.py --metric=euclidean \
                       --init_checkpoint=bert4eth_exp/model_104000python gen_finetune_phisher_data.py --bizdate=bert4eth_exp \ 
                                    --max_seq_length=100 python run_finetune_phisher.py --init_checkpoint=bert4eth_exp/model_104000 \
                               --bizdate=bert4eth_exp \ 
                               --max_seq_length=100 \ 
                               --checkpointDir=tmpIf you find this repository useful, please give us a star and cite our paper : ) Thank you!
@inproceedings{hu2023bert4eth,
  title={BERT4ETH: A Pre-trained Transformer for Ethereum Fraud Detection},
  author={Hu, Sihao and Zhang, Zhen and Luo, Bingqiao and Lu, Shengliang and He, Bingsheng and Liu, Ling},
  booktitle={Proceedings of the ACM Web Conference 2023},
  pages={2189--2197},
  year={2023}
}
If you have any questions, you can either open an issue or contact me ([email protected]), and I will reply as soon as I see the issue or email.