python packages
pip install -r requirements.txt
source data SAML-D.csv - 950 MB https://www.kaggle.com/datasets/berkanoztas/synthetic-transaction-monitoring-dataset-aml
config your path in ./ipmn_proflow/config_82d.json
{
"DATAPATH": "E:/default_download/IPMN_pro/IPMN_1/data/",
"ORI_ALL_CSV": "SAML-D.csv",
...
}run in SAML-D dataset split in 8:2, with multi_window(14d_4d 7d_7d), build basic, graph and pattern features.
python ./ipmn_proflow/main.py --config_path ./config_82d.json
You can build your own .json config just follow the pattern.
| Key | Type | Default | Description |
|---|---|---|---|
DATAPATH |
string |
"E:/default_download/IPMN_pro/IPMN_1/data/" |
Root directory for data files |
ORI_ALL_CSV |
string |
"SAML-D.csv" |
Original dataset filename |
IBM_CSV |
string |
"sampled_IBM.csv" |
IBM-specific dataset filename |
SAVE_TRANS |
string |
"saved_transformer.pkl" |
Filename to save the feature transformer |
SAVE_MODEL |
string |
"saved_model.pkl" |
Filename to save the trained model |
RANDOM_SEED |
int |
42 |
Random seed for reproducibility |
SAVE_LEVEL |
int |
-1 |
Custom save level flag, -1: no model save, 0: only model save, 1: dataset save, 2:data with pred |
SHOW_LEVEL |
int |
1 |
Verbosity level for logging/output, 0: show nothing, 1: show importance, 2: save tree, 3: Permutation importance, 4: SHAP summary, 5: SHAP waterfall |
| Key | Type | Default | Description |
|---|---|---|---|
STANDARD_INPUT_PARAM |
list[string] |
See below | List of standard input features |
STANDARD_INPUT_LABEL |
string |
"Is_laundering" |
Binary classification label |
MULTI_CLASS_LABEL |
string |
"Laundering_type" |
Multi-class classification label |
STANDARD_DROP_PARAM |
list[string] |
["Date", "Timestamp", "Year", "Month"] |
Columns to drop during preprocessing |
Default STANDARD_INPUT_PARAM:
[
"Is_laundering",
"Laundering_type",
"Date",
"Time",
"Sender_account",
"Receiver_account",
"Amount",
"Payment_currency",
"Received_currency",
"Payment_type"
]| Key | Type | Default | Description |
|---|---|---|---|
STEP_UNIT |
string |
"d" |
Unit of time step (e.g., day) |
WINDOW_SIZE |
int |
10 |
Size of the sliding window |
SLIDER_STEP |
int |
1 |
Step size for sliding window |
| Key | Type | Default | Description |
|---|---|---|---|
PARAM_GRID |
dict |
{ "max_depth": [14, 16], "eta": [0.12, 0.14] } |
Grid for hyperparameter tuning |
TPR |
float |
0.95 |
Target True Positive Rate |
TPR_SET |
int |
0 |
Flag to enable TPR adjustment (0 = disabled) |
| Key | Type | Default | Description |
|---|---|---|---|
DATASET_MODES |
string |
"quick_test" |
Mode for selecting training/testing data |
Valid values:
quick_testall_d73all_d82first_2_d73first_4_d73IBM_d73specific_train_specific_test
| Key | Type | Default | Description |
|---|---|---|---|
PARAMETER_MODES |
string |
"basic" |
Mode for feature processing |
Valid values:
originbasicwindow_graphmulti_window_graphwindow_allmulti_window_all
Used only when DATASET_MODES = "quick_test".
| Key | Type | Default | Description |
|---|---|---|---|
QT_TRAIN_START |
string |
"2022/11/01" |
Start date for training data |
QT_TRAIN_END |
string |
"2022/11/30" |
End date for training data |
QT_TEST_START |
string |
"2023/04/01" |
Start date for testing data |
QT_TEST_END |
string |
"2023/04/30" |
End date for testing data |
Used only when DATASET_MODES = "specific_train_specific_test".
| Key | Type | Default | Description |
|---|---|---|---|
SP_TRAIN_FILE |
string |
"2022-11.csv" |
Specific training file |
SP_TEST_FILE |
string |
"2023-06.csv" |
Specific testing file |
Just match the input and output. in main.py
config = load_config()
train_set, test_set = load_dataset(config)
y_train, y_test, X_train, X_test = split_label(config, train_set, test_set)
X_train, X_test = add_parameter(config, X_train, X_test)
save_feature_data2csv(config, y_train, y_test, X_train, X_test)
X_train, X_test, model_save_id = encode_feature(config, X_train, X_test)
grid_search_model = config_model(config)
trained_grid_search_model = train_model(grid_search_model, X_train, y_train)
best_model = search_best_save(config, trained_grid_search_model, model_save_id)
feature_analysis(config, best_model, X_test, y_test)
test_probabilities = test_model(best_model, X_test)
save_predict_data2csv_float(config, test_probabilities, X_test, y_test)
y_pred = analysis_performance(config, y_test, test_probabilities)
save_predict_data2csv_bool(config, y_pred, X_test, y_test)HI-Medium_Trans.csv - 2.82 GB https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml
need run ./data/sample.py split to sampled_IBM.csv and sampled_IBM_pred.csv
IPMN_1/
├── data/
│ ├── SAML-D.csv (need download)
│ ├── HI-Medium_Trans.csv (need download)
│ ├── sampled_IBM.csv (need run sample)
│ ├── sampled_IBM_pred.csv (need run sample)
│ └── sample.py
├── ipmn_proflow/
│ ├── parameter_handler/ (handle window and feature)
│ ├── xgb_trees/ (tree structure store path)
│ ├── main.py
│ ├── config_82d.json
│ ├── __init__.py
│ ├── analysis.py
│ ├── config.py
│ ├── dataloader.py
│ ├── datasaver.py
│ ├── imports.py ("from imports import *")
│ ├── model.py
│ ├── param_feature.py
│ ├── predictor.py
├── utili/ (tools that have been used)
├── requirements.txt
└── README.md
https://github.com/dyinjin/IPMN_1
This project is open-sourced under the MIT License.