Skip to content

Commit fc755bb

Browse files
Christoph Stumpfcstub
Christoph Stumpf
authored andcommitted
Build & Deployment Documentation
* Add build and deployment documentation to README.md file.
1 parent e565930 commit fc755bb

File tree

1 file changed

+136
-0
lines changed

1 file changed

+136
-0
lines changed

README.md

+136
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,139 @@ The requirements regarding the computational resources to train the classifiers
5858
| RAM | 32 GB |
5959
| GPU | 1 GPU, 8 GB RAM |
6060
| HDD | 100 GB |
61+
62+
63+
## Classifier
64+
65+
The machine learning estimator created in this project follows a supervised approach and is trained using the [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting) algorithm. Employing the [CatBoost](https://catboost.ai/) library a binary classifier is created, capable of classifying network flows as either benign or malicious. The chosen parameters of the classifier and its performance metrics can be examined in the following [notebook](https://github.com/cstub/ml-ids/blob/master/notebooks/07_binary_classifier_comparison/binary-classifier-comparison.ipynb).
66+
67+
## Deployment Architecture
68+
69+
The deployment architecture of the complete ML-IDS system is explained in detail in the [system architecture](https://docs.google.com/document/d/1s_EBMTid4gdrsQU_xOCAYK1BzxkhhnYl6wHFSZo_9Tw/edit?usp=sharing).
70+
71+
## Model Training and Deployment
72+
73+
The model can be trained and deployed either locally or via [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
74+
In each case the [MLflow](https://www.mlflow.org/docs/latest/index.html) framework is utilized to train the model and create the model artifacts.
75+
76+
### Installation
77+
78+
To install the necessary dependencies checkout the project and create a new Anaconda environment from the environment.yml file.
79+
80+
```
81+
conda env create -f environment.yml
82+
```
83+
84+
Afterwards activate the environment and install the project resources.
85+
86+
```
87+
conda activate ml-ids
88+
89+
pip install -e .
90+
```
91+
92+
### Dataset Creation
93+
94+
To create the dataset for training use the following command:
95+
96+
```
97+
make split_dataset \
98+
DATASET_PATH={path-to-source-dataset}
99+
```
100+
101+
This command will read the source dataset and split the dataset into separate train/validation/test sets with a sample ratio of 80%/10%/10%. The specified source dataset should be a folder containing multiple `.csv` files.
102+
You can use the [CIC-IDS-2018 dataset](https://www.unb.ca/cic/datasets/ids-2018.html) provided via [Google Drive](https://drive.google.com/open?id=1HrTPh0YRSZ4T9DLa_c47lubheKUcPl0r) for this purpose.
103+
Once the command completes a new folder `dataset` is created that contains the splitted datasets in `.h5` format.
104+
105+
### Local Mode
106+
107+
To train the model in local mode, using the default parameters and dataset locations created by `split_dataset`, use the following command:
108+
109+
```
110+
make train_local
111+
```
112+
113+
If the datasets are stored in a different location or you want to specify different training parameters, you can optionally supply the dataset locations and a training parameter file:
114+
115+
```
116+
make train_local \
117+
TRAIN_PATH={path-to-train-dataset} \
118+
VAL_PATH={path-to-train-dataset} \
119+
TEST_PATH={path-to-train-dataset} \
120+
TRAIN_PARAM_PATH={path-to-param-file}
121+
```
122+
123+
Upon completion of the training process the model artifacts can be found in the `build/models/gradient_boost` directory.
124+
125+
To deploy the model locally the MLflow CLI can be used.
126+
127+
```
128+
mlflow models serve -m build/models/gradient_boost -p 5000
129+
```
130+
131+
The model can also be deployed as a Docker container using the following commands:
132+
133+
```
134+
mlflow models build-docker -m build/models/gradient_boost -n ml-ids-classifier:1.0
135+
136+
docker run -p 5001:8080 ml-ids-classifier:1.0
137+
```
138+
139+
### Amazon SageMaker
140+
141+
To train the model on Amazon SageMaker the following command sequence is used:
142+
143+
```
144+
# build a new docker container for model training
145+
make sagemaker_build_image \
146+
TAG=1.0
147+
148+
# upload the container to AWS ECR
149+
make sagemaker_push_image \
150+
TAG=1.0
151+
152+
# execute the training container on Amazon SageMaker
153+
make sagemaker_train_aws \
154+
SAGEMAKER_IMAGE_NAME={ecr-image-name}:1.0 \
155+
JOB_ID=ml-ids-job-0001
156+
```
157+
158+
This command requires a valid AWS account with the appropriate permissions to be configured locally via the [AWS CLI](https://aws.amazon.com/cli/). Furthermore, [AWS ECR](https://aws.amazon.com/ecr/) and Amazon SageMaker must be configured for the account.
159+
160+
Using this repository, the manual invocation of the aforementioned commands is not necessary as training on Amazon SageMaker is supported via a [GitHub workflow](https://github.com/cstub/ml-ids/blob/master/.github/workflows/train.yml) that is triggered upon creation of a new tag of the form `m*` (e.g. `m1.0`).
161+
162+
To deploy a trained model on Amazon SageMaker a [GitHub Deployment request](https://developer.github.com/v3/repos/deployments/) using the GitHub API must be issued, specifying the tag of the model.
163+
164+
```
165+
{
166+
"ref": "refs/tags/m1.0",
167+
"payload": {},
168+
"description": "Deploy request for model version m1.0",
169+
"auto_merge": false
170+
}
171+
```
172+
173+
This deployment request triggers a [GitHub workflow](https://github.com/cstub/ml-ids/blob/master/.github/workflows/deployment.yml), deploying the model to SageMaker.
174+
After successful deployment the model is accessible via the SageMaker HTTP API.
175+
176+
## Using the Classifier
177+
178+
The classifier deployed on Amazon SageMaker is not directly available publicly, but can be accessed using the [ML-IDS REST API](https://github.com/cstub/ml-ids-api).
179+
180+
### REST API
181+
182+
To invoke the REST API the following command can be used to submit a prediction request for a given network flow:
183+
184+
```
185+
curl -X POST \
186+
http://ml-ids-cluster-lb-1096011980.eu-west-1.elb.amazonaws.com/api/predictions \
187+
-H 'Accept: */*' \
188+
-H 'Content-Type: application/json; format=pandas-split' \
189+
-H 'Host: ml-ids-cluster-lb-1096011980.eu-west-1.elb.amazonaws.com' \
190+
-H 'cache-control: no-cache' \
191+
-d '{"columns":["dst_port","protocol","timestamp","flow_duration","tot_fwd_pkts","tot_bwd_pkts","totlen_fwd_pkts","totlen_bwd_pkts","fwd_pkt_len_max","fwd_pkt_len_min","fwd_pkt_len_mean","fwd_pkt_len_std","bwd_pkt_len_max","bwd_pkt_len_min","bwd_pkt_len_mean","bwd_pkt_len_std","flow_byts_s","flow_pkts_s","flow_iat_mean","flow_iat_std","flow_iat_max","flow_iat_min","fwd_iat_tot","fwd_iat_mean","fwd_iat_std","fwd_iat_max","fwd_iat_min","bwd_iat_tot","bwd_iat_mean","bwd_iat_std","bwd_iat_max","bwd_iat_min","fwd_psh_flags","bwd_psh_flags","fwd_urg_flags","bwd_urg_flags","fwd_header_len","bwd_header_len","fwd_pkts_s","bwd_pkts_s","pkt_len_min","pkt_len_max","pkt_len_mean","pkt_len_std","pkt_len_var","fin_flag_cnt","syn_flag_cnt","rst_flag_cnt","psh_flag_cnt","ack_flag_cnt","urg_flag_cnt","cwe_flag_count","ece_flag_cnt","down_up_ratio","pkt_size_avg","fwd_seg_size_avg","bwd_seg_size_avg","fwd_byts_b_avg","fwd_pkts_b_avg","fwd_blk_rate_avg","bwd_byts_b_avg","bwd_pkts_b_avg","bwd_blk_rate_avg","subflow_fwd_pkts","subflow_fwd_byts","subflow_bwd_pkts","subflow_bwd_byts","init_fwd_win_byts","init_bwd_win_byts","fwd_act_data_pkts","fwd_seg_size_min","active_mean","active_std","active_max","active_min","idle_mean","idle_std","idle_max","idle_min"],"data":[[80,17,"21\\/02\\/2018 10:15:06",119759145,75837,0,2426784,0,32,32,32.0,0.0,0,0,0.0,0.0,20263.87212,633.2460039,1579.1859130859,31767.046875,920247,1,120000000,1579.1859130859,31767.046875,920247,1,0,0.0,0.0,0,0,0,0,0,0,606696,0,633.2460327148,0.0,32,32,32.0,0.0,0.0,0,0,0,0,0,0,0,0,0,32.0004234314,32.0,0.0,0,0,0,0,0,0,75837,2426784,0,0,-1,-1,75836,8,0.0,0.0,0,0,0.0,0.0,0,0]]}'
192+
```
193+
194+
### ML-IDS API Clients
195+
196+
For convenience, the Python clients implemented in the [ML-IDS API Clients project](https://github.com/cstub/ml-ids-api-client) can be used to submit new prediction requests to the API and receive real-time notifications on detection of malicious network flows.

0 commit comments

Comments
 (0)