Compartion of Models(SMOTE, GAN) for Data Augmentation with UNSW-NB15 Dataset
- Proposed Model: GAN Data Augmentation + Security-related Feature elimination
- Compare Model:
- Data Augmentation x
- SMOTE Data Augmentation
- GAN Data Augmentation
UNSW-NB15 UNSW-NB15 is a network intrusion dataset. It contains nine different attacks, includes DoS, worms, Backdoors, and Fuzzers. The dataset contains raw network packets. The number of records in the training set is 175,341 records and the testing set is 82,332 records from the different types, attack and normal. Link: https://paperswithcode.com/dataset/unsw-nb15
- In this experiment, we used the same amount of data as the image above. (*Used Train Set for Data Augmentation))
(1) Data Analysis In this dataset, the number of Attack Category datasets (Backdoor, Analysis, ShellCode, Worms) is significantly smaller than others. When the number of instances for each category is highly imbalanced during classification, several problems can arise.
Such as..
- 1. Model Bias: The model may become biased towards the majority class, leading to poor performance on minority classes.
- 2. Poor Generalization: The model might not learn the characteristics of the minority classes well, resulting in poor generalization when making predictions on new data.
- 3. Skewed Metrics: Evaluation metrics such as accuracy may be misleading, as a high accuracy can be achieved by simply predicting the majority class.
- 4. Overfitting: The model may overfit the majority class data, capturing noise instead of the underlying patterns.
(2) Proposed Solution
- Make balanced data by augmenting data
- To enhance security, remove some security-related features during training
*Notation: the linked Notion Page is written in Korean.
- Research on Gernerative AI for Data Augmentation -> select Gernerative AI Model
- Related Work
- Network Intrusion Detection Based on Supervised Adversarial Variational Auto-Encoder With Regularization
1. Data Augmentation x -> Train
-
Data: ['SMOTE oversampled Data', 'GAN oversampled Data']
-
Training Accuracy: [99.9045918367347, 100.0]
-
Test Accuracy: [85.42678571428571, 99.995290349927]
2. GAN & SMOTE Data Augmentation o -> Train
3. Security-related Feature elimination -> GAN & SMOTE Data Augmentation o -> Train
used preprocessed UNSW-NB15 Dataset as datset.csv
run python main.py