The Effect of Sampling Methods on the CICIDS2017 Network Intrusion Data Set

Ho, Yan-Bing; Yap, Wun-She; Khor, Kok-Chin

doi:10.1007/978-981-16-4118-3_4

Yan-Bing Ho³⁸,
Wun-She Yap³⁸ &
Kok-Chin Khor³⁸

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 782))

360 Accesses
4 Citations

Abstract

Handling unbalanced intrusion detection data sets are difficult as minority intrusion classes may not be easy to detect. One of the possible causes of the problem is the characteristic of learning algorithms that usually favour majority classes in data sets. The contribution of this study is to improve the detection rate for intrusions in the unbalanced CICIDS2017 data set by using sampling techniques. We evaluated Random Under-Sampling (RUS), Synthetic Minority Over-sampling Technique (SMOTE) and the combination of RUS and SMOTE. After applying the sampling techniques, we performed intrusion detection and used the accuracy plus True Positive Rate (TPR) as the evaluation metrics for the detection results. The results showed that RUS gave the best detection performance overall. Besides, 12 out of the 15 classes, including some hard-to-detect minority classes, were detected with result improvement.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Resampling imbalanced data for network intrusion detection datasets

Article Open access 06 January 2021

SMMO-CoFS: Synthetic Multi-minority Oversampling with Collaborative Feature Selection for Network Intrusion Detection System

Article Open access 11 February 2023

NKB-S: Network Intrusion Detection Based on SMOTE Sample Generation

Keywords

1 Introduction

Intrusion Detection Systems (IDS) are used to detect malicious activities in the networks and information systems. Due to the increasing network scale and traffic, large network data are generated almost every seconds. However, intrusion activities are relatively rare compared to the overall traffic amount causing the network data to be unbalanced.

Using such data to train detection models for IDS is difficult because the learning algorithms usually favour large classes for maximising accuracy and may have difficulties detecting the minority intrusions. Further, minority intrusions may not be able to form actual decision boundaries for the learning algorithms. Decision boundaries are important as they are the regions in a feature space that separates classes of a data set so that the learning algorithms can learn the classes effectively.

In this study, we attempted to improve the detection rates for minority intrusions by balancing the data set involved. The data set in this study used was CICIDS 2017 [1]. We firstly attempted under-sampling for the large class. Secondly, we attempted over-sampling, Synthetic Minority Over-sampling Technique (SMOTE) [2] for the seven weak intrusion classes that usually give weak detection rates. Finally, we combined both sampling techniques to seek better improvement in intrusion detection.

2 Literature Review

2.1 CICIDS2017 Data Set Overview

The CICIDS2017 data set [1] contains eight different files, and each of them contains network activities collected over five days. Table 1 shows the class distribution of the CICIDS2017 data set after combining the eight files. It comprises 2,830,743 instances, 78 features and 15 classes with no duplicated data. The data set is highly unbalanced as the BENIGN class takes 80.3% of the data set.

Table 1 The class distribution of the CICIDS2017 data set

Full size table

2.2 Sampling Techniques

Unbalanced class distribution is a common problem for real-world data sets such as network intrusions detection [1] and credit card fraud detection [3]. The rare classes are often the primary interests of classification [4]. Researchers have proposed several sampling techniques to tackle the unbalanced class distribution and improve classification performance, i.e., over-sampling, under-sampling, and combining sampling [5].

2.2.1 Over-Sampling

Over-sampling duplicates instances of minority classes or generates the duplicates based on the characteristic of the minority classes. This shall decrease the rareness of minority classes, thereby decreasing the overall level of class imbalance [4]. A basic over-sampling method is Random Over-Sampling (ROS) that duplicates the minority instances randomly [6]. Increasing the size of a minority class using ROS can increase the time taken to build a model and may lead to an overfitting problem [7]. Further, the lack of minority information may persist even after duplicating existing instances using ROS. Studies [7] show that ROS is less effective at improving the detection of minority classes. Therefore, Chawla et al. [2] proposed an advanced over-sampling method, Synthetic Minority Over-sampling Technique (SMOTE), to create new minority instances rather than duplicating existing instances. This technique creates synthetic instances using the nearest neighbour rule in the feature space. However, SMOTE considers only minority classes without taking care of majority classes. Therefore, increasing the size of minority classes may increase the chances of overlapping among classes [8].

2.2.2 Under-Sampling

Under-sampling removes the existing majority instances to balance a data set. A basic under-sampling technique is Random Under-Sampling (RUS) that removes the majority instances randomly. However, this may cause the removal of potentially useful information from a data set and the performance degradation in classification [4, 9].

2.2.3 Combining Sampling

Combining sampling is to apply a combination of sampling techniques on an unbalanced data set to improve the classification performance [10]. One example of combining sampling is to combine under-sampling and over-sampling. Das et al. [6] stated that under-sampling should be applied before over-sampling as a data cleaning method because it helps reduce the overlapping classes’ effect.

3 Methodology

We transformed the CICIDS2017 data set into a format understandable by data mining algorithms used with data pre-processing. We replaced the missing values in the CICIDS2017 data set with the mean values of the features. Infinity values were then replaced by values that were ten times the maximum feature value. We also used Z-score normalisation to standardise all the features because the original range of their values is varied widely.

The unbalanced class distribution has caused the learning algorithms to bias majority classes and may produce low detection rates for minority classes. We used three sampling methods to address the problem, namely, over-sampling, under-sampling and hybrid sampling.

Four learning algorithms were used for intrusion detection, i.e., Gaussian Naïve Bayes (GNB) [11], C4.5 [11], Neural Network (NN-MLP) [12], K-Nearest Neighbour (KNN) [13], and Logistic Regression (LR) [14]. We used tenfold cross-validation to evaluate the performance of the learning algorithms. The data set was split into ten groups for both training and testing purposes.

The CICIDS2017 data set is unbalanced. Therefore, accuracy is a less suitable metric to evaluate learning algorithms. If the majority class is correctly classified, then the accuracy shall be high even though the rare classes are wrongly classified. Complementing accuracy with True Positive Rate (TPR) to examine learning algorithms’ performance is a better option. This is because TPR can examine the detection performance for each of the classes in the data set.

4 Results and Discussion

Table 2 shows the detection result using the learning algorithms, i.e., Gaussian Naïve Bayes (GNB), C4.5, Neural Network (NN-MLP), K-Nearest Neighbour (KNN), and Logistics Regression (LR). By comparing the average TPR, C4.5 was the best performer among the single classifiers, with an accuracy (average TPR) of 0.9927. We noticed seven weak intrusion classes (bold classes in Table 2) that were hard to detect; below-average TPRs (less than 0.8.) were obtained using some of the learning algorithms. They were Bot, DoS Slowloris, Heartbleed, Infiltration, and three Web Attacks (WAs)—Brute Force, Sql Injection and XSS. Such performance could be caused by the unbalanced class distribution of the data set as the BENIGN is the immense majority in the data set. The learning algorithms’ characteristic that favours the majority class (BENIGN) also contribute to such performance.

Table 2 The intrusion detection result for the full CICIDS 2017 data set using the learning algorithms. The classes in bold are the weak intrusion classes that give below average TPR

Full size table

To improve the detection rate overall and for these weak intrusion classes, we attempted under-sampling, over-sampling and a combination of them to balance the data set.

Firstly, we attempted random under-sampling (RUS) on the majority class, BENIGN to balance the data set and reduce the effect of the majority BENIGN. Table 3 shows the RUS results using C4.5. C4.5 was used since it gave the best performance among the single classifiers, as shown in Table 1. The best overall accuracy (average TPR of 0.9985) was achieved by reducing BENIGN between 30 and 90% of its original size. The result was not much different from the full data set. However, the TPR for 12 of the classes were improved, including four of the weak intrusion classes.

Table 3 The intrusion detection result for the CICIDS 2017 data set resampled using RUS on BENIGN. The numbers in bold shows better TPRs than the results obtained using the full data set

Full size table

We then attempted the over-sampling technique, Synthetic Minority Over-sampling (SMOTE), to increase the size of the seven weak intrusion classes. Table 4 shows the results of the over-sampling. The best average TPR (0.9900) was achieved by increasing the size of these minority classes to 250% of the full data set, and the result was slightly weak compared with the full data set. Improvements were noticed for some of the classes, including only three of the weak intrusion classes. Overall, the detection performance was slightly weak as compared to RUS.

Table 4 The results for the data set yielded using SMOTE on the five minority classes (bolded font). The numbers in bold shows better TPRs than the results obtained using the full data set

Full size table

Finally, we combined RUS and SMOTE to seek improvement in detection. Figure 1 shows the TPRs achieved using the combination of these two sampling techniques. The x-axis represents the percentage of the remaining majority class samples, BENIGN, after under-sampling. On the other hand, the y-axis represents the TPRs. There are four line-plots that represent the percentage of over-sampling on the seven minority classes. The best result achieved was 30% under-sampling on BENIGN and 300% over-sampling on the seven weak intrusion classes. The average TPR obtained was 0.9934.

Table 5 shows the result comparison of the full data set and the resampled data sets. The RUS (30%) achieved the best average TPR (0.9985) among the sampling techniques. Using RUS, we achieved the best TP rates for 11 out of 15 classes as compared with the full data set, and the data sets resulted using SMOTE and RUS (30%) + SMOTE (300%). To conclude, the sampling technique RUS gave a slight improvement in detection overall and most of the CICIDS 2017 data set classes.

Table 5 The result comparison using the full and resampled data sets. The numbers in bold shows the best TPRs obtained by using RUS

Full size table

5 Conclusion and Future Work

This paper aims to use sampling techniques to improve the intrusion detection rate using the CICIDS 2017 data set. We attempted under-sampling, over-sampling, and a hybrid of them to balance the data set, as the learning algorithms work by assuming data sets involved are balanced in class distribution. The RUS gave the best overall accuracy, measured using average TPR. The average TPR obtained was 0.9985, a slight improvement compared to the full and resampled data sets using SMOTE and RUS + SMOTE. We also noticed an improvement in detecting most of the classes, including some weak intrusion classes.

References

Sharafaldin I, Lashkari AH, Ghorbani AA (2018) Toward generating a new intrusion detection dataset and intrusion traffic characterisation. In: ICISSP 2018–Proc. 4th Int. Conf. Inf. Syst. Secur. Priv. 2018-Janua. pp 108–116
Google Scholar
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Kalid SN, Ng K, Tong G, Khor K (2020) A multiple classifiers system for anomaly detection in credit card data with unbalanced and overlapped classes. IEEE Access 8:28210–28221
Article Google Scholar
Weiss G (2004) Mining with rarity: a unifying framework. SIGKDD Explor 6:7–19
Article Google Scholar
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5:221–232
Article Google Scholar
Das B, Krishnan NC, Cook DJ (2014) Handling imbalanced and overlapping classes in smart environments prompting dataset
Google Scholar
Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity : why under-sampling beats over-sampling
Google Scholar
Sáez JA, Luengo J, Stefanowski J, Herrera F (2014) Managing borderline and noisy examples in imbalanced classification by combining smote with ensemble filtering. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 8669 LNCS. pp 61–68
Google Scholar
Dal Pozzolo A, Caelen O, Bontempi G (2010) Comparison of balancing techniques for unbalanced datasets. Mach Learn Gr Univ Libr Bruxelles Belgium 16:732–735
Google Scholar
Seiffert C, Khoshgoftaar TM, Van Hulse J (2009) Hybrid sampling for imbalanced data. Integr Comput Aided Eng 16:193–210
Article Google Scholar
Abdulrahman AA, Ibrahem MK (2019) Evaluation of DDoS attacks detection in a new intrusion dataset based on classification algorithms. Iraqi J Inform Commun Technol 1:49–55
Article Google Scholar
Toupas P, Chamou D, Giannoutakis KM, Drosou A, Tzovaras D (2019) An intrusion detection system for multi-class classification based on deep neural networks. In: Proc.–18th IEEE Int. Conf. Mach. Learn. Appl. ICMLA 2019. pp 1253–1258
Google Scholar
Yong Y (2012) The research of imbalanced data set of sample sampling method based on K-Means cluster and genetic algorithm. Energy Procedia 17:164–170
Article Google Scholar
Zhang Y, Chen XU, Jin LEI, Wang X, Guo DA (2019) Network intrusion detection: based on deep hierarchical network and original flow data. IEEE Access 7:37004–37016
Article Google Scholar

Download references

Acknowledgements

This work was supported by the UTAR Research Grant (IPSR/RMC/UTARRF/2019-C2/K02) provided by the Universiti Tunku Abdul Rahman, Malaysia.

Author information

Authors and Affiliations

Lee Kong Chian Faculty of Engineering Science, Universiti Tunku Abdul Rahman, Kampar, Malaysia
Yan-Bing Ho, Wun-She Yap & Kok-Chin Khor

Authors

Yan-Bing Ho
View author publications
You can also search for this author in PubMed Google Scholar
Wun-She Yap
View author publications
You can also search for this author in PubMed Google Scholar
Kok-Chin Khor
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan-Bing Ho .

Editor information

Editors and Affiliations

Namseoul University, Cheonan, Korea (Republic of)
Hyuncheol Kim
Institute of Creative Advanced Technologies, Science and Engineering, Suwon, Korea (Republic of)
Kuinam J. Kim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ho, YB., Yap, WS., Khor, KC. (2021). The Effect of Sampling Methods on the CICIDS2017 Network Intrusion Data Set. In: Kim, H., Kim, K.J. (eds) IT Convergence and Security. Lecture Notes in Electrical Engineering, vol 782. Springer, Singapore. https://doi.org/10.1007/978-981-16-4118-3_4

Download citation

DOI: https://doi.org/10.1007/978-981-16-4118-3_4
Published: 02 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-4117-6
Online ISBN: 978-981-16-4118-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The Effect of Sampling Methods on the CICIDS2017 Network Intrusion Data Set

Abstract

Similar content being viewed by others

Resampling imbalanced data for network intrusion detection datasets

SMMO-CoFS: Synthetic Multi-minority Oversampling with Collaborative Feature Selection for Network Intrusion Detection System

NKB-S: Network Intrusion Detection Based on SMOTE Sample Generation

Keywords

1 Introduction

2 Literature Review

2.1 CICIDS2017 Data Set Overview

2.2 Sampling Techniques

2.2.1 Over-Sampling

2.2.2 Under-Sampling

2.2.3 Combining Sampling

3 Methodology

4 Results and Discussion

5 Conclusion and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

The Effect of Sampling Methods on the CICIDS2017 Network Intrusion Data Set

Abstract

Similar content being viewed by others

Resampling imbalanced data for network intrusion detection datasets

SMMO-CoFS: Synthetic Multi-minority Oversampling with Collaborative Feature Selection for Network Intrusion Detection System

NKB-S: Network Intrusion Detection Based on SMOTE Sample Generation

Keywords

1 Introduction

2 Literature Review

2.1 CICIDS2017 Data Set Overview

2.2 Sampling Techniques

2.2.1 Over-Sampling

2.2.2 Under-Sampling

2.2.3 Combining Sampling

3 Methodology

4 Results and Discussion

5 Conclusion and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation