1 Introduction

The 5G network provides higher bandwidth and lower latency for edge IoT devices to access the core network, which increases the efficiency of collaboration between edge side application and cloud-side service [61]. But at the same time, it also expands the attack surface of the core network, which makes the enterprise network face greater security threats. Due to limited computing capability, it is difficult for IoT devices to deploy heavy-weight security protection mechanisms [27, 47, 65] Security incidents that occurred in recent years have shown that by controlling IoT devices, it is possible to launch attacks on enterprise core networks or Internet infrastructure. For example, on Friday October 21, cybercriminals launched DDoS attacks on the DNS system in the US-East region via 30,000 maliciously manipulated Wi-Fi cameras, which is known as Dyn cyberattack incident [20]. The attack caused major Internet platforms and services to be unavailable to large swathes of users in Europe and North America. Similar incident includes the Ukrainian Power Center incident, hackers invaded Ukraine's power center through IoT devices [36].

As a network security protection technology, Intrusion Detection Systems (IDSs) are widely used to protect the core network against external intrusions. It was first proposed in 1980, the goal is to determine whether there is any malicious intrusion through behavior characteristics monitoring and analysis [6]. The early intrusion detection technology was focus on host security. With the rapid development of network technology, the research focus of intrusion detection technology turned to determine the intrusion behavior by analyzing network traffic [30]. Basically, intrusion detection includes misuse detection and anomaly detection [1, 5]. In the anomaly detection model, when the user's behavior pattern deviates from the normal standard by more than the threshold, it is regarded as abnormal behavior. In the case of the abuse detection model, when the user behavior pattern matches the existing malicious behavior pattern, it will be regarded as misuse behavior. Therefore, the key to improving the accuracy of intrusion detection lies in the recognition of network traffic patterns. Unfortunately, with the rapid development of operating systems, application software, and network technologies, both normal user behavior and attack behaviors are constantly changing. Especially the endless system vulnerabilities that lead to an ever-evolving variety of network attack methods which make the update speed of the signature database of malicious behaviors difficult to meet the detection requirements.

Over the last three decades, numerous machine learning algorithms have been widely utilized in network intrusion detection to make up this deficiency of manual analysis, such as support vector machines (SVM) (Vladimir [60], artificial neural networks (ANN) [29] and decision trees [44]. These research works show that machine learning methods can indeed improve the analysis efficiency of abnormal traffic, and can find some abnormal behaviors that cannot be identified by manual analysis [38, 52, 58, 64, 67,68,69]. However, judging from recent research results, there are still several challenges that need further exploration. First of all, how to perceive intrusion behavior without a known traffic signature database? As far as we know, most intrusion detection methods based on supervised learning and semi-supervised learning require prior data for training. The detection accuracy of these methods for unknown intrusions is generally low. Secondly, most traffic sampling data contains different attributes, for example, DARPA KDD CUP99 and NSL-KDD. Some research efforts try to improve the accuracy of anomaly detection by optimizing feature selection [24]. Is there another method that can reduce the sensitivity of detection accuracy to sample feature selection? Finally, when some packet attributes of the network traffic sampling data are missing, how to perform anomaly analysis based on these incomplete data?

1.1 Motivation

5G network introduces new application scenarios such as enhanced mobile bandwidth (eMBB), massive machine-type communications (mMTC), and ultra-reliable low-latency communications (uRLLC), which make it become a mobile communication infrastructure for a new generation of IoT information systems [19]. While the 5G network provides high bandwidth and low latency services, it also brings more severe security challenges to the core network [41, 42, 46, 57, 63], which endues the intrusion detection of IoT systems with different characteristics from the traditional network systems.

First, in the scenario of massive machine-type communications (mMTC), massive IoT devices will generate a large amount of network packets, causing huge analysis pressure for the intrusion detection system. According to statistics, by 2019, the data center traffic is three times that of 2014, with an average annual growth rate of 30%. In fact, the traffic base was very large, with 2.1ZB in 2014. As far as we know, there is no machine learning algorithm that can analyze such huge network traffic packets at line speed. Under normal circumstances, intrusion detection systems can only select a subset of network traffic for anomaly analysis. Therefore, how to select sampled data and the packet properties of network traffic have become important technical issues.

Secondly, a variety of IoT network protocols and emerging computing models make traffic analysis more complex. Figure 1 shows a typical 5G IoT application scenario. In this scenario, when data traffic begins from the IoT terminal through the edge gateway and transmission network to the edge of the core network, the data link layer protocol and application layer protocol will change. Most of existing research works on intrusion detection mainly focus on the analysis of the characteristics of the transport layer protocol. In fact, malicious attackers can implement network intrusion behaviors based on data link layer protocols and application layer protocols. On the other hand, the diversity of protocols results in very rich types and values of protocol fields, which brings the difficulty of feature selection and higher computational complexity to the analysis algorithm. In fact, many IoT intrusion detections not only need to analyze the network layer protocol but also need to analyze the link layer and physical layer protocols, which further increases the complexity of sample sampled data and the packet property selection [10].

Fig. 1
figure 1

Protocol stack of 5G IoT network

Finally, since the access points of IoT terminals are scattered and numerous, the sample data provided by different access points may lack some protocol properties value. For example, the Aegean Wi-Fi Intrusion Dataset (AWID) is a comprehensive 802.11 network dataset, which was derived from real Wi-Fi traffic traces in 2015. Figure 2 shows the results of our analysis on the AWID data. It can be found that among the 575,643 samples, 329,821 of them have missing attributes, reaching more than 40%. If there is no corresponding data filling mechanism, cluster analysis will be difficult to conduct.

Fig. 2
figure 2

Incomplete sampling data of AWID

In short, the particularity of 5G/IoT networks makes intrusion detection and analysis more difficult and complicated, especially in terms of traffic properties selection and in case of properties absent. Traditional misuse detection-based IDS use supervised learning or semi- supervised learning method to recognized the malicious behavior. Most of these methods rely on the selection of protocol properties and massive prior data. Research results show that misuse detection-based IDS is very successful on detecting known intrusions, but is poor at unknown abnormal behaviors and 0-day attacks.

Although the composition of the IoT protocol stack is relatively complex, the behavior of IoT application is not complicated due to the resource constraints of the IoT devices. At the same time, the distribution of IoT devices is scattered, making it difficult to organize complex collusion attacks. Therefore, there are obvious differences between the abnormal and normal behaviors. It is possible to divide these behaviors by using the clustering method, and then further determine which one is abnormal. Based on such regard and assumption, in this study, we choose a clustering method based on unsupervised learning for anomaly detection. Moreover, we try to use multi-view learning methods to reduce the influence of a single attribute on the detection results. In order to further enhance the practicality of the algorithm, the clustering analysis algorithm we proposed considers sample data with missing attributes for the first time.

1.2 Contribution

Based on the above considerations, we try to propose an anomaly detection method based on unsupervised learning in this paper. The major contributions of the proposed work are summarized as follows:

  • Aiming at the difficulty of selecting traffic attributes in anomaly detection, we propose an analysis method based on multiple kernel clustering (MKC) algorithm. To reduces the sensitivity of anomaly detection accuracy to single feature selection, our method constructs multiple base kernels via different feature properties and combines these kernels to improve clustering performance.

  • We further consider the pre-processing of sampled data with incomplete attributes. As we know, the existing multiple kernel clustering methods cannot address the situation when some feature properties of the traffic are absent. Most of the traditional solutions adopt the methods of mean value filling or zero value filling and even discard these absent properties, which may result in a lower detection rate. Our method supplements the incomplete base kernel with approximate values which are calculated based on sample data.

  • Since it can only handle continuous numerical data, the existing multiple kernel clustering methods are mainly used for image recognition. This paper proposes a method to deal with characters and non-continuous data, such as enumerated type data, IP address and so on, which expands the application field of multiple kernel clustering method. We also evaluate the performance of the design model on multiple benchmark data sets.

1.3 Organization

The remainder of this paper is organized as follows: Sect. 2 discusses related studies and analyses their limitations. The proposed anomaly detection based on multiple kernel K-means clustering algorithm is described in Sect. 3. Section 4 presents the results of experiments and evaluations. Section 5 presents the conclusions and future work.

2 Related works

The 5th generation of mobile communication technology (5G for short), as an extension of the 4G (LTE-A, WiMAX-A) system, provides three types of services, including enhanced mobile broadband (eMBB), massive machine-type communications (mMTC), and ultra-reliable and low latency communications (URLLC) (ITU 2017) These services ensure the Internet of Things applications such as machine-to-machine (M2M), Vehicles-to-Everything(V2X), device-to-everything (D2E), and provide the same user experience as that of the wire network. Unfortunately, in the context of 5G era, IoT networks are facing much greater security risks. The attack surface of 5G IoT not only involves terminal devices but also includes communication channel and application software. The expansion of the attack area of IoT makes the intrusion detection of IoT network become a research hotspot. There have been many literatures summarizing the types and research directions of IoT intrusion [3, 14, 54, 35]. According to these literatures, the research on IoT intrusion detection mainly includes intrusion detection method, intrusion detection system deployment, security threat model and verification method.

With the rapid development of machine learning technology, research on intrusion detection based on machine learning has attracted widespread attention in recent years. In general, IoT intrusion detection methods mainly include misuse detection, anomaly detection, and hybrid detection. Because the research content of this article is mainly aimed at the network intrusion detection problem of IoT, the terminal intrusion detection problem is not discussed here. The key idea of intrusion detection technology is to determine whether there is an intrusion incident happened by analyzing the hidden features in the sampled data, which include system log, network traffic, and so on. Because machine learning has inherent advantages in analyzing data, a large number of intrusion detection technologies based on machine learning algorithms have been proposed, which mainly include: unsupervised machine learning, supervised machine learning, and deep learning [22].

2.1 Supervised learning based intrusion detection

Before deep learning technology was proposed, intrusion detection technology based on supervised learning was the main research direction.

  1. 1.

    K-nearest Neighbor

K-nearest neighbor (k-NN) is a sample classification technique that does not require parameters. Data classification is achieved by calculating the Euclidean distance of the input sample (Soucy and Mineau). The k-NN classifier is widely used in the field of intrusion detection. For example, Liang et al. [39] use the Minimum Dependence Maximum Significance (MDMS) algorithm to select 6 features from the KDD1999 data set and use KNN to predict network traffic. The proposed method can better identify probe attacks and denial of service attack. The accuracy of the k-NN classifier is mainly affected by the value of k [17].

  1. 2.

    Support Vector Machines.

Compared with other algorithms, the method of support vector machine (SVM) can solve the problem of small samples and has better generalization ability. SVM is very suitable for classifying data sets that contain large features. SVM is simple to implement and easy to expand and can perform anomaly detection in real-time. Therefore, a large number of SVM-based intrusion detection methods have been proposed. For example,Ahmim et al. [2] use Z-score to normalize KDD1999 data, and uses compressed sampling method for feature compression, combined with SVM to classify the compression results. The proposed method has a low false positive rate (FPR) and can effectively detect denial of service attacks, probe attacks and other attacks. Chen et al. [13] use logarithms of the marginal density ratios (LMDRT) as a feature conversion technique to construct an IDS based on SVM.

  1. 3.

    Decision Trees.

The Decision tree (DT) has low computational complexity, and the constructed rules are easy to understand. Therefore, it is also widely used in the field of intrusion detection. For example, Senthilnayaki et al. [51] proposed a smart grid advanced metering infrastructure IDS based on a CART decision tree, and the experimental accuracy rate on the CICIDS2017 data set was 99.66%. According to the method proposed, the highest accuracy rate on the CICIDS2017 data set is 96.665%, and the lowest false alarm rate (FAR) is 1.145%. The accuracy rate is higher than that of Naive Bayes (74.528%) and other algorithms. Effectively identify normal traffic and abnormal traffic.

  1. 4.

    Naive Bayes Networks.

Naive Bayes Networks (NB) is a probabilistic graph model that predicts the occurrence probability of events based on prior observations of similar events [16]. Naive Bayes Networks is mainly used to classify normal and abnormal behavior based on previous observations in a supervised learning model. The logic of the NB classifier is simple and easy to implement. It only needs a few samples to train and can obtain satisfactory results [56]. For example, Nuo [45] proposes a classification method based on the Naive Bayes model, tested on the KDD1999 data set, can effectively detect Trojan horse attacks, fake message attacks, denial of service attacks and remote user unauthorized access attacks, detection The detection rate (DR) reaches 87–97%.

  1. 5.

    Ensemble Classifiers.

To improve the performance of a single classifier, the ensemble classifier is proposed. The main idea is to combine multiple weak learning algorithms and then generate majority voting results for classification [32]. Bosman et al. [11] show that the EL algorithm produces more accurate results than each member classifier, but at the same time, due to the parallel use of multiple classifiers, the accuracy of EL leads to the cost of increasing time complexity [9].

2.2 Unsupervised learning based intrusion detection

Intrusion detection technology based on unsupervised learning performs intrusion detection on sample data without reference classifiers. The k-means algorithm has the advantages of strong interpretability and fast convergence speed. When k-means is used in combination with other classification algorithms, it can effectively improve the detection rate. For example, Shah et al. [53] uses an improved k-means algorithm to construct a high-quality training data set and uses a combination of SVM and an extreme learning machine (ELM) algorithm to construct an IDS, which can effectively identify denial of service attacks. The traditional k-means algorithm is sensitive to the initial value of the cluster center, and the accuracy is easily affected by noisy data and incomplete data [4]. Since the sample data can be represented using different units/scales, most existing distance-function or density-function based AD algorithms are sensitive to how data is expressed. To avoid the problem, literature [7] proposed an unsupervised stochastic forest-based Anomaly Detection algorithm, which is called usfAD. Noisy data has a greater negative impact on the accuracy of the clustering algorithm. Most of the existing clustering algorithms adopt a noise-free assumption. Iam-On [31] uses the multi-kernel k-means clustering method to analyze the noisy data. The experimental result shows that the approach is robust to the low level of noise. Guo et al. [28] study unsupervised anomaly detection in IoT systems and develops a GRU-based Gaussian Mixture VAE scheme, called GGM-VAE. According to the experiment results of simulation, the proposed scheme gets a  47.88% improvement in F1 scores on average.

The intrusion detection method based on the unsupervised algorithm can effectively deal with the large-scale traffic data problem that is increasing year by year in the network, reduce the computational overhead, and improve the detection accuracy. Therefore, with the increase of massive amounts of data in the network, unsupervised machine learning algorithms will be more widely used, but they are sensitive to noise and outliers, which are also problems faced by unsupervised machine learning algorithms in the field of intrusion detection.

2.3 Deep learning based intrusion detection

Deep learning technology can use a hierarchical structure to perform unsupervised feature learning and pattern classification of data, integrate feature extractors and classifiers into a framework, without the need to extract features manually. Deep learning can effectively process large-scale network traffic data and has higher efficiency and detection rate than traditional machine learning methods, but the training process is more complicated and the model interpretability is weak. Intrusion detection technologies based on deep learning mainly include deep auto encoders (AEs) [66], restricted boltzmann machine (RBM) [26], deep belief network (DBN) [23], recurrent neural network (RNN) [34], etc.

With the tremendous enrichment of machine learning theories, techniques such as reinforcement learning [12] and extreme learning [21] have also been applied to network intrusion detection.

3 A multiple-Kernel clustering-based anomaly detection scheme

3.1 Preliminary

3.1.1 Kernel K-means clustering (KKM)

Clustering is a type of unsupervised machine learning method, which can generalize the observed values into certain classes according to their features. By analyzing enormous research results of intrusion detection, we found that the network behaviors with different purposes often caused network traffic with different characteristics. Based on this assumption, the network traffic caused by abnormal behaviors could be distinguished from normal network traffic. Here, we hope to classify network traffic through clustering methods.

K-means is a distance-based clustering algorithm, which is widely used due to its simplicity and ease of implementation. But the K-means algorithm does not perform well when processing linearly inseparable data. For example, to distinguish abnormal traffic from normal traffics, we treat each IP packet as a feature vector with multiple attributes, such as IP address, port, protocol type, and so on. Due to the mutual influence of these attributes, it is difficult to achieve linear segmentation in low-dimensional space. Therefore, it is difficult to obtain satisfactory clustering results directly using the K-means algorithm.

To make up for the shortcomings of the K-means algorithm, the kernel K-means algorithm is proposed. It assumes as follows: a set of point which cannot be linearly divided in a low-dimensional space is more likely to become linearly separable when it has been mapped into a high-dimensional space.

The key idea of the kernel clustering method is to map the data points of the input set into a high-dimensional feature space through a non-linear mapping and perform clustering in a new feature space. Because the nonlinear mapping increases the probability that the data points are linearly separable, a more accurate clustering result could be achieved.

The mapping function \({\varvec{\varPhi}}\) is defined as follows:

$${\varvec{\varPhi}}:{\varvec{x}} \mapsto{\varvec{\varPhi}}\left( {\varvec{x}} \right) \in {\varvec{F}},{\varvec{x}} \in {\varvec{X}}$$
(1)

where X is the original input data set, and \({\varvec{F}}\) is the high-dimensional feature space.

For example, if we want to map feature x into a three-dimensional space, the mapping function \({\varvec{\varPhi}}\) can be represented as follows.

$${\varvec{\varPhi}}\left( {\varvec{x}} \right) = \left( {{\varvec{x}},{\varvec{x}}^{2} ,{\varvec{x}}^{3} } \right)^{{\varvec{T}}}$$
(2)

For all x, z ∊ X, the original data inner product is \({\varvec{x}},{\varvec{z}}\), and the feature inner product in feature space is \({\varvec{\varPhi}}\left( {\varvec{x}} \right),\Phi \left( {\varvec{z}} \right)\). Since the computational complexity of calculating \({\varvec{\varPhi}}\left( {\varvec{x}} \right),{\varvec{\varPhi}}\left( {\varvec{z}} \right)\) in high-dimensional feature space is high, to improve the efficiency, we define kernel function \({\varvec{K}}\) as follows. For all x, z ∊ X satisfies

$${\varvec{K}}\left( {{\varvec{x}},{\varvec{z}}} \right) ={\varvec{\varPhi}}\left( {\varvec{x}} \right)^{{\varvec{T}}}{\varvec{\varPhi}}\left( {\varvec{z}} \right)$$
(3)

Obviously, the computational complexity of calculating the kernel function in the low-dimensional space is lower than the complexity of directly calculating the vector inner product in the high-dimensional space.

Given a kernel function \({\varvec{K}}:{\varvec{R}}^{{\varvec{N}}} \times {\varvec{R}}^{{\varvec{N}}} \mapsto {\varvec{R}}\) and a data set \(\left\{ {{\varvec{x}}^{\left( 1 \right)} , \ldots ,{\varvec{x}}^{{\left( {\varvec{M}} \right)}} } \right\}\), where \({\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{R}}^{{\varvec{N}}}\) and \({\varvec{i}} = 1, \ldots ,{\varvec{M}}\). For each \({\varvec{x}}^{{\left( {\varvec{i}} \right)}}\) and \({\varvec{x}}^{{\left( {\varvec{j}} \right)}}\), we calculate \({\varvec{K}}_{{{\varvec{ij}}}} = {\varvec{K}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} ,{\varvec{x}}^{{\left( {\varvec{j}} \right)}} } \right)\) and get a kernel matrix \({\varvec{MK}}_{{{\varvec{M}} \times {\varvec{M}}}}\), According to Mercer's theorem [43], the function K is an effective kernel function, if and only if the kernel matrix \({\varvec{MK}}_{{{\varvec{M}} \times {\varvec{M}}}}\) is a positive semidefinite matrix. According to this conclusion, for a given data set, we do not need to find a mapping function \({\varvec{\varPhi}}\), but only need to construct the kernel matrix with the training set, and determine whether it is a positive semi-definite matrix.

As one of the most important machine learning techniques, the kernel method provides a powerful and unified learning framework. It allows researchers to focus on algorithm design without considering the attributes of the data itself, such as strings, vectors, text, and graphs. The key idea of kernel k-means [50] clustering algorithm is mapping data from input space to a higher dimensional feature space through kernel function, such as Polynomial kernel function, Gaussian kernel function and Sigmoid kernel function, and then use k-means algorithm in the feature space.

The Kernel K-Means clustering model training algorithm is described in Table 1.

Table 1 Kernel K-Means Clustering Algorithm

For kernel K-means clustering, the problem-solving model is described as follows:

Given data set \(\left\{ {{\varvec{x}}^{\left( 1 \right)} , \ldots ,{\varvec{x}}^{{\left( {\varvec{M}} \right)}} } \right\}\), our objective is to partition this dataset into N disjoint clusters \(\left\{ {{\varvec{C}}_{1} ,{\varvec{C}}_{2} ,...,{\varvec{C}}_{{\varvec{N}}} } \right\}\user2{ }\) by using kernel K-means method. First, we use the function \({\varvec{\varPhi}}\left( x \right)\) mapping the original \({\varvec{x}}^{{\left( {\varvec{i}} \right)}}\) into a reproducing Hilbert space H [18].

Let \(\overline{{{\varvec{m}}_{{\varvec{k}}} }}\) be the mean of the k-th cluster. The optimization objective of kernel K-means clustering is to minimize the sum of square of the within-cluster distance. By adopting a decision function \({\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{C}}_{{\varvec{n}}} } \right) \to \left\{ {0,1} \right\}\), where \({\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{C}}_{{\varvec{n}}} } \right) = 1\) if \({\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{C}}_{{\varvec{n}}}\) is true, nor \({\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{C}}_{{\varvec{n}}} } \right) = 0\), it can be represented as Eq. (4),

$$\min \left( {\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{M}}} \mathop \sum \limits_{{{\varvec{k}} = 1}}^{{\mathbf{N}}} {\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right)\|{\varvec{\varPhi}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} } \right) - \overline{{{\varvec{m}}_{{\varvec{k}}} }}\|^{2} } \right)$$
(4)

where \(\overline{{{\varvec{m}}_{{\varvec{k}}} }} = \frac{{\mathop \sum \nolimits_{{{\varvec{i}} = 1}}^{{\varvec{M}}} {\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right){\varvec{\varPhi}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} } \right)}}{{\mathop \sum \nolimits_{{{\varvec{i}} = 1}}^{{\varvec{M}}} {\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right)}}\).

In Eq. (4), \(\|{\varvec{\varPhi}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} } \right) - \overline{{{\varvec{m}}_{{\varvec{k}}} }}\|^{2}\) can be calculate as follows:

$${\varvec{K}}\left( {{\varvec{x}}^{{\left( {\varvec{i}} \right)}} ,{\varvec{x}}^{{\left( {\varvec{i}} \right)}} } \right) - \frac{{2\mathop \sum \nolimits_{{{\varvec{j}} = 1}}^{{\varvec{M}}} {\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{j}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right){\varvec{K}}\left( {{\varvec{x}}^{{\varvec{i}}} ,{\varvec{x}}^{{\mathbf{j}}} } \right)}}{{\mathop \sum \nolimits_{{{\varvec{j}} = 1}}^{{\varvec{M}}} {\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{j}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right)}} + \frac{{\mathop \sum \nolimits_{{{\varvec{j}} = 1}}^{{\varvec{M}}} \mathop \sum \nolimits_{{{\varvec{l}} = 1}}^{{\varvec{M}}} {\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{j}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right){\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{l}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right){\varvec{K}}\left( {{\varvec{x}}^{{\varvec{j}}} ,{\varvec{x}}^{{\varvec{l}}} } \right)}}{{\mathop \sum \nolimits_{{{\varvec{j}} = 1}}^{{\varvec{M}}} \mathop \sum \nolimits_{{{\varvec{l}} = 1}}^{{\varvec{M}}} {\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{j}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right){\varvec{I}}\left( {{\varvec{x}}^{{\left( {\varvec{l}} \right)}} \in {\varvec{C}}_{{\varvec{k}}} } \right)}}$$
(5)

3.1.2 Multiple-kernel K-means clustering (MKKM)

The traditional kernel method is a single-kernel method based on a single feature space and cannot effectively process the huge size and heterogeneous information. The limitations of the kernel method are more significant, and the construction and selection of the kernel function is still an open problem. To solve the above problems, multiple kernel learning was proposed [25].

In the case of multi-kernel, let X = \(\left\{ {{\varvec{x}}^{\left( 1 \right)} , \ldots ,{\varvec{x}}^{{\left( {\varvec{M}} \right)}} } \right\}\) be a data set, each \({\varvec{x}}^{{\left( {\varvec{i}} \right)}}\) has m properties and \({\varvec{\varPhi}}_{k} \left( \cdot \right):{\varvec{x}} \in {\varvec{X}} \mapsto {\varvec{H}}_{{\varvec{k}}}\) is the k-th feature mapping function which maps X into a reproducing kernel Hilbert space \({\varvec{H}}_{{\varvec{k}}} \left( {1 \le {\varvec{k}} \le {\varvec{m}}} \right)\). So, for each \({\varvec{x}}^{{\left( {\varvec{i}} \right)}}\), it can be represented as \({\varvec{\varPhi}}_{{\varvec{\beta}}} \left( {\varvec{x}} \right) = \left[ {{\varvec{\beta}}_{1}{\varvec{\varPhi}}_{1} \left( {\varvec{x}} \right)^{{\varvec{T}}} , \ldots ,{\varvec{\beta}}_{{\varvec{m}}}{\varvec{\varPhi}}_{{\varvec{m}}} \left( {\varvec{x}} \right)^{{\varvec{T}}} } \right]^{{\varvec{T}}} :{\varvec{x}} \in {\varvec{X}} \mapsto {\varvec{H}}_{{\varvec{k}}}\), and \({\varvec{\beta}} = \left[ {{\varvec{\beta}}_{1} , \ldots ,{\varvec{\beta}}_{{\varvec{m}}} } \right]^{{\varvec{T}}}\) is a coefficients matrix of m base kernels \({\varvec{k}}_{{\varvec{p}}} \left( { \cdot , \cdot } \right)_{{{\varvec{p}} = 1}}^{{\varvec{m}}}\). So, the multi-kernel function can be defined as follows:

$${\varvec{K}}_{{\varvec{\beta}}} \left( {{\varvec{x}}_{{\varvec{i}}} ,{\varvec{x}}_{{\varvec{j}}} } \right) = {{\varvec{\Phi}}}_{{\varvec{\beta}}} \left( {{\varvec{x}}_{{\varvec{i}}} } \right)^{{\varvec{T}}} {{\varvec{\Phi}}}_{{\varvec{\beta}}} \left( {{\varvec{x}}_{{\varvec{j}}} } \right) = \mathop \sum \limits_{{{\varvec{p}} = 1}}^{{\varvec{m}}} {\varvec{\beta}}_{{\varvec{p}}}^{2} {\varvec{k}}_{{\varvec{p}}} \left( {{\varvec{x}}_{{\varvec{i}}} ,{\varvec{x}}_{{\varvec{j}}} } \right)$$
(6)

We can get Eq. (7) as follows:

$$\begin{gathered} \mathop {\min }\limits_{{{\varvec{H}},{\varvec{\beta}}}} {\varvec{Tr}}\left( {{\varvec{K}}_{{\varvec{\beta}}} \left( {{\varvec{I}}_{{\varvec{n}}} - {\varvec{HH}}^{{\varvec{T}}} } \right)} \right)\user2{ } \hfill \\ {\varvec{s}}.{\varvec{t}}.\user2{ H} \in {\varvec{R}}^{{{\varvec{n}} \times {\varvec{k}}}} ,{\varvec{H}}^{{\varvec{T}}} {\varvec{H}} = {\varvec{I}}_{{\varvec{k}}} ,{\varvec{\beta}}^{{\varvec{T}}} 1_{{\varvec{m}}} = 1,\user2{ } \hfill \\ {\varvec{\beta}}_{{\varvec{p}}} \ge 0,\user2{ }\forall {\varvec{p}} \hfill \\ \end{gathered}$$
(7)

In Eq. (7), \({{\varvec{I}}}_{{\varvec{k}}}\) is an identity matrix with size \({\varvec{k}}\times {\varvec{k}}\). By adjusting the H and β, we can get the optimization result of Eq. 7 in two ways: (1) Optimizing H while keeping β fixed. In this way, H can be obtained by solving a kernel k-means clustering optimization problem shown in Eq. (7); (2) optimizing β while keeping H fixed. In this way, β can be optimized via solving the following quadratic programming with linear constraints.

3.2 The proposed method

3.2.1 Basic idea

The effectiveness of the multi-kernel clustering algorithm is not only proven in theoretical research but also has been successful in many application fields, such as image analysis, pattern recognition, etc. For IoT network intrusion detection, in the absence of prior data and classification specification, we can firstly perform clustering analysis on the sampled data, and then confirm the clustering results. Based on this consideration, we try to apply the multi-kernel clustering method for network traffic anomaly detection.

Unfortunately, suffering from problems such as incomplete data and diverse data types, the multi-kernel clustering method is difficult to directly use for network abnormality analysis. In this paper, we try to propose an innovative method to solve the two problems as follows:

  1. 1.

    Incomplete data. As we know most existing multi-kernel clustering algorithms cannot perform cluster analysis when the kernel matrix is incomplete. Due to the lack of some attributes in the sampled data of the IoT network, it is difficult to construct a complete kernel matrix. For this reason, we try to adopt a method that integrates the information source filling and clustering tasks into the same optimization objective, to better combine the filling process and the clustering process and improve the performance of the algorithm.

  2. 2.

    Diverse data. In the image recognition application, the similarity of data can be obtained by calculating the Euclidean distance between pixels. But in the network traffic analysis, the situation is completely different. The diversity of IoT protocols causes the protocol attributes of messages to differ not only in data types but also in their value ranges. This will cause some attributes to have too much weight in the clustering process, which will affect the final clustering results. We need to preprocess these data before clustering.

3.2.2 Multiple kernel K-means with incomplete kernels

The multiple kernel K-means clustering method is suitable for cluster analysis of data with multiple attributes. But it has a constraint, that is, it requires a complete kernel matrix. Due to the complexity and diversity of the IoT network environment, in actual situations, the sampling data that can be used for intrusion detection often has the problem of missing attributes, such as the example in Fig. 2. Most of the existing solutions use mean value filling or zero value filling to complete the kernel matrix. These approaches do not fully explore the potential correlations between the sample data and are not conducive to improving the detection accuracy. Inspired by the work of literature [40] and [62], this section proposes a method of constructing a kernel matrix based on incomplete data, filling in missing attribute values based on the similarity of sampled data, thereby improving the detection accuracy. The main principle of this method is described as follows:

For ease of explanation, we assume that there is a multi-view dataset \({\varvec{X}} = \left\{ {{\varvec{X}}_{1} ,{\varvec{X}}_{2} , \ldots ,{\varvec{X}}_{m} } \right\}\) which has \(m\) views, and each view \({\varvec{X}}_{p} \left( {1 \le p \le m} \right)\) has two attributes, which can be denoted as follows:

$${\varvec{X}}_{p} = \left[ {{\varvec{X}}_{p}^{{\left( {\varvec{o}} \right)}} ,{\varvec{X}}_{p}^{{\left( {\varvec{u}} \right)}} } \right] =^{{\text{T}}}$$

We assume that at least one sample in each view is observable. Suppose that the first attribute \({\varvec{X}}_{p}^{{\left( {\varvec{o}} \right)}}\) is observable, and the second \({\varvec{X}}_{p}^{{\left( {\varvec{u}} \right)}}\) is missing. Before cluster analysis, we must fill the absent information. The key idea is as follows:

To compute the kernel matrix between sample, we need to get a positive definite kernel function \({\varvec{\kappa}}\left( { \cdot , \cdot } \right)\). We first calculate the k nearest neighbors of the observable attribute \({\varvec{X}}_{p}^{{\left( {\varvec{o}} \right)}}\) and then calculate the kernel matrix between \({\varvec{X}}_{p}^{{\left( {\varvec{o}} \right)}}\) and its k nearest neighbors. The kernel matrix is expressed as \({\mathbf{K}}_{p}^{{\left( {{\varvec{oo}}} \right)}}\). At the same time, the constraint that must be met is that the complete kernel matrix of the observable source should be equal to the known kernel matrix. So, the complete kernel matrices \(\{ {\mathbf{K}}_{p} \}_{p = 1}^{m}\) should minimize the following formulation.

$$\begin{gathered} \mathop {\min }\limits_{{\{ {\mathbf{K}}_{p} \}_{p = 1}^{m} ,{\mathbf{H}}}} \mathop \sum \limits_{p = 1}^{m} {\text{Tr}}\left( {{\mathbf{K}}_{p} \left( {{\mathbf{I}}_{{\varvec{k}}} - {\mathbf{HH}}^{{\text{T}}} } \right)} \right) \hfill \\ s.t. {\mathbf{K}}_{p} \left( {{\varvec{s}}_{p} ,{\varvec{s}}_{p} } \right) = {\mathbf{K}}_{p}^{{\left( {{\varvec{oo}}} \right)}} ,{\mathbf{K}}_{p}{ \succcurlyeq }0,\forall p,{\mathbf{H}} \in {\mathbb{R}}^{n \times k} ,{\mathbf{H}}^{{\text{T}}} {\mathbf{H}} = {\mathbf{I}}_{k} , \hfill \\ \end{gathered}$$
(8)

where \({\varvec{s}}_{p}\) are the indices of missing instances of the \(p\)-th view.

The optimization goal w.r.t \(\{{\mathbf{K}}_{p}{\}}_{p=1}^{m}\) in Eq. (8) is a programming problem with positive semi-definite constraints, and its computational efficiency is rather low when solving large-scale optimization problems. To overcome this defect, we propose to optimize the objective function:

$$\mathop {\min }\limits_{{{\mathbf{K}}_{p} }} {\text{Tr}}({\mathbf{K}}_{p} {\mathbf{U}}), s.t.{\mathbf{K}}_{p} \left( {{\varvec{s}}_{p} ,{\varvec{s}}_{p} } \right) = {\mathbf{K}}_{p}^{{\left( {{\varvec{oo}}} \right)}} ,{\mathbf{K}}_{p}{ \succcurlyeq }0$$
(9)

where \({\mathbf{U}} = {\mathbf{I}}_{{\varvec{k}}} - {\mathbf{HH}}^{{\text{T}}}\). Consider that the decomposition of \({\mathbf{K}}_{p} = {\mathbf{A}}_{p} {\mathbf{A}}_{p}^{{\text{T}}}\), \({\mathbf{A}}_{p}^{\left( o \right)}\) is the observable part and \({\mathbf{A}}_{p}^{\left( u \right)}\) is the unobservable part, i.e., \({\mathbf{A}}_{p} = \left[ {{\mathbf{A}}_{p}^{\left( o \right)} ;{\mathbf{A}}_{p}^{\left( u \right)} } \right]\). We can transform Eq. (9) into:

$$\mathop {\min }\limits_{{{\mathbf{A}}_{p}^{{\left( u \right)}} }} {\text{Tr}}\left( {[{\mathbf{A}}_{p}^{{\left( o \right)}} ;{\mathbf{A}}_{p}^{{\left( u \right)}} ]^{{\text{T}}} \left[ {\begin{array}{*{20}c} {{\mathbf{U}}^{{\left( {oo} \right)}} } & {{\mathbf{U}}^{{\left( {ou} \right)}} } \\ {{\mathbf{U}}^{{\left( {ou} \right){\text{T}}}} } & {{\mathbf{U}}^{{\left( {uu} \right)}} } \\ \end{array} } \right]\left[ {{\mathbf{A}}_{p}^{{\left( o \right)}} ;{\mathbf{A}}_{p}^{{\left( u \right)}} } \right]} \right)$$
(10)

where \(\left[ {\begin{array}{*{20}c} {{\mathbf{U}}^{{\left( {oo} \right)}} } & {{\mathbf{U}}^{{\left( {ou} \right)}} } \\ {{\mathbf{U}}^{{\left( {ou} \right){\text{T}}}} } & {{\mathbf{U}}^{{\left( {uu} \right)}} } \\ \end{array} } \right]\) is a blocked form of \({\mathbf{U}}\).

By taking the derivative of Eq. (10) with respect to \({\mathbf{A}}_{p}^{\left( u \right)}\), we can obtain the close solution of \({\mathbf{A}}_{p}^{\left( u \right)}\) as follows:

$${\mathbf{A}}_{p}^{{\left( u \right)}} = \left( {{\mathbf{U}}^{{\left( {uu} \right)}} } \right)^{{ - 1}} {\mathbf{U}}^{{\left( {ou} \right){\text{T}}}} {\mathbf{A}}_{p}^{{\left( o \right)}}$$
(11)

Moreover, the optimal \({\mathbf{H}}\) can be obtained by taking the \(k\) eigenvectors corresponding the largest \(k\) eigenvalues of \(\mathop \sum \limits_{p = 1}^{m} {\mathbf{K}}_{p}\).

The filling algorithm for Incomplete kernel matrix is described in Table 2 and the description of some symbols are listed in Table 3.

Table 2 Filling Algorithm for Incomplete Kernel Matrix
Table 3 Symbol description of algorithm 2

3.2.3 Algorithm computation complexity

The construction of the kernel matrix of the observable samples in the \({ }{\varvec{p}}\)-th view is basically \({\mathcal{O}}\left( {n^{2} } \right)\), where \(n\) is the number of samples. When \({\mathbf{H}}\) is fixed, the updating of each kernel matrix will take \({\mathcal{O}}\left( {n^{3} } \right)\) time, due to the calculation of the inverse of \({\mathbf{U}}^{{\left( {uu} \right)}}\). So the updating of \(\{ {\mathbf{K}}_{p} \}_{p = 1}^{m}\) takes \({\mathcal{O}}\left( {mn^{3} } \right)\) time in total. And the updating of \({\mathbf{H}}\) needs to conduct eigen decomposition on a \(n \times n\) matrix, which costs \({\mathcal{O}}\left( {{\varvec{n}}^{3} } \right)\) time. Assume that the iteration number is \(T\). The computation complexity of our algorithm is \({\mathcal{O}}\left( {{\varvec{n}}^{2} + {\varvec{T}}\left( {{\varvec{m}} + 1} \right){\varvec{n}}^{3} } \right)\).

3.2.4 Data pre-process

In machine learning tasks, the attributes of sample data are not always continuous values but maybe dispersed values, such as various attributes of IP packets. Usually, there are mainly two situations: (1) there is no significance between the values of discrete features, such as the value of protocol type; (2) the value of discrete feature has the meaning of magnitude such as message length. Since we need to calculate the similarity between different samples in the vector space, the distance in the vector space, to preserve the non-partial order characteristics of the sampled data, we use one-hot encoding for dispersed attributes without values significance. After the discrete features are one-hot encoded, the features of each dimension can be regarded as continuous features. We can normalize each dimension feature. For example, normalized to [− 1, 1] or normalized to a mean value of 0 and the variance of 1.

For example, Consider the IP protocol type: {“TCP”, “UDP”, “ICMP”}, if we directly use numbers to represent the value, it will destroy the distribution characteristics of the attribute, because the type of protocol characteristics is not set in numerical order. To solve the above problems, we use One-Hot Encoding. The method is to use N-bit status registers to encode N states. Each state has its own independent register bit, and at any time, among them only one bit is valid. For example: for {“TCP”, “UDP”, “ICMP”}, the one-hot encoding can be: {“001”, “010”, “100”}. Through this encoding method, we map the data to a sparse space, which solves the problem that the classifier cannot handle the attribute data.

Usually, the attributes of all IP packets include integer/real type and enumeration type. Since the clustering algorithm mainly analyzes digital data, the traffic data needs to be preprocessed first. The main process includes encoding, normalization, and dimensionality reduction.

Step1: Encoding.

Sampling data of network traffic usually consists of a group of IP packets. The protocol field of these IP packets usually includes a numeric type value and an enumerated type value. For the numeric type value, we keep its original value. For the enumerated type value, we use one-hot encoding and then use the Hamming distance to calculate the standard deviation during the normalization process. Take the NSL-KDD data set as an example, which contains seven enumerated attributes. The one-hot codes of these seven attributes are listed in Table 4.

Table 4 The one-hot code for IP packet attribute

Step2: Standardization and Normalization

Due to the large differences in the value of different protocol attributes, a unified standard cannot be used in the similarity calculation. This will make the attributes with a larger value get a higher weight in the clustering process, thereby affecting the final clustering accuracy. To solve this problem, we define a standardized function:\(\user2{ S}:{\mathbf{x}} \mapsto {\varvec{S}}\left( {\varvec{x}} \right) \in {\varvec{R}}\). Suppose \(\{ {\varvec{x}}_{{\varvec{i}}} \}_{{{\varvec{i}} = 1}}^{{\varvec{m}}}\) ∊ X is sample data set with the size of M and each data has N attributes.

$$x_{ij}^{\prime } = \frac{{x_{ij} - \overline{x}}}{{\frac{1}{M}\left( {\left| {x_{1j} - \overline{x}} \right| + \cdots + \left| {x_{Mj} - \overline{x}} \right|} \right)}}$$
(13)
$$S.t. \overline{x} = \frac{1}{M}\left( {x_{1j} + \ldots + x_{Mj} } \right)$$

For each attribute value \(x_{ij}\) of the sampled data, we use function S to calculate \(x_{ij}^{^{\prime}}\). The function S is shown in Eq. (14). If the value of \(x_{ij}\) is a numeric type, \(x_{ij} - \overline{x}\) means the algebraic difference between \(x_{ij}\) and \(\overline{x}\). If the value of \(x_{ij}\) is an enumeration type, \(x_{ij} - \overline{x}\) means the Hamming difference between \(x_{ij}\) and \(\overline{x}\).

After the data has been standardized, it is further normalized and mapped to the [0, 1] interval. The calculation method is as follows:

$$x_{ij}^{\prime } = \frac{{x_{ij}^{\prime } - min\left\{ {x_{1j}^{\prime } ,x_{{2j, \ldots ,x_{Mj}^{\prime } }}^{\prime } } \right\}}}{{max\left\{ {x_{1j}^{\prime } ,x_{{2j, \ldots ,x_{Mj}^{\prime } }}^{\prime } } \right\} - min\left\{ {x_{1j}^{\prime } ,x_{{2j, \ldots ,x_{Mj}^{\prime } }}^{\prime } } \right\}}}$$
(14)

Step3: Dimensionality reduction.

Although the multiple kernel learning algorithm is not sensitive to the changes of individual attributes during the clustering process and has a good clustering effect, its computational complexity is higher, which is O(N3). Therefore, it is necessary to perform dimensionality reduction processing on the data before cluster analysis. From the original NSL-KDD, AWID and other test data, we found these data sets contain many data with the same characteristics. These data have no significant effect on the clustering results, but will increase the computational cost. Therefore, we use principal component analysis to reduce its dimensionality.

3.2.5 Anomaly detection based on multiple kernel K-means clustering

So far, after solving the problems of incomplete data and data diversity, we can perform anomaly detection on IoT network traffic based on the multi-kernel K-means clustering method. The procedure of the whole method is as follows: the first step is a feature selection of the sampled data. By using the technical advantages of multi-kernel learning, the selection of traffic characteristics can cover different protocol layers, such as: data link layer, network layer, transport layer, and application layer. The second step is data normalization. Finally, the multi-kernel K-means clustering method is used to cluster the data and the classification result of the traffic is obtained. The flow of the whole algorithm is shown in Fig. 3, and the pseudo code of the algorithm is shown in Table 5.

Fig. 3
figure 3

Anomaly detection based on Multiple Kernel K-means Clustering

Table 5 Anomaly detection Algorithm based on MKKC

4 Experiments

4.1 datasets used

To verify the effectiveness of the proposed method, we select several data sets for testing, mainly including NSL-KDD (University of New Nrunswick), UNSW (Chinese Software Developer Network) and AWID (Awid dataset).

The NSL-KDD data set mainly performs data redundancy processing on KDD CUP99. Part of the duplicate data is removed from the training data and test data, but the format of the data itself is the same as the KDD CUP99 data set. Each piece of data in the NSL-KDD data set contains 41 attribute features, 32 continuous feature attributes, and 9 discrete feature attributes, and a class tag item is added to the last item of each attribute in the data, indicating that the data is normal or attack. There are four types of attack, which includes Dos, Probing, R2L, and U2R.

The UNSW_NB15 data set was collected by the Australian Centre for Cyber Security (ACCS) Cyber Range Laboratory in 2015. The laboratory uses the IXIA PerfectStorm tool to capture new and updated attack information from the CVE site, and uses the depdump tool to capture network traffic, ultimately obtaining 100 GB traffic generated a mixed data set of contemporary attacks caused by normal behavior and human behavior. Compared with the NSL-KDD data set, the biggest advantage of this data set is that it contains contemporary implicit attack methods, which more accurately reflects the real situation of contemporary network traffic. The UNSW_NB15 data set contains a total of more than 2.54 million pieces of data, which can be divided into 9 categories according to abnormal behaviors, namely Fuzzers, Analysis, Backdoors, Dos, Exploits, Generic, Reconnaissance, Shellcode, and Worms.

The Aegean Wi-Fi Intrusion Dataset (AWID) is a comprehensive 802.11 network dataset, which was derived from real Wi-Fi traffic traces in 2015. The AWID dataset is collected in the actual network environment via network equipment. Be different from the NSL-KDD data set and UNSW_NB15 data set, each record of AWID dataset has 155 attributes and contains link layer protocol information. In AWID dataset, there are three types of attack, which include Flooding, Impersonation, and Injection.

4.2 Experiment setup

The proposed method is experimentally evaluated on three widely used intrusion detection benchmark data sets listed in Sect. 4.1. We constructed two experiments. The first experiment evaluates the effectiveness of the filling algorithm for an incomplete kernel matrix. The second experiment evaluates the performance of anomaly detection based on MKKC. The experimental environment is built based on the Linux operating system running on a host with Intel CPU Core i7 3.6 GHz and 16G RAM. The development environment is Matlab2014a and simpleMKL toolbox.

4.3 Experiment 1: effectiveness of the filling algorithm for incomplete kernel matrix

For ease of description, we named Algorithm 2 proposed in Sect. 3 as MKKC-IC. We choose two other multi-kernel k-means methods for comparison. These two methods are MKKC-MF and MKKC-ZF, respectively, they use the mean value and zero value to fill in the missing attributes in the sample data. We use clustering accuracy (CA), normalized mutual information (NMI) and purity to evaluate the performance of the three multi-kernel k-means methods mentioned above.

Considering the high complexity of the algorithm, we randomly selected 500 sample data and 15 features in the benchmark data set for analysis. In fact, when detecting unknown abnormal behavior in a real network environment, a large amount of sampling data is often lacking. Therefore, our data selection strategy makes sense. We applying both a Polynomial kernel and a Gaussian kernel on the feature. In the experiment, we randomly select samples and adjust the proportion of missing data in the samples, while observing the changes in CA, NMI and purity of the above three methods.

The experimental results are shown from Fig. 4 to Fig. 6. The horizontal axis of the coordinate represents the missing ratio, and the vertical axis of the coordinate represents the clustering accuracy (CA), normalized mutual information (NMI) and purity respectively.

Fig. 4
figure 4

Experimental result of Clustering Accuracy (CA)

Figures 4, 5, 6 respectively show the performance of the four multi-kernel methods when processing incomplete sample data. In this experiment, the multiple kernel k-means (MKKM) is the reference target without the missing kernel matrix. Since there is no missing attribute, the performance of MKKM is the highest. With the increase in the missing ratio, we found that the performance closest to MKKM is MMK-IC method. This result confirms our assumption of data filling, that is, filling the missing kernel matrix based on similarity is helpful to increase the accuracy of clustering. Although the main application areas of multi-kernel method are concentrated in image recognition, speech recognition, etc., and the test data of the algorithm is mostly image data. The results of this experiment show that the multi-kernel k-means clustering method can also explore the inherent characteristics of network traffic data and has a good clustering effect.

Fig. 5
figure 5

Experimental result of Normalized mutual information

Fig. 6
figure 6

Experimental result of Purity

It can be seen from Table 6 lists the average performance of the four clustering algorithms when the sample missing rate is 10% and the highest performance is shown in bold. It can be seen from Table 6 that after using the MKKC-IC algorithm to fill the incomplete kernel matrix, the accuracy of the clustering results can be further improved. In this experiment, the MKKC-IC algorithm improves the performance of the traditional MKKC-ZF and MKKC-MF algorithms by 4%. In addition, Fig. 4 to Fig. 6 that when using the MKKC-IC algorithm to perform cluster analysis on the sampled data with missing attributes, the overall effect is better than the traditional MKKC-ZF and MKKC-MF algorithms.

Table 6 ACC, NMI and Purity comparison (mean_std)

4.4 Experiment of intrusion detection

To verify the effectiveness of the proposed intrusion detection method, this section uses true positive rate (TPR), false positive rate (FPR), precision, accuracy and F-score as evaluation metrics. Since the proposed method is based on multi-kernel clustering algorithm, three typical clustering algorithms are selected as the comparison objects, including density peaks (DP) algorithm [49], K-means algorithm and Gaussian Mixture Modelling (GMM) algorithm [48]. We first use the above algorithms to perform cluster analysis for NSL-KDD, UNSW_NB15, and AWID, and then perform statistics based on the clustering results. In the statistical process, only normal behaviors and abnormal behaviors are distinguished, and the clustering results are not accurately classified.

To test the algorithm's ability to identify abnormal traffic from small batches of data, 1000 records were selected from each data set for multiple tests. The number of abnormal traffic packets is increased from 100 to 500, each time increasing by 100. Finally calculate the average of the test results. Table 7 shows the result of performance test.

Table 7 Result of performance test

The experimental results in Table 7 show that the multi-kernel clustering method helps to obtain more stable and accurate anomaly detection results. In addition, the DP algorithm, K-means algorithm and GMM algorithm are susceptible to the influence of feature selection, which leads to big changes in TPR, accuracy and accuracy. In Table 7, the highest values of different experimental results are marked in bold.

5 Conclusion

The kernel method is a powerful tool to solve the linear inseparability of low-dimensional vector spaces. But a single kernel method is not good at solving the problem of high-dimensions vector clustering. The multi-kernel method is proved to be a more advanced and effective solution. However, in the actual application process, the lack of sampling data hinders the use of multi-kernel clustering algorithms. In this paper, we propose and discuss the issues of attribute missing in sample data for intrusion detection. we also propose an intrusion detection framework for 5G and IoT networks which is based on multiple kernel k-means with incomplete kernels. The experimental results show that our proposed method can indeed achieve high-accuracy clustering when the sampled data is incomplete. Unfortunately, this method still has some shortcomings in processing performance, and there are still some shortcomings when processing massive amounts of data. In the follow-up research work, we strive to improve and perfect the problem.

Existing multi-kernel clustering methods generally suffer from high computational complexity and cannot process large-scale sampled data in real time. The method proposed in this article is no exception. Therefore, in the follow-up research, we will try to further explore the topology characteristics of 5G IoT, and design a hierarchical clustering method to avoid the problem of massive data size.