1 Introduction

The experts have already anticipated the growth of wireless network traffics [1]. Mobile traffics, including mobile 5G and Wi-Fi traffic are believed to increase tremendously, in the next 10 years [1]. Unfortunately, as the traffic increases, a number of malicious attacks by adversaries are have jumped accordingly [2]. An impersonation attack is one of common Wi-Fi attacks [3]. In this attack, adversaries can impersonate themselves as legitimate clients to gain unauthorized access. The impersonation attack also has a severe impact due to allowing unauthorized users to access the network as a security breach [4].

Intrusion Detection System (IDS) become a promising countermeasure for network attacks by leveraging machine learning application occasionally. Machine learning, based on data label availability, can be divided into two approaches: supervised and unsupervised learning. A supervised learning needs prior information about the class label data. The supervised learning fits for the classification task, including attack detection. In the latter case, an unsupervised learning learns without any prior information about the class label of raw data. Therefore, the unsupervised learning fits for clustering task, which makes an efficient way to group similar data. In terms of attack detection, we may leverage unsupervised approach by claiming the outlier data from big clusters as attacks, since benign data usually form a big cluster. Besides that, unsupervised learning is suitable for huge Wi-Fi networks as labeling training data may be infeasible.

There are numerous famous unsupervised learning methods such as k-means clustering [5], Principal Component Analysis (PCA) [6] and Independent Component Analysis (ICA) [7]. The key characteristics of the three methods are: k number of class partitioning, orthogonal transformation and reveal hidden independent factors, respectively [8]. However, since we are facing huge and complex Wi-Fi data, the three traditional unsupervised learning methods are insufficient because the data might be not well distributed [9]. In order to overcome this problem, we venture to transform raw data into another form of data, which can lead to better unsupervised learning result.

One acceptable candidate for the transformations is Stacked Autoencoder (SAE) which transforms original features into more meaningful representation by reconstructing its input with the decoder. It provides an efficient way to validate that the important information in the data has been captured. The SAE as a deep learning method, can be efficiently used for unsupervised learning on a complex dataset. By stacking several unsupervised feature learning layers, and greedy method training for each layer, we can consider extracted features on each hidden layer as a new space with better form for clustering task. However, SAE is originally designed for capturing complex information in lower-dimensional features than the original features, not for clustering task. Therefore, we see SAE for assisting traditional clustering algorithm to achieve better clustering result. We then forward the newly formed features from non-linear SAE transformation into k-means clustering algorithm to improve k-means clustering performance.

We implement and test our work using a comprehensive Wi-Fi network benchmark dataset, called AWID dataset [3]. Besides this dataset, Kolias et al. [3] also tested a series of existing machine learning models on the dataset in a heuristic manner. The lowest detection rate is observed on impersonation attack by detection rate of 22% only while our proposed approach outperforms on impersonation attack detection achieving a detection rate of 92%. Clearly, the novel way of combining deep learning transformation and traditional k-means clustering method improves the performance of impersonation attack detector and can be further generalized for different attack types in large scale Wi-Fi networks.

This paper is organized as follows: Sect. 2 reviews several related work. We provide our proposed approach along preliminaries in Sect. 3. Section 4 gives our experimental results and analysis. Conclusion and future work of this paper will be suggested in Sect. 5.

2 Related Work

There are several previous work which leverages deep learning techniques as a clustering method. Song et al. [9] proposed an autoencoder-based data clustering. They mapped original data space to a new space using autoencoder, which is more suitable for clustering, and claimed that by applying a non-linear transformation, the data become compact with respect to their corresponding cluster center in the new space. They modified original autoencoder by adding two new objective functions during training: minimize reconstruction error and distance. Comparing with Song et al. [9] which needs to modify the original autoencoder, our proposed method does not need to modify the original autoencoder and improves traditional clustering algorithm. While Saito and Tan [10] proposed similar work by mapping the input data to an embedded space, using autoencoder. They concatenated the learned representations of all intermediate layers. All features learned by each neural network layer were used to generate a combined representation which is useful for effective cluster analysis. Different from Saito and Tan [10], we leverage unsupervised k-means clustering algorithm instead of supervised k-Nearest Neighbor (k-NN) classification algorithm. We also use a single hidden layer only without concatenation of all hidden layers to maintain the process still lightweight.

A lot of proposed approaches for detecting impersonation attacks [11,12,13,14]. [11,12,13] were designed to detect one particular impersonation attack by adding or modifying specific protocols. However, in particular, Aminanto and Kim [14] proposed one general model that can detect an impersonation attack by reducing the features dimensionalities and adopting SAE at final stage. Unfortunately, their approach needs data labels which is supervised learning algorithm. Therefore, we propose a fully unsupervised approach for coping with huge and complex Wi-Fi network traffics.

3 Our Approach

In this Section, we briefly describe our novel fully unsupervised deep learning-based Wi-Fi impersonation attack detector. For clarity, we firstly introduce preliminaries about SAE and k-means clustering, and then explain how the overall scheme works for detecting attacks.

3.1 Stacked Autoencoder (SAE)

An autoencoder is a symmetric neural network model as shown in Fig. 1, which belongs to unsupervised learning in the sense that a model could be built from non-labeled data. To extract new-lower dimensional features, autoencoder uses an encoder-decoder paradigm as shown in Fig. 1, which can capture relevant data from the original data. The encoder is a function that maps an input X to a representation layer H as expressed by Eq. (1).

Fig. 1.
figure 1

Autoencoder network with symmetric input-output layers

$$\begin{aligned} H=s_f \left( W X+b_f \right) , \end{aligned}$$
(1)

where \(s_f\) is a non-linear activation function, in this case of a logistic sigmoid, \(s_f(t) = \dfrac{1}{1+e^{-t}}\), where t is the function input. The W and \(b_f\) denote a weight matrix for features and a bias vector for encoding, respectively. The decoder function expressed in Eq. (2) maps representation layer H back to a reconstruction \(X'\) as an output.

$$\begin{aligned} X'=s_g \left( V H+b_g \right) , \end{aligned}$$
(2)

where \(s_g\) is the activation function of the decoder, which is a sigmoid function too. The V and \(b_g\) denote a weight matrix for features and a bias vector for decoding, respectively. Autoencoder training phase finds optimal parameters \(\theta = \lbrace W, V, b_f, b_g \rbrace \) which minimize the reconstruction error E between the input data X and its reconstruction output \(X'\) on a training set as shown in Eq. (3).

$$\begin{aligned} E=\frac{1}{N}\sum _{n=1}^{N}\sum _{k=1}^{K}\left( X'_{kn}-X_{kn}\right) ^2 + \lambda \cdot \varOmega _{weights} + \beta \cdot \varOmega _{sparsity}, \end{aligned}$$
(3)

where N and K denote the total number of training data and the number of variables for each data, respectively. \(\varOmega _{weights}\) represents \(L_2\) regularization, while \(\varOmega _{sparsity}\) denotes sparsity regularization, which evaluates how close the average output activation value and the desired value. The coefficient of \(L_2\) regularization term \(\lambda \) and the coefficient of sparsity regularization term \(\beta \) are specified during autoencoder training.

Autoencoder can be used as a deep learning technique by unsupervised greedy layer wise pre-training algorithm as depicted in Fig. 2, which is called Stacked Autoencoder (SAE). In this algorithm, all layers except the last layer are initialized in a multi-layer neural network. Each layer is then trained in an unsupervised manner as autoencoder which constructs new representations of the input.

Fig. 2.
figure 2

Stacked autoencoder (SAE) network with three hidden layers

The final layer implements the softmax for the classification the deep neural network. Softmax function is a generalized term of the logistic function that squashes the K-dimensional vector \(\mathbf v \in \mathbb {R}^K\) into K-dimensional vector \(\mathbf v ^* \in {\left( 0, 1\right) }^K\) which adds up to 1. The softmax layer minimizes the loss function, which is the cross entropy function.

3.2 K-means Clustering

K-means clustering algorithm groups all observations data into k clusters iteratively until convergence will be reached. In the end, one cluster contains similar data since each data enters to the nearest cluster. K-means algorithm assigns a mean value of the cluster members as a cluster centroid. In every iteration, it calculates the shortest Euclidean distance from an observation data into any cluster centroid. Besides that, the intra-variances inside the cluster are also minimized by updating the cluster centroid iteratively. The algorithm would terminate when convergence is achieved, which the recent clusters are the same as the previous iteration clusters [8].

Fig. 3.
figure 3

Our proposed scheme contains feature extraction and clustering tasks

figure a

3.3 Fully Unsupervised Deep Learning

In this subsection, we describe our novel fully unsupervised deep learning-based IDS for detecting impersonation attacks. There are two main tasks, feature extraction and clustering tasks. Figure 3 shows our proposed scheme which contains two main tasks in cascade. We use a real Wi-Fi networks-trace, AWID dataset [3], which contains 154 original features. Before the scheme starts, normalizing and balancing process should be done in order to achieve best training performance. Algorithm 1 explains the procedure of the proposed scheme in detail.

The scheme starts with two cascading encoders, and the output features from the second layer then forwarded to the clustering algorithm. The first encoder has 100 neurons as the first hidden layer while the second encoder comes with 50 neurons only. We follow a common rule for choosing the number of neurons in a hidden layer by using 70% to 90% of the previous layer. In this paper, we define k = 2 since we consider two classes only. The scheme ends by two clusters formed by k-means clustering algorithm. These clusters represent benign and malicious data.

4 Evaluation

We evaluate the proposed scheme on AWID dataset. We firstly show the effectiveness of leveraging second layer representation of SAE training result compared to original data. We implement SAE and k-means clustering algorithm using MATLAB R2016b running on an Intel Xeon E-3-1230v3 CPU @3.30 GHz with 32 GB RAM. We verify our proposed scheme by comparing the proposed scheme with the previous work. We introduce the dataset has been used and evaluation metrics in the next subsections.

4.1 Dataset

We use AWID dataset as a benchmark dataset since the dataset might become a common benchmark dataset for wireless network research due to its comprehensiveness and real world-alike characteristics. Regarding the number of classes, the dataset has two types of attack classes: “ATK” and “CLS”. The “ATK” dataset consists of 16 classes including one benign class, while the “CLS” data contains four classes only. The 16 classes of the “ATK” dataset can be classified to four attack categories in the “CLS” dataset. In this paper, we use the “CLS” dataset which contains benign, impersonation, injection and flooding classes. However, we consider two classes only among four classes. Besides that, the AWID dataset is also divided into two types based on the size of data instances included, namely, full and reduced datasets. There are 1,795,595 data instances in the full dataset, with 1,633,190 and 162,385 benign and attack instances, respectively. While the reduced dataset contains only 575,643 instances, with 530,785 and 44,858 benign and attack instances, respectively. In this paper, we used the reduced “CLS” AWID dataset for the sake of simplicity.

The dataset expresses the nature of a network, where the number of benign instances is larger than attack instances [3]. The ratio between benign and attack instances are 10:1 and 11:1 for training and test datasets, respectively. This situation might cause a bias during training, and infer machine learning performance. Therefore, we balance the dataset for training purpose. The ratio between benign and attack instances then become 1:1 for both balanced training and test datasets. The benign instances are randomly reduced into 163,319 data instances for training dataset while 53,078 data instances for test dataset.

The AWID dataset not only consists of discrete data but also continuous and symbolic data types, with flexible value ranges. This situation might confuse any machine learning during training. The dataset preparation should be done in advance, which contains two main tasks: mapping symbolic-valued attributes to integer values and normalizing tasks. First, target classes would be mapped to integer type: 1 for benign instances and 2 for impersonation attack. Second, symbolic attributes, such as a receiver, destination, transmitter, and source address, would be mapped to integer values with a minimum value of 1 and a maximum value of i, where i is the number of all symbols. Third, some attributes that have a hexadecimal data type, such as WEP Initialization Vector (IV) and Integrity Check Value (ICV), need to be cast into integer values too. Also, there are some attributes left with a continue data type, like timestamps. Last, the dataset also contains a question mark (“?”) for unavailable values for the corresponding attributes. The question marks are assigned to zero value. After all attribute values are cast into integer values, each of the attributes is linearly normalized between zero and one. Equation (4) shows the normalizing formula:

$$\begin{aligned} z_{i}=\frac{x_{i}-min (x)}{max(x)-min(x)}, \end{aligned}$$
(4)

where \(z_i\) denotes the normalized value, \(x_i\) refers to the corresponding attribute value, and min(x) and max(x) are the minimum and maximum values of the attribute, respectively.

4.2 Evaluation Metrics

We use several metrics that commonly used for measuring IDS performance [15]: classification accuracy (Acc), Detection Rate (DR), False Alarm Rate (FAR). Acc shows the overall effectiveness of an algorithm. DR, also known as Recall, refers to the number of attacks detected divided by the total number of attack instances in the test dataset. FAR is the number of normal instances classified as an attack divided by the total number of normal instances in the test dataset. \(F_1\) score measures a harmonic mean of precision and recall, where Precision shows the number of attacks compared to the total of classified instances as an attack. Intuitively, our goal is to achieve a high Acc, DR, Precision and \(F_1\) score, and at the same time, maintain low FAR. The above measures can be defined as shown in Eqs. (5), (6), (7), (8) and (9):

$$\begin{aligned} Acc=\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(5)
$$\begin{aligned} DR=Recall=\frac{TP}{TP+FN} \end{aligned}$$
(6)
$$\begin{aligned} FAR=\frac{FP}{TN+FP} \end{aligned}$$
(7)
$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$
(8)
$$\begin{aligned} F_1=2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}, \end{aligned}$$
(9)

where True Positive (TP) is the number of intrusions correctly classified as an attack. True Negative (TN) is the number of normal instances correctly classified as a benign packet. False Negative (FN) is the number of intrusions incorrectly classified as a benign packet. False Positive (FP) is the number of normal instances incorrectly classified as an attack.

4.3 Experimental Results

We implement our proposed scheme as shown in Algorithm 1. There are two hidden layers in the SAE network with 100 and 50 neurons accordingly. The encoder in the second layer fed with features formed by the first layer of encoder. The softmax activation function is implemented in the final stage of the SAE in order to optimize the SAE training. The 50 features extracted from the SAE are then forwarded to k-means clustering algorithm as an input. We use random initialization for k-means clustering algorithm. However, we set a certain value as a random number seed for reproducibility purpose. We compare clustering results from three inputs: original data, features from the first hidden layer of the SAE and features from the second hidden layer of the SAE as shown in Table 1.

Table 1. The evaluation of our proposed scheme

We observe the limitation of a traditional k-means algorithm, which unable to clusters complex and high dimensional data of AWID dataset, as expressed by 55.93% of accuracy only. Although 100 features coming from the 1st hidden layer achieved 100% of detection rate, the false alarm rate is still unacceptable with 57.48%. The k-means algorithm fed by 50 features from the 2nd hidden layer achieved the best performance among all as shown by the highest \(F_1\) score (89.06%) and Acc (94.81%), also the lowest FAR (4.40%). Despite a bit lower detection rate, our proposed scheme improves the traditional k-means algorithm in overall by almost twice \(F_1\) score and accuracy.

Figure 4 shows cluster assignment result in Euclidean space, by our proposed scheme. Black dots represent attack instances, while gray dots represent benign instances. The location of cluster centroid for each cluster is expressed by X mark.

Fig. 4.
figure 4

Cluster assignment result in Euclidean space by our proposed scheme

We also compare the performance of our proposed scheme against two related previous work by Kolias et al. [3] and Aminanto and Kim [14] as shown in Table 2. Our proposed scheme is able to classify impersonation attack instances with a detection rate of 92.18% while maintaining low FAR, 4.40%. Kolias et al. [3] tested various classification algorithms such as Random Tree, Random Forest, J48, Naive Bayes, etc., on AWID dataset. Among all methods, Naive Bayes algorithm showed the best performance by correctly classifying 4,419 out of 20,079 impersonation instances. It achieved approximately 22% DR only, which is unsatisfactory. Aminanto and Kim [14] proposed another impersonation detector by combining Artificial Neural Network (ANN) with SAE. They successfully improved the IDS model for impersonation attack detection task by achieving a DR of 65.18% and a FAR of 0.14%. In this study, we leverage SAE for assisting traditional k-means clustering with extracted features. We still have a high false alarm rate, which leads to a severe impact of IDS [16]. However, we can accept false alarm rate value about 4% since we use fully unsupervised approach here. We can adjust the parameters and cut the FAR down, but, less FAR or high DR remains a tradeoff for users and will be discussed in further work. We observe the advantage of SAE for abstracting a complex and high dimensional data to assist traditional clustering algorithm which is shown by reliable DR and \(F_1\) score achieved by our proposed scheme.

Table 2. Comparison with previous work

5 Conclusion and Future Work

In this paper, we improve traditional k-means clustering algorithm by proposing a novel fully unsupervised-based intrusion detection system incorporating deep learning technique, a stacked autoencoder. We implement SAE to achieve high level abstraction of complex and huge Wi-Fi network data. The SAE has important features: model-free and learnability on large-scale data, which is suitable for the open nature of Wi-Fi networks where attackers can easily impersonate as legitimate users. We believe that the extracted features by SAE are in the new space that can improve clustering algorithm performance. Our experiments show significant improvements compared to previous work with notably 94.81% of accuracy.

In the near future, we will further investigate and propose a method to reduce false alarm rate in order to achieve a reliable IDS. In addition, we will discuss using deep learning techniques, especially stacked autoencoder as an outlier detection for detecting unknown attacks.