1 Introduction

At present, the IoT (Internet of Things) has penetrated into all levels of social life and has been deeply applied in intelligent environment, personal and social fields [1]. While the IoT brings convenience to social production, it also easily leads to security risks that cannot be ignored. The existing method is to found abnormal behaviors or network attacks of devices by detecting network traffic of the IoT, so as to protect the security of the IoT to the maximum extent.

With the development of intelligent devices, IoT traffic detection technology based on machine learning has been widely studied [2]. Traditional traffic detection methods, such as Bayesian, support vector machine [3], are mostly based on statistics, requiring experts to mark traffic data and extract statistically significant features, including transmission rate, byte change and time interval of network traffic. Kong et al. [4] proposed an abnormal traffic identification system based on multi-classification support vector machine, which can classify and identify various attack traffic and has good performance through the experiment of KDDCUP99 dataset. Shafiq et al. [5] used NetMate to extract features of data packets from HIT and MIMS datasets. Vu et al. [6] proposed a feature engineering technique to extract important attributes of network traffic by analyzing data packets and mine the correlation of data packets. Experiment results show that this method can significantly improve the identification accuracy of abnormal traffic and the calculation efficiency of the model. With the exponential growth of IoT device traffic data, the above methods will face huge challenges that are difficult to overcome to extract statistical features from massive data.

In recent studies, some researcher try to apply deep learning method to network traffic anomaly detection. Radford et al. [7] used the improved recursive neural network (RNN) to learn the computer network traffic sequence, and the experiment proved that the model could detect malicious traffic patterns in the computer system. Zou [8] proposed a new method to identify network traffic by deep neural network, and improved the accuracy of classification results by combining convolutional neural network (CNN) and RNN. Experimental results show that the model has a great improvement in efficiency and reliability. However, these solutions are based on supervised learning and and require a large amount of labeled data. But, in actual scenarios, it is extremely difficult and time-consuming to obtain abnormal sample labels. Especially, with the rapid development of network attacks, the model needs to be retrained to detect new attacks, resulting in low detection efficiency of the model for unknown attacks.

In order to break through the bottleneck and difficulties of supervised learning, researchers turn their attention to semi-supervised/unsupervised learning methods. In recent years, unsupervised algorithms based on Auto-Encoder (AE) [9] have attracted extensive attention and in-depth research. Nguyen et al. [10] proposed a network framework based on variational AE, which can effectively detect and interpret various network traffic anomalies. Yisroel et al. [11] proposed a plug and play Network Intrusion Detection system (NIDS) named Kitsune, which extracted characteristic data from data stream using damping increment statistical method, and then used the core algorithm KitNet to detect abnormal traffic in real time. Realize online efficient detection of network attack under unsupervised learning. Akcay et al. [12] used GAN (Generative Adversarial Networks) to train an encoder model with jump connections, which could learn the normal distribution of sample. Experiments proved that this model had a high detection rate for images in different fields. Zenati et al. [13] achieved effective results by jointly training two sub-networks to capture normal data distribution. In general, anomaly detection based on AE is judged by the reconstruction loss of original sample and decoder output sample. Under normal circumstances, the reconstruction of the abnormal loss is bigger, the reconstruction of the normal loss is smaller, but the decoder refactoring effect is limited by a fixed length of latent vector, less available information in the decoding process. Therefore, in the IoT scenario with stronger data diversity and more data, the reconstruction loss of some normal samples is larger, while that of some abnormal samples is smaller, which will reduce the anomaly detection accuracy.

However, complex feature selection combined with well-designed neural network structure can help anomaly detection. In this paper, an unsupervised anomaly detection model named Haar-AE is proposed. Specifically, HaarAE firstly uses Haar wavelet transform to enhance the input features of the original data, and the wavelet transform can retain the characteristic information of the original data in time domain and frequency domain. Then the data is input into the convolution encoder, and the convolution results of each layer in the encoding stage are not only input into the next layer of convolution, but also into the ConvLSTM layer. Cascade the results of ConvLSTM and deconvolution of each layer in the decoding stage, and input the cascade results into the next deconvolution layer, so as to increase the information that can be captured by the decoder and strengthen the reconstruction effect of normal samples. At the same time, a memory module is added to the AE, and a small number of limited normal sample latent vectors are stored in the memory module.

The contributions and innovations are as follows:

  1. 1)

    An unsupervised IoT traffic anomaly detection model named HaarAE is proposed, through this method, the reconstruction error of the model to normal samples can be kept at a low level, and the reconstruction error of the model to abnormal samples can be kept at a high level.

  2. 2)

    The combination of Haar wavelet transform and AE is applied to traffic anomaly detection of IoT devices. Besides analyzing the timing characteristics of traffic data, the characteristics of frequency domain are also analyzed to obtain the periodicity and fluctuation information of traffic data. The remainder of this paper as follows. Section 2 describes the related work. Section 3 proposes the HaarAE model and introduces its structural design. Section 4 verifies the effectiveness of this method by experiments. Conclusions and prospective research directions are described in Section 5.

2 Related work and theories

2.1 Spectrum analysis

With the development of deep learning, although RNN [14], Convolutional Neural Network (CNN) [15] and other methods can make use of their network structure to autonomously learn the feature of samples. However, feature analysis still plays an significant role in obtaining desirable results from network models. Among them, spectrum analysis is one of the typical representative technologies, which analyzes the frequency properties of time series to find the hidden periodicity. It has been widely used in time series processing, acoustics, computer vision, biomedicine and other fields [16, 17]. According to Fourier theory [18], each time domain signal has the corresponding frequency domain signal, the traditional spectrum analysis is to transform time series signal from time domain to frequency domain, and reveals the time domain can be difficult to find the information. Livera et al. [19] proposed a state space modeling framework based on Fourier transform for predicting complex seasonal time series, such as time series with high-frequency seasonality and dual calendar effect. Experimental results show that this framework can reduce the computational burden of maximum likelihood estimation and can effectively identify and extract seasonal features in time domain. Wang [20] proposed a face recognition algorithm based on fractional Fourier transform, which improved the robustness to illumination, noise and other factors and effectively improved the face recognition rate. Due to the natural defects of Fourier transform in the processing of discrete signals, it can no longer meet the actual needs. However, wavelet transform can retain information in time domain and frequency domain and process local similarity of signals and data, showing great advantages in feature extraction and data mining [21]. Zhao et al. [22] used wavelet transform to reveal the frequency domain information of univariate time series, and used different neural networks to simultaneously capture time-frequency feature and long-term trends. Furthermore, attention mechanism is used to blend local and global features, which improves the prediction accuracy of time series effectively. Yuan et al. [23] proposed a full-convolutional neural network based on wavelet transform to capture frequency domain information of multivariate time series through wavelet transform. Experimental results show that adding wavelet transform can effectively improve the detection effect of convolutional neural network. Ma et al. [24] proposed an anomaly detection method named WAGAN for industrial sensor, which removes noise and enhances data features by means of decomposition and reorganization of multi-level discrete wavelet transform. Attention mechanism is introduced into WAGAN model, thus improving the accuracy of anomaly detection. Zhang et al. [25] roposed a new unsupervised learning framework P2GAN, which can map the input samples to Gaussian distribution factors through discriminators, so as to fully extract the true distribution information. The author verified the effectiveness of P2GAN from both theoretical and experimental aspects. Hou et al. [26] proposed a framework called Divide and Assembly Anomaly Detection (DAAD), which interprets image reconstruction as a process of divide and assembly. And add memory module to adjust the reconstruction ability of the model. It is difficult for common network models to capture the frequency domain information of data, which leads to the failure of models to learn more data features. However, a full understanding of data flow feature of the IoT is a great significance for analyzing equipment anomalies and judging network attacks. Therefore, in the real environment of device traffic of the IoT, more advantages can be brought by using wavelet transform to expand data features.

2.2 Autoencoder

AE is an unsupervised neural network model, which is widely used in anomaly detection, data dryness, data dimensionality reduction, image repair, information retrieval and other fields. The AE can learn the hidden features of the input data, this process called encoding, and hidden features learned can reconstruct the original input data, this process called decoding. The basic AE consists of two parts, the first part is the encoder and the second part is the decoder, as shown in Fig. 1

Fig. 1
figure 1

Structure of AE

In training phase, the encoder learns the nonlinear mapping between the original data space and latent vector H, and maps the high-dimensional data to the low-dimensional data to reduce the amount of data. The process can be expressed by the following formula:

$$ H = f(x)=s(Wx+p)) $$
(1)

Among them, S represents nonlinear mapping function, W represents weight matrix. After the latent vector is obtained, the decoder learns the nonlinear mapping of the potential vector to the original input space to realize data reconstruction. The formula used in the decoder phase is defined:

$$ x^{\prime}= g(x)=\alpha(WH+p) $$
(2)

The main work of the AE is to accurately reconstruct the input data, and its objective functions mainly include Mean Square Error (MSE) and Kullback-Leible(KL) divergence. MSE can be defined as:

$$ MSE = \frac{1}{N}{\sum_{i}^{N}} { {{{({y_{i}} - y_{i}^{\prime})}^{2}}} } $$
(3)

KL divergence describes the relative entropy between two probability distributions p and q. Its formula can be defined as:

$$ {D_{KL}}(p||q) = - \sum_{x} {p(x)\log (\frac{{q(x)}}{{p(x)}})} $$
(4)

Finally, the back-propagation algorithm is used to propagate the error back to the hidden layer, so as to optimize the loss function and model parameters.

2.3 Conclusion

The traffic of IoT devices belongs to time series data, so it has the same frequency domain information as other time series data. In our work, Haar wavelet transform is introduced to bring more diverse features to the device traffic data of the IoT, which effectively makes up for the deficiency of neural network in feature engineering. Build unsupervised AE network structures to avoid tagging large data sets, At the same time, aiming at the poor of model fitting effect, and the insufficiency of access to information, a memory module is added in AE to increase the reconstruction error of abnormal samples. In addition, a ConvLSTM layer is added, and the output of each layer of the decoder is cascated with the output of the corresponding ConvLSTM layer, which can not only capture the timing features of the data, but also enable the decoder to make full use of the coding information of each layer to reconstruct the original data, enhancing the fitting ability and detection accuracy of the model.

3 Unsupervised anomaly detection model based on Haar wavelet transform

3.1 Wavelet transform of original data

Wavelet transform can analyze signals in different frequency bands with different resolutions, so it can find the complex feature of signals. The idea is to use scale function to represent the original data and obtain the low frequency part of the data, which contains the global features of the original data. As the scale becomes larger, the scale function becomes more ambiguous to the original signal, and the difference from the original signal becomes larger and larger, so the wavelet function needs to be introduced to represent the difference. The wavelet function can obtain the high frequency part of the data, which contains the details of the data. Specifically, the wavelet transform decomposed the original data into approximate coefficient and detail coefficient under the action of wavelet function and scale function [27]. Figure 2 describes the process of wavelet transform. Given a set of raw traffic data X = (x1,x2,....,xN− 1), N is the length of the data, given the scale function β = {β1,β2,…,βN− 1} and the wavelet function ε = {ε1,ε2,…,εN− 1}, Next, for the given sequence of traffic data X, projection on the two functions β and ε, the approximate coefficients and detailed coefficients are obtained respectively, as shown in formula (5) and formula (6).

$$ {W_{\beta}}({f_{0}},c) = (x,{\beta_{{f_{0}},k}}) = \frac{1}{{\sqrt N }}\sum_{m = 0}^{N - 1} x [m]{\beta_{{f_{0}},k}}[m] $$
(5)
$$ {W_{\varepsilon} }({f_{}},c) = (x,{\varepsilon_{{f_{}},k}}) = \frac{1}{{\sqrt N }}\sum_{m = 0}^{N - 1} x [m]{\varepsilon_{{f_{}},k}}[m] $$
(6)

Wβ is the approximate coefficient, Wε is the detail coefficient, c is the wavelet transform translation, f0 and f is wavelet transform scale level. The detail coefficient reveals the variance of the data on different scales, and the approximate coefficient gets the smoothed average on that scale. Furthermore, the decomposition result of each level of wavelet transform is that the low frequency part obtained from the previous decomposition is decomposed into two parts, low frequency and high frequency. After l-level decomposition, the source signal X can be expressed by the following formula:

$$ X = {W_{{\varepsilon_{1}}}} + {W_{{\varepsilon_{2}}}} + {W_{{\varepsilon_{3}}}} + {\ldots} + {W_{{\varepsilon_{l}}}} + {W_{{\beta_{l}}}} $$
(7)
Fig. 2
figure 2

Process of wavelet transform

Among them, \({W_{{\varepsilon _{1}}}}{W_{{\varepsilon _{2}}}}{W_{{\varepsilon _{3}}}}\ldots {W_{{\varepsilon _{l}}}}\) are the high-frequency signal decomposed from the first, second layer to the l th level, \({W_{{\beta _{l}}}}\) is the low-frequency signal obtained by decomposition of the l th level. The detail coefficients and approximate coefficients obtained through wavelet transform can provide multi-scale data features for the model, which is difficult to be achieved by ordinary neural networks.

Yuan et al. [23] pointed out that as the smooth average of the original data, the approximate coefficient is easy to be learned by the CNN, and due to the non-orthogonality of the approximate coefficient, the additional input will bring redundant parameters to the convolution. calculate. In fact, for IoT device traffic data, the low-frequency information in the normal sample is still important, because it represents the normal pattern of the sample. Although the convolutional neural network can learn part of it, experiments show that good feature data cooperates with the neural network can bring better results.

3.2 Convolutional AE based on memory module

Figure 3 includes a wavelet transform module, a memory module and a ConvLSTM cascade module. First, for each input original data xd, d ∈{0,1,2…}, apply wavelet transform to decompose it to a specific level L to obtain feature information on different scales, where d is the data dimension, Formally, in scale level l ∈ (1,…,L), the detail coefficient obtained by wavelet transform is \(W_{\varepsilon } = {W{{\!}_{\varepsilon _{1}}}}(l) + {W_{{\varepsilon _{2}}}}(l) + {\ldots } + {W_{{\varepsilon _{d}}}}(l)\), the approximate coefficient is \(W_{\beta _{1}} = {W_{{\beta _{1}}}}(l) + {W_{{\beta _{2}}}}(l) + {\ldots } + {W_{{\beta _{d}}}}(l)\). Next, the original data, detail coefficients and approximation coefficients are put into an independent convolution decoder, in order to fully extract the feature in time domain and frequency domain. The input of each convolutional decoder is xRDN or \(W(l) \in {R^{D*\frac {N}{{{2^{c}}}}}}\), setting the kernel size of the first convolution layer to D forces the convolution encoder to merge all dimensions to capture the global correlation between dimensions. Unified definition Xin is input data, Xout is the output data, so the output of the convolution decoder can be expressed as:

$$ {X_{out}} = f(A*{X_{in}} + b) $$
(8)
Fig. 3
figure 3

The overall structure of the model

Where f represents nonlinear mapping function, ∗ is the convolution operation, A and b are the parameters learned in the convolution encoder. Generally, the encoder and decoder form a symmetric structure through the neural network layer of reverse stacking. For the input sample X, The encoder \({f_{encoder}}\left (\cdot \right ):x \to z\) encodes data to obtain hidden vector z, then the original data is restored through the decoder, and the reconstruction error between the input data and the restored data is used to judge whether there is an anomaly. However, in the actual environment, the data imbalance and difference lead to the strong AE generalization ability, and the reconstruction error of some abnormal samples is very small. In this paper, the memory module [28] structure is introduced after the encoder. and stores a small number of latent vectors that can represent normal sample distribution into the memory module. In the test stage, The coding result of the test sample is used to retrieve the most similar latent vector Z for reconstruction. Therefore, when a test sample is an abnormal sample, its reconstruction error will increase. In essence, the memory module stores a matrix with the size of N times m, where n represents the number of n samples and M represents the feature dimension of samples. The memory addressing module is used to calculate the attention weight of the latent vector β, as shown in formula (9):

$$ {\beta_{i}} = \frac{{{e^{\cos \left( {z,{m_{i}}} \right)}}}}{{\sum\limits_{j = 1}^{N} {{e^{\cos \left( {z,{m_{j}}} \right)}}} }} $$
(9)

Where \({\cos \limits } (\cdot , \cdot )\) is the \({\cos \limits } \) distance similarity calculation function between any two vectors. mi represents a storage item in memory. The specific formula of \({\cos \limits } (\cdot , \cdot )\) is shown in (10).

$$ cos (z,{m_{i}}) = \frac{{z \cdot {m_{i}^{T}}}}{{{{\left\| z \right\|}_{2}}{{\left\| {{m_{i}}} \right\|}_{2}}}} $$
(10)

In practice, some exceptions are still well reconstructed, so hard shrinkage method is used to constrain β, as shown in formula (11).Where λ represents the set threshold, which is valid only when the attention weight is greater than this value, otherwise it is 0.

$$ {\beta^{\prime}_{i}} = \left\{ {\begin{array}{c} {\frac{{\max \left( {{\beta_{i}} - \lambda ,0} \right)}}{{\left| {{\beta_{i}} - \lambda } \right| + \varepsilon }} \cdot {\beta_{i}},{\beta_{i}} > \lambda }\\ {0,{\beta_{i}} \le \lambda } \end{array} \lambda \in \left[ {\frac{{\text{1}}}{N},\frac{3}{N}} \right]} \right\} $$
(11)

Finally, the constrained attention weight and the basic implicit feature vector in the memory module are used to calculate \({z^{\prime }}\), as shown in formula (12).

$$ {z^{\prime}} =\beta \cdot M =\sum_{i = 1}^{N} {{\beta^{\prime}_{i}}{m_{i}}} $$
(12)

3.3 ConvLSTM cascaded structure

The main advantage of memory module is to increase the reconstruction error, since the hidden vector generated by encoder is fixed, the decoding effect of decoder is limited by the expression of hidden vector, and its performance may decrease as the sequence length increases. A cascade structure of ConvLSTM [29] is constructed to improve the decoder performance. As shown in Fig. 3, the output of each layer in the encoding stage is not only input to the next layer of convolution, but also input to a ConvLSTM layer. The result of each layer of deconvolution in the decoding stage is cascaded with the result generated by ConvLSTM and input to the next layer of deconvolution. Given the output XT,l of the l-th convolutional layer, and the previous hidden state ZT− 1,l So the current hidden state is updated to ZT,l = ConvLSTM(XT,l,ZT− 1,l) in the ConvLSTM layer. The specific formula for the ConvLSTM layer is as follows:

$$ {i_{T,l}} = \sigma ({b_{i,l}} + {W_{xi,l}} * {X_{T,l}} + {W_{zi,l}} * {Z_{T - 1,l}} + {W_{ci,l}} \circ {C_{T - 1,l}}) $$
(13)
$$ {f_{T,l}} = \sigma ({b_{f,l}} + {W_{xf,l}} * {X_{T,l}} + {W_{zf,l}} * {Z_{T - 1,l}} + {W_{cf,l}} \circ {C_{T - 1,l}}) $$
(14)
$$ \begin{array}{@{}rcl@{}} {C_{T,l}} = {f_{T,l}} \circ {C_{T - 1}} + {i_{t,l}} \circ \tanh ({b_{c,l}} + {W_{xc,l}} *\\ {X_{T,l}} + {W_{hc,l}} * {Z_{T - 1,l}} \end{array} $$
(15)
$$ {o_{T,l}} = \sigma ({b_{o,l}} + {W_{xo,l}} * {X_{T,l}} + {W_{zo,l}} * {Z_{T - 1,l}} + {W_{co,l}} \circ {C_{T - 1}}) $$
(16)
$$ {Z_{T,l}} = {o_{t,l}} \circ \tanh ({C_{T,l}}) $$
(17)

In this paper, some changes are made to the symbols of formulas (9), (10), (11), (12) and (13) to make them consistent in the whole paper. “∘” is known as Hadamard product. Wxi,l, Wzi,l, Wci,l, Wxf,l, Wzf,l, Wcf,l, Wxc,l, Whc,l, Wxo,l, Wzo,l, Wco,l are the parameters of ConvLSTM, all XT,l, CT,l, ZT− 1,l, iT,l, oT,l, fT,l are tensors in three dimensions. Finally, in order to reconstruct the original data, detail coefficients and approximation coefficients, the latent vector output by the encoder needs to be decoded. We design the deconvolution operation as follows:

$$ x^{-}{\sim}_{T,l - 1} = f({W_{T.l}} \otimes {Z_{T,l}} + {b_{T,l}}),l = 3 $$
(18)

Where f is the same activation function as the convolution encoder, “⊗” is the deconvolution operation, and W, b are the learning parameters of the convolution decoder. In order to be able to cascade with ConvLSTM layer, we updated the deconvolution operation:

$$ x^{-}{{\!}_{T,l - 1}} = f({W_{T.l}} \otimes [{Z_{T,l}} \oplus { x^{-} {{\!}_{T,l}}}] + {b_{T,l}}),l = 1,2 $$
(19)

“⊕” is a connection operation. Specifically, when “l” is equal to 3, the last layer of deconvolution only accepts information from hidden vectors, when l = 1,2, The deconvolution layer will accept not only the deconvolution output xT,l− 1 from the previous layer, but also the output ZT,l from the ConvLSTM, and connect these two parts, further input to the next deconvolution layer, so the decoder is able to combine the output at different deconvolution layers and ConvLSTM layers. The ConvLSTM layer captures data features at different time scales. In this way, the decoder relies not only on hidden vector features, but also on features provided by ConvLSTM layer, which enables the model to comprehensively utilize information of different scales to reconstruct data, effectively utilizing the feature extraction capability of ConvLSTM and improving anomaly detection performance. The reconstruction loss of the HaarAE is defined as formula (20). Finally, anomaly detection is carried out based on the reconstruction error of original data, detail coefficient and approximate coefficient. The specific results will be described in detail in the next section.

$$ \begin{array}{@{}rcl@{}} loss &=& {\left\| {x - x- } \right\|_{2}} + \sum\limits_{l = 1}^{L} {\left\| {{w_{\beta} }(l) - {{w_{\beta} }(l)}- } \right\| }\\ &&+ \sum\limits_{l = 1}^{L} {\left\| {{w_{\varepsilon} }(l) - {{w_{\varepsilon} }(l)}- } \right\|} \end{array} $$
(20)

4 Experiment

This section describes the experimental evaluation of the HaarAE model, this paper uses a common benchmark dataset and a collection of real IoT traffic dataset to evaluate the HaarAE. The aim is to answer the following questions.

Question 1::

Is the anomaly detection performance of IoT device traffic superior to mainstream unsupervised methods?

Question 2::

How does each component of HaarAE affect its performance (ablation study)?

Question 3::

Is HaarAE more robust to input noise? HaarAE uses the Keras [30] framework, and the Adam [31] with a learning rate of 0.001 is used to optimize the model.

4.1 Indicators

In the experimental part, the anomaly detection performance of the proposed model was evaluated based on three evaluation indexes. The details are as follows:

$$ {\text{Precision}} = \frac{{TP}}{{TP + FP}} $$
(21)
$$ {\text{Recall}}=\frac{{TP}}{{TP{\text{ + }}FN}} $$
(22)
$$ F\text{1}= 2*\frac{\Pr{\text{e*}}\text{Re} \text{c}}{\Pr {\text{e + }}{\text{Re}}\text{c}} $$
(23)

In this paper, the traffic sample with attack as positive sample and the normal sample as negative sample. In formula (21), (22) and (23), TP is the number of samples that are actually attack samples and are predicted to be attack samples, FN is the number of samples that are actually attack samples but predicted to be normal samples, FP is the number of normal samples but predicted to be attack samples, and TN is the number of normal samples and predicted to be normal samples. Precision indicates the predicted correct value in the number of attack samples predicted, and recall is used to evaluate whether all attack samples are predicted by the percentage of coverage. However, it is difficult to objectively describe the performance of the model with precision and recall alone, therefore, F1 value is added, which is the harmonic mean of the two values.

4.2 Dataset

This paper uses the following two datasets: KDDCUP99, Comprehensive data.

  1. 1.

    KDDCUP99: KDDCUP99 can be obtained in the UCI repository. In KDDCUP99, each sample in the training dataset contains 41 fixed feature attributes and one label.

  2. 2.

    Comprehensive data: the comprehensive data is composed of the real IoT device traffic data collected by Ayyoob [32]. They collected 16 days of data packets from the test platform, including benign and attack traffic. This paper collates these real traffic data and makes a balance processing.

Details about the dataset are shown in Table 1.

Table 1 Statistics of the datasets

4.3 Baseline methods

This paper uses several mainstream unsupervised learning methods as the baseline method

  1. 1.

    OC-SVM: OC-SVM is a popular kernel based anomaly detection method.

  2. 2.

    AE: AE is an unsupervised learning algorithm. It compresses the input into a latent spatial representation, then reconstructs the output through the representation. It is mainly used for data dimensionality reduction or feature extraction.

  3. 3.

    DCN: Deep clustering network (DCN) is a most advanced clustering algorithm, which adjusts the performance of automatic encoder through k-means.

  4. 4.

    DAGMM [33]: A depth autoencoder Gaussian mixture model for unsupervised anomaly detection, which organically combines the dimension reduction process and density estimation process for end-to-end joint training, and avoids the local optimization of the model due to the independence of two steps.

  5. 5.

    MemAE [28]: The MemAE adds a memory module to the AE. In the training stage, the content in the memory module is updated to construct the prototype elements of normal samples.

  6. 6.

    Kitsune [11]: A network intrusion detection system uses the damping increment statistical method to extract the characteristic data from the data flow, and then uses the core algorithm kitnet to detect the abnormal traffic in real time, so as to realize the online and efficient detection of network attacks under unsupervised learning.

  7. 7.

    InterFusion [34]: A multi-dimensional time series unsupervised detection method, which can simultaneously model the dependence between different indicators of multi-dimensional time series and the dependence on time sequence. In addition, in order to answer question 2, the following variants of HaarAE are used as a baseline to demonstrate the impact of a single component in HaarAE on the accuracy of the model.

  8. 8.

    HaarAE-HN: HaarAE-HN (HaarAE-HaarNone) model does not process the original data by wavelet transform, but retains the cascade structure and memory module.

  9. 9.

    HaarAE-CN: HaarAE-CN (HaarAE-CascadeNone) adopts the traditional autoencoder structure to reconstruct the data, without the cascade structure of ConvLSTM, and retains the memory module and processing of original data by wavelet transform .

4.4 Experimental evaluation

  1. 1)

    Unsupervised anomaly detection performance experiment

    The ultimate goal of well-designed Feature Engineering and appropriate network model is to achieve higher accuracy in anomaly detection of IoT devices. In this experiment, the training dataset containing only normal samples is used to construct HaarAE model that can capture the normal mode. 70% of the normal samples are used as the training set and 5% of the normal samples are used as the verification set. 25% of the normal samples are used as the test set.

    The level of wavelet transform is the key factor to expand data features. Therefore, firstly, the relationship between model performance and wavelet transform level is studied. In the face of huge data, high-level wavelet transform will produce high time overhead. Therefore, in the case of performance permitting, this paper adopts four levels of wavelet transform, represented by HaarAE-1, HaarAE-2, Haa-rAE-3 and HaarAE-4 respectively. HaarAE-1 means that only one layer of wavelet transform is performed on the original data, i.e. \(X = {W_{{\varepsilon _{1}}}} + {W_{{\beta _{1}}}}\), HaarAE-2 represents the two-layer wavelet transform of the original data, i.e. \(X = {W_{{\varepsilon _{1}}}} + {W_{{\varepsilon _{2}}}} + {W_{{\beta _{2}}}}\), The same goes for HaarAE-3 and HaarAE-4. The effect of wavelet transform level on model performance is shown in Table 2. It can be seen from the table that when the wavelet transform level is Four, the best anomaly detection effect is obtained on KDDCUP99 dataset and comprehensive dataset, which are recall = 0.9975/0.9445, precision = 0.9758/0.9599 and F1 = 0.9863/0.9522 respectively. Therefore, the wavelet transform level used in the experimental part of this paper is 4, which can meet most anomaly detection requirements.

    Next, for question 1 and question 2, we evaluate the performance of the model on two datasets. In the training phase, only normal samples are used as the training set and verification set. In the test phase, the reconstruction MSE of training set samples and test set samples are compared to determine the anomaly. In the experiment, 1D-CNN (One dimensional convolutional neural network) is used as the basic structure of encoder and decoder. Three groups of symmetrical 1D-CNN layers are set, and the ReLu layer [35] is used as the activation function. At the end of the encoder, we further set the global average pooling layer [36]. Compared with the full connection layer, the global average pooling layer can prevent over fitting by reducing the total number of parameters in the model. The specific results are shown in Table 3.

    Table 3 reports recall, precision and F1 of HaarAE and other baseline models on two datasets. Generally speaking, the model proposed in this paper achieves the highest recall, precision, and F1 on two datasets, but the effect of the two variant models of HaarAE is not as effective as the latest unsupervised model, which shows that it is very important to combine effective feature extraction with appropriate network model. It is worth noting that the KDDCUP is not a IoT traffic dataset, but the model also worked pretty well, Therefore, the proposed model has certain universality in the field of traffic anomaly detection.

  2. 2)

    Effect of cascade structure on model accuracy

    The purpose of cascade structure is to provide more information for the decoder, so that the decoder can reconstruct the normal sample better. Therefore, This paper studied the performance of the model under four different cascade structures. The schematic diagram of the network is shown in Fig. 4.

    Cascade-Full refers the complete cascade structure in the HaarAE model. Cascade-None refers the AE model of none cascade structure. Cascade -None is consistent with HaarAE-CN. Cascade-1 (Conv2-Deconv1) refers that it is connected to ConvLSTM layer after conv2 layer and cascaded with the output of deconv1 layer; Cascade-2 (Conv1-Deconv2) refers that it is connected to ConvLSTM layer after conv1 layer and cascaded with the output of deconv2 layer. For these four network structures, this paper conducted experiments, as shown in Table 4.

    Table 4 shows the recall, precision and F1 of four cascade structures under two datasets. From the table, it can be seen that Cascade-Full has the best result, The result of Cascade-1 (Conv2-Deconv1) and Cascade-1 (Conv1-Deconv2) is between Cascade-Full and Cascade-None, which is consistent with the result that we expect. This shows that the position of the cascade layer has little effect on the model performance, but increasing the number of ConvLSTM layers can improve the effect of anomaly detection. In order to further observe the effect of cascade structure and memory module on the reconstruction of normal samples and abnormal samples, the reconstruction errors of different models are counted on comprehensive data, as shown in Figs. 5 and 6.

    It can be seen that adding ConvLSTM layer and memory module can greatly enlarge the reconstruction error between abnormal samples and normal samples, and the fluctuation range of normal sample reconstruction error is also very small. It shows that HaaAE model can reconstruct most data in normal samples.

  3. 3)

    HaarAE robustness experiment

    In the actual environment, the traffic data of IoT devices usually contain noise. Therefore, whether the anomaly detection algorithm has strong robustness to the input noise is of great significance. For problem 3, in order to study the robustness of HaarAE in anomaly detection, in training stage, we add different proportions of abnormal samples to the training data containing only normal samples to simulate noise, so that it can be mixed with normal samples for training. The detailed results of robustness experiment are shown in Table 5.

    Table 5 reports the precision, recall and F1 values of HaarAE, HaarAE-HN, OC-SVM and DCN on KDDCUP99 respectively. It can be observed that the precision, recall and F1 of the four models decrease with the increase of anomaly ratio, which means that the noise data will have an impact on the performance of anomaly detection, and the impact is negatively correlated. It is worth noting that although the performance of HaarAE model decreases when facing noise data, but it still maintains a high anomaly detection performance compared with other models. Especially the results of Haar-HN and HaarAE show that the data features obtained by wavelet transform can enhance the robustness of the model.

  4. 4)

    Some problems of HaarAE

    However, HaarAE performs wavelet transform when facing a large amount of data, which will bring a certain amount of time overhead. Figure 7 shows a comparison of time overhead at different wavelet transform level. It can be clearly found that the level of wavelet transform is positively correlated with time overhead, which inevitably increases the time cost of HaarAE. Therefore, in future work. it is not necessary to perform wavelet transform on all data, and also the same anomaly detection performance can be achieved, further solve the problem of time overhead.

Table 2 Wavelet transform level comparison
Table 3 Precision, recall, and F1 from HaarAE and the baseline methods
Fig. 4
figure 4

Model sketch of different cascade structures

Table 4 Effects of different cascade structures on Model
Fig. 5
figure 5

Reconstruction error of Ordinary AE

Fig. 6
figure 6

Reconstruction error of HaarAE

Table 5 Anomaly detection results on contaminated training data from KDDCUP99
Fig. 7
figure 7

Time overhead at different wavelet transform levels

5 Conclusion

In order to solve the problems existing in traffic anomaly detection of IoT equipment, an unsupervised anomaly detection model called HaarAE is proposed in this paper. First, feature extraction of original data is carried out by using Haar wavelet transform to capture the features in time and frequency domain and enhance the feature expression of original data. Next, the original data is input into the model together with the data from the wavelet transform. In the encoding phase, the ConvLSTM layer is added to capture the time feature of data. In the decoding phase, the ConvLSTM layer results are cascaded with each layer of the decoder to provide more effective feature information for the decoder and improve the fitting ability of the model. Finally, a memory module is added between the encoder and decoder to increase the reconstruction error of abnormal samples.The model has been validated on different types of datasets for its effectiveness and versatility. Experimental shows that HaarAE has a certain versatility while improving the effect of anomaly detection.