Keywords

1 Introduction

The arrival of the era of big data promoted the development of information retrieval and data mining technology [1]. Detection of volume abnormal information is also becoming more and more important. There are a lot of detecting problem of large data in many practical applications. Furthemore, the exception will make network congestion and will cause serious influence to the user, thus analysis of abnormal problem is very important for us [2].

In recent work demonstrated a useful role for principal component analysis (PCA) to detect network anomalies. They showed that the minor components of PCA (the subspace obtained after removing the components with largest eigenvalues) revealed anomalies that were not detectable in any single node-level trace. This work assumed an environment in which all the data is continuously pushed to a centralsite for off-line analysis. Such a solution cannot scale either for networks with a large number of monitors nor for networks seeking to track and detect anomalies at very small time scales. Thus, anomaly detection in large data is still a problem to be studied.

In this paper we propose a general method to diagnose anomalies. This method is based on PCA (principal components analysis) algorithm and the CS (compression sensing) theory to realize the data anomaly detection. The goal of this article is in order to achieve the data of anomaly detection.

The paper is organized as follows. In Sect. 1, we introduced the data of the research status of anomaly detection and research content. In Sect. 2, we introduced anomaly detection theory and separable compression sensing theory. In Sect. 3, we first generate simulation data and then the data for training and testing results are obtained. In Sect. 4, we get the article conclusion.

2 Theory

2.1 Anomaly Detection

According to Lakhina et al. [2], we can learn a lot of anomalies which is very rare, and abnormal will be hidden in the normal data. PCA [3] is a small number of principal components derived from the original variables, so that they can retain the information of the original variables as much as possible [4]. Thus, we can use PCA algorithm which is easier to find abnormal.

First, let the network information by matrix \(X \, = (X_{1} \, ,X_{2} \, , \ldots \, ,X_{L} \, ) \,\) and each of these data \(X_{i} \in R^{N} ,i\text{ = 1}, \ldots ,L\) [5] after the normalization of matrix and we begin to decomposition for the normalized matrix by using PCA algorithm, then get the same covariance matrix:

$$\Sigma x = \left( {\frac{1}{L}} \right)XX^{T} \, = U\Lambda U^{T}$$
(1)

Given that \(U\text{ = [}U_{1} \text{,} \ldots \text{,}U_{k} \text{]}\) are the principal eigenvectors of \(\sum_{x}\) corresponding to the largest K eigenvalues \(\uplambda_{1} , \ldots ,\uplambda_{K}\), the projection onto the residual subspace is \(P = (I - UU^{T} )\) [6]. So, for the checked data X, its protection into the residual subspace is \(Z = PX = (I - UU^{T} )X\). If Z follows a multivariate normal distribution, the squared prediction error (SPE) [7] statistic is given as:

$$tSPE = \left\| Z \right\|_{2}^{2} \, = \left\| {(I - UU^{T} \, )X} \right\|_{2}^{2}$$
(2)

and follows a noncentral chi-square distribution under the null hypothesis that the data is ‘normal’. Hence, rejection of the null hypothesis can be based on whether tSPE exceeds a certain threshold corresponding to a desired false alarm rate \(\upbeta\). In [2], the Q-statistic was identified as threshold, it is usually expressed as:

$$Q_{\upbeta} \, = \theta_{1} \, [\frac{{c_{\upbeta} \sqrt {2\theta_{2} h_{0}^{2} } }}{{\theta_{1} }} + 1 + \frac{{\theta_{2} h_{0} \left( {h_{0} - 1} \right)}}{{\theta_{1}^{2} }}]^{{\frac{1}{{h_{0} }}}}$$
(3)

where \(h_{0} \, = 1 - \frac{{2\theta_{1} \theta_{3} }}{{3\theta_{2}^{2} }}\), \(\theta_{i} \, = \sum\limits_{j = K + 1}^{N} {\upxi_{j}^{i} }\) for \(i\text{ = 1, 2, 3}\), \(c_{\upbeta} = (1 -\upbeta)\) percentile in a standard normal distribution and \(Q_{\upbeta}\), and \({\xi }_{j}\), \(i = \text{1}, \ldots ,M\) are the eigenvalues of \({\sum }_{y}\).

Once the tspe > \(Q_{\upbeta}\), then can estimate the data which is the abnormal state detection data.

2.2 Theory of Separable Compression Sensing

Separate compression sensing theory pointed out that the random measurement matrix through the tensor product can be expressed as \(\Phi =\Phi_{x}\,\otimes\,\Phi_{y}\) [8], Depending on the theory of CS \(X\) Random projection value \(G\) can be represented as:

$$G =\Phi_{x} X\Phi_{y}$$
(4)

The (1) further the available data \(X\) of the sparse coefficient matrix is expressed as A:

$$G =\Phi_{x}\Psi_{x} A(\Psi_{y} )^{T}\Phi_{y}$$
(5)

The \(\Psi =\Psi_{x}\,\otimes\,\Psi_{y}\) is sparse transformation matrix tensor product form. Depending on the random measured value, can reconstruct the original data:

$$\hbox{min} \left\| A \right\|_{1} \quad \quad {\rm subject}\,{\rm to}\,G =\Phi_{x}\Psi_{x} A(\Psi_{y})^{T}\Phi_{y}$$
(6)

According to the Eq. (5), G represents the compressed of testing data, the G instead of type (2) in X, then carries on the data of the anomaly detection [9, 10].

3 Simulation and Experimental Results

3.1 Synthetic Data

There are two main purpose of the experiment. The first one is to prove that our method is feasible for Large N. The second is to show that it has a better time resolution property when processing the data stream in CS domain. First, we consider a model for the network data as follows:

$$x = s + n$$
(7)

where x is the snapshot of network traffic over N links: Which letter s represents original signal and the letter n is a gaussian noise that is added to the original data [6]. We selected matrix which is 500 lines of 2000 columns. In order to make the anomaly in the compression state as much as possible is detected, we selected the multiple compression matrix for the test, at the same time, we add the Gaussian noise average is zero-mean and mean square error is 0.01. To imitate the network abnormal, we added 40 sample sizes of exception information on the original signal following the procedure mentioned in [2].

3.2 Experiments Results

We investigated the uncompressed matrix and the compression degree of different matrix analysis as shown in the results. Horizontal axis represents the \(\upbeta\) from 0.1 to 0.9, ordinate represents anomaly detection accuracy. Specifically, Fig. 1 shows a plot of the eigenvalue distribution between original and compressed data. This is a very encouraging result from the point of view of detecting anomaly in CS domain. Under each beta we all joined the anomaly information with 40, the number of each matrix to detect abnormal is \(Y/40\) and the anomaly detection accuracy can be calculated according to Fig. 2. Uncompressed curve as shown in Fig. 2, it can detect the abnormal information ratio which increases with the beta. As shown in Fig. 2 matrix compression degree can detect abnormal probability which is also different. The smaller the compression degree of anomaly detection the accuracy is higher.

Fig. 1
figure 1

Eigenvalues of original and compressed data

Fig. 2
figure 2

The correct probability of anomaly detection with different matrix

4 Conclusions

Through simulation experiment, we use PCA and separable compression sensing to detect the different matrices, the matrix of uncompression is more easily to detect the abnormal than the matrix of compression. Thus, we have to choose the degree of compression in order to detect the abnormal information more accurately.