1 Introduction

With the rapid development of big data applications, data are accumulating in an alarming rate from all walks of life (Subramaniam et al. 2006; Esmaeili and Almadan 2011; Gaber et al. 2005). For example, Wal-Mart supermarket produces about 20 million transaction records each day, Google servers processes about 4 million searching requests each hour and AT&T produces averagely 3 thousands call records each second. For us, most of the data are normal and has little value; however, there exists a very little proportion of data, about from 0.01 to 5 %, which makes a big difference to the decision making. For example, an accidental anomalous transaction record of credit card in financial field could signify fraudulent use of credit cards, an anomalous traffic pattern in computer network security field could indicate a unauthorized access (Gomez et al. 2013), and an anomalous fluctuation of detection signal in industrial control field could stand for a fault component (Limthong et al. 2014; O’Reilly et al. 2014; Salem et al. 2014; Serdio et al. 2014; Gil et al. 2014; Di Martino et al. 2014). These applications all force the anomaly detection method to detect these anomalous observations timely and quickly and to take the efficient measures to warn, track or solve these corresponding problems.

In fact, many anomaly detection methods have been proposed in the literatures based on the statistics, data mining and machining learning methods (O’Reilly et al. 2014; Gupta et al. 2014), which usually builds a profile based on the historical normal dataset. For a new coming observation, it is normal or not is identified by whether it conforms to the normal profile or not. But these methods were usually regarded as a “side effect” and designed for other purpose (classification or clustering) other than anomaly detection (Liu et al. 2012; Yang et al. 2015; Huang et al. 2015). Besides, most of them were proposed for static dataset, which is unfit for the online processing of streaming data. As we all known, the typical characteristics of streaming data, such as continuous generation, massive data, unknown or dynamic changing of data distribution, great un-balance distribution between normal and anomalous observations, make the static, offline or “multiple-pass” methods un-appropriate for the streaming data situation.

In this paper, a novel online streaming data anomaly detection method is proposed based on improved hyper-grid structure and online ensemble learning. To mitigate the complex searching space of hyper-grid-based anomaly detection method, the \(L_{1}\) detection neighbor region of initial hyper-grid structure is improved. Besides, considering that the optimal hyper-cube of hyper-grid structure is hardly obtained under the context of streaming data, the online ensemble learning theory is employed, and it has two advantages. The first one is that it can adapt to the distribution evolving characteristic of streaming data, and the other is that obtaining multiple hyper-grid structure based weak individual detectors are more easily than learning an optimal strong detector.

The remainder of this paper is organized as follows: in Sect. 2, some related works about anomaly detection are introduced; in Sect. 3, the online anomaly detection method based on hyper-grid structure and online ensemble learning is presented; in Sect. 4, experiments and result analysis are detailed; at last, conclusions and the look for future work are presented.

2 Related works

Up to now, many researchers have focused on their attentions on the anomaly detection research in the community of statistics, data mining and machine learning. For the static dataset, many anomaly detection methods have been proposed from the different perspective (Lee et al. 2013; Suhailis et al. 2014). For example, two categories of methods are proposed from the perspective of data set characteristics. The one category takes into account the unbalanced characteristics of data distribution and only uses the normal dataset to build the detector model (Segui et al. 2013); the other uses the over-sampling or under-sampling method (He et al. 2008) to balance the dataset and the classification methods to build the detector model (Desir et al. 2013), which regard the anomaly data detection as a binary classification problem. However, it should be noted that the former might lose some useful information, and the latter might decrease the generalization performance. From the perspective of technique adopted, the methods can roughly categorized as distance-based method (Angiulli and Fassetti 2009; Moshtaghi et al. 2011), density-based method (Breunig et al. 2000; Huang et al. 2014), model-based method (Liu et al. 2012) and so on. Distance-based method usually takes the distance, such as Euclidean distance or Mahalanobis distance, as a basic metric to carry out the anomaly detection. The distance is firstly calculated between the object sample and the center of the detection model. If the distance is no more than the pre-defined threshold, it is regarded as normal; otherwise, it is anomalous. Density-based method is an extension of the distance-based method. If the density of region where the object sample is located is less than the pre-defined threshold, it is judged as anomalous; otherwise, it is normal. Usually, density-based methods can obtain the relatively higher detection accuracy than other methods for the local anomaly. Model-based method usually learns a detector model based on the historical dataset, some statistics theory or machine learning methods, such as Gaussian distribution, artificial neural network, support vector machine, were used to build the detection model (Schölkopf 2001). If the object sample conforms to the generation mechanism and fits the obtained model in the process of anomaly detection, it is regarded as normal; otherwise, it is anomalous. From the perspective of whether or not to use the sample label during the procedure of detector building, the methods can be categorized as supervised learning method, un-supervised method and semi-supervised method (Daneshpazhouh and Sami 2014; Yamanishi et al. 2004; Noto et al. 2012; Zhang et al. 2010). In a word, the taxonomy of above mentioned methods has some overlap to some extent and an obvious conclusion can be drawn that machine learning-based method is a dominated method. Recently, from the perspective of essential characteristic of anomaly, i.e., few and different, an isolation-based method is proposed (Liu et al. 2012). Besides, based on the quantity estimation in the detection region, mass-based method is proposed (Ting et al. 2013). Recently, some research results have been obtained based on these two methods (Yu et al. 2009a; Tan et al. 2011).

With the arrival of big data era, an increasing number of data are accumulating rapidly, which makes the anomaly detection methods designed for static dataset improper for the online data processing. Storing massive data in the main memory are impossible and unnecessary, which makes these methods scanning the training dataset many times to learn the detector which is unfit for the one-pass characteristics of online data processing. Recently, researchers at home and abroad have proposed some new methods for online anomaly detection (Quinn and Sugiyama 2014). These methods generally can be categorized as into two directions: one is the modification of the traditional method, i.e., training the initial detector based on all historical dataset and updating it based on the new coming data chunk. However, there exists an obvious disadvantage; that is, the poor adaptability and scalability do not well adapt to the dynamic distribution of online streaming data. Instead, only when the change of data distribution is slow and small can this kind of strategy obtain an acceptable detection performance. The other is to build detector based on incremental learning (He et al. 2011) and online ensemble learning theory (Dietterich 1997; Kolter and Maloof 2007; Breiman 1996, 2001; Bifet et al. 2009a; Ando et al. 2015), which trains multiple individual detector from the different facets of streaming data. Theories and experiment have proved that online ensemble learning can deal with the data dynamic characteristics and concept drift phenomenon (Minku and Yao 2012) of streaming data to some extent, and nowadays, two ensemble models are used widely, i.e., the horizontal ensemble and vertical ensemble. The former uses the latest multiple continuous data chunks to train multiple single detectors. It has the most advantage that it can master the time-dependence relationship of streaming data and can process the noise data items, and the most obvious disadvantage that some individual detector may be too old to adapt to the current data distribution. The latter only uses the latest one chunk to train multiple single detectors by using different learning algorithm, but this strategy requires that the current data chunk should be absolutely accurate; otherwise, the trained detector will have a poor performance. In general, due to the dynamic characteristics of streaming data, the online ensemble learning method becomes a hot direction of online anomaly detection, online bagging, online boosting methods (Oza 2005; Qi et al. 2011; Fern and Givan 2003) providing the solid theory foundation (Bifet et al. 2009b; Chang and Cho 2010) for these methods. There are many researchers proposed some online methods to deal with the classification problem of streaming data (Yamanishi et al. 2004; Palshikar 2005; Yu et al. 2009b; Zhou et al. 2013; Sagha et al. 2013; Ding et al. 2015; Ding and Fei 2013); some of them can be used to detect the anomalous data in streaming data. For example, the experimental results of He et al. (2011) demonstrated that proposed method could master the data distribution in real time and have a good detection performance. Tan et al. (2011) proposed a half-space tree algorithm for streaming data based on the isolation principle and ensemble learning, which dedicated to the online streaming data processing. The most obvious advantage of this algorithm is that the real dataset is not used during the process of building the initial detector and can build the detector fast. But its disadvantage is obvious, for it updates the detector continuously whenever this process is required or not, which degrades the real-time nature. In this paper, a hyper-grid-based ensemble online anomaly detection method is proposed based on the hyper-grid structure and online ensemble learning theory. In the next section, it will be represented in details.

3 Hyper-grid structure-based ensemble anomaly detection method

3.1 Hyper-grid structure-based anomaly detection method

Hyper-grid-based anomaly detection method was initially designed for the offline anomaly detection (Xie et al. 2012), which of course was inappropriate for the online data processing. Here, the hyper-grid structure is firstly introduced.

For the sample \(x_{i}\) (\(x_i \in X\),\(x_i =(x_{i,1},x_{i,2} ,\ldots ,x_{i,q} ),i=\{1\ldots m\})\), x stands for a q-dimensional vector and X denotes dataset, the hyper-grid structure G, built by X, is represented as follows:

$$\begin{aligned} G=\left\{ C_{u1,\ldots uq} \;|\;C_{u1,\ldots uq} \in G\right\} \end{aligned}$$

\(C_{u1,\ldots uq} \) denotes a hyper-cube, and its side length and diagonal length are h and d, respectively. Here, \(h=d{/}(2\sqrt{q})\) and the hyper-grid structure G are composed of continuous hyper-cubes. For \(x\in X\), it can be mapped into a hyper-cube in the hyper-grid structure.

After a hyper-grid structure is built based on training dataset X, the new coming observation, whether it is anomalous or not, is decided by the region where the observation is mapped into. If the mapping position of new observation is located in a sparse region, which means no or few training data are mapped into that hyper-cube, the new observation is detected as anomaly; otherwise, it is normal.

Unfortunately, only using the exact mapped region usually does not accurately judge whether the observation is normal or not, especially these observations are mapped onto the boundary of hyper-cube. Therefore, for the hyper-cube \(C_{u1,\ldots uq} \), its \(1{\mathrm{st}}\) neighbor detection region is further introduced and denoted by layer-1 \((L_{1})\), which is defined as follows.

$$\begin{aligned} L_1 (C_{u1,\ldots uq} )=\{C_{v1,\ldots vq} \;|v=u\pm 1,\;C_{v1,\ldots vq} {\ne } C_{u1,\ldots uq} \}\nonumber \\ \end{aligned}$$
(1)

To clearly demonstrate the above presented definition, a hyper-grid structure built by a 2-dimension (2D) dataset is used as an example, and the layer-1 \((L_{1})\) neighbor detection region of mapped hyper-grid cube is depicted in Fig. 1

Fig. 1
figure 1

Layer-1 \((L_{1})\) neighbor detection region of hyper-grid structure for a 2D observation (black dot). a One grid of 2D hyper-grid structure, b \(L_1\) neighbor detection region of one observation mapped in the hyper-grid, c \(L_1\) neighbor detection region of one observation mapped on the edge of hyper-grid, d \(L_1\) neighbor detection region of one observation mapped on the crossing point of hyper-grid

The black spot in Fig. 1 denotes the mapped observation point located in the different position of corresponding hyper-cube. Figure 1b denotes an observation mapped into one hyper-cube, and the number of its \(L_{1}\) neighbor detection regions is 8. When an observation is mapped onto the boundary of hyper-cube, the number of \(L_{1}\) neighbor detection region is relatively small, such as depicted by Fig. 1c, d. Here, under the context of hyper-grid structure, an anomaly is redefined as follows

Definition 1

For an observation x, if its mapping region is sparse, it may be an anomalous sample; otherwise, it is normal (Xie et al. 2012; Knorr et al. 2000).

The above definition is qualitative and is hard to be applied into practical anomaly detecting. Therefore, the threshold k is introduced, and three heuristic rules are concluded to facilitate the anomaly detecting.

Rule 1

If more than k observations are mapped into the hyper-cube \(C_{u1,\ldots uq} \), the new observation mapped into \(C_{u1,\ldots uq} \) usually is normal.

Rule 2

If more than k observations are mapped into the hyper-cube \(C_{u1,\ldots uq} \) and its layer-1 \((L_{1})\) region, i.e., \(C_{u1,\ldots uq} \cup L_1 (C_{u1,\ldots uq} )\), the new observation mapped into \(C_{u1,\ldots uq} \) usually is normal.

Rule 3

If less than k observations are mapped into the hyper-cube \(C_{u1,\ldots uq} \) and its layer-1 \((L_{1})\) region, i.e., \(C_{u1,\ldots uq} \cup L_1 (C_{u1,\ldots uq} )\), the new observation mapped into \(C_{u1,\ldots uq} \) usually is anomalous.

Difference from the Knorr et al. (2000), where the initial hyper-grid-based anomaly detection method was proposed to employ the \(L_1 (C_{u1,\ldots uq} )\) and \(L_2 (C_{u1,\ldots uq} )\). The proposed method employs only \(L_1\) neighbor detection region, with the consideration that \(L_2\) neighbor searching space is considerable and introduces the ensemble learning (Sect. 3.3) to build multiple hyper-grid structure rather than an optimized one.

3.2 The improved \({L}_{1}\) neighbor region of hyper-cube

Figure 1 illustrates an obvious fact that the number of neighbor detection region is closely related to the mapped position of observations. In fact, it is hard to judge the exact detection region for any given sample according the aforementioned definition, especially for these samples, which are mapped onto the boundary or near to the boundary, just as is described in Fig. 2.

Fig. 2
figure 2

Layer-1 \((L_{1})\) neighbor detection region for two 2D observations (black dots). a \(L_1\) neighbor detection region of two observations (one is close to left edge, the other is mapped on the edge), b \(L_1\) neighbor detection region of two observations (one is close to upper-left edge, the other is mapped on the crossing point), c \(L_1\) neighbor detection region of two observations (one is close to upper edge, the other is close to lower edge), d L1 neighbor detection region of two observations (one is close to upper-left edge, the other is close to lower-right edge)

From Fig. 2a, b, an obvious fact is that even though two observations are very close to each other, the neighbor detection region is significantly different. Besides, even though two observations are mapped into the same hyper-cube, such as Fig. 2c, d, the defined detection region may not represent the true condition because it is hard to obtain the accurate grid structure. Consequently, the traditional definition will lead to some poor detection performances. To solve this difficulty and further reduce the searching space, the improved \(L_{1}\) neighbor detection region is re-defined.

Fig. 3
figure 3

Proposed layer-1 \((L_{1})\) neighbor detection region of hyper-grid structure for a 2D observation (black dot). a Improved \(L_1\) neighbor detection region(observation is close to upper-left edge), b improved \(L_1\) neighbor detection region (observation is close to lower-right edge), c improved \(L_1\) neighbor detection region (observation is located on the edge), d improved \(L_1\) neighbor detection region (observation is located on the edge), e improved \(L_1\) neighbor detection region (observation is located on the edge), f improved L1 neighbor detection region (observation is located on the crossing point)

For sample x, its neighbor space usually refer to the adjacent space from the perspective of geometry. Therefore, for hyper-grid structure, the mapped position of sample is taken into account to judge its \(L_{1}\) neighbor detection region. Compared to the traditional method, this improvement can decrease the number of searching adjacent hyper-cube to some extent. The improved \(L_{1}\) neighbor detection region \(\bar{{J}}(x)\) can be defined as follows:

$$\begin{aligned}&\overline{J} (x)=\left\{ C_{v1,\ldots vq} \;|v_i =u_i,\;u_i +e_i ,\;\;i=1,\ldots q\right\} \end{aligned}$$
(2)
$$\begin{aligned}&e_i = \left\{ \begin{array}{ll} +1&{}\quad u_i^*-u_i >0.5 \\ -1&{}\quad u_i^*-u_i <0.5 \\ \{-1,\;+1\}&{}\quad u_i^*-u_i =0.5 \end{array} \right. \end{aligned}$$
(3)
$$\begin{aligned}&u_i^*=\frac{x_i }{h},\;\;\;i=1,\ldots q\end{aligned}$$
(4)
$$\begin{aligned}&u_i =\left\lfloor {u_i^*} \right\rfloor ,\;\;\;i=1,\ldots q \end{aligned}$$
(5)

Based on the above improved definition of \(L_{1}\) neighbor detection region, Fig. 3 partly presents the proposed layer-1 (\(L_1\)) neighbor detection region of the hyper-grid structure for a 2D observation.

From Figs. 1 and 3, for observations which are not mapped to the boundary observations which are not mapped in the center, the number of neighbor hyper-cube decreases from 9 to 4 in the 2D hyper-grid structure. In fact, most observations can decrease the number of searching region. Because the n-dimension hyper-grid structure and the mapped position of observation are not on the boundary or in the center of hyper-grid cube, the number of neighbor detection region decreases from \(2^{n}\) to n, which can significantly save the computation cost; and because these observations are mapped on the boundary, the number of searching neighbor detection region is decreased to different degree.

3.3 Hyper-grid-based online anomaly detection algorithm

The initial hyper-grid structure-based anomaly detection method is designed and applied to static dataset, but it could not be directly used for anomaly detection of streaming data. Because the boundary of hyper-grid structure built by a static dataset is fixed, a sample is doubtlessly judged as an anomaly when its mapping position exceeds the hyper-grid boundary. What is more, the concept drift may be occurred under the streaming data environment, which makes the previous hyper-grid structure unfit for the current data distribution for the evolution characteristic of streaming data. That is to say, a sample may be a normal observation even though it exceeds the hyper-grid boundary. Consequently, the hyper-grid structure should be updated corresponding to the distribution change in streaming data. Figure 4 demonstrates the flow chart of our proposed streaming data anomaly detection method, which consists of three critical stages, i.e., detector building, anomaly detecting and detector updating.

Fig. 4
figure 4

A flow chart of the streaming data anomaly detection method based on hyper-grid structure

The specific procedures are described as follows.

  1. (1)

    Building of anomaly detection based on hyper-grid structure

The procedure of learning the hyper-grid-based anomaly detector is an activity of mapping the observations into the hyper-cube in hyper-grid space. This process is simple, and only a vector of \(q+1\) dimension, i.e., \((u_1,u_2,\ldots ,u_q,cnt)\), is required for each observation, the last item cnt is used for storing the number of mapping observations into one hyper-cube \(C_{u1,\ldots uq}\), and the front q items of vector, i.e., \((u_1,u_2,\ldots ,u_q )\) is used to storing the position index where the observation is mapped to. The number of mapped observations to \(C_{u1,\ldots uq} \) can be defined as \(S(C_{u1,\ldots uq} )=cnt\). The initial value of cnt is 0. In fact, if the side length of hyper-cube is 1, then hyper-grid-based anomaly detection method can be categorized as a density-based method to some extent. Consequently, the procedure of training the hyper-grid-based anomaly detector is just the process of mapping the observations to one hyper-cube and counting its number. The pseudo-code of training the hyper-grid-based anomaly detector can be described by Algorithm 1.

figure a

The Find() function in algorithm 1 is used to obtain the number of mapped observations in hyper-cube \(C_{u1,\ldots uq} \), if the complete hyper-grid space is stored, the random searching strategy is adopted to the sequential storing structure. Unfortunately, the hyper-grid space usually is a sparse space for the greatly unbalanced distribution between normal observations and anomalous observations. In real application, some compressed methods usually are introduced to store such sparse space for saving the memory resource. Consequently, a specially designed searching method is implemented by function Find() to accelerate the searching speed.

  1. (2)

    Anomaly detection based on hyper-grid-based method

After the detector is built, for the new continuous coming observations, the online anomaly detection activity starts up. Firstly, the new observation is mapping to the one hyper-cube of hyper-grid structure; secondly, whether it is anomalous or not is decided by the total cnt, which is the sum of number of mapped points corresponding to hyper-cube and its \(L_{1}\) neighbor detection region. The pseudo-code of anomaly detection method is described by Algorithm 2.

figure b
  1. (3)

    Updating strategy of detector

The updating of hyper-grid structure is a re-learning procedure. In order to reduce the computation costs and alleviate the fluctuation of detection performance to some extent, a delay updating strategy is employed (Xie et al. 2012), i.e., a probability p and a buffer is specified based on some prior knowledge. For the new coming observation, p decides whether the new coming observation could be stored in a buffer or not. When the buffer is full, the procedure of updating detector will be activated. Usually, the magnitude of the p value is determined by some prior knowledge of domain experts. For example, if the dynamic characteristic of streaming data is relatively small, a relatively small value may be specified for decreasing the computational cost; otherwise, a relatively big value is specified for reflecting the change in distribution quickly. The pseudo-code of the updating detector method is described by Algorithm 3.

figure c

Obviously, the two key parameters, i.e., h and k, have important effects on the performance of proposed method. Due to the known data distribution in static dataset, parameters estimation method can be used to have the relatively accurate values (Xie et al. 2012; Knorr et al. 2000), and in our paper, the initial parameters value can be obtained from the historical dataset. In order to improve the robustness of above mentioned method, online ensemble learning is introduced. Next, the detail is described.

3.4 Hyper-grid-based ensemble anomaly detection algorithm

Our proposed method consists of three key procedures, i.e., training the ensemble detector based on the hyper-grid structure, online anomaly detecting and online ensemble detector updating.

  1. (1)

    Building of ensemble anomaly detector

In fact, it is hard for the side length h of hyper-cube to obtain the optimal value for X, especially in streaming data. Therefore, ensemble learning is employed to build multiple weak detectors to combine a strong detector based on different hyper-grid structure, and a rough estimation range of h rather than the exact h value can be obtained based on the following description.

Based on the result presented in Xie et al. (2012) and Knorr et al. (2000), if \(x\in R^{q}\) and \(d=2\sqrt{q}h\), the mean square error (MSE) of detection region reaches the minimum. Then the expected h can be estimated by the following formula.

$$\begin{aligned} h^{*}=\left( {\frac{Z_1 6q}{Z_2 n\sum _{i=1}^q {R(f_i )} }} \right) ^{\frac{1}{q+2}} \end{aligned}$$
(6)

where \(R(f)=\int _{R^{q}} {f(x)^{2}} \mathrm{d}x\), f(x) is the probability density function (PDF), which can be estimated by the \(\overline{f(x)} =\frac{S(C_{u1,\ldots uq} )}{nh^{q}}\). \(S(C_{u1,\ldots uq} )\) is the number of mapped data in the hyper-cube \(C_{u1,\ldots ,uq} \) and it actually subject to a binomial distribution. \(h^{*}\) denotes the expected value of h; the approximate h can be obtained from above formula in the batch learning where the data distribution are known. Unfortunately, the infinite streaming data imply that the probability density function is hard to be obtained, and that an estimated method needs to be adopted. Suppose that in the current data chunk, the streaming data are subject to the Gaussian distribution, i.e., \(N(\mu ,\;\sum ) \), where \(\mu =(0,\ldots ,0)\) and \(\sum = \mathrm{diag}(1,\ldots ,1)\). If the current distribution is unknown, it can be learned by the online or incremental learning method (He et al. 2011). Based on above assumption, \(Z_1 \in \left[ {\left( {\frac{3}{2}} \right) ^{q}h^{2q},2^{q}h^{2q}} \right] \) and \(Z_2 =(2h)^{2q}\) (Xie et al. 2012), \(\frac{Z_1 }{Z_2 }\in \left[ {\left( {\frac{3}{8}} \right) ^{q},\left( {\frac{1}{2}} \right) ^{q}} \right] \) and \(R(f_i )=\frac{1}{2^{q+1}\pi ^{\frac{q}{2}}}\) can be induced. Consequently, the rough boundary of h can be calculated by the formula (7).

$$\begin{aligned}&h^{*}\in \left[ \left( {\left( {\frac{3}{8}} \right) ^{q}\frac{6q}{n\sum _{i=1}^q {R(f_i )} }} \right) ^{\frac{1}{q+2}},\right. \nonumber \\&\quad \left. \left( {\left( {\frac{1}{2}} \right) ^{q}\frac{6q}{n\sum _{i=1}^q {R(f_i )} }} \right) ^{\frac{1}{q+2}} \right] \end{aligned}$$
(7)

The pseudo-code of the building hyper-grid-based ensemble anomaly detector is described by the Algorithm 4.

figure d
  1. (2)

    Ensemble anomaly detecting

After the ensemble detector is trained, the procedure of anomaly detection is similar to the description in Sect. 3.3. Here, it needs to be noted that a critical parameters k is still unknown for the streaming dataset. From the perspective of statistics, the value of k can be designated by some prior knowledge in real applications, i.e., if the \(\int _{J(y)} {f(x)\mathrm{d}x} \) is less than a very small probability, for example, 0.01, then y can be regard as anomaly. Therefore, for a dataset consisting of m observations, \(k=0.01*m\). But for the streaming data, such an estimation method has poor robustness. In fact, after the h is specified firstly, then k can be learned based on the training dataset.

Suppose that f(x) is continuous in the sample space, i.e., \(x\in X,\;x_i \in [\mathrm{min}_i,\;\mathrm{max}_i ],\;i\leftarrow 1,\ldots ,q\). Let \({\vert }J(y){\vert }\) denotes the number of observation in detection region J(y), if s observations is selected randomly from the X, the estimated value of k, denoted as \(\overline{k} \), can be calculated by the formula (9).

$$\begin{aligned} \overline{k} =\frac{1}{s}\sum \limits _{i=1}^s {\left| {J(y_i )} \right| } \end{aligned}$$
(8)

\(\vert J(y)\vert \) can be obtained from the hyper-grid structure built by the X. Unfortunately, the f(x) usually is not continuous in real hyper-grid structure. Therefore, the continuous rate of hyper-grid space can be calculated by the formula (9).

$$\begin{aligned} r=\frac{\left| {\mathrm{NP}} \right| }{\prod _{i=1}^q {\left\lceil {\frac{\mathrm{max}_i -\mathrm{min}_i }{h^{*}}} \right\rceil } } \end{aligned}$$
(9)

\(\vert \mathrm{NP} \vert \) denotes the number of non-empty hyper-cubes in the hyper-grid structure built by dataset X, and \(\prod _{i=1}^q {\left\lceil {\frac{\mathrm{max}_i -\mathrm{min}_i }{h^{*}}} \right\rceil } \) denotes the total number of hyper-cube. Then \(\overline{k} \) can be estimated by the formula (10).

$$\begin{aligned} \overline{k} =\frac{r}{s}\sum \limits _{i=1}^s {\left| {J(x)} \right| } \end{aligned}$$
(10)

After each single detector obtains its respective result \(y_i (x)\), a result combination strategy of multiple individual detector can be used to achieve the final result \(y_{\mathrm{fin}} (x)\). The commonly used method in the literature is the majority vote (for classification problem) and weighted average (for regression problem). In our paper, the final ensemble detection result can be calculated by (11), where \(w_i \) denotes weight coefficient, i.e., \(w_i =1\) means the simple average, otherwise weighted average. What is more, for the sake of simplicity, the simple average strategy is employed to combine the final result.

$$\begin{aligned} y_{\mathrm{fin}} (x)=\frac{1}{M}\sum \limits _{i=1}^M {y_i (x)} *w_i \end{aligned}$$
(11)

When the data buffer is full, the ensemble detector updating is triggered. To each single detector of ensemble detector, the updating method is the same to the description of the Algorithm 3 in Sect. 3.3.

4 Experiment and result analysis

In this section, the dataset, performance evaluation metrics, experiment results and analysis will be described, respectively. Experiments were conducted on a personal PC with 2 Duo CPU, P7450@2.13 GHz and 4 GB memory. The operating system is Windows 7 professional. The data processing was partly on the MATLAB 2010, and the algorithms mentioned in Sect. 3 were implemented with Microsoft Visual C++ platform. Part of experiments were conducted on the Waikato environment for knowledge analysis software (Weka 2005).

4.1 Dataset

In order to validate proposed method, a real-world dataset, Statlog (Shuttle) (UCI Machine Learning Repository 2007), from the UCI machine learning repository was employed; besides, two artificial generated datasets, SimData1, Simdata2 were created. Because the hyper-grid-based method is an un-supervised method, the label attribute was selected only to evaluate proposed algorithm, and further, the input order of dataset was regard as the order of streaming data for simulating the online environment.

The Shuttle is a widely used dataset from UCI repository, which contains 9 attributes and has little or no distribution change. The classes of this dataset are labeled from 1 to 7, and about 80 % samples of dataset belong to class 1 (i.e., normal class). To validate our method, this paper selected the label 2, 3, 5, 6, 7 as the anomaly class, the total number of which amount to 49,097.

Artificial dataset SimData1 and Simdata2 have the form of X(xy), in which x denotes the multiple dimension data sample and y denotes the label. Here, \(y\in \{0,1\}\), 0 denotes the normal sample and 1 anomalous sample. The dimension of SimData1 and SimData2 are 4 and 5, respectively. Each attribute value is continuous, and each dataset has 50,000 samples. SimData1 has 500 anomalies and the expectation anomaly rate is 1 %, and SimData2 has 2000 anomalies and the expectation anomaly rate is 4 %. Besides, for each artificial generated dataset, the normal data follow the Normal (Gaussian) distribution, i.e., \(p(x)\sim N(\mu ,\Sigma )(\mu \) and \(\Sigma \) denotes average and covariance matrix), which can be generated by the function normrnd \((\mu ,\Sigma ,m,n)\) using the MATLAB 2010; the anomalous dataset follows the uniform distribution. In order to simulate the streaming data which data distribution maybe changed, some strategy is designed during the process of dataset generation.

Suppose \(\mu _i \) is an average of the ith dimension and its value is changed dynamically. According to \(\mu _i r_i (1+\tau )\), \(\tau (\tau \in [0,1])\) denotes the evolution degree and \(r_i (r_i \in \{-1,1\})\) denotes the evolution direction which is activated with a 5 % probability; further, the dataset evolution follows the next three assumptions.

  1. 1.

    Only the average of dataset was changed gradually;

  2. 2.

    Only the variance of dataset was changed gradually;

  3. 3.

    Both the average and the variance were changed gradually.

The above three assumptions were activated with a specified probability during the process of generating dataset. The summary of real-world dataset and simulated dataset can be seen in Table 1.

Table 1 Summary of three datasets

4.2 Performance evaluation metrics

For the test observation, the trained detector defines its label as normal or anomalous. Without losing the generality, “1” and “\(-1\)” are used to denote the anomalous and normal observation, respectively. The true label for the observation x is denoted as \(y\in \{-1,\;+1\}\), the detection result obtained from detector is denoted as , and if , the detector made a correct decision; otherwise, it made a mistake.

For anomaly detection, the results of each observation can be classified into four categories:

  1. 1.

    True positive, TP, i.e., ;

  2. 2.

    False positive, FP, i.e., ;

  3. 3.

    True negative, TN, i.e., ;

  4. 4.

    False negative, FN, i.e., .

Above the commonly used evaluation metrics of anomaly detection are presented for assessing proposed method performance.

  1. 1.

    True positive rate (TPR)

$$\begin{aligned} \mathrm{TPR}=\frac{\# \mathrm{TP}}{\# \mathrm{TP}+\# \mathrm{FN}} \end{aligned}$$
(12)

\(\# \) denotes the number of set; i.e., \(\# \mathrm{TP}\) denotes the number of true positive observations of test dataset. This metric is renamed as sensitivity, recall, which represents the percentage of anomalies that are correctly detected, i.e., the proportion between the number of correctly detected anomalies and the total number of anomalies.

  1. 2.

    Positive predictive rate (PPR)

$$\begin{aligned} \mathrm{PPR}=\frac{\# \mathrm{TP}}{\# \mathrm{TP}+\# \mathrm{FP}} \end{aligned}$$
(13)

It is renamed as precision, which means the proportion that the number of anomalous observations that are correctly detected and the number of anomalies identified by the anomaly detector.

  1. 3.

    F-measure

$$\begin{aligned} \mathrm{F}\text {-}\text {measure}= & {} \frac{\left( {1+\beta ^{2}} \right) \mathrm{precision}*\mathrm{recall}}{\beta ^{2}\,{*}\,\mathrm{precision}+\mathrm{recall}}\nonumber \\= & {} \frac{\left( {1+\beta ^{2}} \right) \,*\,\# \mathrm{TP}}{(1+\beta ^{2})*\# \mathrm{TP}+\beta ^{2}{*}\# \mathrm{FN}+\# \mathrm{FP}} \end{aligned}$$
(14)

This metric is renamed as F-score, which is an adjustable average of precision and recall and its value is near to the relative small value between precision and recall. Consequently, the relative big value of F-measure means that the precision and recall both have higher value. \(\beta \) is a coefficient to adjust the relative importance of precision recall, and usually, \(\beta \) is set as 0.5, 1 or 2. In our paper, \(\beta =1\).

Table 2 The experimental value of h

4.3 Experimental results and analysis

The purpose of the experiment is to validate the performance of proposed algorithm. Firstly, the dataset should be regularization pre-processing before the experiment; Secondly, one-fifth dataset is used to build the hyper-grid-based detector and the rest is used as a test dataset. For each dataset, h needs to be specified in advance and its expectation value can be obtained with the method proposed in Sect. 3.4. However, it is usually difficult and nearly impossible to obtain an exact value of h in real situation, a rough boundary of h is specified relatively easy based on the historical dataset, i.e., \([h_l^*,\;h_u^*]\), \(h_l^*\) and \(h_u^*\) denote the expectation lower boundary and upper boundary, respectively. With the evolution of streaming data, a more roughly boundary is specified according to the prior knowledge of domain expert. It can be evaluated by the following formula.

$$\begin{aligned} h_u^{\prime }= & {} h_u^*+(h_u^*-h_l^*)*\lambda _1 \nonumber \\ h_l^{\prime }= & {} h_l^*-(h_u^*-h_l^*)*\lambda _2 \end{aligned}$$
(15)

\(\lambda _1 \) and \(\lambda _2 \) denote the relax coefficient, and its value range is [0, 1]. The h value of the three datasets can be seen in Table 2. The second column \((h*\)) shows the expectation value of h which is acquired based on the formulas described in Sect. 3.4; the third column \((h')\) shows the experimental value of h which is estimated based on the corresponding expectation value. In our paper, the value of \(h'\) is gradually changing from lower boundary to upper boundary with a step length 0.01, 25 hyper-grid spaces are built for each dataset. Then, as is shown in Table 2, the corresponding k value of three datasets can be acquired based on the \(h'\) value.

  1. 1.

    The algorithm performance under different parameter setting

For each dataset, all 75 independent experiments were carried out smoothly based on different h and k value. Each experiment repeated itself 10 times, and we presented the average for different performance evaluation metric. In view of the limit space, for each h, only the best experiment results about k are presented in Tables 34 and 5.

Table 3 The performance of shuttle dataset
Table 4 The performance of SimData1 dataset
Table 5 The performance of SimData2 dataset

From the experimental results presented in Tables 34 and 5, it can be seen that the appropriate h and k values affect the algorithm performance. In real applications, based on some prior knowledge and for the sake of simplicity, a relative fixed k value can be usually specified, while the hyper-grid detectors are being built. In our paper, the hyper-grid-based ensemble detector was combined by different individual hyper-grid-based detectors with different h value.

The value of three performance evaluation metrics on three dataset, i.e., TPR, PPR and F-measure, are not relatively high. Nevertheless, our proposed method proves useful, for the next section will detail the comparative experiments between hyper-grid-based between the hyper-grid-based anomaly detector and some existed methods.

  1. 2.

    Performance comparison of different algorithm

To compare the proposed method with the existed methods, Weka (Waikato environment for knowledge analysis) software, an open and intelligent experiment platform in data mining and machine learning community, is introduced. It is worth noticing that the Weka experimental platform only serves as a reference to evaluate proposed method. Here, Weka’s two library functions, i.e., classifier based on support vector machine and ensemble learning (BSVME), classifier based on multiple perception machine and ensemble learning (BMLPE) are employed. Two experiments are conducted using all dataset, that is to say, the data distribution is known. If the experimental result of the proposed method is close to the results obtained from Weka, it can prove that our proposed method is effective to some extent. The experiments on Weka follow such strategies

  1. 1.

    Each dataset was divided into training sub-dataset and test sub-dataset according to the proportion of 2:1.

  2. 2.

    Bagging ensemble strategy and simple voting result combination method were employed.

  3. 3.

    Algorithms of one-class support vector machine and multiple layer perception were used to train the individual detector.

  4. 4.

    Ensemble size was set as 40, and the experimental result was the average of 50 times independent experiments.

Besides, the AHIForest (Ding et al. 2015) and HSTree (Tan et al. 2011) algorithms are employed to evaluate proposed HypGridE algorithm. The windows size, ensemble size and statistic histogram size are set as 256, 40 and 10, respectively. The experimental results can be seen in Table 6.

Table 6 Performance of different anomaly detection algorithm
Fig. 5
figure 5

Shuttle dataset experimental results of different evaluation metrics based on different algorithm

Fig. 6
figure 6

SimData1 dataset experimental results of different evaluation metrics based on different algorithm

From the results presented in Table 6, it can be seen that the performance of HypGridE has the similar experimental results to those of the AHIForest and HSTree algorithm, even though they are all inferior to the BSVME and BMLPE. Further, from Figs. 56 and 7, the results of three performance evaluation metrics of different algorithms have slight difference. Here, it is worth noticing that the experiments of BSVME and BMLPE were conducted on the static dataset, while the experiments of HypGridE, AHIForest and HSTree were conducted on the dynamic dataset, which to some extent turns out that our proposed method is effective. Besides, compared with BSVME and BMLPE, HypGridE has an exclusive advantage that by mapping the observations to the hyper-grid space saves not only the complicated computation but also the computation cost, making itself appropriate for online streaming data processing.

Fig. 7
figure 7

SimData2 dataset experimental results of different evaluation metrics based on different algorithm

5 Conclusion and future work

After exploring the drawbacks of existed hyper-based anomaly detection algorithms, an improved version with the re-definition of \(L_1\) neighbor searching space was firstly introduced at first. Then, considering it is hard to obtain the optimized parameters related to the algorithm performance and difficult for the single detector to fit the streaming data environment, this paper employs online ensemble learning technique and proposes hyper-grid-based ensemble anomaly detection method. The greatest advantage of the proposed method is not to compute the distance or density directly and different from existed distance-based or density-based methods with low computation complexity; the method fits the online streaming data processing well, about which was demonstrated by the experiments conducted on the real and simulated data set.

Because the space complexity of hyper-grid-based anomaly detection method is \(O(\mathbb {N}^{q})\), when dealing with the high dimension data set, our proposed method demand significantly high memory requirement, but this may be feasible in the real applications. Hence, our future work will be centered on how to use sparse theory and some compression algorithms to tackle the problem.