1 Introduction

With the increase of video in the real world, especially at airports, railway stations, and urban traffic systems, video analysis is becoming increasingly important for video surveillance. For security applications, video anomaly detection algorithms are becoming more and more important [9].

Many approaches have been proposed to improve video anomaly detection. These methods always detect anomalies in video frames using a regular grid, such as \(4\times 4\) [14], \(4\times 5\) [4], \(5\times 5\) [52], \(10\times 10\) [10, 28], \(13 \times 13\) [7], \(15\times 15\) [4, 5], \(16\times 12\) [46], \(16\times 16\) [12], \(20\times 20\) [47], \(30\times 30\) [20], \(48\times 48\) [30], \(80\times 80\) [53], etc. This single-scale grid strategy may speed up anomaly detection while decreasing anomaly detection accuracy. Therefore, the multi-scale strategy has received much attention in many works [23, 24, 28]. Although the multi-scale strategy is widely used for anomaly detection, repeated computations at the same location for multi-scale features directly lead to a large increase in computations. To reduce the computations, we first compute the low-scale features and then use the result to compute the high-scale features. This can greatly improve the efficiency of feature extraction.

In our work, scale is defined by the various combinations of adjacent basic grids. For example, suppose the basic grid is \(3\times 3\), then the second scale is considered to be the \(2\times 2\) adjacent basic grid combination with size \(6\times 6\), the third scale is the \(3\times 3\) basic grid combination with size \(9\times 9\), the fourth scale is the \(4\times 4\) basic grid combination with size 12 \(\times \) 12, and so on. These different combinations are called multiscale pyramids. We also develop a fast algorithm to compute the motion features at other scales based on the features of the base scale using a convolution operation, which is much more efficient than computing the motion feature of different scales directly. In addition, the vote strategy is introduced into the pyramid combination and two types of scene models are proposed for robust anomaly detection.

In summary, the contributions of our work are summarized as follows:

  • We design a normalized histogram motion feature based on optical flow and extend it to multi-scales to capture each region’s motion characteristics at a different scale. Furthermore, a fast algorithm for computing the other scale motion representation based on the basic histogram is proposed.

  • Since the scene model can feedback the prior knowledge into anomaly detection, we propose to use the global and online scene models to guide anomaly detection.

  • We develop a new method which integrates the vote and pyramid strategies [21] to combine the different scale anomaly detection results.

The rest of this paper is organized as follows. Section 2 provides a brief overview of previous works on anomaly detection. The detailed explanation of our method is provided in Section 3. Section 4 demonstrates the effectiveness of the proposed algorithm in the published datasets, followed by the conclusion in Section 5.

2 Related works

Many methods have been proposed for abnormal event detection. The trajectory-based method is usually used in the first days to detect abnormal events [2, 13, 18, 32, 34]. For example, Johnson and Hogg [18] use Active Shape Models (ASM) to track the trajectories and learn a model from the trajectories to detect events. Basharat et al. [2] propose to model the pixel-level probability density functions of object velocity and size from the traces by using a multivariate Gaussian Mixture Model (GMM) for anomaly detection. In [39], clustering with a support vector machine (SVM) is used to detect anomalous trajectories. Mo et al. [34] use joint sparse reconstruction for video anomaly detection. The main problem of the trajectory method is the detection accuracy of the trajectories. If the trajectories are detected accurately, the abnormal events can be detected easily. To ensure the accuracy of the trajectories, Hu et al. [13] propose a robust multiple object tracking algorithm based on a fast, accurate fuzzy K-means algorithm, and use the detected trajectories to detect anomalies using statistical theory. However, due to the complex scenes, it is impossible to estimate the trajectories based on the unreliable detection of moving objects, leading to the anomaly detection failures.

To take full advantage of video features, low-level features are always used in anomaly detection, such as Histogram of Gradient (HOG), Histogram of Optical Flow (HOF), and image gradient. For example, Zhao et al. [53] use sparse coding and online reconstructability for anomaly detection based on HOG and HOF. Cong et al. [4] also use a sparse coding strategy for abnormal event detection based on the Multi-scale Histogram of Optical Flow (MHOF), an extension of HOF. The motion features HOG and HOF are also adopted by [50] based on a statistical hypothesis test, and the abnormal events are identified as those that contain abnormal event patterns and have a high abnormality detector value. In addition, gradient information is also used for abnormality detection. In [20], the 3D gradients of Spatial-temporal Volumes (STVs) are proposed to detect anomalous events based on a Hidden Markov Model (HMM). Lu et al. [28] propose a sparse combination learning algorithm for detection using these 3D STVs. Yu et al. [49] also use the 3D STVs for anomaly detection with a content-adaptive sparse reconstruction method. A similar gradient feature is also used as an appearance feature in [10, 16, 52].

With the development of deep learning, the output feature map of the neural network is generally used [11, 16, 17, 27, 36, 37, 43, 54]. Hasan et al. [11] build a fully convolutional feed-forward autoencoder to learn the local features and classifiers in an end-to-end learning system. Ionescu et al. [16] consider the activation maps of the last convolutional layer of VGG-f [3] as appearance features and combine them with 3D gradient motion features to detect abnormal events. Liu et al. [27] first use an encoder to encode the first t frames, and then feed the output into a Convolutional LSTM (ConvLSTM) [42] to obtain the final features for anomaly detection by margin learning. Motion and appearance features obtained from three convolutional autoencoders are clustered and the one-versus-rest scheme is utilized to score the anomalous events in [17]. Nguyen and Meunier [36] use a convolutional autoencoder and a U-net [41] to learn the joint appearance structures and motion patterns for anomaly detection. Most of the above methods focus on a new representation to support anomaly detection. These methods separate anomaly detection and data representation learning, resulting in a suboptimal representation for specific anomaly detection. Therefore, end-to-end methods are proposed in [11, 37, 54]. Pang et al. [37] use a few labeled anomalies and a prior probability to perform end-to-end learning of anomaly scores. Zhou et al. [54] combine feature learning, sparse representation and dictionary learning in a new neural network. Although neural networks have made a great progress in anomaly detection, some problems still need to be solved, such as expensive computational power, long training time, design of cumbersome network structure, and so on. Therefore, in this paper, we use the traditional anomaly detection method based on normal video motion.

3 Our work

In this work, anomaly detection is considered as a template matching problem, where the feature of the testing motion histogram in each grid is compared with the trained template to evaluate the anomalous events at each scale. A scene model can incorporate the knowledge of the scene into the anomaly detection. Therefore, we use a trained scene model as a reference to limit the range of anomaly detection, and employ an online scene model to control anomaly detection outside this range. The framework of our method is shown in Fig. 1. The top and bottom subgraphs describe the training process and the middle one describes the testing process.

Fig. 1
figure 1

Framework of our method

3.1 Scene model

Inspired by the fact that the abnormal events usually occur on the moving objects and cannot occur on the background, we use a scene model to describe this property. We have two kinds of scene models: a global reference scene model and an online model. The scene model is modeled at the pixel level and is based on the optical flow obtained from [8]. Assume that the optical flow at the point (xy) is \((v_x,v_y)\), where \(v_x\) and \(v_y\) are the horizontal and vertical components of the optical flow. The magnitude M at pixel (xy) is calculated as follows:

$$\begin{aligned} {\begin{matrix}&M_{(x,y)} =\sqrt{v_x^2+v_y^2}. \end{matrix}} \end{aligned}$$
(1)

The magnitude of motion represents the intensity of object motion. If the magnitudes in a certain region are smaller than a threshold, it means that this region could be more of a background. So the region of moving objects can be determined as follows:

$$\begin{aligned} B_{(x,y)}=\left\{ \begin{array}{cc} 1, &{} M_{(x,y)}>= \theta \\ 0, &{} M_{(x,y)}< \theta \end{array}\right. . \end{aligned}$$
(2)

Where \(\theta \) is a threshold, obtained by an adaptive method [35]. For the global scene model, we first compute magnitudes of each frame in the training samples, and use (2) to binarize each frame. Then, we calculate the average of the whole binarization frames in the training dataset to obtain the scene model. Finally, we estimate one threshold from the global scene model (more details can refer to Section 4.1 Parameters Setting) and use it to obtain the corresponding binarization of global scene model. In the testing phase, we consider the magnitude obtained from (1) as the online scene model. Figure 2 shows the scene model on the UCSD Ped1 dataset. With the global scene model, we can determine the area that the object usually passes through. While with the online scene model, we can determine the area that the object is currently passing through.

Fig. 2
figure 2

Our scene model on the UCSD Ped1 dataset. (a) is the global scene model. (b) is the binarization of the (a). (c) is an online scene model of one frame. (d) is the corresponding binarization of (c). (e) the trace of moving object is in the background and is colored with red. The yellow region in (b) or (d) represents the foreground and the black stands for the background

In general, the moving region (foreground) in the global scene model should contain the region under the online scene model. If the objects appear in the background region of the global scene model, it is more likely that anomalies will occur in that region. Figure 2(e) shows an example. The regions of the online scene model (yellow regions in Fig. 2(d)) are mostly in the foreground of the global scene model. The regions outside the global scene model (red regions in Fig. 2(e)) indicate that few objects in the training set occur in that region. Therefore, we process two different criteria inside and outside the global scene model.

3.2 Normalized histogram and fast computation

First, we propose a normalized histogram for representing object motion at one scale. Second, we extend this normalized motion feature into multi-scales and design a fast algorithm for high-scale histogram computation based on the base scale histogram. Third, we introduce how to obtain the MPGT from calculated motion features. Last, we introduce the method to score the testing frames using an anomaly detection score function, as well as the scheme to combine multi-scale detections using vote and pyramid strategies.

3.2.1 Normalized histogram feature

Many works use histograms, such as HOG [6], HOF [53], and MHOF [4]. Our motion feature is also based on histograms. We also extend the optical flow based histogram to include normalization for each bin. Suppose each frame is divided into \(s\times s\) grids, and the grids at the same location with continuous t frames form a cube. A histogram is computed for each cube with the size \(s\times s \times t\). Each pixel (xy) in a cube is quantified into one of d directions based on its angle:

$$\begin{aligned} {\begin{matrix}&\theta _{(x,y)}=tan^{-1}(v_y/v_x). \end{matrix}} \end{aligned}$$
(3)

We calculate a histogram for each direction with magnitude as the weight. Let the weighted histogram at the location \((x_{cube},y_{cube})\) be \(H_{(x_{cube},y_{cube})} = [h_1, h_2, \cdots , h_d]\), where \(h_i\) represents the sum of weight in ith direction. Let the standard histogram be \(B_{(x_{cube},y_{cube})}= [b_1, b_2, \cdots , b_d]\), where \(b_i\) is the count of the pixels located in the ith direction. The final histogram \(H_{(x_{cube},y_{cube})}^f\) is normalized with the \(B_{(x_{cube},y_{cube})}\):

$$\begin{aligned} H_{(x_{cube},y_{cube})}^{f}= & {} H_{(x_{cube},y_{cube})}./ B_{(x_{cube},y_{cube})} \nonumber \\= & {} [h_1/b_1, h_2/b_2, \cdots , h_d/b_d], \end{aligned}$$
(4)

where the operator ./ is the element-wise division, and here suppose that \(b_i\ne 0\). If \(b_i\) is zero, the corresponding histogram is 0 too.

3.2.2 Fast algorithm for multi-scale histogram

We define the concept of scale differently from the previous works [21, 48], which defined scale as the image resolution. We define the scale with reference to the receptive field. In Fig. 3, for example, different grid sizes represent different scales. The histogram of each scale can be calculated from (4) if the ith scale is \(s_i \times s_i\) (we remove t from \(s_i \times s_i \times t\) for simplicity). The computation is extremely large and wasteful if the histograms are computed directly based on optical flow. Therefore, we propose a rapid algorithm to complete this task using the convolution operation.

Fig. 3
figure 3

Multi-scale pyramid histogram architecture. The video is divided into different scale cubes which are colored with orange, green and red, to extract the feature at different scales. The all scale motion features compose the multi-scale pyramid motion features, which can be used to train MPGT

Fig. 4
figure 4

Toy example of our fast multi-scale histogram computation. The yellow square on the top subgraph stands for the base scale (level 0) and the blue square represents the second scale (level 1)

figure a

Histograms computation using convolution.

According to the histogram definition, the value of each bin for each scale is defined as the summary of pixels in that direction. Figure 4 provides an example of this computation. In the left subfigure, the circle and square stand for two different bins, and the yellow square represents a base scale (level 0), whose corresponding histograms are shown in the red dotted rectangle. The blue square represents a high scale (level 1) and its histogram is represented in the dark dotted rectangle. As shown in this example, the high scale region of level 1 contains four base scale regions of level 0. The histogram of each bin on the level 1 scale is the sum of the four corresponding bins on the level 0. It is efficient to calculate the histogram of level 1 from level 0 using a simple addition operation. The level 1 histogram calculation is displayed in the right subfigure. The histogram of the first bin on level 0 is [2, 2, 1, 3] (see the green square in the left subgraph). The histogram value on level 1 is 8 and is also the convolution of level 0 with a \(2\times 2\) kernel of value 1.

For one video, let the base scale histogram without normalization at location (xy) be \(H_{(x,y)} = [h_1, h_2, \cdots , h_d]\), and the standard histogram be \(B_{(x,y)}= [b_1, b_2, \cdots , b_d]\). Based on the base scale histogram, we can calculate the histogram for any scale as long as the size of the other scales is an integer multiple of the base scale. That is, suppose that the base scale is represented as \(s_1\times s_1\), we can obtain the histogram of other scale, whose spatial size is \(s_2= I \times s_1\), where I is one positive integer, using the convolution operation. The algorithm is shown in Alg. 1, where w is a multiple of the base scale size. Suppose the size of the base scale is \(3 \times 3\). If we want to obtain the histogram of \(6 \times 6\), the kernel size should be \(2 \times 2\) with \(w=2\). Similarly, we can obtain the other scale motion histograms using Alg. 1, and then calculate the normalized motion feature employing (4).

3.3 Multi-scale pyramid grid templates

Based on the normalized histogram, we can calculate the MPGT by combining each maximum grid template at every scale. Suppose there are \(n_h\) normalized histograms at location \((x_i, y_i)\) in the whole training data, i.e. \(H_{(x_i,y_i)}^1, H_{(x_i,y_i)}^2, \cdots , H_{(x_i,y_i)}^{n_h}\). We model the maximum grid template at location \((x_i,y_i)\) as follows:

$$\begin{aligned} g_{(x_i,y_i)}=\max (H_{(x_i,y_i)}^1, H_{(x_i,y_i)}^2, \cdots , H_{(x_i,y_i)}^{n_h}), \end{aligned}$$
(5)

where the max operation selects the maximum histogram in every bin for d directions among all training histograms. Therefore, our maximum grid template can capture the movement distribution in various locations. For the ith scale, suppose one frame can be divided into non-overlap \(m\times n\) grids. With each local maximum grid template, we can obtain the maximum grid template for the ith scale

$$\begin{aligned} G=\begin{bmatrix} g_{(x_1,y_1)}&{}g_{(x_1,y_2)}&{}\cdots &{}g_{(x_1,y_n)}\\ g_{(x_2,y_1)}&{}g_{(x_2,y_2)}&{} \cdots &{}g_{(x_2,y_n)}\\ \cdots &{}\cdots &{}\cdots \\ g_{(x_m,y_1)}&{}g_{(x_m,y_2)}&{} \cdots &{}g_{(x_m,y_n)} \end{bmatrix}. \end{aligned}$$
(6)

We also extend the template into the multi-scale cases. Intuitively, the cube with small size can detect the details of the motion. The cube with the large size can capture the high-level object knowledge. Therefore, we describe the motion using different scales at the same time to improve the detection. Suppose the first scale maximum template is \(G_0\), it can be obtained from the basic histograms of \(s_0\times s_0\times t\) cubes. Then, we obtain the second scale histograms using Alg. 1 with the basic histograms, as well as the second scale maximum template \(G_1\). The third and other scale maximum template \(G_i\) can be also achieved using the same strategy. All the different scale templates compose the MPGTs.

3.4 Anomaly detection

With our proposed MPGT, we can finish the anomaly detection task. In this section, we first introduce how to score the testing frames, and then describe the combination of multi-scale detection results.

3.4.1 Anomaly detection score function

We use the logistic function to score the final anomaly detection score. Let the histogram \(H_{(x,y)}^{max}=(h_1^{max},h_2^{max},\cdots , h_d^{max})\) be one scale maximum template at location (xy), and \(H_{(x,y)}=(h_1,h_2,\cdots ,h_d)\) be the corresponding testing histogram also located at (xy). The score function for the location (xy) grid is expressed by:

$$\begin{aligned} s_{(x,y)}=\frac{1}{1+exp(-\beta *\sum \limits _{\begin{array}{c} k=1,\cdots d\\ \wedge h_k>h_k^{max} \end{array}}(h_k-h_k^{max}))}, \end{aligned}$$
(7)

where \(\beta \) is the logistic function parameter, and \(\wedge \) is the AND operation. The bins whose value are larger than the corresponding bins of maximum template in (7) are considered. The template \(g_{(x,y)}\) at location (xy) stands for the maximum motion intensity. If one bin of a new observation is larger than the corresponding bin of the template, it is abnormal for this new observation at this location. The larger the new observation bin value exceeds the grid template, the more likely the abnormal events will appear.

We can divide each frame into the foreground and background regions with the global and online scene models. Then we detect anomaly in the foreground using (7) and background with a weighted MPGT:

$$\begin{aligned} G=\begin{bmatrix} g_{(x_1,y_1)}&{}g_{(x_1,y_2)}&{}\cdots &{}g_{(x_1,y_n)}\\ g_{(x_2,y_1)}&{}g_{(x_2,y_2)}&{} \cdots &{}g_{(x_2,y_n)}\\ \cdots &{}\cdots &{}\cdots \\ g_{(x_m,y_1)}&{}g_{(x_m,y_2)}&{} \cdots &{}g_{(x_m,y_n)} \end{bmatrix}*c, \end{aligned}$$
(8)

where c is a softening factor, which can reduce the template effectiveness in the background regions. If \(c=1\), the whole frame has the same template for anomaly detection. In fact, if some object moves in the background, the anomaly has more probability to occur. Therefore, we set \(c<1\) to decrease the anomaly detection condition and improve the accuracy of detecting abnormal events in the background. For all testings, we set \(c=0.8\) in all experiments.

3.4.2 Strategy for combination of multi-scale detection

Many applications use the pyramid approach, such as scene category recognition [21], image classification [48], object detection [25] and event detection [51]. In this work, we first calculate the probability of abnormalities on each grid location at each scale using (7). Then, we use a pyramid strategy to combine multi-scale detection results and incorporate the voting strategy in the pyramid scheme:

$$\begin{aligned} p(x,y)= \sum _{i=0\bigwedge vote(x,y)>v}^{n_s} 1/2^{n_s-i} s_{(x,y)}, \end{aligned}$$
(9)

where \(n_s\) is the total scale levels, vote(xy) stands for vote strategy at location (xy), \(\wedge \) is the AND operation. For example, \(vote(x,y)>2\) indicates that if there are more than two abnormal events, this position (xy) would be considered as abnormal candidates. Obviously, if \(vote(x,y)>0\), Eq. (9) is similar to that in [28]. In our work, we put more emphasize on the high scale for that the high level can capture more significant holistic feature.

4 Experiments

In this section, we demonstrate the experimental performance of the proposed algorithm using the published datasets UCSD [31] and Avenue [28]. The video frames are first downscaled to a resolution of \(120\times 160\) and four scales of \(3\times 3\), \(6\times 6\), \(9\times 9\) and \(12\times 12\) are used for the motion features in all experiments. The base scale is set to \(3\times 3\) and the motion feature of the other scales is calculated based on the first scale using the fast algorithm proposed in Section 3.2.2. Figure 5 shows some detection results of our method. As can be seen in the figure, our approach can effectively detect various anomalies, such as bicyclists, runners, skaters, and cars.

Fig. 5
figure 5

Detection samples on the published datasets, with the red mask for the anomaly region. (a) is original images. (b) is the detection at the first scale. (c) is the detection at the second scale. (d) is the detection at the third scale. (e) is the detection at last scale. (f) is the detection using our combined strategy

4.1 Parameters setting

There are three types of parameters in our method. The first is the threshold in the scene model, the second is \(\beta \) in (7), and the last is the bin number d of the histogram in the low-level feature.

Fig. 6
figure 6

Threshold decision on our scene model. (a) is the scene model on UCSD Ped1 dataset. (b) is the plot for the scene model with each pixel as x-axis and the average of pixel appearing times as y-axis

Table 1 Comparison with different histogram bins on UCSD Ped1 dataset using frame-level ground truth

Intuitively, the magnitudes in (1) represent motion intensity, so they should be low for the background and high for the moving objects. To determine the threshold in the scene model, we first flatten the scene model and plot it with the number of pixels as the y-axis and each pixel position as the x-axis. Then we examine the threshold on the low point in the graph. In our experiments, we set the threshold value to 25, 20, and 35 for the UCSD Ped1, Ped2, and Avenue dataset respectively. Figure 6 shows the procedure for estimating the threshold for the UCSD Ped1 dataset.

For the other two parameters, we find that both parameters have limited effectiveness in detecting abnormalities. To test the influence of the parameters, we first fix one parameter and change the other within a certain range. We then compute Area Under the Curve (AUC) for the evaluation on UCSD Ped1 dataset. The comparison with different bins using \(\beta =0.01\) is shown in Table 1. As shown in this table, when the number of bins increases, the AUC decreases a little, which implies that the small bins can perform well, and when the bins are 2, the performance is the best. Intuitively, when the number of bins is small, there are more pixels in a bin, and more data makes the histogram more robust against noise.

Table 2 shows the AUC comparison for different \(\beta =0.0001, 0.001, 0.01, 0.05, 0.1, 1, 5\) on the UCSD Ped1 dataset, where the movement histogram is fixed at 2 bins. As can be seen from the comparison, when \(\beta <0.001\), the larger \(\beta \), the lower the AUC. When \(\beta >0.001\), the smaller \(\beta \) is, the smaller the AUC is. The average AUC hovers around 0.88, indicating that our method is robust to the parameter changes. In the following experiments with the UCSD dataset, we set \(\beta =0.001\) and \(d=2\).

4.2 Experiments on UCSD dataset

In the UCSD dataset, there are two pedestrian scenes on campus, Ped1 and Ped2. Each dataset contains a training set and a testing set. For UCSD Ped1, the training set contains 34 short clips for learning normal patterns and the testing set contains 36 short clips. Each clip in Ped1 has 200 frames with a resolution of \(158\times 238\). The anomalies occur in multiple frames in each clip in the testing set, and there is a subset of 10 clips with pixel-level binary masks that can be used to identify regions that contain anomalies. The UCSD Ped2 dataset contains scenes of pedestrians moving parallel to the camera at a resolution of \(240\times 360\), with 16 clips for training and 12 clips for testing. In the following experiments, we evaluate our approach based on frame and pixel ground truth.

Table 2 Comparison with different \(\beta \) on UCSD Ped1 dataset using frame-level ground truth

4.2.1 Evaluation on UCSD Ped1 dataset

For the frame-level comparison, we compare our approach with the state-of-the-art methods, including Scan Statistic (SS13) [15], Sparse Combination (SC13) [28], Spatio-Temporal Motion Context (STMC) [5], Convolutional Auto Encoder (Conv-AE) [11], Combining Motion and Appearance (CMA16) [52], Winner-Take-All SVM (WTA-SVM) [44], our former work MAP [24], MultiLevel Anomaly Detector (MLAD) [45], Multivariate Gaussian fully Convolution Adversarial Autoencoder (MGCAA) [22] and Self-trained Deep Ordinal Regression (SDOR) [38]. These compared methods not only contain the traditional machine learning methods, such as SS13 [15], SC13 [28], STMC [5], MAP [24] and CMA16 [52], but also contain the deep learning methods, e.g. Conv-AE [11], WTA-SVM [44], MLAD [45], MGCAA [22] and SDOR [38]. The comparison based on the frame-level ground truth is shown in the second column in Table 3Footnote 1. On the UCSD Ped1 dataset, our method wins the second place in the compared methods, is comparable to SC13 [28], and is better than the other compared methods. Although our method is a little worse compared with SC13 [28], the performance is better than SC13 on the Avenue dataset.

To evaluate our method at the pixel level, we use an adaptive threshold to obtain the binarization of each frame. Following [5, 28, 31, 40], the corresponding frame is considered correctly detected if more than 40% of the truly anomalous pixels are detected. The parameters used for pixel-level comparison are the same as those used for frame-level evaluation. We compare our method with Adam [1], Mixture of Probabilistic Principal Component Analyzers (MPPCA) [19], Social Force (SF) [33], MPPCA+SF [31], Mixture of Dynamic Textures (MDT) [31], Spatio-Temporal Motion Context (STMC) [5], Sparse [4], Stacked RNN (SRNN) [30], Future Frame Prediction (FFP) [26], and Anomaly Net (ANet) [54]. The comparison is shown in the second column in Table 4. It is obvious that our method has the best performance of all compared methods, which shows the effectiveness of our approach.

Table 3 Comparison with the state-of-the-art methods on UCSD and Avenue datasets at frame-level
Table 4 Comparison with the state-of-the-art methods on UCSD datasets at pixel-level ground truth

4.2.2 Evaluation on UCSD Ped2 dataset

In the UCSD Ped2 dataset, the parameter settings are the same as in the UCSD Ped1 dataset. For frame-level comparison, in addition to the above methods, we also compare our method with Temporally-coherent Sparse Coding (TSC) [30], Unmask [16], and ConvLSTM- AE [29]. The comparison is reported in the third column in Table 3. The performance of our proposed method is similar to SS13 [15], slightly worse than WTA-SVM [44] and MLAD [45], but better than the other methods. In fact, the methods of WTA-SVM [44] and MLAD [45] have better performance than our method in the Ped2 dataset, but their performance in the UCSD Ped1 dataset is worse than our method.

The comparison with state-of-the-art methods based on pixel-level ground truth at UCSD Ped2 dataset is shown in the third column in Table 4. The compared methods include [26], SRNN [30], and ANet [54]. As can be seen from the comparison, our method wins the first place among all the compared methods.

4.3 Experiments on Avenue dataset

The videos in the Avenue dataset published by [28] are from an avenue on the CUHK campus with a total of 30652 frames, including 16 videos in the training set and 21 videos in the testing set, and the pixel-level ground truth is also provided by the authors. This dataset presents a greater challenge in three respects: (1) The camera vibrates in some videos. (2) There are a few outliers in some testing videos. (3) Some normal patterns rarely occur in the training data. Similar to the experiments on the UCSD dataset, we downscale the sequences in this dataset to a resolution of \(120\times 160\), set the quantized histogram directions to 8, and set \(\beta \) to 20.

In addition to the methods of SC13 [28], Conv-AE [11], and MLAD [45], we also compare our method with TSC [30], Unmask [16], and ConvLSTM-AE [29]. The comparison is shown in the fourth column in Table 3. As shown in this comparison, the AUC of our method is 0.88, which is better than SC13, Conv-AE [11], Unmask [16], and MLAD [45], on par with ConvLSTM-AE [29], and a little worse than TSC [30]. The TSC [30] method has the best performance on the Avenue dataset among the compared methods, and our method ranks second. But the performance of our method is better than that of TSC [30] on the UCSD Ped2 dataset. Also, the method ConvLSTM-AE [29] has the same performance as our method on the Avenue dataset, but the performance of ConvLSTM-AE [29] on the UCSD Ped2 dataset is worse than our method.

4.4 Balanced performance comparison

To further demonstrate the superiority of our method, we calculate the AUC average for different combinations, e.g., the AUC average for the UCSD subsets Ped1 and Ped2 (we call it Ped12), the AUC average for the Ped1 and Avenue datasets (Ped1A), the AUC average for Ped2 and Avenue (Ped2A), and the AUC average for all testing datasets (AllD). The results are shown in the last columns in Table 3, i.e., the fifth column represents the AUC average on Ped12, the sixth column represents the AUC average on Ped1A, the seventh column Ped2A, and the last column represents the AUC average on all datasets. We can see that our method and SS13 have the same performance on Ped12 and rank first on Ped12. In addition, our method has the best performance on all combined datasets, including Ped1A, Ped12, Ped2A, and AllD, indicating that our method has balanced performance on all test datasets.

4.5 Validation of our method

In recent years, deep learning-based methods have been applied to anomaly detection with great success [11, 29, 30, 38, 54]. However, the main weakness is the dependence on a large amount of hardware resources. Unlike deep learning methods, our proposed method requires little GPU hardware, and you can train the template model quickly. Moreover, our method has low dataset size requirements and needs only one template in the testing phase, which requires little memory. Moreover, the experiments on public datasets show that the performance of our method is also comparable to deep learning-based methods.

4.5.1 Effectiveness of our scene model

The scene model proposed in this paper is simple but effective in detecting anomalies. To further demonstrate that the scene model can improve the accuracy of anomaly detection, we compare the performance between the methods with and without the scene model on the UCSD and Avenue datasets using the AUC. We also use \(\beta =0.001\) and 2 bins for all evaluations on the UCSD dataset and \(\beta = 20\) and 8 bins on the Avenue dataset. As reported in Table 5 (second and third columns), the performance with the scene model is similar to that without the scene model in the UCSD Ped1 dataset, but better than the performance in the UCSD Ped2 and Avenue datasets, indicating that the proposed scene model actually increases the anomaly detection performance.

Table 5 Comparison between the method with scene model (SM) and the one without (the second and third columns), and the comparison between the method with normalization and the one without (the second and last columns)
Table 6 Comparison between the method with multi-scale scheme and the one with only one scale

4.5.2 Necessity of histogram normalization

Normalization of the histogram is necessary for robust anomaly detection. It can increase the robustness to noise and improve the accuracy of anomaly detection. To test the effectiveness of histogram normalization, we also calculate the AUC for the UCSD and Avenue datasets using the same parameters as the previous experiments, i.e., we set \(\beta =0.001\) and 2 bins for the UCSD dataset and use \(\beta = 20\) and 8 bins for the Avenue dataset. Table 5 shows the comparison between the histogram with and without normalization (the second and last columns). As can be seen in this table, the AUC for all test datasets is much better than that without normalization, which means that normalization to the motion feature is essential for anomaly detection.

4.5.3 Validity of multi-scale pyramid histograms

Multi-scale histogram features can benefit the accuracy of anomaly detection. To show the effectiveness of multi-scale strategy, we compare the method using four scales histogram feature and the ones employing only one single scale. In this experiment, the parameter settings are the same as in the above experiments with the UCSD and Avenue datasets.

Similar to the above experiments, we employ four scales for the multi-scale experiment in testing, and use \(3 \times 3\), \(6 \times 6\), \(9 \times 9\) and \(12 \times 12\) for one scale respectively. The comparison is shown in Table 6. As can be seen in this table, although the performance in the UCSD Ped2 dataset with the 12 \(\times \) 12 scale is slightly better than the multi-scale method, the AUC in the UCSD Ped1 and Avenue datasets with multi-scale features is significantly higher than all detections with only one scale, indicating that the multi-scale strategy does improve detection accuracy.

Figure 5 shows some results using different scales. From left to right, the first column is the original image, the second column is the detection of anomalies at the base scale of \(3\times 3\), the third column is the detection at the scale of \(6\times 6\), the fourth column is the detection at the scale of \(9\times 9\), the fifth column is the detection at the scale of \(12\times 12\), and the last column is the combined detection of the four scales. As these examples show, false alarms may occur in the small scale, and the detection result may be rough in the large scale. However, the combined detection can effectively eliminate the false alarms to improve the detection accuracy.

Fig. 7
figure 7

Comparison of running time between the method using convolution and the method computing histogram directly

4.6 Comparison for running time

In our method, we use the basic motion histogram of the grid to compute the motion histogram of the high scale grid, which can greatly improve the speed of feature computation. In this section, we compare the running time of the method with convolution and the method that computes the histogram directly. The experiments are performed on a laptop with Intel(R) Core i7-10750H CPU and 8G memory. The code was written using Python and is available at https://github.com/limax2008/Anomaly-DetectionpyramidGridTemplates.

Similar to the above experiments, the basic grid is set to \(3\times 3\), and we compute \(6\times 6\), \(9\times 9\), and \(12\times 12\) directly or using convolution, respectively. The video is first resized to \(120\times 160\) and the motion histogram bins are set to 2. The experiment is finished by randomly selecting a video from the UCSD Ped1 dataset and repeated 10 times. The final running time is considered as the average of the 10 repetitions. The comparison is shown in Fig. 7. From the comparison, we can see that: (1) For direct computation, the time increases as the grid size increases. (2) Our fast algorithm for calculating all scale histograms is significantly faster than the direct calculation (more than 4 times faster). Moreover, it is worth noting that the running time of our fast algorithm is the total time including \(3\times 3\), \(6\times 6\), \(9\times 9\), and \(12\times 12\) (the first bar in Fig. 7). If we directly subtract the time for computing the 3\(\times \)3 grid histogram, the time for computing the \(3\times 3\), \(6\times 6\), \(9\times 9\), and \(12\times 12\) histograms is about (11.053-10.941)/3 = 0.0373 seconds for 200 frames at one scale, indicating that our proposed algorithm is fast enough for computing the motion histogram. Table 7 shows the comparison of the time standard deviation. We can see that the time standard deviation of our proposed method is small, which indicates that our method is stable for all videos.

Table 7 Comparison of running time standard deviation between the method using convolution and the method computing histogram directly

5 Conclusion

In this work, we first improve the histogram by normalization to increase the robustness of the motion feature. Then, we propose a simple but powerful scene model for anomaly detection, where we combine the global scene model and the online scene model to detect anomalies based on their intersection. Next, we propose a fast motion feature computation algorithm based on the basic histogram feature using a convolution operation. Finally, we use our proposed MPGT for anomaly detection based on vote and pyramid strategies. Experimental results on public datasets show the effectiveness of the proposed method. Currently, anomaly detection considers only one type of motion features. There is a possibility to further improve the accuracy by considering appearance in our future work.