1 Introduction

In recent decades, the use of surveillance cameras has grown considerably due to a heightened security demand. This has significantly increased the number of videos produced, making the ability to automatically process these videos an urgent issue. A detailed survey [16] in this research field shows the growth in video publications in the recent years. The sheer number of videos being produced has turned the detection of abnormal events into a practical and challenging task. Abnormal event detection cannot be treated as a typical classification problem because the infinite types of abnormal events makes defining them impossible. In addition, negative samples are not sufficient for building models. In this case, building models using normal videos in an unsupervised way has drawn substantial attention. Models of normal events are first trained from positive samples, which means all the events that do not fit the normal models are abnormal events. The process of running an unsupervised abnormal event detection model can be summarized as, first, extracting features to represent normal video frames; second, building normal models using features of positive samples; and third, checking test frames against normal models.

Much research on unsupervised abnormal event detection has been published in recent years. In [10], object-of-interest trajectories are extracted and designated as the representation of normal events. Trajectories that do not follow the normal patterns are treated as anomalies. How well this method performs depends on the accuracy of the object tracking. In [13], spatial-temporal changes are used to learn a set of sparse basis combinations, and test samples are checked by reconstruction error.

Lately, the use of deep learning algorithms has taken off in the computer vision community, and the performance of image classification has reached impressive new levels. In the context of abnormal event detection, deep learning tools extract high-level features from videos in order to detect these abnormal events. In [19], a deep learning model auto-encoder is learned and serves to extract features, while Gaussian models are trained to detect anomalies. Method in [8] utilizes the PCANet to extract global features of video frames from saliency information (SI) and multi-scale histogram of optical flow (MHOF), so it has limitation in detecting and localizing anomalies in local areas. In order to detect and localize abnormal events simultaneously and make the best of PCANet, a new method is proposed in this paper. The PCANet is revised to extract local information from raw motion information of video sequences rather than from higher level features, i.e., SI and MHOF. Then corresponding one-class classifiers are used to detect anomalies.

The remainder of the paper is organized as follows: In Section 2, related work is covered in order to give background on the proposal. The proposed method is discussed in Section 3. Followed by experimental results and their analysis in Section 4. Finally, conclusions are presented in Section 5.

2 Related work

With the wide use of surveillance systems around the world, computer vision researchers have taken a keen interest in algorithms for intelligent video processing. One practical application of this research is the ability to detect abnormal events in surveillance videos, such that users can be alerted to suspicious activity as it occurs.

The detection of abnormal events in videos involves identifying events with abnormal patterns. The lack of negative samples has led to the use of unsupervised learning methods as a sensible solution. In order to build normal models from training videos, extracting discriminative features becomes crucial to the success of the methods. Algorithms for abnormal event detection can be categorized into one of two types based on their methods of feature extraction: trajectory-based algorithms and spatio-temporal domain-based algorithms.

In [10, 18, 21], extracted trajectories of normal events represent normal patterns, and trajectories that do not follow the normal patterns are treated as anomalies. However, there are some limitations to applying the trajectory-based methods to crowded scenes because of the occlusion among multiple moving objects severely influences the efficiency and accuracy of object tracking and the extraction of trajectories of multiple objects. Social attribute-aware force model (SAFM) are proposed in [26] to incorporate social characteristics of crowd behaviors to improve the description of interactive behaviors, which effectively reduces the influence of occlusion between multiple moving objects. And in [24], 3-D discrete cosine transform (DCT) is used to design a robust multi-object tracker to associate the targets in different frames.

For spatio-temporal domain-based algorithms, features extracted based on spatio-temporal methods represent normal videos, and then normal models are learned in order to detect abnormal events. For example, extracted HOG [7, 9], optical flow [11, 15], or multi-scale histogram of optical flow [6] represents videos, and then normal models are learned in order to detect abnormal events. Various models are used to depict the features based on spatio-temporal changes. In [1], a location-based approach is proposed for illustrating behavior. In the study [2], a learned nonparametric model analyzes statistics of a multi-scale local descriptor. The study [15] proposes a social force model that estimates interaction forces between moving particles. The [12] studies the use of a joint detector that works with temporal and spatial anomalies, as well as the use of mixture of dynamic textures (MDT) models. In [1, 25], normal patterns are fitted into a Markov random field. Latent Dirichlet allocation models are utilized to detect abnormal events in [15, 20]. In [5, 6, 12, 13], sparse representations of normal events are proposed to reconstruct test frames.

3 Proposed method

Figure 1 shows the overall framework of the proposed method, which can be regarded as a kind of spatio-temporal domain-based method. Abnormal events differ from normal events in their motion patterns, so the first step is to extract optical flows from the video frames. Next, the calculation of the magnitude of optical flow at each pixel forms an optical flow map which can be regarded as the spatio-temporal representation of video frames. Because abnormal events generally occur in local regions, optical flow maps are uniformly partitioned to non-overlapping patches to extract local and discriminative information. Patches at same spatial coordinates in optical flow maps are then used to learn models. A set of PCANet models are learned separately based on patches at corresponding spatial coordinates. For each patch, removed block-wise histograms represent each block in the patch, as opposed to linking them all to a single vector. After obtaining training features, one-class classifiers are trained. This method also utilizes simple and efficient classifiers based on k-medoids, as discussed in Section 3.2.

Fig. 1
figure 1

Framework of the proposed method. Optical flow maps are firstly calculated from video frames. Then block-wise features are extracted by revised PCANet from local patches. Finally, detection results are obtained by the operation of one-class classifier and mean filter

The extraction of block-wise features is the first step for a test frame. Then, one-class classifiers calculate anomaly scores of each block to form a map. However, in crowded scenes, high anomaly scores of some normal isolated blocks may occur due to the occlusion and interference among multiple moving objects, as this leads to a high false positive rate. The influence by misclassified isolated points is efficiently eliminated in this method by processing anomaly score maps through a mean filter. Experimental results demonstrate the overall effectiveness of the method.

3.1 Feature extraction based on PCANet

As a part of the trending deep learning field, PCANet was a new deep learning network proposed in [4]. Compared with other deep learning networks, such as convolutional deep neural network (ConvNet) that requires obscure expertise and large amounts of labeled training data, PCANet trains more easily and better suits the issue at hand. The structure of a two-stage PCANet is illustrated in Fig. 2. Similar to ConvNet, two stages of PCANet is a model with two layers. PCA filters in different stages is linearly cascaded to extract high-level features. And two-stage PCANet is more effective than one-stage PCANet in extracting higher level features as shown in [4].

Fig. 2
figure 2

Structure of a Two-Stage PCANet. Input is the optical flow map and output is the extracted feature. Features of input maps are extracted by three operations: PCA filters, binary hashing, and histogram

Optical flow maps are uniformly partitioned to non-overlapping patches referred to as sub-maps with size h×w. And sub-maps at same spatial coordinates are gathered temporally for training. Suppose there are N sub-maps for each spatial location, denoted as \(X = [X_{1}, X_{2}, ..., X_{N}] \in \mathbb {R}^{hw \times N}\). For a given sub-map X i , a patch with size k 1×k 2 is taken around each point, and all overlapping patches of X i are collected as \(X_{i,1}, X_{i,2},..., X_{i,(h-k_{1}+1)\times (w-k_{2}+1)}\), where each X i,j denotes the jth vectorized patch in X i . By subtracting patch mean from each patch, we obtain \(\bar {X}_{i} = [\bar {X}_{i,1}, \bar {X}_{i,2}, ..., \bar {X}_{i,(h-k_{1}+1)\times (w-k_{2}+1)}]\). Similar matrices for all sub-maps can be obtained using the same approach. Putting all the matrices together forms:

$$ \bar{X} = [\bar{X}_{1}, \bar{X}_{2}, ..., \bar{X}_{N}] \in \mathbb{R}^{k_{1}k_{2}\times Nq} $$
(1)

where q=(hk 1+1)×(wk 2+1). The PCA filters in the first stage are expressed as:

$$ {W_{l}^{1}} = vec2mat_{k_{1},k_{2}}(q_{l}(\bar{X}\bar{X}^{T})) \in \mathbb{R}^{k_{1} \times k_{2}}, l = 1,2,...,L_{1} $$
(2)

where L 1 is the number of PCA filters in the first stage, \(q_{l}(\bar {X}\bar {X}^{T})\) denotes the lth principal eigenvector of \(\bar {X}\bar {X}^{T}\), \(vec2mat_{k_{1},k_{2}}(v)\) is a function that reshapes the vector \(v \in \mathbb {R}^{k_{1}k_{2}}\) to a matirx \(W \in \mathbb {R}^{k_{1} \times k_{2}}\).

The process of the second stage is similar to the first stage. Given the lth filter \({W_{l}^{1}}\) and X i from the first stage, output is \({X_{i}^{l}}=X_{i}*{W_{l}^{1}}\). By collecting all patches in \({X_{i}^{l}}\) and subtracting patch mean from each patch, \(Y^{l}= [{Y^{l}_{1}}, {Y^{l}_{2}}, ..., {Y^{l}_{N}}] \in \mathbb {R}^{k_{1}k_{2}\times Nq}\) is obtained for all the filter outputs and is concatenated as:

$$ Y = [Y^{1}, Y^{2}, ..., Y^{L_{1}}] \in \mathbb{R}^{k_{1}k_{2}\times L_{1}Nq} $$
(3)

The PCA filters of the second stage are obtained as:

$$ W_{\ell}^{2} = vec2mat_{k_{1},k_{2}}(q_{l}(YY^{T})) \in \mathbb{R}^{k_{1} \times k_{2}}, \ell = 1,2,...,L_{2} $$
(4)

For each input \({X_{i}^{l}}\) in the second stage, a single integer-valued map is obtained with:

$$ {T_{i}^{l}} = \sum\limits_{\ell=1}^{L_{2}}2^{\ell-1}H({X_{i}^{l}}*W_{\ell}^{2}) $$
(5)

where H(⋅) is a function whose value is one for positive entries and zero for all others. Each map of \({T_{i}^{l}}\) is partitioned into B blocks. Histograms of the decimal values in each block are computed and concatenated into one vector in traditional PCANet. In order to extract local information from optical flow maps and localize abnormal motion patterns, histograms of each block are not concatenated into one vector. Histograms of all the partitioned blocks are obtained as \(BlockHist_{b}({T_{i}^{l}}), b=1,2,...,B\). Given a input sub-map X i , the features are defined as:

$$ f_{i,b} = [BlockHist_{b}({T_{i}^{1}}),...,BlockHist_{b}(T_{i}^{L_{1}})] \in \mathbb{R}^{2^{L_{2}}\times L_{1}}, b=1,2,...B $$
(6)

The local blocks can be overlapping or non-overlapping. In this method, the blocks are kept as non-overlapping. Assume the size of optical flow maps is 240×360, and the size of each sub-map is 40×40, so a set of PCANet models can be obtained as P C A N e t s={P C A N e t i,j },i=1∼6,j=1∼9. Additionally, for each sub-map X k , the PCANet model, extracts a corresponding set of block-wise features, denoted as {f k,b },b=1,2,...16.

3.2 Abnormal events detection by one-class classifier

The training dataset contains only normal events, so one-class classifier is more suitable in this case than other classifiers. Suppose there are N normal training optical flow maps, and each training sample is partitioned to M non-overlapping sub-maps. Now, N sub-maps at same spatial coordinates are gathered temporally for training. Then, N feature vectors are extracted by PCANet for each block location, denoted as F={f 1,f 2,...,f N }. The K-mediods clustering algorithm is then used to divide F into k clusters. The k cluster centers can be obtained as C={c 1,c 2,...,c k } for each block location. Given a test sample, PCANet extracts feature vectors for each block, and anomaly scores are calculated as:

$$ g(x) = min(x - c_{i})^{T}(x - c_{i}) $$
(7)

The scores are arranged according to the spatial location of each block in order to construct an anomaly score map. To eliminate the noise of isolated points, anomaly score maps are processed by a mean filter, and g (x) is obtained. Then, the classifier is defined as:

$$ h(x) = \left\{\begin{array}{ll} normality &\text{\(g^{\prime}(x)< \theta\)}\\ anomaly &\text{\(g^{\prime}(x)\geq \theta\)} \end{array}\right. $$
(8)

4 Experiments

Experiments were carried on UCSD Ped2 [14] and UMN [17] datasets. Comparisons with other state-of-the-art methods were also made, and experimental results demonstrate the superiority of the proposed method.

4.1 Experiments on UCSD Ped2 dataset

The UCSD Ped2 dataset [14] was acquired with a stationary camera at 10 fps, with resolution of 240×360. The video footage was split into various clips with approximate lengths of 200 frames. In the normal setting, the video contains only pedestrians. Events are considered abnormal when abnormal pedestrian motion patterns are present or when non pedestrian objects are in the walkways. The training set contains 16 normal short clips, and the testing set contains 12 clips in which some of the frames include one or more abnormal events. Similar to [14], receiver operating curve (ROC) and equal error rate (EER) were analyzed at both the frame and pixel levels, and comparisons with other state-of-the-art methods were examined.

In frame-level evaluation, a frame is considered correctly detected if at least one pixel is judged as an anomaly, and an exact location of the anomaly is not necessary. Therefore, it is possible for some true positive detections to identify abnormal events in the wrong location. In pixel-level evaluation, a frame is considered detected correctly if at least 40 percent of the truly anomalous pixels have been identified. In order to create an ROC curve, a set of false positive and true positive rates are extracted from the detection procedure repeated on multiple thresholds.

For UCSD Ped2, where the frame size is 240×360, optical flow maps with the same size are calculated. The size of sub-maps was set as 40×40, the size of mean filters was set as 3×3, and the PCANet parameters were set as f l i t e r s i z e=5×5,b l o c k s i z e=10×10, L 1 is set to 8 to reserve 90 % of the variance, K-mediods parameter k is empirically set equal to L 1, L 2 is set to 6 to restrict the dimension of extracted features and reserve enough variance. The original optical flow map was partitioned into 54 sub-maps, and the corresponding 54 PCANet models were learned. Each sub-map was divided into 16 blocks and features with dimension of \(2^{L_{1}}L_{2}=512\) were extracted to represent each block. Finally, one 24∗36 anomaly score map was obtained for each frame.

Threshold 𝜃 was altered to produce ROC curves for frame-level and pixel-level evaluations. And the threshold can be set according to the application’s requirements to get reasonable false positive rate and true positive rate. ROC curves and comparisons with other methods are illustrated in Fig. 3.

Fig. 3
figure 3

ROC curves of UCSD Ped2. a ROC curves based on frame-level evaluation. b ROC curves based on pixel-level evaluation

The comparisons of EER with other methods are shown in Table 1. Smaller EER indicates better performance.

Table 1 Comparison of EER on UCSD Ped2 (%)

Experimental results show that this method achieves respectable performance in both detecting and localizing abnormal events in surveillance videos. Figure 4 provides a portion of detection results of this method.

Fig. 4
figure 4

Examples of anomaly detection and localization. Red regions are abnormal pixels. Each column of the frame is the same. The top row of frames shows the ground truth, and the bottom row of frames shows the abnormal events detected by our method

4.2 Experiments on UMN dataset

The UMN dataset [17] consists of three scenes containing crowds executing rapid escape behaviors, which are considered as abnormal events. Scene (a) contains 1453 frames, scene (b) contains 4144 frames and scene (c) contains 2142 frames, all of which have a resolution of 240×320. Figure 5 shows snapshots from each of the three scenes.

Fig. 5
figure 5

Examples of Each Scene in UMN Dataset. The frames in the top row show normal crowd behavior, and the frames in the bottom row show abnormal behavior

PCANet models and one-class classifiers are trained by using the first 600 normal frames of each scene, and other frames are used for testing. The UMN dataset contains no pixel-level ground truth, so the EER in frame level is used to evaluate the method. The parameter settings of experiments on UMN and UCSD Ped2 dataset are the same. In UMN dataset, rapid escape behaviors of crowds are considered as abnormal events. And the escape behaviors of each individual can be regarded as local anomalies. Our method is able to extract high-level information from local regions, so abnormal events in local areas can be detected to generate reasonable results in UMN dataset. ROC curves and comparisons with other methods are illustrated in Fig. 6. The comparisons of EER with other methods are shown in Table 2.

Fig. 6
figure 6

ROC curves of UMN Dataset (%)

Table 2 Comparison of EER on UMN Dataset (%)

5 Conclusion

The proposed method involves an unsupervised learning framework for abnormal event detection and localization in surveillance videos. Optical flows are calculated to represent the motion information in video sequences. The deep learning model, PCANet, is used to extract high-level features from low-level optical flows. To localize abnormal events, the PCANet is revised to extract local information from video frames. Experimental results show that the method can effectively detect and localize abnormal events in surveillance videos.