1 Introduction

Today, more and more video surveillance systems emerge in public places, such as traffic junctions, airports, railway stations, etc. With low hardware costs, communities and families are equipped with surveillance cameras. However, most of these monitoring facilities are just used as a record equipment. In most monitoring centers, only several people are in charge of the monitoring equipment by focusing on the monitor screens and analyze humans’ behaviors to find abnormal events in time. This type of manual monitoring systems is not only likely to cost a lot of manpower, but also easy to miss abnormal events. Therefore, intelligent monitoring systems are much desired. The intelligent monitoring system can monitor visual field. It can detect and analyze abnormal events automatically. Compared to traditional video surveillance systems, intelligent monitoring systems can greatly reduce labor costs with 24-h monitoring, and help prevent some unexpected dangerous events in advance, or prevent further expansion of abnormal events.

Abnormal event detection in crowded scenes is an important and challenging task in intelligent surveillance video systems. Abnormal event detection refers to detecting and responding to the abnormal changes or behaviors of humans or objects in videos. Currently, there are various abnormal detection algorithms proposed in crowded scenes, such as abnormal crowd behavior detection [9, 46, 53], human abnormal action detection [38], traffic incident detection [50], etc.

The study [27] proposes the social force model, which uses interaction force between pedestrians as feature for abnormal event detection, and the result is good. But [5] proposes the multi-scale histogram of optical flow (MHOF) feature based on sparse representation is better than [27]. MHOF is composed of optical flow (OF) [30] and improved. But these features are extracted from the pixel value of video, without considering the characters of human eyes and human visual system. This paper proposes using human attention as the feature of video for abnormal event detection. Deep learning algorithm is a learning network which simulates human brain, and it is widely used in the recognition of complex objects. In the field of abnormal event detection, deep learning was not considered to use to extract high-level features. But in order to link up the human visual system, this paper uses deep learning to abstract the high-level of human attention. It simulates the process of human visual perception and human brain learning.

In this study, we propose a new abnormal event detection model based on the features from saliency information and MHOF. A visual attention model is adopted to extract the saliency features for dynamic scenes. Furthermore, MHOF is used to extract motion features for video sequences. After the saliency and motion features are computed, a deep learning technique is used for abnormal event detection. Experimental results on a public database have demonstrated the promising performance of the proposed method.

The reminder of this paper is organized as follows. Section 2 introduces the related work in the literature. The proposed method is described in Section 3. Section 4 provides the comparison experiments to demonstrate the advantages of the proposed method. The final section gives a conclusion of the study.

2 Related work

2.1 Abnormal event detection

The data mining and machine learning techniques are widely used in intelligent video surveillance systems [28]. Now many researchers are trying to explore how to design an effective algorithm or model for real-time video data in order to accurately detect abnormal events for intelligent monitoring.

Abnormal event detection refers to detecting and responding to the abnormal changes in videos, as is shown in Fig. 1. Currently, there are various abnormal detection algorithms proposed in crowded scenes.

Fig. 1
figure 1

Examples of the normal and abnormal frames. a A normal frame. b An abnormal frame

One type of abnormal event detection methods is designed based on object detection and tracking [18, 23, 29]. There are three steps in these methods: (1) Detection and tracking: detecting objects in video frames and tracking them. (2) Motion extraction [9, 46, 50, 53]: recording the trajectory of the moving objects. (3) Activity analysis [38]. Usually, the tracking algorithms include object detection. The study [30] presents an object detection method based on sparse representation and Histograms of Sparse Codes (HSC), which can obtain better performance in object detection than histograms of oriented gradients (HOG). There are a lot of studies trying to propose effective tracking algorithms. In the study [52], the authors proposed a model-free tracker for object tracking. In order to track people and their baggage, the authors in the study [5] built up 3D shape models to obtain the stereo depth information. For long-term tracking, self-paced learning algorithm was proposed in the study [37]. The authors in the study [22] designed a Robust linear regression algorithm for object tracking.

The other type of object detection methods is implemented by extracting features from video in the spatiotemporal domain [6, 9, 39]. In spatiotemporal domain, the global and local features are combined for the final abnormal event detection. The local detection is implemented by the differences of the target and its surrounding area. The global feature is extracted by analyzing the visual scene globally to determine whether there is abnormal event occurring. Various features and models are used in this type of method, such as mixtures of dynamic textures [24], global cues [25], and social force model [27]. In the study [11], the authors proposed an Interaction Energy Potentials function which is the representation between the current behavior state of a subject and its actions. And other features have been proposed, such as: Streak line representation of flow [26], optical flow [47], histogram of optical flow (HOF) [2], multi-scale histogram of optical flow (MHOF) [10], global optical flow orientation histogram [48], social attribute-aware force model [51], and so on. Based on the features described above, some abnormal event detection algorithms were proposed based on machine learning techniques. In the study [4], the authors used optical flow to design an event detection model by adopting the optimal number of models to represent normal crowd behavior. The study [27] uses Latent Dirichlet Allocation (LDA) to detect abnormal events. In the study [45], the authors used probabilistic Latent Semantic Analysis (pLSA) to detect abnormality. In the study [34], the authors applied a Superpixel-based Bag-of-Words (BoW) model to build an event detector.

2.2 Saliency information extraction

Human visual system (HVS) is a mechanism to realize the external world projection in the brain. In recent years, the simulation applications based on HVS are more and more [3133]. When people watch video, the attention of human eyes can be easily attracted by the abnormal events and behavior appearing in the video frames. Thus, in this paper, we extract the saliency information as one feature for abnormal event detection. Visual attention is an important mechanism of the human visual system (HVS). There are two categories of human visual attention mechanism: top-down [3, 49] and bottom-up [8, 12, 14, 15, 40, 42]. Top-down is influenced by prior knowledge, such as tasks with the purpose, the distribution of target characteristics, the context of visual scene, etc. On the contrary, bottom-up is a spontaneous choice of saliency area by an image and it is the main research direction of visual attention. In this paper, we study visual attention based on bottom-up too.

Treisman developed the Feature-Integration Theory (FIT) in 1980 [40]. When an observer looks at a scene, his visual attention can be attracted easily by some low-level features such as color, intensity and contrast, and he would pay attention to the salient areas in the scene. These areas are considered as salient areas in video frames and they can be computed by the differences between center and surroundings. In the past decades, various visual attention models were designed based on FIT theory [1, 14, 15, 42]. The salient area detection methods calculate the saliency map by computing the differences between the center area and surrounding ones. Achanta et al. designed a saliency detection model by using frequency domain information of images [1]. Guo et al. proposed a saliency detection model of visual scenes based on Fourier transform [15, 16, 20]. In the study [12], the authors proposed a saliency detection algorithm by using quaternion Fourier Transform (QFT) instead of Fourier transform (FT). It uses Amplitude Spectrum of QFT to represent the color, contrast and brightness, and obtain the final saliency map through the weight. As is showed in experimental results, it outperforms the state-of-the-art detection models. In this paper, we extract the saliency map of video frames by using the saliency detection method in [12] to simulate the visual attention and treats it as a spatial feature.

2.3 Multi-scale histogram of optical flow

Generally, the abnormal events occur in consecutive frames. Abnormal detection algorithm would consider the saliency value of every frame and its continuity as well. Optical flow is represented by 2D instantaneous velocity of all pixels in an image. The 2D velocity vector is the point of the 3D velocity vector in the projection imaging surface. Thus, optical flow not only includes the motion information of the observed object, but also contains the information of the 3D structure in the scene. Each pixel (i, j) has a 2D velocity vector (d x i,j , d y i,j ). If the abnormal event detection task takes optical flow of each pixel as a feature, the computational complexity will be high, and the pixel noise also influences the results. To obtain better performance, multi-scale histogram of optical flow (MHOF) is proposed in [10], which first divides each video frame into small image patches, and then classifies each patch into 16 classes. By using the histogram of 16 classes as the patch features instead of optical flow, this method greatly reduces the computation complexity, and achieves the effect of suppressing the noise in optical flow.

MHOF preserves more precise motion information than the traditional histogram of optical flow (HOF). In the study [10], MHOF can better describe the current frame scene changes, and thus detects abnormal events in video sequences accurately.

The proposed MHOF framework for each block is shown in Fig. 2. First of all, every video frame is divided into image patches with the same size M. Then, the optical flow matrix D i (D x, D y) of each patch is computed, and so is the MHOF of each patch. The following Eqs. 1 and 2 are used to calculate the class-label of each pixel class i,j .

Fig. 2
figure 2

The framework of MHOF

$$ {C}_{i,j}\in \left\{\begin{array}{cc}\hfill 0\hfill & \hfill {d}_{i,j}^x,{d}_{i,j}^y\le Th\hfill \\ {}\hfill 1\hfill & \hfill {d}_{i,j}^x,{d}_{i,j}^y>Th\hfill \end{array}\right. $$
(1)
$$ clas{s}_{i,j}= round\left(\ \uptheta \left({d}_{i,j}^x,{d}_{i,j}^y\right)/\left(\pi /4\right)\right)+8\times {C}_{i,j} $$
(2)

where (d x i,j , d y i,j ) is the optical flow of each pixel; Th is the magnitude threshold; θ is the angle of d x i,j , d y i,j .

2.4 Deep learning

The features in spatial and temporal domains are manually selected, such as color, intensity, contrast and so on. These are low-level features. The human brain often abstracts high-level features from low-level ones, such as shape, depth, movement, etc., and these high-level features can be better perceived by human brain. Deep Learning is highly correlated to AI (Artificial Intelligence) in the field of machine learning. There are three properties in Deep Learning: (1) Learning is unsupervised in each layer; (2) In each layer, training data is unsupervised learning, and the results are used as the input of higher layer; (3) supervised learning is employed to adjust all layers. Deep learning is a widely used in the research area of machine learning. The motivation is simulating the human brain to establish neural network for analysis and learning. In the application of image processing, deep learning is used to discover multiple-level and high-level features for image representation. Thus, the classification tasks no longer depend only on low level features manually. An early study of deep learning was conducted by Hinton et al. [19]. Recently, deep learning has been widely used in the research area of computer vision [13, 21, 36], but generally this model is mainly applied to the detection of complex objects, such as face detection [35]. Because the amount of computation of deep learning is large, existing studies on abnormal event detection based on deep learning are rare. In this paper, we use a very simple deep learning method-PCANet to simulate human brain to abstract high-level feature from low-level feature of video.

2.5 SVM

Now the SVM + HOG algorithm has become the mainstream architecture for pedestrian tracking. This paper also uses the SVM as the classifier for abnormal event detection. Support vector machine (SVM) is widely used for statistical classification and regression analysis in various applications. The SVM includes a support vector classifier and support vector regressor. SVM was firstly introduced in 1996 by Vapnik [43, 44], and is a kind of analysis method based on statistical learning theory. It requires very few samples for training, not sensitive to the number of attributes. SVM not only brings good training data classification effect, but also has good test accuracy on the test data with the same characteristics. In the past 20 years, SVM theory and application have been developed quickly.

For a two-class separable learning task, the samples are mapped into a high dimensional space, and in this space, a hyper-plane will classify the samples into two classes. To obtain promising classification performance on the data, the optimal hyper-plane should be selected. Thus, two-side planes are set up and they are parallel to the hyper-plane and have the same distances to the hyper-plane. The distance between the two-side planes is called margin. The hyper-plane is in the middle of these two-side planes. The larger the margin, the higher the accuracy of the classifier is. Thus the purpose of the SVM algorism is to obtain the maximum marginal hyper-plane.

There is an example of two-class linearly separable learning task in Fig. 3. There are many sample points in a 2-dimension Descartes coordinates and there are two coordinate values for each sample point x and y. In 2-D space, the representation of the hyper-plane is a straight line. These sample points have two classes: positive and negative. There is a hyper-plane in Fig. 3a, b. The aim of SVM is to find a maximum marginal hyper-plane (MMH), and thus the hyper-plane of Fig. 3a is the final result of SVM. For calculation of the MMH, the classification function can be described as follows:

Fig. 3
figure 3

The samples of the hyper-plane for a linearly separable case

$$ f(x)={\boldsymbol{\omega}}^T\boldsymbol{x}+b $$
(3)

where ω is the normal vector of MMH, {(x, y)} is the sample set. The hyper-plane can be defined as

$$ {\boldsymbol{\omega}}^T\boldsymbol{x}+b=0 $$
(4)

According to the point to plane distance formula, the following Eq. can be obtained:

$$ r=\frac{{\boldsymbol{\omega}}^T\boldsymbol{x}+b}{\boldsymbol{\omega}}=\frac{f\left(\boldsymbol{x}\right)}{\boldsymbol{\omega}} $$
(5)

Suppose r is the distance between a side-plant and the hyper-plane, then the expressions of the two side-plane is

$$ \left\{\begin{array}{c}\hfill {\boldsymbol{\omega}}^T\boldsymbol{x}+b=-k\hfill \\ {}\hfill {\boldsymbol{\omega}}^T\boldsymbol{x}+b=+k\hfill \end{array}\right. $$
(6)

Normalize k, then the above expression is

$$ \left\{\begin{array}{c}\hfill {\boldsymbol{\omega}}^T\boldsymbol{x}+b=-1\hfill \\ {}\hfill {\boldsymbol{\omega}}^T\boldsymbol{x}+b=+1\hfill \end{array}\right. $$
(7)

Figure 3a, b illustrate the three planes.Then the sample point sequence {(x, y)} should follow the following formula:

$$ \left\{\begin{array}{c}\hfill {\boldsymbol{\omega}}^T{\boldsymbol{x}}_i+b\ge 1\hfill \\ {}\hfill {\boldsymbol{\omega}}^T{\boldsymbol{x}}_i+b\le 1\hfill \end{array}\right. $$
(8)

where (x i , y i ) ∈ {(x, y)}, and the sample points (x*, y*) for which the equalities of Eq. 8 are satisfied, these points are called support vectors, which are corresponding to r*

$$ {r}^{*}=\frac{{\boldsymbol{\omega}}^T{\boldsymbol{x}}^{*}+b}{\boldsymbol{\omega}}=\frac{f\left({\boldsymbol{x}}^{*}\right)}{\boldsymbol{\omega}}=\left\{\begin{array}{c}\hfill \frac{1}{\boldsymbol{\omega}}\ \mathrm{if}\kern0.5em {y}^{*}=+1\ \hfill \\ {}\hfill -\frac{1}{\boldsymbol{\omega}}\ \mathrm{if}\kern0.5em {y}^{*}=-1\hfill \end{array}\right. $$
(9)

The distance between these two side-planes d is

$$ d=2{r}^{*}=\frac{2}{\boldsymbol{\omega}} $$
(10)

When d is with the maximum value, the hyper-plane is the optimal hyper-plane (MMH). Thus, d should be maximized with respect to ω and b.

$$ \begin{array}{c}\hfill \max (d)= \max \left(\frac{2}{\boldsymbol{\omega}}\right)\hfill \\ {}\hfill s.t.\ {y}_i\left({\boldsymbol{\omega}}^T{\boldsymbol{x}}_i+b\right)\ge 1,i=1,2\dots, n\hfill \end{array} $$
(11)

The final solution of ω and b construct the MMH classifier, namely SVM classifier. Now there are a lot of improved algorithms of SVM. The study [47] proposes a one –class classifier based on SVM--online least squares one-class SVM (online LS-OC-SVM), and then detect abnormal events using this method based on MHOF feature.

3 The proposed method

3.1 Saliency information extraction in video

According to selective attention mechanism, human eyes can quickly and effectively focus on important events in complex scenes. Human vision can always pay attention to the salient areas, which are different from their neighboring areas. Generally, the abnormal events existing in video can be represented by the sudden change in spatial or temporal dimension in video frames. In this paper, we present an abnormal event detection algorithm based on saliency information of video frames. Here the saliency detection model in [12] is used to extract the salient areas in each video frame. The frame is firstly divided into image patches, and then the saliency value of each patch is determined according to the difference between one patch and all the other patches from the features of color, intensity, and orientation, which is described as follows:

$$ {S}_i={\displaystyle \sum_{i\ne j}}{w}_{i,j}{D}_{i,j} $$
(12)

where S i is the saliency value of image patch i, D i,j is the difference between patches i and j, w i,j is the corresponding weight, determined by human visual sensitivity. D i,j is represented by the differences between amplitude spectrum of QFT of patches i and j. The framework is showed in Fig. 4. Figure 5 shows a normal and an abnormal frames and their corresponding saliency information.

Fig. 4
figure 4

Framework of the saliency information extraction of video sequences

Fig. 5
figure 5

a Normal frame. b Abnormal frame. c Saliency map of frame (a). d Saliency map of frame (b)

3.2 Feature representation of video frames based on PCANet

Deep learning can be used to abstract high-level features from low-level features. We use a simple deep learning technique to dig out MHOF and SI features for further feature representation. In [7], the deep learning algorithm of PCANet is proposed by constructing a simple and robust model. In this model, the representation of multi-level features can be discovered by using simple PCA to carry on the extraction of high-level feature.

Generally, there are three steps in PCANet: the high-level features of training video can be extracted by filters in steps 1 and 2. These filters are obtained by multilayer principal component analysis. The framework of PCANet is shown in Fig. 6.

Fig. 6
figure 6

The framework of PCANet

Suppose there is an video sequence S(I1, I2, ⋯, It, ⋯, IN), and the resolution of S is nr × nc, N is the number of frames in video S. The size of filters in step1 and step2 is k1 × k2. The algorithm can be summarized as follows:

  1. Step 1

    We extract k1 × k2 pixels around each pixel in each frame to obtain (nr − k1 + 1) × (nc − k2 + 1) image patches.

    For example, assume P i (p 1, p 2, ⋯, p j , ⋯, p (nr − k1 + 1) × (nc − k2 + 1)) is the block set of frame I i , and j is the number of patch in I i . Then each patch is transformed into a vector v ∈ ℝ(k1 × k2, 1). A huge matrix M ∈ ℝ(k1 × k2, (nr − k1 + 1) × (nc − k2 + 1) × N) can be obtained and it is the reconstruction of the video S. The eigenvalue λ ∈ ℝ(k1 × k2, 1) and eigenvector V ∈ ℝ(k1 × k2, k1 × k2) of M are computed, in which K eigenvectors from large to small in V are selected to construct F1 ∈ ℝ(k1 × k2, K). Then we transform F1 to F 1(f 1, f 2, ⋯ f h , ⋯, f K ), where f h  ∈ ℝ(k1, k2).

  2. Step 2

    K feature images of each frame can be obtained by filter-set F1. So there are N × K feature images through the convolution of layer1 filters. And then the filter-set F2 can be computed by repeating Step 1.

  3. Step 3

    Based on F1 and F2, K 2 redundant feature images \( Featur{e}_t\in {\mathrm{\mathbb{R}}}^{\left(nr,nc,{K}^2\right)} \) can be obtained and then convert Feature t to binary images. The representation image of T t can be computed as follows:

    $$ {\mathrm{T}}_t^l={\displaystyle \sum_{h=1}^K}{2}^{i-1}H\left({I}_t^{l-1}*{f}_h^l\right) $$
    (13)

    where l is the level of PCANet; f l h is the filter of layer l filter-set F l

    At last, by dividing T t to patches, we compute the histogram of each patch, which is the final representation of image features in multiple levels. We can change the filter size to obtain different PCANet. It is not to say that the smaller filter size is better, since the local information affects the accuracy of feature extraction, and we will demonstrate thesein the following experiments.

3.3 Spatiotemporal abnormal event detection model based on PCANet

Saliency information (SI) is extracted based on the characteristics of human perception, and represents the important information in visual scenes. Optical flow is the velocity vector of pixels, and MHOF can be used to extract temporal features of video frames and PCANet simulate the human brain. In this paper, SI and MHOF are combined to build a spatiotemporal model of abnormal detection (SI + MHOF model) by using the PCANet deep learning model and the framework is given in Fig. 7.

Fig. 7
figure 7

The framework of the proposed SI + MHOF PCANet model

The proposed method for abnormal event detection can be described as follows.

  1. Step 1

    Dividing each frame of training video into m image patches.

  2. Step 2

    Based on Eq. 12, the saliency value of training video S train (S 1, S 2, ⋯, S i , ⋯, S n ) can be obtained; where n is the number of frames in training video, S i (s 1, s 2, ⋯, s j , ⋯ s m ) is the saliency matrix of frame i, and s j is the saliency value of patch j

  3. Step 3

    According to Fig. 2 and Eqs. 1 and 2, the MHOF, H train (H 1, H 2, ⋯, H i , ⋯, H n ), of training video can be obtained. H i  = (h 1, h 2, ⋯, h j , ⋯ h m ) is the MHOF of frame i and the MHOF of patch j is h j

  4. Step 4

    Taking (S i , H i ) as the features of frame i, we can get the training data sequence Data train ((S 1, H 1), (S 2, H 2), ⋯, (S i , H i ), ⋯, (S n , H n )).

  5. Step 5

    Use PCANet to transform Data train to _Data train

  6. Step 6

    Training the Sparse_Data train with SVM to obtain the corresponding SVM model.

  7. Step 7

    According to step1 ~ step4, Data test (S k , H k ), the feature vector of each test frame can be computed, and we can obtain Sparse_Data test by PCANet.

  8. Step 8

    Detecting Sparse_Data train by using the trained SVM model to determine whether the test frame is abnormal.

4 Experiments

We use the UMN dataset [41] to conduct the comparison experiment to demonstrate the performance of the proposed method.

In the experiment, we compare the optical flow (OF) based, multi-scale histogram of optical flow (MHOF) based, saliency information (SI) based, SI and MHOF (SI + MHOF) based algorithms in abnormal event detection. In addition, the results by these algorithms in use of PCANet and without using PCANet are also provided. In [10], each frame is divided into 20 image patches and 320 features are extracted in total. In order to ensure the consistent feature dimension, in OF and SI based algorithms, each frame is divided into 320 image patches and thus 320 values can be obtained from each frame. In the learning process, we randomly select training frames from the video footage of each scene with a certain proportion, and the remaining frames are used as the test.

4.1 Evaluation criterion

In this paper, F -measure is used as the evaluation method. F -measure is computed by TP (True Positive is that the positive sample is correctly classified by the classifier), TN (True Negative is that the negative sample is correctly classified by the classifier), FP (True Positive is that the negative sample is incorrectly classified by the classifier), and FN (False Negative is that the positive sample is incorrectly classified by the classifier). Precision is the proportion of true positive in positive which is predicted by classifier. Recall is the proportion of true positive in real positive, as is shown in Eq. 14.

$$ \begin{array}{c}\hfill Precision=\frac{True\ positive}{True\ positive+ False\ positive}\hfill \\ {}\hfill Recall=\frac{True\ positive}{True\ positive+ False\ negative}=\frac{True\ positive}{positive}\hfill \end{array} $$
(14)

At large values of recall classifier, the number of false negative is low and thus the performance is better, and for the value of the precision, the larger the better. However, it is difficult to simultaneously guarantee that both these two values are high. Thus, it is challenging to build a classifier where the precision and recall are both largest [17].

In order to simultaneously ensure the precision and recall values, Precision and Recall can be combined into a single measure, F -measure as follows.

$$ {F}_{\beta }=\frac{\left(1+{\beta}^2\right)\times precision\times recall}{\beta^2\times precision+ recall} $$
(15)

F β -measure is the harmony of precision and recall. β = 1 denotes that the weights of precision and recall are the same. In the experiment, we use F 1 as the evaluation criteria.

4.2 Lawn scene

There are 1453 frames and 2 abnormal events in this video footage. The frames with large-change pedestrian motion are labeled as abnormal frames. The experimental results of this scene are shown in Table 1 and Fig. 8.

Table 1 Experimental results of Lawn Scene
Fig. 8
figure 8

Experimental results of Lawn Scene

In order to demonstrate the advantages of the deep learning technique, we compare the results from the algorithms with and without deep learning based on MHOF, SI and SI + MHOF. In Table 1, the second to fourth columns show the results of PCANet with filter size 3 × 3; the fifth to seventh columns show the results of PCANet with filter size 5 × 5; the last four columns show the results from the algorithm without PCANet. Figure 8 shows the F 1 -measure values of different algorithms with different proportion of training samples.

From Table 1 and Fig. 8, we can see:

  1. (1)

    Without PCANet, the results of MHOF are better than those of OF when the training sample percentage is less than 0.5. When the training sample percentage is larger than 0.6, the performance by MHOF will decrease. But the results of SI and SI + MHOF are better than those of OF and MHOF, and when the training sample percentage is larger than 0.4, the results of SI + MHOF are better than those of SI. It is proved that saliency information is useful in detecting abnormal evens and spatiotemporal features (SI + MHOF) can be used to better detect abnormal evens in video sequences.

  2. (2)

    The results of MHOF with PCANet of 5 × 5 filter size are better than those without using PCANet. However, the results of MHOF with 3 × 3 filter size PCANet are poor. The reason might be that the noise would influence the results with small filter sizes. Thus, for different features, the PCANet with corresponding sizes should be used.

  3. (3)

    By using PCANet, the results of SI and SI + MHOF are improved, especially the results of SI + MHOF.

  4. (4)

    When there are a few training samples, using PCANet cannot obtain good results. This is because deep learning net is a no feedback neural network. It can decompose the complex function relationship by the multi-layer simple function, thus it needs a lot of samples in training. When the training sample is small, the relationship between the multi layers cannot be accurately determined, and thus the experimental results would be not good enough.

4.3 Plaza scene

There are 2142 frames and 3 abnormal events in this video footage. The experimental results of this scene are shown in Table 2 and Fig. 9. From Table 2 and Fig. 9, we can have the following observations.

Table 2 Experimental results of Plaza Scene
Fig. 9
figure 9

Experimental results of Plaza Scene

  1. (1)

    Without using PCANet, the performance of MHOF are better than that of OF, and when the training sample percentage is less than 0.4, the results of MHOF are the best among the compared algorithms. However, when the training sample percentage is larger than 0.5, the results of SI and SI + MHOF are better than those of MHOF.

  2. (2)

    By using PCANet, the performance of MHOF and SI + MHOF improves, especially for the results of SI + MHOF.

  3. (3)

    When the training sample percentage is less than 0.4, the results of MHOF are the best, as shown in the tenth column of Table 2; when the training sample percentage is greater than 0.4, the results of SI + MHOF based on PCANet are better than others.

From the above experimental results, we can conclude that:

  1. (1)

    In abnormal event detection, without using PCANet, SI, MHOF, SI + MHOF are better than OF. For different scenes, both the features of MHOF and SI contribute to the abnormal event detection. With increasing training sample, the algorithm by SI + MHOF can obtain better performance than those by only MHOF or SI.

  2. (2)

    For different video sequences, the suitable PCANet should be selected for abnormal event detection. With different sizes of filter in PCANet, the accuracy of abnormal event detection might be different.

  3. (3)

    PCANet is able to extract better features from complex scenes. This also conforms to the original intention of deep learning.

  4. (4)

    Because PCANet is a neural network with no feedback and unsupervised learning, PCANet model is more sensitive to the number of training samples during abnormal event detection.

5 Conclusion

In this paper, we propose to use the saliency information and MHOF to represent the features of the spatial domain and temporal domain in video sequences, respectively. The PCANet is adopted to simulate human brain to extract the high-level features from SI and MHOF for abnormal event detection. Experimental results demonstrate that the feature of SI + MHOF is better than only MHOF or SI in abnormal event detection, and the results of the proposed algorithm by using PCANet are better than that without using it. In the future, we will try to investigate how the deep learning techniques could further improve the performance of abnormal event detection.