1 Introduction

Video-based crowd motion analysis is a fundamental problem in surveillance applications. In particular, anomalous event detection is one of the most popular tasks in the domain of crowd motion analysis. Varadarajan and Odobez [1] defined a crowded event as an action that happens at temporal and spatial levels. Anomalous crowd behavior or events denotes any sudden incidents that capture human attention or exhibit different behavior patterns from the regular pattern of behavior. There is an increasing need to monitor public places (shopping malls, transport stations, airports, streets, and other gathering places) to identify anomalous crowd patterns, such as sudden changes in crowd movements or size. Furthermore, there is a need to detect anomalous events in an emergency, such as in public places, and raise the alarm accurately and timely.

Existing research has contributed significantly to monitoring human behavior using large surveillance cameras. The widely used closed-circuit television (CCTV) for crowd monitoring helps to improve security up to a certain extent. However, the CCTV alone cannot provide a complete perspective of all the places because of their “disjoint and fragmented” coverage [2]. For example, CCTV cameras are usually installed in fixed locations and have limited (fixed field of view) coverage. Moreover, employing and maintaining ubiquitous surveillance cameras is highly expensive.

Additionally, there is a high chance of not detecting any crowd events at large crowded venues because the security personnel who monitor the CCTV feeds are susceptible to fatigue and concentration loss. Human observers are not always sensitive to detect an event’s sudden occurrence: they cannot pay attention to all of the anomalous objects or behaviors in a scene. Hence, it is challenging to distinguish these unusual events from ordinary activities based solely on human observation.

In order to improve crowd event detection performance and tackle the challenges associated with real-life applications on a large scale, an automated monitoring system is needed to detect, analyse, and predict crowd behavior [3,4,5]. Thus, analysing and understanding a large group of people in crowds have attracted increasing attention. Although crowd analysis covers many tasks, this paper focuses on anomalous event detection.

This paper proposes a novel end-to-end hybrid deep learning framework for highly accurate anomalous object detection and abnormal movement pattern detection. In contrast to the preliminary analysis presented in [6], this work explores a new architecture called the DeepSDAE model, as shown in Fig. 1). The proposed framework employs a Stacked Denoising Auto-Encoder (SDAE) and an improved VGG16 model [7]. In particular, we extract continuous motion trajectories as pre-features and feed them to the SDAE. In contrast to the work presented in [6], we first extract two output channels from the SADE’s hidden and final layers. We then feed these channel outputs to (1) a Plane-based one-class Support Vector Machine (1SVM), abbreviated as PSVM and (2) an optimised VGG16 that feeds to another PSVM. Finally, we combine the decisions from these two PSVM outputs with reinforcement learning to detect crowd anomalies using a late fusion mechanism. Our approach detects anomalous crowd movement behaviors, such as those in the baseline datasets, namely the UCSD [8], Avenue [9, 10] and Subway Surveillance [11] datasets. The abnormal activities include non-pedestrian objects, such as people riding bicycles, skaters, carts, and anomalous motions in the UCSD dataset [8]; strange actions, a person walking in the wrong direction and non-pedestrian objects in the Avenue dataset [9, 10]; people entering and exiting the subway in the wrong direction, loitering and irregular interactions in the Subway Surveillance dataset [11]. In addition, it can detect abnormal activities, such as standing and loitering people appearing in the densely crowded Melbourne Cricket Ground (MCG) dataset [5, 12]. Our main contributions in this work are:

  1. 1.

    The proposed novel end-to-end deep neural network architecture, called DeepSDAE, enables long-clue Spatio-temporal crowd motion anomaly detection, where the raw videos are used twice via two channels. We use this proposed approach only once while still achieving superior performance. Moreover, the proposed framework convergence is stable and quick because of our optimised deep learning framework.

  2. 2.

    We derive a set of optimal parameters for detecting the anomalies by modelling the crowd flow process as a Markov Decision Process (MDP) and solving them using a Deep Q-learning (DQN) method [13]. As a result, our approach produces a generalised set of learned parameters, improving the ability to detect similar events from different, new scenarios.

  3. 3.

    We evaluate our proposed framework on the MCG dataset [5, 12], the UCSD Ped1 and Ped2 [8], Avenue [9, 10], and Subway Surveillance [11] datasets. These datasets consist of various abnormal crowd activities, and we evaluate 13 other approaches comprehensively. The proposed DeepSDAE frame outperforms existing approaches in detecting anomalies (frame level or pixel level) in crowded scenes.

  4. 4.

    The DeepSDAE is a novel Reinforcement learning - deep learning model to detect crowd anomalies, where RL is firstly introduced to explore the parameter set. It produces superior results by learning the optimal crowd anomaly detection parameters (i.e., the time window of each tracklet, the neighbouring relationship of individuals and fusing decisions to arrive at anomaly scores) via the Deep Q-learning by maximising the expected rewards.

Fig. 1
figure 1

Overview of the proposed end-to-end crowd anomaly detection architecture. We extract the motion features from the raw videos and represent the crowd trajectories as crowd collectiveness using Kanade-Lucas-Tomasi (KLT) tracklets’ manifold, rendering continuous motion maps. The proposed DeepSDAE framework comprises SDAE with an optimised VGG16 model to discern the normal and abnormal movement patterns. Classifying normal and abnormal patterns in highly crowded scenarios is challenging. Therefore, we introduce two information channels in the DeepSDAE framework with PSVMs to produce two anomaly scores and merge the decisions via the late fusion scheme, delivering outstanding crowd anomaly detection results

The structure of this article is as follows: Section 2 provides the literature review on abnormal crowd detection methods. In Section 3, we introduce our proposed methodology. Section 4 provides the evaluation results for the four datasets, and Section 5 provides the conclusion.

2 Related work

We find several methods proposed to detect anomalous crowd events in video surveillance applications. We will review the recent abnormal crowd detection approaches in this section. In the literature, we find crowd tracking algorithms [5, 14], object detection methods [15] and detecting people fighting methods [16]. However, these approaches work only in specific experimental scenarios; finding a general approach is still challenging. Motion representation methods (with texture and dynamic models) [17, 18] and sparse coding [19] have partially solved these challenges. These methods approached abnormal detection as the outlier detection problem. The generalised approach considers training regular image sequences and testing whether the new incoming frame is normal or abnormal compared with the trained regular pattern.

The Mixture of Dynamic Textures (MDT) [17] captures the crowd dynamics and the appearance in crowded scenarios to represent crowd behavior. The hierarchical mixture of dynamic texture (H-MDT) li2014anomaly modifies the original MDT to improve the temporal anomaly detection process. This modification involves partitioning each frame into multiple sub-blocks and extracting the small patches with spatio-temporal information as MDT. The resulting H-MDT creates multi-scale anomaly maps, both at temporal and spatial levels. An online conditional random fields (CRF) scheme detects anomalous events, fusing temporal and spatial anomaly maps. The results demonstrate that the H-MDT outperforms the MDT [17]. However, the training of CRF requires handcrafted annotated training samples, making it less attractive for real-world applications.

Deep learning has demonstrated exemplary performance in object detection and behavioral analysis. Likewise, deep learning is a promising approach for anomalous event detection. Xu et al. [20] have proposed a novel architecture, called Appearance and Motion DeepNet (AMDN), for anomalous event detection. Feng et al. [21] use the SDAE to extract the spatial features. They implement the time-dependent Long Short-Term Memory (LSTM), capturing the long-term clues. Hasan et al. [22] combine conventional handcrafted Spatio-temporal local features and convolutional fully-connected feed-forward auto-encoder (Conv-AE) with learning local features and classifying activities. Chong et al. [23] proposed convolutional LSTM (ConvLSTM) to combine spatial features and temporal evolution of spatial features to detect video anomalies. The authors demonstrated that their framework works better than [22]. Dubey et al. [24] studied the joint learning approach to extract appearance and motion features using context-dependency called the Deep-network with Multiple Ranking Measures (DMRMs). Morias et al. [25] extract the dynamic skeleton features using a message-passing encoder-decoder recurrent network for anomaly detection.

The authors in [19] proposed a stacked Recurrent Neural Network (RNN) based deep model implementing coherent sparse coding. In addition, we find high-level feature extracting deep learning models proposed to detect anomalous event tasks [26, 27], such as the Fully Convolutional Neural Networks (FCNs) [26] and Plug-and-play Convolutional Neural Networks (PCCNs) [27]. Recently, algorithm called CO-attention Siamese Network (COSNetin) [28, 29] is utilized to tackling this problem, leading to a zero-shot solution, which is a unified and end-to-end trainable architecture, that can catch diverse joint feature. Dubey et al. [24] studied the joint learning approach to extract appearance and motion features using context-dependency called the Deep-network with Multiple Ranking Measures (DMRMs).By combining optical flow and HOG, Mishra et al. [30] presented a tensor-based model for motion description to identify any unusual behavior in the crowd scene. The authors [31, 32] proposed a two-fold CNN based framework to complete end-to-end solution, which can achieve robust classification with a specific and dedicated deep learning heuristic.

3 Methodology

Figure 1 illustrates our proposed end-to-end architecture, which uses continuous motion maps as input for the proposed DeepSDAE model for performing crowd anomaly detection. Continuous motion map is learned from raw video using a deep model that comprises crowd movement features [33]. The proposed deep network model includes a four-layer SDAE with a transfer learning-based VGG optimisation model.

We first extract the continuous motion maps from the raw videos. The continuous motion maps help build the relation between appearance and motion features. The creation of the motion map involves extracting the motion features, represented via the KLT manifold collectiveness and finally rendering the continuous motion maps [6]. We then input these maps to the DeepSDAE. The denoising characteristic of DeepSDAE makes the framework more robust to changing crowd motions.

We use transfer learning to optimise the original VGG model to overcome the overfitting problem and improve the model generalisation. The optimised component comprises one convolution layer and three FC layers. The outputs from the VGG and the hidden layer of SDAE are fed to two PSVMs via two channels, respectively, to produce different anomaly scores. We chose PSVM in our architecture because of its demonstrated high anomaly detection performance in the recent work [7]. We employ the late fusion mechanism to combine the two anomaly decision scores. We derive the weight vectors for combining the decisions using the DQN framework.

3.1 DeepSDAE framework

In previous works, experiments performed on the MCG dataset [5, 12] using SDAE with one-class SVM [20] showed limitations in achieving good performance because the scenarios in the MCG dataset are highly crowded. This section introduces the unsupervised hybrid deep model called DeepSDAE to perform crowd anomaly detection. Figure 1 shows our framework, including an SDAE, an optimised VGG16 model, and a PSVM [7, 34] to efficiently separate the normal and abnormal patterns.

Detecting anomalies include first extracting the features and then applying SDAE to learn the embeddings. The SDAE produces two outputs: we feed (1) the first output from the hidden layer of the SDAE to a PSVM to produce an anomaly score, and (2) the second output from the decoder of the SDAE to the improved VGG model, followed by a PSVM, to produce another anomaly score. Finally, the outputs from the two PSVMs are combined using a deep reinforcement learning-based late fusion [35] mechanism to detect anomalous crowd events.

3.1.1 Learning representations

In this work, we exploit the SDAE’s capability to learn useful representations from motion feature maps. At the pre-training stage, the target is to find out a suitable mapping function, and train one-layer auto-encoder, where are then being fed to the next layer, till forming stacked four-layer feedforward denoising neural network.

For fine-tuning, we use the the training data \({\phi ^{c}} = \left \{ {{d_{i}^{c}}} \right \}_{i = 1}^{{M^{c}}}\), where c denotes the motion maps and Mc represents the number of training samples. The objective function in DeepSDAE model is given by,

$$ J\left( {{\phi^{c}}} \right) = \sum\limits_{i}^{{M^{c}}} {\left\| {{d_{i}^{c}} - \hat {d_{i}^{c}}} \right\|_{2}^{2}} + {\tau_{F}}\sum\limits_{i = 1}^{M} {\left( \left\| {{\omega}_{i}^{c}} \right\|_{F}^{2} + \left\| {{{\omega}^{\prime}}_{i}^{c}} \right\|_{F}^{2}\right)} $$
(1)

where \({d_{i}^{c}}\) and \(\hat {d_{i}^{c}}\) represent the input motion features and the reconstructed sample of the SDAE; τF aims at setting the balance between the two terms in the objective function by regularising the weights of the encoder and the decoder; \({\omega _{i}^{c}}\) and \(\omega _{i}^{\prime c}\) represents the corresponding weights in encoder and decoder segments of the SDAE. The output of the hidden layer is used with sparsity constraints, aiming at making data representation perform better. We apply sparsity constraints on the outputs of the hidden units to find a valuable data representation. We use Stochastic Gradient Descent (SGD) for a stable and guaranteed global convergence.

3.1.2 Optimising the VGG16 model

The Computer Vision Group from Oxford University proposed a VGG16 architecture, in 2014, for performing deep learning-based classification tasks. The model has 16 layers comprising 13 convolutional layers and three FC layers. We can arrange the layers into five blocks (block 1 to block 5) with different convolutional and pooling layers. For example, blocks 1 and 2 in the VGG16 model have two convolutional layers followed by a max-pooling layer. In contrast, blocks 3-5 have three convolutional layers followed by the max-pooling layer. Among them, the size of all convolution kernels is 3 × 3, and the size of pooling kernels is 2 × 2. Therefore, the VGG16 network replaces the convolution kernels of size 5 × 5 and 7 × 7 by stacking the convolution kernels 3 × 3. As a result, the optimised VGG16 model can obtain the same receptive field, significantly reducing the number of parameters and increase the depth of the network [36]. In addition, the stacked convolutional layers increase the non-linear transformation layers and feature extraction capability of the network. The non-linear transformation here uses the ReLU function, defined as:

$$ {\text{ReLU}}(x) = \max (x,0) $$
(2)

VGG16 model includes many parameters, and most parameters are concentrated in the FC layers. To decrease the overfitting of the model during training, we use transfer learning and fine-tuning [37] to optimise the VGG16 model. The optimised VGG16 model comprises a trained VGG16 model and a convolutional neural network, as shown in Fig. 2. We use binary cross-entropy metric as the loss function, which is defined as

$$ \text{loss} = - \Big({\sum}_{i}^{n} {{y_{i}}\log {{\hat y}_{i}} + (1 - {y_{i}})} \log (1 - {\hat y_{i}}) \Big), $$
(3)

where n is the total number of samples, yi is true label of the samples, and \({\hat y_{i}}\) is the predicted label.

Fig. 2
figure 2

The proposed VGG16 optimisation model

The parameters in the five blocks (1 to 5) in the pre-trained VGG-16 network are retained and frozen, reducing the number of trainable parameters, and a simple CNN replaces the FC layer. Since the network contains only one convolutional layer, a relatively small dataset can meet the training requirements. Figure 2 shows the structure of our proposed VGG16 optimisation model.

When constructing a VGG16 model, we first initialise the model parameters based on ImageNet and then freeze the parameters of 5 blocks. Then, we connect the constructed simple CNN for training. Table 1 shows the changes in parameters after freezing. From Table 1 we clearly see that the number of trainable parameters that significantly reduced between VGG-16 (15,895,105) and optimized VGG-16 (1,180,417), which is a 92.6% reduction \(\Big {(}\displaystyle \frac {14,714,688}{15,895,105} \Big {)}\) in trainable network parameters and the reduction is observed in the training time.

Table 1 Parameters of the optimised VGG16 model

3.1.3 Classifying anomalies

We implement PSVM [7, 34] in this step for classifying anomalies. PSVM aims at finding a hyperplane from the high-dimensional feature space that can classify the normal and abnormal data. The reason we chose PSVM is that it is computationally simpler. Implementing PSVM involves implicitly projecting the data vectors xiRd(i = 1,2,...,n) to a high-dimensional feature space via kernel functions and requires solving the following quadratic optimisation problem

$$ \underset{\varpi ,\xi ,\rho }{\min } \frac{1}{2}{\left\| W \right\|^{2}} + \frac{1}{{\sigma \cdot n}}\sum\limits_{i = 1}^{N} {{\xi_{i}}} , \left( {W * \varphi \left( {{x_{i}}} \right)} \right) + {\xi_{i}} \ge \bar \theta, $$
(4)

where W denotes the weight vector, σ denotes the regularisation parameter indicating the proportion of anomalous data from the entire data; ξi are the slack variables that allow us to control the vectors on either side of the classification hyperplane and \(\bar \theta \) is a pre-defined offset that allows that aids in classification. We make use of the radial basis function (RBF) kernel to map input data to the high-dimensional feature space. The RBF is defined as

$$ K\left( {{x_{i}},{x_{j}}} \right) = e\frac{{ - {{\left\| {{x_{i}} - {x_{j}}} \right\|}^{2}}}}{{2{\tau^{2}}}}, $$
(5)

where τ denotes the extent of spread of the RBF kernel. The anomaly score in the testing data are denoted as xt, where it can be explored by using \(\delta = W * \varphi (x_{t}) - \bar \theta \). The σ denotes the lower boundary for the support vectors and the corresponding upper boundary for the proportion of abnormal data. An unsupervised PSVM model training method is implemented in this paper [34].

3.1.4 Merging decisions via late fusion scheme

The anomalies score output from two branches is combined by implementing late fusion. Late fusion can learn semantic representations directly from the unimodal features [35]. In addition, we use a weight vector to merge these two PSVM values as A = [ω,1 − ω], where ω is the weight of the upper branch (SDAE-VGG-PVSM), and we choose this parameter by utilising a reinforcement learning. Anomaly score from each branch (\(\delta = W * \varphi (x_{t}) - \bar \theta \)) is calculated and the final anomaly decision score is given by

$$ \delta_{c} = \omega \delta_{1} + (1-\omega) \delta_{2}. $$
(6)

We use a flexible threshold η to depict the receiver operating characteristic curve (ROC) in our work. If δc < η, then the frame will be recognised as normal data at the frame level and the associated pixels will also be detected as normal data at the pixel level. On the other hand, if δcη, then the frames and the related pixels will be treated as anomalies. The ROC curve and the corresponding Area Under Curve (AUC) are obtained by changing the threshold η. The details are discussed in Section 4 (experimental evaluation).

3.2 OP-RL method

Learning optimal parameters is an essential but difficult task, and we usually choose them empirically. We need to learn a set of three parameters, N, K and ω, for detecting crowd anomalies. The parameters N and K are from the motion maps. The parameter N denotes the value of the time window of each tracklet and K denotes the parameter about the number of nearest neighbour in k-NN graph, where it is used to represent the interactions among the individuals obtained from the KLT tracklets. The third parameter, ω, is the weight vector of the first branch in the late fusion scheme, producing the combined anomaly decision score δc. We develop a reinforcement learning approach to select the optimal parameters. We model the parameter learning process as the MDP and propose a optimal parameter learning method (OP-RL) to find the optimal parameters for anomalies detection.

Table 2 The evaluation metric for parameter set

In reinforcement learning, we let the agent learn an interactive environment by trial and error while using a reward to measure its interactions. Figure 3 demonstrates the fundamental elements and the process involved in reinforcement learning. An environment is a world in which the agent operates. State St is the current situation sequence, and St+ 1 is the subsequent sequence the agent perceives. The reward is the feedback from the environment, and it also measures the influence of action (At), where t represents the current time. The agent’s target is to interact with the environment using different actions and get the maximal reward in the future. The reward Rt at timestep t could be defined as \(R_{t}={\sum }_{t^{\prime }=t}^{T}{\gamma }^{t^{\prime }-t}t_{t^{\prime }}\), where γ is a discounted factor. The optimal action–value function, Q(s,a), is calculated using \(Q^{*}(s,a)=\max \limits _{\pi }\mathbb {E}[R_{t}{\mid }s_{t}=s,a_{t}=a,\pi ]\), where π is a policy sequence of actions.

Fig. 3
figure 3

The basic elements and the processes involved in reinforcement learning. The environment is a world in which the agent operates. State St is the current situation sequence, and St+ 1 is the subsequent sequence the agent perceives. The reward is the feedback from the environment, and it also measures the influence of action (At), where t represents the current time

The optimal action–value function follows an identity called the Bellman equation. This equation is based on the intuition that if the optimal value \(Q^{*}(s^{\prime },a^{\prime })\) of the state sequence \(s^{\prime }\) at the next timestep was known for all possible actions \(a^{\prime }\), then the optimal strategy is to select the action \(a^{\prime }\) that maximises the expected value \(r+{\gamma }Q^{*}(s^{\prime },a^{\prime })\), where,

$$ Q^{*}(s,a) = \mathbb{E}_{s'{\sim}\varepsilon}\left[r+{\gamma}\max_{a^{\prime}}Q^{*}(s^{\prime},a^{\prime}){\mid}s,a\right]. $$
(7)

As discussed above, the reinforcement learning algorithm interactively updates the action–value function using the Bellman equation. The action–value function can be denoted as Q(s,a;𝜃), where the weight 𝜃 can be learned with a neural network such as the Q-network. Training the Q-network involves minimising the loss function Li(𝜃i) for each iteration i, given by

$$ L_{i}(\theta_{i})=\mathbb{E}_{s,a{\sim}\rho(\cdot)}\left[(y_{i}-Q(s,a;\theta_{i}))^{2}\right], $$
(8)

where \(y_{i}=\mathbb {E}_{s'{\sim }\varepsilon }[r+{\gamma }\max \limits _{a^{\prime }}Q(s^{\prime },a^{\prime };\theta _{i-1}){\mid }s,a]\) is the target for iteration i and ρ(s,a) is a probability distribution over the State sequence s and the action sequence a. When optimising the loss function Li(𝜃i), the parameter 𝜃i− 1 is fixed. The gradient is calculated using \(\nabla _{\theta _{i}}L_{i}(\theta _{i})=\mathbb {E}_{s}\left [\left (r+{\gamma }\max \limits _{a^{\prime }}Q(s^{\prime },a^{\prime };\theta _{i-1})-Q(s,a;\theta _{i})\right )\nabla _{\theta _{i}}Q(s,a;\right .\) \(\left .\theta _{i})\right ]\). Note that when optimising the loss function using SGD, computing the gradient is usually computationally expensive.

Markov decision process(MDP), defined by the tuple \((\mathcal {S},\mathcal {A},P,r,\gamma )\), where \(\mathcal {S}\) is the state space, \(\mathcal {A}\) is the action space, P is the state transition probability from state s to the next state \(s^{\prime }\) under action a, R is a reward function represents the expected reward for executing action a in state s, and γ ∈ (0,1) is the discount factor. The agent interacts with the environment with its policy π, a mapping from state to action.

In terms of the detailed algorithm, we adopt the Deep Q-learning (DQN) to solve the parameter optimization problem (refer to Fig. 4). DQN is the first deep reinforcement learning method proposed by DeepMind [13] to learn control policies directly from the high-dimensional sensory inputs. The process of Q-value iteration in the DQN framework is the same as what we discussed before and Fig. 5 illustrates our entire approach. In reinforcement learning, the main component for one episodeFootnote 1 is (state, next state, action, reward). State is expressed as St, next state as St+ 1, action as set At, and the reward as Rt at the current time t. We use the generated motion maps as the state St. The parameters N, K and ω are chosen from the action set At = (N,K,ω). In the current episode t, reward Rt is calculated as Rt = δc to measure how the actions are chosen. The reward, Rt, in this work is a value between 0 and 1, and the higher reward indicates the action chosen being better. For the next episode t + 1, we use the next state of the updated motion maps St+ 1.

Fig. 4
figure 4

Deep Q-learning framework for learning optimised set of parameters

Fig. 5
figure 5

Our proposed OP-RL model for learning optimal parameters in this work

The optimal action-value function Q(St,At) is defined as the maximum expected return achievable by following any strategy, after seeing the State sequence S and taking the action a. The objective is to find an optimal policy which can maximize the expected discounted long-term reward \(\widetilde {R}=\mathbb {E}_{\pi }[{\sum }_{t=0}^{\infty }\gamma ^{t}R_{t}]\), where where π is a policy mapping sequences to actions. Through the iterative process, RL model converges, when the value of \(\widetilde {R}\) is maximal. Finally, the decided action set At = (N,K,ω) by using the above converged model corresponds to the optimal parameter set.

As the value of N, K and ω are all discrete, which is also the reason why we take DQN as the reinforcement learning algorithm. The action space chosen in DQN is discrete, where action space is continues in some other algorithms like DDPG. In details, for N, the value in action space is changing from 3 to 10, round to one decimal place. So for this value, the size of action space for this value is 70. In terms of parameter K, the value in action space is changing from 1 to 10. The size of action space for this value is 10. For parameter ω, the value in action space is changing from 0 to 1, round to one decimal place. The size of action space for this value is 10.

4 Results and discussion

4.1 Experiment setup

We implemented our proposed framework in C++ and Python using Anaconda 3.5 package manager in Visual Studio 2015. We used Windows 10 Operating System (OS) and NVIDIA®; Geforce®; GTX 2080Ti graphics card for our experiments.

In the four-layer SDAE architecture we used, the number of neurons in the first layer is set to be 1024, every time reduced by half for the rest of the layers. Precisely, the SDAE encoder consists 1024 → 512 → 256 → 128 neurons in the four layers, and the corresponding decoder neurons in the four layers as 128 → 256 → 512 → 1024. The learning rate λF is set to 0.0001. At the pre-training and fine-tuning stage, the number of epochs is set to 10 and 20, respectively. We used Quick Model Selection method [34] for tuning the parameters of the PSVMs, where C = (2− 5,2− 3,...,215). The best value of γ for the RBF kernel is selected from \(\gamma = ({2^{- 15}}, {2^{- 13}}, {\dots } , {2^{3}})\), using cross validation. We used the tuning parameters provided in [34].

The optimal parameters learned by using Reinforcement Learning are given in Table 2. The parameters N and K are from the motion maps, where the value of N denotes the value of the time window of each tracklet and K denotes the parameter about the number of nearest neighbour in k-NN graph, where it is used to represent the interactions among the individuals obtained from the KLT tracklets. In terms of ω, it is the weight vector of the first branch in the late fusion scheme, producing the combined anomaly decision score δc.

It seems that the parameters are adaptive to different type of videos, based on different type of anomalies, but the change of these parameters are very slight.

We utilise SDAE to learn valuable representations in an unsupervised manner. In addition, PSVM is highly effective in classifying the outliers [34]. Therefore, the proposed method guarantees better performance by overcoming the lousy performance caused by the noise and unlabelled training data in traditional one-class SVM. Furthermore, we implement DQN to learn optimal parameters instead of manually choosing parameters.

We evaluate our proposed framework on four datasets. These include UCSD [8], Avenue [9, 10], Subway Surveillance [11] and densely crowded MCG datasets, and we compare the results with existing state-of-the-art approaches in the recently published papers.

4.2 MCG dataset

MCG dataset is a real-world Ground dataset collected from a sport stadium of Melbourne. We use videos from C2, C5 and C6 cameras installed at different corridors in MCG. Figure 6 shows the sample images obtained from the MCG dataset.

Fig. 6
figure 6

Sample results obtained by DeepSDAE on the real-world MCG dataset, where the red block denotes the anomalies. Loitering is the most common abnormal pattern in the MCG dataset

The MCG dataset is a real-world dataset collected from a sports stadium in Melbourne [5, 12]. It consists of different camera views of actively moving crowds in the MCG corridors with people standing and loitering events. The data includes videos collected on four dates from six different cameras (C1 to C6) during Australian Football League (“footy”) matches at MCG, summing to approximately 31 hours of data [5, 12]. Specifically, we use 16-Sep-C2, 16-Sep-C5 and 16-Sep-C6 video sequences from C2, C5 and C6 cameras to evaluate our framework to detect loitering events. Figure 6 shows the sample images obtained from the MCG dataset.

4.2.1 Frame-level abnormal crowd detection

We consider loitering events abnormal in densely crowded scenes and aim to detect frame-level loitering events. A frame is defined as anomalous if there is at least one actual abnormal pixel (related to a loitering event) in the frame. We obtain the ROC curve and the AUC by changing the threshold η. The x-axis denotes the False Positive Rate (FPR), and the y-axis denotes True Positive Rate (TPR). Also, we note that FNR = 1 −TPR. The FPR corresponds to the frame identified as anomalous when normal. In contrast, the TPR denotes the number of anomalous frames detected correctly as anomalous. Figure 7(a) shows the frame-level abnormal (loitering) events detected in video frames for Cameras C2, C5 and C6 of the MCG dataset. For the three video sequences (16-Sep-C2, 16-Sep-C5 and 16-Sep-C6), our proposed DeepSDAE has better AUC (86.4%, 74.6% and 79.3%) and lower EER (18.1%, 29.5% and 23.4%) when compared with DBN’s AUC (85.4%, 70.1% and 79.1%) and EER (20%, 35% and 23%), indicating an improved frame-level anomaly detection results.

Fig. 7
figure 7

The evaluation result of MCG dataset: (a) frame-level based abnormal event detection, (b) pixel-level based abnormal event detection

We use AUC and Equal Error Rate (EER) to evaluate this work quantitatively. Table 3 furnishes the performance comparison of DeepSDAE architecture with previous works. Furthermore, we can observe that the proposed architecture outperforms Deep Belief Nets (DBN) [6] when evaluated on the data from three cameras (C2, C5 and C6) from the MCG dataset.

Table 3 Comparison of performance of deep belief nets (DBN) [6] and the proposed DeepSDAE on MCG dataset (cameras C2, C5 and C6)

4.2.2 Pixel-level abnormal crowd detection

As before, we focus on detecting loitering events in densely crowded scenes. If we identify actual abnormal pixels account for more than 40% of the ground truth in the pixel level comparisons, we consider those pixels as true positive (TP) detection. In contrast, we consider it a false positive (FP) even if a single regular pixel is identified as abnormal. Figure 7(b) shows the ROC for pixel-level anomaly detection, and Table 3 furnishes the quantitative results. Figure 7(b) and Table 3 show that the performance of DeepSDAE is better than anomaly detection presented using DBN architecture [6]. For the three video sequences (16-Sep-C2, 16-Sep-C5 and 16-Sep-C6), our proposed DeepSDAE has better AUC (69%, 68.3% and 73.3%) and lower EER (34.8%, 37.7% and 28.5%) when compared with DBN’s AUC (67.6%, 64.1% and 70.4%) and EER (35%, 40.2% and 30.7%), indicating a better anomaly detection results at pixel level.

4.3 UCSD Ped1 and Ped2 dataset

UCSD Ped1 dataset [8] contains pedestrian walkway video sequences. The dataset contains 34 and 36 videos available for training and testing, respectively. Each frame in the video sequences has 238 × 158 pixel resolutions. UCSD Ped2 dataset [8] contains crowds of people moving parallel to the camera plane. It comprises 16 video sequences (360 × 240 pixel resolutions) for training and 12 videos for testing. Because we do not require pre-training, we only use the test sequences of these two datasets. For both USCD Ped1 and Ped2, the expected behavior includes people walking. The abnormal behavior includes anomalous motion patterns and non-pedestrian objects like cyclists, people with wheelchairs, cars, and skaters. We kept all the model parameters the same as the MCG dataset. Figure 8 shows the sample results we obtained from the proposed SDAE scheme. Figures 9 and 10 show the evaluation results for UCSD Ped1 and Ped2 datasets.

Fig. 8
figure 8

Sample results of DeepSDAE on three benchmark datasets, where red coloured pixels denoting the detected anomalies in (a) UCSD Ped1 dataset [8], (b) UCSD Ped2 dataset [8], (c) Subway Surveillance dataset [11], and (d) Avenue dataset [9]

Fig. 9
figure 9

Evaluation result on UCSD Ped1 dataset: (a) frame-level anomaly detection results, and (b) pixel-level anomaly detection results

Fig. 10
figure 10

ROC Curve for frame-level anomaly detection on UCSD Ped2 dataset. Pixel-level anomaly detection for this dataset is not available because of the complexity involved and lack of available comparisons in the literature

We compared our quantitative results using DeepSDAE with the 13 recently published algorithms. Tables 4 and 5 show the quantitative results on the UCSD Ped1 and UCSD Ped2 datasets. Regarding the result on the UCSD Ped1 dataset, Our proposed DeepSDAE has AUC of 91.5% and 65.7% at detecting frame-level and pixel-level anomalies, outperforming the majority of the existing state-of-the-art approaches. Compared with the Feng [21], TCP [27] and sRNN-AE [38], DeepSDAE is inferior to these methods. Likewise, our approach has lower EER compared to most algorithms.

Table 4 Comparison of results of DeepSDAE with other existing algorithms evaluated on the UCSD Ped1 dataset
Table 5 Comparison of results of DeepSDAE with other existing algorithms evaluated on the UCSD Ped2 dataset

Regarding the result on the UCSD Ped2 dataset, Our proposed DeepSDAE has AUC of 88.9% and EER of 19.9% at detecting frame-level anomalies, outperforming the existing methods on the UCSD Ped2 dataset by at least 1%, with lower EER. Especially, when it is compared with the Feng [21], TCP [27] and sRNN-AE [38], where DeepSDAE is inferior to these methods in UCSD Ped1 dataset, DeepSDAE can perform better in this case.

4.4 Subway surveillance and avenue datasets

The Subway Surveillance dataset [11] is collected in a subway station, where anomalous behaviors correspond to people moving in the wrong directions. The entire dataset contains two video sequences at entry (144249 frames) and exit (64900 frames) gates. The Avenue Dataset [9] contains 16 training and 21 testing video clips acquired on The Chinese University of Hong Kong (CUHK) campus with a total of 30652 frames (15328 for training and 15324 for testing) [10].

We keep the experimental parameters for all the components unchanged (same as the previous datasets). Figure 8 shows the example results. We evaluated the performance at the frame level for these two datasets because the Subway Surveillance dataset [11] and Avenue [9] does not provide the pixel-level based ground truth. Tables 6 and 7 show the quantitative results compared with other methods.

Table 6 Comparison of results of DeepSDAE with other existing algorithms evaluated on the Subway dataset
Table 7 Comparison of results of DeepSDAE with other existing algorithms evaluated on the avenue dataset

For the Subway Surveillance dataset, we can see that the performance of our approach is second only to the method proposed in [46], and it outperforms the rest of the existing approaches. In addition, our method produces the best performance in terms of EER. Furthermore, our DeepSDAE is also the second-best for the Avenue dataset compared with the existing approaches. It achieves good performance than most of the past methods.

The reinforcement learning-based parameter setting model helps the DeepSDAE to find the optimal learning parameters. Hence, SDAE combined with the VGG16 optimised model (reduced trainable parameters) speeds up the training, and as a result, they converge quickly. Our experimental results confirm this finding and are better than other methods. Deep-Q-Learning provides a further optimisation to improve in learning the parameters. Our experiments suggest that the Deep-Q-Learning improves the overall AUC and ROC in detecting anomalies.

The proposed framework can perform well in detecting movement anomalies, but it cannot be used as a general approach in diverse types of anomaly detection. Our future work will include exploring late fusion mechanisms to different anomaly detection scenarios and generalising the framework for other crowd settings. In addition, we will explore whether we can have a further optimised VGG-16 model or have a lightweight model (such as MobileNets [48]) to improve overall computational time for real-time video surveillance applications.

5 Conclusion

We proposed an end-to-end deep learning model called DeepSDAE to detect anomalous crowd behavior. This framework is a hybrid deep learning architecture comprising motion maps, a four-layer SDAE, an optimised VGG16 model, and a PSVM. Our framework uses a late fusion mechanism to combine decisions from PSVM channels for detecting crowd anomalies. We derive optimal parameters by modelling the crowd flow process as the MDP and solving them using Deep Q-learning. DeepSDAE is a novel Reinforcement learning-Deep learning model to detect crowd anomalies, where RL is firstly introduced to explore the optimal parameter set. The experimental evaluation on four datasets (MCG, UCSD Ped1 and Ped2, Avenue, and Subway Surveillance) show that our proposed DeepSDAE surpasses existing approaches in detecting anomalies (frame level or pixel level) in crowded scenes.