1 Introduction

Imagine the smart scenario in which the robot is proactively detecting all suspicious activities performed by human, avoiding the crowd turbulences and violence before they get worse, acting as a surveillance agent at public and private places, avoiding the robberies/theft at sensitive areas by informing concerned authorities, and many more. Though we may not have reached that stage yet, current technology is expediting towards such amazing era with self-operated and autonomous robots working continuously without human intervention. This survey talks about progress made by newly emerged deep learning technology and other non-deep learning approaches in the area of video based anomalous activity detection.

Automated visual surveillance as an active area in computer vision has been one of the most sought-after research domains in academia and business firms due to its wide applicability for monitoring of public and private places, crowd management, elderly health care systems, defense systems, transportation systems. So, installing Closed Circuit Television (CCTV) cameras has been popular option for monitoring the ongoing activities and achieving global security. Due to less cost, ease of use and customized design of cameras, global surveillance camera market is anticipated to increase at compound annual growth rate of 16.6% from 2017 to 2025 [102].

This proliferation of cameras for effective monitoring has resulted into deluge of video data. In 2015, around 566 petabytes of data were produced by video surveillance cameras installed worldwide, and would generate 2500 petabytes of data daily by the end of 2019 [19]. As continuous monitoring of such videos is beyond the capacity of video operator personnel, there is a need of automated, online visual surveillance system to operate continuously and detect suspicious/anomalous behavior of objects and human from video in near-real time manner.

Due to its wide scope for providing global security, the past 2 decades witnessed great improvements in video based anomaly detection approaches. Many review papers have been put forth in the domain of human activity recognition, behavior understanding, and crowded scene analysis which are directly or indirectly relevant to video based anomaly detection [8, 40, 44, 46, 52, 72, 89, 90, 94, 104, 117, 124]. It can be observed from the existing literature that no review paper has assessed deep learning approaches for video based anomalous activity detection. The work by Chong and Tay discussed the use of deep architectures for anomaly detection [14]. But it covers a very short review of deep learning methods for anomaly detection. Therefore, the aim of this survey paper is to thoroughly analyze the progress made by deep learning techniques in the field of video based anomaly detection.

The work is contributed as follows:

  • The graphical taxonomy of video-based anomalous activity detection has been put forth.

  • Thorough survey of state-of-the-art deep learning approaches for video based anomaly detection is done.

  • The trade-offs in anomaly detection from video are mentioned from both the view-point of accuracy-oriented approaches and real-time processing oriented approaches using deep learning techniques.

  • Newly introduced datasets in the past lustrum, need and issues of anomaly detection are explored.

  • The current challenges, application domains and possible future directions in the domain of deep learning applicable to anomaly detection are thoroughly put forth.

The roadmap of the paper is depicted in Figure 1. Section 2 deals with taxonomy of anomalous activity detection from videos. Section 3 and 4 deal with traditional and deep learning approaches for anomaly detection respectively. Application domains and benchmarked datasets are briefly overviewed in section 5. Research challenges and future directions in anomaly detection are enunciated section 6. Section 7 concludes the paper.

Figure 1
figure 1

Roadmap of the paper (Figure to be read from left in clockwise manner)

2 Taxonomy of video based anomalous activity detection

Anomalies are patterns in data that do not conform to a well-defined notion of normal behavior” [10]. Anomalous activity is also known as irregular behavior, suspicious activity, surprising event [36], unusual activity and so on. Anomalous events are context and subject dependent, new, unknown, rare and therefore, challenging to detect from videos. Figure 2 depicts the taxonomy of video based anomaly detection. The taxonomy is depicted based on various factors to be considered for performing anomalous activity detection.

Figure 2
figure 2

Taxonomy of video based anomaly detection

2.1 Tasks

Anomalous activity detection focuses on finding whether the given video frame exhibits an anomaly or not. It addresses the question, “Does the given frame contain an anomaly or not?” Anomalous activity localization performs the localization of anomalies by determining actual location of anomaly in the given video frame by bounding box. It addresses the question, “Where is anomaly occurring in the given frame?” Localizing the groups performing activities has been well handled by Lei et al. [92] using latent graph model. The tasks of detection and localization have been jointly performed in [12, 16, 66, 79, 81, 114, 115].

2.2 Kinds of anomaly detection

  • Referring to the literature [12, 15, 16], anomalous events can be classified into 2 classes’ viz. local anomaly and global anomaly. Local anomalous event differs from spatio-temporal neighboring events and deals with finding how the activity of an individual varies from its neighbors (For example, driving a vehicle in wrong direction). Local anomalous activity detection has been well investigated in [2, 45, 57, 84]. On the contrary, global anomalous events globally interact with each other in an unusual way, even if any local events are normal or anomalous individually i.e. multiple events though seem normal, interact with each other in a suspicious/unusual manner (For example, car accidents; crowd dispersion due to explosion). It also involves entities behaving in suspicious manner and their collective activity is harmful, for example, violence and robbery. Joint modeling of local and global anomalous events is done in [12, 78]. Both [78] and [12] used spatio-temporal video descriptors for joint modeling of local and global events. Cong et al. [15] used spatio-temporal features extracted using Histogram of Optical Flow for detecting anomaly at multiple locations and scales. Their approach is based on sparse coding technique.

  • On the similar lines, Yu et al. [118] defined single point anomaly and interaction based anomaly. Point anomaly can be mapped to anomalous activity of individual entity, termed as single entity based anomaly. The interaction of group in unconventional manner maps to interaction based anomaly. The variants of interaction based anomalies can be person interacting with object i.e. human-object interaction (For example, person keeping the bag unattended at public place), human-human interaction (For example, fighting People) or object-object interaction (For example, vehicles colliding with each other). Complexity and time required for anomaly detection and localization increases from anomaly due to single entity to interaction based entities and finally crowd anomaly detection and localization. This is well depicted in Figure 3.

    Figure 3
    figure 3

    Kinds of Anomaly detection based on number of entities involved

  • The definition of anomaly varies as the context varies. For example, car running on highway is normal activity whereas running of the same car on pedestrian walkway is anomalous one. This is known as contextual anomaly in which activity in specific scenario is said to be anomalous according to some context, whereas the same activity is normal in other context. Therefore, activities related to each other by space and time forms the context [123], and it is necessary to model the appearance features obtained from spatial domain and motion features obtained from temporal domain in a joint manner. Contextual anomalies are divided into spatial and temporal anomalies [48, 64]. By and large, spatial anomalies are detected from single frame, whereas minimum two-frames (observations collected over consecutive time stamps) are required for temporal anomaly detection.

2.3 Cameras deployed for surveillance

One of the important factors for accurately detecting the anomaly is the number of cameras deployed for capturing the video and the view/angle at which the cameras are fixed. It’s important to capture the videos from multiple views/cameras since there are chances that all activities (normal/anomalous) may not be captured by single camera and framework may miss detecting the anomalous activity if any. Sometimes, suspicious individual may also deliberately avoid camera and hide the activities. These issues can be addressed by capturing the multiple views of ongoing activities.

Multiple views can be captured from multiple cameras, sensors, or thermal cameras [20, 87]. Once all the videos are obtained, video summarization can be performed by removing redundant views and decision of anomaly detection is taken. Most research work has focused on detecting the anomaly from single view camera [51]. Though the use of multi-camera (multi-view) for anomaly detection is complex and challenging task compared to the anomaly detection using single view, multi-camera anomaly detection has potential to provide accurate detection capability since it helps to capture the spatio-temporal features (context) of the video efficiently.

2.4 Target of interest for anomalous activities

Anomaly detection is applied to both indoor and outdoor environment, and therefore the challenges associated with surveillance videos of such environments need to be handled carefully. Indoor surveillance videos are characterized by change in illumination levels of room, light perturbation, reflections in architectural components like windows or doors. This kind of surveillance mainly covers offices, shops, ATMs, home-based healthcare systems. On the other hand, outdoor surveillance videos involve change in illumination levels based on time of day, weather conditions based on rain, snow and fog. This surveillance covers both controlled environments and uncontrolled environments like sports arena, crowd scenario, pedestrian walkways, transportation systems and many more. In summary, target of anomaly detection can be individuals/groups, crowd scenarios, transport domain, naturally or artificially occurring calamities like flood detection, fire detection, etc.

2.5 Anomaly measurement and performance metrics

Different performance metrics are used for anomaly detection. This includes True positive rate, False-positive rate, precision, recall. Confusion matrix depicting performance measures used for anomaly detection is shown in Figure 4. For this, ground-truth is delineated as follows. Presence of anomalous activity is understood as “positive” whereas, its absence as “negative” in the confusion matrix. Receiver Operating Characteristic (ROC) curves are preferably used for visualizing and comparing the performance of classification methods used in anomaly detection. It is a two-dimensional graphical representation of true positive rate plotted on Y-axis and false positive rate plotted on X-axis. In case of anomaly detection, it is used for checking the performance trade-off between benefits of true positive (accurately predicted events) versus false positive (inaccurately predicted events). Three more performance metrics are evaluated based on ROC curves viz. Area under the ROC Curve (AUC), Equal Error rate (EER), and Equal Detection Rate (EDR). EER is determined by the ratio of uncategorized frames when FP rate = 1 – TP rate. The Detection Rate (DR) calculated at equal error rate is termed as EDR. In case of real time processing oriented anomaly detection approaches, frames per second (FPS), or time required for processing each frame in the video is also considered. For evaluating the performance of the anomaly detection models, various levels of detection are considered viz. frame-level, pixel level, duel pixel level and object level.

Figure 4
figure 4

Confusion matrix for anomaly detection

  • Frame-level anomaly detection: The whole frame is said to be anomalous if at least one pixel for a test frame is predicted to be anomalous.

  • Pixel-level anomaly localization: It measures the accuracy of spatially located anomalous region as predicted by system. As mentioned by Li, Mahadevan, & Vasconcelos [51], comparison of predicted anomalous pixels is done with the pixel-level ground truth; if there is 40% overlap of predicted anomalous pixels with that of ground-truth region, then given test frame is assigned true positive metric, on the other hand, frame is considered as a false positive.

  • Duel Pixel level (DPL): If some region detected by algorithm possesses some anomaly and suppose this region (either obtained by frame-level or pixel-level) overlaps with ground truth anomalous region, then that region is termed as “lucky guess”. Frame-level and pixel-level measures do not take into account this false region (“lucky guess”). In order to detect such “lucky guess” region, duel pixel measure is introduced [79]. A frame is said to contain anomalous activity, if following conditions are satisfied. (1) Frame contains anomaly at frame-level. (2) There should be at least β% overlap of predicted anomalous pixels with that of ground-truth region (3) In addition to anomalous region, if unimportant regions are also considered as anomaly, then duel-pixel measure identify it as true positive.

  • Object level: Since pixel-level anomaly check for 40% overlap of predicted anomalous pixels with that of ground-truth region, setting higher true positive rate may result into large false positive rate. Therefore, object level anomaly localization [26] define true positive rate by setting threshold Θ shown in Eq. 1.

$$ \frac{Detected\ anomaly\kern0.35em \cap \kern0.35em True\ anomaly}{Detected\ anomaly\kern0.35em \cup \kern0.35em True\ anomaly}\ge \Theta $$
(1)

3 Traditional approaches for anomaly detection

Anomaly detection from video is the widely investigated research topic since decade. Various frameworks for tracking, surveillance, and anomaly detection in different domains have been put forth till date for commercial use. IBM Smart Surveillance System (S3) [95] is the world’s first, event based, distributed middleware to be used in surveillance system for video based behavioral analysis, automatic scene monitoring, event based retrieval and real time event alert system. PFinder [111] and W4 [33] are two systems applicable to be used for human behavior tracking. PFinder tracks and interprets human behavior activities and it is applicable to be used in video databases, wireless interfaces. W4 [33] system is operated by monocular video imagery and works in an outdoor environment for detection and tracking of multiple interacting people with other objects. Mobileye [116] is commercially available system for vehicle tracking in the domain of transportation. It is used for detection of large objects lying over small distances using monocular camera, however, its detection is limited to certain classes of objects (vehicles and pedestrians). Fujitsu’s Intelligent Transportation System [27] works at 30 FPS running on Intel Xeon 3.2 GHz with 4 GB memory for detecting and tracking vehicles and other entities at real time. Knight [86], automated surveillance system developed at University of Central Florida works with multiple cameras for detecting and tracking the objects. Apart from monitoring the sterile and dangerous zones, this system also summarizes the key frames in video and delineates the textual information of the trajectories for the ease of monitoring personnel.

Monitoring and tracking the human behavior and finding out the anomalies from surveillance video are well investigated topics since decade. But, this topic has been investigated more from the point of view of detecting anomalies without regard of how much length of videos are buffered before processing starts, support for online learning, and time it takes to detect and classify the anomalies. Very few papers serve as candidate for real-time processing oriented models [55, 78]. As video based anomalous activity detection is the promising area in Computer Vision, it has great applicability to be deployed in public places. Deploying such systems at public places requires the practical constraints like how much amount of video the system buffers before processing, time required for detecting and classifying anomalies so that detection would be performed in a timely manner to avoid mishaps proactively. It also requires how much frames are processed per second, speed of streaming video and running time of anomaly detection algorithms. Considering this factors into account, this paper classifies the anomaly detection approaches into accuracy-oriented and real-time processing oriented approaches. This would facilitate how to modify the traditional approaches to real-time one or develop new models focused on real-time processing so that these models are readily deployable in real life scenarios for anomaly detection. So, state-of-the-art approaches of video based anomalous activity are mainly divided into 2 categories as accuracy-oriented approaches and real-time processing oriented approaches. The aim of accuracy-oriented approaches is to detect and localize anomalies with a focus on accurately detecting the anomalies, whereas, real-time processing oriented approaches focus on online processing of video frames in order to detect anomalies in real time manner. The classification of video based anomaly detection approaches is depicted in Figure 5. As majority of the traditional approaches have focused on accuracy-oriented methods for anomaly detection, this paper covers real-time processing oriented anomaly detection approaches both using traditional and deep learning methods. But for comparing trade-off among accuracy and real-time, significant accuracy-oriented approaches are also mentioned.

Figure 5
figure 5

Classification of video based anomaly detection approaches

3.1 Real-time processing oriented approaches

The state-of-the-art traditional anomaly detection approaches are divided into 2 categories viz. Local feature modeling methods and Holistic feature modeling methods. Local feature modeling methods learn the model based on local visual features to represent the events and apply statistical, computer vision based techniques for detection of anomalies. This method assumes video as a collection of entities. Holistic feature modeling methods assumes the entities in the video as a whole and performs anomaly detection based on modeling the holistic features like motion and density.

3.1.1 Local feature modeling methods

Statistical approach

The statistical method for describing anomalous activity is given as follows [85]. In this, features are distributed with a probability density function (pdf),g0(.), if they come from a nominal distribution. Anomalous instances are distributed with pdf, g1(). So, problem of anomalous activity detection amounts to predicting whether an instance is distributed according to nominal or anomalous pdf. This is given in an Eq. 2.

$$ {H}_0:\ell \sim {g}_0(.)\ \mathrm{versus}\ \mathrm{the}\ \mathrm{alternative}\ \left(\mathrm{anomaly}\right)\ {H}_1:\ell \sim {g}_1(.) $$
(2)

If both pdfs are either known or can be estimated from training data, this task reduces to the well-known Likelihood Ratio Test (LRT).

Non-parametric methods are not dependent on parametric model for learning motion and appearance based features from video, and directly learns scene normality based on descriptor instances. This approach uses normal data to train itself and follows unsupervised method of training. Bertini et al. [6] put forth spatio-temporal volumes (STV)-based non parametric method for detecting anomalies using unsupervised learning. To support multi-scale analysis, descriptors are computed independently at different scales even if they are overlapped. For detecting contextual anomalies, likelihood of descriptor based on neighboring cells is calculated. To detect whether video stream contains anomalous event, range query is applied on the training data to check the neighboring cells. This is done using fast approximate nearest-neighbor search built over k-means trees.

Apart from videos, anomaly detection is performed on hyperspectral imagery. Most approaches working on hyperspectral imagery follow the statistical approach in which the statistical model is built for background image and anomaly rate is calculated based on deviation of the mean from the model [77]. Though this approach is unsupervised, hyperspectral data merely satisfies the requirements of this approach. Therefore, Olson and Doster [68] came up with approach to model the background. They have combined the kernel Principal Component Analysis (PCA) with sub-sampled image and calculated reconstruction error as a measure of anomaly detection. This method can be improved by jointly modeling the spectral and spatial information of the hyperspectral imagery.

Sparse representation approach

Lu et al. [55] put forth framework based on sparse combination learning for speedily detecting the anomalies from surveillance video. The use of small-scale least square optimization shortens running time for detecting the anomalies.

Bag of words (BOW) approach

Roshtkhari et al. [78] handled the problem of detecting contextual anomalies from video using probabilistic framework by measuring the likelihood of STVs. New normal events are learned incrementally using online and unsupervised learning. For faster detection of anomalies redundancy of spatio-temporal volumes is curbed by grouping STVs using codebook construction and in turn, reducing the search time for comparison of newly observed data with previously stored STVs.

Contextual information obtained from spatio-temporal volumes of video cubes is used for detection of global and local anomalies in [53]. Activity pattern codebook is constructed to infer global information from video whereas composition pattern dictionary is used to infer salient patterns in STVs. Sparse reconstruction model built over learned dictionary is used for anomaly detection. In this paper, multi-scale analysis method is used for accurate localization of anomalies in the video.

Cheng et al. [13] applied one class Support Vector Machine (SVM) with Bays probability to detect anomalies from video and maximum subsequence search method for anomaly localization. In this, events in video are represented using ‘subsequence’ - subsequence of time series based spatial windows present in the proximity of each other. Though the approach achieves comparable performance in terms of faster processing, anomaly of small-scale and having short duration of occurrence is not detected.

Feature learning approach

For real time detection of anomalies from video, Wang et al. [105] used low-level statistical features instead of relying on complex machine learning and computer vision algorithms. This approach is not suitable for low density crowd scenes in which behavior of individual entity is anomalous.

Leyva et al. [49] used optical flow features and foreground occupancy features for extracting descriptive features from cell structure. After extracting compact set of features, models like Gaussian Mixture Model (GMM), Markov chains and BOW are used for anomalous video volumes. Finally, inference mechanism is used for detection of anomalous activity using neighborhood cells described by local spatio-temporal features.

3.1.2 Holistic feature modeling methods

Holistic & density based approach

Marsden et al. [59] used holistic and density based approach for crowd anomaly detection. In this, they have put forth scene-level holistic features in terms of 4 dimensions as crowd conflict, collectiveness, motion speed and density. They have used 2 classifiers based on availability of anomalous data. GMM is used for anomaly detection only when normal training data are available. SVM – discriminative model is used when both normal and anomalous behavior data are available. In this paper, authors used cross scene training, i.e. for detecting anomalies in UMN datasets, training frames of other datasets are used to generate Gaussian Mixture Model (GMM).

Trajectory based approach

Motion instability defined in terms of direction randomness and motion intensity has been used for discriminating anomalous behavior from normal one using unsupervised approach [113]. This framework is useful for understanding how previously occurred observed patterns are deviating, but fails to detect appearance-based anomalies. In order to perform faster processing, feature tracking scheme is employed in this approach.

The majority of algorithms for anomaly detection work on decompressed videos containing pixel-level information. In pixel domain, complex feature extraction process requires huge amount of data and lowers the speed of execution. This problem of decompressed videos worsens when thousands of long duration decompressed video data are generated. Biswas and Babu [7] came up with new approach of utilizing motion vector cues from H.264/AVC compressed videos. Hierarchical processing of video frames is carried out using pyramid structure, namely motion pyramids. In this, initial processing is done at coarse level and if anomaly is occurred, then processing moves to finer level i.e. actual frame level. This method of hierarchical processing reduces computational overhead, and thus detects anomalies at real time. This approach works well only when motion is encountered in video and therefore can’t detect appearance anomaly since it is purely based on motion vector.

3.2 Accuracy oriented approaches

The substantial amount of work has been put forth in accuracy oriented approaches for anomalous activity detection. Some of significant works are based on Mixture of Dynamic Textures (MDT) [51, 57], sparse representation technique [63], Gaussian process regression [12], cascaded Hidden Markov Models [106], context-dependent approaches [123].

Most of the mentioned papers [12, 51, 57] train the model with normal videos and build normalcy model. During testing phase, anomalous videos are introduced to check the effectiveness of the model. This is known as unsupervised learning. Out of these, [63] uses both supervised and unsupervised learning method for detection of anomaly. In case of supervised learning, anomalous videos are also used during training phase for improving the accuracy of detection. Similar to [63], weakly supervised learning strategy is followed in [35]. In this paper, anomalous videos are used in training phase. Both multi-instance learning model and dictionary learning approach are used for anomaly detection.

3.3 Comparison of real-time processing oriented approaches and accuracy-oriented approaches

Real-time processing oriented approaches rely on online learning i.e. time required for frame processing is shorter than the time for processing the next frame in the sequence. Such approaches continuously update themselves for identifying whether a newly observed event is anomalous or not. In this, model parameters are updated incrementally based on new training data.

On the other hand, accuracy oriented approaches use offline algorithms which assume that all the training data are available in the outset. These methods use fixed parameters, predefined anomaly thresholds or fine-tuned thresholds obtained from training of batch data. And therefore, can’t be used for real time detection of anomalies.

The traditional approaches of anomaly detection rely on hand-crafted features for extracting features from video frames, and require expertise to design the methods for feature engineering. Such hand-crafted features being suboptimal are very specific to the given scenario. Manually hand-crafted features are incapable to be used in cross domain datasets and exhibit poor support for inferring semantic information. Such approaches can’t be generalized to work with scenes having unknown anomalies, adverse lighting conditions and drastic variation in motion and appearance of entities in video. On the contrary, deep learning architectures like Convolutional Neural Networks (CNNs), automatically learns and selects the features. Deep learning approach has ability to be generalized across multiple datasets i.e. once learning is done on one dataset, the pre-trained deep neural network can be applied to other dataset, called as Knowledge Transfer. Figure 6 the shows the difference between traditional machine approach and deep learning approach.

Figure 6
figure 6

Traditional Machine learning versus deep learning methodology

4 Deep learning approaches for anomalous activity detection

The availability of large datasets and high availability of GPUs at lower costs has resulted into proliferation of deep learning techniques. State-of-the-art results have been achieved for image classification [47], object detection [69, 70, 76], activity recognition [88, 108] egocentric activity recognition [109], video hashing [91] and video captioning [29] using deep learning. The reason behind success of deep learning approaches is that non-linear transformations allow extracting useful and complex features from high dimensional data like video. This triggered use of deep learning techniques for anomaly detection from videos.

As the use of deep learning approaches for anomaly detection approaches is still in nascent stage, deep learning approaches from the viewpoint of both accuracy and real time are described in this paper. Various deep architectures have been used for anomaly detection viz. CNNs [47], Long Short Term Memory Networks [38], Auto-encoders (AE), and Generative Adversarial networks (GANs) [31].

Recently, vanilla deep models have been modified to solve the specific problem in hand. For example, 2D CNNs or 3D CNNs are used for automatically describing the videos. Moreover, to capture the temporal and spatial dynamics of long duration videos, different variants of LSTMs have also been put forth [32, 54]. Temporal and Spatial LSTM (TS-LSTM) put forth by Guo et al. [32] is a good candidate solution for detecting anomalies from video having long duration. It can be noted that CNNs (ability to extract features automatically) and generative models (ability to reconstruct the input pattern) have been widely used for anomaly detection.

The overview of variants of auto-encoders used in generative models is given here since most of anomaly detection approaches are based on AEs.

The beauty of generative models is that they learn the distribution of data and accordingly predict the future sequence of frame. Based on this principle, reconstruction error is generally used for calculating anomaly score. Each data instance xi is reconstructed with help of learned network. The reconstructed output is given by oij. Then reconstruction error is calculated as follows.

$$ {\delta}_i=\frac{1}{n}\sum \limits_{j=1}^n{\left({x}_{ij}-{o}_{ij}\right)}^2 $$
(3)

In the above equation, reconstruction error is denoted by δi, n denotes number of features for defining data. The reconstruction error δi, gives an anomaly score. The learned auto-encoder reconstructs the motion signatures from normal videos with low error but can’t accurately reconstruct motions from anomalous videos. In other words, the auto-encoder is used for modeling the distribution of the regular dynamics of appearance changes. Generative models generally assume that the features come from a predetermined type of distribution and therefore are likely to fail if the feature distribution changes.

  • Sparse AE (SAE): Sparse auto-encoder is used for handling the transfer learning problem.

  • Denoising auto-encoders (DAE): The Denoising auto-encoder is the extended stochastic version of auto encoder. Auto-encoder can be converted to denoising auto-encoder by adding stochastic corruption layer at the input of auto-encoder. For discovering robust features and refrain hidden layer from learning the identity of the input, this auto-encoder is trained to reconstruct the input while preserving the information of input and undoing the effect of corrupted version of input. Denoising auto-encoder predicts the missing (corrupted) values from the non-missing values (uncorrupted) for the given pattern. This model requires manually devised noise model for training. In case of training with unsupervised setting, it is very difficult to choose effective noise model for training.

  • Stacked denoising auto-encoders (SDAE): Denoising Auto-encoders stacked together forms the initialization of deep architectures. They have been extensively used for new representation from videos. In order to denoise the corrupted (missing) values of the inputs, denoising auto-encoders are trained locally.

  • Marginalized Denoising auto-encoder (MDA): SDAE require heavy computational processing and due to this, they are not efficient for large scale video analytics. Marginalized SDAE (mSDAE) handles this issue of computational processing and scalability to support high dimensional data. mSDAEs does feature learning in a faster way by using single layer structure of auto-encoder and achieves balanced trade-off among performance and speed.

  • Cascaded stacked auto-encoder: Stacked auto-encoder with more than one layer is known as cascaded stacked auto-encoder. Stacked auto-encoders are useful for unsupervised feature learning.

  • Generative Adversarial Networks (GANs): GANs are used for generating data and follow unsupervised learning. GAN can be considered as zero-sum two-player game. It consists of 2 different networks namely generator G and discriminator D. During training phase, generator’s task is to generate data (images) whereas discriminator’s task is to discriminate the generated data i.e. identify whether data are real or generated from G. In this way, discriminator is trained to output correct results.

  • Conditional GANs: GANs can be modified to get conditional GAN by adding condition c as input to both generator G and discriminator D.

  • Convolutional Winner-take-all encoder (CONV-WTA): The CONV-WTA [58] uses unsupervised approach for learning the sparse representations in a hierarchical manner. It is non-symmetric in nature. Encoder part is built by stacking multiple CNN based Rectified Linear Unit (ReLU) layers whereas decoder is built using linear deconvolutional layer.

4.1 Deep learning based Real-time processing oriented approaches

Initially, being the base of deep learning architectures, neural network models have been employed for real time anomalous activity detection [79, 80]. Sabokrou et al. [79] used independent feature learning method to model video using local and global descriptors using sparse auto-encoder model in unsupervised way. They have used Gaussian distribution to model normal video patches and Mahalanobis distance to denote anomaly measurement. On the similar lines of previous work, Sabokrou et al. [80] put forth two anomaly detectors based on auto-encoder and sparse representation of video. As AE gives more reconstruction error for anomalous patch and sparse representation of video implies chances of anomalies, the cascaded effect of both detectors is used to detect anomaly at real time and achieves 120 FPS performance on UCSD ped 2 dataset using MATLAB 2015 running on 3.5 GHz CPU and 16 GB of memory.

4.1.1 Deep generative model

Fully convolutional neural networks have been used for anomaly detection for the first time by Sabokrou et al. [83]. This method uses transfer learning approach for extracting features using patch-operations from video frames with the help of CNN pre-trained on AlexNet [47]. The extracted features are represented using sparse auto-encoders whereas Gaussian model is used to evaluate anomalies. For automatically representing video frames and inferring appearance and motion cues, 3D gradients obtained from PCANet are used and normal event is modeled using deep GMM [26]. Deep GMM being scalable and generative in nature is constructed by stacking multiple layers of GMM. For time efficient and accurate anomaly localization, deep cascade approach based on competitive cascade of deep neural networks has been put forth by Sabokrou et al. [81]. This approach works for real time anomaly detection in surveillance systems. It combines two stages of deep stack auto-encoder and CNN. Intermediate layers of CNN or stack auto-encoders act as sub-stages of a cascaded classifier. For achieving time-efficient anomaly detection, shallow layers of cascaded DNN are used to detect background normal patches whereas complex patches in the neighborhood of simple patches are detected by deep layers. CNN is merely fine-tuned in this method and trained from scratch. Following the cubic patch based approach based on cascaded classifies, Sabokrou et al. [82] used local and global video descriptors for representation of video. Structural Similarity Metric (SSIM) is employed to check the similarity among the patches. Two one-class classifiers for each descriptor are used for anomaly detection based on weakly anomalous patches and strongly anomalous patches.

Wu et al. [112] used two stream network and Variational Autoencoder/Generative Adversarial network for detection and localization of anomalies. The beauty of this system is that it is based on client-server architecture and provides users with input channel for uploading the local videos and also accepts input stream of video using online mode.

4.1.2 Spatio-temporal model

The lack of large set of anomaly representations for training, most anomaly detection approaches follow unsupervised learning [113, 114]. The frameworks put forth by Giorno et al. [21] and Ionescu et al. [99] work on unsupervised learning method when no training data are available. Online anomaly detection based on unmasking technique is done in [99]. It works on the principle of change detection. The unmasking technique involves binary classifier which iteratively distinguishes between consecutive video frames while removing the discriminating features from frames iteratively. The higher degree of training accuracy shows the presence of anomaly.

4.2 Deep learning based accuracy oriented approaches

4.2.1 Temporal regularity model

This model focuses on evaluation of CNN features across time to capture local anomalous events from videos [74]. Ravanbakhsh et al. used CNN model pre-trained on object recognition to detect anomalies and employed two-channel approach for representing video in terms of appearance and motion (optical flow) similar to [43, 57]. They put forth TCP (Temporal CNN pattern) network in which Binary Quantization layer is placed as the last layer of CNN to represent temporal motion patterns for anomaly segmentation. But, TCP network is not end-to-end trainable and suffers from heavy post-processing and require previously computed codebook of convolution feature maps.

Anomaly detection based on sparse coding approach [55, 120] involves building a dictionary over normal events associated with small reconstruction error, whereas anomalous events would result into large reconstruction error. Optimizing the sparse coefficients is time consuming and leads to bottleneck for dictionary learning. In addition, the neighboring frames (temporally related) are assigned different sparse coefficients which leads to loss of temporal coherence between those frames [55, 57].

In order to retain the locality information between the neighboring frames, temporally coherent sparse coding based method (TSC) has been put forth in [56] in which similar neighboring frames are encoded with same sparse coefficients. This TSC is mapped to its equivalent representation using stacked RNN (sRNN). Optimization in the parameters of Stacked RNN alleviated the need to select the hyperparameters in TSC and expedites the anomaly prediction due to use of shallow architecture.

4.2.2 Spatio-temporal model

Zhou et al. [122] pioneered the use of spatio-temporal CNN for anomaly detection and localization for the first time. Fang et al.’s spatio-temporal anomaly detection model is inspired from saliency information obtained from videos [24]. This model represents spatial information (SI) obtained from salient regions of frame. Temporal motion aspect is represented by multi-scale histogram optical flow (MHOF). Deep learning network – PCANet is used to obtain features from SI and MHOF for anomaly detection.

Once the anomalous events are detected, then it is also necessary to explain why the event is judged as anomalous. This is called as event recounting of anomalous activities. The approach put forth by Hinami et al. [37] performs joint detection and recounting of anomalous events by amalgamating the multi-task Fast-RCNN (MT-FRCN) and environment-specific anomalous event detector. Currently, semantic knowledge used for explaining the anomalies is restricted to actions. This deep knowledge of visual concepts can be extended to explain more complex object interactions occurring in the anomalous events. Sun et al. [93] fused one-class SVM (OC-SVM) with CNN for designing end-to-end trainable model for anomaly detection. For modeling the velocity and direction of entities in video, optical flow features are fed as input to CNN. Their Deep One Class (DOC) model equipped with Radial Basis Function (RBF) yields robust anomaly detection.

4.2.3 Representation learning model

Hu et al. [41] used deep incremental slow feature analysis network (D-IncSFA) to learn high level abstraction from video and detect anomalies in one step. Global anomaly detection is done using temporal modeling and local anomaly detection using multi-scale analysis based on summed squared derivative (SSD) value. This approach does not reply on any classifier model and does not use hand-crafted feature representation.

Though deep learning models are good at extracting high level abstraction from the given video, it is difficult to model regression tasks since labels do not hold enough capability to fine-tune learning parameters. This problem has been addressed by deep metric learning (DML) based on regression applicable to density based approaches [107]. DML not only extracts density based features but also learns better distance measurement. Currently, this method is shown to be applicable in congestion detection and crowd counting. But, it is still difficult to train deep networks even if guided DLM is used.

4.2.4 Deep generative model

To learn the temporal dynamics in long-hours of video, end-to-end-trainable framework is developed using convolutional auto-encoder having ability to learn local features and classifiers [34]. Its working is based on following principle. Auto-encoder learns complex distribution of normal patterns in video and reconstructs motion in normal patterns with low error and does not reconstruct motion patterns in anomalous frame of video. The reconstruction error between real frame and reconstructed frame gives the anomaly score. Similar to this approach, Medel and Savakis [60] replaced the weights in fully connected LSTM with convolutional filters to yield Conv-LSTM architectures and used it for predicting near future terms by encoding and reconstructing video sequence. The predictive capability of network used in regulatory evaluation algorithm detects the anomalies. Xu et al. [114] used three stacked denoising auto-encoders (SDAE) to learn the joint representation of appearance and motion and three one-class SVMs for calculating the anomaly scores. Use of optical flow maps makes this method to capture only the short term motion and can’t handle long-term temporal motion to infer useful regular pattern from video. Apart from this, contextual information required for finding relation between consecutive video shots is missing in their approach. Therefore, Feng et al. [25] came up with another approach based on SDAE to handle the problem of short term clues and contextual anomalies. They used LSTM for capturing long-term motion cues from video and Graph-based manifold ranking scheme to reduce false alarms from spatial contextual information.

In order to automatically learn the feature representations from video in unsupervised manner, Xu et al. extended their previous work [114] and put forth Appearance and Motion DeepNet (AMDN) approach based on Stacked Denoising Auto-Encoders (SDAE) [115]. The crux of the approach is double fusion scheme which performs joint representation of appearance and motion characteristics of video and this scheme does not rely on object level analysis. AMDN is not suitable for real time applications due to its heavy computational processing. In addition, this scheme uses shallow networks and small image patches as network input, so there are chances of overfitting on small scale data. As multiple one-class SVMs built over learned features are not jointly optimized with anomalous activity discrimination task, the learned features may be suboptimal.

Dearth of anomalous ground truth data and ambiguous nature of anomalies hinders the development of end-to-end trainable deep learning model. This issue has been addressed using conditional Generative Adversarial Networks [75]. As claimed by authors, it is a pioneer work using GAN for anomaly detection for the first time. The main feature of this end-to-end trainable deep learning model is that it uses cross-channel approach to refrain generator from learning identity function and uses multi-channel representation for fusing appearance and motion information.

Tran and Hogg [96] used Convolutional Winner-Take-All (WTA) [58] and one-class SVM for anomaly detection. In this, convolutional auto-encoder is used for extracting the motion features and OC-SVM is used for building the normalcy model. This framework is based on motion feature representation. It can be extended by introducing mechanism for appearance feature modeling and also methods for modeling motion patterns of longer length.

4.2.5 Hybrid model

Inspired by the success of 3 dimensional CNNs [97], Zhao et al. [121] put forth hybrid approach for anomaly detection. They used 3D for modeling the spatio-temporal features for surveillance video and jointly utilized the reconstruction loss (for reconstructing the input frame) and weight decreasing prediction loss (for predicting the future frame) of auto-encoder for detection of anomalies. As the approach works on predicting frames, sudden appearance of objects in the field of view may hinder the performance of this model.

4.3 Comparative study

Tables 1 and 2 shows the comparison of selected traditional and deep learning approaches. The comparison of real time anomaly detection approaches using both traditional and deep learning techniques is done in Table 3. Please note that values mentioned in the Table 3 are taken from results mentioned in the corresponding research papers. It can be observed that deep learning technique has achieved highest performance for frame per second on the datasets. Though it is not verified from single research article, there is a great scope for deep learning to excel in anomaly detection tasks.

Table 1 Deep learning approahces for anomalous activity detection
Table 2 Traditional approaches for anomalous activity detection
Table 3 Comparative study of real time performance of anomaly detection approaches based on traditional and Deep Learning (DL) methods (Values mentioned in the table are directly taken from corresponding references)

5 Application domains and benchmarked datasets

5.1 Discussion of application domains

Though there are enormous domains where anomaly detection can be applied, Figure 7 shows some of widely investigated scenarios in which research related to anomaly detection can be carried out. These domains include traffic, transportation, sports, crowd scenarios, health care domains, naturally occurring/manmade calamity detection, industrial domains, wildlife scenarios, etc. Some of the working use cases of anomaly detection are explained here. Periodical railway inspection in order to avoid railway mishaps is a part and parcel of safer railway transportation. Detection of obstacles, missing fastening bolts (by which rail is fixed to the sleepers), status of switches and other railway defects can be detected in real time manner [22]. This eliminates the need of human expertise to walk along the track to identify visual anomalies. Autonomous driving on urban highways or mountain regions is very challenging. Timely detection of anomalous objects guarantees to curb the chances of accidents and ensures the safety of people on highways [17]. Detecting unattended objects in timely manner is essential for maintaining enhanced security at public places to curb the chances of terrorism [67]. Anomaly detection is indirectly related to crowded scene analysis. This includes congestion detection, crowd counting [107]. Crowd counting and timely detection of congestion due to traffic, processions, or at pilgrims helps to avoid mishaps by applying proactive measures to control the crowd. This would ultimately help to avoid crowd disasters [117]. One of the applications of anomaly detection for immediately taking an action in elderly fall incidents [23].

Figure 7
figure 7

Application domains of anomaly detection

5.2 Public datasets for indoor and outdoor surveillance

Considering the requirements of real-life scenarios, various datasets for anomaly detection have been put forth till date obtained from indoor and outdoor surveillance. Table 4 shows the widely used datasets for anomaly detection. The datasets are compared based on the features, scenarios covered for anomaly detection, availability of ground truth (GT) and the resolution of videos.

Table 4 Benchmarked datasets of anomaly detection

6 Research challenges and future directions

Automated video surveillance for detecting anomalous activity has been a topic of great interest in computer vision and cognitive science for enhancing the security of indoor and outdoor places. Following identified research issues are still open in this domain and needs to be addressed for efficient detection of anomalous activities from videos. The list of issues and challenges is by no means exhaustive and continues.

  • Challenges related to indoor and outdoor environment: Handling noise in video data due to the sensor, camera jitter and various video decoding artifacts, occlusion of independently moving objects, illumination changes, intra-class and inter-class variation of objects. Camouflage detection is also a major challenge.

  • Challenges related to scale at which normalcy model is defined: It relates to multi-scale (resolution) of normalcy model, handling variation of normalcy model according to the anomaly to be detected

  • Challenges related to dearth of labeled anomalous behavior training dataset: Due to dearth of labeled anomalous behavior training data, use of unsupervised learning is trivial option. There is a need of detection of anomalous activity based on lesser contextual information, and miniaturized size of training dataset.

  • Challenges related to trade-offs among performance metric: Achieving balanced trade-off among real time processing and desired level of accuracy is very critical.

  • Challenges related to multi-view anomaly detection: Target of interest seems to be normal from one view but exhibits abnormality if checked from another view. Substantial amount of work has been done in single-view anomaly detection. However, lesser work has been done in multi-view anomaly detection. It is very challenging to incorporate different levels of anomaly detection using multiple views in a single framework

  • Challenges related to camera anomaly detection: Cameras used for surveillance are the basic sensors used for capturing the surveillance videos. Detecting the tampering, malfunctioning of surveillance camera in a real time i.e. camera anomaly detection has become the topic of research in recent years. The techniques for sabotage detection of cameras and self-improvement of camera’s status needs further research investigation.

  • Anomaly detection from videos obtained from 360-Degree camera: Till date, most approaches for anomaly detection used the video data obtained from statically positioned cameras [51], multiple cameras [56], and moving cameras [65]. With the advent of technology, the market of 360-Degree camera is anticipated to grow with CAGR of 34.4% from 2017 to 2024 [1]. This leads to need of detecting anomalies in the videos obtained from 360-Degree cameras. There is great scope for detecting anomalies from such videos since existing research is focused on videos from stationary and moving cameras only.

  • Convergence of frameworks aimed to detect anomalies from multiple domains: Though biometric spoof detection (Detection of spoofed face, iris, and finger) is widely researched topic on its whole [4, 61], it can be integrated with video-based anomaly detection frameworks to improve the accuracy of detection especially in indoor environments. Apart from this, frameworks for camera anomaly detection used for detecting damages caused to the cameras can be incorporated with anomaly detection frameworks. This can be achieved by implementing interoperability measures among multiple detection frameworks to converge them into generalized anomaly detection platform.

  • Event recounting of anomalous activities: The objective of Multimedia Event Recounting (MER) is to generate the summary of the events occurring in the given video clip [98]. Motivating from the success of MER on TRECVID datasets [28, 39], event recounting has been applied for anomaly detection in [37]. Justifying why the event is anomalous apart from detecting the anomalous event from video is still untouched research area except the work by [37]. Developing anomaly detection framework along with the visual concept representation for description of anomalous events is truly challenging due to ambiguous nature of anomalies and availability of semantic domain knowledge of anomalies, and deserves high scope for further research.

  • Use of deep learning technique to develop real time systems for anomaly detection: Current literature still lags behind when it comes to process the videos at real time. The possible solution can be to follow the online learning methods to understand the anomalies at real time and incorporate online learning with deep models.

7 Conclusion

A proliferation of deep learning is changing the way of solving the real-world problems and anomaly detection is no exception. It is just the recent lustrum in which deep learning has been employed for anomalous activity detection, and this paper is an attempt to analyze and summarize deep learning techniques for video based anomaly detection in a nutshell and would act as profound research contribution for further investigation of deep learning for automated visual surveillance domain.

It can be understood that use of deep learning for anomaly detection has achieved remarkable results on both the accuracy oriented and real time processing oriented objectives of anomaly detection. This research domain is very promising area since it will act as foundation stone in many future computer vision based projects like elderly fall detection (health care) systems, self-driving cars, robotics, and many more domains alleviating the need of human personnel for continuously monitoring the sensitive places.

Considering the deep learning aspect, there is much scope improved anomaly detection approaches by implementing parallel and distributed architectures models of deep learning. Motivated by the recent success of AlphaGo using deep reinforcement learning, use of deep reinforcement learning for online learning of anomalous activities and real time processing of anomaly detection would still be untouched research to be explored.