1 Introduction

Security surveillance systems are employed to prevent violations in private and public areas. Analysis of public environments described by the phenomenon of overcrowding in the form of the population can be considered one of the challenging issues in machine vision and image processing [1]. The high volume of the crowd needs the surveillance and participation of many individuals such as personnel and operators to visually monitor and control abnormal events [2]. Human error is one of the challenges that can make the routine of crowd surveillance a difficult and complex procedure [3]. By monitoring human activities in sensitive crowded situations via real-time manner, we can detect abnormal and unconventional behaviors [4]. This real-time process will improve the security condition and prevent abnormal and unconventional behaviors in crowded public environments. Abnormal behaviors are actions that are unexpected and often assessed negatively because they differ from conventional or expected behavior. When the behavior of a person or individuals seems abnormal, this phenomenon is called an abnormal action [5,6,7].

Abnormal behavior is strongly dependent on the norms defined in the considered environment and cannot be precisely defined. For instance, moving clockwise around the Kaaba is an abnormal behavior, although this behavior may be perfectly normal in other situations. Congestion is generally considered an abnormal category. Sometimes it is a security challenge, meaning that an abnormal behavior has occurred to create an action outside the framework.

The evolution of the WoT is associated with machine learning and computer vision Web-based technologies for organizing a fast hybrid decision-making system. Besides, the WoT is proceeding to more control over our living conditions, allowing more facilitation in obtaining things done. WoT represents a collection of criteria by the W3C for determining the interoperability problems of various Internet of Things (IoT) application fields and principles [8]. Moreover, the Thing Description of WoT is the essential part of the WoT building blocks. In this definition, a Thing Description helps realize WoT as a physical or virtual thing. Therefore, a Thing based on semantic vocabulary and a serialization based on JavaScript Object Notation (JSON) are considered the model’s information.

We required robust machine learning algorithms with various capabilities such as automatic behavior detection in real-time conditions [9,10,11]. One of the new and automated methods that have recently been the main focus of researchers in the field of machine learning is deep learning-based methods [12]. Even in less crowded environments, monitoring the abnormal behavior of humans is necessary as a real-time procedure. Some events also occur unintentionally as a result of inherent challenges in the population itself. In 2010, a tragic event occurred during a music festival in Duisburg, Germany, which led to the death of 20 people and the injury of nearly 500 others [13]. In 1989, over 96 people died due to overcrowding during a football match, and 766 people lost their lives at Hillsborough Stadium, Liverpool, England [14]. The incident started when people were about to leave the stadium, while it was possible to control the issue and prevent the event [14]. Similarly, at least 2431 people lost their lives in congestion in Mecca, Saudi Arabia, in 2015 [15]. At the beginning of 2020 in Kerman, Iran, about 58 people died due to overcrowding during a mourning ceremony [16]. In all these cases, by observing the crowd’s behavior, it was possible to prevent the occurrence of unfortunate accidents. Furthermore, it has been observed that terrorist attacks can occur with the sudden entry of a person or runaway vehicles into the crowd, and the entry of people with suspicious tools, equipment, and even bag. If these incidents have been predicted or observed beforehand, they may have been prevented. When the operator controls crowded environments, there is a possibility of a sudden loss of data due to a lack of attention and accuracy. The purpose of automated security surveillance systems based on IoT or WoT platforms is to minimize false errors and control crowd behavior in unconventional forms of congestions. Hence, the issues involve crowds of people and abrupt changes in their behavior.

2 Security surveillance and real-time processing

Automated security surveillance approaches are necessary to protect a country’s crucial infrastructure and public environments (e.g., metro, airports, city centers, mall shops, and stadiums) against the warning of criminal activity, civil unrest, cyber-attacks, and terrorism. Thus, increased security surveillance is further expected for any occurrence that represents high-density crowds [17]. Urban security monitoring and real-time detection of crowd behavior rely heavily on the condition of CCTV cameras. Urban security monitoring includes dynamic, evolving crowd scenes that place more significant requirements on visual search processes. Some studies have shown that although the development of security surveillance has various aspects of progress, it cannot make full decisions on behalf of the operator [17].

Real-time video processing is one of the most cardinal topics in big data analysis. Accordingly, it is required for uninterrupted security surveillance of various events, messages, and processing and analysis in network infrastructure [18]. Huge amounts of data that continuously reach pipelines can be generated in any format, such as structured, unstructured, and semi-structured. Therefore, the information exchanged for video processing include messages and events. Processing events such as real-time activity recognition of abnormal crowd behavior will improve the correlation. This possibility creates pattern recognition procedures at the scale of observing a large number of events and transmitting information at microsecond speeds. Hence, real-time detection of abnormal events in practical video processing applications has rarely been considered in state-of-the-art abnormal behavior algorithms. The processing of abnormal behavior algorithms is associated with transferring high volumes of information, requiring a powerful hardware platform and software designs. However, it is not always possible to access powerful hardware, so comparisons between abnormal crowd behavior detection methods, like other video analysis methods, are based on comparisons between the algorithms used.

In offline systems, one can yield to employ the time to obtain optimal or near-optimal approaches [19]. Instead, working methods operate on online situations and need effective solutions. Online processing indicates some interaction; however, it does not impose a delay limit. Besides, a real-time manner means limited latency, and we can define online processing as consecutive logging of transactions for real-time computer methods [20].

In real-time abnormal event detection, due to computational flexibility, statistical techniques are often used in video frame processing designs as well as fast algorithms with low computational complexity [21]. With these assumptions, some abnormal event detection methods’ time performance and computational complexity increase significantly and cannot generate real-time outputs. Some ways reduce the size of video frames to resolve the computational complexity, which can disrupt pattern recognition and even affect the body shape of people in the crowd, resulting in inadequate tracking or high error in classification [22, 23].

Evolutionary and meta-heuristic algorithms in many applications cannot work as real-time models. These algorithms need a lot of time to process information due to various parameters such as population, parameters initializing, different loops such as different population generations, and other similar challenges. Furthermore, the need to converge to the optimal value requires several calling of the cost function. However, due to the heavy processing of video information and population analysis, some studies have stated that in the future, it can be hoped that optimization algorithms will be used extensively in the network space and the challenge of computational complexity in conventional computers [24]. It is solved for applications such as video processing.

Other abnormal event detection methods utilize structures based on deep learning, but these structures can provide low-error responses even for low-quality images. Nonetheless, the deep learning structure is lazy in data processing and requires a considerable amount of time, especially during the training phase. Putting all these together, one can consider trade-offs of methods, based on which we may discard some optimal outcomes [24, 25]. For this reason, in the strategies designed to identify abnormal events for crowd behavior, no attention has been paid to the real-time aspect of the technique, and if the method is real-time, it is associated with some other challenges such as reduced detection accuracy.

3 Overview and motivation

When humans monitor crowded environments due to fatigue or lack of operator focus, it is always possible to miss a critical event with unpleasant consequences. In this regard, the purpose of security surveillance systems is to minimize the risk of false alarm rate (FAR).

In recent years, security surveillance automation of these places has attracted many researchers in the field of real-time image and video processing [26,27,28,29,30,31]. The designed system must detect abnormal events to make security surveillance systems more intelligent and automated. In addition, other problems such as noise and pixel occlusion, the interaction of objects and people, the simultaneous existence of several unusual events, computational complexity, and unstructured events can affect security surveillance. These challenges are common in all environments, and abnormal behavior detection algorithms must deal with them.

We also need suitable techniques to solve the existing challenges and to analyze them properly. There are many methods in the field of machine vision that have been used for crowd behavior analysis. Most of them, like heavy deep learning structures, are computationally overwhelming. Therefore, they are not suitable for real-time processing. Moreover, [32] has shown that the use of optical flow in the analysis of noise-impregnated frames is also effective. There are other methods such as speeded up robust features (SURF) [33] and scale-invariant feature transform (SIFT) [34] that analyze crowd behavior based on features. Considering all the problems mentioned, solving them will be a difficult and complicated process. Consequently, using an algorithm robust against these challenges would be a good solution. Various algorithms have been proposed in this field, which may detect the behavior as an anomaly; hence, the methods mentioned in the detection of abnormal behavior can have their classification. These divisions are broadly categorized into supervised and unsupervised methods.

The present study aims to analyze abnormal behavior detection and classification methods based on algorithms that have been proposed as state-of-the-art real-time or near-real-time approaches in security surveillance applications. Automatic detection is considered in environments where there is a high probability of people walking and commuting. The occurrence of abnormal conditions varies, but the presence of bicycles, cars, skates, throwing objects, and the like, which are faster than pedestrians, can usually be identified as abnormal behavior. Even fleeing, fighting, gathering, and moving out of the ordinary are in some ways considered abnormal crowd situations.

The remainder of the research describes some related methods. Then, some efficient and similar algorithms will be introduced, and finally, their results and interpretation will be presented. Above all, there will be reliance on analysis using novel methods and algorithms such as deep learning. In Section 4, similar algorithms for identifying abnormal behavior in the crowd scenes will be discussed. Section 5 provides a general comparison in terms of functional ability, and eventually, Section 6 gives a summary of the conclusions based on the performed research.

4 Crowd anomaly behavior detection

Motion detection can also be defined from a visual point of view, which is a process during the combination of modeling algorithms and machine vision [35, 36]. The main purpose of human motion video analysis algorithms is to develop and improve detection systems in the field of human motion detection and analysis. In the meantime, comprehensive datasets that contain the main human movement patterns will cause the systems that are invented and proposed in this field every day to operate based on harmonious and common principles.

This section presents the abnormal crowd detection algorithms applicable for automated security surveillance platforms. We scrutinize the recent algorithms such as tracking, classification based on handcrafted extracted features, classification based on deep learning, and hybrid methods.

4.1 Tracking

One of the traditional methods in analyzing crowd behavior is optical flow, an apparent pattern movement of objects, surfaces, and edges in a visual frame, which arises from the relative movement between the observer and a scene [37, 38]. Some events are subject to the study of a specific behavior in specific situations, and such a study can be found in [39]. So that, the Kalmen filter has been used to segment the background and foreground of video frames. However, applying backgrounds such as optical flow can be used to distribute the apparent velocities of the motion of a light pattern in an image [40].

Figure 1 shows an outline of the real-time classification plan used to detect normal and abnormal crowd behaviors in the UCSD ped2 and UMN databases, which employ a Gaussian distribution model [41]. This schematic includes four sections: input video, global and local descriptors, abnormal behavior classification, and fusion scheme. In the first section, each frame scene is split into different non-overlapping cubes. The second section of this construction involves global and local descriptors.

Fig. 1
figure 1

The structure of real-time method in [34] (left to right): input visual image, global and local views of patches, modeling the information employing Gaussian distributions, and the final decision

The local descriptor uses the structural similarity index method (SSIM) method to calculate the similarity between patches, a type of local patch descriptor. Thus, two types of local descriptors are carried out based on the space-time neighborhood approach and inner temporal approach (TIA). Concerning the first local description, the space-time neighborhood sections of each patch include a section of the spatial neighborhood, including itself in the center, and a section of the temporal neighborhood following the patch. The SSIM values result from the first local descriptor [d0, …, d9]. Concerning the TIA, the SSIM value for all frames in the patch is computed as [D0…, Dt-1]. Finally, the SSIM values are combined from both methods to obtain a local descriptor [d0, …, d9, D0, …,Dt-1].

In the classification process, the abnormal behaviors of two Gaussian classifiers, including C1 and C2, are estimated through two sets of features from the global and local models. Classifiers [42] use the Gaussian distribution procedure to design the regular activities in each patch of the video, and the Mahalanobis distance technique is applied to identify outlier data. The outcomes in their work depict that the real-time technique is comparable to a state-of-the-art approach on UCSD ped2 and UMN benchmarks, but with even more time to analyze the frames [41]. They compared their work with Li et al. [43] based on run-time network situation. The processing times of their work and Li et al.’s study were about 0.04 s per frame and 1.38 s per frame, respectively. They also measure the effect of anomaly based on frame level, pixel level, and dual pixel-level evaluations, as shown in Fig. 2.

Fig. 2
figure 2

Measure of anomaly assessment. The blue and red rectangles represent the output of the method and anomaly ground truth, respectively [41]

Yu et al. [44] proposed a method for detecting abnormal behaviors using the Gaussian-Poisson mixture model (GPMM).

Inspired by the Gaussian mixed model, this algorithm generates information about the movement pattern of crowd behavior and the number of events with normal and abnormal behaviors. In principle, the expectation-maximization (EM) procedure is used to teach the GPMM [45]. A predetermined threshold is also utilized to detect abnormal behavior. Because the value obtained is considered to be less than the threshold, the event is considered abnormal. They also used a classification scheme that analysis the abnormal behaviors based on the behavior's temporal and spatial frequency. Region 1 in Fig. 3 shows a high temporal frequency and low spatial frequency of behavioral patterns. Consequently, it is ranked as a global abnormal behavior. In region 2, the behavior pattern shows a high temporal frequency and a high spatial frequency. Thus, this behavior is supposed a local abnormal behavior.

Fig. 3
figure 3

Classification of abnormal behavior based on the spatiotemporal frequencies of behaviors that show the behavioral pattern [44]

Other similar studies such as Sabokrou et al. [41], Leyva et al. [46], Lu et al. [47], Marsden et al. [48] used tracking-based methods. The latest one employs a real-time method claimed to have a short response time in the analysis of crowd abnormal behavior. Other optical flow-based tracking methods have been proposed that combine optical flow and HOG or use it in association with techniques such as GMM and EM [49,50,51]. The method proposed by Pennisi et al. [50] is a real-time and online crowd abnormal behavior detection method. It is a combination of visual feature extraction and image segmentation that operates without requiring a training phase. Other tracking methods are also of concern, including the Markov hidden model [52,53,54,55,56,57], which is used as a dynamic probabilistic method in areas where crowd and security surveillance is possible. These include security surveillance [52, 53], path analysis [54, 55], and action recognition [56, 57]. Other research has used a combination of similar tracking methods, including GM-HMM modeling [58]. Such method consists of applying a combined tracking method and utilizing feature extraction by principal component analysis (PCA) and histogram of gradient (HOG), which is based on the K-means++ clustering method and tracking using a Gaussian mixed model.

The application of other similar methods using Gaussian mixed models in tracking as well as extracting features from abnormal behavior can be found in [59,60,61], in some of which the analysis of the key components plays a key role in feature extraction and size reduction of the features. Some research has also used multi-objective tracking methods considering the possibility of people overlapping in the crowd or non-static and dynamics movement conditions and trajectories [62]. This method is based on an analysis of the trajectory of people and detecting several objectives, which is known as the extended K-shortest path (E-KSP) and is a type of search for optical flow with the minimum cost in the greedy search for paths.

Another paramount tracking method that is highly effective in detecting abnormal behaviors of individuals and crowds is the Kanade-Lucas-Tomasi (KLT) method, which extracts trajectory information using clustering in motion patterns among the crowd [63, 64]. The application of this method [65] is provided in Fig. 4. The method is based on the input frames obtained from the crowd and the extraction of features derived from the accelerator sections called FAST, along with the optimized KLT method used in detecting motion trajectory.

Fig. 4
figure 4

The performance of the KLT method in detecting abnormal behavior: a input frames, b the FAST features, c feature tracks over time, and d spatial proximity using Delaunay triangulation [65]

Other similar studies related to crowd abnormal behavior tracking include the method presented by Biswas and Babu [66] and Luo et al. [67].

4.2 Learning models

In learning-based anomaly detection methods, the aim is to use methods working on non-automatic feature extraction as well as automatic feature extraction. Non-automatic feature extraction methods obtain the features through conventional feature extraction methods, and automated methods are considered among deep learning models. The following describes some of the proposed methods in this field that have been published in recent years for abnormal behavior detection.

4.2.1 Handcrafted features

The real-time crowd anomaly detection algorithm for security surveillance has been proposed by Wang et al. [9]. Their research has developed a spatiotemporal texture model for feature extraction. They established a redundant texture feature space using the wavelet transform. The detection algorithm is fast and robust, and the system has shown improved accuracy and performance compared with similar methods.

The proposed method in [68] used features based on motion information instead of detecting actions or events in order to detect the abnormality. The EM algorithm is used to cluster seven-dimensional sample vectors with a predetermined number for clusters. Events that are not related to any of these predetermined clusters are considered unusual events. The method introduced in [69] represents the individual movement label categorizing events using the two-state Markov chain model. In [70], a statistical framework for modeling activity and discovering anomalies has been provided. They generally described a family of unsupervised methods for video anomaly recognition based on handcrafted extracted features and statistical activity analysis of video sequences.

In [71], events are modeled using spatiotemporal cubes, and the decision tree classification technique is used to identify the type of event. At the core of most of these techniques, the probabilistic modeling is realized based on location and the data obtained from the tracking part. The conventional method used in transport monitoring is to cluster the trajectory of identified moving targets and track their movements. The resulting clusters are used as normal models to detect and estimate abnormal and abnormal behavior. In [72], a classifier-based approach to recognize dynamic events in security surveillance sequences has been presented.

They have also suggested handcraft local patterns of features, and an ensemble of randomized trees has a spatiotemporal organization of patterns. In most researches, attempts have been made to use the trajectory of people as appropriate features to analyze behavior type. Similar methods are effective in helping create a suitable context for object tracking and trajectories.

These techniques are used for abnormal security events in the traffic category, as various tracking techniques have been proposed that we can use to greatly assess the ultimate goal of behavior recognition [73]. One of the common methods in this field is to extract the path taken by the vehicles in a normal and completely common way and search for their deviation from a specific path in the received traffic videos or car traffic [74,75,76,77,78]. The vehicle that is being tracked is taken into account in the evaluation step. Therefore, its trajectory is compared with normal or conventional features. Too much deviation from all the features is considered to represent an abnormal path.

Since the advent of video processing, machine vision, and behavioral pattern detection, especially abnormal movements, analysis methods of moving object trajectories have played a crucial role. Through learning from the analysis of individuals’ trajectories or moving objects, some of the proposed methods operate in real-time [79], which has resulted in tracking targets, especially humans [80]. In this case, countless people or objects will be tracked during the training and learning phases over some time. In the next step, the resulting paths will be converted into a set of paths and generally displayed as an overview of the activities of the people in the background of the frames. The detection and testing phase of the trajectories obtained from the video is compared with the trajectories examined in the learning phase. Other similar methods can be found in Zou et al. [81], Chaker et al. [82], and Singh et al. [83]. In these methods, SVM and similarity level measurement have been used as classifiers of abnormal movements of people, respectively. In some other studies, the definition of tracklets has been utilized for optimizing the classification process [4, 64]. While many methods for crowd anomaly detection suggest offline solutions, few studies have considered real-time analysis of crowd behavior. However, the reason for this concerns the dynamics of the crowd, which sometimes requires cumbersome calculations.

One of the crowd anomalies is panic, which Aldissi et al. [21] analyzed and proposed a real-time distinguishing method that examined the crowd’s movements according to a simple and efficient algorithm. The main idea of their method is to study the interactions between moving edges along with the video in the frequency domain.

4.2.2 Deep learned features

Deep learning is a more specialized form of machine learning proposed based on the definition of depth for simple neural networks [84, 85]. Since the layers of the deep neural network (see Fig. 5 for details) are completely interconnected, they have complexity in processing high-dimensional inputs. Hence, they may encounter problems such as over-fitting if they lack access to the appropriate dataset or processor hardware, thus failing to create a model with real-time capability in decision-making. Generally, we convolved the n-frame sequence with 3D filters in Fig. 5. After that, m∗c∗k part filters are applied to convolve k feature maps for the classification category such as abnormal event detection to recognize the crowd behavior.

Fig. 5
figure 5

The conventional 3D-CNN is used for classification application [91]

Various patterns and configurations of deep convolutional neural networks have been considered in the study [86].

In this study, the appropriate function to separate the frames obtained from security surveillance and natural frames recorded in case of abnormal states of individuals have been analyzed by bringing suitcases or items suspected of being at risk of terrorist explosions. Learning in the network is realized by using data differences between consecutive frames. Most of their activities concern various procedures of crowd surveillance and automatic analysis of an uncontrolled environment, which is considered target detection or individual effect of moving in public places.

Creating a surveillance environment and visual perception of the frames received from the crowd is a challenging and important issue in various categories of computer vision. In [87], a new configuration called Deep-Crowd was inspired by the deep learning method of residual neural network (ResNet) to extract and separate spatial traits fully. A unique dataset of nearly 6000 image frames has been generated to learn and evaluate the proposed system. The different evaluation criteria of their proposed system help achieve the appropriate accuracy of 83.11%, which can be compared with other efficient methods.

Also, Kotapalle et al. [88] used deep convolutional neural networks to detect and recognize the individuals’ traffic in which video frames were analyzed for security surveillance. In this method, initially, some pre-processing methods were performed on images and video frames to enable high-precision detection.

A new model for detecting abnormal movements within the crowd is presented [89]. This model somehow improves user behavioral patterns and behaviors, and adopts a method with new similarity criteria. Experiments based on image data show that the detection model presented in this study provides satisfactory detection performance. Some methods detect abnormal movements of individuals by combining deep learning with patterns, such as spatial-temporal volume (STV) [90].

Hu et al. [92] also used convolutional neural networks called ConvNet to extract features from the crowd. The ConvNet structure consists of three convolutional layers with three max-pooling layers and one full-connected layer, as shown in Fig. 6. After each convolutional layer, the rectified linear units (ReLU) process continues with the max-pooling and non-linear processes. Aggregation and integration processes to extract visual features hierarchically from local low-level features to global high-level features have been considered in their method. Slicing convolutional neural network (S-CNN) has been proposed to classify crowd anomaly events [93]. Some popular strategies for optimizing hyper-parameters are applied in related studies of abnormal behavior detection. Manual hyper-parameter tuning, grid search, random search, Bayesian optimization, gradient-based optimization, and evolutionary optimization are the traditional techniques widely used in machine learning methods.

Fig. 6
figure 6

The ConvNet structure for automated feature extraction from crowd scenes [92]

Moreover, learning rate, batch size, momentum, and weight decay are hyper-parameter tuning techniques in deep learning. Adjusting and tuning hyper-parameters related to deep learning is considerably complex, and yet, if done, it can have a positive effect on improving the classification process. The S-CNN method is based on 3D feature mapping to display spatial and temporal sections in 2D format. Similar to the methods presented in [94, 95], the convolutional neural network method has been employed to detect crowd anomaly behavior.

Fan et al. [11] proposed an algorithm to detect abnormal behavior in a set of video frames that used the spatiotemporal auto-encoder to solve the less negative samples challenge to extract features and improve learning. They designed a spatiotemporal convolutional neural network (sCNN) with a simple structure and low computational complexity. Their experiments were applied to the UCSD and UMN datasets. The algorithm was able to operate in real-time only by using a CPU.

4.3 Real-time design

Notwithstanding the fact that tracking models have been employed extensively in computer vision applications, they may not always be accurate and noise-robustness. On the other hand, one of the obvious problems with this method is that the object stays static. If the people in the scenes stop in their place, it will be difficult for them to be tracked. Although suggestions have been proposed to address this problem, in general, the appropriate method that can currently be effective in estimating people’s location and tracking is the tracking strategy. However, when the number of objects in the frame increases, tracking is slightly delayed. Hence, to solve this difficulty, we implemented a scheme that reduces the complexity of processing time. Due to the deep learning network’s need for the frames received from the tracking models, we can reduce the frames’ dimensions and resolution to some extent. Figure 7 illustrates the flow plan for real-time processing of received frames from the social activities. If the number of frames contains moving people with high and low congestion be F, the processing time to refine the algorithm’s responses is on average constant and in the range of milliseconds (tR). However, the time required for tracking using the tracking model (tK) varies due to changes in the number of people present in the crowded scenes. The proposed design can operate as the real-time model to detect the abnormal behavior based on people’s movement when the inequality holds as (1).

$$ \left({t}_{k_i}+{t}_{R_i}\right)<{t}_p $$
(1)
Fig. 7
figure 7

The flow plan for real-time processing of received frames from the crowd activities

4.4 Deep transfer learning

In many video processing applications, transfer learning models were employed to re-train deep learning approaches [96], and the behavior classification tasks in-crowd are assessed in terms of efficiency and accuracy of the model [97]. Deep learning approaches include layered architecture with various layers to learn multiple features, and eventually, all mentioned layers are joined to a fully connected layer to generate the concluding outcomes [98, 99]. In transfer learning, the layered structure can handle the pre-trained models such as VGG, ResNet, and AlexNet without its terminative classification layer as an accommodated feature extractor to obtain more reliable classification performance with less training time. The primary AlexNet includes five convolutional layers, three max-pooling layers, and three fully connected layers [100]. The frame input layer needs the frame of size 227 × 227 × 3. ReLU is implemented following each convolution and the fully connected layer, which extends the non-linear attributes of the network design. Consequently, cross-channel normalization is used, and a dropout ratio of 0.5 is assumed.

A more direct path throughout the network results in the proper performance with very deep architectures for propagating information in the deep residual network (ResNet) [101]. The accuracy begins to saturate due to degradation difficulty. The degradation challenge happens due to the increment of network layers. The backpropagation procedure prevents from vanishing gradient problem; and therefore, ResNet uses backpropagation, which has shortcut connections parallel to the regular convolutional layers. This network helps to extract global features.

GoogleNet [102] is a deep convolutional neural network design that obtained proper classification outcomes with enhanced computational efficiency in various applications applying transfer learning [103, 104]. GoogleNet or Inception design includes 22 layers in deep consisting two convolution layers. Other layers are four max-pooling layers, one average pooling used at the end of the last inception module, and nine inception modules linearly stacked.

The depth of design is extended to 19 and 16 layers in the Visual Geometry Group (VGG) net [105]. The number of parameters is decreased by using very small convolution filters (3 × 3), which are VGG-19 and VGG-16. VGG design includes convolution layers with several continuous 3 × 3 convolutions. Two fully connected layers then support the 2 × 2 max-pooling layer with the ultimate layer as the Softmax output.

Sánchez et al. [106] proposed a taxonomic organization of current achievements following a pipeline. They discussed crowd behavior interpretation through deep learning and considered crowd emotions, datasets, anomaly detection, and other perspectives in crowd analysis. In [107], violent scene detection has been discussed considering action anomaly recognition in the crowd based on deep transfer learning.

5 Performance comparison

The models that have been proposed so far for crowd anomaly detection have pros and cons.

Most models do not solve the problem of uncertainty of outputs and only report the output and estimate the results in accuracy, sensitivity, and specificity. One of the common problems among different condition analysis algorithms in a video frameset is not examining different conditions such as computational complexity, noise, pixel occlusion, efficiency, and cost. Visual image is received from a collection of library video frames [17]. In these frames, in addition to the various crowd scenarios that have been used to implement different behavioral classes, abnormal objects that can be a threat to a crowd have also been used. For example, a motorcycle that suddenly speeds through a crowd of pedestrians with a suspicious backpack dropped by a person in the crowd. In the proposed database, five different behavioral classes are defined. For each behavioral class, at least two different scenarios are extracted in terms of scenario structure, camera angle, and crowd density in the scene. The defined behavioral classes include (1) panic, (2) fighting, (3) congestion, (4) obstacle or abnormal object, (5) natural, and (6) nothing conditions in the crowd. To classify crowd behavior, we categorized natural states versus abnormal states. In total, about 29,220 video frames related to normal forms and 14,910 video frames related to abnormal conditions were recorded. In Table 1, the attributes related to the video data used are summarized. Figure 8 illustrates a frame Motion Emotion Dataset (MED) [108] that includes normal and abnormal crowd situations.

Table 1 Details of MED dataset
Fig. 8
figure 8

Frames from the MED database that contain crowd anomaly behavior: a natural, b obstacle or abnormal object, c panic, d nothing, e fighting, and f congestion [108]

General methods based on tracking from the perspective of machine vision, such as optical flow and GMM to analyze different scenarios and several similar systems, are compared with learning-based methods such as handcrafted extracted features and automated feature extraction (i.e., deep learning) and hybrid approaches. In general, we have measured normal and abnormal behaviors, and the problem here is a binary classification. Evaluation criteria for the general assessment of the categorized methods include criteria such as inverse computational complexity (P1), inefficiency (P2), time (P3), uncertainty (P4), noise sensitivity and artifacts (P5), and the loss of generalization (P6) in achieving the solution. Figure 9 shows the relative estimates of the performance for the methods separately by computing the general evaluation methods with two iterations.

Fig. 9
figure 9

Calculation of the overall performance of algorithms in detecting crowd anomaly behavior. a Tracking methods, b classification handcrafted methods, c classification deep methods, and d hybrid model according to evaluating the calculation of the introduced criteria P1 to P6

To assess the performance of the abnormal behavior detection model in-crowd, accuracy, sensitivity, and specificity are evaluated. Furthermore, the outcomes have been compared with similar methods in Table 2.

Table 2 Assessment of the performance of the automated and hybrid abnormal behavior detection model

In Table 2, the values of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are computed each time the algorithm is executed. Besides, the accuracy (Acc), sensitivity (Se), and specificity (Sp) outcomes are shown in Table 2 for simple, adjusted, and hybrid models. We have determined the accuracy of abnormal behavior detection in the crowd for various versions of combined models and the datasets employed for low- and high-congestion volumes in scenes. The following items were considered to analyze the performance of the models and compare them with similar methods:

  1. 1.

    Cost analysis according to the estimation of total costs resulting from model estimation and error analysis

  2. 2.

    Performance evaluation according to the average weight estimate of precision and recall indices

  3. 3.

    Analysis of the time according to the estimated time spent to run the algorithm in one model estimation

  4. 4.

    Analysis of uncertainty based on calculating the difference and dispersion of mean squares of error in different iterations

  5. 5.

    Investigation of noise sensitivity based on the calculation of the class related to crowd behavior and in the presence of artifacts such as noise application, climate change, pixel occlusion, and poor quality of received frames

  6. 6.

    Generalization analysis following the application of unseen frames on the classification and detection methods of individual and crowd behavior

Similar methods either use only video frames containing crowd anomaly conditions or exclusively analyze general behavior and do not analyze the influential features that cause a correlation in the response. Figure 9 also compares the abnormal behavior detection methods, including tracking, classification handcrafted, classification deep, and hybrid model methods according to criteria P1 to P6. It is observed that the smaller the area (i.e., which is scored from 1 to 5), the better the performance of the method. Hybrid methods have yielded better results. Despite providing the desired accuracy, deep learning methods sometimes cannot develop a real-time method for detecting crowd behavior. The reason is that the application of the image type and the lack of special features in these methods require the adjustment of multiple parameters and high computational complexity. Moreover, the need for large volumes of data for training, the definition of the input data, the need for defining broad parameters of the deep classifier, processing time, ambiguity in feature processing, and the strong dependence on the definition of different combined layers must be included in the calculations.

Although the accuracy of previous studies for a limited number of frames was 68.2% to a maximum of 73% for identifying the accuracy of the crowd anomaly behavior classifier, the maximum accuracy despite high standard deviation is another drawback [88, 109]. Nonetheless, the lack of a need to define specific feature descriptors is one of their notable merits.

In methods such as those given in [110,111,112], the outputs are calculated with a high standard deviation level. The methods are focused only on examining the status of the crowd in limited classes. In these methods, the SVM classifier is used, which requires defining many parameters and cannot create a proper hyper-plane if it is not precisely adjusted. Compared to the methods mentioned above, approaches such as [41] employ neural networks separately, sometimes associated with over-fitting issues. However, due to the dynamic nature of the neural network, the dispersion of responses is high; yet, if the parameters are adjusted, it can create more optimal responses.

Some methods for tracking or classification are less accurate, and no solution has been adopted regarding the possibility of using them in real-time. However, it is not possible to draw a direct line between different methods in the analysis of crowd anomaly behavior. This issue is related to the type of data and the purpose of using the mentioned method. Some ways are different from other similar techniques in video processing and have a clear function for specific topics. We have compared the automated extracted feature-hybrid technique with similar models in Table 3. The deep transfer learning based on AlexNet structure and Kalman filter as tracking model have been used to conduct abnormal behavior detection.

Table 3 Comparison of the criteria between the automated extracted features-hybrid model and other similar methods

Some abnormal event data have been collected according to users’ needs and to assess the situation and the lack of efficient methods [64, 86, 94, 113]. With this approach, an experiment is performed between similar scenarios, where the MED and UCSD datasets are used in general to analyze the obtained frames. In crowd behavior analysis, criteria such as computational complexity and frame dimensions are compared between different methods in Fig. 10. We have considered three different sizes of frames in this experiment for two datasets. Each method is divided into three other ways. For example, tracking methods are divided into simple, adjusted, and hybrid procedures, and then the benchmarks mentioned for them will be estimated.

Fig. 10
figure 10

Computational complexity evaluation for tracking, classification based on handcrafted extracted features, and classification based on deep learning methods of analyzing abnormal behavior in the crowds. Dimensions 1, 2, and 3 are frames with sizes equal to the half, equal to same, and equal to twice the sizes of original frames, respectively

Moreover, Fig. 10 depicts experiments for classification based on handcrafted extracted features and classification based on deep learning methods. This comparison is only a relative estimation and is not taken into account as an absolute estimation. However, most tracking-based models and handcrafted extracted features can be implemented in real-time or near-real-time. This can also involve a small part of deep learning systems and hybrid methods.

6 Conclusion

In the case of crowd anomalies or unusual congestions, automatic security analysis of the crowd behavior becomes possible. Automated detection of abnormal behavior in the crowd is of high importance because activities such as terrorist activities, fights, unusual and suspicious movements, etc., all require the supervision of operators and vigilant personnel to participate in security surveillance. However, this is a considerable challenge and leads to high cost and low accuracy in decision making. Therefore, designing a fatigue-free and error-free system that simultaneously offers real-time capability based on WoT platform will provide satisfactory impacts on controlling the crowd behavior. In this paper, different crowd anomaly detection methods are studied. Various aspects such as individual tracking, classification based on handcrafted extracted features, classification based on deep learning, and hybrid models are examined. It is found that deep learning methods and hybrid models have more satisfactory performance characteristics and can identify and predict crowd anomaly behavior. Nevertheless, in crowd behavior analysis, computational complexity has been considered in a few methods, where the reaction time to abnormal behavior can be reduced by reducing the processing time. The authors are looking to implement inferred patterns based on hybrid models and WoT platforms and reduce computational time and complexity by improving accuracy in future studies.