1 Introduction

With the massive growth in the worldwide population, efficient crowd management is highly required for public security, safety and also to control crowd disaster. Analysing and understanding crowd is the initial step for effective crowd management. The crowd analysis related tasks include but not limited to crowd count and density estimation [1], crowd behaviour like abnormality detection [2], crowd type detection [3], group activity detection [4], crowd video understanding [5]. Among these, crowd analysis using crowd count and density estimation attracts many researchers in recent years. It becomes the backbone for the crowd analysis related tasks such as crowd abnormal behaviour detection (panic crowd, gathering, running, fighting, over-crowd, and so forth) [6, 7], crowd congestion-level analysis [8], dominant crowd motion direction detection, and many more. Fig. 1 shows a basic workflow diagram for different crowd analysis tasks using crowd count and density estimation. The focus of the proposed work is to provide an efficient model for crowd congestion-level analysis (CCA). The CCA provides a degree of congestions information in a crowd scene, which helps for crowd disaster management. We can implement the CCA at the global-level (Frame-level) or local-level (Patch-Level).

Fig. 1
figure 1

Crowd analysis using crowd count and density estimation

In global-level CCA, the crowd scenes are annotated with several congestion levels/classes with the help of crowd density or crowd flow information followed by feature extraction and classification. But in local-level CCA, the crowd scene is divided into different blocks/patches, and then, these patches are annotated with different congestion classes. The feature extraction and classification are done at the patch-level only. The division of congestion classes is mainly based on service level information provided by Polus et al. [9] or manually defining classes based on density and crowd flow information [10]. For example, the congestion classes could be free-flow, restricted-flow, jammed-flow and dense-flow [9] or very low (VL), low (L), medium (M), high (H) and very high (VH) [10]. Both conventional machine learning and deep-learning approaches have been proposed to solve the objective function. The existing conventional approaches extract spatial (shape, spectral, texture) [11,12,13,14,15,16,17,18,19] or spatial–temporal texture features [20, 21] to solve the objective function as a multi-class classification problem. But these methods lack in extracting fine-grained features, thereby results in a high misclassification rate and computationally very expensive. The existing deep-learning frameworks utilize CNN architecture [9, 22] to extracts spatial features to solve the objective function. The proposed work is based on two intuitions such as (a) extracting only spatial features won’t increase the accuracy since the crowd scene is affected by cluttered background, lighting change, varying crowd densities, perspective change. Moreover, it does not provide any information related to crowd motion. So, in addition to the spatial features, temporal or motion features should be extracted and fused with it. And (b) the single column CNN [9, 22] for CCA cannot capture features invariant to perspective change or scene change, but multi-column CNN [23] with different kernel size is capable of extracting invariant features. Hence based on these two intuitions, we proposed a two-input stream multi-column multi-stage CNN (TIS-MCMS-CNN) to solve CCA in real-time. The two streams extract spatial and temporal features from the crowd scene. These features are fused using a fusion layer, which is followed by two dense connection layers and a classification layer. We perform end to end training, and experiments are done by using publicly available datasets such as PETS-2009 [24], UCSD (Ped1 and Ped2) [25], UMN (Plaza1 and Plaza2) [26]. We divided the dataset into five density classes such as Very Low (VL), Low(L), Medium (M), High (H), Very High (VH) according to work done by Fu et al. [10]. Each density class represents particular congestion information. The experiment results show the robustness of the model and outperform the existing state-of-the-art techniques in terms of accuracy. The proposed model processes nearly 30 test frames per second and shows that it can be appropriate for real-time applications. The idea for TIS-MCMS-CNN is adopted from the single input stream multi-column convolutional neural network (MCNN) [23], which was originally implemented for crowd count using crowd density map estimation.

The remaining part is organized as Sect. 2 describes a brief literature review of state-of-the-art techniques for CCA, Sect. 3 describes proposed work, Sect. 4 explains details of experiments and results, and Sect. 5 describes the conclusion and future work.

2 Literature review

In the literature, crowd density classification and crowd congestion-level analysis (CCA) are interchangeably used. Based on the feature extraction strategy, traditional methods for crowd density classifications mainly classified into two categories, namely spatial approaches and spatial–temporal approaches. Table 1 shows a brief literature review of state-of-the-art techniques for CCA. Generally, the crowd density features such as shape, texture, edge, moments, spectral (Fourier), wavelet features vary significantly between different congestion levels of crowd. This fact attracts researchers to extract significant spatial features to represent crowd congestion levels for classification. Based on this intuition, Marana et al. [11] argued that texture information of the crowd scene changes very significantly as the crowd density increases from very sparse to very dense. The sparse crowd possesses coarse texture, whereas the dense crowd contains a fine texture. Based on this intuition, Marana et al. [11] extracted texture features using Grey Level Dependency Matrix (GLDM) for five different crowd density levels and classified using the Self Organization Map (SOM) learning algorithm, but results in poor performance. Again, Marana et al. [12] proposed a new technique in which Minkowski Fractal Dimension (MFD) was used to characterize crowd densities and resulted in 75% correct classification using a SOM learning algorithm. Rahmalan et al. [13] extracted texture, shape, and moment features using GLDM, MFD, Translational Invariant Orthonormal Chebyshev Moments (TIOCM), respectively, for different crowd density classes and classified using SOM. But the performance is degraded due to cluttered background, shadow and noise. Marana et al. [14] proposed a real-time crowd density classification method based on texture histogram and low-pass filter for classification correction in a distributed environment but obtained 73.89% accuracy only. Su et al. [15] adopted the Maximally Stable Externally Region (MSER) [27] detector to extract crowd regions and then applied 3D to 2D projection on it, followed by shadow removal. Histogram statistical features were extracted and trained the model using Support Vector Machine (SVM) to classify four crowd congestion levels but results in moderate performance. Ma et al. [16] proposed a patch-level crowd density classification and global-level crowd density estimation technique by using local texture features called Gradient Orientation Co-occurrence Matrix (GOCM). Bag of visual words is created by normalizing GOCM descriptors followed by k-means clustering. The proposed method is capable of handling scene change and background noise. Wang et al. [17] proposed a local density classification approach using texture features using Local Binary Pattern Co-Occurrence Matrix (LBPCM) for grey image and gradient of the image. The model is trained using SVM and achieves 94.25% accuracy but takes more frame processing time and hence not suitable for real-time application.

Table 1 A brief comparison of state-of-the-art techniques for the CCA

Kim et al. [20] extracted normalized moving area and normalized contrast to classify crowd density. The main idea was to find a feature descriptor for the moving crowd and the static crowd. The moving area is obtained by employing Combined Local–Global (CLG) optical flow and accumulated magnitude map, and thresholding value is used to find the normalized moving area. The GLCM is used to capture contrast information. Although the idea is noble, it suffers from a high misclassification rate for a varying scene change and lighting change dataset. Yang et al. [21] extracted spatial–temporal features from the moving crowd scene using a sparse-spatial–temporal local binary pattern (SST-LBP) descriptor followed by spectral analysis. Then SVM was used to classify four crowd density levels. Fradi et al. [18] proposed a patch-level crowd density classification technique by extracting LBP features followed by a discriminant subspace analysis using Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA). The feature descriptors were fed to a modified algorithm for multi-class classification using SVM with RBF kernel. Lamba et al. [28] proposed a technique to extract rotational invariant spatial–temporal LBP features for the moving crowd. At first, key interest points were detected by using Hessian Detector [29] for a volume of frames. Then RIST-LBP features were extracted from the volume of spatial–temporal key points, and then, SVM was employed to classify four crowd density levels. The authors didn’t consider descriptors for the static crowd in the scene. Recently, Alanazi et al. [19] proposed a local crowd density classification technique in which CLBP features were utilized to describe four different density levels. Multi-class SVM was used for classification, although the method showed some promising results but can’t applicable in real-time as single-frame processing time is 7 s.

A few methods have been proposed using a deep-learning algorithm for CCA. Fu et al. [10] proposed three deep CNN models, namely, modified multi-stage CNN (MS-CNN), optimized CNN, and cascaded CNN. The main focus was to extract in-depth spatial features for CCA and increase the processing speed by deleting wights of the neurons which have similar receptive fields. But the proposed model has some limitations like (a) We have to manually identify hard samples from the dataset for cascade CNN, which is quite impossible to find in real-world applications and (b) Extracting only spatial features can’t increase the performance as the crowd scene contains both spatial (for the static crowd) and temporal (moving crowd) information. Pu et al. [22] utilized transfer learning mechanisms to solve the objective. GoogleNet [30] and VGGNet-16 [31] were adopted to classify crowd densities for three and five classes. But still, performance needs to be improved.

In contrast to CCA, some constructive works have done in the field of social image understanding using deep learning. For example, Li et al. [32] proposed a new distance metric called a weakly supervised deep metric learning (WDML) algorithm that discovers the knowledge between the social image visual content and the user-provided tags. Again, Li et al. [33] proposed a deep collaborative embedding (DCE) model to uncover the hidden space for images and their associated tags. As of now, the discussion is limited to the CCA, but it can be extended to Image Understanding related tasks like content-based image retrieval, tag-based image retrieval in our future study.

Now, by observing the literature review on the CCA, we may summarize that,

  • Most of the traditional methods either extracts spatial features or spatial–temporal texture features from the crowd scene but lacks in providing better accuracy in real-time.

  • The spatial features solely cannot increase the accuracy of the CCA; crowd motion information must be used.

  • The deep models provide better accuracy as compared to traditional methods.

  • The existing deep approaches only extract in-depth spatial features to solve the objective in real-time but fail to extract invariant features for perspective change or scene change.

The above reasons motivate us to develop a deep model that extracts and fuses invariant deep spatial as well as temporal features for the CCA. Based on this intuition, we proposed a two-input stream multi-column multi-stage CNN (TIS-MCMS-CNN) followed by a fusion layer to perform global-level crowd density classification. The frames are annotated with one of five density classes such as Very Low (VL), Low (L), Medium (M), High (H), and Very High (VH). We follow the work of Fu et al. [10] to annotate these density labels to frames. The inputs to each stream are the original frame and flow magnitude map.

The novelty or main contribution is as follows

  1. (a)

    Designing a spatial–temporal deep model for the CCA. To be the best of our knowledge, the proposed deep model is the first model for the CCA, which extracts deep spatial–temporal features.

  2. (b)

    Minutely designed the complex TIS-MCMS-CNN architecture, which processes the frames at the rate of around 30 frames per second.

  3. (c)

    Extensive experiments were performed to show the robustness of the proposed model.

Section 3 discusses the proposed work in details.

3 Proposed work

To develop an efficient model for the CCA, the model should extract invariant features representing not only global scene information (Spatial) but also moving crowd (Motion/Temporal) information. It is because the crowd scene contains cluttered background, high occlusion, dynamic texture, lighting changes, dynamic crowd shape between frames, static and motion crowds. So, extracting only spatial features wouldn’t help to capture more meaningful information to improve accuracy. Hence, crowd motion features should be extracted. Then, fusion of both spatial and temporal (Spatial–Temporal) features should increase the performance. In such cases, conventional machine learning is not capable of extracting discriminant features (concluded from literature review), and hence, deep learning could be the right choice. The existing deep CNN models couldn’t provide better accuracies in the presence of perspective change, cluttered background, illumination change. Zhang et al. [23] argued and showed that a multi-column multi-stage CNN with different receptive fields (kernel size) can extract features which are adaptive and invariant to these challenges. Based on this intuition, we proposed a two-input stream multi-column multi-stage convolutional neural network (TIS-MCMS-CNN), which not only extracts deep spatial and temporal features but also fuses them to solve the objective function. Figs. 2 and 3 show the overall architecture and the detail architecture of the TIS-MCMS-CNN, respectively.

Fig. 2
figure 2

Overall architecture of the proposed model

Fig. 3
figure 3

Detail architecture of the model TIS-MCMS-CNN

The following subsections explain details of the proposed model and its working principle:

  • Network architecture.

  • Pre-processing and motion magnitude map extraction.

  • Problem formulation and learning algorithm.

  • Precaution to handle overfitting.

3.1 Network Architecture

According to Fig. 3, the proposed architecture contains three main modules such as

  • Two streams of multi-column multi-stage CNN.

  • A fusion layer.

  • A multi-layer perceptron (MLP) module.

The two streams (see Fig. 3) named as stream-1 (the spatial-stream) and stream-2 (the motion-stream). Each stream contains three columns of convolution layers. Each column contains three stages of convolution layers of different receptive field or kernel size. The details of the layer information are given in Table 2. The activation function for each convolution (Conv) layer is a rectified linear unit (ReLU), which is followed by a max pooling (MP) layer. We have taken ReLU for Conv layers, because it performs better than logistic, tanh activation functions, and get control of the vanishing gradient problem. The ReLU for \(k{th}\) neuron at level l can be calculated using the following Eq. 1.

$$ \text{Re} {\text{LU}}\left( {\left( {\left( {z_{{{\text{in}}_{k} }} } \right)} \right)} \right)_{l} = \hbox{max} \left\{ {0,\left. {z_{{{\text{in}}_{k} }} } \right\}} \right., $$
(1)

where \(\left( {z_{{\text{in}}_{k}} } \right)\) is the weighted sum of the information transmitted from neurons of level l-1 to the \(k{th}\) neuron of level l.

Table 2 TIS-MCMS-CNN layers information

A fusion layer follows the two streams. The feature maps obtained from the third stage of each column are flattened and concatenated in the fusion layer.

Next, an MLP module follows the fused layer. The MLP module contains three dense connection layers, namely fully connection layer-1 (FC-1), fully connection layer-2 (FC-2), and an output layer. The activation function for FC-1 and FC-2 are ReLU. The output layer containing five neurons; each of these neurons is responsible for giving one of five responses such as VL, L, M, H, VH. The activation function for fully connected layers except the output layer is ReLU. The activation of the output layer is SoftMax. The SoftMax function generally gives the probability distribution of each target class. So, we have used this function at the output layer. The following Eq. 2 shows the SoftMax activation for \({\text{the }}\;p{th}\) neuron of the seventh layer (output) of our proposed model.

$$ y_{{7_{{p{\text{out}}}} }} = {\text{SoftMax}}\left( {\left( {z_{{{\text{in}}_{p} }} } \right)_{7} } \right) = \frac{{e^{{\left( {z_{{{\text{in}}_{p} }} } \right)_{7} }} }}{{\mathop \sum \nolimits_{p = 1}^{5} e^{{\left( {z_{{{\text{in}}_{p} }} } \right)_{7} }} }},\; {\text{for}}\, p = 1,2,3,4,5, $$
(2)

where \(\left( {z_{{{\text{in}}_{p} }} } \right)_{7}\) is the weighted information transmitted from the sixth layer to the pth neuron of the output layer (seventh layer).

3.2 Pre-processing and motion magnitude map extraction

In the pre-processing stage, we converted each colour frame into its Grayscale and resized to 42 × 40. Let, the RGB video frames and it’s resized grayscale images are denoted using set \(\it {\text{VF}} = \left\{ {vf_{1} ,vf_{2} , \ldots ,vf_{T} } \right\}\) and \(\it {\text{GF}} = \left\{ {gf_{1} ,gf_{2} , \ldots ,gf_{T} } \right\}\) respectively, where T is the total number of frames. Eq. 3 is used to convert the colour frames into grayscale images.

$$ gf_{i} =\, 0.299 \times R\left( {vf_{i} } \right) + 0.587 \times G\left( {vf_{i} } \right) + 0.114 \times B\left( {vf_{i} } \right), \forall i = 1,2, \ldots ,T, $$
(3)

where \(R(), G(), \;{\text{and }} B()\) are Red, Green, and Blue channels of the colour frame, respectively. The motive behind this pre-processing is to minimize the total memory occupancy of the model, and the idea is adopted from Fu et al. [10]. Then, the motion magnitude map of the resized frames is obtained by applying the Lucas–Kanade [34] optical flow. The Lucas–Kanade method cannot give dense-flow information but still provides noise-free motion information. It finds an optical flow between two consecutive frames by solving the following constrained equation.

$$ gf_{x} \times u + gf_{y} \times v + gf_{d} = 0, $$
(4)

\(\it gf_{x} {\text{ and }} gf_{y}\) are the spatial derivatives of the \(\it d{th}\) frame, \(\it gf_{T}\) is the temporal derivative of \({\text{the }} d{th} {\text{ frame}},\) and u, v are the horizontal and vertical optical flow of that frame, respectively. The flow magnitude can be obtained by solving the following Eq. 5.

$$ {\text{Mag}}\left( {x, y, d} \right) = \sqrt {u\left( {x,y} \right)^{2} + v\left( {x,y} \right)^{2} } $$
(5)

where \(\it {\text{Mag}}\left( {x, y, d} \right)\) refers to the motion magnitude map for the \(d{th}\) frame. The detailed description and solution of Eqs. 4 and 5 can be found in [34]. Let the motion magnitude maps for the set GF is denoted as set \(\it {\text{MF}} = \left\{ {mf_{1} ,mf_{2} , \ldots ,mf_{T} } \right\}.\) The resized grayscale frame set GF and the motion magnitude set MF are inputted to the first stream, i.e. the spatial-stream and second stream, i.e. motion-stream of TIS-MCMS-CNN, respectively. The detailed description of the problem formulation and learning algorithm is discussed in the following subsection.

3.3 Problem formulation and learning algorithm

We trained the network using backpropagation [35] with Adam optimizer [36]. The algorithm-1 shows steps followed during training. For training the network, broadly, we need two things forward propagation and backward propagation.

The network is trained until early-stopping, or \({\text{itr}} = { \hbox{max} }\_{\text{iteration}}\) is satisfied. The early-stopping is a measure used to minimize the overfitting and can be found in Keras as an in-built function. In early-stopping, we need to set patient parameter p, and during training, the model checks whether the validation loss or training loss is/is not going down even after crossing the p number of epochs. If it is going down, then the training continues otherwise stops.

Let \(\it {\text{GF}}\) and \(\it {\text{MF}}\) are divided into N number of batches. Let tuple \(\it S_{{t_{x} }} = \left\langle {{\text{GF}}_{{t_{x} }} ,{\text{MF}}_{{t_{x} }} } \right\rangle\) represents pre-processed Gray frames and corresponding motion magnitude maps for the \(t{th}\) batch. Here, for any value of t (ranges from 1 to N) x ranges from \(\it 1 {\text{ to Batch}}\_{\text{Size}}\). The \(\it {\text{Batch}}\_{\text{Size}}\) defines the batch size of \(t{th}\) batch samples. We assigned resized grayscale frames, i.e. \(\it {\text{GF}}_{{t_{x} }}\) and motion magnitude maps of frames, i.e. \(\it {\text{MF}}_{{t_{x} }}\) to, the first and second stream, respectively.

During forward propagation, for each sample in \(S_{{t_{x} }}\), the feature maps of the two-stream CNNs, the activation of the hidden layers of MLP and the predicted outputs of the final layer are calculated. The forward propagation for the net is discussed below.

The convolution layers of the two-stream CNNs convolves the input matrix with filters and generate feature maps. For the proposed network, the convolution operation of the \(\it i{th}\) layer for the \(j{th}\) column of \(k{th}\) stream is denoted as,

$$ \left[ {f_{i,k}^{j} } \right]^{{S_{{t_{x} }} }} = \left[ {{\text{CONV}}\left( {\theta_{{C_{i,k} }}^{j} ,\left[ {fm_{i - 1,k}^{j} } \right]^{T} } \right)} \right]^{{S_{{t_{x} }} }} , $$
(6)

where \(\theta_{C}\) represents parameters for two-stream CNNs and \(\theta_{{C_{i,k} }}^{j} = \left[ {W_{i1,k}^{j} ,W_{i2,k}^{j} , \ldots , W_{iM,k}^{j} } \right]\) for i = 1,2,3 j = 1,2,3 and k = 1,2. Each \(W_{im,k}^{j}\) represents \(m{th}\) convolution kernel’s weight matrix for \(i{th}\) layer of \(j{th}\) column of \(k{th}\) stream and we got m = 1,2,…,M such kernel parameters for each layer. Remember that the value of M is different for different layers. The \(f_{i,k}^{j}\) is the preactivated feature map. The \(fm_{i - 1,k}^{j}\) is the activated and max pooled feature map obtained from \(\it (i - 1){th}\) stage of \(j{th}\) column of \({\text{the}}\; k{th}\) stream. Note that, \(fm_{0,1}^{0}\) and \(fm_{0,2}^{0}\) represent the inputs to the first and second streams, which are \(\it {\text{GF}}_{{t_{x} }}\) and \(\it {\text{MF}}_{{t_{x} }}\) respectively. Each stage consists of a convolution layer followed by ReLU, followed by max pooling. The feature map of each stage can be calculated as,

$$ \left[ {fm_{i,k}^{j} } \right]^{{S_{{t_{x} }} }} = \left[ {{\text{MP}}\left( {{\text{ReLU}}\left( {f_{i,k}^{j} } \right)} \right)} \right]^{{S_{{t_{x} }} }} . $$
(7)

The feature maps obtained after the third stage of all three columns of two streams are concatenated using a fusion layer, and it can be denoted as

$$ \left[ {\text{Fuse}} \right]^{{ S_{{t_{x} }} }} = \left[ {{\text{CONCATE}}\left( {fm_{3,k}^{j} } \right)} \right]^{{ S_{{t_{x} }} }} . $$
(8)

It should be noted that the fusion layer only concatenates the flattened feature maps of its previous layer; hence there is no weight updating during backpropagation. The next stage of the forward propagation is to find the activation of the neurons of the MLP. The pre-activation for the fully connected layers (5–7 layers) can be calculated as

$$ \left[ {y_{i} } \right]^{{S_{{t_{x} }} }} = \left[ {\theta_{{fc_{i} }} \times \left[ {y_{{\left( {i - 1} \right)_{out} }} ,1} \right]^{T} } \right]^{{S_{{t_{x} }} }} , $$
(9)

where \(\theta_{{fc_{i} }} = \left[ {\omega_{i}} \right]\) , \(\omega_{i}\) is the weight matrix connecting neurons from \(\left( {i - 1} \right){th}\) layer to the \(i{th}\) layer. The \(y_{i}\) and \(y_{{i_{\text{out}} }}\) represent pre-activation and the activated neurons for \(i{th}\) layer. Remember that \(\it y_{4} = y_{{4_{\text{out}} }} = {\text{Fuse}}.\) The response of the first two fully connected layers of MLP is ReLU, and the activation can be calculated as

$$ \left[ {y_{{i_{\text{out}} }} } \right]^{{S_{{t_{x} }} }} = \left[ {{\text{ReLU}}\left( {y_{i} } \right)} \right]^{{S_{{t_{x} }} }} , \;{\text{for}}\; i = 5,6 . $$
(10)

The last layer of the MLP is the classification layer, which contains five neurons. The SoftMax activation is used in this layer, and it can be represented as

$$ \left[ {y_{{i_{\text{out}} }} } \right]^{{S_{{t_{x} }} }} = \left[ {\mathop {\bigcup }\limits_{p = 1}^{5} \left[ {{\text{SoftMax}}\left( {y_{{i_{{p_{\text{out}} }} }} } \right)} \right]} \right]^{{S_{{t_{x} }} }} , \;{\text{for}}\; i = 7. $$
(11)

Let \(\emptyset_{\text{Net}} = \left[ {\theta_{\text{C}} ,\theta_{\text{fc}} } \right]\) represent all the parameters of the network. At last, the loss is computed by using Cross-Entropy between the true distribution and predicted distribution. The cross-entropy loss is a way to find the difference between two distributions like true distribution, i.e. \(T_{\text{p}}\) and predicted distribution, i.e. \(y_{{i_{out} }}\). So, the loss \(\it L\left( {T_{\text{p}} , y_{{i_{\text{out}} }} } \right)\), can be calculated using the following Eq. 12.

$$ \left[ {L\left( {\emptyset_{\text{Net}} } \right)} \right]^{{S_{{t_{x} }} }} = \left[ {L\left( {T_{p} , y_{{i_{\text{out}} }} } \right)} \right]^{{S_{{t_{x} }} }} = \left[ { - \mathop \sum \limits_{p = 0}^{4} T_{p} \log y_{{7_{{p_{\text{out}} }} }} } \right]^{{S_{{t_{x} }} }} , $$
(12)

where \(\left. {T_{p} } \right|_{p = 0,1,2,3,4}\) is the true distribution of the five density classes, namely VL (0), L (1), M (2), H (3), and VH (4) and \(y_{{7_{\text{out}} }}\) is their predicted distribution. Whenever the true distribution and predicted distribution are the same, then the loss function \(L\left( {p,q} \right)\) is minimized. So, we want to find predicted distribution such that −\(\mathop \sum \nolimits_{i} p_{i} \log q_{i}\) is minimized. So, the problem for the proposed work can be formulated as an optimization problem which minimizes the loss between the true distribution and the predicted distribution, and it can be represented as,

$$ \left[ {\mathop {\arg \hbox{min} }\limits_{{\emptyset_{\text{Net}} }} - \mathop \sum \limits_{p = 0}^{4} T_{p} \log y_{{7_{{p_{\text{out}} }} }} |\mathop \sum \limits_{p = 1}^{5} y_{{i_{{p_{\text{out}} }} }} = 1} \right]^{{S_{{t_{x} }} }} $$
(13)

Using Lagrangian constraint multiplier, the objective function can be reduced to the following form:

$$ \left[ {\mathop {\arg \hbox{min} }\limits_{{\emptyset_{\text{Net}} }} \mathop \sum \limits_{p = 0}^{4} - T_{p} \log y_{{7_{{p_{\text{out}} }} }} + \lambda \mathop \sum \limits_{p = 1}^{5} y_{{i_{{p_{\text{out}} }} }} - 1} \right]^{{S_{{t_{x} }} }} , $$
(14)

where \(\lambda\) is the Lagrangian multiplier. If \(\lambda \mathop \sum \nolimits_{p = 0}^{4} y_{{i_{{p_{\text{out}} }} }} - 1\) = 0, then we will get an absolute minimum.

We have used the L2 norm as the regularization term and added in the optimization function. So, the loss function could look like as

$$ \left[ {\tilde{L}\left( {\emptyset_{\text{Net}} } \right)} \right]^{{S_{{t_{x} }} }} = \left[ {L\left( {\emptyset_{\text{Net}} } \right) + \frac{\alpha }{2}\emptyset_{\text{Net}}^{2} } \right]^{{S_{{t_{x} }} }} , $$
(15)

where α is the regularized parameter.

figure a

During backpropagation, we find out the gradients of the loss using the backpropagation algorithm [35] and update weights and biases using [36]. The gradients with respect to the parameter are calculated at every layer of the proposed model. Let the computed gradients on sample \(S_{{t_{x} }}\) for the whole network is represented as \(\left[ {\nabla \emptyset } \right]^{{S_{{t_{x} }} }} = \left[ {\nabla \theta_{\text{fc}} ,\nabla \theta_{\text{c}} } \right]^{{S_{{t_{x} }} }}\). The gradients with respect to parameters for the MLP layers and two-stream CNNs for a given \(\it {\text{Batch}}\_{\text{Size}}\) can be calculated by iteratively solving the following Eqs. 16 and 17, respectively.

$$ \left[ {\nabla \theta_{\text{fc}} } \right]^{{ S_{{t_{x} }} }} = \left[ { \nabla_{{\theta_{\text{fc}} }} \tilde{L}\left( {\emptyset_{\text{Net}} } \right)} \right]^{{S_{{t_{x} }} }} $$
(16)
$$ \left[ {\nabla \theta_{\text{C}} } \right]^{{S_{{t_{x} }} }} = \left[ { \nabla_{{\theta_{\text{fc}} }} \tilde{L}\left( {\emptyset_{\text{Net}} } \right)} \right]^{{S_{{t_{x} }} }} $$
(17)

Now, the cumulative gradients \(\left[ {\nabla \emptyset } \right]^{t} = \left[ {\left[ {\nabla \theta_{fc} ,\nabla \theta_{c} } \right]} \right]^{t}\) for a given batch t can be calculated using the following Eqs. 18 and 19.

$$ \left[ {\nabla \theta_{fc} } \right]^{t} = \mathop \sum \limits_{x = 1}^{{{\text{Batch}}\_{\text{Size}}}} \left[ { \nabla_{{\theta_{fc} }} \tilde{L}\left( {\emptyset_{\text{Net}} } \right)} \right]^{{S_{{t_{x} }} }} $$
(18)
$$ \left[ {\nabla \theta_{\text{C}} } \right]^{t} = \mathop \sum \limits_{x = 1}^{{{\text{Batch}}\_{\text{Size}}}} \left[ { \nabla_{{\theta_{\text{C}} }} \tilde{L}\left( {\emptyset_{\text{Net}} } \right)} \right]^{{S_{{t_{x} }} }} . $$
(19)

After finding all the gradients, we updated the parameters using Adaptive Moment (Adam) [36] update rule. To solve the decay problem, Adam [36] utilizes the cumulative history of gradients to update the parameters of the network, such as weights and biases. For a given iteration t the cumulative history of gradients can be calculated using following Eqs. (2022).

$$ m^{t} = \beta_{1} \times m^{t - 1} + \left( {1 - \beta_{1} } \right) \times \left[ {\nabla \emptyset } \right]^{t} $$
(20)
$$ v^{t} = \beta_{2} \times v^{t - 1} + \left( {1 - \beta_{2} } \right) \times \left( {\left[ {\nabla \emptyset } \right]^{t} } \right)^{2} $$
(21)
$$ \hat{m}^{t} = \frac{{m^{t} }}{{1 - \beta_{1}^{t} }} \;{\text{and}}\;\hat{v}^{t} = \frac{{v^{t} }}{{1 - \beta_{2}^{t} }} , $$
(22)

where \(\beta_{1}\) = 0.9 and \(\beta_{2}\) = 0.999. Now, the parameters are updated using the following equation:

$$ \left[ {\nabla \emptyset } \right]_{t + 1} = \left[ {\nabla \emptyset } \right]_{t} - \frac{\eta }{{\sqrt {\hat{v}^{t} + \varepsilon } }} \times \hat{m}^{t} , $$
(23)

where \(\eta\) is the learning rate.

3.4 Precautions to handle overfitting

Overfitting occurs when the model is overly complex to fit the training-set tightly, which results in remembering the training-set but not learning it. It is because of the complex model always possesses low bias and high variance. So, overfitting is the model’s error. If the model is simple, then underfitting may occur. In such a case, the model has high bias and low variance. So, there is always a trade-off between bias and variance. So, why should we care about bias-variance trade-off and model complexity for the deep neural network? Answers to this question are,

  • Deep Neural Networks are highly complex models.

  • It has many parameters and many non-linearities.

  • So, it is easy for them to overfit and drive training error to zero.

Hence, we need some regularization. The followings are different forms of regularization, which help to handle the overfitting.

  • L2 Regularization.

  • Early-stopping.

  • Drop-out.

  • Parameter sharing and tying.

  • Data augmentation.

  • Ensemble.

  • Adding noise to inputs and outputs.

In our proposed model, we implemented L2 regularization and early-stopping to handle with overfitting. Larger weights in the neural network make the model overfit such that for a small change in the input (during testing), it results in more significant variation in the output. So, we must penalize the weights. For this, L1 or L2 regularization can be used. We have used the L2 norm as the regularization term and added in the optimization function.

4 Experiments and results

The proposed model is implemented in PyCharm using Keras and Tensorflow. The model is implemented in the intel-i7 8th generation processor having 8 GB RAM. The experiments are done on three publicly available datasets, namely PETS-2009 [24], UCSD [25] and UMN [26]. To show the efficiency of the proposed model, three state-of-the-art techniques (MS-CNN [10], CLBP [19] and MLP [20]) are implemented and compared. Apart from this we also performed an ablations study in which we implemented each stream, i.e. spatial-stream and motion-stream individually and the results were compared. The following subsections describe dataset preparation, default network parameter values, performance metrics, results analysis and ablation study in detail.

4.1 Dataset Preparation

For the experiment, we have used publicly available datasets, namely PETS-2009 [24], UCSD [25], UMN [26]. The PETS-2009 [24] dataset is the benchmark dataset for crowd surveillance. The  PETS-2009 S1 mainly meant for crowd count and density estimation. It contains three scenarios, such as L1, L2, and L3. Each of them contains sequences recorded in four views in two timestamps. We selected all the view-1 of the three scenarios, which were recorded in different lighting and weather conditions. Each view contains a sparse to the dense crowd. The UCSD [25] is a crowd anomaly dataset that provides two sequences in two different scenarios, namely, UCSD-Ped1 and UCSD-Ped2. The Ped1 and Ped2 contain 16,000 and 4800 frames, respectively. These datasets contain different challenging conditions like varying lighting conditions, occlusions, camera jitters. The UMN [26] provides benchmark datasets for crowd monitoring. The UMN-Plaza1 and Plaza2 are two surveillance videos recorded in Plaza1 and Plaza2 for crowd monitoring. These videos are converted into frames, as shown in the table. The frames are manually annotated with one of five density levels according to the degree of congestion of the crowd. The division of crowd scenes (PETS-2009, UCSD) into five density levels such as VL, L, M, H, and VH is motivated by the work done by Fu et al. [10]. The detail of the division of crowd density levels is given in the following Table 3. Each entry of Table 3 shows the range of people for that density class. The UMN-Plaza1 and Plaza2 contain the very low crowd, so we divided their density level according to the value shown in the following Table 3. Fig. 4 shows examples of different congestion levels defined on the following datasets.

Table 3 Details of five congestion levels
Fig. 4
figure 4

Examples of crowd scenes of different density levels

Generally, the recent works on the CCA did not mention whether they have trained the model with validating data or not. So, to give a clear understanding of the proposed work, we created two datasets, namely Dataset-1 and Dataset-2. In the Dataset-1, annotated frames for every class are divided into training-set and test-set. Sixty percent of each class is taken as training-set and rest 40% as test-set. The Dataset-2 contains training-set, validation-set, and test-set. Each density class is divided into 40% of training, 20% of validation, and 40% of testing. The following Table 4 and Table 5 show the division of datasets into “Dataset-1” and “Dataset-2”.

Table 4 Details of “Dataset-1”
Table 5 Details of “Dataset-2”

4.2 Initializing network parameters

The model is fine-tuned by setting the parameters and hyperparameters as per the following values. The parameters of the model, such as learning rate and weights, are initialized to 0.01 and default values, respectively. The hyperparameters such as batch normalization with momentum, number of iterations, L2 regularization are initialized to 0.95, 500, and 0.01, respectively. For Adam \(\beta_{1}\) = 0.9 and \(\beta_{2}\) = 0.999. The batch sizes of 64,128, 256, 64 used for  PETS-2009, UCSD-Ped1, UCSD-Ped2, UMN, respectively.

4.3 Performance metrics

As the proposed work is a multi-class classification problem, so we used macro-average performance metrices for performance analysis. For a binary classification problem, the structure of the confusion matrix is shown in below Table 6.

Table 6 Confusion matrix for binary classification

Using the confusion matrix, we can draw the following performance metrics for individual classes \(\it i|_{i = 0,1,2,3,4}\). Here, 0, 1, 2, 3, 4 stands for VL, L, M, H, VH.

$$ {\text{Recall }}\left( {\text{Ri}} \right) = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} $$
(24)
$$ {\text{Precision }}\left( {\text{Pi}} \right) = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}} $$
(25)
$$ {\text{False Positive Rate}} \left( {{\text{FPR}}_{i} } \right) = \frac{\text{FP}}{{{\text{FP}} + {\text{TN}}}} $$
(26)
$$ {\text{Specificity}} \left( {{\text{S}}_{i} } \right) = 1 - {\text{FPR}}_{i} $$
(27)
$$ {\text{F}}1 - {\text{Score}}\left( {{\text{F}}1_{i} } \right) = \frac{{2 \times P_{i} \times R_{i} }}{{P_{i} + R_{i} }} $$
(28)

Now, the global performance metric for the multi-class classification can be calculated by taking the macro-average of performance metric of individual class. The following equations show macro-average performance metric.

$$ {\text{Recall}} \left( R \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{5} R_{i} }}{5} $$
(29)
$$ {\text{Precision}} \left( P \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{5} P_{i} }}{5} $$
(30)
$$ {\text{False Positive Rate}} \left( {\text{FPR}} \right) = \frac{{\mathop \sum \nolimits_{i}^{5} {\text{FPR}}_{i} }}{5} $$
(31)
$$ {\text{Specificity}} \left( S \right) = 1 - {\text{FP Rate}} $$
(32)
$$ {\text{F}}1 - {\text{Score}} = \frac{{\mathop \sum \nolimits_{i = 1}^{5} {\text{F}}1_{i} }}{5} . $$
(33)

The accuracy and error rate can be calculated using the following equations.

$$ {\text{Accuracy}} = \frac{{{\text{FP}} + {\text{FN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$
(34)
$$ {\text{Error}}\;{\text{Rate}} = \frac{{{\text{FP}} + {\text{FN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$
(35)

4.4 Results analysis

We performed the experiments by implementing the proposed TIS-MCMS-CNN as well as three state-of-the-art techniques proposed by Fu et al. [10], Alzalani et al. [19] and Kim et al. [20] on the prepared datasets like Dataset-1 and Dataset-2. Figures 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 show the confusion matrix heatmaps of the proposed model for the Dataset-1 and Dataset-2.

Fig. 5
figure 5

Confusion Matrix-Heatmap of TIS-MCMS-CNN for PETS-2009 of Dataset-1

Fig. 6
figure 6

Confusion Matrix-Heatmap of TIS-MCMS-CNN for PETS-2009 of Dataset-2

Fig. 7
figure 7

Confusion Matrix-Heatmap of TIS-MCMS-CNN for UCSD-Ped1 of Dataset-1

Fig. 8
figure 8

Confusion Matrix-Heatmap of TIS-MCMS-CNN for UCSD-Ped1 of Dataset-2

Fig. 9
figure 9

Confusion Matrix-Heatmap of TIS-MCMS-CNN for UCSD-Ped2 of Dataset-1

Fig. 10
figure 10

Confusion Matrix-Heatmap of TIS-MCMS-CNN of UCSD-Ped2 of Dataset-2

Fig. 11
figure 11

Confusion Matrix-Heatmap of TIS-MCMS-CNN for UMN-Plaza1 of Dataset-1

Fig. 12
figure 12

Confusion Matrix-Heatmap of TIS-MCMS-CNN for UMN-Plaza2 of Dataset-2

Fig. 13
figure 13

Confusion Matrix-Heatmap of TIS-MCMS-CNN for UMN-Plaza2 of Dataset-1

Fig. 14
figure 14

Confusion Matrix-Heatmap of TIS-MCMS-CNN for UMN-Plaza2 of Dataset-2

4.4.1 PETS-2009

Table 7 shows the results of different approaches using Dataset-1 and Dataset-2 of PETS-2009. From Table 7, it can be noticed that the proposed architecture provides better results as compared with the state-of-the-art techniques. Irrespective of challenges such as lighting changes, dynamic crowd shape, and occlusion exists in the dataset, the accuracy of the proposed model for the Dataset-1 is 96.97%, with only 15 misclassified samples. The accuracy for the Dataset-2 is 95.94%, with only 20 misclassified samples. It proves that the proposed architecture can extract efficient features irrespective of challenging situations and outperforms the MS-CNN [10], CLBP [19], and MLP [20].

Table 7 Performance Analysis of several approaches using Dataset PETS-2009

4.4.2 UCSD-Ped1

The proposed method achieves the highest accuracy of 97.21% and 96.63% for Dataset-1 and Dataset-2 of UCSD-Ped1, respectively, as compared with the state-of-the-art techniques. Out of 5602 test cases, the misclassified samples are 156 and 189 for Dataset-1 and Dataset-2, respectively. The proposed method shows better performance in terms of Error, Recall, Specificity, Precision, False positive rate, and F1-measures. Table 8 shows details of the achieved performance measurements for different approaches.

Table 8 Performance analysis of several approaches on UCSD-Ped1

4.4.3 UCSD-Ped2

Out of 1427 test cases of UCSD-Ped2, the proposed model results in misclassified samples of 26 and 21, with the accuracy of 98.19% and 98.52% for Dataset-1 and Dataset-2, respectively.

The performance analysis on UCSD-Ped2 is given in Table 9, and it can be noticed that the proposed architecture performs better than other techniques in terms of Error, Recall, Specificity, Precision, False positive rate, and F1-measures. Thus, it capably handles the challenging situations that exist in the dataset.

Table 9 Performance analysis of several approaches on UCSD-Ped2

4.4.4 UMN-Plaza1 and Plaza2

The proposed method achieves an accuracy of 98.55% and 97.64% with misclassified samples of 8 and 13 for Dataset-1 and Dataset-2 of UMN-Plaza1, respectively. Similarly, for UMN-Plaza2, the proposed method again performs well in terms of accuracy and other measures, as shown in Tables 10 and 11. Although CLBP [19] performs better in Dataset-2, it takes much time to process a frame. The time analysis for all the approaches is discussed in Section-4.6. Figs. 15 and 16 show graphical representations of the performance comparison of several methods for Dataset-1 and Dataset-2.

Table 10 Performance analysis of several approaches on UMN-Plaza1
Table 11 Performance analysis of several approaches on UMN-Plaza2
Fig. 15
figure 15

Comparison of accuracies of several approaches for “Dataset-1”

Fig. 16
figure 16

Comparison of accuracies of several approaches for “Dataset-2”

4.5 Ablation study

The ablation study is done to show the impact of each stream that is spatial and motion-stream individually. The results are shown in Table 12. The accuracies of two streams are very low as compared with the proposed architecture. It can be concluded from Table 12 that neither spatial-stream nor the motion-stream solely capable of capturing discriminant features. Generally, cluttered background, dynamic crowd shape, and occlusion affect the spatial-stream, and static objects like humans affect the motion-stream. Moreover, when we combine these two features, the performance increases.

Table 12 Performance analysis of Ablation Study on Different Datasets

4.6 Time analysis

We also calculated the average time required to process the test frames. It is found that the proposed approach achieves around 29.45 (≈ 30) frames per second. Although MS-MCNN performs better still, the proposed model can provide results in real-time. The following Table 13 shows average frames per second achieved for test cases by different approaches.

Table 13 Test frames processing time of several approaches

5 Conclusion and future work

The proposed architecture (TIS-MCMS-CNN) implements a global crowd congestion-level analysis. The two streams of TIS-MCMS-CNN extracts both spatial and temporal features from the resized frame and motion magnitude map, respectively. The motion magnitude map is obtained by using the Lucas–Kanade [34] optical flow. Other optical flow algorithms can be used, but the Lucas–Kanade [34] approach provides noise-free temporal information between consecutive frames. The TIS-MCMS-CNN can extract features invariant to perspective change and scene change. To measure the efficiency of the model, we demonstrated experiments on three publicly available datasets, namely PETS-2009 [24], UCSD [25], UMN [26]. These datasets contain some of the challenging situations such as illumination changes, occlusion, perspective change, and camera jitter. We compared the performance of the proposed model with MS-CNN [10], CLBP [19], and MLP [20]. The proposed model outperforms the state-of-the-art techniques. The accuracy proves that the model is robust and can capture discriminant features even in the presence of some challenging situations. The architecture processes an average of nearly 30 test frames per second, and hence, it is applicable in real-time. The model could not classify some samples in which the transition between different congestion classes occurs. In our future work, we focus on developing a real-time implementation of a deep graphical model and applying image understanding tasks [32, 33] in the local-CCA.