1 Introduction

The study of crowd scenes is becoming a field of considerable interest to researchers, mainly due to the rising number of popular events and public places that facilitate the mass gathering of people. Such occasions and spaces include markets, subways, religious festivals, sporting events and public demonstrations [32, 43]. Often, a crowd may induce a disastrous event due to fight, congestion, mass panic or various other reasons [18]. Many crowd disasters occurred recently [1, 6, 14].

In an attempt to prevent such deadly disasters from occurring, most public areas including holy places, campuses, residential areas and airports are now equipped with closed-circuit television (CCTV) surveillance cameras. The incoming video can be automatically analysed to facilitate the early detection of a possible abnormal event. The automatic detection of panic behaviours is the interest of the present study. Panic manifests as a sudden change in the crowd dynamics. It appears in the video feed as an atypical behaviour: moving in different directions, speed increase, collective running, grouping in one region and so on.

Many research projects have been conducted to automatically detect panic behaviours [7, 12, 13, 15, 21, 22, 24, 26, 29, 31, 34, 35, 40,41,42]. Despite their good detection performances, the majority of them propose off-line solutions. Although they are useful in many situations such as police investigation, it is important to detect a panic situation as soon as it occurs using a real-time detection approach.

To the best of our knowledge, few real-time techniques are proposed in the literature [13, 15, 22, 24, 26, 31, 34, 35, 42]. The common scheme of the panic detection approaches is mainly composed of three steps. First, the motion field is estimated since motion is a crucial characteristic of the crowd dynamics. Second, a feature that characterizes the crowd behavior is extracted. It is based on the way a panic behavior is defined. Third, panic is detected as a deviation of the values, taken by the selected feature, from those obtained during a non panic situation. For instance, in [35], a panic is identified by the presence of atypical motion patterns in the scene. Motion patterns of a non panic situation are learned thanks to the computation of motion representative subspaces on videos of normal behaviors. During the testing phase, the motion field of the considered video is estimated. Then, an approximation using the representative subspaces determined through the training stage is carried out. If the error between the estimated motion and its approximation exceeds a threshold adjusted by the user, then the presence of an abnormal behavior is concluded. However, the performances of this technique depend on the amount of trained data and the diversity of the human behavior in crowded scenes makes it difficult the enumeration of all possible normal behaviors.

More recently, a real-time detection technique based on texture modelling is proposed in [34]. It associates the occurrence of a panic behavior to a temporal change of the texture. According to the authors [34], this technique achieves an execution time of 18 ms per image. However, its performances may degrade when the spatial-temporal texture patterns are highly heterogeneous during a non-panic situation.

Detecting grouping and running behaviors are the main interest of the work described in [42]. First, the motion is estimated using an optical flow (OF) estimation approach. This estimation is carried out only on the Harris corners in order to alleviate the computations. Second, two parameters are chosen as features to characterize the two behaviors. The first parameter is a crowd distribution index that reflects the gathering level of people in a local area. The second parameter is the velocity. The combination of the two parameters forms the kinetic energy of the crowd. Running or gathering behaviors are detected if the energy or the crowd distribution index are greater than a threshold. This technique achieves a near real time execution of 20 images per second.

In [24], panic is defined as an unexpected change in the spatial occupancy of moving objects. An abnormal event is recognized when a high temporal variation of the space occupancy occurs during a given time interval. However, the space occupancy is computed in terms of number of pixels. Thus, it does not take into consideration the way by which the space is occupied along the video. In other terms, the space can be occupied by approximately the same number of moving pixels, but not with the same spatial distribution. The spatial positions of moving blobs are more likely to change during a panic situation as a consequence of the reaction of pedestrians when a dangerous event occurs.

By examining the reported real time techniques, two main limitations can be pointed out. First, although the motion information is important in characterizing the crowd dynamics, the OF estimation yields heavy computations. Also, tracking moving objects or restricting the motion estimation to some points of interest may fail in the presence of occlusions or a highly dense crowd. To alleviate this problem, we suggest to detect moving pixels by computing the absolute differences between pairs of successive images. Contrary to what one might expect, this step does not affect the robustness of the whole proposed system, to noise, occlusions and illumination variations as demonstrated later in Section 4.

The second limitation is that considering a panic as a temporal change in the texture may limit the performances of the system when the scene is highly textured. In the same way, defining a panic as a temporal change in the number of moving pixels within the crowded area may degrade the detection accuracy when people cannot move outside the area in presence of panic. As a solution, and motivated by the fact that the occurrence of a panic situation changes the way people behave with respect to each other, panic is viewed, in the present study, as a sudden change in the interactions between people. The same definition of panic has been considered recently in [29] and led to high detection performances in crowded scenes of any density level. However, the approach described in [29] cannot be investigated for real-time detection due to the computational complexity of the OF applied to all pixels of each image. In the present work, we restrict the analysis of the interactions between moving objects to the interactions between moving edges. Thus, the problem of analysing the spatial interactions between pedestrians is formulated as the problem of analyzing how the spatial distribution of moving edges varies in time. A sudden and remarkable temporal variation is associated to the occurrence of a panic situation. Our rationale is to conduct the analysis of the spatial distribution of moving edges, in the frequency domain. This is motivated by the fact that the spatial distribution of edges is easily perceived in the frequency domain by coefficients of high values. Furthermore, the transformation into the frequency domain allows to get a sparse representation of the spatial discontinuities. Any transformation to the frequency domain could be applied. The fast fourier transform (FFT), the discrete cosine transform (DCT) and the discrete wavelet transform (DWT) are explored in the present work.

To analyze the crowd behavior, a new feature is proposed based on the coefficients obtained through the transformation. A sudden increase in the value of the feature reveals the presence of a panic. In order to temporally locate the panic behavior, the values taken by the proposed feature are classified into two subsets: the first one is related to the normal instances of the video while the second one corresponds to the panic instances. We perform this classification using two different techniques as detailed in Section 3. An experimental comparison between both is presented in Section 4.

Three main contributions are proposed in this work:

  1. 1.

    The first contribution aims to alleviate the heavy computations resulting from applying a motion estimation technique, by considering the absolute differences between pairs of successive images of the video. This solution allows a fast analysis reaching 406 frames per second (fps) as shown in Table 4.

  2. 2.

    Considering a panic situation as a sudden change in the interactions between people, our second contribution consists of associating the spatial distribution of moving edges to people interactions. Thus, the problem of analyzing people interactions is formulated as the problem of analyzing the temporal variation of moving edges distribution.

  3. 3.

    As a third contribution, we propose a new feature that characterizes the interactions between pedestrians. Our rationale is to sparsely represent moving edges in the frequency domain where they are expressed by coefficients of high values. When panic starts, the spatial distribution of moving edges suddenly changes implying a remarkable change in the values of the coefficients. Hence, the feature we propose is the sum of the coefficient absolute values at each instant.

The rest of the paper is organized as the following. The datasets used in this study are described in Section 2. Then, the proposed system is detailed in Section 3 and experimentally evaluated in Section 4. Next, the results are discussed in Section 5. Finally, some conclusions are drawn in Section 6.

2 Datasets

A variety of datasets are used in order to deal with various scenes. As depicted in Table 1, videos including artificial and real behaviors with different density levels, and different image sizes are analyzed.

Table 1 Characteristics of the datasets

A brief overview of each data set is given in what follows.

University of Minnesota (UMN) dataset

It is a public dataset produced by the University of Minnesota, USA [23]. It is composed of 11 video sequences representing escape events, and captured in various contexts: Lawn, Indoor and Plaza. People in these videos walk around normally until an abnormal event occurs which makes them run away. A ground truth (GT) of this dataset is available in [12].

Motion Estimation Dataset (MED)

is a public dataset that includes 11 videos of panic behaviors [25]. Typical scenarios are: putting down a suspicious backpack, earthquake, hoodlum attack and sniper attack. The GT of this dataset is annotated and made publicly available by authors of the work [25].

Performance Evaluation of Tracking and Surveillance 2009 (PETS2009) dataset

is recorded at University of Reading, UK [9]. It includes many scenarios, where each scenario is captured from four different views. Two scenarios are analyzed in the present study. In the first scenario of 107 images, people start walking from the left side until an abnormal event occurs which makes them run away. The second scenario is composed of 378 images. People in this scenario start gathering to the middle until an abnormal event occurs which makes them run away in different directions. The videos of this dataset are challenging as they contain frequent illumination variations.

Festival crowd

[38] This video is a real scene of high people density. It records a festival event and shows people who are initially gathered until an abnormal event occurs. This video is challenging as it includes frequent people interactions, obstacles and occlusions.

Bull-running festival

[37] This video records a bull-running festival in Spain. In the beginning, it shows people walking, then they start freeing space for the coming bulls. After that, some of the bulls enter in the scene which causes people running. Critical occlusions appear in this video.

3 Method

The proposed approach is composed of four main stages as shown by the block-diagram of Fig. 1. Given the streaming video transmitted by the CCTV camera, the K images of the video are transformed into a grayscaling level. The first step of the proposed method consists of computing the absolute differences \(\{D^{(k)}\}_{k}\) between pairs of successive images I(k) and I(k+ 1) (∀k = 1, … , K − 1). This phase allows to locate the moving edges at each instant. Second, the resulting maps \(\{D^{(k)}\}_{k}\) are transformed in a frequency domain. The obtained coefficients of high absolute values in the frequency domain correspond to the spatial discontinuities within the map D(k). These coefficients allow to reveal the way the moving edges are distributed within a local area. In the same way, local homogeneous regions, such as non moving areas, are represented by coefficients of low absolute values. The absolute values of the coefficients of D(k) are summed, giving rise to S(k). Third, the variation of S(k) along time (∀k = 1, … , K − 1) allows to identify whether a remarkable increase, which may be associated to a panic situation, exists. To detect a panic behavior within the set \(\{S^{(k)}\}_{k}\), two alternatives are explored in this study: a clustering based approach and a statistical based approach. Finally, the detection performances are refined by removing false alarms through a postprocessing phase.

Fig. 1
figure 1

Block-diagram of the proposed approach

3.1 Absolute image differences computation

This step aims to detect moving edges of the objects present in the video by a minimum number of computations. Hence, the absolute difference D(k) between two successive images I(k) and I(k+ 1) of the video is carried out as:

$$ D^{(k)}=|I^{(k+1)}-I^{(k)}|, \quad \forall k=1,\ldots,K-1 $$
(1)

where K is the number of images in the video. The resulting matrices \(\{D^{(k)}\}_{k=1,\ldots ,K-1}\) locate the moving edges between successive instants. Furthermore, they reveal the spatial distribution of the moving pixels at each instant. This distribution undergoes variations in a certain range during a non panic situation. When a panic occurs, it largely varies due to a sudden and a remarkable change in people behavior. An illustration is given in Fig. 2 where the first column depicts an image extracted during a non panic situation along with the corresponding map D(k). The second column shows an image extracted during a panic situation and its related map D(k). The bright pixels in D(k) indicate a high intensity differences between the successive images. It is well noticeable that small variations exist during a non panic situation; whereas in a panic situation, the number of moving pixels increases, and the absolute pixel intensity differences between successive images are higher(shown by bright pixels in D(k)) as a consequence of a faster change in the characteristics of the pedestrian movements during panic.

Fig. 2
figure 2

Distributions of moving pixels during a non panic b a panic situation

In order to detect this remarkable spatial-temporal change, we resort to the analysis of the distribution of moving edges in the frequency domain as explained in the next paragraph.

3.2 Proposed frequency-based feature for the characterization of the crowd dynamics

The aim of this step is to analyze the behavior of the pedestrians at each instant through analyzing the spatial distribution of the corresponding moving edges. For this, the transformation of D(k),∀k = 1, … , K − 1 in a frequency domain is retained for its efficiency to locate discontinuities in one hand, and the sparse representation it offers on the other hand. The FFT [3], the DCT [2] and the DWT [19] are explored in the present work. The transformation \(\mathcal {T}_{F_{d}}(D^{(k)})\) of D(k), ∀k = 1, … , K − 1 in a frequency domain Fd ∈{FFT,DCT,DWT} yields a set of coefficients {c(k)} stored in a matrix C(k). The spatial discontinuities in D(k) are transformed into high-magnitude coefficients in the frequency domain, which represent a minority among the whole set {c(k)}. On the contrary, a majority of low-magnitude coefficients correspond to the local spatial homogeneities as illustrated by the histograms of Fig. 3. Furthermore, the differences in people interactions between the non panic and the panic situation are well highlighted. For instance, in this excerpt, with the same pedestrians present in both situations, the magnitude’s range of the set of coefficients {c(k)} is larger during a panic than in case of normal behaviors. In addition, the number of high-magnitude coefficients during a panic situation is greater than their number during a non panic situation. It is to be noted that the same behavior is observed regardless the chosen frequency domain.

Fig. 3
figure 3

Distribution of the coefficient’magnitudes during a non-panic (first column) and a panic (second column) situation, using FFT (second row), DCT (third row) and DWT (fourth row)

These observations motivated us to propose a new feature S(k) defined for each D(k) by:

$$ S^{(k)}=\left\{ \begin{array}{ll} \sum\limits_{(r,s) \in C^{(k)}} |c^{(k)}(r,s)|, & \text{if } F_{d} \in \{FFT, DCT\} \\ \sum\limits_{j=1}^{J} \sum\limits_{o=1}^{\mathcal{O}} \sum\limits_{(r,s) \in c_{(j,o)}^{(k)}} |c_{(j,o)}^{(k)}(r,s)|, &\text{if } F_{d}=DWT. \end{array} \right. $$
(2)

where J is the number of wavelet decomposition levels, \(\mathcal {O}\) is the number of orientations at each level and \(c_{(j,o)}^{(k)}\) is the wavelet subband at the resolution level j = 1, … , J and the orientation \(o=1,\ldots ,\mathcal {O}\). For a dyadic wavelet, \(\mathcal {O}=3\) and \(c_{(j,1)}^{(k)}\) denotes the horizontal subband (o = 1) at the resolution level j, \(c_{(j,2)}^{(k)}\) denotes the vertical subband (o = 2) and \(c_{(j,3)}^{(k)}\) is the diagonal subband (o = 3). As explained later in this paper (Section 4), several dyadic wavelet transforms are experimented with different decomposition levels ranging from J = 1 to J = 3. It is found that a 1-level decomposition (J = 1) yields the best detection performances.

The feature S(k) allows to quantify the discontinuities between moving pixels at each instant. Furthermore, it facilitates the distinction between a non panic and a panic behavior.

The examination of the temporal variation of S(k) reveals a sudden change in its values when a panic occurs. An illustration of this behavior is depicted in Fig. 4 where the temporal variation of S(k) along the video 9 of the UMN dataset is displayed.

Fig. 4
figure 4

Temporal variation of the proposed feature along the video 9 of the UMN dataset by using a FFT. b DCT. c DWT

As can be noticed, the values of S(k) vary within the same range until the 551st image where a remarkable jump occurs due to the occurrence of a panic behavior, and lasts for about 120 images. Then, the curve drops when people leave the scene. Another jump is also noticed within the images 301 and 302, when the DWT is applied. However, this peak does not correspond to a panic given its very short duration and is automatically eliminated according to the processing described in Section 3.4.

The next step consists of automatically detecting the high values of S(k) as they reveal the presence of a panic situation.

3.3 Panic detection

Two relevant and distinguishable behaviors are present in a video containing a panic situation : non panic and panic related behaviors. They are reflected by the presence of two classes of values in the set \(\mathcal {S}=\{S^{(k)}\}_{k}\) respectively : low and high values. In this study, we propose to formulate the problem of detecting the high values by using two different formulations. The first one considers classifying the set \(\mathcal {S}\) into 2 classes using a clustering technique [16]. The second formulation, proposed in [29], considers the high values as atypical observations that statistically deviate from the distribution followed by the low values. Besides, without loss of generality, the high values are assumed to be a minority within the set \(\mathcal {S}\) and hence are considered as outliers, detected thanks to the use of a statistical test for outlier detection [27]. We investigate the two formulations and we compare them in terms of detection performances and execution time.

3.3.1 Clustering based detection

The objective of this step is to differentiate between the data observations in \(\mathcal {S}\) that correspond to a panic behavior, from those related to a normal behavior, by using a clustering technique. The idea is to build clusters of data by grouping in each cluster the data points that are as close as possible to each other with respect to a given distance, in one hand. On the other hand, the distance between clusters is required to be as large as possible. To detect the values of \(\mathcal {S}\) that correspond to a panic, two clusters have to be identified. The first cluster Snp corresponds to the values obtained during a non-panic situation; while the second cluster Sp includes high values that are related to a panic situation.

Several clustering techniques are proposed in the literature [10, 11, 20, 28]. The comparison of their performances in detecting panic is conducted in Section 4.

3.3.2 Statistical detection

The aim is to partition \(\mathcal {S}\) into two homogeneous subsets: a subset Snp of the majority of observations related to a non panic situation and another subset Sp containing a minority of observations of remarkably higher values that are related to a panic situation. Motivated by the characteristics of each subset and the differences between them, we emphasize the possibility of identifying them through a hypothesis testing. More precisely, the observations in Sp are considered to be deviating from the statistical distribution followed by the ones in Snp. Their detection can therefore be performed following two phases. The first phase aims to estimate the mean and variance of Snp by analyzing \(\mathcal {S}\) robustly to the presence of the other category of observations (those being part of Sp). To this aim, the Minimum Covariance Determinant (MCD) estimator is retained for its efficiency and relatively low computational complexity [27]. The second phase aims to deduce the set Sp, given the estimated parameters of the distribution of Snp.

  1. 1.

    Parameters estimation: The key idea of MCD to estimate the mean and the variance of \(\mathcal {S}\) robustly to the presence of the observations of Sp, is to look for the most concentrated subset in \(\mathcal {S}\) of size h = (1 − α)(K − 1) among h-subsets, given a confidence level 0 < α < 1. Hence, the observations \(s_{i} \in \mathcal {S}\) are firstly ordered in an increasing order. Then, contiguous h-subsets Hi are built as Hi = {s(i), … , s(i+h− 1)}. For each subset, the mean and the variance are computed. Then, the most concentrated subset is the one whose variance \({\sigma ^{2}_{c}}\) is the minimum among the variances of all the subsets Hi. Its mean is denoted by μc.

  2. 2.

    Detection of panic related observations: As outlined before, panic related observations have distinguishable values compared to the non panic related ones, and hence are considered as outliers. An observation si of \(\mathcal {S}\) is considered as an outlier if its distance d(si,μc,σc) from the mean μc relatively to σc exceeds a tabulated threshold T derived with respect to a confidence level α. This distance is defined by:

    $$ d(s_{i},\mu_{c},\sigma_{c})=\frac{|s_{i}-\mu_{c}|}{\sigma_{c}}, \forall i \in 1,\ldots,K-1. $$
    (3)

Hence, the two subsets Snp and Sp of \(\mathcal {S}\) related respectively to non panic and panic situations are deduced by:

$$ \begin{array}{@{}rcl@{}} S_{\text{np}}&=&\{s_{i} \in \mathcal{S}; d(s_{i},\mu_{c},\sigma_{c})<T \},\\ S_{\mathrm{p}}&=&\{s_{i} \in \mathcal{S}; d(s_{i},\mu_{c},\sigma_{c})\geq T \}. \end{array} $$
(4)

The MCD source code is part of the LIBRA package which is available at https://wis.kuleuven.be/stat/robust/LIBRA/LIBRA-home.

Figure 5 shows the detection result of the statistical test for outlier detection, when applied to the set \(\mathcal {S}\) of Fig. 4c. As expected, the images 301 and 302 are considered as being part of the panic images. Other than these images, the panic event is detected earlier by just one image. To improve the detection performances, we propose a postprocessing step that aims to reduce the false detections.

Fig. 5
figure 5

Detection using MCD before and after postprocessing

3.4 Postprocessing

The proposed detection technique yields some false detections that should be reduced. To this aim and without loss of generality, the following assumptions are stated:

  • A panic behavior cannot happen over less than one second.

  • A panic behavior occurs once within a processed video.

The first assumption means that if N successive images are detected as containing a panic behavior and that N is less than the number N of images per second (equivalently, the sampling rate of the video), then, those images are considered as false detections and are discarded from the set of detections. According to the second assumption, it is then possible to identify the sequential number of the image when panic starts. It is the one whose all subsequent images were also identified as anomalous. As depicted in Fig. 5, the result of applying this processing shows the effective elimination of the false detections that are located separately to the sequence of panic images.

4 Results

To evaluate the performances of the proposed technique, four rounds of tests are conducted. The aim of the first round is to select the wavelet parameters that yield the best performances. In the second round, we seek to retain the panic detection method that yields the most accurate results. For this, common clustering techniques as well as the MCD test are confronted. The selection of the most suitable frequency domain is carried out in the third round. Finally, after retaining the appropriate parameters of the system, the detection performances are evaluated with respect to some highly-accurate offline techniques of the literature, and some real-time techniques. In order to quantify the performances of the tested techniques, the correct detection rate Pc (which is the same as the accuracy), the false detection rate Pf, the precision and the recall are computed. They are respectively defined in terms of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) by:

$$ P_{c} = \frac{TP+TN}{K}, P_{f} = \frac{FP+FN}{K} $$
(5)
$$ Recall = \frac{TP}{TP+FN}, Precision= \frac{TP}{TP+FP}\\ $$
(6)

The proposed technique is also evaluated in terms of execution time (in number of frames per second (fps)) when it runs on a PC with a 64 bit Core(TM) i7 2.80 GHz CPU, 16 GB RAM and Windows 10. MATLAB 2017 and the WAVELAB library [33] are used for the implementation.

4.1 Wavelet parameters

In this category of tests, we aim to select the most suitable wavelet function and the optimal number of decomposition levels.

Selection of the wavelet function

Several wavelet functions exist in the literature [5, 8, 17, 36, 39] and [30]. The performances of the proposed approach are evaluated with respect to some wavelets in order to retain the most suitable one: Haar [36], Beylkin [4], Vaidyanathan [39], Coiflet [8] order 1, Daubechies [17] order 20, Symmlet [5] order 10 and Battle [30] order 5. Figure 6 shows the detection rates when the MCD method is applied. Each result represented in a curve reflects the Pf value (along the x-axis) and the Pc value (along the y-axis) of a specific video in the UMN dataset. The performances are almost the same for all the wavelets except for video 3, where they are degraded when using the Symmlet, the Beylkin and the Battle wavelets. According to these results, the Coiflet wavelet yields the best performances and it is retained for the remaining tests.

Fig. 6
figure 6

Detection performances using different wavelet functions and the MCD test

Selection of the number of resolution levels

Tests are conducted in order to decide about the optimal number of resolution levels that ensures high detection performances. 1-level, 2-level and 3-level wavelet decompositions are investigated on all the videos of the UMN dataset. The temporal variations of S(k) along the video 9 of the UMN dataset, when J = 1, J = 2 and J = 3 are respectively illustrated in Fig. 7a, b and c. The panic starting at the 551st image according to the ground truth (GT) is visible when J = 1 and J = 3 and two classes of values can be clearly identified in the set \(\mathcal {S}\), unlike the one obtained when J = 2.

Fig. 7
figure 7

Temporal variation of the proposed feature according to the number of resolution levels aJ = 1 bJ = 2 cJ = 3. Application to video 9 of the UMN dataset

By applying the same test to all the videos, the detection rates depicted in Fig. 8 show that a 1-level decomposition and a 3-level decomposition yield close performances. Therefore, a 1-level wavelet decomposition is retained as it requires less computations compared to the 3-level decomposition.

Fig. 8
figure 8

Comparison of the detection performances when J = 1, J = 2 and J = 3. Application to all the videos of the UMN dataset

4.2 Selection of the detection technique

In order to select the most appropriate detection technique, different clustering methods such as k-means [11], the Partitioning Around Medoids (PAM) method [16] and skinny-dip [20] are investigated and compared to the MCD statistical test for outlier detection [27]. Furthermore, different values of the confidence level α (0.01, 0.05, 0.1) are considered to evaluate the performances of the system when the MCD test is used. Figure 9 as well as Table 2 show that the performances of the MCD test with α = 0.01 outperforms the other methods with an average detection rate of 0.98.

Fig. 9
figure 9

Comparison between the detection performances obtained with PAM, skinny-dip, k-mean clustering algorithms and the MCD test. Application to all the videos of the UMN dataset

Table 2 Comparison between MCD, PAM, k-means and skinny-dip methods based on DWT. Application to the UMN dataset

4.3 Comparison between the frequency domains

Using the retained parameters of the system, namely the MCD test with α = 0.01, this round of tests aims to select the most appropriate frequency domain in terms of detection performances and execution time. Therefore, FFT, DCT and the DWT using the coiflet wavelet function with one level of decomposition, are explored.

Regarding the execution time, Table 3 shows that the FFT transform yields the fastest execution, followed by the DCT transform and then the DWT. However, for all the data sets, the detection rates obtained using each of the three frequency domains are very close, except for the PETS2009 and MED datasets where the DWT outperforms DCT and FFT. The execution time indicates that the technique operates in real-time although the image dimensions of the video are larger than in the other data sets. It is worth pointing out that the use of the DWT requires that the dimensions of the images be of the form 2n where \(n \in \mathbb {N}^{*}\). That is, if this condition is not satisfied, the images are zero-padded. This explains in part the less fast computation of the technique when the DWT is used compared to the use of DCT and FFT.

Table 3 Panic detection results based on DWT, DCT and FFT using MCD

4.4 Performances evaluation compared to the state-of-the-art techniques

The performances evaluation of the proposed system compared to the state-of-the-art techniques is carried out in two stages. As it is important to maintain a high detection accuracy while operating in real-time, the objective of the first stage is to compare the accuracy of the system with some offline techniques [7, 12, 21, 29, 40, 41].

In the second stage, comparisons with real-time techniques [13, 22, 24, 26, 31, 34, 35, 42] are performed in terms of accuracy and execution time.

Table 4 shows that in average, the proposed technique outperforms the technique in [12] and is slightly less accurate than the technique in [29] when tests are performed on the UMN dataset, with an average accuracy of 0.986 against 0.99. These results are considered as excellent since the proposed system operates in real-time with an average computational speed of 358 fps.

Table 4 Performances comparison between our approach and offline approach in [12] and [29] using the UMN dataset

In the same way, Table 5 depicts the performances obtained on PETS2009 dataset. In average, the proposed technique performs better than [7, 40] and [21] for both scenarios, outperforms the technique [41] for the second scenario and is slightly less accurate than [41] for the first scenario.

Table 5 Performances comparison between our approach and offline approach. Application to the PETS2009 dataset

Besides, Table 6 shows that the technique we propose outperforms the technique in [25] when the MED dataset is experimented, and for any of the frequency transforms.

Table 6 Detection performances on the MED dataset. Comparison between the proposed approach and [25]

Real videos are also tested and the performances are depicted in Table 7 in comparison to the technique of [29]. Good performances are obtained by the proposed system even though they are less accurate than those of [29]. In the second stage, the performances of the proposed system are tested on the UMN data set and compared to related real-time techniques. The results are reported in Table 8 and show the outperforming of the proposed system in terms of both accuracy and execution time, for the three frequency domains.

Table 7 Performances comparison between our approach and offline approach [29]
Table 8 Comparison in terms of accuracy and execution time between the reported real-time detection techniques and the proposed technique. Application to UMN dataset

5 Discussion

The present study describes a new real-time approach for the detection of panic behaviors in crowded scenes. Three main contributions are proposed for which, efficiency, accuracy and high speed are experimentally proved. The first contribution aims to alleviate the heavy computations resulting from applying a motion estimation technique, by considering the differences between successive images of the video. This solution allows to locate moving edges with a fast execution. Furthermore, panic is defined as a sudden change in the interactions between people. This is reflected by a change in the spatial distribution of the moving edges in addition to the increase of the number of moving pixels as a consequence of the fast behavior’s change of people. In order to characterize the distribution of moving pixels during a panic and a normal situation, our second contribution consists of representing the moving edges in a frequency domain, allowing a sparse representation of the spatial discontinuities. The FFT, the DCT and the DWT are explored in the present study. Tests conducted on several challenging videos, with different density levels of pedestrians, show the high performances and the high speed of the proposed system as depicted in Tables 34567 and 8. The experimental comparison between the three frequency domains in terms of performances show that they perform well and that the detection rates are close. In terms of execution time, the FFT based system yields the highest execution speed, followed by the DCT, then the DWT.

Our third contribution refers to the detection of panic related data by exploring two formulations. The first formulation considers distinguishing between the normal-related data and the panic-related data by using a clustering technique; whereas the second formulation is based on a hypothesis testing, in which panic-related data are considered as aberrant compared to the data resulting from a normal situation. Figure 9 and Table 2 illustrate the good performances of the system for both formulations, with a slight outperforming of the second formulation.

The proposed system is evaluated with regard to offline and real-time detection techniques and the results show its high performances.

6 Conclusion and future work

A new panic detection approach is proposed in this study. The aim is to analyze the crowd dynamics and detect a possible panic behavior in real-time and without requiring a prior knowledge about the video under consideration. For this, a new feature is proposed based on the computation of the image differences and the analysis of the moving edges in frequency domains. Then, two formulations of the panic detection problem are explored and compared in terms of accuracy and execution time. The approach is evaluated using several datasets and showed its high performances.

In the future work, we will study the effectiveness of other solutions for the detection of moving edges, such as the foreground extraction, and their impact on the system performances.