1 Introduction

With worldwide urbanization and population growth, public places such as railway stations, subways, and airports become highly crowded [16]. This brings in the risk to incur catastrophic events. For example, in Fig. 1, a surveillance video from Tencent News shows the anti-government activity in Bangkok in January 2014. The video faithfully records an emergency of bomb explosion among the crowd and the consequent abnormal collective behaviors. The frequently occurred safety problems have attracted much attention to crowd surveillance in terms of both sociology [10] and technological issues [7]. Automatic detection of abnormal events has the potential to provide alarm or assistance to human observers, which can greatly save operators’ valuable attention when they face a large number of monitors for long hours [7].

Fig. 1
figure 1

An example of the surveillance video showing the anti-government event at Bangkok with grenade attacks. The explosion position is marked by the red circles. The displayed frames successively show the process of the attack: Normal state of people’s moving, explosion of grenade, people escaping in panic, and the final crowd evacuation

The abnormal behaviors in crowded scenes usually appear as crowd commotion, for example, the crowd commotion following the occurrence of the explosion as shown in Fig. 1, so that anomaly detection is in general a problem of detection of crowd commotion [14, 30]. The traditional object-based approaches treat crowd as a collection of individuals. As this kind of methods describe group behaviors through individuals, its performance is subject to the precision of object segmentation [25] and tracking [2, 19, 27]. In low-density scenes, object-based approaches may perform well. In crowded scenes, such as the case shown in Fig. 1, since a large number of people perform random moving in a small space, the high density of objects and the occlusions in both static and dynamic senses will degrade the accuracy of object segmentation and tracking severely, which makes it impossible to capture the behavior of a single individual [11].

To avoid the difficulty of segmenting individuals, a possible solution is to model the crowd commotion of large population as the spatial-temporal movement of a large number of moving particles. In such a context, the well-known social force model [9] is proposed to simulate the movement of a large number of persons as moving particles by taking into account the constraints arising from the mutual interactions among them [14, 17], where the particles are computed from the optical flows. Although it aims to model the physical nature of crowd commotion, the social force model is still far from revealing the reality of people’s responses under emergent circumstances due to the inaccurate representation of individual objects in terms of computer vision, which deviates from the physical nature of the original social force model [9]. Aside from that, the social force model has the emphasis on the local physical phenomenon such as the interaction between neighboring elements but fails to figure out the evolution of crowd commotion from a global point of view, which accounts for why its performance is not satisfactory.

The main contribution of this study is: By employing optical flows as low-level representation, we characterize complex crowd commotions from a global point of in terms of thermodynamics. Intuitively, crowd motions in surveillance videos are comparable to heat motions of basic particles such as molecules in terms of thermodynamics. This motivates us to investigate into thermodynamics to seek effective measures for characterizing crowd motion. Boltzmann Entropy is originally proposed in thermodynamics to investigate the statistical regularity of heat motion of molecules [8]. It captures the relation between the disorder of the motion of a mass of microscopic molecules and the chaotic degree of the corresponding macrostate, namely, the motion pattern of the group of interest as a whole. Inspired by Boltzmann Entropy, we introduce a computer vision method to detect and localize abnormal crowd commotion. The Boltzmann Entropy-based method applied in this study can figure out the spatiotemporal motion pattern of a set of particles in the sense of statistics so that it avoids segmenting or tracking of objects in high-density crowd videos. Figure 2 summarizes the main steps of the approach. As observed experimentally in this study, when abnormal crowd movement occurs, Boltzmann Entropy will increase correspondingly, so abnormal events can be detected by applying a threshold to the time series of Boltzmann Entropy without complex machine learning or large volume of training data. For decision making, here, we need only a simple Gaussian model established from the mean and standard deviation of the time series of the Boltzmann Entropy in correspondence with normal cases. High detection rate and low false alarm rate are achieved on the University of Minnesota (UMN) and web datasets by using the proposed method. The experiments demonstrate not only the effectiveness of the proposed method in detecting and locating abnormal behaviors in crowded scenes but also the efficiency for online monitoring.

Fig. 2
figure 2

The pipeline of the proposed method

The remainder of the paper is organized as follows. In Section 2, we review the related works in the literature. In Section 3, we introduce Boltzmann Entropy to describe the chaotic degree of the crowded scenes in terms of statistics, and then the anomaly detection based on Boltzmann Entropy. In Section 4, we evaluate the performance of the proposed method in detecting and locating abnormal behaviors. In Section 5, we draw conclusions.

2 Related work

The traditional object-based approaches treat crowd as a collection of individuals and describe the collective behaviors of a group through individuals, so their performance depends highly on object segmentation [25] or trajectory reconstruction [2, 19, 27]. In low-density scenes, object-based approaches can achieve good performance. However, in crowded scenes such as the video frames shown in Fig. 1, when a large number of persons perform random actions in a small space, the high densities of objects and severe occlusions make object segmentation and tracking impossible in terms of capturing the behavior of every individual [11]. Besides, the computational cost should be considerably high.

To tackle the problem of unreliable object segmentation or tracking in crowded scenes, an emerging trend for abnormal event detection is to model global motion patterns from local motions of small cells, namely, pixels, local image blocks/patches, and local 3D cuboids/bricks. According to different features applied, these approaches can broadly be classified into three categories: Cell interaction-based approaches [12, 14, 17, 29], cell trajectory-based approaches [20], and cell property-based approaches [1, 4, 13, 18, 22, 28, 32].

For cell interaction-based approaches, a well-known work is the so-called social force model (SFM) proposed by Mehran et al. [14], where group actions are modeled as interaction forces of subjects computed from optical flows. However, SFM is easy to incur false alarms, which are usually caused by the error-prone computation of local interaction forces. The particle swarm optimization (PSO) method introduced by Raghavendra et al. [17] is a robust algorithm to optimize SFM via selective particles but it is difficult to apply such method to online surveillance due to the time-consuming optimization. Cui et al. define interaction energy potential to represent the current state of a subject based on the positions and velocities of itself and its neighbors [29]. The energy potential among neighbors and the action of the subject of interest in the sense of velocity are used to model abnormal group behaviors. However, the weakness of SFM, say, the unstable relationships in terms of local interactions remains unsolved in [29]. Kaltsa et al. deploy compact swarms over optical flows, whose positions are continuously updated following the rules of interaction with each other as well as relationship with regard to the optical flow magnitude [12]. Then, the histograms of oriented swarms constructed from the swarms’ positions and the histograms of oriented gradients [5] are combined to detect anomalous regions via support vector machine (SVM). The fundamental principle of [12] is similar to SFM as well as PSO optimized SFM such that it inherits the same limits, that is, error-prone local interaction modeling and computationally high cost.

For cell trajectory-based approaches, Wu et al. [20] make use of chaotic invariants of particle trajectories as feature, which is known as maximal Lyapunov exponent and correlation dimension, to detect and localize anomalies in crowded scenes. When people’s movements are spatially constrained such as in corridor and underpass, however, discriminating normal and abnormal events according to trajectories is extremely difficult due to the violation of the assumption in chaos theory.

For cell property-based approaches, the fundamental is to characterize motion patterns of crowded scenes based on optical flow computing [1]. Mahadevan et al. employ a mixture of dynamic textures to represent jointly the appearance and the dynamics of local portions of videos in crowded scenes [13, 28]. This approach has an outstanding performance in detecting temporal and spatial anomalies at the cost of heavy load of computation. Reddy et al. propose multiple features including motion, size, and texture from each foreground cell and the decision resulting from each feature is fused to make the final decision [18]. Here, to guarantee computational efficiency, only foreground pixels are considered in modeling crowd motions. Yang et al. detect abnormal events by computing the sparse reconstruction cost based on the dictionary consisting of multi-scale histograms of optical flows in normal cases [4, 32]. Wu et al. construct the class-conditional probability density functions of optical flows in terms of position, magnitude and direction with regard to non-escape and escape cases, respectively [22]. Then, they detect crowd-escaping behaviors by using the class-conditional probability density functions in the framework of Bayes decision.

The features describing crowd actions used in most of the aforementioned methods are high-dimensional features. This means that a complex learning strategy and a large number of training samples are needed due to the sparseness caused by the high dimensionality of the data. In general, the learning process is time-consuming and the classification performance is subject to what training samples are available. Provided the training samples are inadequate or not representative, the detection precision will become worse remarkably.

In contrast to the high-dimensional features that require complex machine learning for pattern classification, an alternative solution is to develop low-dimensional features [23, 24, 30] that do not need an explicit learning procedure. For such features, the decision-making is based on watching whether the target feature value exceeds a predefined threshold. Cao et al. combine crowd kinetic energy and direction variation computed from optical flows to detect abnormal events in crowded scenes [24]. Xiong et al. propose an energy model to detect two types of abnormal crowd behaviors, people gathering and running [30]. Here, people counting is necessary for detecting the two types of abnormal behaviors but the number of people in crowded scenes is difficult to be estimated. In [15], grid-based motion features addressing the speed and direction of moving objects are utilized to detect abnormal events in crowded scenes. Since grid-based computation of motion features is so coarse and thus error-prone, the crowd flux estimation subject to it should be correspondingly unstable. Gu et al. use the product of Shannon Entropies of particle distributions in the vertical and horizontal directions to estimate the parameters of Gaussian Mixture Model (GMM) over the normal crowd behaviors and apply the GMM to detect anomalies [6]. The limitation of Shannon Entropy is that it concerns only the probability of moving particles’ distribution in the whole plane but misses to count the total amount of the subjects involved in crowd motion. Unfortunately, the chaos degree of crowd motion is subject to the number of the subjects involved. Susan et al. [23] adopt the non-extensive entropy, which is a variant of Shannon Entropy [21], to detect anomaly in motion vector fields over three consecutive frames. Since non-extensive entropy is approximately the exponent of Shannon Entropy, it cannot overcome the limitation of Shannon Entropy, that is, it concerns only the probability of particle distribution but misses the total amount of the subjects involved in crowd motions, which is easy to cause false alarms.

In sum, the problems for the state of art methods are: Some features are not simple enough and complex machine learning on high-dimensional data is needed. This makes such methods not only time-consuming but also not applicable as a general solution due to the example-specific machine learning. Some other features are simple and efficient enough to ensure fast decision making such as the Shannon Entropy feature. However, such features do not promise robust performance under various circumstances. The major reason is that all these features do not reflect the physical nature of crowd motions. In view of the limits of the aforementioned methods, we propose to make use of Boltzmann Entropy in terms of thermodynamics as a feature for anomaly detection on collective human mobility. The initial work of this idea is presented in [33]. Here, we extend the work to include more details with rigorous and extensive validation of the method.

The proposed feature gains advantage over the aforementioned methods in the following aspects: Due to the physical nature of Boltzmann Entropy, the proposed feature describes the chaos degree of crowd motions with explicit physical meanning. This accounts for why it leads to better performance in the experiments. In the meantime, the efficient computation of the feature promises real-time on-line monitoring. Besides, the proposed method works no matter how many persons are in the present of the crowded scenes because it roots in the statistics of optical flows rather than object detection or tracking. In this sense, the proposed feature overcomes the limit of Shannon Entropy and its variants, which miss to distinguish the volume of the crowd involved in abnormal behaviors. To the best of our knowledge, the proposed feature is specific in that it can promise both accuracy in anomaly detection and efficient computation. In contrast to [33], this study includes more details as follows: An extensive survey of the state of the art, a more detailed description of the feature extraction method with formal mathematical definition and the motivation rooting in physics, the decision making method to identify anomaly patterns, evaluation of the discriminant power in terms of outlier degree and feature ranking measure, and more experimental comparisons with more baseline methods based on both the UMN benchmark and some additional real-world videos. To highlight the background as well as the motivation of this study, the state of the art works are summarized in Table 1 for comparison.

Table 1 Summary of the state of the art

3 Boltzmann Entropy to feature crowd motion

Boltzmann Entropy is a quantitative measure of disorder degree of thermal motion of a large number of basic particles such as molecules and atoms. Motivated by the physical nature of Boltzmann Entropy in terms of thermodynamics, when monitoring collective crowd motion in video surveillance, the optical flows computed from continuous frames are regarded as moving particles and Boltzmann Entropy is employed to figure out quantitatively how chaotic or disordered the collective motion of a group of people of interest is. As shown in Fig. 3, the particles computed from the optical flow cover most of the motion field, which characterize the motion pattern of people in a scene. Since the Boltzmann Entropy computed from continuous video frames forms a time series, the anomaly detection problem can thus be solved with ease by applying simply a threshold to the time-varying Boltzmann Entropy measure that characterize the group motion, and capturing the remarkably sudden changes as indication of possible abnormal events.

Fig. 3
figure 3

The up-left images are two continuous frames used to calculate the optical flow. The bottom-left image shows that particles computed from the optical flow cover the motion field. The right shows the enlarged view of the corresponding region

As outlined in Fig. 2, the computation of the feature includes three steps: (1) Obtaining particles as all the moving regions and pixels computed from optical flows in continuous video frames; (2) Building a histogram to figure out the distribution of particles’ moving directions and speeds; (3) Calculating the Boltzmann Entropy by taking into account every possible configuration of particles’ motion states that can result in such a histogram. The scheme is detailed as follows:

3.1 Motion signature of crowd as a histogram of particles’ moving direction and speed

A video is comprised of a series of clips and each clip consists of a couple of frames streaming over time. Here, each clip is represented by a matrix of w × h × t size, where w × h denotes the frame resolution (width by height) and t is the number of sequential frames. We see each clip as a dynamic system, which captures the motion pattern of the crowd of interest in a short time window. In general, an abnormal event will last for a time over a couple of frames at least, so the collective behavior of crowd should be described in the context of a clip. This is why we let t frames constitute a clip and let clip be the basic unit to perform pattern analysis.

In each frame, an isolated pixel or a couple of pixels connected to form a region with nonzero optical flow values are referred to as moving particles or particles for brief. In contrast, the pixels with the optical flows of zero values can be considered as the background. On a micro level, the mobility of a particle is defined as the velocity \( {\mathbf{V}}_i=\left({v}_i^x,{v}_i^y\right) \), where

$$ \left|{\mathbf{V}}_i\right|=\sqrt[2]{{\left({v}_i^x\right)}^2+{\left({v}_i^y\right)}^2} $$
(1)

and

$$ {\theta}_i= \arctan \frac{v_i^y}{v_i^x}\left({\theta}_i\in \left[0,2\pi \right)\right) $$
(2)

represent the magnitude and direction of the velocity of a moving particle, respectively.

To figure out the mobility of the particles in each clip in a statistical sense, we establish a histogram with m M × m A bins of equal intervals in both the magnitude and the angle of the vector of velocity. Here, m M represents the number of the magnitude intervals and m A that of the angle intervals. We assign the particles into such m M × m A bins according to their velocity magnitudes and directions. When computing the histogram, we fix the upper bound of the magnitude of the velocity in accordance with the limited physical capability of human beings’ mobility. If the speed of a moving particle is greater than this value, we will confine it under the upper bound. After accumulating the particles of every frame in a given clip into one histogram, we obtain a histogram with N bins, where we let k n denote the number of the particles falling in the n th bin for 1 ≤ n ≤ N. Figure 4(a) shows an example, where a large number of particles are projected into 3 × 8 bins. To allow a more straightforward view of the histogram, we illustrate a block with deeper blue color in Fig. 4(b) if the corresponding bin contains more particles in Fig. 4(a). Every bin in the histogram as shown in Fig. 4 contains the particles with similar moving speeds and directions, which is one prototype of the particle motion pattern. Intuitively, the histogram can reveal the collective motion pattern of all particles from a global point of view by means of counting in a statistical sense how the moving particles of interest distribute over a couple of motion prototypes. Note that the histogram utilized in this study is different from what is applied in [30], where motion vectors of foreground pixels are decomposed into vertical and horizontal directions such that two histograms are constructed, respectively, for projection along either direction.

Fig. 4
figure 4

a Partition of bins with regard to magnitude and direction of velocity of particles. b Histogram of particle distribution

3.2 Boltzmann Entropy

Although histogram is a widely used feature extraction means in terms of computer vision, the spatial variation and high dimensionality make the aforementioned histogram-based pattern description not suitable for discerning the distinction between normal and abnormal collective motion patterns. According to our observation, in response to emergency events, most people will behave in an abnormal manner with a highly disordered mobility pattern from a global point of view. This motives us to make use of Boltzmann Entropy [8] in distinguishing the collective behavior of people under emergent circumstances from what they behave in normal cases since Boltzmann Entropy is a measure of the disorder degree of a whole system from the perspective of molecule activities in terms of thermodynamics. In the following, we present how to compute Boltzmann Entropy from a given histogram with N bins, namely {k n  : 1 ≤ n ≤ N}, where k n denotes the number of particles falling into the n th bin. Note that for the total K particles, \( K=\sum_{n=1}^N{k}_n \), there are many possible arrangements of them that can lead to the same histogram. For the example shown in Fig. 5, we illustrate the moving directions and speeds of 3 particles as well as the corresponding histogram, where we use red, green, and blue to mark the 3 particles, respectively. It is apparent that the same histogram could result from a couple of different configurations of the particles of interest in terms of moving direction and speed. Counting all the cases to result in the given histogram, we obtain

$$ W={C}_K^{k_1}\cdot {C}_{K-{k}_1}^{k_2}\cdot \cdot \cdot \cdot {C}_{K-{k}_1-\cdots {k}_{N-1}}^{k_N} $$
(3)
Fig. 5
figure 5

In the left part, the arrow in each color represents the moving direction and speed of one particle and there are 3 particles arranged in 6 manners in this example. The histogram in the right part figures out the particles’ moving directions and speeds

where \( {C}_P^Q \) represents the combination of Q elements taken from the P elements in total. As an example illustrated in Fig. 5, in total, six arrangements of the three particles are able to produce the same one histogram, so the total number W of all the possible particle arrangements computed from the histogram is 6. By multiplying the natural logarithm of W with the so-called Boltzmann constant k B , Boltzmann Entropy is then defined as:

$$ BE={k}_B\cdot 1\mathrm{n}\kern0.1em W $$
(4)

It is obvious that the Boltzmann Entropy BE is totally determined by W. W is computed in a statistical sense and it reflects the disorder degree of the mobility of all the particles under surveillance. A larger W indicates high disorder degree of the movements of the particles. In contrast, a small W means highly ordered mobility pattern of the particles. An extreme case is: When all particles move in the same direction with the same speed, the value of W is 1, that is, only one prototype of particles’ motion exists. Based on the definition of Boltzmann Entropy in the context of thermodynamics, we reformulate it for video processing as follows:

$$ BE=k\cdot 1\mathrm{n}\left({\displaystyle \prod_{n=1}^N{C}_{K-{\displaystyle \sum_{j=1}^{n-1}{k}_j}}^{k_n}}\right) $$
(5)

where we let k = 1.38 × 10-5.

3.3 Boltzmann Entropy against Shannon Entropy

It is known that there are many definitions regarding entropy. In the following, we will compare Boltzmann Entropy with Shannon Entropy to justify that the pattern descriptor rooting in thermodynamics simulates and characterizes crowd mobility better due to the similar physical mechanism.

Except for Boltzmann Entropy in the literature of thermodynamics, more widely used entropy in the context of information sciences is Shannon Entropy, also referred to as Information Entropy [21]. Shannon Entropy is a measure of uncertainty regarding a random variable. Consider a given histogram {k n  : 1 ≤ n ≤ N} with the associated probabilities \( \left\{{p}_n=\frac{k_n}{K}:1\le n\le N\right\} \), where k n denotes the number of particles falling into the n th bin, N is the number of histogram bins, \( K=\sum_{n=1}^N{k}_n \) denotes the total number of particles, p n  ∈ [0, 1], and \( \sum_{n=1}^N{p}_n=1 \). The expression of Shannon Entropy is the expected value of -log2 p n . The formal definition is

$$ SE=\mathrm{E}\left(\mathrm{I}\left({p}_n\right)\right)={\displaystyle \sum_{n=1}^N{p}_n}\cdot \mathrm{I}\left({p}_n\right)=-{\displaystyle \sum_{n=1}^N{p}_n}\cdot { \log}_2{p}_n $$
(6)

In the case of p n  = 0, the value of the corresponding item 0log20 is set to be 0. If Shannon Entropy is zero (SE = 0), it means that there is no uncertainty. Because p n  ∈ [0, 1], Shannon Entropy is always non-negative (SE ≥ 0).

Almost all of the previous entropy-related approaches for anomaly detection [6, 23] are based on Shannon Entropy and its variant. An example is as follows: A non-extensive entropy is used in [23] for increasing the nonlinearity of the exponential terms, which is defined in a quadratic exponent form.

$$ NE=\mathrm{E}\left(\mathrm{I}\left({p}_n\right)\right)={\displaystyle \sum_{n=1}^N{p}_n}\cdot \mathrm{I}\left({p}_n\right)={\displaystyle \sum_{n=1}^N{p}_n\cdot {e}^{-{p}_n^2}} $$
(7)

We compare Shannon Entropy with Boltzmann Entropy in the following two tests. First, as shown in Fig. 6(a) and (b), we form two histograms with respect to the magnitude and direction of the velocity of particles and then compute the entropy values from each histogram using Shannon Entropy and Boltzmann Entropy, respectively. Based on this example, we demonstrate the primary distinction between Boltzmann Entropy and Shannon Entropy. Second, we calculate the entropies of every clip in the 11 video sequences of three scenes of the UMN dataset. The range of each abnormal event is labeled as red shadow as shown in Fig. 7. The waveforms of the calculated entropy values are also shown in Fig. 7.

Fig. 6
figure 6

Shannon Entropy and Boltzmann Entropy for two histograms

Fig. 7
figure 7

Curves of entropy values against clips for the three scenes of the UMN dataset. The blue curve represents Boltzmann Entropy and the green one Shannon Entropy. The shadow in red color means anomaly. The online monitoring images are attached to the peaks of the curves

According to the results of the above two tests, we find that Shannon Entropy and its variants are limited in the following aspects: (1) Shannon Entropy is not subject to the total number of particles but fully determined by the ratio of the particle number in every bin to the total number of particles. As shown in Fig. 6(a) and (b), the Shannon Entropy value remains the same as SE(a) = SE(b) for the two cases, where the particles are uniformly distributed in every bin but the total number of particles is different. (2) As shown in Fig. 7, sometimes Shannon Entropy reaches maxima when the pedestrian evacuation is going to finish, which can be observed in the video examples attached to the corresponding green maxima points. The reason is: Due to the limit as shown in Fig. 6, Shannon Entropy and its variants cannot distinguish the histograms resulting from the motion of a small group of people from those corresponding with the motion of a large group of people. That is, even when only a few people are in fast moving state, the corresponding Shannon Entropy could be unreasonably high. This does not meet the requirement of video surveillance since the abnormal activities of a large number of persons should attract much more attention in practice. Therefore, Shannon Entropy and its variants are not suitable for monitoring the crowd movement in a whole scene.

In contrast, Boltzmann Entropy does not suffer from such shortcomings as Shannon Entropy. Boltzmann Entropy is sensitive to the particle number as BE(a) < BE(b) in Fig. 6. Even when the distribution of the particles follows the same probability density in the histograms, more particles taking part in the crowd motion will lead to higher value of Boltzmann Entropy. In Fig. 7, the waveform of Boltzmann Entropy exhibits a Gaussian function like shape when abnormal events occur. At the beginning of the crowd commotion, when some people are trying to leave their positions with atypical motions, correspondingly, the calculated Boltzmann Entropy begins to increase rapidly. When almost all the people under surveillance are taking part in the fast moving, the value of Boltzmann Entropy reaches the peak accordingly. When the pedestrian evacuation is finished, the Boltzmann Entropy value becomes extremely small and tends to be zero. This shows that Boltzmann Entropy can reflect the disorder degree of crowd movement in a whole scene, and the co-occurrence between crowd commotion and peak value of Boltzmann Entropy is obvious. This enables a simple but effective way to detect abnormal crowd commotion by applying just a threshold to the waveform of Boltzmann Entropy, which avoids error-prone and time-consuming object detection and tracking, or modeling complex interactions among objects.

3.4 Anomaly detection

Suppose that BE i is the Boltzmann Entropy value for the i th clip. It is obvious that {BE 1, BE 2,  ⋯ , BE T } forms a time series provided the video sequence has T clips in total. As a result, the difficult problem of abnormal event detection in video surveillance is converted into a very simple problem of identifying outlying values in a time series. As mentioned previously, for a video clip, a greater W or BE corresponds with a higher disorder degree of the group motion, which could indicate the panic behavior of people in emergency cases. The evolution of crowd motion is usually as follows: As normal cases dominate the major portion of surveillance video sequences, people should behave normally with relative low W or BE. When an emergent event takes place, people will react to it and their collective behavior will turn from normal pattern to abnormal pattern in a very short time. This causes the feature value W or BE to increase suddenly and drastically. Correspondingly, there should appear a turning point in the time series {BE 1, BE 2,  ⋯ , BE T }. To capture such turning points, we further process the original time series into a difference sequence

$$ \left\{0,\left(B{E}_2-B{E}_1\right),\cdots, \left(B{E}_T-B{E}_{T-1}\right)\right\}=\left\{{d}_1,{d}_2,\cdots, {d}_T\right\} $$
(8)

where the difference between the entropy of the present clip and that of the previous clip is computed. In normal cases, the difference value between two consecutive clips should be small. In abnormal cases, such difference value will become remarkably higher than that in normal cases. It is known that abnormal events occur rarely. Thus, we need to establish a model corresponding with normal cases at first, and then detect abnormal event as those of low probability under the model. Here, we establish a Gaussian model N(μ, σ 2) from {d 1, d 2,  … , d M } of normal cases with the mean and standard deviation to be

$$ \mu =\frac{1}{M}\cdot {\displaystyle \sum_{i=1}^M{d}_i} $$
(9)

and

$$ {\sigma}^2=\frac{1}{M-1}\cdot {\displaystyle \sum_{i=1}^M{\left({d}_i-\mu \right)}^2} $$
(10)

The Gaussian model N(μ, σ 2) is applied to calculate the probability that every entropy difference value may appear under the constraint of such model. Since the decrease of Boltzmann Entropy means that group motion becomes more ordered, we exclude the clips with negative entropy difference values directly without any further decision making on them. To enable reliable decision making, we perform smoothing as follows:

$$ \mathrm{If}\;5{d}_i<B{E}_i-B{E}_{i-2},\left({d}_i+B{E}_i-B{E}_{i-2}\right)/2\to {d}_i $$
(11)

4 Experiments

The method is tested on the publicly available dataset of normal and abnormal videos from the University of Minnesota (UMN) [26], and the crowd motion videos collected from Internet by us. To quantitatively evaluate different methods in anomaly detection, we manually mark the ground truth for all the videos following the guidelines that abnormal crowd motion is a small-probability event. We take 10 frames as a clip and there is no overlap between two successive clips. Hence, a clip is a spatial-temporal matrix of w × h × 10, where w × h represent the resolution of each image in the video. As observed in the experiments, the proposed method yields similar results when adopting different optical flow algorithms such as Brox, Farneback, PyrLK, and Horn-Schunck. Here, we use the Brox algorithm [3]. In the following, we will evaluate the discriminant power of Boltzmann Entropy [8] as a feature for anomaly detection, and then compare the proposed method with the methods based on pure optical flow, social force model (SFM) [14], particle swarm optimized-social force model (PSO-SFM) [17], chaotic invariants [20], sparse reconstruction cost [32], Shannon Entropy [21], and non-extensive entropy [23] on the UMN dataset. Furthermore, we examine the performances of various low-complexity real-time methods using the more complex web dataset in terms of anomaly detection and localization.

4.1 The power of Boltzmann Entropy as a feature

We evaluate the discriminant power of Boltzmann Entropy as a feature for anomaly detection on the UMN dataset. The dataset comprises 11 video sequences of 3 different indoor and outdoor scenes and each sequence begins with normal behavior, followed by a portion of abnormal panic such as running and dispersion. The duration of the videos is 4.17 min (7739 frames), and the resolution of each frame is 320 × 240 pixels. Given a video sequence of U clips, we denote the Boltzmann Entropy features of all the clips as {F 1, ⋯,F U }. We define the outlier degree of F t as

$$ O{D}_t={\displaystyle \sum_{u=1}^U{\left\Vert {\boldsymbol{F}}_t-{\boldsymbol{F}}_u\right\Vert}_1} $$
(12)

which is the sum of the L1 norm distance between F t and all the other features. Then, we sort the outlier degrees {OD 1, OD 2, ⋯,OD U } in descending order to seek anomalies. We perform the above computations for each of the 11 video sequences in the UMN dataset. As a result, 11 sorted lists are obtained in total. TotalAnomaly represents the total number of frames involving abnormal behaviors in the 11 video sequences of the UMN dataset. Suppose that in the top-N ranked OD values of the i th video sequence of the UMN dataset, there are True i (N) frames of true anomalies. Then, the precision and recall regarding the 11 video sequences can be computed as follows:

$$ P(N)=\frac{{\displaystyle \sum_{i=1}^{11}Tru{e}_i(N)}}{11\times 10\times N} $$
(13)

and

$$ R(N)=\frac{{\displaystyle \sum_{i=1}^{11}Tru{e}_i}(N)}{Total\ Anomaly} $$
(14)

Precision concerns how many frames in the detected list are true anomalies while recall figures out how many true anomalies have been identified from all the true anomalies. The precision and recall rates against N under different parameters are shown in Fig. 8. As can be seen in Fig. 8, the precision and recall rates vary little with different parameter setting of m M and m A. This shows that Boltzmann Entropy is not sensitive to partition of histogram bins. In this sense, Boltzmann Entropy is a robust feature. Figure 8 also shows that Boltzmann Entropy possesses the discrimination power to distinguish anomalies to some extent. Yet, this should be further examined since the time series of Boltzmann Entropy should provide more clues in the sequential arrangement of the values. As shown in Fig. 7, in normal cases, the Boltzmann Entropy will remain at low level continuously until an abnormal event takes place. Then, the Boltzmann Entropy will go up suddenly and keep on high values in a couple of consecutive clips. The consistency of the sequence lasting for a time and the turning point can provide extra information in judging anomalies. Accordingly, we can determine the starting time of an abnormal event via a threshold. The threshold can be applied to detect both suddenly appearing abnormal events and slowly appearing abnormal events.

Fig. 8
figure 8

The precision and recall rates under different parameters on the UMN dataset

As mentioned previously, the abnormal crowd movements are usually visible at the turning points of the time series of feature values. Therefore, we figure out the difference sequences of both Boltzmann Entropy and Shannon Entropy in Fig. 9(a) instead of the original time series to allow an intuitive insight into their pattern classification power. Moreover, in Fig. 9(b), (c), (d), and (e), we illustrate the histograms of the time sequences shown in Fig. 9(a) to reveal the distribution of the differential feature values belonging to normal and abnormal cases in terms of Boltzmann Entropy and Shannon Entropy, respectively. It is observed that the pattern classification power of Boltzmann Entropy is remarkably better than that of Shannon Entropy due to the much less overlap between the two classes in correspondence with normal and abnormal moments.

Fig. 9
figure 9

Difference sequences of Boltzmann Entropy and Shannon Entropy and the corresponding histograms of a video sequence in the UMN dataset. a The differential sequences of 4 cases, where the shadow in red color corresponds with crowd commotion; b, c, d, e The histograms corresponding with the 4 cases, respectively

Furthermore, to examine quantitatively the discriminant power of the proposed feature in comparison with that of the Shannon Entropy feature, here, we employ the measure referred to as R-Metric [31] to evaluate the ability of either feature in distinguishing abnormal patterns from normal ones. Suppose that the two classes of the differential values of any entropy feature corresponding with normal and abnormal events are denoted as \( \left\{{d}_1^x,{d}_2^x,\cdots, {d}_X^x\right\} \) and \( \left\{{d}_1^y,{d}_2^y,\cdots, {d}_Y^y\right\} \), respectively, where X and Y represent the numbers of the samples belonging to the two classes, respectively. Prior to defining R-Metric, the relationship between \( {d}_i^x \) and \( {d}_j^y \) is quantifying through

$$ H\left({d}_i^x,{d}_j^y\right)=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill \left|{d}_i^x\right|<\left|{d}_j^y\right|\hfill \\ {}\hfill 0\hfill & \hfill \left|{d}_i^x\right|\ge \left|{d}_j^y\right|\hfill \end{array}\right. $$
(15)

where i ∈ [1, X] and j ∈ [1, Y], and |∙| means absolute value. Then, R-Metric is defined by taking into account the relationships between all the samples in the two classes, that is,

$$ RM= \max \left\{{\displaystyle \sum_{i=1}^X{\displaystyle \sum_{j=1}^YH\left({d}_i^x,{d}_j^y\right),{\displaystyle \sum_{j=1}^Y{\displaystyle \sum_{i=1}^XH\left({d}_j^y,{d}_i^x\right)}}}}\right\} $$
(16)

A bigger RM value corresponds with better separability between two classes since more samples belonging to the same class are consistently sorted into either of the two classes consisting of small values or big values and there is less overlap between the two classes. In Table 2, we list the comparison of the two features in terms of the RM value. It is notable that the RM values based on Boltzmann Entropy are obviously greater than those based on Shannon Entropy in almost all scenarios, except for the 11th video sequence. In an overall sense, Boltzmann Entropy is significantly superior to Shannon Entropy in discriminating anomalies.

Table 2 R-Metric of Boltzmann Entropy and Shannon Entropy features on each video sequence of the UMN dataset (SN: Serial number of video sequence; SE: Shannon Entropy)

4.2 Evaluation on UMN dataset

Prior to anomaly detection, we establish a Gaussian model for each of the three scenes of the UMN dataset, where we select the first 30 clips from each video sequence to compute the difference sequence of Boltzmann Entropy as described previously. Then, we compute the mean and standard deviation of the difference values to render the corresponding Gaussian model. Once the Gaussian model corresponding with normal cases is established, then, we conduct the anomaly detection test as follows: If two consecutive clips appear with the probability lower than 0.1 under the corresponding Gaussian model, we label them as the start of an abnormal event. We choose the moment when Boltzmann Entropy reduces to 0.1 as the end of an abnormal event.

The detection results by using the proposed method on the 3 scenes composed of the 11 video sequences of the UMN dataset are shown in the 3 rows of Fig. 10 against the ground truth, where each red interval represents the lasting of an abnormal event in each video sequence while the green portions represent the course of normal situations. Note that there is no sole false alarm interval in the detection bars and every abnormal event in ground truth has been detected using the proposed method. In another words, there is no false alarm.

Fig. 10
figure 10

Results of abnormal activity detection on the UMN dataset. Each row represents one scene. The two bars underlying each row are the detection result against the ground truth, where the green color represents the normal frames and the red color corresponds to the abnormal frames. Above the bar, the left column is an example of the normal frames and the right column is an example of the detected abnormal frames

In Table 3, we compare the proposed method with optical flow [1], social force model [14], particle swarm optimized-social force model (PSO-SFM) [17], chaotic invariants [20], sparse reconstruction cost [32], Shannon Entropy [21], and non-extensive entropy [23] on the UMN dataset in terms of both accuracy and speed for abnormal crowd behavior detection. Here, we employ the area under ROC (receiver operating characteristic curve) curve (AUC) to measure the accuracy and the frame rate, say the number of frames processed per second (fps) to indicate the speed of each method in terms of computing the corresponding feature.

Table 3 Comparison of different methods for anomaly detection on the UMN dataset

The baselines methods for comparison fall into 3 categories: (1) The optical flow [1] method detects anomalies from optical flows directly. Since no feature extraction is applied, it is obvious that the AUC of the optical flow method is much inferior to the performances of the other methods, which leads to the lowest AUC of 0.84 while those of the other ones are all above 0.94. The comparison shows that feature extraction is critical in differentiating abnormal and normal crowd motions. This shows why the effort to seek effective features is meaningful. The Shannon Entropy [21] and its variant, namely non-extensive entropy [23], are similar to the proposed method in that all the entropy-based features can be computed efficiently with the speed of roughly 100 fps. However, due to the different physical nature as demonstrated in Section 3.3, Shannon Entropy and non-extensive entropy lead to much poorer performance for scene 2, which are 0.913 and 0.89, respectively, in comparison with the 0.986 AUC of the proposed method. Note that scene 2 is an indoor scene with spatial constraint to force the evacuation behavior of people in panic homogenous, that is, the moving directions and speeds of particles are mostly the same. This is a difficult task for Shannon Entropy, which accounts for why the overall performances of Shannon Entropy and non-extensive entropy are inferior to that of the proposed method, that is, 0.943 and 0.95 against 0.985. As for sparse reconstruction cost [32], although the accuracy is 0.978 AUC, which is a bit lower than the 0.985 AUC of the proposed method, its speed is lower than 20 fps due to the time-consuming complex modeling.

(2) The performance of the social force model is 0.96, which is lower than the 0.985 AUC of the proposed model. Note that the original social force model proposed by Dirk Helbing is focused on modeling the interactions among individual persons [9]. In the computer vision context [14], however, due to the difficulty in segmenting every individual person in crowded scenes, the definition of particles is relaxed to be collections of optical flows instead of persons, which deviates from the original definition due to splitting one person into a couple of particles. This should be the major reason to degrade the performance of the social force model. To solve this problem, particle swarm optimization-based social force model is proposed to reorganize the computation with selective particles. It leads to the highest AUC of 0.996 among all the competitive methods. However, the performance improvement is obtained at the cost of decreasing the speed of the original social force model from 56 fps to the less than 1 fps, which is remarkably out of the scope for real-time video surveillance.

(3) As for the chaotic invariants, 0.99 AUC is achieved in [20] but the authors report only the performance on the 6 selective video sequences, not all the 11 video sequences of the UMN dataset. The chaotic invariants need to solve the evolution equations of the particle positions to obtain particle trajectories. Furthermore, it iteratively clusters the particle trajectories to extract representative trajectories for computing the largest Lyapunov exponent and the correlation dimension, which measure how much neighboring trajectories deviate from their original closeness to each other after a certain steps of evolution. Here, a huge number of particles are involved into the evolution and iteration processes, which makes the computational cost of the chaotic invariants high. So it an impractical solution on account of the heavy computation caused less than 5 fps low speed, which is far from real-time realization. Besides, for the scenes that the people’ movements are constrained in a narrow space such as in corridors and underpasses, the evolution of the trajectories might not follow the precondition of chaos, which assumes that neighboring trajectories will increasingly fall apart accompanying elapse of time. In such a case, Lyapunov exponent may not reflect exactly the chaotic degree of the collective human mobility corresponding with anomaly due to the constrained evolution of the trajectories.

On account of the above comparison, we see that the proposed method promises the highest accuracy among all the real-time workable methods for abnormal crowd behavior detection.

4.3 Anomaly detection and localization on web data

There are many real-world crowd motion videos. We collect 32 challenging video sequences of different scenarios from Internet as additional benchmark to further evaluate the various methods that promise real-time surveillance. Figure 11 shows the snapshots of 6 of the 32 video sequences. The first video provides a top view of the rapid evacuation of the crowd shocked by high-altitude falling. The second one records the terroristic attack in Kunming train station in March 2014, where people are running away during the suddenly happened terroristic attack. The third video depicts the anti-government activities in Bangkok 2014 during the bomb attack. The fourth video records how policemen disperse the crowd using water gun. The fifth one provides a high altitude view of the scenario during the gunfight between arrest resistant and police. The sixth one is a surveillance video showing the fighting in a supermarket. The other videos of the web dataset record more crowd events, which are available online.Footnote 1

Fig. 11
figure 11

Localization of abnormal behaviors. The left column is the original frames. The middle column is the localized abnormal behaviors using Boltzmann Entropy. The right column shows those obtained using Shannon Entropy. Highlighted squares correspond to the highly possible abnormal regions

We locate the positions corresponding with abnormal behaviors in crowded scenes through Boltzmann Entropy and Shannon Entropy for the image patches of uniform size. Some results for the 6 scenes from the web dataset are shown in Fig. 11, where the regions of anomalies that possess higher entropy values in contrast to the surroundings are highlighted with red squares in the frames. It can be observed that the precision to localize anomaly regions based on Boltzmann Entropy is much higher than that based on Shannon Entropy as Shannon Entropy-based localization is sometimes over-segmented. The results demonstrate Boltzmann Entropy is a general model that can be used to detect global abnormal events as well as localize abnormal activities.

In Fig. 12, we show how Boltzmann Entropy changes with time. In the 3 cases shown in Fig. 12, originally, the Boltzmann Entropy remains at a relatively low level but when an abnormal event happens to invoke people’s chaotic movements for escaping, the Boltzmann Entropy goes up suddenly and violently. Such examples show that the proposed feature is effective in dealing with real-world data.

Fig. 12
figure 12

Curves of Boltzmann Entropy values against clips for 3 scenes of the web dataset. The shadow in red color corresponds with the lasting time of crowd commotion. The online monitoring images are also attached to the corresponding clips

Moreover, we compare the proposed method with the optical flow method [1], social force model [14], Shannon Entropy [21], and non-extensive entropy [23] on the total 32 web videos since these methods work in real time on the UMN dataset. In the experimental setting, the same number of normal video frames are used to calculate the features for each method so as to establish the Gaussian model corresponding with the normal cases, where a multivariate Gaussian model is used for the optical flow histogram [1]. When 3 consecutive clips corresponding with a probability lower than 0.1 appear using such Gaussian model, an abnormal event is alarmed. The results are illustrated in Fig. 13, where we see that the proposed feature consistently leads to relatively accurate anomaly detection close to the ground truth in almost all such 32 scenarios while the other methods do not work satisfactorily for some of the 32 cases. The precise performance data are reported in Table 4, where we see the same results as shown in Fig. 13. The AUCs (area under ROC curve) with regard to the 32 video sequences listed in Table 4 show that the proposed method yields the best performance in 14 cases while in the other cases, its performance is very close to the best one in comparison with the optical flow [1], social force model [14], Shannon Entropy [21], and non-extensive entropy [23].

Fig. 13
figure 13

Performance comparison regarding anomaly detection on the web dataset. For each video sequence, the color bars from top to bottom represent the ground truth, the results of the proposed method, Shannon Entropy, non-extensive entropy, optical flow, and social force model, respectively, where the red color corresponds with abnormal cases and the other colors the normal ones

Table 4 AUC by using different methods on the web dataset. SN: Serial number of each video sequence; BE: The proposed method; OF: Optical flow method; SE: Shannon Entropy; NE: Non-extensive entropy; SFM: Social force model

The performance curves of those methods are compared in Fig. 14. The AUC of the proposed method is the highest one and the only one over 0.9. According to Table 3, Table 4, and Fig. 14, the performance of social force model degrades dramatically on the real scenarios since it fails to detect 11 video sequences of the web dataset in Fig. 13 and only 0.736 AUC is achieved in contrast to the 0.96 AUC on the UMN benchmark. Its performance is only better than using optical flow feature directly as the AUC of the optical flow method drops from 0.84 on the UMN benchmark to 0.689 on the real-world data. This manifests that the social force model in the sense of computer vision deviates far from the nature of human dynamics in complex crowded scenarios. As shown in Table 3 and Fig. 14, the AUCs of the Shannon Entropy and its variant, non-extensive entropy decrease from 0.943 and 0.95 to 0.860 and 0.842, respectively, when the benchmark changes from the UMN dataset to the web dataset. Note that in Fig. 13, Shannon Entropy and non-extensive entropy do not work on 4 and 6 video sequences, respectively. Shannon Entropy and non-extensive entropy, which is a variant of it, can only reflect the probability of the distribution of particles but miss the total amount of the particles involved in crowd motion but the chaos degree of crowd motion is in fact subject to the number of active particles. This explains why the performances of the two entropy-based methods are similar and worse than that of the proposed method. The best and stable detection performance on the complex scenarios is obtained by using the proposed feature because Boltzmann Entropy directly characterizes the macroscopic chaos degree of crowd motion through microscopic particle movements in term of thermodynamics, where the AUC remains stable at 0.985 and 0.947 on both benchmarks. In accordance with Fig. 13, it is the only method that works for all the 32 challenging video sequences. It obtains the best performance on nearly half of the 32 video sequences according to Table 4 and the best overall performance in view of Fig. 14. These show that the proposed method is a universally promising feature.

Fig. 14
figure 14

Comparison of the detection results using different methods on the web dataset. OF: Optical flow; SE: Shannon Entropy; NE: Non-extensive entropy; SFM: Social force model

5 Conclusion

Rooting in microscopic statistical theory, we introduce Boltzmann Entropy to capture the chaos degree of crowd behavior for video surveillance. The advantages lie in the following three aspects: (1) The Boltzmann Entropy characterizes the chaos heat motion of molecules, which is similar to the physical nature of crowd motion in surveillance videos. This accounts for why it performs with satisfactory accuracy on anomaly detection. (2) The Boltzmann Entropy feature is extracted from optical flow directly without the need to detect or track objects individually. (3) As the crowd motion is represented in a very simply and efficient way as a single time series, only a threshold is needed for decision making. This avoids complex and time-consuming machine learning on large volume training data and makes online monitoring a practical task. The proposed method promises competitive performance in detecting and localizing abnormal behaviors in crowded scenes while the computation is much more simple and efficient in comparison with the existing methods. A detailed comparison is summarized in Table 5, where we see that the proposed method is the unique one that promises high precision on different scenes and datasets, and can work on all the 32 challenging video sequences in the sense of anomaly detection.

Table 5 Conclusive comparison