Keywords

1 Introduction

With the development of the robot technology, cloud computing technology and Internet of things technology, cloud service robots have been used in various places, such as airports, bus stops, shopping malls, subway stations, etc. They can serve as the guiders for people to take traffic tools or the assistants of shopping. Moreover, they can serve as the safety guards by detecting the abnormal events in real-time. Real-time crowded scene analysis is a very difficult task for social robots due to the inherent complexity and vast diversity such as illumination changes, low resolution, scene depth, camera position, etc. With the development of computer vision technologies, various vision based approaches have been proposed to detect abnormal events in surveillance scenes, background modeling [15], sparse representation [4], object tracking [19], face recognition [20] and people counting [10], are considered as the fundamental elements that compose an intelligent surveillance system for the anomaly detection. In this paper, we aim to detect the abnormal crowd behaviors [7, 9, 12, 16] based on the computer vision technology in real-time. A variety of algorithms have been proposed to detect abnormal events in scenes, these approaches could be divided into three categories according to the scene representation: a trajectory based approaches [8, 18, 19, 21], patch based approaches [5, 6, 13, 18], sparse coding based approaches [2, 7, 11, 14].

In [8], a normal dictionary set, was constructed by collecting trajectories of normal behaviors and extracting the control point features of cubic B-spline curves, which was further divided into Route sets. Sparse reconstruction coefficients and residuals of a test trajectory to the Route sets could be calculated with the trajectory sparse reconstruction analysis (SRA). A new descriptor named as Social Affinity Maps (SAM) [21] and priors over origin and destination (OD-prior) were proposed to understand the crowd behavior at the scale of million pedestrians for human mobility in crowded spaces such as city centers or train stations. In [18], a robust algorithm was proposed to detect stationary group activities and understand crowd scene. A locally shared foreground codebook and was used to shape the 3D stationary time map.

A collectiveness descriptor for crowd as well as their constituent individuals along with the efficient computation were proposed in [6]. In [5], a novel patch entropy approach to represent the crowd distribution information and the optical flow was introduced to describe the crowd speed. The Gaussian Mixture Model (GMM) over the normal crowd behaviors was used to predict the anomalies in the detecting stage. A hybrid agent system [13] was used to detect abnormal behaviors in crowded scenes, which included static and dynamic agents to observe efficiently the corresponding individual and interactive behaviors in a crowded scene. The crowd behaviors were represented as a bag of words through the integration of static and dynamic agents information.

The Social Force model [14] treated the moving patches as individuals. And their interaction forces are estimated and mapped into the image plane to obtain Force Flow for every pixel in every frame. Randomly selected spatio-temporal volumes of Force Flow are used to model the normal behaviors of the crowd. In [11], based on inherent redundancy of video structures, an efficient sparse combination learning framework was proposed for abnormal behaviors detection. It achieved high detection rates on benchmark datasets at a speed of 140\(\sim \)150 frames per second on average when computing on an ordinary desktop PC by MATLAB. In [2], unlike existing approaches based on sparse coding , the abnormal events detection model directly sparsely coded the motion features of the center patches with features of its surrounding patches.

In this paper, we take advantage of the distribution of the patches in the frame to simulate the distribution of the individuals in the crowd. The speed of the patches in the frame to simulate the speed of the individuals in the crowd. We proposed patch moving energy to effectively represent the crowd distribution information. As the number of frames with abnormal crowd behaviors is only a small portion of the entire video, it is obvious that abnormal crowd behavior detection is an unbalanced problem. In this paper we simultaneously use the crowd speed and the distribution information to predict abnormal crowd dispersion behaviors. The comparison experiments conducted on surveillance dataset validate the advantages of the proposed algorithm.

The rest of the paper is organized as follows: In Sect. 2, we introduce the procedures of proposed algorithm. In Sect. 3, we present experimental results and the comparisons with the-state-of-art algorithm. Section 4 concludes the paper.

2 The Proposed Algorithm

2.1 Crowd Aggregation Detection

In the field of public security, the massive mass incidents are often from small crowd gathered to evolution. Therefore, moderate scale crowd aggregation detection and its alarming are crucial to social robots for surveillance purpose. The gathered crowd can’t be measured using the algorithm based on optical flow due to the relatively static state of crowd in a particular area within a period of time. Hence, a crowd aggregation detection based on real background modeling and hierarchical alarm algorithm is proposed in this paper. The algorithm process is shown in Fig. 1 and described in following subsections.

Fig. 1.
figure 1

The framework of the proposed crowd aggregation detection.

Background Modeling. In order to model the background in the video scenes, a robust Pixel-to-Model (P2M) background modeling and recovery approach [17] is used in our work. Each pixel is represented by a context feature which consists of local compressive descriptors. The novel P2M distance is employed to classify the potential background pixels. Furthermore, the P2M distances are also utilized to adaptively update the background model in the space of local descriptors in a smooth and efficient way. The P2M based background recovery can robustly reconstruct the clean background and suppress real-world noises.

As shown in Fig. 1, there are two different types of background computing in the crowd aggregation detection: clean background modeling, which is called real background, and the dynamic background. At time t, the dynamic background of scene based on P2M is denoted as \(B_{dy}(t)\). The real background \(B_{real}\) is computed by

$$\begin{aligned} B_{real} = \frac{1}{N} \sum _{n=1}^N b_{n} , \end{aligned}$$
(1)

where \(b_{n}\) is a random selected background image from \(B_{dy}(t)\) with \(t \in [0,T]\) and N is the number of the random selected background images. In practice, the real background can be selected manually for better performance.

Event Detection. One of the main characteristics of the gathered crowd is relatively static within certain areas for a period of time. In order to extract the “static” people from the scene, a \(i \times j\) grid of patches is placed over every difference image of clean background and dynamic background, and the size of the patch \(P_{(i,j)}\) is \(m \times n\). The difference image should be binarized, thus all of pixels with value of 1 presents the “static” foreground. Part of static pixels describes the information of gathered crowd in the scene. As we use the “static” patches to represent the gathered crowd, in order to statistic the “static” patches, we denote the difference image before binarization at time t as D(t), which is calculated by Eq. (2). The value of a patch \(V_{P_{(i,j)}}\) is defined by the proportion of non-zero pixels in it as Eq. (3)

$$\begin{aligned} D(t) = | B_{real} - B_{dy}(t) | , \end{aligned}$$
(2)
$$\begin{aligned} V_{P_{(i,j)}} = \left\{ \begin{array}{ll} 1, &{} \ if \sum _{i=1}^m \sum _{j=1}^n P_{i,j}(x,y) \ge \frac{m \times n}{T_p}\\ 0, &{} \ otherwise \end{array} \right. \!\!, \end{aligned}$$
(3)

where \(P_{i,j}(x,y)\) is the value of pixels in the patch \(P_{(i,j)}\) of the binary difference image. And \(T_p\) is the threshold of a static patch. The value of \(T_p\) is 8 in this work.

Crowd aggregation area is composed of multiple adjacent static patches. So a weighted sliding window is used to detect the crowd aggregation. The size of the window is integer times of \(m \times n\), and its sliding step length is also equal to integer times of the patch size. The value of the window is defined as

$$\begin{aligned} V_{w} = \sum _{a=1}^A \sum _{b=1}^B \lambda _{c1} V_{P_{(i,j)}}, \end{aligned}$$
(4)

where A is integer times of m and B is integer times of n. \(\lambda _{c1}\) is a compensation parameter of camera calibration, which can improve the effect of the patches far from the camera. A group threshold \(\theta _{l_i}\) are used to classify the different levels of crowd aggregation according to the value of \(V_{w}\). An example of crowd aggregation detection is shown in Fig. 2.

2.2 Crowd Escape Detection

In real-life situations, crowd escape occurs by violent movement which is apparent as sudden speeding up, or chaotic movement in a restricted area, or movement contrasting with that of one’s neighbors such as in a panic situation. In statistical mechanics, entropy is used to measure uncertainty. The greater entropy means the higher disorder, thus the patch entropy approach is proposed to estimate the distribution of the moving patches in [6]. We refer to [6] and propose the patch energy to simulate the distribution of the pedestrians in the crowd. The main steps of the patch energy approach are summarized in Fig. 3 and described in following subsections.

Fig. 2.
figure 2

An example of crowd aggregation detection.

Fig. 3.
figure 3

The framework of the proposed crowd escape detection.

Calculate the Dense Optical Flow. As the moving pedestrians are able to cause the abnormal crowd behaviors, only them need to be concerned about when we detect the crowd escape. We use the moving patches to represent the moving pedestrians in this work. The moving patches extraction stated as the following. Firstly, the velocity of every pixel is calculated by dense optical flow [3]. In order to reduce the influence of illumination change, the average optical flow of continuous several frames is extracted. The map of optical flow is shown in Fig. 4(a). Secondly, every map is divided into \(M \times N\) patches. We estimate every patch’s velocity with the energy of motion according to the velocity of every pixel in the patch as described in the following subsection.

Patch’s Energy of Motion. Assuming the size of every patch in the map of optical flow is \(X*Y\), a histogram of the patch is calculated by the different velocities of every pixel(as shown in Fig. 4(b)), every patch in the image has an energy of motion defined by Eq. (5). An example of patch energy change is shown in Fig. 4(c).

$$\begin{aligned} E_{p_{(i,j)}} = \frac{1}{2} \sum _{r=1}^H h_r v_r^2,\ \sum _{r=1}^H h_r =X*Y , \end{aligned}$$
(5)

where \(v_r\) is the rth bin in the histogram and \(h_r\) is the number of pixels in the rth bin.

Moving Patches Extraction. We denote a patch as “moving patch” if the energy of motion for it is greater than the threshold \(T_e\). In order to extract the moving patches, we compare every patch’s energy of motion with the \(T_e\) in turn, the value of \(T_e\) is given by experiences in different video scenes. The value of a patch \(V_{\widetilde{P}_{(i,j)}}\) is defined as

$$\begin{aligned} V_{\widetilde{P}_{(i,j)}} = \left\{ \begin{array}{ll} 1, &{} \ if E_{p_{(i,j)}} \ge T_e\\ 0, &{} \ otherwise \end{array} \right. . \end{aligned}$$
(6)

Event Detection. There will be more patches involving in the escape area if there are more running directions. So a weighted sliding window is used to detect the crowd escape. The size of the window is integer times of the patch, and its sliding step length is also equal to integer times of patch size. The value of the window \(V_{\widetilde{w}}\) is defined as

$$\begin{aligned} V_{\widetilde{w}} = \sum _{i=1}^A \sum _{j=1}^B \lambda _{c2} V_{\widetilde{P}_{(i,j)}} , \end{aligned}$$
(7)

where A is integer times of X and B is integer times of Y. \(\lambda _{c2}\), which can improve the effect of the patches far from the camera, is a compensation parameter of camera calibration, and the value of it will be increased with the increase of the distance between the camera and the real point. If the \(V_{\widetilde{w}}\) is greater than the threshold \(\theta _e\), the crowd escape behavior happens in this window area. As the crowd escape behavior usually involving plenty of persons, we consider that the crowd escape behavior must happen in more than 2 adjacent window areas. In this way, a lot of false positives have been avoided. An example of crowd escape detection is shown in Fig. 4(d), and the blue boxes in the picture indicate the area where the event happens.

3 Experiment Results and Analysis

3.1 Dataset

To validate the performance of the proposed algorithm, we test it on UMN and RICOH dataset in comparison to the particle entropy algorithm.

The publicly available dataset of the unusual crowd activities from University of Minnesota (UMN) [1] is used to verify the effectiveness of the proposed abnormal crowd event detection algorithm. The dataset consists of 3 different indoor or outdoor scenes with the escape events.

Fig. 4.
figure 4

An example of crowd escape detection.

RICOH dataset consists of two kinds of video from two different cameras (TYZX camera and Point-Gray camera). The dataset from TYZX camera includes two dispersing events. It is a low resolution complex scene, involving more moving pedestrians, with illumination changing drastically. So it is very difficult to detection the abnormal crowd behaviors. Another kind of video from Point-Gray camera includes some crowd aggregation in 6 videos, with different view of camera, different direction of aggregation of crowd and different scale of crowd. It is also difficult to detect the event because of the serious occlusion resulting from the low installing location of the camera. The example images of RICOH dataset is shown in Fig. 5.

Fig. 5.
figure 5

The forward and backward estimation of optical flow.

3.2 Experiments

The experiment is conducted as follows. Firstly, the experiment on the crowd aggregation detection is devised on the videos from the Point-Gray camera, the average precision rate is 88 % with 94 % recall rate. In order to compare our algorithm with the state-of-the-art particle entropy algorithm, we conduct the experiments on UMN and TYZX dataset for crowd escape detection. Figure 6 shows some results on UMN Scenes.

Fig. 6.
figure 6

The result of crowd escape detection on UMN dataset.

The experiment for crowd escape is conducted secondly. Table 1 shows the quantitative comparisons to the particle entropy algorithm in the UMN dataset and TYZX. The precision and recall rate of our algorithm are much better than the particle entropy except the precision on UMN scene2. And the best result even achieves 100 % precision rate and 82 % recall rate. Hence, the proposed algorithm can significant outperform the state-of-the-art particle entropy algorithm on most tested datasets.

Table 1. The result of comparison.

4 Conclusions

In the future, robots will play more and more important roles in our life. As one of the crucial role of public security guards, they can sense the abnormal crowd behaviors and activate alarm and evacuate the stream of people, thus reducing the occurrence of public events. In this paper, we propose a novel crowd aggregation detection algorithm based on background modeling firstly. The algorithm can make grading warning to crowd congestion in public security. Secondly, another energy of moving approach is proposed to represent the crowd distribution information. The experimental results on RICOH dataset show the good performance of the proposed approach. Specially, our algorithm is robust to illumination changes, low resolution, scene depth and camera position. The experiments conducted on publicly available dataset showed the effectiveness of the approach and that our algorithm outperforms the state-of-the-art particle entropy algorithm. In the future work, the thresholds in our algorithm will be self-adaptive to avoid the complicated manual modulation for different surveillance scenes.