1 Introduction

Monitoring crowd and understanding its behavior has been one of the foremost research frontiers currently. One of the objectives of crowd monitoring is to ensure safety and security of the individuals in public gatherings (music concerts), public shared spaces (shopping malls), roadways (pedestrian crossing, urban streets), transport facilities (in airports, public transport terminals), sporting events (such as in stadia, olympics), and importantly, in the case of emergency evacuations (in the event of fire, terror threat, natural disaster, building collapse, etc.). In the event of emergencies such as fire, stampede, and eruption of violence, we require an estimate of crowd density in different spaces. Manual counting of the people from several live streams of videos at a centralized location becomes tedious, which is the current and widely adopted approach. Often, personnel would experience fatigue with continuous monitoring of these feeds and it becomes impossible to monitor several streams in case of emergencies.

To help alleviate the problem of manual counting, several people detection, counting and tracking systems have been proposed with some commercial outcomes. A comprehensive survey of human activity recognition and behavior can be found in [80]. Often these systems fail in occluded scenarios due to either unavailability of information about the occluded object during detection or the inability to establish the correspondence upon reappearance of the same object during tracking. People counting-based crowd monitoring systems detect and locate the individuals to arrive at the number of people in a scene, whereas density estimation-based approaches, without locating individuals, provide a mapping of the group of people to the crowd density in the scene (in a specific area) using certain features. For monitoring the crowd in public spaces, counting people by detecting the individuals and then tracking is infeasible because of the occlusions and matching the reappearances.

One of the primary objectives of monitoring the crowd in public spaces is to provide a reasonable estimate of people for emergency evacuations and path planning. Zhan et al. [92] and Jacques  et al. [41] have provided the importance of crowd analysis and their applications (public space design, creation of virtual environments, visual surveillance and intelligent environments) in managing large crowds. Crowd monitoring systems face the difficulties in detecting motion from the scene due to varying environmental conditions. Nonuniform illumination, cast shadows, nonrigid human movements and occlusions are some of the major challenges [90]. The challenges associated with people detection and tracking are mainly ascribed to the nonrigid human body motions. Humans as nonrigid objects, in general, fall under the category of elastic and fluid movements [2, 45]. The challenges in object detection and tracking include [90]: (a) loss of information (from 3D to 2D), (b) presence of video noise, (c) nonrigid (articulated) object motions, (d) inter-object occlusions, self-occlusions, partial occlusions and full occlusions, (e) random object motions, (f) nonuniform scene illumination, and (g) near real-time processing requirements.

Most of the existing visual monitoring systems are applied to detect and track vehicles (particularly cars) by the police to patrol highways. Many crowd monitoring systems do exist, but they face challenges and problems during the high-density crowded scenarios. Detecting and tracking a single object is relatively less challenging when compared to the multi-object and high-density circumstances. In addition, a further degree of difficulty is added for the human detection and tracking because of the nonrigid body movements. Nonrigid body movements imply that the movements of an object are incoherent and different parts of the body move differently (varying directions and velocities); whereas, rigid body movement (such as vehicle) follows coherent motion (holds the parts of an object together). Consequently, the techniques developed for vehicle detection and tracking would be inapplicable without modifications to the pedestrian tracking in occluded scenarios. Hence, it is of utmost importance to study, analyze and develop algorithms that are specifically targeted at monitoring the crowd.

In this work, we collected data from the Melbourne Cricket Ground (MCG) as a means to estimate crowd density at entry and exit points of the concourse. Six locations were chosen for monitoring crowd movement. Figure 1 shows the schematic map of the MCG camera locations and their view. The goal of this research is to estimate the crowd density without using supervised video subvolume features or classification to estimate the density. In this work, online refers to the way the processing of frames are carried out. The proposed approach does not require all the frames; instead, the density is estimated framewise. Offline processing refers to batch processing where the processing is commenced after the availability of the complete video. The key differentiating factor from the existing works is that most of the methods are offline: use features that are extracted from the the entire video sequence, trained and classified to arrive at the estimate.

In this regard, we present an optical flow-based approach for crowd density estimation. The optical flow motion estimation is used as a primary basis for deducing all the outcomes. Different signal processing blocks, filters and mathematical tools are analytically and efficiently combined to provide the estimate of the density in the scene. We believe that this work provides the necessary foundation for motion tracking and analysis in the crowded scenarios, and also a basis for future research in crowd behavior analysis. The proposed approach has been tested on MCG dataset—with low light, severe occlusions, perspective distortion and low-quality video. To validate the proposed approach, the method was tested on two papular public crowd datasets—PETS 2009 and UCSD Pedestrian Database—with medium- to high-density crowds in the presence of shadows and occlusions. These datasets contain only humans and hence it is assumed that detection of objects refers to humans only. For identification of vehicles, humans, animals and other objects, the proposed approach requires addition of detection and classification steps to distinguish these objects, which is not pursued in this work. The advantages of the proposed approach are: (a) use of motion features to calculate the crowd density as opposed to background modeling, model-based approaches and texture methods and (b) estimation of density is online as opposed to offline.

Section 2 provides a review of relevant background modeling, motion estimation, human modeling, head detection and crowd density estimation approaches. Section 3 provides the flow of the proposed approach. Section 4 provides the object detection using motion estimation and contour analysis. Section 5 describes the crowd density estimation problem and the proposed approach. Section 6 provides the complete dataset and implementation details followed by results and discussion, and with Sect. 7 concluding this work.

Fig. 1
figure 1

Crowd monitoring at MCG: a small section of the concourse was used as experimental testbed. This figure shows the schematic view of six camera locations deployed at the MCG

2 Related work

The primary requirement in object detection is to separate the moving objects from the background. Intuitively, one of the ideas is to model the background of the scene and then classify the objects that do not belong to the background model as foreground (objects). A review of such background modeling schemes is provided in Sect. 2.1. The other approach to extract the foreground is the motion estimation scheme, which estimates the inter-frame motion information based on the pixel displacements. Section 2.2 provides the review of motion estimation schemes. The second step in the object detection is to ascertain that the detected objects are indeed humans. A review of model-based human detection is provided in Sect. 2.3. Others prefer head detection to ascertain the humans. Section 2.4 provides a review of head detection-based approaches and finally Sect. 2.5 provides a review of relevant density estimation approaches.

2.1 Background modeling

In the process of motion detection and estimation for object tracking, the separation of the object is the most critical of all. Frame differencing, a basic approach to detect the changes, can be obtained by performing difference of two frames or by taking difference between a reference frame and the current frame [25]. The resulting difference can be binarized based on global thresholding [46] or multiple thresholds [62]. In addition, the accumulative difference image is an another way to segment the background from the foreground [42]. Pfinder [85] used a single-Gaussian multi-class statistical pixel-based model that maintains the pixel values using the Gaussian model and thus tolerates the spontaneous pixel noise. The notion of adaptive background modeling was introduced using the Mixture of Gaussian (MoG) [26, 74]. Each pixel value was modeled as a MoG of recently observed pixel values (most probable background pixels will have more weight in the mixture). A new Gaussian was created by replacing the least probable Gaussian from an open-ended list of Gaussians to incorporate the new pixel value. Depending on the mean and variance of the pixel values, the pixel that did not fit into any of the Gaussians was labeled as foreground. This is one of the most popularly adopted approach in background modeling.

MoG was generalized using the Kernel Density Estimation (KDE) [21], where the recent values of the pixels were modeled as kernel. Up to this point, only the intensity values were used to model the background. Color Gaussian distribution of each pixels was modeled to exploit the color information by [95] inspired by [85]. It is worth noting that color information may change significantly when cast shadows and nonuniform illumination are present, particularly in monitoring crowd in the unconstrained environment. Others have used the eigenvalue decomposition for background subtraction [60]. Eigenvalues represent the variance in the decreasing order. Hence, eigenvalues of a region of pixels identify those pixels with the higher fluctuations. Some have adopted to the running Gaussian average [98], spatial correlation of pixels [69], or the bimodal distribution of pixel values (either foreground or background) [2729] for modeling the background. It should be noted that all of these approaches use recent pixel observations. In contrast, in [7], for each pixel (in a given color space), a set of sampled values were taken from the past pixel values or from the neighborhood and the Euclidean distance between the new pixel value and the set of samples was computed and stored. The downside of modeling background is that the systems need to learn the model of the scene. In general, learning the scene model depends on the learning rate and the sample frames (with no objects). Furthermore, when the background has the wide dynamic ranges due to the presence of large number of individual objects and occluded objects, modeling the background would require a comparatively longer time than when only a few objects are present and are not occluded.

2.2 Motion estimation

In contrast to the background modeling schemes, motion estimation comprises of estimating the 2D pixel velocities between frames. Motion estimation deduces movements in the scene based on the apparent motion of the objects in the scene. Optical flow is one such approach to estimate the motion (measuring velocity) components of objects in the scene. Based on estimating the 2D flow vectors, motion estimation techniques can be grouped into four categories [8]: (a) gradient-based techniques, (b) region-based techniques, (c) energy-based techniques and (d) phase-based techniques. Gradient-based techniques [34, 50, 59, 79] derive pixel velocities using the spatio-temporal derivatives. Region-based techniques aim to match the pixel intensities in a regional window by maximizing the similarity scores [5, 70]. Energy-based techniques use the spatio-temporal windows and texture features in the Fourier (frequency) domain to estimate the power spectra of the window and calculated the optical flow velocities [32, 33]. Phase-based techniques compute the velocity vectors based on tracking contour phases [23, 82]. Dense optical flow [34] considers the global view of the scene and estimates the motion, where estimation of motion vectors is smooth because of regularization. Sparse optical flow [50] averages the pixel values in a window and then estimates the motion. The drawbacks of the motion estimation approaches are that the techniques developed are applicable only when there is a motion. In cases of object positioned at a particular location for a long period of time, the object goes undetected.

2.3 Human modeling

To overcome the difficulty of incoherent motions of the humans [2], shape-based analysis of the detected regions is widely used to the address problem of human detection and classification. In [93] and [49], the problem of segmentation (human crowd with occlusion) was handled using the three-dimensional models and shapes to interpret the foreground. In [94] and [96] both shape and appearance-based (color features) models were maintained. Templates were used to represent the human shape information [11, 20, 24, 97] and wavelet-based templates were also been employed [61]. Part-based matching methods include the prior object shape information such as head, hands, torso and legs to be matched with the detected shapes [16, 18, 78, 91]. Shape models may also include the points, silhouettes, articulated shapes [10], and skeleton models. Active Appearance Models (AAM) use both the shape (spatial) and local (texture, color, edge) features [9, 43, 67, 88]. AAMs are suitable for face recognition, eye tracking and medical image segmentation. Nevertheless, matching the human models require definite models, object structures, predetermined color and reproducible features.

2.4 Head detection

In most of the surveillance and tracking systems, it is assumed that the humans are walking upright. One of the prominent features to track humans is to detect the head. Since most of the body parts follow articulated motion, head maintains a stable shape for each of the individuals. In this regard, people tracking and counting systems have employed head detection approaches  [29, 36, 39, 40, 55, 57, 58]. Head detection based methods are also used in estimating the crowd density. Head detection assists in localizing the individuals in crowded scenarios. Global head detection and 2D Gaussian kernel based regression was employed in [66] to estimate the density in crowded scenarios. Interest points based on the orientation of the gradients were used to form a binary image. In [75], LUV channels, intensity of the gradient channels and six orientation of the gradient channels were used as features for head detection. However, the visibility of the heads depend on the camera angle and the severity of occlusion. Complete top-view would always provide better results using the head detection, whereas tilted cameras suffer from the occlusion. Further, other density estimation approaches based on head detection will be dependent on the accuracy of detected heads and the detection rate is currently low.

2.5 Density estimation

Most of the crowd density estimation schemes use either the texture features on the local and global levels or extract the foreground pixels using the motion information. The extracted features of the foreground pixels are then mapped to the crowd density in a given region. Texture features such as Gray Level Dependence Matrix (GLDM), Minkowski Fractal Dimension (MFD), and Translation Invariant Orthonormal Chebyshev Moments (TIOCM) were used to classify the crowd density into five classes (very low, low, moderate, high and very high) using Self Organizing Maps (SOM) [64]. GLDM was used at local and global levels as features and Support Vector Machine (SVM) as classifier for the abnormal crowd density detection [86]. Kanade–Lucas–Tomasi (KLT) [50] tracking features were clustered to estimate the crowd density in public venues [3]. Statistical texture measures such as contrast, homogeneity, energy, entropy, and Gradient Orientation Co-occurrence Matrix (GOCM) were used in conjunction with the \(\nu \)-Support Vector Regression (\(\nu \)-SVR) for estimating density [52, 53]. Gaussian regression process to density estimation using geometrical, edge and texture features can be found in [12].

Background modeling-based approaches can also be found in the literature. Features such as blob area, Harris corner, KLT feature points, contour number, contour perimeter, ratio of contour perimeter to area, Canny edge and fractal dimension were treated as inputs to multi-variable linear regression model [54]. Combining neural networks with foreground pixels and morphological operations, the crowd density was estimated in [37]. Density was also estimated by accumulating foreground pixels over time followed by the use of texture features such as the four orientations: \(0^\circ \), \(45^\circ \), \(90^\circ \) and \(135^\circ \) to create the Gray Level Co-occurrence matrix (GLCM) and statistical features (energy, entropy, homogeneity, and contrast) [73].

In contrast, crowd modeling was also performed using motion estimation followed by Markov Random Field (MRF) to locate the objects’ positions and application of the least squares method to reduce the neighborhood search space [37]. Furthermore, motion frequency using the Discrete Cosine Transform (DCT) coefficients and SVM to estimate the density was proposed [38]. Using moving Speed Up Robust Feature (SURF) points and their clusters as input features for the \(\epsilon \)-SVR, density estimation was performed in [17]. Most of the aforementioned approaches to density estimation require training. Consequently, a change in the scene or camera or orientation requires retraining of the system, which is undesirable as an end user.

In this paper, we propose the crowd density estimation using motion estimation followed by the hierarchical clustering. Willick and Yang [83] indicated that Horn and Schunk’s [34] motion estimation scheme performed better compared to Nagel’s work [59]. In addition, since the selection of window size in Lucas and Kanade [50] affects the flow vector determination, nonrigid human motion may generate noisy vectors compared to dense optical flow. Therefore, in this work, we use the dense optical flow for motion estimation by Horn and Schunck [35]. Figure 2 presents an overview of the developed crowd monitoring system. The system is divided into five major blocks (see Fig. 2): (a) camera inputs, (b) prepossessing filters, (c) people tracking, (d) motion tracking, and (e) crowd events. In this work, the path chosen to estimate the crowd density is: input from camera:1 \(\rightarrow \) preprocessing filters:2 \(\rightarrow \) multiple objects:3 \(\rightarrow \) density estimation:4 \(\rightarrow \) people count:5. We assume that the video data from cameras are directly accessible, the cameras are calibrated and only human objects are present. Sample frames from the three datasets (MCG, PETS 2009, and UCSD) are shown in Figs. 34, and 5.

Fig. 2
figure 2

Overview of crowd monitoring system

Fig. 3
figure 3

Sample frames from MCG dataset for different cameras

Fig. 4
figure 4

Sample frames from PETS 2009 dataset [22]

Fig. 5
figure 5

Sample frames from UCSD Pedestrian Database dataset [13]

3 Flow of the proposed approach

Let \(\mathbf {I}_{p \times q}\) represent an area in the frame \(\mathbf {I}(x,y)\), where \(p,q \in \mathbb {Z}_{+}\) represent the number of pixels. The problem of crowd density estimation then translates to

$$\begin{aligned} f:\mathbf {I}_{p}\times \mathbf {I}_{q}\in \mathbb {R}^{2} \rightarrow f(\mathbf {I}_{p \times q})\in \mathbb {N}, \end{aligned}$$
(1)

where \(f(\mathbf {I}_{p \times q})\) is the crowd density in \(\mathbf {I}_{p \times q}\) of \(\mathbf {I}(x,y)\). Thus, the goal of the crowd density estimation now turns to estimating number of people in a given region. In other words, the function maps the number of moving objects to the region in the foreground.

  1. 1.

    At first, nonlinear distortions introduced such as barrel or pincushion distortions by low-cost optics are rectified by applying lens correction profile to each of the frames. In addition, the perspectivity introduced in imaging from 3D to 2D to central projection is fixed by perspective correction.

  2. 2.

    Next, the video frames are pre-processed to remove any video noise such as chroma noise, blurred scene and speckle noise or so by applying suitable filters.

  3. 3.

    Later, motion vectors of the scene are computed from the optical flow to narrow the search space (of foreground pixels) for estimating the crowd density. In other words, we do further computation on motion vectors and their locations of the video scene in the current scene instead of the entire video scene.

  4. 4.

    Motion vectors corresponding to the zero angular orientation are eliminated. We consider only those motion vectors that have both horizontal and vertical motion components. This step provides a means of filtering the speckle noise equivalent generated by the motion vectors.

  5. 5.

    Contours are drawn for each of the connected regions. This step demarcates the crowd region and provides as a input to the next level of processing. To demarcate these regions truthfully to crowd regions only, step 2 is essential.

  6. 6.

    Refine the motion cues by spatially filtering the motion vectors using the median filter to reduce rapid changes in motion information. Motion vectors in a space of the video region can posses high variance in case of crowd due to articulated movements of people. This variance is reduced by means of median filter and assigning the filtered value to all the elements of the block.

  7. 7.

    Finally, foreground pixels are mapped to the crowd density by clustering the motion cues hierarchically. The association of the crowd density in a region is a function of block size and the motion information obtained from spatially filtered motion vectors. Clustering the motion vectors is tantamount to localizing the movements corresponding to grouped objects and thus providing an estimate of the crowd density.

The complete flow of the proposed crowd density estimation is provided in Fig. 6.

4 Object detection

In this section, the building blocks of detecting the moving objects for density estimation are described. The preprocessing block, estimates the motion in the scene from two consecutive frames using optical flow, and finding the contours and their analyses are presented in the following subsections.

Fig. 6
figure 6

Flow of crowd density estimation by clustering motion cues. In the Region of Interest (ROI), the actual number of people was 2. One can notice that this is a low-light region. The proposed solution estimated the density to be 3.

4.1 Lens distortion correction

Most of the optical devices introduce nonlinear distortions. In case of image formation in a camera, two main types of distortion are introduced: radial distortion and tangential distortion. Radial distortion introduces barrel and pincushion effects, whereas the less-severe tangential distortion arises when the image plane and the lens plane are not parallel. Radial distortion proves to extremely distort the images compared to tangential. Tangential distortion is attributed to the manufacturing defects. Consequently, in this work, only the radial distortion is considered. The radial distortion is corrected by determining the coefficients (\(K\!={k_{1}, k_{2}, k_{3}}\)) regulated by the 6th-order polynomial equation [30, 51]:

$$\begin{aligned} f(r) = 1 + k_{1}r^{2} + k_{2}r^{4} + k_{3}r^{6} \end{aligned}$$
(2)

where \(f(r)\) is the functional with \(r = x^2 + y^2\) and \((x,y)\) forming the image coordinates. When \(f(r) <1 \), the barrel distortion arises and pincushion effect surfaces when \(f(r) >1\). Figure 7 shows the barrel and pincushion distortions and Fig. 8 shows an example of “barrel” distortion present in one of the frames of MCG dataset and its correction.

Fig. 7
figure 7

a Barrel distortion and b pincushion distortion due to nonlinear optical effects of a camera lens [63]

Fig. 8
figure 8

a Presence of the barrel distortion in C2 is more clearly visible towards the left edge of the frame where the vertical structure appears to be bent outwards and b distortion-corrected frame

4.2 Perspective correction

During image formation, the light from the 3D projective space (\(\mathbb {P}^{3}\)) is projected onto a 2D space \(\mathbb {P}^{2}\) through the fixed center of projection. If we denote \((X,Y,Z,1)\) as a point in \(\mathbb {P}^{3}\) and \((x,y,1)\) in \(\mathbb {P}^{2}\) in homogeneous coordinates, then the perspective projection (camera matrix) can be written as:

$$\begin{aligned} \underbrace{ \begin{bmatrix} x^{\prime } \\ y^{\prime } \\ 1 \\ \end{bmatrix} }_{\text {image plane}}&= \underbrace{ \begin{bmatrix} f_{x}&sk&x_{0} \\ 0&f_{y}&y_{0} \\ 0&0&1 \\ \end{bmatrix} }_{\text {intrinsic parameters}} \underbrace{ \begin{bmatrix} R_{11}&R_{12}&R_{13}&t_{X} \\ R_{21}&R_{22}&R_{23}&t_{Y} \\ R_{31}&R_{32}&R_{33}&t_{X} \\ \end{bmatrix} }_{\text {extrinsic parameters}} \underbrace{ \begin{bmatrix} X \\ Y \\ Z \\ 1 \\ \end{bmatrix}}_{\text {world plane}}\nonumber \\&= \mathbf {K}[\mathbf {R}|\mathbf {t}] \begin{bmatrix} X \\ Y \\ Z \\ 1 \\ \end{bmatrix} \end{aligned}$$
(3)

where \((f_{x},f_{y})\) are the focal lengths of in the \((x,y)\) directions of the image in terms of pixel units, \((x_{0},y_{0})\) is the principal point, \(sk\) is the skew parameter are the five intrinsic parameters. The extrinsic parameters indicate the rotation (\(\mathbf {R}\)) and translation (\(\mathbf {t}\)) between the camera coordinates and the external world.

However, projective transformation introduces perspective distortion through central projection mapping. This distortion projects objects closer to camera to appear bigger and objects far away from the camera to be smaller. The distortion is corrected by estimating the projectivity or collineation or homography from points to points, from lines to lines, from \(\mathbb {P}^{2}\) to \(\mathbb {P}^{2}\) [31]. The homography matrix \((\mathbf {H})\) is a non-singular matrix and hence invertible. The homography is estimated using Direct Linear Transform (DLT) given by:

$$\begin{aligned} s \begin{bmatrix} x \\ y \\ 1 \\ \end{bmatrix}= \begin{bmatrix} h_{11}&\quad \! h_{12}&\quad \! h_{13} \\ h_{21}&\quad \! h_{22}&\quad \! h_{23} \\ h_{31}&\quad \! h_{32}&\quad \! h_{33} \\ \end{bmatrix} \begin{bmatrix} x^{\prime } \\ y^{\prime } \\ 1 \\ \end{bmatrix} =\mathbf {H} \begin{bmatrix} x^{\prime } \\ y^{\prime } \\ 1 \\ \end{bmatrix} \end{aligned}$$
(4)

where \(s\) is a nonzero scalar. The matrix \(\mathbf {H}\) has \(8^\circ \) of freedom and hence requires four point correspondences between images. An example of perspective-corrected frame from MCG dataset is shown in Fig. 9.

Fig. 9
figure 9

a View of C1 and b its perspective corrected view for the MCG dataset.

4.3 Preprocessing

Let \((x,y)\) denote a pixel in a frame \(\mathbf {I}(x,y)\). In this work, \(x\) axis indicates the horizontal axis of the frame and the \(y\) axis represents the vertical axis of the frame. At this stage, every frame is first bilaterally filtered to preserve edges and reduce variations in color pixels with \(\sigma _{d}=5\) and \(\sigma _{r}=10\). Bilateral filtering is applied to boost the sharpness of the frames while maintaining perceptually strong colors and edges. Next, these frames are converted to grayscale. The grayscale frames are then low-pass filtered to eliminate high-frequency noise. This is accomplished using two-dimensional Gaussian filter with \(\mu =0\) and \(\sigma _{x}=\sigma _{y}=0.1\). The filtered frames are then used for motion estimation using the optical flow [34].

4.4 Motion estimation

For a given sequence of frames, the motion is computed using the optical flow approach. The brightness at point \(\mathbf {I}(x,y)\) is denoted by \(\mathbf {E}(x,y,t)\), where \(t\) represents the time. Assuming that a moving object retains the constant brightness, estimation of motion between two frames of a video sequence is given by optical flow [34]:

$$\begin{aligned} \frac{\partial \mathbf {E}}{\partial t} \frac{\mathrm{d}x}{\mathrm{d}t} + \frac{\partial \mathbf {E}}{\partial t} \frac{\mathrm{d}y}{\mathrm{d}t} + \frac{\partial \mathbf {E}}{\partial t} = 0 \end{aligned}$$
(5)

Further,

$$\begin{aligned} \mathbf {E}_{x}u+\mathbf {E}_{y}v+ \mathbf {E}_{t}=0 \end{aligned}$$
(6)

where \(\mathbf {u}=\frac{\mathrm{d}x}{\mathrm{d}t}\) and \(\mathbf {v}=\frac{\mathrm{d}y}{\mathrm{d}t}\), represent the velocities of matching pixels in two frames. Spatially, for each pixel locations \((x,y)\), the resultant velocity vector of two-dimensional motion (\(\mathbf {u}(x,y)=\frac{\mathrm{d}x}{\mathrm{d}t}\) and \(\mathbf {v}(x,y)=\frac{\mathrm{d}y}{\mathrm{d}t}\) ) with time is calculated. The matrix of resultant vectors’ magnitude and directions are given by:

$$\begin{aligned} \mathbf {I}_{R}(x,y)= \sqrt{(\mathbf {u}(x,y)^2 + \mathbf {v}(x,y)^2)} \end{aligned}$$
(7)

and

$$\begin{aligned} \mathbf {I}_{D}(x,y)= \displaystyle \arctan \left( \frac{\mathbf {v}(x,y)}{\mathbf {u}(x,y)}\right) \end{aligned}$$
(8)

where \(\mathbf {I}_{R}(x,y)\) and \(\mathbf {I}_{D}(x,y)\) together indicate the presence of moving objects. Matrices \(\mathbf {I}_{R}\) and \(\mathbf {I}_{D}\) are used to detect the object boundaries in contour analysis. Algorithm 1 shows a pseudocode of preprocessing and motion estimation.

figure e

4.5 Contour analysis

Contour analysis includes demarcating the boundaries of the possible object detected from the previous steps. Both the magnitude (\(\mathbf {I}_{R}\)) and direction (\(\mathbf {I}_{D}\)) matrices are used for this analysis. For each iteration of the optical flow between two frames, the direction matrix \(\mathbf {I}_{D}\) comprises of vector directions for each of the pixel movements. The angle of these vectors is confined to \([0,360)\) degrees. To segment the motion of objects, \(\mathbf {I}_{D}\) is binarized for only those pixels with \(|\mathbf {I}_{D}(x,y)| > 0\). The speckle noise in the foreground pixels that may originate either due to presence of \(\mathbf {u}\) or \(\mathbf {v}\) component is eliminated. The binarized \(\mathbf {I}_{D}\) is further smoothened using (\(7 \times 7\)) median filter to reduce any speckle noise. Morphological closing is performed to enhance the motion segmentation. Next, the contours \(\mathbf {C}=\bigcup _{i}^{n}c_{i}\) are determined for each of the nonzero foreground components. For simplicity, the subscript \(i\) is dropped in future contexts. Figure 10 shows the examples of detected contours following preprocessing step (without perspective correction) including motion estimation and filtering of the flow vectors for MCG dataset, PETS 2009 and UCSD dataset. Similar results would be obtained with the inclusion of perspective correction. However, the size of the objects would be normalized in the latter case. Algorithm 2 shows the extraction of contours.

Fig. 10
figure 10

Examples of contour analysis: a contours detected for MCG dataset; b, c PETS 2009 dataset; and d, e UCSD dataset (Note: perspective correction is not shown here for image clarity)

5 Crowd density estimation

5.1 Foreground mapping

Each object occupies certain area (region of foreground pixels belonging to an object) in the foreground due to its motion. In case of single object, the foreground region corresponding to the object will be less when compared to multiple objects moving together at same scale. The contour of the multiple unoccluded objects forms gaps (valleys between objects—Zhao and Nevatia [94] provided silhouette projections of multiple objects where one can observe valleys) in the foreground mapping, whereas, the contour of a partially occluded group fills these gaps to a major extent (this was the phenomenon that was found in case of partially occluded objects). Now, the problem is to learn the crowd density corresponding to the foreground regions. Motion cues has been used for mapping the movements of the objects to the foreground pixels, median filter of the motion vectors to create the blocks in the foreground regions and the hierarchical clustering to map the foreground regions into the crowd density.

[!h]

figure f

5.2 Selection of motion cues

Most of the methods describing density estimation in Sect. 2 require training to deduce a function that provides crowd density in the scene. In brief, once the contours are extracted, the region of interest (ROI) is binarized to generate a contour mask \(c_\mathrm{mask }\). The optical flow magnitude \(\mathbf {I}_{R}\) corresponding to the \(c_\mathrm{mask}\) is processed for density estimation. Bounding box information is derived for each of the contours. The bounding box region of the \(\mathbf {I}_{R}\) (corresponding to \(c_\mathrm{mask}\)) is divided into \(b \times b\) block. To alleviate the flow noise, for every block, the pixel values in each of the blocks are replaced by median filter (\(b \times b\)) output. The resulting matrix is termed as motion matrix \(\mathbf {M}\), is a matrix with the blocks of size \(b \times b\) for every contour and each block corresponding to the median values of original \(\mathbf {I}_{R}\). In this approach, since perspective correction is used, the block size is fixed to \(32 \times 32\).

Block size for uncorrected perspective scene can be determined by the height and width of the objects obtained during camera installation and calibration process. For each of the contours in \(C\), the contour width \(c_\mathrm{width }\), contour height \(c_\mathrm{height }\), contour orientation \(c_\mathrm{angle }\), contour starting point \(c_{x}\) and \(c_{y}\) are obtained. Average single object size \(O_{s}\) and object area \(O_{a}\) are determined using contour dimensions. The block size \(b\) is determined using the single object contour size in the scene. The block size \(b\) is given by:

$$\begin{aligned} b = \frac{1}{N_\mathrm{SO }} \times \sqrt{c_\mathrm{SO _\mathrm{width }} \times c_\mathrm{SO _\mathrm{height }}} \end{aligned}$$
(9)

where \({c_\mathrm{SO _\mathrm{width }}}\) and \(c_\mathrm{SO _\mathrm{height }}\) are height and width of a single object contour, and \(N_\mathrm{SO}=3\) is chosen such that at least \(N_\mathrm{SO}\) blocks cover the width of a single object.

5.3 Clustering motion cues

Let \(\mathbf {M}_\mathrm{min}\) and \(\mathbf {M}_\mathrm{max}\) denote the minimum and maximum of \(\mathbf {M}\). The motion cue matrix or motion cues \(\mathbf {I}_{M}\) is obtained by relatively scaling the motion information as \(\mathbf {I}_{M} = \frac{(\mathbf {M}-\mathbf {M}_\mathrm{min})}{(\mathbf {M}_\mathrm{max}-\mathbf {M}_\mathrm{min})}\). Figure 11 shows an example of the motion information matrix (without blocks). We see that relatively high magnitude motion information form a 2D Gaussian-shaped points. The distinct points (highlighted points) correspond to distinct number of people in the scene. As a result, one can generalize that the motion cue matrix \(\mathbf {I}_{M}\) highlights the regions in the scene that approximately correspond to identifying individual objects i.e. if there are \(N\) people in a scene, then there would approximately be \(N\) corresponding distinct points. Thus, we can deduce the density of the people in the scene from the motion cues.

Fig. 11
figure 11

Refining and clustering motion cues to estimate density (Note: perspective correction is not shown here for image clarity)

To deduce density, isolation of distinct points is necessary. A natural way to achieve this is by clustering the motion cues. At first, dissimilarity matrix of the motion cues is obtained as \(\mathbf {D} = {\mathbf {I}_{M}}^T \mathbf {I}_{M}\). Spectral clustering methods such as Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) require input about top \(d\) eigenvalues, which limits our ability to determine the number of clusters present. \(K\)-means clustering requires the desired number of clusters, which is unattractive for our purpose. However, on the other hand, hierarchical clustering provides us a means to isolate distinct points that correspond to number of people in the scene. This forms the basis for the use of hierarchical clustering of motion cues.

Intuitively, the idea is to form the clusters of motion vectors representing the foreground objects as patterns of crowd. The \(\mathbf {I}_{M}\) provides a natural 2D Gaussian model for forming clusters with peak values indicating the highest velocities at the center of the objects and decreasing values as moved away from the center. In case of crowded scenes, the inter-cluster variance, a property of clustering algorithm, is used to cluster different objects in the scene. A \(K\)-means tree representing the clusters of motion vectors would be an ideal approach; however, we do not know the density of the crowd and furthermore, the density changes with time and hence we cannot fix the number of clusters. Instead, we use hierarchical clustering where we use agglomerative clustering to build the clusters (bottom-up). The dissimilarity matrix consists of features from spatially-filtered median motion vectors. In the process of clustering, the most similar blocks are merged and the process is continued across the scene. In agglomerative clustering process, single-linkage clustering is used (nearest-neighbor clustering) as the motion vectors belonging to an object will be similar to its center and also the dissimilarity increases with increased distance and peaks at the interface between two objects. This process identifies the distinct motion regions in the scene. Minimum spanning tree is constructed using single-linkage clustering and the computational complexity of the clustering amounts to \(\fancyscript{O}(N^{2})\). Average-linkage and complete-linkage computational complexities are of the order \(\fancyscript{O}(N^{3})\). Since we are using 2D plane for the scene analysis, Euclidean distance is used as a metric to perform cluster analysis ,where inter-cluster and intra-cluster variance are used simultaneously to determine the formation of the clusters. Algorithm 3 provides the pseudocode for obtaining motion cues from contours and Algorithm 4 describes the procedure of clustering to extract distinct points. The nearest-neighbor or the single linkage would create a minimum spanning tree to create clusters where the blocks are distinct since the clusters are formed by linking the motion signatures through edges that are distinct (Fig. 12). Algorithm 5 provides a brief overview of process to estimate crowd density.

figure g
figure h
Fig. 12
figure 12

Overview of clustering motion cues (Note: perspective correction is not shown here for image clarity)

6 Performance evaluation

The implementation of the proposed crowd density estimation approach was carried out in OpenCV \(2.3\) on a Virtual Box Linux machine (32-bit Ubuntu 12.04 LTS) equipped with \(1.5\) GB RAM and Intel® i\(7-2600\) CPU running at \(3.4\) GHz. The proposed approach was tested on the MCG dataset and two popular crowd datasets: PETS 2009 [22] and UCSD Pedestrian Database [13].

6.1 Dataset

Table 1 provides detailed information about the three datasets. MCG dataset was originally collected in Advanced Systems Format (asf) from six cameras on four different days of Australian Football League (AFL) matches held at MCG. The data were collected during the start and end of the games. We collected a total of nearly 6 hours of data per camera from each camera at \(30\) fps (equivalent to 36 h of data at 30 fps in total). For the purpose of crowd density estimation, four cameras were chosen (C1, C2, C4, and C6). In-line with standard public datasets, 1 min of video from each camera was extracted during the peak crowd movements and converted to JPG format for analysis. The frame size is \(640 \times 480\) and in RGB format. The ground truth was generated manually using customized software.

figure i

PETS 2009 dataset images are in JPG format with a size of \(768 \times 576\). View 001 is considered for the density estimation for sequences S1.L1, S1.L2 and S1.L3. Different environmental conditions such as shadows, overcast, medium and high crowd density were present with both walking and running events. The crowd density is estimated in regions R0, R1 and R2 as shown in Fig. 10 for sequences S1.L1 and S1.L2. For S1.L3, density is estimated in region R1. The ground truth information for PETS 2009 dataset is available in [56]. In the PETS 2009 dataset, the regions R0, R1 and R2 had been marked by the dataset providers. The dataset has three roads (surrounded by green lawns at an educational campus) intersect at a point. From View-001, R1 oversees one of the roads and the intersection point. The road from R1 leads to another road where R2 overlooks that road. Region R0 encompasses not only regions R1 and R2, but also other regions including the third road and walkable area where there is a possibility of peoples movement. The size of the objects in R1 is comparatively larger than in R2. Furthermore, R1 is more occluded compared to R2 because of the camera view and angle.

UCSD dataset contains walking events and comparatively smaller sized objects. The images are with a resolution of \(238 \times 158\) in grayscale and PNG format. UCSD dataset is challenging in terms of smaller object size (less motion information and reduced texture features) and techniques dependent on color information would perform poorer because of lack of color information. Initially, using empirical knowledge the block size was set to \(6\). The ground truth information for UCSD Pedestrian Database dataset is available at [13].

Table 1 MCG 2011, PETS 2009 and UCSD Pedestrian Database datasets used in this work

6.2 Results and discussion

Performance of the density estimation is measured using Mean Absolute Error (MAE), Mean Relative Error (MRE) and Root Mean Squared Error (RMSE). MAE, MRE and RMSE can be calculated using (10), (11) and (12):

$$\begin{aligned}&\mathrm{MAE} = \frac{1}{N} \displaystyle \sum _{i=1}^{N}{|A(i)-P(i)|}\end{aligned}$$
(10)
$$\begin{aligned}&\mathrm{MRE} = \frac{1}{N} \displaystyle \sum _{i=1}^{N}{\frac{|A(i)-P(i)|}{A(i)}}\end{aligned}$$
(11)
$$\begin{aligned}&\mathrm{RMSE} = \sqrt{\displaystyle \frac{1}{N}\sum _{i=1}^{N}|A(i)-P(i)|^{2}} \end{aligned}$$
(12)

where \(i\) indicates the \(i\mathrm{th }\) frame, \(A(\cdot )\) is the ground truth crowd density and \(P(\cdot )\) is the predicted density. MAE provides a measure of how well the performance is in terms of absolute error i.e. the absolute difference between the actual measured values (ground truth) and the predicted values for each frame and averaged for the given sequence (all the frames). MRE provides a percentage relative error i.e. the ratio of absolute error for each of the frames to the actual value expressed in percentage. This gives us a measure of how good the prediction is relative to the size of the actual value. RMSE provides a measure about variations in predictions and will always be greater than or equal to MAE and \(\text {MAE} \le \text {RMSE}\le \sqrt{N}\cdot \text {MAE}\)  [84]. In other words, closer the RMSE values to MAE, better the predictions results are. Although, these three measures have been used in the literature commonly, comparison among different approaches are mainly evaluated based on MAE. This is because of the nature of the application (crowd density) that is more meaningful in terms of MAE for a given sequence and the way the density is estimated is different for different approaches.

MCG dataset was collected indoors, whereas PETS and UCSD were collected outdoors. As a means to validate the proposed approach, it was validated on PETS and UCSD datasets. Further, we provide the comparison with respect to the existing methods on PETS 2009 dataset. Table 2 provides the results for MCG dataset. For three out of four sequences, the MAE error was less than 2 and less than 4 for the entire dataset. Table 3 shows the results for the PETS 2009 datset. For region R0, the MAE was less than 3; MAE was less than 4 in region R1 and less than 1 for region R2. Table 4 shows the results for UCSD dataset. The MAE for four sequences were less than 1, three sequences less than 2 and three sequences (the MAE and MSE for pedestrian movements using supervised schemes for the UCSD dataset can be found in [14]). The results demonstrate that the proposed method provides a reasonably good estimate of the crowd density.

Table 2 Results for MCG Dataset. MAE, MRE and RMSE measures have been tabulated

The prediction error plots for the three datasets have been included in Online Resource. The prediction error plots include ground truth, predicted results and the error. Furthermore, for each sequence, mean and standard deviation per person is provided. Most of the methods have provided the mean error for the entire sequence (for all the frames). However, we believe that such a measure would not provide adequate information about the density since the error during increased or decreased crowd density will be averaged over the entire sequence. Instead, we have provided the prediction error in terms of density that clearly highlights the trend of the system error. MCG dataset results have been presented in Online Resource Fig. 1 and Fig. 2. PETS 2009 results are depicted in Online Resource Figs. 3-6. UCSD dataset results are presented in Online Resource Fig. 7-10. The mean error increases as the density crosses 10 for MCG dataset, 15 for PETS 2009 dataset and 10 for UCSD dataset. The error is attributed to the heavy inter-object occlusions.

Table 3 Results for PETS 2009 dataset (View-001)
Table 4 Results for UCSD Pedestrian Database dataset

Various methods have been proposed by research communities on PETS 2009 dataset. The proposed method is compared with the results provided in  [1, 47, 75], tested on PETS 2009 dataset, and [12] tested on UCSD dataset. Table 5 provides a comparison of previous work [75] and the proposed work. The table highlights the prediction error in terms of MAE and MRE for the regions R0, R1 and R2. Table 5 provides similar measures (MAE and MRE) compared to the others in regions R0 and R1. Since the region R2 consists of single object cases for larger extents of time, the MRE measure is skewed. Hence, the MRE measure in R2 shows higher error rate. Subburaman et al. [75] used gradient features and training for arriving at the results. The proposed approach uses motion features for determination of density.

Table 6 provides a comparison of previous work [75] and the proposed work for individual sequences rather than the regions R0, R1 and R2. Results for [4] was obtained from [17, 75]. For the purpose of comparison, we have averaged our obtained measures (MAE and MRE) in the regions R0, R1 and R2 to obtain the MAE and MRE for each sequence. Considering the supervised approaches [4, 17, 75], the proposed approach provides similar results and is the best in case of S1.L3 (Timestamp: 14-17) and S1.L1 (Timestamp: 13-59) with the MAE measures of \(1.40\) and \(1.62\). The proposed method performed better in 2 out of 4 sequences. In sequence S1.L1 (Timestamp: 13-57) people move from right to left, where people are partially occluded and boundary between groups are not clearly demarcated. In S1.L2 (Timestamp: 14-06) people move from right to left as a single group with occlusions. In contrast, in S1.L1 (Timestamp: 13-59), people move in groups from right to left such that groups as well as individuals are well separated and occlusions are very limited. In sequence S1.L3 (Timestamp: 14-17), people move across the scene from left to right while running. During the running event, people are individually separated. The better performance of our purposed approach in 2 out of 4 cases compared to others is attributed to the fact that during limited occlusions, the proposed method performs better. Occlusion handling will be addressed in future for enhancing the system.

Table 5 Results for PETS 2009 dataset
Table 6 Comparison of results for PETS 2009 dataset

Noise induced at the low level due to the nature of digital image formation can at many times introduce erroneous results in the high-level processing. The noise would generally be small variations in the pixel values in a given neighborhood. Though these variations are undetected by human perceptions (because our visual system is filtering these variations), these small variations will result in high-amplitude noise at higher level processing in the digital image domain. As a result, filtering or reducing the noise at the low level of video processing forms one of the key building blocks of the video analysis. Filtering in this context implies that reducing variations among perceptually similar colors and maintaining edges to determine the characteristic features. Bilateral filter is one such state-of-the-art filter that was first proposed by Aurich and Weule [6] as “non-Guassian” filter in 1995 and later reintroduced by Smith and Bardy [71] in 1997 and by Tomasi and Manduchi [76] in 1998. Tomasi and Manduchi [76] coined it as “bilateral” filter and remains the most popular edge preserving even today. The biggest advantage of the the bilateral filtering is that it maintains colors and edges that are perceptually meaningful to humans, which mainly imitates human vision system. In other words, the filter maintains geometric and photometric closeness of neighborhood pixels. In the proposed, this has been applied on the input color frames. Application of bilateral filter to optical flow estimation can be found in [68, 87]. While preservation of edges is essential, presence of multiple object boundaries invites large number of edges that degrade the high-level processing in the case of crowd movements. On the other hand, Gaussian filter is a low-pass filter that eliminates high-frequency signals (edges) by averaging the pixels in the local neighborhood according to 2D Gaussian function [76]. In the proposed approach, Gaussian filter has been added on top of bilateral filter to ensure that high-amplitude edges are suppressed. Thus the combination of filters yields smooth object surfaces and simultaneously, reduces high-amplitude edges for smooth motion estimation.

The proposed approach is suitable for sparse camera (single-view) networks where such networks form the essential components of surveillance systems [72]. Counting people using tracking may be accomplished in sparse camera networks or dense camera networks. In case of dense camera networks, multi-view invariant features assist to help resolve occlusions with the combination of trackers. The challenges in sparse camera networks are that only single-view features are available and that the features must be invariant. Gradient-based optical flow, used in this work, is one of such invariant features. Furthermore, perspective normalization in dense camera networks provide accurate object sizes. In sparse camera networks, only camera calibration parameters are available. It must be noted that most of the existing methods apply weight to the features to compensate for perspective distortion. In these cases there is an assumption that features are available for all the detected objects. In addition, it is assumed that features are consistent for both objects nearer to the camera and farther from the camera. Most of the Optical Character Recognition (OCR) systems and License Plate Recognition (LPR) systems employ perspective correction [77, 89]. However, the features available for the farthest object decrease with increase in perspective distance away from the camera [77]. Indeed, research indicates that perspective-corrected images yield more accurate results. For instance, a study by Clark et al. [15] regarding visual image registration based using Scale-Invariant Feature Transform (SIFT), SURF and optical flow indicated that the two natural features, SIFT and SURF delivered poor performance. With the inclusion of optical flow and perspective correction, invariance properties were achieved delivering superior results. Therefore, detecting prominent features without correcting the perspective distortion will not yield accurate results, which has been the major trend in the existing approaches. In this work, we have explicit 2D-2D geometric mapping that compensates for perspectivity.

In the literature, many methods employ the background modeling approach to segment the foreground objects and map the foreground pixels to the number of objects. Others extract the texture, color or intensity features and then use the training and classification using classifiers while assuming that the objects are available. Compared to these methods, in general, the proposed method uses motion estimation and contour analysis to extract the foreground objects without training. For instance, Chan et al. [12] use temporal subvolume of the video to extract features. A similar video-temporal subvolume-based approach was proposed using motion only features by Rao et al. [65], in which several frames were considered. An improvement in this work is that the density estimation is based on individual frames rather from several frames (except for motion features where two frames are required), as this was our primary goal of the proposed approach.

An analogy can be drawn between the crowd monitoring system shown in Fig. 2 and human visual system for density estimation. In human vision system, rods and cones act as sensors in capturing the visual images. Rods are sensitive to low-light vision, provide coarse information fast, whereas cones require minimum luminance, provide fine details and are slow in response. Ganglion cells aggregate information signals from rods (many-to-one mapping) and cones (one-to-one) for sending it to Lateral Geniculate Nucleus (LGN), which acts a carrier of signals for higher level processing. The signals from LGN are projected into visual cortex for object recognition, registration, and tracking. Edge, direction, spatial, temporal and chromaticity information are clubbed (logical AND/OR) and a decision is made at higher dimensional to accurately identify the objects invariance of scale, space and identity [19, 48, 81]. In the proposed approach, camera networks is tantamount to rods and cones; preprocessing filters to ganglion cells and LGN; people detection, motion tracking and density estimation are equivalent to functions of visual cortex.

Although the overall results are impressive, slow movements and shadows limit the effectiveness of the proposed algorithm. If the objects are moving slowly such that the optical flow is unable to determine the apparent motion, then the object detection, contour extraction and density estimation may not be accurate. One such scenario is when the people start to gather at a point and the people at the center do not move. In this scenario, only the exterior of the crowd has the movements while the interior of the crowd has zero movements. In those cases, use of area information from motion cues would be valuable to deduce the crowd activity. We used background modeling based on Mixture of Gaussians (MoG) [26, 74] to handle such scenarios. Furthermore, the dense optical flow was used to obtain smooth vectors globally. Sparse optical flow methods are limited by small movements and in case of articulated movements of people, dense flow handles articulated movements smoothly. The results of density estimation can be greatly improved with the incorporation of explicit shadow handling method. However, it should be noted that in occluded scenarios, formation of shadows are less likely and also the presence of shadows would impact less on counting because of occlusion. There are several instances where occlusion causes the number of foreground detected regions that do not match with the actual crowd density of the scene. In such cases, tracking of the people coupled with density estimation correction would be an ideal solution. Again, either the full body detection or the head detection, and tracking depends on the number of distinguishable features and tracker efficiency. Most of the trackers assume that the heads of the people are visible. In case of UCSD dataset, even head detection may also become inefficient because of lack of features (small-sized objects) and the frames provided are purely grayscale.

The future work includes handling explicit static crowd movement, resolving ambiguity during occlusions without tracking and shadows during crowded scenarios. One possibility to handle crowd occlusions would be to adopt the concept of particle video as described by Sand and Teller [68]. However, the dynamics of crowd movements must be taken into account, which necessitates further study in this direction. Addition of texture information will add another dimension to addressing problems with static crowd movements. Whether the texture information provides any improvement in handling occlusions and shadows in crowded scenes need to be investigated. In addition, further study is required whether texture information is necessary or to the degree to which it is redundant with respect to motion and vice versa, and how well can the combined features be applied to crowd monitoring.

7 Conclusion

Understanding the crowd behavior using an automated video analytics is a relevant research problem from a video surveillance perspective. Crowd density estimation in a video scene is necessary in understanding the crowd behavior. Most of the existing density estimation methods use the supervised training schemes to arrive at the density estimates. In this work, we proposed crowd density estimation based on clustering motion cues. The proposed approach results in accurate estimates of the crowd density. It delivered superior performance in \(50~\%\) of cases on PETS 2009 dataset when compared with existing methods. While the maximum mean error of 3.62 was received for MCG and PETS datasets, it was 2.66 for UCSD dataset. The maximum mean error was found to be nominal in estimating crowd density.