Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Automated video content analysis of a crowded scene has been an active research area in the field of computer vision in the last few years. This strong interest is driven by the increased demand for public safety at crowded spaces such as airports, train stations, malls, stadiums, etc. In such scenes, conventional computer vision techniques for video surveillance cannot be directly applied in the crowded scene due to large variations of crowd densities, complex crowd dynamics and severe occlusions in the scene.

Algorithms for people detection, tracking and activity analysis which consider an individual in isolation (i.e., individual object segmentation and tracking) often face difficult situations such as the overlapping of pedestrians, complex events due to interactions among pedestrians in a crowd. For this reason, many papers consider the crowd as a single entity and analyse its dynamics. The status of crowd is updated as normal or abnormal based on the dynamics of the whole crowd. However, a crowded condition can also be unstructured where pedestrians are relatively free to move in many directions as opposed to a structured crowd where each individual moves coherently in one common direction. In an unstructured crowded scene, considering the crowd as one entity will fail to identify abnormal events which arise due to an inappropriate action of an individual in a crowd. For instance, a running person in a crowd can indicate an abnormal event if the rest of crowd are walking. Thus, considering the crowd as one entity can cause false detections.

Many paper works on modelling crowded scenes to identify different crowd events and/or to detect abnormal events. However, the definition of abnormal event or event of interest has been causing much confusion in the literature due to its subjective nature. Some researchers consider a rare and outstanding event as abnormal while some consider events that have not been observed are abnormal. The problem becomes more challenging as the density of people increases. As a result, more computer vision algorithms are being explored recently.

Despite the great interest and a large number of methods developed, there is a lack of a comprehensive review on crowd video analysis. As shown in Table 1, most current surveys focus on general human motion analysis [1, 5, 24, 75] of single or a small group of people, rather than addressing a crowded scenario. The survey paper by Zhan et al. [83], to the best of our knowledge, is the only one focusing on crowd video analysis. Zhan et al. reviewed some crowd density estimators and crowd modelling techniques, focusing on pedestrian detections, and tracking in a cluttered scene. However, they did not discuss the topic of crowd behaviour understanding and abnormality detection which is covered in this survey. We also present some advances on crowd motion modelling and multi-target tracking in a crowded scene which are not covered in the previous survey.

Table 1 A comparison of this chapter and previous surveys on human motion analysis and crowd video analysis

The goal of this survey is to review and organise the state-of-the-art methods in the domain of crowd video analysis such that their main focus becomes apparent. To achieve this, we have divided the research on crowd video analysis into three broad categories: macroscopic modelling, microscopic modelling and crowd event detection. The methods related to each task are further divided into sub-categories and a comprehensive description of representative methods is provided. In addition, we identify challenges and future directions for analysing a crowded scene. We believe this will help readers, especially newcomers to this area, to understand the major tasks of a crowded scene analysis system and hope to motivate for the development of new methods.

2 Macroscopic Modelling

In order to learn the typical motion patterns in a crowded scene, macroscopic observation-based methods utilise holistic properties of the scene such as motions in local spatio-temporal cuboid or instantaneous motion are utilised. It is also the preferred method in tracking and analysing the behaviour of both sparse and dense crowd using the following properties such as: density, velocity and flow [31]. Figure 1 depicts detailed various features available for use in macroscopic modelling and the techniques initialising those features.

Fig. 1
figure 1

A schematic illustration of the topics involved in macroscopic crowd video analysis

2.1 Optical Flow Feature

Optical flow is a dense field of instantaneous velocities computed between two consecutive frames commonly used in extracting motion features [23]. Given a video of a crowded scene, the first step is to segment the input video into smaller video clips and compute pixel-wise optical flow between consecutive frames of each clip using the techniques in [11, 23, 49]. The extracted flow vectors may contain noise and redundant information. In order to reduce the computational cost and remove noise, researchers utilise unsupervised (Andrade et al. [6, 7] and Yang et al. [81]) or supervised (Hu and Shah [26, 27]) dimensional reduction techniques. Subsequently, the next step is to find the representative motion patterns of the scene by merging flow vectors from all video frames. Referring back to Fig. 1, it can be seen that the motion features extracted from the optical flow can be utilised for motion pattern extraction such as: Sink Seeking Process, Optical Flow Clustering, Interaction Force Modelling, Local Spatio-temporal Motion Variation Modelling, and Spatio-temporal Gradient feature whereby the methods can be used separately or integrated with one another to obtain the desired crowd analysis.

2.1.1 Sink Seeking Process

In the sink seeking process, a grid of particles is overlaid on the first frame of the video clip and advected using a numerical scheme. The path taken by a particle to its final position is called a sink path and thus, the process of finding sinks (exits) and sink paths is called a sink seeking process. Hu and Shah [26, 27] carry out sink seeking process for each particle and thus generate one sink path per particle. These sinks and sink paths are later clustered to extract the dominant motion paths of the scene using an iterative clustering algorithm. On the other hand, Ali and Shah [3] generate a static floor field where each particle holds a value that represents the minimum distance to the nearest sink form its current location. Ali and Shah impose the static floor field together with dynamic and boundary floor field as constraints for tracking algorithm [4].

2.1.2 Optical Flow Clustering

Andrade et al. [6, 7] model the principal components of the optical flow vectors in each video clip using Hidden Markov Models. Then, video segments which have similar motion pattern are grouped together using the spectral clustering method. The resulting clustered video segments are modelled using a chain of HMMs to represent the typical motion pattern of the scene. The emergency events in the monitored scene are detected by finding deviations from the obtained model.

Instead of the spatial segmentation of each video frame, the other approach is to cluster optical flow vectors by spatial grouping as in [64]. Imran et al. [64] proposed to cluster optical flow vectors in each video clip into N Gaussian mixture components. Then, these Gaussian components are linked over time using a fully connected graph. The connected component analysis of the graph is performed to discover different motion patterns. However, their method still faces the problem of having to determine how many components should there be in the mixture.

2.1.3 Interaction Force Modelling

In addition to learning dominant motion patterns, the optical flow vectors obtained can also be used to model interaction forces of a crowd, and then use the model to detect the stability of the crowd. For example, Mehran et al. [53] employ the optical flow vectors to model pedestrian motion dynamics using a social force model. Social force models [22] have been used in many studies in computer graphic fields for creating animations of the crowd [54]. In this model, the motions of pedestrians are modelled with two forces: a personal desire force and an interaction force. The interaction force is defined as an attractive and repulsive force between pedestrians. In [53], an interaction force between pedestrians is estimated based on optical flow computed over a grid of particles. The normal pattern of this force is later used to model the dynamics of a crowded scene and detect abnormal behaviours in crowds.

2.1.4 Local Spatio-Temporal Motion Variation Modelling

Optical flow data can also be used in modelling the variations of motions in local spatio-temporal volumes to describe the typical motion patterns of the scene [4042, 50, 52, 79, 81]. In these approaches, an image space is usually divided into cells of a specific size (e.g., 10×10 in [81]) or cuboids (e.g., 30×30×20 in [42]). Then, optical flow computed in each cell is quantised into different directions. For instance, Yang et al. [81], considered each quantised direction of a given location as a word and cluster these video words into different clusters using a diffusion embedding method. Each node in the graph corresponds to a word and the clusters extracted in the embedded space represent the typical motion patterns of the scenes. Kim and Grauman [40] used a space-time Markov Random Field (MRF) graph to detect abnormal activities in video. Each node in the graph corresponds to a local region in the video frames where the local motion is modelled using a mixture of probabilistic principle component analysis. Wu et al. [79] used Lagrangian framework to extract particle trajectories. These particle trajectories are later used for the modelling of regular crowd motion. The deviations of new motion from the learnt model indicates an abnormal event.

2.2 Spatio-Temporal Gradient Feature

In addition to optical flow information, other features such as spatio-temporal gradient are also used to model the regular movement of a crowd [42, 50]. In [42], the coupled HMM is trained based on the distribution of spatio-temporal motions to detect localised abnormalities in densely crowded scenes. Vijay et al. [52] combined motion information and appearance features to represent the local properties of a scene. The normality of a crowded scene is learned using a mixture of dynamic textures. Then, temporal and spatial abnormalities are separately detected by finding deviations from the normal pattern. Their method has been proved to achieve the better performance than the state-of-the-art methods, at a high computational cost. To address this limitation, Reddy et al. [61] proposed a simpler method using a set of similar features including shape, size and texture extracted from foreground pixels. The computational cost is reduced by removing background noise and considering each feature type individually. Compared to [52], the method proposed by Reddy et al. [61] achieved considerably better results.

2.3 Summary

To conclude the discussion on the macroscopic modelling, a summarisation of the strength and weaknesses of the various state-of-the-art implementation are provided in Table 2.

Table 2 Summarisation of the macroscopic modelling techniques

3 Microscopic Modelling

Microscopic analysis and modelling depends on the analysis of video trajectories of moving entities. This approach, in general, contains the following steps:

  1. 1.

    detection of the moving targets present in the scene;

  2. 2.

    tracking of the detected targets; and

  3. 3.

    analysis of the trajectories to detect dominant flows, and to model typical motion patterns.

Researchers have used different detection and tracking algorithms to generate reliable trajectories. Tracking people in crowds can be either used as a means to improve crowd dynamics analysis, using the tracks and mining trends out of these (bottom-up approach to crowd analysis); or, conversely, tracking methods can use cues obtained from the analysis of crowd dynamics, in order to improve accuracy (top-down approach). The complexity of tracking algorithms depends on the context and environment in which the tracking is performed. In the context of crowd video analysis, the problem of tracking individuals within a crowd introduces additional complexity due to the interactions and occlusions between people in the crowd. A number of tracking methods has been proposed to overcome the challenges encountered in a crowded scene. In this section, some popular human tracking methods in the context of crowd video analysis are discussed. The reader is referred to the survey by Yilmaz et al. [5] for a comprehensive review of various trackers. Figure 2 shows the different topics covered by this section.

Fig. 2
figure 2

The topics for microscopic approach, in which a focus is put on individual tracking in crowds

3.1 The Particle Filter (PF) Framework

The most popular approach for tracking is the Particle Filter-based framework. Particle filtering framework was first introduced for visual tracking by Isard and Blake in [29]. Initially, particle filter approaches were only based on colour cues, and could only track one single target.

3.1.1 Additional Cues for Improved PF

The particle filter implementation based on appearance using colour information only does not perform well tracking more than one individual, specially when those wear similar clothing. In public demonstrations, sports matches and celebrations, it is normal that people’s appearance is similar. Thus, a series of papers present alternatives to the plain ‘colour-only’ Particle Filter. Combinations include colour and contours, Harris, SIFT features [47, 59, 67, 78]; also Histograms of Oriented Gradients (HOGs) are used along with colour information in [69]; or Mean Shift and Joint Probabilities [10].

A completely different approach to improve tracking using particle filters is presented in [84]. The method proposed by the authors mines the interdependencies between particles in order to improve the results. Also different is the method proposed in [28], in which a new tracker is proposed which employs a particle filter tracking framework, where the state transition model is estimated by an optical-flow algorithm. That is, instead of using a pre-defined dynamic transition model.

There are also authors whose interest is in extending the particle filter to multiple cameras; in that case, particles are “shared” and “fused” among the views [57].

Others propose blob-based segmentation and tracking when no occlusions are present, and limit the use of Particle Filters as an occlusion resolution technique [70, 86]. The limitation of this techniques seems clear: blobs are needed and used as the main cue, which is not the case in most crowded scenes, although these techniques can be useful in sparse crowds.

Silhouettes or contours can be a useful cue for action recognition, or people counting in crowds; obviously, in the case of densely crowded scenes, only partial contours can be extracted, although those can be quite useful (e.g., as in ‘Ω shape’-based methods). Since particle filter approaches work regardless of segmentation, reconstructing contours a posteriori to obtain shape cues might be of interest. Ma et al. [51] present this idea: Graph Cuts are applied to a particle filter method to obtain the silhouettes of tracked objects.

3.1.2 Alternative Cues for Tracking: Self-similarity

Schechtman and Irani [66] introduced the concept of self-similarity as a visual feature. Among the applications, they name object detection and recognition, action recognition, texture classification, data retrieval, tracking and image alignment and so on. BenAbdelkader et al. [12] seem to be the first to use image self-similarity plots (ISSPs) for gait recognition; according to the authors, some works state the ISSP of a moving person/object is a projection of its planar dynamics, and as such, these should encode much of gait information. Junejo et al. [34, 35] use a very similar descriptor as a means for action recognition, by using self-similarity matrices (SSMs) as descriptors of the action class. Dexter et al. [17] extend the SSM concept in order to apply it to the synchronisation of actions taken from multiple views. Rani and Arumugam [60] use it as a biometric signature in gait recognition as in [12]. Also, Walk et al. [74] introduced the self-similarity as a feature for pedestrian detection; and Cai et al. [14] have used it for person re-identification among different cameras or moments; the authors create a colour codebook and obtain the spatial occurrence distributions of colour self-similarities. To the best of our knowledge, as of today, no works seem to use self-similarities as a feature for tracking, although Gu et al. state it could be used as an alternative to other local descriptors such as SIFT or SURF.

3.1.3 Multiple Target Tracking Using PF

This framework has been extended in a series of papers [2, 15, 20, 39, 58] for tracking multiple targets. For example, Okuma et al. [58] extend a particle framework by incorporating a cascaded AdaBoost algorithm for the detection and tracking of multiple hockey players in a video. The AdaBoost algorithm is used to generate detection hypotheses of hockey players. Once the detection hypotheses are available, each hockey player is modelled with an individual particle filter that forms a component of a mixture particle filter. Similarly, Ali and Dailey [2] combine an ‘AdaBoost cascade classifier’-based head detection algorithm and the particle filtering method for tracking multiple persons in high density crowds. The performance is further improved by a confirmation-by-classification method to estimate confidence in a tracked trajectory.

To conclude this subsection, a summarisation of the presented methods is shown in Table 3. Both single and multiple view methods are presented, as well as single and multiple target ones.

Table 3 Summarisation of the presented techniques

3.2 Handling Occlusions

Occlusions are one of the most important problems trackers need to face, since generalised models for them are not straightforward [44]. According to the survey in [82], occlusion can be classified into three categories: self-occlusion, which occurs while tracking articulated objects; inter-object occlusion (or dynamic occlusion [72]), which arises when two tracked objects occlude each other; and occlusion by the background (or scene occlusion [72]), which occurs when structures in the scene (e.g., tree branches, pillars, etc.) occlude the object/s being tracked. Some approaches have already been presented in Sect. 3.1.1 [70, 86]. Yilmaz et al. [82] deal with occlusion handling from the lens of the tracking technique in use. A series of different tracker families are presented (point, ‘geometric model’-based and silhouette); each tracking technique is then classified according to whether or not it can handle occlusions, and in the case it does, whether these can be full or only partial. Following this idea, trackers that respond well when occlusions are present, can be used for occlusion handling. In [85], the Kanade-Lucas-Tomasi (KLT) tracker is employed to resolve occlusions, while a particle filter is used as the main tracker. Similarly, a technique based on Mean-shift is used in [16].

Apart of exploiting the features of “occlusion-friendly” trackers, a series of occlusion handling techniques have also been devised, which can be found throughout the literature. Wang et al. [77], present a good historical review of such methods, which rely on the object’s motion model, and keep predicting the object’s location until it reappears. The authors state that serious long-term occlusions cannot be dealt with by this kind of techniques, since observations cannot be obtained while the object is occluded for a long period of time. Vezzani et al. [72] propose what they call the non-visible regions model, which deals with partial and full occlusions, whether these are inter-object or due to the scene. The object model is updated differently in a pixel-wise fashion: the appearance is updated only for the visible pixels; the probabilities associated to those are reinforced, while they remain unchanged for invisible pixels. Furthermore, in pixels with no correspondence due to changes in the shape of the object (called appearance occlusions) probabilities are smoothed. Wang et al. [77], on the other hand, propose a means of modelling the occluder; once modelled, when objects disappear due to occlusion, a search is performed around the occluder in order to find the occluded object as it reappears.

In [37], the authors present a series of monocular approaches to occlusion handling, although this is only to conclude that single-view systems are intrinsically unable to handle occlusions correctly. The authors in [21, 37], use multiple oblique-view cameras to handle occlusions appropriately, and devise a common plane reconstruction, using communication among cameras. Approaches based on multiple views are designed to reduce the amount of hidden regions. Unfortunately, in the case of existing static camera networks, this is not always possible due to the restrictions of their infrastructures, which were not initially devised for automated surveillance. Another approach to occlusion handling is avoiding them in the first place. Occlusions can be reduced by placing the camera appropriately, as suggested by [82] (e.g., by placing a bird-eye view camera, no occlusions occur between the objects on the ground), but the problem of existing infrastructures persists.

Nevertheless, when dealing with occlusions under heavily crowded scenarios, full-body tracking is infeasible due to the continuous existence of partial occlusions, specially from side views [13]. Since the existing cameras tend to be placed above the heads of the people and tilted to face downwards looking at the scene, some authors suggest a good assumption is that heads and shoulders (often referred to as Omega-shape [46]) will be always visible, and that occlusions among subjects’ heads is lower as compared to the rest of the body parts.

3.3 Improving Tracking Using Crowd-Level Cues

As stated in the introduction to this section (Sect. 3), tracking methods can use cues obtained from the analysis of crowd dynamics, in order to improve their accuracy, in a top-down approach. These higher-level cues can be either contextual or coming from the social interactions among the people in the crowd.

3.3.1 Higher-Level Contextual Information

The utility of high-level contextual information has demonstrated that exploiting contextual information improves the performance of human tracking significantly. Antonini et al. [9] use a discrete choice model (DCM) as motion priors to predict human motion patterns and then, fuse this model in a human tracker for improved performance. Similarly, Ali et al. [4] propose to exploit contextual information for tracking multiple people in a structured crowded scene. Assuming that all participants of the crowd are moving in one direction, Ali et al. learn the direction of motion as a prior information based on floor fields. The authors have demonstrated that a higher-level constraint greatly increases the performance of the tracker. However, floor fields can be learned only when the scene has one dominant motion. As a result, the method proposed in [4] cannot be applied for unstructured crowded scenes where the motion of a crowd appears to be random with different participants moving in different directions over time. Some examples of unstructured crowded scenes include crowds at exhibitions, sporting events and railway stations. This shortcoming is addressed by Mikel et al. [63] where the authors employ a correlated topic model for modelling random motions in an unstructured crowded scene. Similarly, L. Kratz and K. Nishino [43] employ the normal motion pattern to predict tracking individuals in a crowd scene where the normal motion pattern is learnt based on local motion at fixed-size cells.

3.3.2 Social Interactions

Another interesting direction of tracking multiple targets is to integrate social interaction of targets in the tracking algorithm. This idea is motivated by the behaviour of targets in a crowd. In crowded scenarios, the behaviour of each individual target is influenced by the proximity and behaviour of other targets in the crowd. Several methods [8, 19, 39, 48, 80] have proposed to integrate the social interactions among targets in the tracking algorithms. This direction has shown promising performance to track multiple targets in crowded scenes. An early example which models the social interaction of targets is Markov Chain Monte Carlo-based (MCMC) particle filter [39]. Their method models social interactions of targets using Markov Random Field and adds motion prior in a joint particle filter. The traditional importance sampling step in the particle filter is replaced by a MCMC sampling step. French et al. [19] extended the method in [39] by adding social information to compute the velocity of particles. In [80], the authors formulated the tracking problem as a problem of minimising an energy function. The energy function is defined based on both the social information and physical constraint in the environment. Their preliminary results indicate that social information provides an important cue for tracking multiple targets in a complex scene. An overview of tracking algorithms that incorporate different high-level contextual information is illustrated in Fig. 3.

Fig. 3
figure 3

An overview of different tracking algorithms that incorporate high-level contextual information

3.4 Tracking in Crowds from Multiple Views

Researchers have also explored the use of multiple cameras for tracking people under severe occlusion in a complex environment. Multiple camera tracking methods intend to expand the monitored area and provide complete information about interesting persons by gathering evidences from different camera views. Lee et al. [45] propose a multiple people tracking method for wide-area monitoring. An automated calibration method is introduced to find correspondences between distributed cameras. In their method, all camera views are calibrated to a global ground-plane view based on geometric constraints and tracking trajectories from each view. Another example in a similar context can be found in the papers by Khan and Shah [36, 38]. A planar homographic occupancy constraint that combines foreground likelihood information from different views is proposed for detection and occlusion resolution.

Another use of multiple cameras is to track people in an environment covered by multiple cameras with overlapping views. Mittal and Davis [55] use pairs of stereo cameras and combine evidences gathered from multiple cameras for tracking people in a cluttered scene. Foreground regions from different camera views are projected back into a 3D space so that the endpoints of the matched regions yield 3D points belonging to people. Dockastader and Tekalp [18] employ a Bayesian network for fusing 2D position information acquired from different camera views to estimate the 3D coordinate position of the interested person. Finally, a layer of Kalman filtering is used to update the position of people. A combination of static and pan-tilt-zoom (PTZ) cameras for multiple camera tracking is introduced in [65]. The static cameras are used to provide a global view of the interested persons when the PTZ cameras are used for face recognition of people.

The brief overview of the research literature indicates that multiple camera tracking methods provide an interesting mechanism to handle severe occlusion and to monitor large areas at public spaces (as seen in Sect. 3.2). However, advantages of the multiple cameras come together with additional issues such as camera calibration, matching information across the camera views, automated camera switching and data fusion. These challenges are still yet to be solved. On the other hand, integrating the social interaction among targets in the tracking algorithms has shown promising performance to track individual targets in a crowd.

4 Event Detection in Crowds

A series of surveys and reviews in this field [30, 32, 56, 62, 68, 73, 83] show there is a great interest in this area. Detecting anomalies or outstanding events in crowds has moved a lot of research efforts. Automatic systems would allow reducing the burden of manual video supervision, which makes is infeasible in most cases, given the enormous amounts of data, as compared to the manpower to process it [68, 73, 83]. Detection of anomalies in crowded scenes can be seen as a classification problem where only two classes are defined (i.e., “normal” versus “anomalous”) [68].

The survey by Sodemann et al. [68], analyses the works in the literature across five aspects:

  1. 1.

    the target/s of interest (a person, a crowd);

  2. 2.

    the definitions of what is anomalous, and the assumptions taken;

  3. 3.

    the types of sensors involved, and the features used;

  4. 4.

    the learning methods; and

  5. 5.

    the modelling algorithms.

According to the authors, their survey is focused on the broader problem formulation and assumptions, rather than providing a review on specific pattern classification methods. In Revathi and Kumar [62], authors provide a categorisation of anomalies according to the number of people and other objects involved. Three categories are defined: anomalies involving a single person with a single object, multiple people with multiple objects, and group behaviour.

Vishwakarma and Agrawal [73] analyse human action recognition in more general terms in video surveillance, although, they present an interesting taxonomy to classify complexity of topic-related algorithms. From a completely different point of view, two other works review physics- and hydrodynamics-based techniques [32, 56] for anomaly and event detection in crowds. Moore et al. [56] present a review of techniques for crowd analysis that consider huge crowds as “fluids” or “liquids”, which are bound to a series of rules and forces (e.g., repulsion and attraction) which explain the interactions among the particles that conform that fluid.

Jo et al. [32] further explore other physics-based techniques, and classify the works according to the categorisation presented in [30], which presents various “domains”: the image space domain, based on the analysis at the pixel, texture or object levels; the sociological domain, which accounts for the social interactions or “crowd mentality”; the level of services, where different crowd conditions are provided; or the computer graphics domain, which deals with realistic crowd simulation. Jacques Junior et al. [30] also classify crowd event detection techniques as either object-based, in which individuals are tracked and these tracks are used to analyse the situations [25, 33, 76]; or holistic-based [7, 53, 71], where the crowd is considered as a whole, and events are detected by extracting the major crowd flows from the monitored scene.

5 Summary

This chapter presents a review and comparative study of various topics in the area of crowd video analysis. The advantages and disadvantages of the state-of-the-art methods related to video analytics in crowded scenes have been detailed.

Tracking individuals in a high-density crowd has been addressed in recent years, as opposed to previously tracking individuals in sparse or even ad-hoc scenarios. A major advance is the introduction of high-level crowd motion pattern as a prior into a general framework [4, 63]. However, the problem of tracking still remains as a challenging problem in the area of computer vision. One major challenge for tracking in a crowded scene is inter-object occlusion due to the interactions of participants in a crowd. There remains a gap between the state-of-the-art and robust tracking of people in a crowded scene. Most recent trackers for crowds use Particle Filters, using different kinds of features; the use of self-similarity measures for this particular application can be of interest and deserves further research, given the results it achieved in other Computer Vision fields.

During recent years there has been substantial progress towards understanding crowd behaviour and abnormality detection based on modelling crowd motion pattern. However, these approaches capture general movement of a crowd but do not accurately detect details of individual movements. As a result, the current literature in understanding crowd motion is not ready to capture the motion pattern of an unstructured crowd scene where the motion of the crowd appears to be random [63]. Future research in this area requires localised modelling of crowd motion to capture different behaviours in unstructured crowded scenes. On the other hand, the understanding and modelling of crowd behaviour remains immature despite the considerable advances in human activity analysis. Progress in this area requires further advances in modelling or representation of a crowd event and recognition of these events in a natural environment.