Keywords

1 Introduction

1.1 Overview

Detecting human beings and identifying actions in videos of a surveillance system is gaining more importance due to its wide variety of applications in detecting anomalous events, counting the number of people in a dense crowd, identification of human, traffic safety and surveillance, sports analysis, gender classification, human gait characterization, fall detection for elderly people, etc.

Latest surveillance cameras are installed all around the world daily, as webcams, for surveillance and other purposes. Most of the digital video surveillance systems depend on human observers for detecting and identifying specific activities in the video scenes. But, there are several limitations in the human capability to monitor simultaneous events in a video surveillance system. Hence, automated human event analysis in video surveillance system has become one of the most effective and attractive research topics in the area of computer vision and pattern recognition.

However, most existing multi-person tracking methods are still limited to special application scenarios. They require multi-camera input, scene specific knowledge, a static background, or depth information, or are not suitable for online processing.

Moreover, there may be both individual and gathering activity in the same scene, it is much harder to speak to and receive such situations. Punctuation models have been generally utilized as a part of the complex visual occasion acknowledgment lately. To apply sentence structure in models or occasion acknowledgment, normally low-level components are firstly removed from features and after that characterized to an arrangement of terminal images, i.e., visual occasion primitives. Hence the proposed work detects and identifies both individual and group events, hence it overcomes the drawback of existing system, and it is also based on cognitive linguistic method which uses unsupervised learning method.

Detecting human object is a difficult task from a machine vision perspective as it is motivated by a wide variety of possible appearance due to changing articulated pose, lighting, clothing and background, but prior knowledge on these limitations can improve the performance of detection. The proposed system detects and captures motion information of moving targets for accurate object classification. Unsupervised classifiers are used for learning method and labels are known, hence the instance such as kick, hug, punch, or any such features are extracted and events are detected. The classified object is being used for high level analysis.

1.2 Objectives

The main objective of this research work is to build up a framework that recognizes the small human group and to detect the event in the video. This framework is utilized for robotized little human gathering occasion discovery inside of social or open spot environment furthermore serves to recognize a fording wrongdoing, in places like Railway station, traffic, collages, office, etc.

1.3 Problem Statement

Detecting human beings and identifying actions in videos of a surveillance system is gaining more importance due to its wide variety of applications in detecting anomalous events, counting the number of people in a dense crowd, identification of human, traffic safety and surveillance, sports analysis, gender classification, human gait characterization, fall detection etc.

The proposed work is used to represent both individual and multiple individuals in an event, hence it overcomes the drawback of existing system, and it is based on cognitive linguistic method which uses unsupervised learning method.

The proposed system is able to identify the events automatically in the video surveillance system. Thus, it reduces the human interaction with the video surveillance system and reports the alerts as the events detected.

1.4 Proposed Methodology

The image is given as an input to the training database. The obtained RGB images are further preprocessed using mathematical morphological method to reduce noise and later converted to grayscale. Features are extracted using HOG descriptor and reduced using PCA. The resultant is stored in a file which is trained using SVM classification. On the other hand the testing dataset is converted to frames, preprocessed, and their absolute difference is evaluated to distinguish background from foreground. Further morphological operation takes place to reduce noise, followed by Feature extraction using HOG and PCA and classified using the SVM Classifier. The key techniques used are

  1. 1.

    Preprocessing using morphological operations.

  2. 2.

    Feature Extraction and Reduction using HOG and PCA.

  3. 3.

    Classification using SVM.

1.5 Applications

Automated anomaly detection has a wide variety of applications. It has huge potential in the field of video surveillance system. Even though video surveillance cameras are installed everywhere, the availability of human resources to monitor the footage is poor. Hence, an automated system will aid in overcoming such human errors. Events such as trespassing can be alerted immediately when an automated system is placed.

Detection of non-human objects in unexpected places aids in betterment of security measures. It helps in person counting in densely crowded places such as those shown in Fig. 1. An automated anomaly detection system may aid in fall detection in the homes of the elderly. Traffic safety is the major applications of anomaly detection. Detection of speeding vehicles or reporting drivers breaking the law immediately can be achieved using an anomaly detection system. Another growing field is in sports analysis where an automated system might alert the referee or judge in case of actions which may otherwise be overlooked.

Fig. 1.
figure 1

Analysis of different scenario through anomalous event detection.

2 Literature Survey

Zhaozhuo Xu [1] has introduces a Human-Object Interaction model, and are able to establish methods and systems to recognize events that are dangerous. In this approach, the process of event understanding is based on identifying dangerous objects in possible areas predicted by human body parts. The accuracy of dangerous human events understanding is improved when human body parts estimation is combined with objects detection.

Dongping Zhang [2] presents an approach to identify group level crowds and detect any abnormal activities in them. It incorporates particle motion information calculated using a set of sample images with long trajectories and other properties, into identifying small human crowds in foreground images while in motion. Science of Human behaviour is studied and employed to detect normal and abnormal activity. Attributes such as orientation, velocity and crowd size are used to distinguish between normal and abnormal behaviour.

MyoThida [3] has presented a review of crowd video analysis in this paper. Automation of surveillance has become in crowded places such as shopping malls, railway stations and airports. Providing intelligent solutions to these places is of high priority to computer researchers. The paper provides a thorough review of the existing automation techniques for analyzing complex and crowded scenes.

The merits and demerits of the various modern methods are discussed in detail. Tracking individuals in a crowd is a major topic. It is a highly complex task due to interactions with various other objects present in the crowd.

M. Sivarathinabala [4] proposes an intelligent video surveillance system, which can be remotely monitored and alerts the user in a situation that the system may interpret as an anomaly. The main focus is on monitoring a single person in situations such as a burglary.

A live video is captured and reduced to images. The images undergo preprocessing. Human behaviour analysis plays an important role to detect any anomalous human activity. This is done by comparing existing sample templates with the processed image. If found, the image is stored in the system and an alert is sent as specified by the user either to MMS, SMS or email. The live video is then compressed and a key frame is specified to directly retrieve the required part of the video.

This paper concludes by providing an automated method for surveillance that not only identifies an anomaly but also triggers an alert to the user. It helps in retrieval of the suspected video by holding key frame values and help in extracting of images of individuals before and after the incident.

Manoranjan Paul [5] throws light on the need for accurately detecting anomalies in videos and its applications in surveillance technologies. Detecting human beings and their actions accurately in a video has various applications such as person identification, fall detection for elderly people, event classification and gender classification.

The authors use the benchmarks set by few existing datasets for comparison and providing their assessment. An intelligent system can capture and detect moving objects in a video. In this study the authors focus on detecting only human beings in general. This in itself is a complex task due to the number of various attributes each person may have such as, clothing, pose, lighting and background.

Detecting objects in a surveillance video is a challenging task due to the low resolution of the video. This paper discusses different methods of object identification and object classification. The various benchmarks are discussed and the applications of human detection in surveillance videos are reviewed.

C. Stauffer [6] proposed a computer vision algorithm for detecting or analyzing the motion of people in crowds. Computer vision algorithm divides background in regions and track the crowds and analyses every movement of people.

D. Ryan [7] develops a scene independent approach that can count the no of people in the crowd. A scene independent counting system can easily be deployed at different place. The counting is been done using a global scaling factor to relate crowd size from one scene to another.

Condition of providing the right heuristic ranking to the individuals, to avoid confusing them with one another. Hence, achieve robustness by finding optimal trajectories over many frames while avoiding the combinatorial explosion that would result from simultaneously dealing with all individuals.

3 System Design

3.1 Overview

The following image represents the system architecture which is further broken down into a clear flowchart in the later segments. The architecture as described in the image consists of testing phase and training phase. In the training phase, the training video is added into the knowledge base after preprocessing, feature extraction and reduction. This dataset is classified into either normal or abnormal event using SVM Classifier. The testing set of the video, which goes through the same operations are classified as abnormal or normal by comparing it to the sample frames of the training videos that are classified already. The output classifies each frame to be either “Normal” or “Abnormal” (Fig. 2).

Fig. 2.
figure 2

Overall flow chart of the proposed system

3.2 Preprocessing

Initially the given video is converted into frames. The converted frames are used for further processing. In pre-processing unnecessary noise in the frames are eliminated using morphological operations. In order to get a more accurate difference between the background and the foreground, the image needs to have lesser noise [7].

3.3 Feature Extraction

The main purpose of feature extraction is to extract the image component and to separate the foreground from the background through HOG feature extraction and further reduce the obtained attributes by the method of PCA. This helps in providing faster time for analysis due to a better predictive model, with many similar attributes reduced to a single attribute (Figs. 3, 4 and 5).

Fig. 3.
figure 3

Feature collection

Fig. 4.
figure 4

Feature extraction

Fig. 5.
figure 5

FeatureFile subroutine that performs HOG and PCA feature extraction/reduction

3.3.1 Histogram of Oriented Gradients

Histogram of oriented gradients is a feature descriptor used in image processing for detecting objects. This technique counts occurrences of gradient orientation in localized portions of an image. The HOG descriptor is most popularly used for detecting humans in images. The flow diagram of the HOG descriptor is given below (Fig. 6).

Fig. 6.
figure 6

HOG feature extraction

3.3.2 Principal Component Analysis

PCA is mathematically defined [8] as an orthogonal linear transform that transforms the data to a new coordinated system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

PCA applied to image processing in this particular project to reduce the orientation values of HOG that are stored. This helps in improved performance and faster detection and analysis of objects in the frames.

3.4 Support Vector Machine Classifier

Support Vector Machines (SVM) is the most popularly used classification method [11]. It has a wide variety of applications such as classification of text [8], recognizing the facial expression [10], in the analysis of genes [7] and many others. SVM is one of the technique for constructing a linear classifier, which produces a classifiers based on the theoretical foundation [12].

3.4.1 SVM Classification Using “KERNAL TRICKS”

Kernel techniques is used in SVM to solve linearly inseparable problems which transforms data to a high dimensional space. But training and testing large data sets consumes more time. Hence, we can train and test the large data sets using linear SVM without kernels.

The Experimental results proves that the proposed method is beneficial for large-scale data sets. Hence the proposed method can be successfully applied to natural language processing (NLP) applications [6].

3.4.2 Radial Basis Kernel Function for SVM Classification of Images

Radial basis kernel function (RBF) is the most commonly used kernel function in machine learning. RBF kernel is used for various kernelized learning algorithms. In particular, it is most commonly used SVM classification [7].

Consider two samples x and \( {\mathbf{x}}' \), which represents a feature vectors in some input space, The RBF kernel is defined as

$$ K({\mathbf{x}},{\mathbf{x^{\prime}}}) = \exp \left( { - \frac{{\left\| {{\mathbf{x}} - {\mathbf{x^{\prime}}}} \right\|^{2} }}{{2\sigma^{2} }}} \right) $$

\( \left\| {{\mathbf{x}} - {\mathbf{x^{\prime}}}} \right\|^{2} \) indicates the squared Euclidean distance between the two feature vectors.

\( \sigma \) is a free parameter. An equivalent, definition involves a parameter \( {\gamma = }\frac{1}{{2\sigma^{2} }} \):

$$ K({\mathbf{x}} ,{\mathbf{x^{\prime}}}) = \exp ( -\upgamma\left\| {{\mathbf{x}} - {\mathbf{x^{\prime}}}} \right\|^{2} ) $$

As shown in Fig. 7, we use SVM training loading the extracted features which are stored in features.dat. The output frames loaded in output.dat is compared to the trained features that are stored in the knowledge base.

Fig. 7.
figure 7

SVM training

3.5 Data Flow

The following image represents the data flow diagram of the proposed work. The architecture as described in the image consists of a testing phase and a training phase.

In the training phase, the training video is preprocessed, features are extracted and added into the knowledge base. This dataset is classified into either normal or abnormal event using SVM Classifier. Even in the testing phase, the video is preprossed and the features are extracted. Later, The SVM classifier, classifies each frame to be either “Normal” or “Abnormal”.

The Fig. 8 shows the two phase of data flow diagram. Initially, the image is read in RGB Frame format. During preprocessing, the frames are converted to gray scale format. This resultant format is used for extracting features using HOG and PCA methods. The extracted features are trained using SVM and the trained data is stored in the knowledge base and used for classifying the data under the SVM.

Fig. 8.
figure 8

Data flow diagram

In the testing phase, the input video is broken down into frames and preprossed using morphological operations. Then the foreground objects are extracted by performing the background substraction. This is done by finding the Absolute difference of the frames. Later Feature extraction takes place on the noise free images. The feature extraction procedures include HOG & PCA methods. These features fall under a classification trained under the SVM, as explained earlier.

In the current project, the SVM classifies data as either Normal or Anomalous (Abnormal). The detected region and recognition result is displayed along with the frame.

4 Result Analysis

4.1 Discussion

The dataset used in this project was shot in a Canon D750 camera at 55 mm focal length. The video resolution is adjusted to 380 × 240 pixels at a frame rate of 15. Three different scenarios are evaluated in the datasets. The first two datasets were used to depict anomalous event such as punching and kicking. This also depicts normal scenarios of handshake and hug. The third dataset is of a typical fall detection even that may occur at any old age homes.

4.2 Performance of Our System

The system, when tested with the datasets mentioned above, an error rate of 0.27 was obtained. The error rate was calculated using the formula:

$$ {\mathbf{Error}} \, {\mathbf{Rate}} \, = \, {\mathbf{No}} \, {\mathbf{of}} \, {\mathbf{False}} \, {\mathbf{Negatives}}/\left( {{\mathbf{No}} \, {\mathbf{of}} \, {\mathbf{False}} \, {\mathbf{Negatives}} \, + \, {\mathbf{No}} \, {\mathbf{of}} \, {\mathbf{True}} \, {\mathbf{Positives}}} \right) $$

Here False Negatives refer to the anomalous events that were not identified. True Positives are the anomalous events that were identified correctly.

Table 1 provides a detailed overview of the performance. The SVM classifier was successfully able to classify most of the event as normal or abnormal. The dataset was provided as frames to the classifier with specification for classification process (Fig. 9).

Table 1. Comparison of error rate for the three datasets
Fig. 9.
figure 9

Performance comparisons for three datasets

Fig. 10.
figure 10

Performance comparisons without the challenging video.

4.3 Discussion of Result

The dataset 3 produced the highest number of false negatives because the person of interest’s movement and placement towards the camera made it difficult for the classifier to identify the event. The feature extraction was also not optimal due the constant changes the objects of the environment.

It is observed that when this dataset was removed from evaluation the error rate was reduced to only 0.21. This shows the importance of a static background for our system, which is also a drawback and suggestion for future enhancement.

Table depicts the performance when dataset 3 is removed (Table 2).

Table 2. Comparison of error rate without the challenging video

The performance of the system is calculated, with and without the challenging video in Figs. 11 and 12. Figure 13 provides a detailed description of the three datasets (Figs. 10, 14, 15, 16 and 17).

Fig. 11.
figure 11

Dataset 1: shows sequence of normal and anomalous events

Fig. 12.
figure 12

Dataset 2: handshake, punch, kick, high five events

Fig. 13.
figure 13

Dataset 3: false negative event followed by fall detection event

Fig. 14.
figure 14

Normal event recognition as seen in dataset 1

Fig. 15.
figure 15

Anomalous (or abnormal) event recognition as seen in dataset 1

Fig. 16.
figure 16

False negative events in dataset 3

Fig. 17.
figure 17

Fall detection in dataset 3

4.4 Output

5 Conclusion

The automated human event analysis in video surveillance system has become one of the most effective and attractive research topics in the area of computer vision and pattern recognition. The increasing computational power, provides a great environment for improving the existing systems.

The proposed work has provided the satisfactory results as expected. The dataset used in this implementation was taken and designed for a static camera. An error rate of 0.27 was achieved when tested with the given datasets. A better error rate of 0.21 was achieved when a challenging video was removed from the dataset. This method can be further improved and implemented for real time surveillance systems. Hence the study for anomalous event detection can grow further.