Keywords

1 Introduction

Video surveillance is often seen as the process of analyzing violent event scenes in the video. The actions conducted by humans can be analyzed with the help of a surveillance camera that can be manual or automated. An intelligent video surveillance system intends at detecting, tracking, and recognizing objects of interest and further analyzing and interpreting the video activities of the scenes, despite the substantial amount of videos collected by surveillance cameras. Nowadays, we see plenty of surveillance cameras being installed throughout the private and public sectors. The reason behind this is for the safety of human beings and also for the hardware equipment available in the markets at reasonable prices. As indicated by the way that visual information is generally accessible in surveillance systems, we focus on strategies used vision information [7]. The automatic surveillance system reduces the risk of security persons monitoring prolonged videos. Violent event detection is a difficult action recognition task and it is a branch of computer vision. The pattern, facial expression, and actions to be detected unusual events in the video scenes. A terrorist attack, bomb detection, fraud detection, loitering, slip and fall event, and many more are action recognition problems [22]. Detection of the violent events is highly uncertain to resolve the difficult task. Once the system has experienced violent and non-violent events, we apparently gave the test label to detect the classification of events. This is a more challenging scenario when the normal event drastically alters and it is difficult to learn due to blurriness, variations in scale, complex background, occlusion, and illumination. In this work, we have used the LOOP descriptor to detect violent events in the video sequences.

The Contributions of the Paper are as Follows

  • LOOP descriptor used to extracts salient features to detect violent events

  • Spatial-temporal post-processing approach is used to improve the accuracy of violent event detection.

  • To evaluate the efficiency of the proposed method, Five-Fold Cross-Validation approach is used and results are compared with the state of the art techniques.

The rest of the paper is structured as follows. Section 2 is connected with the previous research work. The proposed texture-based descriptors are discussed in Sect. 3. Experimentation results are described in Sect. 4. Finally conclude the research paper in Sect. 5.

2 Previous Works

In recent years, the research community has established a survey on a variety of algorithms based on the handcrafted features [2, 5], deep learning features [27, 28, 30, 39] and classifiers [3, 18] are used to resolve the major issues of violent event detection [29, 32]. Quasim et al. [33] introduced the Histogram of Swarms (HOS) descriptor. The method used the variance of Optical Flow (OF) to extract spatio-temporal information in the sequence of video frames. Ant Colony Optimization (ACO) is used to cluster moving object and it separate salient and non-salient features, finally OF technique is used to extract prominent features to detect normal and violent events. Febin et al. [12] presented a combination of Motion Boundary Scale Invariant Features Transform (MoBSIFT) and movement filter algorithm. The movement filter algorithm extracts temporal information features of the non-violent event and avoids the normal event. Furthermore, the combination of motion boundary, optical flow, and SIFT feature extract eminent features to detect violent events. Esen et al. [36] used Motion Co-Occurrence Feature (MCF) to detect abnormal events in the video. The method used a block matching algorithm to extract the direction and magnitude of motion features and fed it to the KNN classifier to categorize normal and abnormal events. Recently, Lohithashva et al. [23] introduced the integration of texture features to detect violent activity. The method extracts prominent texture features to detect suspicious activity. Song et al. [37] introduced the fusion of multi-temporal analysis and multi-temporal perceptron layers to detect unusual events. Zhang et al. [41] presented an entropy model to measure the distribution of enthalpy for abnormal event detection. The authors have used an enthalpy model in the micro point of view to describe crowd energy information. Ryan et al. [35] proposed optical flow and Gray Level Co-occurrence Matrix (GLCM) feature descriptor to detect abnormal events in the video sequence. Lloyd et al. [21] proposed a GLCM texture feature descriptor detect non-violent and violent activity detection. Pujol et al. [10] described events based on features fusion extraction technique of local eccentricity which includes the combination of Fast Fourier Transform, radon transform, projection, and ellipse eccentricity. Deepak et al. [9] introduced the extraction of both spatio-temporal information from texture based feature descriptor. The method extracts local geometric characteristics such as gradients and curvatures which are basic space-time movement properties used to detect normal and abnormal events. Li et al. [20] introduced OF based feature descriptor to detect violent events in the video scene. Initially, they have used background subtraction to remove low variation and noise in the frame and extract the Histogram of Maximal Optical Flow Projection (HMOFP) features. Reconstruction cost (RC) is used to detect violent events in video scenes.

Imran et al. [16] introduced a deep learning method to detect a violent event in surveillance video. MobileNet is used to extract spatio-temporal information from the moving objects after that dominant features are given to a gated recurrent unit (GRU) to detect suspicious events in the video scene. Hason et al. [14] introduced spatiotemporal information using a Spatiotemporal Encoder, Bidirectional Convolution Long Short Term Memory (BCLSTM) deep-learning feature extraction technique to detect unusual events in the video sequence. Asad et al. [4] presented violent event detection based on the spatio-temporal features from a video’s uniformly spaced sequential frames. Multi-level processes for two consecutive frames, obtained from the top and bottom layers of the convolutional layers neural network, are integrated using optimized feature fusion strategy, finally, features are fed to Long short term memory (LSTM) to distinguish between violent and non-violent event. Sabokrou et al. [11] introduced Fully Convolution Neural Networks (FCNs) to detect and localize violent events in a sequence of video. Accatolli et al. [1] introduced a 3D-CNN to detect suspicious activity in video. CNN architecture extracts salient features without any prior knowledge and fed them to the SVM classifier to segregate violent and non-violent events. Zhou et al. [36] applied hybrid auto-encoder architecture to extract spatio-temporal features from the crowd and discriminate normal and abnormal events in video frames. Song et al. [38] introduced a modified 3D-CNN to detect an aggressive incidence throughout the video. The method is used a uniform sampling method to reduce the redundancy and conquest the motion coherence and they have illustrated the efficacy of the sampling method.

3 Proposed Methodology

We demonstrate an overview of the proposed approach in this section. The LOOP descriptor extracts prominent texture features from the input video and fed them to the SVM classifier to detect violent events. Figure 1 shows the workflow of violent event detection using the proposed LOOP descriptor. The approach suggested in the sections that follow illustrates the detection of violent events.

Fig. 1.
figure 1

The description of the proposed method

3.1 LOOP Feature Descriptor

LOOP [6] is a scale and rotational invariance texture-based feature descriptor. To overcome the drawback of previous binary descriptors the LOOP descriptor has used and it is an upgrade of the Local Binary Pattern (LBP) and Local Directional Pattern (LDP) descriptors. Consider \(p_{c}\) be the intensity of the frame F at pixel \((a_{c}, b_{c})\) and \(p_{n}(n = 0, 1,..., n-1)\) and pixel intensity of \(3\times 3\) neighborhood of \((a_{c}, b_{c})\) except for the middle pixel \(p_{c}\). The eight Kirsch masks used previously for the LDP [17] are located in the direction of these eight adjacent pixels \(p_{n}(n = 0 , 1, ..., n-1)\). Therefore, it provides a measure of the severity of the degree of variability in the direction, separately. The Kirsch eight directions mask as shown in Fig. 2.

Fig. 2.
figure 2

Kirsch eight directions mask

The eight respondents of the Kirsch masks are \(k_{n}\) response to the pixels of the intensity \(p_{n}(n = 0 , 1, ..., n-1)\). Each pixel is assigned an exponential \(e_{n}\) by the size of \(k_{n}\) output of eight Kirsch masks.

$$\begin{aligned} LOOP(a_{c}, b_{c})=\sum _{n=0}^{n-1}s(p_{n}- p_{c})*2^{k_{n}} \end{aligned}$$
(1)
$$\begin{aligned} s(a)={\left\{ \begin{array}{ll}1 &{} if\ a \ge 0\\ 0 &{} otherwise\end{array}\right. } \end{aligned}$$
(2)

The LOOP outcome about the pixel \((a_{c}, b_{c})\) is stated as in (1 & 2) and s(a) represents neighborhood pixels intensity values. Therefore, the LOOP descriptor computes the rotational invariance in the major method. Eventually, pixel intensities are evaluated over the cell at each number that has prominently featured. This descriptor is measured as a \(2^{8} = 256 \) dimensional features for each frame.

3.2 Classification Based on Support Vector Machine (SVM)

SVM [8] is a binary classification approach which is widely used in regression and classification applications. Initially, SVM is introduced for classification and regression and subsequent kernel methodologies are used to implement non-linear classification by processing input information via a high-dimensional feature space. SVM attempts to optimize the distance of the distinguishing borderline among violent and non-violent events by trying to maximize the distance of the separating plane from each of the features. In the binary classification problem, data from a two-class are considered. In our research work, the Gaussian kernel function in SVM is used to violent video scene.

3.3 Post-processing

The post-processing technique [34] significantly increases the accuracy and reduces the false-positive rate. In this work, for the post-processing technique, we have taken 30 frames for the detection of frames which significantly improves the performance.

4 Experiment Results and Discussion

In this section, we summarize the detailed experimentation study to evaluate the use of violent event detection approaches in two standard benchmark datasets. Thereafter, the experimentation parameter setting is explained. Finally, the results obtained are compared with the existing feature descriptors.

4.1 Violent Datasets

The Hockey Fight (HF) dataset and Violent-Flows (VF) dataset experimentation are conducted to demonstrate the effectiveness of the proposed method and both datasets have complex backgrounds, illumination, blurriness, scale changes, and occlusion. This dataset comprises 1000 action videos of the National Hockey League (NHL) (500 fights and 500 no-fights), initially used to distinguish violent event detection processes [31]. For each clip, there have been battles to fight between two or hardly any hockey players. Each video clip is approximately equal to 1.75 s.

Fig. 3.
figure 3

Violent datasets sample frames. First row: Normal scenes, Second row: Fight scenes.

The Violent-Flows dataset contains 246 action videos (123 fights and 123 no-fights). Maximum possible people to seeing aggressive events that occurred inside the football ground during the match. This dataset is used to assess the detection of violent events [15]. All violent videos in the angered circumstances, each video is roughly equivalent to 3.5 s. Figure 3 illustrates the following frame sequences comprising Hockey Fight and Violent-Flows dataset sample frames of fights and no fight scenes.

4.2 Experimental Setting

In this section, we have used a Five-fold cross-validation test. We have compared our experimental results with existing methods using Hockey Fight and Violent-Flows dataset. Therefore, five different divisions were partitioned into each dataset: four for training and one for evaluation testing. The average accuracy result is estimated each time and the Precision (P), Recall (R), F-measure (F), Accuracy (Acc), and Area Under Curve (AUC) have used as an evaluation method. we employed an SVM classifier with a Gaussian kernel function to differentiate violent and non-violent events in the video sequences.

4.3 Result

In the experiment, we have used the LOOP descriptor to demonstrate for detection of unusual events in the video sequence. Our proposed method shows impressive results compared to existing methods. HF dataset ROC curves with SVM classifier using LOOP descriptor is compared with the existing methods as shown in Fig. 4. The Precision of 94.48%, Recall of 94.09%, F-measure of 94.28%, the accuracy of 92.25%, and AUC of 95.11% as illustrated in Table 1. VF dataset ROC curves SVM classifier using LOOP descriptor compared with the previous methods as shown in Fig. 5. The obtained Precision, Recall, F-measure, Accuracy, and AUC result are successively, 95.64%, 93.38%, 95.17%, 91.54%, and 93.81% on the Violent-Flows dataset as shown in Table 1. Comparative analysis of the proposed method for HF and VF Datasets as shown in Fig. 6. It is noticed that our proposed feature descriptor is capable to detect violent events even if there is a cluttered background, varied illumination, little motion, and scale changes.

Table 1. Performance evaluation metrics is illustrated in percentage

4.4 Discussion

Our proposed LOOP descriptor gives good result than Histograms of Oriented Gradients (HOG), Histogram of Optical Flow (HOF), Local Ternary Pattern (LTP), Violent Flow (ViF), Oriented Violent Flow (OViF), ViF+OViF, Distribution of Magnitude Orientation Local Interest Frame (DiMOLIF), GHOG+GIST, LBP+GLCM and Histogram of Optical flow Magnitude and Orientation (HOMO) for both HF and VF datasets. HOG, HOF, LTP, and ViF descriptors can not work if orientation changes. Therefore, these feature extraction methods are failed to detect violent event detection. The OViF feature extraction method extracts orientation features and obtains good performance for the HF dataset but does not perform well for the VF dataset. To resolve this problem the ViF+OViF feature extraction technique is used to extracts both magnitude and orientation features to detect suspicious behavior and is superior to ViF and OViF descriptors. DiMOLIF descriptor extracts magnitude and orientation from the optical flow feature descriptor to detect violent events. This descriptor gives substantial results as compared to ViF and OViF. The GHOG+GIST descriptor uses the fusion of global gradient and texture features. GHOG descriptor is poorly performed if there is a cluttered background and the GIST descriptor does not work for violent crowd activity in the video sequence. LBP+GLCM descriptor uses the fusion of texture features to detect aggressive behavior. The main drawback of LBP is the arbitrarily defined set of binary weights that depend on direction. GLCM feature extraction limitations are the high dimensional of the matrix and the high correlation of the features. HOMO is based on multiple scaling factors being applied to the magnitude and orientation variations of the optical flow. LOOP descriptor is effective for illumination changes, scale, and rotational invariance.

Table 2. Performance comparision result on Hockey Fight and Violent-Flow dataset
Fig. 4.
figure 4

Hockey Fight dataset ROC Curves of proposed feature descriptor with SVM classifier

Fig. 5.
figure 5

Violent-Flows dataset ROC Curves of proposed feature descriptor with SVM classifier

Fig. 6.
figure 6

Comparative analysis of the proposed method for Hockey Fight and Violent Flows Dataset

We have demonstrated the efficiency of our proposed model and this is an immensely important task. We compare our experimental results with existing methods using HF and VF datasets. In the experiment, we have used the LOOP descriptor to demonstrate for violent event detection. Our proposed method shows impressive results compared to existing methods as illustrate in Table 2. It is noticed that our proposed feature descriptor is capable to detect violent event even if there is a cluttered background, varied illumination, little motion, and scale changes. Actually, there are six attributes that need to be intimate for suspicious event detection. Some of the intimates are, magnitude, orientation, the spatial arrangement of the moving objects, number of the objects moving in a video scene, mass, and acceleration. Certainly, our proposed method based on the scale and orientation of the object apparent motion using the extraction of LOOP features to improve the performance of the proposed method. Eventually, we deduce that our proposed method significantly performs well for both Hockey Fight and Violent-Flows dataset.

5 Conclusion

Video monitoring is used as a mechanism of scrutinizing videos to recognize suspicious behavior. Human behavior can be examined with the help of a surveillance video that could be manual or automatic. The research community has failed to develop an effective algorithm because of complex background, illumination, scale changes, etc. Experiments are conducted on the HF dataset and VF dataset and the experimental result shows that our proposed method performs an effective and preferable result to the previous feature descriptors. In the future, we intent to conduct experimentation on complex videos, endeavor to optimize the proposed method to improve the accuracy and reduce the time computation.