Keywords

1 Introduction

Human requirement of automated detection system in personal, commercial, industrial, and military areas leads to development of video analytics which will make lives easier and enable us to compete with future technologies [1]. On the other hand, it pushes us to analyze the challenges of automated video surveillance scenarios. Humans have an amazing capacity for decision-making but are notoriously poor at maintaining concentration levels. A variety of studies has shown that after 20 min of watching, up to 90% of the information being shown on monitors will be missed. As closed-circuit television (CCTV) culture continues to grow, humans would require to observe feed from hundreds of camera 24 × 7 [2]. It shows requirement of automatic system that analyzes and stores video from 100s of cameras and other sensors, detecting events of interest continuously and browsing of data through sophisticated user interface. It is simply known as video analytics [3, 4].

Recent research in computer vision is giving more stress on developing system for monitoring and detecting humans. It is helpful for people in personal, industrial, commercial, and military areas to develop innovations in video analysis, to compete with future technologies and to accept the challenges in automatic video surveillance system. Video surveillance tries to detect, classify, and track objects over a sequence of images and help to understand and describe the object behavior by human operator. This system monitors sensitive areas such as airport, bank, parking lots, and country borders. The processing framework of an automated video surveillance system includes stages like object detection, object classification, and object tracking. Almost every video surveillance system starts with motion detection. Motion detection aims at segmenting regions of interest corresponding to moving objects from remaining image. Subsequent processes such as object classification and tracking performances are greatly dependent on it. If there is significant fluctuations in color, shape, and texture of moving object, it causes difficulty in handling these objects. A frame of video sequence consists of two groups of pixels. The first group represents foreground objects and second group belongs to background pixels. Different techniques such as frame differencing, adaptive median filtering, and background subtraction are used for extraction of objects from stationary background [5]. The most popular and commonly used approach for detection of foreground objects is background subtraction. The important steps in background subtraction algorithm are background modeling and foreground detection [6]. Background modeling gives reference frame which represents statistical description of entire background scene. The background is modeled to extract interested object from video frames. It is designed with first few frames of video sequence. But, in case of quasi-stationary background such as wavering of trees, flags, and water, it is more challenging to extract exact moving object. In this situation, single model background frame is not enough to accurately detect the moving object but adaptive background modeling technique is used for exact detection of objects from dynamic background [7].

2 Object Detection Using Adaptive Gaussian Mixture Model

2.1 Basic Gaussian Mixture Model

A Gaussian mixture model (GMM) is parametric probability density function presented as a weighted sum of K Gaussian component densities [8, 9]. It is represented by the following equation:

$$P\left( {x_{t} } \right) = \sum\limits_{i = 1}^{k} {\left( {\upomega_{{{\text{i}},{\text{t}}}} \,n(X_{t} ;\mu_{i,t} ,\sum i,t}) \right)}$$
(1)
$$\sum\limits_{i = 1}^{k} {\left( {\upomega_{{{\text{i}},{\text{t}}}} } \right)} = 1$$
(2)

where \(x\) is a D dimensional data vector and \(\omega\) is weight of ith Gaussian component. Here, k is the number of Gaussian distributions, t represents time, \(\mu\) is mean value of the ith Gaussian mixture at time t, and \(\sum i,t\) is the covariance matrix. The entire GMM is scaled by mean vectors, covariance matrices, and mixture weights of all component densities. The mean of such mixture is represented by following equation:

$$\mu_{t} = \sum\limits_{i = 1}^{k} {\omega_{i,t} \mu_{i,t} }$$
(3)

There are several variants on the GMM and covariance matrices constrained to be diagonal. The selection of number of components and full or diagonal covariance matrix is often determined by the availability of data for estimating GMM parameters.

Background subtraction object detection technique is popular as it is less complex, simple and easy to implement. It takes the difference between current frame (It) and reference frame. The reference frame is denoted by (Bt−1). Hence, difference image (Dt) is given by

$$D_{t} = \left| {B_{t - 1} - I_{t} } \right|$$
(4)

Foreground mask (Ft) is given by applying threshold to difference image

$$\begin{array}{*{20}l} {F_{t} = 1,} \hfill & {{\text{when}}\,D_{t} > {\text{Th}}} \hfill \\ {F_{t} = 0,} \hfill & {{\text{when}}\,D_{t} < {\text{Th}}} \hfill \\ \end{array}$$

2.2 GMM Model Initialization and Maintenance

For stationary process pixels, EM algorithm is applicable. K-means algorithm is an alternative to EM [10]. Using K-means approximation, every new pixel value Xt is checked against existing K Gaussian distribution until match is found. A match is given by

$${\text{Sqrt}}\left( {\left( {X_{t + 1} - \mu_{i,t} } \right)T.\sum\nolimits_{i,t}^{ - 1} {\left( {X_{t + 1} - \mu_{i,t} } \right)} } \right) < k\sigma_{i,t}$$
(5)

where k is constant threshold value which is selected as 2.5. If K distribution is not matched with current pixel value then least probable distribution is replaced with current distribution value as its mean, weight, and variance. Prior weights of K distributions at time t are adjusted as follows:

$$\omega_{k,t} = (1 - \alpha )\omega_{k,t} - 1 + \alpha (M_{k,t} )$$
(6)

where \(\alpha\) is learning rate and Mk,t is 1 for model which is matched and 0 for other models. After operating this approximation, weights are again normalized. The \(\mu \,{\text{and}}\,\sigma\) parameters remain same for unmatched distributions. The parameters of distribution which matches new observations are updated as follows:

$$\mu_{\text{t}} = (1 - \rho )\mu_{\text{t}} - 1 + \rho X_{t}$$
(7)
$$\sigma t = \left( {1 - \rho } \right)\sigma 2t - 1 + \rho \left( {X_{\text{t}} - \mu t} \right)T\left( {X_{t} - \mu t} \right)$$
(8)
$$\sigma_{t} = (1 - \rho )\sigma_{{{\text{t}} - 1}} + \rho (X_{\text{t}} - \mu t)^{\text{T}} (X_{\text{t}} - \mu t )$$
(9)

where

$$\rho = \alpha .n(X_{\text{t}} |\mu_{\text{k}} ,\sigma )$$
(10)

One advantage of this technique is that when new thing is added in the model then it will not completely destroy the previous background model but it can update the model.

3 Object Detection Using Adaptive GMM

The system is implemented using two steps such as GMM-based object detection and noise removal using morphological operations. Implementation is done using MATLAB 2014v with the help of computer vision system toolbox. The small detected regions whose area is less than moving object and which are not part of foreground object can be removed using noise removal algorithm. Finally, output binary image is compared with ground truth image for performance evaluation to determine accuracy.

Background modeling is adaptive to accommodate all the changes occurring in the background scene. It is very sensitive to dynamic changes that have occurred in the scene which causes consequent need of adaptation of background as per the variations in background. The research has progressed toward improving robustness and accuracy in background subtraction method for complex background condition like sudden and slow illumination change. A common attribute of BS algorithm is learning rate, threshold, and constant parameter K which can be empirically adjusted to get desired accuracy. However, tuning process for these parameters has been less attentive due to lack of awareness. Stauffer and Grimson [7] suggested that selection of learning rate and threshold value is important among all other parameters. Tuning process for these parameters requires time intense repeated experimentation to achieve optimum results. It is very challenging to set the parameters because it requires understanding of background situation and common setting for different scenarios may not produce accurate result. All these aspects put limitations on effective use of background subtraction algorithm and demand improvement and extension of original GMM.

Recent years, researchers are focused on developing innovative technology to improve performance of IVS in terms of accuracy, speed, and complexity. To design novel approach for GMM parameter tuning based on extraction of statistical features and map with GMM training parameters [11]. Learning rate parameter is very important which determines the rate of change of background. Large amount of experimentation is required to set the value of learning rate for exact detection of foreground object. It is required to develop the system which tunes the parameter automatically for satisfactory performance of GMM.

GMM modeling is able to handle multimodal background scene. Performance of GMM-based background subtraction is decided by pixel-wise comparison of ground truth and actual foreground mask. Performance of the system is evaluated with the help of primary metrics such as true positive (TP), true negative (TN), false positive (FP), and false negative (FN) and secondary metrics like sensitivity, accuracy, miss rate, recall, and precision. Precision reflects false detection rate and recall gives accuracy of detection. Precision and recall are the two important measures in order to estimate detection algorithm systematically and quantitatively [12, 13].

$${\text{Precision}}\left( \% \right) = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}*100$$
(11)

Our proposed system brings innovation in original GMM-based object detection system through tuning and adaptation of important parameters such as number of component, learning rate, and threshold.

3.1 Video Database

Wallflower database is open source database [9]. It includes seven sets of video sequence with different critical situations in background. Video frames of size 160 × 120 pixel, sampled at 4 Hz. Data set provider also gives one ground truth image and text file having description of all video sequences. Ground truth is binary image representing foreground mask of specific frame in video sequence. Table shows all test sequences along with their ground truth (Table 1).

Table 1 Wallflower dataset of seven different video sequences along with ground truth

3.2 Experimental Setup

The main focus of research is based on appropriate selection of GMM training parameters like K, α, and T. The selection of K Gaussian component value is function of complexity of background scene. If the background is simple and unimodal, the value of K must be selected as 1 or 2. For complex multimodal background, value of K is more than 2 and less than 5 so as to improve the accuracy in detection process. Various pairs of α and T are evaluated on Wallflower dataset. After lots of experimentation best pair of α and T is identified based on performance analysis on various Wallflower videos [9]. Parameter initialization, training, and testing are the three important steps for object detection process.

3.3 Parameter Initialization

Object detection system includes various GMM parameters like number of training frames, initial variance, and training parameters (K, α, and T). They are initialized as follows:

  • Number of training frames: 200 (given by data set provider),

  • Number of component: 4,

  • Initial variance: 0.006, and

  • Threshold: Adjusting value empirically (0.5, 0.6, 0.7, 0.8, 0.9).

4 Experimental Results

GMM-based object detection system is evaluated using various settings of α and T for each sequence. After this experimentation, for all videos, appropriate setting of α and T is decided based on lowest value of total error. Performance metrics are calculated for each sequence by comparing detected mask with ground truth.

Results are as follows (Fig. 1):

Fig. 1
figure 1

Foreground mask obtained using GMM for different values of α and T

GMM-based background subtraction technique gives best overall detection performance at α = 0.001 and T = 0.9. These parameter settings improve the accuracy of foreground mask which is almost matching with ground truth. Learning rate and threshold have enough power to tune object detection performance. Best overall performance setting has less probability to give best result at individual level. Best individual performance may be obtained by different settings of parameters for some of the sequences. Performance analysis for best α and T can be done at pixel level. Empirically selection of higher threshold value gives merging of foreground objects with background. It leads to increase in false negative and decrease in true positive. For faster changing background, empirical selection of lower value of α is too low to adapt such background changes. Empirical setting of threshold value, T = 0.9, is being so high that all foreground pixels are merged into the background. It gives an increase in false negative and decrease in true positive pixels. Same way, empirical selection of learning rate, α = 0.001, is being too low for rapid changing background. Thus, misclassification is higher and accuracy is lower for those video sequences in which sudden change of illumination is occurring. This experimentation also suggests that various settings of (α and T) may result in more improved performance for different video sequences than fixed selection of (α and T) (Table 2).

Table 2 Performance evaluation of proposed system on Wallflower dataset

5 Conclusion

Proposed research emphasizes on proper tuning of important GMM parameter leading to improvement in the performance accuracy of GMM-based object detection system. GMM parameters mainly include number of mixture component (K), learning rate (α), and threshold (T). We have implemented two approaches for tuning these parameters such as traditional empirical tuning and automated adaptive tuning based on background dynamics. Traditional empirical tuning method is implemented using different settings of α and T, while K is kept constant to high value for complex scene. After large number of experimentation, appropriate pair of α and T is selected based on low performance error.

Proposed adaptive tuning method involves adaptation of α and keeping T and K constant to appropriate value. Unique EIR concept is used to extract background dynamics for current frame. Learning rate is tuned depending on EIR. This modified approach improves the result of GMM compared to the original GMM. This result strongly emphasizes the strength of learning rate adaptation. Performances of GMM with these tuning methods are evaluated based on foreground mask obtained using GMM and ground truth image in database. The analysis of performance can be done using primary metrics such as TP, TN, FP, and FN as well as secondary metrics like precision and accuracy. Our proposed system performance is compared with traditional empirical method and other existing techniques. Our research is implemented on MATLAB 2014 platform. Different functions from MATLAB computer vision toolbox are used for implementation of algorithm.