Introduction

The detection and classification of moving object in a video sequence is important for tracking of object, activity recognition and video surveillance. The aim of any motion detection technique is to separate the foreground region of the moving object in a video from the background region. The conventional method proposed in [1] used optical flow technique for the detection of moving object in real time environment. The method reported a good accuracy with a cost of high complexity and time. These limitations can be overcome with the help of background subtraction and temporal differencing approaches. Background subtraction takes the input video sequence and detects the moving objects in a frame by finding the difference of current pixel of the frame with the pixel of the background reference frame [2]. Usually the first frame is selected as the first reference frame and it is then updated periodically.

Whereas, the temporal differencing find the difference of pixel features in a consecutive frame obtained from the video.

Optical flow method takes the sequence of images assigns to every pixel a 2D velocity vector. Based on the attributes of these velocity vectors, the moving objects are segmented and the edge shapes of these objects are detected.

This proposed approach uses background subtraction technique for detecting the moving objects and compares the performance of different classifiers. This developed system will answers to many problems that develops while applying background subtraction techniques. Some of these problems are sudden or gradual variation in the lighting condition, movement of multiple objects in the scene, noise present in the frame because of poor and low quality visual source and the shadow regions which are projected by the foreground objects as moving objects.

An example for background reference frame and a frame with a moving objects in a video are illustrated in Fig. 1a and b.

Fig. 1
figure 1

a static environmental frame b dynamic environmental frame

The main objective of this work is to detect the moving object in the video and classify them using different classification algorithms. The organization of the paper is as follows. The overview of state of out is classified in section 2. The proposed approach of object detection and classification is discussed in section 3, section 4 reports. The experimental results and validation. Finally the solutions are summarized and the conclusion is given in section 5.

Overview of State of Art

Several studies have investigated the automatic detection of moving objects in a video sequence [3,4,5,6,7,8,9,10]. Y Yang et al. in [3] applied spatio-temporal modelling for segmenting and detecting the presence of moving objects in video sequences. Temporal image features were extracted from the background frame obtained from the video. In order to segment the moving objects, dynamic background algorithm was applied and achieved a recall rate of 72%, false positive rate of 27.73% and false negative rate of 27.04%.

A methodology for detecting the moving object using codebook algorithm was proposed by S Li et al. in [4]. Initially, codebook algorithm was employed for segmenting the foreground and background. Three frame difference was used in detecting the foreground frame. For eliminating the cavities introduced by the three frame differences, the author applied log edge detection and component filling in order to optimize the foreground. Finally, logical ‘AND’ operation was applied to relate the foreground object obtained by the improved three frame difference algorithm with the codebook algorithm.

A system for person segmentation, tracking and interpretation was developed by Wren et al. in [5]. The author named the system as pfinder and modelled each pixel of the background with a simple Gaussian distribution. But the estimation of the Gaussian parameter using Expectation Maximization (EM) for each pixel is computationally complex.

Bilodeau et al. in [6] proposed an efficient approach for detecting the moving objects. They applied modified local binary similarity patterns for segmenting the background from the input video sequences. They also tested the effectiveness of their proposed method on various real time video sequences.

Elgammal et al. in [7] introduced a non-parametric background modelling. This approach estimates the probability of observing a pixel grey level based on a sample of grey levels for each pixel. The author employed colour information to suppress the shadows of the target object.

A training free method for the detection of moving object in video sequence is presented by Zhang et al. in [8]. For each frames of the sequence, dense optical flow between itself and its previous frame is measured. A novel clustering method is applied for each region whose optical flow is high. This helps to segment the different objects in motion. This approach achieved a recall rate of 87.2% with a precision of 93.5%.

Wu et al. in [9] proposed a method to use ratio images as the fundamental step for the motion detection. The effects of poor illumination are smoothed out. Usually the problem in the selection of method is related to the difference image. In this method, it is shifted to ratio image. This problem was addressed automatically by the author based on histogram technique.

A Non-parametric kernel density estimation for the detection and tracking of moving object was proposed by Ianasi et al. in [10]. The author developed a fast and robust algorithm by employing multiresolution and recursive density estimation with mean shift based tracking,

D. Kollar et al. in [11] introduced a model using kalman filter to model the background pixel based on the effects of the weather and the time on the intensity values. Whereas S Jabri et al. in [12] used colour and edge information for modelling the background and for subtraction. The author used confidence maps to fuse intermediate results. This method cannot produces good results when there is a sudden change or multiple moving objects in the scene.

Proposed Method

The purpose of this proposed approach is to detect and classify the input video sequences by different classification algorithms. The required input video sequences are collected from CIPR database. The input video sequence is first converted into frames and it is then pre-processed to improve the quality of the frame and to remove the noise. The enhanced frames are then applied to the background subtraction process. Then, the feature vectors are extracted. These features obtained from the different video sequences are trained and tested by three different classifiers. The overall schematic representation of the proposed approach is shown in Fig. 2.

Fig. 2
figure 2

Schematic diagram of the proposed work

Databank Used

To compare the performance of the proposed work, the input video sequences required for processing are collected from CIPR databank (http://www.cipr.rpi.edu/resources/sequences/qcif.html). This databank contains different video sequences which are captured under various environmental location by a high resolution camera. The input video sequence in this databank includes both indoor and outdoor environmental condition. Out of the different sequences, fountain Sequence, airport sequence, meeting room sequence and lobby sequences are chosen to validate and compare the performance of this proposed work.

Pre-Processing

The input video sequence obtained from the CIPR databank is first converted into frames. Each frame is then pre-processed in order to enhance the quality of the frame. The main purpose of pre-processing is to improve the accuracy of the work by removing the noise. Each frame obtained from the video will be in RGB form. This RGB frame is converted into HSI form and the I-Component alone is extracted for further processing. The reason for extracting the I - Component is that the noise will mostly affect only the I - Component compared to Hue and Saturation component. The extracted I - Component is then applied to a median filter for removing the noise.

In order to improve the contrast, the output of the median filter is given to the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm [13].

Background Subtraction

The next important step is the extraction of foreground object from the background. This background extraction algorithm should be able to answer a number of critical problems as discussed in section 1. In this proposed work, the extraction of the foreground region is accomplished by the combination of temporal image analysis and a background subtraction process. Initially, a temporal analysis is performed by comparing at each time ‘t’ two consecutive frames and it provides an image IMt. This image IMt is used for the background subtraction process.

In the background subtraction process the image IMt is compared with the reference background image IBt at each time t. To detect the foreground object, the Radiometric Similarity (RS) is calculated and it is given by

$$ {\displaystyle \begin{array}{c}\mathrm{RS}\left({\mathrm{I}}^{\mathrm{t}}\left(\mathrm{x},\mathrm{y}\right),{\mathrm{I}}^{\mathrm{t}-1}\left(\mathrm{x},\mathrm{y}\right)\right)=\\ {}\frac{m\left[W\right({I}^t\left(x,y\right)W\left({I}^{t-1}\left(x,y\right)\right]-m\left[W\right({I}^t\left(x,y\right)\left\}m\right[W\left({I}^{t-1}\left(x,y\right)\right]}{\left(v\right[W\left({I}^t\left(x,y\right)\right].v\Big[W\left({I}^{t-1}\left(x,y\right)\Big]\right)}\end{array}} $$
(1)

Where m[W] and v[W] represent the mean and variance of the intensity of the pixel present in the window W.

Feature Extraction

The objective of the feature extraction process is to represent a pixel in terms of some quantifiable information which will be useful for the classification process. In this proposed work, the following set of feature vectors were selected.

  1. i.

    Texture Feature using LBP: Features based on texture is extracted using Local Binary Pattern (LBP) algorithm.

  2. ii.

    Grey Level Features: Five different features based on the grey level of the foreground object is extracted.

Texture Feature Using LBP

With the motivation of the literature work in [14], 24 features based on texture is extracted using Local Binary Pattern (LBP). LBP is one of the powerful feature descriptor used in image processing and machine learning. As compared to other texture descriptor, the computational complexity of LBP is very less.

The primary key in this algorithm is to place a label for each pixel in the obtained foreground region. This is obtained by calculating the number of points ‘P’ and radius ‘r’ in the local neighbourhood of the pixel. The intensity value of the centre pixel is calculated and this value is chosen as a reference. Based on this reference value, the neighbourhood pixels are threshold to form a binary pattern. Finally, the LBP labels are calculated by adding the binary pattern of every pixel and weighting scaling it with a power of two.

$$ \mathrm{FLBP}=\sum \left(\mathrm{Ip}\hbox{--} \mathrm{Ic}\right)\ {2}^{\mathrm{p}} $$
(2)
$$ \mathrm{f}\left(\mathrm{x}\right)=\Big\{{\displaystyle \begin{array}{cc}1;& x\\ {}0;& x\end{array}} $$
(3)

Where Ip and Ic are the intensity value of the neighbourhood pixel and centre pixel respectively and P is the number of samples on the circle of radius ‘r’.

Six statistical features like mean, Standard Deviation, median, entropy, skewness and Kurtosis are calculated from each LBP patterns. This procedure is performed for four different radius like r = 1, 2,3 and 4, thereby a total of 24 features were obtained.

Grey Level Features

The grey level of the background object provides more meaningful feature for the classification of input sequences. Considering this information into account, a set of grey level features were derived from the foreground object [15]. Let Sx,y denotes the set of coordinates in a w x w square window cantered on the described pixel (x,y). These features are given as follows.

$$ {\mathrm{FI}}_1\left(\mathrm{x},\mathrm{y}\right)=\mathrm{HI}\left(\mathrm{x},\mathrm{y}\right) $$
(4)
$$ {\displaystyle \begin{array}{c}{\mathrm{FI}}_2\left(\mathrm{x},\right)=\mathrm{HI}\ \left(\mathrm{x},\mathrm{y}\right)-\min\ \left\{\mathrm{HI}\left(\mathrm{s},\mathrm{t}\right)\right\}\\ {}\kern4em \left(\mathrm{s},\mathrm{t}\right){{\upvarepsilon \mathrm{S}}^9}_{\mathrm{x},\mathrm{y}}\end{array}} $$
(5)
$$ {\displaystyle \begin{array}{c}{\mathrm{FI}}_3\left(\mathrm{x},\mathrm{y}\right)=\mathrm{std}\ \left\{\mathrm{HI}\left(\mathrm{s},\mathrm{t}\right)\right\}\\ {}\kern0.5em \left(\mathrm{s},\mathrm{t}\right){{\upvarepsilon \mathrm{S}}^9}_{\mathrm{x},\mathrm{y}}\end{array}} $$
(6)
$$ {\displaystyle \begin{array}{c}{\mathrm{FI}}_4\left(\mathrm{x},\mathrm{y}\right)=\mathrm{HI}\left(\mathrm{x},\mathrm{y}\right)-\mathrm{mean}\ \left\{\mathrm{HI}\left(\mathrm{s},\mathrm{t}\right)\right\}\\ {}\kern3em \left(\mathrm{s},\mathrm{t}\right){{\upvarepsilon \mathrm{S}}^9}_{\mathrm{x},\mathrm{y}}\end{array}} $$
(7)
$$ {\displaystyle \begin{array}{c}{\mathrm{FI}}_5\left(\mathrm{x},\mathrm{y}\right)=\max\ \left\{\mathrm{HI}\left(\mathrm{s},\mathrm{t}\right)-\mathrm{HI}\left(\mathrm{x},\mathrm{y}\right)\right\}\\ {}\left(\mathrm{s},\mathrm{t}\right){{\upvarepsilon \mathrm{S}}^9}_{\mathrm{x},\mathrm{y}}\end{array}} $$
(8)

Scene Classification

The extracted LBP and grey level features are combined and formed as a feature vector. To classify the input video sequences into different classes, these feature vectors are applied to the classifier algorithm. In this work, three different classifiers like SVM, PLS and PNN are chosen and their performances are compared.

SVM Classifier

The extracted feature vector from different moving objects of the inputs video sequences are applied to the SVM classifiers [16]. This classifier tries to minimize the empirical risk and prevents the overfitting problem. The architecture of this classifier is shown in Fig. 3.

Fig. 3
figure 3

Architecture of SVM

This classifier consists of three different layers such as input layer, hidden layer and outer layer. The classification is performed in two different phases a) Training Phase b) Testing Phase.

In the training phase, the features extracted from the different sequences like fountain, airport, meeting room and lobby are trained by the SVM classifiers. Nearly, 60% of the extracted features are used for this process. Then, in the testing phase, the remaining features are applied and tested for the classification process.

The RBF (Radial basis function) kernel used for the process is given by

$$ {\mathrm{K}}_{\mathrm{f}}\left(\mathrm{x},\mathrm{y}\right)={\sum}_{f=1}^k\beta f\ {e}^{\Big(-\gamma f\chi {f}^2}\Big( xf $$
(9)

Where γf is a kernel Parameter.

PLS Classifier

The second type of classifiers that helps to classify the input sequence is Partial Least Square (PLS) classifier.

This classifier have low bias and high variance between the different classes. In this work, a linear regression PLS classifier with adjustable threshold is employed [17,18,19]. The main reason for choosing this classifier is that it provides high accuracy and avoids over-fitting problem. Usually, this classifier is formulated as

$$ \mathrm{A}=\mathrm{B}.\upbeta +\upvarepsilon $$
(10)

Where, A is the vector having the classification metric and B is the extracted feature vector. β is the linear regression coefficient and ε is a residual Vector.

The extracted feature vector B (in section 3.4) is applied to the PLS classifier for training and an optimum linear regression coefficient is found. This optimum value is applied to the testing phase to classify the input sequences.

PNN Classifiers

The next classifiers were experimented for classification process is probabilistic Neural Network (PNN) classifier. This classifier is one of the multilayer feed forward neural network classifier and it is derived from the Bayesian network [20]. This classifier consists of four different layers like input layer, pattern layer, summation layer and output layer and it is shown in Fig. 4.

Fig. 4
figure 4

Architecture of PNN classifier

The step by step process of this PNN is given as follows.

  • Step 1: Feed the extracted input features into the input layer.

  • Step 2: Feed the trained features into next layer i.e. pattern layer. The kernel used is given by

$$ \mathrm{Fk},\mathrm{i}\left(\mathrm{x}\right)=\frac{1}{\left(2{\pi \sigma}^2\right)}{e}^{\frac{{\left\Vert X- Xki\right\Vert}^2}{2{\sigma}^2}} $$
(11)

Where Xki is the centre of the kernel and σ is the smoothing parameter. This sigma determines the speed of the kernel function.

  • Step 3: The next step is to compare the input features for each groups of patterns. Using this comparison, the summation layer computes the conditional probability by using a combination of the previously computed densities

$$ \mathrm{Gk}\left(\mathrm{X}\right)=\sum \limits_{i=1}^{Mk}\mathrm{Wki}\kern0.1em \mathrm{Fki}\left(\mathrm{x}\right)\mathrm{K}\upvarepsilon \left\{\mathrm{1..}\dots \mathrm{k}\right\} $$
(12)

Where Wki represents the positive coefficients satisfying \( \sum \limits_{i=1}^{Mk}\mathrm{Wki} \) = 1.

  • Step 4: At each class output node, add all the frames with similar features to those of the input.

  • Step 5: Calculate the maximum value of all of the added functional values at the output nodes using the following equation.

$$ \mathrm{M}\left(\mathrm{x}\right)=\arg \underset{1\le K\le k}{\max}\mathrm{Gk} $$
(13)

The output layer gives the different classification results.

Results and Discussion

The performance of the proposed approach has been tested on a number of sequences obtained from both Indoor and Outdoor Environment. In particular, fountain sequence, airport sequence, meeting room sequence and lobby sequences are chosen to validate and compare the performance. Indoor Environment sequences with varying illumination are also considered.

The algorithm was developed using MATLAB (R2013a) on a Pentium IV 2.0 GHz processor. The quantitative performance of the proposed work is obtained by calculating accuracy, Detection rate (DR), False Alarm Rate (FAR) and computational time.

$$ \mathrm{Accuracy}=\frac{1}{N}{\sum}_i^N TP $$
(14)
$$ \mathrm{Detection}\ \mathrm{Rate}\ \left(\mathrm{DR}\right)\kern0.5em =\mathrm{TP}/\left(\mathrm{TP}+\mathrm{FN}\right) $$
(15)
$$ \mathrm{False}\ \mathrm{Alarm}\ \mathrm{Rate}\ \left(\mathrm{FAR}\right)=\mathrm{FP}/\left(\mathrm{TP}+\mathrm{FP}\right) $$
(16)

Where

TP = True Positive

FP = False Positive

FN = False Negative

True positive represents the number of detected pixels that corresponds to moving objects. False positive denotes the number of detected pixels that do not correspond to a moving object and False Negative represents the pixels of moving objects that are not detected.

In Table 1, the comparison of performance of the proposed work with other state of art work is shown. The accuracy in Table 1 for the proposed work is the average accuracy of the different video sequence obtained using PLS classifier. From this table, it is clear that this work achieved a high accuracy as compared to other work discussed in the literature.

Table 1 Comparison of proposed method with other state of art method

Table 2 shows the experimental results obtained by the SVM classifiers on different sequences.

Table 2 Experimental results obtained by the SVM Classifiers on input sequences

In Table 3, the performance of the proposed work using PLS classifier is shown.

Table 3 Experimental results obtained by the PLS Classifiers on input sequences

In Table 4, the experimental results of the work obtained by PNN classifier is employed.

Table 4 Experimental results obtained by the PNN classifiers on input sequences

Figure 5 shows the comparison of performance of three different classifiers. For this comparison process, accuracy and computational time of the different classifiers are chosen.

Fig. 5
figure 5

Comparison of Performance of three different classifiers

Conclusions

In this proposed work, a comparative analysis of different classifiers for the detection of moving object is presented. This work can be very useful in tracking of objects, video surveillance, activity recognition and so on. A robust algorithm for the detection of moving object in a video sequence is developed. This method takes the input video and extracts the image sequences initially. These sequences are pre-processed and the foreground object alone is segmented by novel background subtraction techniques. Different useful information based on LBP and grey level co-efficient are extracted and the different scenes are classified using three different classifiers.

The results obtained by these three different classifiers in both indoor and outdoor environment show the robustness and reliability of this approach. One of the drawbacks of this work is that it is not able to completely eliminate the shadows present in the scene. This happens mainly because of the high contrast in the background.