Keywords

1 Introduction

Real-time object tracking is a critical task in computer vision, and many algorithms have been proposed to overcome the difficulties arising from noise, occlusions, clutters, pose, and changes in the foreground object and/or background environment. Many different algorithms [13], have been proposed for object tracking, including mean-shift tracking [4], optical flow, and feature matching. Each algorithm has strengths in certain environments and weaknesses in others. This project aims to combine several such algorithms as inputs or “measurements” to a single Kalman filter [5], for robust object tracking. The filter can favor the algorithm that is most applicable to the current environment by decreasing its measurement noise variance, and similarly ignore less suitable algorithms by increasing their measurement variances.

Two major components can be distinguished in a typical visual tracker. Target Representation and Localization is mostly a bottom-up process which has also to cope with the changes in the appearance of the target. Filtering and Data Association is mostly a top-down process dealing with the dynamics of the tracked object, learning of scene priors, and evaluation of different hypotheses [4, 6, 7]. The way the two components are combined [8, 9], and weighted is application dependent and plays a decisive role in the robustness and efficiency of the tracker. In real-time applications, only a small percentage of the system resources can be allocated for tracking, the rest being required for the preprocessing stages or to high-level tasks such as recognition, trajectory interpretation, and reasoning. Therefore, it is desirable to keep the computational complexity of a tracker as low as possible.

The goal of this paper is dedicated to improve the similarity measure for the target representation in the Kalman filter. We derive a similarity measure by combining between the expected likelihood kernel [1012], and the integral image [13], as a similarity measure between target and estimated scale/shape regions in the frames of video sequence. In this paper we analyzes and compares between our system with: Firstly in [7], the efficient local search framework for real-time tracking of complex non-rigid objects. The shape of the object is approximated by an ellipse and its appearance by histogram based features derived from local image properties based on mean-shift and Kalman filter. Secondly the Kalman filter is used as in [4]. Where the shape of the tracked object is approximated by an ellipse and the appearance within the ellipse is described by a histogram based model. The obvious advantage of such a model is its simplicity and general applicability. Another advantage, that made this observation model rather popular, is the existence of efficient local search schemes to find the image region with a histogram most similar to the histogram of the tracked object. Experimental results show that the proposed approach has superior discriminative power and achieves good tracking performance.

The rest of the paper is organized as follows: Sect. 2, introduces basic Kalman filter for object tracking. Section 3, present the expected likelihood kernel. Section 4, present scale estimation. And then Sect. 5 the proposed approach. Section 6 the experiment result. Section 7 concludes the paper.

2 Kalman Filter

The Kalman filter is a framework for predicting a process state, and using measurements to correct or “update” these predictions.

2.1 State Prediction

For each time step k, a Kalman filter first makes a prediction \(\hat{s}_{k}\) of the state at this time step:

$$\hat{s}_{k} = A \times s_{k - 1}$$
(1)

where \(s_{k - 1}\) is a vector representing process state at time k − 1 and A is a process transition matrix. The Kalman filter concludes the state prediction steps by projecting estimate error covariance \(P_{k}^{ - }\) forward one time step:

$$P_{k}^{ - } = A \times P_{k - 1} \times A^{t} + W$$
(2)

where \(P_{k - 1}\) is a matrix representing error covariance in the state prediction at time k − 1, and W is the process noise covariance.

2.2 State Correction

After predicting the state \(\hat{s}_{k}\) (and its error covariance) at time k using the state prediction steps, the Kalman filter next uses measurements to “correct” its prediction during the measurement update steps. First, the Kalman filter computes a Kalman gain \(K_{k}\), which is later used to correct the state estimate \(\hat{s}_{k}\):

$$K_{k} = P_{k}^{ - } \times (P_{k}^{ - } + R_{k} )^{ - 1}$$
(3)

where R is measurement noise covariance. Determining R k for a set of measurements is often difficult. In our implementations we calculated R dynamically from the measurement algorithms state. Using Kalman gain \(K_{k}\) and measurements \(z_{k}\) from time step k, we can update the state estimate:

$$\hat{s}_{k} = \hat{s}_{k} + K_{k} \times (z_{k} - \hat{s}_{k} )$$
(4)

Conventionally, the measurements Z k are often derived from sensors. In our approach, measurements Z k are instead the output of various tracking algorithm given the same input: one frame of a streaming video, and the most likely x and y coordinates of the target object in this frame (taken the first two dimensions of \(\hat{s}_{k}\)).

The final step of the Kalman filter iteration is to update the error covariance \(P_{k}^{ - }\) into \(P_{k}\):

$$P_{k} = (I - K_{k} ) \times P_{k}^{ - }$$
(5)

The updated error covariance will be significantly decreased if the measurements are accurate (some entries in R k are low), or only slightly decreased if the measurements are noise (all of R k is high). For more details, see [8, 9, 5].

3 Expected Likelihood Kernel

Let p and q be probability distributions on a space χ and ρ be a positive constant. In this work, we are using the probability product kernels (\(K_{\rho } :\chi \times \chi \to R\) on the space of normalized discrete distributions over some indexs set Ω) as the similarity measures for comparing two discrete distributions \(p_{1} ,p_{2} ,p_{3} , \ldots ,p_{N} \in \chi\) and \(q_{1} ,q_{2} ,q_{3} , \ldots ,q_{N} \in \chi\). The probability product kernel between distributions \(\{ p\}_{1..N} \varepsilon \chi\) and \(\{ q\}_{1..N} \in \chi\) is defined as:

$$K_{\rho } (p,q) = \sum\limits_{k = 1}^{N} p(k)^{\rho } q(k)^{\rho }$$
(6)

It is easy to show that such a similarity measure is a valid kernel, since for any \(p_{1} ,p_{2} ,p_{3} , \ldots ,p_{N} \in \chi\), the Gram matrix K consisting of elements \(K_{ij} = K_{\rho } (p_{i} ,q_{j} )\) is positive semi-definite:

$$\sum\limits_{i} \sum\limits_{j} \alpha_{i} \alpha_{j} K_{\rho } (p_{i} ,p_{j} ) = \sum\limits_{k} (\sum\limits_{i} \alpha_{i} p_{i} (k)^{\rho } )^{2} \ge 0$$
(7)

for \(\alpha_{1} ,\alpha_{2} ,\alpha_{3} , \ldots ,\alpha_{N} \in IR\). Different ρ values are corresponded to different types of probability product kernels. For \(\rho = 1\), we have:

$$K_{1} (p,q) = \sum\limits_{k} p(k)q(k) = IE_{p} \left[ {q(k)} \right] = IE_{q} \left[ {p(k)} \right]$$
(8)

We call this the Expected Likelihood Kernel, is defined by \(K(p,q) = \sum\nolimits_{k} p(k)q(k)\). We denote the histogram of Target of object tracking T as \(h_{T}\), and the number of pixels inside T as \(|T|\), which is also equal to the sum over bins, \(|T| = \sum\nolimits_{k} h_{T} (k)\). Let q be the normalized version of h T given by \(q = \frac{{h_{T} }}{|T|}\), so we can consider q as a discrete distribution, with \(\sum\nolimits_{k} q(k) = 1\). Let p the normalized histogram obtained in the farms of video sequence. For the k-bin of \(h_{T}\), its value is obtained by counting the pixels that are mapped to the index k:

$$h_{T} (k) = \sum\limits_{x\varepsilon T} \delta \left[ {b(x) - k} \right]$$
(9)

where \(\delta \left[ t \right]\) is the Kronecker delta, with \(\delta \left[ t \right] = 1\) if t = 0, and \(\delta \left[ t \right] = 0\) otherwise. The mapping function \(b(x)\) maps a pixel x to its corresponding bin index. The computation of the expected likelihood kernel can be expressed as:

$$\begin{aligned} K(p,q) & = \sum\limits_{k} p(k)q(k) = \sum\limits_{k} p(k)\left( {\frac{1}{|T|}\sum\limits_{x\varepsilon R} \delta \left[ {b(x) - k} \right]} \right) \\ & = \frac{1}{|T|}\sum\limits_{x\varepsilon T} \sum\limits_{k} p(k)\delta \left[ {b(x) - k} \right] = \frac{1}{|T|}\sum\limits_{x\varepsilon T} p(b(x)) \\ \end{aligned}$$
(10)

Therefore, the computation of the expected likelihood kernel can be done by taking the sum of values \(p(b(x))\) within candidate target T. The output of the following algorithm is a support map using integral image to compute the similarity measure between target and candidate region from each frame of the video sequence.

4 Adaptive Scale of Target Model

A target is represented by an ellipsoidal region in the image. To eliminate the influence of different target dimensions. Let x i denote the pixel locations of target model and i are all the pixels that belong to the object tracker, and θ is the location of the center of the object tracker in the frame to frame in sequence video. Suppose we are given an arbitrary shape S in an image specified by a set of pixel locations x i , i.e., \(S = \{ x_{i} \}\). The original shape S we have been initially selected manually. The covariance matrix can be used to approximate the shape of the object:

$$\theta = \frac{1}{{N_{s} }}\sum\limits_{{x_{i} \varepsilon S}} x_{i} ,\quad and,\;\;V = \frac{1}{{N_{s} }}\sum\limits_{{x_{i} \varepsilon S}} (x_{i} - \theta )(x_{i} - \theta )^{t}$$
(11)

where N s pixels that belong to the object of interest, and V describe an arbitrary elliptical region. We use here the following parametrization \(s = [\theta^{t} ,scale_{x} ,scale_{y} ,skew]^{t}\) where \(scale_{x}\) and \(scale_{y}\) are the scaling and \(skew\) is the skew transformation obtained from V using the unique Cholesky factorization:

$$V = \left[ {\begin{array}{*{20}c} {scale_{x} } & {skew} \\ 0 & {scale_{y} } \\ \end{array} } \right]^{t} \left[ {\begin{array}{*{20}c} {scale_{x} } & {skew} \\ 0 & {scale_{y} } \\ \end{array} } \right]$$
(12)

We will refer to the state S as \(s = (\theta ,V)\) to explicitly highlight the dependence on θ and V. Similarly, \(S(s)\) will denote the elliptical shape defined by s.

The appearance of an object is described by a set of k scalar features \(r_{1} , \ldots ,r_{k}\) that are extracted from the local area of an image I defined by \(S(s)\). We view each r k as a “bin” of a histogram. Let Γ be the set of pixel values \(I(x_{i} )\), for example \(\Gamma = [0,255]^{3}\) for RGB images. We define a quantization function \(b:\Gamma \to [1, \ldots ,k]\), that associates with each observed pixel value a particular bin index k. The value r k of the k-th bin is calculated from the elliptical image region \(S(s = (\theta ,V))\) using:

$$r_{k} (I,s) = |V|^{{\frac{\gamma }{2}}} \sum\limits_{{x_{i} \varepsilon S(s)}} N(x_{i} ;\theta ,V)\delta [b(I(x_{i} )) - k]$$
(13)

where δ is the Kronecker delta function. The kernel function N is chosen such that pixels in the middle of the object have higher weights than pixels at the borders of the objects. A natural choice is a Gaussian kernel defined by:

$$N(x;\theta ,V) = \frac{1}{|2\pi V|}{ \exp }( - \frac{1}{2}(x - \theta )^{t} V^{ - 1} (x - \theta ))$$
(14)

The prefactor \(|V|^{{\frac{\gamma }{2}}}\) in (13) discounts for the fact that in practice we use only the \(N_{s}\) pixels from a finite neighborhood of the kernel center. We disregard samples further than 1.6-sigma and it is easy to show that one should use \(\gamma \approx 1.6\) in this case. The smooth kernel function will suppress the influence of the (arguably less reliable) pixels near the borders.

5 Proposed Approach

To ensure good organization the progress of work, we used the benefits of modular design in our approach implemented using MATLAB. The goal of an object tracking is to generate the trajectory of an object over time by discovering its exact position in every frame of the video sequence. We have implemented several object tracking algorithms (Kalman filter, Expected likelihood kernel, Adaptive scale) with different processing methods. The step of object tracking system are shown in Fig. 1.

Fig. 1
figure 1

Basic block diagram for proposed algorithm

The proposed approach for object tracking is composed of four blocks named as: Block processing, Block prediction, Block Tracking, Block correction and Block result. The functions of these blocks are as follows:

Block Processing

In block processing, we start video sequence and converting video into images processing for extracting color information of images and target of object tracking.

Block Prediction

Block Prediction step attempts to evaluate how the state of the target will change by feeding it through a state prediction of Kalman filter. The state prediction serves two purposes: The time update equations are responsible for projecting forward (in time) the current state and error covariance estimates to obtain the a priori estimate for the next time step.

Block Tracking

In this block we combine between the expected likelihood kernel, and the integral image to compute similarity measure, and the histograms of all possible target regions of object tracking in video sequence. And we based of state predicted to estimate shape and orientation of object tracker.

Block Correction

The Block correction update equations are responsible for the feedback. That is used for incorporating a new measurement into the a priori estimate to obtain an improved a posteriori estimate. The time update equations can also be thought of as predictor equations, while the measurement update equations can be thought of as corrector equations based on Block Tracking.

Block Result

Tracking trajectory of object is done on the basis of the region properties of the object such as, shape, centroid, etc.

The Algorithm of the proposed approach can be explained as follows:

1. Start video sequence and select the target of object tracker in the first frame

2. Prediction using State prediction of Kalman filter to estimate how the state of an target will change by feeding it through a the current state and error covariance estimates to obtain the a priori estimate for the next time step. Using Eqs. (1) and (2)

3. Calculate similarity measure between target model and candidate regions and estimate shape and orientation of object tracker using Eqs. (10) and (13)

4. Correction and update equations into the a priori estimate to obtain an improved a posteriori estimate, using Eqs. (3)–(5), and state of similarity measure, which calculates the new position of the object

5. Draw trajectory by line joining each stored position has been drawn in every frame which shows the trajectory of the selected moving object. And go to step 2 in the next frame

6 Experiment Result

To verify the efficiency of the proposed approach (PA), we compared our system with two existing algorithms MKF [7] and MS [4], the experimental results show that, the PA system achieves good estimation accuracy of the scale and orientation of object in the sequences videos. We used different sequences, each has its own characteristics but the use of a single object in movement is a commonality between these different sequences, and we set up experiments to listed the estimated width, height, trajectory, and orientation of object. In this work, we selected RGB color space as the feature space and it was quantised into 16 × 16 × 16 bins for a fair comparison between different algorithms. One synthetic video sequence and two real videos sequences are used in the experiments:

We first use a Synthetic Ellipse sequence to verify the efficiency of the proposed approach. As shown in Fig. 2. The external ellipses represent the target candidate regions, which are used to estimate the real targets, that is, the inner ellipses. The experimental results show that the proposed approach could reliably track the trajectory of ellipse with scale and orientation changes. Meanwhile, the experimental results by the MKF and MS are not good because of significant scale and orientation changes of the object.

Fig. 2
figure 2

a Tracking results of the synthetic ellipse sequence by different tracking algorithms. The frames 1, 16 and 77 are displayed. b Trajectory results of the synthetic ellipse video sequence by different tracking algorithms

The second video is a Occlusion sequence is on a more complex sequence. As can be seen in Fig. 3, both proposed approach and MKF [7], algorithm can track the target over the whole sequence, and MS [4], does not estimate the trajectory of target orientation change and has bad tracking results. However, the proposed approach system works much better in estimating the scale and orientation of the target, especially when occlusion occurs.

Fig. 3
figure 3

a Tracking results of the occlusion sequence by different tracking algorithms. The frames 1, 35 and 193 are displayed. b Trajectory results of the occlusion video sequence by different tracking algorithms

The last video is a Player sequence where the scale of the object increases gradually as shown Fig. 4. The experimental results show that the proposed approach estimates more accurately the scale changes and good trajectory of target region than the MKF and MS algorithms.

Fig. 4
figure 4

a Tracking results of the player sequence by different tracking algorithms. The frames 1, 43 and 60 are displayed. b Trajectory results of the player video sequence by different tracking algorithms

Table 1 lists the average time by different methods on the videos sequences. We notice that our proposed approach (PA) has an average time of execution better than MKF and MS algorithms.

Table 1 The average time by different methods on the videos sequences

The experimental results demonstrate that the proposed approach is robust to track the trajectory of objects in different situations (scale variation, pose, rotation, and occlusion). It can be seen that the proposed approach achieves good estimation accuracy in real-time of the scale and orientation of the target.

7 Conclusion

In this paper, the proposed approach has been presented for tracking a single moving object in the video sequence using color information. In this approach we combine between Kalman filter and expected likelihood kernel as a similarity measure using integral image to compute the histograms of all possible target regions of object tracking in video sequence. The newly proposed approach has been compared with the state-of-the-art algorithms on a very large dataset of tracking sequences and it outperforms in the processing speed. The extensive experiments are performed to testify the proposed approach and validate its robustness to the scale and orientation changes of the target in real-time. This implemented system can be applied to any computer vision application for moving object detection and tracking.