Keywords

1 Introduction

Robust object tracking in a long term is a challenging task in computer vision. There have been many trackers proposed by different researchers [13] that employ different types of visual information and learned features to build the appearance models as the base of the tracking, e.g. color histogram in Meanshift, multiple features in particle filter [4], Haar-like in MIL [5], etc.

However, it is still not enough to represent the tracking target with one appearance model, even online updating model or patch dictionary model and so on, while the target in internal and external variations. Internal variation includes pose changing, motion, shape deformation, illumination variation, etc. And External variation includes background changing, covered by foreground objects. Tackle this problem, an appearance model set is needed for describing the historical appearances of tracking target.

Therefore, we propose a novel nonparametric statistical method to model the appearance of the target as combination of multiple appearance models. Each model describes an typical appearance character under specific situation, and clustered by Dirichlet Process Mixture Model (DPMM) [7] framework dynamically unsupervised.

Fig. 1.
figure 1

The framework of the adaptive multiple appearance model tracking.

The Framework of our system is shown in Fig. 1. Experimental results on several public datasets show that AMAM tracking system is applicable to multiple camera system and indoor and outdoor climates tracking system, and outperform several state-of- the-art trackers.

The rest of the paper is organized as follows. Section 2 overviews some of the related works. Section 3 describes the proposed AMAM algorithm. We present experimental validation on several public datasets in Sect. 4 and conclude the paper in Sect. 5.

2 Related Works

In long term tracking task, the biggest challenge is drifting problem. Tackle this problem, appearance models tolerance range need to be enhanced. Ensemble tracking [9] and the Multiple Instance Learning boosting method (MIL) [5] using positive sample and negative samples of tracking targets to train classifiers. Semi-online boosting [10] using both unlabeled and labeled tracking candidate target to train classifiers online. Fragment-based tracking [16] coupled with a voting map can accurately track the partially occluded target. However, historical information is ignored when updating classifiers or models. Dictionary learning [15] was employed to using the linear combination to represent the dynamic appearance of the target and handles the occlusion as a sparse noise component. However, spatial and temporal information are lost when algorithm performing. Appearance representation learned by In our model, we build appearance model set to keep the spatial information of tracking target and tracking system could keep the temporal information as well. at the same time, all efficient appearance model can be employed in this framework including sparse coding, dictionary learning and learned target descriptions by deep learning or other machine learning methods.

Fig. 2.
figure 2

The procedure of AMAM framework working.

3 The Framework of Adaptive Multiple Appearances Model Tracking

In this section, we describe the common framework of adaptive multiple appearance model. In first, we present the Dirichlet Process Mixture model (DPMM), which are employed to organize the adaptive appearance set. After that, we describe the tracking system based on AMAM framework.

3.1 Dirichlet Process Mixture Model

The Dirichlet process (DP) is parameterized by a base distribution H which has corresponding density \(h (\mathrm {\theta })\), and a positive scaling parameter \(\mathrm {\alpha } > 0\). We denote a DP and suppose we draw a random measure G from a DP, and independently draw N random variables \(\mathrm {\theta }_{n}\) from G, this can be described as follows:

(1)

As shown by [8], given N independent observations \(\mathrm {\theta }_{i} \sim G\), the posterior distribution also follows a DP:

$$\begin{aligned} \mathrm {\sim }DP\left( \alpha +N,\frac{1}{\alpha +N}\left( \alpha H+\sum \nolimits _{i=1}^N \delta _{\theta _{i}} \right) \right) \end{aligned}$$
(2)

where \(n_{1}n_{2}, . . . , n_{r}\) represent the number of observations falling in each of the partitions \(A _{1}A_{2}, . . .A_{r}\) respectively,N is the total number of observations, and \(\mathrm {\delta _{\theta _{i}} }\) represents the delta function at the sample point \(\mathrm {\theta }_{i}\).

3.2 Model Inference

Given N observations \(X= \{{X_{i}}\}^N_{i=1} (X_{i} \in N^d)\), each \(X_{i}=\left\{ {x_{i}}\right\} ^d_{i=1}\) represents a quantized d-dim HOG feature, and \(x_{i}\) is the histogram quantized bin counts, which is a quantized integer. Let \(z_{i}\) indicate the cluster or appearance model, associated with the \(i^{th}\) observation which is represented by quantized HOG feature. As shown in Fig. 1, we would like to infer the number of latent clusters or different appearances underlying those observations, and their parameters \(\mathrm {\theta }_{k}\). Since the exact computation of the posterior is infeasible especially when data size is large, we resort to a variant of MCMC algorithms, namely, the collapsed Gibbs sampler [7] for faster approximate inference.

We choose multinomial distribution \(F(\mathrm {\theta })\) to describe HOG features of observations, and the cluster prior \(H(\mathrm {\lambda })\) is a Dirichlet distribution which is conjugate to \(F(\mathrm {\theta })\). Given fixed cluster assignments \(z_{-i}\) for other observations, the posterior distribution of \(z_{i}\) factors as follows:

$$\begin{aligned} p\left( z_{i} \vert z_{-i},X,\alpha ,\lambda \right) \mathrm { \propto }p\left( z_{i} \vert z_{-i},\alpha \right) p\left( X_{i} \vert X_{-i},z,\lambda \right) \end{aligned}$$
(3)

The prior \(p\left( z_{i} \vert z_{-i},\alpha \right) \) is given by the Chinese restaurant process (CRP).

$$\begin{aligned} p\left( z_{i} \vert z_{-i},\alpha \right) \mathrm { \sim }\frac{1}{\alpha +N-1}\left( \sum \limits _{k=1}^K {N_{k}^{-i}\delta \left( z_{i} ,k \right) } + \alpha \delta \left( z_{i} ,\bar{k} \right) \right) \end{aligned}$$
(4)

The \(\bar{k}\) denotes one of the infinitely many unoccupied clusters or new appearances. \(N_{k}^{-i}\) is the total number of observations in cluster k except observation i.

figure a

For the K clusters to which \(z_{-i}\) assigns observations, the likelihood of Eq. (3) is shown as follows:

$$\begin{aligned} p\left( X_{i} \vert z_{i}=k,z_{-i},X_{-i},\lambda \right) =p\left( X_{i}\vert \left\{ X_{j} \vert { z}_{j}=k,j\ne i \right\} ,\lambda \right) \end{aligned}$$
(5)

Because dirichlet distribution \(H(\mathrm {\lambda })\) is conjugate to multinomial distribution \(F (\theta )\), \(\theta = (p_{1}, p_{2}, ..., p_{d})\) and \(\{{X_{i}}\}^N_{i=1} \sim Mult\left( p1, p2, ..., pd\right) \), we can get a closed-form of predictive likelihood expression for each cluster or appearance k as follows:

$$\begin{aligned} p\left( {X}_{i} \vert z_{i}=k,z_{-i},X_{-i},\lambda \right) = \frac{\mathrm {\Gamma }\left( n+1 \right) }{\prod \limits _{j=1}^d {\mathrm {\Gamma }\left( X_{i}^{\left( j \right) }+1 \right) } }\frac{\mathrm {\Gamma }\left( \sum \nolimits _{j=1}^K \lambda _{j}^{'} \right) }{\prod \limits _{j=1}^K {\mathrm {\Gamma }\left( \lambda _{j}^{'} \right) } }\frac{\prod \limits _{j=1}^K {\mathrm {\Gamma }\left( X_{i}^{\left( j \right) }+ \lambda _{j}^{'} \right) } }{\mathrm {\Gamma }\left( n+ \sum \nolimits _{j=1}^K \lambda _{j}^{'} \right) } \end{aligned}$$
(6)

where \(\lambda ^{'}\) is the posterior of \(\lambda \) and \(\Gamma \) is the gamma function. Similarly, new clusters \(\bar{k}\) are based upon the predictive likelihood implied by the prior hyper parameters \(\lambda \):

$$\begin{aligned} p\left( X_{i} \vert z_{i}=\bar{k},z_{-i},X_{-i},\lambda \right) =p\left( X_{i} \vert \lambda \right) \mathrm {= }\frac{\mathrm {\Gamma }\left( A \right) }{\mathrm {\Gamma }\left( N+A \right) }\prod \limits _{k=1}^K \frac{\mathrm {\Gamma }\left( n_{k}+\lambda _{k} \right) }{\mathrm {\Gamma }\left( \lambda _{k} \right) } \end{aligned}$$
(7)

where \(A = \sum \limits _{k} \lambda _{k}\) and \(N = \sum \limits _{k} n_{k}\), and where \(n_{k} = \) number of \(x_{i}\)’s with value k. Combining these expressions, we employed Gibbs sampler at Algorithm 1.

figure b

3.3 AMAM Tracking

Given the observation set of the target\(X_{1:t} = [X_{1},...,X_{t}]\), where each \(X_{t}\) represents a quantized HOG feature, up to time t, the tracking result \(s_{t}\) can be determined by the Maximum A Posteriori (MAP) estimation, \(\hat{s_{t}} = argmax p \left( s_{t}|X_{1:t}\right) \), where \(p\left( s_{t}|X_{1:t}\right) \) is inferred by the Bayes theorem recursively with

$$\begin{aligned} p\left( s_{t} \vert X_{1:t} \right) \propto p\left( X_{t} \vert s_{t} \right) \int {p\left( s_{t} \vert s_{t-1} \right) p\left( s_{t-1} \vert X_{1:t-1} \right) ds_{t-1}} . \end{aligned}$$
(8)

Let \(s_{t} = [l_{x},l_{y},\theta ,s,\alpha ,\phi ]\), where \(l_{x},l_{y},\theta ,s,\alpha ,\phi \) denote xy translations, rotation angle, scale, aspect ratio, and skew respectively.We apply the affine transformation with those six parameters to model the target motion between two consecutive frames. The state transition is formulated as \(p\left( s_{t}|s_{t-1}\right) = N\left( s_{t};s_{t-1}\sum \right) \), where \(\sum \) is the covariance matrix of six affine parameters.

The observation model \(p\left( X_{t}|s_{t}\right) \) denotes the likelihood of the observation \(X_{t}\) at state \(s_{t}\). The Noisy-OR (NOR) model is adopted for doing this:

$$\begin{aligned} p\left( X_{t} \vert s_{t} \right) =1- \prod \nolimits _k \left( 1-p\left( X_{t} \vert s_{t},H^{k} \right) \right) \end{aligned}$$
(9)

where \(H_{k}, k\in \left( 1,2,...,K\right) \) represents the multiple appearance models learned from Algorithm 1.

The equation above has the desired property that if one of the appearance models has a high probability, the resulting probability will be high as well.

Algorithm 2 illustrates the basic flow of our algorithm.

4 Experiments

In our experiments, we employ 10 challenging public tracking datasets selected from [2] and using the same evaluation methods, the center location error and success rate, to verify the performance of our algorithm. The proposed approach is compared with ten state-of-the-art tracking methods. Table 1 shows all the tracking methods we need to evaluate. In addition, we evaluate the proposed tracker against those methods using the source codes provided by the authors and each tracker is running with adjusted parameters for fair evaluation.

Table 1. Compare trackers and their representations in our experiment [2]

4.1 The AMAM Modeling

Figure 2 shows how the AMAM working. These small face images under the main frame shows the appearance instance belong to each appearance model and the historical instances while tracking. The red rectangle in main frame is the tracking result based on the model in red, and the green one is the ground truth. The instances of each appearance models increasing while long term tracking, and the number of appearance models increasing while the inner and inter distance changing based on the DPMM Algorithm 1.

Fig. 3.
figure 3

Three tracking video sequences with all tracking result from the all trackers. The bounding boxes in red are our results.

4.2 Tracking System

Figure 3 shows how the AMAM tracking results based on 2. Bounding boxes in red are our results. We can find that our tracker can track the target very well while most of the other tracker are drifting.

In order to measure the performance of tracking result, we employed two traditional measurement operator. One is the center error (CE), and the other is coverage rate (CR).

Fig. 4.
figure 4

Tracking result compare with the recent appeared 10 trackers listed in Table 1 by coverage rate (a) and center error (b) and measurement operators in public datasets.

The center error is defined as the average Euclidean distance between the center locations of the tracked targets and the manually labeled ground truths, for calculating precision plot. In a general way, the overall performance of one sequence depends on the average center location error over all the frames of one sequence, but when the tracker loses the target, we will only get the random output location and the average error value which may not measure the tracking performance correctly [5]. Therefore, we use the precision plot to measure the overall tracking performance. It shows the percentage of frames whose estimated location is within the given threshold distance of the ground truth. Figure 4(b) shows result in our experiment. Since the smaller is better, our AMAM tracker performs good in those public testing videos.

The coverage rate is defined as the bounding box overlapping rate between the tracking target and the ground truth. A higher score means the tracking result is closer to ground truth. The formula of calculating score is \(score = \frac{area(ROI_{T}\cap ROI_{G})}{area(ROI_{T}\cup ROI_{G})}\) while the formula of calculating average score is \(avrScore = \frac{\sum \limits ^{frameLength}_{1}score}{frameLength}\).

In Table 2, we compare the performance of trackers in each testing dataset with the same testing result shown in Fig. 4. We selected the best performed tracker and the second tracker in each testing data both based on CR and CE operators. We also calculate the differences between them for tracking accuracy measuring. In Table 3, we also import the variation and average CR to measure the robustness and accuracy. Since the inhumations, backgrounds and targets are different at all, if the tracker performing stable with low variation, the tracker can be considered robustness.

Table 2. Compare of all trackers in all datasets by converge rate (CR) and center error (CE).
Table 3. Compare of trackers by variance and average coverage rate (ACR) in performance.

From the Table 2, we find that our AMAM Tracker is outperform in 8 testing videos. The differences between best and our tracker in CR and CE are less than 1.4 % and 1 pixel in the rest 2 testing videos.

From the Table 3, the variation of our tracker in all videos is 0.002, extremely lower than others both in average and individual.The ACR of our AMAM tracker in all testing videos is 19 % higher than others. That means our tracker can perform more robust and more accurate than others.

5 Conclusion

This paper tackled the drifting problem in tracking and proposed an Adaptive Multiple Appearance Model framework for long term robust tracking. We simply employed HOG to build the basic appearance representation of the tracking target in experiment but all efficient representation of visual objects could be joint in our algorithm framework. Historical appearance descriptions could be employed and grouped unsupervised and modeled by Dirichlet Process Mixture Model automatically. And tracking result can be selected from candidate targets predicted by trackers based on those appearance models by voting and confidence map. Experiment in several public datasets shows that, our tracker has low variation (less than 0.002) and high tracking performance (19 % better than other 10 trackers in average) when compared with the state-of-the-art methods.