Adaptive Multiple Appearances Model Framework for Long-Term Robust Tracking

Tang, Shuo; Zhang, Longfei; Chi, Jiapeng; Wang, Zhufan; Ding, Gangyi

doi:10.1007/978-3-319-24075-6_16

Shuo Tang¹⁸,
Longfei Zhang¹⁸,
Jiapeng Chi¹⁸,
Zhufan Wang¹⁸ &
…
Gangyi Ding¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9314))

Included in the following conference series:

Pacific Rim Conference on Multimedia

1794 Accesses
1 Citations

Abstract

Tracking an object in long term is still a great challenge in computer vision. Appearance modeling is one of keys to build a good tracker. Much research attention focuses on building an appearance model by employing special features and learning method, especially online learning. However, one model is not enough to describe all historical appearances of the tracking target during a long term tracking task because of view port exchanging, illuminance varying, camera switching, etc. We propose the Adaptive Multiple Appearance Model (AMAM) framework to maintain not one model but appearance model set to solve this problem. Different appearance representations of the tracking target could be employed and grouped unsupervised and modeled by Dirichlet Process Mixture Model (DPMM) automatically. And tracking result can be selected from candidate targets predicted by trackers based on those appearance models by voting and confidence map. Experimental results on multiple public datasets demonstrate the better performance compared with state-of-the-art methods.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Online Adaptive Multiple Appearances Model for Long-Term Tracking

Object tracking via Dirichlet process-based appearance models

Article 09 April 2016

Robust object tracking via online discriminative appearance modeling

Article Open access 29 October 2019

Keywords

1 Introduction

Robust object tracking in a long term is a challenging task in computer vision. There have been many trackers proposed by different researchers [1–3] that employ different types of visual information and learned features to build the appearance models as the base of the tracking, e.g. color histogram in Meanshift, multiple features in particle filter [4], Haar-like in MIL [5], etc.

However, it is still not enough to represent the tracking target with one appearance model, even online updating model or patch dictionary model and so on, while the target in internal and external variations. Internal variation includes pose changing, motion, shape deformation, illumination variation, etc. And External variation includes background changing, covered by foreground objects. Tackle this problem, an appearance model set is needed for describing the historical appearances of tracking target.

Therefore, we propose a novel nonparametric statistical method to model the appearance of the target as combination of multiple appearance models. Each model describes an typical appearance character under specific situation, and clustered by Dirichlet Process Mixture Model (DPMM) [7] framework dynamically unsupervised.

The Framework of our system is shown in Fig. 1. Experimental results on several public datasets show that AMAM tracking system is applicable to multiple camera system and indoor and outdoor climates tracking system, and outperform several state-of- the-art trackers.

The rest of the paper is organized as follows. Section 2 overviews some of the related works. Section 3 describes the proposed AMAM algorithm. We present experimental validation on several public datasets in Sect. 4 and conclude the paper in Sect. 5.

2 Related Works

In long term tracking task, the biggest challenge is drifting problem. Tackle this problem, appearance models tolerance range need to be enhanced. Ensemble tracking [9] and the Multiple Instance Learning boosting method (MIL) [5] using positive sample and negative samples of tracking targets to train classifiers. Semi-online boosting [10] using both unlabeled and labeled tracking candidate target to train classifiers online. Fragment-based tracking [16] coupled with a voting map can accurately track the partially occluded target. However, historical information is ignored when updating classifiers or models. Dictionary learning [15] was employed to using the linear combination to represent the dynamic appearance of the target and handles the occlusion as a sparse noise component. However, spatial and temporal information are lost when algorithm performing. Appearance representation learned by In our model, we build appearance model set to keep the spatial information of tracking target and tracking system could keep the temporal information as well. at the same time, all efficient appearance model can be employed in this framework including sparse coding, dictionary learning and learned target descriptions by deep learning or other machine learning methods.

3 The Framework of Adaptive Multiple Appearances Model Tracking

In this section, we describe the common framework of adaptive multiple appearance model. In first, we present the Dirichlet Process Mixture model (DPMM), which are employed to organize the adaptive appearance set. After that, we describe the tracking system based on AMAM framework.

3.1 Dirichlet Process Mixture Model

The Dirichlet process (DP) is parameterized by a base distribution H which has corresponding density $h (\mathrm {\theta })$, and a positive scaling parameter $\mathrm {\alpha } > 0$. We denote a DP and suppose we draw a random measure G from a DP, and independently draw N random variables $\mathrm {\theta }_{n}$ from G, this can be described as follows:

(1)

As shown by [8], given N independent observations $\mathrm {\theta }_{i} \sim G$, the posterior distribution also follows a DP:

$$\begin{aligned} \mathrm {\sim }DP\left( \alpha +N,\frac{1}{\alpha +N}\left( \alpha H+\sum \nolimits _{i=1}^N \delta _{\theta _{i}} \right) \right) \end{aligned}$$

(2)

where $n_{1}n_{2}, . . . , n_{r}$ represent the number of observations falling in each of the partitions $A _{1}A_{2}, . . .A_{r}$ respectively,N is the total number of observations, and $\mathrm {\delta _{\theta _{i}} }$ represents the delta function at the sample point $\mathrm {\theta }_{i}$.

3.2 Model Inference

Given N observations $X= \{{X_{i}}\}^N_{i=1} (X_{i} \in N^d)$, each $X_{i}=\left\{ {x_{i}}\right\} ^d_{i=1}$ represents a quantized d-dim HOG feature, and $x_{i}$ is the histogram quantized bin counts, which is a quantized integer. Let $z_{i}$ indicate the cluster or appearance model, associated with the $i^{th}$ observation which is represented by quantized HOG feature. As shown in Fig. 1, we would like to infer the number of latent clusters or different appearances underlying those observations, and their parameters $\mathrm {\theta }_{k}$. Since the exact computation of the posterior is infeasible especially when data size is large, we resort to a variant of MCMC algorithms, namely, the collapsed Gibbs sampler [7] for faster approximate inference.

We choose multinomial distribution $F(\mathrm {\theta })$ to describe HOG features of observations, and the cluster prior $H(\mathrm {\lambda })$ is a Dirichlet distribution which is conjugate to $F(\mathrm {\theta })$. Given fixed cluster assignments $z_{-i}$ for other observations, the posterior distribution of $z_{i}$ factors as follows:

$$\begin{aligned} p\left( z_{i} \vert z_{-i},X,\alpha ,\lambda \right) \mathrm { \propto }p\left( z_{i} \vert z_{-i},\alpha \right) p\left( X_{i} \vert X_{-i},z,\lambda \right) \end{aligned}$$

(3)

The prior $p\left( z_{i} \vert z_{-i},\alpha \right) $ is given by the Chinese restaurant process (CRP).

$$\begin{aligned} p\left( z_{i} \vert z_{-i},\alpha \right) \mathrm { \sim }\frac{1}{\alpha +N-1}\left( \sum \limits _{k=1}^K {N_{k}^{-i}\delta \left( z_{i} ,k \right) } + \alpha \delta \left( z_{i} ,\bar{k} \right) \right) \end{aligned}$$

(4)

The $\bar{k}$ denotes one of the infinitely many unoccupied clusters or new appearances. $N_{k}^{-i}$ is the total number of observations in cluster k except observation i.

For the K clusters to which $z_{-i}$ assigns observations, the likelihood of Eq. (3) is shown as follows:

$$\begin{aligned} p\left( X_{i} \vert z_{i}=k,z_{-i},X_{-i},\lambda \right) =p\left( X_{i}\vert \left\{ X_{j} \vert { z}_{j}=k,j\ne i \right\} ,\lambda \right) \end{aligned}$$

(5)

Because dirichlet distribution $H(\mathrm {\lambda })$ is conjugate to multinomial distribution $F (\theta )$, $\theta = (p_{1}, p_{2}, ..., p_{d})$ and $\{{X_{i}}\}^N_{i=1} \sim Mult\left( p1, p2, ..., pd\right) $, we can get a closed-form of predictive likelihood expression for each cluster or appearance k as follows:

$$\begin{aligned} p\left( {X}_{i} \vert z_{i}=k,z_{-i},X_{-i},\lambda \right) = \frac{\mathrm {\Gamma }\left( n+1 \right) }{\prod \limits _{j=1}^d {\mathrm {\Gamma }\left( X_{i}^{\left( j \right) }+1 \right) } }\frac{\mathrm {\Gamma }\left( \sum \nolimits _{j=1}^K \lambda _{j}^{'} \right) }{\prod \limits _{j=1}^K {\mathrm {\Gamma }\left( \lambda _{j}^{'} \right) } }\frac{\prod \limits _{j=1}^K {\mathrm {\Gamma }\left( X_{i}^{\left( j \right) }+ \lambda _{j}^{'} \right) } }{\mathrm {\Gamma }\left( n+ \sum \nolimits _{j=1}^K \lambda _{j}^{'} \right) } \end{aligned}$$

(6)

where $\lambda ^{'}$ is the posterior of $\lambda $ and $\Gamma $ is the gamma function. Similarly, new clusters $\bar{k}$ are based upon the predictive likelihood implied by the prior hyper parameters $\lambda $:

$$\begin{aligned} p\left( X_{i} \vert z_{i}=\bar{k},z_{-i},X_{-i},\lambda \right) =p\left( X_{i} \vert \lambda \right) \mathrm {= }\frac{\mathrm {\Gamma }\left( A \right) }{\mathrm {\Gamma }\left( N+A \right) }\prod \limits _{k=1}^K \frac{\mathrm {\Gamma }\left( n_{k}+\lambda _{k} \right) }{\mathrm {\Gamma }\left( \lambda _{k} \right) } \end{aligned}$$

(7)

where $A = \sum \limits _{k} \lambda _{k}$ and $N = \sum \limits _{k} n_{k}$, and where $n_{k} = $ number of $x_{i}$’s with value k. Combining these expressions, we employed Gibbs sampler at Algorithm 1.

3.3 AMAM Tracking

Given the observation set of the target$X_{1:t} = [X_{1},...,X_{t}]$, where each $X_{t}$ represents a quantized HOG feature, up to time t, the tracking result $s_{t}$ can be determined by the Maximum A Posteriori (MAP) estimation, $\hat{s_{t}} = argmax p \left( s_{t}|X_{1:t}\right) $, where $p\left( s_{t}|X_{1:t}\right) $ is inferred by the Bayes theorem recursively with

$$\begin{aligned} p\left( s_{t} \vert X_{1:t} \right) \propto p\left( X_{t} \vert s_{t} \right) \int {p\left( s_{t} \vert s_{t-1} \right) p\left( s_{t-1} \vert X_{1:t-1} \right) ds_{t-1}} . \end{aligned}$$

(8)

Let $s_{t} = [l_{x},l_{y},\theta ,s,\alpha ,\phi ]$, where $l_{x},l_{y},\theta ,s,\alpha ,\phi $ denote x, y translations, rotation angle, scale, aspect ratio, and skew respectively.We apply the affine transformation with those six parameters to model the target motion between two consecutive frames. The state transition is formulated as $p\left( s_{t}|s_{t-1}\right) = N\left( s_{t};s_{t-1}\sum \right) $, where $\sum $ is the covariance matrix of six affine parameters.

The observation model $p\left( X_{t}|s_{t}\right) $ denotes the likelihood of the observation $X_{t}$ at state $s_{t}$. The Noisy-OR (NOR) model is adopted for doing this:

$$\begin{aligned} p\left( X_{t} \vert s_{t} \right) =1- \prod \nolimits _k \left( 1-p\left( X_{t} \vert s_{t},H^{k} \right) \right) \end{aligned}$$

(9)

where $H_{k}, k\in \left( 1,2,...,K\right) $ represents the multiple appearance models learned from Algorithm 1.

The equation above has the desired property that if one of the appearance models has a high probability, the resulting probability will be high as well.

Algorithm 2 illustrates the basic flow of our algorithm.

4 Experiments

In our experiments, we employ 10 challenging public tracking datasets selected from [2] and using the same evaluation methods, the center location error and success rate, to verify the performance of our algorithm. The proposed approach is compared with ten state-of-the-art tracking methods. Table 1 shows all the tracking methods we need to evaluate. In addition, we evaluate the proposed tracker against those methods using the source codes provided by the authors and each tracker is running with adjusted parameters for fair evaluation.

Table 1. Compare trackers and their representations in our experiment [2]

Full size table

4.1 The AMAM Modeling

Figure 2 shows how the AMAM working. These small face images under the main frame shows the appearance instance belong to each appearance model and the historical instances while tracking. The red rectangle in main frame is the tracking result based on the model in red, and the green one is the ground truth. The instances of each appearance models increasing while long term tracking, and the number of appearance models increasing while the inner and inter distance changing based on the DPMM Algorithm 1.

4.2 Tracking System

Figure 3 shows how the AMAM tracking results based on 2. Bounding boxes in red are our results. We can find that our tracker can track the target very well while most of the other tracker are drifting.

In order to measure the performance of tracking result, we employed two traditional measurement operator. One is the center error (CE), and the other is coverage rate (CR).

The center error is defined as the average Euclidean distance between the center locations of the tracked targets and the manually labeled ground truths, for calculating precision plot. In a general way, the overall performance of one sequence depends on the average center location error over all the frames of one sequence, but when the tracker loses the target, we will only get the random output location and the average error value which may not measure the tracking performance correctly [5]. Therefore, we use the precision plot to measure the overall tracking performance. It shows the percentage of frames whose estimated location is within the given threshold distance of the ground truth. Figure 4(b) shows result in our experiment. Since the smaller is better, our AMAM tracker performs good in those public testing videos.

The coverage rate is defined as the bounding box overlapping rate between the tracking target and the ground truth. A higher score means the tracking result is closer to ground truth. The formula of calculating score is $score = \frac{area(ROI_{T}\cap ROI_{G})}{area(ROI_{T}\cup ROI_{G})}$ while the formula of calculating average score is $avrScore = \frac{\sum \limits ^{frameLength}_{1}score}{frameLength}$.

In Table 2, we compare the performance of trackers in each testing dataset with the same testing result shown in Fig. 4. We selected the best performed tracker and the second tracker in each testing data both based on CR and CE operators. We also calculate the differences between them for tracking accuracy measuring. In Table 3, we also import the variation and average CR to measure the robustness and accuracy. Since the inhumations, backgrounds and targets are different at all, if the tracker performing stable with low variation, the tracker can be considered robustness.

Table 2. Compare of all trackers in all datasets by converge rate (CR) and center error (CE).

Full size table

Table 3. Compare of trackers by variance and average coverage rate (ACR) in performance.

Full size table

From the Table 2, we find that our AMAM Tracker is outperform in 8 testing videos. The differences between best and our tracker in CR and CE are less than 1.4 % and 1 pixel in the rest 2 testing videos.

From the Table 3, the variation of our tracker in all videos is 0.002, extremely lower than others both in average and individual.The ACR of our AMAM tracker in all testing videos is 19 % higher than others. That means our tracker can perform more robust and more accurate than others.

5 Conclusion

This paper tackled the drifting problem in tracking and proposed an Adaptive Multiple Appearance Model framework for long term robust tracking. We simply employed HOG to build the basic appearance representation of the tracking target in experiment but all efficient representation of visual objects could be joint in our algorithm framework. Historical appearance descriptions could be employed and grouped unsupervised and modeled by Dirichlet Process Mixture Model automatically. And tracking result can be selected from candidate targets predicted by trackers based on those appearance models by voting and confidence map. Experiment in several public datasets shows that, our tracker has low variation (less than 0.002) and high tracking performance (19 % better than other 10 trackers in average) when compared with the state-of-the-art methods.

References

Li, X., Hu, W., Shen, C., et al.: A survey of appearance models in visual object tracking. ACM Trans. Intell. Syst. Technol. (TIST) 4(4), 58:1–58:48 (2013)
Google Scholar
Wu, Y., Lim, J., Yang, M.: Online object tracking: a benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2411–2418 (2013)
Google Scholar
Smeulders, A., Chu, D., Cucchiara, R.: Visual tracking: an experimental survey. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 36(7), 1442–1468 (2014)
Article Google Scholar
Wang, H., Suter, D., Schindler, K.: Effective appearance model and similarity measure for particle filtering and visual tracking. In: Proceedings of European Conference Computer Vision (ECCV) (2006)
Google Scholar
Babenko, B., Yang, M., Belongie, S.: Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 33(8), 1619–1632 (2011)
Article Google Scholar
Wang, N., Yeung, D.: Learning a deep compact image representation for visual tracking. In: Proceedings of the NIPS, (5192), pp. 809–817 (2013)
Google Scholar
Neal, R.M.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9, 249–265 (2000)
MathSciNet Google Scholar
Ferguson, T.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2), 209–230 (1973)
Article MathSciNet MATH Google Scholar
Avidan, S.: Ensemble tracking. IEEE Trans. Pattern Anal. Mach. Intell. 29(2), 261–271 (2007). IEEE society
Article Google Scholar
Grabner, H., Bischof, H.: On-line boosting and vision. In: Proceedings of Computer Vision and Pattern Recognition, 1, pp. 260–267 (2006)
Google Scholar
Stenger, B., Woodley, T., Cipolla, R.: Learning to track with multiple observers. In: Proceedings of Computer Vision and Pattern Recognition, pp. 2647–2654 (2009)
Google Scholar
Yu, Q., Dinh, T.B., Medioni, G.: Online tracking and reacquisition using co-trained generative and discriminative trackers. In: ECCV (2008)
Google Scholar
Gao, Y., Ji, R., Zhang, L., Hauptmann, A.: Symbiotic tracker ensemble toward a unified tracking framework. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 24(7), 1122–1131 (2014)
Article Google Scholar
Zhang, L., Gao, Y., Hauptmann, A., Ji, R., Ding, G., Super, B.: Symbiotic black-box tracker. In: Proceedings of the Advances on Multimedia modeling (MMM), pp. 126–137 (2012)
Google Scholar
Wang, N., Wang, J., Yeung, D.: Online robust non-negative dictionary learning for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), pp. 657–664 (2013)
Google Scholar
Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the integral histogram. In: Proceedings of the CVPR (2006)
Google Scholar
Oron, S., Bar-Hillel, A., Levi, D., Avidan, S.: Locally orderless tracking. In: CVPR (2012)
Google Scholar
Ross, D., Lim, J., Lin, R., Yang, M.: Incremental learning for robust visual tracking. IJCV 77(1), 125–141 (2008)
Article Google Scholar
Jia, X., Lu, H., Yang, M.: Visual tracking via adaptive structural local sparse appearance model. In: CVPR (2012)
Google Scholar
Bao, C., Wu, Y., Ling, H., Ji, H.: Real time robust L1 tracker using accelerated proximal gradient approach. In: CVPR (2012)
Google Scholar
Zhang, T., Ghanem, B., Liu, S., Ahuja, N.: Robust visual tracking via multi-task sparse learning. In: CVPR (2012)
Google Scholar
Kwon, J., Lee, K.M.: Visual tracking decomposition. In: CVPR (2010)
Google Scholar
Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via online boosting. In: BMVC (2006)
Google Scholar
Kalal, Z., Matas, J., Mikolajczyk, K.: P-N learning: bootstrapping binary classifiers by structural constraints. In: CVPR (2010)
Google Scholar
Hare, S., Saffari, A., Torr, P.H.S.: Struck: structured output tracking with kernels. In: ICCV(2011)
Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the Key Technologies Research and Development Program of China Foundation under Grants No. 2012BAH38F01-5. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Key Technologies Research and Development Program of China Foundation.

Author information

Authors and Affiliations

School of Software, Beijing Institute of Technology, Beijing, China
Shuo Tang, Longfei Zhang, Jiapeng Chi, Zhufan Wang & Gangyi Ding

Authors

Shuo Tang
View author publications
You can also search for this author in PubMed Google Scholar
Longfei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiapeng Chi
View author publications
You can also search for this author in PubMed Google Scholar
Zhufan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gangyi Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Longfei Zhang .

Editor information

Editors and Affiliations

Gwangju Institute of Science and Technology, Gwangju, Korea (Republic of)
Yo-Sung Ho
Chinese Academy of Sciences, Institute of Automation, Beijing, China
Jitao Sang
ICU, IVY Lab, KAIST, Daejeon, Korea (Republic of)
Yong Man Ro
KAIST, Daejeon, Korea (Republic of)
Junmo Kim
College of Computer Science, Zhejiang University, Hangzhou, China
Fei Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, S., Zhang, L., Chi, J., Wang, Z., Ding, G. (2015). Adaptive Multiple Appearances Model Framework for Long-Term Robust Tracking. In: Ho, YS., Sang, J., Ro, Y., Kim, J., Wu, F. (eds) Advances in Multimedia Information Processing -- PCM 2015. PCM 2015. Lecture Notes in Computer Science(), vol 9314. Springer, Cham. https://doi.org/10.1007/978-3-319-24075-6_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-24075-6_16
Published: 22 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24074-9
Online ISBN: 978-3-319-24075-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adaptive Multiple Appearances Model Framework for Long-Term Robust Tracking

Abstract

Similar content being viewed by others

Online Adaptive Multiple Appearances Model for Long-Term Tracking

Object tracking via Dirichlet process-based appearance models

Robust object tracking via online discriminative appearance modeling

Keywords

1 Introduction

2 Related Works