Abstract
Aiming at various tracking environments, a weighted joint sparse representation based tracker is proposed. Specifically, each object template is weighted according to its similarity to each candidate. Then all candidates are represented sparsely and jointly, and the sparse coefficients are used to compute the observation probabilities of candidates. The candidate with the maximum observation probability is determined as the object. The object function is solved by a modified accelerated proximal gradient (APG) algorithm. Experiments on several representative image sequences show that the proposed tracking method performs better than the other trackers in the scenarios of illumination variation, occlusion, pose change and rotation.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Visual tracking is a hot topic of computer vision, and extensively applied into such fields as intelligent monitoring, car navigation, advanced human-computer interaction, and so forth. However, due to all kinds of internal and external factors, such as noises, occlusion, illumination and viewpoint variations, pose change, rotation, and so on, realizing the robust visual tracking is still a challenging task. The existing visual tracking methods can be categorized into the generative and the discriminative. The generative methods treat the visual tracking as a matching problem. For example, both Eigentracker [1] and meanshift tracker [2], which are proposed by Black and Comaniciu respectively, have desired real-time performance, but lack the necessary updating of target templates. IVT [3] proposed by Ross, adapts to the appearance change by incrementally learning a low dimensional domain. VTD [4] proposed by Kwon extends the traditional particle filter tracker with multiple motion models and multiple appearance models. The discriminative methods are regarded as a binary classification problem, and need to learn and update a classifier by sampling positive and negative samples. The representative methods are: Avidan et al. [5] proposed a SVM based tracker. Collins et al. [6] discriminate the target from background by online feature selection. Grabner et al. [7] proposed a online semi-supervised boosting method to avoid drifting away from the target and improve tracking performance. Babenko et al. [8] avoid bias and drift by multi-instance learning. Zhang et al. [9] formulate the tracking task as a binary classification problem in the compressed domain and obtain the real-time performance.
Recently, sparse representation based visual tracking methods are very popular. For instance, in [10–12], each candidate is represented sparsely as the linear combination of templates, with excellent tracking results. Due to the small sampling radium, candidates are similar to each other. [13] mines the similarity between candidates by joint sparse representation, improving the representation accuracy of candidates and the robustness of the whole tracking system. However, the aforementioned sparse representation based tracking methods assume that different templates have the same possibilities to represent each candidate. According to [14], the template which is similar to a candidate should have larger probability to represent the candidate.
In this paper, a weighted joint sparse representation based visual tracking method is proposed to robustly track the target in the complex environment. Specifically, to represent a candidate, all the templates are weighted according to their similarity to the candidate. Then, all the candidates are represented by the joint sparse representation to reflect the similarity among them. The sparse coefficient is used to compute the observation probability of the corresponding candidate. The candidate with the maximal observation probability is determined as the target. In this model, a template similar to a candidate has the larger possibility to represent the candidate. To adapt to the illumination change, a locally normalized feature is used for appearance representation, which is an illumination invariant feature. Experiments show that the proposed tracker can well adapt to the possible illumination variation, occlusion, pose change, rotation, and the influence of comprehensive conditions.
2 Weighted Joint Sparse Representation Based Tracker
2.1 Weighted Joint Sparse Representation
Let \( \text{T} = \left\{ {\text{T}_{1} ,\text{T}_{2} , \ldots ,\text{T}_{M} } \right\} \) and \( \left\{ {y_{1} ,y_{2} , \ldots ,y_{N} } \right\} \) denote the target templates set and candidates of current frame respectively. To locate the target accurately, Zhang et al. proposed a multi-task learning based tracking method, which takes into account the similarity relation among candidates and represents all the candidates by solving the \( \ell_{2,1} \) mixed norm regularized minimization problem [13].
where
is the sparse representation coefficient matrix. Specifically, the \( i \) th column \( c_{i} \) of \( c \) corresponds to the representation coefficient of the candidate \( {\mathbf{y}}_{i} \), the \( j \) row \( c^{j} \) are the coefficients of all the candidates corresponding to the target template \( \text{T}_{j} \), and \( c_{i}^{j} \) is the \( j \) th representation coefficient of the candidate \( {\mathbf{y}}_{i} \) corresponding to the target template \( \text{T}_{j} \). The regularization term of (2) is \( \ell_{2,1} \) mixed norm, which can cause the joint sparse representation of candidates. By mining the similarity relation among candidates, [13] improves the representation accuracy and the tracking robustness compared with other sparse representation based trackers [10–12]. However, all the aforementioned trackers [10–13] assume that all the target templates have the same possibilities to represent a candidate. According to [14], the target template more similar to a candidate has larger possibility to be chosen to represent the candidate. So in this paper, before representing a candidate, all the target templates are weighted using their distance to the candidate. Then, considering the similarity relation among candidates, all the candidates are represented using joint sparse representation.
Compare with (1), the weight matrix
is introduced into (2), where \( w_{i}^{j} \) is the weight of the target template \( \text{T}_{j} \) representing the candidate \( {\mathbf{y}}_{i} \), and calculated as the distance between them
The internal \( \left\| {y_{i} - \text{T}_{j} } \right\|_{2} \) is the Euclidean distance between \( \text{T}_{j} \) and \( {\mathbf{y}}_{i} \).
The sparse representation coefficients \( \left\{ {c_{i} } \right\}_{i = 1}^{N} \) of candidates \( \left\{ {\text{y}_{i} } \right\}_{i = 1}^{N} \) are obtained by solving (2), and used to calculate the reconstruction errors \( \left\{ {r_{i} } \right\}_{i = 1}^{N} \) of candidates.
\( r_{i} \) is further used for calculating the observation probability of the candidate \( \text{y}_{i} \).
where \( \mu \) is the normalization constant. Among a group of given candidates \( \left\{ {y_{1} ,y_{2} , \ldots ,y_{N} } \right\} \), the candidate \( y_{i} \) with minimum observation probability
is determined as the target.
2.2 Solving of Objective Function
The key for the aforementioned weighted joint sparse representation is to solve (2), the corresponding objective function is denoted as
To solve (6), an auxiliary function is constructed
where \( \left( {w_{i} } \right)^{ - 1} = \left[ {\frac{1}{{w_{i}^{1} }};\frac{1}{{w_{i}^{2} }}; \ldots ;\frac{1}{{w_{i}^{p + n} }}} \right] \), and \( \odot \) is the element-wise multiplication operator, that is, the multiplication is conducted among corresponding elements of two matrices or two vectors. The same as (1), \( g\left( c \right) \) is a multi-task joint sparse representation problem, and can be solved by a modified APG algorithm [15]. Suppose the optimal solutions of \( g\left( c \right) \) and \( f\left( c \right) \) are \( \hat{c}^{\prime} = \arg \mathop {\hbox{min} }\limits_{c} g\left( c \right) \) and \( \hat{c} = \arg \mathop {\hbox{min} }\limits_{c} f\left( c \right) \) respectively, then similar to [13], there is the following correspondence
where
2.3 Template Updating
A simple template updating strategy is adopted. (1) The target of the 1st frame is added into the target template set and not allowed be updated. (2) Determine whether the target appearance of the current frame greatly changes or not. If the amount of change is greater than a given threshold, update target templates set. Otherwise, not update. Specifically, if the similarity degree of the target \( y_{i} \) and the target template \( \text{T}_{j} \) is smaller than a given threshold \( \tau \), then the target \( y_{i} \) is used to update the target template \( \text{T}_{k} \), where
2.4 Tracking Algorithm
In this paper, the locally normalized feature is adopted for appearance representation, as shown in Fig. 1. In particular, the candidate or target template is firstly divided into several regions. Secondly the vector representation of each region is obtained by concatenating all the columns sequentially, and normalizing to the \( \ell_{2} \) unit length. Further the normalized vectors are concatenated into a longer vector as the feature vector of the candidate or target template.
The detailed algorithm of this paper is shown in Algorithm 1.
3 Experiments and Analyses
We implement the proposed tracker on the PC with AMD Sempron X2 198 CPU (2.5 GHz) and 3 GB memory, using MATLAB 2011b, and compare with other 4 trackers (MIL [8], CT [9], L1APG [10] and MTT [12]) in 3 different scenes to validate the performance. It should be mentioned that all the compared 4 trackers provide source codes. The related parameters are set to: the number \( M \) of target templates is 10, and the number of candidates \( N \) is 600. Each candidate or target template is scaled to 32*32 pixels and is divided into 4*4 regions to extract the locally normalized feature. The regularization parameter \( \lambda \) is set to 0.0002. The number of iterations in the optimization of the objective function is set to 5. And threshold \( \tau \) of template updating is set to 0.4.
Scene 1: We select 3 videos (Singer1, Sylvester2008b and Car4) to validate the performance in the scene of illumination variation as shown in Fig. 2. It can be seen that the proposed tracker can track robustly under various illumination scenes. The reasons are two-folds. For one thing, the proposed method takes into account that the target template similar with the candidate has a greater possibility to represent the candidate, thus has higher reconstruction accuracy. For the other thing, the locally normalized feature, as an illumination invariant feature, can well adapt to the illumination variation.
Scene 2: In the videos Dudek and Sylvester2008b, the pose change and rotation of the tracked target take place as shown in Fig. 3. From the tracking result, the proposed tracker is more accurate than MTT, which is attributed to the weighting step before joint sparse representation of candidates.
Scene 3: In the videos Caviar1, Caviar2, Occlusion1 and Deduk, the tracked target may be occluded partially or severely as shown in Fig. 4. It can be concluded that the proposed tracker can reliably track in various occlusion scenes. The reason is that the proposed tracker makes the target template similar to the candidate has the greater possibility to represent the candidate, and obtains higher representation accuracy.
In the process of tracking, the tracked target suffers from influence of all kinds of comprehensive conditions actually. From the position error curve of Fig. 5, the proposed tracker can reliably track the target under various tracking conditions. Compared with other 4 trackers, the average tracking error of the proposed tracker is smallest, as shown in Table 1.
4 Conclusions
Aiming at the complex tracking environment such as illumination changes, occlusion, pose changes, rotation, and so on, a weighted joint sparse representation based tracking method is proposed with 4 characteristics: (1) The similarity relation among candidates is reflected by joint sparse representation. (2) By weighting target template to reflect the different similarity degree between a target template and a candidate. (3) The objective function is solved by modifying the APG algorithm. (4) A simple template updating strategy is used to adapt to the target appearance change.
References
Black, M.J., Jepson, A.D.: Eigentracking: robust matching and tracking of articulated objects using a view-based representation. Int. J. Comput. Vis. 26(1), 63–84 (1998)
Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 (2003)
Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008)
Kwon, J., Lee, K.M.: Visual tracking decomposition. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1269–1276. IEEE (2010)
Avidan, S.: Support vector tracking. IEEE Trans. Pattern Anal. Mach. Intell. 26(8), 1064–1072 (2004)
Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1631–1643 (2005)
Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-line boosting for robust tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008)
Babenko, B., Yang, M.H., Belongie, S.: Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1619–1632 (2011)
Zhang, K., Zhang, L., Yang, M.H.: Real-time compressive tracking. In: European Conference on Computer Vision, pp. 864–877 (2012)
Mei, X., Ling, H.: Robust visual tracking using ℓ 1 minimization. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1436–1443. IEEE (2009)
Mei, X., Ling, H., Wu, Y., et al.: Minimum error bounded efficient ℓ 1 tracker with occlusion detection. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1257–1264. IEEE (2011)
Zhang, S., Yao, H., Zhou, H., et al.: Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 31–40 (2013)
Zhang, T., Ghanem, B., Liu, S., et al.: Robust visual tracking via structured multi-task sparse learning. Int. J. Comput. Vis. 101(2), 367–383 (2013)
Tang, X., Feng, G., Cai, J.: Weighted group sparse representation for undersampled face recognition. Neurocomputing 145, 402–415 (2014)
Yuan, X.T., Liu, X., Yan, S.: Visual classification with multitask joint sparse representation. IEEE Trans. Image Process. 21(10), 4349–4360 (2012)
Acknowledgment
This work is supported by Scientific Research Fund of Heilongjiang Provincial Education Department (NO: 12541238), Dr. Scientific Research Foundation of Harbin Normal University (KGB201216), the National Science Foundation of China (61173087), Heilongjiang Provincial University Engineering R&D Center of Machine Vision and Intelligent Detection, and Heilongjiang Provincial Education Department Key Laboratory of Intelligent Education and Information Engineering.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Duan, X., Liu, J., Tang, X. (2015). Weighted Joint Sparse Representation Based Visual Tracking. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9491. Springer, Cham. https://doi.org/10.1007/978-3-319-26555-1_68
Download citation
DOI: https://doi.org/10.1007/978-3-319-26555-1_68
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26554-4
Online ISBN: 978-3-319-26555-1
eBook Packages: Computer ScienceComputer Science (R0)