Weighted Joint Sparse Representation Based Visual Tracking

Duan, Xiping; Liu, Jiafeng; Tang, Xianglong

doi:10.1007/978-3-319-26555-1_68

Xiping Duan^17,18,
Jiafeng Liu¹⁷ &
Xianglong Tang¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9491))

Included in the following conference series:

International Conference on Neural Information Processing

2710 Accesses

Abstract

Aiming at various tracking environments, a weighted joint sparse representation based tracker is proposed. Specifically, each object template is weighted according to its similarity to each candidate. Then all candidates are represented sparsely and jointly, and the sparse coefficients are used to compute the observation probabilities of candidates. The candidate with the maximum observation probability is determined as the object. The object function is solved by a modified accelerated proximal gradient (APG) algorithm. Experiments on several representative image sequences show that the proposed tracking method performs better than the other trackers in the scenarios of illumination variation, occlusion, pose change and rotation.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Inverse Sparse Object Tracking via Adaptive Representation

Robust visual tracking using discriminative sparse collaborative map

Article 13 September 2019

Robust Object Tracking Based on Collaborative Model via L2-Norm Minimization

Keywords

1 Introduction

Visual tracking is a hot topic of computer vision, and extensively applied into such fields as intelligent monitoring, car navigation, advanced human-computer interaction, and so forth. However, due to all kinds of internal and external factors, such as noises, occlusion, illumination and viewpoint variations, pose change, rotation, and so on, realizing the robust visual tracking is still a challenging task. The existing visual tracking methods can be categorized into the generative and the discriminative. The generative methods treat the visual tracking as a matching problem. For example, both Eigentracker [1] and meanshift tracker [2], which are proposed by Black and Comaniciu respectively, have desired real-time performance, but lack the necessary updating of target templates. IVT [3] proposed by Ross, adapts to the appearance change by incrementally learning a low dimensional domain. VTD [4] proposed by Kwon extends the traditional particle filter tracker with multiple motion models and multiple appearance models. The discriminative methods are regarded as a binary classification problem, and need to learn and update a classifier by sampling positive and negative samples. The representative methods are: Avidan et al. [5] proposed a SVM based tracker. Collins et al. [6] discriminate the target from background by online feature selection. Grabner et al. [7] proposed a online semi-supervised boosting method to avoid drifting away from the target and improve tracking performance. Babenko et al. [8] avoid bias and drift by multi-instance learning. Zhang et al. [9] formulate the tracking task as a binary classification problem in the compressed domain and obtain the real-time performance.

Recently, sparse representation based visual tracking methods are very popular. For instance, in [10–12], each candidate is represented sparsely as the linear combination of templates, with excellent tracking results. Due to the small sampling radium, candidates are similar to each other. [13] mines the similarity between candidates by joint sparse representation, improving the representation accuracy of candidates and the robustness of the whole tracking system. However, the aforementioned sparse representation based tracking methods assume that different templates have the same possibilities to represent each candidate. According to [14], the template which is similar to a candidate should have larger probability to represent the candidate.

In this paper, a weighted joint sparse representation based visual tracking method is proposed to robustly track the target in the complex environment. Specifically, to represent a candidate, all the templates are weighted according to their similarity to the candidate. Then, all the candidates are represented by the joint sparse representation to reflect the similarity among them. The sparse coefficient is used to compute the observation probability of the corresponding candidate. The candidate with the maximal observation probability is determined as the target. In this model, a template similar to a candidate has the larger possibility to represent the candidate. To adapt to the illumination change, a locally normalized feature is used for appearance representation, which is an illumination invariant feature. Experiments show that the proposed tracker can well adapt to the possible illumination variation, occlusion, pose change, rotation, and the influence of comprehensive conditions.

2 Weighted Joint Sparse Representation Based Tracker

2.1 Weighted Joint Sparse Representation

Let $ \text{T} = \left\{ {\text{T}_{1} ,\text{T}_{2} , \ldots ,\text{T}_{M} } \right\} $ and $ \left\{ {y_{1} ,y_{2} , \ldots ,y_{N} } \right\} $ denote the target templates set and candidates of current frame respectively. To locate the target accurately, Zhang et al. proposed a multi-task learning based tracking method, which takes into account the similarity relation among candidates and represents all the candidates by solving the $ \ell_{2,1} $ mixed norm regularized minimization problem [13].

$$ \hat{c} = \arg \mathop {\hbox{min} }\limits_{c} \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}}c_{i} } \right\|}_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {c^{j} } \right\|}_{2} $$

(1)

where

$$ c = \left[ {c_{1} ,c_{2} , \ldots ,c_{N} } \right] = \left[ {c^{1} ;c^{2} ; \ldots ,c^{J} } \right] = \left( {\begin{array}{*{20}c} {c_{1}^{1} } & \ldots & {c_{N}^{1} } \\ \vdots & \ddots & \vdots \\ {c_{1}^{J} } & \cdots & {c_{N}^{J} } \\ \end{array} } \right) $$

is the sparse representation coefficient matrix. Specifically, the $ i $ th column $ c_{i} $ of $ c $ corresponds to the representation coefficient of the candidate $ {\mathbf{y}}_{i} $, the $ j $ row $ c^{j} $ are the coefficients of all the candidates corresponding to the target template $ \text{T}_{j} $, and $ c_{i}^{j} $ is the $ j $ th representation coefficient of the candidate $ {\mathbf{y}}_{i} $ corresponding to the target template $ \text{T}_{j} $. The regularization term of (2) is $ \ell_{2,1} $ mixed norm, which can cause the joint sparse representation of candidates. By mining the similarity relation among candidates, [13] improves the representation accuracy and the tracking robustness compared with other sparse representation based trackers [10–12]. However, all the aforementioned trackers [10–13] assume that all the target templates have the same possibilities to represent a candidate. According to [14], the target template more similar to a candidate has larger possibility to be chosen to represent the candidate. So in this paper, before representing a candidate, all the target templates are weighted using their distance to the candidate. Then, considering the similarity relation among candidates, all the candidates are represented using joint sparse representation.

$$ \hat{c} = \arg \mathop {\hbox{min} }\limits_{c} \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}}c_{i} } \right\|_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {w^{j} \odot c^{j} } \right\|_{2} } } $$

(2)

Compare with (1), the weight matrix

$$ w = \left[ {w_{1} ,w_{2} , \ldots ,w_{N} } \right] = \left[ {w^{1} ;w^{2} ; \ldots ,w^{J} } \right] = \left( {\begin{array}{*{20}c} {w_{1}^{1} } & \ldots & {w_{N}^{1} } \\ \vdots & \ddots & \vdots \\ {w_{1}^{J} } & \cdots & {w_{N}^{J} } \\ \end{array} } \right) $$

is introduced into (2), where $ w_{i}^{j} $ is the weight of the target template $ \text{T}_{j} $ representing the candidate $ {\mathbf{y}}_{i} $, and calculated as the distance between them

$$ w_{i}^{j} = \exp \left( {\frac{{\left\| {y_{i} - \text{T}_{j} } \right\|_{2} }}{\sigma }} \right) $$

The internal $ \left\| {y_{i} - \text{T}_{j} } \right\|_{2} $ is the Euclidean distance between $ \text{T}_{j} $ and $ {\mathbf{y}}_{i} $.

The sparse representation coefficients $ \left\{ {c_{i} } \right\}_{i = 1}^{N} $ of candidates $ \left\{ {\text{y}_{i} } \right\}_{i = 1}^{N} $ are obtained by solving (2), and used to calculate the reconstruction errors $ \left\{ {r_{i} } \right\}_{i = 1}^{N} $ of candidates.

$$ r_{i} = \left\| {\text{y}_{i} - T\; \cdot \;c_{i} } \right\|_{2} $$

(3)

$ r_{i} $ is further used for calculating the observation probability of the candidate $ \text{y}_{i} $.

$$ P\left( {{\mathbf{y}}_{i} \left| o \right.} \right) \propto \mu \; \cdot \;\exp \left( { - r_{i} } \right) $$

(4)

where $ \mu $ is the normalization constant. Among a group of given candidates $ \left\{ {y_{1} ,y_{2} , \ldots ,y_{N} } \right\} $, the candidate $ y_{i} $ with minimum observation probability

$$ i = \arg \mathop {\hbox{min} }\limits_{j} P\left( {{\mathbf{y}}_{j} \left| o \right.} \right) $$

(5)

is determined as the target.

2.2 Solving of Objective Function

The key for the aforementioned weighted joint sparse representation is to solve (2), the corresponding objective function is denoted as

$$ f(c) = \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}}c_{i} } \right\|_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {w^{j} \odot c^{j} } \right\|_{2} } } $$

(6)

To solve (6), an auxiliary function is constructed

$$ g(c) = \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}} * (w_{i} )^{ - 1} \odot c_{i} } \right\|_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {c^{j} } \right\|_{2} } } $$

(7)

where $ \left( {w_{i} } \right)^{ - 1} = \left[ {\frac{1}{{w_{i}^{1} }};\frac{1}{{w_{i}^{2} }}; \ldots ;\frac{1}{{w_{i}^{p + n} }}} \right] $, and $ \odot $ is the element-wise multiplication operator, that is, the multiplication is conducted among corresponding elements of two matrices or two vectors. The same as (1), $ g\left( c \right) $ is a multi-task joint sparse representation problem, and can be solved by a modified APG algorithm [15]. Suppose the optimal solutions of $ g\left( c \right) $ and $ f\left( c \right) $ are $ \hat{c}^{\prime} = \arg \mathop {\hbox{min} }\limits_{c} g\left( c \right) $ and $ \hat{c} = \arg \mathop {\hbox{min} }\limits_{c} f\left( c \right) $ respectively, then similar to [13], there is the following correspondence

$$ \hat{c} = w^{ - 1} \odot \hat{c}^{\prime} $$

where

$$ \begin{aligned} w^{ - 1} & = \left[ {\left( {w_{1} } \right)^{ - 1} ,\left( {w_{2} } \right)^{ - 1} , \ldots ,\left( {w_{N} } \right)^{ - 1} } \right] \\ & = \left[ {\left( {w^{1} } \right)^{ - 1} ;\left( {w^{2} } \right)^{ - 1} ; \ldots ;\left( {w^{p + n} } \right)^{ - 1} } \right] \\ & = \left( {\begin{array}{*{20}c} {\frac{1}{{w_{1}^{1} }}} & \ldots & {\frac{1}{{w_{N}^{1} }}} \\ \vdots & \ddots & \vdots \\ {\frac{1}{{w_{1}^{p + n} }}} & \cdots & {\frac{1}{{w_{N}^{p + n} }}} \\ \end{array} } \right) \\ \end{aligned} $$

(8)

2.3 Template Updating

A simple template updating strategy is adopted. (1) The target of the 1st frame is added into the target template set and not allowed be updated. (2) Determine whether the target appearance of the current frame greatly changes or not. If the amount of change is greater than a given threshold, update target templates set. Otherwise, not update. Specifically, if the similarity degree of the target $ y_{i} $ and the target template $ \text{T}_{j} $ is smaller than a given threshold $ \tau $, then the target $ y_{i} $ is used to update the target template $ \text{T}_{k} $, where

$$ \begin{aligned} i & = \arg \mathop {\hbox{min} }\limits_{j} P\left( {{\mathbf{y}}_{j} \left| o \right.} \right), \\ j & = \arg \mathop {\hbox{min} }\limits_{l} \left\| {\text{T}_{l} - \text{y}_{i} } \right\|_{2} , \\ k & = \arg \mathop {\hbox{max} }\limits_{l} \left\| {\text{T}_{l} - \text{y}_{i} } \right\|_{2} . \\ \end{aligned} $$

2.4 Tracking Algorithm

In this paper, the locally normalized feature is adopted for appearance representation, as shown in Fig. 1. In particular, the candidate or target template is firstly divided into several regions. Secondly the vector representation of each region is obtained by concatenating all the columns sequentially, and normalizing to the $ \ell_{2} $ unit length. Further the normalized vectors are concatenated into a longer vector as the feature vector of the candidate or target template.

The detailed algorithm of this paper is shown in Algorithm 1.

3 Experiments and Analyses

We implement the proposed tracker on the PC with AMD Sempron X2 198 CPU (2.5 GHz) and 3 GB memory, using MATLAB 2011b, and compare with other 4 trackers (MIL [8], CT [9], L1APG [10] and MTT [12]) in 3 different scenes to validate the performance. It should be mentioned that all the compared 4 trackers provide source codes. The related parameters are set to: the number $ M $ of target templates is 10, and the number of candidates $ N $ is 600. Each candidate or target template is scaled to 32*32 pixels and is divided into 4*4 regions to extract the locally normalized feature. The regularization parameter $ \lambda $ is set to 0.0002. The number of iterations in the optimization of the objective function is set to 5. And threshold $ \tau $ of template updating is set to 0.4.

Scene 1: We select 3 videos (Singer1, Sylvester2008b and Car4) to validate the performance in the scene of illumination variation as shown in Fig. 2. It can be seen that the proposed tracker can track robustly under various illumination scenes. The reasons are two-folds. For one thing, the proposed method takes into account that the target template similar with the candidate has a greater possibility to represent the candidate, thus has higher reconstruction accuracy. For the other thing, the locally normalized feature, as an illumination invariant feature, can well adapt to the illumination variation.

Scene 2: In the videos Dudek and Sylvester2008b, the pose change and rotation of the tracked target take place as shown in Fig. 3. From the tracking result, the proposed tracker is more accurate than MTT, which is attributed to the weighting step before joint sparse representation of candidates.

Scene 3: In the videos Caviar1, Caviar2, Occlusion1 and Deduk, the tracked target may be occluded partially or severely as shown in Fig. 4. It can be concluded that the proposed tracker can reliably track in various occlusion scenes. The reason is that the proposed tracker makes the target template similar to the candidate has the greater possibility to represent the candidate, and obtains higher representation accuracy.

In the process of tracking, the tracked target suffers from influence of all kinds of comprehensive conditions actually. From the position error curve of Fig. 5, the proposed tracker can reliably track the target under various tracking conditions. Compared with other 4 trackers, the average tracking error of the proposed tracker is smallest, as shown in Table 1.

Table 1. Average tracking error (center point error, unit: pixel)

Full size table

4 Conclusions

Aiming at the complex tracking environment such as illumination changes, occlusion, pose changes, rotation, and so on, a weighted joint sparse representation based tracking method is proposed with 4 characteristics: (1) The similarity relation among candidates is reflected by joint sparse representation. (2) By weighting target template to reflect the different similarity degree between a target template and a candidate. (3) The objective function is solved by modifying the APG algorithm. (4) A simple template updating strategy is used to adapt to the target appearance change.

References

Black, M.J., Jepson, A.D.: Eigentracking: robust matching and tracking of articulated objects using a view-based representation. Int. J. Comput. Vis. 26(1), 63–84 (1998)
Article Google Scholar
Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 564–577 (2003)
Article Google Scholar
Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1–3), 125–141 (2008)
Article Google Scholar
Kwon, J., Lee, K.M.: Visual tracking decomposition. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1269–1276. IEEE (2010)
Google Scholar
Avidan, S.: Support vector tracking. IEEE Trans. Pattern Anal. Mach. Intell. 26(8), 1064–1072 (2004)
Article Google Scholar
Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1631–1643 (2005)
Article Google Scholar
Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-line boosting for robust tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008)
Chapter Google Scholar
Babenko, B., Yang, M.H., Belongie, S.: Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1619–1632 (2011)
Article Google Scholar
Zhang, K., Zhang, L., Yang, M.H.: Real-time compressive tracking. In: European Conference on Computer Vision, pp. 864–877 (2012)
Google Scholar
Mei, X., Ling, H.: Robust visual tracking using ℓ 1 minimization. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1436–1443. IEEE (2009)
Google Scholar
Mei, X., Ling, H., Wu, Y., et al.: Minimum error bounded efficient ℓ 1 tracker with occlusion detection. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1257–1264. IEEE (2011)
Google Scholar
Zhang, S., Yao, H., Zhou, H., et al.: Robust visual tracking based on online learning sparse representation. Neurocomputing 100, 31–40 (2013)
Article Google Scholar
Zhang, T., Ghanem, B., Liu, S., et al.: Robust visual tracking via structured multi-task sparse learning. Int. J. Comput. Vis. 101(2), 367–383 (2013)
Article MathSciNet Google Scholar
Tang, X., Feng, G., Cai, J.: Weighted group sparse representation for undersampled face recognition. Neurocomputing 145, 402–415 (2014)
Article Google Scholar
Yuan, X.T., Liu, X., Yan, S.: Visual classification with multitask joint sparse representation. IEEE Trans. Image Process. 21(10), 4349–4360 (2012)
Article MathSciNet Google Scholar

Download references

Acknowledgment

This work is supported by Scientific Research Fund of Heilongjiang Provincial Education Department (NO: 12541238), Dr. Scientific Research Foundation of Harbin Normal University (KGB201216), the National Science Foundation of China (61173087), Heilongjiang Provincial University Engineering R&D Center of Machine Vision and Intelligent Detection, and Heilongjiang Provincial Education Department Key Laboratory of Intelligent Education and Information Engineering.

Author information

Authors and Affiliations

School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Xiping Duan, Jiafeng Liu & Xianglong Tang
College of Computer Science and Information Engineering, Harbin Normal University, Harbin, China
Xiping Duan

Authors

Xiping Duan
View author publications
You can also search for this author in PubMed Google Scholar
Jiafeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xianglong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiping Duan .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, X., Liu, J., Tang, X. (2015). Weighted Joint Sparse Representation Based Visual Tracking. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9491. Springer, Cham. https://doi.org/10.1007/978-3-319-26555-1_68

Download citation

DOI: https://doi.org/10.1007/978-3-319-26555-1_68
Published: 09 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26554-4
Online ISBN: 978-3-319-26555-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics