Keywords

1 Introduction

Visual tracking is a hot topic of computer vision, and extensively applied into such fields as intelligent monitoring, car navigation, advanced human-computer interaction, and so forth. However, due to all kinds of internal and external factors, such as noises, occlusion, illumination and viewpoint variations, pose change, rotation, and so on, realizing the robust visual tracking is still a challenging task. The existing visual tracking methods can be categorized into the generative and the discriminative. The generative methods treat the visual tracking as a matching problem. For example, both Eigentracker [1] and meanshift tracker [2], which are proposed by Black and Comaniciu respectively, have desired real-time performance, but lack the necessary updating of target templates. IVT [3] proposed by Ross, adapts to the appearance change by incrementally learning a low dimensional domain. VTD [4] proposed by Kwon extends the traditional particle filter tracker with multiple motion models and multiple appearance models. The discriminative methods are regarded as a binary classification problem, and need to learn and update a classifier by sampling positive and negative samples. The representative methods are: Avidan et al. [5] proposed a SVM based tracker. Collins et al. [6] discriminate the target from background by online feature selection. Grabner et al. [7] proposed a online semi-supervised boosting method to avoid drifting away from the target and improve tracking performance. Babenko et al. [8] avoid bias and drift by multi-instance learning. Zhang et al. [9] formulate the tracking task as a binary classification problem in the compressed domain and obtain the real-time performance.

Recently, sparse representation based visual tracking methods are very popular. For instance, in [1012], each candidate is represented sparsely as the linear combination of templates, with excellent tracking results. Due to the small sampling radium, candidates are similar to each other. [13] mines the similarity between candidates by joint sparse representation, improving the representation accuracy of candidates and the robustness of the whole tracking system. However, the aforementioned sparse representation based tracking methods assume that different templates have the same possibilities to represent each candidate. According to [14], the template which is similar to a candidate should have larger probability to represent the candidate.

In this paper, a weighted joint sparse representation based visual tracking method is proposed to robustly track the target in the complex environment. Specifically, to represent a candidate, all the templates are weighted according to their similarity to the candidate. Then, all the candidates are represented by the joint sparse representation to reflect the similarity among them. The sparse coefficient is used to compute the observation probability of the corresponding candidate. The candidate with the maximal observation probability is determined as the target. In this model, a template similar to a candidate has the larger possibility to represent the candidate. To adapt to the illumination change, a locally normalized feature is used for appearance representation, which is an illumination invariant feature. Experiments show that the proposed tracker can well adapt to the possible illumination variation, occlusion, pose change, rotation, and the influence of comprehensive conditions.

2 Weighted Joint Sparse Representation Based Tracker

2.1 Weighted Joint Sparse Representation

Let \( \text{T} = \left\{ {\text{T}_{1} ,\text{T}_{2} , \ldots ,\text{T}_{M} } \right\} \) and \( \left\{ {y_{1} ,y_{2} , \ldots ,y_{N} } \right\} \) denote the target templates set and candidates of current frame respectively. To locate the target accurately, Zhang et al. proposed a multi-task learning based tracking method, which takes into account the similarity relation among candidates and represents all the candidates by solving the \( \ell_{2,1} \) mixed norm regularized minimization problem [13].

$$ \hat{c} = \arg \mathop {\hbox{min} }\limits_{c} \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}}c_{i} } \right\|}_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {c^{j} } \right\|}_{2} $$
(1)

where

$$ c = \left[ {c_{1} ,c_{2} , \ldots ,c_{N} } \right] = \left[ {c^{1} ;c^{2} ; \ldots ,c^{J} } \right] = \left( {\begin{array}{*{20}c} {c_{1}^{1} } & \ldots & {c_{N}^{1} } \\ \vdots & \ddots & \vdots \\ {c_{1}^{J} } & \cdots & {c_{N}^{J} } \\ \end{array} } \right) $$

is the sparse representation coefficient matrix. Specifically, the \( i \) th column \( c_{i} \) of \( c \) corresponds to the representation coefficient of the candidate \( {\mathbf{y}}_{i} \), the \( j \) row \( c^{j} \) are the coefficients of all the candidates corresponding to the target template \( \text{T}_{j} \), and \( c_{i}^{j} \) is the \( j \) th representation coefficient of the candidate \( {\mathbf{y}}_{i} \) corresponding to the target template \( \text{T}_{j} \). The regularization term of (2) is \( \ell_{2,1} \) mixed norm, which can cause the joint sparse representation of candidates. By mining the similarity relation among candidates, [13] improves the representation accuracy and the tracking robustness compared with other sparse representation based trackers [1012]. However, all the aforementioned trackers [1013] assume that all the target templates have the same possibilities to represent a candidate. According to [14], the target template more similar to a candidate has larger possibility to be chosen to represent the candidate. So in this paper, before representing a candidate, all the target templates are weighted using their distance to the candidate. Then, considering the similarity relation among candidates, all the candidates are represented using joint sparse representation.

$$ \hat{c} = \arg \mathop {\hbox{min} }\limits_{c} \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}}c_{i} } \right\|_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {w^{j} \odot c^{j} } \right\|_{2} } } $$
(2)

Compare with (1), the weight matrix

$$ w = \left[ {w_{1} ,w_{2} , \ldots ,w_{N} } \right] = \left[ {w^{1} ;w^{2} ; \ldots ,w^{J} } \right] = \left( {\begin{array}{*{20}c} {w_{1}^{1} } & \ldots & {w_{N}^{1} } \\ \vdots & \ddots & \vdots \\ {w_{1}^{J} } & \cdots & {w_{N}^{J} } \\ \end{array} } \right) $$

is introduced into (2), where \( w_{i}^{j} \) is the weight of the target template \( \text{T}_{j} \) representing the candidate \( {\mathbf{y}}_{i} \), and calculated as the distance between them

$$ w_{i}^{j} = \exp \left( {\frac{{\left\| {y_{i} - \text{T}_{j} } \right\|_{2} }}{\sigma }} \right) $$

The internal \( \left\| {y_{i} - \text{T}_{j} } \right\|_{2} \) is the Euclidean distance between \( \text{T}_{j} \) and \( {\mathbf{y}}_{i} \).

The sparse representation coefficients \( \left\{ {c_{i} } \right\}_{i = 1}^{N} \) of candidates \( \left\{ {\text{y}_{i} } \right\}_{i = 1}^{N} \) are obtained by solving (2), and used to calculate the reconstruction errors \( \left\{ {r_{i} } \right\}_{i = 1}^{N} \) of candidates.

$$ r_{i} = \left\| {\text{y}_{i} - T\; \cdot \;c_{i} } \right\|_{2} $$
(3)

\( r_{i} \) is further used for calculating the observation probability of the candidate \( \text{y}_{i} \).

$$ P\left( {{\mathbf{y}}_{i} \left| o \right.} \right) \propto \mu \; \cdot \;\exp \left( { - r_{i} } \right) $$
(4)

where \( \mu \) is the normalization constant. Among a group of given candidates \( \left\{ {y_{1} ,y_{2} , \ldots ,y_{N} } \right\} \), the candidate \( y_{i} \) with minimum observation probability

$$ i = \arg \mathop {\hbox{min} }\limits_{j} P\left( {{\mathbf{y}}_{j} \left| o \right.} \right) $$
(5)

is determined as the target.

2.2 Solving of Objective Function

The key for the aforementioned weighted joint sparse representation is to solve (2), the corresponding objective function is denoted as

$$ f(c) = \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}}c_{i} } \right\|_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {w^{j} \odot c^{j} } \right\|_{2} } } $$
(6)

To solve (6), an auxiliary function is constructed

$$ g(c) = \frac{1}{2}\sum\limits_{i = 1}^{N} {\left\| {{\mathbf{y}}_{i} - {\mathbf{T}} * (w_{i} )^{ - 1} \odot c_{i} } \right\|_{2}^{2} + \lambda \sum\limits_{j = 1}^{J} {\left\| {c^{j} } \right\|_{2} } } $$
(7)

where \( \left( {w_{i} } \right)^{ - 1} = \left[ {\frac{1}{{w_{i}^{1} }};\frac{1}{{w_{i}^{2} }}; \ldots ;\frac{1}{{w_{i}^{p + n} }}} \right] \), and \( \odot \) is the element-wise multiplication operator, that is, the multiplication is conducted among corresponding elements of two matrices or two vectors. The same as (1), \( g\left( c \right) \) is a multi-task joint sparse representation problem, and can be solved by a modified APG algorithm [15]. Suppose the optimal solutions of \( g\left( c \right) \) and \( f\left( c \right) \) are \( \hat{c}^{\prime} = \arg \mathop {\hbox{min} }\limits_{c} g\left( c \right) \) and \( \hat{c} = \arg \mathop {\hbox{min} }\limits_{c} f\left( c \right) \) respectively, then similar to [13], there is the following correspondence

$$ \hat{c} = w^{ - 1} \odot \hat{c}^{\prime} $$

where

$$ \begin{aligned} w^{ - 1} & = \left[ {\left( {w_{1} } \right)^{ - 1} ,\left( {w_{2} } \right)^{ - 1} , \ldots ,\left( {w_{N} } \right)^{ - 1} } \right] \\ & = \left[ {\left( {w^{1} } \right)^{ - 1} ;\left( {w^{2} } \right)^{ - 1} ; \ldots ;\left( {w^{p + n} } \right)^{ - 1} } \right] \\ & = \left( {\begin{array}{*{20}c} {\frac{1}{{w_{1}^{1} }}} & \ldots & {\frac{1}{{w_{N}^{1} }}} \\ \vdots & \ddots & \vdots \\ {\frac{1}{{w_{1}^{p + n} }}} & \cdots & {\frac{1}{{w_{N}^{p + n} }}} \\ \end{array} } \right) \\ \end{aligned} $$
(8)

2.3 Template Updating

A simple template updating strategy is adopted. (1) The target of the 1st frame is added into the target template set and not allowed be updated. (2) Determine whether the target appearance of the current frame greatly changes or not. If the amount of change is greater than a given threshold, update target templates set. Otherwise, not update. Specifically, if the similarity degree of the target \( y_{i} \) and the target template \( \text{T}_{j} \) is smaller than a given threshold \( \tau \), then the target \( y_{i} \) is used to update the target template \( \text{T}_{k} \), where

$$ \begin{aligned} i & = \arg \mathop {\hbox{min} }\limits_{j} P\left( {{\mathbf{y}}_{j} \left| o \right.} \right), \\ j & = \arg \mathop {\hbox{min} }\limits_{l} \left\| {\text{T}_{l} - \text{y}_{i} } \right\|_{2} , \\ k & = \arg \mathop {\hbox{max} }\limits_{l} \left\| {\text{T}_{l} - \text{y}_{i} } \right\|_{2} . \\ \end{aligned} $$

2.4 Tracking Algorithm

In this paper, the locally normalized feature is adopted for appearance representation, as shown in Fig. 1. In particular, the candidate or target template is firstly divided into several regions. Secondly the vector representation of each region is obtained by concatenating all the columns sequentially, and normalizing to the \( \ell_{2} \) unit length. Further the normalized vectors are concatenated into a longer vector as the feature vector of the candidate or target template.

Fig. 1.
figure 1

Appearance representation

The detailed algorithm of this paper is shown in Algorithm 1.

3 Experiments and Analyses

We implement the proposed tracker on the PC with AMD Sempron X2 198 CPU (2.5 GHz) and 3 GB memory, using MATLAB 2011b, and compare with other 4 trackers (MIL [8], CT [9], L1APG [10] and MTT [12]) in 3 different scenes to validate the performance. It should be mentioned that all the compared 4 trackers provide source codes. The related parameters are set to: the number \( M \) of target templates is 10, and the number of candidates \( N \) is 600. Each candidate or target template is scaled to 32*32 pixels and is divided into 4*4 regions to extract the locally normalized feature. The regularization parameter \( \lambda \) is set to 0.0002. The number of iterations in the optimization of the objective function is set to 5. And threshold \( \tau \) of template updating is set to 0.4.

Scene 1: We select 3 videos (Singer1, Sylvester2008b and Car4) to validate the performance in the scene of illumination variation as shown in Fig. 2. It can be seen that the proposed tracker can track robustly under various illumination scenes. The reasons are two-folds. For one thing, the proposed method takes into account that the target template similar with the candidate has a greater possibility to represent the candidate, thus has higher reconstruction accuracy. For the other thing, the locally normalized feature, as an illumination invariant feature, can well adapt to the illumination variation.

Fig. 2.
figure 2

The tracking result in illumination variation

Scene 2: In the videos Dudek and Sylvester2008b, the pose change and rotation of the tracked target take place as shown in Fig. 3. From the tracking result, the proposed tracker is more accurate than MTT, which is attributed to the weighting step before joint sparse representation of candidates.

Fig. 3.
figure 3

The tracking result in the scene of the pose change and rotation

Scene 3: In the videos Caviar1, Caviar2, Occlusion1 and Deduk, the tracked target may be occluded partially or severely as shown in Fig. 4. It can be concluded that the proposed tracker can reliably track in various occlusion scenes. The reason is that the proposed tracker makes the target template similar to the candidate has the greater possibility to represent the candidate, and obtains higher representation accuracy.

Fig. 4.
figure 4

The tracking result in the scene of occlusion

In the process of tracking, the tracked target suffers from influence of all kinds of comprehensive conditions actually. From the position error curve of Fig. 5, the proposed tracker can reliably track the target under various tracking conditions. Compared with other 4 trackers, the average tracking error of the proposed tracker is smallest, as shown in Table 1.

Fig. 5.
figure 5

Position error curve

Table 1. Average tracking error (center point error, unit: pixel)

4 Conclusions

Aiming at the complex tracking environment such as illumination changes, occlusion, pose changes, rotation, and so on, a weighted joint sparse representation based tracking method is proposed with 4 characteristics: (1) The similarity relation among candidates is reflected by joint sparse representation. (2) By weighting target template to reflect the different similarity degree between a target template and a candidate. (3) The objective function is solved by modifying the APG algorithm. (4) A simple template updating strategy is used to adapt to the target appearance change.