1 Introduction

Visual tracking is one of the cardinal problems in computer vision. It has been recently used in many practical applications ranging from robot navigation, intelligent monitoring, medical imaging, augmented reality to human–computer interaction. Nevertheless, the state-of the-art methods are still far from achieving performance comparable to human ability. Trackers have to deal with several difficulties such as background clutter, serious or even complete occlusions, varying viewpoints and large pose changes [23]. In the past years, various kinds of methods were proposed to deal with them, and a rich literature in this filed could be found in the internet.

Firstly, a lot of methods are proposed to formulate tracking in the probabilistic terms. Early works use Kalman filter [13, 26] to provide solutions. As the limitation of linear Gaussian model, the particle filter [20, 28] comes forth. Nowadays, it has been widely used as a tracking framework in many methods, for it can approximate an arbitrary observation model with a stochastically generated set of weighted samples. Another probabilistic model using mean-shift [6, 25, 27] prospers in the past years which simulates the distribution of probability density with random samples and searches the peak using gradient ascent algorithm. Secondly, in the offline tracking applications, the global trajectory optimization [4, 12, 18] is permitted to be applied for the reason that all the frames are available before the tracker starts. The global optimization is expected to result in smoother and more stable trajectories than some online methods. Thirdly, supervised discriminative methods using classification [3, 7, 21] have been exploited to solve tracking problems, which separate the object from the background by training discriminative classifiers and can achieve superior results. In [2], the proposed tracker uses multiple instance learning (MIL) instead of traditional supervised learning to avoid drifting problem and then leads to a more robust tracker. Furthermore, tracking by detection has been paid more attention in recent years. In the method of tracking-learning-detection (TLD) [15, 16], the detection runs simultaneously with an optical tracker and can reinitialize it if it fails.

In this paper, we design a robust visual tracking method based on incremental PCA (IPCA) [9, 10] and sparse representation [5, 14, 17], which uses probabilistic model (group one). Specifically, the method incrementally learns a low-dimensional subspace representation of the target appearance. And then, to find the target in one frame, the sparse representation of the target appearance in the previous frame is calculated on the target template (the mean image) and the trivial templates. Lastly, the resulted coefficients of the sparse representation can be used to lay a mask on the computation of the candidates’ distances to the subspace in the particle filter framework, which helps to eliminate the effect of occlusion, noise, and so forth and then to select out the right candidate. Additionally, the coefficients can also be used to modify the updating scheme of the IPCA and make it more accurate to reflect the changes of the target appearance.

The rest of the paper is organized as follows. In the next section, the related work is reviewed. In Sect. 3, we detail our tracking algorithm, and in Sect. 4, we show experimental results illustrating the superiority of our method compared with another three state-of-the-art trackers.

2 Related work

The most related works with this paper are [23] and [22]. In [23], the authors proposed a tracking method that incrementally learns a low-dimensional subspace representation, adapting online to changes in the appearance of the target. The update scheme of the sample mean and the eigenbasis makes the method robust when tracking the target objects that undergo large changes in pose, scale and illumination.

The work of [22] can be viewed as an extension of [24], in which sparse representation is employed for visual tracking with the intuition that the appearance of a tracked object can be sparsely represented by its appearances in previous frames. Specifically, to find the target in a new frame, each target candidate in the particle filter framework is sparsely represented in the space spanned by target templates and trivial templates. Then, the candidate which has the smallest projection error with its sparse representation is chosen as the tracked target. Mei and Ling [22] achieves good performance in cope with occlusion, noise and other challenging issues as declaimed in the paper.

The difference between [23] and our paper lies in that we design a new templates updating scheme based on sparse representation and modify the distance computing formula (observation model). This way makes the IPCA tracker more robust. Also when using sparse representation, we design a new set of target templates and take use of the coefficients obtained on the whole templates (target templates and trivial templates) to cope with challenging cases like occlusions and other turbulence.

3 Tracking algorithm

3.1 Incremental PCA

In [23], the authors extended the Sequential Karhunen–Loeve (SKL) algorithm [19] presenting a new incremental PCA algorithm that correctly updates the eigenbasis as well as the mean. Given a set of training images \(\{{{TI}_{1}},{{TI}_{2}},\ldots ,{{TI}_{n}}\}\), one can obtain eigenvectors \(U\) by computing the singular value decomposition \(U\varSigma {{V}^{\mathrm{T}}}\) of the centered data matrix \([({{TI}_{1}}-\bar{TI}),\ldots ,({{TI}_{n}}-\bar{TI})]\) where \(\bar{TI}\) symbolizes the mean of \(n\) images. And then, when \(m\) new images \(\{{{TI}_{n+1}},\ldots ,{{TI}_{n+m}}\}\) arrive, SKL computes \({U}^{\prime }\) and \({\varSigma }^{\prime }\) from the SVD of \([AB]\) (\(A: d\times n\) data matrix which is composed of \(n\) observation vectors, \(B\): new coming \(d\times m\) data matrix). Based on SKL, [23] makes it slightly faster than that in the computation process, and it avoids performing QR decomposition on the entire matrix [\(U\varSigma B\)], instead only orthogonalizes (\(B-U{U}^{\prime }B\)). And also, Ross et al. [23] take into account the sample mean of the training data, which changes over time as new data come. The update scheme of the mean adopts a forgetting factor to ensure less modeling power with older observations.

In our paper, we use the algorithm briefly described above to construct the target appearance model, i.e., the subspace spanned by \(U\) and centered at the mean. And obviously, as detailed in Sect. 3.2.3, we plan to design a new scheme to select the data appropriate for updating the model as challenging cases like occlusion and noise occur.

3.2 Sparse representation of a tracking target

In [22], each candidate in the particle filter framework is sparsely represented in the space spanned by target templates and trivial templates. The target templates are composed of several previous target images (10 images in the paper [22]) and updated according to the weight scheme. We can see that the algorithm is time-consuming. So in our paper, we redesign the target templates and sparsely decompose only one target image, i.e., the target image in the previous frame or the last integral image. Then, we exploit the feature of the sparse coefficients in cope with occlusion, noise, etc, to control the update of the templates and finally utilize the coefficients to modify the observation model in the particle framework.

3.2.1 Target templates and trivial templates

As depicted in Fig. 1 (adapted from Fig. 1 of [24]), two target templates are used for sparse representation. One is called fixed target template (\(t_f\)), which is manually selected from the first frame and applied zero-mean-unit-norm normalization. The fixed target template could make the tracking more stable. The other one is called dynamical target template, which is just the sample mean of the incremental IPCA and updated dynamically to capture the target changes as the tracker proceeds. When the tracker seizes the right and integrated target in a frame, it should be updated into the dynamical target template as follows:

$$\begin{aligned} {{t}_{d}}=\frac{fn}{fn+m}{{t}_{d}}^{0}+\frac{m}{fn+m}{\bar{V}_{m}}, \end{aligned}$$
(1)

where \({t_d}^{0}\) and \(t_d\) are the dynamical target templates before and after update, respectively. \(n\) denotes the number of frames having been processed, \(m\) denotes the number of frames newly come, which is set to 1 in our tracker and \(\bar{V}_m\) denotes the mean value of the target images in the \(m\) frames. \(f\) is a forgetting factor that reduces the effect of previously tracked target images and then keeps the template more consistent with the changing target. Specifically, using \(n \leftarrow fn + m\), the effective number of observations will reach equilibrium at \(n = fn+m\). So, when \(f = 0.95\) and \(m = 1\), new observation is included at each update, and the effective size of the observation history will approach \(n = 20\).

Fig. 1
figure 1

A target image is sparsely represented by the target templates and trivial templates. The target templates are composed of a fixed target template and a dynamical target template, which is updated conditionally during the tracking process. Trivial templates are composed of the positive and negative trivial templates, which are used to compensate the effect of occlusion and noise. (from the sequence “car” in Sect. 4.2, Fig. 5.)

One trivial template \(i_i\) in Fig. 1 is a vector with only one nonzero entry, i.e., \(I=[{{i}_{1}},{{i}_{2}},\ldots ,{{i}_{l}}]\) is an identity matrix. Both the target templates and the trivial templates have the same fixed size (when updated into the dynamical template, the object images should be normalized to \(12\times 15\) first in our paper) and the tracked image is sparsely represented after being normalized to this size.

3.2.2 Formulation of sparse representation

Obviously, a tracking target lies approximately in the low-dimensional subspace spanned by the training images of the target or the historical images of the target extracted from the preceding frames. Given a target template set \(T=[{{t}_{d}},{{t}_{f}}] \in {{R}^{l\times 2}}(l\gg 2)\), containing 2 target templates which are reformed to one-dimensional vectors, a tracking result \(y\in {{R}^{l}}\) can be represented approximately as a linear combination of the two target templates as follows:

$$\begin{aligned} {y\approx Ta={{a}_{1}}{{t}_{d}}+{{a}_{2}}{{t}_{f}}}. \end{aligned}$$
(2)

where \(a={{[{{a}_{1}},{{a}_{2}}]}^{T}}\in {{R}^{2}}\) is called a target coefficient vector.

In many visual tracking scenarios, target objects are often corrupted by noise, partially occluded, or even completely disappear and reappear. Turbulence is brought into the linear representation of the tracking target \(y\), to compensate the error caused by these incidents, Eq. (2) is rewritten as

$$\begin{aligned} {y=Ta+\varepsilon }, \end{aligned}$$
(3)

where \(\varepsilon \) is an error vector used to compensate the difference of pixel value between the tracking target and the target template. Following the scheme in [24], we use the trivial templates \(I=[{{i}_{1}},{{i}_{2}},\ldots ,{{i}_{l}}]\in {{R}^{l\times l}}\) to formulate the equation (3) as

$$\begin{aligned} {y=\left[ T,I \right] \left[ \begin{matrix} a \\ e \\ \end{matrix} \right] }, \end{aligned}$$
(4)

where a trivial template \({{i}_{i}}\in {{R}^{l}}\) is a vector with only one nonzero entry, \(l\) is arithmetic product of the size of template and \(e={{[{{e}_{1}},{{e}_{2}},\ldots ,{{e}_{l}}]}^{T}}\in {{R}^{l}}\) is called a trivial coefficient vector.

Since the templates that are most similar to the tracking target are positively related to the target, in the work of [22], the nonnegativity constraints are imposed on the coefficients to help filter out clutter that is similar to tracked targets in reversed intensity patterns. Consequently, model (4) is innovated as

$$\begin{aligned} {y=[T,I,-I]\left[ \begin{matrix} a \\ {{e}^{+}} \\ {{e}^{-}} \\ \end{matrix} \right] =Bc,\quad \mathrm{s.t.}\quad c\ge 0 }, \end{aligned}$$
(5)

where \({{e}^{+}}\in {{R}^{l}},{{e}^{-}}\in {{R}^{l}}\) are called a positive trivial coefficient vector and a negative trivial coefficient vector, respectively, \(B\) is the matrix of template set and \(c\) is a nonnegative coefficient vector.

We solve the formulation (5) as an \({{\ell }^{1}}\)-regularized least squares problem, which is known to typically yield sparse solutions [24].

$$\begin{aligned} {\min ||Bc-{{y}_{0}}||_{2}^{2}+\lambda ||c|{{|}_{1}},\quad \mathrm{s.t.}\quad c\ge 0}, \end{aligned}$$
(6)

where \(||.|{{|}_{1}}\) and \(||.|{{|}_{2}}\) denote the \({{\ell }^{1}}\) and \({{\ell }^{2}}\) norms, respectively, and \({{y}_{0}}\) symbolizes the target image in the previous frame or the last integral one (if the target disappears and reappears later). Our implementation solves the formula (6) via an interior-point method based on [17], which uses the preconditioned conjugate gradients (PCG) algorithm to compute the search direction. We record the resultant coefficient vector as \({{c}_{0}}\).

3.2.3 Sparse coefficients

If we represent the tracking result in the previous frame as a linear combination of the template set composed of both target templates and trivial templates, obviously, a good result leads to a sparse coefficient vector by reason that it is similar to at least one of the target templates, and then, coefficients corresponding to trivial templates (named trivial coefficients) tend to be zeros. On the contrary, a bad result or a result seriously occluded by other object often leads to a dense representation since the trivial coefficients are almost all nonzero (see Fig. 2).

Fig. 2
figure 2

Left: good (in red) and bad (in green) target candidates. Middle: good candidate represented by templates. Right: bad candidate represented by templates

If a target image is noised or occluded, the coefficients of the trivial templates can compensate the effect of this turbulence. So, we can exploit the space information of them. In the middle column of Fig. 3, the nonzero coefficients indicate the partial face occluded or noised. The coefficient map just likes a mask that can be used to eliminate or alleviate the turbulence in tracking process. In Sect. 3.3, we illustrate how to use the mask.

Fig. 3
figure 3

Coefficients of two object images depicted in 3D space: the first row comprises an intact face image and its sparse coefficients, and the second row comprises an occluded face and its coefficients with some of them significant in the occluded area

Now, we will explain how to control the update of the IPCA model and the dynamical target templates (just the sample mean in IPCA). As mentioned in Sect. 3.1, it is easy to see that we just need to select the fitting images from new coming ones to update the model. Apparently, if the tracker drifts to a wrong object or background, or tracks the right but seriously or even completely occluded target, the images are not fitting for update. From Figs. 2 and 3, we can see that the mask of an occluded or wrongly tracked object image is dense while one of the complete images is sparse. So, the number of the elements in the mask which approach \(1\) (dense) could be used as a kind of evaluation criterion. Only if the number is below a threshold (0.5\(\times \) size of the mask) could the object image be updated into the model.

Finally, we will illustrate in the following text that the dynamical target template used here is more effective and stable than the templates designed in [22].

$$\begin{aligned} p({I_t}|{X_t})&= {p_{{d_t}}}({I_t}|{X_t}){p_{{w_t}}}({I_t}|{X_t})\nonumber \\&= \mathrm{N}({I_t};\mu ,U{U^\mathrm{T}} + \varepsilon I)\mathrm{N}({I_t};\mu ,U{\varSigma ^{ - 2}}{U^\mathrm{T}}).\end{aligned}$$
(7)
$$\begin{aligned} p({I_t}|{X_t})&= \exp ( - ||({I_t} - \mu ).*{(i-c_0)}\nonumber \\&\quad - U{U^\mathrm{T}}(({I_t} - \mu ).*{(i-c_0)})|{|^2}). \end{aligned}$$
(8)

Obviously, our update scheme is simpler than [22] and can tolerate several wrong tracking results with sparse coefficients. In some very complex background, there exists probably an image block in background which is very similar to the target image, so it will be updated into the target template as its sparsity. But this will not bring fatal disaster to the tracker because the template integrates a lot of target images in previous frames. In [22], a wrong template in the target template set highly probably causes the tracker locks onto itself, and this wrong template is hard to be removed from the set according to the proposed algorithm. The tracking results in Fig. 5 can demonstrate what we say.

3.3 Particle filter and its observation model

In particle filter, a number of particles are propagated to catch up with the moving object. In this work, the particle state at time \(t\) consists of the six parameters of an affine transformation \({{X}_{t}}=({{x}_{t}},{{y}_{t}},{{\theta }_{t}},{{s}_{t}},{{\alpha }_{t}},{{\phi }_{t}})\) where \(x_t, y_t , {\theta }_{t}, {s}_{t}, {\alpha }_{t}, {\phi }_{t}\), denote \(x, y\) translation, rotation angle, scale, aspect ratio and skew direction at time \(t\). Dynamical model of \(X_t\) is that each parameter is modeled independently by a Gaussian distribution around its counterpart in \(X_{t-1}\). Observation model is the key module which gives a similarity weight to each candidate, i.e., accomplishes the task of matching the candidates.

Fig. 4
figure 4

the quantitative results (the center distance at intervals of five frames and the mean distance) of car [23], sylv [2], girl [2], faceocc2 [2], pets09_1 [8] and pets09_2 [8]

Fig. 5
figure 5

Picture results of the “car” in Fig. 4. The first row is the result of IPCA, the second row is the result of \({{\ell }^{1}}\)-tracker and the third row is our tracker. We can see that \({{\ell }^{1}}\)-tracker (the second row) loses the target because wrong templates is updated into its template set, whereas IPCA tracker and our tracker achieve success

Observation model: in [23], the probability of a candidate being generated from the subspace spanned by \(U\) and centered at \(\mu \) is inversely proportional to the distance \(d\) from the candidate to the reference point, i.e., \(\mu \) of the subspace, which can be decomposed into the distance to the subspace, \(d_t\) , and the distance-within-subspace from the projected candidate to the subspace center, \(d_w\). The equation is formulated in (7), where \(I_t\) is an image patch predicated by \(X_t\). For simplicity, we only use \(d_t\) to measure \(p\) which is reasonable on the assumption as detailed in subsection 3.2.3 of [23]. Also, our work below is easy to be extended to incorporate \(d_w\).

It can be shown [1] that the negative exponential distance from \(I_t\) to the subspace spanned by U, i.e., \(\exp ( - ||({I_t} - \mu ) - U{U^\mathrm{T}}({I_t} - \mu )|{|^2})\), is proportional to \(p(I_t|X_t)\) as \(\varepsilon \rightarrow 0\). So, the coefficient vector \(c_0\) (mask) we obtain in Sect. 3.2.2 can be used to modify this observation model as formulated in Eqs. (7) and (8).

As detailed before, the mask is computed for the tracked target image in the previous frame. Then, it can be used to strengthen the matching process in the current frame as the appearance change of the target between neighboring frames is not too large. The mask acts as a weight map that indicates the occluded or noised fractions with regard to each candidate. The formulation is displayed in (8), in which \((i-c_0)\) helps eliminate the effect of turbulence, \(i\) and .* symbolize a vector full of \(1\) and dot product, respectively.

The experiments demonstrate the significant improvement of our tracker in cope with various challenging cases such as serious occlusion and noise. Also, we provide the code in the supplementary material, and it can be run to verify the performance. Thanks a lot for the authors of [23] and [22] for we take use of their code to write ours.

4 Experiments

We compare our algorithm with three state-of-the-art tracking methods on several well-chosen sequences. The methods for comparison include IPCA tracker [23], \({{\ell }^{1}}\)-minimization tracker [22] and multiple instance learning (MIL) tracker [2]. All the experiments are done with a computer of 2.8 GHZ P4 CPU and 2GB memory. The speeds of the trackers are listed in Table 1. Our tracker is almost the same fast with IPCA because the time consumed by sparse representation of just one target is very trivial. In the following experiments, we adopt two evaluation methods to compare the performance of four trackers, in which one is the center distance of the tracking rectangle to the ground truth, and the other is the percentage of frames tracked correctly. Also, we display some results of sequences with pictures for intuitive feelings in the end.

Table 1 The speed of four trackers in general condition (for indication only)

4.1 Evaluation methods

Two evaluation methods mentioned before are detailed below. One is the center distance of the tracking rectangle to the ground truth, which is computed every five frames (as the ground truth is labeled at intervals of five frames). Also, we compute a mean value of the distances to show the performance of each tracker. The other is the percentage of frames tracked correctly. In order to evaluate performance, we use the PASCAL challenge [11] object detection score. Given the detected bounding box \(\mathrm{ROI}_{D}\) and the ground truth bounding box \(\mathrm{ROI}_{\mathrm{GT}}\), the overlap score is evaluated as

$$\begin{aligned} \mathrm{score} = \frac{\mathrm{area}(\mathrm{ROI}_{D} \cap \mathrm{ROI}_{\mathrm{GT}})}{\mathrm{area}(\mathrm{ROI}_{D} \cup \mathrm{ROI}_{\mathrm{GT}})}. \end{aligned}$$
(9)

We can interpret the frame as true positive if the score exceeds 0.5.

4.2 Performance of four trackers

We select seven sequences involving different kinds of challenging situations in which the former six ones are publicly available, and the last one is a moving car video taken on an urban road. In Fig. 4, we depict the quantitative results of the former six sequences. Table 2 indicates that our tracker achieves the best performance, and it can cope with serious occlusions caused by moving objects.

Table 2 The mean distance and the percentage of frames correctly tracked

Nextly, we depict some results of the sequences with pictures and add a video shot in the urban road where a car is occluded seriously by another one. In Figs. 5678 and 9, the three rows are results of IPCA, \({{\ell }^{1}}\)-tracker and our tracker, respectively. For MIL-tracker, we do not get the picture results but only the coordinates of tracking trajectories.

Fig. 6
figure 6

Picture results of the “faceocc2” in Fig. 4. The first row is the result of IPCA, the second row is the result of \({{\ell }^{1}}\)-tracker and the third row is our tracker which behaves the best. As use of sparse representation, our tracker can eliminate the effect of turbulence caused by occluded area and find the accurate object

Fig. 7
figure 7

Picture results of the “pets09_1” in Fig. 4. A women is walking in the opposite direction with the stream of people and encounters serious occlusion sometimes. Only our tracker (the third row) achieves success while IPCA tracker (the first row) and \({{\ell }^{1}}\)-tracker(the second row) fail to track the target

Fig. 8
figure 8

Picture results of the “pets09_2” in Fig. 4. The first row is the result of IPCA, the second row is the result of \({{\ell }^{1}}\)-tracker and the third row is our tracker. In the scene, a man is walking across people flow with serious occlusions and appearance changes. Only our tracker (the third row) achieves success. IPCA could find the man while serious occlusion occurs and \({{\ell }^{1}}\)-tracker locks to a wrong object (its discrimination ability is weeker than IPCA in this case)

Fig. 9
figure 9

Picture results of the video taken on an urban road in which the black car tracked is occluded seriously by a white car and then reappears. The first row is the result of IPCA, the second row is the result of \({{\ell }^{1}}\)-tracker and the third row is our tracker. Our tracker (the third row) could track the target car and refind it while the other two fail

Finally, we will illustrate the results of Fig. 5 to demonstrate what we mentioned in the ending of Sect. 3.2.3, i.e., our updating scheme of the templates works better than that of \({{\ell }^{1}}\)-tracker. We can see that \({{\ell }^{1}}\)-tracker loses the object because some wrong templates come into the template set; however, our tracker succeeds because it can tolerate some inaccurate templates (with sparse coefficients) and rejects definitely wrong templates (with dense coefficients) and consequently prevents drifting.

5 Conclusions and the future work

In this paper, we propose a tracking method based on IPCA and sparse representation. The mask derived from sparse representation is used to eliminate or alleviate the effect of noise, occlusion, etc, during tracking and help update the IPCA appearance model so as to make it more fitting to capture the appearance change. Our tracker works stably when coping with challenging situations as cluttered background, serious or even complete occlusion and large appearance change. In the future, we can combine a detector with the tracker. So, it will be more robust, especially for dealing with the case that the object disappears in one place and reappears in another.