Keywords

1 Introduction

As one of the popular branches of computer vision, visual object tracking has been widely used in various fields, such as military strike, traffic control, security system, human-computer interaction and so on, and it has been rapidly developed thanks to the development of deep learning in recent years. Single-target tracking can be described as follow: an arbitrary target is given by a bounding box in the first frame and the trackers give predicted bounding box in subsequent frames. Although great progress has been made in past decades, visual object tracking is still a challenging task to handle with owning to complicated and volatile interferences like illumination variation, scale variation, partial or full occlusion, motion blur, background clutters and deformation.

The features that can effectively distinguish target from its surrounding background play significant role in visual tracking. Trackers [10, 12, 16, 17, 30, 31] based on hand-crafted features have solved mentioned challenges more or less to some extent and can run at a high speed. But hand-crafted features are usually specially designed for certain scenarios, so they are less accurate or robust when faced with more complex scenarios, which can lead to tracking failure. Recently, lots of trackers [4, 21, 23, 24, 27, 28] based on deep learning methods have been proposed and made a great progress in terms of accuracy and robustness. These trackers all use features that extracted by Convolutional Neural Networks (CNNs) as the representation of tracking target.

Apart from advantages of CNNs, there exist some thorny issues to handle with. For example, a numerous amount of annotated samples are required for supervised training of deep CNNs which have millions of parameters. And the lack of training samples becomes more severe for online visual object tracking since there is only one annotated sample provided in the first frame of each video. What’s more, the computation of features extracted by CNNs is more complex than hand-crafted features. Correlation filters, which transform convolution into multiplication to accelerate processing speed, have been widely used in tracking [5, 10, 24], but the desired output of correlation filters in these trackers are either the same for all training samples or designed improperly which reduces the correlation between filters and tracking targets.

In this paper, we divide tracking into two parts, one of them is target location estimation, and the other is target scale estimation. Both of them are implemented by correlation filters. On the training stage, we design the desired output of correlation filters carefully to get more superior filters. The contributions of this paper are summarized as follows: (i) we comprehensively analyze the diversity of the features from different layers in a CNN as well as the difference between hand-crafted features and the features extracted by CNNs. And we design two correlation filters to utilize these features effectively; (ii) we proposed a novel tracker based on these correlation filters for single-target tracking which just need to update correlation filters dynamically instead of fine-tuning pre-trained deep models. We conducted extensive experiments on OTB-15 benchmark [29] dataset and the results demonstrate that our algorithm outperforms several state-of-the-art trackers.

The rest of this paper is organized as follows. Section 2 reviews related work about our algorithm. Section 3 introduces our algorithm thoroughly and Sect. 4 represents the experimental results of our evaluation on different trakers. Finally, Sect. 5 makes a conclusion on our work in this paper.

2 Related Work

The usage of features extracted by CNNs has shown great effectiveness for computer vision tasks in recent years, such as segmentation [8], classification [20], and gained considerable improvements. However, the computation complexity of extracting these features is much higher than hand-crafted features, and some scholars have done lots of research to improve computational efficiency. Correlation filters, which have played a significant role in signal processing since the 80’s [15, 22] and solved myriad objective functions in the Fourier domain, are widely used in visual tracking to speed up trackers owning to their high computational efficiency.

In [6], Bolme et al. proposed a new type of correlation filter called Average of Synthetic Exact Filters (ASEF), and performed well in some specific tasks [6, 7]. However, a large number of samples are required for the training of ASEF. In the next year, Bolme et al. modified ASEF and proposed Minimum Output Sum of Squared Error (MOSSE) filter for tracking [5], which achieved remarkable performance at a high processing speed. Both ASEF and MOSSE filters are single-channel correlation filters. Henriques et al. [14] proposed an analytic model which is named KCF for datasets consisted of thousands of translated patches using the concept of circulant matrices. For linear regression, this model is equivalent to a correlation filter, but it is also suitable for non-linear regression. What’s more, KCF can be extend to multi-channel correlation filter. The work in [19] also did some research on multi-channel correlation filters which make it possible for correlation filters to be more widely used.

Danelljan et al. proposed a concise tracker called DSST [10] based on correlation filters which inspired us to do our research. The highlight of DSST is its approach for scale estimation. But according to our observation of DSST, we found that the desired outputs of correlation filters are designed improperly, which will be explained in detail in Sect. 3. The tracker HDT [24] exploits features from different layers in a CNN by a correlation filter for localizing tracking target. But HDT is limited to only location estimation which leads to poor performance in video sequences with severe scale variations. What’s more, the desired output of correlation filter is fixed since the first frame, which worsen its performance.

3 Tracking Based on CNN and Correlation Filters

Here, we will describe our algorithm TCCF (Tracking based on CNN and Correlation Filters) thoroughly. Before that, we first introduce the features used for target location estimation and scale estimation.

3.1 Feature Selection

Hand-crafted features, take HOG [9] features for example, do well in representing the texture and edge of tracking target. As shown in Fig. 1, different targets are all described clearly by HOG featuresFootnote 1. But the drawback of hand-crafted features is that they can not distinguish tracking target between objects that are in the same category effectively (refer to the feature map locating at the intersection of third row and second column).

Fig. 1.
figure 1

Feature maps for different tracking targets. From left to right: the first column are input images, the second one are visualized HOG feature maps, the rest are feature maps extracted by VGG-16 from conv2-2, conv3-3 and conv4-3 layers respectively, and the feature map from a layer presented here is the average of all channel feature maps.

Recently, some deep CNN models [20, 25, 26] trained on ImageNet [11] have been widely used in many computer vision tasks and achieved great success. The features extracted by CNNs are more discriminative than hand-crafted (refer to Fig. 1). What’s more, features extracted by a CNN vary from layers to layers. As shown in Fig. 1, shallower layers capture generic information of the target, while deeper layers capture semantic information of the target. Wang et al. also did some research on these differences between different layers [27].

Here, our tracking algorithm is divided into two parts, one of them is target location estimation, and the other is target scale estimation. The two parts are implemented independently. Since features extracted by CNNs can separate target from background more effectively than hand-crafted features, and there are some diversities between features from different layers, so they will be used by a correlation filter to implement location estimation. Once the location of tracking target is determined, hand-crafted (to be exact, HOG) features are used by another correlation filter to complete scale estimation since they do better in representing the texture and edge of target than features extracted by CNNs.

3.2 Correlation Filters

The structure of our proposed method is shown in Fig. 2, tracking online is divided into two parts here. Location Correlation Filter (LCF) is used for location estimation, while Scale Correlation Filter (SCF) is used for scale estimation. Both LCF and SCF are multi-channel correlation filters. Here, we make an introduction to the multi-channel correlation filter used in our algorithm.

Fig. 2.
figure 2

The structure of TCCF.

Let \(x^{t}\), which is a multi-channel signal, denote the features extracted from the given training sample, \(y^{t}\) denote the desired output of correlation filter and \(f^{t}\) denote the correlation filter we want to get. The upper case variants \(X^{t}\) = \(\mathcal {F}\)(\(x^{t}\)), \(Y^{t}\) = \(\mathcal {F}\)(\(y^{t}\)) and \(F^{t}\) = \(\mathcal {F}\)(\(f^{t}\)), where \(\mathcal {F(\cdot )}\) denote the Discrete Fourier Translation (DFT). \(y^{t}\) is artificially pre-defined according to the specific problem we are handling with. The correlation \(f^{t}\) is an ensemble of C weak filters, where C is the number of channels. In the Fourier domain, \(F^{t}\) can be computed by minimizing:

$$\begin{aligned} F^{t} = arg \min _{F^{t}}||Y^t - \sum _{c = 1}^{c = C} F^{t}_{c} \odot X^{t}_{c}||^2 + \lambda \sum _{c=1}^{c=C}||F^{t}_{c}||^2 \end{aligned}$$
(1)

where the subscript index c denote the component in \(c_{th}\) channel. The parameter \(\lambda \) in the second term on the right is the regularizer and the symbol \(\odot \) denote element-wise product. The solution to Eq. (1) is:

$$\begin{aligned} F^{t}_{c} = \frac{Y^t \odot {\bar{X}^{t}_{c}}}{\sum _{c=1}^{c=C}X^{t}_{c} \odot \bar{X}^{t}_{c}+ \lambda } \end{aligned}$$
(2)

where the division is performed element-wise and \(\bar{X}^t_c\) denote the complex conjugation of \(X^t_c\). Obviously, the first term in the denominator is the power spectrum of \(x^{t}\). From Eq. (2) we can find that once the training sample \(x^{t}\) and the regularizer \(\lambda \) are determined, the filter is directly controlled by \(y^{t}\).

Given a testing sample t, we first transform it to the Fourier domain to obtain T, then the response of t can be computed by:

$$\begin{aligned} r = \mathcal {F}^{-1}(\sum _{c = 1}^{c = C}T_{c} \odot F^{t}_{c}) \end{aligned}$$
(3)

where \(\mathcal {F}^{-1}(\cdot )\) is the inverse of DFT (IDFT).

In order to simplify our proposed model and reduce the cost of computation, we adopt an incremental update method as other researchers do in [5, 10, 24], which only use current frame to partially update previous correlation filters when tracking online. Given the \(t_{th}\) frame in a video sequence, let \(p^t\) and \(s^t\) denote the position and size of target in this frame, which are predicted by the tracker. \(F^t\) is updated as follows:

$$\begin{aligned} F^{t}_{c} = \frac{A^{t}}{B^{t}} = \frac{(1 - \eta ) A^{t-1} + \eta \hat{A}^{t}}{(1 - \eta ) B^{t-1} + \eta \hat{B}^{t}} \end{aligned}$$
(4)

where

$$\begin{aligned} \hat{F}^{t}_{c} = \frac{\hat{A}^{t}}{\hat{B}^{t}}=\frac{Y^{t} \odot {\bar{X}^{t}_{c}}}{\sum _{c=1}^{c=C}X^{t}_{c} \odot \bar{X}^{t}_{c}+ \lambda } \end{aligned}$$
(5)

and the parameter \(\eta \) is the learning rate of correlation filters.

Fig. 3.
figure 3

Left: Average success plots of two trackers. Middle: Average success plots of three trackers. Right: The curve between average success scores and standard deviation of \(y^t_s\).

Location Correlation Filter: Since features extracted by a pre-trained CNN are used in LCF, so \(x^{t}\) and \(f^{t}\) are three dimensional, which means \(x^{t}, f^{t} \in \mathfrak {R}^{M\times N\times C}\). Let \(y^t_l \in \mathfrak {R}^{M \times N}\) denote the desired output of LCF and it is a 2-D Gaussian shape distribution which is determined by the mean \(\mu ^t_l\) and standard deviation \(\delta ^t_l\). Suppose features from K convolution layers are used in our algorithm, there will be K independent correlation filters in LCF, which means:

$$\begin{aligned} \text {LCF} = \{F^{k,t}|k = 1,2,\ldots ,K \} \end{aligned}$$
(6)

each \(F^{k,t}\) has a weight \(w^{k}\), and \(\sum _{k=1}^{k=K}w^{k} = 1\). The location of target predicted by \(F^{k,t}\) is the coordinate \((m^k,n^k)\) of the maximum value in \(r^k\). The ultimate location of target is computed as follows:

$$\begin{aligned} (m,n) = \sum _{k=1}^{k=K}w^{k} \cdot (m^k,n^k) \end{aligned}$$
(7)

the symbol \(\cdot \) denote the product of two scalars. Once the ultimate location of target is predicted, there will be a loss between \((m^k,n^k)\) and (mn), which implies the stability of \(F^{k,t}\). And the weight \(w^{k}\) is updated according to the stability of \(F^{k,t}\). Please refer to [24] for more information.

It should be noted that the mean of \(y^{t}_l\) is set to 0 and the standard deviation \(\delta ^t_l\) is proportional to the target size \(s^{t}\), i.e.:

$$\begin{aligned} \delta ^t_l \varpropto s^t \end{aligned}$$
(8)

which means the desired output \(y^t_l\) of location correlation filter is controlled by \(\delta ^t_l\) and it is dynamically updated to adjust to the scale variation of target. While in HDT [24] and DSST [10], the desired outputs of correlation filters are fixed since the first frame in a video sequence, which has a negative impact on the performance of trackers. Suppose we choose a reference system \(\phi \) in the image from the perspective of tracking target and the target make a translation distance D in \(\phi \). Now we jump out of the image and choose a reference system \(\phi '\) in the screen from the respective of observer and the target make a translation distance \(D'\) in \(\phi '\). Since the location estimation is completed in \(\phi '\) and it’s a common sense that the larger \(s^t\) is, the larger \(D'\) will be when D is a constant and vice versa, which means the location estimation is relevant to the size of target.

Fig. 4.
figure 4

Qualitative results of the proposed TCCF tracker and other 9 trackers on a subset of OTB-15 benchmark. From left to right and top to bottom: Basketball, Biker, BlurOwl, CarDark, Bolt, Car1, RedTeam, Deer, Walking2, Human4, Singer2, Surfer. Two frames of each video are presented here.

Scale Correlation Filter: In order to implement scale estimation, we pre-define a set of scale factors \(\{\alpha _l = \theta ^{\lceil \frac{L}{2}\rceil -l}|l = 1,2,\ldots ,L\}\), where \(\theta >1\) is the step for scale transformation. Given a training sample, we first extract L rectangles of interest with the size \(\alpha _l \cdot s^t\), where \(s^{t}\) denote the size of target in this training sample. Then we get a feature map \(M^t \in \mathfrak {R}^{C \times L}\) from these rectangles of interest with each collum in \(M^t\) corresponding to one rectangle. Let \(x^{t}_{c} \in \mathfrak {R}^{1\times L}\) denote the \(c_{th}\) row vector in \(M^t\), and \(y^t_s\) denote the desired output of SCF, then SCF can be obtained by Eq. (2). \(y^t_s\) is 1-D Gaussian shaped distribution with its mean \(\mu ^{t}_s = 0\). And the target size \(s'\) in testing sample is determined by:

$$\begin{aligned} s' = \alpha _i \cdot s^t \end{aligned}$$
(9)

where \(\alpha _{i}\) is the scale factor and i is the index of the maximum value in the response r.

Inspired by the effectiveness of dynamical update of \(y^{t}_l\), we keep \(y^{t}_s\) dynamically updated like Eq. (8), but experimental results demonstrate that the dynamical update of \(y^{t}_s\) reduces the performance of tracker which is opposite of what we have expected.

Here we give an explanation. Unlike location estimation which is implemented in \(\phi '\), the scale estimation is just to find an optimal scale factor \(\alpha _i\) which is independent with \(\phi \) and \(\phi '\). Since the scale variation between two consecutive frames is small, which means the probability of severe scale variation between two consecutive frames is much lower and vice versa, so \(y^{t}_s\) is independent with the size of target but relative to the number of scale factors L:

$$\begin{aligned} \delta ^t_s \varpropto L \end{aligned}$$
(10)
Table 1. Average precision scores on different attributes: Illumination Variation (IV), Occlusion (OCC), Deformation (DEF), Out-of-Plane Rotation (OPR), Background Clutters (BC), Scale Variation (SV), Motion Blur (MB), Fast Motion (FM), Out-of-View (OV), Low Resolution (LR), In-Plane Rotation (IPR).

4 Experiments

The proposed algorithm is implemented in MATLAB with Caffe framework [18] and runs at 3.5 fps on a Ubuntu 14.04.3 machine with a 3.0 GHz Intel i7-5960x CPU and a Nvidia GM2000 TITAN X GPU. The VGG-16 is used as the pre-trained CNN in our experiments, and the last 6 convolutional layers are used to extract features. We use \(L = 33\) and \(\theta = 1.02\) for scale estimation. And the learning rate \(\eta \) is set to 0.00902.

We use one-passe-evaluation (OPE) metric on the first 50 video sequences in OTB-15 benchmark [29] to evaluate different trackers. According to different challenging factors, such as illumination variation, occlusion, deformation and so on, there are 11 attributes tagged to these video sequences, which make it possible to evaluate these trackers thoroughly.

Inspired by DSST [10], we first construct two naive trackers TCCFn1 and TCCFn2 based on LCF to illustrate the effectiveness of the dynamical update of \(y^{t}_l\). The \(y^{t}_l\) in TCCFn1 is fixed since the first frame, and the \(y^{t}_l\) in TCCFn2 is dynamically updated according to Eq. (8). As shown on the left in Fig. 3, there are 1.2% improvements in TCCFn2 which demonstrates the effectiveness of dynamical update of \(y^{t}_l\). We also construct another tracker TCCFn3 where the \(y^t_l\) and \(y^t_s\) both are dynamically updated. The success scores of TCCFn2 and TCCFn3 are shown in the middle in Fig. 3, from where we can figure out that the dynamical update of \(y^{t}_s\) reduces the performance of tracker. To find the optimal \(y^{t}_s\) according to Eq. (10), we conduct extensive experiments using variable-controlling method and get a graphic which is shown on the right in Fig. 3, from where we find the optimal standard deviation of \(y^t_s\) and then we construct the optimal tracker TCCF as depicted in the middle in Fig. 3.

We compare our proposed TCCF tracker with other ten trackers, CSK [13], Frag [1], L1APG [2], Staple [3], DSST [10], KCF [14], FCNT [27], HDT [24], SiamFC [4], STCT [28]. And We do qualitative and quantitative evaluation on these trackers. Among them, qualitative results are shown in Fig. 4, from where we can figure out that our approach efficiently handles some challenging factors, such as deformation, motion blur, scale variation, background cluster and so on. Quantitative results are shown in Tables 1 and 2. We compared these trackers for every attribute. In Table 1, all the values are obtained at the threshold of 20 pixels. In Table 2, all the values are computed using the metric AUC (Area Under Curve). The first, second and third best trackers are highlighted in and respectively. From Tables 1 and 2, we can find that our TCCF performs well in different attributes, which demonstrates the effectiveness of our correlation filters.

Table 2. Average success scores on different attributes: Illumination Variation (IV), Occlusion (OCC), Deformation (DEF), Out-of-Plane Rotation (OPR), Background Clutters (BC), Scale Variation (SV), Motion Blur (MB), Fast Motion (FM), Out-of-View (OV), Low Resolution (LR), In-Plane Rotation (IPR).
Fig. 5.
figure 5

Average precision plots and success plots of different trackers tested over 50 video sequences. On the left, trackers are ranked according to the precision score at the threshold of 20 pixels. On the right, trackers are ranked according to the area under curve.

We also use the precision and success plots to evaluate all trackers in Fig. 5. The precision plots demonstrate the percentage of frames where the distance between the ground truth center of target and the predicted center of target is within a given threshold. The success plots demonstrate the percentage of frames where the overlap ratio between the ground truth bounding box and the predicted bounding box is higher than a given threshold. Comparing TCCF with DSST, we can figure out that there are 21.7% and 15.2% improvements in the precision and success scores. While compared with STCT, TCCF gets 2.6% and 0.6% improvements in the precision and success scores. When comparing TCCF with HDT, although HDT gets 0.8% improvements in the precision scores, TCCF gets 5.9% improvements in the success scores. The plots in Fig. 5 demonstrate that our TCCF tracker achieves the best overall performance than other trackers.

5 Conclusion

In this paper, we proposed a novel algorithm for online visual object tracking based on CNN and correlation filters (TCCF). The pre-trained VGG-16 [25] is the only one CNN used in our algorithm and it is kept fixed when tracking online, so the algorithm just need to update correlation filters dynamically instead of fine-tuning pre-trained deep models, which means the structure of our algorithm is simple and compact. TCCF is consisted with two separate component entities: location estimation and scale estimation. Both of them are implemented by correlation filters independently while using different feature representations. The results of extensive experiments demonstrate that our algorithm outperform the state-of-the-art by a great margin in terms of accuracy and robustness.