TCCF: Tracking Based on Convolutional Neural Network and Correlation Filters

Liu, Qiankun; Liu, Bin; Yu, Nenghai

doi:10.1007/978-3-319-71607-7_28

Qiankun Liu^16,17,
Bin Liu^16,17 &
Nenghai Yu^16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10666))

Included in the following conference series:

International Conference on Image and Graphics

2570 Accesses
3 Citations

Abstract

With the rapid development of deep learning in recent years, lots of trackers based on deep learning were proposed, and achieved great improvements compared with traditional methods. However, due to the scarcity of training samples, fine-tuning pre-trained deep models can be easily over-fitted and its cost is expensive. In this paper, we propose a novel algorithm for online visual object tracking which is divided into two separate parts, one of them is target location estimation and the other is target scale estimation. Both of them are implemented with correlation filters independently while using different feature representations. Instead of fine-tuning pre-trained deep models, we update correlation filters. And we design the desired output of correlation filters for every training sample which makes our tracker perform better. Extensive experiments are conducted on the OTB-15 benchmark, and the results demonstrate that our algorithm outperforms the state-of-the-art by great margin in terms of accuracy and robustness.

You have full access to this open access chapter, Download conference paper PDF

Robust and Real-Time Visual Tracking Based on Single-Layer Convolutional Features and Accurate Scale Estimation

Coarse-to-Fine Object Tracking Using Deep Features and Correlation Filters

Efficient Multi-level Correlating for Visual Tracking

Keywords

1 Introduction

As one of the popular branches of computer vision, visual object tracking has been widely used in various fields, such as military strike, traffic control, security system, human-computer interaction and so on, and it has been rapidly developed thanks to the development of deep learning in recent years. Single-target tracking can be described as follow: an arbitrary target is given by a bounding box in the first frame and the trackers give predicted bounding box in subsequent frames. Although great progress has been made in past decades, visual object tracking is still a challenging task to handle with owning to complicated and volatile interferences like illumination variation, scale variation, partial or full occlusion, motion blur, background clutters and deformation.

The features that can effectively distinguish target from its surrounding background play significant role in visual tracking. Trackers [10, 12, 16, 17, 30, 31] based on hand-crafted features have solved mentioned challenges more or less to some extent and can run at a high speed. But hand-crafted features are usually specially designed for certain scenarios, so they are less accurate or robust when faced with more complex scenarios, which can lead to tracking failure. Recently, lots of trackers [4, 21, 23, 24, 27, 28] based on deep learning methods have been proposed and made a great progress in terms of accuracy and robustness. These trackers all use features that extracted by Convolutional Neural Networks (CNNs) as the representation of tracking target.

Apart from advantages of CNNs, there exist some thorny issues to handle with. For example, a numerous amount of annotated samples are required for supervised training of deep CNNs which have millions of parameters. And the lack of training samples becomes more severe for online visual object tracking since there is only one annotated sample provided in the first frame of each video. What’s more, the computation of features extracted by CNNs is more complex than hand-crafted features. Correlation filters, which transform convolution into multiplication to accelerate processing speed, have been widely used in tracking [5, 10, 24], but the desired output of correlation filters in these trackers are either the same for all training samples or designed improperly which reduces the correlation between filters and tracking targets.

In this paper, we divide tracking into two parts, one of them is target location estimation, and the other is target scale estimation. Both of them are implemented by correlation filters. On the training stage, we design the desired output of correlation filters carefully to get more superior filters. The contributions of this paper are summarized as follows: (i) we comprehensively analyze the diversity of the features from different layers in a CNN as well as the difference between hand-crafted features and the features extracted by CNNs. And we design two correlation filters to utilize these features effectively; (ii) we proposed a novel tracker based on these correlation filters for single-target tracking which just need to update correlation filters dynamically instead of fine-tuning pre-trained deep models. We conducted extensive experiments on OTB-15 benchmark [29] dataset and the results demonstrate that our algorithm outperforms several state-of-the-art trackers.

The rest of this paper is organized as follows. Section 2 reviews related work about our algorithm. Section 3 introduces our algorithm thoroughly and Sect. 4 represents the experimental results of our evaluation on different trakers. Finally, Sect. 5 makes a conclusion on our work in this paper.

2 Related Work

The usage of features extracted by CNNs has shown great effectiveness for computer vision tasks in recent years, such as segmentation [8], classification [20], and gained considerable improvements. However, the computation complexity of extracting these features is much higher than hand-crafted features, and some scholars have done lots of research to improve computational efficiency. Correlation filters, which have played a significant role in signal processing since the 80’s [15, 22] and solved myriad objective functions in the Fourier domain, are widely used in visual tracking to speed up trackers owning to their high computational efficiency.

In [6], Bolme et al. proposed a new type of correlation filter called Average of Synthetic Exact Filters (ASEF), and performed well in some specific tasks [6, 7]. However, a large number of samples are required for the training of ASEF. In the next year, Bolme et al. modified ASEF and proposed Minimum Output Sum of Squared Error (MOSSE) filter for tracking [5], which achieved remarkable performance at a high processing speed. Both ASEF and MOSSE filters are single-channel correlation filters. Henriques et al. [14] proposed an analytic model which is named KCF for datasets consisted of thousands of translated patches using the concept of circulant matrices. For linear regression, this model is equivalent to a correlation filter, but it is also suitable for non-linear regression. What’s more, KCF can be extend to multi-channel correlation filter. The work in [19] also did some research on multi-channel correlation filters which make it possible for correlation filters to be more widely used.

Danelljan et al. proposed a concise tracker called DSST [10] based on correlation filters which inspired us to do our research. The highlight of DSST is its approach for scale estimation. But according to our observation of DSST, we found that the desired outputs of correlation filters are designed improperly, which will be explained in detail in Sect. 3. The tracker HDT [24] exploits features from different layers in a CNN by a correlation filter for localizing tracking target. But HDT is limited to only location estimation which leads to poor performance in video sequences with severe scale variations. What’s more, the desired output of correlation filter is fixed since the first frame, which worsen its performance.

3 Tracking Based on CNN and Correlation Filters

Here, we will describe our algorithm TCCF (Tracking based on CNN and Correlation Filters) thoroughly. Before that, we first introduce the features used for target location estimation and scale estimation.

3.1 Feature Selection

Hand-crafted features, take HOG [9] features for example, do well in representing the texture and edge of tracking target. As shown in Fig. 1, different targets are all described clearly by HOG features^{Footnote 1}. But the drawback of hand-crafted features is that they can not distinguish tracking target between objects that are in the same category effectively (refer to the feature map locating at the intersection of third row and second column).

Recently, some deep CNN models [20, 25, 26] trained on ImageNet [11] have been widely used in many computer vision tasks and achieved great success. The features extracted by CNNs are more discriminative than hand-crafted (refer to Fig. 1). What’s more, features extracted by a CNN vary from layers to layers. As shown in Fig. 1, shallower layers capture generic information of the target, while deeper layers capture semantic information of the target. Wang et al. also did some research on these differences between different layers [27].

Here, our tracking algorithm is divided into two parts, one of them is target location estimation, and the other is target scale estimation. The two parts are implemented independently. Since features extracted by CNNs can separate target from background more effectively than hand-crafted features, and there are some diversities between features from different layers, so they will be used by a correlation filter to implement location estimation. Once the location of tracking target is determined, hand-crafted (to be exact, HOG) features are used by another correlation filter to complete scale estimation since they do better in representing the texture and edge of target than features extracted by CNNs.

3.2 Correlation Filters

The structure of our proposed method is shown in Fig. 2, tracking online is divided into two parts here. Location Correlation Filter (LCF) is used for location estimation, while Scale Correlation Filter (SCF) is used for scale estimation. Both LCF and SCF are multi-channel correlation filters. Here, we make an introduction to the multi-channel correlation filter used in our algorithm.

Let $x^{t}$, which is a multi-channel signal, denote the features extracted from the given training sample, $y^{t}$ denote the desired output of correlation filter and $f^{t}$ denote the correlation filter we want to get. The upper case variants $X^{t}$ = $\mathcal {F}$($x^{t}$), $Y^{t}$ = $\mathcal {F}$($y^{t}$) and $F^{t}$ = $\mathcal {F}$($f^{t}$), where $\mathcal {F(\cdot )}$ denote the Discrete Fourier Translation (DFT). $y^{t}$ is artificially pre-defined according to the specific problem we are handling with. The correlation $f^{t}$ is an ensemble of C weak filters, where C is the number of channels. In the Fourier domain, $F^{t}$ can be computed by minimizing:

$$\begin{aligned} F^{t} = arg \min _{F^{t}}||Y^t - \sum _{c = 1}^{c = C} F^{t}_{c} \odot X^{t}_{c}||^2 + \lambda \sum _{c=1}^{c=C}||F^{t}_{c}||^2 \end{aligned}$$

(1)

where the subscript index c denote the component in $c_{th}$ channel. The parameter $\lambda $ in the second term on the right is the regularizer and the symbol $\odot $ denote element-wise product. The solution to Eq. (1) is:

$$\begin{aligned} F^{t}_{c} = \frac{Y^t \odot {\bar{X}^{t}_{c}}}{\sum _{c=1}^{c=C}X^{t}_{c} \odot \bar{X}^{t}_{c}+ \lambda } \end{aligned}$$

(2)

where the division is performed element-wise and $\bar{X}^t_c$ denote the complex conjugation of $X^t_c$. Obviously, the first term in the denominator is the power spectrum of $x^{t}$. From Eq. (2) we can find that once the training sample $x^{t}$ and the regularizer $\lambda $ are determined, the filter is directly controlled by $y^{t}$.

Given a testing sample t, we first transform it to the Fourier domain to obtain T, then the response of t can be computed by:

$$\begin{aligned} r = \mathcal {F}^{-1}(\sum _{c = 1}^{c = C}T_{c} \odot F^{t}_{c}) \end{aligned}$$

(3)

where $\mathcal {F}^{-1}(\cdot )$ is the inverse of DFT (IDFT).

In order to simplify our proposed model and reduce the cost of computation, we adopt an incremental update method as other researchers do in [5, 10, 24], which only use current frame to partially update previous correlation filters when tracking online. Given the $t_{th}$ frame in a video sequence, let $p^t$ and $s^t$ denote the position and size of target in this frame, which are predicted by the tracker. $F^t$ is updated as follows:

$$\begin{aligned} F^{t}_{c} = \frac{A^{t}}{B^{t}} = \frac{(1 - \eta ) A^{t-1} + \eta \hat{A}^{t}}{(1 - \eta ) B^{t-1} + \eta \hat{B}^{t}} \end{aligned}$$

(4)

where

$$\begin{aligned} \hat{F}^{t}_{c} = \frac{\hat{A}^{t}}{\hat{B}^{t}}=\frac{Y^{t} \odot {\bar{X}^{t}_{c}}}{\sum _{c=1}^{c=C}X^{t}_{c} \odot \bar{X}^{t}_{c}+ \lambda } \end{aligned}$$

(5)

and the parameter $\eta $ is the learning rate of correlation filters.

Location Correlation Filter: Since features extracted by a pre-trained CNN are used in LCF, so $x^{t}$ and $f^{t}$ are three dimensional, which means $x^{t}, f^{t} \in \mathfrak {R}^{M\times N\times C}$. Let $y^t_l \in \mathfrak {R}^{M \times N}$ denote the desired output of LCF and it is a 2-D Gaussian shape distribution which is determined by the mean $\mu ^t_l$ and standard deviation $\delta ^t_l$. Suppose features from K convolution layers are used in our algorithm, there will be K independent correlation filters in LCF, which means:

$$\begin{aligned} \text {LCF} = \{F^{k,t}|k = 1,2,\ldots ,K \} \end{aligned}$$

(6)

each $F^{k,t}$ has a weight $w^{k}$, and $\sum _{k=1}^{k=K}w^{k} = 1$. The location of target predicted by $F^{k,t}$ is the coordinate $(m^k,n^k)$ of the maximum value in $r^k$. The ultimate location of target is computed as follows:

$$\begin{aligned} (m,n) = \sum _{k=1}^{k=K}w^{k} \cdot (m^k,n^k) \end{aligned}$$

(7)

the symbol $\cdot $ denote the product of two scalars. Once the ultimate location of target is predicted, there will be a loss between $(m^k,n^k)$ and (m, n), which implies the stability of $F^{k,t}$. And the weight $w^{k}$ is updated according to the stability of $F^{k,t}$. Please refer to [24] for more information.

It should be noted that the mean of $y^{t}_l$ is set to 0 and the standard deviation $\delta ^t_l$ is proportional to the target size $s^{t}$, i.e.:

$$\begin{aligned} \delta ^t_l \varpropto s^t \end{aligned}$$

(8)

which means the desired output $y^t_l$ of location correlation filter is controlled by $\delta ^t_l$ and it is dynamically updated to adjust to the scale variation of target. While in HDT [24] and DSST [10], the desired outputs of correlation filters are fixed since the first frame in a video sequence, which has a negative impact on the performance of trackers. Suppose we choose a reference system $\phi $ in the image from the perspective of tracking target and the target make a translation distance D in $\phi $. Now we jump out of the image and choose a reference system $\phi '$ in the screen from the respective of observer and the target make a translation distance $D'$ in $\phi '$. Since the location estimation is completed in $\phi '$ and it’s a common sense that the larger $s^t$ is, the larger $D'$ will be when D is a constant and vice versa, which means the location estimation is relevant to the size of target.

Scale Correlation Filter: In order to implement scale estimation, we pre-define a set of scale factors $\{\alpha _l = \theta ^{\lceil \frac{L}{2}\rceil -l}|l = 1,2,\ldots ,L\}$, where $\theta >1$ is the step for scale transformation. Given a training sample, we first extract L rectangles of interest with the size $\alpha _l \cdot s^t$, where $s^{t}$ denote the size of target in this training sample. Then we get a feature map $M^t \in \mathfrak {R}^{C \times L}$ from these rectangles of interest with each collum in $M^t$ corresponding to one rectangle. Let $x^{t}_{c} \in \mathfrak {R}^{1\times L}$ denote the $c_{th}$ row vector in $M^t$, and $y^t_s$ denote the desired output of SCF, then SCF can be obtained by Eq. (2). $y^t_s$ is 1-D Gaussian shaped distribution with its mean $\mu ^{t}_s = 0$. And the target size $s'$ in testing sample is determined by:

$$\begin{aligned} s' = \alpha _i \cdot s^t \end{aligned}$$

(9)

where $\alpha _{i}$ is the scale factor and i is the index of the maximum value in the response r.

Inspired by the effectiveness of dynamical update of $y^{t}_l$, we keep $y^{t}_s$ dynamically updated like Eq. (8), but experimental results demonstrate that the dynamical update of $y^{t}_s$ reduces the performance of tracker which is opposite of what we have expected.

Here we give an explanation. Unlike location estimation which is implemented in $\phi '$, the scale estimation is just to find an optimal scale factor $\alpha _i$ which is independent with $\phi $ and $\phi '$. Since the scale variation between two consecutive frames is small, which means the probability of severe scale variation between two consecutive frames is much lower and vice versa, so $y^{t}_s$ is independent with the size of target but relative to the number of scale factors L:

$$\begin{aligned} \delta ^t_s \varpropto L \end{aligned}$$

(10)

Table 1. Average precision scores on different attributes: Illumination Variation (IV), Occlusion (OCC), Deformation (DEF), Out-of-Plane Rotation (OPR), Background Clutters (BC), Scale Variation (SV), Motion Blur (MB), Fast Motion (FM), Out-of-View (OV), Low Resolution (LR), In-Plane Rotation (IPR).

Full size table

4 Experiments

The proposed algorithm is implemented in MATLAB with Caffe framework [18] and runs at 3.5 fps on a Ubuntu 14.04.3 machine with a 3.0 GHz Intel i7-5960x CPU and a Nvidia GM2000 TITAN X GPU. The VGG-16 is used as the pre-trained CNN in our experiments, and the last 6 convolutional layers are used to extract features. We use $L = 33$ and $\theta = 1.02$ for scale estimation. And the learning rate $\eta $ is set to 0.00902.

We use one-passe-evaluation (OPE) metric on the first 50 video sequences in OTB-15 benchmark [29] to evaluate different trackers. According to different challenging factors, such as illumination variation, occlusion, deformation and so on, there are 11 attributes tagged to these video sequences, which make it possible to evaluate these trackers thoroughly.

Inspired by DSST [10], we first construct two naive trackers TCCFn1 and TCCFn2 based on LCF to illustrate the effectiveness of the dynamical update of $y^{t}_l$. The $y^{t}_l$ in TCCFn1 is fixed since the first frame, and the $y^{t}_l$ in TCCFn2 is dynamically updated according to Eq. (8). As shown on the left in Fig. 3, there are 1.2% improvements in TCCFn2 which demonstrates the effectiveness of dynamical update of $y^{t}_l$. We also construct another tracker TCCFn3 where the $y^t_l$ and $y^t_s$ both are dynamically updated. The success scores of TCCFn2 and TCCFn3 are shown in the middle in Fig. 3, from where we can figure out that the dynamical update of $y^{t}_s$ reduces the performance of tracker. To find the optimal $y^{t}_s$ according to Eq. (10), we conduct extensive experiments using variable-controlling method and get a graphic which is shown on the right in Fig. 3, from where we find the optimal standard deviation of $y^t_s$ and then we construct the optimal tracker TCCF as depicted in the middle in Fig. 3.

We compare our proposed TCCF tracker with other ten trackers, CSK [13], Frag [1], L1APG [2], Staple [3], DSST [10], KCF [14], FCNT [27], HDT [24], SiamFC [4], STCT [28]. And We do qualitative and quantitative evaluation on these trackers. Among them, qualitative results are shown in Fig. 4, from where we can figure out that our approach efficiently handles some challenging factors, such as deformation, motion blur, scale variation, background cluster and so on. Quantitative results are shown in Tables 1 and 2. We compared these trackers for every attribute. In Table 1, all the values are obtained at the threshold of 20 pixels. In Table 2, all the values are computed using the metric AUC (Area Under Curve). The first, second and third best trackers are highlighted in and respectively. From Tables 1 and 2, we can find that our TCCF performs well in different attributes, which demonstrates the effectiveness of our correlation filters.

Table 2. Average success scores on different attributes: Illumination Variation (IV), Occlusion (OCC), Deformation (DEF), Out-of-Plane Rotation (OPR), Background Clutters (BC), Scale Variation (SV), Motion Blur (MB), Fast Motion (FM), Out-of-View (OV), Low Resolution (LR), In-Plane Rotation (IPR).

Full size table

We also use the precision and success plots to evaluate all trackers in Fig. 5. The precision plots demonstrate the percentage of frames where the distance between the ground truth center of target and the predicted center of target is within a given threshold. The success plots demonstrate the percentage of frames where the overlap ratio between the ground truth bounding box and the predicted bounding box is higher than a given threshold. Comparing TCCF with DSST, we can figure out that there are 21.7% and 15.2% improvements in the precision and success scores. While compared with STCT, TCCF gets 2.6% and 0.6% improvements in the precision and success scores. When comparing TCCF with HDT, although HDT gets 0.8% improvements in the precision scores, TCCF gets 5.9% improvements in the success scores. The plots in Fig. 5 demonstrate that our TCCF tracker achieves the best overall performance than other trackers.

5 Conclusion

In this paper, we proposed a novel algorithm for online visual object tracking based on CNN and correlation filters (TCCF). The pre-trained VGG-16 [25] is the only one CNN used in our algorithm and it is kept fixed when tracking online, so the algorithm just need to update correlation filters dynamically instead of fine-tuning pre-trained deep models, which means the structure of our algorithm is simple and compact. TCCF is consisted with two separate component entities: location estimation and scale estimation. Both of them are implemented by correlation filters independently while using different feature representations. The results of extensive experiments demonstrate that our algorithm outperform the state-of-the-art by a great margin in terms of accuracy and robustness.

Notes

1.
The HOG feature map is visualized with the aid of Pitor’s Computer Vision Matlab Toolbox: https://pdollar.github.io/toolbox/.

References

Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the integral histogram. In: 2006 IEEE Computer Society Conference on Computer vision and pattern recognition, vol. 1, pp. 798–805. IEEE (2006)
Google Scholar
Bao, C., Wu, Y., Ling, H., Ji, H.: Real time robust l1 tracker using accelerated proximal gradient approach. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1830–1837. IEEE (2012)
Google Scholar
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.: Staple: complementary learners for real-time tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1401–1409 (2016)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2544–2550. IEEE (2010)
Google Scholar
Bolme, D.S., Draper, B.A., Beveridge, J.R.: Average of synthetic exact filters. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2105–2112. IEEE (2009)
Google Scholar
Bolme, D.S., Lui, Y.M., Draper, B.A., Beveridge, J.R.: Simple real-time human detection using a single correlation filter. In: 2009 Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS-Winter), pp. 1–8. IEEE (2009)
Google Scholar
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. arXiv preprint arXiv:1611.05198 (2016)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 886–893. IEEE (2005)
Google Scholar
Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Accurate scale estimation for robust visual tracking. In: British Machine Vision Conference, pp. 65.1–65.11 (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)
Google Scholar
Hare, S., Golodetz, S., Saffari, A., Vineet, V., Cheng, M.M., Hicks, S.L., Torr, P.H.: Struck: structured output tracking with kernels. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2096–2109 (2016)
Article Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: Exploiting the circulant structure of tracking-by-detection with kernels. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 702–715. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_50
Chapter Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
Article Google Scholar
Hester, C.F., Casasent, D.: Multivariant technique for multiclass pattern recognition. Appl. Opt. 19(11), 1758–1761 (1980)
Article Google Scholar
Hong, Z., Chen, Z., Wang, C., Mei, X., Prokhorov, D., Tao, D.: Multi-store tracker (muster): a cognitive psychology inspired approach to object tracking. In: Computer Vision and Pattern Recognition, pp. 749–758 (2015)
Google Scholar
Jia, X.: Visual tracking via adaptive structural local sparse appearance model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1822–1829 (2012)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Kiani Galoogahi, H., Sim, T., Lucey, S.: Multi-channel correlation filters. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3072–3079 (2013)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Li, H., Li, Y., Porikli, F.: DeepTrack: learning discriminative feature representations online for robust visual tracking. IEEE Trans. Image Process. 25(4), 1834–1848 (2016)
Article MathSciNet Google Scholar
Mahalanobis, A., Kumar, B.V., Casasent, D.: Minimum average correlation energy filters. Appl. Opt. 26(17), 3633–3640 (1987)
Article Google Scholar
Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4293–4302 (2016)
Google Scholar
Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., Yang, M.H.: Hedged deep tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4303–4311 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3119–3127 (2015)
Google Scholar
Wang, L., Ouyang, W., Wang, X., Lu, H.: STCT: sequentially training convolutional networks for visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1373–1381 (2016)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)
Article Google Scholar
Zhang, K., Zhang, L., Liu, Q., Zhang, D., Yang, M.-H.: Fast visual tracking via dense spatio-temporal context learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 127–141. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_9
Google Scholar
Zhang, L., Lu, H., Du, D., Liu, L.: Sparse hashing tracking. IEEE Trans. Image Process. 25(2), 840–849 (2016). A Publication of the IEEE Signal Processing Society
Article MathSciNet Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China (Grant No. 61371192), the Key Laboratory Foundation of the Chinese Academy of Sciences (CXJJ-17S044) and the Fundamental Research Funds for the Central Universities (WK2100330002).

Author information

Authors and Affiliations

Key Laboratory of Electromagnetic Space Information, Chinese Academy of Sciences, Hefei, China
Qiankun Liu, Bin Liu & Nenghai Yu
School of Information Science and Technology, University of Science and Technology of China, Hefei, China
Qiankun Liu, Bin Liu & Nenghai Yu

Authors

Qiankun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Nenghai Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Liu .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
Dalian University of Technology, Dalian, China
Xiangwei Kong
UNSW, Sydney, New South Wales, Australia
David Taubman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Q., Liu, B., Yu, N. (2017). TCCF: Tracking Based on Convolutional Neural Network and Correlation Filters. In: Zhao, Y., Kong, X., Taubman, D. (eds) Image and Graphics. ICIG 2017. Lecture Notes in Computer Science(), vol 10666. Springer, Cham. https://doi.org/10.1007/978-3-319-71607-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-71607-7_28
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71606-0
Online ISBN: 978-3-319-71607-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)