A multiple feature fused model for visual object tracking via correlation filters

Yuan, Di; Zhang, Xinming; Liu, Jiaqi; Li, Donghao

doi:10.1007/s11042-019-07828-2

A multiple feature fused model for visual object tracking via correlation filters

Published: 17 June 2019

Volume 78, pages 27271–27290, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

A multiple feature fused model for visual object tracking via correlation filters

Download PDF

Di Yuan ORCID: orcid.org/0000-0001-9403-1112^1,2,
Xinming Zhang²,
Jiaqi Liu³ &
…
Donghao Li¹

759 Accesses
48 Citations
Explore all metrics

Abstract

Common tracking algorithms only use a single feature to describe the target appearance, which makes the appearance model easily disturbed by noise. Furthermore, the tracking performance and robustness of these trackers are obviously limited. In this paper, we propose a novel multiple feature fused model into a correlation filter framework for visual tracking to improve the tracking performance and robustness of the tracker. In different tracking scenarios, the response maps generated by the correlation filter framework are different for each feature. Based on these response maps, different features can use an adaptive weighting function to eliminate noise interference and maintain their respective advantages. It can enhance the tracking performance and robustness of the tracker efficiently. Meanwhile, the correlation filter framework can provide a fast training and accurate locating mechanism. In addition, we give a simple yet effective scale variation detection method, which can appropriately handle scale variation of the target in the tracking sequences. We evaluate our tracker on OTB2013/OTB50/OBT2015 benchmarks, which are including more than 100 video sequences. Extensive experiments on these benchmark datasets demonstrate that the proposed MFFT tracker performs favorably against the state-of-the-art trackers.

Robust Tracking Based on Multi-feature Fusion

Adaptive Multi-feature Fusion for Correlation Filter Tracking

Correlation Filters with Adaptive Memories and Fusion for Visual Tracking

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Visual object tracking is one of the most fundamental and challenging research problems in the computer vision area for its numerous applications in human-computer interaction, video surveillance, driverless vehicle, etc. Despite having achieved enormous signs of progress over the past decades, object tracking remains more challenges for designing a tracker which can handle critical situations (such as illumination variation, scale variation, deformation, occlusion, etc) perfectly.

There are many different tracking frameworks that attempt to improve tracking performance in different ways. Sparse representation based trackers by finding the best candidate with minimal reconstruction error using target templates [3, 4, 26]. DCFs-based trackers approximate the dense sampling scheme by generating a circulant matrix, of which each row denotes a circular shift of a base sample [12, 14, 23, 24, 60]. Deep learning based trackers often use CNN features and neural network structure to improve tracking performance [34, 37, 45, 47, 56]. Saliency depicts as an evaluation mechanism have been introduced into detection and tracking tasks in recent years [27, 28, 49]. In [49], Wang et al. present a salience-based tracking method, which can estimate object salience and environment salience of extracted visual features for robust visual tracking. For visual tracking, the appearance model is a crucial factor for object representation. Various feature descriptors with effective appearance models have been proposed in numerous literatures [21, 22, 43, 48, 51]. Single feature descriptor has been widely used in appearance based visual tracking models [10, 20, 24, 42, 44] for their computational convenience. The single feature is easily disturbed by noise, however, can not describe the appearance of the object target clearly. Due to different features can provide complementary information [18, 40, 52, 63], this paper is desirable to combine multiple feature descriptors to improve visual tracking performance.

Recently, several visual tracking methods based on multiple feature fusion have been established. The famous ensemble tracking [2] combines the HOG and RGB by using Adaboost algorithm. Ma et al. [41] using multiple feature fusion via weighted entropy to do data-adaptive visual tracking problem. A multi-view correlation filters tracker for enhancing the robustness of the tracker has been proposed in literature [33]. There are also some multiple feature fusion methods under the semi-supervised learning framework [62] and sparse representation-based framework [25]. Although those methods achieved some success, all of them are either limited by a larger computational cost or produce an unsatisfactory tracking performance.

To relieve these problems, we propose a novel multiple features fused tracking method into a correlation filter framework. This fusion method uses a simplest but effective adaptive weighted average of each feature, and the weight adaptively determined by the maximum response value of each feature. By using the complementarity among different features under different tracking scenarios, our model can eliminate the disadvantage of single feature easily affected by noise, which can enhance the ability to represent the appearance of a target. Based on the correlation filter framework, we can get the central coordinate of the target by finding the maximum response value in the response map for each feature. And then, through the adaptive weighted average calculation of the coordinates of the target center in each feature, the specific position of the target can be obtained. Meanwhile, the correlation filter framework provides a fast calculation mechanism, which can increase the speed of our tracking method.

The main contributions of this paper are summarized below:

Based on the correlation filter framework, we propose a novel multiple feature fused model for visual object tracking. This model can adaptively combine these advantages of different features perfectly, and handle the disadvantage of a single feature which is susceptible to noise interference effectively. Meanwhile, the correlation filter framework is efficient for multiple feature fusion operation.
We present a simple but effective scale variation detection mechanism based on the different response value between adjacent frames. This mechanism can enhance the robustness of our tracker in scale variation tracking sequences.
On OTB evaluation benchmarks, our proposed algorithm achieves robust and promising tracking performance.

The rest of this paper is organized as follow. We first review some related works in Section 2, and present an adaptive weighted multiple features fused tracker via the correlation filter framework in Section 3, which includes the fundamental introduce about the KCF tracker, the adaptive weighted multiple features fused model, and the scale evaluation mechanisms. Section 4 describes the implementation details, the evaluation of our approach on comprehensive benchmark datasets and the comparison with some correlative and representative trackers. Finally, we give a brief conclusion about our work in Section 5.

2 Related works

As an extensive review on multi-feature learning and correlation filter framework beyond the scope of this paper, we review the works related to our approach including multiple feature based trackers and correlation filter based trackers.

2.1 Multiple feature based trackers

To deal with the limitations of one single feature in object visual tracking, several multiple feature fusion based visual tracking methods are established [15, 32, 33, 35, 46, 53, 54]. Galoogahi et al. [15] propose a multi-channel detector/filter in the frequency domain, which can improve the tracking performance obviously. In [46], Tang et al. derive a multi-kernel correlation filter based tracker which fully takes advantage of the invariance-discriminative power spectrums of various features to further improve the performance. Yin et al. [53] propose a generic likelihood map fusion framework to combine some different features into a fused soft segmentation which is suitable for mean-shift tracking. Li et al. [33] give a multi-view correlation model to enhances the robustness of the tracker. Qi et al. [45] suggest a hierarchical CNN based tracking framework, which takes full advantage of different features and uses an adaptive Hedge way to hedge these trackers into a stronger one. Literature [30] formulate the tracking problem as some basic observation and motion models corresponding to several features. The multiple basic models are constructed by SPCA and each of them is integrated with an interactive Markov Chain Monte Carlo scheme. These trackers achieved some good or robust performances, however, brought high computational costs.

2.2 Correlation filter based trackers

Correlation filter has been widely used in object detection, recognition, etc. Since Bolme et al. [7] propose the MOSSE tracker, correlation filters have been studied as a robust and efficient method to object visual tracking problem. Most improvements for the MOSSE tracker include the incorporation of kernel skill and HOG features [23, 59], the color name features [5, 12], the sparse representation [58], adaptive scale [10, 31, 35], and the integration of deep features [47, 56]. Henriques et al. [23] propose a CSK tracker and it can provide pretty performance and high calculation speed. In literature [24], the KCF method further improves the efficiency of CSK tracker by using HOG features and using kernel skill to transform the non-linear regression problem into linear regression. In [58], Zhang et al. exploit circulant structure property of target template to improve sparse representation based trackers. Yuan et al. [55] suggest a particle filter re-detection model in correlation filter framework, which can effectively reduce the occurrence of target loss by the tracker. In [11], a spatially regularized correlation filters method have been proposed to learn the filters from training examples with large spatial supports. Some local patches or parts based correlation filters trackers also have been developed [36, 38, 39] to improve the robustness of the trackers. Li et al. [36] introduce a reliable patch to exploit local contexts for tracking task. Liu et al. [38] propose a part based structural correlation filter to preserve the target structure for visual tracking. Although the correlation filter based trackers get better performance at current benchmarks and remaining computationally efficient, one single feature has its limitations and easily interfered by the noise, which cannot locate the object target accurately. In this paper, we propose an adaptive multi-features fused tracker via the correlation filter framework. Compared with these single feature correlation filter based trackers, our tracker exploits multiple features to enhance the robustness in dealing with various changes of the moving target and selects more discriminative features to ensure the tracking accuracy.

3 The adaptive weighted multiple features fused tracking method

According to use a single feature, the target tracking is always easy to be disturbed by noise, so that the tracking performance cannot reach the ideal state. In order to achieve a pretty tracking performance, we propose a novel multiple features fused tracker in correlation filters framework in this section.

Correlation filter based tracker using the information of image I and a filter w to get the center coordinate x(i,j) of object target. The image is obtained from the m-th feature, and the target center coordinates are denoted as x_m(i,j). In general, according to the Bayes formula, we know that:

$$ P(x|I)= \int{P(x|B)P(B|I)dB} \approx \sum\limits_{m=1}^{M} \omega_{m} P(x|B_{m}), $$

(1)

where M represents the number of features, ω_m demotes confidence in characteristic likelihood distributions, ω_m = P(B_m|I), and $\sum \omega _{m}=1$.

3.1 Kernelized correlation filter tracker

The KCF [24] tracker is a representative of tracking by detection. It trains a classifier with all sub-windows of an image by dense sampling. Using kernel trick can make the data matrix of samples become highly structured. Meanwhile, using a fast Fourier transform can make the convolution of two images computed by an efficient element-wise product in the Fourier domain.

The KCF tracker uses a filter w, which is trained on an image patch x of M × N pixels with HOG features to model the appearance of the target. Let the circular shifts of x_m,n,(m,n)∈{0,1,...M − 1}×{0,1,...,N − 1} as training samples for the filter with Gaussian function label y_m,n. Minimizing the error between the training sample x_m,n and the regression label y_m,n, we can get the filter w as:

$$ w = \arg\min_{w} {\sum\limits_{m,n}{|\left\langle \phi(x_{m,n}),w\right\rangle-y(m,n)|^{2} + \lambda_{1} \lVert w \rVert^{2} }}, $$

(2)

where ϕ represents the mapping to kernel space, 〈.〉 denotes the inner product, and λ is a regularization parameter. Since the label y_m,n is not binary, the filter w learned from the training samples contains the coefficients of Gaussian ridge regression.

Using Fast Fourier Transform (FFT) to compute the Gaussian ridge regression problem, the objective function can rewrite as $w={\sum }_{m,n} \alpha _(m,n)\phi (x_{m,n})$, thus (2) can be acquired by:

$$ \alpha=\mathcal{F}^{-1}\left( \frac{\mathcal{F}(y)}{\mathcal{F}(k^{x})+\lambda}\right), $$

(3)

where $\mathcal {F}$ and $\mathcal {F}^{-1}$ denotes denotes FFT and its inverse transformation (IFFT), respectively. In Fourier domain, the kernel correlation k^x = κ(x_m,n,x) is computed by Gaussian kernel. The vector α contains all the α_m,n coefficients.

In the tracking process, an image patch z with the same size of x is cropped out in the new frame. And then, the response score can be calculated by:

$$ f(z)=\mathcal{F}^{-1}(\mathcal{F}(k^{z}) \odot\mathcal{F}(\alpha)), $$

(4)

where f(z) denotes the response map of patch z, ⊙ denotes the element-wise product, $k^{z}=\kappa (z_{m,n},\hat {x})$ and the $\hat {x}$ is the learned target appearance. When get f(z), we can find the position of maximum response value in the map and let the position as the center coordinate of the target. Train new filter and update parameters according to the current position. After that, the steps are repeated so that the target tracking can be achieved in the entire sequence.

Although, with the circulate data matrices and the efficient element-wise product, the KCF achieves a fast and satisfactory performance. The single feature has its limitation in dealing with various changes in tracking sequences yet. In order to obtain a robust and pretty performance, we propose a novel multiple feature fused model in the efficient KCF tracking framework.

3.2 Adaptive weighted multiple features fused model

A multiple features fused model should exploit the complementary information of different features. The selection of features and fusion method directly affect the performance of the multiple feature fused model. For the correlation filter, the different features are suitable for fusion, due to the maximum response value which is used to determine the coordinates of the target. The simplest tracking instance of t-th frame can be seen in Fig. 1.

We propose to unify different feature under a probabilistic framework. For each feature t (where t = 1,2,...,k), its probability distribution is $p^{t}_{ij}$ and $\sum p^{t}_{ij}=1, \ $(i,j)∈{1,2,...M}×{1,2....,N}, where M × N is the size of an image patch. By using the correlation filter framework for visual tracking, we can get the coordinates (i,j) of the center of object target. Next, we choose these centers coordinates from the response map of each feature to determine the center coordinates of the target. For the sake of simplicity, we acquire the final coordinates of the target by the adaptively weighted average of each coordinate position [24]. After multiple features fused, the center coordinates of the target are showed as follows:

$$ \begin{array}{@{}rcl@{}} &&p_{i}=\lambda_{1}*p_{i1}+\lambda_{2}*p_{i2}+...+\lambda_{n}*p_{in},\\ &&s.t. {\sum} \lambda_{j}=1, \end{array} $$

(5)

where p_i denotes the i-th frame’s center coordinates, p_ij is the i-th frame’s center coordinates of j-th feature, λ_j is the corresponding weighting factors of j-th feature.

Traditionally, a good feature can obtain large response values in correlation filtering relatively, so the quality of the feature is very significant for determining the final target position. Based on this opinion, we adopt the maximum response values of each feature in the correlation filter to adaptively acquire its corresponding weight to get the target position, and the corresponding weight can calculate as:

$$ \lambda_{j}=\frac{mR_{j}}{{\sum} mR_{j}}, $$

(6)

where mR_j denotes the maximum response value of j-th feature.

Since the weights obtained by simple weighted averaging can lead to excessive positional weights determined by the interference characteristics, we employ a simple penalty term $\frac {1}{\eta * mR_{j}}$ to solve this problem:

$$ \begin{array}{@{}rcl@{}} \lambda^{\prime}_{j}&=&\frac{mR_{j}}{\frac{1}{\eta * mR_{j}} + {\sum} mR_{j}},\\ \lambda^{\prime\prime}_{j}&=&\frac{\lambda^{\prime}_{j}}{{\sum} \lambda^{\prime}_{j}}, \end{array} $$

(7)

where $\lambda ^{\prime \prime }_{j}$ denotes the modified adaptive weighting factor of j-th feature, η denotes the penalty term coefficient. The purpose of the penalty item $\frac {1}{\eta * mR_{j}}$ is to obtain a large weight for the feature with a large response value, and to obtain a small weight for the feature with a small response value, respectively.

In this approach, we select the features from edge, color and intensity, which correspond to features of HOG [24], Color Names [9] and gray value. The HOG features are robust to illumination and deformation, which obtains excellent results in human detection and tracking [19, 24]. Color Names and gray value are robust to motion blur, which gives good results in image retrieval [9]. Given t-th frame image and the correlation filters F, we can get the center coordinates of the target with each feature: p_c = F_c(I), p_h = F_h(I), p_g = F_g(I). After the fusion of color feature and HOG feature, the corresponding coordinates of the center are:

$$ p_{ch}=\lambda_{c}*F_{c}+\lambda_{h}*F_{h}, $$

(8)

where p_ch denotes the center coordinates by the fusion of color features and HOG features.

After adaptively weighting, the noise of a single feature is filtered out by the response of another feature, so that the original nature of the target can be better represented. The other kinds of fusion are given by: p_cg = λ_c ∗ F_c + λ_g ∗ F_g, p_gh = λ_g ∗ F_g + λ_h ∗ F_h, correspondingly. From 5 to 8, we can obtain the objective function of the adaptive weighted multiple feature fusion model:

$$ p_{i}=\sum \frac{F_{ji}(I)mR_{ji}}{\left( \frac{1}{\eta * mR_{ji}} + {\sum} mR_{ji}\right) {\sum} \frac{mR_{ji}}{\frac{1}{\eta * mR_{ji}} + \sum mR_{ji}}}. $$

(9)

From the previous description, we can see that the algorithm of multiple feature fusion only needs an adaptive weighted average of the selected features. From the intuitive point of view, the corresponding noise of target center position of each feature achieves good filtering. To fuse the maximum response values of each response map of corresponding features, we can determine the ultimate goal of the target position. In Fig. 1, it is obvious that the center point after fusion is more robust. Therefore, the model based on adaptive weighted multiple feature fusion can effectively improve the robustness of the algorithm.

3.3 Scale evaluation mechanism

Based on the correlation filter tracking framework, by finding the maximum value of the response map in each image frame, we can only obtain the center position of the object target. In visual tracking, scale change is one of the most common challenging aspects, however, which influences the tracking performance directly. In this section, we give a simple but effective scale evaluation mechanism based on the multiple feature fused model.

For correlation filter based framework, the initial target size set as size₁ = [h₁,w₁]. And then, we use the relationship of maximum response value between the current t-th frame and previous t − 1-th frame to determine the size of the target in the current frame. In the case of not affecting the result, simply determine the magnitude of the maximum response value of the adjacent frame to determine the direction of the target change. By the property of the Gauss function, we can see that the other conditions remain constant, there is a negative correlation between the target size and the maximum response value. If the maximum response value of the current t-th frame is higher than the previous t − 1-th frame, then the target size decreases. If the maximum response value of the current t-th frame is lower than the previous t − 1-th frame, then the target size increases. If the maximum response value of the current t-th frame is the same as the previous t − 1-th frame, then the target size remains unchanged. Using the change of the target size corresponding to the three features is taken as the weight to get the size change of the target, and the rate of change is expressed by c. So, for the size of the target in t-th frame can be determined by t − 1-th frame target’s size:

$$ \begin{array}{@{}rcl@{}} &&size_{t}=size_{t-1}*c^{\prime},\\ &&c^{\prime} = \frac{1}{3} * (c_{c} + c_{h} +c_{g} ), \end{array} $$

(10)

where size_t denotes the size of target in t-th frame, c_c, c_h and c_g denotes the rate of change of color feature, HOG feature and gray feature, respectively.

The scale change of the target will not be too obvious between the adjacent two frames, so the simple scale reduction is used to update the target scale.

4 Experiments

In order to evaluate the proposed tracker objectively and comprehensively, we test our tracker on a standard visual tracking benchmark. First, we introduce the algorithm flow and the experimental environment and details. And then, we give the details and standards of the experimental evaluation. Finally, the performance of our tracker is validated mainly based on the OTB2013/OTB50/OBT2015[50, 51] benchmarks, which contains more than 100 test video sequences.

Our method is implemented in MATLAB and the experiments run on a PC with an Intel Core-i3-4170 CPU (3.70 GHz) and 8 GB RAM.

4.1 Implementation details

In this section, we give a description of the whole tracking process and parameter settings. First, we can extract different features from the given initial bounding box in the first frame and trains the corresponding filters. And then, we run the tracker iteratively on each frame in the tracking sequences. In each iteration, we can determine the appropriate scale size and locate the center position of the target through the multiple features fused model, successively. Finally, we update the correlation filter models in a linear way. The whole process of our method can be seen in Algorithm 1.

Parameters setting as follows: The search window size is twice of the target set as sz-window= 2 ∗ sz, the scale change ratios are set as c = [0.98,1,1.02] which depend on the scale change ratios of different features, and the penalty term coefficient set as η = 25. We use the same parameters about the correlation filter as in [24] and the same parameters about the features as in [33].

4.2 Evaluation criterion

We use central location error (CLE) and Pascal VOC overlap ratio (VOR) to evaluate the effectiveness of our proposed tracking algorithm [8, 13].

Central position error (CLE): the mean Euclidean distance between the target center location coordinates determined by the algorithm and the true values of the artificial markers. The mean central position error of all frames in the sequence is used to evaluate the overall performance. In order to rank the performance of each tracking algorithm, authors usually used 20 pixels as the threshold with central position error are to measure the score. Pascal VOC overlap rate (VOR) can calculate as:

$$ VOR = \frac{Area\{B_{R} \cap B_{G} \}}{Area\{B_{R} \cup B_{G} \}}, $$

(11)

where B_R denotes the bounding-box of the tracking result, B_G denotes the real bounding-box of the tracking target.

Under the VOR framework, we choose the number of those frames which VOR larger than the threshold value 𝜃 as the successful frames. The success plot figure shows the success threshold varies from 0 to 1 of the ratio of the success number of frames. By comparing the area under the success rate curve (AUC), the corresponding algorithms are sorted accordingly.

In order to evaluate the performance of the algorithm, we use three classes of evaluation indexes given by OTB2013 [51]: one-pass evaluation (OPE), temporal robustness evaluation (TRE), and spatial robustness evaluation (SRE). The OPE means for each test sequence in the evaluation set, only run the tracking algorithm one time, and statistics are over a certain threshold percentage of heart error and overlap. The TRE means runs 20 times with different start frames on each video sequence. The SRE means runs 12 times with different spatial perturbations.

4.3 Evaluation with OTB benchmarks

Datasets

The OTB2013 [51] dataset contains 51 different video sequences and categorizes these sequences with 11 attributes. OTB2015 [50] dataset is an extension of the OTB2013 dataset, which contains 100 different video sequences. OTB50 is a collection of 50 most challenging sequences selected from the OTB2015 dataset. The 11 attributes including: out-of-plane rotation (OR), in-plane rotation (IR), occlusion (OCC), scale variation (SV), illumination variation (IV), background cluster (BC), deformation (DEF), fast motion (FM), motion blur (MB), out of view (OOV) and low resolution (LR). Each of them has different sequences.

Baseline evaluation

We compared our tracker with all the 29 trackers in OTB2013 benchmark including: Frag [1], MIL [3], ASLA [26], TLD [29], Struck [17], L1APG [4], CSK [23], SCM [61], etc. Besides five other representative trackers DSST [10], KCF [24], MvCFT [33], SAMF [35] and KCF_MTSA [6], are also compared with our tracker. The KCF [24] tracker is basically using a kernelized correlation filter operating on the HOG features. The DSST [10] tracker is extending the MOSSE tracker through the robust scale estimation and obtaining the top rank in performance by outperforming 19 state-of-the-art trackers on OTB and 37 state-of-the-art trackers on VOT2014. The MvCFT [33] tracker based on correlation filters proposed a multi-view model under a unified probabilistic framework. The SAMF tracker and the KCF_MTSA tracker are two widely used multi-feature fusion based trackers. The codes and settings are all the same with OTB2013 benchmark, which is widely approved. The comparison results are shown in Fig. 2. Compared with the KCF, MvCFT and DSST tracker, the tracking performance of our proposed algorithm is significantly improved. Moreover, compared with the SAMF and KCF_MTSA tracker, our proposed tracker is very closed to the best tracker both in precision and success plots. This demonstrates the effectiveness of our designed multi-feature fusion model.

Evaluation per attribute

The success and precision plots of each attribute gives in Figs. 3 and 4. As we can see in Fig. 3, our tracker achieves the best performance in the attributes of MB, DEF, IV, OCC and OPR. It also achieves close to the best performance in other attributes. For Fig. 4, our tracker achieves the best performance in the attributes of FM, BC, DEF, IV, OCC and OPR. Meanwhile, it achieves the second or third performance in the attributes of MB, IR, OOV, LR and SV. These advantages benefit from the multiple feature fused model. For scale variation, our result is very close to the best result (DSST) who mainly considers the scale evaluations both in success and precision plots. It also shows the effectiveness of our scale estimation mechanism. For low resolution, our algorithm combines multiple features but the feature has a poor characterization ability, which caused the unsatisfactory results. Generally speaking, our proposed tracker achieves the best or close to the best results in almost all the attributes.

Robustness to initialization

In order to give sufficient experimental contrast results to verify the robustness of our tracker, we evaluated it towards different spatial and temporal initialization using two robustness metrics: TRE and SRE. Fig. 5 shows the overall comparison performance on SRE and TRE. From Fig. 5b, we can see that our tracker achieves the second best performance on AUC success plots, which is close to DSST and better than KCF. On precision plots Fig. 5a, our MFFT tracker gets the best performance. From Fig. 5c, d, we know that both the precision and the success plots show our tracker achieving the best performance. The results on TRE shows the robustness of our tracker on initialization in the first frame by shifting or scaling the ground truth. In summary, our MFFT tracker is effective and achieve a promising result in the visual tracking OTB2013 [51] benchmark.

Comparison to state-of-art trackers

To put the tracking performance into perspective, we compare our tracker with the most recent state-of-the-art trackers including: 1) deep learning based trackers: HDT [45], CNT [56], CFNet-conv1 [47]; 2) correlation filter based trackers: KCF [24], DSST [10], CSK [23], STC [57]; 3) multi-feature based trackers: MvCFT [33], SAMF [35], KCF_MTSA [6]; and 4) representative trackers: TGPR [16], SCM [61], Struck [17]. We analyze the performance of our MFFT tracker with other nine state-of-the-art tracking algorithms under different attributes in OTB2015 [50]. Table 1 shows the comparison results on these 11 attributes. From this table, we can know that both in distance precision rates (DPR) and overlap success rates (OSR), our MFFT tracker achieves the best or closes the best results under all 11 attributes.

Table 1 Average precision and success scores of our MFFT and KCF [24], DSST [10], MvCFT [33], CSK [23], CNT [56], TGPR [16], SCM [61], Struck [17], HDT [45] trackers on OTB2015 [50] dataset on 11 attributes including: background cluttered (BC), deformation (DEF), fast motion (FM), in-plane rotation (IPR), illumination variation (IV), low resolution (LR), motion blur (MB), occlusion (OCC), out-of-plane rotation (OPR), out-of-view (OV) and scale variation (SV)

Full size table

In addition, we use the OTB2013/OTB50/OTB100 datasets to a quantitative comparison of distance precision rate (%) (DPR) at a threshold of 20 pixels and overlap success rate (%) (OSR) at an overlap threshold of 0.5 in Table 2. From this table, we can see that our MFFT tracker achieves the best or close the best tracking results. Comparing with the correlation filter based trackers: KCF [24], DSST [10], CSK [23], and the representative trackers: TGPR [16], SCM [61], Struck [17], our tracker have achieved better tracking performance than these trackers. Comparing with the multi-feature based trackers: MvCFT [33], SAMF [35], KCF_MTSA [6], our trackers have achieved similar tracking performance as these trackers. Even compared with the deep learning based trackers, the tracking performance of our tracker is also better than the CNT [56] tracker. These advantages are due to our adaptive weighted multi-feature fusion model. All the experimental results show that our MFFT tracker is comparable to other state-of-the-art trackers.

Table 2 Comparisons with state-of-the-art tracking methods include: STC [57], KCF [24], DSST [10], MvCFT [33], SAMF [35], KCF_MTSA [6], CFNet-conv1 [47], HDT [45], CNT [56], TGPR [16], SCM [61] and Struck [17] on OTB2013 [51], OTB50 and OTB2015 [50] datasets

Full size table

Qualitative comparison

Our approach significantly improves the performance compared with the single feature based trackers in some complex cases. Figure 6 shows a qualitative comparison of our approach with some existing methods on some challenging tracking sequences. Whether the target scale changes (e.g., Car1 and Human6) or the target is occluded (e.g., Girl2 and Tiger2), our tracker can give a better tracking result than other trackers. Despite no explicit illumination variation handling component, our tracker performs favorably in cases with illumination variation (e.g., Human2).

5 Conclusion

In this paper, we propose a multiple feature fused tracker in the correlation filter framework to achieve a pretty performance on OTB2013/OTB50/OTB2015 benchmarks. The multiple feature fused model can apply different features to deal with various changes of the target in the tracking sequences. This method can adaptively exploit the complementary information between different features to handle the weakness of a single feature that is easily susceptible to noise. And the correlation filter can provide an efficient fusing and tracking framework. Besides, we give a novel scale evaluation mechanism to deal with the moving target with scale change in the tracking sequences. The experiment results with different attributes show the competitive performance of our tracker.

References

Adam A, Rivlin E, Shimshoni I (2006) Robust fragments-based tracking using the integral histogram. In: IEEE conference on computer vision and pattern recognition, pp 798–805
Avidan S (2007) Ensemble tracking. IEEE Trans Pattern Anal Mach Intell 29 (2):261–271
Article Google Scholar
Babenko B, Yang MH, Belongie S (2009) Visual tracking with online multiple instance learning. In: IEEE conference on computer vision and pattern recognition, pp 983–990
Bao C, Wu Y, Ling H, Ji H (2012) Real time robust l1 tracker using accelerated proximal gradient approach. In: IEEE conference on computer vision and pattern recognition, pp 1830–1837
Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr P (2016) Staple: complementary learners for real-time tracking. In: IEEE conference on computer vision and pattern recognition, pp 1401–1409
Bibi A, Ghanem B (2015) Multi-template scale-adaptive kernelized correlation filters. In: IEEE international conference on computer vision workshop, pp 613–620
Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Visual object tracking using adaptive correlation filters. In: IEEE conference on computer vision and pattern recognition, pp 2544–2550
Cehovin L, Kristan M, Leonardis A (2014) Is my new tracker really better than yours?. In: Applications of computer vision, pp 540–547
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE conference on computer vision and pattern recognition, pp 886–893
Danelljan M, Hager G, Khan FS, Felsberg M (2014) Accurate scale estimation for robust visual tracking. In: British machine vision conference, vol 65, pp 1–11
Danelljan M, Hager G, Khan FS, Felsberg M (2015) Learning spatially regularized correlation filters for visual tracking. In: IEEE international conference on computer vision, pp 4310–4318
Danelljan M, Khan FS, Felsberg M, Weijer JVD (2014) Adaptive color attributes for real-time visual tracking. In: IEEE conference on computer vision and pattern recognition, pp 1090–1097
Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Fan N, Li J, He Z, Zhang C, Li X (2019) Region-filtering correlation tracking. Knowl-Based Syst 172:95–103
Article Google Scholar
Galoogahi HK, Sim T, Lucey S (2013) Multi-channel correlation filters. In: IEEE international conference on computer vision, pp 3072–3079
Gao J, Ling H, Hu W, Xing J (2014) Transfer learning based visual tracking with gaussian processes regression. European Conference on Computer Vision, 188–203
Hare S, Golodetz S, Saffari A, et al. (2016) Struck: Structured output tracking with kernels. IEEE Trans Pattern Anal Mach Intell 38(10):2096–2109
Article Google Scholar
He Z, Chung AC (2010) 3-D b-spline wavelet-based local standard deviation (bwlsd): its application to edge detection and vascular segmentation in magnetic resonance angiography. Int J Comput Vis 87(3):235–265
Article Google Scholar
He Z, Li X, You X, Tao D, Tang Y (2016) Connected component model for multi-object tracking. IEEE Trans Image Process 25(8):3698–3711
Article MathSciNet MATH Google Scholar
He Z, Yi S, Cheung Y-M, You X, Tang Y (2017) Robust object tracking via key patch sparse representation. IEEE Trans Cybern 47:354–364
Google Scholar
He Z, You X, Tang Y (2008) Writer identification of chinese handwriting documents using hidden markov tree model. Pattern Recogn 41(4):1295–1307
Article MATH Google Scholar
He Z, You X, Zhou L, Cheung Y-M, Du J (2010) Writer identification using fractal dimension of wavelet subbands in gabor domain. Integrated Computer Aided Engineering 17(17):157–165
Article Google Scholar
Henriques JF, Caseiro R, Martins P, Batista J (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In: European conference on computer vision, pp 702–715
Henriques JF, Rui C, Martins P, Batista J (2014) High-speed tracking with kernelized correlation filters. IEEE Trans Pattern Anal Mach Intell 37(3):583–596
Article Google Scholar
Hong Z, Mei X, Prokhorov D, Tao D (2014) Tracking via robust multi-task multi-view joint sparse representation. In: IEEE international conference on computer vision, pp 649–656
Jia X, Lu H, Yang MH (2012) Visual tracking via adaptive structural local sparse appearance model. In: IEEE conference on computer vision and pattern recognition, pp 1822–1829
Jian M, Lam K, Dong J, Shen L (2015) Visual-patch-attention-aware saliency detection. IEEE Trans Cybern 45(8):1575–1586
Article Google Scholar
Jian M, Qiang Q, Dong J, Yin Y, Lam KM (2018) Integrating qdwd with pattern distinctness and local contrast for underwater saliency detection <î. J Vis Commun Image Represent 53:31–41
Article Google Scholar
Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. IEEE Trans Pattern Anal Mach Intell 34(7):1409–1422
Article Google Scholar
Kwon J, Lee KM (2010) Visual tracking decomposition. In: IEEE conference on computer vision and pattern recognition, pp 1269–1276
Li F, Yao Y, Li P, Zhang D, Zuo W, Yang MH (2017) Integrating boundary and center correlation filters for visual tracking with aspect ratio variation. In: IEEE international conference on computer vision workshop, pp 2001–2009
Li X, Liu Q, Fan N, He Z, Wang H (2019) Hierarchical spatial-aware siamese network for thermal infrared object tracking. Knowl-Based Syst 166:71–81
Article Google Scholar
Li X, Liu Q, He Z, Wang H, Zhang C, Chen WS (2016) A multi-view model for visual tracking via correlation filters. Knowl-Based Syst 113:88–99
Article Google Scholar
Li X, Ma C, Wu B, He Z, Yang M. (2019) Target-aware deep tracking, arXiv:1904.01772
Li Y, Zhu J (2014) A scale adaptive kernel correlation filter tracker with feature integration. In: European conference on computer vision, pp 254–265
Li Y, Zhu J, Hoi SCH (2015) Reliable patch trackers: robust visual tracking by exploiting reliable patches. In: IEEE conference on computer vision and pattern recognition, pp 353–361
Liu Q, Lu X, He Z, Zhang C, Chen W (2017) Deep convolutional neural networks for thermal infrared object tracking. Knowl-Based Syst 134:189–198
Article Google Scholar
Liu S, Zhang T, Cao X, Xu C (2016) Structural correlation filter for robust visual tracking. In: IEEE conference on computer vision and pattern recognition, pp 4312–4320
Liu T, Wang G, Yang Q (2015) Real-time part-based visual tracking via adaptive correlation filters. In: IEEE conference on computer vision and pattern recognition, pp 4902–4912
Lu X, Lei H, Hao Z (2010) Automatic camshift tracking algorithm based on multi-feature. J Comput Appl 30(3):650–652
Google Scholar
Ma L, Lu J, Feng J, Zhou J (2016) Multiple feature fusion via weighted entropy for visual tracking. In: IEEE international conference on computer vision, pp 3128–3136
Ma X, Liu Q, He Z, Zhang X, Chen WS (2016) Visual tracking via exemplar regression model. Knowl-Based Syst 106:26–37
Article Google Scholar
Ou W, You X, Tao D, Zhang P, Tang Y, Zhu Z (2014) Robust face recognition via occlusion dictionary learning. Pattern Recogn 47(4):1559–1572
Article Google Scholar
Ou W, Yuan D, Liu Q, Cao Y (2018) Object tracking based on online representative sample selection via non-negative least square. Multimed Tools Appl 77 (9):10569–10587
Article Google Scholar
Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang MH (2016) Hedged deep tracking. In: IEEE conference on computer vision and pattern recognition, pp 4303–4311
Tang M, Feng J (2015) Multi-kernel correlation filter for visual tracking. In: IEEE international conference on computer vision, pp 3038–3046
Valmadre J, Bertinetto L, Henriques JF, Vedaldi A, Torr PHS (2017) End-to-end representation learning for correlation filter based tracking. In: IEEE conference on computer vision and pattern recognition, pp 2805–2813
Wang N, Shi J, Yeung DY, Jia J (2015) Understanding and diagnosing visual tracking systems. In: IEEE international conference on computer vision, pp 3101–3109
Wang Q, Tang S, Zhai D, Hu X (2018) Salience based object tracking in complex scenes. Neurocomputing 314:132–142
Article Google Scholar
Wu Y, Lim J, Yang M-H (2015) Object tracking benchmark. IEEE Trans Pattern Anal Mach Intell 37(9):1834–1848
Article Google Scholar
Wu Y, Lim J, Yang MH (2013) Online object tracking: a benchmark. In: IEEE conference on computer vision and pattern recognition, pp 2411–2418
Yi S, Lai Z, He Z, Cheung Y-M, Liu Y (2017) Joint sparse principal component analysis. Pattern Recogn 61:524–536
Article Google Scholar
Yin Z, Porikli F, Collins RT (2008) Likelihood map fusion for visual object tracking. In: IEEE workshop on applications of computer vision, pp 1–7
Yuan D, Lu X, Li D, He Z, Luo N (2017) Multiple feature fused for visual tracking via correlation filters. In: International conference on security, pattern analysis, and cybernetics, pp 88–93
Yuan D, Lu X, Li D, Liang Y, Zhang X (2018) Particle filter re-detection for visual tracking via correlation filters. Multimed Tools Appl, pp 1–25
Zhang K, Liu Q, Wu Y, Yang MH (2016) Robust visual tracking via convolutional networks without training. IEEE Trans Image Process 25(4):1779–1792
MathSciNet MATH Google Scholar
Zhang K, Zhang L, Liu Q, Zhang D, Yang MH (2014) Fast visual tracking via dense spatio-temporal context learning. In: European conference on computer vision, pp 127–141
Zhang T, Bibi A, Ghanem B (2016) In defense of sparse tracking: circulant sparse tracker. In: IEEE conference on computer vision and pattern recognition, pp 3880–3888
Zhang T, Xu C, Yang MH (2017) Multi-task correlation particle filter for robust object tracking. In: IEEE conference on computer vision and pattern recognition, pp 4819–4827
Zhang T, Xu C, Yang MH (2019) Learning multi-task correlation particle filters for visual tracking. IEEE Trans Pattern Anal Mach Intell 41(2):365–378
Article Google Scholar
Zhong W, Lu H, Yang MH (2012) Robust object tracking via sparsity-based collaborative model. In: IEEE conference on computer vision and pattern recognition, pp 1838–1845
Zhou Y, Rao C, Lu Q, Bai X, Liu W (2011) Multiple feature fusion for object tracking. In: Sino-foreign-interchange conference on intelligent science and intelligent data engineering, pp 145–152
Zhou Z, Wu D, Peng X, Zhu Z, Luo K (2014) Object tracking based on camshift with multi-feature fusion. J Softw 9(1):147–153
Google Scholar

Download references

Acknowledgment

This research was supported by the Shenzhen Research Council (Grant No. JCYJ2016040 6161948211, JCYJ20160226201453085, JSGG20150331152017052, JCYJ20160531194006833), by the National Natural Science Foundation of China (Grant No. 61672183, 61272366, 61672444), by Science and Technology Planning Project of Guangdong Province (Grant No. 2016B090918047).

Author information

Authors and Affiliations

School of Computer Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Di Yuan & Donghao Li
School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Di Yuan & Xinming Zhang
College of Finance and Statistics, Hunan University, Changsha, China
Jiaqi Liu

Authors

Di Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Xinming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Donghao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Di Yuan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yuan, D., Zhang, X., Liu, J. et al. A multiple feature fused model for visual object tracking via correlation filters. Multimed Tools Appl 78, 27271–27290 (2019). https://doi.org/10.1007/s11042-019-07828-2

Download citation

Received: 04 October 2017
Revised: 11 May 2019
Accepted: 23 May 2019
Published: 17 June 2019
Issue Date: 15 October 2019
DOI: https://doi.org/10.1007/s11042-019-07828-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A multiple feature fused model for visual object tracking via correlation filters

Abstract

Similar content being viewed by others

Robust Tracking Based on Multi-feature Fusion

Adaptive Multi-feature Fusion for Correlation Filter Tracking

Correlation Filters with Adaptive Memories and Fusion for Visual Tracking

1 Introduction