1 Introduction

Visual tracking plays an active role in a wide range of applications, including surveillance systems, driverless vehicles, robotics, human–computer interaction and so on. The task of object tracking involves estimating the states (positions and scales) of the target in subsequent frames, with initial state given in the first frame. Recent years have witnessed significant developments in visual tracking, where an enormous amount of research effort has gone into tasks such as short-term single-object tracking. However, many challenges remain, such as target deformation, rotation, scale variation, occlusion, and imbalanced training samples [3]. Furthermore, object tracking is usually only a single component in a complete machine vision system, thus the real-time capabilities of tracking algorithms are of paramount importance for the whole pipeline to work online. Nevertheless, the current top-ranking trackers are mostly based on deep learning technology and are neither memory-efficient nor real-time capable.

Correlation filters have recently been introduced for visual tracking and have been shown to achieve high speed as well as robust performance. Thanks to the learning of a correlation operator which is formulated as a ridge problem can be accelerated by fast Fourier transform (FFT) in the frequency domain, the correlation filter-based trackers (CFTs) can perform real-time tracking. In general, extracting powerful features is extremely crucial for CFTs. Gradient [16], color [4, 12], and deep features [24, 32] which extracted from convolutional neural networks (CNNs) are widely used in CFTs. However, how to best utilize different features jointly for real-time tracking remains an open issue. There is another tough problem for most CFTs, that is the trackers cannot maintain tracking robustness in the subsequent frames once the model drift occurs. Model drift means that the object appearance model gradually drifts away from the object due to the accumulated errors during online tracking. Existing works [17, 18, 20] have aimed to prevent model drift through modifying the training strategy rather than improving the underlying model-based predictions themselves.

In this paper, we propose a robust correlation tracking method (RCT) via the exploitation of feature fusion and reliable response. A fused feature herein describes the gradient and color information conjunctively in a more natural way as compared to existing approaches [21, 22] which directly concatenates features together. The novel fused feature is then embedded into a correlation filter that is background-aware (in the sense that the filter is capable of learning from real, negative examples densely extracted from the background). For alleviating the model drift issue, an adaptive optimization strategy is introduced to remove the untrusted part of the response map that is caused by deformation or other challenging factors, so as to improve the predictions by obtaining and manipulating a more reliable response map which leading to an enhanced tracking result. The flowchart of the proposed approach is shown in Fig. 1.

We evaluate the proposed tracker on the OTB [29, 30] and Temple-Color [23] datasets. The results demonstrate that our method obtains a very competitive accuracy level in comparison with the state-of-the-art trackers, but does so with a real-time tracking speed of 26 fps on a standard desk-top CPU.

The remainder of this paper is organized as follows. Section 2 introduces the related works, and Sect. 3 describes the details of the proposed approach. Experiments and analysis are conducted in Sects. 4, and 5 concludes this study.

Fig. 1
figure 1

Flowchart of the proposed approach. The operator \(\odot\) is the Hadamard (element-wise) product

2 Related works

We discuss correlation filter-based tracking methods closely related to this work in this section. For the other visual tracking approaches, readers are referred to comprehensive review [19, 26, 29].

2.1 Feature representation in correlation tracking

Bolme et al. [5] proposed one of the seminal correlation tracking methods based on the minimum output sum of squared errors (MOSSE), which can perform online tracking at an astonishing tracking speed of \(\sim\)700 fps. In MOSSE, raw pixels are directly used for tracking. Unfortunately, noises brought by raw images extremely limit its tracking performance. Over the years, gradient and color features have been successfully applied in CFTs. The kernelized correlation filter tracker (KCF) [16] employs the famous histogram of oriented gradient (HOG) [8] feature to improve the accuracy of the tracker. Also, color features like color names (CN) [12] and global color histogram [4] are investigated to reinforce color-video tracking for CFTs. Li et al. [22] proposed a scale adaptive with multiple features tracker (SAMF), which fuses HOG and CN within correlation tracking framework, to further boost the tracking performance. After recognizing the success of deep learning on a wide range of visual-recognition tasks, a number of tracking methods based on deep features and correlation filters have been developed [24, 32]. For instance, Ma et al. [24] utilized hierarchical CNN features to exploit semantic information of the target object with a state-of-the-art performance. However, extracting CNN features from each frame, and training or updating correlation filters with high dimensions is computationally expensive. Therefore for correlation tracking, such an approach often leads to poor real-time performance.

2.2 Robustness to model drift

Model drifts lead to inaccurate model-based predictions. In addressing this problem, Kalal et al. [18] proposed an approach that decomposed the ultimate task of tracking into subtasks of tracking, learning and detection (TLD), where tracking and detection reinforce each other. Zhang et al. [31] introduced the fuzzy logic [2] to alleviate model drift by formulating tracking as a fuzzy classification problem. Inspired by the KCF and TLD trackers, Li et al. [20] proposed a scale adaptive kernelized correlation filter tracker, termed as SKCF, which estimates an accurate scale and models the distribution of correlation response with Gaussian constraint during the process of re-detection. However, the circulant shifted samples in such CFTs suffer from periodic repetitions on boundary positions, thereby leading to model drift and significantly degrading the tracking performance. Spatial regularization methods have since been suggested to alleviate the unwanted boundary effects. For example, using the alternating direction method of multipliers (ADMM) [6], Galoogahi et al. [14] resolved a constrained optimization problem for single-channel filters. Somewhat differently, the SRDCF formulation [10] allows correlation filters to be trained on a significantly larger set of negative training samples, without corrupting the positive samples, where a spatial regularization component is introduced to the training process to penalize the correlation filter coefficients in relation to their spatial location. Recently, Varfolomieiev et al. [27] combines the channel-independent calculation with the spatial regularization to suppress the background filter component. Unlike previous CFTs, in which negative examples are restricted to circular shifted patches, BACF [13] utilizes a correlation filter whose spatial size is much smaller than that of the training samples; real negative training examples, densely extracted from the background are utilized. To avoid drifting for real-time UAV tracking, Huang et al. [17] tried to repress aberrances during the training phase.

Compared with the existing methods, our proposed tracker has several merits. First, while RCT may be viewed as an (improved) approximation to the work of [13] on multiple training samples, the filter works more efficiently owing to the use of a more reliable response map. Second, with the introduction of fused features, the RCT tracker can learn more robust features than the previous work, thereby leading to superior tracking performance.

3 Proposed approach

We aim to develop a robust tracking algorithm that is adaptive to significant changes without being prone to drifting. We first propose a fused feature mechanism which describes the gradient and color information in an integrated way. Then, a background-aware correlation filter based on the exploitation of fused features is designed to obtain a response map. Furthermore, the mask obtained according to the value of the response map will be multiplied with a given original image to form a more reliable response map, which help alleviate possible model drifts.

3.1 Multi-channel fused features

Inspired by the duplicity theory of vision [15], we construct a more natural feature representation to fuse different types of features. In our setting, instead of concatenating the color and gradient features directly, we first transform the original image patch into HSV (Hue, Saturation, Value) color space, which is based more upon how colors are organized and conceptualized in human vision. In such a color space, brightness and colorfulness are absolute measures, which usually describe the spectral distribution of light entering the eye. Benefiting from this, our fused feature performs robust to the illumination variation. Secondly, HOG gradient information is extracted from each channel of the HSV color space, separately. Finally, all the HOG features are concatenated to form our proposed fused feature, in the form of a 93-dimensional matrix. Hence, for terminology, we think of the output feature as the combination of fused-(input)-features or, as a (singular) fused feature. Without losing generality and for conciseness, we term the resultant feature descriptor a fused feature.

3.2 Correlation tracking through fused feature

In this subsection, we introduce our fused feature into background-aware correlation filter [13] to construct a better correlation tracking framework. We utilize a correlation filter with a spatial size which is smaller than the size of training examples to reduce the boundary effects. Denote \(x_k\) as the fused feature vector of a cardinality \({x_k}\in {{\mathbb {R}}^\mathrm{{T}}}\), respectively. We consider \(y \in {{\mathbb {R}}^\mathrm{{T}}}\) as the desired correlation output corresponding to a given sample \(x_k\). A correlation filter w with the dimensionality of D (where \(T> >D\)) is then learned by solving the following minimization problem as that:

$$\begin{aligned} E(w) = \sum \limits _{j = 1}^T {||{y_j} - \sum \limits _{k = 1}^K {w_k^{\top }\mathrm{{P}}{x_k}[\varDelta {\tau _j}]} |{|^2} + \lambda \sum \limits _{k = 1}^K {||{w_k}||_2^2}}, \end{aligned}$$
(1)

where \(\lambda\) is a regularization parameter, \(\mathrm{{P}}\) is a binary matrix, and \(\mathrm{{P}}{x_k}[\varDelta {\tau _j}]\) generates all circular shifts of size D from the entire image patch over all \(j = \left[ {0,\ldots ,T - 1} \right]\) steps. The transpose operator \({^\top }\) on a complex vector or matrix gives the conjugate transpose.

Note that the Eq. (1) can be readily transformed into frequency domain in order to improve the computational efficiency. We introduce \({{\hat{g}}} = {[{{\hat{g}}}_1^{\mathrm{T}}, \ldots ,{{\hat{g}}}_K^{\mathrm{T}}]^{\mathrm{T}}}\) as an auxiliary variable. The trained filter in the frequency domain will be written as:

$$\begin{aligned} \begin{aligned} E(w,{{\hat{g}}})&= ||{{\hat{y}}} - {{\hat{X}}}{{\hat{g}}}||_2^2 + \lambda ||w||_2^2,\\ \mathrm { s.t. }\quad {{\hat{g}}}&= \sqrt{T} (\mathrm{{F}}{\mathrm{{P}}^{\top }} \otimes {\mathrm{{I}}_K})w \end{aligned} \end{aligned}$$
(2)

where \({{\hat{X}}}\) is denoted by \({{\hat{X}}} = {[\mathrm{{diag}}{({{{\hat{x}}}_1})^{\top }},\ldots ,\mathrm{{diag}}{({{{\hat{x}}}_K})^{\top }}]^{\top }}\), \(I_K\) is the \(K\times K\) identity matrix, and \(\otimes\) denotes the Kronecker product. In particular, \({{\hat{A}}}\) represents the FFT of a signal A, where \(\mathrm {F}\) is the orthonormal \(T\times T\) matrix of complex basis vectors, mapping any T-dimensional vectorized signal to its Fourier domain.

By directly employing the augmented Lagrangian method (ALM) [13], we can solve Eq. (2) and obtain the required correlation filter \({{{\hat{g}}}^{(f - 1)}}\), where f is the current frame number.

3.3 Object location by reliable response

Fig. 2
figure 2

Object location by our reliable response map. We improve the predictions by manipulating the reliable response map which is obtained by merging a coarse-to-fine mask, leading to an enhanced tracking result

Representing the response value of every pixel, the response map \(r^{(f)}\) in frame f can be computed by applying the filter \({{{\hat{g}}}^{(f - 1)}}\) that has been updated in the previous frame as:

$$\begin{aligned} r^{(f)} = {\mathcal {F}}^{-1}\left( {\sum \limits _{k = 1}^K {{{\hat{x}}}_k^{(f)}\odot {{{\hat{g}}}_k^{(f-1)}}}}\right) , \end{aligned}$$
(3)

where \(\odot\) denotes the Hadamard product, and \({\mathcal {F}}^{-1}\) is the inverse FFT (IFFT) transform.

Due to the challenges typically faced in performing real-world tracking tasks, such as deformation and rotation, the similarity between the target and modeling template may be decreased, leading to great risk of model drift or locating mistakenly. Therefore, the response map r obtained by Eq. (3) can be regarded as an original (coarse) response. How to remove a lot of potentially misleading redundant information (responses to similar objects) contained in the original response map then? As shown in Fig. 2, when noise exists, the position with the maximum value in the response map does not necessarily correspond to the real target. In this case, simply taking the position with the highest response as the target position is rather unreliable. Through a large number of experiments, empirically we find that the response peak of the real target often changes gradually, while the response peak of the disturbed object is usually very steep and looks very abrupt. Accordingly, in order to exclude the anomalies, we first try to identify the target proposals which are associated with a relatively high value in the response map. In order to achieve this goal, we exploit a threshold \(\alpha\), which divides the response map \(r^{(f)}\) into two parts. The pixels with a gray value greater than \(\alpha\) belong to the target proposal set A and the remaining ones are deemed to attribute to the background part B. The number of pixels contained in the two parts is represented with \(N_{A,\alpha }\) and \(N_{B,\alpha }\) respectively. We vary \(\alpha\) from 0 to 255, each time, \(N_{A,\alpha }\) and \(N_{B,\alpha }\) are counted to calculate the ratio of the target proposals in the patch, which is denoted as \(Q_\alpha ^{(f)}\), f is the frame index, such that:

$$\begin{aligned} Q_\alpha ^{(f)} = \frac{{{N_{A,\alpha }^{(f)}}}}{{{N_{A,\alpha }^{(f)}} + {N_{B,\alpha }^{(f)}}}}. \end{aligned}$$
(4)

Repeat Eq. (4) until the difference between the \(Q_\alpha ^{(f)}\) and \(Q^{(1)}\) (initial ratio of the target area in the patch) is less than the error range threshold \(\beta\) as:

$$\begin{aligned} |Q_\alpha ^{(f)} - Q^{(1)}| < \beta . \end{aligned}$$
(5)

When Eq. (5) is satisfied, the grey value of pixels in the set A is reset to 255, while each of the rest pixels is set to 0. From this, a number of connected domains are obtained. Then, any connected domain whose pixel area is less than a fixed threshold \(\mu\) is deleted to form the fine mask matrix \(\mathrm{{M}}^{(f)}\). By merging \(\mathrm{{M}}^{(f)}\) with the original response map \(r^{(f)}\), the reliable response map \({\tilde{r}}^{(f)}\) results. Finally the position with the maximum value in the reliable response map \({\tilde{r}}^{(f)}\) is treated and recognized as the target location.

3.4 Model updating and scale estimation

To obtain a robust approximation, at frame f, we use an online updating strategy which is formulated as:

$$\begin{aligned} {{{\hat{x}}}_\mathrm{{model}}^{(f)}} = (1 - \eta ){{{\hat{x}}}_\mathrm{{model}}^{(f-1)}} + \eta {{{\hat{x}}}^{(f)}}, \end{aligned}$$
(6)

where \({{{\hat{x}}}_\mathrm{{model}}^{(f)}}\) and \({{{\hat{x}}}_\mathrm{{model}}^{(f-1)}}\) represent the newly updated template model and old one respectively, \(\eta\) is the learning rate.

In order to be adaptable to any change of the scale of a target, the filter is applied on multiple resolutions of the searching area where tracking takes place [22]. This returns S correlation outputs with different scales, where S is the number of scales. The scale with the maximum correlation output is used to update the object location and the subsequent scale. To sum up, Algorithm 1 recapitulates the whole method.

figure e

4 Experimental results

In order to present an objective evaluation regarding the performance of the proposed approach, we examine our RCT tracker on three standard datasets, including OTB50 [30], OTB100 [29], and Temple-Color128 (TC128) [23]. Both the general capability and the special scenarios-handling ability are tested. The experiments are performed in Matlab R2016b on an Intel i7 3.0GHz CPU with 16G RAM. In all the experiments carried out, we use the same parameter values for all image sequences. We employ HOG features with \(4 \times 4\) cells to obtain the fused feature. The regularization factor is empirically set to 0.001 and the number of scales is set to 5 with a scale-step of 1.01. A 2D Gaussian function with bandwidth of \(\sqrt{wh/16}\) is used to define the correlation output for an object of size [hw]. The learning rate \(\eta\) of the correlation filter is 0.013. The pixel area threshold \(\mu\) is set to 105, and the error range \(\beta\) is set to be within 0.07.

We compare our tracker with a range of excellent trackers, including: BACF [13], ECO_HC [9], ARCF [17], AutoTrack [21], KCC [28], Staple_CA [25], MCPF [32], SRDCFdecon [11], BIT [7]. Different metrics may be used for evaluation depending on preferred perspectives, amongst which one-pass evaluation (OPE) is arguably the most commonly used evaluation method. OPE runs a tracker on each sequence once: it initializes a tracker using the ground truth object state in the first frame, and reports the average precision or success rate of all subsequent results. Having recognized this, OPE is also used herein to comparatively evaluate the present work. Center location error (CLE) is obtained through the Euclidean distance between the center of the groundtruth and estimated bounding box. Overlap precision (OP) is computed as the fraction of frames in the sequence where the intersection-over-union (IOU) overlap between the groundtruth box and the tracker prediction is higher than a threshold, and area under curve (AUC) score is the average of the success rates corresponding to the sampled OP thresholds. The trackers are ranked by the distance precision (DP) score with a CLE threshold of 20 pixels in the precision plots, and by the AUC score depicted by the success plots.

4.1 Evaluation on OTB datasets

Fig. 3
figure 3

Results of the proposed tracker and other compared trackers on OTB dataset

We implement the one-pass experiment on the OTB benchmark datasets. Figure 3 shows the precision and success plots on OTB50 and OTB100, respectively. The DP and AUC scores of all compared trackers are shown in the legend. Overall, the RCT tracker performs well on these two evaluation metrics. It ranks the first in the success plots and second in the precision plots among all competing algorithms. Our RCT approach employs the “background-aware” mechanism from the BACF tracker, but achieves a remarkable gain on the baseline. BIT is a tracker that extracts low-level biologically-inspired features while imitating an advanced learning mechanism to combine generative and discriminative models for target location. Our RCT improves on BIT by 15.72% in terms of AUC score on OTB50, and by 14.52% on OTB100. This testifies to the extraordinary performance of the fused features embedded in the correlation tracking framework. ECO_HC is a famous correlation filter-based tracking algorithm, our method also obtains superior results than ECO_HC in both of the DP and AUC scores. Note that the MCPF tracker outperforms RCT by 2.66% on OTB50, and 2.65% on OTB100 in terms of DP scores, yet its AUC scores are lower than our tracker with 1.99% and 1.62% respectively.

Table 1 AUC scores of the proposed trackers versus other state-of-art trackers

4.2 Evaluation on TC128 dataset

Fig. 4
figure 4

Results of proposed tracker and other compared trackers on TC128 dataset

TC128 is a comprehensive color-video tracking benchmark. The results of ten trackers on the 128 sequences are summarized in Fig. 4. As can be seen from these results, for both of the precision plots and success plots, our tracker obtains the third place and performs reliably. Compared to baseline BACF, RCT has a significant advantage of 4.48% in DP score and 4.07% in AUC score, which is in light of the robust fused features and reliable response. MCPF utilizes deep features which is extracted from the pre-trained convolutional neural network and obtains the first place in terms of DP score. Due to an adaptive decontamination of the training set and a conservative model update strategy, ECO_HC also performs better than RCT and ranks the first in terms of AUC score on this dataset.

For further overall comparison, in Table 1, we summarize the AUC scores of all compared trackers from the experimental results on the three datasets. It shows that our RCT tracker achieves the highest AUC score of 59.21% on average, outperforms all handcrafted features-based trackers and, including even MCPF, which utilizes deep features (and hence involves substantially more computation). Moreover, our approach just uses the simple BACF as baseline, it should be noticed that ECO_HC can further enhance its performance with our framework.

4.3 Attribute-based performance

Fig. 5
figure 5

Results of proposed tracker and other compared trackers on annotated challenging attributes

As in the OTB datasets, all the image sequences are annotated with 11 attributes which cover various challenging factors in visual tracking, including scale variation (SV), occlusion (OCC), illumination variation (IV), motion blur (MB), deformation (DEF), fast motion (FM), out-of-plane rotation (OPR), background clutters (BC), out-of-view (OV), in-plane rotation (IPR) and low resolution (LR). Figure 5 shows the results of six representative attributes (FM, IV, MB, OCC, OPR and SV) over the OTB50 benchmark to testify the excellent attribute-performance of our RCT tracker in terms of AUC scores. It shows that the proposed method performs robustly against other state-of-the-art trackers in most challenging scenes.

Fig. 6
figure 6

Tracking results of RCT in qualitative comparison with state-of-the-art algorithms

Figure 6 shows a qualitative comparison of our method with several state-of-the-art trackers including MCPF, ECO_HC, BACF and ARCF in challenging situations. The example frames are from the DragonBaby, Soccer and Rubix sequences, respectively. Obviously, our approach performs well as compared to the others. Sequences with fast motions (DragonBaby), illumination variations (Soccer), scale variations (Rubix), in-plane and out-of-plane rotations (DragonBaby, Soccer, Rubix) can be successfully tackled by our method without model drifts. Videos with motion blurs (DragonBaby, Soccer) and occlusions (Soccer) also benefit from our strategy of reliable response. It should be noted that for the DragonBaby and Rubix sequences, only our RCT tracker still keep estimating both the position and scale of the target accurately. To sum up, the proposed tracking algorithm can perform robustly in various tracking scenes and alleviate model drifts effectively.

4.4 Real-time performance

Table 2 Tracking speed comparison over OTB50 benchmark
Table 3 Tracking performance comparison of the proposed tracker and its key components over OTB50 benchmark

In addition to robust demands in challenging scenes, real-time performance is another essential requirement for online visual tracking. We present the tracking speed comparison over the OTB50 benchmark by the average FPS in Table 2. It can be shown that the ARCF, SRDCFdecon, MCPF trackers (<25 fps) cannot meet the real-time requirement. They generally need to solve a complicated model formulation or extract deep features with a time-consuming procedure, which may limit their use in many real-time applications. On Intel core i7-9700 hardware environment, our RCT tracker operates at a real-time speed of 26.45 fps without using multi-threading or GPU. The tracking speed can be further improved by optimizing the code. Even so, RCT runs more than 50 times faster as compared to MCPF, which operates on an high-end NVIDIA GTX 1080Ti GPU with a measured tracking speed of 0.51 fps.

To verified the real-time performance and effectiveness of key components in our tracker, we also report the comparison results of the average tracking speed and DP/AUC scores over the OTB50 benchmark in Table 3. The basic notions are as follows: (1) ‘Baseline’ denotes the original BACF; (2) ‘Baseline+FF’ means the baseline tracker with our fused features; (3) ‘Baseline+RR’ stands for the baseline tracker with the designed scheme of reliable response; (4) ‘Baseline+FF+RR’ is our final tracker RCT. From Table 3, we can see that both of the two modules operate efficiently without degrading the real-time performance of the baseline tracker. Further, they contribute to the substantial improvement on tracking accuracy over the baseline method.

5 Conclusion

In this paper, we have proposed a real-time correlation filter-based tracking method via the use of multi-channel fused features and reliable response maps. The correlation filter that utilizes multi-channel fused features leads to a significant improvement in tracking performance while dealing with challenging factors such as illumination variation and rotation. We have also proposed a novel strategy to obtain a more reliable response map, thereby locating the target through it. This allows our tracker to reduce the probability of incorrect locating when target occlusion and motion blur exist severely, so as to achieve the effect of suppressing model drift. Comparative experimental investigations have proven, both quantitatively and qualitatively that our approach has comparable performance with that of the state-of-the-art tracking methods. In particular, the proposed approach achieves a significant improvement in overall tracking performance compared to the baseline BACF. Meanwhile, our tracker is still able to maintain a real-time tracking speed of 26 fps. Our method still has shortcomings to be improved. For example, our tracker can not confirm staple tracking when facing long-term occlusion. Future work will involve investigating more powerful fused features with low dimension and more efficient tracking framework to deal with long-term occlusions for real-time applications. Besides, our method can be generalized to other areas of computer vision, such as human appellative [1].