1 Introduction

Visual object tracking has always been a critical task in computer vision, and it is the premise of many higher-level image processing tasks. Object tracking technology is to use the target and background information of the initial video frame to predict the position and scale of the target in the subsequent frames, which is widely used in video surveillance, intelligent transportation, intelligent medical, and other practical scenarios [1,2,3]. In recent years, the object tracking algorithm has achieved outstanding results in tracking performance. However, it is still challenging to accurately position the target in complex environments such as scale variation, illumination variation, and object occlusion.

Fig. 1
figure 1

The overall framework of SCDCF tracker. SCDCF includes three processes: training, detection, and update, which are marked by blue, green, and purple boxes, respectively

The object tracking algorithms based on discriminative correlation filters (DCF) have received extensive attention due to their excellent tracking performance and efficient computational efficiency [4]. The characteristic of the DCF method is to collect samples by cyclic matrix and transform the correlation operation in the time domain into point-wise multiplication in the frequency domain by Fast Fourier Transform (FFT), which dramatically reduces the computational complexity and improves the speed of the algorithm. Most of the early correlation filter algorithms used handcrafted features such as histogram of oriented gradients (HOG) and color names (CN) to represent targets and showed favorable tracking results and excellent computational efficiency, reaching state-of-the-art at that time [5,6,7]. Convolutional neural network (CNN) features have a more robust representation power than handcrafted features, so many researchers have introduced multi-channel CNN features into the correlation filters framework [8,9,10].

It is indisputable that the success of recent DCF-based trackers is mainly due to the use of deep CNN features. As a result, researchers have proposed some methods to exploit the potential of deep features. Some trackers utilize principal component analysis (PCA) methods [8, 11] to reduce or compress the deep feature dimension but still need to address the high computational and memory costs required to extract deep features. Some algorithms improve tracking efficiency by using attention mechanisms [12] and assigning weights [13, 14] to deep feature channels. Nevertheless, the number of feature channels used by these algorithms is still significant, and the computational efficiency needs to be improved. Many researchers have recently introduced saliency detection into the correlation filter tracking framework and developed many advanced tracking algorithms with good results. However, most of these algorithms use image saliency information to either construct spatial or temporal regularization terms in the DCF model [15,16,17] to alleviate the boundary effect problem or to achieve reinforcement learning of the target appearance without paying attention to the correlation between multi-channel deep features and target saliency information [18, 19]. Recent studies [20, 21] have shown that different feature channels have different characteristics and contributions in the tracking process, especially multi-channel depth features. Deep features may contain many interference channels with irrelevant and redundant information to the target, and directly fusing all the hundreds and thousands of dimensional deep feature channels may produce severe overfitting, which leads to degradation of tracking performance.

Based on the above discussion, this paper proposes a new saliency-aware channel selection discriminative correlation filter (SCDCF) for robust visual tracking. The overall framework of the proposed tracker is shown in Fig. 1. SCDCF includes three stages: training, detection, and update. Firstly, we obtain the multi-channel deep features containing the energy of the target saliency-aware region through the saliency detection and feature extraction process. All channels are evaluated according to the proposed saliency-aware average energy ratio (SAER) indicator to obtain effective feature channels that pay more attention to the target information. Channels are given different weights according to their importance, and the final feature training filter is obtained to reduce the filter dimension and improve its discrimination power. Then, the selected feature channels and the trained filter perform correlation operations to obtain a response map to locate the target position. Finally, the proposed model updating mechanism is used for adaptive updating to avoid model degradation. The ADMM [22] algorithm is used to accelerate the solution of the proposed SCDCF model.

The main work of this paper is summarized as follows:

  1. 1.

    The saliency detection method is introduced to obtain the saliency information of the target. The feature energy of the saliency-aware region is calculated according to the target mask to highlight the target appearance and suppress the interference of the background information in the bounding box during the tracking process.

  2. 2.

    A new channel evaluation indicator is proposed to evaluate the importance of feature channels. Based on this, an adaptive channel selection mechanism is designed to select effective feature channels, reduce the feature dimension and enhance the discrimination ability of filters. According to the score, channel reliability is judged, and different weights are assigned to improve the representation ability of features.

  3. 3.

    An adaptive model updating mechanism is designed to judge the reliability of tracking results according to the fluctuation of the response map in the near time frame to ensure the accuracy of the target representation of the appearance model and alleviate the problem of model degradation.

  4. 4.

    The proposed trackers are evaluated on five public tracking datasets, including OTB2013 [23], OTB2015 [24], TC128 [25], UAV123 [26], and VOT2018 [27]. Experimental results show SCDCF is superior to many advanced trackers.

2 Related work

Early DCF algorithms mostly use handcrafted features to represent targets, such as color, texture, and edge features. The MOOSE [28] tracker, which initially introduced correlation filter theory into the field of object tracking, only used grayscale features to describe the target, and the tracking speed can reach hundreds of frames per second. Subsequently, Henriques et al. [29] incorporated multi-channel HOG features into the correlation filter framework and improved the algorithm accuracy by mapping linear space to high dimensional space through kernel functions. Danelljan et al. [30] extended the original RGB color space to 11 dimensions, trained correlation filters using color names (CN) features containing rich color information. Many subsequent trackers [5, 7] utilize complementary handcrafted features to describe the target to enhance the feature representation power. Recently, due to the excellent performance of handcrafted features in computational efficiency and accuracy, DCF trackers based on handcrafted features have shown significant advantages in aerial target scenarios and are widely used in unmanned aerial vehicles (UAV) platforms [15, 31, 32].

Deep features show strong representation ability with the rapid development of neural networks. Many trackers use multi-channel convolutional features extracted by deep neural network models to represent targets. Ma et al. [10] used the VGG-19 network to extract the multi-layer convolution features of the target and achieved precise positioning of the target according to the characteristics of different layers of features. Danelljan et al. [9] proposed the DeepSRDCF tracker based on spatially regularized discriminative correlation filters (SRDCF [33]) combined with convolutional features for modeling. The C-COT [34] tracker used deep neural network to extract features, obtained feature maps of continuous spatial domains by interpolation operations, and applied Hessian matrices to achieve sub-pixel accuracy localization of target positions. Noting the interfering channels and running speed problems caused by multi-channel deep features, many tracking algorithms use attention mechanisms, feature compression, and other methods to alleviate them to achieve robust and fast tracking.

Saliency detection is to simulate the human visual attention mechanism to detect the most interesting and visually expressive areas in the image. It is widely used in visual tasks such as object detection, semantic segmentation, and image caption. Many recent works have applied it to object tracking and achieved good results. For example, some trackers introduce image saliency detection into the regularization term of the DCF formula to alleviate the boundary effect problem. According to the characteristics of aerial object tracking, Fu et al. [15] used the dual regularization strategy to construct the target saliency regularization model to achieve accurate real-time tracking of aerial objects. Feng et al. [16] integrated saliency information and target change information into the spatial weight map and proposed a dynamic saliency spatial regularization correlation filter method. Yang et al. [17] introduced two saliency information extraction methods in the regularization process and proposed co-saliency spatio-temporal regularization correlation filters. In addition, some researchers have also used saliency information to highlight image saliency regions for reinforcing learning of target appearance [18, 19].

Although the tracking performance of the above DCF trackers has been improved, the correlation between target saliency information and feature channel information is not considered. Therefore, we investigate the relationship between target saliency region information and multi-channel deep features and propose a new channel selection method based on image saliency information. By combining saliency detection with feature channel selection, we can accurately highlight the target region and suppress the interference of background information in the target tracking frame, reduce the dimension of feature channels and improve the discriminant ability of the filter.

3 Proposed method

In this section, we first briefly review the discriminative correlation filters and describe the saliency-aware detection mechanism used. Then we propose an adaptive channel selection method and our SCDCF model and use the ADMM method for optimization. Finally, we develop a new model update strategy.

3.1 Revisit of DCF

Given the initial target position in the first frame, the task of object tracking is to estimate the target position in subsequent frames. To locate the target position in the \((t + 1)\)th frame, DCF uses the training sample \(\left\{ {{X_t},Y} \right\}\) of the t-th frame to learn the multi-channel correlation filter, where \({{X}_{t}}\in {{\mathbb {R}}^{W\times H\times C}}\) is defined as a C-dimensional channel feature with width W and height H, and Y is the expected response map of the corresponding Gaussian shape. To obtain a multi-channel correlation filter, DCF expresses the objective as a regularized least squares problem:

$$\begin{aligned} \underset{{{F}_{t}}}{\mathop {{{{\tilde{F}}}_{t}}=\arg \min }}\,\left\| \sum \limits _{i=1}^{C}{F_{t}^{i}\otimes X_{t}^{i}}-Y \right\| _{2}^{2}+\lambda \sum \limits _{i=1}^{C}{\left\| F_{t}^{i} \right\| _{2}^{2}}, \end{aligned}$$
(1)

where \(\otimes\) denotes circular correlation operator, \(X_{t}^{i}\in {{\mathbb {R}}^{W\times H}}\) and \(F_{t}^{i}\in {{\mathbb {R}}^{W\times H}}\) represent the i-th channel of \({{X}_{t}}\) and \({{F}_{t}}\), \(\lambda \sum \nolimits _{i=1}^{C}{\left\| F_{t}^{i} \right\| _{2}^{2}}\) is a regularization term, and \(\lambda\) is the regularization parameter. The task can be transformed into the Fourier domain to derive the closed-form solution of Eq. 1 as follows:

$$\begin{aligned} \hat{F}_{t}^{i}=\frac{{{\left( \hat{X}_{t}^{i}\right) }^{*}}\odot \hat{Y}}{{{\left( \hat{X}_{t}^{i}\right) }^{*}}\odot \hat{X}_{t}^{i}+\lambda }, \end{aligned}$$
(2)

where \(\hat{\cdot }\) stands for the Discrete Fourier Transform (DFT), \(\cdot ^{*}\) indicates the complex conjugate operator, and \(\odot\) represents the element-wise product operator.

According to the feature vector \(Z\in {{\mathbb {R}}^{W\times H\times C}}\) extracted from the candidate images of \((t + 1)\) frame, the response map \(R\in {{\mathbb {R}}^{W\times H}}\) can be obtained by the following equation:

$$\begin{aligned} R={{\mathcal {F}}^{-1}}\left( \sum \limits _{i=1}^{C}{{{{\hat{Z}}}^{i}}\odot {\hat{F}_{t}^{i}}}\right) , \end{aligned}$$
(3)

where \({{\mathcal {F}}^{-1}}\) denotes the inverse DFT. The target position in \((t + 1)\) frame is determined by the peak position in response map R.

3.2 Background-aware correlation filter

The overall objective function of Background-Aware Correlation Filter (BACF) can be expressed as:

$$\begin{aligned} \underset{F}{\arg \min } \frac{1}{2}\left\| \sum _{i=1}^{C} X^{i} \otimes \left( P^ \top F^{i}\right) -Y\right\| _{2}^{2}+\frac{\lambda }{2} \sum _{i=1}^{C}\left\| F^{i}\right\| _{2}^{2} \end{aligned}$$
(4)

where \(X^i\in \mathbb {R}^T\) (T is the number of X pixels ), P is a binary matrix that is used to crop N (\(N<< T\)) elements in feature samples X, and \(P^ \top\) is the conjugate transpose of P.

The traditional correlation filter algorithm performs the cyclic shift operation on the positive sample extracted from the image target to obtain negative samples to train the filter. It does not model the real background information, which may lead to boundary effects and model drift problems. The handcrafted feature-based BACF uses a clipping matrix to crop negative samples from real background information to train filters, significantly improving the sample quality and quantity. Unfortunately, BACF uses handcrafted features to represent the target and treats all spatial feature channels equally, which cannot accurately identify the appearance changes of the target. In addition, BACF also expands the search area to deal with fast tracking problems, but it also introduces more background interference, which limits the improvement of algorithm performance. Therefore, we introduce multi-channel deep features into the BACF framework to improve the accuracy of target appearance modeling and use saliency-aware detection and channel selection mechanisms to reduce the interference of background clutter during tracking.

3.3 Saliency-aware detection mechanism

The existing advanced feature channel selection methods [20, 35] filter the channels according to the feature response in the rectangular target box. These methods improve the quality of the used feature channels to a certain extent but still introduce some background information to interfere with the learning of the filter. This paper aims to calculate the energy more suitable for the target appearance contour area for channel selection. Therefore, we introduce the saliency detection [36] and design a saliency-aware detection mechanism. As shown in Fig. 2, firstly, according to the target region bounding box, i.e., red box, the region near the target is selected as the saliency detection region, i.e., blue box. Then the blue box region is detected to obtain the saliency map, and the target region mask is generated after threshold mapping. The generated mask can be used to segment the target and surrounding region robustly. Combined with the extracted search area features, the multi-channel saliency-aware region features that are more focused on the target saliency-aware region are obtained, which can significantly shield the noise caused by the background information in the tracking box. We use this characteristic of saliency detection to calculate the target saliency region (as shown in Fig. 3(a)). According to the feature map extracted from the search region (as shown in Fig. 3(b)), we can obtain the saliency-aware region feature map (as shown in Fig. 3(c)) and the background region feature map (as shown in Fig. 3(d)) of each channel.

Fig. 2
figure 2

Visualization of the mask generation process. From top to bottom, the images denote sample patch, saliency map, and generated mask. From left to right, the four sequences from OTB2015 dataset are Jogging-1, Human2, MotorRolling, and Trans respectively

Fig. 3
figure 3

a Schematic diagram of the process of mask generation. b Feature map of search region. c Feature map of saliency-aware region. d Feature map of background region

3.4 Adaptive feature channels selection

Fig. 4
figure 4

Visualization of feature maps for different channels of sequences. From top to bottom, the three sequences from OTB2015 dataset are Girl2, Human3, and MotorRolling, respectively

The achievements of deep DCF trackers in recent years are largely attributed to the use of multi-channel convolutional features, but due to the limited number of training samples for visual tracking, the deep networks used to extract convolutional features are often pre-trained in other computer vision tasks, such as VGGNet or MobileNet, which are trained on ImageNet [37]. Using the deep network trained by general targets to extract multi-channel features of specific targets, hundreds of channels may contain a large number of interference channels, which may not contain target area information or contain more background information, affecting the learning of correlation filters. Figure 4 shows the difference between the efficient channels and the interfering channels. Since the DCF tracker obtains the response map by extracting the search area features according to the target position in the previous frame, the feature channels that are beneficial for tracking should focus more on the energy of the target area, containing larger target area energy and smaller energy of other search areas.

Combining the analysis in Sect. 3.3, we propose a new feature channel evaluation indicator. As shown in Fig. 3c and 3d, we divide the feature map after saliency detection into the target saliency-aware region feature map and the background region feature map. The feature channel is evaluated by calculating the average energy ratio of these two parts. The proposed SAER indicator is defined as follows:

$$\begin{aligned} SAER(i)=\frac{{{E}_{O}}({{X}^{i}})}{{{E}_{B}}({{X}^{i}})},i=1,2,\cdots ,C, \end{aligned}$$
(5)

where \({{X}^{i}}\) denotes the ith channel of feature \(X\in {{\mathbb {R}}^{W\times H\times C}}\). We define \(E_O (X^i )\) as the average energy value of the target saliency-aware region O:

$$\begin{aligned} {{E}_{O}}({{X}^{i}})=\frac{\sum \nolimits _{(p,q)\in O}{V(p,q)}}{Area(O)}, \end{aligned}$$
(6)

where V(pq) is defined as the feature energy value of position (pq), Area(O) represents the area of region O. Similarly, \(E_B (X^i )\) is defined as the average energy value of background region:

$$\begin{aligned} {{E}_{B}}({{X}^{i}})=\frac{\sum \nolimits _{(p,q)\in S}{V(p,q)}-\sum \nolimits _{(p,q)\in O}{V(p,q)}}{Area(S)-Area(O)}, \end{aligned}$$
(7)

where S denotes the search region. We judge the confidence of the feature channel according to the SAER index. The higher the SAER score, the richer the target information contained in the channel, and the smaller the SAER score, indicating that the channel contains more background interference. Therefore, we calculate the SAER scores for all channels and adaptively select channels with scores higher than a given threshold for filter learning to reduce the interference of invalid feature channels.

On the other hand, in recent years, the channel attention mechanism has been widely used in computer vision tasks. It judges the importance of feature channels by modeling them and assigns greater weights to more important channels to enhance the discrimination of filters. Therefore, we combine the idea of channel attention with the proposed saliency-aware channel selection mechanism, use SAER to judge the importance of the channel, and assign different weights to the selected channels so as to improve tracking efficiency and alleviate the shortcomings of channel attention. After the salience-aware channel selection, the effective feature with higher discriminative power and the score sequence A containing SAER scores of each channel are obtained, then the weight \(w^i\) of the i-th channel can be expressed as:

$$\begin{aligned} {{{w}^{i}}=1+\frac{1}{2}\times \frac{A(i)-\min (A)}{\max (A)-\min (A)}.} \end{aligned}$$
(8)

3.5 Modeling and optimization of the SCDCF

Using the proposed feature channel selection method, we can obtain the effective feature \(X_E\) that focus more on the target information and the corresponding weight sequences w. Therefore, the proposed SCDCF model can be expressed as:

$$\begin{aligned} {\underset{F}{\arg \min } \frac{1}{2}\left\| \sum _{i=1}^{D} w^{i} X_{E}^{i} \otimes \left( P^{\top } F^{i}\right) -Y\right\| _{2}^{2}+\frac{\lambda }{2} \sum _{i=1}^{D}\left\| F^{i}\right\| _{2}^{2},} \end{aligned}$$
(9)

where \(X_E^i\) and \(w^i\) denote the i-th feature channel of the effective feature \(X_E\) and its weight. After channel selection, the number of channels is reduced from C to D.

To improve the computational efficiency, we use \(X_S\) to represent the final feature used to train the filter in the optimization process, that is, \(X_S^i=w_i \times X_E^i\), and \(X_S^i\) represents the i-th feature channel of feature \(X_S\). The conversion of Eq. 9 to the frequency domain can be expressed as:

$$\begin{aligned} \begin{aligned} \underset{\hat{F},\hat{G}}{\mathop {\arg \min }}\,\frac{1}{2}\left\| {{{\hat{X}}}_{S}}\odot \hat{G}-Y \right\| _{2}^{2}+\frac{\lambda }{2}\left\| F \right\| _{2}^{2}&\\ s.t.,\text { }\hat{G}=\sqrt{T}H{{P}^{\top }}F,&\\ \end{aligned} \end{aligned}$$
(10)

where \(\hat{G}=[{{\hat{G}}^{1}},{{\hat{G}}^{2}},\ldots ,{{\hat{G}}^{D}}]\) is an auxiliary variable matrix, H is the orthonormal \(T\times T\) matrix of complex basis vectors for mapping any T-dimensional vector to the Fourier domain (e.g., \(\hat{g}=\sqrt{T}Hg\)). We employ the augmented Lagrangian method to optimize Eq. 10:

$$\begin{aligned} \begin{aligned} \mathcal {L}(F,{\hat{G}},{\hat{\vartheta }})=&\frac{1}{2}\left\| {{{\hat{X}}}_{S}}\odot {\hat{G}}-{\hat{Y}} \right\| _{2}^{2}+\frac{\lambda }{2}\left\| F \right\| _{2}^{2} \\&+{\hat{\vartheta }}{{({\hat{G}}-\sqrt{T}H{{P}^{\top }}F)}^{\top }} \\&+\frac{\eta }{2}\left\| {\hat{G}}-\sqrt{T}H{{P}^{\top }}F \right\| _{2}^{2}, \end{aligned} \end{aligned}$$
(11)

where \({\hat{\vartheta }}={{[{{\hat{\vartheta }}^{1}},{{\hat{\vartheta }}^{2}},\ldots ,{{\hat{\vartheta }}^{D}}]}^{\top }}\in {{\mathbb {R}}^{T\times D}}\) denotes the Lagrangian multiplier and \(\eta\) is the penalty parameter.

The ADMM algorithm can be applied to Eq. 11 to split it into three independent subproblems, each of which has a closed solution:

$$\begin{aligned} \left\{ \begin{aligned} F =&\frac{\lambda }{2}\left\| F \right\| _2^2+{\hat{\vartheta }}{\left({\hat{G}} - \sqrt{T} H{P^ \top }F\right)^ \top }\\ \quad&+\frac{\eta }{2}\left\| {{\hat{G}} - \sqrt{T} H{P^ \top }F} \right\| _2^2 \\ {{\hat{G}}} =&\frac{1}{2}\left\| {{{{\hat{X}}}_s} \odot {\hat{G}} - {\hat{Y}}} \right\| _2^2+{\hat{\vartheta }}{\left({\hat{G}} - \sqrt{T} H{P^ \top }F\right)^ \top } \\ \quad&+\frac{\eta }{2}\left\| {{\hat{G}} - \sqrt{T} H{P^ \top }F} \right\| _2^2 \\ {\hat{\vartheta }}^{l + 1}&= {{\hat{\vartheta }}^l} + \eta \left({{{\hat{G}}}^{l + 1}} - {F^{l + 1}}\right) \end{aligned}\right. . \end{aligned}$$
(12)

Then, the individual subproblems are solved iteratively as follows:

Subproblem \(\varvec{F}\): The optimal solution can be easily obtained as follows:

$$\begin{aligned} {{F}_{opt}}=\frac{\vartheta +\eta G}{\eta +\lambda /T}, \end{aligned}$$
(13)

where \(G=\frac{1}{\sqrt{T}}H{{P}^{\top }}\hat{G}\) and \(\eta =\frac{1}{\sqrt{T}}H{{P}^{\top }}\hat{\eta }\). \({{F}_{opt}}\) is obtained by Inverse Fast Fourier Transform(IFFT) of \(\hat{G}\) and \(\hat{\eta }\).

Subproblem \(\varvec{\hat{G}}\): For \(\hat{G}\), since each pixel is independent, it can be decomposed into T small subproblems. The closed solution can be obtained as follows:

$$\begin{aligned} \begin{aligned} \hat{G}{{(k)}_{opt}}=&\frac{1}{\eta }\left( \frac{1}{T}{{{\hat{X}}}_{S}}(k)\hat{Y}(k)+\eta \hat{F}(k)-{\hat{\vartheta }}(k) \right) \\&-\frac{{{{\hat{X}}}_{S}}(k)}{\eta b}\left( \frac{1}{T}{{{\hat{U}}}_{X}}(k)\hat{Y}(k)+\eta {{{\hat{U}}}_{F}}(k)-{{{\hat{U}}}_{\vartheta }}(k) \right) , \end{aligned} \end{aligned}$$
(14)

where \({{\hat{U}}_{X}}(k)={{\hat{X}}_{S}}{{(k)}^{\top }}{{\hat{X}}_{S}}(k)\), \({{\hat{U}}_{F}}(k)={{\hat{X}}_{S}}{{(k)}^{\top }}\hat{F}(k)\), \({{\hat{U}}_{\vartheta }}(k)={{\hat{X}}_{S}}{{(k)}^{\top }}{\hat{\vartheta }}(k)\) and \(b={{\hat{U}}_{X}}(k)+\eta T\).

Updating other variables: The Lagrange multiplier \({\hat{\vartheta }}\) and penalty parameter \(\eta\) are updated as:

$$\begin{aligned} \left\{ \begin{aligned}&{{{\hat{\vartheta }}}^{l+1}}={{{\hat{\vartheta }}}^{l}}+\eta ({{{\hat{G}}}^{l+1}}-{{F}^{l+1}}) \\&{{\eta }^{l}}=\min ({{\eta }_{\max }},\delta {{\eta }^{l}}) \\ \end{aligned} \right. , \end{aligned}$$
(15)

where l represents the number of iterations and \(\delta\) is the scale factor.

3.6 Adaptive model Update

The traditional DCF algorithm uses linear interpolation to update the filter for each frame. This strategy of updating each frame can slowly learn the latest changes of the target by combining historical and current information. However, if the model continues to be updated under complicated situations such as severe target occlusion may introduce a large amount of interference information that is detrimental to the tracking process, resulting in model drift. To address these issues, researchers have developed two confidence indicators: APCE (Average Peak to Correlation Energy) [38] and PSR (Peak-to-Sidelobe Ratio) [28], which are used to analyze the similarity and peak intensity of the response map. Inspired by APCE and PSR, SCDCF uses the proposed RFM ( response map fluctuation ) metric to determine the fluctuation degree of the response map and sets the conditions for model updating according to the feedback.

$$\begin{aligned} RMF=\frac{{{R}_{\max }}-{{R}_{\min }}}{\sqrt{\frac{1}{W\times H}\left( \sum \nolimits _{i,j}^{W,H}{{{\left( {{R}_{i,j}}-{{R}_{mean}} \right) }^{2}}} \right) }}, \end{aligned}$$
(16)

where \({{R}_{i,j}}\) is defined as the response value of position (ij), \({{R}_{\max }}\), \({{R}_{\min }}\), and \({{R}_{{\text {m}}ean}}\) are the maximum, minimum, and average values in the response map \(R\in {{\mathbb {R}}^{W\times H}}\). Then the average RFM value \(RM{{F}_{mean}}=(1/n)\sum \nolimits _{1}^{n}{RM{{F}_{n}}}\) of nearly n frames is obtained, and whether to update is judged by comparing the current frame score RFM with \(RM{{F}_{mean}}\). Therefore, the model update for SCDCF can be expressed as:

$$\begin{aligned} \left\{ \begin{array}{*{20}{l}} \mathrm{{Update,}} &{} RM{F_t} > \varphi RM{F_{mean}}\\ \mathrm{{No update,}} &{} RM{F_t} \le \varphi RM{F_{mean}} \end{array} \right. , \end{aligned}$$
(17)

where \(\varphi\) is the ratio factor.

4 Experimental results

Fig. 5
figure 5

Qualitative evaluation results of the proposed tracker and other advanced trackers for 10 challenge sequences from the OTB2015 benchmark. From top to bottom and from left to right, these sequences are Bird1, Biker, Skiing, Matrix, Box, Dragonbaby, Ironman, Jump, MotorRolling and Sylvester, respectively

4.1 Implementation details

Platform: The proposed tracker is implemented in MATLAB 2018a on a PC with an Intel(R) Xeon(R) Gold 6136CPU at 3.00GHz, 512 G RAM and a single NVIDIA TITAN V GPU. The MatConvNet [39] toolbox is used to extract deep features from pre-trained CNN networks.


Parameters: To guarantee the fairness and objectivity of the evaluation, we follow some key parameters in the standard DCF method [7, 11] to construct tracker. For target localization, we use HOG features and shallow layer (conv3-4), middle layer (conv4-3), and deep layer (conv5-1) features of the VGG-16 network to represent the target. We set the learning rate \(\sigma\)=0.0135, and the SAER threshold in Sect. 3.4 is 1.37. For model optimization, we set the number of iterations l of ADMM to be 2, the penalty parameter \(\eta\) to be 1, and the \({\eta _{\max }}\) and \(\delta\) in Eq. 15 to be \(10^4\) and 10, respectively. For model updating, We set the ratio factor \(\varphi\)=0.65 in Eq. 17, refer to [40] to set the number of recent frames n=5. In addition, for some parameters of scale estimation, we refer to the ASRCF [11] tracker, and the remaining parameters are consistent with the BACF [7] tracker.

4.2 Experiment datasets and evaluation metrics

We evaluate the effectiveness of the proposed tracking method on five public tracking datasets, including OTB2013 [23], OTB2015 [24], TC128 [25], UAV123 [26], and VOT2018 [27]. For the OTB2013, OTB2015, TC128, and UAV123 datasets, we use the precision and success plots of the one-pass evaluation (OPE) strategy to measure the performance of the tracker. The precision plot reports the proportion of video frames whose distance between the bounding box predicted by the tracker and the manually labeled actual bounding box is less than a certain threshold. The success plot reports the proportion of video frames whose overlap rate is greater than a certain threshold between the predicted bounding box and the real bounding box. We use the distance precision (DP) with a threshold of 20 pixels in the precision plot and the area under the curve (AUC) of the success plot to evaluate the tracker. The overlap precision (OP) is the corresponding score of the success plot when the overlap rate threshold is set to 0.5. In addition, the center position error (CLE) measures the average Euclidean distance between the center position of the predicted bounding box and the real bounding box, and the speed of the tracker is shown in frames per second (FPS). For the VOT2018 dataset, we analyze the tracker performance using three metrics, expected average overlap (EAO), Accuracy, and Robustness.

4.3 Qualitative valuation

We select 10 representative sequences with different challenge attributes from the OTB2015 dataset for qualitative evaluation of our tracker, and the results are shown in Fig. 5. Comparison algorithms include DCF trackers based on deep features (i.e., DeepSTRCF [41], DeepSRDCF [9], C-COT [34], and HCF [10]) and DCF trackers based on handcrafted features (i.e., MCCT-H [40], ECO-HC [8], BACF [7], Staple [5], and KCF [29]). When the target is severely disturbed by the surrounding background (i.e., Box, Matrix), our approach performs well due to the saliency-aware detection mechanism that shields the background noise and makes the learned filter more focused on the target information. When the target is deformed and rotated (i.e., MotorRolling, Dragonbaby, Jump, Sylvester), SCDCF achieves accurate tracking in continuous frames because it uses the channel selection strategy to remove a large number of redundant channels and uses the channel adaptive weighting method to improve the representation of the features used effectively. Especially in the Jump sequence, which is more difficult to track, only the SCDCF tracker successfully tracks the target in several algorithms. Similarly, because we use an adaptive update strategy to avoid unnecessary model updates, SCDCF also succeeds when the target is out of view (i.e., Bird1, Biker). In addition, Our tracker also performs well in terms of fast motion and illumination variation (i.e., Skiing, Ironman). The qualitative evaluation results show that the proposed SCDCF method is superior to many advanced tracking algorithms in various complex situations.

Fig. 6
figure 6

Precision and success plots of SCDCF and other state-of-the-art trackers on OTB2013 (first row) and OTB2015 (second row), with AUC and DP scores reported in the figure legend

Fig. 7
figure 7

Precision and success plots of SCDCF and other trackers on TC128, with AUC and DP scores reported in the figure legend

Table 1 Performance comparison with other SOTA trackers on OTB2015
Table 2 A comparison of our SCDCF method with 16 advanced trackers on OTB2015
Table 3 A comparison of our SCDCF method with 9 advanced trackers on UAV123
Table 4 A comparison of our SCDCF method with 9 advanced trackers on VOT2018

4.4 Quantitative evaluation

OTB: Fig. 6 shows the precision and success plots of our method and other 16 trackers on the OTB2013 and OTB2015 datasets, including handcrafted features-based DCF trackers (i.e., ECO-HC [8], LADCF-HC [42], BACF [7], CACF [6], SRDCF [33], SAMF [43], KCF [29]), deep features-based DCF trackers (i.e., MCCT [40], DeepSTRCF [41], C-COT [34], DeepSRDCF [9], MCPF [44], HDT [45], HCF [10]), and deep learning-based trackers (i.e., SiamFC [46], SiamRPN [47]). Overall, our SCDCF tracker is superior to many advanced tracking algorithms in terms of DP and AUC scores. On OTB 2013, our method ranks first with a DP of 94.2% and an AUC of 71.6%. On OTB2015, SCDCF has the highest DP and AUC scores of 92.2% and 68.3%, respectively, which are 4.2%/0.8% and 7.1%/4.8% higher than DeepSTRCF and DeepSRDCF, which use multi-channel deep features, and 6.1%/2.3% higher than LADCFHC, which is the best performer among handcrafted feature based trackers. In addition, this work comprehensively compares SCDCF with other 9 deep learning-based trackers, including LUDT+ [48], LUDT [48], PrDiMP-18 [49], ROAM [50], ROAM+ [50], ATOM [51], GradNet [52], DiSiamRPN [53], and DiMP-18 [54], on the OTB2015 dataset to present a more comprehensive evaluation. The results are shown in Table 1. Our SCDCF tracker achieves the best precision and success rate, outperforming the recent SOTA trackers.

To analyze the performance of the tracker in more detail, we evaluate the mean CLE, OP, and speed (fps) of SCDCF on OTB2015. Table 2 reports the comparison results of our method with 15 other trackers. In terms of OP, SCDCF achieves the best performance with 0.6%/2.0% improvement over the second and third places (i.e., MCCT and DeepSTRCF). In terms of mean CLE, SCDCF maintains at 8.4 pixels, outperforming many state-of-the-art trackers. In terms of speed, the end-to-end Siamese network-based tracking algorithms (i.e., SiamFC and SiamRPN) are faster, reaching 84.3 fps and 34.2 fps, respectively. Due to their offline learning method, the tracking speed is fast, but the target appearance model cannot be dynamically adjusted by analyzing the context environment. The tracking effect is poor compared with the SCDCF using the online learning method. Among the many deep feature-based correlation filter trackers, SCDCF is the fastest at 14.8fps, 4.4fps faster than the second-place HCF algorithm. This is because SCDCF uses channel selection to remove a large number of interfering feature channels to improve tracking efficiency. Overall, the tracking performance of our method is superior to the other advanced trackers.

Fig. 8
figure 8

Precision and success plots of SCDCF and other trackers on UAV123, with AUC and DP scores reported in the figure legend

Fig. 9
figure 9

Expected average overlap (EAO) ranking plots on the VOT2018 dataset

Fig. 10
figure 10

Precision plots of SCDCF and 16 state-of-the-art trackers under 11 attributes on OTB2015. For completeness, we also show the overall results obtained by these trackers

Fig. 11
figure 11

The 11 attributes-based DP (left) and AUC (right) scores of our tracker and other trackers on TC128

TC128: We compare the proposed tracker with 10 advanced trackers on TC128, such as MCCT [40], DeepSRDCF [9], DeepSTRCF [41], C-COT [34], SRDCF [33], BACF [7], SAMF [43], Struck [55], DSST [56], and KCF [29]. The evaluation results are shown in Fig. 7, and our method achieves the best DP/AUC scores. In terms of DP, our tracker scores the highest with 80.5%, which is 0.4% and 1.6% better than the second and third places (i.e., MCCT and DeepSTRCF). In terms of AUC, SCDCF ranks first, outperforming C-COT and DeepSRDCF by 1.8% and 5.5%.

UAV123: The UAV123 is one of the most popular datasets in the field of UAV object tracking. We compare SCDCF with 23 recent trackers on UAV123. As shown in Fig. 8, the overall performance of SCDCF is excellent. For more clarity, we also show the DP/AUC scores of the ten best-performing trackers in Table 3. It can be seen from the table that SCDCF scored 77.0% on the DP index, ranking first among the 23 trackers, leading the second and third (i.e., ACSDCF and ECO) by 0.5% and 2.9%. Likewise, SCDCF scored 52.0% on the AUC index, ranking third. Experimental results show that the SCDCF tracker performs better than most contrast trackers on UAV123 and is not inferior to the recently advanced trackers AutoTrack and ACSDCF, further validating the advantages of the proposed method.

VOT2018: To further evaluate the robustness and accuracy of the tracker, we also compare the SCDCF with 9 advanced trackers on VOT2018, including ECO [8], MCCT [40], C-COT [34], CSRDCF [13], UpdateNet [57], SiamFC [46], DCFNet [58], Staple [5], and KCF [29]. We rank all algorithms according to the EAO score, and the results are shown in Fig. 9. Table 4 reports the scores of all algorithms on the three indicators in detail. From the table, we can see that the EAO score of SCDCF reaches the highest 0.280, which is 1.3% higher than C-COT and 0.7% higher than MCCT. On Robustness, our method ranks second only to ECO. Although the performance of MCCT and ECO tracker based on deep features on VOT2018 is also excellent, the number of deep feature channels used is enormous, and the algorithm runs slowly, especially ECO. Therefore, compared with other advanced tracking algorithms, the overall performance of SCDCF is still in the optimal position.

4.5 Attribute-based evaluation

To fully evaluate the performance of the tracker in various complex scenarios, we perform attribute-based evaluation of SCDCF on OTB2015 and TC128 datasets. These attributes include occlusion (OCC), scale variation (SV), illumination variation (IV), background clutter (BC), fast motion (FM), blur (MB). Low resolution (LR), deformation (DEF), out-of-view (OV), out-of-plane rotation (OPR), and in-plane rotation (IPR). Figure 10 shows the results of the comparative analysis on OTB2015. The SCDCF ranks first in DP for eight attributes: IV, FM, DEF, BC, SV, OV, OPR, and LR. Especially in OV and LR, it is 3.5% and 4.2% higher than the second and third places (i.e., C-COT and MCPF), and the DP score of SCDCF is also among the top in the remaining attributes. Figure 11 shows the results of the attribute analysis of SCDCF on TC128. In terms of DP, our method achieves the best in SV, OCC, FM, and OPR and remains in the top three in the remaining seven challenges. In terms of AUC, the SCDCF tracker ranks first in SV, OCC, and second in six attributes: IV, FM, OPR, IPR, OV, and BC. The evaluation results show that SCDCF can better cope with target variations in a variety of complex scenarios by fully using effective feature channels to represent the target during tracking.

4.6 Ablation study

Table 5 DP and AUC scores of various variants of the proposed SCDCF on OTB2013 and OTB2015 datasets

we further conduct ablation studies on the OTB2013 and OTB2015 datasets to evaluate the contribution of each component in the proposed SCDCF tracker, and the evaluation results are shown in Table 5. ’MU’ indicates the proposed adaptive model update strategy we designed. ’SCS’ stands for the saliency-aware channel selection strategy. ’CW’ represents the channel weight assigned to the selected feature channel according to the SAER score. It can be seen from the table that each component improves the performance of the tracker to certain extent. In particular, after the introduction of salience-aware channel selection, the DP/AUC scores of the tracker improved significantly on OTB2013 and OTB2015, 3.0%/2.3% and 2.8%/1.8%, respectively. Using SAER scores to assign channel weights also enhanced the stability of the tracker. Compared to the baseline, SCDCF combines all component strengths to improve the AUC and DP metrics by 10.7%/7.7% and 10.0%/6.2% on OTB2013 and OTB2015, respectively.

Fig. 12
figure 12

Visualization of failure cases. The blue box represents the saliency detection region and the red box denotes the object region

4.7 Discussion

Qualitative and quantitative experiments on several datasets have verified that the proposed saliency-aware channel selection can effectively improve the tracking accuracy of the correlation filter algorithms. Although in most practical scenarios, using the saliency-aware detection mechanism proposed in Sect. 3.3, we can obtain masks that match the target appearance profile, as shown in Fig. 2. However, a few environments will still affect the effectiveness of the saliency-aware detection mechanism, as shown in Fig. 12. In Fig. 12 a, due to the similarity between the target local and the background environment, the mask generated by the saliency-aware detection has some missing regions. In Fig. 12 b, due to the low brightness and resolution of the image and the low color discrimination, the generated target mask is not complete enough. These results reveal the shortcomings of our method. To better improve the tracking performance, we will explore more advanced saliency detection algorithms to alleviate the above problems and study how to integrate saliency information with the DCF model further.

5 Conclusion

In this paper, we research the correlation between multi-channel deep features and target saliency-aware region information and propose a novel DCF-based tracking method via saliency-aware and adaptive channel selection. By comparing the feature energy of the target saliency-aware region and the background region, the more discriminative effective channels in the multi-dimensional convolution features are selected, and high tracking accuracy can be achieved using a small number of feature channels. In addition, the proposed SAER indicator can also be used to determine the importance of channels and realize the adaptive allocation of channel weights. We also introduce the ADMM method to optimize the proposed SCDCF model. Extensive experiments on five well-known datasets validate the effectiveness and robustness of the proposed method.