1 Introduction

Video tracking is a hot research topic in the field of computer vision. It plays an important role in many applications, such as robotics, surveillance and human–computer interaction, to name a few [5, 6, 8,9,10, 13, 16,17,18, 20, 21, 30, 37, 42, 47, 48]. In general, tracking target is defined in the first frame of the video in terms of an upright bounding box. A tracker identifies the location of the object in the subsequent frames. Over the last decade, video tracking has advanced significantly, but still many challenging problems remain to be solved such as fast motion, camera shaking, occlusion, and so on.

Fig. 1
figure 1

In a, the black box represents the target location in the previous frame. The conventional CF detects the object in the area defined by the magenta box. Apart from this area, we add four extra search regions around the original search center, shown in red, blue, green and yellow. In b, the red box represents one of the additional extra search regions the black box and the magenta box are the same as that in a. The yellow dash dot box is the implicit search region of TESR

In the recent tracking algorithms, the correlation filter (CF) tracker and the two deep neural network-based tracker are the two mainstream tracking methods. The tracking algorithms based on deep networks have excellent performance because they take advantage of the large amount of data training off-line, but they often suffer in terms of tracking speed because of their slow online fine-tuning. The CF tracking algorithms have received wide attention for their impressive performance in both speed and accuracy, as well as benefiting from their very simple implementation. In contrast to neural network-based tracking method, the correlation filter method is the lightweight tracker. Extensive research and various forms of improvement of CF have been reported, but some open problems still persist. The boundary effect that is closely linked with the underlying assumption that the object signal is periodic is a prominent problem in CF. The intuitive method to solve this problem is to extend the search region of CF, but this introduces irrelevant background information which is not periodic and will lead to the decline of the tracking performance. Kiani Galoogahi et al. [20] and Danelljan et al. [10] circumvent the background noise problem associated with the search region extension by adopting other approaches. The proposed method addresses this problem, by implicitly extending the search region to overcome the boundary effect of CF by means of a multiposition tracking solution.

In this paper, we propose a correlation tracking framework, which implicitly extends the search region (TESR). The conventional CF methods take the target location in the previous frame as the current frame search center and detect object in a search region of bigger size than the object size, delineated by the bounding box, to find the correct location of the tracking target. The radius of the surrounding area of the bounding box is expanded by a factor as \(p*r\), where r is the radius of the object. Hence, the search region radius of conventional CF methods is \((1+p)*r\), where p is the padding scale factor. The proposed TESR adopts a multiposition tracking method. As shown in Fig. 1, in addition to the search center with the location of the tracked target in the previous frame, we consider multiple extra search center positions around the original search center. For every added position, we also search the object in the region of size: \((1+p)*r\). The distance between the added center and the original center is set to \(\alpha *r\), the total radius of our search region is \((1+p+\alpha )*r\). We can see the search region is implicitly enlarged without introducing extra background noise.

The main contributions of this work can be summarized as follows:

  • We propose a novel method to overcome the boundary effect of CF which implicitly extends the search region without compromising periodicity.

  • We adopt a two-stage tracking strategy: CF detection and SVM detection. Different from the traditional multiclassifier structure, the two classifiers use independent updating periods based on the model of ASMM [1].

  • To deal with the problem of drift, we apply the salient object detection score to regularize the output of the SVM classifier with promising results.

We show that TESR can obtain excellent results. The experiments demonstrate the superior performance of TESR as compared with the state-of-the-art trackers on the popular OTB [44, 45] and VOT [22, 23] benchmarks.

Fig. 2
figure 2

A flowchart of the proposed tracking framework. In the first stage, the same CF is applied in five different search regions, i.e., red, blue, green, yellow and magenta dot boxes, to generate object candidates. The black box represents the object location in the previous frame. The white box represents the target that is predicted by our method. In the second stage, all candidates will be detected using an SVM classifier. The candidate with the highest score is treated as the tracking target

2 Related works

Video tracking has been widely studied over the past decades. There is a considerable body of related literature, where both CF trackers and deep networks trackers are reported to have achieved competitive performance. We only focus on these two main types of trackers in this paper. We refer the reader to [34, 44] for further details of visual tracking.

Correlation filter The recent upsurge of interest in correlation filters started with the Bolme’s seminal article [5], advocating a tracking method based on the minimum output of the sum of squared error (MOSSE). This is a high-speed tracker with 600–700 FPS. Henriques [16] realized a discriminative correlation tracker based on dense sampling by using the property of the circulant matrix. His algorithm, named KCF, was improved in 2015 by adding kernel method and multichannel features (e.g. HOG). SAMF [25] and DSST [9] improved the performance of KCF and MOSSE by endowing them with a scale estimation mechanism. To handle the occlusion, [26, 27, 48] incorporated a common part-based technology into CF. Both [4, 36] proposed methods to handle the limitations of the assumption of isotropy of the CF response. It is worth mentioning that to reduce the boundary effect, [20] proposed a new CF objective function by introducing a specific mask matrix, and SRDCF [10] added a spatial regularization component into the CF objective function. SRDCF achieved impressive performance at the cost of very low speed. The two methods have not fundamentally solved the problem of boundary effect because they both circumvent the problem of extending the search region of CF. Different from the [20] and [10], we propose a multiposition search method, which implicitly extends the search region without infecting extra background noise. The proposed method solves the problem of boundary effect elegantly. Related papers on CF also include [2, 5, 6, 9, 10, 12, 16,17,18,19,20, 25, 27, 32, 35, 36, 47, 50]. For further details of correlation tracking, please refer to [7].

Deep neural network-based tracking In many areas of computer vision, such as image classification, object detection and segmentation, deep neural network algorithms have reported an impressive success. Recently, neural network-based methods have also been used in the tracking field. There are two main types of the deep neural network-based trackers. One is the online fine-tuning neural network method. For example, MDNet [31] pretrains a convolutional network to obtain a generic target representation and constructs a new network by the pretrained CNN by adding a new binary classification layer, which is updated online. The other is the offline training method. For example, CF2 [29] adopts a pretrained CNN to extract feature in a CF tracking framework, where the neural network is only used for feature extraction. SiameseFC [3] is a fully convolutional Siamese network trained offline on the ILSVRC15 [33] dataset for object detection in video. The two-stream networks do not update online. Related papers on deep neural network-based tracking methods also include [15, 24, 38,39,40,41, 49]. Although deep neural network-based tracking algorithms have superior performance, their tracking speed is limited because of the slow online updating. If these methods adopt a network completely trained offline, their accuracy often decreases. Our method is a lightweight correlation filter tracker. The proposed tracker consists of the SVM classifier and the CF, we regard the two classifiers as long-term and short-term ones-respectively, rather than multiple experts for simultaneous fusion. Moreover, the proposed method is a two-stage tracking framework. The components of both stages can be replaced by arbitrary CF and discriminative classifier.

Fig. 3
figure 3

In a, for the sake of simplicity, the 1D pink signal represents the 2D image signal. In b, the assumption of periodicity distorts the pink signal in correlation tracking. The smaller the search window, the more sever the distortion of the target patch signal, shown as the red curve segment, and this serves as input to the correlation filter. In c when the search window size is enlarged, the target patch signal is undistorted

3 Proposed approach

3.1 Overview

TESR is a two-stage tracking framework: The first stage sets to detect object in five different search regions, respectively, by the same correlation filter. The CF will give a response map as its output in each search region. TESR takes the position of the peak value in the response map as the candidate of the tracking object. The second stage tests the five candidates using an SVM classifier and views the candidate with the highest score as the predicted target position. Figure 2 presents a visual representation of the overall tracking procedure.

In the CF detection stage, in addition to the tracked target location found in the previous frame, TESR takes the other multiple tracking positions around the original search center as the base search points. In fact, they are the vertices of the bounding box. As discussed in Sect. 4, when the number of added positions is set to be four, TESR exhibits the best performance. Thus TESR tracks the object five times using the same CF and attains five response maps. During this processing, because we maintain the search size of every CF constant, no extra background noise is introduced. However, on the search region is extended on a whole, the strategy effectively deals with the problem of motion blur, camera shaking, occlusion, and so on.

In the second stage, we choose the SVM as the classifier to select the correct candidate. We can also use the other classifiers such as the neural network, but compared with the neural network, SVM classifier does not need the offline training, and it is also faster than neural network, as well as free from exhibiting degradation in performance. In this stage, TESR applies the method proposed in [43, 46] and regularizes the output of the SVM classifier by the salient object detection score to enhance the performance of the SVM classifier. For the five candidates generated in the first stage, we take the candidate with the highest score as the final tracking object position.

The proposed tracking framework consists of two classifiers. Both classifiers must be continuously updated. Similar to [18], we also argue that the tracker should have both the long-term and short-term memories. TESR updates CF every frame to handle any fast appearance change of the object and updates SVM slower, e.g. every 10–15 frames, which is considered as a long-term memory. The support vectors in SVM primarily capture the elementary features of the object.

3.2 CF detection stage

For simplicity, we assume that \(\varvec{x}\in {\mathbb {R}^n}\) represents the tracking target. To take advantage of the property of the circulant matrix, we generate a matrix \(\varvec{X}{=}[\varvec{x}{,} \varvec{Px}{,} \varvec{P}^2\varvec{x}, {\ldots }{,} \varvec{P}^{n-1}\varvec{x}]\), where \(\varvec{P}\) is a permutation matrix denoting the circulant shift of a vector, \(\varvec{Px}{=}[x_n{,} x_1{,} {\ldots }{,} x_{n{-}1}]\). The response of CF is modelled as a 2D Gaussian map in the ideal case. The peak of the response map is the object location. The goal of CF is to learn a filter \(\varvec{w}\), which can minimize the cost of ridge regression problem as follows:

$$\begin{aligned} {\min _{\varvec{w}}\{\left\| {\varvec{Xw-y}}\right\| ^2 +\lambda \left\| {\varvec{w}}\right\| ^2\}} \end{aligned}$$
(1)

where \(\varvec{y}\) is the ideal 2D Gaussian response and \(\lambda \) is the trade-off parameter.

We can solve Eq. 1 and its dual problem to obtain \(\varvec{w}=\varvec{X}^\mathrm {T}\varvec{\alpha }\), where \(\varvec{\alpha }=(\varvec{XX}^\mathrm {T} +\lambda \varvec{I})^{-1}\varvec{y}\) is the solution of the dual problem of Eq. 1. Using the relationship between circulant matrix and its Fourier transform, we can compute \(\varvec{w}\) in the frequency domain very fast as follows:

$$\begin{aligned} \hat{\varvec{w}}=\frac{\hat{\varvec{x}} \odot \hat{\varvec{y}}}{\hat{\varvec{x}}^*\odot \hat{\varvec{x}}+\lambda } \end{aligned}$$
(2)

where \(\hat{\varvec{x}}\) is the FFT of \(\varvec{x}\), and \(\hat{\varvec{x}}^*\) is the conjugated FFT of \(\varvec{x}\). \(\odot \) denotes the element-wise product.

In the object detection stage, we compute the response map \(f(\varvec{x})= \varvec{w}^\mathrm {T}\varvec{x}\) and take the peak point of the response as the current target location. The details of CF can be found in [17].

As mentioned in Sect. 1, although there has been much improvement in CF in recent years, the boundary effect is the main problem which is difficult to be solved. As shown in Fig. 3, for simplicity, we use the 1D signal to replace the 2D image signal. We find the signal distorted because of the assumption of the periodicity, as indicated by the pink signal shown in Fig. 3b. If the search window is not large enough, the distorted target patch signal input to the correlation filter will cause the tracker to drift. To overcome this problem directly, as shown in Fig. 3c, the target patch signal will not be influenced by the assumption of the periodicity. But the drawback of the process is also obvious. On one hand, we need to expand the search region around the target to cope with fast motion or camera shaking in video. Otherwise, the object will partly move out of the search window and CF will fail to detect it. On the other hand, as CF relies on the property of circulant matrix to speed up the detection and learning of the tracker, if the search region is extended, injecting much background noise in the process, the tracker will start to drift.

As shown in Fig. 1, TESR adds other four search positions, which can implicitly extend total search size without leading to the drift problem. TESR takes the four diagonal vertices of the bounding box (black box) as the additional correlation filter search centers. Consequently, all filters use the original search size. We can see the search region of the left top CF (red box) and the right top CF (blue box) can cover the target, and TESR can ultimately track the object position correctly.

Let us assume the size of target is \(m*n\). The padding area around target of conventional CF is p times of the target size, i.e., \((pm)*(pn)\). Assume that the size of search region of conventional CF is \(M*N\) such that \(M=(1+p)m\), \(N=(1+p)n\). We denote the image patch of conventional search region as \(\varvec{x}_0=[x_1, x_2, \ldots , x_{M*N}]\). In CF detection stage of TESR, \(\varvec{x}_0\) is shifted to the neighbor of the vertexes of bounding box. We denote \(\varvec{x}_i=\varvec{x}_0(a_i, b_i) (i=1, \ldots , 4)\) as added search region, which is shifted from \(\varvec{x}_0\) by \((a_i, b_i)\), with \(a_i\) and \(b_i\) given by

$$\begin{aligned} a_i= & {} {\left\{ \begin{array}{ll} \alpha *m \quad \ \ \ i=1,3\\ -\alpha *m\quad i=2,4 \end{array}\right. } \end{aligned}$$
(3)
$$\begin{aligned} b_i= & {} {\left\{ \begin{array}{ll} \alpha *n \quad \ \ \ i=1,2\\ -\alpha *n\quad i=3,4 \end{array}\right. } \end{aligned}$$
(4)

where \(\alpha \) is a scale parameter.

For each search region, TESR will track the object using the same CF. In the first stage of TESR, the response map of each CF is computed as follows:

$$\begin{aligned} \varvec{y}_i=f(\varvec{x}_i)=\varvec{w}^\mathrm {T}\varvec{x}_i \quad \forall i=0:4 \end{aligned}$$
(5)

For each \(\varvec{y}_i\), we find the location of its peak value and regard this position as a target candidate. TESR generates five candidates in this stage. As shown in Sect. 3.3, an SVM classifier selects the correct result of our tracking framework. Each CF still tracks the object in the search region with the size \(M*N\) or \((1+p)m *(1+p)n\), but the total search size of TESR is implicitly increased to be \((1+p+\alpha )m*(1+p+\alpha )n\). This approach of extending search region implicitly does not introduce any extra background noise.

To speed up our TESR, we do not use multiposition detection method described above in all frames. For the response of original CF, i.e., \(\varvec{y}_0\), is transformed to the probability form by a Gaussian cumulative distribution with mean of 0 and STD of 1. If the peak value of the response is less than a threshold s, TESR will track the object using multiposition detection and apply the SVM classifier to select the best candidate. If the peak value is greater than s, we accept the original CF tracking target as the TESR’s result.

3.3 SVM detection stage

In the first stage, there are separate correlation filters, which generate five target candidates, respectively. We could adopt the method of multiclassifier fusion to define our tracker. But as shown in Fig. 2, some outputs of correlation filters look like the target, while some outputs are severely contaminated by noisy background pixels. If we simply adopt a multiclassifier fusion method, the tracker will easily start to drift. For this reason, we adopt an SVM to determine which candidate is the correct target position in the second detection stage. There are still several problems which we must consider: The two classifiers, i.e., CF and SVM, should adopt different feature extraction methods, and SVM classifier should be able to detect and update rapidly. Furthermore, the two classifiers should be updated with different frequencies to adapt to rapid change in the appearance of the target and focus on stable properties of the object.

The SVM classifier used in TESR is inspired by TVM [43, 46], which uses a fixed-size training data \(Q=\{\zeta _i=(\phi (\varvec{q}_i),\omega _i,s_i)\}_1^B\), \(\phi (\varvec{q}_i)\) is the feature vector of an image patch \(\varvec{q}_i\), \(\omega _i\) is a binary label and \(s_i\) is the number of support vectors representing the decision boundary. Given new data \(L=\{\varvec{x}_i,y_i\}_1^J\), the goal of the TVM is to learn the weight \(\varvec{w}\) by minimizing the objective function:

$$\begin{aligned}&\min _{\varvec{w},b}\left\{ \frac{1}{2}\left\| \varvec{w}\right\| ^2+C \left\{ \sum _{i=1}^{B}\frac{s_i}{N_{\omega _i}}L_\mathrm{h}(\omega _i,\varvec{q}_i;\varvec{w})\right. \right. \nonumber \\&\left. \left. \quad +\sum _{i=1}^{J}\frac{1}{N_{y_i}}L_\mathrm{h}(y_i,\varvec{x}_i;\varvec{w}) \right\} \right\} \end{aligned}$$
(6)

where C is the slack parameter, \(L_\mathrm{h}\) is the hinge loss function.

By taking advantage of the twin prototypes, TVM can maintain a reasonable support vector budget and can be designed as a linear SVM classifier, so it can detect and learn to adapt the decision boundary to support the real-time requirement of visual tracking. For the feature extraction, the current mainstream CF methods adopt multichannel feature like HOG [10, 17, 18, 20]. TESR makes use of CIELab feature as the feature representation of TVM. Thus, the two features which are used in CF and TVM are complementary to each other. HOG focuses on the gradient information, while CIELab feature focuses on the color information. Similar to MEEM, to deal with illumination change, we also adopt the nonparametric local rank transform (LRT) [18]. As TVM is viewed as a long-term memory, its updating frequency is slower than that of CF. We find that when SVM classifier updates at every 10–15 frames, TESR achieves the best performance. By managing the updating frequency of SVM classifier, we retain control over the tracking speed as it is inversely proportional to updating frame rate.

Fig. 4
figure 4

In a, the five candidates generated in the first stage of TESR at the ninth frame of bolt sequence are shown as red, blue, green, yellow and black boxes. In bf, the five candidate patches are extracted from the original image

Fig. 5
figure 5

The black box represents the target location in the previous frame, and the black point represents the its center. The conventional CF detects the object in the area defined by the magenta box. The additional search regions are shown with red, blue, green and yellow boxes, where the value of p is 1, and the color points represent the additional search centers. We set the \(\alpha \) to be 2.5, 2 and 0.8 in ac. We can see when \(\alpha \) is greater than \(1+p\), the additional search regions will not overlap

3.4 Salient feature detection

As an online learning method, the memory of TESR, i.e., the filter of CF and the support vectors of SVM, bound to be contaminated in practice. To alleviate this problem, we adopt a salient feature detection regularization method. We find that the tracking target is always a salient object in general, which can be considered as a priori knowledge of visual tracking. In some cases, we can accurately determine the correct target position from several candidates even without any information extracted from the previous frames. Shown in Fig. 4d–f are more like the tracking target because they contain a salient feature manifest by the whole object. We modify the discriminative method in Sect. 3.3 based on regularizing the output of SVM classifier using the priori knowledge of the saliency of the object.

For the salient feature detection, we use the approach proposed in [28]. We assume that the area of the bounding box is R, and the search region of CF is \(R_\mathrm{search}\). The surrounding area of the bounding box is \(R_\mathrm{s}\), i.e., the area in \(R_\mathrm{search}\) besides R, \(R_\mathrm{s} = R_\mathrm{search}-R\). We adopt the Chi-square of the color histogram as the salient feature detection score as follows:

$$\begin{aligned} {\chi ^2(R,R_\mathrm{s})=\frac{1}{2}\sum _{i}\frac{(R^i-R_\mathrm{s}^i)^2}{R^i+R_\mathrm{s}^i}} \end{aligned}$$
(7)

where \(R^i\) and \(R_\mathrm{s}^i\) are the ith bins of the color histogram of R and \(R_\mathrm{s}\).

Obviously, if the color information of the target candidate area is significantly different from the surrounding area, it is more likely that it contains a salient object. Accordingly, the objective function of SVM detection stage is turned to:

$$\begin{aligned} f(\varvec{x})=\varvec{w}^\mathrm {T}\varvec{x}+\mu \chi ^2(R,R_\mathrm{s}) \end{aligned}$$
(8)

where \(\varvec{w}\) is the weight of the SVM classifier and \(\mu \) is a trade-off parameter.

The five candidates generated in CF detection stage are assessed by the criterion in Eq. 8 to determine the correct tracking target.

4 Experiments

4.1 Base tracker

TESR is regarded as a two-stage detection framework based on CF. The base CF tracker used in the first stage can be any tracker based on CF. We choose two CF trackers as our base trackers: KCF [17] and STAPLE [2] because both have high tracking speed and appealing performance. The discriminative classifier used in the second stage is TVM as mentioned above. We choose it also for its highly efficient and effective performance. We call the two implementations of TESR as TESR\(_{\mathrm {KCF}}\) and TESR\(_{\mathrm {STAPLE}}\) in the following sections.

Like KCF, our TESR\(_{\mathrm {KCF}}\) does not adapt to scale changes, but similar to STAPLE, the tracking of TESR\(_{\mathrm {STAPLE}}\) is robust to scale change. The motivation for choosing base CF trackers with different characteristics is to allow a fair comparison between our TESR version and its existing approaches. We shall show that our two CF-based trackers TESR\(_{\mathrm {KCF}}\) and TESR\(_{\mathrm {STAPLE}}\) are not only better than KCF and STPALE, but also deliver promising performance in comparison with the state-of-the-art trackers.

Fig. 6
figure 6

Precision plots of our TESR by comparison with the state-of-the-art trackers on the OTB2013 a and OTB2015 b benchmark

Fig. 7
figure 7

Success plots of our TESR by comparison with the state-of-the-art trackers on the OTB2013 a and OTB2015 b benchmark

4.2 Implementation details and parameters

In all our experiments, we use MATLAB on an Intel(R) Xeon(R) 3.30 GHz CPU with 8 GB RAM. For the base tracker KCF and STAPLE, we retain the original parameter settings. In the multiposition detection in CF, we set the threshold s of response map peak value to be 0.6 and the value of \(\alpha \) to be 0.8. Although the larger the value of \(\alpha \), the larger the search region, if the \(\alpha \) is too large, our additional search regions will not overlap, which may result in the loss of the target. As shown in Fig. 5, the maximum value of \(\alpha \) cannot be greater than \(1+p\), where p is the padding scale factor. We test the value of \(\alpha \) from 0.1 to 2 . With p equal to 1, the best value of \(\alpha \) is 0.8. We also test the impact of additional search centers in other positions and increase the counting number of the multiposition detection (the maximum number is set to be 20). Interestingly, the performance is not sensitive to the number of search centers; the reason may be that more false candidates are generated as the number of additional search centers increases. Moreover, the speed will be slower.

For TVM used in the second detection stage, we also retain the original parameter setting in MEEM [46]. The trade-off parameter \(\mu \) determines the impact of the salient feature detection as shown in Eq. 8. We test its value from 0.01 to 10 and find it is best at 0.7. The updating frequencies of CF and SVM are one frame and nine frames for TESR\(_{\mathrm {KCF}}\) and one frame and 12 frames for TESR\(_{\mathrm {STAPLE}}\), respectively. We use the sliding window sampling approach around the predicted location to obtain the training samples in the updating of TVM. The sampling region has the same size for the CF original search size, i.e., \((1+p)*r\), where r is the radius of object, p is the padding scale factor. We only extract sample patches at the updating period, i.e., every 9 frames or 12 frames. In addition to these samples, we also retain the object candidates generated in CF detection as the training samples at every frame. For all training sample patches, we calculate the overlaps of them and the predicted object location, We regard the samples with an overlap greater than 90% as positive samples and consider the samples with an overlap less than 50% as negative samples. All the other samples are discarded.

Table 1 A comparison of the DPE/OPE scores on sequences classified according 11 challenging factors: IV (illumination variation), SV (scale variation), OCC (occlusion), DEF (deformation), MB (motion blur), FM (fast motion), IPR (in-plane rotation), OPR (out-of-plane rotation), OV (out-of-view), BC (background clutters) and LR (low resolution)
Fig. 8
figure 8

Tracking results for challenging factors such as changing illumination and action

Fig. 9
figure 9

AccuracyRobustness (AR) rank plot on VOT2014 benchmark. Better trackers are closer to the top right corner

4.3 Evaluation

We run our TESR on two recent popular benchmarks, i.e., OTB [44, 45] and VOT [22, 23], and compare it to several state-of-the-art trackers. We use the source codes provided by the original authors and run the code ourselves on the OTB benchmark to evaluate their performance. For VOT benchmark, since VOT challenge provides the results of all participated trackers, we use these reported results to compare with our TESR.

Table 2 The raw accuracy and robustness scores on VOT2014 benchmark (over 25 sequences) of some top performers

4.3.1 OTB

We run our TESR\(_{\mathrm {KCF}}\) and TESR\(_{\mathrm {STAPLE}}\) on benchmark OTB2013 [44] and OTB2015 [45]. We also run the other state-of-the-art trackers including KCF [17], DSST [9], STAPLE [2], MEEM [46], SRDCF [10] and DLSSVM [32]. Two performance measures are used. The precision evaluation of a tracker on a sequence is expressed as the average per-frame location error between its predicted bounding box and the ground truth. The success rate evaluation is expressed as the average per-frame overlap between its predicted bounding box and the ground truth using the intersection-over-union (IOU) criterion \(S_t=\frac{r_t\cap {r_{GT}}}{r_t\cup {r_{GT}}}\), where \(r_t\) is the predicted bounding box and \(r_{GT}\) is the ground truth.

Figure 6 shows the precision plot at the threshold set to 20. Figure 7 shows the success plot of the proportion of successful frames at the IOU thresholds varying from 0 to 1. We use the area-under-curve (AUC) to measure the representative success score. We can see that on both OTB2013 and OTB2015, our TESR\(_{\mathrm {KCF}}\) and TESR\(_{\mathrm {STAPLE}}\) achieve very good results in comparison with base trackers KCF and STAPLE. TESR\(_{\mathrm {KCF}}\) is better by 11% in precision and 14% in success rate on OTB2013 and 7% in precision and 7% in success rate on OTB2015. TESR\(_{\mathrm {STAPLE}}\) shows improvement of 14% in precision and 12% in success rate on OTB2013 and 10% in precision and 11% in success rate on OTB2015. TESR\(_{\mathrm {STAPLE}}\) is the best in terms of both evaluation measures. The precision measure achieved by TESR\(_{\mathrm {STAPLE}}\) is 88.9% and 86.1% on OTB2013 and OTB2015, respectively. The success rate measure achieved by TESR\(_{\mathrm {STAPLE}}\) is 67.3% and 64.4% on OTB2013 and OTB2015, respectively.

Fig. 10
figure 10

AccuracyRobustness (AR) rank plot on VOT2016 benchmark. Better trackers are closer to the top right corner. Note CCOT is the champion of VOT2016 challenge which is shown as yellow cross

Table 3 The raw accuracy and robustness scores on VOT2016 benchmark (over 60 sequences) of some top performers

Table 1 compares the precision and success rate scores of TESR and the state-of-the-art trackers on sequences classified according to various challenging factors. All the videos in OTB2013 are annotated with 11 different attributes, namely: illumination variation, scale variation, occlusion, deformation, motion blur, fast motion, in-plane rotation, out-of-plane rotation, out-of-view, background clutter and low resolution. Some tracking results according to various factors such as light changing and action changing dramatically are also shown in Fig. 8 We can see our TESR\(_{\mathrm {STAPLE}}\) outperforms existing trackers on all attributes in precision and on ten attributes in success rate. In particular, TESR\(_{\mathrm {STAPLE}}\) achieves relatively larger performance gains, as compared with the second best tracker, when target object experiences illumination variation, deformation, occlusion, out-of-view and background clutter. According to the results for the subset of background clutter, TESR can identify the target due to the stable features of object stored in the support vectors of SVM classifier. The superior performance on the subset of illumination variation and deformation is the result of adaptation to fast appearance change by our CF store. Since we adopt a large search region, TESR can re-capture the object easily after it starts drifting due to occlusion.

Fig. 11
figure 11

a, b, sequence couple, c, d, sequence jogging1, e, f, sequence birds1. The red box represents the ground truth, and the white box represents the predicted location of TESR

4.3.2 VOT

We also run our TESR\(_{\mathrm {STAPLE}}\) on VOT2014 [23] and VOT2016 [22]. On VOT benchmark, there are two highly interpretable weakly correlated performance measures to analyze tracking behavior in reset-based experiments: accuracy (A) and robustness (R). Unlike OTB, VOT-related methodology resets the tracker after it drifts off the target. The accuracy (A) is the average overlap between the predicted bounding boxes and ground truth during successful tracking periods. The robustness (R) measures the number of the failures of the tracker.

To generate Fig. 9 and Table 2, we use the most recent version of the VOT toolkit, which can be downloaded from the VOT challenges Web site. As shown in Fig. 9, for all 38 participated trackers of VOT2014 challenge, our TESR\(_{\mathrm {STAPLE}}\) gets the best accuracy ranking. Table 2 presents the raw accuracy score for some trackers of VOT2014 challenge. We can see our TESR\(_{\mathrm {STAPLE}}\) achieves a 6% improvement in accuracy in comparison with the second best tracker: KCF [17] and DSST [9].

The experimental results of 70 trackers that participated in the VOT16 challenge are publicly available. For simplicity, we compare TESR\(_{\mathrm {STAPLE}}\) to some state-of-the-art trackers. We only include the trackers: CCOT [11], DSST [9], KCF [17], SAMF [25], SCT4 [8], SRDCF [10], STAPLE [2], STRUCK [14]. Note CCOT is the champion tracker of VOT2016 challenge. As shown in Fig. 10 and Table 3, we can see TESR\(_{\mathrm {STAPLE}}\) also significantly outperforms other trackers in accuracy measure. We achieve the better ranking than CCOT in accuracy ranking which is shown in Fig. 10.

It is worth recalling that the robustness performance defined for the benchmark is the number of the failures on tracking periods. The VOT toolkit resets the tracker once the IOU between the predicted bounding box and the ground truth reduces to zero. This measure method is not favorable to our TESR\(_{\mathrm {STAPLE}}\). As TESR implicitly extends the search region, it can capture the target again after it completely misses the object. As shown in Fig. 11, in sequence couple, jogging1 and birds1, TESR\(_{\mathrm {STAPLE}}\) captures the target again in 107th, 81th and 32th frames, respectively, after its IOU completely drops to zero. Then it continues tracking well in the following frames. Note that this is the advantage of our method. We do not need redetection technology. Nevertheless, our TESR is still a short-term tracker. Unfortunately, the robustness measure of VOT benchmark views these frames as failures, and this increases the value of robustness score. As a result TESR\(_{\mathrm {STAPLE}}\) has not achieved a very high ranking in robustness measure on VOT2014 and VOT2016.

Table 4 Comparing the impact of different components of TESR on OTB2013/2015 with the precision/success rate scores/fps

4.4 Impact of different components of TESR

To verify the effectiveness of different innovations of TESR, we conducted the ablation study by suppressing different parts of TESR\(_{\mathrm {STAPLE}}\) technology, by performing experiments to measure the impact on OTB datasets, as shown in Table 4. TESR is a fast tracker. In particular, if we remove the significance feature detection module, the TESR will reach a speed of 19fps with only a slight performance degradation.

For the impact of the saliency detection, we removed the salient feature detection module, and denoted the resulting tracker as TESR\(_{\mathrm {NonSD}}\). The experimental results obtained on OTB2013 and OTB2015 are presented as follows: The precision is 87.5% and 85.1%, and the success rate is 65.9% and 63.8%, respectively. This shows that the salient feature detection improves the performance.

For the impact of the SVM detection, we deleted it and maximized over the five peak values to judge which candidate is most likely the object. The tracker is denoted as TESR\(_{\mathrm {NonSVM}}\). The results obtained on OTB2013 and OTB2015 are given as follows: The precision is 81.4% and 81.5%, and the success score is 62.7% and 60.9%. This shows clearly that our two-stage strategy improves the performance. The reason of the decrease in the performance is that peak value of the response of CF does not always correspond to the target position.

Our TESR is not the simple combination of two classifiers, i.e CF and SVM. Firstly, our CF detection stage implicitly extends the search region in comparison with the conventional CF trackers. Secondly, different from the traditional multiclassifier structures, the two classifiers use independent update periods. The correlation filter is viewed as a long-term memory, adapting to the rapid appearance change, while the SVM classifier is viewed as a long-term memory, adapting to stable property of the object. These complementary concepts are of paramount importance to our tracking framework. Note, if SVM classifier updates every frame, denoted as TESR\(_{\mathrm {1F}}\) , the precision on OTB2013 and OTB2015 would decrease to 83.3% and 78.5%, the success would also decrease to 63.6% and 59.3% respectively.

5 Conclusion

In this paper, we propose a correlation tracking framework employing implicitly extending search region (TESR) to deal with the problem of occlusion, motion blur and camera shaking in visual tracking. Our method is a two-stage detection solution. In the first stage, we decrease the boundary effect with a very unique way. In the second stage, we use an SVM classifier to choose the best candidate generated in the first stage. The results of experiments demonstrate that our method exhibits superior performance in comparison with the current state-of-the-art trackers on benchmark OTB and VOT.