Keywords

1 Introduction

Generic visual tracking is a crucial task in computer vision aiming at locating the specific continuous object in video. Limited information, usually the first annotation is provided during visual object tracking. One unique characteristic of generic object tracking is that no prior knowledge (e.g., the object class) about the object, as well as its surrounding environment, is allowed [1]. The quality of localization and scale estimation of target are the most influential factor of the performance.

Recently, localization and scale estimation tend to be two subtasks of the tracking problem [2]. Before the deep learning methods, trackers based on Discriminative Correlation Filter (DCF) [3, 4] framework took dominant positions in tracking method. Traditional correlation trackers suffer from inefficiency and low accuracy due to its inherent flaw. It is natural that the deep learning ways are applied to other computer vision tasks, such as object detection, semantic segmentation and visual tracking. SiamFC [5], introduces siamese learning paradigm into visual object tracking, though it employs brutal multi-scale test which is inaccurate and inefficiency [2]. Then, SiamRPN tracker family [6,7,8] perform an accurate and efficient target scale estimation by introducing the Region Proposal Network (RPN) [9]. However, the pre-defined anchor settings not only introduce ambiguous similarity, but also demand huge prior-knowledge about target. SiamFC++ [10] adopts the anchor-free regression and classification style based on Siamese learning paradigm, it still heavily rely on the sufficient prior-knowledge about target. Motivated by the aforementioned analysis, we propose three suggestions on designing modern visual object trackers:

  • Balance between online learning and offline training: The breakthroughs on object detection provide a better way to replace multi-scale estimation in object tracking. For example, RPN [9] structure achieves astonishing accuracy in SiamRPN [6]. Because siamese formulation does not provide a powerful discriminative model, we highly recommend that online learning needs to be well-designed. The Correlation based trackers [3, 4, 11, 12] are able to tackle with online model updation. However, the problems of model drift and insufficient training of online model result in low accuracy.

  • Fully utilization of Multi-level deep convolutional features: Deep model should be trained for robustness, while the shallow model should emphasize accurate target localization [13]. We highly recommend that deep and shallow models should be emphasized equally in order to have better robustness performance. Even though the high quality training data is crucial for the success of end-to-end representation [7], we argue that models designed for both deep and shallow features can reduce the burden of offline training.

  • Online searching strategies are highly recommended in scale estimation branch: Both the RPN structure from Faster-RCNN [9] or one-stage anchor-free detection from FCOS [14] output the coordinates of target directly without online searching strategy. We strongly consist that it cannot tackle with severe appearance deformation and complex scenes. In our work, we choose the IoU-Net [15] prediction proposed by Atom [2] as our scale estimation branch. It can perform online searching strategy when the coarse location of target is determined.

2 Related Work

Generic object tracking can be divided into two frameworks: Tracking framework and detection framework. Generally, tracking framework trackers are mainly based on correlation filters. MOSSE [4] proposes a CF tracker by learning a minimum output sum of squared error for target appearance and calculate in Fourier domain. KCF [3] adopts ridge regression and circulant matrix to facilitate the speed of calculation in Fourier domain. C-COT [16] converts feature maps of different resolutions into a continuous spatial domain to achieve better accuracy. The subsequent ECO [17] has better efficiency by removing the redundant correlation filters.

ATOM [2] tracker adopts IoU-Net [15] and online learning to classify the target and estimate the scale. Online learning and offline training are combined together. ATOM achieves better robustness performance than Siamese-based trackers. However, it still lack of multi-level deep convolutional features fusion and its online learning is totally independent of offline training which can be further improved. DiMP [18] combines online training and offline training together.

SiamRPN and its succeeding works [3, 4, 11, 12] modifies a Region Proposal Network after a siamese network. They have direct bounding box regression ability thanks to extensive offline training. However, the robustness still suffers from the weak discriminative ability of siamses-based detection networks. The pre-defined anchors of Region Proposal Network (RPN) [9] also need to be well-designed. Even though the SiamFC++ [10] adopts an anchor-free style for bounding box regression, its performance still heavily rely on extensive offline training and robustness cannot be improved as much as accuracy.

3 Proposed Method

Two meta-filters in Geographical Sensitivity Pathway (GASP) are trained to have more discriminative power between foreground and background. The geometrically sensitivity pathway (GESP) focus more on the appearance model of the object in order to estimate the scale accurately.

Fig. 1.
figure 1

Pipeline of Dual Pathway Network. GASMF stands for meta-filtes in Geographical Sensitivity Pathway. GESMF stands for meta-filtes in Geometrically Sensitivity Pathway. GESAFM is Appearance Fast Adaption Module in Geometrically Sensitivity Pathway

3.1 Dual Path Network

The whole pipeline of our tracker consists of two meta-filters and a Box Fast Adaption Module. Hierarchical feature representations are used for two meta-filters in order to achieve better performance on localizations. Similar to the object segmentation in [19], the Box Fast Adaption Module can have accurate object outline estimation after the localization process (Fig. 1).

3.2 Multi-hierarchical Independent Discriminati Filters in Online Learning

Inspired by discriminative correlation filter (DCF) approaches, we formulate our learning objective based on \(L^{2}\) classification error. Each sample \(x_{k}\) contains D feature channels \(x_{j}^{1}\) \(x_{j}^{2}, \ldots , x_{j}^{D},\) extracted from the same image patch, where k is the index of the samples. Assume that \(f=\left\{ f_{d}\right\} _{d=1: D}\) is a set of D channel features. The correlation filters algorithm can be formulated as:

$$\begin{aligned} \underset{f}{\arg \min } \sum _{k=1}^{K}\left\| \phi \left( x_{k}, f\right) -y_{k}\right\| _{L^{2}}^{2}+\lambda \sum _{d=1}^{D}\Vert f\Vert _{L^{2}}^{2}\end{aligned}$$
(1)

where \(x_{k}\) is the cyclic shift sample of the \(x_{k}\) and \(y_{k}\) is the Gaussian response label. The optimization problem in Eq. (1) can be solved efficiently in the Fourier domain.

$$\begin{aligned} \phi \left( x_{k}, f\right) =\sum _{d=1}^{D} f_{d} * x_{k}^{d} \end{aligned}$$
(2)

In our work, we try to combine the online optimization with offline training, thus we approximate the loss with a quadratic function and optimize it by backward propagation instead of Fast Fourier Transform (FFT).

In this section, the discriminative learning loss is described in details. The input to our model predictor D consists of a training set \(S_{\text{ train } }=\left\{ \left( x_{j}\right) \right\} _{j=1}^{n}\) of deep feature maps \(x_{j} \in \mathcal {X}\) generated by the backbone network F. During online tracking, correlation filter is optimized to generate a target model \(f=D\left( S_{\text{ train } }\right) .\) The model f is defined as the filter weights of a convolutional layer. The maximum value of the model output should localize the center of target.

$$\begin{aligned} L(f)=\frac{1}{\left| S_{\text{ train } }\right| } \sum _{(x) \in S_{\text{ wain } }}\Vert r(x * f, c)\Vert ^{2}+\Vert \lambda f\Vert ^{2} \end{aligned}$$
(3)

Here, * denotes convolution and \(\lambda \) is a regularization factor. The function r(sc) computes the residual at every spatial location based on the target confidence scores \(s=x * f\) and the ground-truth target center coordinate c. In Eq. (1), \(r(s, c)=s-y_{c},\) traditional correlation filter trackers optimize the residuals between response and the Gaussian target scores. Thus, the difference of target and distractor response usually represents the discriminative ability of the correlation filters. However, during online tracking, background noise and distractors are far more abundant than our target resulting in imbalance of the positive and negative samples.

In order to learn a more discriminative filter, it is common to have a weight matrix in the learning loss. In our work, We employ a hinge-like loss in r,  clipping the scores at zero as \(\max (0, s)\) in the background region. Thus, the filter is more focus on the hard negative distractors instead of easy negative samples. We believe that it could contribute to a more discriminative filter and efficiency online optimization.

$$\begin{aligned} r(s, c)=v_{c} \cdot \left( m_{c} s+\left( 1-m_{c}\right) \max (0, s)-y_{c}\right) \end{aligned}$$
(4)

The mask \(m_{c}\) modifies the spatial weight of scores, having values in the interval \(m_{c}(t) \in [0,1]\) at each spatial location \(t \in \mathbb {R}^{2}.\)

In our work, we use convolutional layers D to generate the filter \(f=D\left( S_{\text{ train }}\right) \) by implicitly minimizing the error (3).

$$\begin{aligned} f^{(i+1)}=f^{(i)}-\alpha \nabla L\left( f^{(i)}\right) \end{aligned}$$
(5)

Instead of minimizing the error (3) in Fourier domain, we approximate the error with a quadratic function and directly employ gradient descent optimization using a step length \(\alpha \).

$$\begin{aligned} \begin{aligned} L(f) \approx \tilde{L}(f)=\,&\frac{1}{2}\left( f-f^{(i)}\right) ^{\mathrm {T}} Q^{(i)}\left( f-f^{(i)}\right) \\&+\,\left( f-f^{(i)}\right) ^{\mathrm {T}} \nabla L\left( f^{(i)}\right) +L\left( f^{(i)}\right) \end{aligned}\end{aligned}$$
(6)

Here, the filter variables f and \(f^{(i)}\) are seen as vectors and \(Q^{(i)}\) is positive definite square matrix. The steepest descent is adopted in order to achieve a fast convergence performance. By solving \(\frac{\mathrm {d}}{\mathrm {d} \alpha } \tilde{L}\left( f^{(i)}-\alpha \nabla L\left( f^{(i)}\right) \right) =0,\) we could find the step length \(\alpha \).

$$\begin{aligned} \alpha =\frac{\nabla L\left( f^{(i)}\right) ^{\mathrm {T}} \nabla L\left( f^{(i)}\right) }{\nabla L\left( f^{(i)}\right) ^{\mathrm {T}} Q^{(i)} \nabla L\left( f^{(i)}\right) } \end{aligned}$$
(7)

In this work, We set \(Q^{(i)}=\left( J^{(i)}\right) ^{\mathrm {T}} J^{(i)},\) where \(J^{(i)}\) is the Jacobian of the residuals at \(f^{(i)} .\) This design of positive definite square matrix \(Q^{(i)}\) involves with second-order gradient descent of residuals at \(f^{(i)}\) which can contribute to a fast and efficient convergence.

Compared to the traditional correlation filter (CF) algorithms, We treat the hierarchical features differently. Because the shallow and deep features are both critical to the localization and classification, we train a set of independent filters for each feature. The decomposition of the function of two filters are beneficial to the overall performance. Conventional CF algorithms with one single filter is usually difficult to tackle with both classification and localization tasks during online tracking leading to model drift and insufficient online learning.

3.3 Filter Generations in Meta-learning Style

The motivation of our learning algorithm is that discriminative filters for similar visual objects in arbitrary background have amounts of sharing weights. Filters for objects with the same high-level semantic information should be robust towards changes, motion blur, scale variations, etc. To extract useful sharing filter weights in similar tracking scenes, we separate scene-independent information through offline training (Fig. 2).

Fig. 2.
figure 2

Multi-hierarchical independent discriminative filters combined with online learning and offline training framework.

figure a

With these sharing weights stored in convolutional networks to generate meta-filters, our online discriminative model for classification can be adapted to the specific objects fastly. We introduce a network module called filter generation network \(g_{\theta }\). It consists of two convolutional layer and a precise ROI pooling. During offline training, the \(S_{\text{ train } }=\left\{ \left( x_{j}, c_{j}\right) \right\} _{j=1}^{n}, \) composed of several tracklets, are used to generate meta-filters through averaging the pooled feature maps. And then, the test samples \(S_{\text{ test } }=\left\{ \left( x_{j}, c_{j}\right) \right\} _{j=1}^{m}\) are applied with generated filters to optimize the filter generation network.

Details of our meta-filters in Geographical Sensitivity Pathway (GASP) and Geometrically Sensitivity Pathway (GESP) are show in Fig. 3. ResNet-50 Block3 features in different stage are passed to a convolutional block (Cls). Regions defined by the input bounding boxes are then pooled to a fixed size using Precise Pooling layers. After a convolutional block, the weights of filter are generated to perform as convolutional block for features of searching image. Online optimizers optimize weights of filters during online tracking while offline optimizers try to learn meta-knowledges of filter-generation.

Fig. 3.
figure 3

Full architecture of our discriminative meta-filters. Pseudo-siamese network is not shown here for simplicity.

3.4 Appearance Fast Adaption Module

After the coarse spatial location of target is figured out, we need a subnetwork to acquire the accurate localization of target. In this work, we adopt an independent IoU-Nets [15] with template feature modulation. We train our independet IoU-Net [15] with template feature modulation for measuring the differences between proposals and ground truth. Full architecture can be viewed in Fig. 4.

Fig. 4.
figure 4

Full architecture of our Appearance Fast Adaption Module (AFAM I) and Appearance Fast Adaption Module II (AFAM II).

The template features \(x_{0}\) and searching area features x are extracted by modulation branch and test branch. The bounding box annotion \(A_{0}\) is as extral modulation information for generating box confidence value. The modulation information \(c\left( x_{0}, A_{0}\right) \) is added to the test branch as convolution kernel. The feature representation of search area z(xA) has strong spatial correspondence with the searching frame. Thus it could reflect the spatial coordinate difference between template and test frame.

$$\begin{aligned} {\text {BoxConVal}}(A)=g\left( c\left( x_{0}, A_{0}\right) \cdot z(x, A)\right) \end{aligned}$$
(8)
figure b

During online tracking, we apply another online searching strategy to maximun the confidence value with bounding box optimization. We use Gaussian distribution and previous position of target to generate initial proposals. For each proposel, we obtain the confidence value through the Box Fast Adaption Module. By backward propagation to obtain the gradient of confidence value, we optimize the length and center position of current proposal directly. Details are shown in Algorithm 2.

Appearance Fast Adaption Module II (AFAM II) provides pixel-level target information. We use the features extracted from ResNet-50. For the first frame and ground truth, we obtain the pseudo-mask for the target from the AFAM II. Then the extra information from Appearance Fast Adaption Module I (AFAM I) and pseudo-mask are concatenated together. The refinement network will output the final appearance estimation. Although the AFAM II are pretrained on training segmentation sequences from Youtube-VOS, yet it is not design for the segmentation task. During training, we use bounding box labels as inputs to predict target mask. So it should be considered as target appearance estimation, not instance segmentation.

4 Experiments

Meta-filters in Geographical Sensitivity Pathway (GASMF) and Box Fast Adaption Module in Geometrically Sensitivity Pathway (GESBFAM) are firstly trained jointly with ImageNet pretrained weights. Because ImageNet pretrained models are for classification task which may not suitable for tracking, we firstly train the GASMP and GESBFAM 40 epochs in the training splits of TrackingNet [20], LaSOT [21], GOT10k [1] and COCO [22] datasets to adapt backbone to tracking task. Then, we add the meta-filters in Geometrically Sensitivity Pathway (GESMF) to train another 30 epochs for a more discrminative power model. We train our model by sampling 26,000 frame-pairs per epoch. We use ADAM [23] with learning rate decay of 0.2 every 10th epoch. We use features extracted from the third block from Resnet. We set the kernel size of the meta-filters to 64 * 4 * 4. Appearance Fast Adaption Module (AFAM) in Geometrically Sensitivity Pathway are pre-trained on 3471 training segmentation sequences from Youtube-VOS [24].

4.1 Ablation Studies

We compared the performance of different combinations in Resnet50. ResNet-50 Block3 features in different stage. If we select adjacent layers, more redundancy and interference will be introduced into our tracking framework, thus causing the performance degradation. From the Table 2 the best performance achieved is from the layer3a and layer3e. When using two meta-filters, the EAO comes to 0.455, which demonstrates the effectiveness of two filters. The Box Fast Adaption Module improves accuracy a lot which is 0.652 comparing to 0.597 (Table 1).

Table 1. VOT2018-comparison with different settings.
Table 2. VOT2018-comparison with state-of-the-art trackers. The top three results are in , and fonts. Best viewed in color display.
Table 3. VOT2019 realtime-comparison with state-of-the-art trackers. The top three results are in , and fonts. Best viewed in color display.
Table 4. GOT-10K-comparison with state-of-the-art trackers. The top three results are in , and fonts. Best viewed in color display.

4.2 Results on Several Benchmarks

VOT2018 [25] datasets consist of 60 test sequences. With no training dataset provided, VOT is the most challenging benchmark for tracking which has topics including fast motion, occlusion, etc. We tested our tracker on this benchmark and present the results in Table 2 and Table 3. To the best of our knowledge, we achieves an EAO of 0.467 on VOT2018 (Kristanetal, 2018) and EAO of 0.334 on VOT2019 benchmark which is the new state-of-the-art performance. Our tracker also can run at 30 FPS in Nvidia GeForce 1080ti which is still very competitive (Tables 4 and 5).

Table 5. OTB-15-The top three results are in , and fonts. Best viewed in color display.

5 Conclusions

In this paper, we propose three suggestions on designing modern visual object trackers. We combine offline training and online learning of discriminative filters together. The meta-learning ways are stressed and successfully applied in object tracking. The meta-knowledge of the filter generations on similar tracking scenes are learned through convolutional network. Gradient descent optimization is carefully designed to adapt our filters to unseen objects efficiently. Moreover, a pseudo-siamese network structure enpowers the discriminative ability of our meta-filters. Our tracker can perform online searching strategies to find the best object bounding box. The balance of online searching and offine training helps us to achieve better results with less training resource.