Hierarchical Representations with Discriminative Meta-filters in Dual Path Network for Tracking

Xie, Fei; Wang, Ning; Yao, Yuncong; Yang, Wankou; Zhang, Kaihua; Liu, Bo

doi:10.1007/978-3-030-60639-8_26

Fei Xie¹⁶,
Ning Wang¹⁷,
Yuncong Yao¹⁶,
Wankou Yang¹⁶,
Kaihua Zhang¹⁷ &
…
Bo Liu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12306))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1563 Accesses
2 Citations

Abstract

In visual tracking task, accuracy and robustness are critical issues for achieveing remarkable performance. In this paper, we propose a novel dual path network with discriminative meta-filters and hierachical representations to solve these issues. We first design geometrically sensitivity pathway (GESP) and geographical sensitivity pathway (GASP) as two subtasks for target classification and scale estimation. GASP mainly includes powerful discriminative meta-filters to find coarse location of target and GESP can refine region of interests online while adapt the appearance model to the target swiftly. Then, a dual path network is developed in a online and offline framework. Specifically, meta-filters are trained offline in order to gain meta-knowledge of similar tracking scenes. Finally, we present three suggestions on deigning modern tracker. Extensive experiments on VOT2018 datasets verify the superior performance of proposed method compared with other state-of-the-arts, achieving expected average overlap (EAO) of 0.467.

This work is supported in part by National Major Project of China for New Generation of AI (No. 2018AAA0100400), in part by the Natural Science Foundation of China under Grant nos. 61773117, 61876088, the Primary Research & Development Plan of Jiangsu Province - Industry Prospects and Common Key Technologies under Grant No. BE2017157.

F. Xie—He is currently working toward the Master degree in the School of Automation, Southeast University.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework

Adaptive sparse attention-based compact transformer for object tracking

Article Open access 28 May 2024

Cascaded Region Proposal Networks for Proposal-Based Tracking

Keywords

1 Introduction

Generic visual tracking is a crucial task in computer vision aiming at locating the specific continuous object in video. Limited information, usually the first annotation is provided during visual object tracking. One unique characteristic of generic object tracking is that no prior knowledge (e.g., the object class) about the object, as well as its surrounding environment, is allowed [1]. The quality of localization and scale estimation of target are the most influential factor of the performance.

Recently, localization and scale estimation tend to be two subtasks of the tracking problem [2]. Before the deep learning methods, trackers based on Discriminative Correlation Filter (DCF) [3, 4] framework took dominant positions in tracking method. Traditional correlation trackers suffer from inefficiency and low accuracy due to its inherent flaw. It is natural that the deep learning ways are applied to other computer vision tasks, such as object detection, semantic segmentation and visual tracking. SiamFC [5], introduces siamese learning paradigm into visual object tracking, though it employs brutal multi-scale test which is inaccurate and inefficiency [2]. Then, SiamRPN tracker family [6,7,8] perform an accurate and efficient target scale estimation by introducing the Region Proposal Network (RPN) [9]. However, the pre-defined anchor settings not only introduce ambiguous similarity, but also demand huge prior-knowledge about target. SiamFC++ [10] adopts the anchor-free regression and classification style based on Siamese learning paradigm, it still heavily rely on the sufficient prior-knowledge about target. Motivated by the aforementioned analysis, we propose three suggestions on designing modern visual object trackers:

Balance between online learning and offline training: The breakthroughs on object detection provide a better way to replace multi-scale estimation in object tracking. For example, RPN [9] structure achieves astonishing accuracy in SiamRPN [6]. Because siamese formulation does not provide a powerful discriminative model, we highly recommend that online learning needs to be well-designed. The Correlation based trackers [3, 4, 11, 12] are able to tackle with online model updation. However, the problems of model drift and insufficient training of online model result in low accuracy.
Fully utilization of Multi-level deep convolutional features: Deep model should be trained for robustness, while the shallow model should emphasize accurate target localization [13]. We highly recommend that deep and shallow models should be emphasized equally in order to have better robustness performance. Even though the high quality training data is crucial for the success of end-to-end representation [7], we argue that models designed for both deep and shallow features can reduce the burden of offline training.
Online searching strategies are highly recommended in scale estimation branch: Both the RPN structure from Faster-RCNN [9] or one-stage anchor-free detection from FCOS [14] output the coordinates of target directly without online searching strategy. We strongly consist that it cannot tackle with severe appearance deformation and complex scenes. In our work, we choose the IoU-Net [15] prediction proposed by Atom [2] as our scale estimation branch. It can perform online searching strategy when the coarse location of target is determined.

2 Related Work

Generic object tracking can be divided into two frameworks: Tracking framework and detection framework. Generally, tracking framework trackers are mainly based on correlation filters. MOSSE [4] proposes a CF tracker by learning a minimum output sum of squared error for target appearance and calculate in Fourier domain. KCF [3] adopts ridge regression and circulant matrix to facilitate the speed of calculation in Fourier domain. C-COT [16] converts feature maps of different resolutions into a continuous spatial domain to achieve better accuracy. The subsequent ECO [17] has better efficiency by removing the redundant correlation filters.

ATOM [2] tracker adopts IoU-Net [15] and online learning to classify the target and estimate the scale. Online learning and offline training are combined together. ATOM achieves better robustness performance than Siamese-based trackers. However, it still lack of multi-level deep convolutional features fusion and its online learning is totally independent of offline training which can be further improved. DiMP [18] combines online training and offline training together.

SiamRPN and its succeeding works [3, 4, 11, 12] modifies a Region Proposal Network after a siamese network. They have direct bounding box regression ability thanks to extensive offline training. However, the robustness still suffers from the weak discriminative ability of siamses-based detection networks. The pre-defined anchors of Region Proposal Network (RPN) [9] also need to be well-designed. Even though the SiamFC++ [10] adopts an anchor-free style for bounding box regression, its performance still heavily rely on extensive offline training and robustness cannot be improved as much as accuracy.

3 Proposed Method

Two meta-filters in Geographical Sensitivity Pathway (GASP) are trained to have more discriminative power between foreground and background. The geometrically sensitivity pathway (GESP) focus more on the appearance model of the object in order to estimate the scale accurately.

3.1 Dual Path Network

The whole pipeline of our tracker consists of two meta-filters and a Box Fast Adaption Module. Hierarchical feature representations are used for two meta-filters in order to achieve better performance on localizations. Similar to the object segmentation in [19], the Box Fast Adaption Module can have accurate object outline estimation after the localization process (Fig. 1).

3.2 Multi-hierarchical Independent Discriminati Filters in Online Learning

Inspired by discriminative correlation filter (DCF) approaches, we formulate our learning objective based on $L^{2}$ classification error. Each sample $x_{k}$ contains D feature channels $x_{j}^{1}$ $x_{j}^{2}, \ldots , x_{j}^{D},$ extracted from the same image patch, where k is the index of the samples. Assume that $f=\left\{ f_{d}\right\} _{d=1: D}$ is a set of D channel features. The correlation filters algorithm can be formulated as:

$$\begin{aligned} \underset{f}{\arg \min } \sum _{k=1}^{K}\left\| \phi \left( x_{k}, f\right) -y_{k}\right\| _{L^{2}}^{2}+\lambda \sum _{d=1}^{D}\Vert f\Vert _{L^{2}}^{2}\end{aligned}$$

(1)

where $x_{k}$ is the cyclic shift sample of the $x_{k}$ and $y_{k}$ is the Gaussian response label. The optimization problem in Eq. (1) can be solved efficiently in the Fourier domain.

$$\begin{aligned} \phi \left( x_{k}, f\right) =\sum _{d=1}^{D} f_{d} * x_{k}^{d} \end{aligned}$$

(2)

In our work, we try to combine the online optimization with offline training, thus we approximate the loss with a quadratic function and optimize it by backward propagation instead of Fast Fourier Transform (FFT).

In this section, the discriminative learning loss is described in details. The input to our model predictor D consists of a training set $S_{\text{ train } }=\left\{ \left( x_{j}\right) \right\} _{j=1}^{n}$ of deep feature maps $x_{j} \in \mathcal {X}$ generated by the backbone network F. During online tracking, correlation filter is optimized to generate a target model $f=D\left( S_{\text{ train } }\right) .$ The model f is defined as the filter weights of a convolutional layer. The maximum value of the model output should localize the center of target.

$$\begin{aligned} L(f)=\frac{1}{\left| S_{\text{ train } }\right| } \sum _{(x) \in S_{\text{ wain } }}\Vert r(x * f, c)\Vert ^{2}+\Vert \lambda f\Vert ^{2} \end{aligned}$$

(3)

Here, * denotes convolution and $\lambda $ is a regularization factor. The function r(s, c) computes the residual at every spatial location based on the target confidence scores $s=x * f$ and the ground-truth target center coordinate c. In Eq. (1), $r(s, c)=s-y_{c},$ traditional correlation filter trackers optimize the residuals between response and the Gaussian target scores. Thus, the difference of target and distractor response usually represents the discriminative ability of the correlation filters. However, during online tracking, background noise and distractors are far more abundant than our target resulting in imbalance of the positive and negative samples.

In order to learn a more discriminative filter, it is common to have a weight matrix in the learning loss. In our work, We employ a hinge-like loss in r, clipping the scores at zero as $\max (0, s)$ in the background region. Thus, the filter is more focus on the hard negative distractors instead of easy negative samples. We believe that it could contribute to a more discriminative filter and efficiency online optimization.

$$\begin{aligned} r(s, c)=v_{c} \cdot \left( m_{c} s+\left( 1-m_{c}\right) \max (0, s)-y_{c}\right) \end{aligned}$$

(4)

The mask $m_{c}$ modifies the spatial weight of scores, having values in the interval $m_{c}(t) \in [0,1]$ at each spatial location $t \in \mathbb {R}^{2}.$

In our work, we use convolutional layers D to generate the filter $f=D\left( S_{\text{ train }}\right) $ by implicitly minimizing the error (3).

$$\begin{aligned} f^{(i+1)}=f^{(i)}-\alpha \nabla L\left( f^{(i)}\right) \end{aligned}$$

(5)

Instead of minimizing the error (3) in Fourier domain, we approximate the error with a quadratic function and directly employ gradient descent optimization using a step length $\alpha $.

$$\begin{aligned} \begin{aligned} L(f) \approx \tilde{L}(f)=\,&\frac{1}{2}\left( f-f^{(i)}\right) ^{\mathrm {T}} Q^{(i)}\left( f-f^{(i)}\right) \\&+\,\left( f-f^{(i)}\right) ^{\mathrm {T}} \nabla L\left( f^{(i)}\right) +L\left( f^{(i)}\right) \end{aligned}\end{aligned}$$

(6)

Here, the filter variables f and $f^{(i)}$ are seen as vectors and $Q^{(i)}$ is positive definite square matrix. The steepest descent is adopted in order to achieve a fast convergence performance. By solving $\frac{\mathrm {d}}{\mathrm {d} \alpha } \tilde{L}\left( f^{(i)}-\alpha \nabla L\left( f^{(i)}\right) \right) =0,$ we could find the step length $\alpha $.

$$\begin{aligned} \alpha =\frac{\nabla L\left( f^{(i)}\right) ^{\mathrm {T}} \nabla L\left( f^{(i)}\right) }{\nabla L\left( f^{(i)}\right) ^{\mathrm {T}} Q^{(i)} \nabla L\left( f^{(i)}\right) } \end{aligned}$$

(7)

In this work, We set $Q^{(i)}=\left( J^{(i)}\right) ^{\mathrm {T}} J^{(i)},$ where $J^{(i)}$ is the Jacobian of the residuals at $f^{(i)} .$ This design of positive definite square matrix $Q^{(i)}$ involves with second-order gradient descent of residuals at $f^{(i)}$ which can contribute to a fast and efficient convergence.

Compared to the traditional correlation filter (CF) algorithms, We treat the hierarchical features differently. Because the shallow and deep features are both critical to the localization and classification, we train a set of independent filters for each feature. The decomposition of the function of two filters are beneficial to the overall performance. Conventional CF algorithms with one single filter is usually difficult to tackle with both classification and localization tasks during online tracking leading to model drift and insufficient online learning.

3.3 Filter Generations in Meta-learning Style

The motivation of our learning algorithm is that discriminative filters for similar visual objects in arbitrary background have amounts of sharing weights. Filters for objects with the same high-level semantic information should be robust towards changes, motion blur, scale variations, etc. To extract useful sharing filter weights in similar tracking scenes, we separate scene-independent information through offline training (Fig. 2).

With these sharing weights stored in convolutional networks to generate meta-filters, our online discriminative model for classification can be adapted to the specific objects fastly. We introduce a network module called filter generation network $g_{\theta }$. It consists of two convolutional layer and a precise ROI pooling. During offline training, the $S_{\text{ train } }=\left\{ \left( x_{j}, c_{j}\right) \right\} _{j=1}^{n}, $ composed of several tracklets, are used to generate meta-filters through averaging the pooled feature maps. And then, the test samples $S_{\text{ test } }=\left\{ \left( x_{j}, c_{j}\right) \right\} _{j=1}^{m}$ are applied with generated filters to optimize the filter generation network.

Details of our meta-filters in Geographical Sensitivity Pathway (GASP) and Geometrically Sensitivity Pathway (GESP) are show in Fig. 3. ResNet-50 Block3 features in different stage are passed to a convolutional block (Cls). Regions defined by the input bounding boxes are then pooled to a fixed size using Precise Pooling layers. After a convolutional block, the weights of filter are generated to perform as convolutional block for features of searching image. Online optimizers optimize weights of filters during online tracking while offline optimizers try to learn meta-knowledges of filter-generation.

3.4 Appearance Fast Adaption Module

After the coarse spatial location of target is figured out, we need a subnetwork to acquire the accurate localization of target. In this work, we adopt an independent IoU-Nets [15] with template feature modulation. We train our independet IoU-Net [15] with template feature modulation for measuring the differences between proposals and ground truth. Full architecture can be viewed in Fig. 4.

The template features $x_{0}$ and searching area features x are extracted by modulation branch and test branch. The bounding box annotion $A_{0}$ is as extral modulation information for generating box confidence value. The modulation information $c\left( x_{0}, A_{0}\right) $ is added to the test branch as convolution kernel. The feature representation of search area z(x, A) has strong spatial correspondence with the searching frame. Thus it could reflect the spatial coordinate difference between template and test frame.

$$\begin{aligned} {\text {BoxConVal}}(A)=g\left( c\left( x_{0}, A_{0}\right) \cdot z(x, A)\right) \end{aligned}$$

(8)

During online tracking, we apply another online searching strategy to maximun the confidence value with bounding box optimization. We use Gaussian distribution and previous position of target to generate initial proposals. For each proposel, we obtain the confidence value through the Box Fast Adaption Module. By backward propagation to obtain the gradient of confidence value, we optimize the length and center position of current proposal directly. Details are shown in Algorithm 2.

Appearance Fast Adaption Module II (AFAM II) provides pixel-level target information. We use the features extracted from ResNet-50. For the first frame and ground truth, we obtain the pseudo-mask for the target from the AFAM II. Then the extra information from Appearance Fast Adaption Module I (AFAM I) and pseudo-mask are concatenated together. The refinement network will output the final appearance estimation. Although the AFAM II are pretrained on training segmentation sequences from Youtube-VOS, yet it is not design for the segmentation task. During training, we use bounding box labels as inputs to predict target mask. So it should be considered as target appearance estimation, not instance segmentation.

4 Experiments

Meta-filters in Geographical Sensitivity Pathway (GASMF) and Box Fast Adaption Module in Geometrically Sensitivity Pathway (GESBFAM) are firstly trained jointly with ImageNet pretrained weights. Because ImageNet pretrained models are for classification task which may not suitable for tracking, we firstly train the GASMP and GESBFAM 40 epochs in the training splits of TrackingNet [20], LaSOT [21], GOT10k [1] and COCO [22] datasets to adapt backbone to tracking task. Then, we add the meta-filters in Geometrically Sensitivity Pathway (GESMF) to train another 30 epochs for a more discrminative power model. We train our model by sampling 26,000 frame-pairs per epoch. We use ADAM [23] with learning rate decay of 0.2 every 10th epoch. We use features extracted from the third block from Resnet. We set the kernel size of the meta-filters to 64 * 4 * 4. Appearance Fast Adaption Module (AFAM) in Geometrically Sensitivity Pathway are pre-trained on 3471 training segmentation sequences from Youtube-VOS [24].

4.1 Ablation Studies

We compared the performance of different combinations in Resnet50. ResNet-50 Block3 features in different stage. If we select adjacent layers, more redundancy and interference will be introduced into our tracking framework, thus causing the performance degradation. From the Table 2 the best performance achieved is from the layer3a and layer3e. When using two meta-filters, the EAO comes to 0.455, which demonstrates the effectiveness of two filters. The Box Fast Adaption Module improves accuracy a lot which is 0.652 comparing to 0.597 (Table 1).

Table 1. VOT2018-comparison with different settings.

Full size table

**Table 2. VOT2018-comparison with state-of-the-art trackers. The top three results are in , and fonts. Best viewed in color display.**

**Table 3. VOT2019 realtime-comparison with state-of-the-art trackers. The top three results are in , and fonts. Best viewed in color display.**

**Table 4. GOT-10K-comparison with state-of-the-art trackers. The top three results are in , and fonts. Best viewed in color display.**

4.2 Results on Several Benchmarks

VOT2018 [25] datasets consist of 60 test sequences. With no training dataset provided, VOT is the most challenging benchmark for tracking which has topics including fast motion, occlusion, etc. We tested our tracker on this benchmark and present the results in Table 2 and Table 3. To the best of our knowledge, we achieves an EAO of 0.467 on VOT2018 (Kristanetal, 2018) and EAO of 0.334 on VOT2019 benchmark which is the new state-of-the-art performance. Our tracker also can run at 30 FPS in Nvidia GeForce 1080ti which is still very competitive (Tables 4 and 5).

**Table 5. OTB-15-The top three results are in , and fonts. Best viewed in color display.**

5 Conclusions

In this paper, we propose three suggestions on designing modern visual object trackers. We combine offline training and online learning of discriminative filters together. The meta-learning ways are stressed and successfully applied in object tracking. The meta-knowledge of the filter generations on similar tracking scenes are learned through convolutional network. Gradient descent optimization is carefully designed to adapt our filters to unseen objects efficiently. Moreover, a pseudo-siamese network structure enpowers the discriminative ability of our meta-filters. Our tracker can perform online searching strategies to find the best object bounding box. The balance of online searching and offine training helps us to achieve better results with less training resource.

References

Huang, L., Zhao, X., Huang, K.: Got-10k: a large high-diversity benchmark for generic object tracking in the wild. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Atom: accurate tracking by overlap maximization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4660–4669 (2019)
Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)
Article Google Scholar
Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2544–2550. IEEE (2010)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 850–865. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48881-3_56
Chapter Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with siamese region proposal network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980 (2018)
Google Scholar
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware siamese networks for visual object tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117 (2018)
Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: Siamrpn++: evolution of siamese visual tracking with very deep networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291 (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Xu, Y., Wang, Z., Li, Z., Ye, Y., Yu, G.: Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. arXiv preprint arXiv:1911.06188 (2019)
Sun, C., Wang, D., Lu, H., Yang, M.-H.: Correlation tracking via joint discrimination and reliability learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 489–497 (2018)
Google Scholar
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2805–2813 (2017)
Google Scholar
Bhat, G., Johnander, J., Danelljan, M., Khan, F.S., Felsberg, M.: Unveiling the power of deep tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 483–498 (2018)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9627–9636 (2019)
Google Scholar
Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization confidence for accurate object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–799 (2018)
Google Scholar
Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg, M.: Beyond correlation filters: Learning continuous convolution operators for visual tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) European Conference on Computer Vision, vol. 9909, pp. 472–488. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_29
Danelljan, M., Robinson, A., Shahbaz Khan, F., Felsberg M.: Eco: efficient convolution operators for tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6638–6646 (2017)
Google Scholar
Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Learning discriminative model prediction for tracking. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6182–6191 (2019)
Google Scholar
Lukezic, A., Matas, J., Kristan, M.: D3s-a discriminative single shot segmentation tracker. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7133–7142 (2020)
Google Scholar
Muller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–317 (2018)
Google Scholar
Fan, H., et al.: Lasot: a high-quality benchmark for large-scale single object tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5374–5383 (2019)
Google Scholar
Lin, T.-Y., et al.: Microsoft coco: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) European Conference on Computer Vision, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Xu, N., et al.: Youtube-vos: sequence-to-sequence video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 585–601 (2018)
Google Scholar
Matej Kristan, et al.: The seventh visual object tracking vot2019 challenge results. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)
Google Scholar
Chen, Z., Zhong, B., Li, G., Zhang, S., Ji, R.: Siamese box adaptive network for visual tracking. arXiv preprint arXiv:2003.06761 (2020)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.S.: Fast online object tracking and segmentation: a unifying approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1328–1338 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Automation, Southeast University, Nanjing, 210096, China
Fei Xie, Yuncong Yao & Wankou Yang
School of Automation, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Ning Wang & Kaihua Zhang
JD.com, Beijing, China
Bo Liu

Authors

Fei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuncong Yao
View author publications
You can also search for this author in PubMed Google Scholar
Wankou Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kaihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wankou Yang .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Yuxin Peng
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Dalian University of Technology, Dalian, China
Huchuan Lu
Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Chinese Academy of Sciences, Beijing, China
Chenglin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Peking University, Beijing, China
Hongbin Zha
Nanjing University of Science and Technology, Nanjing, China
Jian Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, F., Wang, N., Yao, Y., Yang, W., Zhang, K., Liu, B. (2020). Hierarchical Representations with Discriminative Meta-filters in Dual Path Network for Tracking. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12306. Springer, Cham. https://doi.org/10.1007/978-3-030-60639-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-60639-8_26
Published: 15 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60638-1
Online ISBN: 978-3-030-60639-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics