Keywords

1 Introduction

In recent years, deep learning [1] has been leading the progress of visual tasks, such as target tracking [2], image segmentation [3] and target detection [4] and others. We focus on the target tracking, and also pay close attention to other tasks for they promote each other.

Trackers based on Siamese network have attracted many researchers thanks to their balance of speed and accuracy. Tao et al. [5] propose Siamese instance search for tracking (SINT), which adopts Siamese network structure, matching candidate image patches with multi-scale and the target patch. Then, Bertinetto et al. [6] design the full convolution Siamese network, named SiamFC, which measures the similarity between the search image features and the target image feature through correlation convolution, and formulate the target tracking into the problem of the image matching. SiamRPN [7] introduces region proposal network (RPN) into the Siamese network, and utilizes the anchor mechanism of the object detect task [8] to predict the size of target. Therefore, a boundary box regression branch and a classification branch are added to SiamFC to discriminate the target and bound the target candidate region. Dsiam [9] explores a dynamic Siamese network to learn object appearance changes and background suppression online, and trains them with continuous video frames. DasiamRPN [10] uses the detection dataset to expand the positive samples and the difficult negative samples, and designs the interference perception module to distinguish the real target from the disturbance, which improves the generalization of the tracker.

SiamRPN implements the size prediction of the target by introducing RPN module, but several aspects are to be modified.

Firstly, the regression branch in SiamRPN is optimized by L1 norm loss, so the prediction of bounding box is not accurate [11, 12].

Secondly, SiamRPN filters positive and negative samples through the Intersection over Union (IoU) ratio between anchor and the Ground Truth (GT) bounding box, which leads to low discrimination among positive samples.

Finally, the classification branch is separate from the regression branch in the introduced RPN module, which may not lock the same candidate target patch in the optimal prediction of the two branches.

In this paper, we propose a modified SiamRPN based on IoU. Under the framework of SiamRPN, we introduce IoU between GT box and anchors into the loss function to refine the regression prediction box, and define IoU between GT box and predicted box to weight positive sample for distinguishing each other, and positive samples based on IoU establish the connection between the classification branch and regression branch. Tracking experiments are carried out on OTB2013 [12], OTB2015 [13] test datasets to verify the feasibility and effectiveness of the proposed tracker.

The remainder of this paper is organized as follows. Section 2 discusses the principle of Siamese network. A modified SiamRPN for visual tracking is proposed in Sect. 3. In Sect. 4 experiments and discussion are given. The final section presents conclusion as well as future work.

2 Siamese Network

The classic Siamese network used in the tracking task is shown in Fig. 1. It formulates the problem of target tracking into the matching one between images.

Fig. 1
figure 1

Siamese network

Siamese networks apply an identical transformation φ to both exemplar image z and candidate image x, and measure the similarity between their representations by cross-correlation Layer as follows.

$$ f\left( {z,x} \right) = {\upvarphi }\left( z \right) * {\upvarphi }\left( x \right) + b $$
(1)

where b is a bias at every location of score map.

The similarity measure function f(z, x) is learned to evaluate the similarity between the exemplar features and the candidate features, so as to obtain the similarity response score map that shows s a high score if the two images depict the same object and a low score otherwise.

3 The Proposed Method

In this section, we propose the modified Siamese-RPN based on IoU, illustrated in Fig. 2. Under the framework of SiamRPN, we introduce IoU between GT box and anchors into the loss function to refine the regression prediction box, and define IoU between GT box and predicted box to weight positive sample for distinguishing each other, and positive samples based on IoU establish the connection between the classification branch and regression branch.

Fig. 2
figure 2

A modified Siamese-RPN framework

We divide it into four parts a Siamese feature extraction module, region proposal module, bounding box regression, and foreground–background classification.

3.1 Siamese Feature Extraction Module

Siamese feature extraction module is to map images into feature representation domain. As is shown on the left block of Fig. 2, it consists of two branches, one is for the feature extraction of exemplar image, which is from the historical frame, we denote input z, output φ(z). The other is for search image which is from the current frame, we denote input z, output φ(x).

They share the learnable network φ, which adopts a full convolution network layer of Alexnet [14]. That is, The input z with 127 × 127, got by center cropping and input x with 255 × 255 got, in the same manner, are fed into Siamese module for feature extraction.

3.2 Region Proposal Module

The region proposal module is to obtain proposal generation for tracking targets. As is shown in the middle block of Fig. 2, it has two Siamese convolution networks \({\upvarphi }_{{{\text{cls}}}}\), \({\upvarphi }_{{{\text{reg}}}}\), used for foreground & background classification branch and target bounding box regression branch, respectively, and each of which is matched with a supervision section.

In order to get the feature in the identical representation domain, the two Siamese convolution networks \({\upvarphi }_{{{\text{cls}}}}\), \({\upvarphi }_{{{\text{reg}}}}\), are applied to features φ(z), φ(x), respectively, and output \({\upvarphi }_{{{\text{cls}}}} [{\upvarphi }(z)]\), \({\upvarphi }_{{{\text{cls}}}} [{\upvarphi }(x)]\) for classification, and \({\upvarphi }_{{{\text{reg}}}} [{\upvarphi }(z)]\), \({\upvarphi }_{{{\text{reg}}}} [{\upvarphi }(x)]\) for regression.

Then we perform the cross-correlation on the classification branch and the regression branch as below

$$ \begin{gathered} p_{{{\text{cls}}}} = {\upvarphi }_{{{\text{cls}}}} [{\upvarphi }(z)] * {\upvarphi }_{{{\text{cls}}}} [{\upvarphi }(x)] \hfill \\ p_{{{\text{reg}}}} = {\upvarphi }_{{{\text{reg}}}} [{\upvarphi }(z)] * {\upvarphi }_{{{\text{reg}}}} [{\upvarphi }(x)] \hfill \\ \end{gathered} $$
(2)

Here, \({\upvarphi }_{{{\text{cls}}}} [{\upvarphi }(z)]\) and \({\upvarphi }_{{{\text{reg}}}} [{\upvarphi }(z)]\) server as convolution kernel, \({\upvarphi }_{{{\text{cls}}}} [{\upvarphi }(x)]\) and \({\upvarphi }_{{{\text{reg}}}} [{\upvarphi }(x)]\) server as input signal in the cross-correlation layer.

Anchor mechanism is introduced to the tracking task. If there are k anchors, classification prediction \(p_{{{\text{cls}}}}\) 2 k channels, and regression prediction \(p_{{{\text{reg}}}}\) output 4 k channels.

3.3 Loss Function

In this section, as is shown on the right block of Fig. 2, we introduce IoU into loss function, and re-formulate the loss function of the regression branch and classification branch, respectively, as the following subsections.

We apply the strategy from SiamRPN [7] to pick positive and negative training samples: In terms of IoU between anchors and Ground truth box of target, positive samples are defined as anchors which has IoU > 0.6. and negative samples are defined as anchors which have IoU < 0.3. We limit at most 16 positive samples and totally 64 samples from one training pair, and optimize the loss function of bounding box regress on the positive samples, and loss function of classification on total samples. We set 5 anchors with the same area and the aspect ratios [0.33, 0.5, 1, 2, 3].

Regression Loss. It is not so effective to use only L1 norm loss for the optimizer of the bounding box regression in the SiamRPN.

According to the works [11, 15] survey, IoU loss is one of the most effective evaluation, and is more accurate than that of the Ln norm loss in the bounding box regression. However, IoU loss has the difficulties of the highly nonlinear, multi-degree of freedom and the multiple zero gradient regions [16], it is hard to optimize IoU loss. Meanwhile, parameter imbalance exists in RPN module [17]. It is further hard to optimize IoU loss of the RPN network. I guess it may be the main reason why SiamRPN doesn’t directly use IoU loss.

Here, we develop the bounding box regression prediction loss based on the joint optimization of IoU loss and smooth L1 norm loss.

In order to overcome the difficulty of IoU loss, we optimize only the IoU loss of the best positive sample, and optimize smooth L1 loss on the other positive samples. It is noted that the best positive sample is defined as the anchor that has the max IoU.

At the same time, the best positive sample is located in the central region. IoU loss will play a less important role in training process if only being optimized on the best positive sample. We illustrate the joint optimization of IoU loss & smooth L1 norm loss processing in Fig. 2.

To begin with input, exemplar image z is got by center cropping. The search image x is got by cropping at a new center, which is shifted with random pixels. Then inputs z, x are fed into Siamese module and RPN module to output the prediction. The loss of target bounding box regression based on IoU & smooth L1 is given as

$$ L_{{\text{R}}} = L_{{{\text{best}}}} + \sum\limits_{{i \in {\text{pos}}}} {L_{{{\text{S - L}}_{{1}} }} \left( {p_{{{\text{reg}}}}^{\left( i \right)} } \right)} $$
(3)

where pos is all of positive samples except the best positive sample. \(L_{{{\text{S - L}}_{{1}} }}\) is smooth L1 loss, which is computed as SiamRPN [7]. The loss defined as on the best positive sample \(L_{{{\text{best}}}}\) is formulated by

$$ L_{{{\text{best}}}} = 1 - I_{{{\text{IoU}}}} \left( {b_{{{\text{reg}}}}^{{\left( {{\text{best}}} \right)}} ,{\text{gt}}_{{{\text{reg}}}} } \right) + R_{{{\text{penalty}}}} \left( {b_{{{\text{reg}}}}^{{\left( {{\text{best}}} \right)}} ,{\text{gt}}_{{{\text{reg}}}} } \right) $$
(4)

where \({\text{gt}}_{{{\text{reg}}}} = \left\{ {\left( {x_{{{\text{gt}}}} ,y_{{{\text{gt}}}} ,w_{{{\text{gt}}}} ,h_{{{\text{gt}}}} } \right)} \right\}\) is GT target bounding box, \(b_{{{\text{reg}}}}^{{\left( {{\text{best}}} \right)}} = \left\{ {\left( {x_{{\text{b}}} ,y_{{\text{b}}} ,w_{{\text{b}}} ,h_{{\text{b}}} } \right)} \right\}\) is the predicted target bounding box on the best positive sample. \(I_{{{\text{IoU}}}} \left( {b_{{{\text{reg}}}}^{{\left( {{\text{best}}} \right)}} ,{\text{gt}}_{{{\text{reg}}}} } \right)\) is the IoU between \({\text{gt}}_{{{\text{reg}}}}\) and \(b_{{{\text{reg}}}}^{{\left( {{\text{best}}} \right)}}\). Penalty term of IoU loss \(R_{{{\text{penalty}}}}\) describes a constraint on bounding box, and it is calculated as Ref. [18]

$$ R_{{{\text{penalty}}}} \left( {b_{{{\text{reg}}}} ,{\text{gt}}_{{{\text{reg}}}} } \right) = \frac{{\rho^{2} \left( {b_{{{\text{reg}}}} ,{\text{gt}}_{{{\text{reg}}}} } \right)}}{{C^{2} }} + \alpha \nu $$
(5)

where \(\rho \left( {b_{{{\text{reg}}}} ,{\text{gt}}_{{{\text{reg}}}} } \right)\) is Euclidean distance between \({\text{gt}}_{{{\text{reg}}}}\) and \(b_{{{\text{reg}}}}^{{\left( {{\text{best}}} \right)}}\). The weight coefficient \(\alpha = \frac{\nu }{{\left( {1 - I_{{{\text{IoU}}}} \left( {b_{{{\text{reg}}}} ,{\text{gt}}_{{{\text{reg}}}} } \right)} \right) + \nu }}\). v is used to measure the similarity of length–width ratio between the ground truth box and the predicted box, and computed by

$$ \nu = \frac{4}{{\pi^{2} }}\left( {\arctan \frac{{w_{{{\text{gt}}}} }}{{h_{{{\text{gt}}}} }} - \arctan \frac{{w_{{\text{b}}} }}{{h_{{\text{b}}} }}} \right)^{2} $$
(6)

From Eqs. (4) to (6), it can be seen that IoU loss keeps the target accuracy to the most extent in such aspects of intersection, length–width and center distance.

Classification Loss. SiamRPN picks positive and negative samples based on IoU between GT box and anchors. There is only one target in each image for the single target tracking task, so the positive samples are all from the same target. It is hard to determine which positive sample approaches more to the true target when their IoU is close to each other.

On the other hand, regression branch is separate from classification branch in SiamRPN, which may not lock the same candidate target patch in the optimal prediction of the two branches.

In this paper, we define weight coefficients for positive samples based on IoU between GT bounding box \({\text{gt}}_{{{\text{reg}}}}\) and the predicted bounding boxes \(b_{{{\text{reg}}}}^{{\left( {{\text{pos}}} \right)}}\), which are returned by regression branch.

The weight is used to distinguish sampled positive samples from each other. Consequently, these weighted positive samples bridge classification prediction and regression prediction. It is helpful to overcome the inconsistence by establishing the connection between classification prediction and regression prediction.

Then we formulate the classification loss on negative samples and weighted positive samples as below

$$ L_{{\text{C}}} = L_{{{\text{CP}}}} + L_{{{\text{CN}}}} $$
(7)

The classification loss on weighted positive samples is given by

$$ L_{{{\text{CP}}}} = \sum\limits_{{i \in {\text{pos}}}} {L_{{{\text{CE}}}} \left( {\eta_{{{\text{scale}}}} \cdot I_{{{\text{IoU}}}} \left( {b_{{{\text{reg}}}}^{\left( i \right)} ,{\text{gt}}_{{{\text{reg}}}} } \right) \cdot p_{{{\text{cls}}}}^{\left( i \right)} ,{\text{gt}}_{{{\text{cls}}}}^{\left( i \right)} } \right)} $$
(8)

where \(L_{{{\text{CE}}}} (x,y)\) is cross-entropy loss function, \({\text{gt}}_{{{\text{cls}}}}^{\left( i \right)}\) \(p_{{{\text{cls}}}}^{\left( i \right)}\) are the ground truth and predicted classification logits of the ith positive sample, respectively. \(I_{{{\text{IoU}}}} \left( {b_{{{\text{reg}}}}^{\left( i \right)} ,{\text{gt}}_{{{\text{reg}}}} } \right)\) is weight coefficient for the ith positive sample.

All of positive samples weights is scaled by a scalar \(\eta_{{{\text{scale}}}}\) to reduce the stochastic volatility of regression prediction. Based on IoU and prediction, \(\eta_{{{\text{scale}}}}\) is defined as

$$ \eta_{{{\text{scale}}}} = \frac{{\sum\nolimits_{{i \in {\text{pos}}}} {p_{{{\text{cls}}}}^{\left( i \right)} } }}{{\sum\nolimits_{{i \in {\text{pos}}}} {I_{{{\text{IoU}}}} \left( {b_{{{\text{reg}}}}^{\left( i \right)} ,{\text{gt}}_{{{\text{reg}}}} } \right)p_{{{\text{cls}}}}^{\left( i \right)} } }} $$
(9)

The classification loss on negative samples is given as

$$ L_{{{\text{CN}}}} = \sum\limits_{{i \in {\text{neg}}}} {L_{{{\text{CE}}}} \left( {p_{{{\text{cls}}}}^{\left( i \right)} ,{\text{gt}}_{{{\text{cls}}}}^{\left( i \right)} } \right)} $$
(10)

where neg denotes negative samples.

Finally, the total loss function on two branches is given as

$$ L_{{{\text{SUM}}}} = L_{{\text{R}}} + L_{{\text{C}}} $$
(11)

4 Experiments

In this section, we evaluate our proposed algorithm by conduct experiments on benchmark datasets OTB2013 [12], OTB2015 [13]. All the tracking results ensure a fair comparison.

4.1 Parameter Settings and Implementation Details

Parameter settings. All of experiments run on Ubuntu 18.04, Python3.6.12 and Pytorch1.6.0 platform with an Intel Xeon Gold 5122 CPU and a GeForce RTX 2080Ti GPU, memory 16 GB.

These parameters of Siamese module and RPN module are obtained by optimizing loss function in Eq. (11) with Stochastic Gradient Descent (SGD). We perform 50 epochs with mini-batch 32, the learning rate decreased 10−2 to 10−6 at each epoch.

Implementation details. During offline training phase, we train our proposed Siamese-IoU through end-to-end on datasets GOT10K [19] and on YouTube-Bounding-Boxes [20]. During online tracking phase, there is no online adaptation since we formulate online tracker as one-shot detector.

4.2 Quantitative Analysis

Our proposed tracker is evaluated and compared with top other trackers SiamFC [6], SiamRPN [7], Staple [21], KCF [22], CSRDCF [23], STRCF [24]. Here, trackers SiamFC and SiamRPN are trained offline with the above parameter settings and implementation details, and tracked online with their default hyperparameters.

Evaluation criteria. (1) precision, report the ratio of successful frames which Euclidean distance between the center of the predicted bounding box and the center of the ground truth is less than the given threshold τ (τ is set to 20 pixels) to the total number of video frames. (2) success rate: report the ratio of the number of frames whose overlap score is greater than the given threshold (τ is set to 0.5) to the total number of video frames.

Result on OTB2013. OTB2013 datasets contain 50 video clips. The performance is evaluated in terms of success plot and precision plot. The tracking results are reported on the test sets of OTB22013 in Fig. 3. It can be seen that the tracker Ours achieves an average precision 88.1% and a success rate of 63.4%. Tracker Ours is superior to other trackers SiamRPN, SiamFC, Staple, KCF, CSRDCF STRCF. Compared with top tracker SiamRPN, tracker ours increases by 3.2% and 3.7% in precision and success, respectively.

Fig. 3
figure 3

Success plots and precision plots on OTB2013

Result on OTB2015. OTB2015 datasets contain 100 video clips. The tracking results of the success plot and precision plot are illustrated in Fig. 4 on the test sets of OTB22015. It can be seen that the tracker ours achieves average precision 84.3% and success rate 62.0%. Tracker Ours is superior to other trackers SiamRPN, SiamFC, Staple, KCF, CSRDCF STRCF. Compared with top tracker SiamRPN, tracker Ours increase by 1.5% and 1.8% in precision and success, respectively.

Fig. 4
figure 4

Success plots and precision plots on OTB2015

To sum up, our proposed tracker (SiamIoU) outperforms significantly overSiamRPN, SiamFC and others in accuracy and EAO.

4.3 Qualitative Analysis

To intuitively evaluate and demonstrate Tracker Ours, we visualize the tracking comparison with SiamRPN, SiamFC on the following challenging clips from OTB2013, Lemming, Shaking, Singer2 and Ironman in Fig. 5. We give a brief qualitative analysis of the tracking visualization.

Fig. 5
figure 5

Comparison of the tracking results of Ours with SiamRPN and SiamFC

For the challenges of the Background Cluster (BC), Illumination Variation (IV), as can be seen in the sequence of Shaking, Singer and Lemming. The tracker Ours shows better robustness to IV than SiamRPN and SiamFC, for instance, the results of the frames that happen to flashlight on the clip Shaking. Our tracker bounds the target well thanks to the introduction of the IoU refine.

5 Conclusion

In this paper, we propose a modified Siamese region proposal network based on the IoU, It is end-to-end offline trained on datasets GOT10K and YouTube Bounding-Boxes by applying box refinement procedure. In the inference phase, Our tracker is formulated as a local one-shot detector, and outperform SiamRPN and other trackers on datasets OTB2013, OTB2015.