Keywords

1 Introduction

Visual object tracking consists of estimating the trajectory of an object along a continuous video sequence. Usually, only the first frame is annotated with a bounding box, which provides very limited information about the object to be tracked. In real situations, the target often undergoes complex transformations which cause its appearance to significantly change over time. Until recently, the majority of trackers tackled this challenge by constantly updating a classifier throughout the video [7, 9, 13]. In fact, when combined with deep network models, this strategy still produces the most accurate results on standard benchmarks [4, 23, 26]. However, updating the classifier online presents challenges of its own. Firstly, constantly updating a large model causes a significant drop in speed. Secondly, as the update depends on previous predictions, the classifier is prone to drift and contamination [6, 19, 31].

Lately, however, siamese networks have shown that compelling results can be achieved without updating the model [1, 10, 12]. Siamese trackers are trained on a large set of image pairs to learn a robust matching function that is able to re-identify the object even when its appearance changes significantly. Nonetheless, although they are usually fast, there is still a gap in accuracy when compared to the top-performing trackers.

In this paper, we show that this gap can be significantly decreased by collecting features containing different context and semantic levels from a deep network. Unlike traditional multi-layer tracking approaches which only exploit the different semantic levels [24, 28, 30], we extract features with multiple context levels by applying a crop on the feature maps, which we refer to as multi-context features. In the scope of this work, context refers to the amount of background that is included with the object. Figure 1b shows an example of an object with different context levels. Since the receptive field is different at each layer, cropping the maps allows each feature to collect information from different context sizes. We hypothesize, and show through experiments, that multi-context features are particularly suitable for siamese networks.

In the siamese formulation proposed in SiamFC [1], two images, an exemplar z and an instance x, are forwarded through two identical networks with shared weights, yielding features \(\varphi (z)\) and \(\varphi (x)\) respectively. When matching the features, \(\varphi (z)\) can be interpreted as a filter to be applied over \(\varphi (x)\) to produce a prediction. If we use multi-layer features, it is possible to obtain multiple filters \(\varphi _l(z)\) and \(\varphi _l(x)\). However, standard multi-layer features can only provide different global representations for the same image. On the other hand, as explained in Sect. 3.2, by considering multi-context features, filters from different layers can be more diverse, focusing on different regions of the image. As discussed in previous works [1, 10, 29], the amount of context can play a significant role in the tracking performance. And our proposed tracker, SiamMCF, allows to leverage it at multiple levels in a single pass.

The contributions of this paper are two-fold: (i) a novel extension to the siamese formulation which leverages multiple context and semantic levels in a single forward pass, and (ii) we demonstrate that the multi-context features provide a significant increase in performance when compared to standard multi-layer ones.

2 Related Work

2.1 Siamese Tracking

SINT [27] is one of the earliest siamese trackers that presented some really compelling results. It consists of a siamese network trained for matching patches of images. For the tracking stage, a patch of the first frame is matched to patches collected around the previous position. Although its results are still among the best siamese trackers, it is much slower than other approaches. GOTURN [12], on the other hand, is able to track at 100 fps. It works by extracting deep features from two crops: one from the object and another from the area centered on the previous position. These features are concatenated and used to solve a regression problem to estimate the relative motion of the target relative to the previous frame. The high speed does compromise the performance, as its results are not as accurate as other siamese approaches.

SiamFC [1] is one of the most balanced options, as it presents one of the best compromises between accuracy and speed. The SiamFC tracker employs a pair of AlexNets [18] with shared weights. A smaller exemplar image and a larger instance are forwarded to generate high-level features. By correlating the exemplar feature over every instance position, a spatial prediction map is obtained.

Several improvements [8, 10, 14, 29] were proposed over the initial SiamFC tracker. CFNet [29] proposes to include a trainable correlation filter layer on top of the siamese network. By introducing a differentiable solution to the deep correlation filter in the Fourier domain, the tracker can be efficiently trained end-to-end with gradient descent. DSiam [8] tackles the model updating problem in siamese networks. Two transformation terms are independently applied to both branches before the matching. The first term updates the model by encouraging it to be similar to the previous observation. The second one is used to suppress background activations in the current frame. EAST [14] proposes to speed-up SiamFC by trying to avoid forwarding the images until the last layer. For this, reinforcement learning is applied to train a classifier that decides at which layer forwarding can be stopped while still retaining a discriminative representation for the given image. SA-Siam [10] leverages appearance and semantic features for tracking. This is done by using two networks, a SiamFC and an AlexNet trained for classification. The authors show that the features obtained from each net are complementary and better results are obtained by combining their predictions.

2.2 Tracking with Multi-layer Representations

Multi-layer features have been applied to object tracking in different ways. Wang et al. [30] showed that different layers effectively produce complementary features for tracking. By leveraging information collected from different layers, tracking results were improved. Chi et al. [3] also exploited this property to obtain predictions from multiple layers. Some other methods [22, 28] have used deeper layers to first roughly estimate the target position and then project it to shallower layers. The rationale is that early layers provide less coarse features which can improve the detection accuracy. HDT [24] applies an adaptive hedge method to assign different confidence levels to each layer based on its previous results. C-COT [7] employs an implicit interpolation model to cast the feature maps into the continuous space. In this way, features from different layers with different sizes can be merged to train a correlation filter.

All of the previous approaches still use the whole feature maps for the predictions. Therefore, complementarity between layers is somewhat restricted, as all layers are global representations of the image and do not fully exploit information related to more localized patches. Some recent works [5, 21] have demonstrated that, by suppressing or masking the features maps, more robust representations can be obtained. In this work, we propose to combine multi-layer features with spatially constrained maps to obtain multi-context features. SA-Siam [10] has exploited this property to some extent by concatenating features from two layers and then cropping. However, that was applied to consecutive layers of an AlexNet [18], which are not very deep and too close to each other. SINT [27] adopts a strategy to extract multi-layer features which is similar to ours. However, there are some important differences to be pointed out. Firstly, we apply cropping on the exemplar branch, in which the object is known, whereas SINT uses ROI pooling on the instance branch, in which the real object position is uncertain. Secondly, ROI pooling in SINT performs a rescaling (into a \(7\times 7\) region) with a max-pooling directly in the lower-resolution feature space, while we rescale the input image before forwarding and always crop a region of the same size, which is less prone to suffer from the negative effects of discretization. We show through experiments that our approach obtains significantly better results than other previous siamese approaches, including SINT and SiamFC-R [15], which also uses a very deep network as a backbone.

3 Our Approach

3.1 Siamese Tracker

In the standard SiamFC [1] formulation, two images are provided as inputs: the exemplar from the first frame z and the current tracking frame, the instance x. Let the prime symbol represent a crop operation over an image. The siamese network receives the cropped regions \(z'\) and \(x'\) which are then forwarded to produce the features \(\varphi (z')\) and \(\varphi (x')\) respectively. The feature \(\varphi (z')\) is then used as a correlation filter over \(\varphi (x')\), thus yielding a prediction map

$$\begin{aligned} g(x', z') = \varphi (x') \star \varphi (z'). \end{aligned}$$
(1)

3.2 Siamese Tracker with Multi-context Features

For our SiamMCF tracker, we adapt the prediction map function to work at different layers and extract features with different contexts from each layer. Figure 1 illustrates our proposed approach. The context amount is controlled by cropping the feature map. Since the receptive field at each layer is different, as long as the crop sizes in different layers are not proportional to the receptive field changes, we are able to extract features that consider different areas of the input image. In particular, we can extract features with different contexts by cropping regions of the same size from all the layers. Figure 1b shows the effective region corresponding to crops at different layers.

Fig. 1.
figure 1

Illustration of our tracking framework. (a) Proposed network with multi-context features. (b) Receptive fields of different layers superposed over the image. Deeper layers encode larger contexts. Best viewed in colors.

Given a set of selected layers \(L = \{l\}\), prediction maps are estimated as

$$\begin{aligned} g_l(x', z') = \mathbbm {1} \gamma _l \odot (\varphi _l(x') \star \varphi _l'(z')) + \mathbbm {1} \beta _l, \end{aligned}$$
(2)

where \(\varphi _l(\cdot )\) represents the feature obtained by forwarding until layer l, and \(\odot \) indicates element-wise multiplication. We also learn normalization parameters \(\gamma _l\) and \(\beta _l\) to stabilize the magnitude of the predictions. Notice that we use the cropped filter \(\varphi _l'(z')\) to collect exemplars with different context sizes.

Since the backbone network in SiamFC is based on AlexNet [18], which is relatively shallow, extracting multi-level features does not provide very different representations. Therefore, we replace the backbone with a deeper network. In particular, we conduct experiments with a ResNet-50 [11]. The original ResNet, however, has a large output stride of 32, which is not ideal for the siamese formulation as both images are largely reduced. Therefore, we reduce the output stride to 8 by setting the convolution stride to 1 in blocks 2 and 3 of the ResNet, and by applying dilated convolutions [2].

It is important to mention that the original SiamFC is based on the fully-convolutional formulation [1]. This formulation ensures that the output features generated by the network commute with translation. Therefore, if the exemplar image is a crop of a region of the instance, then the exemplar output features will also correspond to a region of instance features. In other words, the exemplar image can be found in the instance simply by looking for the region with the maximum similarity. One important caveat is that this formulation can only hold as long as the employed network does not use padding operations, which severely restricts the choice of available architectures.

A ResNet, however, is very deep and requires padding. In fact, the receptive field in the last layer is usually larger than the input image, thus generating an asymmetry when processing images of different sizes (e.g. 127 and 255 for the exemplar and instance branches) which, in turn, break the fully-convolutional formulation. We hypothesize, and show by experiments, that the use of multi-context features alleviates this issue, by using images of the same size, and by extracting cropped intermediate features which: (i) are comparable due to same size inputs, and (ii) also include features whose receptive field are smaller than the input (earlier layers).

We further modify the network by adding residual adaptation modules on top of each of the |L| base layers from the backbone network. A residual adaptation module consists of an additional bottleneck residual unit [11] followed by a convolution. The residual unit has the same properties (number of channels, dilation rate, etc.) as the base ResNet layer it is connected to. The role of this module is to provide more capacity for the extracted features to adapt to the siamese matching at each layer and also to decrease the dimensionality for faster cross-correlation. We show experimentally that the addition of residual units for adaptation positively affects the results.

Final predictions are obtained by computing the average map:

$$\begin{aligned} g(x', z') = \frac{1}{|L|} \sum _{l \in L}{g_l(x', z')}. \end{aligned}$$
(3)

3.3 Training

We compute an individual loss to each layer prediction \(g_l(x', z')\). Let i indicate the index of the element (pixel) in a map. Then the loss of each prediction is the average of the logistic losses \(\ell _l\):

$$\begin{aligned} \mathcal {L}_l = \sum _i w(y_i) \ell _l(g_l(x_i', z_i'; \theta ), y_i), \end{aligned}$$
(4)

where \(w(y_i)\) is a weighting function applied on the labels \(y_i\) that leverages the imbalance between positive and negative samples. This weighting function is defined as:

$$\begin{aligned} w(y_i) = \frac{0.5 y_i}{n_{\text {pos}}} + \frac{0.5 (1 - y_i)}{n_{\text {neg}}}, \end{aligned}$$
(5)

where \(n_{\text {pos}}\) and \(n_{\text {neg}}\) are the number of positive and negative samples respectively.

The network is then trained with gradient descent to find the set of parameters \(\theta \) that minimizes the global loss:

$$\begin{aligned} \theta ^* = \mathop {\mathrm{arg\,min}}\limits _{\theta } \sum _{l \in L} \mathcal {L}_l(g_l(x_i', z_i'; \theta ), y_i) + \lambda \Vert \theta \Vert _2^2. \end{aligned}$$
(6)

4 Experimental Results

4.1 Datasets and Evaluation Protocols

We evaluate our tracker on two widely-adopted public datasets: the visual object tracking (VOT) and the online tracking benchmark (OTB).

VOT. Both VOT16 and VOT17 [15,16,17] are composed of 60 sequences annotated with rotated bounding boxes. The standard evaluation criterion is focused on short-term tracking, where trackers are reinitialized whenever their IoU is zero. Trackers are ranked mainly according to three measures: Expected Average Overlap, Accuracy and Robustness. It also provides a normalized speed value (EFO) which can be used to compare tracker speeds disregarding the influence of the hardware, to some extent. (we refer to [16] for more details about the metrics).

OTB. We use two versions of the OTB dataset: OTB13 [32] and OTB15 [33]. The former contains 51 objects to be tracked, while the latter is a superset of OTB13 with 100 objects. The trackers are evaluated by two measures: precision and success. Precision estimates the average distance between the center of the predicted bounding box and the groundtruth. Success is used for obtaining the average Intersection-over-Union (IoU) of the predicted boxes. We use OTB13 for our ablation experiments, while OTB15 is kept for comparing with state-of-the-art trackers.

4.2 Implementation Details

Network. Our backbone network is a ResNet [11] with 50 layers. We initialize its weights from a model trained on ImageNet [25] classification. As mentioned before, we decrease the network output stride from 32 to 8. In order to keep the input size compatible with the stride, we resize the input images to \(248 \times 248\) pixels. In our formulation, both the exemplar and the instance images are of the same size and they include a large context, which is obtained by cropping an area 16 times larger than the object. The output features generated by the network have dimensions \(31 \times 31 \times 64\). For the multi-context features, we crop the central \(7 \times 7\) region from each of the feature maps \(\varphi _l(z')\). Our set of chosen layers L is composed of the outputs of each of the 4 residual blocks of the ResNet.

Training. During training, the weights from the ResNet are frozen, and only the residual adaptation modules are trained. We briefly experimented with training ResNet layers as well, but we did not observe any noticeable improvement. The training follows the same protocol as in the SiamFC [1], by learning to match pairs of images collected from the ImageNet VID challenge. This dataset contains around 4000 sequences divided into 30 categories, which accounts for more than one million frames. One important point to notice is that, since ResNets use padded convolutions, the training targets must not always be centered, as it is done in SiamFC. Otherwise, the network will learn a positional bias. Therefore, we augment the training set with random cropping, as well as color distortion, horizontal flipping, and small resizing perturbations. The weights are optimized using gradient descent with a momentum term of 0.9. The learning rate is continually decayed exponentially from \(10^{-3}\) to \(10^{-6}\). The network is trained during 50000 iterations with a mini-batch size of 8 pairs of images.

Testing. Tracking is conducted in the same manner as in the SiamFC [1]. Therefore, the matching is conducted independently at each frame and spatial consistency between frames is enforced by applying a Hann window over the prediction map. In order to obtain more precise predictions, we upsample the correlation output by a factor of 8 using bicubic interpolation. We handle scale changes by forwarding three images at different scales. For a fair comparison, all hyperparameters are kept the same as in SiamFC.

We implement our tracker using Python and Tensorflow 1.4. The experiments were conducted on a machine with an Intel Xeon E5 CPU and a GeForce GTX 1080Ti GPU. The average tracking speed during the experiments is around 20 frames per second. The code will be made available on http://github.com/hmorimitsu/siam-mcf.

4.3 Ablation Study

We verify the contribution of each of our design choices by evaluating the results of different configurations on the OTB13 dataset. Our main interests were to verify the impacts of (i) replacing the AlexNet in SiamFC by a ResNet, (ii) using different layers from the ResNet for the matching, (iii) using large context inputs with late feature cropping, and (iv) including residual adaptation modules. For the third test, when large-context and cropping are not used, we input an exemplar image whose size is \(120 \times 120\) pixels. This image also contains a reduced context size, corresponding to an area four times larger than the object, which is the same setting used in SiamFC. For the fourth test, if residual adaptation is not used, we add and train only a single convolutional layer on top of the ResNet outputs. Table 1 summarizes our results.

Table 1. Ablation results on OTB13 dataset. L1–L4 indicates that features from those levels are being used for matching.

The last row corresponds to the result obtained by the baseline SiamFC. The results show that simply replacing the backbone with a ResNet actually generates worse results. This can be explained by the violation of the fully-convolutional formulation discussed in Sect. 3.2. Even by considering multi-layer features, the performance is only on par with the baseline. However, as illustrated by the results in the bottom part of the table, multi-context features obtained with feature cropping from multiple layers produce noticeably better results. In fact, even when applied to some layers individually, the results are already better than the baseline. However, we see that by combining it with multiple layers we have significantly better performance. It is interesting to remark that, although using L4 by itself usually leads to worse results, removing it from the multi-features set actually generates slightly worse results. One reason is that, in sequences such as Ironman, MotorRolling, and Skating1, L4 is actually better than other layers. We observe a similar behavior when comparing L123 with L1234, thus showing that L4 predictions are beneficial to the model. Lastly, we observe that dropping the residual adaptation does decrease the results, thus demonstrating its contribution.

We select the model with multi-context features and residual adaptation module, which generated the best results, as our SiamMCF to perform the experiments against the state-of-the-art methods.

4.4 Comparison with the Start-of-the-Art

We validate the performance of our tracker by comparing its results with some state-of-the-art trackers. We selected some of the currently best performing trackers in general, as well as other recent siamese proposals. We evaluate our results on three datasets: VOT16, VOT17, and OTB15.

VOT16. We compare our results using SiamMCF on VOT 2016 with the best contenders in the competition (C-COT, TCNN, SSAT, MLDF). We also include the results of other trackers, including SA-Siam [10] and SiamRPN [20], two recent siamese trackers, and SiamFC-R [15], a SiamFC modified to use ResNet as a backbone. The results are summarized in Table 2.

Table 2. Results on the VOT16 dataset. The arrows indicate whether higher or lower values are better.

We can see that SiamMCF outperforms all compared trackers, including the best tracker in the competition, C-COT, and recent siamese methods SiamRPN and SA-Siam. On the other hand, we still cannot obtain better robustness than the methods using online updating, although we outperform all siamese entries. By analyzing the ranking results in Fig. 2, we see that occlusion is the main reason for the drop in performance. This result is understandable, as in such situation, the tracker tends to present higher activations in the surrounding area than in the occluded region, thus causing drifting. It is important to notice that the contributions of SiamMCF are orthogonal to siamese updating strategies, for example, as proposed by DSiam [8]. Therefore, it is possible that even better performance could be obtained by applying those updating strategies to our tracker.

Fig. 2.
figure 2

EAO ranking on VOT16 according to sequence attributes. Each row corresponds to an attribute. The horizontal axis shows the EAO according to the corresponding attribute. Our SiamMCF obtain the best results most of the time.

We also compare our tracker on the unsupervised setting of the VOT benchmark. Different from the standard settings, the trackers are not reinitialized after they drift away from the target. This evaluation focuses on longer-term tracking, as it penalizes more heavily trackers which are unable to recover from a temporary target loss. Figure 3 shows the precision and success plots of the One-Pass Evaluation (OPE) on the VOT16 dataset. We can see that SiamMCF also achieves state-of-the-art results on this test, being very close to the best method SSAT.

Fig. 3.
figure 3

OPE results on the VOT16 dataset.

VOT17. As in the previous benchmark, we also select the top trackers from the competition (LSART, CFWCR, CFCF, ECO) and siamese trackers (SiamDCF, SA-Siam, SiamFC). From the results in Table 3 we see that our tracker also performs very favorably on VOT17 as well, being seconded only by LSART, while retaining the highest accuracy. Once again we outperform all other siamese trackers.

Table 3. Results on the VOT17 dataset. The arrows indicate whether higher or lower values are better.

The unsupervised results shown in Fig. 4 are also encouraging. In this dataset, SiamMCF actually outperforms all other trackers when considering the Intersection over Union metric (success plots), while being close to the best method in terms of the center distance of the predictions (precision plots).

Fig. 4.
figure 4

OPE results on the VOT17 dataset.

OTB15. We further verify the results of SiamMCF on OTB15 (Fig. 5). We show comparative results against state-of-the-art trackers that use multi-layer features (ECO [4], C-COT [7], HDT [24]) and siamese networks (SINT+ [27], SiamFC [1], CFNet [29]). Once again we outperform other siamese proposals while approaching the other state-of-the-art methods, which rely on online updating. It is worth noticing that ECO adopts different hyperparameters for OTB and VOT datasets, whereas we keep them fixed for all evaluations. Particularly, we kept SiamFC parameters for fair comparison, thus it is possible that a further improvement could be obtained with a careful hyperparameter search.

Fig. 5.
figure 5

Results on the OTB15 dataset.

We also show the results on some more specific attributes in Fig. 6. Similarly to what was observed on VOT, videos containing occluded or out-of-view objects are responsible for the largest differences in performance. On the other hand, our tracker performs remarkably well on low-resolution videos. This seems to be a feature of trackers based on SiamFC, as both SiamFC itself and CFNet also perform relatively better in this type of sequence. Some qualitative results are displayed in Fig. 7.

Fig. 6.
figure 6

OTB15 success plots for different attributes.

Fig. 7.
figure 7

Qualitative results on sequences from OTB15 using the selected trackers.

We can see that SiamMCF is quite robust to diverse challenging situations, including change of lighting, rotation and scale changes. From the results of the fourth sequence, we see that the use of deeper networks provide additional robustness to rotation, as both our method and HDT show good results. Nonetheless, we observe that our proposal is still overall more robust than HDT, correctly tracking the target in sequences 2 and 3. The qualitative results also confirm that our approach is more robust than the SiamFC baseline, as it works correctly in many sequences where SiamFC loses the target. We also verify that we are able to better handle some sequences where the top performer ECO has difficulties.

The last two sequences present some failure cases for our tracker. We can see that sometimes when the target appearance changes significantly, or if fast motion and blur happen, we are still unable to keep tracking the target. Occlusion also presents difficulties, which may cause the tracker to drift away from the target.

5 Summary

This paper proposed to extend SiamFC to exploit multi-context features in visual object tracking, which is obtained by applying cropping on features maps of different layers. In this way, each layer contributes not only with different semantic levels, but also focus on regions of different sizes of the input image. We showed that by incorporating these features into a deep siamese network tracker we are able to obtain outstanding results in short-term tracking, by outperforming almost all other methods on the newest VOT benchmarks. We are also able to outperform the state-of-the-art siamese trackers on OTB while getting close to the most accurate methods. Even with the use of multi-context features and deep networks, our tracker remains faster than many of the top-performing methods, running at almost real-time speeds.