Keywords

1 Introduction

A decade ago, the Visual Object Tracking (VOT) initiative was founded in response to the lack of standardised performance evaluation in visual object tracking. To facilitate the development of this highly active computer vision field, the first VOT2013 challenge [13] was organized in conjunction with ICCV2013. Encouraged by the strong interest of the emerging community, eight VOT challenges have been organized since, with the results presented at the accompanying workshops at major computer vision conferences: ECCV2014 (VOT2014 [14]), ICCV2015 (VOT2015 [12]), ECCV2016 (VOT2016 [10]), ICCV2017 (VOT2017 [9]), ECCV2018 (VOT2018 [8]), ICCV2019 (VOT2019 [6]), ECCV2020 (VOT2020 [7]), ICCV2021 (VOT2021 [11]). The VOT challenge is now the main annual tracking performance evaluation event in computer vision.

The primary mission of the VOT initiative has been the promotion of the development of general trackers for single-camera, single-target, model-free, causal tracking. For nearly a decade the VOT has thus been a community-driven forum for gradual development and in-situ testing of performance evaluation protocols, dataset development and exploration of the tracking challenges landscape. The VOT2013 [13] started with a single short-term tracking challenge; VOT-ST. In VOT2014 [14] the VOT-TIR challenge was added to explore tracking in thermal imagery. In VOT2017 [9] the real-time tracking challenge VOT-RT was established to promote tracking speed and computational efficiency in parallel to robustness. Long-term tracking challenge VOT-LT was introduced in VOT2018 [8] and a year later in VOT2019 [6], multi-modal (RGB+thermal and RGB+depth) tracking challenges VOT-RGBT and VOT-RGBD were introduced.

Particular attention has been put on the development of informative performance evaluation measures. Two basic weakly correlated performance measures were introduced in VOT2013 [13] to evaluate the tracking accuracy and robustness of short-term trackers. A ranking-based methodology to identify the top performers was also proposed but was abandoned in VOT2015 [12] in favor of a more principled and interpretable combination of the primary scores in form of the expected average overlap score EAO. For the first seven VOT challenges, the measures were calculated under a reset-based protocol, in which a tracker is reset upon drifting off the target. This protocol was replaced in VOT2020 [7] by the anchor-based evaluation protocol that produces the most stable performance evaluation results compared to related protocols, yet inherits the benefits from the reset-based protocol. Similarly, a performance evaluation protocol and measures tailored for long-term tracking have been developed [16] and applied first in VOT2018 [8]. These measures have consistently shown good evaluation capabilities for long-term trackers.

Several datasets have been developed over the years. A dataset creation and maintenance protocol has been established for the main short-term tracking challenge to produce datasets which are sufficiently small for practical evaluation yet include a variety of challenging tracking situations for in-depth analysis. In VOT2017 [9], a sequestered dataset for identification of the short-term tracking challenge winner was introduced. This dataset has been refreshed along with the public versions over the years. Alongside, datasets specialized for long-term tracking, RGB+thermal and RGBD tracking were constructed and gradually updated.

In most of the VOT challenges, the trackers are required to report the target position as an axis-aligned bounding box. While this is a reasonable target state encoding, the VOT short-term tracking challenge gradually explored more detailed pose encodings to push the bar on tracking accuracy and expand the range of applications. Thus rotated bounding boxes were introduced in VOT2014 [14]. To reduce human annotation bias, VOT2016 [10] introduced fitting rotated bounding boxes to semi-automatically segmented objects in each frame. In VOT2020 [7] bounding boxes were abandoned and the short-term trackers are required to provide full target segmentation (the VOT-ST dataset was accordingly re-annotated to ensure high ground truth accuracy) – with this move, the VOT short-term tracking challenge has started narrowing the gap between visual object tracking and the related field of video object segmentation. The remaining challenges (VOT-LT, VOT-RGBD, VOT-RGBT) maintain axis-aligned target annotation.

This paper presents the tenth edition of the VOT challenges – the VOT2022 challenge. After two years of virtual editions due to the global Covid19 pandemic, the 10th anniversary of VOT was organized in a mixed form with in-person and online attendance, in conjunction with the ECCV2022 Visual Object Tracking VOT2022 Workshop. In the following, we overview the challenge and participation requirements.

1.1 The VOT2022 Challenge

The evaluation toolkit and the datasets are provided by the VOT2022 organizers. The challenges opened in the first week of April and closed on May 3rd. The winners of individual challenges were identified in late June, but not publicly disclosed. The results were presented at the ECCV2022 VOT2022 workshop on 24th October. The VOT2022 challenge contained seven challenges:

  1. 1.

    VOT-STs2022 challenge addressed short-term tracking by target segmentation in RGB images.

  2. 2.

    VOT-STb2022 challenge addressed short-term tracking by bounding boxes in RGB images.

  3. 3.

    VOT-RTs2022 challenge addressed the same class of trackers as VOT-STs2022, except that the trackers had to process the sequences in real-time.

  4. 4.

    VOT-RTb2022 challenge addressed the same class of trackers as VOT-STb2022, except that the trackers had to process the sequences in real-time.

  5. 5.

    VOT-LT2022 challenge addressed long-term tracking by bounding boxes in RGB images.

  6. 6.

    VOT-RGBD2022 challenge addressed short-term tracking by bounding boxes in RGB+depth (RGBD) imagery.

  7. 7.

    VOT-D2022 challenge addressed short-term tracking by bounding boxes in depth map images.

The authors participating in the challenge were required to integrate their tracker into the VOT2022 evaluation kit, which automatically performed a set of standardized experiments. The results were analyzed according to the VOT2022 evaluation methodology.

Participants were encouraged to submit their own new or previously published trackers as well as modified versions of third-party trackers. In the latter case, modifications had to be significant enough for acceptance. Participants were expected to submit a single set of results per tracker If a participant coauthored several submissions with a similar design, only the top performer from this cluster was considered to compete in the final top-performer ranking and winner identification.

Each submission was accompanied by a short abstract describing the tracker, which was used for the short tracker descriptions in Appendix [5] – the authors were asked to provide a clear description useful to the readers of the VOT2022 results report. In addition, participants filled out a questionnaire on the VOT submission page to categorize their tracker according to various design properties. Authors were encouraged to submit their tracker integrated into a Singularity container provided by VOT, which allows result reproduction and aids potential further evaluation. The participants with sufficiently well-performing submissions who contributed to the text for this paper and agreed to make their tracker code publicly available from the VOT page (or upon request) were offered co-authorship of this results paper. The committee reserved the right to disqualify any tracker that, by their judgement, attempted to cheat the evaluation protocols or failed in the post-hoc evaluation.

Methods considered for prizes in the VOT2022 challenge were not allowed to be trained on certain datasets (OTB, VOT, ALOV, UAV123, NUSPRO, TempleColor and RGBT234), except for VOT-LT2022, where the VOT-LT2021 dataset was allowed. For GOT10k, a list of 1k prohibited sequences was created in VOT2019, while the remaining 9k+ sequences were allowed for learning. The reason was that part of the GOT10k was used in the VOT-ST2022 dataset.

The use of class labels specific to VOT was not allowed (i.e., identifying a target class in each sequence and applying pre-trained class-specific trackers was not allowed). The organizers of VOT2022 were allowed to participate in the challenge but were not eligible to win. Further details are available from the challenge homepageFootnote 1.

VOT2022 goes beyond previous challenges by updating the datasets in VOT-ST2022 and VOT-RT2022, introducing a training dataset as well as a sequestered dataset in the VOT-RGBD2022 challenge, introducing a depth-only tracking challenge VOT-D2022 and a new challenging VOT-LT2022 tracking dataset. The Python VOT evaluation toolkit was updated as well.

The remainder of this report is structured as follows. Section 2 describes the performance evaluation protocols, Sect. 3 describes the individual challenges, Sect. 4.5 overviews the results and conclusions are drawn in Sect. 5. Short descriptions of the tested trackers are available in Appendix [5].

2 Performance Evaluation Protocol

Since VOT2018, the VOT challenges adopt the following definitions from [16] to distinguish between short-term and long-term trackers:

  • Short-term tracker (\(\textrm{ST}_0\)). The target position is reported at each frame. The tracker does not implement target re-detection and does not explicitly detect occlusion.

  • Short-term tracker with conservative updating (\(\textrm{ST}_1\)). The target position is reported at each frame. Target re-detection is not implemented, but tracking robustness is increased by selectively updating the visual model depending on a tracking confidence estimation mechanism.

  • Pseudo long-term tracker (\(\textrm{LT}_0\)). The target position is not reported in frames when the target is predicted not visible. The tracker does not implement explicit target re-detection but uses an internal mechanism to identify and report tracking failure.

  • Re-detecting long-term tracker (\(\textrm{LT}_1\)). The target position is not reported in frames when the target is predicted not visible. The tracker detects tracking failure and implements explicit target re-detection.

Since the two classes of trackers make distinct assumptions on target presence, separate performance measures and evaluation protocols were designed in VOT to probe the tracking properties.

2.1 The Short-Term Evaluation Protocols

The short-term performance evaluation protocol entails initializing the tracker at several frames in the sequence, called the anchor points, which are spaced approximately 50 frames apart. The tracker is run from each anchor - in the first half of the sequences in the forward direction, for anchors in the second half backwards, till the first frame. Performance is evaluated by two basic measures accuracy (A) and robustness (R).

Accuracy is the average overlap on frames before tracking failure, averaged over all sub-sequences. Robustness is the percentage of successfully tracked sub-sequence frames, averaged over all sub-sequences. Tracking failure is defined as the frame at which the overlap between the ground truth and predicted target position dropped below 0.1 and did not increase above this during the next 10 frames. This definition allows short-term failure recovery in short-term trackers. The primary performance measure is the expected average overlap EAO, which is a principled combination of tracking accuracy and robustness. Please see [7] for further details on the VOT short-term tracking performance measures.

2.2 The Long-Term Evaluation Protocol

The long-term performance evaluation protocol follows the protocol proposed in [16] and entails initializing the tracker in the first frame of the sequence and running it until the end of the sequence. The tracker is required to report the target position in each frame along with a score that reflects the certainty that the target is present at that position. Performance is measured by two basic measures called the tracking precision (Pr) and the tracking recall (Re), while the overall performance is summarized by the tracking F-measure.

The performance measures depend on the target presence certainty threshold, thus the performance can be visualized by the tracking precision-recall and tracking F-measure plots obtained by computing these scores for all thresholds. The final values of Pr, Re and F-measure are obtained by selecting the certainty threshold that maximizes tracker-specific F-measure. This avoids all manually-set thresholds in the primary performance measures.

3 Description of Individual Challenges

3.1 VOT-ST2022 Challenge Outline

This challenge addressed RGB tracking in a short-term tracking setup. The initial VOT challenges required target prediction in form of bounding boxes, while a transition to segmentation output requirement has been made in VOT2020. Nevertheless, to support the very much active community that develops bounding box prediction trackers, the bounding box challenge is re-introduced in VOT2022. Thus the VOT-ST2022 ran two subchallenges: the main segmentation-based short-term tracking challenge VOT-STs2022, and the legacy bounding-box-based short-term tracking challenge VOT-STb2022.

The Dataset. Results of the VOT2021 showed that the dataset was not saturated [11], thus the public dataset has been only refreshed by the addition of two sequences which include new challenging scenarios not present in previous VOT datasets: (i) a transparent deforming object and (ii) a flat object with significant out of plane rotations (see Fig. 1). The sequestered dataset has been updated with two sequences matching the public dataset extension.

Fig. 1.
figure 1

Two sequences with new challenging scenarios were added to the VOT-ST2022 public dataset. In the sequence ‘bubble’ the bubble has to be tracked, while in the sequence ‘tennis’ the racquet is the target object.

The new sequences were frame-by-frame semi-automatically segmented to provide the segmentation ground truth for the main VOT-STs2022 subchallenge. For the legacy VOT-STb2022 subchallenge, the target position was annotated in all sequences by fitting axis-aligned bounding boxes to the target segmentation masks. Per-frame visual attributes were semi-automatically assigned to the new sequences following the VOT attribute annotation protocol. In particular, each frame was annotated by the following visual attributes: (i) occlusion, (ii) illumination change, (iii) motion change, (iv) size change, (v) camera motion.

Winner Identification Protocol. The VOT-STs2022 winner was identified as follows. Trackers were ranked according to the EAO measure on the public dataset. The top five ranked trackers were then re-run by the VOT2022 committee on the sequestered dataset. The top-ranked tracker on the sequestered dataset not submitted by the VOT2022 committee members is the winner. The same protocol was used to identify the winner of the legacy short-term challenge VOT-STb2022.

3.2 VOT-RT2022 Challenge Outline

This challenge addressed real-time RGB tracking in a short-term tracking setup. The dataset was the same as in the VOT-ST2022 challenge, but the evaluation protocol was modified to emphasize the real-time component in tracking performance. In particular, the VOT-RT2022 challenge requires predisetcting bounding boxes faster or equal to the video frame rate. The toolkit sends images to the tracker via the Trax protocol [21] at 20fps. If the tracker does not respond in time, the last reported bounding box is assumed as the reported tracker output at the available frame (zero-order hold dynamic model). The same performance evaluation protocol as in VOT-ST2022 is then applied. As in VOT-ST2022, two realtime subchallenges were considered: the main segmentation-based realtime subchallenge VOT-RTs2022 and the legacy bounding-box-based realtime subchallenge VOT-RTb2022.

Winner Identification Protocol. All trackers are ranked on the public RGB short-term tracking dataset with respect to the EAO measure. The winner of the main VOT-RTs2022 subchallenge was identified as the top-ranked tracker not submitted by the VOT2022 committee members. The same methodology was applied to identify the winner of the VOT-RTb2022 challenge.

3.3 VOT-LT2022 Challenge Outline

Fig. 2.
figure 2

The new VOT-LT dataset - a frame selected from each sequence. Name and length (top), visual attributes (bottom left): (O) full occlusion, (V) out-of-view, (P) partial occlusion, (C) camera motion, (F) fast motion, (S) scale change, (A) aspect ratio change, (W) viewpoint change, (I) similar objects. The dataset is highly diverse in attributes and target types and contains many target disappearances.

This challenge addressed RGB tracking in a long-term tracking setup and is a continuation of the VOT-LT2021 challenge. We adopt the definitions from [16], which are used to position the trackers on the short-term/long-term spectrum. A long-term performance evaluation protocol and measures from Sect. 2.2 were used to evaluate tracking performance on VOT-LT2022. Compared to VOT-LT2021, a significant change is a new dataset described in the following.

The Dataset. The new VOT-LT dataset contains 50 sequences, carefully selected to obtain a dataset with long sequences containing many target disappearances. The LTB50 [16], which was used in VOT-LT2021, is the training set this year. The new VOT-LT dataset contains 50 challenging sequences of diverse objects (persons, cars, motorcycles, bicycles, boats, animals, etc.) with a total length of 168,282 frames. The sequence resolution is 1280 \(\times \) 720. Each sequence contains on average 10 long-term target disappearances, each lasting on average 52 frames. An overview of the dataset is shown in Fig. 2.

The targets are annotated by axis-aligned bounding boxes. Sequences are annotated by the following visual attributes: (i) full occlusion, (ii) out-of-view, (iii) partial occlusion, (iv) camera motion, (v) fast motion, (vi) scale change, (vii) aspect ratio change, (viii) viewpoint change, (ix) similar objects. Note this is per-sequence, not per-frame annotation and a sequence can be annotated by several attributes. Compared with LTB50, the new VOT-LT dataset is more challenging in small objects, similar objects, fast motion, and full/partial occlusions.

Winner Identification Protocol. The VOT-LT2022 winner was identified as follows. Trackers were ranked according to the tracking F-score on the new LT dataset (no sequestered dataset available). The top-ranked tracker on the dataset not submitted by the VOT2022 committee members is the winner of the VOT-LT2022 challenge.

3.4 VOT-RGBD2022 Challenge Outline

The first RGBD (RGB and Depth) challenge was introduced to VOT 2019 and the two first challenges were based on the same public dataset, CDTB [15], which consists of 80 sequences where the target momentarily disappears or is fully occluded. In VOT 2021, the CDTB dataset was replaced with new sequences captured with an Intel RealSense 415 RGBD camera that provides spatially aligned RGB and depth frames. The 2021 dataset contained 80 public and 50 sequestered test sequences. The main motivation for the new dataset was to make it more challenging in the sense that sometimes depth cue is more informative and sometimes RGB. Moreover, separate training and test sequences were provided to allow method fine-tuning with dataset-specific data. More details about the dataset and its properties can be found from [25]. The two major changes as compared to the previous years’ RGBD tracks are that 1) the challenge is now a short-term (ST) tracking challenge and 2) the challenge is divided into RGBD and depth-only (D) tracks in order to better understand how much depth contributes to RGBD tracking, i.e. complementarity of the two modalities.

The main motivation to switch from the long-term evaluation to short-term evaluation is that in the long-term setting the target disappearance played an important role and many of the proposed RGBD trackers used the depth channel to assist in occlusion detection, but otherwise the cue was omitted. Now, the two tracks, RGBD and D, provide information about the complementary properties of color texture and depth. It is noteworthy that the RGBD and D challenges use otherwise exactly the same data.

The Dataset. Inspired by the recent work on depth-only tracking [26], we converted the long-term sequences from the CDTB dataset used in the first two VOT-RGBD challenges and DepthTrack used in the latest challenge, to short-term sequences. We converted all 80 sequences from CDTB and 50 test sequences of DepthTrack. Since the DepthTrack training sequences were not used they can be used in training learning-based trackers. The short-term sequences were manually checked and sequences with poor depth information or other errors were removed. Finally, 127 sequences were selected and published on the VOT Web site. See Fig. 3 for example frames.

Fig. 3.
figure 3

Samples from the RGBD and D challenge sequences. The first two from the left are from the CDTB sequences and the next two from DepthTrack-test sequences.

VOT-D2022. The data for the VOT-D2022 challenge is exactly the same as for VOT-RGBD except that the RGB frames are removed.

Winner Identification Protocol. The VOT-RGBD2022 and VOT-D2022 winners were identified as follows. Trackers were ranked according to the EAO measure on the public dataset and the top-ranked tracker on the public dataset not submitted by the VOT2022 committee members is the winner. The same protocol was used to identify the winners of both the VOT-RGBD and VOT-D challenges.

4 The VOT2022 Challenge Results

This section summarizes the trackers submitted, results analysis and winner identification for each of the VOT2022 challenges. Due to page limit, we provide the appendix with more detailed descriptions of the submitted trackers in the supplementary material [5]. For browsing convenience, we also compiled a version of the paper with the appendix included – please see the VOT2022 resutls pageFootnote 2 for this verison.

4.1 The VOT-STs2022 Challenge Results

The VOT-STs2022 challenge tested 31 trackers, including the baselines contributed by the VOT committee. Each submission included the binaries or source code that allowed verification of the results if required. In the following, we briefly overview the entries and provide the references to original papers in the Appendix [5] where available.

Of the participating trackers, 13 trackers (42\(\%\)) were categorized as ST\(_0\), 14 trackers ( 45\(\%\)) as ST\(_1\), and 4 (13\(\%\)) as LT\(_0\). 81\(\%\) applied discriminative and 19\(\%\) applied generative models. Most trackers (81\(\%\)) used a holistic model, while 19\(\%\) of the participating trackers used part-based models. Most trackers (75\(\%\)) applied an equally probably displacement within a region centered at the current positionFootnote 3 or a random walk dynamic model (25\(\%\)). 42\(\%\) of trackers localized the target in a single stage, while the rest applied several stages, typically involving approximate target localization and position refinement. Most of the trackers (84\(\%\)) use deep features. The majority of the submissions (72\(\%\)) localized the target by segmentation, while the rest reported a bounding box.

The trackers were based on various tracking principles. 11 trackers were based on classical or deep discriminative correlation filters (RTS, ATOM_AR, DiMP_AR, KYS_AR, PrDiMP_AR, CSRDCF, D3Sv2, SuperDiMP_AR, KCF, LWL, LWL-B2S), 2 trackers were based purely on Siamese correlation (SiamFC, SiamUSCMix), 14 trackers were based on transformers (DAMT, DAMTMask, DGformer, Linker, MixFormerM, MS_AOT, OSTrackSTS, SwinT, SRATransTS, TransLL, TransT, transt_ar, TransT_M, and TRASFUSTm), two were deformable parts trackers (ANT and LGT), a meanshift tracker (ASMS), and a video-object segmentation method adapted to tracking (STM).

In summary, we observe a significant increase in a new class of trackers identified in VOT2021 – the transformers. In fact, 47% of trackers are now from this class, 41% of trackers apply discriminative correlation filters, while 6% apply classical siamese correlation networks.

Results. The results are summarized in the AR-raw plots and EAO plots in Fig. 4 and in Table 9. The top ten trackers according to the primary EAO measure (Fig. 4) are MS_AOT, DAMTMask, MixFormerM, OSTrackSTS, Linker, SRATransTS, TransT_M, DGformer, TransLL and LWL-B2S. Nine of the top trackers apply transformers as the core tracking methodology and one applied deep DCFs. Seven apply a two-stage target localization, meaning that they first localize the target by a bounding box and the segment the target withing the bounding box with a separate network (two of these apply Alpha-Refine [24] – the winner of VOT-RT2020 challenge). Three of the top 10 trackers are single-stage, meaning that they directly segment the target. Four of the trackers are apply elements (or are extensions) of MixFormers [3], four extend TransT [2] and three apply ViT [4].

The top tracker on the public set according to EAO is MS_AOT, which is based on the recent transformer-based video object segmentation AOT [28]. For normal-sized objects, the tracker acts as a single-stage segmentation method. For tiny objects, the tracker works in a two-stage regime in which the object is first localized by bounding box using MixFormer [3] and then segmented by the AOT.

The second-best tracker is DAMTMask, which is build on top of MixFormer [3] and SuperDiMP [1], and applied a two-stage target localization and segmentation approach. The target location is predicted by RepPoints [27] and a MixFormer-like head is implemented to predict the segmentation mask.

The third-best tracker is MixFormerM, a two-stage tracker which uses a new mixed attention module for simultaneous feature extraction and target information fusion.

The three top performers in EAO are among the top three performers in accuracy (A) and robustness (R) measures as well (Table 9). While these trackers are comparable in target localization accuracy, MS_AOT stands out by its remarkable robustness (Fig. 4).

Table 1. VOT-STs2022 tracking difficulty with respect to the following visual attributes: camera motion (CM), illumination change (IC), motion change (MC), occlusion (OC) and size change (SC).
Fig. 4.
figure 4

The VOT-STs2022 AR-raw plots generated by sequence pooling (left) and EAO curves (center) and the VOT-STs2022 expected average overlap graph with trackers ranked from right to left. The right-most tracker is the top-performing according to the VOT-STs2022 expected average overlap values. The dashed horizontal line denotes the average performance of three state-of-the-art trackers published in 2021/2022 at major computer vision venues. These trackers are denoted by gray circle in the bottom part of the graph. See Table 9 for the tracker labels.

Three of the tested trackers have been published in major computer vision journals and conferences in the last two years (2021/2022). These trackers are indicated in Fig. 4, along with their average performance (EAO = 0.504), which constitutes the VOT2022 state-of-the-art bound. Approximately 32% of the submissions exceed this bound.

The per-attribute robustness analysis is shown in Fig. 5 for individual trackers. The overall top performers remain at the top of per-attribute ranks as well. MS_AOT achieves top robustness in all attributes. According to the median failure over each attribute (Table 1) the most challenging attribute remains occlusion. The drop on this attribute is consistent for all trackers (Fig. 5).

Fig. 5.
figure 5

Robustness with respect to the visual attributes in VOT-STs2022 challenge (left) and in the VOT-STb2022 challenge (right). See Table 9 and Table 10 for VOT-STs2022 and VOT-STb2022 tracker labels, respectively.

The VOT-STs2022 Challenge Winner.

The top five trackers from the baseline experiment (Table 9) were re-run on the sequestered dataset. Their scores obtained on the sequestered dataset are shown in Table 2. The top tracker according to the EAO is MS_AOT and is thus the VOT-STs2022 challenge winner.

Table 2. The top five trackers from Table 9 re-ranked on the VOT-STs2022 sequestered dataset.

4.2 The VOT-STb2022 Challenge Results

The VOT-STb2022 challenge tested 41 trackers, including the baselines contributed by the VOT committee. Each submission included the binaries or source code that allowed verification of the results if required. In the following, we briefly overview the entries and provide the references to original papers in the Appendix [5] where available. The trackers were based on various tracking principles. 13 trackers were based on classical or deep discriminative correlation filters (SuperFus, TCLCFcpp, KCF, D3Sv2, DiMP, ATOM, CSRDCF, SuperDiMP, PrDiMP, FSC2F, oceancycle, DeepTCLCF, KYS), 4 trackers were based purely on Siamese correlation (NfS, SiamUSCMix, SiamVGGpp, SiamFC), 19 trackers were based on transformers (TransT_M, TransT, ADOTstb, GOANET, DAMT, tomp, TransLL, APMT_MR, APMT_RT, DGformer, Linker_B, MixFormer, ViTCRT, MixFormerL, OSTrackSTB, SRATransT, vittrack, SwinTrack, SBT), one ensamble-based (TRASFUST), one was based on meta-learning (ReptileFPN), one was scale-adaptive mean-shift tracker (ASMS), and two were part-based generative trackers (ANT and LGT).

Results. The results are summarized in the AR-raw plots and EAO plots in Fig. 6, and in Table 10. The top ten trackers according to the primary EAO measure (Fig. 6) are DAMT, MixFormerL, OSTrackSTB, APMT_MR, MixFormer, APMT_RT, ADOTstb, SRATransT, Linker_B, TransT_M. Like in the segmentation tracking challenge VOT-STs2022, all top ten trackers apply transformers. In fact, seven of the top trackers are modifications of segmentation-based counterparts, ranked among the top ten trackers on the VOT-STs2022: MixFormerL, DAMT, OSTrackSTB, MixFormer, SRATransT, Linker, TransT.

All three top-ranked trackers on the public dataset according to EAO, are counterparts of the top-ranked trackers on the main segmentation challenge VOT-STs2022. The two top performers, with equal EAO are MixFormerL and DAMT. MixFormerL, is a counterpart of the tracker ranked third on VOT-STs2022, while DAMT is a counterpart of the second-ranked tracker on VOT-STs2022. The two trackers excel in different tracking properties. DAMT is more robust than MixFormerL, while MixformerL is delivers a more accurate target estimation than DAMT. The third-best ranked tracker is OSTrackSTB is a counterpart of the fourth-best ranked tracker on VOT-STs2022.

Fig. 6.
figure 6

The VOT-STb2022 AR-raw plots generated by sequence pooling (left) and EAO curves (center) and the VOT-STb2022 expected average overlap graph with trackers ranked from right to left. The right-most tracker is the top-performing according to the VOT-STs2022 expected average overlap values. The dashed horizontal line denotes the average performance of ten state-of-the-art trackers published in 2021/2022 at major computer vision venues. These trackers are denoted by gray circle in the bottom part of the graph. See Table 10 the tracker labels.

Seven of the tested trackers have been published in major computer vision journals and conferences in the last two years (2021/2022). These trackers are indicated in Fig. 6, along with their average performance (EAO=0.484), which constitutes the VOT2022 state-of-the-art bound. Approximately 43.9% of the submissions exceed this bound.

The per-attribute robustness analysis is shown in Fig. 5 for individual trackers. The overall top performers remain at the top of per-attribute ranks as well, however, none of the trackers consistently outperforms the rest in all attributes. According to the median failure over each attribute (Table 3) the most challenging attribute remains occlusion. The drop on this attribute is consistent for all trackers (Fig. 5).

Table 3. VOT-STb2022 tracking difficulty with respect to the following visual attributes: camera motion (CM), illumination change (IC), motion change (MC), occlusion (OC) and size change (SC).

The VOT-STb2022 Challenge Winner. Top trackers from the baseline experiment (Table 10) were re-run on the sequestered dataset. Since some of the top trackers were variations of the same tracker, the VOT committee selected only the top-performing variant as a representative to be run on the sequestered dataset. Note that there are several ways to specify the ground truth against which the predicted bounding boxes from the trackers can be evaluated. The most straight-forward way is to fit bounding boxes to the ground truth masks (as done in the public evaluation). However, the most accurate ground truth target location specification is actually a segmentation mask and the predicted bounding box from the tracker can be considered as its parametric approximation. We thus inspected the tracker performance for winner identification along the bounding box ground truth specification and along the segmentation mask ground truth specification.

The scores using the bounding box ground truth are shown in Table 4, while the scores using the segmentation mask ground truth are shown in Table 5. We observe that the tracker ranks remain the same across the two ground truth specifications, except from the top two, who switch ranks. For this reason, both top-performers are determined as the winners of the VOT-STb2022 challenge, each in its category. The winner of the VOT-STb2022 challenge in the bounding box ground truth category is OSTrackSTB, while the winner in the segmentation mask ground truth category is APMT_MR.

Table 4. The top five trackers from Table 10 re-ranked on the VOT-STb2022 sequestered dataset using the bounding box ground truth.
Table 5. The top five trackers from Table 10 re-ranked on the VOT-STb2022 sequestered dataset using the segmentation masks as ground truth.

4.3 The VOT-RTs2022 Challenge Results

The trackers that entered the VOT-STs2022 challenge were also run on the VOT-RTs2022 challenge. Thus the statistics of submitted trackers were the same as in VOT-ST2022. For details please see Sect. 4.2.

Results. The EAO scores and AR-raw plots for the trackers participating in the VOT-RTs2022 challenge are shown in Fig. 7 and Table 9. The top ten segmentation-based real-time trackers are MS_AOT, OSTrackSTS, SRATransTS, TransT_M, DGformer, MixFormerM, TransLL, TransT and Linker and RTS.

Nine of the top ten trackers are based on transformers. Nine trackers are ranked among to top 10 on the VOT-STs2022 challenge: MS_AOT, OSTrackSTS, SRATransTS, TransT_M, DGformer, MixFormerM, TransLL, Linker and rts, while TransT is a variation of TransT_M. The top-ranked tracker on realtime challenge according to EAO is MS_AOT, which is also the top-performer on the VOT-STs2022 public datast, the second-best is OSTrackSTS, which ranks fourth on VOT-STs2022 and the third is SRATransTS, which ranks seventh on VOT-STs2022. This indicates significant advancement in the field of visual object tracking since the inception of the VOT realtime challenges, indicating that the speed limitation of modern robust trackers has been confidently breached by transformers.

Three of the tested trackers have been published in major computer vision journals and conferences in the last two years (2021/2022). These trackers are indicated in Fig. 7, along with their average performance (EAO = 0.422), which constitutes the VOT2022 state-of-the-art bound. Approximately 45.2% of the submissions exceed this bound.

Fig. 7.
figure 7

The VOT-RTs2022 AR plot (left), the EAO curves (center) and the EAO plot (right). The dashed horizontal line denotes the average performance of seven state-of-the-art trackers published in 2021/2022 at major computer vision venues. These trackers are denoted by gray circle in the bottom part of the graph.

The VOT-RTs2022 Challenge Winner. According to the EAO results in Table 9, the top performer and the winner of the segmentation-based real-time tracking challenge VOT-RTs2022 is MS_AOT.

4.4 The VOT-RTb2022 Challenge Results

The trackers that entered the VOT-STb2022 challenge were also run on the VOT-RTb2022 challenge. Thus the statistics of submitted trackers were the same as in VOT-STb2022. For details please see Sect. 4.1 and [5].

Results. The EAO scores and AR-raw plots for the trackers participating in the VOT-RTb2022 challenge are shown in Fig. 8 and Table 10. The top ten bounding-box-based real-time trackers are OSTrackSTB, APMT_RT, MixFormer, APMT_MR, SRATransT, DAMT, TransT_M, vittrack, SBT, TransT. All of these are based on transformers. Seven are among the top ten performers on the public dataset in VOT-STb2022: OSTrackSTB, APMT_RT, MixFormer, APMT_MR, SRATransT, DAMT and TransT_M. Thus, similarly to VOT-RTs2022, results here show that performance is minimally compromised if at all on account of speed in transformer-based tracking.

The top-performer according to the EAO on the public dataset is OSTrackSTB, which is based on the recent OSTrack [29] and uses a ViT [4] backbone. This tracker is ranked third on VOT-STb2022. The second and the third-best trackers on VOT-RTb2022 are APMT_RT and MixFormer, which are ranked fourth and fifth on VOT-STb2022.

Note that 7 of the tested trackers have been published in major computer vision journals and conferences in the last two years (2021/2022). These trackers are indicated in Fig. 8, along with their average performance (EAO=0.421), which constitutes the VOT2022 state-of-the-art bound. Approximately 53.7% of the submissions exceed this bound.

Fig. 8.
figure 8

The VOT-RTb2022 AR plot (left), the EAO curves (center) and the EAO plot (right). The dashed horizontal line denotes the average performance of ten state-of-the-art trackers published in 2021/2022 at major computer vision venues. These trackers are denoted by gray circle in the bottom part of the graph.

The VOT-RTb2022 Challenge Winner.

According to the EAO results in Table 10, the top performer and the winner of the bounding-box-based real-time tracking challenge VOT-RTb2022 is OSTrackSTB.

4.5 The VOT-LT2022 Challenge Results

Trackers Submitted. The VOT-LT2022 challenge received 7 valid entries. The VOT2022 committee contributed additional trackers SuperDiMP and KeepTrack as baselines; thus 9 trackers were considered in the challenge. In the following, we briefly overview the entries and provide the references to original papers in [5] where available.

All participating trackers were categorized as ST\(_1\) according to the ST-LT taxonomy from Sect. 2 in that they implemented explicit target re-detection. All trackers were based on convolutional neural networks. Four trackers applied Transformer architecture akin to STARK [23] for target localization (CoCoLoT, mixLT, mlpLT, and VITKT_M). Particularly, VITKT_M is based purely on a Transformer-backbone [20] for feature extraction. Four trackers applied SuperDiMP structure [1] as their basic tracker (ADiMPLT, mixLT, mlpLT, SuperDiMP). Three trackers selected KeepTrack [18] as their auxiliary tracker due to its robustness to distractors (CoCoLoT, VITKT_M, KeepTrack). One tracker is based on MixFormer [3] to design a long-term tracker that focuses on target recapture (HuntFormer). One tracker extends the D3Sv2 [17] short-term tracker with long-term capabilities (D3SLT). Four trackers combined different tracking methods and switched them based on their tracking scores (CoCoLoT, D3SLT, mixLT, mlpLT, VITKT_M). Among them, two trackers use an online real-time MDNet-based [19] verifier to determine the tracking score (CoCoLoT, D3SLT).

Table 6. List of trackers that participated in the VOT-LT2022 challenge along with their performance scores (Pr, Re, F-score) and ST/LT categorization.

Results. The overall performance is summarized in Fig. 9 and Table 6. The top-three performers are VITKT_M, mixLT_LT and HuntFormer. VITKT_M obtains the highest F-score (0.617) in 2022, while last year winner (mlpLT) obtains 0.565. Since the new VOT-LT dataset is more challenging, it should be noted that the average F-Score of these trackers decreased by 11.4% than last year. All the results are based on the submitted numbers, but these were verified by running the codes multiple times. The VITKT_M is composed of a Transformer-based tracker VitTrack, an auxiliary tracker KeepTrack and a motion module. Specifically, the master tracker VitTrack is a Transformer-based tracker composed of a backbone network, a corner prediction head and a classification head. Besides, a simple motion module is used to predict the target current state according to the temporal trajector. When scores of VitTrack and KeepTrack are all lower than a threshold, and the target moves abnormally, this motion module is triggered to predict the current state.

The mixLT architecture is a progressive fusion of multiple trackers, mainly STARK [23] and SuperDiMP. Specifically, it first fuses the results of two trackers, STARK-ST50 and STARK-ST101. The states of two trackers are then corrected based on the fusion resuls. SuperDiMP controlled by meta-updater is introduced for further fusion between dissimilar trackers, in order to improve the robustness of long-term tracking. The final tracking result is determined according to the confidences of the trackers over several frames, and another tracker correction is performed.

Based on MixFormer, the tracker HuntFormer propose an effective motion prediction model that provides a reliable search region for the tracker to recapture the target. Meanwhile, we propose a novel soft-threshold-based dynamic memory update model, which keeps a set of reliable target templates in the memory that can be used to match the target position in the search region. The two modules cooperate with each other, which greatly improves the recapture ability of the tracker.

The VITKT_M achieves an overall best F-score and significantly surpasses mixLT (by 1.7%) and MixFormer (by 1.9%). All of these methods are based on Transformer. Two similar trackers, VITKT_M and VITKT were submitted by one team. The only difference is that the VITKT is a more concise version than VITKT_M without the motion module. When ablating the motion module (VITKT), the F-score decreases by 1.2%. Since VITKT is a minor variant of VITKT_M, we only keep VITKT_M in our ranking.

The VOT-LT2022 Challenge Winner. According to the F-score in Table 6, the top-performing tracker is VITKT_M, closely followed by mixLT and HuntFormer. Thus the winner of the VOT-LT 2021 challenge is VITKT_M.

4.6 The VOT-RGBD2022 Challenge Results

Eight trackers were submitted to the 2022 RGBD challenge: DMTracker, keep_track, MixForRGBD, OSTrack, ProMix, SAMF, SBT_RGBD and SPT.

All trackers are based on the popular deep learning-based tracker architectures that have performed well in the previous years VOT RGB challenges. The new deep architecture for this year is MixFormer [3] that is in multiple submissions (MixForRGBD, ProMix and SAMF). The main difference between the submitted trackers is how they fuse the two modalities, depth and RGB, and in their training prodedures. Some teams submitted multiple trackers, but since their architectures are different they were all accepted.

Fig. 9.
figure 9

VOT-LT2022 challenge average tracking precision-recall curves (left) and the corresponding F-score curves (right). Tracker labels are sorted according to maximum of the F-score (see Table 6).

Results. The Expected Average Overlap (EAO), Accuracy (A) and Robustness (R) metrics of the submitted and a number of additional trackers are shown in Table 7. The two best performing trackers, MixForRGBD and SAMF, are distinctively better than the next ones. The six best performing trackers are this year submissions while the DepthTrack database baseline, DeT_DiMP50_Max, is the seventh. The two RGB trackers perform the worst as was expected.

Fig. 10.
figure 10

The VOT-RGBD2022 AR plot (left) and the EAO curves (right).

The VOT-RGBD2022 Challenge Winner. The results in Fig. 10 show that MixForRGBD and SAMF perform very similarly and are clearly better than the rest. Still, MixForRGBD obtains the best EAO score and is thus the winner of the VOT-RGBD2022 challenge.

Table 7. Results for the eight submitted VOT-RGBD2022 trackers. For comparison, the table also includes the results for the three best performing RGBD trackers from VOT2020 (ATCAIS) and VOT2021 (STARK_RGBD and DRefine), two strong baseline RGB trackers from the previous years (DiMP and ATOM) and the baseline RGBD tracker from the DepthTrack dataset (DeT_DiMP50_Max [25]).

4.7 The VOT-D2022 Challenge Results

The VOT-D2022 challenge uses the same 127 short-time tracking sequences as the above RGBD2022 challenge, but in the D (depth-only) challenge the trackers are provided only the depth map frames. This challenge was added as it was intriguing to study how much RGB adds to the depth cue and what is the complementary power of the two modalities.

The total of six trackers were submitted to the depth-only challenge. The submitted trackers are: CoDeT, MixFormerD, OSTrack_D, RSDiMP, SBT_Depth and UpDoT.

Not surprisingly the D-only challenge attracted submissions from the same groups that also participated the RGBD challenge. For example, CoDeT is a D-only version of DMTRacker, MixFormerD of MixForRGBD, OSTrack_D of OSTrack, and SBT_Depth of SBT_RGB. RSDiMP is from the same group as the SPT RGBD tracker, but the two architectures are different. The authors of CoDeT also submitted UpDoT which corresponds to standard DiMP trained with two different versions of depth data.

Table 8. Results for the six submitted VOT-D2022 trackers. For comparison, the table also includes the results for the recent dept-only tracker DOT [26] and RGB DiMP that was trained with RGB but tested with colormap converted depth images.
Fig. 11.
figure 11

The VOT-D2022 AR plot (left) and the EAO curves (right).

Table 9. Results for VOT-STs2022 and VOTs-RT2022 challenges. Expected average overlap (EAO), accuracy and robustness are shown. For reference, a no-reset average overlap AO [22] is shown under Unsupervised.
Table 10. Results for VOT-STb2022 and VOTb-RT2022 challenges. Expected average overlap (EAO), accuracy and robustness are shown. For reference, a no-reset average overlap AO [22] is shown under Unsupervised.

Results. The computed performance metrics for the D (depth-only) trackers are in Table 8 and the corresponding graphs in Fig. 11. From the results we can see that the depth-only variants of the best performing RGBD trackers also perform well in the D-only challenge (MixFormerRGBD \(\rightarrow \) MixFormerD and OSTrack \(\rightarrow \) OSTrack_D). The only dedicated D-only tracker submitted to the D-only challenge and which does not have an RGBD counterpart, RSDiMP, obtains the second best EAO score. Overall the three best methods, MixFormerD, RSDiMP and OSTrack_D, perform almost on par and are distinctively better than the rest. Therefore, these three trackers are good starting points to understand how to effectively use the depth channel in tracking.

Notably, there is a clear difference between the D-only and RGBD results on the same data (Table 7 vs. Table 8). That confirms that the both modalities, D and RGB, are beneficial for object tracking. For example, the RGB DiMP in Table 7 is clearly better than the depth-only DiMP in Table 8 (EAO 0.534 vs. 0.336), but inferior to the best D-only tracker (MixFormerD 0.600).

The VOT-D2022 Challenge Winner. The three best depth-only trackers, MixFormerD, RSDiMP and OSTrack_D, perform on par, but since MixFormerD obtains the best EAO score, it is selected as the winner.

5 Conclusions

Results of the VOT2022 challenge were presented. The challenge is composed of the following challenges focusing on various tracking aspects and domains: (i) the segmentation-based short-term RGB tracking challenge (VOT-STs2022), (ii) the legacy bounding-box-based short-term RGB tracking challenge (VOT-STb2022), (iii) the realtime counterpart of VOT-STs2022 (VOT-RTs2022), (iv) the realtime countrepart of VOT-STb2022 (VOT-RTb2022), (v) the VOT2022 long-term RGB tracking challenge (VOT-LT2021), (vi) the VOT2022 short-term RGB and depth (D) tracking challenge (VOT-RGBD2022) and its variation (vii) the VOT2022 short-term depth-only tracking challenge (VOT-D2022).

In this VOT edition, new VOT-LT2022, VOT-RGBD2022 and VOT-D2022 datasets were introduced, a legacy bounding-box-based tracking challenge VOT-STb2022 was reintroduced, the VOT-ST2022 public and sequestered datasets were refreshed, and a training dataset has been introduced for VOT-LT2022.

A methodological shift, which was indicated already in the VOT2021 [11], has been made even more aparent this year. Nearly half of the trackers participating in VOT-STs2022 challenge were based on transformers, approximately 40% were using discriminative correlation filters, while only few were based on Siamese correlation trackers (a methodology highly popular in VOT2021). All of the top 9 trackers were based on transformers. Apart from being robust, these trackers are also very fast – 9 of top VOT-STs2022 trackers are among the top trackers on VOT-RTs2022 challenge. Variations of the segmentation trackers were submitted to the legacy bounding-box tracking challenge VOT-STb2022. Seven of the top ten trackers on VOT-STb2022 were modifications of the trackers ranked among the top ten on VOT-STs2022. The winner of the VOT-STs2022 challenge is MS_AOT, while the winner of the VOT-STb2022 challenge in the bounding box ground truth category is OSTrackSTB and the winner in the segmentation mask ground truth category is APMT_MR. The winner of the VOT-RTs2022 challenge is MS_AOT and the winner of the VOT-RTb2022 challenge is OSTrackSTB.

The VOT-LT2022 challenge’s top-three performers all apply Transformer-based tracker structure for short-term localization and long-term re-detection. Among all submitted trackers, the dominant methodologies are SuperDiMP [1], STARK [23], KeepTrack [18], and MixFormer [3]. The top perfomer and the winner of the VOT-LT2022 is VITKT_M, which ensembles the results of VitTrack and KeepTrack. This tracker obtains a significantly better performance than the second-best tracker.

In the VOT-RGBD2022 and VOT-D2022 challenges, the same tracker architecture obtained the best results in all tracking metrics. There are two interesting points in this submission that possibly explain its success as compared to others. At first, the tracker is based on the recent Convolutional Visual Transformer (CvT) model and, secondly, the both RGB and depth representations are learned from data. Since there are no depth-only tracking datasets that are sufficiently large for network training, the existing RGB datasets were converted to pseudo depth map datasets using a monocular depth estimation method. These design choices turned out to be the winning ones this year, and therefore the same authors won the VOT-RGBD2022 and VOT-D2022 challenges with their two trackers adopting the same architecture, MixForRGBD and MixFormerD.

For the last decade, the primary objective of VOT has been to establish a platform for discussion of tracking performance evaluation and contributing to the tracking community with verified annotated datasets, performance measures and evaluation toolkits. The VOT2022 was the tenth effort toward this, following the very successful VOT2013, VOT2014, VOT2015, VOT2016, VOT2017, VOT2018, VOT2019, VOT2020 and VOT2021. Since its beginning, the VOT has successfully identified modern milestone tracking methodologies at their inception, spanning discriminative correlation filters, Siamese trackers and most recently the transformer-based architectures. By pushing the boundaries, presenting ever challenging sequences and opening new challenges, the VOT has been successfully fulfilling its service to community. The effort, however, is joint with the tracking community who continually raises to the challenges and is the one generating the fast pace of tracker architecture development. We thank the community for their collaboration and look forward to future developments in this exciting scientific field.