Keyword

1 Introduction

Compared with weak AI designed for solving one specific task, artificial general intelligence (AGI) is expected to understand or learn any intellectual task that a human being can. Although there is still a large gap between this ambitious goal and the intellectual algorithms of today, some recent works [19, 20, 48, 76] have begun to explore the possibility of building general vision models to address several vision tasks simultaneously.

Object tracking is one of the fundamental tasks in computer vision, which aims to build pixel-level or instance-level correspondence between frames and to output trajectories typically in the forms of boxes or masks. Over the years, according to different application scenarios, the Object tracking problem has been mainly divided into four separate sub-tasks: Single Object Tracking (SOT) [17, 39], Multiple Object Tracking (MOT) [37, 75], Video Object Segmentation (VOS) [43], and Multi-Object Tracking and Segmentation (MOTS) [57, 75]. As a result, most tracking approaches are developed for only one of or part of the sub-tasks. Despite convenience for specific applications, this fragmented situation brings into the following drawbacks: (1) Trackers may over-specialize on the characteristic of specific sub-tasks, lacking in the generalization ability. (2) Independent model designs cause redundant parameters. For example, recent deep-learning-based trackers usually adopt similar backbones architectures, but the separate design philosophy hinders the potential reuse of parameters. It is natural to ask a question: Can all main-stream tracking tasks be solved by a unified model?

Although some works [33, 36, 58, 60, 66] attempt to unify SOT\( { \& }\)VOS or MOT\( { \& }\)MOTS by adding a mask branch to the existing box-level tracking system, there is still little progress towards the unification of SOT and MOT. There are mainly three obstacles hindering this process. (1) The characteristics of tracked objects vary. MOT usually tracks tens even hundreds of instances of specific categories. In contrast, SOT needs to track one target given in the reference frame no matter what class it belongs to. (2) SOT and MOT require different types of correspondence. SOT requires distinguishing the target from the background. However, MOT needs to match the currently detected objects with previous trajectories. (3) Most SOT methods [3, 5, 9, 15, 27, 72] only take a small search region as the input to save computation and filter potential distractors. However, MOT algorithms [2, 8, 36, 63, 69, 79, 84] usually take the high-resolution full image as the input for detecting instances as completely as possible.

To conquer these challenges, we propose two core designs: the target prior and the pixel-wise correspondence. To be specific, (1) the target prior is an additional input for the detection head and serves as the switch among four tasks. For SOT\( { \& }\)VOS, the target prior is the propagated reference target map, enabling the head to focus on the tracked target. For MOT\( { \& }\)MOTS, by setting the target prior as zero, the head degenerates into the usual class-specific detection head smoothly. (2) The pixel-wise correspondence is the similarity between all pairs of points from the reference frame and the current frame. Both the SOT correspondence (\(\textbf{C}^{\textrm{SOT}}\in \mathbb {R}^{h'w'\times {hw}}\)) and the MOT correspondence (\(\textbf{C}^{\textrm{MOT}}\in \mathbb {R}^{M\times {N}}\)) are subsets of the pixel-wise correspondence (\(\textbf{C}_{\textrm{pix}}\in \mathbb {R}^{hw\times {hw}}\)). (3) With the help of the informative target prior and the accurate pixel-wise correspondence, the design of the search region becomes unnecessary for SOT, leading to unified inputs as the full image for SOT and MOT.

Towards the unification of object tracking, we propose Unicorn, a single network architecture to solve four tracking tasks. It takes the reference frame and the current frame as the inputs and produces their visual features by a weight-shared backbone. Then a feature interaction module is exploited to build pixel-wise correspondence between two frames. Based on the correspondence, a target prior is generated by propagating the reference target to the current frame. Finally, the target prior and the visual features are fused and sent to the detection head to get the tracked objects for all tasks.

With the unified network architecture, Unicorn can learn from various sources of tracking data and address four tracking tasks with the same model parameters. Extensive experiments show that Unicorn performs on-par or better than task-specific counterparts on 8 challenging benchmarks from four tracking tasks.

We summarize that our work has the following contributions:

  • For the first time, Unicorn accomplishes the great unification of the network architecture and the learning paradigm for four tracking tasks.

  • Unicorn bridges the gap among methods of four tracking tasks by the target prior and the pixel-wise correspondence.

  • Unicorn puts forwards new state-of-the-art performance on 8 challenging tracking benchmarks with the same model parameters. This achievement will serve as a solid step towards the general vision model.

2 Related Work

2.1 Task-Specific Trackers

SOT typically specifies one tracked target with a bounding box on the first frame, then requires trackers to predict boxes for the tracked target in the following frames. Considering the uniqueness and the motion continuity of the tracked target, most of the algorithms in SOT [3, 5, 9, 15, 27, 70, 72] track on a small search region rather than the whole image to reduce computation and to filter distractors. Although achieving great success in the SOT field, search-region-based trackers suffer from the following drawbacks: (1) Due to the limited visual field, it is difficult for these methods to recover from temporary tracking failure, especially in the long-term tracking scenarios. (2) The speed of these methods drops drastically as the number of tracked instances increases. The inefficiency problem restricts the application of SOT trackers in scenarios such as MOT, where there are tens or hundreds of targets to track. To overcome the first problem, some works [23, 58] propose a global-detection-based tracking paradigm. However, these methods either require large modifications to the original detection architecture to integrate the target information or rely on complicated dynamic programming to pick the best tracklet. Besides, both Global-Track [23] and Siam R-CNN [23] are developed on two-stage Faster R-CNN, whose detection pipeline is tedious and relies on hand-crafted anchors and ROI-Align. By contrast, in this work, we build our method based on a one-stage, anchor-free detector [18]. Furthermore, we demonstrate that only with minimal change to the original detector architecture, we could transform an object detector into a powerful SOT tracker.

Different from SOT, MOT does not have any given prior on the first frame. Trackers of MOT are required to find and associate all instances of specific classes by themselves. The mainstream methods [41, 49, 63, 79, 84] follow the tracking-by-detection paradigm. Specifically, an MOT system typically has two main components, an object detector and a certain association strategy. Commonly used detectors include Faster R-CNN [45], the YOLO series [18, 44], CenterNet [85], Sparse R-CNN [50], and Deformable DETR [88], etc. Popular association methods include IoU matching [4, 49], Kalman Filter [4, 63, 79], ReID embedding [41, 63, 65, 79], Transformer [36, 49, 77], or the combination of them [78]. Although there are some works [12, 87] introducing SOT trackers for the association, these SOT trackers [3, 14] are completely independent with the MOT networks, without any weight sharing. There is still a large gap between methods of SOT and MOT.

The goal of VOS is to predict masks for the tracked instances based on the high-quality mask annotations of the first frame. This field is now dominated by memory-network-based methods [10, 40, 74]. Although achieving great performance, these methods suffer from the following disadvantages: (1) The memory network brings huge time and space complexity, especially when dealing with high spatial resolution and the long sequence. While these scenarios are quite common in sequences of SOT and MOT. Specifically, the long-term tracking benchmarks [17, 55] in SOT usually have thousands of frames per sequence, being more than 20x longer than DAVIS [43]. Meanwhile, the image size in MOT [75] can reach 720\(\,\times \,\)1280, while the image size of DAVIS is usually only 480\(\,\times \,\)854. (2) SOTA methods assume that there are always high-quality mask annotations on the first frame. However, high-quality masks demand expensive labor costs and are usually unavailable in real-world applications. To overcome this problem, some works [33, 58, 60] attempt to develop weakly-annotated VOS algorithms, which only require box annotation on the first frame.

MOTS is highly related to MOT by changing the form of boxes to fine-grained representation of masks. MOTS benchmarks [57, 75] are typically from the same scenarios as those of MOT [37, 75]. Besides, many MOTS methods are developed upon MOT trackers. Representative approaches include 3D-convolution-based Track R-CNN [57] and Stem-Seg [1], Transformer-based TrackFormer [36], tracking-assisting-detection Trades [66] and Prototype-based PCAN [26].

2.2 General Vision Models

Despite the great success of specialized models for diverse tasks, there is still a large gap between the current AI with human-like, omnipotent Artificial General Intelligence (AGI). An important step towards this grand goal is to build a generalist model supporting a broad range of AI tasks. Recent pioneering works [19, 20, 48, 76] attempt to approach this goal from different perspectives. Specifically, MuST [19] introduces a multi-task self-training pipeline, which harnesses the knowledge in independent specialized teacher models to train a single general student model. INTERN [48] proposes a new learning paradigm, which learns with supervisory signals from multiple sources in multiple stages. The developed general vision model generalizes well to different tasks but also has lower requirements on downstream data. Florence [76] is a new computer vision foundation model, which expands the representations to different tasks along space, time, and modality. Florence has great transferability and achieves new SOTA results on a wide range of vision benchmarks. OMNIVORE [20] proposes a modality-agnostic model which can classify images, videos, and single-view 3D data using the same model parameters.

Fig. 1.
figure 1

Comparison between previous solutions and Unicorn.

2.3 Unification in Object Tracking

In the literature, some works [60, 62, 66] attempted to design a unified framework for supporting multiple tracking tasks. Specifically, SiamMask [60] is the first work to address SOT and VOS simultaneously. Similarly, TraDes [66] can solve both MOT and MOTS by introducing an extra mask head. Besides, UniTrack [62] proposes a high-level tracking framework, which consists of a shared appearance model and a series of unshared tracking heads. It demonstrates that different tracking tasks can share one appearance model for either propagation or association. However, the large discrepancy in tracking heads hinders it from exploiting a large amount of tracking data. Consequently, its performance lags far behind that of SOTA task-specific methods. Moreover, when used for MOT or MOTS, UniTrack requires extra, independent object detectors to provide observation. The extra object detector and the appearance model do not share the same backbone, bringing heavy burdens in parameters. By contrast, Unicorn solves four tracking tasks with one unified network with the same parameters. Besides, Unicorn can learn powerful representation from a large amount of labeled tracking data, achieving superior performance on 8 challenging benchmarks. Figure 1 shows the comparison between task-specific methods and Unicorn.

2.4 Correspondence Learning

Learning accurate correspondence is the key to many vision tasks, such as optical flow [51], video object segmentation [25, 80], geometric matching [53, 54], etc. The dense correspondence is usually obtained by computing correlation between the embedding maps of two frames. Most existing methods [25, 51, 80] obtain the embedding maps without considering the information exchange between two images. This could lead to ambiguous or wrong matching when there are many similar patterns or instances on the input images. Although some works [53, 54] attempt to relieve this problem, they usually require complex optimization or uncertainty modeling. Different from the local comparison, Transformer [56] and its variants [88] exploit the attention mechanism to capture the long-range dependency within the input sequence. In this work, we demonstrate that these operations can help to learn precise correspondence in object tracking.

Fig. 2.
figure 2

Unicorn consists of three main components: (1) Unified inputs and backbone (2) Unified embedding (3) Unified head.

3 Approach

We propose a unified solution for Object tracking, called Unicorn, which consists of three main components: unified inputs and backbone; unified embedding and unified head. Three components are responsible for obtaining powerful visual representation, building precise correspondence and detecting diverse tracked targets respectively. The framework of Unicorn is demonstrated in Fig. 2. Given the reference frame \(\textbf{I}_{\textrm{ref}}\), the current frame \(\textbf{I}_{\textrm{cur}}\), and the reference targets, Unicorn aims at predicting the states of the tracked targets on the current frame for four tasks with a unified network.

3.1 Unified Inputs and Backbone

For efficiently localizing multiple potential targets, Unicorn takes the whole image (for both the reference frame and the current frame) instead of local search regions as the inputs. This also endows Unicorn high resistance to tracking failure and the ability to re-detect tracked target after disappearance.

During the feature extraction, the reference frame and the current frame are passed through a weight-sharing backbone to get feature pyramid representations(FPN) [30]. To maintain important details and reduce the computational burden during computing correspondence, we choose the feature map with stride 16 as the input of the following embedding module. The corresponding features from the reference and the current frame are termed \(\textbf{F}_{\textrm{ref}}\) and \(\textbf{F}_{\textrm{cur}}\) respectively.

3.2 Unified Embedding

The core task of Object tracking is to build accurate correspondence between frames in a video. For SOT and VOS, pixel-wise correspondence propagates the user-provided target from the reference frame (usually the \(1^{st}\) frame) to the \(t^{th}\) frame, providing strong prior information for the final box or mask prediction. Besides, for MOT and MOTS, instance-level correspondence helps to associate the detected instances on the \(t^{th}\) frame to the existing trajectories on the reference frame (usually the \({t-1}^{th}\) frame).

In Unicorn, given the spatially flattened reference frame embedding \(\textbf{E}_{\textrm{ref}}\in \mathbb {R}^{hw\times {c}}\) and the current frame embedding \(\textbf{E}_{\textrm{cur}}\in \mathbb {R}^{hw\times {c}}\), pixel-wise correspondence \(\textbf{C}_{\textrm{pix}}\in \mathbb {R}^{hw\times {hw}}\) is computed by the matrix multiplication between them. For SOT\( { \& }\)VOS taking the full image as the inputs, the correspondence is the pixel-wise correspondence itself. For MOT\( { \& }\)MOTS, assume that there are M trajectories on the reference frame and N detected instances on the current frame respectively, the instance-level correspondence \(\textbf{C}_{\textrm{inst}}\in \mathbb {R}^{N\times {M}}\) is the matrix multiplication of the reference instance embedding \(\textbf{e}_{\textrm{ref}}\in \mathbb {R}^{M\times {c}}\) and the current instance embedding \(\textbf{e}_{\textrm{cur}}\in \mathbb {R}^{N\times {c}}\). The instance embedding \(\textbf{e}\) is extracted from the frame embedding \(\textbf{E}\), where the center of the instance is located.

$$\begin{aligned} \begin{array}{l} \textbf{C}_{\textrm{pix}}=\textrm{softmax}(\textbf{E}_\mathrm {{cur}}{\textbf{E}_{\textrm{ref}}}^T)\\ \textbf{C}_{\textrm{inst}}=\textrm{softmax}(\textbf{e}_{\textrm{cur}}{\textbf{e}_{\textrm{ref}}}^T)\\ \end{array} \end{aligned}$$
(1)

It can be seen that the instance-level correspondence \(\textbf{C}_{\textrm{inst}}\) required by MOT and MOTS is the sub-matrix of the pixel-wise correspondence \(\textbf{C}_{\textrm{pix}}\). Besides, learning highly discriminative embedding \(\{\textbf{E}_{\textrm{ref}}, \textbf{E}_{\textrm{cur}}\}\) is the key to building precise correspondence for all tracking tasks.

Feature Interaction. Due to its advantages of capturing long-range dependency, Transformer [56] is an intuitive choice to enhance the original feature representation \(\{\textbf{F}_{\textrm{ref}}, \textbf{F}_{\textrm{cur}}\}\). However, this could lead to huge memory cost when dealing with high-resolution feature maps, because the memory consumption increases with the length of the input sequence quadratically. To alleviate this problem, we replace the full attention with more memory-efficient deformable attention [88]. For more accurate correspondence, the enhanced feature maps are upsampled by 2\(\times \) to obtain high-resolution embeddings on the stride of 8.

$$\begin{aligned} \{\textbf{E}_{\textrm{ref}}, \textbf{E}_{\textrm{cur}}\}=\textrm{Upsample}(\textrm{Attention}(\textbf{F}_{\textrm{ref}},\textbf{F}_{\textrm{cur}})) \end{aligned}$$
(2)

Loss. Ideal embedding should work well on both propagation (SOT, VOS) and association (MOT, MOTS). For SOT\( { \& }\)VOS, although there is no human-annotated label for dense correspondence between frames, the embedding can be supervised by the difference between the propagated result \(\mathbf {\widetilde{T}}_{\textrm{cur}}\) and the ground-truth target map \(\textbf{T}_{\textrm{cur}}\). Specifically, the shape of target map \(\textbf{T}\) is \(hw\times {1}\). The regions where the tracked target exists are equal to one and the other regions are equal to zero. During the propagation, the pixel-wise correspondence \(\textbf{C}_{\textrm{pix}}\) transforms the reference target map \(\textbf{T}_{\textrm{ref}}\) to the estimation of the current target map \(\mathbf {\widetilde{T}}_{\textrm{cur}}\).

$$\begin{aligned} \mathbf {\widetilde{T}}_{\textrm{cur}}(i,j) = \sum _k \textbf{C}_{\textrm{pix}}(i,k) \cdot \textbf{T}_{\textrm{ref}}(k,j) \end{aligned}$$
(3)

Besides, for MOT and MOTS, the instance-level correspondence can be learned with standard contrastive learning paradigm. Specifically, assume that the instance i from the current frame is matched with the instance j from the reference frame, then the corresponding ground-truth matrix \(\textbf{G}\) should satisfies that

$$\begin{aligned} \textbf{G}_{i,k}=\left\{ \begin{array}{cc}0&{}\ \ k\ne j\\ 1&{}\ \ k=j\\ \end{array}\right. \end{aligned}$$
(4)

Finally, the unified embedding can be optimized end-to-end by Dice Loss [38] for SOT\( { \& }\)VOS or Cross-Entropy Loss for MOT\( { \& }\)MOTS.

$$\begin{aligned} \textbf{L}_{\textrm{corr}}=\left\{ \begin{array}{cc}\textrm{Dice}(\widetilde{\textbf{T}}_{\textrm{cur}},\textbf{T}_{\textrm{cur}})&{}\mathrm {task\ in\ \{SOT,VOS\}}\\ \textrm{CrossEntropy}(\textbf{C}_{\textrm{inst}},\textbf{G})&{}\ \ \ \mathrm {task\ in\ \{MOT,MOTS\}}\\ \end{array}\right. \end{aligned}$$
(5)

3.3 Unified Head

To achieve the grand unification of Object tracking, another important and challenging problem is designing a unified head for four tracking tasks. Specifically, MOT shall detect objects of specific categories. However, SOT needs to detect any target given in the reference frame. To bridge this gap, Unicorn introduces an extra input (called target prior) to the original detector head [18, 52]. Without any further modification, Unicorn can easily detect various objects needed for four tasks with this unified head. More details about the head architecture can be found in the supplementary materials.

Target Prior. As mentioned in Sect. 3.2, given the reference target map \(\textbf{T}_{\textrm{ref}}\), the propagated target map \(\mathbf {\widetilde{T}}_{\textrm{cur}}\) can provide strong prior information about the state of the tracked target. This motivates us to take it as a target prior when detecting targets for SOT\( { \& }\)VOS. To be compatible with the original input of the detection head, we first reshape it to \(h\times {w}\times {1}\) (i.e. \(\mathbf {\widetilde{T}}^{\textrm{reshape}}_{\textrm{cur}}\in \mathbb {R}^{h\times {w}\times {1}})\). Meanwhile, when dealing with MOT\( { \& }\)MOTS, we can simply set this prior to zero. Formally, the target prior \(\textbf{P}\) satisfies that

$$\begin{aligned} \textbf{P}=\left\{ \begin{array}{cc}\mathbf {\widetilde{T}}^{\textrm{reshape}}_{\textrm{cur}}&{}\mathrm {task\ in\ \{SOT,VOS\}}\\ \textbf{0}&{}\ \ \ \mathrm {task\ in\ \{MOT,MOTS\}}\\ \end{array}\right. \end{aligned}$$
(6)

Feature Fusion. The unified head takes the original FPN feature \(\textbf{F}\in \mathbb {R}^{h\times {w}\times {c}}\) and the target prior \(\textbf{P}\in \mathbb {R}^{h\times {w}\times {1}}\) as the inputs. Unicorn fuses these two inputs with broadcast sum and passes the fused feature \(\textbf{F}^{'}\in \mathbb {R}^{h\times {w}\times {c}}\) to the original detection head. This fusion strategy has the following advantages. (1) The fused features are seamlessly compatible with four tasks. Specifically, for MOT\( { \& }\)MOTS, the target prior is equal to zero. Then the fused feature \(\textbf{F}^{'}\) degenerates back to the original FPN feature \(\textbf{F}\) to detect objects of specific classes. For SOT\( { \& }\)VOS, the target prior with strong target information can enhance the original FPN feature and makes the network focus on the tracked target.(2) The architecture is simple, without introducing complex changes to the original detection head. Furthermore, the consistent architecture also enables Unicorn to fully exploit the pretrained weights of the original object detector.

3.4 Training and Inference

Training. The whole training process divides into two stages: SOT-MOT joint training and VOS-MOTS joint training. In the first stage, the network is end-to-end optimized with the correspondence loss and the detection loss using data from SOT\( { \& }\)MOT. In the second stage, a mask branch is added and optimized with the mask loss using data from VOS\( { \& }\)MOTS with other parameters fixed.

Inference. During the test phase, for SOT\( { \& }\)VOS, the reference target map is generated once on the first frame and kept fixed in the following frames. Unicorn directly picks the box or mask with the highest confidence score as the final tracking result, without any hyperparameter-sensitive post-processing like cosine window. Besides, Unicorn only needs to run the heavy backbone and the correspondence once, while running the lightweight head rather than the whole network N times, leading to higher efficiency. For MOT\( { \& }\)MOTS, Unicorn detects all objects of the given categories and simultaneously outputs corresponding instance embeddings. The later association is performed based on the embeddings and the motion model for BDD100K and MOT17 respectively.

4 Experiments

4.1 Implementation Details

When comparing with state-of-the-art methods, we choose ConvNeXt-Large [31] as the backbone. In ablations, we report the results of our method with ConvNeXt-Tiny [31] and ResNet-50 [21] as the backbone. The input image size is \(800\times 1280\) and the shortest side ranges from 736 to 864 during multi-scale training. The model is trained on 16 NVIDIA Tesla A100 GPU with a global batch size of 32. To avoid inaccurate statistics estimation, we replace all Batch Normalization [24] with Group Normalization [67]. Two training stages randomly sample data from SOT\( { \& }\)MOT datasets and VOS\( { \& }\)MOTS datasets, respectively. Each training stage consists of 15 epochs with 200,000 pairs of frames in every epoch. The optimizer is Adam-W [32] with weight decay of \(5e^{-4}\) and momentum of 0.9. The initial learning rate is \(2.5e^{-4}\) with 1 epoch warm-up and the cosine annealing schedule. More details can be found in the supplementary materials. In Sect. 4.2, 4.3, 4.4 and 4.5, we compare Unicorn with task-specific counterparts in 8 tracking datasets. In each benchmark, the font and the font indicate the best two results. Unicorn in four tasks uses the same model parameters.

Table 1. State-of-the-art comparison on LaSOT [17] and TrackingNet [39].

4.2 Evaluations on Single Object Tracking

We compare Unicorn with state-of-the-art SOT trackers on two popular and challenging benchmarks, LaSOT [17] and TrackingNet [39]. Both datasets evaluate the tracking performance with the following measures: Success, precision (P) and normalized precision (\(P_{norm}\)). All these measures are the higher the better.

LaSOT. LaSOT [17] is a large-scale long-term tracking benchmark, which contains 280 videos in the test set with an average length of 2448 frames. Table 1 shows that Unicorn achieves new state-of-the-art Success and Precision of 68.5% and 74.1% respectively. It is also worth noting that Unicorn surpasses the previous best global-detection-based tracker Siam R-CNN [58] by a large margin (68.5% vs 64.8%) with a much simpler network architecture and tracking strategy (directly picking the top-1 vs tracklet dynamic programming).

TrackingNet. TrackingNet [39] is a large-scale short-term tracking benchmark containing 511 videos in the test set. As reported in Table 1, Unicorn surpasses all previous methods with a Success of 83.0% and a Precision of 82.2%.

4.3 Evaluations on Multiple Object Tracking

We compare Unicorn with state-of-the-art MOT trackers on two challenging benchmarks: MOT17 [37] and BDD100K [75]. The common metrics include Multiple-Object Tracking Accuracy (MOTA), Identity F1 Score (IDF1), False Positives (FP), False Negatives (FN), the percentage of Mostly Tracked Trajectories (MT) and Mostly Lost Trajectories (ML), Identity Switches (IDS). Among them, MOTA is the primary metric to measure the overall detection and tracking performance, IDF1 is used to measure the trajectory identity accuracy.

Table 2. State-of-the-art comparison on MOT17 [37] test set.
Table 3. State-of-the-art comparison on BDD100K [75] tracking validation set.

MOT17. The MOT17 focuses on pedestrian tracking and includes 7 sequences in the training set and 7 sequences in the test set. We compare Unicorn with previous methods under the private detection protocol on the test set of MOT17. Table 2 demonstrates that Unicorn achieves the best MOTA and IDF1, surpassing the previous SOTA method by 0.5% and 0.4% respectively.

BDD100K MOT. BDD100K is a large-scale dataset of visual driving scenes and requires tracking 8 categories of instances. To evaluate the average performance across 8 classes, BDD100K additionally introduces two measures: mMOTA and mIDF1. Different from MOT17, BDD100K is annotated at only 5 FPS. The low frame-rate brings difficulty to motion models commonly used for MOT17. As shown in Table 3, Unicorn achieves the best performance, largely surpassing the previous SOTA method QDTrack [41] on the val set. Specifically, the improvement is up to 4.6% and 3.2% in terms of mMOTA and mIDF1 respectively.

Table 4. State-of-the-art comparison on the validation set of the DAVIS-2016 and the DAVIS-2017. OL: online learning, Memory: using an external memory bank.

4.4 Evaluations on Video Object Segmentation

We further evaluate the ability of Unicorn to perform VOS on DAVIS [43] 2016 and 2017. Both datasets evaluate methods with the region similarity \(\mathcal {J}\), the contour accuracy \(\mathcal {F}\), and the average of them \( \mathcal {J \& F}\).

DAVIS-16. DAVIS-16 includes 20 single-object videos in the validation set. Table 4 demonstrates that Unicorn achieves the best results among methods with bounding-box initialization, even surpassing RANet [64] and FRTM [46] with mask initialization. Meanwhile, Unicorn outperforms its multi-task counterparts SiamMask [60] by a large margin of 17.6% in terms of \( \mathcal {J \& F}\).

DAVIS-17. DAVIS-17 contains 30 videos in the validation set and there could be multiple tracked targets in each sequence. As shown in Table 4, compared with the previous best box-initialized method Siam R-CNN [58], Unicorn achieves competitive results with a much simpler architecture. Specifically, Siam R-CNN [58] uses an extra Box2Seg network, which is completely independent from the box-based tracker without any weight sharing. However, Unicorn can predict both boxes and masks with a unified head. Although there is still gap between the performance of Unicorn with that of SOTA VOS methods with mask initialization, Unicorn can address four tracking tasks with the same model parameters, while HMMN [47] and STCN [10] can only be used in the VOS task.

Table 5. State-of-the-art comparison on the MOTS [57] test set.
Table 6. State-of-the-art comparison on the BDD100K MOTS validation set.

4.5 Evaluations on Multi-object Tracking and Segmentation

Finally, we evaluate the ability of Unicorn for MOTS on MOTS20 [57] and BDD100K MOTS [75]. The main evaluation metrics are sMOTSA and mMOTSA.

MOTS20 Challenge. MOTS20 Challenge has 4 sequences in the test set. As shown in Table 5, Unicorn achieves state-of-the-art performance, surpassing the second-best method PoinTrackV2 [71] by a large margin of 3.3% on sMOTSA.

BDD100K MOTS Challenge. BDD100K MOTS Challenge includes 37 sequences in the validation set. Table 6 demonstrates that Unicorn outperforms the previous best method PCAN [26] by a large margin (i.e. mMOTSA +2.2%, mAP +5.5%). Meanwhile, Unicorn does not use any complex design like space-time memory or prototypical network as in PCAN, bringing into a simpler pipeline.

4.6 Ablations and the Other Analysis

For the ablations, we choose Unicorn with ConvNeXt-Tiny [31] backbone as the baseline. The detailed results are demonstrated in Table 7.

Backbone. We implement a variant of Unicorn with ResNet-50 [21] as the backbone. Although the overall performance of this version is lower than the baseline, this variant still achieves superior performance on four tasks.

Table 7. Ablations and comparisons. Our baseline model are underlined.

Interaction. Besides the memory-efficient deformable attention [88], we compare the full attention [56] and the convolution operation, which does not exchange information between frames. Experiments show that deformable attention obtains better performance than the full attention, while consuming much less memory. Moreover, the results of the convolution are lower than the baseline, showing the importance of interaction for accurate correspondence.

Fusion. Apart from broadcast sum, we compare other two methods: concatenation, and without the target prior. The performance of SOT and VOS drops significantly after removing the target prior, demonstrating the importance of this design. Besides, broadcast sum performs better than concatenation.

Single Task. We compare with training four independent models for different tasks. Experiments show that our unified model performs on-par with independently trained counterparts, while being much more parameter-efficient.

Speed. We develop a light-weight variant with a lower input resolution of 640\(\,\times \,\)1024. Experiments show that the real-time version does not only achieves competitive performance but also can run in real-time at more than 20 FPS.

5 Conclusions

We propose Unicorn, a unified approach to address four tracking tasks using a single model with the same model parameters. For the first time, it achieves the unification of network architecture and learning paradigm for Object tracking. Extensive experiments demonstrate that Unicorn performs on-par or better than task-specific counterparts on 8 challenging benchmarks. We hope that Unicorn can serve as a solid step towards the general vision model.