Towards Grand Unification of Object Tracking

Yan, Bin; Jiang, Yi; Sun, Peize; Wang, Dong; Yuan, Zehuan; Luo, Ping; Lu, Huchuan

doi:10.1007/978-3-031-19803-8_43

Bin Yan¹²,
Yi Jiang¹³,
Peize Sun¹⁴,
Dong Wang¹²,
Zehuan Yuan¹³,
Ping Luo¹⁴ &
…
Huchuan Lu^12,15

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13681))

Included in the following conference series:

European Conference on Computer Vision

2804 Accesses
62 Citations

Abstract

We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters. Due to the fragmented definitions of the object tracking problem itself, most existing trackers are developed to address a single or part of tasks and over-specialize on the characteristics of specific tasks. By contrast, Unicorn provides a unified solution, adopting the same input, backbone, embedding, and head across all tracking tasks. For the first time, we accomplish the great unification of the tracking network architecture and learning paradigm. Unicorn performs on-par or better than its task-specific counterparts in 8 tracking datasets, including LaSOT, TrackingNet, MOT17, BDD100K, DAVIS16-17, MOTS20, and BDD100K MOTS. We believe that Unicorn will serve as a solid step towards the general vision model. Code is available at https://github.com/MasterBin-IIAU/Unicorn.

B. Yan—This work was performed while Bin Yan worked as an intern at ByteDance.

Access provided by Autonomous University of Puebla. Download conference paper PDF

TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

Article Open access 23 December 2020

A systematic survey on recent deep learning-based approaches to multi-object tracking

Article 26 September 2023

Keyword

Object tracking

1 Introduction

Compared with weak AI designed for solving one specific task, artificial general intelligence (AGI) is expected to understand or learn any intellectual task that a human being can. Although there is still a large gap between this ambitious goal and the intellectual algorithms of today, some recent works [19, 20, 48, 76] have begun to explore the possibility of building general vision models to address several vision tasks simultaneously.

Object tracking is one of the fundamental tasks in computer vision, which aims to build pixel-level or instance-level correspondence between frames and to output trajectories typically in the forms of boxes or masks. Over the years, according to different application scenarios, the Object tracking problem has been mainly divided into four separate sub-tasks: Single Object Tracking (SOT) [17, 39], Multiple Object Tracking (MOT) [37, 75], Video Object Segmentation (VOS) [43], and Multi-Object Tracking and Segmentation (MOTS) [57, 75]. As a result, most tracking approaches are developed for only one of or part of the sub-tasks. Despite convenience for specific applications, this fragmented situation brings into the following drawbacks: (1) Trackers may over-specialize on the characteristic of specific sub-tasks, lacking in the generalization ability. (2) Independent model designs cause redundant parameters. For example, recent deep-learning-based trackers usually adopt similar backbones architectures, but the separate design philosophy hinders the potential reuse of parameters. It is natural to ask a question: Can all main-stream tracking tasks be solved by a unified model?

Although some works [33, 36, 58, 60, 66] attempt to unify SOT$ { \& }$VOS or MOT$ { \& }$MOTS by adding a mask branch to the existing box-level tracking system, there is still little progress towards the unification of SOT and MOT. There are mainly three obstacles hindering this process. (1) The characteristics of tracked objects vary. MOT usually tracks tens even hundreds of instances of specific categories. In contrast, SOT needs to track one target given in the reference frame no matter what class it belongs to. (2) SOT and MOT require different types of correspondence. SOT requires distinguishing the target from the background. However, MOT needs to match the currently detected objects with previous trajectories. (3) Most SOT methods [3, 5, 9, 15, 27, 72] only take a small search region as the input to save computation and filter potential distractors. However, MOT algorithms [2, 8, 36, 63, 69, 79, 84] usually take the high-resolution full image as the input for detecting instances as completely as possible.

To conquer these challenges, we propose two core designs: the target prior and the pixel-wise correspondence. To be specific, (1) the target prior is an additional input for the detection head and serves as the switch among four tasks. For SOT$ { \& }$VOS, the target prior is the propagated reference target map, enabling the head to focus on the tracked target. For MOT$ { \& }$MOTS, by setting the target prior as zero, the head degenerates into the usual class-specific detection head smoothly. (2) The pixel-wise correspondence is the similarity between all pairs of points from the reference frame and the current frame. Both the SOT correspondence ($\textbf{C}^{\textrm{SOT}}\in \mathbb {R}^{h'w'\times {hw}}$) and the MOT correspondence ($\textbf{C}^{\textrm{MOT}}\in \mathbb {R}^{M\times {N}}$) are subsets of the pixel-wise correspondence ($\textbf{C}_{\textrm{pix}}\in \mathbb {R}^{hw\times {hw}}$). (3) With the help of the informative target prior and the accurate pixel-wise correspondence, the design of the search region becomes unnecessary for SOT, leading to unified inputs as the full image for SOT and MOT.

Towards the unification of object tracking, we propose Unicorn, a single network architecture to solve four tracking tasks. It takes the reference frame and the current frame as the inputs and produces their visual features by a weight-shared backbone. Then a feature interaction module is exploited to build pixel-wise correspondence between two frames. Based on the correspondence, a target prior is generated by propagating the reference target to the current frame. Finally, the target prior and the visual features are fused and sent to the detection head to get the tracked objects for all tasks.

With the unified network architecture, Unicorn can learn from various sources of tracking data and address four tracking tasks with the same model parameters. Extensive experiments show that Unicorn performs on-par or better than task-specific counterparts on 8 challenging benchmarks from four tracking tasks.

We summarize that our work has the following contributions:

For the first time, Unicorn accomplishes the great unification of the network architecture and the learning paradigm for four tracking tasks.
Unicorn bridges the gap among methods of four tracking tasks by the target prior and the pixel-wise correspondence.
Unicorn puts forwards new state-of-the-art performance on 8 challenging tracking benchmarks with the same model parameters. This achievement will serve as a solid step towards the general vision model.

2 Related Work

2.1 Task-Specific Trackers

SOT typically specifies one tracked target with a bounding box on the first frame, then requires trackers to predict boxes for the tracked target in the following frames. Considering the uniqueness and the motion continuity of the tracked target, most of the algorithms in SOT [3, 5, 9, 15, 27, 70, 72] track on a small search region rather than the whole image to reduce computation and to filter distractors. Although achieving great success in the SOT field, search-region-based trackers suffer from the following drawbacks: (1) Due to the limited visual field, it is difficult for these methods to recover from temporary tracking failure, especially in the long-term tracking scenarios. (2) The speed of these methods drops drastically as the number of tracked instances increases. The inefficiency problem restricts the application of SOT trackers in scenarios such as MOT, where there are tens or hundreds of targets to track. To overcome the first problem, some works [23, 58] propose a global-detection-based tracking paradigm. However, these methods either require large modifications to the original detection architecture to integrate the target information or rely on complicated dynamic programming to pick the best tracklet. Besides, both Global-Track [23] and Siam R-CNN [23] are developed on two-stage Faster R-CNN, whose detection pipeline is tedious and relies on hand-crafted anchors and ROI-Align. By contrast, in this work, we build our method based on a one-stage, anchor-free detector [18]. Furthermore, we demonstrate that only with minimal change to the original detector architecture, we could transform an object detector into a powerful SOT tracker.

Different from SOT, MOT does not have any given prior on the first frame. Trackers of MOT are required to find and associate all instances of specific classes by themselves. The mainstream methods [41, 49, 63, 79, 84] follow the tracking-by-detection paradigm. Specifically, an MOT system typically has two main components, an object detector and a certain association strategy. Commonly used detectors include Faster R-CNN [45], the YOLO series [18, 44], CenterNet [85], Sparse R-CNN [50], and Deformable DETR [88], etc. Popular association methods include IoU matching [4, 49], Kalman Filter [4, 63, 79], ReID embedding [41, 63, 65, 79], Transformer [36, 49, 77], or the combination of them [78]. Although there are some works [12, 87] introducing SOT trackers for the association, these SOT trackers [3, 14] are completely independent with the MOT networks, without any weight sharing. There is still a large gap between methods of SOT and MOT.

The goal of VOS is to predict masks for the tracked instances based on the high-quality mask annotations of the first frame. This field is now dominated by memory-network-based methods [10, 40, 74]. Although achieving great performance, these methods suffer from the following disadvantages: (1) The memory network brings huge time and space complexity, especially when dealing with high spatial resolution and the long sequence. While these scenarios are quite common in sequences of SOT and MOT. Specifically, the long-term tracking benchmarks [17, 55] in SOT usually have thousands of frames per sequence, being more than 20x longer than DAVIS [43]. Meanwhile, the image size in MOT [75] can reach 720$\,\times \,$1280, while the image size of DAVIS is usually only 480$\,\times \,$854. (2) SOTA methods assume that there are always high-quality mask annotations on the first frame. However, high-quality masks demand expensive labor costs and are usually unavailable in real-world applications. To overcome this problem, some works [33, 58, 60] attempt to develop weakly-annotated VOS algorithms, which only require box annotation on the first frame.

MOTS is highly related to MOT by changing the form of boxes to fine-grained representation of masks. MOTS benchmarks [57, 75] are typically from the same scenarios as those of MOT [37, 75]. Besides, many MOTS methods are developed upon MOT trackers. Representative approaches include 3D-convolution-based Track R-CNN [57] and Stem-Seg [1], Transformer-based TrackFormer [36], tracking-assisting-detection Trades [66] and Prototype-based PCAN [26].

2.2 General Vision Models

Despite the great success of specialized models for diverse tasks, there is still a large gap between the current AI with human-like, omnipotent Artificial General Intelligence (AGI). An important step towards this grand goal is to build a generalist model supporting a broad range of AI tasks. Recent pioneering works [19, 20, 48, 76] attempt to approach this goal from different perspectives. Specifically, MuST [19] introduces a multi-task self-training pipeline, which harnesses the knowledge in independent specialized teacher models to train a single general student model. INTERN [48] proposes a new learning paradigm, which learns with supervisory signals from multiple sources in multiple stages. The developed general vision model generalizes well to different tasks but also has lower requirements on downstream data. Florence [76] is a new computer vision foundation model, which expands the representations to different tasks along space, time, and modality. Florence has great transferability and achieves new SOTA results on a wide range of vision benchmarks. OMNIVORE [20] proposes a modality-agnostic model which can classify images, videos, and single-view 3D data using the same model parameters.

2.3 Unification in Object Tracking

In the literature, some works [60, 62, 66] attempted to design a unified framework for supporting multiple tracking tasks. Specifically, SiamMask [60] is the first work to address SOT and VOS simultaneously. Similarly, TraDes [66] can solve both MOT and MOTS by introducing an extra mask head. Besides, UniTrack [62] proposes a high-level tracking framework, which consists of a shared appearance model and a series of unshared tracking heads. It demonstrates that different tracking tasks can share one appearance model for either propagation or association. However, the large discrepancy in tracking heads hinders it from exploiting a large amount of tracking data. Consequently, its performance lags far behind that of SOTA task-specific methods. Moreover, when used for MOT or MOTS, UniTrack requires extra, independent object detectors to provide observation. The extra object detector and the appearance model do not share the same backbone, bringing heavy burdens in parameters. By contrast, Unicorn solves four tracking tasks with one unified network with the same parameters. Besides, Unicorn can learn powerful representation from a large amount of labeled tracking data, achieving superior performance on 8 challenging benchmarks. Figure 1 shows the comparison between task-specific methods and Unicorn.

2.4 Correspondence Learning

Learning accurate correspondence is the key to many vision tasks, such as optical flow [51], video object segmentation [25, 80], geometric matching [53, 54], etc. The dense correspondence is usually obtained by computing correlation between the embedding maps of two frames. Most existing methods [25, 51, 80] obtain the embedding maps without considering the information exchange between two images. This could lead to ambiguous or wrong matching when there are many similar patterns or instances on the input images. Although some works [53, 54] attempt to relieve this problem, they usually require complex optimization or uncertainty modeling. Different from the local comparison, Transformer [56] and its variants [88] exploit the attention mechanism to capture the long-range dependency within the input sequence. In this work, we demonstrate that these operations can help to learn precise correspondence in object tracking.

3 Approach

We propose a unified solution for Object tracking, called Unicorn, which consists of three main components: unified inputs and backbone; unified embedding and unified head. Three components are responsible for obtaining powerful visual representation, building precise correspondence and detecting diverse tracked targets respectively. The framework of Unicorn is demonstrated in Fig. 2. Given the reference frame $\textbf{I}_{\textrm{ref}}$, the current frame $\textbf{I}_{\textrm{cur}}$, and the reference targets, Unicorn aims at predicting the states of the tracked targets on the current frame for four tasks with a unified network.

3.1 Unified Inputs and Backbone

For efficiently localizing multiple potential targets, Unicorn takes the whole image (for both the reference frame and the current frame) instead of local search regions as the inputs. This also endows Unicorn high resistance to tracking failure and the ability to re-detect tracked target after disappearance.

During the feature extraction, the reference frame and the current frame are passed through a weight-sharing backbone to get feature pyramid representations(FPN) [30]. To maintain important details and reduce the computational burden during computing correspondence, we choose the feature map with stride 16 as the input of the following embedding module. The corresponding features from the reference and the current frame are termed $\textbf{F}_{\textrm{ref}}$ and $\textbf{F}_{\textrm{cur}}$ respectively.

3.2 Unified Embedding

The core task of Object tracking is to build accurate correspondence between frames in a video. For SOT and VOS, pixel-wise correspondence propagates the user-provided target from the reference frame (usually the $1^{st}$ frame) to the $t^{th}$ frame, providing strong prior information for the final box or mask prediction. Besides, for MOT and MOTS, instance-level correspondence helps to associate the detected instances on the $t^{th}$ frame to the existing trajectories on the reference frame (usually the ${t-1}^{th}$ frame).

In Unicorn, given the spatially flattened reference frame embedding $\textbf{E}_{\textrm{ref}}\in \mathbb {R}^{hw\times {c}}$ and the current frame embedding $\textbf{E}_{\textrm{cur}}\in \mathbb {R}^{hw\times {c}}$, pixel-wise correspondence $\textbf{C}_{\textrm{pix}}\in \mathbb {R}^{hw\times {hw}}$ is computed by the matrix multiplication between them. For SOT$ { \& }$VOS taking the full image as the inputs, the correspondence is the pixel-wise correspondence itself. For MOT$ { \& }$MOTS, assume that there are M trajectories on the reference frame and N detected instances on the current frame respectively, the instance-level correspondence $\textbf{C}_{\textrm{inst}}\in \mathbb {R}^{N\times {M}}$ is the matrix multiplication of the reference instance embedding $\textbf{e}_{\textrm{ref}}\in \mathbb {R}^{M\times {c}}$ and the current instance embedding $\textbf{e}_{\textrm{cur}}\in \mathbb {R}^{N\times {c}}$. The instance embedding $\textbf{e}$ is extracted from the frame embedding $\textbf{E}$, where the center of the instance is located.

$$\begin{aligned} \begin{array}{l} \textbf{C}_{\textrm{pix}}=\textrm{softmax}(\textbf{E}_\mathrm {{cur}}{\textbf{E}_{\textrm{ref}}}^T)\\ \textbf{C}_{\textrm{inst}}=\textrm{softmax}(\textbf{e}_{\textrm{cur}}{\textbf{e}_{\textrm{ref}}}^T)\\ \end{array} \end{aligned}$$

(1)

It can be seen that the instance-level correspondence $\textbf{C}_{\textrm{inst}}$ required by MOT and MOTS is the sub-matrix of the pixel-wise correspondence $\textbf{C}_{\textrm{pix}}$. Besides, learning highly discriminative embedding $\{\textbf{E}_{\textrm{ref}}, \textbf{E}_{\textrm{cur}}\}$ is the key to building precise correspondence for all tracking tasks.

Feature Interaction. Due to its advantages of capturing long-range dependency, Transformer [56] is an intuitive choice to enhance the original feature representation $\{\textbf{F}_{\textrm{ref}}, \textbf{F}_{\textrm{cur}}\}$. However, this could lead to huge memory cost when dealing with high-resolution feature maps, because the memory consumption increases with the length of the input sequence quadratically. To alleviate this problem, we replace the full attention with more memory-efficient deformable attention [88]. For more accurate correspondence, the enhanced feature maps are upsampled by 2$\times $ to obtain high-resolution embeddings on the stride of 8.

$$\begin{aligned} \{\textbf{E}_{\textrm{ref}}, \textbf{E}_{\textrm{cur}}\}=\textrm{Upsample}(\textrm{Attention}(\textbf{F}_{\textrm{ref}},\textbf{F}_{\textrm{cur}})) \end{aligned}$$

(2)

Loss. Ideal embedding should work well on both propagation (SOT, VOS) and association (MOT, MOTS). For SOT$ { \& }$VOS, although there is no human-annotated label for dense correspondence between frames, the embedding can be supervised by the difference between the propagated result $\mathbf {\widetilde{T}}_{\textrm{cur}}$ and the ground-truth target map $\textbf{T}_{\textrm{cur}}$. Specifically, the shape of target map $\textbf{T}$ is $hw\times {1}$. The regions where the tracked target exists are equal to one and the other regions are equal to zero. During the propagation, the pixel-wise correspondence $\textbf{C}_{\textrm{pix}}$ transforms the reference target map $\textbf{T}_{\textrm{ref}}$ to the estimation of the current target map $\mathbf {\widetilde{T}}_{\textrm{cur}}$.

$$\begin{aligned} \mathbf {\widetilde{T}}_{\textrm{cur}}(i,j) = \sum _k \textbf{C}_{\textrm{pix}}(i,k) \cdot \textbf{T}_{\textrm{ref}}(k,j) \end{aligned}$$

(3)

Besides, for MOT and MOTS, the instance-level correspondence can be learned with standard contrastive learning paradigm. Specifically, assume that the instance i from the current frame is matched with the instance j from the reference frame, then the corresponding ground-truth matrix $\textbf{G}$ should satisfies that

$$\begin{aligned} \textbf{G}_{i,k}=\left\{ \begin{array}{cc}0&{}\ \ k\ne j\\ 1&{}\ \ k=j\\ \end{array}\right. \end{aligned}$$

(4)

Finally, the unified embedding can be optimized end-to-end by Dice Loss [38] for SOT$ { \& }$VOS or Cross-Entropy Loss for MOT$ { \& }$MOTS.

$$\begin{aligned} \textbf{L}_{\textrm{corr}}=\left\{ \begin{array}{cc}\textrm{Dice}(\widetilde{\textbf{T}}_{\textrm{cur}},\textbf{T}_{\textrm{cur}})&{}\mathrm {task\ in\ \{SOT,VOS\}}\\ \textrm{CrossEntropy}(\textbf{C}_{\textrm{inst}},\textbf{G})&{}\ \ \ \mathrm {task\ in\ \{MOT,MOTS\}}\\ \end{array}\right. \end{aligned}$$

(5)

3.3 Unified Head

To achieve the grand unification of Object tracking, another important and challenging problem is designing a unified head for four tracking tasks. Specifically, MOT shall detect objects of specific categories. However, SOT needs to detect any target given in the reference frame. To bridge this gap, Unicorn introduces an extra input (called target prior) to the original detector head [18, 52]. Without any further modification, Unicorn can easily detect various objects needed for four tasks with this unified head. More details about the head architecture can be found in the supplementary materials.

Target Prior. As mentioned in Sect. 3.2, given the reference target map $\textbf{T}_{\textrm{ref}}$, the propagated target map $\mathbf {\widetilde{T}}_{\textrm{cur}}$ can provide strong prior information about the state of the tracked target. This motivates us to take it as a target prior when detecting targets for SOT$ { \& }$VOS. To be compatible with the original input of the detection head, we first reshape it to $h\times {w}\times {1}$ (i.e. $\mathbf {\widetilde{T}}^{\textrm{reshape}}_{\textrm{cur}}\in \mathbb {R}^{h\times {w}\times {1}})$. Meanwhile, when dealing with MOT$ { \& }$MOTS, we can simply set this prior to zero. Formally, the target prior $\textbf{P}$ satisfies that

$$\begin{aligned} \textbf{P}=\left\{ \begin{array}{cc}\mathbf {\widetilde{T}}^{\textrm{reshape}}_{\textrm{cur}}&{}\mathrm {task\ in\ \{SOT,VOS\}}\\ \textbf{0}&{}\ \ \ \mathrm {task\ in\ \{MOT,MOTS\}}\\ \end{array}\right. \end{aligned}$$

(6)

Feature Fusion. The unified head takes the original FPN feature $\textbf{F}\in \mathbb {R}^{h\times {w}\times {c}}$ and the target prior $\textbf{P}\in \mathbb {R}^{h\times {w}\times {1}}$ as the inputs. Unicorn fuses these two inputs with broadcast sum and passes the fused feature $\textbf{F}^{'}\in \mathbb {R}^{h\times {w}\times {c}}$ to the original detection head. This fusion strategy has the following advantages. (1) The fused features are seamlessly compatible with four tasks. Specifically, for MOT$ { \& }$MOTS, the target prior is equal to zero. Then the fused feature $\textbf{F}^{'}$ degenerates back to the original FPN feature $\textbf{F}$ to detect objects of specific classes. For SOT$ { \& }$VOS, the target prior with strong target information can enhance the original FPN feature and makes the network focus on the tracked target.(2) The architecture is simple, without introducing complex changes to the original detection head. Furthermore, the consistent architecture also enables Unicorn to fully exploit the pretrained weights of the original object detector.

3.4 Training and Inference

Training. The whole training process divides into two stages: SOT-MOT joint training and VOS-MOTS joint training. In the first stage, the network is end-to-end optimized with the correspondence loss and the detection loss using data from SOT$ { \& }$MOT. In the second stage, a mask branch is added and optimized with the mask loss using data from VOS$ { \& }$MOTS with other parameters fixed.

Inference. During the test phase, for SOT$ { \& }$VOS, the reference target map is generated once on the first frame and kept fixed in the following frames. Unicorn directly picks the box or mask with the highest confidence score as the final tracking result, without any hyperparameter-sensitive post-processing like cosine window. Besides, Unicorn only needs to run the heavy backbone and the correspondence once, while running the lightweight head rather than the whole network N times, leading to higher efficiency. For MOT$ { \& }$MOTS, Unicorn detects all objects of the given categories and simultaneously outputs corresponding instance embeddings. The later association is performed based on the embeddings and the motion model for BDD100K and MOT17 respectively.

4 Experiments

4.1 Implementation Details

When comparing with state-of-the-art methods, we choose ConvNeXt-Large [31] as the backbone. In ablations, we report the results of our method with ConvNeXt-Tiny [31] and ResNet-50 [21] as the backbone. The input image size is $800\times 1280$ and the shortest side ranges from 736 to 864 during multi-scale training. The model is trained on 16 NVIDIA Tesla A100 GPU with a global batch size of 32. To avoid inaccurate statistics estimation, we replace all Batch Normalization [24] with Group Normalization [67]. Two training stages randomly sample data from SOT$ { \& }$MOT datasets and VOS$ { \& }$MOTS datasets, respectively. Each training stage consists of 15 epochs with 200,000 pairs of frames in every epoch. The optimizer is Adam-W [32] with weight decay of $5e^{-4}$ and momentum of 0.9. The initial learning rate is $2.5e^{-4}$ with 1 epoch warm-up and the cosine annealing schedule. More details can be found in the supplementary materials. In Sect. 4.2, 4.3, 4.4 and 4.5, we compare Unicorn with task-specific counterparts in 8 tracking datasets. In each benchmark, the font and the font indicate the best two results. Unicorn in four tasks uses the same model parameters.

Table 1. State-of-the-art comparison on LaSOT [17] and TrackingNet [39].

Full size table

4.2 Evaluations on Single Object Tracking

We compare Unicorn with state-of-the-art SOT trackers on two popular and challenging benchmarks, LaSOT [17] and TrackingNet [39]. Both datasets evaluate the tracking performance with the following measures: Success, precision (P) and normalized precision ($P_{norm}$). All these measures are the higher the better.

LaSOT. LaSOT [17] is a large-scale long-term tracking benchmark, which contains 280 videos in the test set with an average length of 2448 frames. Table 1 shows that Unicorn achieves new state-of-the-art Success and Precision of 68.5% and 74.1% respectively. It is also worth noting that Unicorn surpasses the previous best global-detection-based tracker Siam R-CNN [58] by a large margin (68.5% vs 64.8%) with a much simpler network architecture and tracking strategy (directly picking the top-1 vs tracklet dynamic programming).

TrackingNet. TrackingNet [39] is a large-scale short-term tracking benchmark containing 511 videos in the test set. As reported in Table 1, Unicorn surpasses all previous methods with a Success of 83.0% and a Precision of 82.2%.

4.3 Evaluations on Multiple Object Tracking

We compare Unicorn with state-of-the-art MOT trackers on two challenging benchmarks: MOT17 [37] and BDD100K [75]. The common metrics include Multiple-Object Tracking Accuracy (MOTA), Identity F1 Score (IDF1), False Positives (FP), False Negatives (FN), the percentage of Mostly Tracked Trajectories (MT) and Mostly Lost Trajectories (ML), Identity Switches (IDS). Among them, MOTA is the primary metric to measure the overall detection and tracking performance, IDF1 is used to measure the trajectory identity accuracy.

Table 2. State-of-the-art comparison on MOT17 [37] test set.

Full size table

Table 3. State-of-the-art comparison on BDD100K [75] tracking validation set.

Full size table

MOT17. The MOT17 focuses on pedestrian tracking and includes 7 sequences in the training set and 7 sequences in the test set. We compare Unicorn with previous methods under the private detection protocol on the test set of MOT17. Table 2 demonstrates that Unicorn achieves the best MOTA and IDF1, surpassing the previous SOTA method by 0.5% and 0.4% respectively.

BDD100K MOT. BDD100K is a large-scale dataset of visual driving scenes and requires tracking 8 categories of instances. To evaluate the average performance across 8 classes, BDD100K additionally introduces two measures: mMOTA and mIDF1. Different from MOT17, BDD100K is annotated at only 5 FPS. The low frame-rate brings difficulty to motion models commonly used for MOT17. As shown in Table 3, Unicorn achieves the best performance, largely surpassing the previous SOTA method QDTrack [41] on the val set. Specifically, the improvement is up to 4.6% and 3.2% in terms of mMOTA and mIDF1 respectively.

Table 4. State-of-the-art comparison on the validation set of the DAVIS-2016 and the DAVIS-2017. OL: online learning, Memory: using an external memory bank.

Full size table

4.4 Evaluations on Video Object Segmentation

We further evaluate the ability of Unicorn to perform VOS on DAVIS [43] 2016 and 2017. Both datasets evaluate methods with the region similarity $\mathcal {J}$, the contour accuracy $\mathcal {F}$, and the average of them $ \mathcal {J \& F}$.

DAVIS-16. DAVIS-16 includes 20 single-object videos in the validation set. Table 4 demonstrates that Unicorn achieves the best results among methods with bounding-box initialization, even surpassing RANet [64] and FRTM [46] with mask initialization. Meanwhile, Unicorn outperforms its multi-task counterparts SiamMask [60] by a large margin of 17.6% in terms of $ \mathcal {J \& F}$.

DAVIS-17. DAVIS-17 contains 30 videos in the validation set and there could be multiple tracked targets in each sequence. As shown in Table 4, compared with the previous best box-initialized method Siam R-CNN [58], Unicorn achieves competitive results with a much simpler architecture. Specifically, Siam R-CNN [58] uses an extra Box2Seg network, which is completely independent from the box-based tracker without any weight sharing. However, Unicorn can predict both boxes and masks with a unified head. Although there is still gap between the performance of Unicorn with that of SOTA VOS methods with mask initialization, Unicorn can address four tracking tasks with the same model parameters, while HMMN [47] and STCN [10] can only be used in the VOS task.

Table 5. State-of-the-art comparison on the MOTS [57] test set.

Full size table

Table 6. State-of-the-art comparison on the BDD100K MOTS validation set.

Full size table

4.5 Evaluations on Multi-object Tracking and Segmentation

Finally, we evaluate the ability of Unicorn for MOTS on MOTS20 [57] and BDD100K MOTS [75]. The main evaluation metrics are sMOTSA and mMOTSA.

MOTS20 Challenge. MOTS20 Challenge has 4 sequences in the test set. As shown in Table 5, Unicorn achieves state-of-the-art performance, surpassing the second-best method PoinTrackV2 [71] by a large margin of 3.3% on sMOTSA.

BDD100K MOTS Challenge. BDD100K MOTS Challenge includes 37 sequences in the validation set. Table 6 demonstrates that Unicorn outperforms the previous best method PCAN [26] by a large margin (i.e. mMOTSA +2.2%, mAP +5.5%). Meanwhile, Unicorn does not use any complex design like space-time memory or prototypical network as in PCAN, bringing into a simpler pipeline.

4.6 Ablations and the Other Analysis

For the ablations, we choose Unicorn with ConvNeXt-Tiny [31] backbone as the baseline. The detailed results are demonstrated in Table 7.

Backbone. We implement a variant of Unicorn with ResNet-50 [21] as the backbone. Although the overall performance of this version is lower than the baseline, this variant still achieves superior performance on four tasks.

Table 7. Ablations and comparisons. Our baseline model are underlined.

Full size table

Interaction. Besides the memory-efficient deformable attention [88], we compare the full attention [56] and the convolution operation, which does not exchange information between frames. Experiments show that deformable attention obtains better performance than the full attention, while consuming much less memory. Moreover, the results of the convolution are lower than the baseline, showing the importance of interaction for accurate correspondence.

Fusion. Apart from broadcast sum, we compare other two methods: concatenation, and without the target prior. The performance of SOT and VOS drops significantly after removing the target prior, demonstrating the importance of this design. Besides, broadcast sum performs better than concatenation.

Single Task. We compare with training four independent models for different tasks. Experiments show that our unified model performs on-par with independently trained counterparts, while being much more parameter-efficient.

Speed. We develop a light-weight variant with a lower input resolution of 640$\,\times \,$1024. Experiments show that the real-time version does not only achieves competitive performance but also can run in real-time at more than 20 FPS.

5 Conclusions

We propose Unicorn, a unified approach to address four tracking tasks using a single model with the same model parameters. For the first time, it achieves the unification of network architecture and learning paradigm for Object tracking. Extensive experiments demonstrate that Unicorn performs on-par or better than task-specific counterparts on 8 challenging benchmarks. We hope that Unicorn can serve as a solid step towards the general vision model.

References

Athar, A., Mahadevan, S., Os̆ep, A., Leal-Taixé, L., Leibe, B.: STEm-Seg: spatio-temporal embeddings for instance segmentation in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 158–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_10
Chapter Google Scholar
Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: ICCV (2019)
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S.: Fully-convolutional siamese networks for object tracking. In: ECCVW (2016)
Google Scholar
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: ICIP (2016)
Google Scholar
Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative model prediction for tracking. In: ICCV (2019)
Google Scholar
Bhat, G., Danelljan, M., Van Gool, L., Timofte, R.: Know your surroundings: exploiting scene information for object tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 205–221. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_13
Chapter Google Scholar
Bhat, G., Lawin, F.J., Danelljan, M., Robinson, A., Felsberg, M., Van Gool, L., Timofte, R.: Learning what to learn for video object segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 777–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_46
Chapter Google Scholar
Brasó, G., Leal-Taixé, L.: Learning a neural solver for multiple object tracking. In: CVPR (2020)
Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H.: Transformer tracking. In: CVPR (2021)
Google Scholar
Cheng, H.K., Tai, Y.W., Tang, C.K.: Rethinking space-time networks with improved memory coverage for efficient video object segmentation. In: NeurIPS (2021)
Google Scholar
Cheng, J., Tsai, Y.H., Hung, W.C., Wang, S., Yang, M.H.: Fast and accurate online video object segmentation via tracking parts. In: CVPR (2018)
Google Scholar
Chu, P., Ling, H.: FAMNet: Joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: ICCV (2019)
Google Scholar
Chu, P., Wang, J., You, Q., Ling, H., Liu, Z.: TransMOT: spatial-temporal graph transformer for multiple object tracking. arXiv preprint arXiv:2104.00194 (2021)
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ECO: efficient convolution operators for tracking. In: CVPR (2017)
Google Scholar
Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: accurate tracking by overlap maximization. In: CVPR (2019)
Google Scholar
Danelljan, M., Gool, L.V., Timofte, R.: Probabilistic regression for visual tracking. In: CVPR (2020)
Google Scholar
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR (2019)
Google Scholar
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: YOLOX: exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
Ghiasi, G., Zoph, B., Cubuk, E.D., Le, Q.V., Lin, T.Y.: Multi-task self-training for learning general representations. In: ICCV (2021)
Google Scholar
Girdhar, R., Singh, M., Ravi, N., van der Maaten, L., Joulin, A., Misra, I.: Omnivore: a single model for many visual modalities. arXiv preprint arXiv:2201.08377 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hu, Y.-T., Huang, J.-B., Schwing, A.G.: VideoMatch: matching based video object segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 56–73. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_4
Chapter Google Scholar
Huang, L., Zhao, X., Huang, K.: GlobalTrack: a simple and strong baseline for long-term tracking. In: AAAI (2020)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)
Google Scholar
Jabri, A., Owens, A., Efros, A.: Space-time correspondence as a contrastive random walk. In: NeurIPS (2020)
Google Scholar
Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross-attention networks for multiple object tracking and segmentation. In: NeurIPS (2021)
Google Scholar
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of siamese visual tracking with very deep networks. In: CVPR (2019)
Google Scholar
Liang, C., et al.: Rethinking the competition between detection and ReID in multi-object tracking. arXiv preprint arXiv:2010.12138 (2020)
Liang, C., Zhang, Z., Zhou, X., Li, B., Lu, Y., Hu, W.: One more check: making "fake background" be tracked again. arXiv preprint arXiv:2104.09441 (2021)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Lukezic, A., Matas, J., Kristan, M.: D3S-a discriminative single shot segmentation tracker. In: CVPR (2020)
Google Scholar
Maninis, K.K., et al.: Video object segmentation without temporal information. TPAMI (2018)
Google Scholar
Mayer, C., Danelljan, M., Paudel, D.P., Van Gool, L.: Learning target candidate association to keep track of what not to track. In: ICCV (2021)
Google Scholar
Meinhardt, T., Kirillov, A., Leal-Taixe, L., Feichtenhofer, C.: TrackFormer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702 (2021)
Milan, A., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Milletari, F., Navab, N., Ahmadi, S.A.: V-Net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016)
Google Scholar
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., Ghanem, B.: TrackingNet: a large-scale dataset and benchmark for object tracking in the wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 310–327. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_19
Chapter Google Scholar
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)
Google Scholar
Pang, J., et al.: Quasi-dense similarity learning for multiple object tracking. In: CVPR (2021)
Google Scholar
Peng, J., Wang, C., Wan, F., Wu, Y., Wang, Y., Tai, Y., Wang, C., Li, J., Huang, F., Fu, Y.: Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 145–161. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_9
Chapter Google Scholar
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv preprint arXiv:1704.00675 (2017)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Google Scholar
Robinson, A., Lawin, F.J., Danelljan, M., Khan, F.S., Felsberg, M.: Learning fast and robust target models for video object segmentation. In: CVPR (2020)
Google Scholar
Seong, H., Oh, S.W., Lee, J.Y., Lee, S., Lee, S., Kim, E.: Hierarchical memory matching network for video object segmentation. In: ICCV (2021)
Google Scholar
Shao, J., et al.: INTERN: a new learning paradigm towards general vision. arXiv preprint arXiv:2111.08687 (2021)
Sun, P., et al.: TransTrack: multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460 (2020)
Sun, P., et al.: Sparse R-CNN: end-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14454–14463 (2021)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 282–298. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_17
Chapter Google Scholar
Truong, P., Danelljan, M., Gool, L.V., Timofte, R.: GoCor: bringing globally optimized correspondence volumes into your neural network. In: NeurIPS (2020)
Google Scholar
Truong, P., Danelljan, M., Van Gool, L., Timofte, R.: Learning accurate dense correspondences and when to trust them. In: CVPR (2021)
Google Scholar
Valmadre, J., Bertinetto, L., Henriques, J.F., Tao, R., Vedaldi, A., Smeulders, A.W.M., Torr, P.H.S., Gavves, E.: Long-term tracking in the wild: a benchmark. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 692–707. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_41
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Voigtlaender, P., et al.: MOTS: multi-object tracking and segmentation. In: CVPR (2019)
Google Scholar
Voigtlaender, P., Luiten, J., Torr, P.H., Leibe, B.: Siam R-CNN: visual tracking by re-detection. In: CVPR (2020)
Google Scholar
Wang, N., Zhou, W., Wang, J., Li, H.: Transformer meets tracker: exploiting temporal context for robust visual tracking. In: CVPR (2021)
Google Scholar
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.S.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)
Google Scholar
Wang, Q., Zheng, Y., Pan, P., Xu, Y.: Multiple object tracking with correlation learning. In: CVPR (2021)
Google Scholar
Wang, Z., Zhao, H., Li, Y.L., Wang, S., Torr, P., Bertinetto, L.: Do different tracking tasks require different appearance models? In: NeurIPS (2021)
Google Scholar
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 107–122. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_7
Chapter Google Scholar
Wang, Z., Xu, J., Liu, L., Zhu, F., Shao, L.: RANet: ranking attention network for fast video object segmentation. In: ICCV (2019)
Google Scholar
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: ICIP (2017)
Google Scholar
Wu, J., Cao, J., Song, L., Wang, Y., Yang, M., Yuan, J.: Track to detect and segment: an online multi-object tracker. In: CVPR (2021)
Google Scholar
Wu, Y., He, K.: Group Normalization. Int. J. Comput. Vision 128(3), 742–755 (2019). https://doi.org/10.1007/s11263-019-01198-w
Article Google Scholar
Xu, Y., Ban, Y., Delorme, G., Gan, C., Rus, D., Alameda-Pineda, X.: Transcenter: transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145 (2021)
Xu, Y., Osep, A., Ban, Y., Horaud, R., Leal-Taixé, L., Alameda-Pineda, X.: How to train your deep multi-object tracker. In: CVPR (2020)
Google Scholar
Xu, Y., Wang, Z., Li, Z., Yuan, Y., Yu, G.: SiamFC++: towards robust and accurate visual tracking with target estimation guidelines. In: AAAI (2020)
Google Scholar
Xu, Z., Yang, W., Zhang, W., Tan, X., Huang, H., Huang, L.: Segment as points for efficient and effective online multi-object tracking and segmentation. TPAMI (2021)
Google Scholar
Yan, B., Peng, H., Fu, J., Wang, D., Lu, H.: Learning spatio-temporal transformer for visual tracking. In: ICCV (2021)
Google Scholar
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Google Scholar
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20
Chapter Google Scholar
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: CVPR (2020)
Google Scholar
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zeng, F., Dong, B., Wang, T., Zhang, X., Wei, Y.: MOTR: end-to-end multiple-object tracking with transformer. arXiv preprint arXiv:2105.03247 (2021)
Zhang, Y., et al.: ByteTrack: multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864 (2021)
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. IJCV (2021)
Google Scholar
Zhang, Y., Wu, Z., Peng, H., Lin, S.: A transductive approach for video object segmentation. In: CVPR (2020)
Google Scholar
Zhang, Z., Liu, Y., Wang, X., Li, B., Hu, W.: Learn to match: automatic matching network design for visual tracking. In: ICCV (2021)
Google Scholar
Zhang, Z., Peng, H., Fu, J., Li, B., Hu, W.: Ocean: object-aware anchor-free tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12366, pp. 771–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_46
Chapter Google Scholar
Zheng, L., Tang, M., Chen, Y., Zhu, G., Wang, J., Lu, H.: Improving multiple object tracking with single object tracking. In: CVPR (2021)
Google Scholar
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 474–490. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_28
Chapter Google Scholar
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
Zhou, Z., Pei, W., Li, X., Wang, H., Zheng, F., He, Z.: Saliency-associated object tracking. In: ICCV (2021)
Google Scholar
Zhu, J., Yang, H., Liu, N., Kim, M., Zhang, W., Yang, M.-H.: Online multi-object tracking with dual matching attention networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 379–396. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_23
Chapter Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: deformable transformers for end-to-end object detection. In: ICLR (2020)
Google Scholar

Download references

Acknowledgement

Thank the reviewers for their insightful comments. Huchuan Lu and Dong Wang are supported in part by the National Natural Science Foundation of China under Grant nos. 62022021, 61806037, 61725202, U1903215 and 61829102, and in part by the Science and Technology Innovation Foundation of Dalian under Grant no. 2020JJ26GX036 and Dalian Innovation leader’s support Plan under Grant no. 2018RD07. Ping Luo is supported by the General Research Fund of HK No. 27208720, No. 17212120, and No. 17200622.

Author information

Authors and Affiliations

School of Information and Communication Engineering, Dalian University of Technology, Dalian, China
Bin Yan, Dong Wang & Huchuan Lu
ByteDance, Beijing, China
Yi Jiang & Zehuan Yuan
The University of Hong Kong, Pok Fu Lam, Hong Kong
Peize Sun & Ping Luo
Peng Cheng Laboratory, Shenzhen, China
Huchuan Lu

Authors

Bin Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yi Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Peize Sun
View author publications
You can also search for this author in PubMed Google Scholar
Dong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zehuan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Ping Luo
View author publications
You can also search for this author in PubMed Google Scholar
Huchuan Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yi Jiang or Dong Wang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1038 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, B. et al. (2022). Towards Grand Unification of Object Tracking. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13681. Springer, Cham. https://doi.org/10.1007/978-3-031-19803-8_43

Download citation

DOI: https://doi.org/10.1007/978-3-031-19803-8_43
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19802-1
Online ISBN: 978-3-031-19803-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards Grand Unification of Object Tracking

Abstract

Similar content being viewed by others

TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

A systematic survey on recent deep learning-based approaches to multi-object tracking

Keyword

1 Introduction

2 Related Work

2.1 Task-Specific Trackers

2.2 General Vision Models

2.3 Unification in Object Tracking

2.4 Correspondence Learning

3 Approach

3.1 Unified Inputs and Backbone

3.2 Unified Embedding

3.3 Unified Head

3.4 Training and Inference

4 Experiments

4.1 Implementation Details

4.2 Evaluations on Single Object Tracking

4.3 Evaluations on Multiple Object Tracking

4.4 Evaluations on Video Object Segmentation

4.5 Evaluations on Multi-object Tracking and Segmentation

4.6 Ablations and the Other Analysis

5 Conclusions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1038 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation