Keywords

1 Introduction

A central goal of computer vision is to produce a general-purpose perception system that robustly works in the wild. Towards this ambitious goal, extending the current short category regime is one of the essential key milestones. As an initial effort in this direction, the large-scale image benchmark, LVIS [16], was introduced and fostered significant progress in developing solid image domain solutions [24, 58, 70, 71, 76, 104]. Recently, a video benchmark, TAO [12], calls for a shift from image to video, opening the new task of detecting and tracking large vocabulary objects.

Fig. 1.
figure 1

The Proposed Learning Framework for Large Vocabulary Tracker Training. While the current learning paradigm learns detection and tracking separately from LVIS and TAO (decoupled), our proposal takes all training data to learn detection and tracking jointly (unified). This is achieved through missing supervision hallucination.

With these new datasets of images and videos, LVIS and TAO, we are interested in building a strong large vocabulary video tracker. However, as the annotation difficulty between images and videos is even more severe in large vocabulary datasets, the significant gap in dataset scale and label vocabularies naturally exists. Therefore, pre-training the model on images for learning large vocabularies and then fine-tuning on video for seamless video domain adaptation is a standard learning protocol. Given this context, can the current advances of large-vocabulary detection and multi-object tracking be successfully unified and tied into a single model? In particular, we see there are two main challenges for the successful marriage of two streams: First, no tracking supervisions are in LVIS. This essentially leads to inconsistent learning of detection (with LVIS and TAO) and tracking (only with TAO), resulting in sub-optimal video feature representations. Second, detection supervisions in TAO are partialFootnote 1. Thus, catastrophic forgetting [45] is inevitable if one naively fine-tunes the LVIS tracker directly on the TAO.

In this work, we present simple, effective, and generic methods for hallucinating missing supervisions in each dataset. Below, we describe the challenges and our solutions in turn.

First, how can we simulate the tracking supervisions only with images in LVIS? Given an image, our idea is to apply spatial jittering artifacts to mimic temporal changes in video and form a natural pair for tracking. Here, we present two new spatial jittering methods. The first is strong zoom-in/out augmentation, which has a large scale-jittering effect that can effectively simulate the low sampling rate test-time inputs in large vocabulary tracking. It yields significant performance improvements over the conventional image affine augmentation [20, 48, 52, 66, 102, 105]. Plus, our findings are in line with recent work that shows a large scale jittering is effective in image detection and segmentation [15], and here we examine this observation on video for robust large vocabulary tracker training. The second is mosaicing augmentation [5, 94], which is originally presented for object detection with enriched background. We extend this augmentation for combining foreground objects in different images in a class-balanced manner [16] and simulate test-time hard, dense tracking scenarios suitable for “many object” trackers. We show that both are effective and complementary to each other.

Second, how can we fill the missing detection supervisions in TAO? The TAO training data partially spans the LVIS categories, and thus the direct fine-tuning of LVIS tracker on TAO causes catastrophic forgetting of absent categories. A straightforward way to avoid this issue is to learn only the tracking part of the model with TAO [51]. However, this hinders full model training and abandons all the TAO detection labels, limiting the overall performance. We instead approach this problem by combining the self-training [60, 61, 63] with a teacher-student framework [21, 89]. In practice, the teacher and student are identical copies of the LVIS pre-trained model, and we freeze the weights of the teacher during training. The overall learning pipeline consists of two steps: First, given an input, we predict pseudo labels using the teacher model. The idea behind the pseudo labeling is to leverage the past knowledge acquired from LVIS and fill in the missing annotations in TAO. Second, using the augmented labels, we train the student model with both distillation loss and ordinal detection loss. Unlike the typical teacher-student schemes used in semi-supervised object detection studies [67, 91, 92], we introduce two new adaptations suitable for large vocabulary learning setup. The first is using soft pseudo labels, i.e., distilling class logits directly, to fire all the student’s classifier weights, rather than using common one-hot (hard) pseudo labels. This is crucial as standard hard pseudo labels tend to bias distillation only toward the frequent class objects due to inherent classifier calibration issue [11, 50]. The second is to use MSE loss in order to equally impact all the classifier weights [28, 72], rather than using picky KL-divergence loss [21]. We found the type of the loss function is also very important for the successful large vocabulary classifier distillation. Despite the simplicity, we empirically show that the distillation results are greatly improved with these adaptations. We also show that our proposal works well on the common vocabulary setup, e.g., COCO, and can be easily extended to new class learning scenarios, COCO \(\rightarrow \) YTVIS.

Combining all these proposals together, unified learning of detection and tracking with both LVIS images and TAO videos becomes possible without forgetting any LVIS categories (see Fig. 1). Furthermore, we also introduce a new regularization objective, semantic consistency loss. It aims to prevent the common tracking failure in large vocabulary tracking due to semantic flicker between similar classes. We study the efficacy of our final framework on the TAO benchmark and achieve new state-of-the-art results. Our extensive ablation studies confirm that the proposals are generic and effective.

2 Related Work

Large Vocabulary Recognition. The object categories in natural images follow the Zipfian distribution [44], and thus, the large vocabulary recognition is naturally tied with the long-tailed recognition [17, 42, 86]. Based on this connection, lots of solid approaches are introduced. The existing methods can be roughly categorized into data re-sampling or loss re-weighting. The data re-sampling methods more often sample data from rare classes to balance the long-tailed training distribution [8, 16]. The loss re-weighting aims at adjusting the loss of each data instance based on their labels or train-time accumulated statistics [22, 70, 71, 76]. Some approaches perform multi-staged training upon these methods, which first pre-train the model in a standard way and then fine-tune using either data re-sampling or loss re-weighting [23, 24, 39, 58, 78, 79, 82, 98]. Also, there are new approaches based on data augmentations [15, 95, 97] or test time calibration [50].

Apart from all these previous efforts on the image, we study a new video extension of the task [12]. We show that our proposal is generic and not sensitive to a specific method, data re-sampling or loss re-weighting, in successfully converting the current large vocabulary detectors to large vocabulary trackers.

Multi-object Tracking. Most modern multi object trackers [35] follow the tracking-by-detection paradigm [56]. An off-the-shelf object detector is first employed to localize all objects in each frame, and then track association is performed between adjacent frames. The main difference among existing methods is in how they estimate the similarity between detected objects and previous tracks for the association. To name a few, Kalman Filter [4, 84], optical flow [88], displacement regression [53, 105], and appearance similarities [3, 25, 34, 35, 43, 47, 51, 62, 68, 83, 93, 99, 100] are the representatives. On the other side, there are also efforts on joining detection and tracking [13, 85, 101], and recently by transformer-based architectures [46, 69, 96]. We note that all these methods only focus on a few object categories such as people or vehicles, ignoring the vast majority of objects in the world.

Our work is an early attempt for extending the current short category regime of modern trackers [12, 41, 80, 107]. In this paper, we build our proposal upon the tracking-by-detection paradigm. We choose the state-of-the-art method, QDTrack [51], which adopts Faster R-CNN [59] and lightweight embedding head for detection and tracking, respectively. The tracking is learned through a dense matching between quasi-dense samples on the pair of images and optimized with multiple positive contrastive learning. Given the state-of-the-art large vocabulary detection [16, 70, 76] and multi-object tracking [51] methods, we primarily investigate the new challenges in developing a strong large vocabulary tracker.

Tracking Without Video Annotations. There is a line of recent research on self-supervised learning for tracking, either using unlabeled videos [32, 33, 38, 55, 73, 77, 81, 90] or images [14, 20, 48, 52, 66, 102, 105]. Our work belongs to the latter category. Applying random affine augmentation to the original image provides a spatially jittered version, which mimics the temporal changes in the video. By letting the model find the correspondence between those two images, meaningful tracking supervision can be provided [17, 48, 105]. Sio et al. [66] present that an image and any cropped region of it can generate a similar effect. Zheng et al. [102] extend this idea to incorporate only the foreground objects in cropped regions for stable training.

In this work, we explore this general idea under a more specific large vocabulary tracking setting. First, we focus on the fact that conventional motion cues are not applicable for large-vocabulary trackers as the input are temporally distant (1FPS) due to the annotation difficulties and there are severe camera movements in natural videos. This motivate us to train the tracker’s vision feature matching more discriminative. To this end, we present a strong zoom-in/out augmentation that can not only simulate low sampling rate input but also includes large scale-jittering effect [15] which is known to be effective in the image domain vision tasks. Second, we recast the image mosaicing augmentation [5], which was initially proposed for robust object detection with enriched backgrounds [94], to simulate test-time dense tracking in the large vocabulary setting. We show that both are complementary in providing discriminative tracking supervisions for this task.

Catastrophic Forgetting. The phenomenon wherein neural networks forget how to solve past tasks because of the exposure to new tasks is known as catastrophic forgetting [45]. It occurs because the model weights that contain important information for the old task are over-written by information relevant to the new one. While the catastrophic forgetting can occur in various scenarios, many existing efforts are focused on the class incremental learning setup, where it incrementally adds new object categories phase-by-phase, in image classification [1, 9, 29, 36, 54, 57, 64, 87]. Also, there are some few approaches tackling incremental object detection [30, 65, 75, 103].

We target a different setup, transfer learning from image to video without forgetting. Specifically, we aim to train the model on images covering the entire evaluation categories and then fine-tune it on videos, which partially covers the evaluation categories, without forgetting. While lots of current video models [27, 49, 93] are trained in this way for generic feature learning, the label difference issue between images and videos has been rarely studied and explored. We study this issue, as this is a practical setup for training large vocabulary trackers using both images and videos.

Fig. 2.
figure 2

Overview of the proposed learning framework. The colored objective functions are generated supervisions with our proposals. (Color figure online)

3 Proposed Method

We introduce a general learning framework that allows joint learning of detection and tracking from all training data, LVIS and TAO, for robust large vocabulary tracking. The overview of our pipeline is shown in Fig. 2. We first present how we can learn tracking from images through zoom-in/out and mosaicing augmentations in Sect. 3.1. We then describe how we avoid catastrophic forgetting when the videos for fine-tuning have fewer label vocabularies than pre-trained images in Sect. 3.2. Finally, we present a new regularization loss term, namely semantic consistency loss, for preventing semantic flicker in Sect. 3.3.

3.1 Learn to Track in LVIS

Our approach is straightforward. An original image and a transformed image with the spatial jittering artifacts can form a natural input pair for tracking. For the jittering artifacts, we present two new augmentations, zoom-in/out and mosaicing (see Fig. 2). Note that tracking annotations come for free as we know the exact transformation relationship between the images. We assign the same unique track-id to the same object in the transformed image.

Strong Zoom-in/Out Track. Due to the annotation difficulty of the large-vocabulary tracking dataset, the train and test time inputs are temporally sparse, i.e., low sampling rate, which naturally results in conventional motion cues not applicable and rather rely on pure vision feature matching. To make the vision feature matching more discriminative, and to effectively simulate test-time low sampling rate inputs, we present strong zoom-in/out augmentation.

It is mainly composed of scaling and cropping operations, which essentially vary the scale and position of the objects. Specifically, for an image \(\textrm{I}\), we generate a input pair, \(\mathrm {I_{t}}\) and \(\mathrm {I_{t+\tau }}\), by applying the \(\mathrm {scale\_and\_crop(\cdot )}\) function to each image. In practice, it scales an image up to 2 times and crops the image to have a minimum IoU of 0.4 or above with original bounding boxes to avoid heavy object truncation and ensure stable tracker training. Prior works either adopt standard random affine transformation [17, 48, 105] or cropping without scaling [66, 102], which generally provide weak scale-jittering effect. Instead, we focus on enlarging the scaling effect and show that our proposal significantly outperforms the baselines.

Mosaicing Track. While the zoom-in/out augmentation is already effective in providing tracking supervisions, it is limited in the tracking of a few objects due to the federated annotations of LVIS [16]. To resolve the issue, we present to combine multiple images and perform tracking with the increased foreground objects. We implement our idea by extending the image mosaicing augmentation [5], which stitches four random training images with certain ratios. While it was originally presented for object detection with enriched background [94], we recast it to simulate hard, dense tracking scenarios in large vocabulary tracking. In practice, four random images, \(\{\mathrm {I_{a}, I_{b}, I_{c}, I_{d}}\}\), are sampled from RFS (Repeat Factor Sampling)-based dataset [16] to maintain the class-balance. Then, image stitching followed by random affine (with large scale jittering within a range of 0.1 to 2) and crop is applied. We summarize these procedure as \(\mathrm {mosaic(\cdot )}\). The tracking pair then can be obtained by applying the \(\mathrm {mosaic(\cdot )}\) function to the sampled images twice. However, we see that unnatural layout pair results in train and test time inconsistency. To this end, we propose to sample tracking input pairs in a mixed way from two different augmentations, zoom-in/out and mosaicing, with equal probability during training. We empirically confirm that this works well in practice.

With our proposal, the model can receive tracking supervisions from all LVIS object categories. The tracking objective function is adopted from the QDtrack [51] (see Fig. 2-top), and we call this model LVIS-Tracker. While the model is only trained on LVIS dataset, it already outperforms the previous state-of-the-art tracker (trained with the standard decoupled learning scheme) significantly (see Table 2a).

3.2 Learn to Unforget in TAO

Due to the fundamental annotation difficulties in videos, the images are in general bigger in dataset scale and larger in taxonomies. Therefore, pre-training the model on images to acquire generic features and fine-tuning on videos for target domain adaptation has become a common protocol for obtaining satisfactory performance in various video tasks [27, 49, 93]. This also applies to training the large vocabulary video trackers, where we first learn a large number of vocabulary from LVIS images and then adapt to the evaluation domain with TAO videos. However, as TAO partially spans the full LVIS vocabularies, a naive transfer learning scheme results in catastrophic forgetting.

Here, our goal is to keep the ability to detect the previously seen object categories while also adapting to learn from new video labels. We mainly focus on the catastrophic forgetting in the detector, as the tracking head is learned in a category-agnostic manner. We detail the proposal using the standard two-staged Faster-RCNN detector (FPN backbone) [40, 59]. Without loss of generality, the proposals can be extended to multi-staged architectures [6, 7, 10, 74], where we apply the proposal for each RCNN head and average them. In fact, the main issue is missing annotations for the seen, known object categories during the image to video transfer learning. Since they are not annotated, we can neither provide detection supervision nor prevent them from being treated as background. This basically perturbs the pre-trained classifier boundaries of both RPN and RCNN, leading to catastrophic forgetting. We remedy this issue by presenting a pseudo-label guided teacher-student framework.

Our key idea is intuitive. The pre-trained model already has sufficient knowledge to detect the seen, known categories. Based on this fact, we first fill in the missing annotations by pseudo-labeling the input. We adopt the basic pseudo-labeling scheme with a threshold of 0.3. The redundant pseudo labels that highly overlap with the current labels are filtered out with NMS. With these augmented labels, we 1) design a teacher-student network to provide (soft) supervisions, i.e., class logit, for preserving the past knowledge, and 2) update the incorrect background samples, i.e., negatives, in RPN and RCNN to prevent seen objects from being background (see Fig. 2-bottom). Using soft class logit is important for the large vocabulary classifier distillation, as the hard pseudo labels bias the operation towards the frequent class objects. Moreover, we use MSE loss instead of Kullback-Leibler (KL) divergence loss [21] for the logit matching. This is because the MSE loss treats all classes equally and thus it allows the rare classes with low probability also to be updated properly [72]. This two new adaptation leads to the successful distillation of the previous knowledge of the large vocabulary classifier (see Table 2b).

Teacher-Student Framework Setup. To effectively retain the previous knowledge, we design a teacher-student framework. We first make identical copies of the image pre-trained model, teacher (T) and student (S). The teacher model (T) is frozen to keep the previous knowledge and guide the student. The student model (S) adapts to the new domain with incoming video labels (via detection loss) and also mimics the teacher model to preserve the past information (via distillation loss). We detail the components in the following.

RPN Knowledge Distillation Loss. The RPN takes multi-level features from the ResNet feature pyramid [40]. In particular, each feature map is embedded through the convolution layer, followed by two separate layers, one for objectness classification and the other for proposal regression. We collect the outputs of both heads from the teacher and student to compute RPN distillation loss, which is defined as \( L ^\textrm{RPN}_\textrm{KD} = \frac{1}{ N _{cls}}{\sum _{i=1} L _{cls}(u_{i}, u_{i}^{*})} + \frac{1}{ N _{reg}}{\sum _{i=1} L _{reg}(v_{i}, v_{i}^{*})}. \) Here, i is the index of an anchor. \(u_{i}\) and \(u_{i}^{*}\) are the mean subtracted objectness logits obtained from the student and the teacher, respectively. \(v_{i}\) and \(v_{i}^{*}\) are four parameterized coordinates for the anchor refinement obtained from the student and teacher, respectively. \( L _{cls}\) and \( L _{reg}\) are MSE loss and smooth L1 loss, respectively. Here, we note that \( L _{reg}\) is only computed for the positive anchors that have an IoU larger than 0.7 with the augmented ground-truth boxes. \( N _{cls} (=256)\) and \( N _{reg}\) are the effective number of anchors for the normalization.

RCNN Knowledge Distillation Loss. We perform RoIAlign [18] on top-scoring proposals from RPN, extracting the region features from each feature pyramid level. Each region feature is embedded through two FC layers, one for classification and the other for bounding box regression. We collect the outputs of both heads from the teacher and student to compute RCNN distillation loss, which is defined as \( L ^\textrm{RCNN}_\textrm{KD} = \frac{1}{ M _{cls}}{\sum _{j=1} L _{cls}(p_{j}, p_{j}^{*})} + \frac{1}{ M _{reg}}{\sum _{j=1} L _{reg}(t_{j}, t_{j}^{*})}. \) Here, j is the index of a proposal. \(p_{j}\) and \(p_{j}^{*}\) are the mean subtracted classification logits obtained from the student and the teacher, respectively. \(t_{j}\) and \(t_{j}^{*}\) are four parameterized coordinates for the proposal refinement obtained from the student and teacher, respectively. \( L _{cls}\) and \( L _{reg}\) are MSE loss and smooth L1 loss, respectively. We only impose \( L _{reg}\) for the positive proposals that have an IoU larger than 0.5 with the augmented ground-truth boxes. \( M _{cls} (=512)\) and \( M _{reg}\) are the effective number of proposals for the normalization.

Fig. 3.
figure 3

(a) Two standard image to video transfer learning setups. In typical, a naive transfer learning from images to videos leads to catastrophic forgetting due to the missing annotations in the video. We present a generic teacher-student scheme that works on both scenarios. (b) Two-step approach for COCO \(\rightarrow \) YTVIS transfer learning setup. We first learn new object classifier weights with the pre-trained classifier as a fixed anchor and then fine-tune the whole classifier through the proposed teacher-student scheme. The red-dotted line along the circle in the set relationship figure indicates the training data used in each stage. The shape figures (e.g., square, triangle) and the separating line denote class instances and the associated classifier. (Color figure online)

Correcting Negatives in Computing the Detection Loss. We avoid sampling the anchors or proposals that have significant IoU overlaps with the augmented ground-truth boxes as a background (\({>}\)0.7 for RPN and \({>}\)0.5 for RCNN). We note that positives are only sampled based on the provided original ground truth labels. This is because the detectors, especially the large vocabulary detectors, suffer from predicting the precise labels [11, 50] while they are good at recalling the objects. We empirically verify this in the experiment.

Extension to Other Transfer Learning Setup. COCO to YTVIS is another important transfer learning setup (see Fig. 3-(a)). This is more challenging than LVIS to TAO, as the superset-subset relationship does not hold, and new object categories to learn are added. To deal with this new pattern, we take a two-step approach (see Fig. 3-(b)). First, we adapt the RCNN classifier of the pre-trained model, increasing the number of output channels to accommodate newly added classes, and train on the videos, \(\textrm{YTVIS} - \textrm{COCO}\), that contain new object categories. In practice, we freeze the original detector, and thus the past information is intact, and only the newly added weight matrices are updated accordingly. The key idea here is to use the original pretrained weight as an anchor and update the newly added weight to be compatible. Second, after sufficient training of the new weights, we now unlock the original detector and update the whole weights with the remaining videos, \(\textrm{YTVIS} \cap \textrm{COCO}\), using the presented teacher-student scheme.

3.3 Regularizing Semantic Flickering

One of the common tracking failures in large vocabulary tracking is due to semantic flicker between similar object categories [12]. To cope with this issue, we attempt to regularize the model during training with a new objective function, namely semantic consistency loss. The proposal is motivated by the temporal consistency loss [2, 26, 31, 37], which enforces the outputs of the model for corresponding pixels (or patches) in video frames to be consistent. It is often used in video processing tasks to ensure the output temporal smoothness at a pixel level. The proposal extends this idea from pixels to instances; We enforce the class predictions of the same instances in two different frames to be equivalent. In practice, we forward the ground truth bounding boxes of the same instance in two different frames to the RCNN head. The mean subtracted classification logits, p, are used for the consistency regularization as, \( L _\textrm{Semcon} = |p^{t} - p^{t+\tau }|_{2}. \) ere, \(p^{t}\) and \(p^{t+\tau }\) denote the logits of the same instance in two different frames, \(I_{t}\) and \(I_{t+\tau }\).

3.4 Unified Learning

Within our proposed learning framework (see Fig. 2), we can train the whole video model, learning detection and tracking jointly, using all available image and video datasets. The final objective function can be summarized as

$$\begin{aligned} \begin{aligned} L = \lambda _\textrm{1} L _\textrm{Det} + \lambda _\textrm{2} L _\textrm{Track} + \lambda _\textrm{3} L _\textrm{KD} + \lambda _\textrm{4} L _\textrm{Semcon}, \end{aligned} \end{aligned}$$
(1)

which consists of four loss terms in total. The detection (\( L _\textrm{Det}\)) and tracking losses (\( L _\textrm{Track}\)) are adopted from [59] and [51]. Note that the \( L _\textrm{KD} (= L ^\textrm{RPN}_\textrm{KD} + L ^\textrm{RCNN}_\textrm{KD}) \) and \( L _\textrm{Semcon}\) are used only when fine-tuning on the videos.

4 Experiments

In this section, we conduct extensive experiments to analyze our methods. We investigate the results mainly in two aspects: image-level prediction and cross-frame association, which will be reflected in the BBox AP and Track AP [12], respectively. Considering the task difficulty, we mainly focus on the Track AP of 50, 75 and their average. For the TAO test, we provide Track AP*, a full Track AP average for IoU from 0.5 to 0.95 with a step size of 0.05. We study the impact of unified learning on TAO dataset (Sect. 4.1). We consistently outperformed the current decoupled learning paradigm with healthy margins using various models, and pushed the state-of-the-art performance significantly. Second, to investigate the importance of the major components in our proposals, we provide ablation studies on TAO validation set (Sect. 4.2). Lastly, we evaluate our teacher-student scheme on two representative image-video transfer learning scenarios, LVIS \(\rightarrow \) TAO and COCO \(\rightarrow \) YTVISFootnote 2(Sect. 4.3). In the following, we provide experiment setups, evaluation protocol and results for each section. More details are in supplementary materials.

4.1 Main Results

Upon the state-of-the-art tracking-by-detection framework [51], we instantiate various large vocabulary trackers. In specific, we consider two important detection architecture, two-staged (Faster-RCNN [59]) and multi-staged (CenterNet2 [106]), and three different long-tailed learning methods, Repeat Factor Sampling (RFS) [16], Equalization Loss V2 (EQLv2) [70], and Seesaw Loss [76]. All the models use the same ResNet-101 [19] with feature pyramid [40] backbone following the previous works [12, 51]. Based on these baseline models, we compare our learning framework with the current standard learning protocol, decoupled learning. The comparison is in Table 1. We observe that our unified learning scheme consistently outperforms the current decoupled learning paradigm on various models, showing the strong generalizabilty of the proposal. With our method, we push the state-of-the-art performance significantly, achieving 21.6 and 20.1 Track AP50 on TAO-val and TAO-test, respectively.

Table 1. Our learning framework couples well with different model architectures and learning methods. All the baseline scores are obtained after the decoupled training, i.e., training the detector and tracker on LVIS and TAO, respectively. FasterRCNN-RFS* is a re-implementation of [51] baseline.

4.2 Ablation Studies

Impact of Image Spatial Jitterings. The results are presented in Table 2b. Compared to the standard affine transformation [17, 48, 105] or simple cropping without scaling [66, 102], the presented strong zoom-in/out and mosaicing provides a large Track AP improvement. This indicates both the low sampling rate input simulation (with large-scale jittering) and dense tracking simulation (with mosaicing) enables more accurate large vocabulary object associations at test-time. To concretely investigate the scaling effects of zoom-in/out, we also provide its variant with small scale-jittering, Z-in/out*, and confirm that the large scale-jittering [15] is indeed important for the performance. We notice that mosaicing augmentation drops the Box AP. We conjecture this happens due to the train and test time inconsistency of input pairs. To this end, we present to form a tracking pair from two different augmentations in equal probability. We found that this mixed sampling strategy provides the best Track AP.

Table 2. (a) Zoom-in/out* and Zoom-in/out denote zoom-in/out augmentation with scaling range of [0.8, 1.25] and [0.1, 2.0], respectively. (b) Pseudo Labeled Training denotes the standard (hard) pseudo label-based training.

Impact of Teacher-Student Framework. In Table 2b, we study the impact of the key proposals in teacher-student framework. For the baselines, we provide the Naive-ft and Vanilla Teacher-Student schemes. Naive-ft indicates fine-tuning on TAO videos without any proper regulation for forgetting, which results in a significant performance drop. Vanilla Teacher-Student scheme samples the distillation targets only from the original ground truth labels, and no negative correction is performed. While it shows the past knowledge preservation effect to some extent, the performance is still worse than the LVIS-tracker. The vanilla scheme starts to improve over the LVIS-tracker when our proposal is added. This implies that pseudo labeling is essential, and 1) keeping the past knowledge of seen objects (by sampling distillation targets from the augmented labels) and 2) preventing the seen objects from being background (by correcting negatives using the augmented labels) are the key to avoid catastrophic forgetting.

One may wonder if the standard (hard) pseudo-labeling approach can directly preserve the previous knowledge as typical teacher-student scheme do [67, 91, 92]. However, as can be shown in the results, we instead observe inferior results than the baseline. The large vocabulary classifier fundamentally suffers from the confidence calibration issue [11, 50] as it is trained on the long-tailed class-imbalanced data. It results in the classifier bias; predictions are made mainly toward the frequent object categories, missing rare objects in one-hot hard pseudo labels. In contrast, the (soft) pseudo labels essentially affect all classes. Furthermore, we suggest to employ MSE loss rather than standard KL-loss [21] as objective function in distillation. As MSE loss treats all classes equally the impact of the gradient is not attenuated for the rare classes. Recent study also reveals that MSE loss offers better generalization capability due to the direct matching of logits compared to the KL loss [28].

Impact of Semantic Consistency Loss. Finally, we study the impact of semantic consistency loss. It regularizes the model’s class logits of the same instance in different frames to be the equivalent. In Table 2b, we observed meaningful improvement in Track AP. This implies that semantic flicker regularization is indeed effective for the large vocabulary object tracking.

4.3 Image to Video Transfer Learning

Here, we evaluate our teacher-student scheme on two representative image to video transfer learning setups (see Fig. 3). In LVIS \(\rightarrow \) TAO setup, we pre-train FasterRCNN-RFS tracker on LVIS (with 482 categories) and fine-tune on TAO (with 216 categories). We evaluate the model on TAO-val with Track AP metric. In COCO \(\rightarrow \) YTVIS setup, we pre-train Mask-RCNN [18] on COCO, transfer the weights to MaskTrack RCNN [93], add new randomly initialized classifier weights to accommodate newly added classes, and fine-tune on YTVIS. More details of the setup are in supplementary materials. We evaluate the model on YTVIS-val with Mask AP [93] metric. To quantitatively analyze whether the proposal properly preserves the past knowledge and benefits from the new video labels, we provide the scores of OLD and NEW. Here, OLD indicates the classes that only reveal in the image pre-training stage. NEW denotes the classes that appears in the video fine-tuning stage. For each setup, we provide a baseline of naive fine-tuning, which results in a severe catastrophic forgetting. The results are summarized in Table 3 and Table 4.

Table 3. Teacher-student framework in LVIS \(\rightarrow \) TAO transfer learning setup. Evaluated on TAO-val.
Table 4. Teacher-Student framework in COCO \(\rightarrow \) YTVIS transfer learning setup. Evaluated on YTVIS-val.

LVIS \(\rightarrow \) TAO Transfer Learning. Especially in this setup, all necessary vocabularies are already learned at the image pre-training stage. Therefore, we can avoid catastrophic forgetting by fine-tuning only the tracking part. However, as it only updates the video model partially, it leads to inconsistent video representations, and thus performance rather slightly drops from the baseline LVIS-tracker. Instead, our method preserves the performance of OLD classes (preventing catastrophic forgetting) and significantly improves the NEW class performance (benefiting from labeled learning).

COCO \(\rightarrow \) YTVIS Transfer Learning. This setup is more challenging as the model is required to achieve two goals, new class learning and old class preserving simultaneously. We decompose these goals and approach this setup in two-step as described in Sect. 3.2. As can be shown in the results, the proposed two-step approach performs better than the direct application of teacher student scheme. The final performance is comparable with, and in NEW classes outperforms, the oracle setup that use all the YTVIS training videos. This shows that our proposal is generic and effective for standard image to video transfer learning setups.

5 Conclusion

In this paper, we tackle the challenging problem of learning a large vocabulary video tracker. We present a simple learning framework that uses all LVIS images and TAO videos to jointly learn the detection and tracking. In specific, first, two spatial jittering methods, strong zoom-in/out and mosaicing, which effectively simulate the test-time large vocabulary object tracking are presented to enable tracker training with LVIS. Second, a generic teacher-student scheme is proposed to prevent catastrophic forgetting while fine-tuning the image pre-trained models on videos. We show that two new adaptation of using soft labels with MSE loss is crucial for the large vocabulary classifier distillation. We hope our new learning framework settles as a baseline learning scheme for many follow-up large-vocabulary trackers in the future.