Abstract
Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies. Codes are available at https://github.com/mengcaopku/LocVTP.
M. Cao—Work done during an internship at Tencent AI Lab.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
Video-Text Pre-training (VTP) [4, 26, 30, 31, 39, 49, 54, 67] has attracted increasing attention with the aim to learn generic and transferable joint video-language (VL) representations. Compared to the conventional separate pre-training on each single modality, e.g., video features are pre-trained under the action recognition datasets (Kinetics [23], Sport1M [22]), VTP has several advantages: 1) It leverages large-scale unlabeled narrated video data with automatically generated corresponding text data for video-text correspondence pre-training. 2) It tries to map different modality features into a shared latent space, which reduces the difficulties of the cross-modal feature interaction. Thanks to these advantages, VTP has significantly improved the performance of many downstream VL tasks. For example, as illustrated in [16], the video retrieval performance using features pre-trained with the VTP method MIL-NCE [38] is much higher than that using separately pre-trained way (cf. Fig. 1a (left)).
Despite their encouraging performance, we find that most current VTP methods are applicable to limited downstream tasks, i.e., they focus on retrieval-based tasks which require video-level predictions, e.g., video retrieval [64], video captioning [45], and video question answering [21]. In contrast, there exists another mainstream localization-based tasks which expect more fine-grained clip-level or frame-level predictions, e.g., temporal grounding [15], action segmentation [51], action step localization [77] (cf. Fig. 1b). Unfortunately, through experiments, we find their poor generalization abilities on this type of downstream tasks. For example, on temporal grounding, even pre-trained with a much larger dataset HowTo100M [39], the VTP method MIL-NCE still performs worse than the separately pre-trained counterpart (cf. Fig 1a (right)).
In this paper, we analyze that this poor transfer ability on localization-based tasks is due to the absence of two indispensable characteristics: 1) Fine-grained alignment : We contend that the alignment should be conducted on more fine-grained clip-word level instead of the coarse-grained video-sentenceFootnote 1 level. As the temporal grounding example shown in Fig. 2, a given query sentence may contain multiple actions (e.g., “hit the golf ball” (\(q^{s1}\)) and “bend down to pick up the ball” (\(q^{s2}\))). Thus, aligning each action (or words) to the corresponding clips (i.e., \(v^{t1}\) and \(v^{t2}\)) will help to obtain more detailed and accurate feature representations. 2) Temporal relation reasoning : We hope the clip features of a certain action can also perceive other actions in the same video. For example, for a typical golf video, action \(q^{s2}\) (“bend down to pick up the ball”) always occurs shortly after action \(q^{s1}\) (“hit the golf ball”). Thus, incorporating such temporal relationship into VTP can help to improve the temporal awareness of video features.
Based on these observations, we propose a novel video-text pre-training framework for localization tasks, dubbed as LocVTP. By considering both above-mentioned characteristics, LocVTP achieves state-of-the-art performance not only on the widely studied retrieval-based tasks, but also on the less-focused localization-based tasks. Specifically, for fine-grained alignment, we extend the coarse-grained contrastive training with video-sentence alignment to a fine-grained one with clip-word alignment. Since there are no clip-word correspondence annotations in existing large-scale datasets, we utilize the latent space established by the coarse-grained contrastive learning to estimate the clip-word similarity, and then select the clip-word pairs with high similarities as positive samples. To further illustrate this, as shown in Fig. 2 (right), suppose \(\{\boldsymbol{v}^{t1}, \boldsymbol{q}^{s1}\}\) and \(\{\boldsymbol{v}^{t2}, \boldsymbol{q}^{s2}\}\) are two matched clip-word feature pairs. Semantic embeddings in each pair are mapped to be close to each other, i.e., \(\boldsymbol{v}^{t1} \leftrightarrow \boldsymbol{q}^{s1}\), \(\boldsymbol{v}^{t2} \leftrightarrow \boldsymbol{q}^{s2}\). For temporal relation reasoning, we propose a new pretext task called context warping. Here we use Fig. 2 (right) for illustration. Context warping is designed to generate a new temporally relevant clip features \(\boldsymbol{z}^{t1}\), which imitates \(\boldsymbol{v}^{t1}\), conditioned on another clip \(\boldsymbol{v}^{t2}\) and the relative distance \(t2 - t1\) in time, i.e., \(\boldsymbol{z}^{t1}={\text {warp}}(\boldsymbol{v}^{t2}, t2 - t1)\). The predicted relevant clip feature \(\boldsymbol{z}^{t1}\) is enforced to maintain the original established cross-modal correspondence unchanged, i.e., \(\boldsymbol{z}^{t1} \leftrightarrow \boldsymbol{q}^{s1}\). In this manner, we simulate the contextual reasoning process and enhance the temporal awareness of video features.
We conduct extensive experiments on four downstream tasks (i.e., video retrieval, temporal grounding, action step localization, and action segmentation) across six datasets. The results on both retrieval-based and localization-based tasks demonstrate the superiority and the generalization ability of our LocVTP.
In summary, we make three contributions in this paper:
-
We propose a localization-oriented video-text pre-training framework, LocVTP, which benefits both retrieval-based and the less-explored localization-based downstream tasks.
-
We pinpoint two crucial designs in LocVTP, i.e., fine-grained video-text alignment and temporal relation reasoning.
-
Experimental results show that our LocVTP significantly outperforms previous state-of-the-art methods when transferred to various downstream tasks.
2 Related Work
Video-Text Pre-training (VTP). With the release of the large-scale instructional dataset HowTo100M, VTP has spurred significant interest in the community. Overall, the mainstream methods can be broadly classified into two classes: 1) Generative methods: Several methods [11, 20, 28, 31, 34, 50, 55, 56] try to extend BERT [53] to the cross-modal domain, i.e., they accept both visual and textual tokens as input and perform the masked-token prediction task. 2) Discriminative methods. These methods [4, 26, 30, 41] learn representations by differentiating input samples using objectives such as the metric loss [19, 58] or contrastive loss [9, 18]. ClipBert [26] enables affordable pre-training from sparsely sampled frames. Frozen [4] adapts the recent ViT [13] as the visual encoder and is flexible to be trained on both image and video datasets. T2VLAD [56] and FCA [17] also perform the fine-grained interactions between video clips and phrases. However, both of them resort to additional overload, e.g., k-means cluster or graph auto-encoder. In contrast, our LocVTP explicitly models the clip-word matching with a more light-weighted similarity comparison manner.
Pre-training for Localization Tasks. Compared to the retrieval tasks [21, 45, 64] which only require only video-level predictions, localization tasks [15, 51, 77] are essentially different since they need dense clip-level or frame-level predictions and thus the pre-training for these tasks is more challenging. In the pure video domain, this gap has been noticed and several pre-training works [2, 65, 66, 73] tailored for action localization have been proposed. BSP [65] synthesizes temporal boundaries using existing action recognition datasets and conducts boundary type classification to generate localization-friendly features. TSP [2] trains video encoders to be temporally sensitive by predicting the foreground clip label and classifying whether a clip is inside or outside the action. As for the video-language domain, our LocVTP is the first pre-training framework designed for localization tasks. Besides, compared to TSP and BSP which require label information for supervised pre-training, our LocVTP can directly learn from narrated videos.
3 Approach
3.1 Overview of LocVTP
An overview of LocVTP is illustrated in Fig. 3. We firstly feed the video and language modalities to their respective encoders \(f_{v}(\cdot )\) and \(f_{q}(\cdot )\) to obtain embedded features. We follow the sparse sampling spirit in [26] and sample T clips for each video, yielding the encoded video \(\boldsymbol{v}= \{\boldsymbol{v}^{t}\}_{t=1}^{T}\), where \(\boldsymbol{v}^{t}\in \mathbb {R}^{D}\) is the \(t^{th}\) clip feature and D is the feature dimension. The text embedding is represented as \(\boldsymbol{q}= \{\boldsymbol{q}^{s}\}_{s=1}^{S_{q}}\), where \(\boldsymbol{q}^{s}\in \mathbb {R}^{D}\) is the \(s^{th}\) word embedding and \(S_q\) is the word length of \(\boldsymbol{q}\).
Three types of contrastive methods are then performed to learn cross-modal features: 1) The coarse-grained contrastive loss builds the video-sentence level alignment; 2) A correspondence discovery strategy is proposed to build clip-word relations, based on which the fine-grained contrastive loss is applied; 3) Temporal aware contrastive loss with the context warping pretext task is proposed to encode temporal information into video representations.
3.2 Coarse-Grained Contrastive Learning
We firstly conduct contrastive alignment at the global video-sentence level. Specifically, to obtain the video and sentence level features, we average pool \(\boldsymbol{v}\) and \(\boldsymbol{q}\) along the temporal and word index dimension, respectively. The global features are represented as \(\overline{\boldsymbol{v}}\), \(\overline{\boldsymbol{q}}\in \mathbb {R}^{D}\). Then we formulate this video-sentence alignment into the contrastive framework [18] as follows:
where \(\overline{\boldsymbol{q}}_{i}, i \in [1, N]\), is the sentence feature for other samples within the batch. N denotes the batch size and \(\tau \) is the temperature parameter. The coarse-grained contrastive loss \(\mathcal {L}_{c}\) serves as a base loss to conduct video-sentence level constraint and induces a basic latent space where the detailed cross-modal matching is achieved. Though usually coarse and noisy, this latent space encodes prior for fine-grained clip-word correspondence discovery. In Sect. 4.6, we design and analyze three potential ways to use this cross-modal matching prior.
3.3 Fine-Grained Contrastive Learning
Beyond the coarse-grained video-sentence alignment, we propose to conduct contrastive learning in a fine-grained manner, i.e., clip-word matching. We contend that introducing such alignment learning into the pre-training stage could narrow down its gap with downstream localization tasks and calibrate the pre-trained feature to be more temporally aware.
Clip-Word Correspondence Discovery. Before performing fine-grained contrastive learning, we firstly need to estimate the clip-word correspondences from video-sentence pairs. Thanks to the priors well established by the coarse-grained contrastive learning, we compute the cosine similarities between the video clips and their corresponding caption words in the pre-built latent space and choose the most similar K words as the correspondence for each video clip. Note that we select multiple positive words rather than simply pick one with the highest similarity because individual words may have vague meanings while sense-groupFootnote 2 conveys more precise information (cf. Sect. 4.7).
Given the video sentence pair \(\left\{ \boldsymbol{v}, \boldsymbol{q}\right\} \), for the encoded \(t^{th}\) video clip \(\boldsymbol{v}^{t}\), we compute its cosine similarities with the \(s^{th}\) word embedding \(\boldsymbol{q}^{s}\) and apply the \({\text {topk}}\) operation to select the most matched K ones. Following [57], these K selected items are average pooled to form the final positive sample:
where \(\boldsymbol{q}_{+}^{t}\) is the final positive sample for \(\boldsymbol{v}^{t}\). \((\boldsymbol{u} {\cdot } \boldsymbol{v})=\boldsymbol{u}^{\top } \boldsymbol{v} /\Vert \boldsymbol{u}\Vert \Vert \boldsymbol{v}\Vert \) represents the cosine similarity between \(\ell _{2}\) normalized \(\boldsymbol{u}\) and \(\boldsymbol{v}\). This process can be efficiently performed for all the video clips using matrix operations.
Fine-Grained Contrastive Loss. With the selected clip-word correspondence as positive pairs, we perform fine-grained representation learning following the cross-modal InfoNCE [18] loss (cf. Fig. 4a). The negative samples are taken from the other words within the batch. Therefore, the fine-grained contrastive loss is defined as follows.
where \(\boldsymbol{q}_{i}^{s}\) is the \(s^{th}\) word feature of the \(i^{th}\) sentence \(\boldsymbol{q}_{i}\).
3.4 Temporal Aware Contrastive Learning
Compared with the video-level retrieval task, which favors temporal invariant features [40, 42], the clip-level localization task [6,7,8, 33, 60, 61, 70,71,72] prefers temporal aware video embeddings. Specifically, correlated actions in the same video should perceive each other. This characteristic is however not embodied in the aforementioned contrastive learning.
Context Warping Head. To alleviate this, we set up a context-warping operation to enforce the video clip to perceive the context. For the video clip \(\boldsymbol{v}^{t}\) in a matched clip-word pair \(\{\boldsymbol{v}^{t}, \boldsymbol{q}^{t}_{+}\}\) (cf. Sect. 3.3), we warp its contextual video clip with \(\delta \) temporal distance, i.e., \(\boldsymbol{v}^{t+\delta }\), to “reconstruct” itself. To supervise this warping process, we set up a temporal aware contrastive loss to maintain the established correspondence. Specifically, we propose a context warping head \(g(\cdot )\) to instantiate this warping process, by taking the context clip feature \(\boldsymbol{v}^{t+\delta }\) and temporal distance \(\delta \) as input.
where \(\boldsymbol{z}^{t}\) is the warped feature. \(W \in \mathbb {R}^{(D+2) \times D}\) are the trainable weights. \(\delta \) is randomly sampled within the range of \([-\delta _{max}, \delta _{max}]\). \({\text {sgn}}(\cdot )\) is the sign function which returns 1 for positive values and \(-1\) for negative ones. Here \({\text {sgn}}(\delta )\) and \(|\delta |\) indicate the direction and distance of the temporal difference \(\delta \), respectively.
Temporal Aware Contrastive Loss. Through the context warping head, the warped feature \(\boldsymbol{z}^{t}\) should mimic the reference feature \(\boldsymbol{v}^{t}\). Since \(\boldsymbol{v}^{t}\) has the clip-word alignment with \(\boldsymbol{q}^{t}_{+}\), such correspondence should be preserved between the warped feature \(\boldsymbol{z}^{t}\) and \(\boldsymbol{q}^{t}_{+}\) (cf. Fig. 4b).
This process enforces video features to learn the ability of temporally reasoning, thus leading to more localization-friendly video features.
Integrating the above constraints, our final loss function is as follows.
where \(\lambda _{c}\), \(\lambda _{f}\), and \(\lambda _{t}\) balance the focus on different constraints during training.
4 Experiments
4.1 Settings of Pre-training
Datasets. We pre-trained our model on three public datasets: 1) HowTo100M [39]. It consists of more than 1.2M videos accompanied with ASR-generated speech transcription. The provided transcription is used to create video-sentence pairs separated by each timestamp. 2) WebVid-2M [39]. It contains about 2.5M well-aligned web video-text pairs. 3) Google Conceptual Captions [48]. It contains 3.3M image and description pairs harvested from the web.
Encoders. Following [4, 54, 67], we adopted ViT-B/16 [13] with space-time attention [5] as the video encoder. The spatial attention weights in the transformer were initialized with ImageNet-21k pre-trained weights while the temporal attention weights were set to zero. We chose a lightweight DistilBERT [47] as the language encoder. Following [3, 4, 41, 52], the language encoder was initialized with the weights pre-trained on English Wikipedia and Toronto Book Corpus.
Implementation Details. For the video in each video-sentence pair, we sampled 8 clips of 16 frames equidistantly and fed them to the video encoder to obtain clip-level features. All frames were resized to \(224 \times 224\). For downstream transfer, we extracted video features with the well-trained model in a dense manner, i.e., every 16 consecutive frames were grouped to compute one clip feature.
Experiments were conducted on 64 V100 GPUs with a batch size of 256 and lasted for 200 epochs. We used Adam [32] with the initial learning rate \(10^{-4}\) as the optimizer. The learning rate decayed by 0.1 at the \(100^{th}\) and \(160^{th}\) epoch. Random flip, random crop, and color jitter for video data augmentation were included. The loss balance factors \(\lambda _{c}\), \(\lambda _{f}\), and \(\lambda _{t}\) were set to 0.5, 1, 1, respectively. The temperature factor \(\tau \) used in contrastive learning was set to 0.07 following [43, 59] and K in Eq.(2) was set to 3. Features in all three contrastive losses were \(\ell _{2}\)-normalized before computation (Table 1).
4.2 Transfer Results on Video Retrieval
Datasets. We evaluate our LocVTP on the widely-used benchmark MSR-VTT dataset [64]. It is composed of 10K YouTube videos (9K for training and 1K for test). We report results on the train/test splits introduced in [69].
Results. 1) As can be seen, we achieve state-of-the-art performance under both sets of data, i.e., HowTo100M and CC3M+WV2M. Specifically, when pre-trained on CC3M+WV2M, LocVTP outperforms Frozen [4] by an absolute lift of 4.8% on R@5. 2) It should be pointed out that although using RGB data only, our LocVTP achieves better performance than the methods using multi-modal expert features including motion, face, and speech, e.g., MMT [14]. 3) The recent work CLIP [43] provides a stronger vision encoder and we also evaluate the performance based on it. It is shown that the CLIP’s weights greatly improve the performance of LocVTP with R@5 achieving 72.8%, surpassing top-performing CLIP-based methods. 4) Our LocVTP also outperforms previous methods under the zero-shot setting, showing its generalization ability.
4.3 Transfer Results on Temporal Grounding
Settings. We validate the performance of pre-trained representations on temporal grounding, which aims to localize actions corresponding to the sentence from an untrimmed video. Specifically, we re-train the mainstream temporal grounding method 2D-TAN [74]Footnote 3 by only replacing the original input features with pre-trained ones. For ease of feature extraction, we choose representative VTP methods with publicly-available codes for comparisons.
Datasets and Metrics. 1) ActivityNet Captions (ANet) [24]. It contains 20K untrimmed videos with 100K descriptions. By convention, we use 37,417 video-query pairs for training, 17,505 pairs for validation, and 17,031 pairs for testing. 2) Charades-STA [15]. Following the official split, 12,408 video-query pairs are used for training, and 3,720 pairs for testing. 3) TACoS [44]. It has 10,146 video-query pairs for training, 4,589 pairs for validation, and 4,083 pairs for testing.
Following prior works, we adopt ΓÇ£R@n, IoU@mΓÇ¥ (abbreviated as \(R^m_n\)) as the metric, Specifically, \(R^m_n\) is defined as the percentage of at least one of top-n retrieved moments having IoU with the ground-truth moment larger than m.
Results. 1) As shown in Table 2, even trained with a much larger dataset, the current popular video-text pre-training frameworks achieve inferior performance compared to the separately pre-trained one. For example, Frozen [4] reaches 43.3% at \(R_1^{0.5}\) on ANet Captions, which is 1.1% absolute value lower than the separately pre-trained counterpart. 2) Either pre-trained on HowTo100M or CC + WV, our LocVTP outperforms both video-text pre-training methods by a large margin on all three datasets. For example, pre-trained on HowTo100M, LocVTP surpasses the separately pre-trained method by 3.8% on \(R_1^{0.5}\) of ANet Captions. 3) For more fair comparisons, we sample a subset of HowTo100M by selecting the same training sample as Kinetics [23] (300K training pairs), denoted as HT\(^\ddag \) in Table 2. Although using noisy ASR captions, the results demonstrates that under the same training data volume, our LocVTP still shows better performance compared to the separately pre-trained method. This manifests that our performance improvement is brought by the sound architecture design rather than just the use of the large-scale dataset.
4.4 Transfer Results on Action Step Localization
Settings. In action step localization, each video belongs to a task and is annotated with multiple action steps described with short natural languages. The goal is to align each frame with the correct step in the text form. Following [35, 39, 68, 76], we take [77] as the downstream localization method. Specifically, we compute the similarity between each frame and the action step descriptions in feature space to find the optimal frame-wise order of action steps for a video.
Datasets and Metrics. We experiment on the instructional video dataset CrossTask [77], which includes 83 tasks and 4.7K videos. Each task is described with an ordered list of steps with manual natural language descriptions. We perform the same evaluation protocol as in [77] by reporting the average recall (CTR).
Results. Table 3 reports the action step localization performance on CrossTask dataset. Our LocVTP pre-trained feature achieves state-of-the-art performance with CTR reaching 51.7%, surpassing the previous method VideoClip by 4.4%. Our competitive performance demonstrates that LocVTP features can effectively perceive detailed action steps.
4.5 Transfer Results on Action Segmentation
Settings. We assess our LocVTP on action segmentation, which aims to predict the action label frame-wisely for each video frame. It is a pure vision task without the use of the text encoder. Following [35, 68, 76], we encode the input video frames with the well-trained video encoder and apply a linear classifier upon the features to predict action labels.
Datasets and Metrics. We conduct experiments on the widely used COIN dataset [51] and the frame-wise accuracy (FA) is taken as the evaluation metric.
Results. As shown in Table 3, our LocVTP achieves state-of-the-art performance with FA reaching 72.9%. This further demonstrates the superiority of our feature in localization tasks even in the absence of language guidance.
4.6 Ablation Study on Training Objective
Training Strategy. Coarse-grained contrastive alignment loss \(\mathcal {L}_{c}\) provides a basic cross-modal matching prior and we introduce three potential ways to use it: 1) multi-stage training: first perform coarse-grained training and then use the trained model to initialize other stages. 2) warm-up training: decrease \(\lambda _{c}\) exponentially from 1 to 0 throughout the training process. 3) weighted training: set \(\lambda _{c}\) to a constant value. Here we set \(\lambda _{c} = 0.5\). As shown in Table 4a, we find the weighted training strategy achieves the best performance and warm-up training is slightly behind. Multi-stage training is the least effective one.
Loss Component. We present the loss component ablations in Table 4b. As shown, both fine-grained loss \(\mathcal {L}_{f}\) and temporal aware loss \(\mathcal {L}_{t}\) are crucial. For example, compared to the full version (exp.#1), removing \(\mathcal {L}_{f}\) and \(\mathcal {L}_{t}\) brings about 1.4% and 1.5% performance degradation on the \(R_1^{0.5}\) metric, respectively.
More Downstream Temporal Grounding Baselines. We take another temporal grounding method CSMGAN [29] as the downstream baseline. As shown in Table. 4c, our LocVTP pre-trained feature consistently benefits this more advanced baseline.
4.7 Ablations on Fine-grained Contrastive Loss (see Footnote 4)
Correspondence Discovery Strategies. We experiment four potential strategies to extract cross-modal correspondences: 1) random: randomly select K words for each clip; 2) 2d-topk: select the most similar \(K \times T\) clip-word pairs; 3) word-topk: select the most similar K clips for each word; 4) clip-topk: select the most similar K words for each clip, namely the method illustrated in Sect. 3.3. As indicated in Table 5a, the random and 2d-topk matching strategies are the two worst options. For the word-topk matching, it is also sub-optimal, which can be attributed to the possibility of introducing words without concrete meanings (e.g., articles or pronouns) into matched pairs.
Number of Selected Pairs K. We further ablate the hyper-parameter K used in the clip-topk strategy. Table 5b shows that the performance saturates at \(K=3\) and slightly decreases for \(K=4\). We conjecture that this may be because too few words have vague meanings while too large K value leads to the inability to establish accurate correspondences.
4.8 Ablations on Temporal Aware Contrastive Loss (see Footnote 4)
Context Projection Head Components. In Eq. (4), the warped feature is generated based on both the direction \({\text {sgn}}(\delta )\) and distance \(|\delta |\). Here we investigate eliminating either of them to see the difference. We observe in Table. 5c that removing either component decreases the performance, which indicates that both the direction and distance of bias \(\delta \) are crucial for feature warping.
Maximum Bias Distance \(\delta _{max}\). Here we ablate different values for \(\delta _{max}\). From Table 5d, we can see that \(\delta _{max} = 4\) achieves the best performance. This may be because that small bias makes the model unable to perceive enough context, while a large bias makes contextual reasoning too difficult.
Intra-modal v.s. Cross-modal Constraint. In Sect. 3.4, given the matched clip-word pair \(\{\boldsymbol{v}^{t}, \boldsymbol{q}^{t}_{+}\}\) and the warped feature \(\boldsymbol{z}^{t}\), we force the cross-modal supervision, i.e., \(\boldsymbol{z}^{t} \leftrightarrow \boldsymbol{q}^{t}_{+}\). Here, we apply the temporal aware contrastive loss \(\mathcal {L}_{t}\) in a intra-modal manner which regards \(\boldsymbol{z}^{t}\) and \(\boldsymbol{v}^{t}\) as positive pairs, i.e., \(\boldsymbol{z}^{t} \leftrightarrow \boldsymbol{v}^{t}\). The results in Table 5 e show that our adopted cross-modal mode outperforms the intra-modal one.
Temporal Sensitivity Analysis. As a sanity check, we devise two proxy tasks to evaluate the temporal sensitivity of pre-trained video features. As shown in Fig. 5a, n equidistantly sampled clips from one video are fed into the frozen video backbone to extract their corresponding features. Two linear classifiers are trained to perform two tasks: order prediction and distance estimation. The first task predicts the temporal index while the second one estimates the temporal distance of two clips. The results in Table 5f show that our LocVTP with temporal aware loss \(\mathcal {L}_{t}\) outperforms the variant without it as well as two typical VTP methods (i.e., UniVL and MIL-NCE), which shows that \(\mathcal {L}_{t}\) clearly contributes to the localization ability.
4.9 Visualization
Cross-modal Correspondence Visualizations. Figure 5b shows two framesFootnote 4 and their corresponding similarity scores with caption words. The top K highest scored words are marked with red (\(K=3\)). Frame #1 and frame #2 have similar appearance views yet correspond to different action processes. Our method pinpoints the subtle differences and accurately finds the most relevant words.
UMAP Visualizations. As shown in Fig. 6, we provide UMAP [37] visualizations for fused multi-modal features, which are generated by multiplying the extracted video feature by one query feature. With the temporal aware loss \(\mathcal {L}_{t}\), our LocVTP shows more separable distributions compared with LocVTP w/o \(\mathcal {L}_{t}\), manifesting that \(\mathcal {L}_{t}\) helps distinguish action-of-interest from background.
Similarity Distribution Visualizations. In Eq.(4), context projection head warps contextual clip \(\boldsymbol{v}^{t+\delta }\) to the reference one \(\boldsymbol{v}^{t}\). Here we collect 10K paired training samples and compute three sets of cosine similarities: reference similarity \(\small {(\boldsymbol{v}^{t}, \boldsymbol{q}_{+}^{t})}\), bias similarity \(\small {(\boldsymbol{v}^{t+\delta }, \boldsymbol{q}_{+}^{t})}\), and projection similarity \(\small {(\boldsymbol{z}^{t}, \boldsymbol{q}_{+}^{t})}\). Figure 5c plots the histogram of these similarities. We can see that the distribution of projection similarity is close to that of reference similarity while far away from that of bias similarity. This demonstrates that our context projection head can effectively warp contextual features conditioned on the temporal information.
5 Conclusions
In this paper, we propose LocVTP, the first video-text pre-training framework for temporal localization tasks. Specifically, we apply cross-modal contrastive learning at both coarse-grained video-sentence and fine-grained clip-word levels. Besides, we propose a context warping pretext task and a temporal aware contrastive loss to enhance the temporal awareness of video features. Experimental results show that LocVTP achieves state-of-the-art performance when transferred to both retrieval-based and localization-based downstream tasks.
Notes
- 1.
- 2.
A group or sequence of words conveying a particular meaning or idea in linguistics..
- 3.
We choose 2D-TAN since it is relatively simple without too many dataset-specific parameters, which can fairly verify the effectiveness of pre-training features. Results on more advanced baselines are available in the supplementary material.
- 4.
If not specified, all ablation studies are conducted on the downstream temporal grounding task at ActivityNet Captions dataset. We use LocVTP pre-trained on HowTo100M with ImageNet initialization..
- 5.
More visualizations are left in the supplementary materials..
- 6.
Here we use “frame” to indicate the center frame of a video snippet.
References
Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4575–4583 (2016)
Alwassel, H., Giancola, S., Ghanem, B.: TSP: temporally-sensitive pretraining of video encoders for localization tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3173–3183 (2021)
Amrani, E., Ben Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. arXiv preprint arXiv:2003.03186 (2020)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. arXiv preprint arXiv:2104.00650 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv (2021)
Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. EMNLP (2021)
Cao, M., Zhang, C., Chen, L., Shou, M.Z., Zou, Y.: Deep motion prior for weakly-supervised temporal action localization. arXiv preprint arXiv:2108.05607 (2021)
Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: AAAI, pp. 10551–10558 (2020)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)
Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 214–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_13
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)
Ging, S., Zolfaghari, M., Pirsiavash, H., Brox, T.: COOT: cooperative hierarchical transformer for video-text representation learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22605–22618 (2020)
Han, N., Chen, J., Xiao, G., Zhang, H., Zeng, Y., Chen, H.: Fine-grained cross-modal alignment network for text-video retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3826–3834 (2021)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Hu, R., Singh, A.: Transformer is all you need: Multimodal multitask learning with a unified transformer. arXiv (2021)
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2758–2766 (2017)
Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)
Lei, J., et al.: Less is more: ClipBERT for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331–7341 (2021)
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv (2019)
Liu, D., Qu, X., Liu, X.Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross-and self-modal graph attention network for query-based moment localization. In: ACM MM, pp. 4070–4078 (2020)
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: HiT: hierarchical transformer with momentum contrast for video-text retrieval. arXiv preprint arXiv:2103.15049 (2021)
Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv (2017)
Lu, C., Chen, L., Tan, C., Li, X., Xiao, J.: DEBUG: a dense bottom-up grounding approach for natural language video localization. In: EMNLP, pp. 5147–5156 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640 (2019)
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: VideoMoCo: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214 (2021)
Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824 (2020)
Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. TACL 1, 25–36 (2013)
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
Rouditchenko, A., et al.: AVLnet: learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199 (2020)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018)
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Tang, Y., et al.: Coin: a large-scale dataset for comprehensive instructional video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1207–1216 (2019)
Tang, Z., Lei, J., Bansal, M.: DeCEMBERT: learning from noisy instructional videos via dense captions and entropy minimization. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2415–2426 (2021)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, A.J., et al.: Object-aware video-language pre-training for retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Wang, W., et al.: Dig into multi-modal cues for video retrieval with hierarchical alignment. In: IJCAI (2021)
Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5079–5088 (2021)
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033 (2021)
Wu, C.Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848 (2017)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Xiao, S., Chen, L., Shao, J., Yueting, Z., Xiao, J.: Natural language video localization with learnable moment proposals. In: EMNLP (2021)
Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. In: AAAI (2021)
Xu, H., et al.: VLM: task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996 (2021)
Xu, H., et al.: VideoCLIP: contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
Xu, M., et al.: Boundary-sensitive pre-training for temporal localization in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7220–7230 (2021)
Xu, M., Perez Rua, J.M., Zhu, X., Ghanem, B., Martinez, B.: Low-fidelity video encoder optimization for temporal action localization. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Yan, R., Shou, M.Z., Ge, Y., Wang, A.J., Lin, X., Cai, G., Tang, J.: Video-text pre-training with learned regions. arXiv preprint arXiv:2112.01194 (2021)
Yang, J., Bisk, Y., Gao, J.: TACo: token-aware cascade contrastive learning for video-text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11562–11572 (2021)
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 487–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_29
Yuan, Y., Lan, X., Wang, X., Chen, L., Wang, Z., Zhu, W.: A closer look at temporal sentence grounding in videos: datasets and metrics. arXiv (2021)
Zhang, C., Cao, M., Yang, D., Chen, J., Zou, Y.: CoLA: weakly-supervised temporal action localization with snippet contrastive learning. In: CVPR, pp. 16010–16019 (2021)
Zhang, C., Cao, M., Yang, D., Jiang, J., Zou, Y.: Synergic learning for noise-insensitive Webly-supervised temporal action localization. Image Vis. Comput. 113, 104247 (2021)
Zhang, C., Yang, T., Weng, J., Cao, M., Wang, J., Zou, Y.: Unsupervised pre-training for temporal action localization tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14031–14041 (2022)
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI, pp. 12870–12877 (2020)
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,. pp. 8746–8755 (2020)
Zhukov, D., Alayrac, J.B., Cinbis, R.G., Fouhey, D., Laptev, I., Sivic, J.: Cross-task weakly supervised learning from instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3537–3545 (2019)
Acknowledgement
This paper was partially supported by NSFC (No: 62176008) and Shenzhen Science & Technology Research Program (No: GXWD20201231165807007-20200814115301001).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cao, M., Yang, T., Weng, J., Zhang, C., Wang, J., Zou, Y. (2022). LocVTP: Video-Text Pre-training for Temporal Localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13686. Springer, Cham. https://doi.org/10.1007/978-3-031-19809-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-19809-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19808-3
Online ISBN: 978-3-031-19809-0
eBook Packages: Computer ScienceComputer Science (R0)