1 Introduction

Fig. 1.
figure 1

(a) Video retrieval and temporal grounding performance using different pre-trained features. Sep.Pre. means separately pre-training, i.e., the video encoder supervisedly pre-trained on Kinetics [23] and text encoder taken from BERT [12]. MIL-NCE and our LocVTP are VTP methods pre-trained on HowTo100M [39]. For video retrieval, we use COOT [16] as the downstream method and evaluate on YouCook2 [75] dataset with R@5. For temporal grounding, we take 2D-TAN [74] as the downstream method and evaluate on ActivityNet Captions [24] dataset with R@1, IoU = 0.5. (b) Retrieval-based and localization-based downstream tasks. We take video retrieval and temporal grounding as typical examples, respectively. The former needs video-level classification while the latter requires clip-level or frame-level localization.

Video-Text Pre-training (VTP) [4, 26, 30, 31, 39, 49, 54, 67] has attracted increasing attention with the aim to learn generic and transferable joint video-language (VL) representations. Compared to the conventional separate pre-training on each single modality, e.g., video features are pre-trained under the action recognition datasets (Kinetics [23], Sport1M [22]), VTP has several advantages: 1) It leverages large-scale unlabeled narrated video data with automatically generated corresponding text data for video-text correspondence pre-training. 2) It tries to map different modality features into a shared latent space, which reduces the difficulties of the cross-modal feature interaction. Thanks to these advantages, VTP has significantly improved the performance of many downstream VL tasks. For example, as illustrated in [16], the video retrieval performance using features pre-trained with the VTP method MIL-NCE [38] is much higher than that using separately pre-trained way (cf. Fig. 1a (left)).

Despite their encouraging performance, we find that most current VTP methods are applicable to limited downstream tasks, i.e., they focus on retrieval-based tasks which require video-level predictions, e.g., video retrieval [64], video captioning [45], and video question answering [21]. In contrast, there exists another mainstream localization-based tasks which expect more fine-grained clip-level or frame-level predictions, e.g., temporal grounding [15], action segmentation [51], action step localization [77] (cf. Fig. 1b). Unfortunately, through experiments, we find their poor generalization abilities on this type of downstream tasks. For example, on temporal grounding, even pre-trained with a much larger dataset HowTo100M [39], the VTP method MIL-NCE still performs worse than the separately pre-trained counterpart (cf. Fig 1a (right)).

In this paper, we analyze that this poor transfer ability on localization-based tasks is due to the absence of two indispensable characteristics: 1) Fine-grained alignment : We contend that the alignment should be conducted on more fine-grained clip-word level instead of the coarse-grained video-sentenceFootnote 1 level. As the temporal grounding example shown in Fig. 2, a given query sentence may contain multiple actions (e.g., “hit the golf ball” (\(q^{s1}\)) and “bend down to pick up the ball” (\(q^{s2}\))). Thus, aligning each action (or words) to the corresponding clips (i.e., \(v^{t1}\) and \(v^{t2}\)) will help to obtain more detailed and accurate feature representations. 2) Temporal relation reasoning : We hope the clip features of a certain action can also perceive other actions in the same video. For example, for a typical golf video, action \(q^{s2}\) (“bend down to pick up the ball”) always occurs shortly after action \(q^{s1}\) (“hit the golf ball”). Thus, incorporating such temporal relationship into VTP can help to improve the temporal awareness of video features.

Fig. 2.
figure 2

Fine-grained video-text alignment: positive clip-word pairs are selected via cosine similarity and are then forced to be close to each other, i.e., \(\boldsymbol{v}^{t1} \leftrightarrow \boldsymbol{q}^{s1}\), \(\boldsymbol{v}^{t2} \leftrightarrow \boldsymbol{q}^{s2}\); Temporal relation reasoning: a context warping head reconstructs \(\boldsymbol{v}^{t1}\) conditioned on \(\boldsymbol{v}^{t2}\) and distance \(t2 - t1\) while maintaining the cross-modal alignment unchanged, i.e., \(\boldsymbol{z}^{t1} = {\text {warp}}(\boldsymbol{v}^{t2}, t2-t1) \leftrightarrow \boldsymbol{q}^{s1}\).

Based on these observations, we propose a novel video-text pre-training framework for localization tasks, dubbed as LocVTP. By considering both above-mentioned characteristics, LocVTP achieves state-of-the-art performance not only on the widely studied retrieval-based tasks, but also on the less-focused localization-based tasks. Specifically, for fine-grained alignment, we extend the coarse-grained contrastive training with video-sentence alignment to a fine-grained one with clip-word alignment. Since there are no clip-word correspondence annotations in existing large-scale datasets, we utilize the latent space established by the coarse-grained contrastive learning to estimate the clip-word similarity, and then select the clip-word pairs with high similarities as positive samples. To further illustrate this, as shown in Fig. 2 (right), suppose \(\{\boldsymbol{v}^{t1}, \boldsymbol{q}^{s1}\}\) and \(\{\boldsymbol{v}^{t2}, \boldsymbol{q}^{s2}\}\) are two matched clip-word feature pairs. Semantic embeddings in each pair are mapped to be close to each other, i.e., \(\boldsymbol{v}^{t1} \leftrightarrow \boldsymbol{q}^{s1}\), \(\boldsymbol{v}^{t2} \leftrightarrow \boldsymbol{q}^{s2}\). For temporal relation reasoning, we propose a new pretext task called context warping. Here we use Fig. 2 (right) for illustration. Context warping is designed to generate a new temporally relevant clip features \(\boldsymbol{z}^{t1}\), which imitates \(\boldsymbol{v}^{t1}\), conditioned on another clip \(\boldsymbol{v}^{t2}\) and the relative distance \(t2 - t1\) in time, i.e., \(\boldsymbol{z}^{t1}={\text {warp}}(\boldsymbol{v}^{t2}, t2 - t1)\). The predicted relevant clip feature \(\boldsymbol{z}^{t1}\) is enforced to maintain the original established cross-modal correspondence unchanged, i.e., \(\boldsymbol{z}^{t1} \leftrightarrow \boldsymbol{q}^{s1}\). In this manner, we simulate the contextual reasoning process and enhance the temporal awareness of video features.

We conduct extensive experiments on four downstream tasks (i.e., video retrieval, temporal grounding, action step localization, and action segmentation) across six datasets. The results on both retrieval-based and localization-based tasks demonstrate the superiority and the generalization ability of our LocVTP.

In summary, we make three contributions in this paper:

  • We propose a localization-oriented video-text pre-training framework, LocVTP, which benefits both retrieval-based and the less-explored localization-based downstream tasks.

  • We pinpoint two crucial designs in LocVTP, i.e., fine-grained video-text alignment and temporal relation reasoning.

  • Experimental results show that our LocVTP significantly outperforms previous state-of-the-art methods when transferred to various downstream tasks.

2 Related Work

Video-Text Pre-training (VTP). With the release of the large-scale instructional dataset HowTo100M, VTP has spurred significant interest in the community. Overall, the mainstream methods can be broadly classified into two classes: 1) Generative methods: Several methods [11, 20, 28, 31, 34, 50, 55, 56] try to extend BERT [53] to the cross-modal domain, i.e., they accept both visual and textual tokens as input and perform the masked-token prediction task. 2) Discriminative methods. These methods [4, 26, 30, 41] learn representations by differentiating input samples using objectives such as the metric loss [19, 58] or contrastive loss [9, 18]. ClipBert [26] enables affordable pre-training from sparsely sampled frames. Frozen [4] adapts the recent ViT [13] as the visual encoder and is flexible to be trained on both image and video datasets. T2VLAD [56] and FCA [17] also perform the fine-grained interactions between video clips and phrases. However, both of them resort to additional overload, e.g., k-means cluster or graph auto-encoder. In contrast, our LocVTP explicitly models the clip-word matching with a more light-weighted similarity comparison manner.

Pre-training for Localization Tasks. Compared to the retrieval tasks [21, 45, 64] which only require only video-level predictions, localization tasks [15, 51, 77] are essentially different since they need dense clip-level or frame-level predictions and thus the pre-training for these tasks is more challenging. In the pure video domain, this gap has been noticed and several pre-training works [2, 65, 66, 73] tailored for action localization have been proposed. BSP [65] synthesizes temporal boundaries using existing action recognition datasets and conducts boundary type classification to generate localization-friendly features. TSP [2] trains video encoders to be temporally sensitive by predicting the foreground clip label and classifying whether a clip is inside or outside the action. As for the video-language domain, our LocVTP is the first pre-training framework designed for localization tasks. Besides, compared to TSP and BSP which require label information for supervised pre-training, our LocVTP can directly learn from narrated videos.

3 Approach

3.1 Overview of LocVTP

An overview of LocVTP is illustrated in Fig. 3. We firstly feed the video and language modalities to their respective encoders \(f_{v}(\cdot )\) and \(f_{q}(\cdot )\) to obtain embedded features. We follow the sparse sampling spirit in [26] and sample T clips for each video, yielding the encoded video \(\boldsymbol{v}= \{\boldsymbol{v}^{t}\}_{t=1}^{T}\), where \(\boldsymbol{v}^{t}\in \mathbb {R}^{D}\) is the \(t^{th}\) clip feature and D is the feature dimension. The text embedding is represented as \(\boldsymbol{q}= \{\boldsymbol{q}^{s}\}_{s=1}^{S_{q}}\), where \(\boldsymbol{q}^{s}\in \mathbb {R}^{D}\) is the \(s^{th}\) word embedding and \(S_q\) is the word length of \(\boldsymbol{q}\).

Fig. 3.
figure 3

An overview of LocVTP. \(f_{v}(\cdot )\) and \(f_{q}(\cdot )\) are video and language encoders, respectively. 1) Coarse-grained contrastive loss \(\mathcal {L}_{c}\) matches the global video and sentence representations \(\overline{\boldsymbol{v}}\) and \(\overline{\boldsymbol{q}}\). 2) The clip-word correspondence is firstly built by similarity computing and then fine-grained contrastive loss \(\mathcal {L}_{f}\) conducts detailed cross-modal alignment. Note that for clarity, we only present the correspondence discovery for the clip \(\boldsymbol{v}^{t}\). 3) A context warping head is employed to warp the contextual feature \(\boldsymbol{v}^{t+\delta }\) and a temporal aware contrastive loss \(\mathcal {L}_{t}\) is applied based on the warped feature \(\boldsymbol{z}^{t}\).

Three types of contrastive methods are then performed to learn cross-modal features: 1) The coarse-grained contrastive loss builds the video-sentence level alignment; 2) A correspondence discovery strategy is proposed to build clip-word relations, based on which the fine-grained contrastive loss is applied; 3) Temporal aware contrastive loss with the context warping pretext task is proposed to encode temporal information into video representations.

3.2 Coarse-Grained Contrastive Learning

We firstly conduct contrastive alignment at the global video-sentence level. Specifically, to obtain the video and sentence level features, we average pool \(\boldsymbol{v}\) and \(\boldsymbol{q}\) along the temporal and word index dimension, respectively. The global features are represented as \(\overline{\boldsymbol{v}}\), \(\overline{\boldsymbol{q}}\in \mathbb {R}^{D}\). Then we formulate this video-sentence alignment into the contrastive framework [18] as follows:

$$\begin{aligned} \mathcal {L}_{c}=-\log \frac{\exp \left( \overline{\boldsymbol{v}} {\cdot } \overline{\boldsymbol{q}} / \tau \right) }{\sum _{i=1}^{N} \exp \left( \overline{\boldsymbol{v}} {\cdot } \overline{\boldsymbol{q}}_{i} / \tau \right) }, \end{aligned}$$
(1)

where \(\overline{\boldsymbol{q}}_{i}, i \in [1, N]\), is the sentence feature for other samples within the batch. N denotes the batch size and \(\tau \) is the temperature parameter. The coarse-grained contrastive loss \(\mathcal {L}_{c}\) serves as a base loss to conduct video-sentence level constraint and induces a basic latent space where the detailed cross-modal matching is achieved. Though usually coarse and noisy, this latent space encodes prior for fine-grained clip-word correspondence discovery. In Sect. 4.6, we design and analyze three potential ways to use this cross-modal matching prior.

3.3 Fine-Grained Contrastive Learning

Beyond the coarse-grained video-sentence alignment, we propose to conduct contrastive learning in a fine-grained manner, i.e., clip-word matching. We contend that introducing such alignment learning into the pre-training stage could narrow down its gap with downstream localization tasks and calibrate the pre-trained feature to be more temporally aware.

Clip-Word Correspondence Discovery. Before performing fine-grained contrastive learning, we firstly need to estimate the clip-word correspondences from video-sentence pairs. Thanks to the priors well established by the coarse-grained contrastive learning, we compute the cosine similarities between the video clips and their corresponding caption words in the pre-built latent space and choose the most similar K words as the correspondence for each video clip. Note that we select multiple positive words rather than simply pick one with the highest similarity because individual words may have vague meanings while sense-groupFootnote 2 conveys more precise information (cf. Sect. 4.7).

Given the video sentence pair \(\left\{ \boldsymbol{v}, \boldsymbol{q}\right\} \), for the encoded \(t^{th}\) video clip \(\boldsymbol{v}^{t}\), we compute its cosine similarities with the \(s^{th}\) word embedding \(\boldsymbol{q}^{s}\) and apply the \({\text {topk}}\) operation to select the most matched K ones. Following [57], these K selected items are average pooled to form the final positive sample:

$$\begin{aligned} \boldsymbol{q}_{+}^{t}={\text {avgpool}} \Big (\underset{s\in [1, S_{q}]}{\arg {\text {topk}} } \left( \boldsymbol{v}^{t} {\cdot } \boldsymbol{q}^{s}\right) \Big ), \end{aligned}$$
(2)

where \(\boldsymbol{q}_{+}^{t}\) is the final positive sample for \(\boldsymbol{v}^{t}\). \((\boldsymbol{u} {\cdot } \boldsymbol{v})=\boldsymbol{u}^{\top } \boldsymbol{v} /\Vert \boldsymbol{u}\Vert \Vert \boldsymbol{v}\Vert \) represents the cosine similarity between \(\ell _{2}\) normalized \(\boldsymbol{u}\) and \(\boldsymbol{v}\). This process can be efficiently performed for all the video clips using matrix operations.

Fine-Grained Contrastive Loss. With the selected clip-word correspondence as positive pairs, we perform fine-grained representation learning following the cross-modal InfoNCE [18] loss (cf. Fig. 4a). The negative samples are taken from the other words within the batch. Therefore, the fine-grained contrastive loss is defined as follows.

$$\begin{aligned} \mathcal {L}_{f}=\frac{1}{T}\sum _{t=1}^{T}-\log \frac{ \exp (\boldsymbol{v}^{t} {\cdot } \boldsymbol{q}_{+}^{t} / \tau )}{\sum _{i=1}^{N} \sum _{s=1}^{S_{q_{i}}} \exp \left( \boldsymbol{v}^{t} {\cdot } \boldsymbol{q}_{i}^{s} / \tau \right) }, \end{aligned}$$
(3)

where \(\boldsymbol{q}_{i}^{s}\) is the \(s^{th}\) word feature of the \(i^{th}\) sentence \(\boldsymbol{q}_{i}\).

3.4 Temporal Aware Contrastive Learning

Fig. 4.
figure 4

Illustrations of (a) fine-grained contrastive Loss and (b) temporal aware contrastive loss. \(\boldsymbol{v}^{t}\) is the \(t^{th}\) clip of video \(\boldsymbol{v}\). \(\boldsymbol{q}_{+}^{t}\) is the pooled positive word features. \(\boldsymbol{z}^{t}\) is the warped feature. We only present positive samples and omit negative ones.

Compared with the video-level retrieval task, which favors temporal invariant features [40, 42], the clip-level localization task [6,7,8, 33, 60, 61, 70,71,72] prefers temporal aware video embeddings. Specifically, correlated actions in the same video should perceive each other. This characteristic is however not embodied in the aforementioned contrastive learning.

Context Warping Head. To alleviate this, we set up a context-warping operation to enforce the video clip to perceive the context. For the video clip \(\boldsymbol{v}^{t}\) in a matched clip-word pair \(\{\boldsymbol{v}^{t}, \boldsymbol{q}^{t}_{+}\}\) (cf. Sect. 3.3), we warp its contextual video clip with \(\delta \) temporal distance, i.e., \(\boldsymbol{v}^{t+\delta }\), to “reconstruct” itself. To supervise this warping process, we set up a temporal aware contrastive loss to maintain the established correspondence. Specifically, we propose a context warping head \(g(\cdot )\) to instantiate this warping process, by taking the context clip feature \(\boldsymbol{v}^{t+\delta }\) and temporal distance \(\delta \) as input.

$$\begin{aligned} \begin{aligned} \boldsymbol{z}^{t}&=g(\boldsymbol{v}^{t+\delta }; \delta ) \\&={\text {ReLU}}\left( W[\boldsymbol{v}^{t+\delta }, {\text {sgn}}(\delta ),|\delta |]\right) , \end{aligned} \end{aligned}$$
(4)

where \(\boldsymbol{z}^{t}\) is the warped feature. \(W \in \mathbb {R}^{(D+2) \times D}\) are the trainable weights. \(\delta \) is randomly sampled within the range of \([-\delta _{max}, \delta _{max}]\). \({\text {sgn}}(\cdot )\) is the sign function which returns 1 for positive values and \(-1\) for negative ones. Here \({\text {sgn}}(\delta )\) and \(|\delta |\) indicate the direction and distance of the temporal difference \(\delta \), respectively.

Temporal Aware Contrastive Loss. Through the context warping head, the warped feature \(\boldsymbol{z}^{t}\) should mimic the reference feature \(\boldsymbol{v}^{t}\). Since \(\boldsymbol{v}^{t}\) has the clip-word alignment with \(\boldsymbol{q}^{t}_{+}\), such correspondence should be preserved between the warped feature \(\boldsymbol{z}^{t}\) and \(\boldsymbol{q}^{t}_{+}\) (cf. Fig. 4b).

$$\begin{aligned} \mathcal {L}_{t}=\frac{1}{T}\sum _{t=1}^{T}-\log \frac{ \exp (\boldsymbol{z}^{t} {\cdot } \boldsymbol{q}_{+}^{t} / \tau )}{\sum _{i=1}^{N} \sum _{s=1}^{S_{q_{i}}} \exp \left( \boldsymbol{z}^{t} {\cdot } \boldsymbol{q}_{i}^{s} / \tau \right) }. \end{aligned}$$
(5)

This process enforces video features to learn the ability of temporally reasoning, thus leading to more localization-friendly video features.

Integrating the above constraints, our final loss function is as follows.

$$\begin{aligned} \mathcal {L} = \lambda _{c}\mathcal {L}_{c} + \lambda _{f}\mathcal {L}_{f} + \lambda _{t}\mathcal {L}_{t}, \end{aligned}$$
(6)

where \(\lambda _{c}\), \(\lambda _{f}\), and \(\lambda _{t}\) balance the focus on different constraints during training.

4 Experiments

4.1 Settings of Pre-training

Datasets. We pre-trained our model on three public datasets: 1) HowTo100M [39]. It consists of more than 1.2M videos accompanied with ASR-generated speech transcription. The provided transcription is used to create video-sentence pairs separated by each timestamp. 2) WebVid-2M [39]. It contains about 2.5M well-aligned web video-text pairs. 3) Google Conceptual Captions [48]. It contains 3.3M image and description pairs harvested from the web.

Encoders. Following [4, 54, 67], we adopted ViT-B/16 [13] with space-time attention [5] as the video encoder. The spatial attention weights in the transformer were initialized with ImageNet-21k pre-trained weights while the temporal attention weights were set to zero. We chose a lightweight DistilBERT [47] as the language encoder. Following [3, 4, 41, 52], the language encoder was initialized with the weights pre-trained on English Wikipedia and Toronto Book Corpus.

Implementation Details. For the video in each video-sentence pair, we sampled 8 clips of 16 frames equidistantly and fed them to the video encoder to obtain clip-level features. All frames were resized to \(224 \times 224\). For downstream transfer, we extracted video features with the well-trained model in a dense manner, i.e., every 16 consecutive frames were grouped to compute one clip feature.

Experiments were conducted on 64 V100 GPUs with a batch size of 256 and lasted for 200 epochs. We used Adam [32] with the initial learning rate \(10^{-4}\) as the optimizer. The learning rate decayed by 0.1 at the \(100^{th}\) and \(160^{th}\) epoch. Random flip, random crop, and color jitter for video data augmentation were included. The loss balance factors \(\lambda _{c}\), \(\lambda _{f}\), and \(\lambda _{t}\) were set to 0.5, 1, 1, respectively. The temperature factor \(\tau \) used in contrastive learning was set to 0.07 following [43, 59] and K in Eq.(2) was set to 3. Features in all three contrastive losses were \(\ell _{2}\)-normalized before computation (Table 1).

4.2 Transfer Results on Video Retrieval

Table 1. Video retrieval performance on MSR-VTT. Vis Enc. Init.: Datasets used for pre-training visual encoders. Methods using multi-modal features are . COCO: Coco Captions [10]; VGen: Visual genome [25]; CC3M: Conceptual captions [48]; WV2M: WebVid-2M [4]; \(^\dag \) denotes the technical report available on ArXiv.

Datasets. We evaluate our LocVTP on the widely-used benchmark MSR-VTT dataset [64]. It is composed of 10K YouTube videos (9K for training and 1K for test). We report results on the train/test splits introduced in [69].

Results. 1) As can be seen, we achieve state-of-the-art performance under both sets of data, i.e., HowTo100M and CC3M+WV2M. Specifically, when pre-trained on CC3M+WV2M, LocVTP outperforms Frozen [4] by an absolute lift of 4.8% on R@5. 2) It should be pointed out that although using RGB data only, our LocVTP achieves better performance than the methods using multi-modal expert features including motion, face, and speech, e.g., MMT [14]. 3) The recent work CLIP [43] provides a stronger vision encoder and we also evaluate the performance based on it. It is shown that the CLIP’s weights greatly improve the performance of LocVTP with R@5 achieving 72.8%, surpassing top-performing CLIP-based methods. 4) Our LocVTP also outperforms previous methods under the zero-shot setting, showing its generalization ability.

Table 2. Temporal grounding performances using pre-trained representations. Sep.Pre.: separately pre-training, i.e., the video encoder supervisedly pre-trained on Kinetics and text encoder taken from BERT. We retrain the temporal grounding method 2D-TAN [74] using the pre-trained features. HT: HowTo100M; CO: Coco Captions [10]; VG: Visual genome [25]; CC: Conceptual captions [48]; WV: WebVid-2M [4]; \(\ddag \): the subset of HowTo100M with the same training volume as Kinetics (300K pairs). Methods with \(^*\) are not open source and we implement them ourselves. \(^\dag \) denotes the technical report available on ArXiv.

4.3 Transfer Results on Temporal Grounding

Settings. We validate the performance of pre-trained representations on temporal grounding, which aims to localize actions corresponding to the sentence from an untrimmed video. Specifically, we re-train the mainstream temporal grounding method 2D-TAN [74]Footnote 3 by only replacing the original input features with pre-trained ones. For ease of feature extraction, we choose representative VTP methods with publicly-available codes for comparisons.

Datasets and Metrics. 1) ActivityNet Captions (ANet) [24]. It contains 20K untrimmed videos with 100K descriptions. By convention, we use 37,417 video-query pairs for training, 17,505 pairs for validation, and 17,031 pairs for testing. 2) Charades-STA [15]. Following the official split, 12,408 video-query pairs are used for training, and 3,720 pairs for testing. 3) TACoS [44]. It has 10,146 video-query pairs for training, 4,589 pairs for validation, and 4,083 pairs for testing.

Following prior works, we adopt ΓÇ£R@n, IoU@mΓÇ¥ (abbreviated as \(R^m_n\)) as the metric, Specifically, \(R^m_n\) is defined as the percentage of at least one of top-n retrieved moments having IoU with the ground-truth moment larger than m.

Results. 1) As shown in Table 2, even trained with a much larger dataset, the current popular video-text pre-training frameworks achieve inferior performance compared to the separately pre-trained one. For example, Frozen [4] reaches 43.3% at \(R_1^{0.5}\) on ANet Captions, which is 1.1% absolute value lower than the separately pre-trained counterpart. 2) Either pre-trained on HowTo100M or CC + WV, our LocVTP outperforms both video-text pre-training methods by a large margin on all three datasets. For example, pre-trained on HowTo100M, LocVTP surpasses the separately pre-trained method by 3.8% on \(R_1^{0.5}\) of ANet Captions. 3) For more fair comparisons, we sample a subset of HowTo100M by selecting the same training sample as Kinetics [23] (300K training pairs), denoted as HT\(^\ddag \) in Table 2. Although using noisy ASR captions, the results demonstrates that under the same training data volume, our LocVTP still shows better performance compared to the separately pre-trained method. This manifests that our performance improvement is brought by the sound architecture design rather than just the use of the large-scale dataset.

4.4 Transfer Results on Action Step Localization

Settings. In action step localization, each video belongs to a task and is annotated with multiple action steps described with short natural languages. The goal is to align each frame with the correct step in the text form. Following [35, 39, 68, 76], we take [77] as the downstream localization method. Specifically, we compute the similarity between each frame and the action step descriptions in feature space to find the optimal frame-wise order of action steps for a video.

Table 3. Comparison results of action step localization (CTR: average recall) and action segmentation (FA: frame-wise accuracy).

Datasets and Metrics. We experiment on the instructional video dataset CrossTask [77], which includes 83 tasks and 4.7K videos. Each task is described with an ordered list of steps with manual natural language descriptions. We perform the same evaluation protocol as in [77] by reporting the average recall (CTR).

Results. Table 3 reports the action step localization performance on CrossTask dataset. Our LocVTP pre-trained feature achieves state-of-the-art performance with CTR reaching 51.7%, surpassing the previous method VideoClip by 4.4%. Our competitive performance demonstrates that LocVTP features can effectively perceive detailed action steps.

4.5 Transfer Results on Action Segmentation

Settings. We assess our LocVTP on action segmentation, which aims to predict the action label frame-wisely for each video frame. It is a pure vision task without the use of the text encoder. Following [35, 68, 76], we encode the input video frames with the well-trained video encoder and apply a linear classifier upon the features to predict action labels.

Datasets and Metrics. We conduct experiments on the widely used COIN dataset [51] and the frame-wise accuracy (FA) is taken as the evaluation metric.

Results. As shown in Table 3, our LocVTP achieves state-of-the-art performance with FA reaching 72.9%. This further demonstrates the superiority of our feature in localization tasks even in the absence of language guidance.

4.6 Ablation Study on Training Objective

Training Strategy. Coarse-grained contrastive alignment loss \(\mathcal {L}_{c}\) provides a basic cross-modal matching prior and we introduce three potential ways to use it: 1) multi-stage training: first perform coarse-grained training and then use the trained model to initialize other stages. 2) warm-up training: decrease \(\lambda _{c}\) exponentially from 1 to 0 throughout the training process. 3) weighted training: set \(\lambda _{c}\) to a constant value. Here we set \(\lambda _{c} = 0.5\). As shown in Table 4a, we find the weighted training strategy achieves the best performance and warm-up training is slightly behind. Multi-stage training is the least effective one.

Loss Component. We present the loss component ablations in Table 4b. As shown, both fine-grained loss \(\mathcal {L}_{f}\) and temporal aware loss \(\mathcal {L}_{t}\) are crucial. For example, compared to the full version (exp.#1), removing \(\mathcal {L}_{f}\) and \(\mathcal {L}_{t}\) brings about 1.4% and 1.5% performance degradation on the \(R_1^{0.5}\) metric, respectively.

More Downstream Temporal Grounding Baselines. We take another temporal grounding method CSMGAN [29] as the downstream baseline. As shown in Table. 4c, our LocVTP pre-trained feature consistently benefits this more advanced baseline.

Table 4. Ablations studies of (a) training strategies; (b) loss component; (c) comparison results on temporal grounding method CSMGAN [29]. Sep.Pre.: separately pre-training, i.e., the video encoder supervisedly pre-trained on Kinetics and text encoder taken from BERT.

4.7 Ablations on Fine-grained Contrastive Loss (see Footnote 4)

Correspondence Discovery Strategies. We experiment four potential strategies to extract cross-modal correspondences: 1) random: randomly select K words for each clip; 2) 2d-topk: select the most similar \(K \times T\) clip-word pairs; 3) word-topk: select the most similar K clips for each word; 4) clip-topk: select the most similar K words for each clip, namely the method illustrated in Sect. 3.3. As indicated in Table 5a, the random and 2d-topk matching strategies are the two worst options. For the word-topk matching, it is also sub-optimal, which can be attributed to the possibility of introducing words without concrete meanings (e.g., articles or pronouns) into matched pairs.

Number of Selected Pairs K. We further ablate the hyper-parameter K used in the clip-topk strategy. Table 5b shows that the performance saturates at \(K=3\) and slightly decreases for \(K=4\). We conjecture that this may be because too few words have vague meanings while too large K value leads to the inability to establish accurate correspondences.

4.8 Ablations on Temporal Aware Contrastive Loss (see Footnote 4)

Table 5. Ablations studies of (a) correspondence discovery strategies; (b) selected pair number K; (c) context projection head. \({\text {sgn}}(\delta )\), \(|\delta |\) denotes the direction and distance; (d) the maximum bias distance; (e) intra-modal v.s. cross-modal \(\mathcal {L}_{t}\); (f) linear localization accuracy. \(Accu_o\) and \(Accu_d\) are order and distance prediction accuracy.

Context Projection Head Components. In Eq. (4), the warped feature is generated based on both the direction \({\text {sgn}}(\delta )\) and distance \(|\delta |\). Here we investigate eliminating either of them to see the difference. We observe in Table. 5c that removing either component decreases the performance, which indicates that both the direction and distance of bias \(\delta \) are crucial for feature warping.

Maximum Bias Distance \(\delta _{max}\). Here we ablate different values for \(\delta _{max}\). From Table 5d, we can see that \(\delta _{max} = 4\) achieves the best performance. This may be because that small bias makes the model unable to perceive enough context, while a large bias makes contextual reasoning too difficult.

Intra-modal v.s. Cross-modal Constraint. In Sect. 3.4, given the matched clip-word pair \(\{\boldsymbol{v}^{t}, \boldsymbol{q}^{t}_{+}\}\) and the warped feature \(\boldsymbol{z}^{t}\), we force the cross-modal supervision, i.e., \(\boldsymbol{z}^{t} \leftrightarrow \boldsymbol{q}^{t}_{+}\). Here, we apply the temporal aware contrastive loss \(\mathcal {L}_{t}\) in a intra-modal manner which regards \(\boldsymbol{z}^{t}\) and \(\boldsymbol{v}^{t}\) as positive pairs, i.e., \(\boldsymbol{z}^{t} \leftrightarrow \boldsymbol{v}^{t}\). The results in Table 5 e show that our adopted cross-modal mode outperforms the intra-modal one.

Temporal Sensitivity Analysis. As a sanity check, we devise two proxy tasks to evaluate the temporal sensitivity of pre-trained video features. As shown in Fig. 5a, n equidistantly sampled clips from one video are fed into the frozen video backbone to extract their corresponding features. Two linear classifiers are trained to perform two tasks: order prediction and distance estimation. The first task predicts the temporal index while the second one estimates the temporal distance of two clips. The results in Table 5f show that our LocVTP with temporal aware loss \(\mathcal {L}_{t}\) outperforms the variant without it as well as two typical VTP methods (i.e., UniVL and MIL-NCE), which shows that \(\mathcal {L}_{t}\) clearly contributes to the localization ability.

4.9 Visualization

Cross-modal Correspondence Visualizations. Figure 5b shows two framesFootnote 4 and their corresponding similarity scores with caption words. The top K highest scored words are marked with red (\(K=3\)). Frame #1 and frame #2 have similar appearance views yet correspond to different action processes. Our method pinpoints the subtle differences and accurately finds the most relevant words.

Fig. 5.
figure 5

(a) Linear localization evaluations including order and distance prediction; (b) Cross-modal correspondence visualizations. Top K responsive words are marked with . (c) Gaussian distributions of the , , and similarities. (Color figure online)

Fig. 6.
figure 6

UMAP visualizations. Clips corresponding to ground-truth caption are marked with while others are with . (Color figure online)

UMAP Visualizations. As shown in Fig. 6, we provide UMAP [37] visualizations for fused multi-modal features, which are generated by multiplying the extracted video feature by one query feature. With the temporal aware loss \(\mathcal {L}_{t}\), our LocVTP shows more separable distributions compared with LocVTP w/o \(\mathcal {L}_{t}\), manifesting that \(\mathcal {L}_{t}\) helps distinguish action-of-interest from background.

Similarity Distribution Visualizations. In Eq.(4), context projection head warps contextual clip \(\boldsymbol{v}^{t+\delta }\) to the reference one \(\boldsymbol{v}^{t}\). Here we collect 10K paired training samples and compute three sets of cosine similarities: reference similarity \(\small {(\boldsymbol{v}^{t}, \boldsymbol{q}_{+}^{t})}\), bias similarity \(\small {(\boldsymbol{v}^{t+\delta }, \boldsymbol{q}_{+}^{t})}\), and projection similarity \(\small {(\boldsymbol{z}^{t}, \boldsymbol{q}_{+}^{t})}\). Figure 5c plots the histogram of these similarities. We can see that the distribution of projection similarity is close to that of reference similarity while far away from that of bias similarity. This demonstrates that our context projection head can effectively warp contextual features conditioned on the temporal information.

5 Conclusions

In this paper, we propose LocVTP, the first video-text pre-training framework for temporal localization tasks. Specifically, we apply cross-modal contrastive learning at both coarse-grained video-sentence and fine-grained clip-word levels. Besides, we propose a context warping pretext task and a temporal aware contrastive loss to enhance the temporal awareness of video features. Experimental results show that LocVTP achieves state-of-the-art performance when transferred to both retrieval-based and localization-based downstream tasks.