LocVTP: Video-Text Pre-training for Temporal Localization

Cao, Meng; Yang, Tianyu; Weng, Junwu; Zhang, Can; Wang, Jue; Zou, Yuexian

doi:10.1007/978-3-031-19809-0_3

Meng Cao¹²,
Tianyu Yang¹³,
Junwu Weng¹³,
Can Zhang¹²,
Jue Wang¹³ &
…
Yuexian Zou^12,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13686))

Included in the following conference series:

European Conference on Computer Vision

3175 Accesses
14 Citations

Abstract

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies. Codes are available at https://github.com/mengcaopku/LocVTP.

M. Cao—Work done during an internship at Tencent AI Lab.

Access provided by Autonomous University of Puebla. Download conference paper PDF

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Level-wise aligned dual networks for text–video retrieval

Article Open access 07 July 2022

Adaptive Token Excitation with Negative Selection for Video-Text Retrieval

1 Introduction

Video-Text Pre-training (VTP) [4, 26, 30, 31, 39, 49, 54, 67] has attracted increasing attention with the aim to learn generic and transferable joint video-language (VL) representations. Compared to the conventional separate pre-training on each single modality, e.g., video features are pre-trained under the action recognition datasets (Kinetics [23], Sport1M [22]), VTP has several advantages: 1) It leverages large-scale unlabeled narrated video data with automatically generated corresponding text data for video-text correspondence pre-training. 2) It tries to map different modality features into a shared latent space, which reduces the difficulties of the cross-modal feature interaction. Thanks to these advantages, VTP has significantly improved the performance of many downstream VL tasks. For example, as illustrated in [16], the video retrieval performance using features pre-trained with the VTP method MIL-NCE [38] is much higher than that using separately pre-trained way (cf. Fig. 1a (left)).

Despite their encouraging performance, we find that most current VTP methods are applicable to limited downstream tasks, i.e., they focus on retrieval-based tasks which require video-level predictions, e.g., video retrieval [64], video captioning [45], and video question answering [21]. In contrast, there exists another mainstream localization-based tasks which expect more fine-grained clip-level or frame-level predictions, e.g., temporal grounding [15], action segmentation [51], action step localization [77] (cf. Fig. 1b). Unfortunately, through experiments, we find their poor generalization abilities on this type of downstream tasks. For example, on temporal grounding, even pre-trained with a much larger dataset HowTo100M [39], the VTP method MIL-NCE still performs worse than the separately pre-trained counterpart (cf. Fig 1a (right)).

In this paper, we analyze that this poor transfer ability on localization-based tasks is due to the absence of two indispensable characteristics: 1) Fine-grained alignment : We contend that the alignment should be conducted on more fine-grained clip-word level instead of the coarse-grained video-sentence^{Footnote 1} level. As the temporal grounding example shown in Fig. 2, a given query sentence may contain multiple actions (e.g., “hit the golf ball” ($q^{s1}$) and “bend down to pick up the ball” ($q^{s2}$)). Thus, aligning each action (or words) to the corresponding clips (i.e., $v^{t1}$ and $v^{t2}$) will help to obtain more detailed and accurate feature representations. 2) Temporal relation reasoning : We hope the clip features of a certain action can also perceive other actions in the same video. For example, for a typical golf video, action $q^{s2}$ (“bend down to pick up the ball”) always occurs shortly after action $q^{s1}$ (“hit the golf ball”). Thus, incorporating such temporal relationship into VTP can help to improve the temporal awareness of video features.

Based on these observations, we propose a novel video-text pre-training framework for localization tasks, dubbed as LocVTP. By considering both above-mentioned characteristics, LocVTP achieves state-of-the-art performance not only on the widely studied retrieval-based tasks, but also on the less-focused localization-based tasks. Specifically, for fine-grained alignment, we extend the coarse-grained contrastive training with video-sentence alignment to a fine-grained one with clip-word alignment. Since there are no clip-word correspondence annotations in existing large-scale datasets, we utilize the latent space established by the coarse-grained contrastive learning to estimate the clip-word similarity, and then select the clip-word pairs with high similarities as positive samples. To further illustrate this, as shown in Fig. 2 (right), suppose $\{\boldsymbol{v}^{t1}, \boldsymbol{q}^{s1}\}$ and $\{\boldsymbol{v}^{t2}, \boldsymbol{q}^{s2}\}$ are two matched clip-word feature pairs. Semantic embeddings in each pair are mapped to be close to each other, i.e., $\boldsymbol{v}^{t1} \leftrightarrow \boldsymbol{q}^{s1}$, $\boldsymbol{v}^{t2} \leftrightarrow \boldsymbol{q}^{s2}$. For temporal relation reasoning, we propose a new pretext task called context warping. Here we use Fig. 2 (right) for illustration. Context warping is designed to generate a new temporally relevant clip features $\boldsymbol{z}^{t1}$, which imitates $\boldsymbol{v}^{t1}$, conditioned on another clip $\boldsymbol{v}^{t2}$ and the relative distance $t2 - t1$ in time, i.e., $\boldsymbol{z}^{t1}={\text {warp}}(\boldsymbol{v}^{t2}, t2 - t1)$. The predicted relevant clip feature $\boldsymbol{z}^{t1}$ is enforced to maintain the original established cross-modal correspondence unchanged, i.e., $\boldsymbol{z}^{t1} \leftrightarrow \boldsymbol{q}^{s1}$. In this manner, we simulate the contextual reasoning process and enhance the temporal awareness of video features.

We conduct extensive experiments on four downstream tasks (i.e., video retrieval, temporal grounding, action step localization, and action segmentation) across six datasets. The results on both retrieval-based and localization-based tasks demonstrate the superiority and the generalization ability of our LocVTP.

In summary, we make three contributions in this paper:

We propose a localization-oriented video-text pre-training framework, LocVTP, which benefits both retrieval-based and the less-explored localization-based downstream tasks.
We pinpoint two crucial designs in LocVTP, i.e., fine-grained video-text alignment and temporal relation reasoning.
Experimental results show that our LocVTP significantly outperforms previous state-of-the-art methods when transferred to various downstream tasks.

2 Related Work

Video-Text Pre-training (VTP). With the release of the large-scale instructional dataset HowTo100M, VTP has spurred significant interest in the community. Overall, the mainstream methods can be broadly classified into two classes: 1) Generative methods: Several methods [11, 20, 28, 31, 34, 50, 55, 56] try to extend BERT [53] to the cross-modal domain, i.e., they accept both visual and textual tokens as input and perform the masked-token prediction task. 2) Discriminative methods. These methods [4, 26, 30, 41] learn representations by differentiating input samples using objectives such as the metric loss [19, 58] or contrastive loss [9, 18]. ClipBert [26] enables affordable pre-training from sparsely sampled frames. Frozen [4] adapts the recent ViT [13] as the visual encoder and is flexible to be trained on both image and video datasets. T2VLAD [56] and FCA [17] also perform the fine-grained interactions between video clips and phrases. However, both of them resort to additional overload, e.g., k-means cluster or graph auto-encoder. In contrast, our LocVTP explicitly models the clip-word matching with a more light-weighted similarity comparison manner.

Pre-training for Localization Tasks. Compared to the retrieval tasks [21, 45, 64] which only require only video-level predictions, localization tasks [15, 51, 77] are essentially different since they need dense clip-level or frame-level predictions and thus the pre-training for these tasks is more challenging. In the pure video domain, this gap has been noticed and several pre-training works [2, 65, 66, 73] tailored for action localization have been proposed. BSP [65] synthesizes temporal boundaries using existing action recognition datasets and conducts boundary type classification to generate localization-friendly features. TSP [2] trains video encoders to be temporally sensitive by predicting the foreground clip label and classifying whether a clip is inside or outside the action. As for the video-language domain, our LocVTP is the first pre-training framework designed for localization tasks. Besides, compared to TSP and BSP which require label information for supervised pre-training, our LocVTP can directly learn from narrated videos.

3 Approach

3.1 Overview of LocVTP

An overview of LocVTP is illustrated in Fig. 3. We firstly feed the video and language modalities to their respective encoders $f_{v}(\cdot )$ and $f_{q}(\cdot )$ to obtain embedded features. We follow the sparse sampling spirit in [26] and sample T clips for each video, yielding the encoded video $\boldsymbol{v}= \{\boldsymbol{v}^{t}\}_{t=1}^{T}$, where $\boldsymbol{v}^{t}\in \mathbb {R}^{D}$ is the $t^{th}$ clip feature and D is the feature dimension. The text embedding is represented as $\boldsymbol{q}= \{\boldsymbol{q}^{s}\}_{s=1}^{S_{q}}$, where $\boldsymbol{q}^{s}\in \mathbb {R}^{D}$ is the $s^{th}$ word embedding and $S_q$ is the word length of $\boldsymbol{q}$.

Three types of contrastive methods are then performed to learn cross-modal features: 1) The coarse-grained contrastive loss builds the video-sentence level alignment; 2) A correspondence discovery strategy is proposed to build clip-word relations, based on which the fine-grained contrastive loss is applied; 3) Temporal aware contrastive loss with the context warping pretext task is proposed to encode temporal information into video representations.

3.2 Coarse-Grained Contrastive Learning

We firstly conduct contrastive alignment at the global video-sentence level. Specifically, to obtain the video and sentence level features, we average pool $\boldsymbol{v}$ and $\boldsymbol{q}$ along the temporal and word index dimension, respectively. The global features are represented as $\overline{\boldsymbol{v}}$, $\overline{\boldsymbol{q}}\in \mathbb {R}^{D}$. Then we formulate this video-sentence alignment into the contrastive framework [18] as follows:

$$\begin{aligned} \mathcal {L}_{c}=-\log \frac{\exp \left( \overline{\boldsymbol{v}} {\cdot } \overline{\boldsymbol{q}} / \tau \right) }{\sum _{i=1}^{N} \exp \left( \overline{\boldsymbol{v}} {\cdot } \overline{\boldsymbol{q}}_{i} / \tau \right) }, \end{aligned}$$

(1)

where $\overline{\boldsymbol{q}}_{i}, i \in [1, N]$, is the sentence feature for other samples within the batch. N denotes the batch size and $\tau $ is the temperature parameter. The coarse-grained contrastive loss $\mathcal {L}_{c}$ serves as a base loss to conduct video-sentence level constraint and induces a basic latent space where the detailed cross-modal matching is achieved. Though usually coarse and noisy, this latent space encodes prior for fine-grained clip-word correspondence discovery. In Sect. 4.6, we design and analyze three potential ways to use this cross-modal matching prior.

3.3 Fine-Grained Contrastive Learning

Beyond the coarse-grained video-sentence alignment, we propose to conduct contrastive learning in a fine-grained manner, i.e., clip-word matching. We contend that introducing such alignment learning into the pre-training stage could narrow down its gap with downstream localization tasks and calibrate the pre-trained feature to be more temporally aware.

Clip-Word Correspondence Discovery. Before performing fine-grained contrastive learning, we firstly need to estimate the clip-word correspondences from video-sentence pairs. Thanks to the priors well established by the coarse-grained contrastive learning, we compute the cosine similarities between the video clips and their corresponding caption words in the pre-built latent space and choose the most similar K words as the correspondence for each video clip. Note that we select multiple positive words rather than simply pick one with the highest similarity because individual words may have vague meanings while sense-group^{Footnote 2} conveys more precise information (cf. Sect. 4.7).

Given the video sentence pair $\left\{ \boldsymbol{v}, \boldsymbol{q}\right\} $, for the encoded $t^{th}$ video clip $\boldsymbol{v}^{t}$, we compute its cosine similarities with the $s^{th}$ word embedding $\boldsymbol{q}^{s}$ and apply the ${\text {topk}}$ operation to select the most matched K ones. Following [57], these K selected items are average pooled to form the final positive sample:

$$\begin{aligned} \boldsymbol{q}_{+}^{t}={\text {avgpool}} \Big (\underset{s\in [1, S_{q}]}{\arg {\text {topk}} } \left( \boldsymbol{v}^{t} {\cdot } \boldsymbol{q}^{s}\right) \Big ), \end{aligned}$$

(2)

where $\boldsymbol{q}_{+}^{t}$ is the final positive sample for $\boldsymbol{v}^{t}$. $(\boldsymbol{u} {\cdot } \boldsymbol{v})=\boldsymbol{u}^{\top } \boldsymbol{v} /\Vert \boldsymbol{u}\Vert \Vert \boldsymbol{v}\Vert $ represents the cosine similarity between $\ell _{2}$ normalized $\boldsymbol{u}$ and $\boldsymbol{v}$. This process can be efficiently performed for all the video clips using matrix operations.

Fine-Grained Contrastive Loss. With the selected clip-word correspondence as positive pairs, we perform fine-grained representation learning following the cross-modal InfoNCE [18] loss (cf. Fig. 4a). The negative samples are taken from the other words within the batch. Therefore, the fine-grained contrastive loss is defined as follows.

$$\begin{aligned} \mathcal {L}_{f}=\frac{1}{T}\sum _{t=1}^{T}-\log \frac{ \exp (\boldsymbol{v}^{t} {\cdot } \boldsymbol{q}_{+}^{t} / \tau )}{\sum _{i=1}^{N} \sum _{s=1}^{S_{q_{i}}} \exp \left( \boldsymbol{v}^{t} {\cdot } \boldsymbol{q}_{i}^{s} / \tau \right) }, \end{aligned}$$

(3)

where $\boldsymbol{q}_{i}^{s}$ is the $s^{th}$ word feature of the $i^{th}$ sentence $\boldsymbol{q}_{i}$.

3.4 Temporal Aware Contrastive Learning

Compared with the video-level retrieval task, which favors temporal invariant features [40, 42], the clip-level localization task [6,7,8, 33, 60, 61, 70,71,72] prefers temporal aware video embeddings. Specifically, correlated actions in the same video should perceive each other. This characteristic is however not embodied in the aforementioned contrastive learning.

Context Warping Head. To alleviate this, we set up a context-warping operation to enforce the video clip to perceive the context. For the video clip $\boldsymbol{v}^{t}$ in a matched clip-word pair $\{\boldsymbol{v}^{t}, \boldsymbol{q}^{t}_{+}\}$ (cf. Sect. 3.3), we warp its contextual video clip with $\delta $ temporal distance, i.e., $\boldsymbol{v}^{t+\delta }$, to “reconstruct” itself. To supervise this warping process, we set up a temporal aware contrastive loss to maintain the established correspondence. Specifically, we propose a context warping head $g(\cdot )$ to instantiate this warping process, by taking the context clip feature $\boldsymbol{v}^{t+\delta }$ and temporal distance $\delta $ as input.

$$\begin{aligned} \begin{aligned} \boldsymbol{z}^{t}&=g(\boldsymbol{v}^{t+\delta }; \delta ) \\&={\text {ReLU}}\left( W[\boldsymbol{v}^{t+\delta }, {\text {sgn}}(\delta ),|\delta |]\right) , \end{aligned} \end{aligned}$$

(4)

where $\boldsymbol{z}^{t}$ is the warped feature. $W \in \mathbb {R}^{(D+2) \times D}$ are the trainable weights. $\delta $ is randomly sampled within the range of $[-\delta _{max}, \delta _{max}]$. ${\text {sgn}}(\cdot )$ is the sign function which returns 1 for positive values and $-1$ for negative ones. Here ${\text {sgn}}(\delta )$ and $|\delta |$ indicate the direction and distance of the temporal difference $\delta $, respectively.

Temporal Aware Contrastive Loss. Through the context warping head, the warped feature $\boldsymbol{z}^{t}$ should mimic the reference feature $\boldsymbol{v}^{t}$. Since $\boldsymbol{v}^{t}$ has the clip-word alignment with $\boldsymbol{q}^{t}_{+}$, such correspondence should be preserved between the warped feature $\boldsymbol{z}^{t}$ and $\boldsymbol{q}^{t}_{+}$ (cf. Fig. 4b).

$$\begin{aligned} \mathcal {L}_{t}=\frac{1}{T}\sum _{t=1}^{T}-\log \frac{ \exp (\boldsymbol{z}^{t} {\cdot } \boldsymbol{q}_{+}^{t} / \tau )}{\sum _{i=1}^{N} \sum _{s=1}^{S_{q_{i}}} \exp \left( \boldsymbol{z}^{t} {\cdot } \boldsymbol{q}_{i}^{s} / \tau \right) }. \end{aligned}$$

(5)

This process enforces video features to learn the ability of temporally reasoning, thus leading to more localization-friendly video features.

Integrating the above constraints, our final loss function is as follows.

$$\begin{aligned} \mathcal {L} = \lambda _{c}\mathcal {L}_{c} + \lambda _{f}\mathcal {L}_{f} + \lambda _{t}\mathcal {L}_{t}, \end{aligned}$$

(6)

where $\lambda _{c}$, $\lambda _{f}$, and $\lambda _{t}$ balance the focus on different constraints during training.

4 Experiments

4.1 Settings of Pre-training

Datasets. We pre-trained our model on three public datasets: 1) HowTo100M [39]. It consists of more than 1.2M videos accompanied with ASR-generated speech transcription. The provided transcription is used to create video-sentence pairs separated by each timestamp. 2) WebVid-2M [39]. It contains about 2.5M well-aligned web video-text pairs. 3) Google Conceptual Captions [48]. It contains 3.3M image and description pairs harvested from the web.

Encoders. Following [4, 54, 67], we adopted ViT-B/16 [13] with space-time attention [5] as the video encoder. The spatial attention weights in the transformer were initialized with ImageNet-21k pre-trained weights while the temporal attention weights were set to zero. We chose a lightweight DistilBERT [47] as the language encoder. Following [3, 4, 41, 52], the language encoder was initialized with the weights pre-trained on English Wikipedia and Toronto Book Corpus.

Implementation Details. For the video in each video-sentence pair, we sampled 8 clips of 16 frames equidistantly and fed them to the video encoder to obtain clip-level features. All frames were resized to $224 \times 224$. For downstream transfer, we extracted video features with the well-trained model in a dense manner, i.e., every 16 consecutive frames were grouped to compute one clip feature.

Experiments were conducted on 64 V100 GPUs with a batch size of 256 and lasted for 200 epochs. We used Adam [32] with the initial learning rate $10^{-4}$ as the optimizer. The learning rate decayed by 0.1 at the $100^{th}$ and $160^{th}$ epoch. Random flip, random crop, and color jitter for video data augmentation were included. The loss balance factors $\lambda _{c}$, $\lambda _{f}$, and $\lambda _{t}$ were set to 0.5, 1, 1, respectively. The temperature factor $\tau $ used in contrastive learning was set to 0.07 following [43, 59] and K in Eq.(2) was set to 3. Features in all three contrastive losses were $\ell _{2}$-normalized before computation (Table 1).

4.2 Transfer Results on Video Retrieval

Table 1. **Video retrieval performance on MSR-VTT.** Vis Enc. Init.: Datasets used for pre-training visual encoders. Methods using multi-modal features are . COCO: Coco Captions [10]; VGen: Visual genome [25]; CC3M: Conceptual captions [48]; WV2M: WebVid-2M [4]; $^\dag $ denotes the technical report available on ArXiv.

Datasets. We evaluate our LocVTP on the widely-used benchmark MSR-VTT dataset [64]. It is composed of 10K YouTube videos (9K for training and 1K for test). We report results on the train/test splits introduced in [69].

Results. 1) As can be seen, we achieve state-of-the-art performance under both sets of data, i.e., HowTo100M and CC3M+WV2M. Specifically, when pre-trained on CC3M+WV2M, LocVTP outperforms Frozen [4] by an absolute lift of 4.8% on R@5. 2) It should be pointed out that although using RGB data only, our LocVTP achieves better performance than the methods using multi-modal expert features including motion, face, and speech, e.g., MMT [14]. 3) The recent work CLIP [43] provides a stronger vision encoder and we also evaluate the performance based on it. It is shown that the CLIP’s weights greatly improve the performance of LocVTP with R@5 achieving 72.8%, surpassing top-performing CLIP-based methods. 4) Our LocVTP also outperforms previous methods under the zero-shot setting, showing its generalization ability.

Table 2. **Temporal grounding performances using pre-trained representations.** Sep.Pre.: separately pre-training, i.e., the video encoder supervisedly pre-trained on Kinetics and text encoder taken from BERT. We retrain the temporal grounding method 2D-TAN [74] using the pre-trained features. HT: HowTo100M; CO: Coco Captions [10]; VG: Visual genome [25]; CC: Conceptual captions [48]; WV: WebVid-2M [4]; $\ddag $: the subset of HowTo100M with the same training volume as Kinetics (300K pairs). Methods with $^*$ are not open source and we implement them ourselves. $^\dag $ denotes the technical report available on ArXiv.

4.3 Transfer Results on Temporal Grounding

Settings. We validate the performance of pre-trained representations on temporal grounding, which aims to localize actions corresponding to the sentence from an untrimmed video. Specifically, we re-train the mainstream temporal grounding method 2D-TAN [74]^{Footnote 3} by only replacing the original input features with pre-trained ones. For ease of feature extraction, we choose representative VTP methods with publicly-available codes for comparisons.

Datasets and Metrics. 1) ActivityNet Captions (ANet) [24]. It contains 20K untrimmed videos with 100K descriptions. By convention, we use 37,417 video-query pairs for training, 17,505 pairs for validation, and 17,031 pairs for testing. 2) Charades-STA [15]. Following the official split, 12,408 video-query pairs are used for training, and 3,720 pairs for testing. 3) TACoS [44]. It has 10,146 video-query pairs for training, 4,589 pairs for validation, and 4,083 pairs for testing.

Following prior works, we adopt ΓÇ£R@n, IoU@mΓÇ¥ (abbreviated as $R^m_n$) as the metric, Specifically, $R^m_n$ is defined as the percentage of at least one of top-n retrieved moments having IoU with the ground-truth moment larger than m.

Results. 1) As shown in Table 2, even trained with a much larger dataset, the current popular video-text pre-training frameworks achieve inferior performance compared to the separately pre-trained one. For example, Frozen [4] reaches 43.3% at $R_1^{0.5}$ on ANet Captions, which is 1.1% absolute value lower than the separately pre-trained counterpart. 2) Either pre-trained on HowTo100M or CC + WV, our LocVTP outperforms both video-text pre-training methods by a large margin on all three datasets. For example, pre-trained on HowTo100M, LocVTP surpasses the separately pre-trained method by 3.8% on $R_1^{0.5}$ of ANet Captions. 3) For more fair comparisons, we sample a subset of HowTo100M by selecting the same training sample as Kinetics [23] (300K training pairs), denoted as HT$^\ddag $ in Table 2. Although using noisy ASR captions, the results demonstrates that under the same training data volume, our LocVTP still shows better performance compared to the separately pre-trained method. This manifests that our performance improvement is brought by the sound architecture design rather than just the use of the large-scale dataset.

4.4 Transfer Results on Action Step Localization

Settings. In action step localization, each video belongs to a task and is annotated with multiple action steps described with short natural languages. The goal is to align each frame with the correct step in the text form. Following [35, 39, 68, 76], we take [77] as the downstream localization method. Specifically, we compute the similarity between each frame and the action step descriptions in feature space to find the optimal frame-wise order of action steps for a video.

Table 3. Comparison results of action step localization (CTR: average recall) and action segmentation (FA: frame-wise accuracy).

Full size table

Datasets and Metrics. We experiment on the instructional video dataset CrossTask [77], which includes 83 tasks and 4.7K videos. Each task is described with an ordered list of steps with manual natural language descriptions. We perform the same evaluation protocol as in [77] by reporting the average recall (CTR).

Results. Table 3 reports the action step localization performance on CrossTask dataset. Our LocVTP pre-trained feature achieves state-of-the-art performance with CTR reaching 51.7%, surpassing the previous method VideoClip by 4.4%. Our competitive performance demonstrates that LocVTP features can effectively perceive detailed action steps.

4.5 Transfer Results on Action Segmentation

Settings. We assess our LocVTP on action segmentation, which aims to predict the action label frame-wisely for each video frame. It is a pure vision task without the use of the text encoder. Following [35, 68, 76], we encode the input video frames with the well-trained video encoder and apply a linear classifier upon the features to predict action labels.

Datasets and Metrics. We conduct experiments on the widely used COIN dataset [51] and the frame-wise accuracy (FA) is taken as the evaluation metric.

Results. As shown in Table 3, our LocVTP achieves state-of-the-art performance with FA reaching 72.9%. This further demonstrates the superiority of our feature in localization tasks even in the absence of language guidance.

4.6 Ablation Study on Training Objective

Training Strategy. Coarse-grained contrastive alignment loss $\mathcal {L}_{c}$ provides a basic cross-modal matching prior and we introduce three potential ways to use it: 1) multi-stage training: first perform coarse-grained training and then use the trained model to initialize other stages. 2) warm-up training: decrease $\lambda _{c}$ exponentially from 1 to 0 throughout the training process. 3) weighted training: set $\lambda _{c}$ to a constant value. Here we set $\lambda _{c} = 0.5$. As shown in Table 4a, we find the weighted training strategy achieves the best performance and warm-up training is slightly behind. Multi-stage training is the least effective one.

Loss Component. We present the loss component ablations in Table 4b. As shown, both fine-grained loss $\mathcal {L}_{f}$ and temporal aware loss $\mathcal {L}_{t}$ are crucial. For example, compared to the full version (exp.#1), removing $\mathcal {L}_{f}$ and $\mathcal {L}_{t}$ brings about 1.4% and 1.5% performance degradation on the $R_1^{0.5}$ metric, respectively.

More Downstream Temporal Grounding Baselines. We take another temporal grounding method CSMGAN [29] as the downstream baseline. As shown in Table. 4c, our LocVTP pre-trained feature consistently benefits this more advanced baseline.

Table 4. Ablations studies of (a) training strategies; (b) loss component; (c) comparison results on temporal grounding method CSMGAN [29]. Sep.Pre.: separately pre-training, i.e., the video encoder supervisedly pre-trained on Kinetics and text encoder taken from BERT.

Full size table

4.7 Ablations on Fine-grained Contrastive Loss (see Footnote 4)

Correspondence Discovery Strategies. We experiment four potential strategies to extract cross-modal correspondences: 1) random: randomly select K words for each clip; 2) 2d-topk: select the most similar $K \times T$ clip-word pairs; 3) word-topk: select the most similar K clips for each word; 4) clip-topk: select the most similar K words for each clip, namely the method illustrated in Sect. 3.3. As indicated in Table 5a, the random and 2d-topk matching strategies are the two worst options. For the word-topk matching, it is also sub-optimal, which can be attributed to the possibility of introducing words without concrete meanings (e.g., articles or pronouns) into matched pairs.

Number of Selected Pairs K. We further ablate the hyper-parameter K used in the clip-topk strategy. Table 5b shows that the performance saturates at $K=3$ and slightly decreases for $K=4$. We conjecture that this may be because too few words have vague meanings while too large K value leads to the inability to establish accurate correspondences.

4.8 Ablations on Temporal Aware Contrastive Loss (see Footnote 4)

Table 5. Ablations studies of (a) correspondence discovery strategies; (b) selected pair number K; (c) context projection head. ${\text {sgn}}(\delta )$, $|\delta |$ denotes the direction and distance; (d) the maximum bias distance; (e) intra-modal v.s. cross-modal $\mathcal {L}_{t}$; (f) linear localization accuracy. $Accu_o$ and $Accu_d$ are order and distance prediction accuracy.

Full size table

Context Projection Head Components. In Eq. (4), the warped feature is generated based on both the direction ${\text {sgn}}(\delta )$ and distance $|\delta |$. Here we investigate eliminating either of them to see the difference. We observe in Table. 5c that removing either component decreases the performance, which indicates that both the direction and distance of bias $\delta $ are crucial for feature warping.

Maximum Bias Distance $\delta _{max}$. Here we ablate different values for $\delta _{max}$. From Table 5d, we can see that $\delta _{max} = 4$ achieves the best performance. This may be because that small bias makes the model unable to perceive enough context, while a large bias makes contextual reasoning too difficult.

Intra-modal v.s. Cross-modal Constraint. In Sect. 3.4, given the matched clip-word pair $\{\boldsymbol{v}^{t}, \boldsymbol{q}^{t}_{+}\}$ and the warped feature $\boldsymbol{z}^{t}$, we force the cross-modal supervision, i.e., $\boldsymbol{z}^{t} \leftrightarrow \boldsymbol{q}^{t}_{+}$. Here, we apply the temporal aware contrastive loss $\mathcal {L}_{t}$ in a intra-modal manner which regards $\boldsymbol{z}^{t}$ and $\boldsymbol{v}^{t}$ as positive pairs, i.e., $\boldsymbol{z}^{t} \leftrightarrow \boldsymbol{v}^{t}$. The results in Table 5 e show that our adopted cross-modal mode outperforms the intra-modal one.

Temporal Sensitivity Analysis. As a sanity check, we devise two proxy tasks to evaluate the temporal sensitivity of pre-trained video features. As shown in Fig. 5a, n equidistantly sampled clips from one video are fed into the frozen video backbone to extract their corresponding features. Two linear classifiers are trained to perform two tasks: order prediction and distance estimation. The first task predicts the temporal index while the second one estimates the temporal distance of two clips. The results in Table 5f show that our LocVTP with temporal aware loss $\mathcal {L}_{t}$ outperforms the variant without it as well as two typical VTP methods (i.e., UniVL and MIL-NCE), which shows that $\mathcal {L}_{t}$ clearly contributes to the localization ability.

4.9 Visualization

Cross-modal Correspondence Visualizations. Figure 5b shows two frames^{Footnote 4} and their corresponding similarity scores with caption words. The top K highest scored words are marked with red ($K=3$). Frame #1 and frame #2 have similar appearance views yet correspond to different action processes. Our method pinpoints the subtle differences and accurately finds the most relevant words.

UMAP Visualizations. As shown in Fig. 6, we provide UMAP [37] visualizations for fused multi-modal features, which are generated by multiplying the extracted video feature by one query feature. With the temporal aware loss $\mathcal {L}_{t}$, our LocVTP shows more separable distributions compared with LocVTP w/o $\mathcal {L}_{t}$, manifesting that $\mathcal {L}_{t}$ helps distinguish action-of-interest from background.

Similarity Distribution Visualizations. In Eq.(4), context projection head warps contextual clip $\boldsymbol{v}^{t+\delta }$ to the reference one $\boldsymbol{v}^{t}$. Here we collect 10K paired training samples and compute three sets of cosine similarities: reference similarity $\small {(\boldsymbol{v}^{t}, \boldsymbol{q}_{+}^{t})}$, bias similarity $\small {(\boldsymbol{v}^{t+\delta }, \boldsymbol{q}_{+}^{t})}$, and projection similarity $\small {(\boldsymbol{z}^{t}, \boldsymbol{q}_{+}^{t})}$. Figure 5c plots the histogram of these similarities. We can see that the distribution of projection similarity is close to that of reference similarity while far away from that of bias similarity. This demonstrates that our context projection head can effectively warp contextual features conditioned on the temporal information.

5 Conclusions

In this paper, we propose LocVTP, the first video-text pre-training framework for temporal localization tasks. Specifically, we apply cross-modal contrastive learning at both coarse-grained video-sentence and fine-grained clip-word levels. Besides, we propose a context warping pretext task and a temporal aware contrastive loss to enhance the temporal awareness of video features. Experimental results show that LocVTP achieves state-of-the-art performance when transferred to both retrieval-based and localization-based downstream tasks.

Notes

1.
Here we use “sentence” to represent the whole paired text for each video, such as the ASR in HowTo100M [39] or query language in ActivityNet Caption [24].
2.
A group or sequence of words conveying a particular meaning or idea in linguistics..
3.
We choose 2D-TAN since it is relatively simple without too many dataset-specific parameters, which can fairly verify the effectiveness of pre-training features. Results on more advanced baselines are available in the supplementary material.
4.
If not specified, all ablation studies are conducted on the downstream temporal grounding task at ActivityNet Captions dataset. We use LocVTP pre-trained on HowTo100M with ImageNet initialization..
5.
More visualizations are left in the supplementary materials..
6.
Here we use “frame” to indicate the center frame of a video snippet.

References

Alayrac, J.B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., Lacoste-Julien, S.: Unsupervised learning from narrated instruction videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4575–4583 (2016)
Google Scholar
Alwassel, H., Giancola, S., Ghanem, B.: TSP: temporally-sensitive pretraining of video encoders for localization tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3173–3183 (2021)
Google Scholar
Amrani, E., Ben Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. arXiv preprint arXiv:2003.03186 (2020)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. arXiv preprint arXiv:2104.00650 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv (2021)
Google Scholar
Cao, M., Chen, L., Shou, M.Z., Zhang, C., Zou, Y.: On pursuit of designing multi-modal transformer for video grounding. EMNLP (2021)
Google Scholar
Cao, M., Zhang, C., Chen, L., Shou, M.Z., Zou, Y.: Deep motion prior for weakly-supervised temporal action localization. arXiv preprint arXiv:2108.05607 (2021)
Chen, L., et al.: Rethinking the bottom-up framework for query-based video localization. In: AAAI, pp. 10551–10558 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607 (2020)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y.-C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 214–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_13
Chapter Google Scholar
Gao, J., Sun, C., Yang, Z., Nevatia, R.: TALL: temporal activity localization via language query. In: ICCV, pp. 5267–5275 (2017)
Google Scholar
Ging, S., Zolfaghari, M., Pirsiavash, H., Brox, T.: COOT: cooperative hierarchical transformer for video-text representation learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 22605–22618 (2020)
Google Scholar
Han, N., Chen, J., Xiao, G., Zhang, H., Zeng, Y., Chen, H.: Fine-grained cross-modal alignment network for text-video retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 3826–3834 (2021)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (eds.) SIMBAD 2015. LNCS, vol. 9370, pp. 84–92. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24261-3_7
Chapter Google Scholar
Hu, R., Singh, A.: Transformer is all you need: Multimodal multitask learning with a unified transformer. arXiv (2021)
Google Scholar
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2758–2766 (2017)
Google Scholar
Karpathy, A., et al.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)
Google Scholar
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)
Google Scholar
Lei, J., et al.: Less is more: ClipBERT for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331–7341 (2021)
Google Scholar
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200 (2020)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv (2019)
Google Scholar
Liu, D., Qu, X., Liu, X.Y., Dong, J., Zhou, P., Xu, Z.: Jointly cross-and self-modal graph attention network for query-based moment localization. In: ACM MM, pp. 4070–4078 (2020)
Google Scholar
Liu, S., Fan, H., Qian, S., Chen, Y., Ding, W., Wang, Z.: HiT: hierarchical transformer with momentum contrast for video-text retrieval. arXiv preprint arXiv:2103.15049 (2021)
Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv (2017)
Google Scholar
Lu, C., Chen, L., Tan, C., Li, X., Xiao, J.: DEBUG: a dense bottom-up grounding approach for natural language video localization. In: EMNLP, pp. 5147–5156 (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS (2019)
Google Scholar
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
Luo, H., et al.: CLIP4Clip: an empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889 (2020)
Google Scholar
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2630–2640 (2019)
Google Scholar
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: VideoMoCo: contrastive video representation learning with temporally adversarial examples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11205–11214 (2021)
Google Scholar
Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824 (2020)
Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. TACL 1, 25–36 (2013)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
Google Scholar
Rouditchenko, A., et al.: AVLnet: learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199 (2020)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018)
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7464–7473 (2019)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP (2019)
Google Scholar
Tang, Y., et al.: Coin: a large-scale dataset for comprehensive instructional video analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1207–1216 (2019)
Google Scholar
Tang, Z., Lei, J., Bansal, M.: DeCEMBERT: learning from noisy instructional videos via dense captions and entropy minimization. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2415–2426 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, A.J., et al.: Object-aware video-language pre-training for retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2022)
Google Scholar
Wang, W., et al.: Dig into multi-modal cues for video retrieval with hierarchical alignment. In: IJCAI (2021)
Google Scholar
Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5079–5088 (2021)
Google Scholar
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3024–3033 (2021)
Google Scholar
Wu, C.Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848 (2017)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Google Scholar
Xiao, S., Chen, L., Shao, J., Yueting, Z., Xiao, J.: Natural language video localization with learnable moment proposals. In: EMNLP (2021)
Google Scholar
Xiao, S., et al.: Boundary proposal network for two-stage natural language video localization. In: AAAI (2021)
Google Scholar
Xu, H., et al.: VLM: task-agnostic video-language model pre-training for video understanding. arXiv preprint arXiv:2105.09996 (2021)
Xu, H., et al.: VideoCLIP: contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
Google Scholar
Xu, M., et al.: Boundary-sensitive pre-training for temporal localization in videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7220–7230 (2021)
Google Scholar
Xu, M., Perez Rua, J.M., Zhu, X., Ghanem, B., Martinez, B.: Low-fidelity video encoder optimization for temporal action localization. In: Advances in Neural Information Processing Systems, vol. 34 (2021)
Google Scholar
Yan, R., Shou, M.Z., Ge, Y., Wang, A.J., Lin, X., Cai, G., Tang, J.: Video-text pre-training with learned regions. arXiv preprint arXiv:2112.01194 (2021)
Yang, J., Bisk, Y., Gao, J.: TACo: token-aware cascade contrastive learning for video-text alignment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11562–11572 (2021)
Google Scholar
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 487–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_29
Chapter Google Scholar
Yuan, Y., Lan, X., Wang, X., Chen, L., Wang, Z., Zhu, W.: A closer look at temporal sentence grounding in videos: datasets and metrics. arXiv (2021)
Google Scholar
Zhang, C., Cao, M., Yang, D., Chen, J., Zou, Y.: CoLA: weakly-supervised temporal action localization with snippet contrastive learning. In: CVPR, pp. 16010–16019 (2021)
Google Scholar
Zhang, C., Cao, M., Yang, D., Jiang, J., Zou, Y.: Synergic learning for noise-insensitive Webly-supervised temporal action localization. Image Vis. Comput. 113, 104247 (2021)
Article Google Scholar
Zhang, C., Yang, T., Weng, J., Cao, M., Wang, J., Zou, Y.: Unsupervised pre-training for temporal action localization tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14031–14041 (2022)
Google Scholar
Zhang, S., Peng, H., Fu, J., Luo, J.: Learning 2d temporal adjacent networks for moment localization with natural language. In: AAAI, pp. 12870–12877 (2020)
Google Scholar
Zhou, L., Xu, C., Corso, J.J.: Towards automatic learning of procedures from web instructional videos. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,. pp. 8746–8755 (2020)
Google Scholar
Zhukov, D., Alayrac, J.B., Cinbis, R.G., Fouhey, D., Laptev, I., Sivic, J.: Cross-task weakly supervised learning from instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3537–3545 (2019)
Google Scholar

Download references

Acknowledgement

This paper was partially supported by NSFC (No: 62176008) and Shenzhen Science & Technology Research Program (No: GXWD20201231165807007-20200814115301001).

Author information

Authors and Affiliations

Scter Engineering, Peking University, Beijing, China
Meng Cao, Can Zhang & Yuexian Zou
Tencent AI Lab, Bellevue, USA
Tianyu Yang, Junwu Weng & Jue Wang
Peng Cheng Laboratory, Shenzhen, China
Yuexian Zou

Authors

Meng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Yang
View author publications
You can also search for this author in PubMed Google Scholar
Junwu Weng
View author publications
You can also search for this author in PubMed Google Scholar
Can Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuexian Zou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuexian Zou .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3456 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cao, M., Yang, T., Weng, J., Zhang, C., Wang, J., Zou, Y. (2022). LocVTP: Video-Text Pre-training for Temporal Localization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13686. Springer, Cham. https://doi.org/10.1007/978-3-031-19809-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-19809-0_3
Published: 01 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19808-3
Online ISBN: 978-3-031-19809-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LocVTP: Video-Text Pre-training for Temporal Localization

Abstract

Similar content being viewed by others

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Level-wise aligned dual networks for text–video retrieval

Adaptive Token Excitation with Negative Selection for Video-Text Retrieval

1 Introduction

2 Related Work

3 Approach

3.1 Overview of LocVTP

3.2 Coarse-Grained Contrastive Learning

3.3 Fine-Grained Contrastive Learning

3.4 Temporal Aware Contrastive Learning

4 Experiments

4.1 Settings of Pre-training

4.2 Transfer Results on Video Retrieval

4.3 Transfer Results on Temporal Grounding

4.4 Transfer Results on Action Step Localization

4.5 Transfer Results on Action Segmentation

4.6 Ablation Study on Training Objective

4.7 Ablations on Fine-grained Contrastive Loss (see Footnote 4)

4.8 Ablations on Temporal Aware Contrastive Loss (see Footnote 4)

4.9 Visualization

5 Conclusions

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3456 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation