Keywords

1 Introduction

Video is one of the most popular forms of media due to its ability to capture dynamic events and its natural appeal to our visual and auditory senses. Online video platforms are playing a major role in promoting this form of media. However, the billions of hours of video available on such platforms are unusable if we cannot access them effectively, for example, by retrieving relevant content through queries.

In this paper, we tackle the tasks of caption-to-video and video-to-caption retrieval. In the first task of caption-to-video retrieval, we are given a query in the form of a caption (e.g., “How to build a house”) and the goal is to retrieve the videos best described by it (i.e., videos explaining how to build a house). In practice, given a test set of caption-video pairs, our aim is to provide, for each caption query, a ranking of all the video candidates such that the video associated with the caption query is ranked as high as possible. On the other hand, the task of video-to-caption retrieval focuses on finding among a collection of caption candidates the ones that best describe the query video.

A common approach for the retrieval problem is similarity learning  [29], where we learn a function of two elements (a query and a candidate) that best describes their similarity. All the candidates can then be ranked according to their similarity with the query. In order to perform this ranking, the captions as well as the videos are represented in a common multi-dimensional embedding space, wherein similarities can be computed as a dot product of their corresponding representations. The critical question here is how to learn accurate representations of both caption and video to base our similarity estimation on.

Fig. 1.
figure 1

When matching a text query with videos, the inherent cross-modal and temporal information in videos needs to be leveraged effectively, for example, with a video encoder that handles all the constituent modalities (appearance, audio, speech) jointly across the entire duration of the video. In this example, a video encoder will be able to distinguish between “someone walking to” and “someone walking away” only if it exploits the temporal information of events occurring in the video (red arrows). Also, in order to understand that a “motorbike failed to start”, it needs to use cross-modal information (e.g., absence of noise after someone tried to start the engine, orange arrow). (Color figure online)

The problem of learning representation of text has been extensively studied, leading to various methods  [3, 7, 18, 25, 34], which can be used to encode captions. In contrast to these advances, learning effective video representation continues to be a challenge, and forms the focus of our work. This is in part due to the multimodal and temporal nature of video. Video data not only varies in terms of appearance, but also in possible motion, audio, overlaid text, speech, etc. Leveraging cross-modal relations thus forms a key to building effective video representations. As illustrated in Fig. 1, cues jointly extracted from all the constituent modalities are more informative than handling each modality independently. Hearing a motor sound right after seeing someone starting a bike tells us that the running bike is the visible one and not a background one. Another example is the case of a video of “a crowd listening to a talk”, neither of the modalities “appearance” or “audio” can fully describe the scene, but when processed together, higher level semantics can be obtained.

Recent work on video retrieval does not fully exploit such cross-modal high-level semantics. They either ignore the multi-modal signal  [15], treat modalities separately  [16], or only use a gating mechanism to modulate certain modality dimensions  [14]. Another challenge in representing video is its temporality. Due to the difficulty in handling variable duration of videos, current approaches  [14, 16] discard long-term temporal information by aggregating descriptors extracted at different moments in the video. We argue that this temporal information can be important to the task of video retrieval. As shown in Fig. 1, a video of “someone walking to an object” and “someone walking away from an object” will have the same representation once pooled temporally, however, the movement of the person relative to the object is potentially important in the query.

We address the temporal and multi-modal challenges posed in video data by introducing our multi-modal transformer. It performs the task of processing features extracted from different modalities at different moments in video and aggregates them in a compact representation. Building on the transformer architecture  [25], our multi-modal transformer exploits the self-attention mechanism to gather valuable cross-modal and temporal cues about events occurring in a video. We integrate our multi-modal transformer in a cross-modal framework, as illustrated in Fig. 2, which leverages both captions and videos, and estimates their similarity.

Fig. 2.
figure 2

Our cross-modal framework for similarity estimation. We use our Multi-modal Transformer (MMT, right) to encode video, and BERT (left) for text.

Contributions. In this work, we make the following three contributions: (i) First, we introduce a novel video encoder architecture for retrieval: Our multi-modal transformer processes effectively multiple modality features extracted at different times. (ii) We thoroughly investigate different architectures for language embedding, and show the superiority of the BERT model for the task of video retrieval. (iii) By leveraging our novel cross-modal framework we outperform prior state of the art for the task of video retrieval on MSRVTT  [30], ActivityNet  [12] and LSMDC  [21] datasets. It is also the winning solution in the CVPR 2020 Video Pentathlon Challenge  [4].

2 Related Work

We present previous work on language and video representation learning, as well as on visual-language retrieval.

Language Representations. Earlier work on language representations include bag of words  [34] and Word2Vec  [18]. A limitation of these representations is capturing the sequential properties in a sentence. LSTM  [7] was one of the first successful deep learning models to handle this. More recently, the transformer  [25] architecture has shown impressive results for text representation by implementing a self-attention mechanism where each word (or wordpiece  [27]) of the sentence can attend to all the others. The transformer architecture, consisting of self-attention layers alternatively stacked with fully-connected layers, forms the base of the popular language modeling network BERT  [3]. Burns et al.  [1] perform an analysis of the different word embeddings and language models (Word2Vec  [18], LSTM  [7], BERT  [3], etc.) used in vision-language tasks. They show that the pretrained and frozen BERT model  [3] performs relatively poorly compared to a LSTM or even a simpler average embedding model. In this work, we show that for video retrieval, a pretrained BERT outperforms other language models, but it needs to be finetuned.

Video Representations. With a two-stream network, Simonyan et al.  [22] have used complementary information from still frames and motion between frames to perform action recognition in videos. Carreira et al.  [2] incorporated 3D convolutions in a two-stream network to better attend the temporal structure of the signal. S3D  [28] is an alternative approach, which replaced the expensive 3D spatio-temporal convolutions by separable 2D and 1D convolutions. More recently, transformer-based methods, which leverage BERT pretraining  [3], have been applied to S3D features in VideoBERT  [24] and CBT  [23]. While these works focus on visual signals, they have not studied how to encode the other multi-modal semantics, such as audio signals.

Visual-Language Retrieval. Harwath et al.  [5] perform image and audio-caption retrieval by embedding audio segments and image regions in the same space and requiring high similarity between each audio segment and its corresponding image region. The method presented in  [13] takes a similar approach for image-text retrieval by embedding images regions and words in a joint space. A high similarity is obtained for images that have matching words and image regions.

For videos, JSFusion  [31] estimates video-caption similarity through dense pairwise comparisons between each word of the caption and each frame of the video. In this work, we instead estimate both a video embedding and a caption embedding and then compute the similarity between them. Zhang et al.  [33] perform paragraph-to-video retrieval by assuming a hierarchical decomposition of the video and paragraph. Our method do not assume that the video can be decomposed into clips that align with sentences of the caption. A recent alternative is creating separate embedding spaces for different parts of speech (e.g., noun or verb)  [26]. In contrast to this method, we do not pre-process the sentences but encode them directly through BERT.

Another work  [17] leverages the large number of instructional videos in the HowTo100M dataset, but does not fully exploit the temporal relations. Our work instead relies on longer segments extracted from HowTo100M videos in order to learn temporal dependencies and address the problem of misalignment between speech and visual features. Mithun et al.  [19, 20] use three experts (Object, Activity and Place) to compute three corresponding text-video similarities. These experts however do not collaborate together as their respective similarities are simply summed together. A related approach  [16] uses precomputed features from experts for text to video retrieval, where the overall similarity is obtained as a weighted sum of each expert’s similarity. A recent extension  [14] to this mixture of experts model uses a collaborative gating mechanism for modulating each expert feature according to the other experts. However, this collaborative gating mechanism only strengthens (or weakens) some dimensions of the input signal in a single step, and is therefore not able to capture high level inter-modality information. Our multi-modal transformer overcomes this limitation by attending to all available modalities over multiple self-attention layers.

3 Methodology

Our overall method relies on learning a function s to compute the similarity between two elements: text and video, as shown in Fig. 2. We then rank all the videos (or captions) in the dataset, according to their similarity with the query caption (or video) in the case of text-to-video (or video-to-text) retrieval. In other words, given a dataset of n video-caption pairs \(\{(v_1,c_1), ..., (v_n,c_n)\}\), the goal of the learnt similarity function \(s(v_i,c_j)\), between video \(v_i\) and caption \(c_j\), is to provide a high value if \(i = j\), and a low one if \(i \ne j\). Estimating this similarity (described in Sect. 3.3) requires accurate representations for the video as well as the caption. Figure 2 shows the two parts focused on producing these representations (presented in Sects. 3.1 and 3.2 respectively) in our cross-modal framework.

3.1 Video Representation

The video-level representation is computed by our proposed multi-modal transformer (MMT). MMT follows the architecture of the transformer encoder presented in  [25]. It consists of stacked self-attention layers and fully collected layers. MMT’s input \(\varOmega (v)\) is a set of embeddings, all of the same dimension \(d_{model}\). Each of them embeds the semantics of a feature, its modality, and the time in the video when the feature was extracted. This input is given by:

$$\begin{aligned} \varOmega (v) = F(v) + E(v) + T(v), \end{aligned}$$
(1)

In the following, we describe those three components.

Features F. In order to learn an effective representation from different modalities inherent in video data, we begin with video feature extractors called “experts”  [14, 16, 19, 31]. In contrast to previous methods, we learn a joint representation leveraging both cross-modal and long-term temporal relationships among the experts. We use N pretrained experts \(\{F^n\}_{n=1}^N\). Each expert is a model trained for a particular task that is then used to extract features from video. For a video v, each expert extracts a sequence \(F^n(v) = [F^n_1, ..., F^n_K]\) of K features.

The features extracted by our experts encode the semantics of the video. Each expert \(F^n\) outputs features in \(\mathbb {R}^{d_n}\). In order to project the different expert features into a common dimension \(d_{model}\), we learn N linear layers (one per expert) to project all the features into \(\mathbb {R}^{d_{model}}\).

A transformer encoder produces an embedding for each of its feature inputs, resulting in several embeddings for an expert. In order to obtain a unique embedding for each expert, we define an aggregated embedding \(F^n_{agg}\) that will collect and contextualize the expert’s information. We initialize this embedding with a max pooling aggregation of all the corresponding expert’s features as \(F^n_{agg} = maxpool(\{F^n_k\}_{k=1}^K)\). The sequence of input features to our video encoder then takes the form:

$$\begin{aligned} F(v) = [F^1_{agg}, F^1_1, ..., F^1_K, ..., F^N_{agg}, F^N_1, ..., F^N_K]. \end{aligned}$$
(2)
Fig. 3.
figure 3

Inputs to our multi-modal transformer. We combine feature semantics F, expert information E, and temporal cues T to form our video embeddings \(\varOmega (v)\), which are input to MMT.

Expert Embeddings E. In order to process cross-modality information, our MMT needs to identify which expert it is attending to. We learn N embeddings \(\{E_1, ..., E_N\}\) of dimension \(d_{model}\) to distinguish between embeddings of different experts. Thus, the sequence of expert embeddings to our video encoder takes the form:

$$\begin{aligned} E(v) = [E^1, E^1, ..., E^1, ..., E^N, E^N, ..., E^N]. \end{aligned}$$
(3)

Temporal Embeddings T. They provide temporal information about the time in the video where each feature was extracted to our multi-modal transformer. Considering videos of a maximum duration of \(t_{max}\) seconds, we learn \(D = {|}{t_{max}}{|}\) embeddings \(\{T_1, ..., T_D\}\) of dimension \(d_{model}\). Each expert feature that has been extracted in the time range \([t,t+1)\) will be temporally embedded with \(T_{t+1}\). For example, a feature extracted at 7.4s in the video will be temporally encoded with temporal embedding \(T_8\). We learn two additional temporal embeddings \(T_{agg}\) and \(T_{unk}\), which encode aggregated features and unknown temporal information features (for experts whose temporal information is unknown), respectively. The sequence of temporal embeddings of our video encoder then takes the form:

$$\begin{aligned} T(v) = [T_{agg}, T_1, ..., T_D, ..., T_{agg}, T_1, ..., T_D]. \end{aligned}$$
(4)

Multi-modal Transformer. The video embeddings \(\varOmega (v)\) defined as the sum of features, expert and temporal embeddings in (1), as shown in Fig. 3, are input to the transformer. They are given by: \(\varOmega (v) = F(v) + E(v) + T(v) = [\omega ^1_{agg}, \omega ^1_1, ..., \omega ^1_K, ..., \omega ^N_{agg}, \omega ^N_1, ..., \omega ^N_K].\) MMT contextualises its input \(\varOmega (v)\) and produces the video representation \(\varPsi _{agg}(v)\). As illustrated in Fig. 2, we only keep the aggregated embedding per expert. Thus, our video representation \(\varPsi _{agg}(v)\) consists of the output embeddings corresponding to the aggregated features, i.e.,

$$\begin{aligned} \varPsi _{agg}(v) = MMT(\varOmega (v)) = [\psi ^1_{agg}, ..., \psi ^N_{agg}]. \end{aligned}$$
(5)

The advantage of our MMT over the state-of-the-art collaborative gating mechanism  [14] is two-fold: First, the input embeddings are not simply modulated in a single step but iteratively refined through several layers featuring multiple attention heads. Second, we do not limit our video encoder with a temporally aggregated feature for each expert, but provide all the extracted features instead, along with a temporal encoding describing at what moment of the video they were extracted from. Thanks to its self-attention modules, each layer of our multi-modal transformer is able to attend to all its input embeddings, thus extracting semantics of events occurring in the video over several modalities.

3.2 Caption Representation

We compute our caption representation \(\varPhi (c)\) in two stages: first, we obtain an embedding h(c) of the caption, and then project it with a function g into N different spaces as \(\varPhi = g \circ h\). For the embedding function h, we use a pretrained BERT model  [3]. Specifically, we extract our single caption embedding h(c) from the [CLS] output of BERT. In order to match the size of this caption representation with that of video, we learn for function g as many gated embedding modules  [16] as there are video experts. Our caption representation then consists of N embeddings, represented by \(\varPhi (c) = \{\phi ^i\}_{i=1}^N\).

3.3 Similarity Estimation

We compute our final video-caption similarity s, as a weighted sum of each expert i’s video-caption similarity \(\langle \phi ^i, \psi ^i_{agg} \rangle \). It is given by:

$$\begin{aligned} s(v,c) = \sum _{i = 1}^{N} w_i(c)\langle \phi ^i, \psi ^i_{agg} \rangle , \end{aligned}$$
(6)

where \(w_i(c)\) represents the weight for the ith expert. To obtain these mixture weights, we follow  [16] and process our caption representation h(c) through a linear layer and then perform a softmax operation, i.e.,

$$\begin{aligned} w_i(c) = \frac{e^{h(c)^{\top } a_{i}}}{\sum _{j=1}^{N} e^{h(c)^{\top } a_{j}}}, \end{aligned}$$
(7)

where \((a_1, ..., a_N)\) are the weights of the linear layer. The intuition behind using a weighted sum is that a caption may not describe all the inherent modalities in video uniformly. For example, in the case of a video with a person in a red dress singing opera, the caption “a person in a red dress” provides no information relevant for audio. On the contrary, the caption “someone is singing” should focus on the audio modality for computing similarity. Note that \(w_i(c), \phi ^i \ \text {and} \ \psi ^i_{agg}\) can all be precomputed offline for each caption and for each video, and therefore the retrieval operation only involves dot product operations.

3.4 Training

We train our model with the bi-directional max-margin ranking loss  [10]:

$$\begin{aligned} \mathcal {L} = \frac{1}{B}\sum _{i=1}^{B} \sum _{j \ne i} \Big [ \max (0, s_{ij} - s_{ii} + m) + \max (0, s_{ji} - s_{ii} + m)\Big ], \end{aligned}$$
(8)

where B is the batch size, \(s_{ij} = s(v_{i},c_{j})\), the similarity score between video \(v_i\) and caption \(c_j\), and m is the margin. This loss enforces the similarity for true video-caption pairs \(s_{ii}\) to be higher than the similarity of negative samples \(s_{ij}\) or \(s_{ji}\), for all \(i \ne j\), by at least m.

4 Experiments

4.1 Datasets and Metrics

HowTo100M  [17]. It is composed of more than 1 million YouTube instructional videos, along with automatically-extracted speech transcriptions, which form the captions. These captions are naturally noisy, and often do not describe the visual content accurately or are temporally misaligned with it. We use this dataset only for pre-training.

MSRVTT  [30]. This dataset is composed of 10K YouTube videos, collected using 257 queries from a commercial video search engine. Each video is 10 to 30s long, and is paired with 20 natural sentences describing it, obtained from Amazon Mechanical Turk workers. We use this dataset for training from scratch and also for fine-tuning. We report results on the train/test splits introduced in  [31] that uses 9000 videos for training and 1000 for test. We refer to this split as “1k-A”. We also report results on the train/test split in  [16] that we refer to as “1k-B”. Unless otherwise specified, our MSRVTT results are with “1k-A”.

ActivityNet Captions  [12]. It consists of 20K YouTube videos temporally annotated with sentence descriptions. We follow the approach of  [33], where all the descriptions of a video are concatenated to form a paragraph. The training set has 10009 videos. We evaluate our video-paragraph retrieval on the “val1” split (4917 videos). We use ActivityNet for training from scratch and fine-tuning.

LSMDC  [21]. It contains 118,081 short video clips (\(\sim \)4–5 s) extracted from 202 movies. Each clip is annotated with a caption, extracted from either the movie script or the audio description. The test set is composed of 1000 videos, from movies not present in the training set. We use LSMDC for training from scratch and also fine-tuning.

Metrics. We evaluate the performance of our model with standard retrieval metrics: recall at rank N (R@N, higher is better), median rank (MdR, lower is better) and mean rank (MnR, lower is better). For each metric, we report the mean and the standard deviation over experiments with 3 random seeds. In the main paper, we only report recall@5, median and mean ranks, and refer the reader to the supplementary material for additional metrics.

4.2 Implementation Details

Pre-trained Experts. Recall that our video encoder uses pre-trained experts models for extracting features from each video modality. We use the following seven experts. Motion features are extracted from S3D  [28] trained on the Kinetics action recognition dataset. Audio features are extracted using VGGish model  [6] trained on YT8M. Scene embeddings are extracted from DenseNet-161  [9] trained for image classification on the Places365 dataset  [35]. OCR features are obtained in three stages. Overlaid text is first detected using the pixel link text detection model. The detected boxes are then passed through a text recognition model trained on the Synth90K dataset. Finally, each character sequence is encoded with word2vec  [18] embeddings. Face features are extracted in two stages. An SSD face detector is used to extract bounding boxes, which are then passed through a ResNet50 trained for face classification on the VGGFace2 dataset. Speech transcripts are extracted using the Google Cloud Speech to Text API, with the language set to English. The detected words are then encoded with word2vec. Appearance features are extracted from the final global average pooling layer of SENet-154  [8] trained for classification on ImageNet. For scene, OCR, face, speech and appearance, we use the features publicly released by  [14], and compute the other features ourselves.

Training. For each dataset, we run a grid search on the corresponding validation set to estimate the hyperparameters. We use the Adam optimizer for all our experiments, and set the margin of the bidirectional max-margin ranking loss to 0.05. We also freeze our pre-trained expert models.

When pre-training on HowTo100M, we use a batch size of 64 video-caption pairs, an initial learning rate of 5e–5, which we decay by a multiplicative factor 0.98 every 10K optimisation steps, and train for 2 million steps. Given the long duration of most of the HowTo100M videos, we randomly sample 100 consecutive words in the caption, and keep 100 consecutive seconds of video data, closest in time to the selected words.

When training from scratch or finetuning on MSRVTT or LSMDC, we use a batch size of 32 video-caption pairs, an initial learning rate of 5e-5, which we decay by a multiplicative factor 0.95 every 1K optimisation steps. We train for 50K steps. We use the same settings when training from scratch or finetuning on ActivityNet, except for 0.90 as the multiplicative factor.

To compute our caption representation h(c), we use the “BERT-base-cased” checkpoint of the BERT model and finetune it with a dropout probability of 10%. To compute our video representation \(\varPsi _{agg}(v)\), we use MMT with 4 layers and 4 attention heads, a dropout probability of 10%, a hidden size \(d_{model}\) of 512, and an intermediate size of 3072.

For datasets with short videos (MSRVTT and LSMDC), we use all the 7 experts and limit video input to 30 features per expert, and BERT input to the first 30 wordpieces. For datasets containing longer videos (HowTo100M and ActivityNet), we only use motion and audio experts, and limit our video input to 100 features per expert and our BERT input to the first 100 wordpieces. In cases where an expert is unavailable for a given video, e.g., no speech was detected, we set the aggregated feature \(F_{agg}^n\) to a zero vector. We refer the reader to the supplementary material for a study of the model complexity.

4.3 Ablation Studies and Comparisons

We will first show the advantage of pretraining our model on a large-scale, uncurated dataset. We will then perform ablations on the architecture used for our language and video encoders. Finally, we will present the relative importance of the pretrained experts used in this work, and compare with related methods.

Pretraining. Table 1 shows the advantage of pretraining on HowTo100M, before finetuning on the target dataset (MSRVTT in this case). We also evaluated the impact of pretraining on ActivityNet and LSMDC; see Table 5 and Table 6.

Table 1. Advantage of pretraining on HowTo100M then finetuning on MSRVTT. Impact of removing the stop words. Performance reported on MSRVTT.

Language Encoder. We evaluated several architectures for caption representation, as shown in Table 2. Similar to the observation made in  [1], we obtain poor results from a frozen, pretrained BERT. Using the [CLS] output from a pretrained and frozen BERT model is in fact the worst result. We suppose this is because the output was not trained for caption representation, but for a very different task: next sentence prediction. Finetuning BERT greatly improves performance; it is the best result. We also compare with GrOVLE  [1] embeddings, frozen or finetuned, aggregated with a max-pooling operation or a 1-layer LSTM and a fully-connected layer. We show that pretrained BERT embeddings aggregated by a max-pooling operation perform better than GrOVLE embeddings processed by a LSTM (best results from  [1] for the text-to-clip task).

Table 2. Comparison of different architectures for caption embedding when training from scratch on MSRVTT.

We also analysed the impact of removing stop words from the captions in Table 1. In a zero-shot setting, i.e., trained on HowTo100M, evaluated on MSRVTT without finetuning, removing the stop words helps generalize, by bridging the domain gap—HowTo100M speech is very different from MSRVTT captions. This approach was adopted in  [15]. However, we observe that when finetuning, it is better to keep all the words as they contribute to the semantics of the caption.

Video Encoder. We evaluated the influence of different architectures for computing video embeddings on the MSRVTT 1k-A test split.

Table 3. Ablation studies on the video encoder of our framework with MSRVTT. (a) Influence of the architecture and input. With max-pooled features as input, we compare our transformer architecture (MMT) with the variant not using an encoder (NONE) and the one with Collaborative Gating  [14] (COLL). We also show that MMT can attend to all extracted features, as detailed in the text. (b) Importance of initializing \(F_{agg}^n\) features. We compare zero-vector initialisation, mean pooling and max pooling of the expert features. (c) Influence of the size of the multi-modal transformer. We compare different values for number-of-layers \(\times \) number-of-attention-heads.

In Table 3a, we evaluate variants of our encoder architecture and its input. Similar to  [16], we experiment with directly computing the caption-video similarities on each max-pooled expert features, i.e., no video encoder (NONE in the table). We compare this with the collaborative gating architecture (COLL)  [14] and our MMT variant using only the aggregated features as input. For the first two variants without MMT, we adopt the approach of  [16] to deal with missing modalities by re-weighting \(w_i(c)\). We also show the superior performance of our multi-modal transformer in contextualising the different modality embeddings compared to the collaborative gating approach. We argue that our MMT is able to extract cross-modal information in a multi-stage architecture compared to collaborative gating, which is limited to modulating the input embeddings. Table 3a also highlights the advantage of providing MMT with all the extracted features, instead of only aggregated ones. Temporally aggregating each expert’s features ignores information about multiple events occurring in a same video (see the last three rows). As shown by the influence of ordered and randomly shuffled features on the performance, MMT has the capacity to make sense of the relative ordering of events in a video.

Table 3b shows the importance of initialising the expert aggregation feature \(F_{agg}^n\). Since the output of our video encoder is extracted from the “agg” columns, it is important to initialise them with an appropriate representation of the experts’ features. The transformer being a residual network architecture, initializing \(F_{agg}^n\) input embeddings with a zero vector leads to a low performance. Initializing with max pooling aggregation of each expert performs better than mean pooling. Finally, we analyze the impact of the size of our multi-modal transformer model in Table 3c. A model with 4 layers and 4 attention heads outperforms both a smaller model (2 layers and 2 attention heads) and a larger model (8 layers and 8 attention heads).

Comparison of the Different Experts. In Fig. 4, we show an ablation study when training our model on MSRVTT using only one expert (left), using all experts but one (middle), or gradually adding experts by greedy search (right). In the case of using only one expert, we note that the motion expert provides the best results. We attribute the poor performance of OCR, speech and face to the fact that they are absent from many videos, thus resulting in a zero vector input to our video encoder. While the scene expert shows a decent performance, if used alone, it does not contribute when used along other experts, perhaps due to the semantics it encodes being captured already by other experts like appearance or motion. On the contrary, the audio expert alone does not provide a good performance, but it contributes the most when used in conjunction with the others, most likely due to the complementary cues it provides, compared to the other experts.

Fig. 4.
figure 4

MSRVTT performance (mean rank; lower is better) after training from scratch, when using only one expert (left), when using all experts but one (middle), when gradually adding experts by greedy search (right).

Comparison to Prior State of the Art. We compare our method on three datasets: MSRVTT (Table 4), ActivityNet (Table 5) and LSMDC (Table 6). While MSRVTT and LSMDC contain short video-caption pairs (average video duration of 13s for MSRVTT, one-sentence captions), ActivityNet contains much longer videos (several minutes) and each video is captioned with multiple sentences. We consider the concatenation of all these sentences as the caption. We show that our method obtains state-of-the-art results on all the three datasets. The gains obtained through MMT’s long term temporal encoding are particularly noticeable on the long videos of ActivityNet.

Table 4. Retrieval performance on the MSRVTT dataset. 1k-A and 1k-B denote test sets of 1000 randomly sampled caption-video pairs used in  [31] and  [16] resp.
Table 5. Retrieval performance on the ActivityNet dataset.
Table 6. Retrieval performance on the LSMDC dataset.

5 Summary

We introduced multi-modal transformer, a transformer-based architecture capable of attending multiple features extracted at different moments, and from different modalities in video. This leverages both temporal and cross-modal cues, which are crucial for accurate video representation. We incorporate this video encoder along with a caption encoder in a cross-modal framework to perform caption-video matching and obtain state-of-the-art results for video retrieval. As future work, we would like to improve temporal encoding for video and text.