1 Introduction

The goal of this work is to retrieve the correlated text description given a random video, and vice versa, to retrieve the matching videos provided with text descriptions (Fig. 1), while several computer vision tasks (e.g., image classification [20, 23, 37], object detection [36, 46, 47]) are now reaching maturity, cross-modal retrieval between visual data and natural language description remains a very challenging problem [35, 64] due to the gap and ambiguity between different modalities and availability of limited training data. Some recent works [17, 25, 30, 38, 59] attempt to utilize cross-modal joint embeddings to address the gap. By projecting data from multiple modalities into the same joint space, the similarity of the resulting points would reflect the semantic closeness between their corresponding original inputs. In this work, we focus on learning joint video-text embedding models and combining video cues for different purposes effectively for developing robust video-text retrieval system.

The video-text retrieval task is one step further than the image-text retrieval task, which is a comparatively well-studied field. Most existing approaches for video-text retrieval are very similar to the image-text retrieval methods by design and focus mainly on the modification of loss functions [12, 40, 41, 50, 61]. We observe that simple adaptation of a state-of-the-art image-text embedding method  [13] by mean pooling features from video frames generates a better result than the existing video-text retrieval approaches  [12, 40]. However, such methods ignore lots of contextual information in video sequences such as temporal activities or specific scene entities, and thus they often can only retrieve some generic responses related to the appearance of static frame. They may fail to retrieve the most relevant information in many cases to understand important questions for efficient retrieval such as “What happened in the video” or “Where did the video take place.” This greatly undermines the robustness of the systems; for instance, it is very difficult to distinguish a video with the caption “a dog is barking” apart from another “a dog is playing” based only on visual appearance (Fig. 2). Associating video motion content and the environmental scene can give supplementary cues in this scenario and improve the chance of correct prediction. Similarly, to understand a video described by “gunshot broke out at the concert” may require analysis of different visual (e.g., appearance, motion, environment) and audio cues simultaneously. On the other hand, a lot of videos may contain redundant or identical contents, and hence, an efficient video-text retrieval should utilize the most distinct cues in the content to resolve ambiguities in retrieval.

Fig. 1
figure 1

Illustration of video-text retrieval task: given a text query, retrieve and rank videos from the database based on how well they depict the text, and vice versa

Fig. 2
figure 2

Example frame from two videos and associated caption to illustrate the significance of utilizing supplementary cues from videos to improve the chance of correct retrieval

While developing a system without considering most available cues in the video content is unlikely to be comprehensive, an inappropriate fusion of complementary features could adversely increase ambiguity and degrade performance. Additionally, existing hand-labeled video-text datasets are very small and very restrictive considering the amount of rich descriptions that a human can compose and the enormous amount of diversity in the visual world. This makes it extremely difficult to train deep models to understand videos in general to develop a successful video-text retrieval system. To ameliorate such cases, we analyze how to judiciously utilize different cues from videos. We propose a mixture of experts system, which is tailored toward achieving high performance in the task of cross-modal video-text retrieval. We believe focusing on three major facets (i.e., concepts for Who, What, and Where) from videos is crucial for efficient retrieval performance. In this regard, our framework utilizes three salient features (i.e., object, action, place) from videos (extracted using pre-trained deep neural networks) for learning joint video-text embeddings and uses an ensemble approach to fuse them. Furthermore, we propose a modified pairwise ranking loss for the task that emphasizes on hard negatives and relative ranking of positive labels. Our approach shows significant performance improvement compared to previous approaches and baselines.

1.1 Contributions

The main contributions of this work can be summarized as follows:

  • The success of video-text retrieval depends on more robust video understanding. This paper studies how to achieve the goal by utilizing multimodal features from a video (different visual features and audio inputs). Our proposed framework uses action, object, place, text, and audio features by a fusion strategy for efficient retrieval.

  • We present a modified pairwise loss function to better learn the joint embedding which emphasizes on hard negatives and applies a weight-based penalty on the loss based on the relative ranking of the correct match in the retrieval.

  • We conduct extensive experiments and demonstrate a clear improvement over the state-of-the-art methods in the video-to-text retrieval tasks on the MSR-VTT dataset [60] and MSVD dataset [9].

This paper is an extended version of our work [35] with significantly more insights and detailed discussions about the proposed framework. The main extension in our pipeline is adding scene cues from videos, along with object and activity cues for learning joint embeddings to develop a more comprehensive video-text retrieval system. The previous version utilized object-text and activity-text embeddings which focused mainly on resolving ambiguities arising related to concepts for Who and What. We add a place-text embedding network in our framework to make it more robust which will help us resolve ambiguities arising from concepts for Where. Experiments show that this change results in a significant improvement over the previous works in two benchmark datasets.

2 Related work

2.1 Image-text retrieval

Recently, there has been significant interest in learning robust visual-semantic embeddings for image-text retrieval [21, 26, 38, 57]. Based on a triplet of object, action, and scene, a method for projecting text and image to a joint space was proposed in early work [14]. Canonical correlation analysis (CCA) and several extensions of it have been used in many previous works for learning joint embeddings for the cross-modal retrieval task [18, 19, 22, 44, 49, 62] which focuses on maximizing the correlation between the projections of the modalities. In [18], authors extended classic two-view CCA approach with a third view coming from high-level semantics and proposed an unsupervised way to derive the third view from clustering the tags. In [44], authors proposed a method named MACC (multimedia aggregated correlated components) aiming to reduce the gap between cross-modal data in the joint space by embedding visual and textual features into a local context that reflects the data distribution in the joint space. Extension of CCA with deep neural networks named deep CCA (DCCA) has also been utilized to learn joint embeddings [1, 62], which focus on learning two deep neural networks simultaneously to project two views that are maximally correlated. While CCA-based methods are popular, these methods have been reported to be unstable and incur a high memory cost due to the covariance matrix calculation with large amount of data [32, 58]. Recently, there are also several works leveraging adversarial learning to train joint image-text embeddings for cross-modal retrieval [10, 57].

Most recent works relating to text and image modality are trained with ranking loss [13, 17, 28, 39, 52, 58]. In [17], authors proposed a method for projecting words and visual content to a joint space utilizing ranking loss that applies a penalty when a non-matching word is ranked higher than the matching one. A cross-modal image-text retrieval method has been presented in [28] that utilizes triplet ranking loss to project image feature and RNN-based sentence description to a common latent space. Several image-text retrieval methods have adopted a similar approach with slight modifications in input feature representations [39], similarity score calculation [58], or loss function [13]. VSEPP model [13] modified the pairwise ranking loss based on violations caused by the hard negatives (i.e., non-matching query closest to each training query) and has been shown to be effective in the retrieval task. For image-sentence matching, an LSTM-based network is presented in [24] that recurrently selects pairwise instances from image and sentence descriptions, and aggregates local similarity. In [39], authors proposed a multimodal attention mechanism to attend to sentence fragments and image regions selectively for similarity calculation. Our method complements these works that learn joint image-text embedding using a ranking loss (e.g., [13, 28, 52]). The proposed retrieval framework can be applied to most of these approaches for improved video-text retrieval performance.

2.2 Video hyperlinking

Video hyperlinking is also closely relevant to our work. Given an anchor video segment, the task is to focus on retrieving and ranking a list of target videos based on the likelihood of being relevant to the content of the anchor [2, 5]. Multimodal representations have been utilized widely in video hyperlinking approaches in recent years [2, 6, 56]. Most of these approaches rely heavily on multimodal autoencoders for jointly embedding multimodal data [8, 15, 55]. Bidirectional deep neural network (BiDNN)-based representations have also been shown to be very effective in video hyperlinking benchmarks [54, 56]. BiDNN is also a variation of multimodal autoencoder, which performs multimodal fusion using a cross-modal translation with two interlocked deep neural networks [54, 55]. Considering the input data, video-text retrieval is dealing with the same multimodal input as video hyperlinking in many cases. However, video-text retrieval task is more challenging than hyperlinking since it requires to distinctively retrieve matching data from a different modality, which requires effective utilization of the correlations in-between cross-modal cues.

2.3 Video-text retrieval

Most relevant to our work are the methods that relate video and language modalities. Two major tasks in computer vision related to connecting these two modalities are video-text retrieval and video captioning. In this work, we only focus on the retrieval task. Similar to image-text retrieval approaches, most video-text retrieval methods employ a shared subspace. In [61], authors vectorize each subject-verb-object triplet extracted from a given sentence by word2vec model [34] and then aggregate the subject, verb, object (SVO) vector into a sentence-level vector using RNN. The video feature vector is obtained by mean pooling over frame-level features. Then, a joint embedding is trained using a least square loss to project the sentence representation and the video representation into a joint space. Web image search results of input text have been exploited by [40], which focused on word disambiguation. In [53], a stacked GRU is utilized to associate sequence of video frames with a sequence of words. In [41], authors propose an LSTM with visual-semantic embedding method that jointly minimizes a contextual loss to estimate relationships among the words in the sentence and a relevance loss to reflect the distance between video and sentence vectors in the shared space. A method named Word2VisualVec is proposed in [12] for the video to sentence matching task that projects vectorized sentence into visual feature space using mean squared loss. A shared space across image, text, and sound modality is proposed in [4] utilizing ranking loss, which can also be applied to video-text retrieval task.

Utilizing multiple characteristics of video (e.g., activities, audio, locations, time) is evidently crucial for efficient retrieval [63]. In the closely related task of video captioning, dynamic information from video along with static appearance features has been shown to be very effective [45, 65]. However, most of the existing video-text retrieval approaches depend on one visual cue for retrieval. In contrast to the existing works, our approach focuses on effectively utilizing different visual cues and audio (if available) concurrently for more efficient retrieval.

2.4 Ensemble approaches

Our retrieval system is based on an ensemble framework [16, 42]. A strong psychological context of the ensemble approach can be found from its intrinsic connection in decision making in many daily life situations [42]. Seeking the opinions of several experts, weighing them, and combining to make an important decision is an innate behavior of human. The ensemble methods hinge on the same idea and utilize multiple models for making an optimized decision, as in our case diverse cues are available from videos and we would like to utilize multiple expert models which focus on different cues independently to obtain a stronger prediction model. Moreover, ensemble-based systems have been reported to be very useful when dealing with a lack of adequate training data [42]. As diversity of the models is crucial for the success of ensemble frameworks [43], it is important for our case to choose a diverse set of video-text embeddings that are significantly different from one another.

3 Approach

In this section, we first provide an overview of our proposed framework (Sect. 3.1). Then, we describe the input feature representation for video and text (Sect. 3.2). Next, we describe the basic framework for learning visual-semantic embedding using pairwise ranking loss (Sect. 3.3). After that, we present our modification on the loss function which improves the basic framework to achieve better recall (Sect. 3.4). Finally, we present the proposed fusion step for video-text matching (Sect. 3.5).

Fig. 3
figure 3

An overview of the proposed retrieval process. We propose to learn three joint video-text embedding networks as shown in Fig. 3. One model learns a joint space (object-text space) between text features and visual object features. Another model learns a joint space (activity-text space) between text feature and activity features. Similarly, there is a third model which learns a joint space (place-text space) between scene features and text features. Here, object-text space is the expert in solving ambiguity related to who is in the video, whereas activity-text space is the expert in retrieving what activity is happening and place-text space is the expert in solving ambiguity regarding locations in the video. Given a query sentence, we calculate the sentence’s similarity scores with each one of the videos in the entire dataset in all of the three embedding spaces and use a fusion of scores for the final retrieval result. Please see Sect. 3.1 for an overview and Sect. 3 for details

3.1 Overview of the proposed approach

In a typical cross-modal video-text retrieval system, an embedding network is learned to project video features and text features into the same joint space, and then retrieval is performed by searching the nearest neighbor in the latent space. Since in this work we are looking at videos in general, detecting most relevant information such as object, activities, and places could be very conducive for higher performance. Therefore, along with developing algorithms to train better joint visual-semantic embedding models, it is also very important to develop strategies to effectively utilize different available cues from videos for a more comprehensive retrieval system.

In this work, we propose to leverage the capability of neural networks to learn a deep representation first and fuse the video features in the latent spaces so that we can develop expert networks focusing on specific subtasks (e.g., detecting activities, detecting objects). For analyzing videos, we use a model trained to detect objects, a second model trained to detect activities, and a third model focusing on understanding the place. These heterogeneous features may not be used together directly by simple concatenation to train a successful video-text model as intra-modal characteristics are likely to be suppressed in such an approach. However, an ensemble of video-text models can be used, where a video-text embedding is trained on each of the video features independently. The final retrieval is performed by combining the individual decisions of several experts [42]. An overview of our proposed retrieval framework is shown in Fig. 3. We believe that such an ensemble approach will significantly reduce the chance of poor/wrong prediction.

We follow network architecture proposed in [28] that learns the embedding model using a two-branch network using image-text pairs. One of the branches in this network takes text feature as input, and the other branch takes in a video feature. We propose a modified bidirectional pairwise ranking loss to train the embedding. Inspired by the success of ranking loss proposed in [13] in image-text retrieval task, we emphasize on hard negatives. We also apply a weight-based penalty on the loss according to the relative ranking of the correct match in the retrieved result.

3.2 Input feature representation

3.2.1 Text feature

For encoding sentences, we use gated recurrent units (GRU) [11]. We set the dimensionality of the joint embedding space, D, to 1024. The dimension of the word embeddings that are input to the GRU is 300. Note that the word embedding model and the GRU are trained end-to-end in this work.

3.2.2 Object feature

For encoding image appearance, we adopt deep pre-trained convolutional neural network (CNN) model trained on ImageNet dataset as the encoder. Specifically, we utilize state-of-the-art 152-layer ResNet model ResNet152 [20]. We extract image features directly from the penultimate fully connected layer. We first rescale the image to \(224\times 224\) and feed into CNN as inputs. The dimension of the image embedding is 2048.

3.2.3 Activity feature

The ResNet CNN can efficiently capture visual concepts in static frames. However, an effective approach to learning temporal dynamics in videos was proposed by inflating a 2D CNN to a deep 3D CNN named I3D in [7]. We use I3D model to encode activities in videos. In this work, we utilize the pre-trained RGB-I3D model and extract 1024-dimensional feature utilizing continuous 16 frames of video as the input.

3.2.4 Place feature

For encoding video feature focusing on scene/place, we utilize deep pre-trained CNN model trained on Places-365 dataset as the encoder [66]. Specifically, we utilize 50-layer model ResNet50 [20]. We extract image features directly from the penultimate fully connected layer. We rescale the image to 224x224 and feed into CNN as inputs. The dimension of the image embedding is 2048.

3.2.5 Audio feature

We believe that by associating audio, we can get important cues to the real-life events, which would help us remove ambiguity in many cases. We extract audio feature using state-of-the-art SoundNet CNN [3], which provides 1024-dimensional feature from input raw audio waveform. Note that we only utilize the audio which is readily available with the videos.

3.3 Learning joint embedding

In this section, we describe the basic framework for learning joint embedding based on bidirectional ranking loss.

Given a video feature representation (i.e., appearance feature or activity feature or scene feature) \(\overline{v}\) (\(\overline{v} \in \mathbb {R}^V\)), the projection for a video feature on the joint space can be derived as \(v = W^{(v)}\overline{v}\) (\(v \in \mathbb {R}^D\)). In the same way, the projection of input text embedding \(\overline{t} (\overline{t} \in \mathbb {R}^T)\) to joint space is \(t = W^{(t)}\overline{t} (t \in \mathbb {R}^D)\). Here, \(W^{(v)} \in \mathbb {R}^{D \times V}\) is the transformation matrix that projects the video content into the joint embedding space, and D denotes the dimension of the joint space. Similarly, \(W^{(t)}\in \mathbb {R}^{D \times T}\) maps input sentence/caption embedding to the joint space. Given feature representation for words in a sentence, the sentence embedding \(\overline{t}\) is found from the hidden state of the GRU. Here, given the feature representation of both videos and corresponding text, the goal is to learn a joint embedding characterized by \(\theta \) (i.e., \(W^{(v)}\), \(W^{(t)}\), and GRU weights) such that the video content and semantic content are projected into the joint embedding space. We keep image encoder (e.g., pre-trained CNN) fixed in this work, as the video-text datasets are small in size.

In the embedding space, it is expected that the similarity between a video and text pair to be more reflective of semantic closeness between videos and their corresponding texts. Many prior approaches have utilized pairwise ranking loss for learning joint embedding between visual input and textual input. They minimize a hinge-based triplet ranking loss combining bidirectional ranking terms, in order to maximize the similarity between a video embedding and the corresponding text embedding and, while at the same time, minimize the similarity to all other non-matching ones. The optimization problem can be written as

$$\begin{aligned}&\min \limits _{\theta } \ \sum _{v}\sum _{t^{-}}[\alpha -S(v,t)+S(v,t^{-})]_{+} \nonumber \\&\quad + \sum _{t}\sum _{v^{-}}[\alpha -S(t,v)+S(t,v^{-})]_{+}, \end{aligned}$$
(1)

where \([f]_{+} = \hbox {max}(0, f)\). \(t^{-}\) is a non-matching text embedding, and t is the matching text embedding for video embedding v. This is similar for text embedding t. \(\alpha \) is the margin value for the pairwise ranking loss. The scoring function S(vt) is defined as the similarity function to measure the similarity between the videos and text in the joint embedded space. We use cosine similarity in this work, as it is easy to compute and shown to be very effective in learning joint embeddings. [13, 28].

In Eq. (1), in the first term, for each pair (vt), the sum is taken over all non-matching text embedding \(t^{-}\). It attempts to ensure that for each visual feature, corresponding/matching text features should be closer than non-matching ones in the joint space. Similarly, the second term attempts to ensure that text embedding that corresponds to the video embedding should be closer in the joint space to each other than non-matching video embeddings.

3.4 Proposed ranking loss

Recently, focusing on hard negatives has been shown to be effective in many embedding tasks [13, 33, 48]. Inspired by this, we focus on hard negatives (i.e., the negative video and text sample closest to a positive/matching (vt) pair) instead of summing over all negatives in our formulation. For a positive/matching pair (vt), the hardest negative sample can be identified using \(\hat{v} = \arg \max \nolimits _{v^{-}} S(t,v^{-})\) and \(\hat{t} = \arg \max \nolimits _{t^{-}} S(v,t^{-})\). The optimization problem can be rewritten as the following to focus on hard negatives:

$$\begin{aligned}&\min \limits _{\theta } \ \sum _{v} [\alpha -S(v,t)+S(v,\hat{t})]_{+} \nonumber \\&\quad + \sum _{t}[\alpha -S(t,v)+S(t,\hat{v})]_{+}. \end{aligned}$$
(2)

The loss in Eq. 2 is similar to the loss in Eq. 1, but it is specified in terms of the hardest negatives [13]. We start with the loss function in Eq. 2 and further modify the loss function following the idea of weighted ranking [51] to weigh the loss based on the relative ranking of positive labels.

$$\begin{aligned}&\min \limits _{\theta } \sum _{v} L(r_v)[\alpha -S(v,t)+S(v,\hat{t})]_{+} \nonumber \\&\quad + \sum _{t} L(r_t)[\alpha -S(t,v)+S(t,\hat{v})]_{+}, \end{aligned}$$
(3)

where L(.) is a weighting function for different ranks. For a video embedding v, \(r_v\) is the rank of matching sentence t among all compared sentences. Similarly, for a text embedding t, \(r_t\) is the rank of matching video embedding v among all compared videos in the batch. We define the weighting function as \(L(r)= (1 + \beta /(N-r+1))\), where N is the number of compared videos and \(\beta \) is the weighting factor. Figure 4 shows an example showing the significance of the proposed ranking loss.

Fig. 4
figure 4

An example showing the significance of the proposed ranking loss. The idea is that if a large number of non-matching instances are ranked higher than the matching one given the current state of the model, then the model must be updated by a larger amount [b However, the model needs to be updated by a smaller amount if the matching instance is already ranked higher than most non-matching ones (a)]

It is very common, in practice, to only compare samples within a mini-batch at each iteration rather than comparing the entire training set for computational efficiency [25, 33, 48]. This is known as semi-hard negative mining [33, 48]. Moreover, selecting the hardest negatives in practice may often lead to a collapsed model and semi-hard negative mining helps to mitigate this issue [33, 48]. We utilize a batch size of 128 in our experiment.

It is evident from Eq. 3 that the loss applies a weight-based penalty based on the relative ranking of the correct match in retrieved result. If a positive match is ranked top in the list, then L(.) will assign a small weight to the loss and will not cost the loss too much. However, if a positive match is not ranked top, L(.) will assign a much larger weight to the loss, which will ultimately try to push the positive matching pair to the top of rank.

3.5 Matching and ranking

The video-text retrieval task focuses on returning for each query video, a ranked list of the most likely text description from a dataset and vice versa. We believe that we need to understand three main aspects of each video: (1) Who: the salient objects of the video, (2) What: the action and events in the video, and (3) Where: the place aspect of the video. To achieve this, we learn three expert joint video-text embedding spaces as shown in Fig. 3.

The object-text embedding space is the common space where both appearance features and text feature are mapped to. Hence, this space can link video and sentences focusing on the objects. On the other hand, the activity-text embedding space focuses on linking video and language description which emphasizes more on the events in the video. Action features and audio features both provide important cues for understanding different events in a video. We fuse action and audio features (if available) by concatenation and map the concatenated feature and text feature into a common space, namely, the activity-text space. If the audio feature is absent from videos, we only use the action feature as the video representation for learning the activity-text space. The place-text embedding space is the common space where visual features focusing on scene/place aspect and text feature are mapped to. Hence, this space can link video and sentences focusing on the entire scene. We utilize the same loss functions described in Sect. 3.4 for training these embedding models.

At the time of retrieval, given a query sentence, we compute the similarity score of the query sentence with each one of the videos in the dataset in three video-text embedding spaces and use a fusion of similarity scores for the final ranking. Conversely, given a query video, we calculate its similarity scores with all the sentences in the dataset in three embedding spaces and use a fusion of similarity scores for the final ranking.

$$\begin{aligned} \begin{aligned} S_{v-t}(v,t)=w_{1}S_{o-t} + w_{2}S_{a-t} + w_{3}S_{p-t}. \end{aligned} \end{aligned}$$
(4)

It may be desired to use a weighted sum when it is necessary in a task to put more emphasis on one of the facets of the video (objects or captions or scene). In this work, we empirically found putting comparatively higher importance to \(S_{o-t}\) (object-text) and \(S_{a-t}\) (activity-text), and slightly lower importance to \(S_{p-t}\) (place-text) works better in evaluated datasets than putting equal importance to all. We empirically choose \(w_{1}=1\), \(w_{2}=1\), and \(w_{3}=0.5\) in our experiments based on our evaluation on the validation set.

4 Experiments

In this section, we first describe the datasets and evaluation metric (Sect. 4.1). Then, we describe the training details. Next, we provide quantitative results on MSR-VTT dataset (Sect. 4.3) and MSVD dataset (Sect. 4.4) to show the effectiveness of our proposed framework. Finally. we present some qualitative examples analyzing our success and failure cases (Sect. 4.5).

4.1 Datasets and evaluation metric

We present experiments on two standard benchmark datasets: Microsoft Research Video to Text (MSR-VTT) dataset [60] and Microsoft Video Description (MSVD) dataset [9] to evaluate the performance of our proposed framework. We adopt rank-based metric for quantitative performance evaluation.

4.1.1 MSR-VTT

The MSR-VTT is a large-scale video description dataset. This dataset contains 10,000 video clips. The dataset is split into 6513 videos for training, 2990 videos for testing, and 497 videos for the validation set. Each video has 20 sentence descriptions. This is one of the largest video captioning datasets in terms of the quantity of sentences and the size of the vocabulary.

4.1.2 MSVD

The MSVD dataset contains 1970 YouTube clips, and each video is annotated with about 40 sentences. We use only the English descriptions. For a fair comparison, we used the same splits utilized in prior works [53], with 1200 videos for training, 100 videos for validation, and 670 videos for testing. The MSVD dataset is also used in [40] for video-text retrieval task, where they randomly chose 5 ground-truth sentences per video. We use the same setting when we compare with that approach.

Table 1 Video-to-text and text-to-video retrieval results on MSR-VTT dataset

4.1.3 Evaluation metric

We use the standard evaluation criteria used in most prior work on image-text retrieval and video-text retrieval task [12, 28, 40]. We measure rank-based performance by R@K, median rank (MedR), and mean rank (MeanR). R@K (Recall at K) calculates the percentage of test samples for which the correct result is found in the top-K retrieved points to the query sample. We report results for R@1, R@5, and R@10. Median rank calculates the median of the ground-truth results in the ranking. Similarly, mean rank calculates the mean rank of all correct results.

4.2 Training details

We used two Titan Xp GPUs for this work. We implemented the network using PyTorch following [13]. We start training with a learning rate of 0.002 and keep the learning rate fixed for 15 epochs. Then, the learning rate is lowered by a factor of 10 and the training continued for another 15 epochs. We use a batch size of 128 in all the experiments. The embedding networks are trained using ADAM optimizer [27]. When the L2 norm of the gradients for the entire layer exceeds 2, gradients are clipped. We tried different values for margin \(\alpha \) in training and found \(0.1\le \alpha \le 0.2\) works reasonably well. We empirically choose \(\alpha \) as 0.2. The embedding model was evaluated on the validation set after every epoch. The model with the best sum of recalls on the validation set is chosen as the final model.

4.3 Results on MSR-VTT dataset

We report the result on MSR-VTT dataset [60] in Table 1. We implement several baselines to analyze different components of the proposed approach. To understand the effect of different loss functions, features, effect of feature concatenation, and proposed fusion method, we divide the table into 7 rows (1.1–1.7). In row 1.1, we report the results on applying two different variants of pairwise ranking loss. VSE [28] is based on the basic triplet ranking loss similar to Eq. 1, and VSEPP [13] is based on the loss function that emphasizes on hard negatives as shown in Eq. 2. Note that all other reported results in Table 1 are based on the modified pairwise ranking loss proposed in Eq. 3. In row 1.2, we provide the performance of different features in learning the embedding using the proposed loss. In row 1.3, we present results for the learned embedding utilizing a feature vector that is a direct concatenation of different video features. In row 1.4, we provide the result when a shared representation between image, text, and audio modality is learned using the proposed loss following the idea in [4] and used for video-text retrieval task. In row 1.5, we provide the result based on the proposed approach that employs two video-text joint embeddings for retrieval. In row 1.6, we provide the result based on the proposed ensemble approach that employs all three video-text joint embeddings for retrieval. Additionally, in row 1.7, we also provide the result for the case where the rank fusion has been considered in place of the proposed score fusion.

4.3.1 Loss function

For evaluating the performance of different ranking loss functions in the task, we can compare results reported in row 1.1 and row 1.2. We can choose only results based on object-text spaces from these two rows for a fair comparison. We see that VSEPP loss function and the proposed loss function perform significantly better than the traditional VSE loss function in R@1, R@5, and R@10. However, VSE loss function has better performance in terms of the mean rank. This phenomenon is expected based on the characteristics of the loss functions. As higher R@1, R@5, and R@10 are more desirable for an efficient video-text retrieval system than the mean rank, we see that our proposed loss function performs better than other loss functions in this task. We observe similar performance improvement using our loss function in other video-text spaces too.

4.3.2 Video features

We can compare the performance of different video features in learning the embedding using the proposed loss from row 1.2. We observe that object feature and activity feature from video perform reasonably well in learning a joint video-text space. The performance is very low when only audio feature is used for learning the embedding. It can be expected that the natural sound associated in a video alone does not contain as much information as videos in most cases. However, utilizing audio along with I3D feature as activity features provides a slight boost in performance as shown in row 1.3 and row 1.4.

4.3.3 Feature concatenation for representing video

Rather than training multiple video-semantic spaces, one can argue that we can simply concatenate all the available video features and learn a single video-text space using this concatenated video feature [12, 60]. However, we observe from row 1.3 that integrating complementary features by static concatenation-based fusion strategy fails to utilize the full potential of different video features for the task. Comparing row 1.2 and row 1.3, we observe that a concatenation of object feature, activity feature, and audio feature performs even worse than utilizing only object feature in R@1. Although we see some improvement in other evaluation metrics, overall the improvement is very limited. We believe that both appearance feature and action feature get suppressed in such concatenation as they focus on representing different entities of a video.

Table 2 Video-to-text retrieval results on MSVD dataset

4.3.4 Learning a shared space across image, text, and audio

Learning a shared space across image, text, and sound modality is proposed for cross-modal retrieval task in [4]. Following the idea, we trained a shared space across video-text-sound modality using the pairwise ranking loss by utilizing video-text and video-sound pairs. The result is reported in row 1.4. We observe that performance in video-text retrieval task degrades after training such an aligned representation across 3 modalities. Training such a shared representation gives the flexibility to transfer across multiple modalities. Nevertheless, we believe it is not tailored toward achieving high performance in a specific task. Moreover, aligning across 3 modalities is a more computationally difficult task and requires many more examples to train.

4.3.5 Proposed fusion

The best result in Table 1 is achieved by our proposed fusion approach as shown in row 1.6. We see that the proposed method achieves 31.43% improvement in R@1 for text retrieval and 25.86% improvement for video retrieval in R@1 compared to best performing Ours(Object-text) as shown in row 1.2, which is the best among the other methods which use a single embedding space for the retrieval task. In row 1.5, Fusion[Object-text & Activity(I3D-Audio)-text] differs from Fusion[Object-text & Activity(I3D)-text] in the feature used in learning the activity-text space. We see that utilizing audio in learning the embedding improves the result slightly. However, as the retrieval performance of individual audio feature is very low (shown in row 1.2), we did not utilize audio-text space separately in fusion as we found it degraded the performance significantly.

Comparing row 1.6, row 1.5, and row 1.2, we find that the ensemble approach with score fusion results in significant improvement in performance, although there is no guarantee that the combination of multiple models will perform better than the individual models in the ensemble in every single case. However, the ensemble average consistently improves performance significantly.

Table 3 Text-to-video retrieval results on MSVD dataset

4.3.6 Rank versus similarity score in fusion

We provide the retrieval result based on weighted rank aggregation of three video-text spaces in row 1.7. Comparing the effect of rank fusion in replacement of the score fusion from row 1.6 and row 1.7 in Table 1, it is also evident that the proposed score fusion approach shows consistent performance improvement over rank fusion approach. It is possible that exploiting similarity score to combine multiple evidences may be less effective than using rank values in some cases, as score fusion approach independently weighs scores and does not consider overall performance in weighting [31]. However, we empirically find that utilizing score fusion is more advantageous than rank fusion in our system in terms of retrieval effectiveness.

Fig. 5
figure 5

Examples of 9 test videos from MSVD dataset and the top 1 retrieved captions by using a single video-text space and the fusion approach with our loss function. The value in brackets is the rank of the highest ranked ground-truth caption. Ground truth (GT) is a sample from the ground-truth captions. Among all the approaches, object-text (ResNet152 as video feature) and activity-text (I3D as video feature) are systems where single video-text space is used for retrieval. We also report result for the fusion system where three video-text spaces (object-text, activity-text, and place-text) are used for retrieval

4.4 Results on MSVD dataset

We report the results of video-to-text retrieval task on MSVD dataset [9] in Table 2 and the results for text-to-video retrieval in Table 3.

We compare our approach with the existing video-text retrieval approaches, CCA [49], ST [29], JMDV [61], LJRV [40], JMET [41], and W2VV [12]. For these approaches, we directly cite scores from respective papers when available. We report score for JMET from [12]. The score of CCA is reported from [61], and the score of ST is reported from [40]. If scores for multiple models are reported, we select the score of the best performing method from the paper.

We also implement and compare results with state-of-the-art image-embedding approach VSE [28] and VSEPP [13] in the object-text (O-T) embedding space. Additionally, to show the impact of only using the proposed loss in retrieval, we also report results based on the activity-text (A-T) space and place-text (P-T) space in the tables. Our proposed fusion is named as Ours-fusion(O-T, A-T, P-T) in Tables 2 and 3. The proposed fusion system utilizes the proposed loss and employs three video-text embedding spaces for calculating the similarity between video and text. As the audio is muted in this dataset, we train the activity-text space utilizing only I3D feature from videos. We also report results for our fusion approach using any two of the three video-text spaces in the tables. Additionally, we report results of Rank-fusion(O-T, A-T, P-T), which uses rank in place of similarity score in combining retrieval results of three video-text spaces in the fusion system.

Fig. 6
figure 6

A snapshot of 9 test videos from MSR-VTT dataset with success and failure cases, the top 1 retrieved captions for four approaches based on the proposed loss function and the rank of the highest ranked ground-truth caption inside the bracket. Among the approaches, object-text is trained using ResNet feature as video feature and activity-text is trained using the concatenated I3D feature and audio feature as the video feature. We also report results for fusion approaches where three video-text spaces are used for retrieval. The fusion approaches use an object-text space trained with ResNet feature and place-text space trained with ResNet50(Place) feature, while in the proposed fusion, the activity-text space is trained using concatenated I3D and audio feature. Fusion (no audio) utilizes activity-text space trained with only the I3D feature

From Tables 2 and 3, it is evident that our proposed approach performs significantly better than the existing ones. The result is improved significantly by utilizing the fusion proposed in this paper that utilizes multiple video-text spaces in calculating the final ranking. Moreover, utilizing the proposed loss improves the result over previous state-of-the-art methods. It can also be identified that our loss function is not only useful for learning embedding independently, but also it is useful for the proposed fusion. We observe that utilizing the proposed loss function improves the result over previous state-of-the-art methods consistently, with a minimum improvement of 10.38% from the best existing method VSEPP(object-text) in video-to-text retrieval and 4.55% in text-to-video retrieval. The result is improved further by utilizing the proposed fusion framework in this paper that utilizes multiple video-text spaces in an ensemble fusion approach in calculating the final ranking, with an improvement of 57.07% from the best existing method in the video to text retrieval and 38.31% in the text-to-video retrieval. Among the video-text spaces, object-text and activity-text space show better performance in retrieval, compared to place-text space which indicates that the annotators focused more on object and activity aspects in annotating the videos. Similar to the results of MSR-VTT dataset, we observe that the proposed score fusion approach consistently shows superior performance than rank fusion approach in both video-to-text and text-to-video retrieval.

4.5 Qualitative results

We report the qualitative results on MSVD dataset in Fig. 5 and the results on MSR-VTT dataset in Fig. 6.

4.5.1 MSVD dataset

In Fig. 5, we show examples of a few test videos from MSVD dataset and the top 1 retrieved captions for the proposed approach. We also show the retrieval result when only one of the embeddings is used for retrieval. Additionally, we report the rank of the highest ranked ground-truth caption in the figure. We can observe from the figure that in most of the cases, utilizing cue from multiple video-text spaces helps to match the correct caption. We see from Fig. 5 that, among 9 videos, the retrieval performance is improved or higher recall is retained for 7 videos. Video 6 and video 9 show two failure cases, where utilizing multiple video-text spaces degrades the performance slightly than using object-text in video 6 and activity-text space in video 9.

4.5.2 MSR-VTT dataset

Similar to Fig. 5, we also show qualitative results for a few test videos from MSR-VTT dataset in Fig. 6. Videos 1–6 in Fig. 6 show a few examples where utilizing cue from multiple video-text spaces helps to match the correct caption compared to using only one of the video-text spaces. Moreover, we also see the result was improved after utilizing audio in learning the second video-text space (activity-text space). We observe this improvement for most of the videos, as we also observe from Table 1. Videos 7–9 show some failure cases for our fusion approach in Fig. 6. Video 7 shows a case, where utilizing multiple video-text spaces for retrieval degrades the performance slightly compared to utilizing only one of the video-text spaces. For video 8 and video 9 in Fig. 6, we observe that the performance improves after fusion overall, but the performance is better when the audio is not used in learning video-text space. On the other hand, videos 1–6 include cases where utilizing audio helped to improve the result.

4.6 Discussion

The experimental results are aligned with our rationale that utilizing multiple characteristics of a video is crucial for developing an efficient video-text retrieval system. Experiments also demonstrate that our proposed ranking loss function is effective in learning video-text embeddings better than the existing ones. However, we observe that major improvement in experimental performance comes from our mixture of experts system which utilizes evidence from three complementary video-text spaces for retrieval. Our mixture of expert video-text model may not outperform the performance of a single video-text model in the ensemble in every single case, but it is evident from experiments that our system significantly reduces the overall risk of making a particularly poor decision.

From qualitative results, we observe it cannot be claimed in general that one video feature is consistently better than others for the task of video-text retrieval. It can be easily identified from the top 1 retrieved captions in Figs. 5 and 6 that the video-text embedding (object-text) learned utilizing object appearance feature (ResNet) as video feature is significantly different from the joint embedding (activity-text) learned using activity feature (I3D) as video feature. The variation between the rank of the highest matching caption further strengthens this observation. Object-text space performs better than the activity-text space in retrieval for some videos. For other videos, the activity-text space achieves higher performance. However, it can be claimed that combining knowledge from multiple video-text embedding spaces consistently shows better performance than utilizing only one of them.

We observe from Fig. 6 that using audio is crucial in many cases where there is deep semantic relation between visual content and audio (e.g., the audio is from the third person narration of the video, the audio is music or song) and it gives important cues in reducing description ambiguity (e.g., video 2, video 5, and video 6 in Fig. 6). We observe that the performance degrades in some cases when audio is utilized in the system (e.g., video 8 in Fig. 6). We see an overall improvement in the quantitative result (Table 1) which also supports our idea of using audio. Since we did not exploit the structure of the audio and analyze the structural alignment between audio and video, it is difficult to determine whether audio is always helpful. For instance, audio can come from different things (persons, animals, or objects) in a video, and it might shift our attention away from the main subject. Moreover, the captions are provided mostly based on visual aspects, which make audio information very sparse. Hence, the overall improvement using audio was limited.

5 Conclusions

In this paper, we study how to leverage diverse video features effectively for developing a robust cross-modal video-text retrieval system. Our proposed framework learns three expert video-text embedding models focusing on three salient video cues (i.e., object, activity, and place) and uses a combination of these models for high-quality prediction. A modified pairwise ranking loss function is also proposed for better learning the joint embeddings, which focuses on hard negatives and applies a weight-based penalty based on the relative ranking of the correct match. Extensive quantitative and qualitative evaluations of MSVD and MSR-VTT datasets demonstrate that our framework performs significantly better than baselines and state-of-the-art systems. Moving forward, we would like to improve our system by developing more sophisticated algorithms to combine evidence from multiple joint spaces and further analyze the role of associated audio for video-text retrieval.