Keywords

1 Introduction

Cross-modal information retrieval refers to the task where queries from one or more modalities (e.g., text, audio etc.) are used to retrieve items from a different modality (e.g., images or videos). This paper focuses on text-video retrieval, a key sub-task of cross-modal retrieval. The text-video retrieval task aims to retrieve unlabeled videos using only textual descriptions as input. This supports real-life scenarios, such as a human user searching for a video he/she remembers having viewed in the past, e.g., “I remember a video where a dog and a cat were laying down in front of a fireplace”, or, searching for a video never seen before, again by expressing their information needs in natural language, e.g., “I would like to find a video where some kids are playing basketball in an open field”.

To perform text-video retrieval, typically the videos or video parts, along with the textual queries, need to be embedded into a joint latent feature space. Early approaches to this task [16, 26] tried to annotate both modalities with a set of pre-defined visual concepts, and retrieval was performed by comparing these annotations. With the rise of deep neural networks over the past years, the community turned to them. Although various DNN architectures have been proposed to this end, their general strategy is the same: encode text and video into one or more joint latent feature spaces where text-video similarities can be calculated.

Fig. 1.
figure 1

Illustration of training different network architectures for dealing with the text-video retrieval problem using video-caption pairs. In all illustrations, \(\mathcal {L}\) stands for the loss function. (a) All features are fed into one encoder per modality, (b) Every textual feature is used as input to a different encoder (or to more than one encoders), while visual features are simply concatenated, and (c) the proposed T\(\times \)V approach, where various textual and visual features are selectively combined to create different joint spaces

State-of-the-art cross-modal video retrieval approaches utilize textual information by exploiting several textual features \(g_s(\cdot )\) – extracted with the help of already-trained deep networks or non-trainable extractors – and encoding them through one or more trainable textual encoders (which are trained end-to-end as part of the overall cross-modal network training). The simple but widely used Bag-of-Words (bow) feature is often combined with embedding-based features such as Word2vec [28] and Bert [7]. Typically, these features are used as input to a simple or more sophisticated trainable textual encoder \(f_s(\cdot )\), e.g., [8, 19], that encodes them into one single representation. Similarly to textual information processing, trained image or video networks (e.g. a ResNet-152 trained on ImageNet) are used to extract feature vectors \(g_v(\cdot )\) from the video frames. Typically these features are then concatenated and used as input to a trainable video encoder \(f_v(\cdot )\). Finally, the outputs of \(f_s(\cdot )\) and \(f_v(\cdot )\) (after a linear projection and a non-linear transformation) are embedded into a new joint space (Fig. 1a). Methods that follow this general methodology include [9, 14, 19].

In [20] a new approach of textual encoder assembly was proposed for exploiting diverse textual features. Instead of inputting all these features into a single textual encoder, an architecture where each textual feature is input into a different encoder (or to more than one encoders) was proposed, resulting in multiple joint latent spaces being created. However, when it comes to the video content, its treatment in [20] is much simpler: several video features derived from trained networks are combined via vector concatenation, and individual fully connected layers embed them into a number of joint feature spaces. The cross-modal similarity, which serves as the loss function, is calculated by summing the individual similarities in each latent space. Figure 1b illustrates the above architecture.

In terms of loss function, the majority of the proposed works, e.g. [9, 13, 19], utilize the improved marginal ranking loss introduced by [12]. This loss utilizes the hard-negative samples within a training batch to separate the positive samples from the samples that are negative but are located near to the positives. In [6] the dual softmax loss, a modification of the symmetric cross-entropy loss, was introduced. It is based on the assumption that the optimal text-video similarity is reached when the diagonal of a constructed similarity matrix contains the highest scores. So, this loss takes into consideration the cross-direction similarities within a training batch and revises the predicted text-video similarities.

In this work, inspired from [20] where multiple textual encoders are introduced, we propose a new cross-modal network architecture to explore the combination of multiple and heterogeneous textual and visual features. We expand the textual information processing strategy of [20], with adaptations, to the visual information processing as well, and we propose a multiple latent space learning approach, as illustrated in Fig. 1c. Moreover, inspired by the dual softmax loss of [6], we examine our network’s performance when we introduce a dual softmax operation at the evaluation stage (contrarily to [6] that applies it to the network’s training) and use it to revise the inferred text-video similarity scores. The contributions of this work are the following:

  • We propose a new network architecture, named T\(\times \)V, to efficiently combine textual and visual features using multiple loss learning for the text-based video retrieval task.

  • We propose introducing a dual softmax operation at the retrieval stage for exploiting prior text-video similarities to revise the ones computed by the network.

2 Related Work

The general idea behind text-video retrieval is to project text and video information into comparable representations. Due to computational resources limitations, early approaches e.g. [17, 24, 26], dealt with relatively small datasets, and used pre-defined visual concepts as a stepping stone. I.e., videos and text were annotated with concepts, and text-video similarity was calculated by measuring the similarity between these annotations. With the explosion of deep learning, the state-of-the-art moved forward to proposing concept-free methods. The current dominant strategy is to encode both modalities into a joint latent feature space, where the text and video representations are compared.

In [10, 14], dual encoding networks were proposed. Two similar sub-networks were introduced, one for the video stream and one for the text, to encode them into a joint feature space. In the dual-task network of [33], a combination of latent space encoding and concept representation, was proposed: the first task encodes text and video into a joint latent space, while the second task encodes video and text as a set of visual concepts. In [19] several textual features were used to create multiple textual encoders, instead of feeding them into a single encoder. In this way, multiple joint text-video latent feature spaces could be learned, leading to more accurate retrieval results. In [31] the problem of understanding textual or visual content with multiple meanings is addressed by combining global and local features through multi-head attention. More recently, inspired by the human reading strategy, [11] proposed a two-branches approach to encode video representations. A preview branch captures the overview information of a video, while the intense-reading branch is designed to extract more in-depth information. Moreover, the two branches interact, and the preview guides the intense-reading branch. As a general trend, the various recent works on text-video retrieval, e.g. [4, 11, 20], have shown that the utilization of multiple textual features to create more than one video-text joint spaces leads to improved overall performance.

Recent approaches additionally go beyond the standard evaluation protocol (i.e., training the network using the training portion of a dataset and testing it on the testing portion), benefiting from pre-training on further large-scale video-text datasets. This procedure leads to improved performance and learning transferable textual and visual representations. In [27], HowTo100M is introduced: a large-scale dataset of  100M web videos. Using this dataset to pre-train a baseline video retrieval network is shown in [27] to be beneficial. HiT [22] uses a transformer-based architecture to create a hierarchical cross-modal network for creating semantic-level and feature-level encoders. In [22] experimentation with and without pre-training also shows that the network’s performance increases with the pre-training step. BridgeFormer [15] introduces a module that is trained to answer textual questions in order to be used as the pre-training step of a dual encoding network. Frozen [2], on the other hand, is based on a transformer architecture and does not use trained image DNNs as feature extractors. It did introduce, though, a large-scale video-text dataset (WebVid-2M) which was used for end-to-end pre-training of their network.

3 Proposed Approach

3.1 Overall Architecture

The text-video retrieval problem is formulated as follows: let \(V = \{v_1,v_2,\ldots ,v_T\}\) be a large set of T unlabeled video shots and s a free-text query. The goal of the task is, given the query s, to retrieve from V a ranked list with the most relevant video shots.

Our T\(\times \)V network consists of two key sub-networks, one for the textual and one for the visual stream. The textual sub-network inputs a free-text query and vectorizes it into M textual features \(g_S: \{g_{s}^{1}(\cdot ), g_{s}^{2}(\cdot ), \dotsc ,g_{s}^{M}(\cdot )\}\). These M features are used as input in a set of carefully-selected K textual encoders \(f_S: \{f_{s}^{1}(\cdot ), f_{s}^{2}(\cdot ), \dotsc ,f_{s}^{K}(\cdot )\}\) that encode the input sentence. Each of these encoders can be either a trainable network or simply an identity function that just forwards its input. Similarly to the textual one, the visual sub-network inputs a video shot consisting of a sequence of N keyframes \({v} = \{I_1,I_2,\ldots ,I_N\}\). We use L trained DNNs in order to extract the initial frame representations \(g_V: \{g_{v}^{1}(\cdot ), g_{v}^{2}(\cdot ), \dotsc ,g_{v}^{L}(\cdot )\}\). To obtain video-shot level representations we follow the mean-pooling strategy.

Subsequently, we create all the possible textual encodings-visual feature pairs (\(f_{s}^{k}(s)\), \(g_{v}^{l}(v)\)) and a joint embedding space is created for each pair, using to this end two fully connected layers. Thus, \( K\times L\) different joint spaces are created. The objective of our network is to learn a similarity function similarity(sv) that will consider every individual similarity in each joint latent space utilizing multi-loss-based training. Figure 1c illustrates our proposed method.

3.2 Multiple Space Learning

To encode the \((f_{s}^{k}(\cdot ),g_{v}^{l}(\cdot ))\) pair into its joint feature space, as shown in Fig. 1c, each single part of the pair is linearly transformed by a fully connected layer (FC). A non-linearity is added in the FC output (not illustrated in Fig. 1c for brevity), for which the ReLU activation function is used, as follows:

$$\begin{aligned} \textbf{s}_k = ReLU(FC(f_{s}^{k}(\cdot )) \\\textbf{v}_l = ReLU(FC(g_{v}^{l}(\cdot )) \end{aligned}$$

This transformation encodes the \((f_{s}^{k}(\cdot ),g_{v}^{l}(\cdot ))\) pair into its new joint feature space. The similarity function \(sim(\textbf{s}_k,\textbf{v}_l)\) calculates the similarity between the output of textual encoder k and video feature l in this joint feature space. The overall similarity between a video-sentence pair is calculated as follows:

$$\begin{aligned} similarity(s,v) = \sum _{k=1}^{K}\sum _{l=1}^{L} sim({\textbf{s}_k,\textbf{v}_l})\end{aligned}$$

where \(sim(\textbf{s}_k,\textbf{v}_l) = cosine\_similarity(\textbf{s}_k,\textbf{v}_l)\).

To train our network, similarly to [10, 14], we utilize the improved marginal ranking loss introduced in [12]. This emphasizes on the hard-negative samples in order to learn to maximize the similarity between textual and video embeddings. At the training stage, given a sentence-video sample (sv), for a specific latent feature space (kl), the improved marginal loss is defined as follows:

$$\begin{aligned} \mathcal {L}_{(k,l)}(s,v) = max(0, \alpha + sim(\textbf{s}_{k},\textbf{v}_{l}^{'}) - sim(\textbf{s}_k, \textbf{v}_{l}))\\ + max(0, \alpha + sim(\textbf{s}_{k}^{'}, \textbf{v}_{l}) - sim(\textbf{s}_k,\textbf{v}_{l})) \end{aligned}$$

where \(\textbf{v}_{l}^{'}\) and \(\textbf{s}_{k}^{'}\) are the hardest negatives of \(\textbf{s}_{l}\) and \(\textbf{v}_{k}\) respectively and \( \alpha \) is a hyperparameter for margin regulation. The overall training loss is calculated as the sum of all \(K \times L\) individual loss values:

$$\begin{aligned} \mathcal {L}(s,v) = \sum _{k=1}^{K}\sum _{l=1}^{L}\mathcal {L}_{(k,l)}(s,v) \end{aligned}$$

3.3 Dual Softmax Inference

In [6], a new objective function based on two softmax operations was proposed. According to this, at the training stage the predicted text-video similarities were revised by calculating a so-called cross-direction similarity matrix and multiplying it with the predicted one. Specifically, during a training batch, let Q be the number of examined caption-video pairs. By computing the similarities between every caption and all videos, a similarity matrix \(\textbf{X} \in \mathcal {R}^{Q\times Q}\) was generated. Next, by applying two cross-dimension softmax operations (one column-wise and one row-wise) an updated similarity matrix \(\textbf{X}'\) was calculated, and was subsequently used as discussed in [6]. Directly applying this approach at the inference stage, though, would require that all queries to be evaluated are known a priori and are evaluated simultaneously; they would need to be used for calculating matrix \(\textbf{X}\), as illustrated in the left part of Fig. 2. This is not a realistic scenario, especially in real-world retrieval applications.

To deal with this issue and revise the inferred text-video similarities at the retrieval stage, we propose a dual softmax-based inference (DS\(_\mathrm{{inf}}\)) as illustrated in the right side of Fig. 2. We utilize a fixed set of C pre-defined background textual queries, which are independent of the evaluated dataset, and we calculate their similarities with all D videos of the test set. For the set of background queries, we calculate once the similarity matrix \(\textbf{X}^{ *}\in \mathcal {R}^{C\times D}\). For each individual evaluated query s a similarity vector \(\textbf{y}(s) = [similarity(s,v_1), similarity(s,v_2), \ldots , similarity(s,v_D)]^T\), is calculated. A matrix \(\textbf{Z}(s) = concat(\textbf{y}(s);\textbf{X}^{ *})\) is constructed, and a dual softmax operation revises the similarities as follows:

$$\begin{aligned} \textbf{Z}^{*}(s) = Softmax(\textbf{Z}(s),\ dim=0) \odot Softmax(\textbf{Z}(s), \ dim=1) \end{aligned}$$

where \( \odot \) denotes the Hadamard product. Finally, from matrix \(\textbf{Z}^{*}\) we extract the revised similarity vector \(\textbf{y}^{*} = [{Z}^{*}_{0,1}, {Z}^{*}_{0,2},\cdot ,{Z}^{*}_{0,D}]\) (Fig. 3). This normalization procedure is meaningful when we expect that there are multiple positive video samples in our dataset for the evaluated query; thus, by normalizing the inferred similarities we can produce a better ranking list.

Fig. 2.
figure 2

Different approaches to update the similarity matrix using all the evaluated queries (left subfigure) and pre-defined background queries (right subfigure)

Fig. 3.
figure 3

Illustration of the dual softmax-based inference (DS\(_\mathrm{{inf}}\)) approach for updating the similarities

3.4 Specifics of Textual Information Processing

In this section, we present every textual feature and every textual encoder we used in order to find the optimal textual encoder combination. Given a sentence s consisting of \(\{w_1, w_2, \ldots , w_B\}\) words, we utilize \(M=4\) different textual features. These features are used as input to textual encoders.

Textual features

  • Bag-of-Words (bow): We utilize Bag-of-Words to vectorize every sentence into a sparse vector representation expressing the occurrence frequency of every word from a pre-defined vocabulary.

  • Word2Vec (w2v): Word2Vec model [28] is an established and well-performing word embedding model. W2v learns to embed words into a word-level representation vectors \(\textbf{W}_{w2v}:\{\textbf{w}_1^{w2v}, \textbf{w}_2^{w2v}, \ldots , \textbf{w}_B^{w2v} \}\in R^{BxD_{w2v}}\). The overall sentence w2v embedding \(g_{s}^{w2v} \) is calculated as the mean pooling of the individual word embeddings.

  • Bert: Bert [7] offers contextual embeddings by considering the sequence of all words inside a sentence, which means that the same word may have different embedding when the context of a sentence is different. We utilize the BASE variation of bert consisting of 12 encoders and 12 bidirectional self-attention heads. Similarly to [20], we calculate the bert sentence embedding \(g_{s}^{bert} \) by mean pooling the individual word embedding.

  • Clip: The transformer-based trained model of CLIP [30] is used as a textual feature extractor. Sentence s is fed to it as a sequence of words \(w_1,..., w_B\) and token embeddings \(\textbf{W}_{clip}:\{ \textbf{w}_{startoftext}^{clip}, \textbf{w}_1^{clip}, \ldots , \textbf{w}_B^{clip}, \textbf{w}_{endoftext}^{clip} \}\) are calculated. The last token embedding, \(\textbf{w}_{endoftext}^{clip}\in R^{512}\), is used as our feature vector.

Fig. 4.
figure 4

Illustration of the ATT encoder, inputting several textual features and producing three levels of encodings (bow, self-attention bi-gru and CNN outputs) which are concatenated and contribute to the final output of the encoder

Textual Encoders. The textual encoders input the extracted textual features (one at a time, or a combination of them) and output a new embedding. We experimented with combinations of the followings:

  • \(f_{s}^{bow}\), \(f_{s}^{w2v}\), \(f_{s}^{bert}\), \(f_{s}^{clip}\): these encoders feedforward the corresponding features through an identity layer.

  • \(f_{s}^{w2v-bert}\): The concatenation of w2v and bert features is used to feed an identity layer.

  • \(f_{s}^{bi-gru}\): A self-attention bi-gru module, introduced in [14], is trained as part of the complete network architecture; it takes as input the w2v features for each word and their temporal order (i.e., not using the overall sentence w2v embedding, contrarily to \(f_{s}^{w2v}\) and \(f_{s}^{w2v-bert}\)).

  • Attention-based dual encoding network (ATT): The textual sub-network presented in [14] (illustrated in Fig. 4) is trained (similarly to \(f_{s}^{bi-gru}\), above), taking as input the bow, w2v and bert features (again, for each individual word rather than the mean-pooled sentence embeddings) and producing a vector in \(\mathcal {R}^{2048}\).

Among all the above possible textual encoders, we propose to combine in our network the \(f_{s}^{clip}\) and ATT ones, as experimentally verified in Sect. 4.

3.5 Specifics of Visual Information Processing

Similarly to the textual sub-network, we use several deep networks that have been trained for other visual tasks as frame feature extractors. Considering a video shot v, first we uniformly sample it with a fixed rate of 2 frames per second, resulting in a set of keyframes \(\{I_1,I_2,\ldots ,I_N\}\). Then, frame-level representations are obtained with the help of the feature extractors listed below, followed by mean pooling of the individual frame representations to get shot-level features.

Visual features

  • R_152: The first video feature extractor inputs an image frame into a ResNet-152 [18] network, trained on the ImageNet-11k dataset. The flattened output of the layer before the last fully connected layer of the network is used as a feature representation of every frame in \(\mathcal {R}^{2048}\).

  • Rx_101: The second feature extractor utilizes a ResNeXt-101 network, pre-trained by weakly supervised learning on web images followed and fine-tuned on ImageNet [25]. Similarly to R_152, Rx_101 inputs frames and the frame representations in \(\mathcal {R}^{2048}\), are obtained as the flattened output of the layer before the last fully connected layer.

  • Clip: As third video feature extractor we utilise a trained CLIP model (ViT-B/32) [30], to create frame-level representations in \(\mathcal {R}^{512}\).

As illustrated in Fig. 1c, we propose using these visual features, without introducing any trainable visual encoder, directly as input to a number of individual FC layers for learning the latent feature spaces. However, in order to examine more design choices, we also tested in our ablation experiments the introduction of a visual encoder, similarly to what we do for textual information processing. To this end, we utilized the visual sub-network of the attention-based dual encoding network of [14] (ATV). Following [14], we input all three aforementioned frame-level features to a single ATV encoder, which (similarly to the ATT one) was trained end-to-end as part of our overall network.

4 Experimental Results

4.1 Datasets and Experimental Setup

We evaluate our approach and report experimental results on three datasets: the two official TRECVID AVS datasets (i.e., IACC.3 and V3C1) [1] and MSR-VTT [34]. The AVS datasets are designed explicitly for text-based video retrieval evaluation; they include the definition of tens of textual queries as well as ground-truth associations of multiple positive samples with each query. The IACC.3 dataset consists of 335.944 test videos, and the V3C1 of 1.082.629 videos for testing (most of which are not associated with any of the textual queries). As evaluation measure we use the mean extended inferred average precision (MxinfAP), as proposed in [1] and typically done when working with these datasets. On the other hand, MSR-VTT targets primarily video captioning evaluation, but is also often used for evaluating text-video retrieval methods. It is made of 10.000 videos and each video is annotated with 20 different natural language captions (totaling 200.000 captions, which are generally considered to be unique); for retrieval evaluation experiments, given a caption the goal is to retrieve the single video that is ground-truth-annotated with it. For the MSR-VTT experiments, following the relevant literature, we use as evaluation measures the recall R@k, \(k=1,5,10\), the median rank (Medr) and mean average precision (mAP).

Table 1. Results and comparisons on the IACC.3 and V3C1 datasets. Bold/underline indicates the best-/second-best scores

Regarding the training/testing splits: for the evaluations on the AVS datasets, our cross-modal network (and any network of the literature, i.e. [14, 20], that we re-train for comparison) is trained using a combination of four other large-scale video captioning datasets: MSR-VTTT [34], TGIF [21], ActivityNet [3] and Vatex [32]. For validation purposes, during training, we use the Video-to-Text Description dataset of TRECVID 2016. For testing, all sets of queries specified by NIST for IACC.3 (i.e., AVS16, AVS17 and AVS18) and V3C1 (i.e., AVS19, AVS20 and AVS21) are used. For the evaluations on the MSR-VTT dataset, we experimented with two versions of this dataset: MST-VTT-full [34] and MSR-VTT-1k-A [36]. MST-VTT-full consists of 6.513 videos for training, 497 for validation and 2.990 videos (thus, \(2.990\times 20\) video-caption pairs) for testing. On the other hand, MSR-VTT-1k-A contains 9.000 videos for training and 1.000 video-caption pairs for testing. For both MSR-VTT versions, we trained our network of the training portion of the dataset and report results on the testing portion, respectively.

Regarding the training conditions and parameters: To train the proposed network, (and again, also for re-training [14, 20]) we adopt the setup of [13], where six configurations of the same architecture with different training parameters were combined. Specifically, each model is trained using two optimizers, i.e., Adam and RMSprop, and three learning rates (\(1\times 10^4\), \(5\times 10^5\), \(1\times 10^5\)). The final results for a given architecture are obtained by combining the six returned ranking lists of the individual configurations in a late fusion scheme, i.e. by averaging the six obtained ranks for each video. For training all configurations, we follow a learning rate decay technique, and we reduce the learning rate \(1\%\) per epoch or by \(50\%\) if the validation performance does not improve for three epochs. The dropout rate is set to 0.2 to reduce overfitting. Also, following [12] the margin parameter on loss function is set to \(\alpha =0.2\). All experiments were performed on a single computer equipped with Nvidia RTX3090 GPU. Our models were implemented and trained using Pytorch 1.11.

Table 2. Results and comparisons between different encoder strategies when using dual softmax-based inference (DS\(_\mathrm{{inf}}\)) similarity. Scores in bold/underline indicate the best-/second-best-performing strategy

4.2 Results and Comparisons

Table 1 presents the results of the proposed T\(\times \)V network, i.e. using three visual features (R_152, Rx_101, clip) followed by FC layers and clip and ATT as textual encoders, on IACC.3 and V3C1 datasets and comparisons with state-of-the-art literature approaches. We compare our method with six methods and the presented results are extracted from their original papers. Furthermore, as [4] has shown that the quality of the initial visual features is crucial for the performance of a method, we present results of re-training the ATT-ATV [14] and SEA [20] networks using the same visual features and same training datasets as we did in our experiments, using their publicly available code. Our proposed network outperforms the competitors on AVS16, AVS19 and AVS21. SEA-clip [4] achieves better results on AVS17, AVS18 and AVS20 by exploiting 3D CNN-based visual features. Comparing the mean performance on AVS16-AVS20 (since SEA-clip does not report results on AVS21), our network achieves MxinfAP equal to 0.248 while SEA-clip 0.240.

Table 3. Results and comparisons on the MSR-VTT full and 1k-A datasets. Methods marked with \(^*\) use an alternative training set of 7.010 video-caption samples for the 1k-A dataset, but still report results on the same test portion of 1k-A as all other methods. Bold/underline indicates the best-/second-best scores

In Table 2 we experiment with the dual softmax-based (DS\(_\mathrm{{inf}}\)) inference on IACC.3 and V3C1. We examine the impact of different background query strategies and compare with the proposed network without DS\(_\mathrm{{inf}}\). “DS\(_\mathrm{{inf}}\) on the set of evaluated queries” indicates that the operation was performed using the same year’s queries. For example, to retrieve videos for an AVS16 query, the remaining AVS16 queries are used as background queries. This improves the overall performance, but as we already have discussed, is not a realistic application scenario. To overcome this problem, we experiment with using a fixed set of background queries: 60 or 200 randomly selected captions extracted from the training datasets. By using these captions, performance is improved compared to the proposed network in every dataset and every year. The difference between using 60 and 200 captions is marginal. Finally, we try using as background queries all AVS queries defined for the same video dataset but not the same test-year as the examined query. For example, to evaluate each of the AVS16 queries, we use the AVS17 and AVS18 queries as background. This strategy achieves the second-best results among all examined ones, being marginally outperformed by “DS\(_\mathrm{{inf}}\) on the set of evaluated queries” strategy; and, contrarily to the latter, does not assume knowledge of all the evaluation queries beforehand. For a given query, the retrieval time using DS\(_\mathrm{{inf}}\) increases (on average) from 0.4 s to 0.6 s on the IACC.3 dataset and from 1.1 s to 1.3 s on the V3C1, respectively.

Table 4. Comparison of combinations of textual encoders and visual features on the IACC.3 and V3C1 datasets

In Table 3 we present results on the MSR-VTT full and 1k-A datasets and compare with literature methods. Our network outperforms most literature methods, even methods like FROZEN [2] and HiT (pre-trained on HT100M) [22] in which their networks utilize a pre-training step on other large text-video datasets. Moreover, BridgeFormer [15], using a pre-training step on the WebVid-2M [2] dataset, marginally outperforms our network in R@1 terms, while our approach achieves better results on the remaining evaluation measures. Finally, we should note that experiments with DS\(_\mathrm{{inf}}\) on MSR-VTT (not shown in Table 3) lead to only marginal differences in relation to the results of the proposed T\(\times \)V network. This is expected because of the nature of MSR-VTT: given a caption the goal is to retrieve the single video that is ground-truth-annotated with it; thus, re-ordering the entire ranking list by introducing the DS\(_\mathrm{{inf}}\) normalization of the caption-video similarities has limited impact.

4.3 Ablation Study

In this section we study the effectiveness of different textual encoders and visual features (or also encoders) combinations. We report results using three visual feature encoding strategies: “feat. concat. + ATV” indicates the early fusion of the three visual features that are then fed into the trainable ATV sub-network of [14] (as illustrated in the rightmost part of Fig. 1a), followed by the required FC layers in order to encode the ATV’s output into the corresponding joint feature spaces. Similarly, “feat. concat.” refers to the early fusion of the three visual features (as illustrated in the rightmost part of Fig. 1b) followed by the required FC layers. Finally, “only FCs” refers to the proposed strategy (Fig. 1c), where visual features are individually and directly encoded to the joint spaces using only FC layers.

Table 5. Comparison of combinations of textual encoders and visual features on the full and 1k-A variations of the MSR-VTT dataset

In Table 4 we report the results on the AVS datasets when using different combinations of textual encoders together with the aforementioned three possible visual encoding strategies. Concerning the visual encoding strategies, the results indicate that the lowest performance is achieved by the “feat. concat. + ATV” strategy, regardless of the textual encoders choice. When the early fusion of the trained models “feat. concat.” is used instead of ATV, the performance consistently increases. The best results are achieved by forwarding the visual features independently with FC layers, as in the proposed approach. Regarding the combinations of the textual encoders, we can see that the utilization of fewer but more powerful encoders (i.e. clip and ATT) leads to better results than using a multitude of, possibly weak, encoders as in [20], regardless of the employed visual encoding strategy. These evaluations show that how the textual and visual features are combined has significant impact on the obtained results.

In Table 5 we presented the same ablation study for the full and 1k-A variations of the MSR-VTT dataset. In these datasets, we can observe a different behavior concerning the visual encoding: while the utilization of a few and powerful textual encoders (i.e. clip and ATT) continues to perform the best, when it comes to the visual modality, the “feat. concat. + ATV” strategy consistently performs the best, regardless of the textual encoder choice. This finding, combined with the results of Table 4, shows that for similar yet different problems and datasets there is no universally-optimal way of combining the visual features (and one can reasonably assume that this may also hold for the textual ones).

5 Conclusions

In this work, we presented a new network architecture for efficient text-to-video retrieval. We experimentally examined different combinations of visual and textual features, and concluded that selectively combining the textual features into fewer but more powerful textual encoders leads to improved results. Moreover, we shown how a fixed set of background queries extracted from large-scale captioning datasets can be used together with softmax operations at the inference stage for revising query-video similarities, leading to improved video retrieval. Extensive experiments and comparisons on different datasets document the value of our approach.