Keywords

1 Introduction

Since natural language texts contain richer content than keywords, video retrieval with natural language queries has received more attention. Usually, both texts and videos are projected into a latent space via different methods, which still have some limitations. First, most methods do not exploit sufficient inter-frame interactions. HGR [1] uses a weighted sum to get video embeddings, ignoring exploring more inter-frame interactions. Second, many works obtain features of various aspects but fuse them with simple methods. CE [5] fuses the results of multiple experts by average pooling and gate unit. Third, most loss functions for video retrieval are not flexible enough. The hinge-based triplet ranking loss [7,8,9] treats all samples equally, ignoring the effect of different samples on optimization. And most loss functions either focus on the hardest negative pair or average all negative pairs. [10, 12] The former may cause model affected by outliers, while the latter brings lots of redundancy.

To address the above limitations, we propose the Multi-Interaction Model (MIM). First, we propose a multi-scale inter-frame interactions module (MSIFI) to encode videos. It is implemented by a well-designed convolutional module. It regards each frame feature as a channel and performs 1-D convolution along the feature axis. Through MSIFI, each element of output embeddings comes from all the elements of inputs. Second, a fusion method is designed to merge features from MSIFI, bi-GRU, and global branches sufficiently. It maps the outputs of MSIFI and bi-GRU into different subspaces. Features from all subspaces will interact with each other. Then it is combined with global features via an adaptive gate unit. Third, we propose an improved loss function. It assigns weights to each pair with non-linear functions, whose value changes with the similarity score. Pairs whose similarity scores are far from the optimum will get larger weights and converge faster. Moreover, an adaptive mining strategy is designed to reserve informative samples with different weights. The main contributions of this work are as follows:

  • To fully exploit interactions among frames in multi scales, we propose a novel MSIFI module. It utilizes a well-designed convolution operation to learn more accurate and significant information from multi-scale interactions.

  • We design a novel fusion module to merge different features. Through sufficient interactions among features from multiple latent subspaces, we integrate features of various aspects and get an accurate video representation.

  • Considering the influence of different samples on optimization, we propose an improved loss with a new mining strategy.

  • Extensive experiments on several datasets validate the effectiveness of MIM.

2 Related Work

Frame Aggregations for Video-Text Retrieval. HGR [1] decomposes videos to match with texts in different levels and JSFusion [4] encodes all frames of videos with texts and directly predicts the video-text similarities. They both ignore exploring more inter-frame interactions.

Fusion Methods for Video-Text Retrieval. Dual Encoding [9] concatenates the results of multiple encoders. Howto100m [6] aggregates different features by max pooling and concatenation. CE [5] aggregates various information with a gate unit and average pooling.

Loss Functions for Video-Text Retrieval. Most methods [7,8,9] adopt hinge-based triplet ranking loss or bi-directional max-margin ranking loss [5, 6]. However, they treat all samples equally. Circle loss [12] assigns weights to different pairs with a linear function and Polynomial Loss [10] just considers the hardest negative sample or averages all negative samples, which are not flexible enough.

Fig. 1.
figure 1

The architecture of MIM. The video encoder has three branches. The Text encoder contains a multi-dimensional attention module. The MSIFI module captures multi-scaled interactions among frames. The fusion module merges three branches features. N is the number of video frames and the dimension of features is unchanged by proper padding. Details are in Sect. 3. \(\textit{WS}\) denotes weighted sum and \(\odot \) is Hadamard product.

3 Methodology

Given a video v and a text t, our model encodes them into fixed d-dimensional vectors in a common space. We use the features extracted by pre-trained CNNs [19,20,21] and BERT [11]. As illustrated in Fig. 1, the video encoder has three branches, whose outputs are denoted as \(\phi _1(v)\), \(\phi _2(v)\), \(\phi _3(v)\). Then they are integrated into \(\zeta (v)\) by the fusion module. Text encoder handles text features with a multi-dimensional attention mechanism to get the result \(\psi (t)\).

3.1 Multi-scale Inter-frame Interactions (MSIFI) Branch

As shown in the upper-left part of Fig. 1, a video is projected into a matrix \(I\in \mathbb {R}^{N\times d}\) by pre-trained CNNs. I is the input of MSIFI and N is the number of frames. Specifically, each element of the feature corresponds to a channel of the last layer in pre-trained CNNs. We treat each frame as one channel of MSIFI and perform the convolution along the feature axis. This actually combines different channels of pre-trained CNNs when sliding our convolutional kernels. As the number of layers increases, the receptive field of each layer is enlarged and it completely covers I in the last layer. In this way, we achieve multi-scale inter-frame interactions and merge significant information from all frames. They are aggregated into \(\phi _1(v)\in \mathbb {R}^{d}\) by max pooling, reserving the most informative features. Each element of \(\phi _1(v)\) is derived from the interactions among all frames.

3.2 Temporal Branch and Global Branch

Since temporal information plays an important role in video encoding, we employ the bi-GRU network to capture temporal information. The input is \(I\in \mathbb {R}^{N\times d}\) and the output is aggregated into \(\phi _2(v)\in \mathbb {R}^{d}\) by max pooling.

To obtain a more comprehensive video embedding, we also extract the global features. As the significance of frames in a video are different, we assign weights to them based on significance. Each frame \(v_i\in \mathbb {R}^{d}\) is mapped into \(\tau _i\in \mathbb {R}\) by a FC layer. The global embedding of the video is the weighted sum of all frames:

$$\begin{aligned} \mathbf {\phi }_3(v)=\sum ^{N}_{i=1}{\gamma _{i}}v_i ,\quad \gamma _{i}=\frac{exp(\tau _{i})}{\sum ^{N}_{i=1}exp(\tau _{i})}, \end{aligned}$$
(1)

where \(\gamma _{i}\in \mathbb {R}\) is the weight of the i-th frame and \(\phi _3(v)\) represents relatively primitive video information.

Fig. 2.
figure 2

Visualization of attentions of different videos to K subspaces. Each row denotes the attention of a subspace, and every K rows correspond to a video. We set K = 3. Semantic similar videos have similar attentions. The content of the first two and last two videos are different, so they have different attentions. This indicates different subspaces represent different aspects of video features.

3.3 Fusion Module

To fuse information from three branches, we conduct another kind of interaction between \(\phi _1(v)\) and \(\phi _2(v)\) and then merge the result with \(\phi _3(v)\). As illustrated in the lower-right part of Fig. 1, we first map \(\phi _1(v)\) and \(\phi _2(v)\) into K subspaces respectively. They are denoted as \(\{\boldsymbol{h}^{(k)}\}\) and \(\{\boldsymbol{e}^{(k)}\}\), where k represents the k-th subspace. Different subspaces represent different aspects of video features. Figure 2 shows the representations of several videos in K subspaces. Semantic similar videos pay similar attention to certain subspaces, and unrelated videos have different dependencies on each subspace. After that, the representations from all subspaces are aggregated by weighted sum to obtain \(\mathbf {z_1}\in \mathbb {R}^{d}\) and \(\mathbf {z_2}\in \mathbb {R}^{d}\). They are fused into \(\xi (v)\) by Hadamard product. The q-th element of \(\xi (v)\) is as follow, where \(\alpha ^{(i)}\in \mathbb {R}\) and \(\beta ^{(i)}\in \mathbb {R}\) are trainable parameters.

$$\begin{aligned} \mathbf {z_1}=\sum _{i=1}^{K}{\alpha ^{(i)}}\boldsymbol{h}^{(i)}, \quad \mathbf {z_2}=\sum _{j=1}^{K}\beta ^{(j)}\boldsymbol{e}^{(j)}, \quad \xi (v)_q=\sum _{i=1}^{K}\alpha ^{(i)}\boldsymbol{h}_q^{(i)}\sum _{j=1}^{K}\beta ^{(j)}\boldsymbol{e}_q^{(j)}. \end{aligned}$$
(2)

It can be seen that the representation from each subspace interacts with representations from all subspaces of another branch. As \(\phi _3(v)\) contains global information, an adaptive fusion gate is uesd to mix \(\xi (v)\) and \(\phi _3(v)\) into \(\zeta (v)\in \mathbb {R}^{d}\):

$$\begin{aligned} \zeta (v)=\mathbf {\lambda }\cdot \xi (v)+(1-\mathbf {\lambda })\cdot \phi _3(v), \quad \lambda =\sigma (\mathbf {FC_1}(\xi (v))), \end{aligned}$$
(3)

where \(\mathbf {\lambda }\in R^{d}\) denotes the gating weight, \(\mathbf {FC_1}\) represents a fully connected layer and \(\sigma \) is the sigmoid function.

Fig. 3.
figure 3

The weight function curves (left) and their derivative curves (right) of pairs in loss function. Blue curves are for positive pairs and red curves are for negative pairs.

3.4 Text Encoder with Multi-dimensional Attention

Inspired by MAGP [14], we believe that different dimensions attend to different properties and we adopt the text encoder of MAGP. The difference is that we add up the output of every 2 adjacent layers of BERT, and concatenate the results of 6 groups. Then the multi-dimensional attention module obtains attention weights for every word and aggregates them into a vector \(\psi (t)\in \mathbb {R}^{d}\).

3.5 Video-Text Matching

The cosine similarity of \(\zeta (v)\) and \(\psi (t)\) is their similarity score: \( s_{i,j}=\frac{\zeta (v)_i^{T}\psi (t)_j}{||\zeta (v)_i||||\psi (t)_j||}.\) \(s_{i,i}\) is a positive pair and \(s_{i,j}\) is a negative pair, where \(i \ne j\). An adaptive mining strategy is used to reserve informative pairs. We select and assign weights to informative pairs while discarding other pairs. All negative samples are sorted based on similarity scores. Harder samples rank higher. Then we save top \(\frac{U}{r}\) samples, assign weights, and aggregate them to get the negative pairs representative \(s_{i,neg}\) for the i-th query. r is a hyper-parameter, U is the size of one batch.

$$\begin{aligned} s_{i,neg}=\sum ^{\frac{U}{r}}_{j=1,j\ne i} {\eta _{j}} s_{ij},\quad \eta _{j}=\frac{exp(s_{ij})}{\sum ^{\frac{U}{r}}_{j=1,j\ne i}exp(s_{ij})}, \end{aligned}$$
(4)

Our loss function is as follow, where \(\mu _{n}\) and \(\mu _{p}\) are the weight functions of negative and positive pairs. \(\mathrm{\Delta }\) is the margin and \(\left[ \cdot \right] _{+}=max(\cdot ,0)\), a, \(b_0\) and \(b_1\) are hyper-parameters.

$$\begin{aligned} L\!=\!log\!\left[ \!1\!+\!\sum _{i=1}^{U}\!\sum _{q=1}^{U}\!exp(\mu _{n}s_{i,neg}\!-\!\mu _{p}(s_{q,q}\!-\!\mathrm{\Delta }))\sum _{j=1}^{U}\!\sum _{k=1}^{U}\!exp(\mu _{n}s_{neg,j}\!-\!\mu _{p}(s_{k,k}\!-\!\mathrm{\Delta }))\!\right] \!, \end{aligned}$$
(5)
$$\begin{aligned} \mu _{p}=\left[ a^{s_{i,i}-\mathrm{\Delta }}\right] _{+} ,\quad \mu _{n}=\left[ b_0^{s_{i,neg}-b_1}\right] _{+}, \end{aligned}$$
(6)
Table 1. Comparison with state-of-the-arts on MSR-VTT, TGIF and VATEX dataset.

The curves of weight functions and their derivative functions are shown in Fig. 3. Our loss functions satisfy the following characteristics. When the similarity score is far from its optimum, this pair is more informative. The value and derivative value of its weight function will be greater. It means that this pair gets a bigger weight in the loss function and updates at a faster pace, and vice versa.

4 Experiments

4.1 Experimental Settings

Datasets and Metrics. We conduct experiments on MSR-VTT [15], VATEX [16], and TGIF [17]. We use the official partition of MSR-VTT. For VATEX and TGIF, we follow the experimental setup of HGR [1].The performance is evaluated with common retrieval metrics, namely R@K (Recall at rank K), MedR (Median Rank), MnR (Mean Rank), and rsum (the sum of all recall scores).

Implementation Details. For MSR-VTT, the visual features are extracted by ResNet-152 and ResNeXt-101 pre-trained on ImageNet [9]. For TGIF and VATEX, we use the pre-trained ResNet-152 visual feature and the officially provided I3D [19] visual feature respectively. The MSIFI module has 5 convolutional layers with kernel size = 3,5,5,7,9. The number of subspaces K is 3. The dimension d is 4096. For loss function, we choose hyper-parameters by grid search. We set \(r \!=\!20\), \(a\!=\!0.37\), \(\mathrm{\Delta }\!=0.8\), \(b_0\!=37\), and \(b_1\!=0.5\). The model is trained for 20 epochs using Adam optmizer [18] with batch size of 64 and learning rate is 1e−4.

Table 2. Ablation studies on MSR-VTT dataset.

4.2 Comparisons with State-of-the-Arts (SOTAs)

As shown in Table 1. On all datasets, MIM has the highest rsum, demonstrating the advantages of MIM. Specifically, MIM outperforms MAGP. As they have the same text encoder, it proves that our video encoder is more effective. As the features of VATEX are not frame-level features, it is hard to implement inter-frame interactions as sufficiently as on MSR-VTT or TGIF. Our performance on VATEX degrades slightly. Nevertheless, our rsum is still the highest, proving the superiority of our fusion module and loss function.

4.3 Ablation Studies

We conduct ablation studies on MRS-VTT and results are displayed in Table 2.

Effectiveness of MSIFI. We remove MSIFI and compare it with Transformer [3]. To maintain similar number of parameters, we use 1 layer Transformer with 4096 hidden dimensions and 8 attention heads. Results show that MSIFI is effective.

Effectiveness of Fusion Module. We replace the fusion module with gate unit and concatenation respectively. And rsum decreases by 9.4 and 23.5, which proves that our fusion strategy can integrate different features more effectively.

Effectiveness of Loss Function. We compare our loss function with other loss functions and replace the mining strategy with hard mining and average operation. Results confirm the superiority of our loss function and mining strategy.

5 Conclusions

This paper introduces a multi-interaction model for video-text retrieval, with an MSIFI branch to capture multi-scale interactions among videos frames and a fusion method to exploit multiple complementary information between different video features. Moreover, a loss function and a mining strategy are proposed. Extensive experiments show the effectiveness of this approach.