1 Introduction

As the development of short video applications, video has become one of the most important media forms for people to obtain information. Video–text retrieval has attracted increasing attention. Due to the success of Transformer [1] and Bert [2] in natural language processing field, there has been a lot of transformer-based vision-language alignment models for cross-modal retrieval [3,4,5,6,7]. Existing approaches could be roughly categorized as global embedding-based [3, 4, 8, 9] and fine-grained interaction-based methods [5, 7, 10,11,12]. Embedding-based methods usually lie in global contrastive alignment of videos and texts. The embedding learning for two modalities can be decoupled, and the representations for test data can be pre-computed offline. Thus, embedding-based methods are efficient when retrieval inference is carried out. These methods model coarse cross-modal interaction via the similarity of the global representations of video and text.

To explore fine-grained interaction between heterogeneous data, a lot of studies [7, 11, 13,14,15,16,17] are proposed. Most of them [7, 11, 13,14,15] fed visual and linguistic elements (usually patches from image and words from sentence) simultaneously into a transformer-based network for cross-modal interaction learning. This way could granularly align and aggregate visual and linguistic clues. In retrieval inference, pairwise video and text are required as input to network for computing their relevance score. In addition, a late interaction architecture is proposed to firstly compute the similarities between tokens of video and text elements, and then, the summation [16] or mean [17] of token-wise similarities is calculated as the relevance score for the video and text. These methods suffer from inferior efficiency in inference.

In this paper, we aim to achieve global embedding-based inference; meanwhile, we expect the feature embeddings could obtain the characteristic of cross-modal fine-grained interaction. Inspired by knowledge distillation [18,19,20,21], we propose cross-granularity self-distillation method by distilling the token-wise fine-grained similarity of video and text into coarse-grained similarity relationship based on global embeddings. The fine-grained cross-modal similarity is considered as soft label to guide the learning of global embeddings, and thus, global features are actually enforced to obtain the performance of fine-grained interaction. In retrieval stage, we utilize global embeddings of video and text for similarity computation and ranking to achieve efficient retrieval.

Fig. 1
figure 1

The framework of SPSD. We propose a similarity-preserving self-distillation method to align video and text. The different layer outputs of visual encoding module and text encoding module are, respectively, utilized to compute feature-level and semantic-level related loss. To get fine-grained similarity in a token-wise interaction way, we design a token screening module to select important tokens for video and text modalities. The fine-grained similarity is then distilled to coarse-grained similarity which is based on global embeddings of video and text. This operation goes on at both feature level and semantic level to form hierarchical cross-granularity self-distillation loss. Besides, cross-layer self-distillation loss is proposed by distilling the semantic-level similarity to feature-level similarity. In this way, the learned semantic representations for video and texts are utilized to compute distances then ranked for cross-modal retrieval

According to the attention allocated characteristics of different transformer layers, the features in different layers focus on different views [4, 7, 22,23,24,25]. For example, local syntax is encoded at the lower layers and longer range semantics at the upper layers [26]. A recent visual-language learning method [4] explored hierarchical features by adding the feature-level (the first layer) and semantic-level (the last layer) contrastive loss to learn the transformer-based encoders. We consider discriminating the binary relationship (similar and dissimilar) between cross-modal data in contrastive learning may be too strict and difficult to low-level features. To alleviate this problem and further explore hierarchical features, we propose cross-layer self-distillation method by regarding semantic-level similarity between video and text as soft label and distilling it to the cross-modal similarity based on low-level features. In this way, the model could learn similarity-oriented low-level features for cross-modal retrieval.

In this paper, we propose similarity-preserving self-distillation (SPSD) method for video and text alignment with cross-granularity and cross-layer self-distillation ways. Figure 1 shows the framework. Two transformer-based encoding modules are used to extract video and text features. The global embeddings of video and text are used to compute coarse-grained similarity. Meanwhile, the token features of video and text are utilized to get token-wise fine-grained similarity by late interaction way. Specifically, we design a token screening module to adaptively select important tokens for fine-grained similarity computation. To mine hierarchical capacity of transformer encoders, we perform cross-granularity self-distillation with semantic-level and feature-level representations. The cross-granularity and cross-layer self-distillation losses are all based on KL divergence. Together with the self-distillation losses, we employ InfoNCE [27] to construct contrastive loss with hierarchical features for training the model.

The cross-granularity self-distillation and cross-layer self-distillation both generate distillation signals through the network itself to help the encoders of video and text learn better. They are applied in the training stage, so they will not cause additional computational overhead in retrieval inference. Experiments on three public datasets show the effectiveness of SPSD.

2 Related work

2.1 Cross-modal interaction learning

Existing approaches for cross-modal retrieval address fine-grained interaction between video and text generally by two ways, feeding video and text together into a single stream network [7, 10,11,12,13,14,15, 28,29,30] or modeling the interplay based on dual stream network [5, 6, 16, 17, 31,32,33,34,35]. Our method is based on dual stream network. SCAN [32] discovers the latent alignments using both image regions and words in a sentence as context and infers image-text similarity. T2VLAD [31] aggregates the multi-modal video sequences and text features with a set of shared semantic centers, and then, the local cross-modal similarities are computed between the video feature and text feature within the same center. MMT [5] computes the video-caption similarity as a weighted sum of each expert’s video-caption similarity. FILIP [17] achieves a cross-modal late interaction mechanism with token-wise maximum similarity between visual and textual tokens. In CRET [33], the text and video embeddings are aligned by learned transformer decoder centers. In recent CMMT model [34], each raw video denotes a pseudo-video class and a cross-modal fine-grained classification task is conducted where the text queries are classified with pseudo-video class prototypes. X-pool [35] utilized a scaled dot product attention for a text to attend to its most semantically similar frames, and then, an aggregated video representation is generated conditioned on the text’s attention weights over the frames. Jin et al. [36] used coarse-fine-grained parallel attention model and feature fusion module to learn effective video feature representation for video–text retrieval task.

Different from all these methods, we make the similarity of global representations have the ability of fine-grained interaction characteristic by self-distillation learning. The most related work to ours is FILIP [17] and MMT [5]. We adopt the same expert features as MMT and the token-wise fine-grained similarity proposed by FILIP as teacher in our cross-granularity self-distillation learning.

2.2 Hierarchical alignment

A lot of studies have researched the different level features of deep network [22,23,24,25,26] for cross-modal alignment [4, 6, 7, 37, 38] since deep architecture can learn representations that vary with network depth from local syntax encoded at the lower layers to longer range semantics at the upper layers. COOT [38] proposes to align the representations at frame–word, clip–sentence and video–paragraph three levels. TACo [6] proposes to construct hierarchical contrastive loss including token-level and sentence-level loss with the output of individual video and text encoders before multi-modal fusion network, and another sentence-level loss after the multi-model fusion network. CrossCLR [37] also utilizes a two-level hierarchy of transformers, where the loss is applied at the clip–sentence level and video–paragraph level. HiT [4] proposes to add feature-level and semantic-level contrastive loss to learn the video and text encoders based on transformer architecture. Ji et al. [39] proposed a step-wise hierarchical alignment network (SHAN) that decomposes image–text matching into multi-step cross-modal reasoning process including local-to-local alignment at fragment level, global-to-local and global-to-global alignment at context level. Jiang et al. [40] explored multi-level cross-modal relationships among video–sentence, clip–phrase, and frame–word for text–video retrieval based on the pre-trained CLIP.

They all utilize the different layer features to construct cross-modal correlation for learning multi-modal encoders. Besides the hierarchical correlation, we propose the cross-layer self-distillation way to take advantage of hierarchical features, i.e., the semantic-level similarity based on the last output of transformer encoders is transferred to the low-level feature learning.

2.3 Knowledge distillation

Knowledge distillation [41] is proposed to transfer the activation of individual example representation from a large teacher network to a small student network. Some studies have shown that transferring the mutual similarity instead of actual representation is beneficial to student representation learning [19, 20, 42,43,44]. Park et al. [43] proposed to transfer the relational information from teacher to student by distance-wise and angle-wise distillation losses. Tung et al. [19] proposed to guide the training of a student network such that input pairs that produce similar (dissimilar) activations in the teacher network produce similar (dissimilar) activations in the student network. Zhu et al. [20] selected a neighbor example from the teacher space as anchor and encouraged the anchor–student relation to be consistent with the anchor–teacher relation. Tian et al. [44] encouraged the teacher and student to map the same input to close representations and different inputs to distant representations. Li et al. [45] explore the merit of the student model in each time step to guide the training process of the teacher model.

Another line of work is self-knowledge distillation through distilling knowledge within network itself [18, 21, 46]. Zhang et al. [18] proposed to distill the classifier’s representations in the deeper portion of the CNN networks into the shallow ones. Hou et al. [46] exploited the activation-based CNN attention maps from its own layers as the distillation targets for its lower layers. Ji et al. [21] introduced an auxiliary self-teacher network to enable the transfer of a refined knowledge to the classifier network. Different from all these methods, our SPSD transfers the fine-grained similarity relationship between video and text to coarse-grained similarity based on global features, which is intra a transformer layer, and transfers high-level features’ similarity of video and text to low-level features’ similarity, which is cross-two transformer layers.

2.4 Others

Recently, a lot CLIP pre-trained-based models and contrastive learning are studied for video and text retrieval. CLIP4Clip [47] transferred the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. CLIP-ViP [48] utilized a video proxy mechanism to transfer CLIP model to video domain and introduced an omnisource cross-modal learning method to reduce the domain gap between pre-training data and downstream data. Cross-modal adapter [49] is proposed for parameter efficient fine-tuning with a few parameterization layers. Huang et al. [50] proposed the text–video cooperative prompt tuning model to efficient tune the pre-trained CLIP for text–video cross-modal retrieval. Wang et al. [51] proposed a diversity-sensitive contrastive learning loss by adaptive negative pair weighting to capture the fine-grained discrepancies among negative pairs. Better pre-trained model and contrastive learning loss achieve better performance on cross-modal retrieval. Nevertheless, they are not the focus of this paper. For the sake of fairness, we compare with the models using expert features for video as MMT [5] to validate our proposed cross-granularity and cross-layer self-distillation method.

3 Method

As shown in Fig. 1, our model has several key components, cross-granularity self-distillation, token screening module and cross-layer self-distillation. We first introduce the preliminaries and then describe the innovative points.

3.1 Preliminaries

3.1.1 Video encoding module

In our paper, video encoder is implemented by a stack of 4 self-attention layers and fully connected layers as the architecture of the transformer encoder presented in [1, 5]. Inspired by recent work [4, 5, 9, 37], some expert features are firstly extracted for video using pre-trained models such as motion features from S3D trained on Kinetics, audio features extracted using VGGish model trained on YT8M and appearance features from SENet-154 trained on ImageNet. The input of video encoder contains the expert features, embeddings of the expert type and the embeddings of the time in the video when the feature was extracted [5]. Expert features are firstly, respectively, projected to the same dimension 512 by fully connected layers and \(L_2\) normalization.

For a video v, the n-th-type expert feature at k time is denoted as \(F_k^n\), where \(n\in [1, N]\) and \(N =7\) is the total kinds of experts as [5]. The global feature for a kind of expert is obtained by max pooling on all times. The expert feature is then represented as,

$$\begin{aligned} F_v = [F_{agg}^1, F_1^1,\ldots , F_K^1,\ldots , F_{agg}^N, F_1^N,\ldots ,F_K^N]. \end{aligned}$$
(1)

To distinguish different types of expert and the time of the extracted feature, 512-dimensional embeddings of expert type and temporal information are learned as video encoder inputs. They are denoted as,

$$\begin{aligned} E_v= & {} [E^1, E^1,\ldots , E^1,\ldots , E^N, E^N,\ldots ,E^N], \end{aligned}$$
(2)
$$\begin{aligned} T_v= & {} [T_{\text {agg}}, T_1,\ldots , T_D,\ldots , T_{\text {agg}}, T_1,\ldots \ldots \ldots \ldots , T_D].\nonumber \\ \end{aligned}$$
(3)

The summation of \(F_v\), \(E_v\) and \(T_v\) is fed into a 4-layer transformer-based video encoder to learn the video token representations.

3.1.2 Text encoding module

We employ a pre-trained Bert-base-uncased model [2] as the text encoder and fine-tune it. Each word in a text t is embedded into a vector as token embeddings \(F_t\). [CLS] and [END] are placed on the first and last positions. Text Segment Mask \(M_t\) is used to indicate the id of input sequence, which is meaningless in our method since only one text is processed every time. Text Position Embedding \(P_t\) is used to encode the indexes of word in the text sequence. The final input for text encoder is the sum of \(F_t\), \(M_t\) and \(P_t\), which is fed into the pre-trained Bert model to get text token representations.

3.1.3 Token aggregation

For video modality, we utilize mean pooling on all output tokens of video encoding module to obtain the global video representation and apply a linear fully connected layer to project the representation into a vector with the same dimension \(d_{\text {rep}}\) with text data.

For text modality, the text representation is got by applying mean pooling on all word tokens, the output of text encoding module. A linear fully connected layer is used to project the text representation into the same dimension \(d_{\text {rep}}\) with video data.

Then, we design a shared linear layer to project video and text to a d-dimensional common space. It should be noted that the expert models of video are fixed, the parameters of text encoding module are fine-tuned, and other parameters are learned from scratch in training stage.

3.1.4 Contrastive loss

For video–text retrieval task, the target of our method is to obtain the global visual and textual embeddings by learning the model parameters. We employ contrastive loss [6, 27, 27, 37] to make the pairwise video and text similar and unmatched samples dissimilar. Given a mini-batch of N video and text pairs \(B=\{v_n, t_n\}_{n=1}^N\), where \(v_n\) and \(t_n\) are pairwise video and its text description, we get their common space embeddings by video and text encoders as \(\{r_n^v\}_{n=1}^N\) and \(\{r_n^t\}_{n=1}^N\), respectively. All pairs \(\{v_i, t_j\}\) with \(i \ne j\) are regarded as negative pairs. The similarity matrix \(S\in R^{N\times N}\) between a mini-batch examples is computed by inner product of the embeddings, that is,

$$\begin{aligned} S_{i,j}=(r_i^v)^T {r_j^t}, \end{aligned}$$
(4)

which is the similarity of the i-th video and j-th text. The contrastive loss InfoNCE [27] for video–text retrieval on a mini-batch is,

$$\begin{aligned} L_{c}^{v2t} = -\frac{1}{N} \sum _{i=1}^N \log {\frac{\exp (S_{i,i} / \tau )}{\sum _{j=1}^N \exp (S_{i, j} / \tau )}}, \end{aligned}$$
(5)

where \(\tau \) is a temperature hyper-parameter [52]. Similarly, the loss for text–video retrieval on a mini-batch is,

$$\begin{aligned} L_{c}^{t2v} = -\frac{1}{N} \sum _{j=1}^N \log {\frac{\exp (S_{j,j} / \tau )}{\sum _{i=1}^N \exp (S_{i, j} / \tau )}}. \end{aligned}$$
(6)

The two losses are combined as,

$$\begin{aligned} L_{c} = \frac{1}{2}(L_c^{v2t} + L_c^{t2v}). \end{aligned}$$
(7)

By optimizing the contrastive loss, the similarities between pairwise video and text embeddings in a mini-batch are maximized and that of the unmatched sample embeddings are minimized.

3.2 Cross-granularity self-distillation

The every layer outputs of video and text transformer encoders contain token features. For video, output tokens contain the information of expert feature at a certain time. For text, output tokens contain the information of every word in the text. We employ token-wise late interaction [16, 17] to obtain the fine-grained cross-modal similarity values. To guarantee the effectiveness of late interaction, we propose token screening module to select important tokens for video and text alignment. At the same time, the coarse-grained similarity between video and text is obtained according to inner product of global embeddings, as Eq. (4). In training stage, the fine-grained similarity is used as soft label to optimize coarse-grained similarity by the way of knowledge distillation. In retrieval stage, cross-modal matching depends on the inner product of global embeddings. In the way of cross-granularity self-distillation, fine-grained global representations of video and text are learned to achieve retrieval.

3.2.1 Token screening module

In [16, 17], all tokens are participated in fine-grained similarity computation. It is time-consuming and sensitive to noisy token. For example, given text “a dog is barking at women,” three words “dog,” “barking” and “women” are obviously critical to cross-modal matching, and thus, they should be focused on rather than others. Nonsignificant tokens will hinder the reliability of cross-modal alignment. Therefore, we propose token screening networks for video and text, respectively, to adaptively determine which tokens would participate in cross-modal fine-grained interaction according to the token features of video and text encoders.

Fig. 2
figure 2

The illustration of token screening module. Top k token features are adaptively selected according to input token features themselves

Figure 2 shows the structure of our proposed token screening module. We denote the output of transformer encoders (before mean pooling) as \(X = \{x_i \vert i\in [1, n]\}\), which could be video or text tokens, and n is the number of tokens. The figure shows 5 tokens as example. The n token vectors are successively fed into linear layer, ReLU, linear layer and softmax layer to get normalized n probabilities, which are regarded as the tokens’ importance measurement. Denote the ratio of screening token as r, then the number of selected tokens is \(k = \lfloor n\times r \rfloor \), where \(0 \le r \le 1\) and \(\lfloor \cdot \rfloor \) represents taking the integer portion. Our token screening module selects the most important k tokens according to the adaptive probabilities, and the selected token features are used to compute the fine-grained cross-modal similarity.

3.2.2 Fine-grained similarity matrix

After token screening module, the selected token features of video and text are projected to same dimension \(d_{rep}\) by a linear layer, respectively. Then, they are together projected to a d-dimensional common subspace by a shared linear layer, which is the same way as the processing of global embeddings. The features of i-th video and j-th text are denoted as \(R_i^v \in R^{n_1\times d}\) and \(R_j^t \in R^{n_2\times d}\), respectively, where \(n_1\) and \(n_2\) are the selected number of tokens by video and text token screening module, respectively. As [17], for a visual token \([R_i^v]_k\), its similarity with text is the largest one among the token with all textual tokens \([R_j^t]_{r=1}^{n_2}\). The token-wise fine-grained similarity between the video and text is the average on all video tokens. Then, the similarity value of the i-th video to j-th text is formulated as,

$$\begin{aligned} FS_{i,j}^{v2t} = \frac{1}{n_1} \sum _{k=1}^{n_1} {[R_i^v]_k}^T [R_j^t]_{m_k^v}, \end{aligned}$$
(8)

where \(m_k^v = argmax_{0 \le r \le n_2} {[R_i^v]_k}^T [R_j^t]_r\). Similarity, the similarity of the j-th text to i-th video is,

$$\begin{aligned} FS_{i,j}^{t2v} = \frac{1}{n_2} \sum _{k=1}^{n_2} {[R_i^v]_{{m_k^t}}}^T [R_j^t]_k, \end{aligned}$$
(9)

where \(m_k^t = argmax_{0 \le r \le n_1}{[R_i^v]_r}^T [R_j^t]_k\). In this way, we can obtain the fine-grained similarity matrix \(FS^{v2t} \in R^{N\times N}\) and \(FS^{t2v} \in R^{N\times N}\) for video–text and text–video retrieval with a batch cross-modal samples. It should be noted that \(FS^{v2t} \ne FS^{t2v}\).

3.2.3 Cross-granularity loss

In the retrieval based on token-wise interaction method [17], all token features need to be stored and the fine-grained similarity is computed as above. We expect retrieval is achieved by inner product of vectors yet has the effectiveness of fine-grained cross-modal alignment. In this paper, we propose novel similarity-preserving self-distillation approach. The fine-grained similarity matrix is regarded as teacher and coarse-grained similarity matrix as student. The knowledge is transferred from the teacher to the student by minimizing their difference with KL divergence. Given two distributions \(P=\{p_i \mid i\in [1,m]\}\) and \(Q=\{q_i \mid i\in [1,m]\}\), the formulation of KL divergence is as follows,

$$\begin{aligned} D_{KL}[P\vert \vert Q]=\sum _{i=1}^m p_i[log(p_i)-log(q_i)]. \end{aligned}$$
(10)

The student similarity for a batch data is S, as computed in Eq. (4). Teacher similarity matrices for video–text and text–video retrieval are \(FS^{v2t}\) and \(FS^{t2v}\), respectively, as shown in Eqs. (8) and (9). The cross-granularity self-distillation loss for video–text retrieval is defined as,

$$\begin{aligned} L_{cg}^{v2t} = \frac{1}{N} \sum _{i=1}^N D_{KL}[s(FS^{v2t}_i) / \tau \vert \vert s(S_i)], \end{aligned}$$
(11)

where \(FS^{v2t}_i\) and \(S_i\) are, respectively, the i-th row of the similarity matrix that is the similarity values between the i-th video and all texts. \(\tau \) is the temperature scaling parameter. The s operation means softmax, used to normalize the row of similarity matrix. Similarly, the distillation loss for text–video retrieval is defined as,

$$\begin{aligned} L_{cg}^{t2v} = \frac{1}{N} \sum _{i=1}^N D_{KL}[s(FS^{t2v}_i)/ \tau \vert \vert s(S_{:,i})], \end{aligned}$$
(12)

where \(FS^{t2v}_i\) is the i-th row of the fine-grained similarity matrix. \(S_{:, i}\) is the i-th column of the coarse-grained similarity matrix, which represents the similarities between all videos with the i-th text. The whole cross-granularity self-distillation is then formulated as,

$$\begin{aligned} L_{cg} = L_{cg}^{v2t} + L_{cg}^{t2v}. \end{aligned}$$
(13)

By optimizing \(L_{cg}\), the student coarse-grained similarity is preserved consistent with the teacher fine-gained similarity. Thus, the global embeddings for coarse-grained video and text alignment could learn the fine-grained interaction by the similarity-preserving self-distillation way. The similarity computations of two granularities are based on the representations from the same transformer layer, and the cross-granularity loss is intra-layer self-distillation.

3.3 Cross-layer self-distillation

Different layers of deep network usually focus on features with different degrees of abstraction [4, 6, 37, 38]. For example, low-level layer tends to encode local visual content and basic syntax, while high-level layer tends to capture more complex semantics and obtain more abstract representation. In other words, high-level features are more appropriate for semantic task than low-level features. It is too difficult for low-level features to achieve strict pairwise judgment. In this paper, we propose cross-layer self-distillation to explore hierarchical features by using semantic layer similarity providing soft label for feature layer alignment.

Specially, we employ the last-layer representations of video and text encoding modules to compute the similarity as teacher, and the similarity of the first-layer representations as student. The computation is as in Eq. (4). For a mini-batch B, the teacher similarity matrix is denoted as \(S^h\) and student similarity matrix as \(S^l\). We obtain the distillation loss by KL divergence for video–text retrieval as follows,

$$\begin{aligned} L_{cl}^{v2t}=\frac{1}{N} \sum _{i=1}^N D_{KL} [s(S^h_i/ \tau ) \vert \vert s(S^l_i)], \end{aligned}$$
(14)

where \(S^h_i\) and \(S^l_i\) are the i-th row of similarity matrix, respectively, representing the similarities between the i-th video and all texts. The distillation loss for text–video retrieval is defined as,

$$\begin{aligned} L_{cl}^{t2v}=\frac{1}{N} \sum _{i=1}^N D_{KL} [s(S^h_{:,i}/ \tau ) \vert \vert s(S^l_{:,i})], \end{aligned}$$
(15)

where \(S^h_{:,i}\) and \(S^l_{:,i}\) are the i-th column of matrix, respectively, representing the similarities between all videos with the i-th text. And the whole cross-layer self-distillation loss is then formulated as,

$$\begin{aligned} L_{cl} = L_{cl}^{v2t} + L_{cl}^{t2v}. \end{aligned}$$
(16)

By optimizing the loss function, the student similarity matrix is preserved consistent with the teacher. That is, semantic layer relationship provides soft label (similarity) for low-level feature alignment, which helps make the learned hierarchical features more suitable for video and text retrieval.

3.4 Objective function

Our objective function consists of three components, feature level, semantic level and cross-layer. The third one is the above cross-layer self-distillation \(L_{cl}\), as Eq. (16). Feature-level loss includes two parts, contrastive loss and cross-granularity self-distillation, respectively, computed by Eqs. (7) and (13) based on the first-layer outputs of video and text encoders. They are, respectively, denoted as \(L_{c}^f\) and \(L_{cg}^f\), and then, the formulation of feature-level loss is,

$$\begin{aligned} L_f = L_{c}^f + \lambda L_{cg}^f. \end{aligned}$$
(17)

Semantic-level loss also includes two parts, contrastive loss \(L_{c}^s\) and cross-granularity self-distillation \(L_{cg}^s\) based on the last-layer outputs of video and text encoders, and the formulation is,

$$\begin{aligned} L_s = L_{c}^s + \lambda L_{cg}^s, \end{aligned}$$
(18)

where \(\lambda \) is the trade-off parameter for contrastive loss and cross-granularity self-distillation loss. Our final objective function is calculated as follows,

$$\begin{aligned} L = L_s + \alpha L_f + \gamma L_{cl}, \end{aligned}$$
(19)

where \(\alpha \) and \(\gamma \) are the trade-off parameters for semantic-level, feature-level and cross-layer losses. By optimizing the loss, our method SPSD could adequately take advantage of hierarchical features and fine-grained interactions between video and text tokens to alignment cross-modal data.

4 Experiments

4.1 Datasets and settings

We compare SPSD with state of the arts on three datasets MSRVTT [53], LSMDC [54] and ActivityNet Captions [55]. Ablation experiments are conducted on MSRVTT.

MSRVTT dataset consists of 10000 videos collected from YouTube with 257 queries. The length of each video is about 10–30 s, and each video has 20 manually tagged English sentence descriptions. For the 10000 videos, we refer [30] and divide this dataset into training set with 9000 videos and test set with 1000 videos.

LSMDC dataset contains 118081 short videos truncated from 202 movies. The length of each short video is about 45 s, and each video is equipped with a text caption from the movie script or audio description. The test set consists of 1000 videos, from movies not present in the training set.

ActivityNet Captions dataset consists of 20K YouTube videos temporally annotated with sentence descriptions. We follow the approach of [4], where all the descriptions of a video are concatenated to form a paragraph. The training set has 10,009 videos. We evaluate our video–paragraph retrieval on the “val1” split (4917 videos).

Table 1 Comparison with SOTA on MSRVTT (The bold font indicates the best results)

Evaluation metrics include R@1, R@5, R@10, R@50, MedR and Rsum. R@K is the percentage of test queries that at least one relevant item is found among the top-K retrieved results. The MedR measures the median rank of correct items in the retrieved ranking list. We also take the sum of all R@K as Rsum to reflect the overall retrieval performance. Larger R@K and Rsum and smaller MedR indicate better retrieval performance.

In training stage, AdamW [56] optimizer is used, the initialization learning rate is set to \(5\times 10^{-5}\), and the weight decay is set to 0.01. The learning rate is decayed by a multiplicative factor 0.95 every epoch, and the network is trained for 60 epochs. The size of mini-batch is fixed to 128. In terms of the hyper-parameters, the dimension of video and text representation is \(d_{rep} = 512\), and the dimension of shared space for similarity computation is \(d = 1024\). Temperature hyper-parameter \(\tau \) is set to 0.07.

4.2 Comparison with state-of-arts

For a fair comparison, we compare with the similar state-of-the-art methods which also fuse multiple expert features for video. The state-of-art methods include JSFusion [30], CE [9], MMT [5], support-set [8], TACo [6], HiT [4], CrossCLR [37] and Jin [36]. CE and support-set achieve retrieval based on global representations. JSFusion, MMT and Jin are fine-grained alignment methods for video and text. TACo, HiT and CrossCLR are hierarchical contrastive learning methods. The performances of these methods are from their papers.

The results on MSRVTT dataset are shown in Table 1. We can see that SPSD achieves the best performances at all metrics except that SPSD gets the second place on text-to-video task with MedR. Our method achieves the performance \(Rsum = 327.6\%\), which is \(7.4\%\) higher than the second place HiT with \(Rsum = 320.3\%\). In practical applications, people tend to pay more attention on the top retrieval results. The R@1 performance of SPSD is \(1.1\%\) and \(1.6\%\) higher than the second place HiT on video-to-text and text-to-video retrieval, respectively. HiT [4] performs hierarchical cross-modal contrastive matching with global features from feature level and semantic level. In comparison, token-wised fine-grained similarity and cross-layer interaction for self-distillation learning explored in our method help our method outperform HiT with most of the evaluation metrics. Our method outperforms other global representation-based methods CE and support-set, fine-grained alignment methods JSFusion, MMT and Jin, and hierarchical contrastive learning method TACo. This further proves that it is effective to align video and text with cross-granularity and cross-layer self-distillation losses.

Table 2 Comparison with SOTA on LSMDC (The bold font indicates the best results)
Table 3 Comparison with SOTA on ActivityNet (The bold font indicates the best results)
Table 4 The performances of cross-granularity self-distillation on MSRVTT (The bold font indicates the best results)

On LSMDC dataset, we only conduct the text-to-video retrieval since most compared methods only provide the results on this retrieval task. The performances are shown in Table 2. We can see that SPSD achieves the performances \(R@1 = 15.3\%\), \(R@5 = 32.9\%\), \(R@10 = 43.4\%\) and \(MedR = 17.0\), which are all the best performances among the compared methods. The Rsum performance of SPSD is \(2.1\%\) higher than the second place CrossCLR and \(4.8\%\) higher than the third place HiT on video-to-text retrieval. HiT [4] performs hierarchical cross-modal contrastive matching with global features from feature level and semantic level. In comparison, token-wised fine-grained similarity and cross-layer interaction for self-distillation learning explored in our method help our method outperform HiT. CrossCLR [37] utilizes a two-level hierarchy of transformers, where the loss is applied at the clip/sentence level and at the video/paragraph level. In comparison, fine-grained interaction and hierarchical features are both explored in CrossCLR and our method, and thus, cross-layer learning may be the reason why our model outperforms CrossCLR. With cross-layer self-distillation loss, the low-level features could be better with the soft label provided by semantic layer features. Our method outperforms other global representation-based methods CE, fine-grained alignment methods JSFusion, MMT and Jin. This further proves that it is effective to align video and text with cross-granularity and cross-layer self-distillation.

On ActivityNet dataset, we report the performances with R@1, R@5, R@50 and MedR in Table 3. We can see that the R@1, R@5 and R@50 performances of our method are \(26.6\%\), \(59.9\%\) and \(97.0\%\) on video-to-text retrieval, respectively, which are much better than others. The R@50 performance of our method is \(96.8\%\) for text-to-video retrieval, which is better than \(94.7\%\) of the second place HiT. But R@1 and R@5 performances of our method are worse than that of HiT. Further optimization of our method is needed for text-to-video retrieval on ActivityNet dataset. The Rsum of our method is \(360.7\%\), which is better than \(354.7\%\) of the second place support-set and \(340.9\%\) of the third place MMT. With Rum, our method outperforms all the methods who realized video-to-text and text-to-video retrieval.

The computation burden of our method’s retrieval process is related to the embedding extraction for query, the embedding dimension and the size of database. The average retrieval time for a query is 0.1447s on MSRVTT, 0.1992s on LSMDC and 0.4428s on ActivityNet dataset, which are test with CPU. It is noted that we set the trade-off hyper-parameters \(\lambda =30, \gamma =10, \alpha =1\) of our model on MSRVTT, LSMDC and ActivityNet for comparison with others. The ratios for video and text token screening module are, respectively, \(r_v = 0.75\) and \(r_t = 0.5\). It shows that the hyper-parameter setting of our model is generalized.

Table 5 The performances of token screening module on MSRVTT (The bold font indicates the best results)
Table 6 The performances of hierarchical contrastive loss on MSRVTT (The bold font indicates the best results)
Table 7 The performances of cross-layer self-distillation on MSRVTT (The bold font indicates the best results)
Table 8 The performances of SPSD with different parameter settings on MSRVTT (The bold font indicates the best results)

4.3 Ablation studies

4.3.1 Cross-granularity self-distillation

To evaluate the effectiveness of cross-granularity self-distillation, we adopt the loss as shown in Eq. (18), which consists of the cross-granularity self-distillation and contrastive loss computed only on semantic-level features. The other hierarchical losses are ignored in this experiment. The ratio of token screening r is set to 1. We vary parameter \(\lambda \), the trade-off parameter of contrastive part and cross-granularity self-distillation part in the loss. With \(\lambda =0\), only contrastive loss is used for the model. The result is shown in Table 4. We can see that the model with \(\lambda =300\) achieves the best overall performance \(Rsum = 314.3\%\) of video-to-text and text-to-video retrieval, which is better than the model with \(\lambda =0\) (\(Rsum=307.9\%\)). And \(MedR=3\) obtained with \(\lambda =300\) is better than \(MedR=4\) with \(\lambda =0\) on video-to-text retrieval. This validates the effectiveness of cross-granularity self-distillation loss. In other words, the global representations could obtain information from fine-grained interaction of video and text tokens.

4.3.2 Token screening module

To validate token screening module, which selects important tokens for computing fine-grained similarity, we vary the screening ratio based on the above experiment and \(\lambda \) is set to 300. Since we have two token screening modules, respectively, for video and text, \(r_v\) for video and \(r_t\) for text are both varied, and the performances are shown in Table 5. \(r_v = 1, r_t = 1\) represents using all tokens for similarity computation, with which the overall performance obtained is \(Rsum = 314.3\%\). The model with \(r_v = 0.75, r_t = 0.5\) obtains the best performance \(Rsum = 318.5\%\). This validates that selecting \(75\%\) important visual tokens and \(50\%\) important textual tokens are optimal for fine-grained similarity computation. In following experiments, we fix \(r_v = 0.75\) and \(r_t = 0.5\).

4.3.3 Hierarchical contrastive loss

To validate the hierarchical contrastive learning, we ignore the two similarity-preserving self-distillation losses and vary the trade-off parameter \(\alpha \) between contrastive loss of semantic level and feature level. The result is shown in Table 6. When \(\alpha =0\), only the semantic-level contrastive loss is considered in the model, which has the performance \(Rsum=307.9\%\). When \(\alpha =1\), the weights for semantic level and feature level are the same, with which the model gets \(Rum=303.4\%\). We can see that the model with \(\alpha = 0.1\) achieves the best overall performance \(Rsum=309.4\%\). This explains that the low-level and high-level features of encoders both contribute to retrieval performance, but improper weight will hinder the performance. Relative to the low-level features, the high-level features are more suitable for retrieval task.

4.3.4 Cross-layer self-distillation

To further validate the cross-layer self-distillation, we set the model without cross-granularity loss and \(\alpha =0.1\) for low-level contrastive loss. We conduct the experiments on hyper-parameter \(\gamma \) for cross-layer self-distillation loss as Eq. (19). The result is shown in Table 7, which shows that the model with \(\gamma =3\) obtains the best overall performance \(Rsum=315.3\%\). Without cross-layer self-distillation, i.e., setting \(\gamma =0\), the model has the overall performance \(Rsum=309.4\%\). This declares it is effective to construct cross-layer self-distillation loss by utilizing high-level similarity to provide soft label to guide the learning of the cross-modal similarity based on low-level transformer features.

4.3.5 The trade-off parameters

The three components in the whole objective function in Eq. (19) influence each other. We conduct the experiments on the trade-off hyper-parameters of the function. The experimental result in Table 8 shows that the model with \(\lambda =30, \gamma =10, \alpha =1\) gets the best overall performance \(Rsum = 327.6\%\) and the best \(MedR = 3.0\) on video-to-text retrieval. The result states that the proposed two kinds of similarity-preserving self-distillation and hierarchical loss are effective to cross-modal retrieval.

5 Conclusion

In this paper, we introduce a similarity-preserving self-distillation method for fine-grained video–text alignment and hierarchical feature learning. The proposed cross-granularity self-distillation can make the global representations of video and text encoders obtain the fine-grained cross-modal interaction. Cross-layer self-distillation demonstrates that the similarity learning based on low-level features benefits from the soft label provided by the similarity of high-level features. The hierarchical losses including hierarchical cross-granularity self-distillation loss, hierarchical contrastive loss and cross-layer self-distillation loss improve the performances of video-to-text and text-to-video retrieval tasks. Our method achieves outstanding performances on MSRVTT, LSMDC and ActivityNet.