1 Introduction

With the rapid development of mobile device and Internet, short videos are becoming more and more important in modern life. Therefore, Text to Video Retrieval (TVR), a typical multi-modal task, has drown increasing attention [1,2,3,4,5,6]. The aim of this task is to rank videos (or texts) within the collection based on their relevance to a specific text or video, which enables users to efficiently and precisely retrieve their desired video content. In the past few decades, with the ongoing advancement of deep learning technology, remarkable progress has been made in the field of video retrieval [7,8,9,10,11,12,13]. However, due to the heterogeneity of text and video modalities, how to reduce the modality gap and improve performance is still an open problem.

Fig. 1
figure 1

Comparison of our proposed method with previous multi-grained contrastive methods. a The framework of previous methods (multi-grained retrieval), which uses the raw features of encoders for fine-grained matching. b Our LSECA framework semantically enhances the original fine-grained features via LSE (local semantic enhancement) modules, but also takes into account the correspondence between the two modalities to design the CA (cross aggregation) interaction module

To narrow this modality gap, some good methods have emerged, among which the method based on fine-tuning the pre-training models [14,15,16] has gained widespread attention. CLIP4Clip [17] is a notable method that leverages the robust semantic extraction capabilities of pre-training model CLIP [14] to align video with text in a shared feature space, enabling a direct comparison of video and text features. Compared to previous works [1, 3,4,5,6,7,8, 18,19,20], this method yields superior results. However, this method focuses on global information while ignoring fine-grained information. To mitigate the problem, some excellent works [21,22,23,24,25] are proposed, which employ frame [26] and word features as local information (as shown in Fig. 1a). For example, X-CLIP [22] excavates local and global representations and leverages cross-grained, coarse-grained and fine-grained contrastive learning scheme to further improve retrieval performance. However, the backbone model (e.g., CLIP [14]) is pre-trained with image-text pairs, the extracted raw frame and word features may not be suitable for video retrieval task, resulting in suboptimal search results.

Despite the above problems, the text-video retrieval task requires precise semantic alignment, but the multi-modal features output by the raw encoder have a lot of redundant and noisy information, which is a serious challenge for cross-modal matching. Specifically, with regard to video, certain frames in the sparsely sampled frames have high content overlap (i.e., redundant information), and a few frames have little semantic significance to the supplied text description (i.e., noisy information). For text modality, the given text contains prepositions (e.g., ‘the‘, ‘in‘, ‘on‘, etc.) which may carry less semantic information than entities and verbs. Although these prepositions can help to understand the text, they have litte impact in the process of fine-grained matching and even lead to suboptimal performance. Besides, the video modality usually corresponds to multiple descriptions in the retrieval datasets [27,28,29], and videos often display more content than text. Therefore, how to conditionally filter frame features and enhance the interaction between the two modalities need to be addressed.

To solve the above problems, we develop a novel text to video retrieval model based on local semantic enhancement and cross aggregation, named LSECA. Previous approaches typically use more multimodal features [3] or more complex cross-granularity alignment [22] to achieve finer semantic matching. We use unimodal specific information (e.g., video global feature and keyword features) for local semantic enhancement. The proposed model can augment fine-grained semantic representation and facilitate interaction between video and text (as shown in Fig. 1b). Specifically, for video branch, we use pooled features as anchors to improve the semantics of fine-grained frame features. Besides, due to the uncertainty of frame content, we design an adapter-aware module to estimate the weight of each fine-grained representations. For text branch, we first extract the keywords of corresponding description with KeyBert [30] model and the fusion strategy between raw word and keyword representations is introduced to shift the focus more to semantic content. Local semantic enhancement is not enough and interaction between modalities is also required. Thus in the coarse-grained perspective, we further propose the cross aggregation module that fuses the frame-level features based on the text to enable interaction between two modalities. The above module can significantly improve the discimination of video and text representations. To demonstrate the effectiveness of the proposed LSECA, we conduct some extensive experiments on three mainstream text-video retrieval datasets, including MSRVTT [27], MSVD [28], and LSMDC [29]. The results illustrate that our proposed LSECA achieves significant improvment and outperforms several previous state-of-the-art methods.

The main contributions of this work are summarized as follows:

  • We propose a novel framework LSECA for text-video retrieval, which not only enhances the fine-grained video and text representations but also fully considers the interaction between two modalities.

  • For local semantic enhancement, we propose two effective strategies for video and text branches respectively. The cross aggregation module is introduced to achieve sufficient interaction between two modalities.

  • Extensive experiments on three text-video retrieval datasets demonstrate the effectiveness of our method. Our LSECA achieves state-of-the-art performance on MSRVTT (47.1%), MAVD (46.9%), and LSMDC (23.4%).

2 Related work

2.1 Text to video retrieval

With the promotion and popularization of short video applications, accurate video similarity search is becoming increasingly important. Text-video retrieval task aims to find the most relevant video based on the given text information. However, unlike text-image cross-modal task, text-video retrieval task need to consider temporal information, making the retrieval task more diffcult. Some early approaches [3, 4, 6, 7, 11, 18, 31,32,33,34] extracted video as well as text features by using convolutional neural networks or experts. Despite these approches have demonstrated favorable outcomes, the performance of these methods is still limited due to end-to-end optimization issue. With the continuous development of the pre-training models (e.g., CLIP [14], ALIGN [15], CoCa [16], etc), the paradigm [17] of end-to-end video retrieval by fine-tuning models directly from raw video (or text) has gained a lot of attention. Numerous pretty works [17, 21, 22, 35,36,37] utilize the semantic extraction ability of CLIP learned from 400 M image-text pairs to adapt to video retrieval task. CLIP4Clip [17] leverages the knowledge obtaiend from the CLIP model and applies to the task of video retrieval. By employing the contrastive learning to compute the similarity scores, it achieves good performance and establishes a strong baseline for future research endeavors. Based on CLIP4Clip, CLIP2Video [36] proposes the temporal difference block and temporal alignment block to enhance the optimization of video and text representations. However, the above methods only use the global feature for contrastive learning, ignoring the semantic information and lack the interaction between two modality. Different from these approahces, we not only use fine-gained feature to improve the performance but also design the cross aggregation module to enhance the interaction between video and text.

Fig. 2
figure 2

An overview of our LSECA for text-video retrieval. In LSECA, We first extract the keywords of the given text description via the KeyBert [30] TransFormer. And we design two different local semantic enhancement(LSE) schemes for text and video, respectively. With the help of video representations and keywords, thus obtaining fine-grained representations that are richer and more compact in semantic information. In addition, to enhance the interaction between the two modalities, we propose the corss aggregation(CA) module

2.2 Multi-grained representation learning

In recent years, there has been a proliferation of valuable studies [21, 22, 32, 33, 38,39,40] that employ multi-grained video and text representations to enhance retrieval performance. Concretely, for the text branch, the common approach [22] is to treat word embeddings as fine-grained features and [CLS] token as global feature. For the video branch, traditional methods [32, 33] utilize task-specific networks or experts to extract different types of features (e.g., object, action, scene, audio, etc). However, these specific features extracted cannot be well adapted to retrieval task due to the end-to-end optimization issue. For example, T2VLAD [33] achieves better retrieval results by aligning local features though NetVlad and aligning global feature though aggregation. Recent works extract frame features as fine-grained representation of video by using image encoder due to the rapid development of pre-training image-language model. For instance, TS2-Net [21] proposes the token shift module to capture temporal movements and the token selection module to select tokens that contribute most to fine-grained semantic information. X-CLIP [22] presents the multi-grained contrastive learning to better utilize more semantic information for improving retrieval performance.

The above methods utilzie the output of the raw encoders for contrastive learning. However, the performance may be limited due to the heterogeneity between the video and image. In this paper, we utilize local semantic enhancement module to improve the video and text fine-grained representations and design the cross aggregation to enhance the interaction between two modalities, resulting in notable enhancement in retrieval performance.

3 Method

In this section, we detail the proposed LSECA, along with providing specific details regarding the text to video retrieval task. Concretely, in the retrieval task, given a set of descriptions and the same number of video clips, our goal is to obtain a semantic similarity matrix for the purpose of retrieving the videos. The architecture of LSECA is shown in Fig. 2. In Sect. 3.1, we firstly introduce the basic preliminary which consists mainly of the extraction of text and video features and the symbolic representation of each feature. We then elaborate the details of the proposed Local Semantic Enhancement module in Sect. 3.2, which consists of two parts, the video branch in Sect. 3.2.1 and the text branch in Sect. 3.2.2, respectively. Immediately following the Cross Aggregation module in Sect. 3.3. Finally, in Sect. 3.4, we describe the calculation of the multi-grained similarity and the objective function for optimization.

3.1 Preliminary

In general, given a set of video-text pairs (\(\varvec{V}\), \(\varvec{T}\)) as the input data. For the video branch, we sample the video frames uniformly for a video, usually at a sampling rate of 1 frame per second. We use the image encoder of CLIP [14] which is a vision transformer architecture and initialized by the public checkpoints of ViT-B/32 to process frame image. Specifically, the frame image firstly is divided into multiple patches, add [CLS] token and position tokens which make encoder better extract semantic information from image. Finally, The [CLS] tokens from the last transformer layer are extracted as the frame-level features \(\varvec{f}=\left\{ f_1, f_2, f_3,..., f_{N_f}\right\} \), and \(N_f\) is the number of frames in the video. For a description \(\varvec{t}\) \(\in \) \(\varvec{T}\), similar to the video side, the text encoder of CLIP is used to extract text features, the architecture of text encoder is also a multi-layer transformer. We firstly split the given text description to word sequence by using the specific tokenizer [14]. Before being fed into the text encoder, the word sequence is padded with [BOS] and [EOS] tokens at the start and end of the sequence, respectively. Finally, the global textual feature \(t_{EOS}\) and word-level features \(\varvec{w}\)=\(\left\{ w_1, w_2, w_3,..., w_{N_w} \right\} \) are the output of the [EOS] token and corresponding word tokens from the final layer of the textual transformer, where \(N_w\) is the length of the description.

3.2 Local semantic enhancement

3.2.1 Visual fine-grained representation

Firstly, we need to obtain the feature of the entire video based on the obtained frame features in Sect. 3.1. However, frame-level features are extracted from separate frames without considering the temporal interaction among frames, which only include spatial features of each single frame. And it is essential to be able to understand the video content correctly. Therefore, we follow the [17] and utilize the temporal transformer to model the temporal relationships between frames. Specifically, we add a position token for each frame feature before fed into model and the outputs of the temporal transformer are average pooled to obtain final video-level features, which can be formulate as:

$$\begin{aligned} f_{i}^{'}= & {} TransEnc(f_i+p_i), \end{aligned}$$
(1)
$$\begin{aligned} \varvec{v}= & {} \frac{1}{N_f} \sum _{i}^{N_f} f_{i}^{'}, \end{aligned}$$
(2)

where \(\varvec{p}\) is the added position tokens for frame features \(\varvec{f}, N_{f}\) is the number of sampled frames, and \(\varvec{v}\) is the final video-level feature.

Fig. 3
figure 3

The details of visual LSE module. We enhance the frame features by utiling the pooled feature and design adapter-aware module to adjust the enhanced features

Earlier TVR works mainly focus on fine-grained and coarse-grained contrastive learning, which compute similarity using the raw output of CLIP encoder. However, the output of raw encoder may not be well suited for video retrieval task due to the heterogeneity between the video and image. To this end, we develop the visual semantic enhancement module in the proposed LSECA, which differs from prior approaches.

The video frames obtained by uniform sampling contain a lot of redundant information, which is detrimental to cross-modal matching. Therefore, we adjust the local frame features from a global perspective thereby achieving semantic enhancement. As shown in Fig. 3, given the video-level feature \(\varvec{v}\) and frame-level features \(\varvec{f}^{'}\)=\(\left\{ f_{1}^{'}, f_{2}^{'}, f_{3}^{'},..., f_{N_f}^{'}\right\} \), we first concatenate global video feature \(\varvec{v}\) with each frame-level feature \(f_{i}\), generating the input of the visual semantic enhancement module \(\hat{f_i} = [{\varvec{v}}, f_{i}^{'}\)]. Moreover, we utilize the LSTM as main part of the visual local semantic enhancement module to generatea sequence of global-guide frame embeddings \(\varvec{f}^{g}\) = \(\left\{ f_1^g, f_2^g, f_3^g,..., f_{N_f}^g\right\} \). In addition, The information across frame and video is partially matched, and it is not appropriate to treat them equally. Thus, we propose the adapter-aware module to adjust the enhanced features and reduce the impact on the final similarity calculation. The whole process can be formulated as:

$$\begin{aligned} f_i^g = LSTM(\hat{f_i}) \cdot W_{a}, \end{aligned}$$
(3)

where \(f_i^g\) is the fine-grained feature after visual semantic enhancement, \(W_{a}\) is the weights estimated by the adapter-aware module, which add soft labels to video fine-grained features to filter out unnecessary frames by comparing each frame with its video context. To be specific, as shown in Fig. 3, the adapter-aware module consists two linear FC layers, a self-attention layer, and a sigmoid activation function layer to calculate the corresponding weights. In this case, the self-attention layer is able to provide a global view of the fine-grained features, and the sigmoid layer can generate smooth adaptive weights for these features in the end.

Fig. 4
figure 4

The details of textual LSE module and cross aggregation module. The core algorithm of both modules is the cross-attention mechanism. Enhanced text features and text-guided video aggregation features are obtained through the guidance of keywords as well as text features, respectively

3.2.2 Textual fine-grained representation

Due to the difference between two modalities, it is not advisable to adopt the same enhancement strategy as the video branch. In real video retrieval scenarios, we usually focus more on words with more semantic information, such as entities, actions, scenes, and so on. In light of this, we propose the local semantic enhancement strategy that relies on the utilization of keyword-guide. Specifically, we utilize the KeyBert [30] transformer to extract keywords of corresponding textual description, which is an effective and easy-to-use keyword extraction technique that leverages BERT embeddings to create keywords and keyphrases. The formulation can be represented as follows:

$$\begin{aligned} \varvec{w}^{'} = KeyEnc(KeyBert(\varvec{t})), \end{aligned}$$
(4)

where \(\varvec{w}^{'}\) is the keyword features and \(KeyEnc(\cdot )\) is the keyword encoder that is a standard transformer encoder with 12 layers and 8 attention heads the same structure as text encoder of CLIP [14]. With the exception of the final linear projection layer, the weight parameters are shared between keyword and text encoders.

For the textual local semantic enhancement module, we design a cross-attention strategy for raw word features \(\varvec{w}\) = \(\left\{ w_1, w_2, w_3,..., w_{N_w} \right\} \) and extracted features features \(\varvec{w}^{'}\) = \(\left\{ w_1^{'}, w_2^{'}, w_3^{'},..., w^{'}_{N_k} \right\} \), where \(N_k\) is the number of keywords in the description. As shown in Fig. 4a, the word features are the keys and values and the keyword features are the queries, and it can be formulated as follows:

$$\begin{aligned} \varvec{w}^{k} = CrossAtten(\varvec{w} \cdot W_K, \varvec{w} \cdot W_V, \varvec{w}^{'} \cdot W_Q), \end{aligned}$$
(5)

where \(CrossAtten(\cdot )\) is the cross-attention mechanism which dynamically assigns the importance of different elements of the input features based on the relationship between two modalities, thus better capturing their interdependencies. And \(W_Q\), \(W_K\) and \(W_V\) are trainable projection matrices and \(\varvec{w}^{k}\) is the enhanced text fine-grained representations. Finally, semantic content can be enhanced in term of word features through keyword guidance.

3.3 Cross aggregation

In addition to local semantic enhancement, we also consider the interaction between two modalities. Upon examination of the retrieval dataset, we find that a video typically corresponds to multiple descriptions. Besides, the video embedding is fixed in previous approaches, whereas the semantic focus of the different related texts is different. Therefore, merely relying on global video features is insufficient for effectively cross-modal retireval.

Inspired by [37], we use the similar cross aggregation module for enhanced frame features \(\varvec{f}^{g}\) to interact with specific text feature \(\varvec{t}_{EOS}\). Specifically, as shown in Fig. 4b, the enhanced frame features are projected as key and value by two different linear layer and the query are the result of the projection of text feature \(\varvec{t}_{EOS}\). Finally, the output of cross aggregation module is the text-guided video aggregated feature \(v_{ca}\), which can be formulated as follows:

$$\begin{aligned} \hat{\varvec{v}}= & {} CrossAtten(\varvec{f}^{g} \cdot W_{K}^{'}, \varvec{f}^{g} \cdot W_{V}^{'}, \varvec{t}_{EOS} \cdot W_{Q}^{'}), \end{aligned}$$
(6)
$$\begin{aligned} \varvec{v}_{ca}= & {} LN_{1}(LN_{2}(\hat{\varvec{v}}) + Dropout(\hat{\varvec{v}})), \end{aligned}$$
(7)

where \(W_Q\), \(W_K\) and \(W_V\) are trainable projection matrices. Similar to the Sect. 3.2.2, \(LN_{1}(\cdot )\) and \(LN_{2}(\cdot )\) are the LayerNorm layers. Besides, Dropout is a dropout layer, which not only makes train more stable but also prevents overfitting risk.

3.4 Multi-grained similarity and objective function

After the above features processing, we obtain enhanced frame features \(\{f_{i}^g\}_{i=1}^{N_f}\), enhanced words features \(\{w_{i}^k\}_{i=1}^{N_w}\), aggregated video feature \({\varvec{v}}_{ca}\), and raw text feature \({\varvec{t}}_{EOS}\). For coarse-grained similarity calculation, we directly use matrix multiplication between the video feature \(\varvec{v}_{ca}\) and text feature \(t_{EOS}\), which can be represented as follows:

$$\begin{aligned} S_{coarse} = (\varvec{v}_{ca})^\top \cdot {\varvec{t}}_{EOS}. \end{aligned}$$
(8)

For fine-grained similarity calculation, the fine-grained embeddings of video is the enhanced frame features \(\varvec{f}^{g} = \{f_{i}^{g}\}_{i=1}^{N_f}\), where the \(N_f\) is the number of sampled frames. The fine-grained embeddings of text is the enhanced text features \(\varvec{w}^k\)= \(\{w_{i}^k\}_{i=1}^{N_{k}}\). Following the [41], we calculate a similarity matrix which is defined as \(A = [a_{i,j}]^{N_{f}\times N_{k}}\), where \(a_{i,j}\) is computed from the cosine similarity of \(f_{i}^g\) and \(w_{j}^k\) and represents the fine-grained similarity score between the \(f_{i}^g\) and the \(w_{j}^k\). Beisides, we choose the maximum value \(\underset{j}{\text {max}}\ a_{ij}\) and \(\underset{i}{\text {max}}\ a_{ij}\) in each row and column as the score that each fine-grained feature contributes to the final similarity calculation. At the same time we use the computed adaptive weights to pool the corresponding scores over all frames and words. Finally, it can be formulated as:

$$\begin{aligned} S_{fine} = \frac{1}{2}\left( \sum _{i=1}^{N_f}\omega _{f}^{i}\max _j a_{i,j} + \sum _{j=1}^{N_{k}}\omega _{t}^{j}\max _i a_{i,j}\right) , \end{aligned}$$
(9)

where \([\omega _f^0,\omega _f^1,...,\omega _f^{N_f}] = \Phi (f^g)\) and \([\omega _t^0,\omega _t^1,...,\omega _t^{N_{k}}] = \Psi (w^k)\) are the corresponding weights of the video frames and text words and they facilitate fine-grained cross-modal alignment. Specifically, \(\Phi (\cdot )\) and \(\Psi (\cdot )\) have the same structure, both consist of an FC layer and a Softmax layer. The first term of the whole equation is to represent the video to text retrieval similarity and the second term is opposite. Therefore, the final similarity score \(\varvec{S}\) of LSECA contains multi-grained contrastive similarity scores, which can be represented as follows:

$$\begin{aligned} \varvec{S} = \alpha S_{coarse} + (1-\alpha )S_{fine}, \end{aligned}$$
(10)

where \(\alpha \) is the trade-off hyper-parameter of total similarity, \(S_{coarse}\) and \(S_{fine}\) are the global and local similarity, respectively.

Based on the above schemes, given a batch of B text-video pairs, our LSECA can calculate a B \(\times \) B similarity matrix during the training process. The InfoNCE loss based the similarity matrix is used to optimize the whole LSECA, which can be formulated as:

$$\begin{aligned}{} & {} \mathcal {L}_{v2t} = - \frac{1}{B}\sum _{i=1}^{B}\log \frac{\exp (S_{v_i, t_i} / \tau )}{\sum _{j = 1}^{B} \exp (S_{v_i, t_j}/ \tau )}, \end{aligned}$$
(11)
$$\begin{aligned}{} & {} \mathcal {L}_{t2v} = - \frac{1}{B}\sum _{i=1}^{B}\log \frac{\exp (S_{v_i, t_i} / \tau )}{\sum _{j = 1}^{B} \exp (S_{v_j, t_i}/ \tau )}, \end{aligned}$$
(12)
$$\begin{aligned}{} & {} \mathcal {L} = (\mathcal {L}_{v2t} +\mathcal {L}_{t2v})/2, \end{aligned}$$
(13)

where B is the pre-set batch size, \(\tau \) is a temperature hyperparameter which makes the training process converge more rapidly. The loss function \(\mathcal {L}\) is utilized to increase the similarity of the positive pairs and decrease the similarity of the negative pairs, thereby shortening the distance between relevant video-text representations and separating the irrelevant text-video representations during the training process.

4 Experiments

4.1 Experimental settings

We conduct experiments on three mainstream text to video retrieval datasets to demonstrate the effectiveness of our LSECA.

MSRVTT [27] contains about 10K YouTube video clips, each with 20 caption descriptions. The duration of each video clip in this collection varies between 10 and 40 s. Following the dataset splits from  [3], we train models with associated captions on the Training-9K set and report results on the test 1K-A set.

MSVD [28] contains 1,970 videos with 80K captions, with about 40 captions on average per video. Videos tend to be 40 s or less in length.There are 1,200, 100, and 670 videos in the train, validation, and test sets, respectively. The training as well as the inference on this dataset is in multi-sentence mode, which is slightly different from the other two datasets and can be found in the source code.

LSMDC [29] contains 118,081 videos and captions, which are extracted from 202 movies. The length of each video ranges from 2 to 30 s. We follow the split of [3] and there are 109,673, 7,408, and 1,000 videos in the train, validation, and test sets, respectively.

Evaluation Metrics To evaluate the performance of our proposed LSECA, we choose recall at Rank K (R@K, higher is better), median rank (MdR, lower is better), and mean rank (MnR, lower is better) as retrieval performance metrics. To be specific, R@K refers to the percentage of the first K retrieved videos that correspond to the text description among all the videos to be retrieved, i.e., the ability of the model to find the target video during retrieval. Referring to previous work [17], we use R@1, R@5 and R@10 as specific recall metrics As a result, the higher R@K indicates better performance. Median Rank (MdR) is the median retrieved rank of the ground truth. Similarly, Mean Rank (MnR) is the mean retrieved rank of the ground truth. Thus, the lower MdR and MnR indicate better performance. In addition, we added SumR (R@1+R@5+R@10) as a composite metric.

Table 1 Retrieval performance comparison on the MSR-VTT 1K validation set

Implementation Details We conduct extensive experiments on 2 NIVIDIA GeForce RTX 3090 24GB GPUs using PyTorch library. Following the previous work [17], we initialize the text encoder and video encoder by using the public CLIP checkpoint (ViT-B/32). The frame sampling rate of videos is 1 FPS. The text description length is set to 32, the video length is set to 12 for all datasets and the number of keywords is 5. The initial learning rate for text encoder and video encoder of CLIP is 1e-7 and the initial learning rate for other modules is 1e-4. Then we decay the learning rate using the cosine schedule strategy and use the Adam optimizer to optimize the whole model. We train the model for 5 epochs with above settings and set the temperature \(\tau \) is 0.01. We conduct ablation, comparison and qualitative experiments on the MSR-VTT dataset, which is more popular and competitive compared with other datasets.

4.2 Comparison with state-of-the-art

In this subsection, we compare the proposed LSECA with the previous state-of-the-art (SOTA) works on the three datasets, namely MSRVTT, MSVD and LSMDC. Results of the experiments on these datasets are presented in Tables 1, 2, 3. We can see that the LSECA obtains significant improvement on all three datasets. Furthermore, Table 1 shows the retrieval results of our method and comparisons with other SOTA model on MSRVTT 1K. To be specific, compared to the baseline CLIP4Clip-seqTransf [17], LSECA obtains 47.1 R@1 (5.9% improvement) and gets higher performance in all other metrics(e.g., 74.9 R@5, 5.0% improvement) in text to video retrieval with ViT-B/32 checkpoint. Comparing with other SOTA methods we also get the highest SumR. Therefore, our LSECA has significantly improved compared to the baseline [17], and also obtains competitive performance compared to other SOTA models. Tables 2 and 3 show results for the MSVD dataset and LSMDC dataset, respectively. Our LSECA also achieves good performance improvement compared to CLIP4Clip-seqTransf [17], which demonstrates the effectiveness and generalization ability of our proposed LSECA. The proposed LSECA can achieve good performance may be attributed to the following reasons:

  • We optimize the fine-grained features compared to some previous works [21, 22, 32, 35,36,37]. For the video side, we utilize the video representations to semantically enhance the frame features so as to filter out irrelevant information and make the corresponding local information more prominent. For the text side, we process word features with KeyBert [30] extracted keywords as anchors to reduce the impact of irrelevant words on retrieval performance.

  • We consider the uncertain matching problem between text and video, that is, video usually corresponds to multiple text descriptions, and a single text can only correspond to a portion of the elements in the video. Thus we design the cross aggregation module to alleviate this problem well, so as to obtain good performance.

Table 2 Results of text-to-video retrieval on the MSVD
Table 3 Results of text-to-video retrieval on the LSMDC

4.3 Ablation study

In this section, we provide detailed ablation studies to further clarify the effects of each part of our design. The MSRVTT dataset is selected as the testbed, the results and analyses are as follows.

4.3.1 Ablation about components

Table 4 Component-wise evaluation of our framework on the MSR-VTT 1K validation set

To validate the effectiveness of each component, we conduct the ablation experiments with the 1k-A test split on the MSR-VTT 1K validation set. The results are shown in Table 4, and we obtain some important observations: We first investigate the impact of Visual Local Semantic Enhancement (VLSE) module. The global video embedding is utilized to assist frame-level features for obtaining more semantic information. Similar to obtaining a synopsis, we use it to adaptively enhance the semantic information of the sampled frame features, which can be associated with entities, actions, backgrounds, and other information in the synopsis. From the experimental results in Table 4, it can be seen that our proposed enhancement module significantly improves the retrieval performance. Furthermore, we conduct experiment to testify the impact of Textual Local Semantic Enhancement (TLSE) module. In a real-world scenario, for videos, we tend to summarize their content, but for text, we are more inclined to extract its key points due to the heterogeneity between the two modalities. Therefore, we extract keywords to guide fine-grained features of the text towards the semantic center. Not disappointing our expectations, the experimental results in Table 4 also fully support our design. In addition, we simultaneously enhance both video and text features to achieve better retrieval performance. Finally, due to the fact that videos represent more content than text, to retrieve more accurately videos, we also consider the interaction between two modalities and propose the Cross Aggregation (CA) module based on the corresponding text. The results show that our model achieves the better performance. It demonstrates that the three parts are beneficial for semantic enhancement and cross-modality interaction.

4.3.2 Effect of the number of keywords

The k controls the size of keywords features \(\{w_1^{'}, w_2^{'}, w_3^{'},...,\)\( w^{'}_{N_k} \}\). We start with a small size and increase it to large ones. In Table 5, overall performance improves and then decreases. By analysed, on the one hand, we find that fewer keywords limit the ability to enhance fine-grained features. On the other hand, the guidance ability of keywords decreases as the size increases. From Table 5, We set the keywords size k = 5 to achieve the better performance in practice.

Table 5 Ablation studies for the number of Keywords k on the MSR-VTT 1K validation set
Fig. 5
figure 5

Illustration of four fusion strategies. “Sum" and “Concat" represent the combination solutions between fine-grained frame features and global video feature. “LSTM" and “TransFormer" the two feature fusion architectures we apply. The effects of different combination solutions and fusion architectures on local semantic enhancement are analyzed experimentally

4.3.3 Effect of different visual local semantic enhancement strategy

As shown in Fig. 5, we design four fusion schemes for frame-level semantic enhancement. To investigate the effect of the four fusion structures, i.e. “Sum + Trans", “Concat + Trans", “Sum + LSTM" and “Concat + LSTM" on retrieval performance, we perform some ablation experiments to compare them with each other in Table 6. “Sum" means that each fine-grained frame feature is added to the global video feature to obtain the combined features. “Concat" denotes to cascade the frame features with the video feature to obtain a longer feature which the dimension is 1024. “Trans" and “LSTM" means the fusion network structures which can fuse processed features thus achieving local semantic enhancement. From Table 6, we summarize the following observations: 1) When we use simple approach, Sum, compared to the Concat Obtains poor retrieval results. This may be because our goal is to use global features as anchors to guide semantic enhancement, yet the Sum operation causes two features of the same dimension to be confused together, damaging the original semantic information, resulting in poor retrieval performance compared to Concat. 2) From Table 6, we can also find that using LSTM for semantic enhancement can achieve better retrieval results compared to Transformer. By analyzing LSTM, Transformer, and input features, concating the features can enhance the increase in frame level feature similarity, while Transformer, based on key, Query, and Value for temporal interaction. As a result, the results of the semantic enhancement effect does not align with initial expectations, and it is not as effective as LSTM. In summary, the experimental results show that proper fusion strategy between video and frame features can obtain better fine-grained representations.

Table 6 Ablation studies for the different visual local semantic enhancement strategies on the MSR-VTT 1K validation set
Table 7 Ablation studies for the adapter-aware module and the adaptive weights on the MSR-VTT 1K validation set

4.3.4 Effect of the adapter-aware module and the adaptive weights

In Table 7, we testify the impact of the adapter-aware module in visual LSE module and adaptive weights in the Eq. 9. There are decreases in overall performance after removal. Specifically, after removing the adapter aware module, R@5 decrease from 74.9% to 72.4%. And without adaptive weights, R@1 also decreased by 0.6%. Therefore, these two parts are helpful for representation learning as well as cross-modal alignment.

Table 8 Ablation studies for the cross-attention module in textual LSE module on the MSR-VTT 1K validation set
Fig. 6
figure 6

Effect of the trade-off hyper-parameters \(\alpha \) on MSRVTT 1K validation set

Fig. 7
figure 7

Our top-3 text-to-video retrieval visualization results on MSR-VTT. And we also visualize the other state-of-the-art methods(UATVR [46] and UCOFIA [42])

4.3.5 Effect of the cross-attention module in textual LSE module

As shown in Table 8, we further compare our method with other interaction methods. For the transformer, we cascade the word and keyword features, input them into the transformer, and output them as enhanced local text features. From the experimental results, we can see that the values of R@1, R@5 and R@10 degrade to some extent. Our approach improves the representation of textual local features and obtains good retrieval performance by using keyword features as queries and reassigning semantics in word features through a cross-attention mechanism.

4.4 Parameter sensitivity analysis

The hyper-parameter \(\alpha \) is used to trade off \(S_{coarse}\) and \(S_{fine}\) in Eq. 10. Intuitively the matching scores of different granularity features may contribute differently to the final retrieval. So we conduct experiments with the value range setting \(\alpha \in [0.2, 0.8]\) as shown in Fig. 6. And we can observe that our proposed LSECA achieves the best retrieval performance when \(\alpha = 0.6\) is adopted.

4.5 Qualitative analysis

To visually validate the effectiveness of our proposed LSECA, we show a typical text-to-video retrieval example in Fig. 7 and make the comparsion with the UATVR [46] and UCOFIA [42]. Our model can find the correct video based on keyword guidance from similar videos. The similarity between the third video and query calculated by UATVR [46] is highest, leading to incorrect retrieval results. Although UCOFIA [42] retrieves the correct video, it did not distinguish well between hard negative pairs. Local semantic enhancement makes it possible to find key information about videos and text, and cross aggregation aids in the process of information filtering. Thus, LSECA performs well in visual and textual content understanding, achieving good retrieval results.

5 Conclusion

In this paper, we have proposed a new framework LSECA which not only considers the interaction between two modalities but also enhances the fine-grained video and text representations. For the heterogeneity between video and text, we have proposed different local semantic enhancement schemes, which utilzies global embedding of the video and keywords of the text as anchors to guide fine-grained features to highlight semantic information. Moreover, we have also designed the cross interaction module for frame and text features, which can achieve sufficient interaction between two modalities. Experiments have shown that LSECA achieves significant improvements on three standard text-video retrieval datasets, verifying the effectiveness and generalization of our proposed method. In this paper, the design of the semantic enhancement module for text embedding is slightly simplistic, and the keywords can bring much more than that to the retrieval task, we are working towards this direction in our future work.