Keywords

1 Introduction

The emergence of video content has led people to increasingly adopt video formats for acquiring knowledge. However, due to the typically lengthy nature of video clips, extracting knowledge from them can be a time-consuming and tedious process. Therefore, finding efficient methods to retrieve relevant information from videos is important.

Visual Answer Localization (VAL) is an emerging task that corresponds to this issue. Its objective is to identify the video clips that can answer the user’s question. The process of VAL involves analyzing the visual and subtitle components of a video to identify segments that contain relevant information. Recently, a new task temporal answer localization in the Chinese Medical instructional video is proposed. The datasets for this task have been collected from high-quality Chinese medical instructional channels on the YouTube website. These datasets have been manually annotated by medical experts. In this paper, we explored the VAL task in Chinese Medical dataset, which is the shared task in CMIVQA track1.

The VAL task presents challenges due to significant disparities between the visual and textual modalities [1]. Previous research has been conducted in related tasks like video segment retrieval [2] and video question answering [3]. However, it does not work well to directly transfer these methods due to difference in tasks [4]. Existing VAL methods typically employ a two-stream model to separately encode video and text, and utilize a cross encoder to align the modalities [4, 5]. They then use cross-modal representations to predict the relevant video clips. The effectiveness of these methods relies on pre-trained language models, such as Deberta, but there is a noticeable discrepancy between the finetuning process of the VAL task and the pre-training of language models. The pre-training phase utilizes Masked Language Modeling, while the downstream VAL tasks involve token prediction.

We adopt language prompt to resolve this issue. Prompt-based learning is a novel paradigm in NLP that has achieved great success. In contrast to the conventional“pre-train, finetune” paradigm that involves adapting pre-trained models to downstream tasks through objective engineering, the “pre-train, prompt predict” paradigm reformulates the downstream tasks to simulate the masked language modeling task optimized during the original pre-training, utilizing a textual prompt. This paradigm aligns the downstream tasks more closely with the pre-training tasks, thereby enabling better retention of acquired knowledge. Notably, under low-resource conditions, it surpasses the “pre-train, finetune” paradigm, and has demonstrated promising results across various NLP tasks, including Question Answering and Text Classification.

In our work, we developed a prompt template for the VAL task and utilized the prompt learning approach. To enhance the integration of video and text modalities, we employ an asymmetric co-attention module to foster their mutual interaction. Our comprehensive experiments demonstrate the effectiveness of our proposed methods, which achieved the first place on the leaderboard on CMIVQA track1 with a total score of 0.3891 in testB.

2 Related Work

2.1 Visual Answer Localization

Visual answer localization is an important task in cross-modal understanding [4, 5]. This task involves identifying the video clips that correspond to the user’s query [6]. The current methods in(VAL) primarily employ sliding windows to generate multiple segments and rank them based on their similarity to the query. Alternatively, some methods use a scanning-and-ranking approach. They sample candidate segments through the sliding window mechanism and integrate the query with each segment representation using a matrix operation Some approaches directly predict the answer without the need for segment proposals [7]. In the latest work [5, 8], the subtitle and query are inputted into a pretrained language model. Subsequently, a cross encoder is utilized to interact with the visual modality. In this paper, we utilize the prompt technique to improve the model’s comprehension of the task, achieving this by employing a prompt to transform VAL into a MLM task.

2.2 Prompt Based Learning

Prompt-based learning is an emerging strategy that enables pre-trained language models to adapt to new tasks without additional training, or by training only a small subset of parameters. The manual prompt involves creating an intuitive template based on human understanding. The early use of prompts in pre-trained models can be traced back to GPT-1/2 [9, 10]. These studies demonstrated that by designing suitable prompts, language models (LMs) could achieve satisfactory zero-shot performance in various tasks, including sentiment classification and reading comprehension. Subsequent works [11,12,13] further explored the use of prompts to extract factual or commonsense knowledge from language models (LMs). PET [14] is a semi-supervised training technique that rephrases the input in completion format using a prompt to enhance the model’s comprehension of the task. It subsequently annotates the unsupervised corpus with multiple models tailored to single prompts, and ultimately trains the classifier on the enlarged corpus. And our approach is inspired by the ideas introduced in the PET technique.

3 Method

This section begins with the presentation of our data preprocessing approach, which aims to reduce noise in the model inputs. Subsequently, we will introduce our novel model architecture, emphasizing the significant components of prompt construction and cross-modal fusion. Ultimately, we will provide a detailed explanation of our loss design and training techniques.

3.1 Task Formalization

The Chinese Medical Instructional Video Question-Answering task aims to provide a comprehensive solution by addressing medical or health-related question (Q) in conjunction with Chinese medical instructional video (V) and their corresponding set of subtitles (\(S = {[T_i]}_{i=1}^r\)), where r denotes the number of subtitle spans. The primary objective is to accurately determine the start and end timepoints of the answer \([\hat{V}_s, \hat{V}_e]\) within the video V.

This task endeavors to develop advanced algorithms and systems capable of comprehending questions posed in Chinese medical instructional videos and effectively retrieving the corresponding answers. Moreover, it incorporates a subtitle timeline table (STB) that precisely maps each subtitle span to its corresponding timeline span in the subtitle set S. By functioning as a Look-up Table, this timeline table facilitates seamless mapping between frame span timepoints \([\hat{V}_s, \hat{V}_e]\) and accurate target answers\([V_s, V_e]\), thereby ensuring the provision of accurate target answers. Ultimately, the task can be mathematically represented as:

$$\begin{aligned} \left[ \hat{V}_s, \hat{V}_e\right] &= {\text {f}}(Q, V, S) \end{aligned}$$
(1)
$$\begin{aligned} \left[ V_{s}, V_{e}\right] &= {\text {STB}}(\left[ \hat{V}_s, \hat{V}_e\right] ) \end{aligned}$$
(2)

3.2 Data Preprocess

Before inputting the data into the model, a comprehensive data analysis was conducted. It was discovered that the subtitle information provided in the dataset was incomplete, as certain videos lacked subtitles. Upon closer examination, it was determined that the absence of subtitles was primarily attributed to the lack of audio subtitles (soft subtitles) in these videos.

However, it was observed that the video content itself contained subtitles, referred to as hard subtitles. In order to address this issue, Optical Character Recognition (OCR)Footnote 1 technology was employed to extract the subtitle text from the videos and supplement the missing captions. Additionally, for cases where subtitle information lacked both hard and soft subtitles, the corresponding subtitle was filled with a space. This method was implemented to enhance the integrity of the data.

3.3 Model Architecture

In this shared task, a novel method (MutualSL) [5] is employed as the baseline, demonstrating superior performance compared to other state-of-the-art (SOTA) approaches across various public VAL datasets. To further enhance the VAL capability for Chinese medical videos, an extended version of the baseline method is utilized by integrating prompts that activate powerful language comprehension and representation capabilities offered by large-scale language models for downstream tasks. Additionally, asymmetric co-attention [15] is incorporated to improve the model’s cross-modal interaction capability.

Fig. 1.
figure 1

The proposed cross-modal prompt model comprises separate feature extractors for video and text. Video features are enriched through Asymmetric Co-Att and the Visual Predictor, yielding \([\hat{V}_s, \hat{V}_e]\) predictions. Text and video features are combined using MLM Head and a broadcast mechanism, resulting in \([\hat{T}_s, \hat{T}_e]\) predictions. The final outcome considers four losses, with \(<V_s, V_e>\) as the pseudo-label generated by text for video, and \(<T_s, T_e>\) as the pseudo-label generated by video for text.

The structure of the model is depicted in Fig. 1. During the initial stage, diverse feature extractors are employed to extract representations from both the input text and video frame sequences. Subsequently, the model combines asymmetric co-attention with both video features and text-query features. To facilitate cross-modal interaction, the model employs a broadcast mechanism to combine the deep video features extracted by Asymmetric Co-Attn with the text features extracted by Deberta-V2-large [16] resulting in the final fused features being text-based. Finally, both the fused textual features and visual features are individually processed by their corresponding Predictors to obtain the final result.

Visual Feature Extraction. In contrast to the baseline model, we use separable 3D CNN (S3D) [17] pretrained on Kinetic 400 [18], which has better integration of spatial-temporal features and enhanced generalization capability, to extract video features instead of Two-Stream Inflated 3D ConvNets (I3D) [19]. Specifically, first, we extract video keyframes from video V using FFmpeg, and then S3D extracts video features from the video frames:

$$\begin{aligned} \textbf{V} = \text {S3D}(V) \end{aligned}$$
(3)

Here, \(\textbf{V} \in \mathbb {R}^{k \times d}\), where d represents the dimension and k represents the length of the video.

Text Input Template. The main objective of text encoder is to provide us with high-quality question and subtitle information representation. we still follow the baseline approach by using the pre-trained Chinese language model Deberta-V2-large as the text encoder. However, simply putting in subtitle information does not fully activate the large model’s understanding of the language task, so we employ prompt-based techniques to reconstruct the input text.

We introduce a reconstruction process by adding a prompt (P) before the problem and inserting the [Mask] (M) token in the middle of each subtitle segment to predict the result. Input template is defined as T

$$\begin{aligned} &T = \{P, Q, T_1, M_1, T_2, M_2, ..., T_n, M_n\} \end{aligned}$$
(4)
$$\begin{aligned} &\textbf{T} = \text {Deberta-V2-large}(T) \end{aligned}$$
(5)

Here, \(\textbf{T} \in \mathbb {R}^{n \times d}\), where d represents the dimension and n represents the length of the text. We went through a lot of experiments and ended up with the best performing prompt templates. Finally Our prompt P is set as “请根据视频和字幕判断问题对应的答案在哪个位置”.

Cross-Modal Fusion. To improve the semantic representation of video features and capture interactions between visual and textual information, we employ asymmetric co-attention. This mechanism consists of three components: a self-attention (SA) layer, a cross-attention (CA) layer, and the Context-Query Concatenation (CQA).

In the self-attention layer, the video features \(\textbf{V}\) extracted by S3D are utilized to capture internal dependencies within the visual information. This process yields enhanced visual features, denoted as \(\textbf{V}^{SA}_{\text {visual}}\), and attention keys, represented as \(\textbf{K}^{SA}_{\text {visual}}\).

$$\begin{aligned} \textbf{V}^{\text {SA}}_{\text {visual}}, \textbf{K}^{\text {SA}}_{\text {visual}} = \text {LN}(\text {SA}(\textbf{V})) \end{aligned}$$
(6)

Next, we incorporate textual features \(\textbf{Q}_{\text {textual}}\) obtained from the prompt and question into the visual features. The cross-attention layer plays a crucial role in integrating these textual features with the visual features \(\textbf{V}^{SA}_{\text {visual}}\) and \(\textbf{K}^{SA}_{\text {visual}}\). This integration facilitates the fusion of information from both modalities, enabling a comprehensive understanding of the video content. The output of the cross-attention layer is represented as \(\textbf{V}^{CA}_{\text {visual}}\) and \(\textbf{K}^{CA}_{\text {visual}}\), capturing the cross-modal interactions and enriching the semantic representation of the visual features.

$$\begin{aligned} {\textbf{V}^{CA}_{\text {visual}}, \textbf{K}^{CA}_{\text {visual}}}={\text {LN}}({\text {CA}}(\textbf{V}^{SA}_{\text {visual}}, \textbf{K}^{SA}_{\text {visual}})) \end{aligned}$$
(7)

Finally, the outputs of the cross-attention layer, \(\textbf{V}^{CA}_{\text {visual}}\) and \(\textbf{K}^{CA}_{\text {visual}}\), along with the textual features \(\textbf{Q}_{\text {textual}}\), are concatenated and fed into the Context-Query Concatenation layer. This layer combines the contextual information from the video and the query, resulting in a text-aware video representation, \(\textbf{V}^{CQA}_{\text {visual}}\), that captures the interplay between visual and textual elements.

$$\begin{aligned} \textbf{V}^{CQA}_{\text {visual}}={\text {Conv1d}}({\text {Concat}}[{\textbf{Q}_{\text {textual}}, \textbf{V}^{CA}_{\text {visual}}, \textbf{K}^{CA}_{\text {visual}}}]) \end{aligned}$$
(8)

Regarding the textual modality, we employ global averaging to pool the visual features \(\textbf{V}^{CQA}_{\text {visual}}\), resulting in the representation \(\overline{V}^{CQA}_{\text {visual}}\). Finally, we combine \(\overline{V}^{CQA}_{\text {visual}}\) with \(\textbf{T}_{\text {Deberta}}\) extracted by the Deberta-v2 model through summation to obtain the ultimate output of the textual features \(\overline{\textbf{T}}\).

$$\begin{aligned} &\overline{V}^{CQA}_{\text {visual}}={\text {AvgPool}}(\textbf{V}^{CQA}_{\text {visual}})\end{aligned}$$
(9)
$$\begin{aligned} &\overline{\textbf{T}} = \{\overline{V}^{CQA}_{\text {visual}} + \textbf{T}^i_{Deberta}\}^n_{i=1} \end{aligned}$$
(10)

Visual Predictor. To address the current task, we adhere to the Visual Predictor approach established by the baseline, which includes separate start and end predictors. Each predictor is composed of a unidirectional LSTM model and a FNN. The \(\textbf{V}^{CQA}_{\text {visual}})\) features are inputted into the LSTM model, followed by the utilization of the feedforward layer to calculate the logarithm of the predicted time point logits {\(\mathbf {\hat{V}_{s}}\), \(\mathbf {\hat{V}_{e}}\)}, encompassing both the start and end time points.

$$\begin{aligned} \mathbf {\hat{V}_{s}}&={\text {FNN}}(\text {LSTM}_{\text {start}}(\textbf{V}^{CQA}_{\text {visual}}))\end{aligned}$$
(11)
$$\begin{aligned} \mathbf {\hat{V}_{e}}&={\text {FNN}}(\text {LSTM}_{\text {end}}(\textbf{V}^{CQA}_{\text {visual}})) \end{aligned}$$
(12)

Prompt-Based Prediction. Figure 2 illustrates the “prompt,predict” paradigm. Our Input template is T with n “[mask]” tokens. We aim to predict the category words “始” and “末” using the textual prompt T. This process is similar to masked language modeling during the pre-training stage. Let \(\textbf{T}_s\) represent the probability of the “始” token and \(\textbf{T}_e\) represent the probability of the “末” token of all mask. Additionally, \([\hat{T}_s,\hat{T}_e]\) represent the probabilities of the ground truth being predicted as “始” and “末” , respectively.

Fig. 2.
figure 2

Illustration of the “prompt, predict” paradigm.

Loss Function. In order to optimize the logits of the Visual Predictor and the Prompt-based Prediction, we utilize the Cross-Entropy function (CE). To enhance the model’s robustness, we employ a subtitle timeline Look-up Table, which generates pseudo-labels \(\langle V_s, V_e \rangle \) for videos based on text prediction results and \(\langle T_s, T_e \rangle \) for texts based on video prediction results. Additionally, we introduce the rdrop loss to further improve the model’s robustness and enhance its generalization capabilities.

Finally, our loss function is defined as follows:

$$\begin{aligned} \mathcal {L}_{\text {total}} = \mathcal {L}_{\text {v}} + \mathcal {L}_{\text {t}} + \mathcal {L}_{\text {distill}\_\text {v}} + \mathcal {L}_{\text {distill}\_\text {t}} + \beta \times \mathcal {L}_{\text {Rdrop}} \end{aligned}$$
(13)

The loss terms are defined as follows: \(\mathcal {L}_{\text {v}}\) represents the loss between the predicted video features and the true labels, \(\mathcal {L}_{\text {distill}\_\text {v}}\) represents the loss between the predicted video features and the pseudo labels, \(\mathcal {L}_{\text {t}}\) represents the loss between the predicted text features and the true labels, \(\mathcal {L}_{\text {distill}\_\text {t}}\) represents the loss between the predicted text features and the pseudo labels, and \(\mathcal {L}_{\text {Rdrop}}\) represents the loss of rdrop. Additionally, \(\beta \) represents the weight of the rdrop loss.

4 Experiments

4.1 Dataset and Metrics

NLPCC Shared Task 5 involves a dataset of 1628 Chinese Medical Instructional Videos with annotated question-answer pairs tied to video sections and divided into training (2936 examples) and two test sets (491 and 510 examples). This dataset, sourced from YouTube’s Chinese medical channels and annotated by experts, includes videos, audios, and both types of Chinese subtitles. The data extraction process converts everything to Simplified Chinese.

Performance is evaluated using two metrics: Intersection over Union (IoU) and mean IoU (mIoU) [20], assessing video frame localization as a span prediction task. The examination includes “\(R@n, IoU = \mu \)” and “mIoU”, with experiments using \(n = 1\) and \(\mu \in {0.3, 0.5, 0.7}\) for evaluation.

4.2 Experiment Details

We executed a range of thorough experiments to verify the pipeline’s efficiency. All tests maintained consistent training tactics and dataset arrangements for accurate comparisons. Particularly, the AdamW optimizer was used in our training regimen with an initial learning rate of \(\text {8e-6}\) and a 10% linear warmup. We divided the training and validation sets at a 0.9:0.1 ratio from the officially given annotated data, ensuring uniform dataset splits. Performance assessment of the top-performing model occurred on the validation set using the official testA sets. It is noteworthy that each epoch’s training time was optimized to be only 30 min.

4.3 Experimental Results and Analysis

In this study, we have evaluated the impact of text feature extraction, visual feature extraction, data preprocessing schemes, asymmetric co-attention model setting and prompt setting on the Visual Answer Localization task. The experimental results summarized in Table 1 and Table 2 provide insights into the performance of various methods, which can be analyzed in the following sections.

Table 1. Impact of text and visual feature extraction, and data preprocessing schemes on Visual Answer Localization task performance. Visual Feature setting is based on DeBERTa-v2-710M-Chinese; Data Preprocess setting is based on DeBERTa-v2-710M-Chinese and S3D.

Text Feature Setting. Among Chinese-Mac- bert-large [21], Chinese-RoBERTa-large [22], and DeBERTa-v2-710M-Chinese [16], we observe that the DeBERTa-v2-710M-Chinese model performs the best in IoU scores and mIoU for Valid Set and TestA Set, surpassing the baseline model. This proves its superior effectiveness in text feature extraction for the Visual Answer Localization task.

Visual Feature Setting. When evaluating different visual feature extraction schemes, we find that incorporating S3D into the DeBERTa-v2-710M model yields the highest mIoU on the Valid Set (45.36) and shows consistent improvement in the TestA Set (41.18), exceeding the baseline by 1.2%. This demonstrates S3D’s suitability compared to S3DG and Resnet151, which scored lower than the baseline, showcasing their lower efficacy in visual feature extraction.

Data Preprocess. The blend of soft and hard caption extraction schemes outperforms the baseline model in mIoU scores for Valid Set (46.24) and TestA Set (42.02), substantiating the benefit of using audio and OCR-based techniques for caption extraction. A combination of both techniques results in the biggest improvement, hinting the advantage of using both audio and visual information to improve model performance in Visual Answer Localization tasks.

Model Evaluation. When examining the impact of different model settings, adding the Asy-Co-Att mechanism results in a significant improvement in performance across all IoU thresholds on both the validation and TestA sets, as compared to the baseline De-S3D-DP model. This indicates the mechanism effectively captures visual-textual interactions and refines video feature semantics. While the addition of Rdrop also improves upon the base model, it doesn’t provide the same significant gains as Asy-Co-Att. However, combining Asy-Co-Att and Rdrop attains the best performance, highlighting their complementary benefits.

Table 2. Impact of Model Settings and Prompt Configurations on De-S3D-DP for Visual Answer Localization. De-S3D-DP denotes the method which separately employs DeBERTa-v2-710M-Chinese and S3D models to extract textual and visual modality features, and optimizes the text through the use of both soft and hard subtitles. Asy-Co-Att refers to the asymmetric co-attention mechanism. Both (A &R) indicates that both the Asy-Co-Att and RDrop methods are employed simultaneously.

Text Prompt Setting. We analyzed two prompt configuration schemes: \(\text {Prompt}_{1}\), which constructs text input without a [Mask] (M) token for predictions; and \(\text {Prompt}_{2}\), which includes the [Mask] (M) token for downstream prediction. \(\text {Prompt}_{2}\) performs better across all IoU thresholds and datasets. This consistency with the pretraining task seems to enhance the model’s Visual Answer Localization abilities.

In summary, our analysis indicates that certain pre-trained language models (e.g., Macbert-large) and data preprocessing techniques (e.g., combining soft and hard captions) can significantly improve the performance of the Visual Answer Localization task. Besides, based on the De-S3D-DP model above, the results in Table 2 suggest that incorporating both the Asy-Co-Att mechanism and Rdrop method, along with the \(\text {Prompt}_{2}\) configuration, leads to the most significant improvements in performance for the Visual Answer Localization task.

5 Conclusion

This research is dedicated to the challenge of gleaning pertinent information from videos through Visual Answer Localization (VAL). Our focus was the VAL task, utilizing the Chinese Medical instructional video dataset in the CMIVQA track1 shared task. The inability of existing methods to effectuate a smooth transfer from related tasks is recognized. To surmount these challenges, the Prompt-based learning paradigm from Natural Language Processing (NLP) was employed by us. This approach recalibrates downstream tasks to emulate the masked language modeling task employed during pre-training. A prompt template customized for the VAL task was developed and the prompt learning approach institutionalized. Furthermore, an asymmetric co-attention module was initiated to augment the integration of video and text data.

The efficiency of our methods was illustrated by our experiments, culminating in us achieving the topmost place on the CMIVQA track1 leaderboard, with an aggregate score of 0.3891 in testB. Prompt-based learning is proven to hold superiority over traditional pre-training and fine-tuning methods, especially under low-resource conditions. To conclude, our research propels VAL techniques forward and lays out functional solutions for valuing knowledge from videos.