1 Introduction

The dramatic increase in multi-modal data (including text, image, audio, and video) on the Internet makes research on multi-modal summarization necessary. Multi-modal summarization aims to generate a condensed summary, which can cover salient information from one or more modalities inputs [1, 2]. Different from traditional pure text summary; Zhu et al. [3] points out that generated summaries with both text and images can effectively improve the quality of generated summary and increase the satisfaction of users. The information of different modalities is complementary and verifiable to each other. Utilizing multi-modal information helps the model better locate key content and generate better summaries. Intuitively, people can grasp key information easier from multiple modalities than only from the text. This task is defined as multi-modal summarization with multi-modal outputs (MSMO). Figure 1 shows an example of this task, which gets text and images as input and generates one summary with two selected images.

Fig. 1
figure 1

An example for explaining the semantic alignment between images and paragraphs in the document. “...” means some content is omitted. Each image is aligned to the paragraph at right

Recent single-modal summarization models always employ an encoder–decoder framework with transformers structure [4, 5]. Existing multi-modal models always add separate encoders for different modalities into the single-modal encoder–decoder framework [2, 3, 6,7,8,9,10]. We show the widely used structure of them in Fig. 2a, b. The representation of different modalities is obtained separately from single-modal encoders, which leads to the model cannot effectively capture the interaction between them. Recently, some works have paid attention to how to enhance image text interaction [9, 11] by adding interactive modules or auxiliary tasks.

Fig. 2
figure 2

Multi-modal summarization models with different encoder structures

However, previous works ignore the paragraph-level vision-language semantic alignment, where an example is shown in Fig. 1. The vision-language semantic refers to the meaning conveyed by vision and language. Alignment refers to establishing correspondence between vision and language that share the same meaning. The semantics of each paragraph is highly corresponding to the image on the left. There exists a semantic correspondence between images and paragraphs. If a model can reorder the images based on the semantic meaning of the paragraphs, it indicates that the model comprehends and aligns the semantic features of both the images and the text. Besides, visual-language joint encoding is not well-applied for multi-modal summarization tasks, which has been proven effective on many multi-modal natural language understanding (NLU) tasks (e.g., Visual Question Answering) [12,13,14,15,16].

To improve these deficiencies, in this paper, we propose the Vision-Language Summarization model ViL-Sum with a universal transformer-based encoder–decoder structure. The core of ViL-Sum is a joint multi-modal encoder with two well-designed tasks, image reordering, and image selection, which aims to guide the model to learn better vision-language representations and capture the alignment of paragraph-level vision-language semantics. Specifically, we use a backbone (e.g., ViT [17]) to convert images into visual token embeddings and concatenate them with document token embeddings as the input of the joint multi-modal encoder. The ViL-Sum structure with the joint multi-modal encoder is shown in Fig. 2c. To model paragraph-level vision-language semantic alignment, we propose a simple but effective image reordering task. It forces the model to reorder shuffled input images, which guides the model to learn the corresponding relation between paragraphs and images. To further enhance vision-language representation, we also train ViL-Sum with an image selection task, which selects several summary-related images as part of the multi-modal summary. We follow [9] of using image caption to construct pseudo-labels. Finally, we train ViL-Sum with text summary generation, image selection, and image reordering tasks in a multi-task manner.

Experiments show that our ViL-Sum with multi-task training can outperform baselines by a wide margin. And further analysis demonstrates that the improvement is exactly from the joint modeling and multi-task training. However, the caption of the image is not always available. So, the image selection task is not generalization for all datasets. It is deserved to mention that if we remove the image selection task, our proposed multi-modal encoder and the image reordering task still help the model beat all comparison models.

Our contributions can be summarized as follows:

  • We propose a novel vision-language Summarization (ViL-Sum) model, which can jointly encode images and text to capture their interrelation.

  • We propose two auxiliary tasks and employ multi-task learning to guide the model to learn the paragraph-level vision and language semantic alignment.

  • Our model outperforms all current state-of-the-art methods on most automatic and manual evaluation metrics. And in further analysis, we find that the improvement is exactly from the paragraph-level semantic alignment modeling and multi-task training.

2 Related work

2.1 Single-modal summarization

Recently, text summarization models have achieved remarkable performance with the development of pre-trained language models. Liu and Lapata [18] first apply the pre-trained language model BERT [19] to summarization tasks. They add several transformers as the decoder to the BERT encoder and then train them with different learning rates. Their work outperforms all traditionally trained neural models. Pegasus [4] and BART [5] are two fully pre-trained models for summarization generation with well-designed self-supervised tasks. Their appearance provides powerful base models for summarization and totally changed the research paradigm in the summarization task. After that, more and more summarization works begin to focus on pre-trained language models, including supervised and unsupervised methods [20,21,22,23,24,25,26].

2.2 Vision-language representation

Large-scale Transformers-based [27] vision and language representation models [5, 19, 28] have achieve state-of-the-art results on many Natural Language Processing (NLP) tasks. They first pre-trained on a large-scale corpus with self-supervised tasks and then fine-tuned on specific downstream tasks. Most existing vision and language pre-training (VLP) models [29,30,31,32,33,34] adopt two different encoders to model vision and language separately, which extracts visual features by an object detection model and then, combines the derived object-centric representation of the image and text. Recently, large-scale vision and language representation learning has tried to jointly encode different modalities with the same encoder and achieve promising improvements [12,13,14,15,16, 35,36,37]. Their success proves the joint modeling of different modalities is practicable. Besides, multi-task learning is effective in vision-Language representation; they perform joint training diverse tasks for better investigating relationships between vision and language [31, 38, 39].

2.3 Multi-modal summarization

Different from single-modal text summarization, multi-modal summarization is a task to generate a condensed summary to cover the primary information from multimedia data. One of the most significant characteristics of this task is it is not only based on text information, but can also employ rich visual information from images, audio, and videos. Multi-modal summarization tasks can be divided into two types with different outputs: single-modal output [1, 6, 7] and multi-modal output [3, 9, 11, 40]. Compared with single-modal output, a multi-modal output summary can increase users-satisfaction [3] and first proposes a large-scale Multi-modal Summarization with Multi-modal Output (MSMO) dataset. To tackle the gap between training and testing in the MSMO task, Zhu et al. [9] propose two methods to obtain pseudo image labels and train the model with multi-modal optimization objectives. Zhang et al. [41] propose to integrate extractive and abstractive summaries and adopt knowledge distillation with a vision and language pre-training model. Zhang et al. [42] propose a location-aware approach to further leverage the image location information. Jiang et al. [43] introduce a cross-modal alignment mechanism by exploiting pseudo image captions to bridge the cross-modal semantic gap. Inspired by MSMO, Li et al. [44] propose the task of video-based Multi-modal Summarization with Multi-modal Output (VMSMO) and a Dual-Interaction-based Multi-modal Summarizer (DIMS) model, including a local conditional self-attention mechanism and a global-attention mechanism to model and summarize multi-modal input.

However, previous works all obtain vision-language representation via separate encoders for different modalities, which has been proved weaker than joint representation in vision-language representation learning research [15, 16]. Besides, they ignored the special paragraph-level semantic alignment between different modalities. In this paper, we proposed a novel vision-language summarization ViL-Sum model with a multi-task learning framework to tackle these issues.

3 Methodology

We show the main architecture of our ViL-Sum model in Fig. 3. Firstly, we employ a backbone network as the image tokenizer to convert images into visual token embeddings in Fig. 3a. Then, text embeddings and visual token embeddings are concatenated as the input of the main encoder–decoder framework in Fig. 3b. Finally, we train the ViL-Sum model in a multi-task manner. In the following sections, we will first introduce vision-language joint representation. Then, we will describe the details of multi-task learning.

Fig. 3
figure 3

The overall framework of our proposed ViL-Sum model. a Is the detail of the ViT-based image tokenizer. b Is the encoder–decoder framework with multi-task learning

3.1 Vision-language joint representation

First of all, we formalize the input and output of our ViL-Sum as (DI) and \((S, I_S)\), where \(D=\{t_1, t_2, \ldots , t_T\}\) refers to the sequence of tokens from the input document, \(I=\{\text{img}_0, \text{img}_1, \ldots , \text{img}_M\}\) refers to the sequence of input images from the input document, \(S=\{t_1, t_2, \ldots \}\) refers to the sequence of tokens from gold text summary, and \(I_S = \{\text{img}_1, \text{img}_2, \ldots , \text{img}_K\}\) refers to K selected images for the multi-modal summary.

3.1.1 Document embeddings

Each document is firstly converted into the sequence of tokens \(\{t_1, t_2, \ldots , t_T\}\), and then, two special tokens “\(\langle s\rangle \)”and “\(\langle \backslash s\rangle \)”are added to represent the start and end of the document. After that, we map each token into vector representation \(E_D = \{e_{\text{start}}, e_1, \ldots , e_T, e_{\text{end}}\}\) with text embedding layer.

3.1.2 Image embeddings

Different from previous methods, which extract many image features via existing object detection models. We employ ViT [17] as the backbone, which split each image into several patches and then encode them. The details of the image tokenizer are shown in Fig. 3b.

Firstly, we reshape image \(\text{img} \in \mathbb {R}^{H\times W \times C}\) into a sequence of flattened 2D patches \(\{\text{img}^p \in \mathbb {R}^{N\times (P^2\cdot C)}\}_{p=1}^N\), where (HW) are the resolution of the original image, C is the number of channels; (PP) are the resolution of each image patch, and \(N = HW/P^2\) is the resulting number of patches. Then, we can obtain a sequence of image patches \(\{\text{img}^p\}_{p=1}^N\) as the input of the image tokenizer.

Secondly, the patches are linearly projected to patch embeddings \(e^p=E\times \text{img}_i^p\), where \(E \in \mathbb {R}^{(P^2\cdot D)\times C}\). We also add a special token “[class]” with learnable embedding \(e^0\). Then, attaching position embeddings and patch embeddings as input \(Z_0\) for the image encoder to retain positional information of images:

$$\begin{aligned} Z_0 = [e_i^0; e_i^1; \ldots ; e_i^N] + E_{\text{pos}} \end{aligned}$$
(1)

where \(Z_0, E_{\text{pos}} \in \mathbb {R}^{(N+1)\times D}\), \(E_{\text{pos}}\) is position embeddings.

Finally, we employ the pre-trained ViT with L encoder layers as the backbone to encode these patches of each image. This backbone also can be replaced by any other encoders (e.g., linear projection layer).

$$\begin{aligned} Z_{\ell +1} = \texttt{EncoderLayer}(Z_{\ell }),\quad \ell =1,2,\ldots ,L \end{aligned}$$
(2)

The global max-pooling of output vectors is obtained as the visual token embedding of image \(\text{img}_i\):

$$\begin{aligned} v_i = \texttt{MaxPooling}(Z_{L}) \end{aligned}$$
(3)

where \(v_i\in {R}^{D}\). Through the image tokenizer, we can convert the sequence of input images into a sequence of visual token embeddings \(E_v=\{v_i\}_{i=1}^{M}\).

3.1.3 Multi-modal encoder

The input of the multi-modal encoder is the concatenation of visual token embeddings \(E_v\) and token embeddings \(E_D\). We can formalize the input as \(H_0 = \{E_v; E_D\}\) and then, encode visual and text embeddings with 12 transformer blocks. Finally, we can obtain vision-language representation \(H_L\) from the last layer output of this encoder.

$$\begin{aligned} H_L = \{h_{v_1},\ldots , h_{v_M}, h_{\text{start}}, h_{1}, \ldots , h_{\text{end}}\} \end{aligned}$$
(4)

The vision and language semantics interact with the self-attention mechanism of the transformer structure during the encoding process.

3.2 Visual-enhanced summary generation

The vision-language representations \(H_L\) from the previous multi-modal encoder contain multi-modal features of input text and images. After encoding, we feed the representations \(H_L\) into the decoder to generate a text summary. The target of the summary generation task is to minimize the negative log-likelihood of the reference y tokens as given input document D and images I via updating model parameters \(\theta \). The loss function of the summary generation task is as follows:

$$\begin{aligned} \mathcal L_{\theta }^{\text{GEN}} = - \sum _{j=1}^{|y|}\log P_{\theta }(y_j|y_{<j}, D, I) \end{aligned}$$
(5)

Different from single-modal summarization tasks, this optimization target also depends on the features from input images I, which enhance the final summary generation.

3.3 Images reordering

To align the paragraphs and images from the input, in this section, we introduce a simple yet effective task, image reordering, to guide the model to learn semantic alignment. Specifically, we shuffle the order of input images and then, force the ViL-Sum model to predict the original order of input images with a classification head:

$$\begin{aligned} y_i=P(\text{pos}_i)=\texttt{softmax}(W\cdot h_{v_i} + b) \end{aligned}$$
(6)

where all input images share one classification head. To train the classification layer, the model computes loss and minimizes the objective function:

$$\begin{aligned} \mathcal L_{\theta }^{\text{IR}} = \frac{1}{M}\sum _{i=1}^M \sum _{c=1}^C -\hat{y}_{\text{ic}} \log y_{\text{ic}} \end{aligned}$$
(7)

where C is the number of categories, depending on the number of input images. We set \(C=10\). If the number of input images is greater than 10, we only keep the first 10 images as input images.

3.4 Images selection

We also train our ViL-Sum with multi-modal output reference following [9]. To build pseudo image selection labels of training data, we employ similarity between image caption and gold summary to select top-K images as labels \(\hat{y}\) (K is empirically set as 3). The similarity is the average of ROUGE-1, ROUGE-2, and ROUGE-L scores. The probability to select each image is as follows:

$$\begin{aligned} y_i = P(\text{img}_i) = \sigma (W\cdot h_{v_i} + b) \end{aligned}$$
(8)

The loss function of the image selection task is as follows:

$$\begin{aligned} \mathcal L_{\theta }^{\text{IS}} = \frac{1}{M} \sum _{i=1}^M -[\hat{y}_i \log y_i +(1-\hat{y}_i) \log (1- y_i)] \end{aligned}$$
(9)

3.5 Enhanced by multi-task learning

We train our ViL-Sum with a text summary generation task and two well-designed auxiliary tasks in a multi-task manner, which are used to enhance vision-language representation and paragraph-level semantic alignment. In previous sections, we have introduced the details of them. Finally, ViL-Sum is trained with three tasks: summary generation, image selection, and image reordering, jointly by simultaneously minimizing three loss functions as follows:

$$\begin{aligned} \mathcal L_{\theta }^{\text{TOTAL}} = L_{\theta }^{\text{GEN}} + L_{\theta }^{\text{IS}} + L_{\theta }^{\text{IR}} \end{aligned}$$
(10)

It is deserved to mention that the caption of the image is not always available. So, the image selection task is not generalization for all datasets. If we remove the image selection task, we can select images by measuring the similarity between generated summary and vector representations of images. Our proposed multi-modal encoder and the image reordering task still help the model achieve excellent performance.

4 Experimental setup

4.1 Dataset

We employ the MSMO dataset [3] to evaluate the effectiveness of our proposed ViL-Sum. MSMO dataset is a large-scale dataset for the Multi-modal Summarization with Multi-modal Output tasks. Each example in the dataset is a triplet (document, images, summary), which contains more than one image in each example. This dataset contains online news articles (723 tokens on average) paired with multiple image caption pairs (6.58 images on average) and multi-sentence summaries (70 tokens on average). For test data, based on text reference, at most, three images are annotated to produce a multi-modal reference by humans. The detailed statistical information of the MSMO dataset is shown in Table 1.

Table 1 Statistical information of MSMO dataset

4.2 Baseline models

We report the existing multi-modal summarization methods (ATG, ATL, HAN, GR) [3], MMR [45] and MOF\(^{\text{RR}}_{\text{dec}}\) [9] using multiple metrics. We also report the result of PGC [46], which is a single-modal summarization model.

To prove the effectiveness of our proposed joint representation and multi-task learning, we mainly compare with BART-base [5] model and a reproduced two-stream model BART-cross which has the same structure with MOF\(^{\text{RR}}_{\text{dec}}\) and replace GRU and VGG19 [47] with BART and ViT [17], respectively. To be fair, we mainly compare our model with BART-base and BART-cross due to previous methods did not employ pre-trained models. The details of these models are as follows:

  • PGC is the BiGRU-based pointer-generator network that allows both copying words from the input text and generating words from a fixed vocabulary.

  • ATG is based on the PGC model. It fuses static visual features from VGG19 with text features after the BiGRU encoder. Besides, ATG selects final images by the visual-text attention weight.

  • ATL replaces the image global features of ATG with local features (multiple pooling features), which select images by measuring the sum of visual attention distribution over the local patch features of each image.

  • HAN is based on the ATL model and adds a hierarchical attention mechanism. This attention mechanism first attends to the image patches to get the intermediate vectors to represent images and then, attends to these vectors to get the visual context vector.

  • GR is an extractive method that employs LexRank [48] to rank captions of images and select images based on the rank score. The text summary of it is generated by the PGC model.

  • MMR is a unified unsupervised graph-based framework for multi-modal summarization that can cover both single-modal output summarization and multi-modal output summarization. According to specific requirements, there are three models: generic multi-modal ranking, modal-dominated multimodal ranking, and non-redundant text-image multi-modal ranking. MMR\(^{*}\) is the corresponding MMR model that truncates the input text to 10 sentences.

  • MOF\(^{\text{RR}}_{\text{dec}}\) is based on ATG model. This model first constructs pseudo-labels of image selection for the final summary. Specifically, it employs the ROUGE score to measure the relevance of image caption and summary text.

  • UniMS is a unified multi-modal summarization framework that integrates extractive and abstractive summaries and adopts knowledge distillation to improve image selection.

  • LAMS investigates image locations for multi-modal summarization via a stack of multi-modal fusion block and formulates the high-order interactions among images and texts.

  • SITA proposes a novel coarse-to-fine image text alignment mechanism to identify the most relevant sentence of each image and applies a cross-modal retrieval model to retrieve reference caption for an image from the golden summary.

  • BART-base is a pre-trained seq2seq generation model, which achieved promising results in many generations of NLP tasks, especially on text summarization. We employ this model to confirm visual features’ contribution to a summary generation.

  • BART-cross is a BART-based model with the same model structure as previous ATG, ATL, HAN, GR, and MOF\(^{\text{RR}}_{\text{dec}}\). It first encodes images with ViT and then, fuses text representation from the BART encoder output. The fusion of image and text representations employs cross-attention like the ATG model. This is the main comparison model.

For a fair comparison, we construct this BART-cross model to prove the effectiveness of joint multi-modal encoder and multi-task training in our ViL-Sum. Because our ViL-Sum without multi-task training only changes the encoding mechanism from separate encoders to the joint multi-modal encoder.

4.3 Implementation details

We train our model for 10 epochs on 8xV100 GPUs using Adam [49] with \(\beta _1=0.9\), \(\beta _2=0.99\), a batch size of 64. We also use a linear learning rate warm-up with 1000 steps. The weight-decay is set as \(10^{-4}\). The model is initialized with ViT-B/16 and BART-base parameters. The max length of input images and tokens is 10 and 512, respectively. For the image tokenizer, we employ the same setting with ViT-b/16 in [17]. During testing, we generate the summary with a beam size of 3, and the minimum and maximum decoding lengths are set as 15 and 150 separately.

4.4 Evaluation metrics

We evaluate the pictorial summary with the MMAE metric [3].Footnote 1

MMAE consists of three sub-metrics: ROUGE score (ROUGE-L), Image Precision (IP), and Image Text Relevance (MAX\(_{\text{sim}}\)). ROUGE [50] score can measure the salience of text in generated summary, which is widely used for measuring summarization systems. The image precision can measure the salience of selected images and is computed as Eq. (11).

$$\begin{aligned} \textbf{IP} = \frac{|\text{ref}_{\text{img}} \cap \text{rec}_{\text{img}}|}{|\text{rec}_{\text{img}}|} \end{aligned}$$
(11)

where \(\text{ref}_{\text{img}}\) and \(\text{rec}_{\text{img}}\) denote reference images and recommended images by MSMO systems, respectively. MAX\(_{\text{sim}}\) can measure the relevance between selected images and generated text summary, which trains an image text retrieval [51] model with max-margin loss to evaluate Image Text relevance. Finally, Zhu et al. [3] choose the linear regression results of 3 metrics as MMAE with human judgments and the weight for ROUGE-L, MAX\(_{\text{sim}}\), and IP is 1.641, 0.854, 0.806, respectively; the intercept is 1.978.

We report the results of ROUGE-1/2/L, MAX\(_{\text{sim}}\), IP, and MMAE of each model to comprehensively measure their performance. The results of our model are all the averages of three different checkpoints.

5 Results

5.1 Overall performance

The main results of all models are shown in Table 2. Previous baseline models in block 1 are based on the pointer network with BiGRU. Models in block 2 are our implementation based on the BART-base model. SEL means selection task and REO means reordering task. All reported results of ours are the average of 3 different checkpoints.

Table 2 The main results of all comparison models on different metrics

We can see that compared with the baselines, our ViL-Sum gains significant improvement on most metrics, and ViL-Sum+selection reordering achieves the best comprehensive performance except IP. SITA gains a notable improvement on IP metric. It trains a cross-modal retrieval model to retrieve reference caption for image and provides supervision signal, which is very beneficial for image selection and the image text alignment. Nevertheless, our ViL-Sum still outperforms SITA on other metrics. Compared with BART-cross, we can see that the joint representation and multi-task training both bring satisfactory improvement, which proved the effectiveness of our proposed methods. Interestingly, the introduction of image features hurts the performance of all single-modal summarization models, especially BiGRU-based models.

To further demonstrate that visual information indeed benefits text summary generation, we compare four groups of multi-modal models with the single-modal models on which they are base, as shown in Table 3. The single-modal models are above the dashed line, and the multi-modal models based on them are below the dashed line. In the first group, the multi-modal model (ATL and MOF\(^{\text{RR}}_{\text{dec}}\)) are not superior to the single-modal model (PGC). These works concluded that long documents already contain enough information and that too many images would introduce noise for summary generation. On the contrary, the other three groups of results show that the multi-modal models have achieved performance improvements compared with the single-modal models. The results indicate that introducing and exploiting image information effectively can improve text summarization generation.

Table 3 Comparison of text summary results on ROUGE scores between multi-modal models and their single-modal models

5.2 Performance of joint representation

Firstly, we can see that the performance of ATG, ATL, HAN, and GR all hurt ROUGE scores by simply introducing images as independent visual features. Through the multi-modal objective optimization, MOF\(^{\text{RR}}_{\text{dec}}\) has a significant improvement on IP and does not decrease the quality of generated text summary. This situation proves that modeling vision and language information independently did not bring in the revenue for text summary generation. The results of BART-cross, which also introduces images as independent features, also have lower ROUGE scores than BART-base. This situation proves again the previous conclusion.

Different from previous performance on ROUGE score, our ViL-Sum with joint vision-language representation obtains better ROUGE scores, and the Image Precision (IP) and MAX\(_\text{sim}\) both have a significant improvement. This demonstrates that using the joint multi-modal encoder to obtain vision-language representation is better than using separate encoders with cross-attention to fuse multi-modal features.

5.3 Performance of multi-task learning

The result of ViL-Sum without multi-task learning has achieved good performance and is better than BART-cross. In this section, we will analyze the influence of our proposed multi-task learning. From the results, we can see that the introduction of image selection and reordering bring a slight decrease in ROUGE scores. Meanwhile, the IP and MAX\(_{\text{sim}}\) scores increase significantly, which makes the overall score MMAE better than ViL-Sum without multi-task training.

We report the ablation study results of two auxiliary tasks in the second block of Table 2. From the results, we can see that image selection and reordering both can bring improvement in IP and MAX\(_{\text{sim}} \) scores. The combination of two tasks can push the overall score MMAE higher. The comparison of these models demonstrates that the introduction of multi-task learning exactly improved the vision-language representation and semantic alignment, which is reflected in the improvement of the multi-modal metrics: IP, MAX\(_{\text{sim}}\) and MMAE.

6 Discussion

6.1 Human evaluation

We randomly sample 100 examples from the test set to conduct the human evaluation. The multi-modal summary of golden reference, BART-base, BART-cross, and our ViL-Sum (best) is evaluated by three human annotators. Each annotator will score each example with a rating scale from 1 (worst) to 5 (best). Table 4 shows the average scores from three annotators (t-test, \(p<0.05\)). We can see that annotators tend to give the multi-modal summary from BART-cross and our ViL-Sum higher scores. In addition, our ViL-Sum outperforms two strong baselines by a wide margin and is close to the references. It is noteworthy that our work shows a higher improvement in human evaluation compared with the improvement in metrics. This reflects the gap between human evaluation and metrics.

Table 4 Results evaluated by human annotators

6.2 Impact of different numbers of images

Table 5 depicts the experimental results of our model performance varying with different K (the number of selected summary-related images at the final summary). Since the golden reference in the test set contains three images, the consistency between training and test makes the model perform best when K is 3. Overall, our model is not very sensitive with K. With different K, our ViL-Sum all achieve excellent performance, which proves our method can identify the real importance images from multi-modal inputs. Besides, we guess the image selection of the MSMO dataset is simple due to the data from the news.

Table 5 Results of ViL-Sum under different numbers K of images

6.3 Impact of different image tokenizer

To further evaluate the effectiveness of joint modeling and multi-task learning, we replace the backbone of the image tokenizer to observe the performance of ViL-Sum. We replace the ViT backbone with Linear Layer and an image tokenizer from Vision Transformer [52]. Both of them have much smaller parameters than the ViT backbone. Specifically, linear is the simple version of ViT which replaces the transformer image encoder with a simple linear layer to map the images into visual token embeddings. Vision is an image tokenizer from Vision Transformer [52], which can convert one image into several visual token embeddings. Table 6 reports the results of them. We can see that the ViT exactly provides better visual features than the other two backbones. However, the performance does not drop sharply with the replacement of the image tokenizer. This proves that Our proposed two strategies are robust and the ViL-Sum is flexible with different image tokenizers.

Table 6 Results of ViL-Sum with different image tokenizers

6.4 Case study and relevance visualization

We select one typical example from the test set and visualize the relevance of (1) summary sentences and selected images; (2) selected paragraphs and images; (3) all tokens and images in Figs. 4 and 5. Each color block means a cosine similarity between the image and text object. The darker color refers to a higher similarity in the heatmap. With our proposed methods, the generated summary contains high-quality summary with three related images as shown in Fig. 4a. From different relevant visualizations, we can see that our ViL-Sum can effectively align the semantic representation of summary sentences and selected images as shown in Fig. 4b. The input images can be aligned with paragraphs by training with image reordering as shown in Fig. 4c. By comparing Fig. 4b, d, as well as Fig. 4c, e, we can observe that ViL-Sum surpasses BART-cross in paragraph-level vision-language semantic alignment, primarily attributed to the incorporation of multitasking learning.

Fig. 4
figure 4

Example from the test set with the generated multi-modal summary. a Is the full example. b, d are heatmaps that show the relevance of the summary and selected images. c, e are heatmaps that show the relevance of selected paragraphs and images. Each color block means cosine similarity between the image and text object. The darker color refers to higher similarity (colour figure online)

Fig. 5
figure 5

The heatmap shows the relevance of all input tokens and images. The darker color refers to higher similarity (colour figure online)

We also report the heatmap of all input tokens and images in Fig. 5, which is consistent with Fig. 4b, c. This case proves that the multi-task training really helps ViL-Sum learn reasonable relations between images and input paragraphs.

7 Conclusion

In this paper, we propose a novel Vision-Language Summarization (ViL-Sum) model, which can enhance the vision-language representation and the paragraph-level semantics alignment through multi-task training and joint modeling. A multi-modal encoder jointly encodes images and text to capture their interrelation. The reordering task and image selection task guide the model to learn paragraph-level vision and language semantic alignment. Our ViL-Sum achieves new state-of-the-art results on most automatic and manual evaluation metrics. Further analysis demonstrates that the improvement is from the joint multi-modal encoder and multi-task training.

In human evaluation, we have observed a gap between human evaluation and metrics. Introducing more appropriate evaluation metrics contribute to the development of multi-modal summarization. Besides, we only use the MSMO dataset due to the lack of other datasets. Our proposed image reordering task is straightforward yet effective, we will extend our method to more scenarios (e.g., vision-language pre-training models) and modalities (e.g., audio and video) in the future. Furthermore, we plan to generalize our method to other multi-modal tasks, such as multi-modal question answering.