Retrieve-and-Rank End-to-End Summarization of Biomedical Studies

Moro, Gianluca; Ragazzi, Luca; Valgimigli, Lorenzo; Molfetta, Lorenzo

doi:10.1007/978-3-031-46994-7_6

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14289))

Included in the following conference series:

International Conference on Similarity Search and Applications

400 Accesses
1 Citations

Abstract

An arduous biomedical task involves condensing evidence derived from multiple interrelated studies, given a context as input, to generate reviews or provide answers autonomously. We named this task context-aware multi-document summarization (CA-MDS). Existing state-of-the-art (SOTA) solutions require truncation of the input due to the high memory demands, resulting in the loss of meaningful content. To address this issue effectively, we propose a novel approach called Ramses, which employs a retrieve-and-rank technique for end-to-end summarization. The model acquires the ability to (i) index each document by modeling its semantic features, (ii) retrieve the most relevant ones, and (iii) generate a summary via token probability marginalization. To facilitate the evaluation, we introduce a new dataset, FAQsumC19, which includes the synthesizing of multiple supporting papers to answer questions related to Covid-19. Our experimental findings demonstrate that Ramses achieves notably superior ROUGE scores compared to state-of-the-art methodologies, including the establishment of a new SOTA for the generation of systematic literature reviews using Ms2. Quality observation through human evaluation indicates that our model produces more informative responses than previous leading approaches.

Access provided by Autonomous University of Puebla. Download conference paper PDF

CovSumm: an unsupervised transformer-cum-graph-based hybrid document summarization model for CORD-19

Article 26 April 2023

Frequent item-set mining and clustering based ranked biomedical text summarization

Article 04 July 2022

Evaluating Different Similarity Measures for Automatic Biomedical Text Summarization

Keywords

1 Introduction

Given the paramount societal role of biomedicine and related natural language processing (NLP) tasks [11,12,13,14,15, 38], aggregating information from multiple topic-related biomedical papers to help search, synthesize, and answer questions is of great interest [7]. Real-world applications require indexing, combining, and summarizing evidence from clinical trials on a research background to produce systematic literature reviews (SLRs) or answer medical inquiries. Consequently, we define such activities as context-aware multi-document summarization (CA-MDS) due to the presence of an input context (i.e., background or question) that conditions the downstream summarization task (Fig. 1). In real life, biomedical articles usually contain several thousands of words that compose lingo and complicated expressions, making understanding them a time- and labor-consuming process even for professionals. Thus, automation support for biomedical activities is practical and beneficial in facilitating knowledge acquisition.

CA-MDS solutions for biomedical applications should process all inputs without ignoring any details, reducing the risk of model hallucination, namely generating unfaithful outputs due to training on targets having facts unfounded by the source. Therefore, state-of-the-art (SOTA) models rely on sparse transformers [2], Fusion-in-Decoder strategies [18], and marginalization-based decoding [34]. However, such methods either (i) need high memory requirements that force input truncation for organizations operating in low-resource regimes [29,30,31, 33, 35], or (ii) lack end-to-end learning, reducing the potential of cooperating neural modules.

In this paper, we introduce Ramses,^{Footnote 1} a retrieve-and-rank summarization approach trained via end-to-end learning to retrieve salient biomedical documents by their semantic meaning and synthesize them given an input context. Ramses comprises a biomedical bi-encoder and a generative aggregator. The bi-encoder reads all the documents, represents their semantics via embeddings, and retrieves and scores salient documents related to an input context. Then, the aggregator is conditioned by the context along with these latent documents to decode the summary by marginalizing the token probability distribution weighted by their relevance score.

We evaluated Ramses in two biomedical CA-MDS tasks: (i) producing SLRs on the Ms2 dataset [7] and (ii) answering frequently asked questions (FAQs) about Covid-19 in our proposed dataset FAQsumC19. In detail, we collected 514 Covid-19 FAQs with high-quality abstractive answers written by experts. Then we augmented each instance with 30 supporting scientific papers containing the information needed to answer the question, producing 15,420 articles. In particular, FAQsumC19 has two essential features: (i) includes abstractive answers authored by experts, unlike other related datasets that use extractive targets [41]; (ii) is the first CA-MDS dataset for Covid-19, becoming a crucial benchmark for producing multi-document summaries to answer questions on Covid-19 with the support of updated related biomedical papers.

We perform extensive experiments, showing that Ramses achieves new SOTA performance in the Ms2 dataset and outperforms previous solutions in FAQsumC19, whose inferred answers are also rated as of more quality by human experts.

2 Related Work

Semantic Neural Retriever Applications. The semantic representation skill exhibited by neural networks has catalyzed the emergence of groundbreaking neural methodologies in information retrieval [10] First, the algorithm Bm25 has been exceeded by dense passage retrieval (Dpr) [21], a remarkable neural application that has since evolved into a fundamental element within numerous neural-driven retrieval solutions [42, 43]. These neural retrievers have been fused with a language model to enrich and improve input [23], generating superior models characterized by increased efficiency and improved performance [3, 12]. Despite their promising results, the end-to-end application of these solutions in MDS remains unexplored.

NLP for Biomedical Documents. Much recent work in NLP has concentrated on the biomedical domain [28], including CA-MDS [7], which can decrease the burden on medical workers by highlighting and aggregating key points while reducing the amount of information to read. Previous contributions focused on the automatic generation of SLRs. In detail, cutting-edge solutions rely on three different neural architectures: (i) transformer-based models with linear complexity in the input size thanks to sparse attention [2], which concatenate the input context along with all documents in the cluster producing a single source sequence; (ii) quadratic transformers with Fusion-in-Decoder [18], which join the hidden states of documents after encoding them individually; (iii) marginalization-based decoding augmented by frozen retrievers [34], which first pinpoints salient documents w.r.t. a query and produces a single summary by summing the probability distribution of the inferred token for each document.

MDS Solutions in Other Domains. Flat approaches with MDS-specific pre-training [49] concatenate the sources in a single text, treating MDS as a single-input task. Hierarchical approaches merge document relations to obtain semantically rich representations by leveraging graph-based methods [1] and multi-head pooling and interparagraph attention [19]. Marginalization-based approaches [17] apply marginalization to the token probability distribution at the decoding time to produce a single output from many inputs. The two-stage approaches [25] adopt different strategies to rank sources before producing the summary. Unlike previous work, Ramses is trained in end-to-end learning to retrieve relevant text from biomedical articles and marginalize the probability distribution of the latent extracted information at decoding time.

Covid-19 Datasets. With the appearance of Covid-19, thousands of articles have been published quickly. To aid experts in accessing this knowledge, large organizations collected corpora such as Cord-19 [47] and LitCovid [6], encouraging the proposal of task-specific datasets. Covid-QA [27] study question-answering using annotated pairs extracted from 147 papers. Covid-Q [48] collects 16,690 questions about Covid-19, classifying them into 15 categories. [40] scrapped over 40 trusted websites for Covid-19 FAQs, creating a collection of 2100 questions. [45, 52] proposed two datasets for the retrieval of FAQs, where user queries are semantically paired with existing FAQs. FAQsumC19 fills this gap, introducing the first CA-MDS dataset to answer Covid-19 FAQs by summarizing multiple related studies.

Fine-grained comparisons with previous work are in Sect. 6.1.

3 Preliminary

We provide details for context-aware multi-document summarization (CA-MDS).

Definition. CA-MDS aims to compile a summary from a cluster of related articles given an input context, analogous to the query in query-focused summarization [46]. Yet, unlike answering FAQs, SLR generation does not consider questions. Thus, we define the task we face as CA-MDS. The biomedical tasks we address in this work, such as SLR generation and FAQ answering, are CA-MDS tasks because they both have an input context (i.e., the research issue in SLRs and the human question in FAQs) and many topic-related documents from which produce the output.

Problem Formulation. In the CA-MDS setting, we have $(c, \textbf{D}, y)$, where c is the input context, $\textbf{D}$ is the cluster of topic-related documents, and y is the target generated from $\textbf{D}$ given c. Formally, we want to predict y from $\{c, d_1,...,d_n | d \in \textbf{D}\}$.

4 Method

The end-to-end learning of Ramses allows the cooperating modules to jointly retrieve and aggregate key information from multiple sources in one output (Fig. 2).

Given the context c and the documents $\textbf{D}$, our method first generates relevance scores on $\textbf{D}$ with a biomedical solution based on Dpr [20]:

$$\begin{aligned} p_{\beta , \theta }(d \in \textbf{D} | c) = (Enc_{\beta }(d) \oplus Enc_{\theta }(c)) \end{aligned}$$

(1)

where $Enc_{\beta }$ and $Enc_{\theta }$ are two different BioBert-base models trained to produce a dense representation of documents and the context [39], respectively, $\oplus $ is the inner product between them, and p(d|c) is the relevance score associated to the document d given c. Thus, our solution finds the most top-k relevant texts according to c. Then, given c and each $d \in $ top-k, a Bart-base model [22] draws a distribution for each next output token for each d, before marginalizing:

$$\begin{aligned} p(y | c, \textbf{D}) = \prod _z^N \sum _{ d \in \text {top-}k} p_\theta (d|c) p_\gamma (y_z | d^{'}, y_{1:z-1}) \end{aligned}$$

(2)

where $d^{'} = [c, tok, d]$ is the concatenation of c and $d \in \text {top-}k$ with a special text separator token (<doc-sep>) to make the model aware of the textual boundary, N is the target length, and $p_\gamma (y_z | d^{'}, y_{1:z-1})$ is the probability of generating the target token $y_z$ given $d'$ and the previously generated tokens $y_{1:z-1}$.

We train our Ramses model by minimizing the negative marginal log-likelihood of each target with the following loss function:

$$\begin{aligned} \mathcal {L} = -\sum _i \log p(y_i | c_i, \textbf{D}_i) \end{aligned}$$

(3)

End-to-End Learning. The model (Eq. 2) allows the gradient to backpropagate to all modules. For clarity, we rewrite the formula as a continuous function, as follows:

$$\begin{aligned} {\textsc {Ramses}} (\textbf{D}, c) & = & \sum _{(d_j, s_j)} B_\gamma ([c, tok, d_j]) \cdot s_j \end{aligned}$$

(4)

$$\begin{aligned} \text {top-}k(\textbf{D}, c) & = & [(d_1, s_1), \ldots , (d_k, s_k)] \end{aligned}$$

(5)

$$\begin{aligned} s_j & = & Enc_{\beta }(d_j) \oplus Enc_{\theta }(c) \end{aligned}$$

(6)

where $(d_j, s_j) \in \text {top-}k$ and $B_\gamma $ is Bart.

The presence of $s_j$ in Eq. 4 allows the gradient, computed by minimizing the objective function, to reach $Enc_{\beta }$ and $Enc_{\theta }$. For this reason, the documents and context embeddings are adjusted during the training to improve the generated summary, making all modules of our solution learn jointly in an end-to-end fashion.

Table 1. The question-cluster pairs’ quality. Best values are bolded.

Full size table

5 FAQsumC19 Dataset

We introduce a new dataset, FAQsumC19, containing 514 Covid-19-related FAQs with abstractive answers written by experts, each supported by 30 abstracts of scientific articles, for a total of 15,420 documents. We obtained from the Covid-19 FAQ section on WHO^{Footnote 2} all available question-answer pairs. We then augmented each instance with 30 Covid-19 scientific articles strictly related to the question from the updated version of the Cord-19 dataset [47]. Specifically, we experimented with the selection of supporting articles with different information retrieval methods, such as a random baseline, Bm25 [44], and Sublimer [38]. We used the concatenation between the question and the answer to retrieve the first 30 ranked documents regarding semantic similarity, creating a knowledge base to support the answer generation. We finally split the dataset into 464 instances for training ($\approx 90\%$) and 50 for the test ($\approx 10\%$).

To assess the quality of question-cluster pairs in our dataset, we computed the content coverage with ROUGE-1 precision [24] and BERTScore [51] of the question-answer concatenation w.r.t. each document in the cluster, and calculate the average score. We evaluate the syntactic and semantic overlap between the question and answer and the texts. Table 1 reveals that Sublimer achieves the best scores, as expected.

6 Experiments

6.1 Experimental Setup

Datasets. Table 2 reports the statistics of the datasets used to test Ramses in different biomedical tasks: Ms2 [7] consists of 15,597 instances derived from the scientific literature. Each sample is composed of (i) the background statement, which describes the context research issue, (ii) the target statement, which is the summary to generate; and (iii) the studies, which are the abstracts of biomedical documents that contain the needed information for the research issue. FAQsumC19 is our proposed dataset that comprises 514 Covid-19 FAQs with abstractive answers written by experts, each supported by 30 abstracts of scientific papers.

Table 2. The datasets used for evaluation (FAQsumC19 is ours). Statistics include dataset size and the average (i) number of source (S) documents per instance, (ii) number of total words in S and target (T) texts, and (iii) S-T compression ratio of words [16].

Full size table

Baselines. We compare Ramses with SOTA solutions: Bart-FiD [7], which is Bart with the Fusion-in-Decoder strategy [18], encodes all sources individually and combines their hidden states before decoding. Led-Gaq [7], which is Led [2] with global attention on the input query, concatenates all texts in a single input of up to 16,384 tokens. Damen [34], a retrieval-enhanced solution with marginalization-based decoding, discriminates important fragments of the cluster with a frozen Bert-base model and marginalizes their probability distribution during decoding. Primera [49], which is Led pre-trained with a multi-document summarization-specific objective, concatenates the texts with a special separator token up to 4096 tokens in size.

Evaluation Metrics. We use ROUGE-1/2/L [24] to assess fluency and informativeness. We also adopt $\mathcal {R}$ [32] as an aggregated judgment that considers the variance of the ROUGE scores. Finally, we perform qualitative analysis to bridge the superficiality of automatic evaluation measures.

Implementation. We fine-tune the models using PyTorch and the HuggingFace library, setting the seed to 42 for reproducibility. Ramses is trained on an NVIDIA RTX 3090 GPU of 24 GB memory from an internal cluster for 1 epoch with a learning rate of 3e-5 on Ms2 and for 3 epochs with a learning rate of 1e-5 on FAQsumC19. For decoding, we use the beam search with 4 beams and the following min-max target size: 32–256 for Ms2 and 100-256 for FAQsumC19.

Table 3. Performance of models on the evaluation datasets. The best scores are in bold.

Full size table

Table 4. ROUGE F1 scores (R-1, R-2, R-L) on Ms2 on evaluating Ramses with different generator checkpoints (B and L stand for base and large, respectively) and k documents retrieved at training time. Oom means “GPU out of memory exception.” The best results are bolded.

Full size table

6.2 Results

Table 3 reports the performance of the models in the two evaluation datasets. Ramses yields better scores, suggesting that the retrieve-and-rank end-to-end learning is more effective than prior SOTA approaches in both biomedical CA-MDS tasks.

The Impact of k. As our method relies on learning to select the best top-k relevant documents from the cluster, the value of k is crucial for model performance and GPU memory occupation. Therefore, we analyze the impact of k on model performance by experimenting with a different number of documents to retrieve: 3, 6, 9, 12, 15, 18. Table 4 reports a slight performance improvement as k increases until a threshold is reached (e.g., $k=9$ for Bart-base), indicating that the marginalization approach with more documents helps produce better ROUGE scores. However, a high k (i.e., $k\ge 12$) can also increase information redundancy and contradiction, lowering the final performance. Table 4 also lists the results of different models on single text summarization as the aggregator’s checkpoint, such as Bart and Pegasus [50]. We notice that Bart-large achieves better ROUGE scores, although Pegasus is the largest model. However, as Bart-base achieved a slightly lower result despite the noticeably fewer trainable parameters, we chose to use it for all experiments. Therefore, we tested the best checkpoint of Bart-base trained with $k=9$ with a different k at the inference time in Ms2. Table 5 reports that the best performance has been achieved with $k=12$. Furthermore, Table 5 also shows the results on FAQsumC19 with a different k at training time, revealing a trend similar to Ms2.

Memory Requirements. Figure 3 shows the memory complexity at the training time of Ramses for each k. We notice that the memory occupation is linear w.r.t. k, indicating that our solution is not computationally expensive, even for large clusters.

Table 5. The results of Ramses on Ms2 by varying k at inference time and on FAQsumC19 by varying k at training time. The best scores are bolded.

Full size table

6.3 Ablation Studies

Table 6 reports the ablation studies on Ms2 using Ramses with Bart-base and $k=9$ with the same hyperparameter settings for all experiments.

Table 6. The ablation studies on Ms2. We gradually remove each module of Ramses to show the performance drop. The best scores are in bold.

Full size table

Excluding the input context from the input concatenation to give to the generative aggregator (w/o context) leads to the most significant decrease in performance. Indeed, the context is the research question shared by all documents in the cluster, so it contains important information for the final summary.

Training a single model to encode both the context and documents (w/o bi-encoder), namely using a shared BioBert model, decreases performance. Indeed, since the context and the documents have two different purposes (i.e., we need the context to select context-related documents), two models are needed to specialize and differentiate the text representation. Despite the similarity of the two tasks, they differ for two main reasons: (i) the context is relatively shorter than the documents in the cluster, and (ii) the concept expressiveness is denser in the context than in the more verbose documents.

Removing the token separator <doc-sep> between the context and document (w/o token-sep) decreases performance. Indeed, this token is needed to make the model aware of the textual boundary between the context and the documents.

Using cosine similarity instead of the inner product (w/o inner-product) to score the documents against the input context achieves the worst results.

Freezing $Enc_\beta $ (w/o trained-retrieval) decreases performance, highlighting the usefulness of end-to-end learning to allow the model to select more informative documents.

Switching the context with the documents in the input concatenation (w/o context-first) decreases performance, indicating that the leading context in the input helps the generative aggregator to focus better on how to join context-related information.

6.4 Human Evaluation

Considering the drawbacks of automatic metrics such as ROUGE [9], which is still the standard for evaluating text generation, we qualitatively evaluated the answers inferred from the FAQs of the entire FAQsumC19 test set with three domain experts with master’s degrees in medical and biological areas.

Instructions. We gave evaluators a table, with each row containing the question and three possible answers in random order: (i) the “gold” from WHO, (ii) the prediction of Ramses, and (iii) the prediction of Led-Gaq (the second-best model in FAQsumC19 according to $\mathcal {R}$). Each expert was asked to order the answers according to how thoroughly they answered the question, focusing primarily on the factuality. For fairness, we did not inform the evaluators about the answers’ origins and the test’s goal. Overall, experts completed the task in two days, reporting no difficulty ordering the 50 answers.

Results. Evaluation results, reported in Table 7, show that our method produces better informative abstractive answers to a given open question than a linear transformer with sparse attention. To be precise, the experts rated 76% of the answers of our solution as better than those of Led, with 46% agreement between the annotators (which means that 46% of the time, the three evaluators agree). Furthermore, the evaluators also found that 7.33% of the answers inferred by Ramses are more informative than “gold” from WHO. Nevertheless, model generations are still far from being as informative as gold answers, indicating the limitations of current neural language models in FAQsumC19.

Table 7. The human evaluation on FAQsumC19 with inter-annotator agreement (IAA) using WHO ground-truth answers and the inferred ones by Ramses and Led-Gaq. A>B means how many times the generated answer from A was scored higher than B.

Full size table

7 Conclusion

In this paper, we introduced Ramses, a retrieve-and-rank end-to-end learning solution for CA-MDS of biomedical studies. Ramses is designed to simultaneously acquire indexing capabilities and retrieve pertinent documents to generate comprehensive summaries. Through multiple experiments on two biomedical datasets (including our proposed FAQsumC19 to answer Covid-19 FAQs), we found that Ramses outperforms SOTA models. This finding suggests that the integrated retrieval mechanism significantly benefits the CA-MDS task. Yet, human assessments indicate that there is still notable room for improvement, motivating further research in pursuit of novel retrieval applications within the realm of biomedical multi-document summarization.

Future works can investigate and include multimodal [36, 37], cross-domain [8], and knowledge propagation [4, 5, 26] approaches.^{Footnote 3}

Notes

References

Amplayo, R.K., Lapata, M.: Informative and controllable opinion summarization. In: EACL, Online, April 19–23 2021, pp. 2662–2672. ACL (2021)
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. CoRR abs/2004.05150 (2020)
Google Scholar
Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., et al.: Improving language models by retrieving from trillions of tokens. In: ICML. PMLR, vol. 162, pp. 2206–2240. PMLR (2022)
Google Scholar
Cerroni, W., Moro, G., Pasolini, R., Ramilli, M.: Decentralized detection of network attacks through P2P data clustering of SNMP data. Comput. Secur. 52, 1–16 (2015). https://doi.org/10.1016/j.cose.2015.03.006
Article Google Scholar
Cerroni, W., Moro, G., Pirini, T., Ramilli, M.: Peer-to-peer data mining classifiers for decentralized detection of network attacks. In: ADC. CRPIT, vol. 137, pp. 101–108. ACS (2013)
Google Scholar
Chen, Q., Allot, A., Lu, Z.: Litcovid: an open database of COVID-19 literature. Nucleic Acids Res. 49(Database-Issue), D1534–D1540 (2021)
Google Scholar
DeYoung, J., Beltagy, I., van Zuylen, M., Kuehl, B., et al.: Ms $\hat{}$ 2: Multi-document summarization of medical studies. In: EMNLP, Punta Cana, 7–11 November, 2021, pp. 7494–7513. ACL (2021). https://doi.org/10.18653/v1/2021.emnlp-main.594
Domeniconi, G., Moro, G., Pagliarani, A., Pasolini, R.: On deep learning in cross-domain sentiment classification. In: IC3K (Volume 1), Funchal, Madeira, Portugal, November 1–3, 2017, pp. 50–60. SciTePress (2017). https://doi.org/10.5220/0006488100500060
Fabbri, A.R., Kryscinski, W., McCann, B., Xiong, C., et al.: Summeval: re-evaluating summarization evaluation. TACL 9, 391–409 (2021). https://doi.org/10.1162/tacl_a_00373
Article Google Scholar
Formal, T., Piwowarski, B., Clinchant, S.: Match your words! a study of lexical matching in neural information retrieval. In: Hagen, M., et al. (eds.) Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, pp. 120–127. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99739-7_14
Chapter Google Scholar
Frisoni, G., Italiani, P., Salvatori, S., Moro, G.: Cogito Ergo $Summ$: Abstractive Summarization of Biomedical Papers via Semantic Parsing Graphs and Consistency Rewards. In: AAAI 2023, Washington, DC, USA, February 7–14, 2023. AAAI Press, Washington, DC, USA (2023)
Google Scholar
Frisoni, G., Mizutani, M., Moro, G., Valgimigli, L.: Bioreader: a retrieval-enhanced text-to-text transformer for biomedical literature. In: EMNLP 2022, pp. 5770–5793. ACL, Abu Dhabi, United Arab Emirates (2022)
Google Scholar
Hammoudi, Slimane, Quix, Christoph, Bernardino, Jorge (eds.): Data Management Technologies and Applications: 9th International Conference, DATA 2020, Virtual Event, July 7–9, 2020, Revised Selected Papers. Springer, Cham (2021)
Google Scholar
Frisoni, G., Moro, G., Carbonaro, A.: Learning interpretable and statistically significant knowledge from unlabeled corpora of social text messages: a novel methodology of descriptive text mining. In: DATA, pp. 121–134. SciTePress (2020)
Google Scholar
Frisoni, G., Moro, G., Carbonaro, A.: A survey on event extraction for natural language understanding: Riding the biomedical literature wave. IEEE Access 9, 160721–160757 (2021). https://doi.org/10.1109/ACCESS.2021.3130956
Article Google Scholar
Grusky, M., Naaman, M., Artzi, Y.: Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In: NAACL (Long Papers), pp. 708–719. ACL, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1065
Hokamp, C., Ghalandari, D.G., Pham, N.T., Glover, J.: Dyne: Dynamic ensemble decoding for multi-document summarization. CoRR abs/2006.08748 (2020)
Google Scholar
Izacard, G., Grave, E.: Leveraging passage retrieval with generative models for open domain question answering. In: EACL: Main Volume, pp. 874–880. ACL, Online (2021). https://doi.org/10.18653/v1/2021.eacl-main.74
Jin, H., Wang, T., Wan, X.: Multi-granularity interaction network for extractive and abstractive multi-document summarization. In: ACL, Online, July 5–10 2020, pp. 6244–6254. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.556
Karpukhin, V., Oğuz, B., Min, S., Lewis, P., et al.: Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020)
Karpukhin, V., Oguz, B., Min, S., Lewis, P.S.H., et al.: Dense passage retrieval for open-domain question answering. In: EMNLP 2020, Online, November 16–20, 2020, pp. 6769–6781. ACL (2020). https://doi.org/10.18653/v1/2020.emnlp-main.550
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: ACL, July 5–10 2020, pp. 7871–7880 (2020). https://doi.org/10.18653/v1/2020.acl-main.703
Lewis, P.S.H., Perez, E., Piktus, A., Petroni, F., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: NeurIPS 2020, December 6–12, 2020, virtual (2020)
Google Scholar
Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. ACL, Barcelona, Spain (2004)
Google Scholar
Liu, Y., Lapata, M.: Hierarchical transformers for multi-document summarization. In: ACL, Florence, Italy, July 28- August 2 2019, pp. 5070–5081. ACL (2019). https://doi.org/10.18653/v1/p19-1500
Lodi, S., Moro, G., Sartori, C.: Distributed data clustering in multi-dimensional peer-to-peer networks. In: (ADC), Brisbane, 18–22 January, 2010. CRPIT, vol. 104, pp. 171–178. ACS (2010)
Google Scholar
Möller, T., Reina, A., Jayakumar, R., Pietsch, M.: Covid-qa: a question answering dataset for Covid-19 (2020)
Google Scholar
Moro, G., Masseroli, M.: Gene function finding through cross-organism ensemble learning. BioData Min. 14(1), 14 (2021)
Google Scholar
Moro, G., Piscaglia, N., Ragazzi, L., Italiani, P.: Multi-language transfer learning for low-resource legal case summarization. Artif. Intell. Law 31 (2023)
Google Scholar
Moro, G., Ragazzi, L.: Semantic self-segmentation for abstractive summarization of long documents in low-resource regimes. In: AAAI 2022, Virtual Event, February 22 - March 1, 2022, pp. 11085–11093. AAAI Press (2022). www.ojs.aaai.org/index.php/AAAI/article/view/21357
Moro, G., Ragazzi, L.: Align-then-abstract representation learning for low-resource summarization. Neurocomputing 548, 126356 (2023). https://doi.org/10.1016/j.neucom.2023.126356
Article Google Scholar
Moro, G., Ragazzi, L., Valgimigli, L.: Carburacy: summarization models tuning and comparison in eco-sustainable regimes with a novel carbon-aware accuracy. AAAI 37(12), 14417–14425 (2023). https://doi.org/10.1609/aaai.v37i12.26686
Article Google Scholar
Moro, G., Ragazzi, L., Valgimigli, L.: Graph-based abstractive summarization of extracted essential knowledge for low-resource scenario. In: ECAI 2023, Kraków, Poland, September 30 - October 4, 2023, pp. 1–9 (2023)
Google Scholar
Moro, G., Ragazzi, L., Valgimigli, L., Freddi, D.: Discriminative marginalized probabilistic neural method for multi-document summarization of medical literature. In: ACL, pp. 180–189. ACL, Dublin, Ireland (May 2022). https://doi.org/10.18653/v1/2022.acl-long.15
Moro, G., Ragazzi, L., Valgimigli, L., Frisoni, G., Sartori, C., Marfia, G.: Efficient memory-enhanced transformer for long-document summarization in low-resource regimes. Sensors 23(7) (2023). https://doi.org/10.3390/s23073542, www.mdpi.com/1424-8220/23/7/3542
Moro, G., Salvatori, S.: Deep vision-language model for efficient multi-modal similarity search in fashion retrieval, pp. 40–53 (09 2022). https://doi.org/10.1007/978-3-031-17849-8_4
Moro, G., Salvatori, S., Frisoni, G.: Efficient text-image semantic search: A multi-modal vision-language approach for fashion retrieval. Neurocomputing 538, 126196 (2023). https://doi.org/10.1016/j.neucom.2023.03.057
Moro, G., Valgimigli, L.: Efficient self-supervised metric information retrieval: A bibliography based method applied to COVID literature. Sensors 21(19) (2021). https://doi.org/10.3390/s21196430
Papanikolaou, Y., Bennett, F.: Slot filling for biomedical information extraction. CoRR abs/2109.08564 (2021)
Google Scholar
Poliak, A., Fleming, M., Costello, C., Murray, K.W., et al.: Collecting verified COVID-19 question answer pairs. In: NLP4COVIDEMNLP. ACL (2020)
Google Scholar
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad. In: ACL 2018, Melbourne, Australia, July 15–20, 2018, pp. 784–789. ACL (2018). https://doi.org/10.18653/v1/P18-2124
Ren, R., Lv, S., Qu, Y., Liu, J., et al.: PAIR: leveraging passage-centric similarity relation for improving dense passage retrieval. In: ACL/IJCNLP (Findings). Findings of ACL, vol. ACL/IJCNLP 2021, pp. 2173–2183. Association for Computational Linguistics (2021)
Google Scholar
Ren, R., Qu, Y., Liu, J., Zhao, W.X., et al.: Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. In: EMNLP (1), pp. 2825–2835. ACL (2021)
Google Scholar
Croft, Bruce W.., van Rijsbergen, C.. J.. (eds.): SIGIR ’94. Springer London, London (1994). https://doi.org/10.1007/978-1-4471-2099-5
Book Google Scholar
Sun, S., Sedoc, J.: An analysis of bert faq retrieval models for Covid-19 infobot (2020)
Google Scholar
Vig, J., Fabbri, A.R., Kryscinski, W., Wu, C., et al.: Exploring neural models for query-focused summarization. In: NAACL 2022, Seattle, WA, United States, July 10–15, 2022, pp. 1455–1468. ACL (2022). https://doi.org/10.18653/v1/2022.findings-naacl.109
Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., et al.: CORD-19: the Covid-19 open research dataset. CoRR abs/2004.10706 (2020)
Google Scholar
Wei, J.W., Huang, C., Vosoughi, S., Wei, J.: What are people asking about Covid-19? A question classification dataset. CoRR abs/2005.12522 (2020)
Google Scholar
Xiao, W., Beltagy, I., Carenini, G., Cohan, A.: PRIMERA: Pyramid-based masked sentence pre-training for multi-document summarization. In: ACL, pp. 5245–5263. ACL, Dublin (2022). https://doi.org/10.18653/v1/2022.acl-long.360
Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: PEGASUS: pre-training with extracted gap-sentences for abstractive summarization. In: ICML, 13–18 July 2020. vol. 119, pp. 11328–11339. PMLR (2020)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., et al.: Bertscore: Evaluating text generation with BERT. In: ICLR, Addis Ababa, Ethiopia, April 26–30, 2020. OpenReview.net (2020)
Google Scholar
Zhang, X.F., Sun, H., Yue, X., Lin, S.M., et al.: COUGH: A challenge dataset and models for COVID-19 FAQ retrieval. In: EMNLP 2021, Virtual Event, 7–11 November, 2021, pp. 3759–3769. ACL (2021)
Google Scholar

Download references

Acknowledgements

This research is partially supported by (i) the Complementary National Plan PNC-I.1, “Research initiatives for innovative technologies and pathways in the health and welfare sector” D.D. 931 of 06/06/2022, DARE—DigitAl lifelong pRevEntion initiative, code PNC0000002, CUP B53C22006450001, (ii) the PNRR, M4C2, FAIR—Future Artificial Intelligence Research, Spoke 8 “Pervasive AI,” funded by the European Commission under the NextGeneration EU program. The authors thank the Maggioli Group for granting the Ph.D. scholarship to Luca Ragazzi and Lorenzo Valgimigli.

Author information

Authors and Affiliations

Department of Computer Science and Engineering (DISI), University of Bologna, Via dell’Università 50, 47522, Cesena, Italy
Gianluca Moro, Luca Ragazzi, Lorenzo Valgimigli & Lorenzo Molfetta

Authors

Gianluca Moro
View author publications
You can also search for this author in PubMed Google Scholar
Luca Ragazzi
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Valgimigli
View author publications
You can also search for this author in PubMed Google Scholar
Lorenzo Molfetta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gianluca Moro .

Editor information

Editors and Affiliations

University of A Coruña, Coruña, Spain
Oscar Pedreira
Pompeu Fabra University, Barcelona, Spain
Vladimir Estivill-Castro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moro, G., Ragazzi, L., Valgimigli, L., Molfetta, L. (2023). Retrieve-and-Rank End-to-End Summarization of Biomedical Studies. In: Pedreira, O., Estivill-Castro, V. (eds) Similarity Search and Applications. SISAP 2023. Lecture Notes in Computer Science, vol 14289. Springer, Cham. https://doi.org/10.1007/978-3-031-46994-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-46994-7_6
Published: 27 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46993-0
Online ISBN: 978-3-031-46994-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Retrieve-and-Rank End-to-End Summarization of Biomedical Studies

Abstract

Similar content being viewed by others

CovSumm: an unsupervised transformer-cum-graph-based hybrid document summarization model for CORD-19

Frequent item-set mining and clustering based ranked biomedical text summarization

Evaluating Different Similarity Measures for Automatic Biomedical Text Summarization

Keywords

1 Introduction

2 Related Work

3 Preliminary

4 Method

5 FAQsumC19 Dataset

6 Experiments

6.1 Experimental Setup

6.2 Results

6.3 Ablation Studies

6.4 Human Evaluation

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Retrieve-and-Rank End-to-End Summarization of Biomedical Studies

Abstract

Similar content being viewed by others

CovSumm: an unsupervised transformer-cum-graph-based hybrid document summarization model for CORD-19

Frequent item-set mining and clustering based ranked biomedical text summarization

Evaluating Different Similarity Measures for Automatic Biomedical Text Summarization

Keywords

1 Introduction

2 Related Work

3 Preliminary

4 Method

5 FAQsumC19 Dataset

6 Experiments

6.1 Experimental Setup

6.2 Results

6.3 Ablation Studies

6.4 Human Evaluation

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation