Abstract
Document Text Summarization aims to create a short and condensed version from the original document, which transmits the main idea of the document in a few words. We formulated extractive multi-document text summarization as a combinatorial optimization problem. In which we used sentence features to select the most important content. We conduct experiments on Document Understanding Conference (DUC01) dataset using the ROUGE toolkit. Our experiments demonstrate that the proposed method contributes significant improvements over the state-of-the-art methods and heuristics.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The growth of the Internet involves that documents spread swiftly. Thus, the users get engulfed in many documents, wondering where to access them. In this context, Document Text Summarization (DTS) appears as a viable solution because it aims to generate a condensed version of documents and convey relevant information to the reader. Therefore, users can save time through summaries instead of reading the whole set document to capture the main idea [1, 2]. Due to this situation, researchers in Natural Language Processing are focused on the text summarization task [1]. Optimization-based approaches have been gaining importance because of the excellent performance obtained due to these being effective to get an optimal solution for huge and varied spaces [3,4,5]. These helps recognize the appropriate sentences to include in a summary in the DTS context.
A domain that has been the object of study in state-of-the-art is news. The different news sources that report on a particular event contain common components that construct the main facts. Thus, DTS from multiple news articles is a valuable field of study since the number of online publications is overgrowing. This is essential to satisfy the information need of various users. For this reason, multiple datasets have been developed, such as DUC [6], TAC (Text Analysis Conference) [7], Multi-News [8], CNN [9], among others, to evaluate the effectiveness of state-of-the-art methods.
There are three approaches to generating text summaries in the literature: extractive, abstractive, and hybrid.
Extractive Text Summarization.
Proposed systems based on this approach create summaries by assigning weights to sentences according to linguistic and statistical features, then selecting the sentences with better weight by combining them. These methods generally contain two significant components: ranking and selection of sentences. In addition, extractive summarization methods ensure the generated summaries are semantically similar to the original documents [3, 10].
Abstractive Text Summarization.
This approach allows the proposed methods to create summaries using new corpus words and sentences. The processing of abstractive summarization is like the human generation of summaries. However, it requires sophisticated natural language understanding and generation techniques, such as paraphrasing and sentence fusion [10, 11].
Hybrid Text Summarization.
This approach combines the advantages of extractive and abstractive methods to process the input texts. The hybrid approach processes data in two steps: The first step is to reduce the input length of documents to create a selective summary. Afterward, the selective summary is used by an abstractive method to construct a final summary [3].
Depending on the number of documents, summarization can be classified into two tasks: Single-Document Text Summarization, which composes a summary from one document, and Multi-Document Text Summarization (MDTS), which produces a summary from a collection of documents about a particular topic [1, 3, 12].
We formulated MDTS as a combinatorial optimization problem, which we address through a Genetic Algorithm (GA). The GA does not require external resources, working in an unsupervised way. Moreover, we hypothesize is that both sentence position and coverage provide essential information to distinguish relevant sentences from documents to create news summaries. Additionally, we have tested the proposed method by generating summaries of 50, 100, and 200 words on the DUC01 dataset.
The rest of the paper is organized as follows: Sect. 2 presents the related work. Then, Sect. 3 describes the proposed summarization method. In Sect. 4, we show experimental results. Finally, the conclusions of this paper are drawn in Sect. 5.
2 Related Works
In the literature, the DTS has been tackled through many techniques, such as supervised-based methods convert the summarization task into supervised classification problem. Generally, these methods learn by training to classify sentences, indicating whether a sentence is included in the summary. State-of-the-art approaches usually use word embeddings for representing the contextual meaning of sentences. Nevertheless, proposed methods require a corpus manually staggered [3, 8]. On the other hand, unsupervised-based methods generally assign a score to each sentence of each document, describing the relevance of sentences in the text. Therefore, sentences with the highest values will be part of the extractive summary. [3, 5, 13]. In this approach, four steps have been identified to generate a summary: Term selection, term weighting, sentence weighting, and sentence selection [13]. For the last step, various textual features have been developed [13]. Some of them are presented in Table 1:
3 Proposed Method
In MDTS, the search space is more extensive than in Single DTS, making it more challenging to select the most important sentences. In this context, MDTS can be determined as an optimization problem. The documents from the collection are considered a set of sentences, and the aim is to choose an optimal subset from sentences under a length constraint. Previous works [15, 16] have proposed the GA as an alternative for the MDTS to select an optimal combinatorial subset of sentences, obtaining competitive results compared to other state-of-the-art alternatives. However, we intend to improve its performance. Therefore, we have sought to enhance the GA exploration by increasing the size of the population. The population size is one of the essential factors that affect performance [4, 17]. In general, small population sizes might lead to premature convergence and yield substandard solutions [18].
3.1 Pre-processing
In this step, the documents of each collection were ordered chronologically. Then, the sentences of documents were hierarchically ordered according to appearance in the text to create a meta-document, which contains all collection sentences. Afterward, the text of the meta-document was separated into sentences. Finally, a lexical analysis was applied to separate sentences into words [5].
3.2 Text Modeling
After preprocessing the text, it is necessary to model it. This stage aims to predict the probability of natural word sequences. The simplest and most successful form for text modeling is the n-gram, which is a text representation model that constructs contiguous subsequences of consecutive words from a given text [5].
3.3 Weighting and Selection of Sentences
Sentence weighting and selection of sentences usually worked together [13].While the first one assigns a degree of relevance for each sentence, the second one chooses the most appropriate sentences to generate extractive summaries. However, it involves a vast search space that requires to be addressed by optimization. In view of this, we propose the following GA to select the most important sentences:
Encoding:
The binary encoding was used, where each sentence of the meta-document represents a gene. The values 1 and 0 define if a sentence will be selected in the final summary [5, 13, 16].
Generation of Population:
The initial population was randomly generated. On the other hand, the population of the next generations is generated from the selection stage. The search process concludes when a termination criterion is met. Otherwise, a new generation will be produced, and the search process will continue [5, 13].
Size of the Initial Population:
The size of the population was determined according to the number of sentences from the meta-document [4, 5, 13].
Selection Operator:
The selection of individuals is performed through the roulette operator, which selects individuals of a population according to their fitness to choose individuals with a higher value. Each individual is assigned to a proportional part of the roulette according to its fitness in this operator. Finally, the selection of parents is performed, which are needed to create the next generation, and each selected individual is copied into the parent population [5].
Crossover Operator:
Crossover is a genetic operator that combines two parents to produce one or two descendants. The idea underlying crossover is that the new individual can be better than its parents if it takes the best characteristics of each parent. Crosses with priority over common genes: This crossover operator was designed to generate summaries, where each individual represents a selection of sentences. Of the selected parents, only one gene is randomly selected (with value 1) that will belong to the descendant to fulfill the number of words [4, 5].
Mutation Operator:
We used the flipping operator, which consist of changing the value of each gene, inverting from 1 to 0 or vice versa [5, 13]. First, the mutation is performed considering the genes with a value of 1 and later considering the genes with a value of 0. Afterward, it is verified that the established number of words is fulfilled. If it is not fulfilled, another gene with a value of 0 will be inverted, and this process will continue until the specified minimum number of words is satisfied.
The Fitness Function:
It was calculated by employing the concept of the slope of the line [4, 5, 16]. The slope defines the importance of sentences. The main idea is to consider the first sentence with the importance \({X}_{n}\), the second with the significance of \({X}_{n}-1\). In a text with \(n\) sentences, if the sentence \(i\) is selected for the summary, its relevance is defined as \(t(i-x)+x\), where \(x=1+ (n-1)/2\) and \(t\) is the slope to be discovered. The formula to calculate the importance of the sentence position is in Eq. 1:
where \(k\) is the number of selected sentences. On the other hand, the content coverage to retrieve all aspects from meta-document was calculated by the summation of the frequencies of the n-grams that the summary weighs. \((Precision\_Recall)\) was calculated via the sum of the frequencies of the n-grams considered in the original text divided by sum of the frequencies of the different n-grams of summary (see Eq. 2).
Finally, to obtain the value of the fitness function, the following formula was applied, which is multiplied by 1000 (see Eq. 3).
Stop Condition:
For this operator, we have used the number of generations as a stop condition.
4 Experimental and Results
4.1 Dataset
To empirically evaluate the results of the proposed method, we use the DUC01 dataset, is an open benchmark for generic automatic summarization evaluation, which is in the English language; it is composed of 309 documents split into 30 collections, which we tested with the lengths of 50, 100, and 200 words. We choose this dataset because the gold standards summaries provided in it were typed like an abstractive approach. It allowed us to measure how competitive the proposed extractive unsupervised method can be about summaries made using paraphrases, words, and sentences that do not belong to source documents.
4.2 Evaluation Measures
ROUGE (Recall-Oriented Understudy for Gisting Evaluation). It involves measures to automatically establish the quality of a summary created by a proposed method by contrasting it to other ideal summaries created by humans, called gold standard summaries [20]. These measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans [21].
4.3 Parameter Selection
We perform tests with different parameters such as tournament and roulette selection operator, HUX crossover operator, crossover with priority on common genes, double inversion mutation, with different crossover and mutation probabilities, respectively. Also conducted our tests with varying population sizes; we multiplied the number of sentences of the meta-document from 2 and 15 to determine the best possible population size to improve the GA exploration. Per our empirical results, we conclude that good traits spread through the population for the different summaries lengths (50, 100, and 200 words) by multiplying the number of sentences from the meta-document by 9 and throughout 150 generations. Favoring the selection as parents of individuals with greater fitness value by roulette operator. Moreover, we tested n-grams of sizes from 1 to 5. According to our results, grams size 2 produces better sentence selection. In general, the parameters that produced the best results are shown in Table 2.
In [5] was realized an analysis of slope in, concluding when the slope value is negative the first sentences are more important. Contrariwise, if the slope value is 0, all sentences have the same importance. Due to this reason, in our experimentation, we have used tests with slope values from −0.1 to −1. To determine which slope value was best for each length, the best results are presented in Table 3.
As can be seen from the results obtained, when the summaries are created at a short length, the value of the slope that produced the best results is −0.1. According to [5, 16], this means that all the sentences of the meta-document have the same importance. While the size of the summary increases, the sentences that are considered important are found close to the beginning of the text. From the results obtained for the length of 100 words, the value of the slope was −0.6. While for summaries of 200 words, the value of the slope was −0.8. It means that the most important content is in the first sentences.
4.4 Description and Comparison of the State-of-the-Art Methods and Heuristics
To examine the performance of the proposed method was compared with state-of-the-art methods and heuristics. Supervised methods were not considered in the following analysis because the proposed method generates summaries from the information given in source documents, so it does not require external resources such as corpora, dictionaries, thesaurus, and lexicons. That is, it works in an unsupervised way.
CBA:
In [22] was proposed a clustering-based method for MDTS. K-means were used in clustering. To define the sentences that should be selected for the final summary. Moreover, the sequence in which it will appear. The clustering was ranked via a cosine similarity measure.
NeATS:
Lin and Hovy [23] proposed an Extractive MDTS system. The textual features such as term frequency, sentence position, stigma words, and a simplified version of Maximum Marginal Relevance were applied to choose filter content.
LexPageRank:
In this method, the importance of sentences was computed based on the idea of centrality in a graph representation of sentences. In this, the connectivity matrix is based on cosine similarity [24].
GA-1:
This method model MDTS like an optimization problem through GA[15].
Topline:
The authors calculated the upper bounds in this work, which is possible to achieve by state-of-the-art methods [10, 25].
Baseline-First:
It takes the first sentence from the document collection in chronological sequence until the target summary size is fulfilled [15].
Baseline-Random:
This randomly selects sentences to incorporate them as an extractive summary until the length is required [10, 15].
Baseline-First Document:
It includes the first 50, 100, and 200 words from the first document of a set of them until the target summary size is fulfilled [15].
Lead Baseline:
This takes the first 50, 100, and 200 words from the last document in the set, where documents are supposed to be chronologically prepared [15].
We have compared the obtained results of the proposed method to other state-of-the-art methods and heuristics. In the comparison, the values Rouge-1 and Rouge-2 are exposed. Also, there is a comparison of the level of advance between the state-of-the-art methods and heuristics. To compute the performance, we use the formula (see Eq. 4), based on the assumption that the performance of the Topline heuristic is 100% and Baseline-random is 0% [25].
Tables 4, 5, and 6 show this comparison using different summary lengths.
In the task where the summary length is 50 words (see Table 4), with the proposed method, the preceding results were improved by 12.7%, and the previous best result was the baseline-first document.
On the other hand, where the summary length is 100 words (see Table 5), the improvement is 6.08% with respect to what was considered the best result, which was LexPageRank method. As can be seen, in this length of summaries, there is a method whose performance, according to Eq. 4, is below the Baseline-random heuristic considered as the worst selection of sentences.
For the summary length is 200 words (see Table 6), the improvement was 4.01% more than the best method reported, which was GA-1. At this length, the heuristics have a better performance than in the 100 words task due to outperforming Baseline-Random, except Lead-Baseline, whose performance is even a negative value.
5 Conclusions
In this paper, we formalized the summarization of a set of documents as a combinatorial optimization problem. In particular, GA was introduced to satisfy the extraction of the most relevant content from a collection of documents by using textual features, such as coverage and sentence position. Moreover, we improve the performance by incrementing the population size to explore an optimal solution better. Finally, we perform different experiments on the available benchmark dataset DUC01 in the English language for the lengths of 50, 100, and 200 words. The results show that the method is competitive with state-of-the-art previously reported results. Also, the summaries produced by the proposed method have achieved high evaluation scores compared with abstract gold standard summaries without needing external data.
References
Gao, S., Chen, X., Ren, Z., Zhao, D., Yan, R.: From Standard Summarization to New Tasks and Beyond: Summarization with Manifold Information (2020)
Roul, R.K., Mehrotra, S., Pungaliya, Y., Sahoo, J.K.: A new automatic multi-document text summarization using topic modeling. In: Fahrnberger, G., Gopinathan, S., Parida, L. (eds.) ICDCIT 2019. LNCS, vol. 11319, pp. 212–221. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05366-6_17
El-Kassas, W.S., Salama, C.R., Rafea, A.A., Mohamed, H.K.: Automatic text summarization: a comprehensive survey (2021). https://doi.org/10.1016/j.eswa.2020.113679
García-Hernández, R.A., Ledeneva, Y.: Single extractive text summarization based on a genetic algorithm. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Rodríguez, J.S., di Baja, G.S. (eds.) MCPR 2013. LNCS, vol. 7914, pp. 374–383. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38989-4_38
Mendoza, G.A.M., Ledeneva, Y., García-Hernández, R.A.: Determining the importance of sentence position for automatic text summarization. J. Intell. Fuzzy Syst. 39, 2421–2431 (2020). https://doi.org/10.3233/JIFS-179902
Over, P., Dang, H.: DUC in context. Inf. Process. Manag. 43, 1506–1520 (2007). https://doi.org/10.1016/J.IPM.2007.01.019
NIST (National Institute of Standars and Technology: TAC 2008 Summarization Track. https://tac.nist.gov/2008/summarization/. Accessed 20 July 2020
Fabbri, A.R., Li, I., She, T., Li, S., Radev, D.R.: Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model (2019)
Lins, R.D., et al.: The CNN-Corpus: A large textual corpus for single-document extractive summarization. In: Proceedings of the ACM Symposium on Document Engineering, DocEng 2019, pp. 1–10. Association for Computing Machinery, Inc, New York, New York, USA (2019). https://doi.org/10.1145/3342558.3345388
Matias, G., Ledeneva, Y., García, R.: Detección de ideas principales y composición de resúmenes en inglés, español, portugués y ruso. 60 años de investigación. Alfaomega Grupo Editor, S.A. de C.V (2020)
Ma, C., Zhang, W.E., Guo, M., Wang, H., Sheng, Q.Z.: Multi-document Summarization via Deep Learning Techniques: A Survey (2020). https://doi.org/10.1145/nnnnnnn.nnnnnnn
Hou, S.-L., et al.: A survey of text summarization approaches based on deep learning. J. Comput. Sci. Technol. 36(3), 633–663 (2021). https://doi.org/10.1007/s11390-020-0207-x
Ledeneva, Y., García-Hernández, R.A.: Generación automática de resúmenes Retos, propuestas y experimentos. Universidad Autónoma del Estado de México (2017)
Vázquez, E., García-Hernández, R.A., Ledeneva, Y.: Sentence features relevance for extractive text summarization using genetic algorithms. J. Intell. Fuzzy Syst. 35, 353–365 (2018). https://doi.org/10.3233/JIFS-169594
Neri-Mendoza, V., Ledeneva, Y., García-Hernández, R.A.: Unsupervised extractive multi-document text summarization using a genetic algorithm. J. Intell. Fuzzy Syst. 39, 2397–2408 (2020). https://doi.org/10.3233/JIFS-179900
Neri Mendoza, V., Ledeneva, Y., García-Hernández, R.A.: Abstractive multi-document text summarization using a genetic algorithm. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A., Salas, J. (eds.) MCPR 2019. LNCS, vol. 11524, pp. 422–432. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21077-9_39
Sastry, K., Goldberg, D., Kendall, G.: Chapter 4 Genetic Algorithms. (2005)
Du, K.L., Swamy, M.N.S.: Search and optimization by metaheuristics: techniques and algorithms inspired by nature (2016). https://doi.org/10.1007/978-3-319-41192-7
Borges, J.L.: La doctrina de los ciclos (2013)
Rojas-Simón, J., Ledeneva, Y., García-Hernández, R.A.: Evaluation of text summaries without human references based on the linear optimization of content metrics using a genetic algorithm. Expert Syst. Appl. 167, 113827 (2021). https://doi.org/10.1016/J.ESWA.2020.113827
Lin, C.-Y.: ROUGE: A Package for Automatic Evaluation of Summaries (2004)
Boros, E., Kantor, P.B., Neu, D.J.: A Clustering Based Approach to Creating Multi-Document Summaries (2001)
Lin, C.-Y., Hovy, E.: From single to multi-document summarization. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL 2002, p. 457 (2002). https://doi.org/10.3115/1073083.1073160
Wang, D., Zhu, S., Li, T., Gong, Y.: Multi-document summarization using sentence-based topic models. In: ACL and AFNLP, p. 297 (2010). https://doi.org/10.3115/1667583.1667675
Rojas Simón, J., Ledeneva, Y., García Hernández, R.A.: Calculating the Upper Bounds for Multi-Document Summarization using Genetic Algorithms. Comput. y Sist. 22 (2018) https://doi.org/10.13053/cys-22-1-2903
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Neri-Mendoza, V., Ledeneva, Y., García-Hernández, R.A., Hernández-Castañeda, Á. (2022). Multi-document Text Summarization Based on Genetic Algorithm and the Relevance of Sentence Features. In: Vergara-Villegas, O.O., Cruz-Sánchez, V.G., Sossa-Azuela, J.H., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera-López, J.A. (eds) Pattern Recognition. MCPR 2022. Lecture Notes in Computer Science, vol 13264. Springer, Cham. https://doi.org/10.1007/978-3-031-07750-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-07750-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07749-4
Online ISBN: 978-3-031-07750-0
eBook Packages: Computer ScienceComputer Science (R0)