Keywords

1 Introduction

Automatic Text Summarization (ATS) has been considered one of the most critical tasks in Natural Language Processing (NLP) that continues to be open. In the last three decades, a great variety of advances has been presented over the Document Understanding Conferences (DUC) and Text Analysis Conferences (TAC) workshops,Footnote 1 organized by the National Institute of Standards and Technology (NIST). These workshops have been focused in generate summaries in English. However, other organizations have been reported several advances in the state-of-the-art. One of the primary organization of this area is the Interinstitutional Center for Computational LinguisticsFootnote 2 (NILC, abbreviation in Portuguese) has been performed several advances and resources for NLP in Portuguese.

Since 1993, the researchers of NILC and others have been performed several applications for Portuguese ATS. Some of them include the use of supervised and unsupervised machine learning methods [1, 2], discursive knowledge models [3,4,5], identification of “gist sentence” from the source documents to generate extractive summaries [6], Text Simplification (TS) [7], complex networks and graph-based methods to text analysis [8,9,10,11]. On the other hand, some ATS systems have been proposed to generate extractive summaries through optimization-based methods [12,13,14,15]. In the most of these works have been presented ATS systems for Single-Document Summarization (SDS) and Multi-Document Summarization (MDS) using corpora like TeMario and CSTNews respectively [16,17,18].

In [19, 20] has been mentioned that the primary challenge of ATS is to generate extractive summaries of better similarity in comparison to the summaries created by humans (gold-standard summaries). However, for several domains, the gold-standard summaries are made by substituting some terms (words and phrases) from the source documents. This case is happened with DUC, TeMario and CSTNews corpus [16, 18, 21]. Consequently, the level of maximum similarity will be less than 100%, and therefore the upper bounds will be lower for any method. To determine the maximum similarity to the gold-standard summaries involves the search and evaluation of several numbers of possible sentence combinations from a document to generate the best extractive summary.

Currently, some heuristics have been used to compare the performance of several state-of-the-art methods to know their level of advance. These heuristics are known as Baseline-first and Baseline-random that reflects the standard and lower bounds respectively [19]. On the other hand, the use of Topline heuristic has been introduced in recent works, with the purpose of reflecting the upper bounds [22]. These one have been used to calculate the significance of SDS and MDS tasks [19, 20]. However, for Portuguese SDS has not performed a significant analysis to compare the best state-of-the-art methods due to that Topline was unknown.

The use of optimization-based methods for SDS and MDS have been represented a viable solution to generate extractive summaries of superior performance. These ones include the use of Genetic Algorithms [13]. Therefore, the use of optimization-based methods represents a viable solution to obtain extractive summaries closest to the human-written summaries. In this paper, a GA is used with some adjustment of parameters of [19] to get the sentence combinations of best similarity to the sentences selected by humans in Portuguese, using some evaluation measures of the ROUGE system. Moreover, different lengths of summaries and sentence segmentations as constraints were considered to calculate the upper bounds.

The rest of the paper is organized as follows: Sect. 2 presents some related works and previous works that have used techniques based on exhaustive searches and optimized based techniques to determine the best sentence combinations to calculate the significance for SDS and MDS methods. Section 3 describes the structure and development of proposed GA. Section 4 shows the experimental configuration of GA to determine the Topline for TeMario corpus. Moreover, a significant analysis to identify the best state-of-the-art methods with the use of Baseline-first, Baseline-random, and Topline heuristics. Finally, Sect. 5 describes the conclusions and future work.

2 Related Works

Over of the last two decades, many problems have been treated around the ATS (e.g., automatic evaluation of summaries [23, 24], sentence boundary detection, language analysis, etc.). However, few studies have been performed to determine the best extractive summaries. Some related works use techniques based on exhaustive searches to represent the summaries made by humans [25, 26].

In the work of Ceylan [25], an exhaustive search-based method was presented to obtain the best combinations of sentences. Unlike of Lin and Hovy [26], this method employs a probability density function to reduce the number of all possible combinations of sentences in different domains (literary, scientific, journalistic and legal). Each sequence is evaluated with some metrics of the ROUGE system. A similar approach has been performed in [27]. Nevertheless, the main problem of this method involves the partial processing of several source document subsets to reduce their handling [19, 20]. Therefore, the use of this strategy can generate biased results.

In the work of [28], nine heuristic methods to reduce and assign scores to the sentence combinations for SDS and MDS have been presented. First, the redundant sentences are removed. Subsequently, the remaining sentences are introduced into eight methods to assign them a score according to the gold-standard summaries, with the purpose of eliminating the low scoring sentences. However, the use of several heuristics to determine the best combinations of sentences in different domains and different entries allows the increase of computational cost to find the best sentence combinations. Furthermore, for SDS only a single gold-standard summary was used. In the case of MDS, only 533 documents of 567 on DUC02 were used, generating more biased results.

Finally, a calculus of significance and upper bounds for SDS and MDS using GAs were presented in [19, 20]. Using three different heuristics (Baseline-random, Baseline-first, and Topline) that represent the lower, standard and upper bounds it has been calculated the percentage of advance of several state-of-the-art methods for SDS and MDS, using DUC01 and DUC02 as test datasets. Unlike the previous works, all sentences were considered as candidates to construct the best extractive summaries and calculate the upper bounds using GAs. In this paper, we propose the calculus of upper bounds in Portuguese using GAs to find the best combinations of sentences that can be generated from the single-document summaries of TeMario corpus and rank the best SDS methods for Portuguese.

3 Calculating Upper Bounds

To calculate the upper bounds for SDS in Portuguese, we propose the use of typical steps and procedures of basic GA described in [29], to evaluate several combinations of sentences in an optimized search space. In this section, the main stages and descriptions of the proposed GA to calculate the upper bounds are shown.

Solution Representation.

In previous works [19, 20], the solution is presented using a coding of individuals considering the order of sentences that can appear in the extractive summary. Therefore, each individual \( X_{i} \) (a candidate of best extractive summary) is represented in a vector of \( n \) positions \( [P_{1} , P_{2} , \ldots , P_{n} ] \), where each position includes a sentence \( \{ S_{1} , S_{2} , \ldots , S_{n} \} \) of original document \( D \). For each coding to be considered like an extractive summary, the first sentences are considered according to a limit of words.

Fitness Function.

The evaluation of individuals is an essential stage of GA where each candidate summary \( X_{i} \) is evaluated according to the F-measure score from ROUGE system metrics [30]. The maximum F-measure score of summary \( X_{k} \) obtained from \( g \) generations determine the best combination of sentences found by GA. This maximization is shown in Eq. (1), where \( n \) is the length of n-grams for evaluation of candidate summaries.

$$ Max\left( {F\left( {X_{k} \left( g \right)} \right)} \right) = \frac{{\mathop \sum \nolimits_{{S \in S_{ref} }} \mathop \sum \nolimits_{{gram_{n} \in S}} Count_{match} (gram_{n} )}}{{\mathop \sum \nolimits_{{S \in S_{ref} }} \mathop \sum \nolimits_{{gram_{n} \in S}} Count(gram_{n} )}}, g = \{ 0, \ldots ,G\} $$
(1)

In this case, we have focused in optimize through ROUGE-1 and ROUGE-2 metrics (evaluation based on bag-of-words and bigrams respectively) due to that these metrics have been obtained the maximum correlations with respect to human judgments [30]. \( F \) is the F-measure score of the ROUGE system, and \( Count_{match} \left( {gram_{n} } \right) \) is the number of co-occurrent n-grams between a candidate summary \( X_{i} \) and gold-standard summary. If the candidate summary \( X_{k} (g) \) has the highest co-occurrence of n-grams from all populations \( X_{i} \left( g \right) \), then it will have the best combination of sentences due to that it has the most substantial of retrieved n-grams.

Initialization of Individuals.

To initialize the population of individuals (when \( g = 0 \)) must be generated with codifications of random real numbers for signature each sentence of source document \( D = \{ S_{1} , S_{2} , \ldots , S_{n} \} \) in each position \( P_{i} \) of \( [P_{1} , P_{2} , \ldots , P_{n} ] \). Therefore, the first generation of individuals will be according to Eq. (2), where \( a_{s} \) represents a real integer number \( \{ 1, 2, \ldots , n\} \) that corresponds to the number of the selected sentence in document \( D \), \( c = 1, \, 2, \, \ldots ,N_{pop} \), \( s = 1,2, \, \ldots ,n \), \( n \) is the number of the n-th sentence from the source document.

$$ X_{c} \left( 0 \right) = \left[ {X_{c,1} \left( 0 \right),X_{c,2} \left( 0 \right), \ldots ,X_{c,n} \left( 0 \right)} \right],X_{c,s} = a_{s} $$
(2)

Therefore, each sentence has the same probability of being included as part of an extractive summary according to a number \( W \) of requested words (see Eq. (3)).

$$ \sum\nolimits_{{S_{i} \in Summary}} {l_{i} \le W} $$
(3)

where \( l_{i} \) is the length of the sentence \( S_{i} \) (measured in words) and \( W \) is the maximum number of words allowed to generate an extractive summary. In this case, we considered the use of several numbers of words per document as a constraint, due to that the lengths of each document of TeMario (gold-standard summaries and source documents) are made up of different compression rates.

Selection.

In the selection stage, we propose the use of two selection operators to obtain the best subsets of individuals for each population of individuals. The first one consists in selecting a small subset of individuals through the elitism operator, which has the feature to choose minimal subgroups of individuals of best aptitude from generation \( g \) to pass the next generation (\( g + 1 \)). To select the remaining individuals from each generation, we propose several select of individuals from the tournament selection operator. This operator generates several subsets of \( N_{Tor} \) randomly picked individuals to retrieve the individual with the best fitness value, as shown in Eq. (4), where \( X_{b} \left( g \right) \) is the individual with the best fitness value and \( F \) is the F-measure score of ROUGE metric.

$$ X_{b} \left( g \right) = argmax\left( {F\left( {X_{1} \left( g \right)} \right), F\left( {X_{2} \left( g \right)} \right), \ldots , F\left( {X_{{N_{Tor} }} \left( g \right)} \right)} \right) $$
(4)

To integrate the selection stage, we propose to use the elitism operator to choose the best individuals of each population, using a percentage of them. Finally, the remaining individuals are obtained from the tournament selection operator, using samples of two randomly obtained individuals.

Crossover.

For the crossing of individuals, we use the cycle crossover algorithm (CX) to interchange a subset of genes according to a start point (initial gene). For the CX operator to be started, it is necessary considering a crossover probability \( P \) to determine the subset of individuals who will perform the genetic exchange. Therefore, if \( b_{rand} \) (a random number) is between 0 and \( P \), then the operator must select a starting point to perform the genetic exchange of parents \( X_{p1} \left( g \right) \) and \( X_{p2} \left( g \right) \) to generate an offspring \( Y_{i} \left( g \right) \), otherwise, the first parent \( \left( {X_{p1} \left( g \right)} \right) \) will be \( Y_{i} \left( g \right) \). To produce the second offspring, the roles of \( X_{p1} \left( g \right) \) and \( X_{p2} \left( g \right) \) are exchanged.

Mutation.

For the mutation stage, we propose taking a set of individuals \( Y_{i} \left( g \right) \) to generate individuals \( Z_{i} \left( g \right) \) modifying some genes of each population of individuals. To the mutation of individuals, we used the insertion mutation operator to select a pair of genes of the individual \( Y_{i,t} \left( g \right) \) and \( Y_{i,r} \left( g \right) \) randomly to insert the gene \( Y_{i,t} \left( g \right) \) in the gene \( Y_{i,r} \left( g \right) \), as shown in Eq. (5), where \( r \) is the variable that relates the gene to be inserted, the variable \( t \) represents the target gene to be inserted, which are an element of subset \( s = \{ 1, 2, \ldots , n\} \), and \( n \) is the number of the sentence \( S_{i} \) from the source document \( D \).

$$ Z_{i,s} \left( g \right) = \left\{ {\begin{array}{*{20}l} {Y_{i,t} \left( g \right) = Y_{i,r} \left( g \right), Y_{i,t \pm 1} \left( g \right) = Y_{i,t} \left( g \right), \ldots , Y_{i,r} \left( g \right) = Y_{i,r \pm 1} \left( g \right); } \hfill & {if\;0 < rand \le P} \hfill \\ {Y_{i,s} \left( g \right)} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(5)

Therefore, if rand (a random number) is between 0 and \( P \), then the mutation of individuals is performed by insertion operator, otherwise, the individual is not modified.

Replacement of Individuals.

Taking as reference the previous works [19, 20], the replacement of individuals step, we propose to integrate the set of individuals generated by elitist selection (\( E\left( {g + 1} \right) \)) and the set of individuals \( Z_{i} (g) \) from the mutation stage, to integrate the population of the next generation \( \left( {X_{i} \left( {g + 1} \right) = X_{i} \left( {g + 1} \right) \mathop \cup \nolimits Z_{i} \left( g \right)} \right) \).

Termination Criterion.

The termination criterion used to halt the GA iterations is determined by the number of \( G \) generations established as a constraint of stop. In the experimentation stage, 50 generations were used for each document of TeMario due to that was the best parameter found.

4 Experimental Results

In this section, we describe TeMario corpus and the experiments performed to generate the best extractive summaries. Moreover, the performance of some state-of-the-art methods and heuristics are presented to determine which methods of the state-of-the-art are more significant.

4.1 TeMario Corpus

TeMario (derived from “TExtos com suMÁRIOs”) is a corpus of 100 newspaper articles written in Brazilian Portuguese. 60 documents were written by the online Brazilian newspaper Folha de São Paulo, and 40 documents were written by the newspaper Journal do Brasil (Brazilian Newspaper) [16]. The TeMario documents are distributed into equitably five sections (Special, World, Opinion, International, and Politics).Footnote 3 Moreover, a gold-standard summary was generated for each document of TeMario by a professional writer. Table 1 shows some features of TeMario corpus.

Table 1. TeMario corpus description [16].

Unlike DUC datasets, TeMario was not created with specific constraints to indicate the comparison of the performance of the state-of-the-art methods. One of the main problems is derived from the lack of explicit identification of sentences or phrases to generate summaries because it was not determined the sentence labeling. Due to this, we present the segmentation of sentences in three different cases.Footnote 4 The first segmentation consists in divide the documents by paragraph, the second segmentation includes in split the source documents into several sentences manually (Tagged), and finally, the third division consists in divide the documents into sentences through an automatic sentence boundary detection tool (SENTER) developed by the same author of TeMario.Footnote 5 Table 2 shows the number of sentences of each segmentation.

Table 2. Number of sentences obtained from TeMario corpus using different segmentations.

According to the number of sentences obtained from different segmentations (see Table 2), the use of varying segmentations of terms generates different sequences to construct extractive summaries, and therefore the performance of SDS methods can be affected. The division by paragraphs presents fewer sentences to combine. Tagged and SENTER segmentations presents a similar number of sentences. However, these segmentations capture different sequences of terms due to that Std indicator is different between segmentations.

4.2 Parameters of GA

With respect to the GA, different parameters were carried out; however, the best parameters performed are presented in Table 3. Unlike the previous works [19, 20], the Topline was calculated considering different segmentations of sentences described above. Moreover, the gold-standard summaries were written with different lengths and therefore were not possible to determine the upper bounds. Considering this constraint, the Topline was calculated for different lengths. These are: 1. Summaries with a compression rate of 30% (parameter proposed in [1, 10]). 2. Summaries with 100 words. 3. Summaries according to the length of words to the gold-standard summaries.

Table 3. GA parameters to calculate the upper bounds of TeMario corpus.

4.3 State-of-the-Art Methods and Heuristics

In this paper, we determine the level of advance with respect to other heuristics (Baseline-first, Baseline-random, and Topline). The methods and heuristics taken into consideration for this comparison are the following:

  • Baseline-first: This heuristic uses the first sentences from the source text to generate an extractive summary, according to a length of words. The performance of this heuristic has been generated good results in SDS and MDS [19]. However, this heuristic must be overcome by state-of-the-art methods.

  • Baseline-random: This heuristic consists in selecting a random number of sentences according to a length of words to generate an extractive summary. This heuristic allows us to determine how significant is the performance of the state-of-the-art methods [29].

  • Topline: It is an heuristic that allows to obtain the upper bounds for SDS and MDS that any state-of-the-art method can achieve, due to the lack of concordance between evaluators [22].

  • GA-Summarization: The method presented in [13] uses a GA to generate extractive independent-language summaries. This method evaluates the quality of each candidate summary considering three features: 1. Frequency of terms in sentence. 2. Frequency of terms in summary. 3. Importance of sentences according to the position from the source document.

  • GistSumm: The method presented in [6] uses a gist-sentence approach to generate extractive summaries. First, the identification of the “gist-sentence” is performed through simple statistical measures. Then, the gist sentence is used as a guideline to identify and select other sentences to integrate the extractive summary. This method can generate extractive summaries in three different forms: 1. Intrasentential Summarization (GistSumm-1). 2. Query-based summarization (GistSumm-2). 3. Average keywords ranking (GistSumm-3).

  • Shvoong: It is an online tool founded by Avi Shaked and Avner Avrahami to generate extractive summaries in 21 different languages.Footnote 6 Some of them include the English, French, German, Portuguese, and others.

  • Open Text Summarizer (OTS): It is an open-source application to generate multilingual extractive summaries that can be downloaded online.Footnote 7 This tool allows constructing extractive summaries based on the detection of the main ideas from the source document, considering the reduction of redundant information.

To compare the performance of heuristics and the state-of-the-art methods previously described, the evaluation based on the statistical co-occurrence of bag-of-words and bigrams (ROUGE-1 and ROUGE-2) from the ROUGE system was performed [30]. ROUGE-1 and ROUGE-2 use the ROUGE-N evaluation method, based on the statistical co-occurrence of terms included between a candidate summary and the gold-standard summaries (see Eq. (6)).

$$ ROUGE - N = \frac{{\mathop \sum \nolimits_{{S \in Summ_{ref} }} \mathop \sum \nolimits_{{gram_{n} \in S}} Count_{match} (gram_{n} )}}{{\mathop \sum \nolimits_{{S \in Summ_{ref} }} \mathop \sum \nolimits_{{gram_{n} \in S}} Count(gram_{n} )}} $$
(6)

Tables 4, 5 and 6 show the average results of ROUGE-1 and ROUGE-2 (R1 and R2) scores of Baseline-first, Baseline-random, and Topline heuristics considering different segmentations of TeMario. As we can see, the performance of Baseline-random and Topline was affected, due to that the selection of sentences was obtained from different criteria (paragraph, automatic and manual form). On the other hand, the performance of Baseline-first was not affected significantly, due to this heuristic only uses the length of words to construct extractive summaries. However, each segmentation of sentences generated a higher number of words (some words were split), and therefore, the evaluation step generates different results (but it is not significant). Moreover, the use of different compression rates (100 words, 30% of the source text and according to the length of gold-standard summaries) affects the performance of all heuristics, due to the gold-standard summaries has different lengths of words and therefore must be evaluated with varying rates of compression.

Table 4. Results of heuristics considering the segmentation by paragraph.
Table 5. Results of heuristics considering the tagged segmentation.
Table 6. Results of heuristics considering the segmentation of SENTER.

To compare the state-of-the-art methods and heuristics previously described, we generated summaries according to human segmentation (Tagged) with a compression rate of 30% from the source documents (see Table 7). Table 7 shows the performance of GA-Summarization method (48.791) is better than other state-of-the-art methods in ROUGE-1. However, the performance of GistSumm-2 method (18.375) is better than other state-of-the-art methods in ROUGE-2. On the other hand, the performance of Baseline-first outperforms all state-of-the-art methods in ROUGE-1 (48.986) and ROUGE-2 (18.948). Furthermore, some methods have been obtained worse performance than Baseline-random heuristic (Gist-Summ-3 and GistSumm-1).

Table 7. Results of the state-of-the-art methods and heuristics considering a tagged segmentation of sentences with a compression rate of 30%.

To unify the performance of the state-of-the-art methods in ROUGE-1 and ROUGE-2, the Eq. (7) was used to rank the best ones according to the position of each method (See Table 7).

$$ Ran\left( {method} \right) = \sum\nolimits_{r = 1}^{6} {\frac{{\left( {6 - r + 1} \right)R_{r} }}{6}} $$
(7)

where \( R_{r} \) refers the number of times the method occurs in the r-th rank. The number 6 represents the total number of methods involved in this comparison.

Table 8 shows the result rank of each state-of-the-art method. As we can see, the performance of GA-Summarization (1.833) and GistSumm-2 (1.833) methods show the best positions in the method rankings. However, GA-Summarization performs some independent-language features, while the performance of GistSumm-2 depends on some language features. On the other hand, the methods Shvoong, OTS, GistSumm-3 and GistSumm-1 present the same positions across ROUGE metrics.

Table 8. Ranking of the state-of-the-art methods.

5 Conclusions and Future Work

In several works have been presented several ATS methods to generate extractive summaries in Portuguese. However, the calculus of upper bounds was unknown. In this paper, a calculus of upper bounds for SDS in Portuguese was presented. Furthermore, it was possible to generate a general ranking of the state-of-the-art methods according to their position.

In the process of calculating the upper bounds, it was necessary the use of different segmentation of sentences to obtain the best extractive summaries in Portuguese, due to that TeMario has not a specific delimitation of items to generate extractive summaries. Nevertheless, in this work, we proposed the use of three different segmentation of sequences (Paragraph, Tagged and Automatic Segmentation of sentences) to generate extractive summaries in TeMario.

The length of gold-standard summaries affects the performance of lower bounds and upper bounds (Topline and Baseline-random respectively) (see Tables 4, 5 and 6), due to that these summaries were not written with a specific compression rate. The use of different segmentations of sentences with different compression rates affects the performance of all state-of-the-art methods and heuristics, therefore, it is necessary consider these constraints to generate and evaluate summaries.

The performance of Baseline-first was not affected significantly by the segmentation of sentences, because this heuristic employs the number of the first words to generate an extractive summary. Moreover, the performance of this heuristic it was better with respect to all state-of-the-art methods (see Table 7).

In Table 7 it is observed that Baseline-first heuristic outperforms all state-of-the-art methods involved in this comparison, therefore to generate summaries with better performance we propose the use of other methods (or combinations of them) to generate summaries to outperform this heuristic. Finally, we propose the generation and evaluation of summaries in TeMario considering the constraints mentioned above to generate a comparison with respect the upper bounds and lower bounds.