Keywords

1 Introduction

The extensive use of Internet has caused the enormous growth in the usage of digital information. Currently, there are a great variety of users of online information services with a huge amount of unstructured digital information [9, 18]. The user accesses the information through queries, but the precision is always an issue due to the information overload. One way to resolve this issue is by generating summaries [6, 12].

The general process of summarization consists of rewriting the full text into a brief version [19]. Automatic Text Summarization (ATS) is a task of Natural Language Processing (NLP). ATS consists in selecting the most important units which could be paragraphs, sentences, part of sentences or keywords from a document or collection of documents using the state-of-the-arts methods or commercial systems.

ATS methods can be abstractive or extractive. In the abstractive text summarization, the summaries are composed from fusing and generating new text that describes the most important facts [10]. In the extractive text summarization, the sentences or other parts of a text are extracted and concatenated to compose a summary [14, 18].

Depending on the number of documents, summarization techniques can be classified in two tasks: Single-Document Text Summarization (SDTS) and MDTS. The main goal of the MDTS is to allow to the users to have an overview about the topics and important information that exists in collection of documents within relatively a short time [1, 3, 22]. The MDTS has gained interest since mid-1990s [4], starting with the development of evaluation programs such as Document Understanding Conferences (DUC) [23] and Text Analysis Conferences (TAC) [26].

In this paper, we consider the methodology for building the final summary that considers all sentences of all documents [2, 18, 21, 29]. In this paper, a new MDTS method based on a GA is proposed.

The organization of the paper is as follows: the Sect. 2 describes the proposed method. The Sect. 3 shows experimental configuration and results. Finally, in the Sect. 4 the conclusions are presented.

2 Proposed Method

2.1 Pre-processing

The proposed method consists of three steps. In the first step, the documents of the collection were chronologically ordered, then the original text is adapting to the entry of the format of the GA, where the original text is separated in sentences. Also, the text pre-processing is applied to the collection of documents. Firstly, the text was divided into words separated by commas, then some tags were placed in the text to be able to differentiate quantities, emails, among others, and finally the lexical analysis is carried out [8, 20].

2.2 Text Model

The goal of text modeling is to predict the probability of natural word sequences. It assigns high probability on word sequences that occur, and low probability on word sequences that never occur. The simplest and most successful form for text modeling is the n-gram model. n-gram is defined as a subsequence of consecutive elements in a given sequence [15, 20].

2.3 Genetic Algorithm

The basic configuration of GA is defined as follows [5]: the initial population is randomly generated, while the population of other generations are generated from some selection/reproduction procedure. The search process terminates when a termination criterion is met. Otherwise a new generation will be produced, and the search process continues. The termination criterion can be selected as a maximum number of generations, or the convergence of the genotypes of the individuals. Genetic operators are constructed according to the problem to be solved, so the crossover operator has been applied to the generation of summaries.

Encoding.

The binary encoding is used for each individual, where each sentence of the document constitutes a gene. The values 1 and 0 determine if the sentence will appear or no in the final summary. The initial population is randomly generated [8, 20].

Selection Operator.

Roulette selects individuals from a population according to their aptitude, and is intended to select stronger individuals (with greater value in the fitness function) [20].

Crossover Operator.

This operator has been used in [8]. It was designed for ATS, where each individual represents a selection of sentences. The process of cross over is randomly select parents, only those with genes with a value of 1, and this value is assigned to the new individual. Genes with a value of 1 in both parents will be more likely to be chosen. To meet the condition of the summary, a gene is selected to be part of a new individual, the number of words is counted [20].

Mutation Operator.

This operator performs the mutation according to a certain probability as described in [8, 20].

Stop Condition.

The stop condition that was applied for the term of the GA is the maximum number of generations. For the execution of the GA, consideration must be given to the number of words that the summary must have. In this case, the lengths of 10, 50, 100 and 200 words were used.

The number of individuals and the number of generations are automatically calculated by the GA through the Eqs. 1 and 2 respectively. The number of individuals is determined by the number of sentences that the document contains by means of the following equation [20]:

$$ Number \,individuals = Number \,Sentences*2 $$
(1)

The number of generations is calculated trough the following equation:

$$ Number\, Generations = 4*15 *\,Number\, Sentences $$
(2)

Fitness Function.

The fitness function was used in the method [8, 20]. In this fitness function are evaluated two features, position sentences and coverage. The main idea is that if all the sentences (see the Eq. 3) had the same importance, it is could draw a line with the points that make up those coordinates as it is showed in Eq. 4.

$$ \{ X_{1} ,X_{2} , X_{3} , \ldots X_{n} \} $$
(3)
$$ \left\{ {\left( {X_{1} ,y} \right), \left( {X_{2} ,y} \right), \left( {X_{3} ,y} \right), \ldots \left( {X_{n} ,y} \right)} \right\} $$
(4)

The idea for assigning more importance for the first sentences, would be consider the first sentence with the importance \( X_{n} \), the second with importance \( X_{n} - 1 \).

Since the placement of the line indicates its importance, the midpoint of that line can be used to determine the slope of the line; thus, softening the importance of sentences. This would allow us to know how important a sentence is with respect to the following. For this can use the general equation of the slope of the line.

For a text with n sentences, if the sentence \( i \) is selected for the summary then its relevance is defined as \( t(i - x) + x \), \( where \,x = 1 + (n - 1)/2 \) and \( t \) is the slope to be discovered. With the objective to normalize the measurement of the position of the sentence \( (Sentence\, Importance) \), the importance of the first \( k \) sentences is calculated, where \( k \) is the number of selected sentences. Then the formula to calculate the importance of the first sentences would be as follows:

$$ Sentence \,importance = \frac{{\mathop \sum \nolimits_{{\left| {c_{i} } \right|}}^{n} = 1^{{t\left( {i - x} \right) + x}} }}{{\mathop \sum \nolimits_{j = 1}^{k} \,t\left( {j - x} \right) + 1}},\,x = 1 + \frac{(n - 1)}{2} $$
(5)

However, it is not the only value by which the GA should be governed since it would try to obtain only the first sentences. It is also necessary to evaluate that the summary has different ideas, that is, it is not repetitive, but at the same time it has important words \( (Precision - Recall) \). To measure both things the fitness function makes the summation of the frequencies of the n-grams that the summary weigh how significant are the n-grams obtained is the same but considering the original text, in this case only the most frequent n-grams according to the number of minimum words. This weighing are Precision and Recall. Precision defines as a sum of the frequencies of the n-grams consider the original text, expressed as follows:

$$ \sum {Original\;text\;frequency} $$
(6)

Recall defines as a sum of the frequencies of the different n-grams of summary:

$$ \sum {Frequency\;Summary} $$
(7)

Therefore, the formula for obtaining Precision-Recall is:

$$ Presicion - Recall = \frac{{\sum {Original\;text\;frequency} }}{{\sum {Frequency\;Summary} }} $$
(8)

Finally, to obtain the value of the fitness function, the following formula is applied, which is multiplied by 1000.

$$ FA = Presicion\_Recall\, *\,Sentence\;Importance\, *\,1000 $$
(9)

3 Experimental Results

We test the proposed method based on the MDTS using the dataset provided in DUC [23]. We use ROUGEFootnote 1 to evaluate the proposed method, which is widely applied by DUC for performance evaluation [16]. It measures the performance of a summary by counting the unit overlaps between the candidate summary and two reference summaries.

Dataset.

In order to empirically evaluate the summarization results, DUC02 dataset is used. This corpus contains 59 clusters of news texts documents, every cluster contains from 5 to 14 text documents. The summary lengths are 10, 50, 100, and 200 words [17].

Experiment Configuration.

In each experiment, we followed the standard sequence of steps explained in the Sect. 2. The Table 1 presents the best obtained result of the proposed method with different slop value, for four different summary lengths, and selection operator (Roulette). The sentence selection considered parameter is k-best+first which consists in selecting the first sentences of the text, until the desired size of the summary is reached [13, 14].

Table 1. The best obtained results with several parameters of the proposed method.

In the Sects. 3.1, 3.2 and 3.3, we describe and compare the best obtained results of the proposed method to the state-of-the-art methods, commercial systems and heuristics.

3.1 Description of the State-of-the-Art Methods

R2N2 [6]: Recursive Neural Networks (R2N2) to rank sentences for MDTS were presented. This is a supervised method because analyzes the syntactic structure of an input sentence and produces a sentence parse tree. The R2N2 are used to automatically learn ranking features over parsing tree, and used features as POS, named entity, sentence depth, among other. Two sentence selection methods were used: R2N2_G: uses greedy algorithm G which selects the most salient sentences with a similarity threshold. R2N2_ILP: The sentence selection is based on Integer Linear Programming (ILP). ILP is intended to find the global optimum. In addition to the previous selection methods in [6] were used three support machine regression baselines: UR: Unigram regression with G selection. SR: Sentence regression with G selection. U+SR: ILP method assigns weights to the words of the sentences that are measured by related regressions.

LexRank [29]: It computes sentence importance based on the concept of centrality in a graph representation of sentences. In this model, a connectivity matrix based on intra sentence cosine similarity is used.

WF (Word Frequency) [18]: The sentences that contains the words with the most frequency from the source document (without stop-words) are considered for the final summary. TE (Textual Entailment) [18]: It consists of using textual implication in ATS that has been considered as a useful approach for obtaining a preliminary summary, where the sentences have not associated with any other sentence of document. TE + WF [18]: This method applies prior recognition of the textual entailment as a previous step to the words frequency in the summarization process.

GS, Knapsack, ILP [21]: This work considers tree inference global algorithms, the first is a greedy approximate method GS, the second is a dynamic programming approach based on solutions to the knapsack problem Knapsack, and the third is an exact algorithm that uses an Integer Linear Programming formulation of the problem ILP.

MFS (Maximal Frequent Sequences) [11]: This method analyses several options for language-independent features and corresponding term weighting feature based on units larger than one word.

3.2 Description of Heuristics

Topline [25]: It is a heuristic that allows to obtain the maximum value that any state-of-the-art method can achieve due to the lack of concordance between evaluators, since it selects sentences considering one or several gold-standard summaries.

Baseline-First [25]: It is a heuristic that take the first sentence in the 1st, 2nd, 3rd, etc. document collection in chronological sequence until you have the target summary size.

Baseline-Random [25]: It is the state-of-the-art heuristic that randomly selects sentences to present them as an extractive summary to the user.

Baseline-First-Document: It is a heuristic that take the first sentences in the 1st document of a document collection, until you have the target summary size. This heuristic is proposed in this work.

Lead Baseline [18]: It is a heuristic that take the first 10, 50, 100, and 200 words in the last document in the collection, where documents are assumed to be chronologically ordered.

3.3 Description of Systems

Svhoong Summarizer [12]: This system is available online. The text should be copied to the web page. The final summary is the text underlined in the same page.

Pertinence Summarize [12]: This system is available online. For each document, Pertinence automatically calculates percentages depending on the number of words in the document.

Tool4noobs Summarizer [12]: This system is available online and it uses three steps: extraction of text, identification of the key-words, and identification of sentences and generation of summary.

Copernic Summarizer [12]: The system was exclusively developed for the generation of automatic summaries. The version 2.1 was used on the Microsoft Windows operating system.

Microsoft Office Word Summarizer [12]: This system can be found in versions of Microsoft Office Word 2003 and Microsoft Office Word 2007.

3.4 Experimental Results (4 MDTS Tasks)

It is considered that any method can be worse than randomly choosing sentences (baseline-random), so the advance of baseline-random can be recalculated as 0%. The best possible performance is called topline and it is considered as 100% of advance. Using baseline-random and topline is possible to recalculate F-measure results in order to see the advance compared to the worst and the best results. In the Tables 2, 3, 4 and 5, the results of F-measure of ROUGE-1, ROUGE-2 and Advance are presented.

Table 2. Comparison of the results to other methods for 10 words.
Table 3. Comparison of the results to other methods for 50 words.
Table 4. Comparison of the results to other methods for 100 words.
Table 5. Comparison of the results to other methods for 200 words.

10 Words Task:

The summary length of this task is 10 words. This task is rarely tested in the state-of-the-art, because of the difficulties related to find less words to describe the most important information. We calculate all heuristics: topline, baseline-first, baseline-random, baseline-first-document and lead-baseline (see Table 2) where we show the results of all state-of-art method and heuristics. The topline shows an enormous margin which exists between the best method and the best possible result to obtain. The difference is 62.09%. We hope that these experimental results serve as a reference for the future works.

50 Words Task:

The summary length of this task is 50 words. Several state-of-the-art unsupervised methods are 5 heuristics are presented in Table 3 (3 are calculated in the state-of-the-art: topline, baseline-first and baseline-random, and 2 heuristics are calculated in this paper: baseline-first-document and lead-baseline). In the Table 3, we see that the proposed method overcome the results of all state-of-art method and heuristics. As topline heuristic shows an extensive margin exists between the best method and the best possible result to obtain. The difference is 68.79%.

100 Words Task:

The summary length of this task is 100 words. The task of 100 words summary length is the most tested in the state-of-the-art. We can see several unsupervised and supervised methods, commercial tools, heuristic calculated in the state-of-the-art such as topline, baseline-first and baseline-random, and heuristics calculated in this paper such as baseline-first-document and lead baseline. In the Table 4, we see that the proposed method overcome the results of the state-of-art method, heuristics and commercial systems. However, only one supervised method from [2] overcome results of the proposed method, our conclusion is because the usage of language-dependent features. As topline result shows, an enormous margin exists between the best method and the best possible result to obtain. The difference is 55.47%. The summaries of commercial systems were provided by the authors and were evaluated.

200 Words Task:

The summary length of this task is 200 words. Several unsupervised and supervised methods, and 4 heuristics calculated in the state-of-the-art such as topline, baseline-first, lead baseline and baseline-random and 1 heuristic calculated in this paper is baseline-first-document. In the Table 5, we see that the proposed method overcome the results of all the state-of-art methods and heuristics. The topline result shows an extensive margin exists between the best method and the best possible result to obtain. The difference is 67.37%. We logically understand that the MDST task needs the summary of more length, so our proposed method could find the most important ideas to compose the summary and obtain the best result in the task where the summary of bigger length is required.

4 Conclusions

In this paper, we proposed the method for AMDTS based on GA. The fitness function was calculated considered sentence position and coverage. We proposed the binary coding representation, selection, crossover and mutation operators. Four different tasks for each of the 59 collection of documents of DUC02 data set (specifically AMDST task) were tested. We tested different configurations of the most used methodology to generate AMDST summaries. Moreover, different heuristics such as topline, baseline, baseline-random and lead baseline were calculated. The proposed method for AMDTS demonstrates the improvement over the state-of-art methods and heuristics.

As future work we will use more language-independent features as redundancy reduction, sentence length and similarity with the title [28]. Also, we will consider other text models like sn-grams [27] and MFS [7], and other language [24].