1 Introduction

The huge amount of information available electronically has increased demand for automatic text summarization systems. Text summarization is the process of automatically creating a compressed version of a given text that provides useful information [5, 6, 35, 40, 48]. Text summarization addresses both the problem of selecting the most important portions of text, and the problem of generating coherent summaries. There are two types of summarization: extractive and abstractive. Extractive summarization methods simplify the process by selecting a representative subset of the sentences in the original document. Abstractive summarization may compose new sentences that are not present in the original source. However, abstractive approaches require deep natural language processing techniques such as semantic representation, inference, and natural language generation, which have yet to reach a mature stage [40].

Research into automated summarization began in the 1950s [18]. Different attempts have shown that very complex techniques are required to produce human-quality text summaries because the process encompasses discourse understanding, abstraction, and language generation [34]. Simpler approaches have been explored that extract representative texts. They have used statistical techniques and/or techniques based on surface domain-independent linguistic analyses. Within this context, summarization can be defined as the selection of a subset of sentences that is representative of the document’s content. This typically involves ranking the sentences in a document, so that we can select those with the highest scores and minimum overlap [4, 24]. Most recent work in summarization uses this paradigm.

The process of text summarization can be divided into three phases: analysis, transformation, and synthesis. In the analysis phase, the input text is processed and a few salient features are selected. In the transformation phase, the results of the analysis are transformed into a summary representation. Finally, the synthesis phase produces an appropriate summary using the summary representation, which corresponds to the particular needs of a user. In the overall process, the compression rate is an important factor that influences the quality of the summary. It is defined as the ratio between the length of the summary and the length of the original text. A decreasing compression rate produces a more concise summary; however, more information is lost. An increasing compression rate produces a larger summary, but it will contain more insignificant information. In fact, when the compression rate is 5–30 % the quality of the summary is acceptable [11, 16, 20, 38].

Early work on text summarization was limited because of the lack of powerful computers and the difficulty of natural language processing (NLP), so research focused on the study of text genres such as sentence positions, and cue-phrases [7, 18]. During the 1970s, researchers began to apply artificial intelligence (AI) techniques [2, 21, 31, 41]. These AI methods exploited knowledge representations, such as frames or templates, to identify conceptual entities from a text and to extract relationships among entities by inference mechanisms. The major drawback is that limited definitions of frames or templates may lead to an incomplete analysis of conceptual entities. During the 1990s, information retrieval (IR) was used for text summarization [1, 9, 10, 15, 16, 19, 30, 36, 39]. However, most of these IR techniques focused on a symbolic-level analysis, and did not take into account semantics such as synonymy, polysemy, and term dependency [15].

Automated multi-document summarization has drawn much attention in recent years. A multi-document summary is commonly used to provide a concise topic description for a cluster of documents, and to help a user quickly browse many documents. There is an inevitable overlap in the information content of different documents. Therefore, we need effective summarization methods to merge information stored in different documents, and if possible, contrast their differences [37].

Recently, there have been many investigations into text summarization [3, 8, 32]. In [14], the authors considered the evaluation of summarization using relevance prediction, and [33] investigated the ROUGEeval package; SUMMAC, NTCIR, and DUC were considered by [28], and [13] researched voted regression models. Other techniques included single and multiple-sentence compression using “parse and trim” and statistical noisy-channel approaches [42], and conditional random fields [23]. Also investigated were multi-document summarization [12, 37] and summarization for specific domains [17, 22, 29].

In this work, all of the documents in a certain cluster have been merged into one file. After redundancy removal, the sentences of each file are modeled as vectors of features extracted from the text. The summarization task can be seen as a two-class classification problem, where a sentence is labeled as “correct” if it belongs to the extractive reference summary, or as “incorrect” otherwise. We may give the “correct” class a value ‘1’ and the “incorrect” class a value ‘0’. In the testing mode, each sentence is given a value between ‘0’ and ‘1’ (values between 0 and 1 are continuous). Therefore, we can extract the appropriate number of sentences according to the compression rate. The trainable summarizer is expected to “learn” the patterns that lead to the summaries by identifying relevant feature values which are most correlated with the classes “correct” or “incorrect”. When a new cluster file is input into the system, the “learned” patterns are used to classify each sentence by giving it a certain score value between ‘0’ and ‘1’. A set of highest score sentences are chronologically specified as a file summary based on the compression rate.

This paper is organized as follows: Section 2 presents the different text feature parameters, Sect. 3 describes the proposed multi-document automatic summarization model, Sect. 4 shows the experimental results, and finally Sect. 5 presents our conclusions and future work.

2 Text features

We consider the following features when analyzing the text.

1. F1 = Word similarity among sentences.

This text feature measures the importance of a sentence based on how often its content appears in the other sentences of the document. It is simply the vocabulary overlap between this sentence and other sentences in the document. F1 is calculated as follows:

$$\begin{aligned} \mathit{Score}_{\mathrm{F}1}(S)=\frac{\mathrm{Keywords~in}\ S \cap \mathrm{Keywords~in~other~sentences}}{\max(\mathrm{Keywords~in}\ S_i \cap \mathrm{Keywords~in~other~sentences})} \end{aligned}$$
(1)

where S is a document sentence under consideration, and S i is sentence number i in the document. Note that the denominator of Eq. (1) is used for normalization.

2. F2 = Word similarity among paragraphs.

This text feature is similar to F1, but compares the whole paragraph rather than individual sentences. F2 is calculated as follows:

$$\begin{aligned} \mathit{Score}_{\mathrm{F}2}(S)=\frac{\mathrm{Keywords~in}\ P \cap \mathrm{Keywords~in~other~paragraphs}}{\max(\mathrm{Keywords~in}\ P_i \cap \mathrm{Keywords~in~other~paragraphs})} \end{aligned}$$
(2)

where P is a document paragraph that contains the sentence S, and P i is paragraph number i.

3. F3 = Text format score.

In some documents, the importance of the sentence is indicated by expressing some of its words in a different text format, e.g., italic, bold, underlined or a larger font size.

F3 is calculated as follows:

$$\begin{aligned} \mathit{Score}_{\mathrm{F}3}(S)=\frac{\sum_{t_{sp}\in \mathrm{special~format~terms}}t_{sp}}{\sum_{t \in \mathrm{sentence~terms}}t } \end{aligned}$$
(3)

where t is a phrase or term in the sentence, and t sp is a sentence special format term.

4. F4 = Cue-phrases.

Sentences that contain cue-phrases such as “in summary” and “in conclusion”, and superlatives such as “the best”, “the most important”, “according to the study”, and “hardly”, may be considered important.

F4 is calculated as follows:

$$\begin{aligned} \mathit{Score}_{\mathrm{F}4}(S)=\frac{\sum_{t_{cp}\in \mathrm{cue}{\scriptsize-}\mathrm{phrases}}t_{cp}}{\sum_{t \in \mathrm{sentence~terms}}t } \end{aligned}$$
(4)

where t is a phrase or term in the sentence, and t cp is a cue-phrase.

5. F5 = Summation of tfidf of the sentence terms.

The term frequency in the given document is simply the number of times a given term appears in that document. This count should be normalized to prevent a bias towards longer documents, and to give a measure of the importance of the term t i within the particular document d j . Thus, the term frequency is defined as follows:

$$ \mathit{tf}_{i,j}=\frac{n_{i,j}}{\sum_k n_{k,j}}, $$
(5)

where n i,j is the number of occurrences of the considered term in document d j , and the denominator is the number of occurrences of all terms in document d j . The inverse document frequency is a measure of the general importance of the term and is calculated as follows:

$$ \mathit{idf}_i\log\frac{|D|}{|\{d:t_i\in d\}|}, $$
(6)

where |D|= the total number of documents in the corpus, and |{d:t i d}|= the number of documents where the term t i appears (that is n i,j ≠0).

Then

$$ \mathit{tfidf}_{i,j} = \mathit{tf}_{i,j}\times \mathit{idf}_i, $$
(7)

and

$$ \mathit{Score}_{\mathrm{F}5}(S)\frac{\sum_t \mathit{tfidf} (t)}{\max(\sum_t \mathit{tfidf}(t)~\mathrm{in~all~document\mbox{'}s~sentences})}. $$
(8)

6. F6 = Title feature.

A sentence is given a high score if it contains words that occur in the document title. F6 is given as follows:

$$ \mathit{Score}_{\mathrm{F}6}(S)=\frac{\# \mathrm{of~title~words~in}\ S}{\mathrm{title~length}}. $$
(9)

7. F7 = Sentence location feature.

It is common for the first and last sentence of the first and last paragraphs to be important, and so it should be more likely for them to be included in the summary. F7 is calculated as follows:

$$ \mathit{Score}_{\mathrm{F}7}(S)= \left\{\begin{array}{l@{\quad}l} 1 & \mbox{for first or last sentence in the first}\\ & \mbox{or last paragraph},\\ 0 & \mbox{otherwise}. \end{array}\right. $$
(10)

8. F8 = Occurrence of a non-essential information feature.

Some words are indicators of non-essential information. These words are speech markers such as “because”, “furthermore”, and “additionally”, and typically occur at the beginning of a sentence. F8 is calculated as follows:

$$ \mathit{Score}_{\mathrm{F}8}(S)= \left\{\begin{array}{l@{\quad}l} 1 & \mbox{if the sentence contains at least one}\\ & \mbox{of the non-essential information terms},\\ 0 & \mbox{otherwise}. \end{array}\right. $$
(11)

3 The proposed multi-document automatic summarization model

The proposed multi-document automatic summarization model has two modes of operation:

  1. (1)

    Training mode, where features are extracted from the training data and used to train the maximum entropy model, naive-Bayes classifier and support vector machine.

  2. (2)

    Testing mode, where the features are calculated for the sentences in the test data. The sentences are ranked according to the sets of feature weights calculated during the training stage. We then construct a hybrid model, which is used to create the final sentence ranking. Summaries consist of the highest-ranking sentences.

3.1 Redundancy removal

A pre-processing step is necessary before the summarization process can take place. We have removed stop words and conducted some light stemming. The sentences are represented using a graph, so that we can detect and remove redundancies before applying the model-based ranking formulas. Each sentence is considered as a node in a directed graph. A link is established between two nodes if at least four continuous words are common. The link weight is the ratio of the number of common words to the average length of the two sentences. For every parent node, those child nodes that have a link weight greater than a particular threshold are excluded from the sentence ranking process. Hence, repeated and almost identical sentences are removed.

3.2 Maximum entropy (ME)

The maximum entropy approach is appropriate for the task of sentence extraction. The maximum entropy principle encapsulates the approach of [26], which he describes as follows.

“In making inferences on the basis of partial information we must use that probability distribution which has maximum entropy subject to whatever is known. This is the only unbiased assignment we can make; to use any other would amount to an arbitrary assumption of information which by hypothesis we do not have.”

The first step of the maximum entropy approach is to extract features or important information, in this case worthy sentences, to constrain the model. The second step is to construct a model using these features. As these features do not account for a complete model, we assign a uniform probability distribution, subject to the feature constraints. To find the uniform probability we must use maximum entropy. In [27], Shannon defines entropy as a measure of the information content or uncertainty of an outcome. Entropy is maximized when a distribution is uniform, i.e. this is the most uncertain situation.

The parametric form for a conditional maximum entropy model is as follows [25]:

$$\begin{aligned} & P(c|s)=\frac{1}{Z(s)}\exp\biggl(\sum _i \lambda_i f_i(c,s)\biggr) \end{aligned}$$
(12)
$$\begin{aligned} & Z(s)=\sum_c \exp\biggl(\sum _i \lambda_if_i(c,s)\biggr) \end{aligned}$$
(13)

where c is one of two labels: one indicating that a sentence should be in the summary (correct) and another label indicating that the sentence should not be in the summary (incorrect). s is one sentence in a training set, linked to its originating document. This means that we can recover the position of any given sentence in any given document. In maximum entropy models, the training set is viewed in terms of a set of features. Each feature expresses some characteristic of the domain, as explained in Sect. 2. In Eq. (13), f i (c;s) is a feature, and λ i is a feature’s weight. We assume that all the features have the same weight.

When classifying sentences using the maximum entropy method, we use the following equation:

$$ \mathit{label}(s)=\mathop{\mathrm{arg\,max}}_{c\in C}P(c|s), $$
(14)

where C is a set of labels (correct and incorrect).

We can write the unnormalized score as

$$ \mathit{label}(s)=\mathop{\mathrm{arg\,max}}_{c\in C}\exp \biggl(\sum_i \lambda_if_i(c,s) \biggr). $$
(15)

This maximum entropy classifier assumes a uniform prior. For a non-uniform prior, we can use

$$ \mathit{label}(s)=\mathop{\mathrm{arg\,max}}_{c\in C} F(c) \exp\biggl(\sum_i \lambda_if_i(c,s) \biggr), $$
(16)

where F(c) is a function equivalent to the prior when using the unnormalized classifier.

We classify each sentence using the above model.

3.3 Naive-Bayes classifier

In the naïve Bayes classifier [16], the classification function categorizes each sentence as worthy of extraction or not. Let s be a particular sentence, S be the set of sentences that make up the summary, and F1,…,F8 the text features. Assuming that the features are independent, we get

$$ P(s\in S| f_1, f_2,\ldots,f_8)=\frac{\prod_{i=1}^8 P(f_i|s\in S).P(s\in S)}{\prod_{i=1}^8P(f_i)}. $$
(17)

Since the denominator in Eq. (17) has the same value for all sentences, it can be simplified as

$$ P(s\in S|f_1,f_2,\ldots,f_8)=\prod_{i=1}^8 P(f_i |s\in S).P(S\in S). $$
(18)

We assign each sentence a score using Eq. (18).

3.4 Support vector machine classifier

Support vector machine (SVM) methods have often been found to provide good classification results [3]. The SVM approach tries to find the optimal separating hyperplane between two classes.

The kernel function used to implement the SVM technique is the sigmoid function. It is

$$ K(x_i,x_j)=\mathrm{tanh}(\gamma.x_i^T x_j+r), $$
(19)

where γ and r are the kernel parameters set to γ=1.

3.5 Hybrid machine learning model

Consider a sentence, represented as a feature vector X, which is to be assigned one of n possible classes (C 1,…,C n ). We have n=2 classes, because one class indicates that a sentence should be in the summary and another class indicates that it should not. Let R be the number of classifiers. In this case, R=3 as we have the maximum entropy method, the naive-Bayes classifier, and the support vector machine classifier. The feature vector used by the ith classifier is X i . Each class, C k , is modeled by the probability P(X i |C k ), and its prior probability of occurrence is P(C k ). The models under consideration are mutually exclusive. That implies that only one model is associated with each pattern. Using Bayesian theory, a given feature vector X=(X i ,i=1,…,R) is assigned to class C j that has the maximum posteriori probability, i.e.

$$ P(C_j|X_1,\ldots,X_R)=\max_k P(C_k|X_1,\ldots,X_R). $$
(20)

Let us rewrite the posteriori probability, P(C k |X 1,…,X R ), based on Bayes theorem as follows:

$$ P(C_k|X_1,\ldots,X_R)=\frac{P(X_1,\ldots,X_R|C_k)P(C_k)}{P(X_1,\ldots,X_R)}. $$
(21)

The joint probability density P(X 1,…,X R ) can be expressed

$$ P(X_1,\ldots,X_R)=\sum^n_{k=1} P(X_1,\ldots,X_R|C_k)P(C_k), $$
(22)

where P(X 1,…,X R |C k ) represents the joint probability distribution extracted by the classifiers. Consider that the representations are conditionally statistically independent. Therefore, we can use

$$ P(X_1,\ldots,X_R|C_k)=\prod^R_{i=1} P(X_i|C_k). $$
(23)

Substituting Eqs. (23) and (22) into Eq. (21), we get

$$ P(C_k|X_1,\ldots,X_R)=\frac{P(C_k)\prod^R_{i=1}P(X_i|C_k)}{ \sum^n_jP(C_j)\prod^R_{i=1}P(X_i|C_j)}. $$
(24)

Combining Eqs. (24) and (20), we get the decision rule. The sentence s is assigned a class, C j , if

$$ P(C_j)\prod^R_{i=1}P(X_i|C_j)=\mathop{\mathrm{max}}^n_{k=1} P(C_k)\prod^R_{i=1} P(X_i|C_k). $$
(25)

4 Experimental results

4.1 The training and testing data

We have trained our algorithm using the 147 single documents of the DUC 2001. We have extracted the eight text features and a summary from each document. We have used these feature parameters to train the models described in the previous section.

We have used multi-document extracts of 100-word summaries, generated for each of the 59 document clusters formed on the DUC 2002. First, we merged all the documents of each cluster into one file, and all of the document titles of each cluster into one title. We extracted the eight text features from each file, and then used the models to summarize the text. We ranked the sentences based on the model output. We selected a set of the highest ranked sentences for each file, with a constraint of 100 words.

We used an intrinsic evaluation to judge the quality of a summary that was based on the recall-oriented understudy for gisting evaluation (ROUGE-1). The ROUGE scores have become the standard automatic method for evaluating the content of machine generated summaries. They have been shown to be highly correlated with human evaluations. Formally, ROUGE-N (in our experiments N=1) is an n-gram recall between a candidate summary and a set of reference summaries. ROUGE-N is computed as follows:

$$ \frac{\sum_{S\in\{\mathrm{References~Summaries}\}}\sum_{\mathit{gram}_n\in S} \mathit{Count}_{\mathrm{match}}(\mathit{gram}_n)}{ \sum_{S\in\{\mathrm{References~Summaries}\}}\sum_{\mathit{gram}_n\in S} \mathit{Count}(\mathit{gram}_n)}, $$
(26)

where n stands for the length of the n-gram, gram n , and Count match(gram n ) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries.

4.2 Baseline approaches

4.2.1 The lead approach

We have extracted the first sentences of each document in a certain cluster to represent the cluster summary based on the 100-word constraint. Table 1 shows the average ROUGE-1 result for the 59 clusters of DUC 2002.

Table 1 All approach performance evaluations based on ROUGE-1

4.2.2 The UnifiedRank, PositionRank, TwoStageRank and BasicRank approaches

In [43], mutual influences between single-document summarization and multi-document summarization tasks are incorporated into a graph model. The ranking scores of a sentence for the two tasks were obtained in a unified ranking process. The PositionRank approach improves the basic PageRank algorithm by using the position weight of a sentence as a prior score. The TwoStageRank approach computes the score of each sentence within a single document using the PositionRank method. It then computes the final score of each sentence within the document set by considering the document-level sentence score as the prior score in the improved PageRank algorithm. The BasicRank approach exploits the standard PageRank algorithm to rank sentences based on all sentence relationships in a document set. Table 1 shows the ROUGE-1 results for the 59 clusters of DUC 2002 based on these four methods.

4.2.3 The CLASSY’s guided summarization approach

In [46], data preparation took place before the algorithm was applied. The data preparation included sentence splitting, trimming, and categorization (do not use the sentence; use the sentence for statistics only; and consider the sentence for use in the summary). The words were stemmed, and no stop words were removed. The algorithm for sentence scoring has three parts:

  • The probability that a term will be included in a human generated summary is generated for each term. The sentence score is defined as the expected number of terms in a sentence divided by the sentence length.

  • A non-redundant subset of high scoring sentences is selected using non-negative matrix factorization as in [47].

  • Finally, a subset of this is selected to achieve the 100-word summary using a branch and bound algorithm.

Table 1 shows the ROUGE-1 results for the CLASSY’s guided summarization approach.

4.3 The effect of each feature on the summarization performance

In this section, we have investigated the effect of each feature parameter on the multi-document summarization using their score values. For instance, to investigate the effect of the first feature (word similarity among sentences) on the summarization performance, we have used Score F1(S) (in Eq. (1)) to rank the sentences in clusters.

Table 1 shows the summarization average ROUGE-1 result associated with each text feature.

4.4 The results using the sum of all normalized feature parameters

In this section, we have used the summation of all normalized feature parameters associated with a sentence to calculate its score value. We have used the following formula:

$$\begin{aligned} \mathit{Score}(S)& =\mathit{Score}_{\mathrm{F}1}(S)+\mathit{Score}_{\mathrm{F}2}(S) +\mathit{Score}_{\mathrm{F}3}(S) \\ &\quad {}+\mathit{Score}_{\mathrm{F}4}(S) +\mathit{Score}_{\mathrm{F}5}(S)+\mathit{Score}_{\mathrm{F}6}(S) \\ &\quad {}+\alpha\mathit{Score}_{\mathrm{F}7}(S)+\alpha\mathit{Score}_{\mathrm{F}8}(S) \end{aligned}$$
(27)

where the above equation contains the normalized feature parameters, and α=(Score F1(S)+Score F2(S)+Score F3(S)+Score F4(S)+Score F5(S)+Score F6(S))/6. Table 1 shows the summarization average ROUGE-1.

4.5 The results of the maximum entropy method (ME)

The system has extracted features from the training data, which it used to train the ME model. We used the sentences of the testing data as inputs to the maximum entropy method as follows:

  1. (1)

    Extract the features from the sentences in the file.

  2. (2)

    Construct the feature vector.

  3. (3)

    Use this feature vector as an input to the ME model.

  4. (4)

    Save the output of the ME method for each sentence.

  5. (5)

    Chronologically select the set of sentences based on the output of the ME model.

Table 1 shows the summarization ROUGE-1.

4.6 The results of naive-Bayes classifier

Here, we have used the steps described in Sect. 4.5, with the naive-Bayes classifier replacing the maximum entropy model. Table 1 shows the summarization average ROUGE-1.

4.7 The results of the support vector machine classifier

Here, we have used the steps described in Sect. 4.5, with the support vector machine classifier replacing the maximum entropy model. Table 1 shows the summarization average ROUGE-1.

4.8 The results of the hybrid machine learning model

We have used Eq. (25), which is a hybrid of the maximum entropy, naive-Bayes and support vector machine methods, to achieve the results in Table 1.

4.9 The results of Hybrid Machine Learning Model using the best feature set

Feature selection is a process that selects a subset of original features [44, 45]. The following algorithm uses forward feature selection (selecting the best feature at each stage).

  1. (1)

    Start with a single feature and analyze the performance.

  2. (2)

    Repeat Step 1 until you finish all features.

  3. (3)

    Select the feature giving the best performance (now the feature set contains only one feature).

  4. (4)

    Add one feature (from the rest of the available features) to the feature set then analyze the performance.

  5. (5)

    Select the feature set that provides the best performance.

  6. (6)

    Repeat Steps 4 and 5 until all features have been analyzed or no further performance improvement is seen.

The feature set composed of (1, 2, 5 and 6) gave reasonable results, as shown in Table 1.

4.10 Discussion

It is common for a file title to convey the main topic of its content. It is clear from Table 1 that the most important text feature for summarization is F6 (title feature), because it uses the vocabulary overlap between a sentence and the title. F5 (summation of tfidf of the sentence terms) also provided a good result. F1 and F2 were also effective, which is reasonable, as the sentences that contain words that appear in other sentences in the document (F1) or in other paragraphs (F2) should be important. The lowest results are associated with F7 (sentence location feature) and F8 (occurrence of non-essential information feature). The results using the support vector machine classifier are better than that of the maximum entropy and naive-Bayes approaches, as shown in Table 1. The hybrid machine learning model produced the best results, as shown in Table 1. However, it does not significantly outperform state-of-the-art approaches to multi-document summarization. Adding other text features such as positive keyword, negative keyword, Bushy path and aggregate similarity might improve the proposed method results.

5 Conclusions and future work

In this paper, we have investigated the use of maximum entropy, naive-Bayes and support vector machine models for multi-document automatic text summarization. We have also investigated a hybrid machine learning model. We have applied our new approaches to the DUC 2002 data set. Our approaches have used feature extraction criteria, which give researchers the opportunity to use many variations based on the language and text type. The text features we have used are language independent. Our achieved results are promising when compared with some existing techniques.

In the future, we will extend this approach to personalized single and multi-document summarization.