Keywords

1 Introduction

In this data driven world, more than 2.5 quintillion bytes of data are generated every day. However, since this huge quantity of data requires tremendous amount of processing, most of the data is never analyzed and we are still learning to make sense out of it. A large fraction of this data is in the form of text documents ranging from news articles to e-books. Often these documents are verbose and require long reading time which one typically wants to avoid. Hence, the need of summarizing these documents arises. Moreover, to process the large amount data efficiently, the technique of summarization needs to be automated. Two most common methods used for automatic text summarization are Extractive where important sentences and phrases are extracted from the original text without modifying the sentences themselves and Abstractive which involves paraphrasing parts of the original source text [1]. Multi-document summarization is a technique which can able to generate summaries from huge volume of documents. Many active research works have done in this domain [2,3,4]. This paper uses the extractive approach for the process of multi-document summarization. Given a corpus of documents, the proposed methodology gives a summary by extracting the most important sentences from the corpus using Multilayer ELM (ML-ELM), a classifier which is getting rapidly recognized in the machine learning domain having easy to implement and capable of handling large volume of data with high processing speed [5]. For generating the summaries algorithmically, the classifier is trained over a dataset and the trained model predicts the importance of the sentence in each document of the corpus. Identifying the feature set over which the training will occur is hence a crucial step in the process. Nine important features are identified by the proposed approach which are likely to be a part of a sentence in the summary. Experimental results on DUC and TAC datasets show that ML-ELM can outperform other traditional classifiers in the field of text summarization.

Remaining sections of the paper are as follows: Sect. 2 briefly discusses the basics of ML-ELM. The proposed approach is discussed in Sect. 3. Section 4 covers the experimental work and Sect. 5 concludes the paper.

2 Basic Preliminaries

2.1 Extreme Learning Machine

Huang et al. [6] suggested a feed forward neural network having single layer and named it as Extreme Learning Machine (ELM). Many important characteristics of ELM such as no back propagation, extremely fast learning speed, able to manage large dataset etc. make ELM more popular compared to other established classifiers.

Mathematically: Consider N different examples \((x_i, y_i)\), where \(x_i=[x_{i1}, x_{i2},..., x_{in}]^T \in R^n \) and \(y_i = [y_{i1}, y_{i2},..., y_{im}]^T \in R^m\), such that \((x_i, y_i) \in R^n \times R^m\), \(i = 1, 2,..., N\). ELM has an activation function g(x) and L hidden layer nodes. Given input x, ELM output function can be written as

$$\begin{aligned} y_j = \sum \limits _{i=1}^L \beta _i g(\mathbf w_i\cdot x_j+ b_i ) \end{aligned}$$
(1)

where \(j = 1,...,N\), \(w_i\) and \(b_i\) are randomly generated hidden node parameters. \(w_i= [w_{i1}, w_{i2}, w_{i3} ,..., w_{in}]^T\) is the weight vector that joins the ‘n’ input nodes to the \(i^{th}\) hidden node and \(b_i\) is the bias of the \(i^{th}\) hidden node. \( \beta \) is the weight vector that connects each hidden node to every output node and is represented as \( \beta \) = \([\beta _1,...,\beta _L]^T\). Equation 1 in a reduce form can be written as \( H \beta = Y\), where Y and H are the output and hidden layer matrix respectively.

2.2 Multilayer ELM

An artificial neural network having multiple hidden layers called Multilayer ELM is proposed by Kasun et al. [7] which possesses all the properties of ELM since it combines ELM with ELM-autoencoder (ELM-AE). Parameters that represent ML-ELM are trained layer-wise using ELM-AE in an unsupervised manner. During the training time, no iteration takes place and hence the unsupervised training is very fast. The architectures of ELM-AE and ELM are almost similar except that ELM is supervised in nature, while ELM-AE is unsupervised and it can be stacked and trained in a progressive way. The stacked ELM-AEs will learn how to represent the data, the first level has a basic representation, the second level combines that representation to create a higher-level representation and so on. Using the Eq. 2, ML-ELM transfers the data between the hidden layers. The general architecture of ML-ELM is shown in Fig. 1.

$$\begin{aligned} H_i= g((\beta _i)^T H_{i\text{- }1}) \end{aligned}$$
(2)

where \(H_i\) is the \(i^{th}\) hidden layer output matrix and \(\beta _i\) is the output weight vector of ELM-AE placed before the \(i^{th}\) hidden layer. \(i=0\), represents the input layer x. Regularized least square technique is used to calculate analytically the output weights of the connection between last hidden layer and nodes of the output layer.

Fig. 1.
figure 1

Multilayer ELM and ELM Autoencoder

3 Proposed Approach

3.1 Identifying Important Features and Generating the Feature Vector

The proposed approach identified nine important features of a sentence which are described below. This nine-dimensional feature vector along with the class label form the feature vector for each sentence s of a document d belongs to the corpus C.

  1. 1.

    Length of the sentence: This feature is defined as how large a given sentence is by computing its size (i.e. the number of terms it contains).

  2. 2.

    Weight of the sentence: Term Frequency (TF) measures the frequency of a term t in d where as Inverse Document Frequency (IDF) measures the importance of t in the entire corpus C. TF-IDF value of a term is computed as \(TF\text{- }IDF_{t, d} = TF_{t, d} \times IDF_{t}\). Score for a sentence s is computed by summing up the \(TF\text{- }IDF\) values of all the terms belong to that sentence.

  3. 3.

    Density of the sentence: The density of a sentence s is computed by taking the ratio between the total count of keywords in s and the total count of words which includes all stop-words of s. For each s \(\in \) d:

    1. i.

      extract the keywordsFootnote 1 from s and let the count of keywords be n.

    2. ii.

      compute the score as \(\frac{n}{N}\), where N is number of words in s including stop-words.

  4. 4.

    Presence of named entities in the sentence: Named-entities are the terms which are generally proper nouns and they belong to the categories such as ‘names of people’, ‘organization’, ‘locations’ ‘quantities’ etc. Sentences having these named-entities are usually important because they tend to directly talk about an object. To detect the presence of such terms in a sentence, Stanford NER TaggerFootnote 2 is used. Score of a sentence is the number of named entities it contains.

  5. 5.

    Presence of cue-phrases in the sentence: In general, sentences that contain the phrases like ‘in summary’, ‘our investigation’, ‘in conclusion’, ‘the paper describes’ and the ones which highlight the quality such as ‘important’, ‘the best’, ‘hardly’, ‘significantly’, ‘in particular’ etc. are important because they happen to be a good source of significant information in a document. Sentences which contain such cue-phrases given a score of 1 and the rest are given 0. WordNetFootnote 3 has been used to find such words/phrases.

  6. 6.

    Relative offset of the sentence: Sentences that are located in the beginning or towards the end of a document tend to be more imperative as they carry relevant information like definitions and conclusions. Such sentences receive a score 1, otherwise 0.

  7. 7.

    Presence of title words in the sentence: A document is best described by those sentences that contain a part or all the words present in the title of that document and hence such sentences are considered as important. The score of sentence s is computed using Eq. 3.

    $$\begin{aligned} \text {score of}\,\, s = \frac{\text {common words between the title and } s}{\text{ total } \text{ number } \text{ of } \text{ words } \text{ in } \text{ the } \text{ title }} \end{aligned}$$
    (3)
  8. 8.

    Presence of quoted text in the sentence: Sentences which have a part of their text within quotation marks assert something specific and that information is important. For each s \(\in \) d, if it contains the words/phrases that reside within “ ” (double quotation) marks then it receives the score 1, otherwise 0.

  9. 9.

    Presence of upper-case words in the sentence: This feature checks the presence of words/phrases that are upper-case in a sentence. These upper-case words/phrases are likely to refer the important acronyms, names, places, etc. Such sentences receive a score 1, otherwise 0.

3.2 Deciding the Class Label for Each Sentence

The class label for each sentence is either important (+1) or unimportant (-1). Those sentences of a document which are labeled as “important” will constitute the summary for that document.

4 Experimental Analysis

For ELM and ML-ELM, the code is run using different number of hidden nodes (n) and layers and only those number of nodes and layers are considered on which the best results are obtained. Every DUCFootnote 4 and TACFootnote 5 datasets contain four human written gold summary. For DUC-2006 and 2007, ELM with n = ‘175’ achieved the maximum F-measure. For ML-ELM, n is considered as ‘50’ and ‘30’ respectively and the number of hidden layers = ‘5’ on which the best results is obtained. Tables 1 and 2 show the performance of different classifiers on these two DUC datasets. Similarly, for TAC-2008 and 2009, ELM with n = ‘175’ and ML-ELM using n = ‘100’ and hidden layers = ‘5’ achieved the maximum F-measure. The results on these two TAC datasets are shown in Tables 3 and 4 respectively. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scoreFootnote 6 of ML-ELM with extractive gold summary and human-written gold summary on both datasets are shown in Tables  5 and 6 respectively. For unigram and bigram matching, recall computation is required and hence ROUGE-1 and ROUGE-2 are used. Similarly, another ROUGE score called ROUGE-SU4 [8] is also used for recall computation and it uses the technique “skip-bigram with unigram having maximum gap length of 4” for matching. As the human-written gold summaries most likely use words that are not a part of the original document and people tend to paraphrase them in their own way while using words, thus evaluating ROUGE scores using human-written gold summaries results low score as shown in Table  6. Since system generated summaries are extractive in nature, they strictly make use of the n-gram structure of the original set of documents. This results in a very less n-gram overlap between the human-written gold summaries and the system generated summaries. ROUGE essentially relies on the n-gram recall and hence the scores generated tend to be low. But if we create a set of extractive gold summaries using the human-written gold summaries and use them along with the system generated summaries, ROUGE scores tend to be good as shown in Table 5 because there is a sufficient n-gram recall between the two sets. From the tables, it is observed that F-measure (bold indicates the maximum) of ML-ELM is better than other established classifiers.

Table 1. DUC 2006 (train and test)
Table 2. DUC 2007 (train and test)
Table 3. TAC 2008 (train and test)
Table 4. TAC 2009 (train and test)
Table 5. ROUGE of ML-ELM (Extractive gold summary)
Table 6. ROUGE of ML-ELM (Human-written gold summary)

5 Conclusion

This paper discussed an automatic text summarization technique using ML-ELM. The proposed approach has identified nine important features of a sentence to form the feature vector. ML-ELM is used on test datasets to find out those sentences which are important for building the summary. For experimental purpose, different DUC and TAC datasets are used. ROUGE scores in terms of average F-measure are calculated using ML-ELM and the results show that ML-ELM performs better than other classifiers. This demonstrates that deep learning has high influence on summarization of textual data. This work can be extended by using the proposed technique to built a recommendation system for different electronic devices. Also, by combining the feature space of ML-ELM with other classifiers will improve the classification results further.