Deep Learning in the Domain of Multi-Document Text Summarization

Roul, Rajendra Kumar; Sahoo, Jajati Keshari; Goel, Rohan

doi:10.1007/978-3-319-69900-4_73

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

3033 Accesses
9 Citations

Abstract

Text summarization is the process of generating a shorter version of the input text which captures its most important information. This paper addresses and tries to solve the problem of extractive text summarization which works by selecting a subset of phrases or sentences from the original document(s) to form a summary. Selections of such sentences are done based on certain criteria which formulates a feature set. Multilayer ELM (Extreme Learning Machine) which is based on the underlying deep network architecture is trained over this feature set to classify the sentences as important or unimportant. The used approach is unique and highlights the effectiveness of Multilayer ELM and its stability for usage in the domain of text summarization. Effectiveness of Multilayer ELM is justified by the experimental results on DUC and TAC datasets wherein it significantly outperforms the other well known classifiers.

You have full access to this open access chapter, Download conference paper PDF

Advancement of Text Summarization Using Machine Learning and Deep Learning: A Review

Automatic Text Summarization Using Transformers

Automatic Extractive Summarization for English Text: A Brief Survey

Keywords

1 Introduction

In this data driven world, more than 2.5 quintillion bytes of data are generated every day. However, since this huge quantity of data requires tremendous amount of processing, most of the data is never analyzed and we are still learning to make sense out of it. A large fraction of this data is in the form of text documents ranging from news articles to e-books. Often these documents are verbose and require long reading time which one typically wants to avoid. Hence, the need of summarizing these documents arises. Moreover, to process the large amount data efficiently, the technique of summarization needs to be automated. Two most common methods used for automatic text summarization are Extractive where important sentences and phrases are extracted from the original text without modifying the sentences themselves and Abstractive which involves paraphrasing parts of the original source text [1]. Multi-document summarization is a technique which can able to generate summaries from huge volume of documents. Many active research works have done in this domain [2,3,4]. This paper uses the extractive approach for the process of multi-document summarization. Given a corpus of documents, the proposed methodology gives a summary by extracting the most important sentences from the corpus using Multilayer ELM (ML-ELM), a classifier which is getting rapidly recognized in the machine learning domain having easy to implement and capable of handling large volume of data with high processing speed [5]. For generating the summaries algorithmically, the classifier is trained over a dataset and the trained model predicts the importance of the sentence in each document of the corpus. Identifying the feature set over which the training will occur is hence a crucial step in the process. Nine important features are identified by the proposed approach which are likely to be a part of a sentence in the summary. Experimental results on DUC and TAC datasets show that ML-ELM can outperform other traditional classifiers in the field of text summarization.

Remaining sections of the paper are as follows: Sect. 2 briefly discusses the basics of ML-ELM. The proposed approach is discussed in Sect. 3. Section 4 covers the experimental work and Sect. 5 concludes the paper.

2 Basic Preliminaries

2.1 Extreme Learning Machine

Huang et al. [6] suggested a feed forward neural network having single layer and named it as Extreme Learning Machine (ELM). Many important characteristics of ELM such as no back propagation, extremely fast learning speed, able to manage large dataset etc. make ELM more popular compared to other established classifiers.

Mathematically: Consider N different examples $(x_i, y_i)$, where $x_i=[x_{i1}, x_{i2},..., x_{in}]^T \in R^n $ and $y_i = [y_{i1}, y_{i2},..., y_{im}]^T \in R^m$, such that $(x_i, y_i) \in R^n \times R^m$, $i = 1, 2,..., N$. ELM has an activation function g(x) and L hidden layer nodes. Given input x, ELM output function can be written as

$$\begin{aligned} y_j = \sum \limits _{i=1}^L \beta _i g(\mathbf w_i\cdot x_j+ b_i ) \end{aligned}$$

(1)

where $j = 1,...,N$, $w_i$ and $b_i$ are randomly generated hidden node parameters. $w_i= [w_{i1}, w_{i2}, w_{i3} ,..., w_{in}]^T$ is the weight vector that joins the ‘n’ input nodes to the $i^{th}$ hidden node and $b_i$ is the bias of the $i^{th}$ hidden node. $ \beta $ is the weight vector that connects each hidden node to every output node and is represented as $ \beta $ = $[\beta _1,...,\beta _L]^T$. Equation 1 in a reduce form can be written as $ H \beta = Y$, where Y and H are the output and hidden layer matrix respectively.

2.2 Multilayer ELM

An artificial neural network having multiple hidden layers called Multilayer ELM is proposed by Kasun et al. [7] which possesses all the properties of ELM since it combines ELM with ELM-autoencoder (ELM-AE). Parameters that represent ML-ELM are trained layer-wise using ELM-AE in an unsupervised manner. During the training time, no iteration takes place and hence the unsupervised training is very fast. The architectures of ELM-AE and ELM are almost similar except that ELM is supervised in nature, while ELM-AE is unsupervised and it can be stacked and trained in a progressive way. The stacked ELM-AEs will learn how to represent the data, the first level has a basic representation, the second level combines that representation to create a higher-level representation and so on. Using the Eq. 2, ML-ELM transfers the data between the hidden layers. The general architecture of ML-ELM is shown in Fig. 1.

$$\begin{aligned} H_i= g((\beta _i)^T H_{i\text{- }1}) \end{aligned}$$

(2)

where $H_i$ is the $i^{th}$ hidden layer output matrix and $\beta _i$ is the output weight vector of ELM-AE placed before the $i^{th}$ hidden layer. $i=0$, represents the input layer x. Regularized least square technique is used to calculate analytically the output weights of the connection between last hidden layer and nodes of the output layer.

3 Proposed Approach

3.1 Identifying Important Features and Generating the Feature Vector

The proposed approach identified nine important features of a sentence which are described below. This nine-dimensional feature vector along with the class label form the feature vector for each sentence s of a document d belongs to the corpus C.

1.
Length of the sentence: This feature is defined as how large a given sentence is by computing its size (i.e. the number of terms it contains).
2.
Weight of the sentence: Term Frequency (TF) measures the frequency of a term t in d where as Inverse Document Frequency (IDF) measures the importance of t in the entire corpus C. TF-IDF value of a term is computed as $TF\text{- }IDF_{t, d} = TF_{t, d} \times IDF_{t}$. Score for a sentence s is computed by summing up the $TF\text{- }IDF$ values of all the terms belong to that sentence.
3.
Density of the sentence: The density of a sentence s is computed by taking the ratio between the total count of keywords in s and the total count of words which includes all stop-words of s. For each s $\in $ d:
1. i.
  extract the keywords^{Footnote 1} from s and let the count of keywords be n.
2. ii.
  compute the score as $\frac{n}{N}$, where N is number of words in s including stop-words.
4.
Presence of named entities in the sentence: Named-entities are the terms which are generally proper nouns and they belong to the categories such as ‘names of people’, ‘organization’, ‘locations’ ‘quantities’ etc. Sentences having these named-entities are usually important because they tend to directly talk about an object. To detect the presence of such terms in a sentence, Stanford NER Tagger^{Footnote 2} is used. Score of a sentence is the number of named entities it contains.
5.
Presence of cue-phrases in the sentence: In general, sentences that contain the phrases like ‘in summary’, ‘our investigation’, ‘in conclusion’, ‘the paper describes’ and the ones which highlight the quality such as ‘important’, ‘the best’, ‘hardly’, ‘significantly’, ‘in particular’ etc. are important because they happen to be a good source of significant information in a document. Sentences which contain such cue-phrases given a score of 1 and the rest are given 0. WordNet^{Footnote 3} has been used to find such words/phrases.
6.
Relative offset of the sentence: Sentences that are located in the beginning or towards the end of a document tend to be more imperative as they carry relevant information like definitions and conclusions. Such sentences receive a score 1, otherwise 0.
7.
Presence of title words in the sentence: A document is best described by those sentences that contain a part or all the words present in the title of that document and hence such sentences are considered as important. The score of sentence s is computed using Eq. 3.
$$\begin{aligned} \text {score of}\,\, s = \frac{\text {common words between the title and } s}{\text{ total } \text{ number } \text{ of } \text{ words } \text{ in } \text{ the } \text{ title }} \end{aligned}$$
(3)
8.
Presence of quoted text in the sentence: Sentences which have a part of their text within quotation marks assert something specific and that information is important. For each s $\in $ d, if it contains the words/phrases that reside within “ ” (double quotation) marks then it receives the score 1, otherwise 0.
9.
Presence of upper-case words in the sentence: This feature checks the presence of words/phrases that are upper-case in a sentence. These upper-case words/phrases are likely to refer the important acronyms, names, places, etc. Such sentences receive a score 1, otherwise 0.

3.2 Deciding the Class Label for Each Sentence

The class label for each sentence is either important (+1) or unimportant (-1). Those sentences of a document which are labeled as “important” will constitute the summary for that document.

4 Experimental Analysis

For ELM and ML-ELM, the code is run using different number of hidden nodes (n) and layers and only those number of nodes and layers are considered on which the best results are obtained. Every DUC^{Footnote 4} and TAC^{Footnote 5} datasets contain four human written gold summary. For DUC-2006 and 2007, ELM with n = ‘175’ achieved the maximum F-measure. For ML-ELM, n is considered as ‘50’ and ‘30’ respectively and the number of hidden layers = ‘5’ on which the best results is obtained. Tables 1 and 2 show the performance of different classifiers on these two DUC datasets. Similarly, for TAC-2008 and 2009, ELM with n = ‘175’ and ML-ELM using n = ‘100’ and hidden layers = ‘5’ achieved the maximum F-measure. The results on these two TAC datasets are shown in Tables 3 and 4 respectively. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score^{Footnote 6} of ML-ELM with extractive gold summary and human-written gold summary on both datasets are shown in Tables 5 and 6 respectively. For unigram and bigram matching, recall computation is required and hence ROUGE-1 and ROUGE-2 are used. Similarly, another ROUGE score called ROUGE-SU4 [8] is also used for recall computation and it uses the technique “skip-bigram with unigram having maximum gap length of 4” for matching. As the human-written gold summaries most likely use words that are not a part of the original document and people tend to paraphrase them in their own way while using words, thus evaluating ROUGE scores using human-written gold summaries results low score as shown in Table 6. Since system generated summaries are extractive in nature, they strictly make use of the n-gram structure of the original set of documents. This results in a very less n-gram overlap between the human-written gold summaries and the system generated summaries. ROUGE essentially relies on the n-gram recall and hence the scores generated tend to be low. But if we create a set of extractive gold summaries using the human-written gold summaries and use them along with the system generated summaries, ROUGE scores tend to be good as shown in Table 5 because there is a sufficient n-gram recall between the two sets. From the tables, it is observed that F-measure (bold indicates the maximum) of ML-ELM is better than other established classifiers.

Table 1. DUC 2006 (train and test)

Full size table

Table 2. DUC 2007 (train and test)

Full size table

Table 3. TAC 2008 (train and test)

Full size table

Table 4. TAC 2009 (train and test)

Full size table

Table 5. ROUGE of ML-ELM (Extractive gold summary)

Full size table

Table 6. ROUGE of ML-ELM (Human-written gold summary)

Full size table

5 Conclusion

This paper discussed an automatic text summarization technique using ML-ELM. The proposed approach has identified nine important features of a sentence to form the feature vector. ML-ELM is used on test datasets to find out those sentences which are important for building the summary. For experimental purpose, different DUC and TAC datasets are used. ROUGE scores in terms of average F-measure are calculated using ML-ELM and the results show that ML-ELM performs better than other classifiers. This demonstrates that deep learning has high influence on summarization of textual data. This work can be extended by using the proposed technique to built a recommendation system for different electronic devices. Also, by combining the feature space of ML-ELM with other classifiers will improve the classification results further.

Notes

1.
NLTK is used to extract the keywords.
2.
http://nlp.stanford.edu/software/CRF-NER.shtml.
3.
https://wordnet.princeton.edu/.
4.
http://www.duc.nist.gov.
5.
http://www.nist.gov/tac/data.
6.
http://www.berouge.com/Pages/default.aspx.

References

Ganesan, K., Zhai, C., Han, J.: Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, pp. 340–348 (2010)
Google Scholar
Yang, G., Wen, D., Chen, N.-S., Sutinen, E., et al.: A novel contextual topic model for multi-document summarization. Expert Syst. Appl. 42(3), 1340–1352 (2015)
Article Google Scholar
Valizadeh, M., Brazdil, P.: Exploring actor–object relationships for query-focused multi-document summarization. Soft Comput., 1–13 (2014)
Google Scholar
Luo, W., Zhuang, F., He, Q., Shi, Z.: Exploiting relevance, coverage, and novelty for query-focused multi-document summarization. Knowl.-Based Syst. 46, 33–42 (2013)
Article Google Scholar
Roul, R.K., Asthana, S.R., Kumar, G.: Study on suitability and importance of multilayer extreme learning machine for classification of text data. Soft. Comput. 21(15), 4239–4256 (2017)
Article Google Scholar
Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: theory and applications. Neurocomputing 70(1), 489–501 (2006)
Article Google Scholar
Kasun, L.L.C., Zhou, H., Huang, G.-B., Vong, C.M.: Representational learning with extreme learning machine for big data. IEEE Intell. Syst. 28(6), 31–34 (2013)
Google Scholar
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, vol. 8, pp. 74–81 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, BITS-Pilani, Goa Campus, Zuarinagar, 403726, Goa, India
Rajendra Kumar Roul & Rohan Goel
Department of Mathematics, BITS-Pilani, Goa Campus, Zuarinagar, 403726, Goa, India
Jajati Keshari Sahoo

Authors

Rajendra Kumar Roul
View author publications
You can also search for this author in PubMed Google Scholar
Jajati Keshari Sahoo
View author publications
You can also search for this author in PubMed Google Scholar
Rohan Goel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajendra Kumar Roul .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Roul, R.K., Sahoo, J.K., Goel, R. (2017). Deep Learning in the Domain of Multi-Document Text Summarization. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_73

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_73
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Deep Learning in the Domain of Multi-Document Text Summarization

Abstract

Similar content being viewed by others

Advancement of Text Summarization Using Machine Learning and Deep Learning: A Review

Automatic Text Summarization Using Transformers

Automatic Extractive Summarization for English Text: A Brief Survey

Keywords

1 Introduction