1 Introduction

With the help of text summarization a large text document can be shortened in order to create a summary that is user readable and understandable. The summary generated should provide important information of the document in the shortest way possible. Due to enormous amount of data present, text summarization has become an inevitable task for search engines to provide best search results. Text summarization consists of mainly three phases: interpretation, transformation and generation [1]. During interpretation, the representation of the text document to be summarized is produced. Transformation phase transforms the outcome of interpretation phase to summary and final summary is generated in generation phase. Text summarization methods [2] can be classified into following types based on the type of information and application. Based on the type of approach of summary generation, there are two types of summarization approaches [2, 3]. Extractive and Abstractive Summarization. In extractive summarization, sentences with high score are extracted from the input document and are added to the final summary without any changes. This is the easiest way to implement. This approach may result in lengthy and inconsistent summary because it does not deal with the semantics. In abstractive summarization, the sentences generated in the summary will be semantically related. Natural Language Processing is used to perform abstractive summarization. It is difficult to implement as it needs the complete understanding of the document as human. But it produces more readable summary than the summary produced by extractive summarization. Based on the type of details present in the final summary, it is divided into two types [4] Indicative summaries and Informative summaries. Based on the number of documents given as input to the summarizer, summarization is classified into two types [5] Single Document Summarization and Multi-document Summarization. In Single document Summarization, the automatic summarizer generates final summary from a single document. It takes only one document as input. It involves less overhead. In multi-document summarization, the automatic summarizer generates final summary from multiple documents. It takes more than one document as input. By clustering the documents or by some other mechanism, summary is generated. It is difficult to implement. Care must be taken to avoid redundancy in the final summary [6]. In Single document extractive summarization, single document is taken as input. The document is segmented into sentences. Feature extraction is performed on the tokenized sentences. Sentence scoring is done as the linear combination of weights of the features extracted. To generate more readable and cohesive summary, sentences of the document are clustered followed by ranking. Sentences with relatively high score are chosen among others to generate the final summary. A good summary is that which gives complete information of the document in few sentences. It is characterized by high compression ratio and retention ratio. Retention ratio gives the amount of information retained in the generated summary. It should be much larger than the compression ratio for an ideal summary.

2 Related work

Summarizing single text document using sentence scoring is not a new idea. Several researchers worked on text summarization and its applications. As an example Nandhini and Balasundaram [7] presented a model in which for each sentence, the score is calculated using weighted combination of learner dependent and text dependent features, where the weights are calculated using genetic algorithm. To train the classifier to extract important and readable sentences based on feature vectors, a supervised learning algorithm is used. Later on Ferreira et al. [8] proposed a model in which sentences are clustered to categorize them into specified topics. The sentences are scored and the ones with highest score are selected to form the cluster. The number of selected sentences depends upon the summarization rate provided by the user. The same method was again used by Ferreira et al. [9] to experiment on news, blogs and articles by combining fifteen sentence scoring methods in different ways. Several researchers also worked on extractive summarization. An extractive model of single document text summarization based on Agglomerative nested clustering approach is proposed by Sharaff et al. [10].

In text summarization, evolutionary approach has also been explored like genetic algorithm, particle swarm optimization, cat swarm optimization etc. Benjumea and Leon [11] developed genetic clustering algorithm named SENCLUS in which sentences are clustered to represent the closeness among text topics using a peculiar fitness function. The function is based on coverage and redundancy, and then most relevant sentences are selected to be part of the extractive summary from each topic. The multi-value or two value logic has the problem of imprecise values and ambiguity. To overcome this problem, Fuzzy Logic and WorldNet synonyms are used in the model proposed by Yadav and Meena [12]. They also overcome the issues involved in the semantics of the text.

Niu et al. [13] proposed OnSeS, which uses word2vec and neural network. It has three phases: clustering, ranking by building a graph using BM25 and generating important point of cluster using neural machine translation. It is a novel text summarization method for short texts. Jeong et. al. [14] proposed an integrated framework which learns using category information and summary. To combine feature distributions, it uses a language model. It performs better than individual text summarization and classification. It is based on POS tagger and approaches using simple statistics. This makes it easy to implement and language independent. A multi variant email classification model using clustering techniques has been developed to generate categorical terms [15]; and presented a model for extractive summarization using fuzzy logic based approach [16]. Rahman and Borah [17] reviewed various extractive based approaches for query-based text summarization. Unsupervised learning methods are approaches based on document graphs, features, etc. Supervised Learning methods are SVM, KNN, Naïve Bayes Classifier, Neural Networks, etc. Yang et al. [18] used Bayesian hierarchical topic model. They distinguishes specific and general topics and indicates their relationship by examining the topic hierarchies. Instead of using term frequency or other traditional methods for keyword extraction, keywords are extracted automatically by training probability distribution of data is proposed by Thomas et. al. [19].

Babar and Patil [20] used Latent Semantic Analysis to capture semantic contents in sentences. LSA includes Input Matrix Creation, Singular Value Decomposition and Sentence Selection. A metaheuristics based approach using extra tree classifier has been presented by Sharaff and Gupta to detect spam messages in email [21]. Wang et. al. [22] presented a two-phase approach for long text summarization, namely, EA-LTS. It has two phases: extraction phase and abstraction phase. A query based summarization created intentionally for considering a part of input data has been proposed by Jiang et al. [23] to generate summary of only one aspect of document or conversation or tweet called as “targeted summarization”.

Text summarization has several applications in different real world problems. Generic summarization, one of the summarization technique in user intervention was used to create hierarchical summary for large heterogenoeus data [24]. The authors have provided open access to their dataset to perform hierarchical summarization on a particular given topic. Dernoncourt et al. [25] introduced a repository to perform experiments related to creation of summary based on abstractive summarization and perform several experiments on the introduced dataset using artificial neural network. In order to measure the aboutness of textual documents, a graph based approach has been developed by using TexRank technique [26]. Several evolutionary algorithms have also been explored to generate summary [27]. One such document summarization has been done by using cat swarm based optimization technique. The main contribution of this paper is to develop a sentence scoring based mechanism to generate summary of most informative sentences within a document. Few recent related works and their comparison have been shown in Table 1.

Table 1 Overview of related works

3 Proposed methodology

The proposed method of text summarization walks through the following stages and presented in Fig. 1: pre-processing, feature extraction, sentence scoring, clustering, cluster ranking and summary generation. The document is pre-processed (stop words removal, stemming, etc.) and features are extracted from the sentences. Sentences are scored based on the features extracted. Then these sentences are grouped into clusters using K-means clustering or Hierarchical clustering algorithm. Clusters and sentences in each cluster are ranked and highly ranked sentences from each cluster of relative importance are included in the final summary. Algorithm 1 presents the main components in the proposed model and are discussed below:

Fig. 1
figure 1

Overall Methodology of summary generation

The pseudocode of the proposed method is outlined in Algorithm 1.

figure a

3.1 Pre-processing the input document

The text document may contain many words which occur very frequently but of no importance, symbols and white spaces. First of all, all the unnecessary white spaces are removed. Then tokenize the given document into sentences and sentences to words. Then stop words removal and stemming is performed.

3.1.1 Sentence segmentation

In Sentence segmentation, the given document is parsed to extract sentences from the input text. This is done by identifying sentence boundaries between words. In English, punctuation marks like full stop (.), question mark (?), exclamation (!), etc., that act as sentence boundaries. As it involves boundary recognition, sentence segmentation is also known as sentence boundary detection or sentence boundary recognition.

3.1.2 Tokenization

In Tokenization, the given sentences are broken down into words. This task is done by identifying the spaces between words.

3.1.3 Stop words removal

Stop words are natural language words that occur very frequently but do not add any importance to the meaning of a sentence/document. They are mostly used to keep a sentence grammatically and syntactically correct. Stop words have to be removed because their presence may mislead the results. For example, if term frequency is calculated with stop words present, stop words will be getting more score than relevant words.

3.1.4 Stemming

Stemming is the process of transforming derived words to root words. A word can be represented in different forms in the same document like amaze, amazed and amazing. All the three words came from the same root word ‘amaze’. Stemming is performed to normalize words in the document.

3.2 Feature extraction

In feature extraction, different features of a sentence are explored by its properties. Features used in this model are:

3.2.1 Sentence length

The length of a sentence should be of medium size. A longer sentence may contain unnecessary information. It occupies more space in the summary providing less important information. In an efficient summary, no of words should be as minimum as possible. Short sentences may not provide much of information. Sentence length is computed as the proportion of number of terms (words) in the sentence to the number of terms in the longest distance.

3.2.2 Sentence position

The sentences which are at the start and end of the document are relatively more important. Starting sentence in the paragraph provide more important information. First and last sentence are given the highest score. Second sentence from starting and second sentence from ending are given next highest score and so on.

3.2.3 Similarity with the title

The sentences which contain the words in the title provide more relevant information of the document. The score for similarity with the title is computed as the proportion of number of terms that match with the title with the total number of terms in the title.

3.2.4 Proper nouns

The presence of proper nouns in a sentence makes it more important and relevant in the document. The score is computed as the proportion of number of proper nouns in a sentence to the number of terms in that sentence.

3.2.5 Term weight

Term weight is calculated based on term frequencies of words. It is the ratio of term frequencies of all words in a sentence to maximum of sum of term frequencies of a sentence in the given document.

3.2.6 Dates and numerical data

Sentences which contain dates and numerical data provide important information of a document. The ratio of number of numerical data/ dates in a sentence to the number of terms in a sentence is computed as score of the sentence.

3.3 Scoring the sentences

The scoring of sentences is performed using the features extracted above. Few or all among the above features are selected and weights are assigned to each feature to calculate the score of sentences. It is calculated as weighted linear combination of the features extracted (Nandhini and Balasundaram, 2013). For example, if sentence length, sentence position and term weight are considered then score of the sentence is computed by the formula in Eq. 1 and 2:

$$S\left(i\right)= {\alpha }_{1}.{F}_{1}+{\alpha }_{2}.{F}_{2}+{\alpha }_{3}.{F}_{3}$$
(1)
$${\alpha }_{1}+{\alpha }_{2}+{\alpha }_{3}=1$$
(2)

where \({\alpha }_{1}\)= Weight of sentence length feature, \({F}_{1}\) = Sentence length feature score, \({\alpha }_{2}\)= Weight of sentence position feature, \({F}_{2}\) = Sentence position feature score, \({\alpha }_{3}\)= Weight of term weight feature, \({F}_{3}\) = Term weight feature score.

If n features are considered, the generalized formula would be presented in Eqs. 3 and 4:

$$S\left(i\right)= {\sum\limits_{k=1}^{n}{\alpha }_{k}.{F}_{k}}$$
(3)
$${\sum\limits _{k=1}^{n}{\alpha }_{k}=1}$$
(4)

where \({\alpha }_{k}\)= Weight of feature k, \({F}_{k}\) = Score of feature k.

3.4 Clustering of sentences

To generate an efficient summary, cohesion between sentences is as important as selecting important sentences in the document. In order to achieve cohesion between sentences, clustering of sentences is performed. For this, calculate cosine similarity between two sentences. It measures angular distance between two sentences. It is calculated using the following formula in Eq. 5:

$$\text{Cos} \left( {S_{{i,}} S_{j} } \right) = \frac{{2 \times {\text{number of words that are both in }}S_{i} {\text{ and }}S_{j} }}{{{\text{number of}}{\mkern 1mu} {\text{words }}{\mkern 1mu} {\text{in }}S_{i} \times {\text{number of}}{\mkern 1mu} {\text{words }}{\mkern 1mu} {\text{in }}S_{j} }}$$
(5)

Two clustering techniques used in the proposed model are:

3.4.1 K-means clustering

Firstly the number of clusters (n) to be formed is decided in K-means clustering. Select n sentences with highest score and make them as initial centroids for each cluster. Now for each sentence, select the cluster to which it belongs with the help of cosine similarity measure. The sentence to which cluster it belongs is decided local optimally. Now, calculate the closeness of each cluster by measuring distance between each sentence and its centroid. If it is below the threshold value, then the cluster formed is good enough. Otherwise, choose different centroids and continue the entire process for fixed number of iterations.

3.4.2 Hierarchical clustering

In hierarchical clustering, title or the sentence with highest score is chosen as the root. Then n sentences are selected which are similar to the root as children of the root. Now, for each of these n children, select n sentences which are similar to them as children and so on. After forming the tree of the entire document, the path from root to leaf gives the total information of the document and sentences in each level provides cohesion between distances. The sentence (child of a node) at each level is selected in a greedy manner.

3.5 Cluster ranking

The clusters formed above are ranked to select sentences proportionally for the final summary based on its rank. Ranking is done on the basis of sentence position in the document and its score. The sentence order of the original document is maintained in the final summary.

3.6 Sentence selection from clusters

The sentences in each cluster are ranked based on its similarity with centroid sentence. The centroid sentence is given the highest rank in the cluster. Then find the next sentence which is most similar to the centroid and so on. Sentence selection from clusters is based on the ranking of sentences. More sentences are selected from the cluster which contains more sentences with highest score. It is calculated using the following formula in Eq. 6:

$$R\left({C}_{k}\right)= \frac{{\text{sum of scores of sentences in a cluster}}}{{\text{No.of sentences in the cluster}}}$$
(6)

where Ck is a cluster.

3.7 Summary generation

After selecting the sentences for final summary, place them in the order of rank of the cluster they belong to. This will generate a summary with high compression ratio without any information loss.

4 Experimental analysis

The proposed model is tested using DUC (Document Understanding Conference) dataset (https://duc.nist.gov/) [28]. The dataset consist of text documents to be summarized and reference summaries written by human (abstractive summary). Randomly 20 documents are taken from DUC dataset and summaries are generated using the proposed model. ROUGE-1 and ROUGE-2 techniques have been used for evaluation purpose. ROUGE-N (Meena and Gopalani, 2014) is computed as follows in Eq. 7:

$${\text{ROUGE - N}} = \frac{{\sum s \in ({\text{ref summary}})\sum {{\text{gram}}_{{\text{n}}} } \in \left( s \right){\text{count}}_{{{\text{match}}}} ({\text{gram}}_{n} )}}{{\sum s \in ({\text{ref summary}})\sum {{\text{gram}}_{{\text{n}}} } \in \left( s \right){\text{count}}({\text{gram}}_{n} )}}$$
(7)

The average precision, recall and F-measure of Hierarchical clustering technique using Rouge-1 are 0.67, 0.49 and 0.56 respectively. Figure 2 shows the graphical representation of precision, recall and F-measure of summary generated by Hierarchical clustering technique using Rouge-1.

Fig. 2
figure 2

Precision, Recall and F-measure of summary generated using Hierarchical clustering technique (Rouge-1)

The average precision, recall and F-measure of Hierarchical clustering technique using Rouge-2 are 0.56, 0.40 and 0.47 respectively. Figure 3 shows the graphical representation of precision, recall and F-measure of summary generated by Hierarchical clustering technique using Rouge-2.

Fig. 3
figure 3

Precision, Recall and F-measure of summary generated using Hierarchical clustering technique (Rouge-2)

The average precision, recall and F-measure of K-means clustering technique using Rouge-1 are 0.61, 0.45 and 0.52 respectively and presented in Fig. 4.

Fig. 4
figure 4

Precision, Recall and F-measure of summary generated using K-means clustering technique (Rouge-1)

The average precision, recall and F-measure of K-means clustering technique using Rouge-2 are 0.46, 0.35 and 0.40 respectively and presented in Fig. 5.

Fig. 5
figure 5

Precision, Recall and F-measure of summary generated using K-means clustering technique (Rouge-2)

5 Discussions on results

It has been observed from the experimental analysis that feature extraction with hierarchical clustering technique gives better summary as compared to feature extraction with k-means clustering technique. The average precision of Hierarchical clustering model is 0.67 and k-means clustering model is 0.61 using Rouge-1 metric while that of Rouge-2 metric is 0.56 and 0.46 respectively. The precision of the model decreases as n value in the n-gram increases as extractive summaries are compared with abstractive summaries. The weights assigned to feature scores of a sentence play an important role in generating summary. In the proposed model, more weightage is given to term weight, number of proper nouns in the sentence and sentence similarity with the title as compared to other features. Proper assignment of weights to features and clustering techniques used in the proposed model made the resultant summary comparable with the human generated abstractive summary.

6 Conclusion

Summarization using feature extraction of sentences and clustering techniques has been proposed. Feature extraction is used in calculating the score of sentences. The score of a sentence is computed as linear combination of weights of the features. Clustering is performed to improve continuity among the sentences in the final summary. The quality of the generated summary is highly affected by weights assigned to features. In the two clustering techniques used, hierarchical clustering performs better than k-means clustering.

7 Open research

In future, the proposed model could be extended for multi document summarization and query focused summarization. Sentence Reduction or trimming techniques can be employed to generate summary similar to human written summaries.