Keywords

1 Introduction

The  problem of summarizing text is very trivial. In this fast-paced life, everyone wants shorthand information to save time. People read the news by only reading the headlines and the first few lines. A popular app that is exploiting this nature of humans is “Inshorts”, which provides a news summary in just 60 words. Students want a summary of class notes just one night before the exams. They also want a summary of YouTube lectures that may be hours long. NLP has proven to be very useful to solve this problem. While extractive summarization is popular currently, abstractive summarization is picking up pace. With the ever-rising amount of data in the globe, the demand for automatic summary generation from the text document is growing considerably to reduce the manual work of a person. In today’s world, data generation and consumption are exploding at an exponential rate. Text summarization finds its applications in various NLP-related tasks such as question answering, text classification, and other related fields. Summaries are generated as an intermediary step in these systems, which help to reduce the length of documents. This, in turn, leads to faster access for information searching. News summarization and headline generation is another important application. Most of the search engines use machine-generated headlines for displaying news articles in feeds.

The focus of this research is on extractive text summarization that aims to implement a single-document summarizer system that achieves a quick, concise extractive summary of any textual document. The structure of paper is as follows: Sect. 1 is introduction, Sect. 2 presents literature survey, the proposed methodology is given in Sect. 3, the results and discussion are in Sect. 4, and finally, the conclusion and future aspects are in Sect. 5.

2 Literature Survey

Madhuri and Ganesh Kumar [1], a unique statistical methodology is proposed and proven for extractive text summarization on a single document. The approach is described to extract sentences of the input text in a concise fashion. Weights are assigned to sentences to rank them. From the input text, highly ranked sentences are extracted. They created a summarizer application that accepts text files as input. After that, the input file is pre-processing. The system then calculates the sum of weighted frequencies by dividing the frequency of the keywords by their greatest frequency. Finally, the summarizer extracts high-weighted-frequency sentences and converts them to audio.

Krishnaveni and Balasundaram [2] proposed strategy for improving the summary coherence, based on local scoring and ranking. The author creates two types of summaries in this paper: heading-wise and main. Features are usually employed for the score of phrases. The original text is separated heading by heading, and each heading is treated as a separate text.

Naik and Gaonkar [3], the rule-based concept was proposed to extract features from sentences. The author pre-processed the input data first and then retrieved keywords from the document, which were subsequently pruned based on the computed threshold. Then the feature value is calculated for each document. Certain rules have been written by the author. Finally, the sentence is sorted based on sentence score to form an extractive summary.

Jafari and Shahabi [4], presented a fuzzy logic method to calculate and encode feature values as feature attribute vectors for each text. The fuzzy system assigns a score between 0 and 1. Input values for membership function are calculated. The knowledge base also contains the rules needed for summarization. Finally, top n scored sentences are selected to form a summary.

One of the most well-known extractive text summarizing algorithms is stated by Luhn [5], and it is determined by the number of times words appear in the text, as well as the distance between relevant words, which is determined by the number of non-relevant words among relevant ones.

The text rank algorithm is generally based on the graph algorithm which is developed by Mihalcea and Tarau [6]. The text rank algorithm works in two steps, very first it calculates the similarity between two sentences, and in the second step, it calculates the overall significance of sentences.

3 Proposed Methodology

3.1 Data Pre-processing

  1. 1.

    Sentence Segmentation: Segmentation breaks down the entire text into sentences, once the sentences are broken then each sentence with their respective position will be stored in the array.

  2. 2.

    Tokenization: This data pre-processing step sentences are divided into tokens so that they can be further used for feature extraction tasks.

  3. 3.

    Stop word and punctuation: In this data pre-processing task, we will remove punctuation as well as often occurring words such as punctuation, but, the, a, and so on. Proposed methodology for extractive text summarization is shown in Fig. 1.

Fig. 1
figure 1

Proposed methodology

3.2 Feature Extraction

  1. 1.

    Sentence Position: The position of the sentence can determine the importance of the sentence for the summary. The sentences at the beginning and end of the document are usually the most significant. So, based on this the sentence score is calculated. In our case, the positional score is determined by taking into account the following factors:

$$ \mathrm{Sentence}\_ \mathrm{Position}=\left\{ \begin{array}{l} {\mathrm{0},\; \mathrm{if}\; \mathrm{it's}\; \mathrm{first}\; \mathrm{or}\; \mathrm{last}\; \mathrm{sentence}\; \mathrm{of}\; \mathrm{text}} \\ {\mathrm{else}\; \mathrm{cosine}(\mathrm{Sen}\_ \mathrm{pos-min})((1/\mathrm{max})-\mathrm{min})} \end{array}\right. $$

where,

Sen_pos: Position of sentence in text. Min: th * total no. of sentence in the text. Max: th*2* total no. of sentence in the text. Th: threshold which to be considered as 0.2.

  1. 2.

    Tokenization: This feature helps to filter out too short sentences, which typically don’t convey much information.

$$ \mathrm{Sentence}\_ \mathrm{Length}=\left\{ \begin{array}{l} {0, {\text {if number of words is less than}}\; 3} \\ {{\text {else number of words in the sentence}}} \end{array}\right. $$
  1. 3.

    Proper nouns: A proper noun refers to something with a particular identity, such as a name or a location. In this first sentences are tagged using POS tagger using NLTK Library. Then we count the proper noun in tagged sentence.

$$ \mathrm{proper}\_ \mathrm{noun}=\left\{ \frac{\mathrm{number of proper noun in tagged}\; S_{i} }{\mathrm{total length of tagged}\; S_{i} } \right. $$

where

\(s_{i} \): ith sentence. Number of proper noun in tagged \(s_{i} \): number of proper noun in ith sentence. Total length of \(s_{i} \): number of words in ith sentence.

  1. 4.

    Number of numerals: Since the numerals in a document represent facts, having sentences with specific figures is important. The number of numerical can be calculated by using the following formula.

$$ \mathrm{number}\_ \mathrm{of}\_ \mathrm{numericals}=\left\{ \frac{\mathrm{number of numericals in}\; S_{i} }{\mathrm{total number of words in}\; S_{i} } \right. $$
  1. 5.

    No. of Thematic words: Thematic terms are the top ten most commonly used words in the sentence. The no. of thematic terms can be calculated as follows.

$$ \mathrm{thematic}\_ \mathrm{words}=\frac{\mathrm{number of thematic words in}\; S_{i} }{\mathrm{total number of thematic words}} $$
  1. 6.

    Centroid similarity: The centroid sentence is defined as the sentence with the maximum TF-ISF score. The cosine similarity of each sentence to the centroid sentence will then be computed.

$$ {\text {Centroid}}\_ {\text {Similarity}}=\left\{ \cos ine\_ {\text {similarity}}({\text {centroid}},{\text {sentence}})\right. $$
  1. 7.

    TF-ISF: TF-ISF represents Term frequency—Inverse document frequency that’s work similar to TF-IDF works. TF_ISF is given as follow:

$$\begin{aligned} \mathrm{TF}\_ \mathrm{ISF}=\left\{ \frac{\sum \log (\mathrm{isf})\times \mathrm{tf}}{\mathrm{Total}^{} \mathrm{words}} \right. \end{aligned}$$
  1. 8.

    Named entities: In this feature extraction section, we calculate the number of the named entity in every sentence. Sentences include references to named individuals such as a corporation, a group of people, and so on are often required to understand a factual report in some way.

Once the feature value of each sentence is obtained, we will generate the sentence feature matrix of sentence. Sentence feature Matrix is a two-dimensional matrix as shown in Fig. 2. Sentence feature matrix \(S=(s_{1} ,s_{2} ,s_{3} ,s_{4} ,\ldots ,s_{n} )\)where \(s_{i} =(f_{1} ,f_{2} ,f_{3} ,f_{4} ,f_{5} ,f_{6} ,f_{7} ,f_{8} )\) is a sentence feature and (\(i<=n\)). Where total no. of sentences in the document is indicated by n.

The sentence feature matrix produced by the preceding stages is as follows:

Fig. 2
figure 2

Feature matrix for text summarization

3.3 Restricted Boltzmann Machine

Our approach of extractive summarization is restricted Boltzmann machine (RBM), a stochastic and generative neural network that is capable of learning internal representations through probability distribution over its set of inputs [7]. They are a two-layered artificial neural network; the first layer is called the visible layer or the input layer (input nodes), and another one is called the hidden layer (hidden nodes). Every hidden node is connected to the bias node. The input nodes are not related to one another in the visible layer. Also, in the hidden layer, hidden nodes are not related to one another. The network is known as the restricted Boltzmann machine because of these restricted connections (Fig. 3).

Fig. 3
figure 3

RBM model

To improve, the feature matrix S is fed into an RBM with one hidden layer. Each sentence passes through hidden layer 1 initially. Each sentence’s feature values are multiplied by randomly produced weights, and one bias value is randomly produced and added to all the sentences. The results of these operation are fed into an activation function, which produces the nodes output (Fig. 4).

Fig. 4
figure 4

Visible layer to Hidden layer

$$\begin{aligned} P(s_{i} )=\sigma \left( \sum _{j=1}^{n}s_{j} \times W_{ij} +b_{i} \right) \end{aligned}$$
$$\begin{aligned} \sigma (x)=\frac{1}{(1+e^{-x} )} \end{aligned}$$

where, \(\sigma (x)\) is the sigmoid function, \(w_{ij} \) stands for randomly generated weights, whereas bi stands for bias.

After this, the restricted Boltzmann machine learns to reconstruct data by itself in an unsupervised manner. This is done by reversing the above process, i.e., the hidden layer will become the input layer with activations as the new input. Then these activations are again multiplied with previous weights which are associated with the visible layer nodes and these products are added to the visible layer bias at each visible node. Therefore, obtained results are known as the reconstructions which are then compared to the original input (Fig. 5).

Fig. 5
figure 5

Hidden layer to visible layer

The likelihood of activation of a visible unit \(s_{j} \) is stated as

$$\begin{aligned} P(s_{j} )=\sigma \left( \sum _{j=1}^{n}s_{j} \times W_{ij} +b_{i} \right) \end{aligned}$$
$$\begin{aligned} \sigma (x)=\frac{1}{(1+e^{-x} )} \end{aligned}$$

where, \(\upsigma (x)\) is the sigmoid function, \(w_{ij} \) is weights and \(b_{i} \) is the bias.

Thus, the hidden unit values are predicted during training and after determining the hidden unit values, the new input values are predicted. This procedure is known as Gibbs sampling. The training loss is obtained by calculating the difference between the old \(s_{j}^{\prime } \) and new input values \(s_{j} \). This procedure is performed for a number of epochs, and the weights are determined using (CD) contrastive divergence given by:

$$\begin{aligned} \begin{array}{l} {w_{ij({\text {new}})} =w_{ij({\text {old}})} +({\text {Learning}}\_ {\text {Rate}}_{} \times _{} w^{'} )} \\ {w^{'} =(s_{j} \otimes P(s_{i} )-s_{j}^{'} )} \end{array} \end{aligned}$$

where, \(w^{'} \) is the difference between the outer products of probabilities with the original input values \(s_{j} \) and the new input values as a change in weights values \(s_{j}^{'} \). An enhanced feature matrix is obtained after the epochs which have enhanced feature values for each sentence.

3.4 Summary Generation

A list is created that contains the sum of all enhanced feature values for each sentence in the document. As a result, a value is generated for each sentence, which is the sentence’s score. Sentences are sorted according to their scores in decreasing order. The first sentence is always included in the summary because it is the most essential sentence. The top 50% of the remaining sentences are included in the first summary and sorted in descending order according to their original location in the text.

4 Result and Discussion

We have used the BCC news dataset which includes several articles from different domains such as technology, politics, business, entertainment, and sport along with human-generated summaries for those articles. We have used ten documents from the BCC new dataset for evaluation purpose. Proposed methodology is compared with other state of art extractive summarization algorithm such as Text Rank [6], Lex Rank [8], LSA [9], Luhn [5]. For generating summary from Text Rank, Lex Rank, LSA, Luhn we have used Sumy Tool. By using Sumy tool, we can directly import this entire algorithm and for creating summary by using these algorithm users have just pass the text document. Each document is summarized with four different summarization techniques. We have extracted the top-ranked sentence from the document to form a summary at 10%, 20%, 30%, 40%, and 50% of document length. The purpose of experimenting with different percentage levels is to investigate the performance of various approaches.

We evaluated each system summary in conjunction with its corresponding reference summary. The ROUGE tool was used for the evaluation. ROUGE includes precision, recall and f-measure. The result fshows that proposed approach give better result than all other techniques that stated above (Tables 1, 2, and 3).

Table 1 Precision
Table 2 Recall
Table 3 F1 Score

5 Conclusion

We have developed an unsupervised methodology that makes use of RBM to summarize single document. The algorithm works separately for each input document, as each document is uniquely different in itself. This is an advantage that the proposed methodology gives. Our method uses RBM and generates an effective and efficient summary. We have evaluated our model using the ROUGE1 score compared with other techniques, and the result shows that our model gives better results than other compared techniques.

The proposed methodology could be extended to multiple-document summarization. Different languages can be used to summarize documents.