Keywords

1 Introduction

At present, where everything is fast and simple for the convenience of humans, a short version of the existing long version is necessary.

People do not have the time and patience to consume information as a whole and they have to be fed in a simple manner. Professionals who handle large reports also face such problems. This can be solved by summarization of text. Text summarization is defined as

Automatic text summarization is the process of compressing and extracting information effectively from input documents while still retaining its key content. [1]

A basic understanding on what a summary is required before moving on to the text summarization. A summary is a minimized version of something that is created from one or more texts, delivers the key ideas from the original text, and is written in a concise manner. Automatic text summarization aims to show the source text as a short version form with semantics [2]. The goal of text summarization is to come up with methods to produce this summary efficiently and clearly. There are two types of summarizers used in our case. They are abstractive and extractive summarizers.

By choosing a portion of the entire sentence base, extractive summarization creates a summary of the text that has been provided. The text’s most significant phrases and sentences are determined and chosen with a score that is calculated based on the words in the text. The approach of abstractive summarization begins with analyzing the text document to develop an interpretation. The computer makes a prediction based on this interpretation: a summary. By paraphrasing parts of the actual text, it changes the essence of the text [3].

Text summarization is classified into two types: indicative and informative. Inductive summary just provides the core concept of the text to the consumer. This form of summary is typically 5–10% of the length of the main portion of the text. Informative summary methods, on the other hand, give exact information on the primary text. The useful summary should be 20–30% as long as the main material [2].

At this stage, most of the automatic text summarization techniques mainly use machine learning or deep learning methodologies and models. Deep learning techniques were used for the first time in abstractive text summarization in 2015, with the proposed methodology based on the encoder-decoder architecture. Deep learning models have produced amazing results in these applications and have been widely used recently [4].

Our work compares two models of Transformer, GPT2 and BERT, to find out which one is more efficient than the other. At the end of our analysis using ROUGE metrics, it has been confirmed that BERT gives a better performance compared to GPT2.

2 Related Works

Subha Shini and Ambeth Kumar developed and put into use recurrent neural network-based text summarization techniques. They have addressed different approaches and distinct datasets that are used to create text summaries, which decrease the time required to manually enter summaries for lengthy texts while maintaining the meaning produced by recurrent neural networks (RNNs) [5].

Two of the LSA-based summarizing techniques are described in [6]. The algorithm results are compared using their ROUGE scores after being assessed on Turkish and English documents. Both of our systems perform equally well on collections of Turkish and English documents; however, one of them yields the highest scores.

The most recent extractive text summarizing methods for several languages have been described in this study [7]. It is seen that excellent work has been done for several foreign languages, like Chinese and English. However, there is still no summary system for Bengali languages. Therefore, it is difficult to provide a summary method that makes use of several feature categories.

In [8], the authors have designed and implemented a text summarization that uses the BERT algorithm. They introduced a document-level encoder based on BERT which expresses the document into segments and obtains a representation of the given paragraph. They also used a two-staged fine-tuning approach to further improve the quality of the summaries that is generated.

In another study, text summarization was accomplished using multi-layered attentional peephole convolutional LSTM (Long Short-Term Memory). The goal is to create an input of a given lengthy text and construct an automatic text summation. They have improved the settings for MAPCol utilizing the central composite design (CCD) in conjunction with RSM, which produces summaries with good accuracy [9].

Jesse Vig introduced an open source tool as a way to visualize attention in the Transformer at various scales. The tool was demonstrated on GPT-2 and BERT, and three use cases were presented. To traverse the tool’s three views, a uniform interface is intended to be created in future works [10].

A text summarization tool that uses the LSTM-CNN-based ATS framework (ATSDL) has been designed and implemented by some authors. The purpose is to construct new sentences which will explore more good-grained fragments than sentences generated. ATS is the task of forming summary sentences by combining facts from different sources and forming them into shorter content and preserving meaning [11].

In [12], the authors upgrade to a recurrent neural network architecture the most advanced model for abstractive sentence summarization. The model is a condensed form of the machine translation encoder-decoder structure. To produce headlines based on the first line of each news article, this model was trained on the Gigaword corpus. Gigaword data and the DUC-2004 challenge show that it performs far better than the prior state of the art, despite the fact that this model does not rely on any additional extraction properties.

The Improved Attention Layer-assisted-Recurrent Convolutional Neural network model is used for yet another text summarization (IA-RCNN). The Sequence-to-Sequence (S2S) paradigm was integrated with RCNN in the model, which was created for abstractive text summarization in a variety of text sources. The model that is suggested in this research is tested using various text sources. For real-time applications, its performance is adequate [13].

2.1 Text Summarization Using Deep Learning

In 2015, deep learning techniques were first used to summarize abstract text, and a model based on encoder/decoder architecture was proposed. Deep learning algorithms have become popular recently and have achieved excellent results in these applications.

Deep learning analyzes difficult situations as a decision aid. Deep learning uses feature extraction at different levels of abstraction to mimic the capabilities of the human brain. Deep learning is used in many NLP tasks because it uses multiple nonlinear layers of data processing to facilitate learning multi-level hierarchical representations of data. Various deep learning models were used for abstract summarization, including RNNs, convolutional neural networks (CNNs), and sequence-to-sequence models. In this section, deep learning models will be explored [27].

A large corpus is used to learn the contextual representations of the language. The BERT language representation is one of the new word embedding model extensions. BERT benefits from a fine-tuning and role-based approach (depending on specific job goals). In addition, transformers can learn the meaning of relationships between “word pairs” by using self-awareness to compute input and output representations [4].

2.2 Need for Text Summarization in Big Data Analytics

Big data is a technology that is used to manage massive amounts of data. It is a technique for storing, distributing, and processing massive amounts of data. Because of technology improvements and HTML 2.0, it is now possible to send data without specifying a tag. Specifically, social media tools such as WhatsApp, Facebook, Instagram, Google, and others are extremely effective for transmitting large amounts of organized and unstructured data [22]. When you view a video on YouTube, you will see advertising that are relevant to your interests at random intervals [23]. This advertising has been chosen for you based on your browsing history and preferences. The most current advancements in a certain issue may be found via Twitter trends.

Furthermore, Google text search and voice-based search have simplified searches nowadays. Today, most individuals use mobile phones to conduct various transactions, and these phones are also used to monitor user locations. Today, various mobile applications are available, and these programmers are installed based on the user’s preferences and requirements [24]. These mobile applications are highly beneficial in understanding the user’s interests and how much time they spend. These many scenarios suggest extensive data creation, with data being created in vast quantities at all times [25, 26]. As a result, a unique text summarizing approach is necessary to better grasp the hidden knowledge or information concealed behind this data.

3 BERT

BERT (Bidirectional Encoder Representations from Transformers) is used to get over the restriction of RNN (Recurrent Neural Networks), CNN (Convolution Neural Networks), and ANN (Artificial Neural Networks). BERT models are pre-trained on big data sets. So no additional training on the dataset is required. It uses an efficient uniform architecture along with sentence transformation in the resources. With this, the best results are observed in summarization. Unlike any context-free embedding model, BERT is a contextual embedding model [14]. BERT creates on top of the transformer architecture, but its purpose is distinct for pre-training data [15].

Example:

Consider these sentences:

  • Python is very dangerous. It can kill you.

  • I love Python. It’s easy to understand.

If the word embedding is applied, it will consider the word python in two sentences together. This is not correct because one python is a snake and another is a programming language. So BERT helps us by keeping the contextual embedding of the model.

Advantages of the BERT model are as follows:

  1. 1.

    Context – It keeps the context of the words.

  2. 2.

    Word Ordering – It keeps the words ordering with the help of positional encoding or positional vectors.

  3. 3.

    Embeddings – It has token embedding, sentence embedding, and positional embedding. All these are BERT embedding.

  4. 4.

    Out of Vocabulary – It keeps out of vocabulary with the help of self-attention. Moreover, it takes care of those keywords, which are present or not present in the vocabulary.

Applications of the BERT model are as follows:

  • Text Summarization

  • NER – Name Entity Recognition

  • Next Sentence Prediction

  • Long Text Classification

  • Sentiment Analysis

  • Question Answering

3.1 BERT Architecture

The goal of Bidirectional Encoder Representations from Transformers (BERT), which pre-trains deep bidirectional representations from unlabeled text, is to simultaneously condition both left and right contexts across all layers. As a result, the pre-trained BERT model can be enhanced with just one additional output layer to create cutting-edge models for a variety of tasks, such as question answering and language inference, without requiring large changes to the architecture for each task [16].

A group of sentences from “s1,s2,s3,..., sn” have two possible outcomes, with xi = {0,1}indicating whether or not a given sentence will be chosen. The output vectors are tokenized instead of sentences thanks to a pre-trained MML (Masked Language Model). Instead of having multiple sentences, it only has two labels, Sa and Sb, and uses embedding to specify the various sentences. These embedding are appropriately changed to produce the necessary summaries (see Fig. 1) [17].

Fig. 1
A schematic diagram illustrates the process of input document encoding using BERT for summarization. It includes token embeddings for each token in the input document, interval segment embeddings to separate different segments, position embeddings to indicate the position of each token, and BERT layers for further processing and summarization.

Process flow of BERT

3.2 Phases in Generating the Summary

3.2.1 Input Document

Sentences from the input source are encoded during this phase. Each sentence has a CLS tag before it and a SEP tag at the end. To group the components of one or more sentences, use the CLS tag.

3.2.2 Interval Segment Embedding

Sentences in the input document will be differentiated during this phase. Each sentence is assigned to one of the labels mentioned above.

Depending on the value of i {si} might be either{Ea or Eb}. Essentially, the standard is Ea for even value i and Eb for the odd value i [18].

3.2.3 Embedding

The representation of words in their corresponding vector forms is referred to as embedding. Google makes use of this BERT model feature to help with understanding. Gaining access to various semantic functions, such as those for understanding the purpose of the input document and constructing word-related models.

Our input text is subjected to three different types of embedding before being fed to the BERT model layer:

Embedding tokens: The words are transformed into output vectors with fixed dimensions during this stage. At the beginning and conclusion of input sentences, the CLS and SEP tags are inserted.

3.2.4 Segment Embeddings

Using binary coding, the different inputs are distinguished in this phase.

For example,

Sentence1 – “I love books”

and Sentence2 – “I love sports.”

Following the segment embedding operation: [CLS],I,Love,Books,[SEP],I,Love,Sports

Following the segment embedding operation: [0,0,0,0,0,1,1,1], Sentence1 = 0, Sentence2 = 1

3.2.5 Position Embeddings

The BERT model can handle input sequences of up to 512 bytes in length, and its output vector dimension is [512,768]. A word’s placement in the input sentence in this phase changes the context of the sentence and should not have the same vectors.

3.2.6 Summarization

The self-attention layer is the primary distinction between recurrent neural networks and Bidirectional Encoder Representations from Transformers (BERT). The model aids in the representation of words and looks for connections between the words.

Simple Classifier: To predict the score Yi, a linear layer and sigmoid function are added to the BERT model. The sigmoid function sets a threshold that determines the range of probability that is mapped to a binary value Yˆi [18].

$$ Y\hat{\mkern6mu} \textrm{i}=\alpha \left({W}_{\textrm{o}}{T}_{\textrm{i}}+{b}_{\textrm{o}}\right) $$

3.2.7 Inter Sentence Transformer

The simple classifier is not utilized in this stage. The BERT model is improved by the addition of additional transformer layers, which also help it to better identify the key ideas in the input document [18].

$$ {h}^{\sim}\textrm{l}=\textrm{LN}\Big({h}^{l-1}+\textrm{MHAtt}\left({h}^{l-1}\right)\vspace*{-1.3pc} $$
$$ hl=\textrm{LN}\left({h}^{\sim}\textrm{l}+\textrm{FFN}\left({h}^{\sim }l\right)\right)\vspace*{-1.3pc} $$
$$ {h}^0=\textrm{PosEmb}(T) $$

where

  • T – Sentence output vector by BERT model.

  • PosEmb – Positional Embeddings

  • LN – Layer Normalization Function

  • MHAtt – Multi-head Attention Function

  • l – in-depth of the stacked layer.

  • Sigmoid output layer – Y^i = α(WoTi + bo).

4 GPT-2

The GPT tokenizer was used to tokenize the chosen files, although the context size of GPT models limited it to 512 or 1024 tokens (512 or 1024 tokens for GPT and GPT-2, respectively). Identical text files exist for BERT as well; 100 files were selected from the dataset containing the necessary tokens for training (Fig. 2).

Fig. 2
A block diagram represents a classification system that utilizes transformers for various tasks. It includes components such as text prediction, text classification, entailment, masked multi-self-attention, and multiple-choice tasks. Each component involves the extraction of relevant information, transformation using transformers, and linear transformations for classification or prediction.

Process flow of GPT 2

Using the tokens that came before it, a language model makes a probabilistic prediction about the token that will come next in the sequence. It gains knowledge of the likelihood that a sentence or string of tokens will appear based on the text examples it has encountered during training. The following conditional probability can be used to represent it:

$$ \textrm{P}\left({w}_1^T\right)=\textrm{P}\left({w}_t|{w}_1^{t-1}\right) $$

where wt is the tth token, and writing sub-sequence wij = (wi,wi + 1,...,wj−1,wj).

The Transformer model variant known as GPT/GPT-2 only has the decoder portion of the Transformer network. They function like conventional uni-directional language models thanks to the multi-headed masked self-attention that enables them to focus on just the first I tokens at time step t. These models analyze tokens concurrently, as opposed to sequentially, like RNNs, by simultaneously predicting tokens for all time steps. These role models could be represented by:

$$ {h}_0=\textrm{U}{\textrm{W}}_{\textrm{e}}+{W}_{\textrm{p}},\vspace*{-1.3pc} $$
$$ {h}_i=\textrm{transformerblock}\left({h}_{i-1}\right)\forall i\kern0.5em \epsilon\ \left[1,n\right]\vspace*{-1.3pc} $$
$$ P(u)=\textrm{softmax}\left({h}_{\textrm{n}}{W}_{\textrm{e}}^{\textrm{T}}\right) $$

where U = (u1, u2,...,uk) are the tokens, n is the number of layers, We are the token embedding matrix, and Wp is the position embedding matrix and trained on the language model objective:

$$ L(U)={\sum}_i^k\ P\left({u}_i\ |\ {u}_{i-k},..\dots, {u}_{i-1};\beta \right) $$

where U = (u1, u2,...,uk) are the tokens, k is the context size (maximum number of tokens which can be processed at once by model), and P is the conditional probability modeled by our language model with parameters β. Table 1 describes in the test data set with sample title and content [19].

Table 1 Sample test data set

Byte Pair Encoding is used by GPT2 to construct the vocabulary tokens. This implies that the tokens are typically word fragments. GPT-2 can predict the following token in a sequence because it was trained with the intention of causal language modeling (CLM). Using this feature, GPT-2 could produce text that is syntactically coherent. GPT-2 generates fabricated text samples in response to arbitrary input priming the model. The model has the capacity to alter how it looks to correspond with the style and information in the conditioning text [20].

Due to the GPU resource limitation, the abstractive summarization model is a pre-trained distil version of GPT-2. The DistilGPT2 can accept token lengths of up to 1024. It includes 6 transformer decoder layers and 12 attention heads. The Huggingface transformer package’s Hugging face GPT-2 is utilized, which is implemented in Pytorch [14].

The GPT-2 performs comparably to supervise baselines on reading comprehension in a no-shot environment. In some activities, like summarizing, it performs the task qualitatively, but based on quantitative measurements, it still performs at a rudimentary level. The zero-shot performance of GPT-2, while suggestive as a research finding, is still a long way from being useful in terms of practical applications [19].

5 Experiment Setup

5.1 About the Dataset

The non-anonymized BCC dataset that Kaggle provided was utilized, which is designed for condensing news items into two to three sentences. The dataset included thousands of text files with an average length of 15 lines and also included a predetermined summary. Using various measures, this summary is used to assess the precision of both models (GPT-2 and BERT). A few more model-specific pre-processing processes were carried out before feeding this data to the models. A few more pre-processing steps specific to the models were performed.

5.2 Training the Models

Some snippets that are relevant in training and summarization are provided here.

A screenshot of a code showcases the import of the Summarizer and Transformer Summarizer modules from the summarizer library. The code also includes the initialization of a G P T 2 model using the Transformer Summarizer with specified parameters. And, the code trains a Summarizer for the BERT model.

5.3 Evaluation Metrics

The term ROUGE, which stands for “Recall-Oriented Understudy of Gisting Evaluation,” refers to a set of standards for rating texts that are produced automatically. It is typically employed to assess the effectiveness of a TS algorithm’s summary. There are many metrics and ratings that have been proposed in the literature for the evaluation of text summarization results, but ROUGE is the most popular one. Since ROUGE operates similarly on both the abstractive and extractive algorithms, it does not produce outstanding results. In addition, numerous executions are typically preferable to a single one. It ignores the semantic and grammatical precision of the system and human summaries in favor of the overlap of n-grams (represented as a number value).

The process of writing a summary is easier to compare to an abstractive TS task than an extractive one because when writing a summary, a person can try to express his thoughts with new words and phrases after carefully reading and understanding one or more source texts, trying to cover as many topics as possible from the original text [21].

A screenshot of a code showcases demonstrates the usage of the Rouge Score module from the rouge score library. It imports the necessary modules, initializes hypothesis and reference documents, creates a Rouge Scorer object, and calculates the ROUGE scores for each document pair.

5.4 Summary Snippets

Text given as input:

“Safety alert as GM recalls cars.

According to federal officials, the world’s largest automaker, General Motors (GM), is recalling roughly 200,000 vehicles in the United States for safety reasons. According to the National Highway Traffic Safety Administration (NHTSA), the greatest recall comprises 155,465 trucks, vans, and SUVs (SUVs). This is due to potential brake system problems. In the year 2004 and 2005 recall of the product affects automobiles, according to GM. The Chevrolet Avalanche, Express, Kodiak, Silverado, and Suburban, as well as the GMC Savana, Sierra, and Yukon, have possible flaws. According to the NHTSA, if the hood was open while driving, pieces from a pressure accumulator in the braking system may shatter, endangering passengers.

It warned that doing so might enable hydraulic fluid to escape, making it more difficult to brake or steer and perhaps resulting in an accident.

Additionally, GM is recalling 19,924 Pontiac Grand Prix cars, SRX SUVs, and Cadillac XLR coupes from the 2004 model year.

This is due to the fact that in severely cold weather the accelerator pedal could not function effectively, necessitating greater braking.

In addition, the automaker is recalling 17,815 2005-model Buick Raniers, Chevrolet Trailblazers, GMC Envoys, and Isuzu Ascenders due to improper windshield installation that could result in a crash.

However, GM emphasized that it was unaware of any injuries connected to the issues.

The recall was made public after GM said last month that it anticipated weaker profitability this year than in 2004.

The largest automaker in the world is struggling with sluggish US sales, losses in its European division, and now a product recall.

In January, General Motors warned that increased healthcare expenditures in North America and decreased earnings at its financial services unit would negatively affect the company’s performance in 2005.”

Summary given as input

According to GM, the recalled products include cars from the 2004 and 2005 model years.

The NHTSA warned that if the hood was open while driving normally, a pressure accumulator in the braking system might fracture and cause injuries to individuals.

Federal officials have announced that General Motors (GM), the largest automaker in the world, is recalling roughly 200,000 vehicles in the US for safety reasons.

The recall comes after GM said last month that it anticipated weaker profitability this year than in 2004.

In addition, the automaker is recalling 17,815 2005 model-year Buick Raniers, Chevrolet Trailblazers, GMC Envoys, and Isuzu Ascenders because the windshield is improperly installed and could fly out in an accident.

The greatest recall, according to the National Highway Traffic Safety Administration (NHTSA), affects 155,465 pickups, vans, and sports utility vehicles (SUVs).

Summary by BERT model

“Safety alert as GM recalls cars.

According to federal laws, the United States is recalling around 200,000 cars manufactured by General Motors (GM), the world’s largest carmaker.

According to the National Highway Traffic Safety Administration (NHTSA), 155,465 trucks, vans, and SUVs have been recalled (SUVs).

In addition, the company is recalling 17,815 Buick Raniers, Chevrolet Trailblazers, GMC Envoys, and Isuzu Ascenders from the 2005 model year because the windshield was mounted incorrectly and might fly out in a collision.”

Summary by GPT2 model

“Safety alert as GM recalls cars.

Federal officials have announced that General Motors (GM), the largest automaker in the world, is recalling roughly 200,000 vehicles in the US for safety reasons.

The automaker is also recalling 17,815 2005 model-year Buick Raniers, Chevrolet Trailblazers, GMC Envoys, and Isuzu Ascenders because the windshield is improperly installed and could fly out in an accident.

The recall comes after GM said last month that it anticipated weaker profitability this year than in 2004.”

6 Comparison of Results

The numerical values given by ROUGE metrics on precision, recall, and f-measure can be listed and the difference in values given by both the models can be seen in Tables 2 and 3.

Table 2 Performance of GPT2
Table 3 Performance of BERT

Figures 3 and 4 represent the precision, recall, and f-measure given by the ROUGE Metrics. Comparing these bar graphs, it is found that BERT gives better results than GPT 2. These bar graphs are just representations of two random summaries that were generated by the respective Transformer models.

Fig. 3
A bar graph plots 0.0 to 0.6 numbers versus precision, recall and f measures. Recall has the maximum value.

Performance of GPT2

Fig. 4
A bar graph plots 0.0 to 1.0 numbers versus precision, recall, and f measures. Recall has the maximum value.

Performance of BERT

Figure 5 can be used to visualize the trend in which both the models perform in terms of precision. Here it can be seen that the precision given by GPT2 is lower compared to the one given by BERT.

Fig. 5
A line graph plots a comparison of precision G P T 2 versus B E RT. G P T 2 has a maximum point at (8, 0.40) and BERT has a maximum value at (4, 0.60).

Precision comparison of GPT2 and BERT

From Fig. 6, the loss function of the BERT model seems to be comparatively better than GPT 2.

Fig. 6
A line graph plots a comparison of precision G P T 2 versus B E RT. G P T 2 has a maximum point at (5, 0.65) and BERT has a maximum value at (8, 0.60).

Loss comparison of GPT2 and BERT

7 Conclusion

From all the visualizations and analysis done using ROUGE metrics on BERT and GPT 2 models, it is concluded that BERT gives better results than GPT 2. This conclusion has been reached by analyzing the numerical values obtained by ROUGE metrics and the results that the graph shows. BERT gives an average precision of 0.40 whereas GPT2 gives 0.38. Comparing this and the other factors mentioned by the metrics, it is concluded the BERT model for Big Data Analytics using summarization.