Introduction

Patents are an important knowledge source and, therefore, their analysis has been considered a useful tool for research and for management development. Patents are one of the most effective ways to protect an invention today (Wang et al. 2019). One of the objectives of granting patents is to facilitate the dissemination of scientific knowledge (Ouellette 2017). However, finding information in these documents is becoming an increasingly complex task due to the large number of patents in datasets (Sjögren et al. 2018). These documents have a complex language with excessive descriptive technical details and idiosyncrasies that report to the structure of the patent document and the length of the sentences. Thereafter, the retrieval process and analysis of these documents are time consuming and laborious (Codina-Filbà et al. 2017; Gomez 2019).

The efficient analysis of these documents allows for monitoring technological trends, defining business models, securing market share, decreasing time to develop new products and reducing possibility of patent infringement (Codina-Filbà et al. 2017; Kim and Lee 2015; Trappey et al. 2009). Camus and Brancaleon (2003) highlighted the importance of information contained in patent analysis, revealing risks and opportunities and gaining insight into business activities. However, in order to be useful for the decision-making task, the information contained in a patent dataset must be presented in an understandable format (Madani and Weber 2016).

The information contained in patents is distributed in sections, defined by the patent office. The formatting of a patent text is controlled by laws and regulations of the country or a patent authority in which the inventor applied for the patent. In general, patents have title, abstract, claims and description. The abstract is characterized by having complex syntactic constructs and a generic vocabulary. The claims section has a hierarchical structure, including independent and dependent claims. The independent claims present a general idea of the invention whereas the dependent claims present more specific information about the invention. Each claim is composed by a single sentence. This leads to the appearance of very long sentences and significant complexity. The description section is characterized by having distinctive information of the inventions (Codina-Filbà et al. 2017; Mille and Wanner 2008).

In order to take advantage of patent knowledge, it is essential to organize information in an accessible and simple format and to name groups provided by patent offices with sentences which truly represent them. Because these subgroups belong to a restricted knowledge domain, the naming task can be extremely laborious. In this context, it is necessary to look for techniques which facilitate this naming process, to assist the specialists in their task.

This work uses summarization techniques as an approach to name patents groups. In the work presented by Souza et al. (2019), the best performing methodology of extractive summarization was reached using Latent Semantic Analysis (LSA) algorithm applied to patent abstracts. In this work, we compare LSA algorithm with an abstractive summarization algorithm and evaluate if the use of an abstractive summarization algorithm achieves better performance in the task to name new patent groups.

This work is divided into 6 sections. First section presents the theoretical background and related works. “Proposed approach” section describes the abstractive summarization model, the used dataset and the methodological steps of the work. “Experiments” and “Final considerations” sections show the results, analysis and final considerations.

Theoretical background

In general, there are two main approaches to automatic summarization: extractive and abstractive. Extractive summarization selects the main sections of the original text to generate a summary. The extractive summarization systems are usually based on the sentence/topic extraction technique and attempt to identify a set of sentences that is most important for the general understanding of a particular document. In order to identify these sentences, many approaches use keywords as a criterion for choosing the sentences and, thus, extract the sentences that have the highest number of keywords (Wang et al. 2011). Abstractive summarization tries to develop an understanding of the main sections of the text and, from an internal semantic representation, expresses the knowledge obtained in natural language. For this, it uses linguistic methods to interpret and describe the text, generating a summary with the main information of this text (Wang et al. 2011). Because it requires extensive processing of natural language, abstractive summarization is more complex than extractive and therefore less explored (Gambhir and Gupta 2017).

Abstractive methods can be divided into two categories, syntactic and semantic. Syntactic methods verify the grammatical structure of the text and use the information obtained to generate a concise representation of the text. Semantic methods generate a summary of the text from its semantic representations, usually using ontologies. Approaches using semantic representations are considered more robust because the abstractive summarization needs a thorough analysis of the text (Khan et al. 2015). However, currently semantic analysis methods have not been performing well in texts considered simpler, nor in structurally more complex texts. This fact makes the summary generation task more challenging (Codina-Filbà et al. 2017).

This section is divided into five subsections in which the concepts related to the work are presented. “Seq2Seq model”, “LSTM network” and “LSA algorithm” sections provide, respectively, a description of the Sequence to Sequence (Seq2Seq) model, the Long Short-Term Memory (LSTM) network, and the LSA extractive summarization algorithm. “Recall-oriented understudy for gisting evaluation” and “Analysis of semantic similarity” sections describe the used measurement metrics.

Seq2Seq model

Seq2Seq was first introduced by Cho et al. (2014) and Sutskever et al. (2014). The architecture of Seq2Seq model is divided into two parts, encoder and decoder. Each of these parts may be implemented by, for example, a Recurrent Neural Network (RNN). To perform the abstractive summarization task, a many-to-many Seq2Seq architecture is used, where the encoder has an artificial input Artificial Neural Network (ANN) that receives a sequence of words from the text \(x = x_{1}, \ldots, x_ {m}\), and gets the corresponding hidden state \(z = z_{1}, \ldots, z_ {m}\). The decoder receives as input z and outputs a sequence \(h = h_{1}, \ldots, h_{t}\) (Zhang et al. 2019). To determine when the decoder will start generating summaries, a symbol representing the end of the input is used. After the first output \(h_{1}\) is generated, the decoder will produce a new hidden state along with a word representation vector. Each generated word is used as input for the next word generation.

LSTM network

LSTM networks consist of a set of recurrently connected blocks that are time-rolled (Greff et al. 2016). LTSM is a type of RNN. Standard RNN usually presents difficulties with long-term dependencies, so this network does not perform well on tasks that need a broader context (Greff et al. 2016). One alternative to this type of task is the use of LSTM. The LSTM network was introduced by Hochreiter and Schmidhuber (1997) and has since undergone several modifications. Currently, this network is mainly used in tasks that aim to solve sequential data learning problems (Greff et al. 2016), such as automatic text translation (Luong et al. 2015), automatic text summarization (Song et al. 2019) handwriting recognition (Paul et al. 2019), audio analysis (Bin et al. 2018) and video analysis (Abtahi et a l. 2018), among others.

In Fig. 1, we have a simplified model of a LSTM network consisting of an input \(x_ {t}\) and an output \(h_{t}\). Variable \(x_{t}\) entry goes through several layers of LSTM, and each cell has a loop. The function of the loop is to allow information to persist on the network for a certain time. In tasks that use sequential data, it is often necessary to look back to correctly predict the next state. In a basic RNN the amount of context available is smaller than in an LSTM network.

Fig. 1
figure 1

LSTM network model scheme

Each of the LSTM blocks has one or more memory cells and multiplicative units that are input gate (\(i_{t}\)), forget gate (\(f_{t}\)), output gate (\(o_{t}\)) and cell activation vectors (\(C_{t}\)). Basically, the input to the cells is multiplied by the activation gate, the output is multiplied by the output gate, and the previous cell values are multiplied by the forget gate (Graves and Schmidhuber 2005). The sigmoid layer \((\sigma )\) outputs numbers between zero and one, and tanh layer creates a vector of new candidate values (\(C'_{t}\)). Figure 2 shows an internal diagram of a standart LSTM cell.

Fig. 2
figure 2

LSTM detailed cell representation (Olah 2015)

To map the input sequence x to an output sequence h, calculations are performed iteratively from \(1 \rightarrow t\). In Fig. 2, each one of the paths presented in the LSTM cell is named. These paths are represented by the following equations:

$$\begin{aligned} f_{t}= & {} \sigma (W_{f} \times [h_{t-1}, x_{t}] + b_{f}), \end{aligned}$$
(1)
$$\begin{aligned} i_{t}= & {} \sigma (W_{i} \times [h_{t-1}, x_{t}] + b_{i}), \end{aligned}$$
(2)
$$\begin{aligned} C'_{t}= & {} tanh (W_{C} \times [h_{t-1}, x_{t}] + b_{C}), \end{aligned}$$
(3)
$$\begin{aligned} C_{t}= & {} f_{t} \times C_{t-1} + i_{t} \times C'_{t}, \end{aligned}$$
(4)
$$\begin{aligned} o_{t}= & {} \sigma (W_{o} \times [h_{t-1}, x_{t}] + b_{o}), \end{aligned}$$
(5)
$$\begin{aligned} h_{t} = o_{t} \times tanh (C_{t}). \end{aligned}$$
(6)

Equation 1 has the role of deciding which information to forget. Equations 2 and 3 decide which information will be stored in the state of the cell. Equation 4 updates the state of the cell. Equations 5 and 6 decide which output will be produced. Table 1 defines the used variables.

Table 1 LSTM network model variables descriptions

LSA algorithm

Deerwester et al. (1990) described LSA as a method for information retrieval. However, Landauer et al. (1998) suggested using this method to find relationships between words. The main idea of this method is to reduce the number of dimensions, consequently reducing noise, and emphasizing strong indirect relationships between entities.

In this work, LSA is used to generate a summary of a document. The method was based on the work of Dokun and Celebi (2015) and consists of making an extractive summarization in which an algorithm extracts a single sentence from the document, identifying it as the sentence that best represents the document. For this, the algorithm receives as input a preprocessed document and generates a sentence-term matrix, usually sparse, in which a column vector represents the weighted frequency of the sentence in the document.

From the semantic point of view, Singular Value Decomposition (SVD), used by the algorithm, derives the latent semantic structure from the document represented by a matrix, reflecting a breakdown of the original document into linearly-independent base vectors or concepts. Each term and sentence from the document is jointly indexed by these base vectors/concepts. Beside this, if a word combination pattern is recurring in the document, this pattern will be represented by one of the singular vectors.

The magnitude of the corresponding singular value indicates the importance degree of this pattern within the document. Any sentences containing this word combination pattern will be projected along this singular vector, and the sentence that best represents this pattern will have the largest index value with this vector (Froud et al. 2013).

Recall-oriented understudy for gisting evaluation

Recall-Oriented Understudy for Gisting EvaluationFootnote 1 (ROUGE) is a set of metrics used to evaluate automatic text summarization and machine translation results. These metrics determine the similarity between a summary generated by a computational model and a summary generated by humans. One of the metrics of this set is ROUGE-N, which is a recall of N-grams between the candidate summaries and the reference summaries (Sanchez-Gomez et al. 2018). Thus, a ROUGE-1 score of 0.40 says that 40% of the content in the reference summary was captured by the summary generated by the model. ROUGE-N is calculated according to equation 7, proposed by Lin (2004). The used variables are described in Table 2.

$$\begin{aligned} ROUGE{-}N = \frac{\sum _{C \in R} \times \sum _{gram_{n} \in C} \times Count_{match}(gram_{n})}{\sum _{C \in R} \times \sum _{gram_{n} \in C} \times Count(gram_{n})}. \end{aligned}$$
(7)
Table 2 ROUGE-N variables descriptions

Another metric of this set is ROUGE-L, which evaluates correspondence between Longest Common Substring (LCS) shared by two sentences (Sanchez-Gomez et al. 2018). This metric assumes that the higher the LCS value of two R and C summaries, the more similar they are. Therefore, ROUGE-L will be 1.0 when both sequences are equal, and 0.0 when LCS(RC) is zero, indicating that there is no common sequence between R and C. To calculate this value, we use Eqs. 8, 9, 10 and 11 proposed by Lin (2004). The equations 8, 9 and 11 represent, respectively, the Recall, Precision and F-Measure of the LCS between R and C. The used variables are described in Table 3.

The precision metric checks how many of the values that were said to be positive are actually positive. The recall metric measures how many of the values that are positive were classified as positive. The F1-score metric combines the precision and recall values indicating the overall quality of the model.

$$\begin{aligned} R_{lcs}= & {} \frac{LCS(R,C)}{m}, \end{aligned}$$
(8)
$$\begin{aligned} P_{lcs}= & {} \frac{LCS(R,C)}{n}, \end{aligned}$$
(9)
$$\begin{aligned} \beta= & {} \frac{P_ {lcs}}{R_{lcs}}, \end{aligned}$$
(10)
$$\begin{aligned} ROUGE{-}L= & {} F_{lcs} = \frac{(1 + \beta ^{2}) \times R_{lcs}\times P_{lcs}}{ R_{lcs} + \beta ^{2}\times P_{lcs}}. \end{aligned}$$
(11)
Table 3 ROUGE-L variables descriptions

Analysis of semantic similarity

Semantic similarity is a measurement that verifies the similarity between sentences and texts, also defined as semantic entities. This similarity is measured using the distance between terms based on their meaning or semantic content. The semantic similarity index between the semantic entities is a numerical estimation obtained with the semantic information of the entities terms (Harispe et al. 2015).

In this work, the Semantic Similarity EstimatorFootnote 2 (SenSim) method proposed by Al-Natsheh et al. (2017) was used. This method consists of two phases, the first is the extraction of characteristic pairs and the second is the regression estimation. For the extraction of feature pairs, the algorithm uses attributes such as Part-of-Speech (PoS), which is a category of words with similar lexical properties, Named-Entities (NE) such as people, organizations and sites, and the representation of sentences in Bag-of-Words (BoW), which is weighted by the TF-IDF algorithm. For regression estimation, the Random Forests (RF) method is used, which is a classifier that constructs decision trees during training. This method takes two sentences and assigns them a score between 0 and 5. A high score represents a large similarity between the sentences.

Related works

Automatic text summarization aims to create a simple and descriptive summary of sections from the original text. Thus, the process identifies the significant aspects of one or more documents to represent them consistently (Allahyari et al. 2017). Abstractive summarization methods have still been less explored than extractive, as they require intense natural language processing (Gambhir and Gupta 2017). Works such as those published by Parmar et al. (2019), Zhang et al. (2019), Song et al. (2019), Yao et al. (2018) provide a perspective on how the abstractive summarization task is currently explored.

Parmar et al. (2019) evaluate in their work the performance of a Seq2Seq model and a bidirectional LSTM network. The used dataset was CNN/Daily Mail and Amazon reviews. The Seq2Seq model was validated in both datasets and the LSTM model only in the Amazon reviews. Both models were evaluated using ROUGE-1, ROUGE-2 and BLEU metrics. BLEU is a metric initially proposed for automatic text translation evaluation that uses a modified unigram precision. From the presented results, it was possible to verify that Seq2Seq model using the Amazon review dataset was the one that obtained the best result with BLEU metric, with a score of 26.25%, which indicated that it had the best accuracy among the three models under testing.

Zhang et al. (2019) presented in their work a generative model of abstractive text summarization using Convolutional Neural Network (CNN) and Seq2Seq. The proposed model had a copy mechanism for dealing with rare words and a hierarchical attention mechanism. According to the authors, the use of a CNN hierarchical structure was much more efficient than conventional models of the Seq2Seq RNN. The used datasets were GigaWord, DUC 2004 and CNN/Daily Mail. To evaluate the quality of the generated summaries, ROUGE-1, ROUGE-2 and ROUGE-L metrics were used. According to the authors, the proposed model had a good performance in relation to the state of the art.

Song et al. (2019) proposed in their work an abstractive summarization model based on LSTM-CNN. The proposed model consists of three steps, which are text pre-processing, sentence extraction and text summary generation. The used dataset was CNN/Daily Mail. The generated model was evaluated using ROUGE-1 and ROUGE-2 metrics. According to the authors, the results exceeded the existing models in terms of semantic and syntactic structure, combining extractive and abstractive summarization, and obtained competitive results in the manual assessment of linguistic quality.

Yao et al. (2018) presented in their work an abstractive summarization method that used a dual coding model. In the method presented by the authors, the primary encoder performed text encoding on a regular basis, while the secondary encoder modeled the importance of words in the text and generated a more accurate encoding of the text. For final summary generation, the two encodings were combined to generate a more diverse summary. The used dataset for the experiments were CNN/Daily Mail and DUC 2004. To evaluate the generated summaries, the metrics ROUGE-1, ROUGE-2 and ROUGE-L were used. According to the authors, the proposed method presented a good result in relation to the state of the art.

By analyzing these works, it is possible to verify that most of them evaluate their results using news datasets, not exploing domains such as patents. Works such as of Codina-Filbà et al. (2017), Mille and Wanner (2008), Trappey et al. (2009) highlight the importance of generating automatic summaries in patent documents. Therefore, it is necessary to evaluate the performance of these algorithms in this domain of knowledge.

Proposed approach

This section is divided into 3 subsections. The first one presents the abstractive summarization model used. The second one presents a description of the dataset used in experiments, while the third one presents the methodological steps taken during the practical experiments.

Abstractive summarization model

The model used in this work consists of two LSTM network architectures. The encoder used LSTM cell along with Stack Bidirectional Dynamic RNN, represented in Fig. 3 by the dotted box named Encoder. In this model, there is the stacking of several layers of two-way RNN, in which the combined outputs of the previous and subsequent layer are used as inputs to the next layer (Parmar et al. 2019). Using the bidirectional model has the advantage of being able to use past and future contextual information. The decoder uses LSTM BasicDecoder cells associated with the Beam Search Decoder. The decoder is represented in Fig. 3 by the dotted box named Decoder. Beam Search Decoder is a technique that allows you to find the best word combination for the output summary. According to Cohen and Beck (2019), this algorithm is one of the most commonly used in neural sequence models, as it performs non-greedy local searches that increase the chances of generating a sentence with a higher overall probability. For its use, it is necessary to set the parameter \(beam\_width\). In this work, \(beam\_width\) is 10. The higher the value of \(beam\_width\), the better exploration of the search space and therefore the better the sentence should be. However, the computational cost is high.

Fig. 3
figure 3

Abstractive summarization model approach

In order for sequence x to be used as network input, it must pass through an embedding layer. In this work, the embedding layer uses the GloVe unsupervised learning algorithm. This algorithm generates vector representations for words, combining the advantages of global matrix factorization and local context window techniques. Model training is performed using the word co-occurrence information in a given corpus, and the resulting representations show linear substructures of the word vector space (Pennington et al. 2014). At the end, there are vector representations with the ability to highlight the semantic structure of words, allowing to capture the meaning and similarity between them.

One of the problems with this model is that the ANN needs to compress important sentence information into a fixed-length vector, called context vector (Parmar et al. 2019). This compression can lead to important information loss, especially when it comes to long sentences. To solve this problem, we use the Bahdanau attention mechanism (Bahdanau et al. 2014). The attention mechanism is represented inside the dotted box named Attention Layer. In addition, to avoid overfitting and improve model performance, the Dropout technique is used. This technique randomly drops network drives during training, along with their connections (Srivastava et al. 2014). For Dropout we use \(keep\_prob\) = 0.8, which means that 20% of neurons can be dropped during training. Another problem presented by this network is that of exploding and vanishing gradients. An exploding gradient can occur when the gradient norm becomes too large, resulting in an unstable network. A vanishing gradient occurs when the gradient norm becomes too small, stopping the optimization process at a certain point. To avoid this problem, the clipping technique is used. This technique introduces a gradient threshold. The Gradient standards that exceed this limit are reduced to match the norm. The threshold value used is 5. The hyperparameters \(keep\_prob\), clipping threshold, number of LSTM layers and the dimensions of word embeddings were defined empirically.

Dataset

There are some classic datasets that are used for automatic text summarization task. These include CNN/Daily Mail, NYT, NEWSROOM, XSUM, ARXIV, PUBMED and Amazon Reviews datasets. Sharma et al. (2019) state that these datasets are not suitable for training abstractive summarization models, because the majority of the fragments used in the articles abstracts, in general, appear again in the text. The presence of the summary in the input text means that abstractive summarization do not have to generate a sentence, just extract the sentence from the input text. However, the goal of abstractive summarization is to build a model that can understand the content of the text and thus, subsequently, generate one or more sentences able to define the input text content. Thus, using texts that already have the summary content within the input text limits the learning process of the algorithm and makes the abstractive summarization more similar to the extractive than abstractive summarization algorithm. Because of this, Sharma et al. (2019) propose the use of patent documents to train abstractive summarization models, especifically, the description and abstract sections. These section, do not usually have fragments of the document text.

Therefore, to conduct the experiments following Sharma’s suggestion, a dataset was created composed of abstracts and titles of patent documents provided by the United States Patent and Trademark Office (USPTO). Abstracts were used in the input model and titles were used as ground truth to compare them with the output model. USPTO uses the Cooperative Patent Classification (CPC) system, which classifies patents into sections, classes, subclasses, groups and subgroups as illustrated in Fig. 4. We chose to use titles and abstracts because the objective is to use the proposed approach to generate simple and descriptive sentences that are able to name, consistently, patent subgroups. As can be seen later, in Table 13, subgroups names are generally small.

Fig. 4
figure 4

Hierarchical organization of the CPC system

Two main datasets were created. The first was used in the training and validation of the abstractive summarization process. The second was used to compare the abstractive with the extractive summarization process used by Souza et al. (2019). To generate the first dataset, 7,000 documents were randomly selected from each of the nine sections of CPC. In the CPC system, documents can belong to more than one subgroup, so it was necessary to remove duplicate documents. In the end, we obtained a dataset composed of 41,527 patent documents, divided into a training dataset with 33,221 documents and a validation dataset with 8,306 documents. The dataset has an average compression ratio of 22.55, which represents the ratio between the number of words in the abstract and the number of words in the titles. Abstracts have an average of approximated 124 words and 6 sentences, and the titles have an average of approximately 8 words and 1 sentence.

Among the related work, Sharma et al. (2019) are the ones that evaluate the performance of abstractive summarization models in patent datasets. However, the authors propose a patent dataset composed of patent titles, abstracts and descriptions and evaluate performance using only abstracts and descriptions. Patent sections have different structure and language characteristics, which makes it impossible to compare the results of this work with those of Sharma et al. (2019). Beside this, the summaries generated in this work are more concise and use a smaller number of words.

To perform the naming task, as proposed by this work, we used a second dataset composed of four subgroups which have the following CPC codes: G06K 7/1443, G06K 7/1447, G06K 7/1452 and G06K 7/1456. From now on, the subgroups codes will be represented only by their suffixes 43, 47, 52 and 56. The second dataset is composed of 733 patents. Table 4 shows the distribution of patents in each of the subgroups.

Table 4 Second dataset patents distribution

Methodological steps

Initially, all used datasets were preprocessed. For all dataset files presented in “Dataset” section, special characters were removed, all texts were placed in lower case, all punctuations were separated from text, and periods were replaced by the # character.

The methodology used to find the sentence that best describes a group using abstractive summarization can be divided into two phases. The first phase is divided into two steps. Figure 5 presents a diagram representing the steps of the first phase. This phase was performed 30 times by initializing LSTM network weights randomly. It’s necessary to execute these algorithms 30 times because they are stochastic, which means that different executions of the same algorithm using the same input data may return different results. Thus, the final performance of these algorithms is given by the average performance of 30 instances of their execution, ensuring statistical validation of the obtained performance.

Fig. 5
figure 5

First phase of the abstractive summarization process

The first step consists of training the model using the first dataset described in “Dataset” section. The model is trained with 2 layers with 150-dimensional hidden states and a pre-trained word vectors model 840b by 300-dimensional vectorsFootnote 3 using Adam Optimizer. For training, patents abstracts were used as network input and document titles as outputs. In the second step, the validation of the model was performed. Validation was performed for each of the 30 generated instances. To verify the quality of the generated output, which we called “summaries”, we used ROUGE-1, ROUGE-2 and ROUGE-L metrics, following Sharma et al. (2019) approach. In this work, the network outputs were compared with the document titles. For each of the 30 instances, the results were obtained for ROUGE-1, ROUGE-2 and ROUGE-L metrics. To calculate the average accuracy of the model, the metrics were averaged using the 8306 validation dataset records. This resulted in 30 values for each of the metrics. Afterwards, the average of each metric was calculated considering the 30 instances. The instance selected for the tests is the one that received the best general average with the three metrics.

The second phase is divided into three steps and consists of using the abstractive summarization model generated in the first phase for the task of automatically naming patent groups. Figure 6 presents a sequence diagram of the steps in this phase.

Fig. 6
figure 6

Second phase of the abstractive summarization process

The first step consists of summarizing the abtracts of each document using the model generated in the first phase. For each document, only one sentence is generated. The second step of the process analyzes the similarity between the generated sentences, using the method proposed by Al-Natsheh et al. (2017). The semantic similarity of each sentence in relation to the other sentences of the subgroup is calculated, creating a list of sentence pairs, and their respective scores of similarity. In the third step, the maximum metric is used to select the most representative sentence of each subgroup. The maximum metric, used by Souza et al. (2019), selects the sentence that most frequently presents the highest similarity score.

Finally, the validation of the entire experiment is performed, using the second dataset, shown in Table 4, which already had its names designated by specialists. The selected sentences are quantitatively evaluated, analyzing the semantic similarity between the name of the subgroup and the chosen sentences as the most representative of each subgroup. In addition, a qualitative analysis is performed, with the name of the subgroup. The hypothesis is that if the selected sentence is semantically similar to the subgroup name, it will provide a meaningful description for the subgroup. From this analysis, it is possible to compare the results obtained using the abstractive summarization and the LSA extractive summarization, developed by Souza et al. (2019).

Experiments

Initially, when the training model is performed, we restrict the LSTM network input to 150 words and the output to 15 words. These values are chosen considering the average amount of words in the abstracts and in the titles of the patent documents. During the training, a dictionary with 48,083 words is generated. The model is trained using Google’s Colab Notebooks with a Tesla K80 GPU. Each instance lasted 14h on average running 100 epochs. The average training and inference times of the 30 instances of the model, for each abstract, are approximately 1.5171 seconds and 0.0083 seconds, respectively. On average, the loss function value is of 0.1827 with a standard deviation of 0.4958. Figure 7a presents the histograms with the average distribution of the values of ROUGE-1. The average distribution of the values ROUGE-2 and ROUGE-L are presented in Fig. 7b and c for 30 instances.

Fig. 7
figure 7

Average distribution of ROUGE scores in 30 instances of execution

Table 5 presents the average with their respective standard deviation values for ROUGE-1, ROUGE-2, and ROUGE-L metrics obtained for the 30 instances generated in this work. We chose to present only these three metrics because these are the ones used to evaluate the model performance of this work. Based on the presented data, it can be seen that most of the results of the discussed works do not perform well. This clearly shows that abstractive summarization still needs major development, both for general discourse texts and for patent documents, which are characterized by having structurally more complex texts.

Table 5 ROUGE scores average

Table 6 reproduces the results of ROUGE-1, ROUGE-2 and ROUGE-L metrics, in percentage, for each of the referred works. Based on the presented data, it can be seen that most of the results of those reports do not perform well. This clearly shows that abstractive summarization still needs major development, both for general discourse texts and for patent documents, which are characterized by having structurally more complex texts. The general average of the metrics used to evaluated the model match some results present in the literature, as shown in Table 6.

Table 6 ROUGE scores for the discussed works

To select the instance to be used in the second phase, the three metrics were averaged for each one. We then averaged these values. The global average was 32.01% with standard deviation equal to 2.18%. From this value, we selected the instance that had an average value closer to this global average. In Tables 7, 8, 9 and 10, some of the obtained results with the second dataset are presented. In each table, the first line is the patent abstract, the second, the patent title and the third, the generated summary, which is automatically generated as a label for each patent, by the proposed approach. In bold, the words that appear in the document title and the generated summary were highlighted. The patents were selected from four different CPC sections. The selected patents were identified as \(P_{1}\), \(P_{2}\), \(P_{3}\) and \(P_{4}\), each belonging, respectively, to section B, E, G and F of the CPC system.

Table 7 Results of abstractive summarization for \(P_{1}\)
Table 8 Results of abstractive summarization for \(P_{2}\)
Table 9 Results of abstractive summarization for \(P_{3}\)
Table 10 Results of abstractive summarization for \(P_{4}\)

Table 11 shows the resulting metrics for patents \(P_ {1}\), \(P_ {2}\), \(P_{3}\) and \(P_ {4}\). According to the presented results, we can verify that the used model has promising results, especially when compared to the examples of summaries generated by abstractive models of other works, as presented in “Introduction” section.

Table 11 ROUGE scores for patents \(P_ {1}\), \(P_ {2}\), \(P_{3}\) and \(P_ {4}\)

Overall, out of 8,306 documents of the validation dataset, 543 simultaneously obtained the maximum value for the three metrics, such as P1 patent shown in Table 7. Table 12 shows the number of patents in each percentage range for ROUGE-1, ROUGE-2, and ROUGE-L metrics. By analyzing the characteristics of these texts, we conclued that most of them have texts in the training dataset, considering that there are many documents related to the same topic. This shows that the generated model was able to identify this relationship. In some cases, it was noted, by a qualitative comparison, that the generated summary had the same semantic content as the input, but the generated summary did not have all the words of the reference summary. In these cases, the metric rated it with a very low score. There are also cases in which the summaries differ in some words, such as the \(P_{3}\) patent shown in Table 9. In this case, the score was also severely penalized. Therefore, we conclude that the used metrics do not perform well to evaluate abstractive summaries, because unlike the extractive summaries that always have the same words as the input texts, the abstractive summaries have more freedom to generate sentences. This makes it possible to generate sentences semantically similar to the input text, consistent with the text content, but which do not have exactly the same words, such as \(P_{4}\) patent shown in Table 10.

Table 12 ROUGE scores distribution for 8,306 Patents

Moreover, by analyzing all the results obtained with the proposed approach, it was possible to realize that, in many cases, we obtained significant results. The results usually presented in the literature were trained with larger dataset, general speech texts and more intense training. Therefore, we believe that the results obtained in this work are promising, since we apply abstractive summarization in patent texts which have a more complex language and structure, as systematically described in the literature.

After the training, validation, and analysis of the results obtained by the abstractive summarization model, we used the instance selected in the group naming task. A sentence was generated for each of the four subgroups presented in Table 4. The idea is that the generated sentence should be able to describe the content of each one of the subgroups. The generated sentences are presented in the second column of Table 13. The third column presents the names given by the specialists. The word cloud, shown in Fig. 8, is a graphical representation that helps evaluating the existing similarity between the original text and the summarized sentence obtained in the extractive summarization by Souza et al. (2019).

Table 13 Generated sentences using abstractive summarization
Fig. 8
figure 8

Sentence extracted using LSA and metric maximum for the subgroup 43 (Souza et al. 2019)

By comparing the scores of semantic similarity between the generated sentences and the subgroups names, it is possible to see that extractive summarization has superior results. Only one of the results with the abstractive summarization presented a higher score than the extractive summarization. However, sentences obtained using extractive summarization are not able to name a group but rather provide a sentence to help the specialist to define the name of a group. On the other hand, the expectation is that the abstractive summarization could name the group without the intervention of the specialist. Given this, it can be concluded that abstractive summarization has great potential for this task, however, the techniques still need to be improved. Table 14 presents the results of semantic similarity between sentences generated by abstractive and extractive summarization and the subgroup names defined by the specialists. The best results were highlighted in bold. These results vary between 0 and 5.

Table 14 Semantic similarity scores comparisons

Final considerations

The main contribution of this work is to propose an approach for automatic generation of patent group names, using summarization techniques. An abstractive summarization model was compared to the performance of an extractive summarization algorithm applied to the sentence generation task, capable of assisting the specialists when naming new groups/subgroups. The experiments were performed using a modern abstractive summarization strategy that uses a Seq2Seq architecture and LSTM networks applied to an area of interest of the academic and industrial community. The task of generating abstractive summaries of patent documents is still little explored. Therefore, we hope to contribute to the study of these techniques in patent datasets. Based on the experiments performed, it was possible to verify that the abstractive summarization model used has promising results for the patent domain. Although the performance of extractive summarization had a better result than the abstractive one, in the task of group naming, it was possible to identify advantages associated with the use of abstractive summarization. Therefore, a proposal to continue this work is with the expansion of the training dataset, the training with a larger number of epochs, the comparison of the approach presented here with other variations and the analysis of other techniques to evaluate the performance of the models, such as the validation by specialists.