1 Introduction

With the development of society, people are more and more aware of the tremendous changes in our life brought about by innovation. As one of the most important ways to protect innovation, patent has been paid more and more attention. More and more patents are accumulated in the worldwide since the amount of patents applications increases year by year. Because patents contain rich technology, economy and law information, patent analysis and mining has become an important research topic in the field of data mining. Nowadays, with the rapid development of market economy, enterprises have to seize the highland of technology for sustainable development. Technology/effect matrix is a tool of patent analysis and mining. It can help enterprises to find technology vacant areas and minefields, and provides important support for technological innovation and avoidance. In the process of building technology/effect matrix, the annotation of technology/effect is a rather important step. At present, technology/effect is mostly by manual annotation, requiring a lot of heavy manual labor. In addition, manual annotation is subjective, for the same patent, different annotators may have different ways, which may bring hidden trouble for patent mining. This paper aims to solve these problems.

2 Related Work

In recent years, there are many research on patent analysis and mining at home and abroad. [1] investigated multiple research questions related to patent documents, including patent retrieval, patent classification, and patent visualization. [2] used OPTICS algorithm and k-nearest neighbor to implement clustering analysis of patent information. [3, 4] tried to focus on vacant technology forecasting, by using K-medoids or Bayesian. [5] gave a survey on different text clustering techniques for patent analysis. [6] used self-organizing map (SOM) approach to cluster patents into different quality groups and used support vector machine (SVM) to build up the patent quality classification model. [7] studied the patent document classification problem by deep learning. [8] focused on keyword strategies for applying text-mining to patent data and addressed four factors about key words.

In the domain of patent technology effect matrix, there are also some but not many research [9,10,11,12,13,14,15]. Japanese scholars [9, 10] were the earliest to study on technology effect matrix of Japanese and English language patents. [11] applied semantic role labeling to create technology-effect matrix. [12] proposed a method for matrix structure construction based on feature degree and lexical model. [13] gave one kind method based on conditional random field model (CRFs) to recognize effect phrases.

In the authors’ previous work about patent analysis and mining [14,15,16,17], we mainly focused on the removal of stop words in patents, intelligent recommendation of the traditional Chinese medicine patents and effect annotation. In the research about annotation, we found that the same patent inventor has his/her preferred style of writing; thus, using co-training method, effect statements’ extraction is divided into chain extraction and keywords extraction, which iteratively annotate effect statements in patent abstract. However, the limitation of this method is that it is easy to produce misjudgment. That is, some statements that are closely related to each other but actually not effect statements will be deemed as effect statements falsely. In this paper, making use the distribution and morphological characteristics of patent effect statements, and trying to make the extraction algorithm more general, but not limited to a patent inventor, we propose a multi-features fused scoring algorithm for automatic extraction of effect statements.

The rest of paper is organized as follows: Sect. 3 analyzed and summarized the characteristics of Chinese patent abstract, including distribution and morphological characteristics of effect statements. Section 4 described the automatic annotation algorithm PaEffExtr in detail. Section 5 analyzed and explained the experimental results. Section 6 concluded the paper and prospected the future work.

3 Characteristics of Patent Abstracts

Generally, a patent text consists of title, abstract, claim and specification. Patent abstract is a summary of the whole content of the patent text. It is short, but contains the composition structure of the invention, technologies used, design principles, functions, scope of application and other important information. Therefore, patent abstract is the data source of many patent mining experiments. In patent abstract, there is usually a description of the function and application scope of the invention, which is called as patent effect. The purpose of this paper is to automatically extract effect statements from Chinese patent abstracts.

In order to facilitate the following explanation, two definitions are given as follows:

Definition 1: patent effect statements

A collection of statements describing the function and application scope of the invention in the text of patent abstract, denoted as ES. From the perspective of linguistics, the elements in this collection are not necessarily close to each other in the abstract text.

Definition 2: patent effect clause

The element in patent effect statements, denoted as ec. From the perspective of linguistics, patent effect clause may be a single sentence, and also may be a clause in a long sentence.

So we can say that ES = {ec}.

After observing a large number of patent abstracts, we found that effect statements had two obvious characteristics.

  1. (1)

    Distribution characteristic: in patent abstracts, the positions of effect clauses follow certain rules. In many cases effect clauses appear at the end of the abstract, in a few cases appear in the head of the abstract, in rare cases in the middle of the abstract. Sometimes, all of the effect clauses in a patent abstract are distributed in multiple places, but in many cases, all the effect clauses appear in a continuous way.

  2. (2)

    Morphological characteristic: because effect statements describe the function and the scope of application of the invention, there are often specific clue words in effect clauses. These clue words may be used to guide the emergence of an effect clause, and may also indicate which aspects have changed, what changes have been made and so on.

According to different situations, we divide the clue words into the following categories.

  1. (1)

    leading word: a word used to guide the emergence of effect clause. For example: “have”, “can”, “apply to”, “used to”, “make”, etc.

  2. (2)

    facet word: a word used to indicate which aspects have changed brought by a patent invention. For example: “cost”, “performance”, “quality”, “efficiency”, etc.

  3. (3)

    changing word: a word reveals what changes have been made by a patent invention. For example: “improve”, “simple”, “lower”, “avoid” and so on.

  4. (4)

    degree word: a word used to indicate the extent to which a patent invention have changed. For example: “significant”, “obvious”, etc.

4 Multi-features Fused Scoring Algorithm

According to the introduction above, we find that effect clauses in patent abstract have its obvious distribution and morphological characteristics. Those clauses at specific locations and containing clue words are more likely to be effect clauses than other clauses. Therefore, we design a multi-features fused scoring algorithm, based on the location information and whether containing clue words, to give score to each clause, and choose those with high scores as effect clauses.

4.1 Calculation of Distribution Score

There is no mandatory requirement for the writing of abstract text of patents, so patent applicants usually write according to their own habits and preferences. Through the observation we found that in many cases, functions and application scope of patents are located in the tail of the abstracts, in few cases are in the head, in rare cases are in the middle, even sometimes there is not any effect statement in some abstract. In addition, the use of punctuation marks is also very arbitrary. Some applicants are accustomed to use periods to separate the patent structure, technology, design principle, function and application range, some tend to use a semicolon, and some only use commas directly. In this paper, we will use the comma, semicolon, periods etc. as delimiter, to separate abstract into clauses. We calculate distribution score using the following method.

With regard to a patent abstract text T, we use C to represent the set of all its clauses. For the ith clause \( c_{i} \), its distribution score is calculated as:

$$ D_{i} = \left\{ {\begin{array}{*{20}l} {\gamma_{1} } \hfill & {when\;0 < i < \frac{N}{3}} \hfill \\ {\gamma_{2} } \hfill & {when\;\frac{N}{3} \le i < \frac{2N}{3}} \hfill \\ {\gamma_{3} } \hfill & {when\;\frac{2N}{3} \le i \le N} \hfill \\ {} \hfill & {\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(\gamma_{1} + \gamma_{2} + \gamma_{3} = 1)} \hfill \\ \end{array} } \right. $$
(1)

In Formula 1, \( \left| C \right| = N \), \( D_{i} \) represents the distribution score of \( c_{i} \). The abstract text is divided into 3 parts, each of which is given a single weight.

4.2 Calculation of Morphological Score

Because the clauses containing clue words are more likely to be effect clauses than others, so we check whether the clause contains clue words and how many clue words to calculate morphological score.

The collection of clue words is the key to calculate morphological score. Through the observation, we found that clue words appear frequently in effect statements, so we use statistical methods to find them. By manually annotating effect clauses of a certain number of patents, looking for high-frequency words, artificial screening, and several rounds repeated, we constitute a clue words set called ClueWords.

With regard to a patent abstract text T and ith clause, the morphological score is calculated as follows.

$$ M_{i} = \lambda k_{i} $$
(2)

In (2), \( k_{i} \) represents the number of clue words contained in \( c_{i} \).

4.3 Algorithm PaEffExtr

In summary, for the ith clause in a patent abstract text T, the score is calculated as follows:

$$ Score_{i} = \alpha D_{i} + \beta M_{i} {\kern 1pt} \;\;\;\;\;(\alpha + \beta = 1) $$
(3)

Here, \( \alpha \) and \( \beta \) are the weights of \( D_{i} \) and \( M_{i} \) respectively.

The following is the algorithm of extracting effect statements from patent abstract.

figure a

In this algorithm, firstly, the abstract text is separated into clauses with the period, comma, semicolon, colon, question mark, brackets and spaces. According to each clause’s position, distribution score is calculated; secondly, segment words for each clause using NLPIR [18], count the number of clue words and calculate morphological score; thirdly, fuse distribution score and morphological scores together, and calculate the total score; finally, put all the clauses with the total score higher than the threshold into the set of effect clauses.

It is not difficult to see that the constitution of clue words set is the key point of the algorithm, since the accuracy of the set directly affects the accuracy of the algorithm. In order to ensure the integrity and usefulness of clue word set, we use the idea of iteration to collect clues words. First of all, annotate some patents manually, find out the high-frequency words, after artificial screening, keep high-quality ones as the initial set of clue words; then use the automatic extraction algorithm above to extract effect clauses of more patents, find out high-frequency words again, artificial selection, add new clue words into the initial clue words set; repeat iteratively until the clue word set arrive to a stable state. As shown in Fig. 1.

Fig. 1.
figure 1

The constitution process of clue words set

4.4 Evaluation of Algorithm

In this paper, we compare automatically annotated effect statements with those manually annotated, using two indicators: precision and recall to evaluate the effectiveness of the algorithm.

Assuming that for the ith patent abstract, the set of effect clauses by manual annotation is \( P_{i} \), and the set of effect clauses by automatic annotation is \( Q_{i} \), then, the precision and recall of our algorithm on this patent abstract are computed as the following.

$$ Presicion_{i} = \left\{ {\begin{array}{*{20}l} {\frac{{|P_{i} \cap Q_{i} |}}{{|Q_{i} |}}} \hfill & {when{\kern 1pt} {\kern 1pt} {\kern 1pt} |P_{i} | > 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} and{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} |Q_{i} | > 0} \hfill \\ 1 \hfill & {when{\kern 1pt} {\kern 1pt} {\kern 1pt} |P_{i} | = 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} and{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} |Q_{i} | = 0} \hfill \\ 0 \hfill & {when{\kern 1pt} {\kern 1pt} {\kern 1pt} |P_{i} | > 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} and{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} |Q_{i} | = 0} \hfill \\ 0 \hfill & {when{\kern 1pt} {\kern 1pt} {\kern 1pt} |P_{i} | = 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} and{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} |Q_{i} | > 0} \hfill \\ \end{array} } \right. $$
(4)
$$ Recall_{i} = \left\{ {\begin{array}{*{20}l} {\frac{{|P_{i} \cap Q_{i} |}}{{|P_{i} |}}} \hfill & {when\;\left| {P_{i} } \right|\, > 0{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} and{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left| {Q_{i} } \right| > 0} \hfill \\ 1 \hfill & {when\;\left| {P_{i} } \right| = 0} \hfill \\ 0 \hfill & {when\;\left| {P_{i} } \right| > 0{\kern 1pt} {\kern 1pt} {\kern 1pt} and{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left| {Q_{i} } \right| = 0} \hfill \\ \end{array} } \right. $$
(5)

Generally speaking, for N patent abstracts, the precision and recall of our algorithm are computed as the following.

$$ Presicion = \frac{1}{N}\sum\limits_{i = 1}^{N} {Presicion_{i} } $$
(6)
$$ Recall = \frac{1}{N}\sum\limits_{i = 1}^{N} {Recall_{i} } $$
(7)

5 Experiments

We use Java language to implement the algorithm, with 50,000 patent abstracts from Chinese universities and research institutions as the data source. In this section, the experimental results are given.

5.1 Clue Words

Table 1 exhibits part of clue words after several rounds of algorithm operation and manual screening.

Table 1. Some clue words

5.2 Comparative Experiments

By setting different parameters and thresholds, the precision and recall of the algorithm are compared. Table 2 shows the evaluation results of 24 groups of experiments with different parameters. We can find some rules from these results.

Table 2. Evaluation results of 24 groups of experiments with different parameters
  1. Rule 1:

    We can see that when \( \gamma_{1} \), \( \gamma_{2} \), \( \gamma_{3} \) values as 0.3, 0.2 and 0.5 respectively, the algorithm will get better precision and recall, since effect clauses prefer to be located in the tail the head of the abstract text.

  2. Rule 2:

    The differences brought by the weights of \( \alpha \) and \( \beta \) are not obvious.

  3. Rule 3:

    When the threshold is improved, the precision increases, but the recall reduces.

  4. Rule 4:

    When \( \lambda \) changes from 0.2 to 0.1, the precision increases, but the recall reduces.

  5. Rule 5:

    Group 9, 23 and 24 have the best precision, and Group 15 has the best recall.

5.3 Runtime

Figure 2 shows when the parameters and threshold are set to fixed values (\( \gamma_{1} \) = 0.2, \( \gamma_{2} \) = 0.1, \( \gamma_{3} \) = 0.7, \( \lambda \) = 0.2, \( \alpha \) = 0.3, \( \beta \) = 0.7, th = 0.35), the runtimes of our algorithm on different number of patent texts. It can be seen that the time complexity of our algorithm is approximately linear.

Fig. 2.
figure 2

The runtime of the algorithm

6 Conclusion and Future Work

In order to reduce the burden of patent annotators, this paper presents an automatic extraction algorithm of effect statements in Chinese patent abstracts. This algorithm uses distribution and morphological characteristics of effect statements, construct a clue words thesaurus, and use scoring method to extract effect statements automatically. The algorithm is simple and direct, and has satisfying experimental results. It can also be extended to the automatic annotation of other patents information, such as technical words, coordinative phrases, and so on.