Keywords

1 Introduction

With the development of the direction of the automated invention [1,2,3,4,5], CAI systems are being used more and more recently. Computer-Aided Invention is the search for innovative solutions using the computer. CAI systems are automated support systems and search for new technical solutions. The completeness of various knowledge bases and the completeness of ontologies of subject areas directly affect the success of support systems and the search for new technical solutions that find their application in the synthesis of new technical solutions. But one of the serious problems of CAI systems is the problem of updating the knowledge base since this process is rather difficult [6].

Scientific documents, patent documents, reference books can be the main sources of technical information to supplement existing knowledge bases. Patent documents can be considered one of the main sources of technical information since the number of patents in patent databases is quite large.

The existing more than 20 million worldwide patent database can act as a source of information for the initial stages of designing new technical solutions. Such volumes of data require automated processing.

One of the convenient ways of conceptualized knowledge representation about any subject area is the ontology model. Ontologies are a convenient organization of stored knowledge, thanks to which you can search and analyze data. Considering that the array of patent documents contains a lot of information useful for extraction and analysis, such as claims, classifications, country of origin, organization; ontologies provide the ability to structure and link information.

2 Analysis of the Patent Array

A patent document is a document issued by an authorized public authority confirming the exclusive right of the patent holder to an invention, utility model, or industrial design. One of the most useful for analysis is the patent claims, which are part of the specification of the patent document. The International Patent Classification (IPC) is a vehicle for internationally uniform classification of patent documents. This paper deals with patents belonging to the classes of electricity and mechanical engineering, that is, classes H and F, respectively.

In this study, as morphological features, which are concepts of ontologies of the subject areas “Technical functions” and “Implementation of technical objects”, the technical implementation and the structure “problem-solution” are highlighted. The technical implementation determines the constructive composition of the invention, and the problem-solution structure expresses the problem solved by the technical implementation. The source of data for the first feature is the claims of the device, and for the second—the item of the technical result in the description section of the invention.

Using the SAO (Subject-Action-Object) [7,8,9] model, technical implementations of objects can be represented, and the problem-solution structure is an incomplete part of the model. Morphological features of technical objects from patent documents can be represented by certain syntactic constructions that can be used for the automated construction of ontologies.

The main methods for extracting concepts and relationships between concepts for building domain ontologies are dependency parsing and part-of-speech tagging [10,11,12,13,14,15]. The SAO model is used to represent the implementations of technical objects and technical functions. To extract concepts from the claims of a patent document, the latest version of the Stanford NLP called Stanza [16] is used. An example of concept extraction is shown in Fig. 1

Fig. 1
figure 1

Example of fact extraction

3 Developed Methods for Extracting Information from the Patent Array

Features of patent presentation of technical systems:

  • Descriptions of realizations of technical objects are contained in the invention formula;

  • The technical problem solved by the device (device from the name of the patent) is contained in the first paragraph of the summary of the patent.

Before proceeding to parse patent documents containing descriptions of implementations of technical devices, it is necessary to perform preliminary processing of the patent array, which is an XML file. The filtering of patents is carried out by classes H and F, which correspond to electricity, mechanical engineering, etc.

To search for and retrieve realizations of technical objects [17,18,19,20], the claims are analyzed. The first claim is the most generalized and contains the most complete description of the device, and it is he who is being analyzed.

3.1 Pre-segmentation Algorithm

The main idea of preparing the segments of the first paragraph of the formula is to “restore” sentences for correct analysis by the stanza parser. In Example 1, you can see a fragment of the first claim of the invention in its original form.

For the left side of the claims, the main device is searched. The claims begin with the main device, followed by the sequence of characters “comprising:”. To restore the segments the left part is taken up to the “:” character and the right part containing the enumeration is split by the “;” character. At the beginning of each segment representing an enumerated element, a substring is added containing the main unit of the patent claims.

Each penultimate enumerated element has after the “;” the conjunction “and”, which can complicate the parsing of the sentence. Therefore, in the first claim, the combination of symbols “; and” is replaced with“; ”, after which the first occurrence of the word “ where ”is searched for. The claims are divided into two parts—before and after. If “where” is absent, then the whole formula is taken. Since there can be several “where”, then the part of the formula after the first mention of “where” is broken down by “where”, and for each resulting segment, whitespace characters are removed from the beginning and end of the segment.

Example 1. A fragment of the first claim of the invention

<claim-text> 1. A decoupled gas turbine engine comprising:

<claim-text> a low pressure compressor; </claim-text>

<claim-text> a high pressure compressor; </claim-text>

<claim-text> a second turning duct in fluid communication between the combustor and the high pressure turbine; </claim-text>

<claim-text> where the low pressure compressor and the low pressure turbine …

After preliminary segmentation, the first claim will have the form shown in Example 2.

Example 2. View of the first claim after preliminary segmentation

A decoupled gas turbine engine comprising a low pressure compressor.

A decoupled gas turbine engine comprising a high pressure compressor.

3.2 SAO Extraction Algorithm

A global list of extracted SAOs is used to store and write retrieved device components in the form of an SAO model. For each pre-segmented segment, all SAOs are retrieved. The input segment is split into a sequence of tokens using a parser. Only those segments with tokens that contain key verbs typical for extracting the implementation of technical objects are subject to processing. Key verbs include the following: comprise, consist, connect, include, attach, have. The extraction of technical realizations should be continued until there are no unprocessed key verbs in the segment. Figure 2 shows an algorithm for extracting realizations of technical objects from the claims.

Fig. 2
figure 2

Algorithm for extracting realizations of technical objects from the claims

Dependency parsing and parts of speech detection are used to directly extract the technical implementation. The algorithm for extracting technical implementation assumes the presence of a potential key vowel, for which it is necessary to find a subject and an object.

Figure 3 shows a detailed algorithm for extracting a specific implementation of a technical object.

Fig. 3
figure 3

Algorithm for extracting the implementation of a technical object

An example of extracting the implementation of technical objects is shown in Fig. 4.

Fig. 4
figure 4

Example of the method implementation

To extract technical functions and the problem to be solved, the device does not analyze the patent formula, but the section of the patent with the title “Technical Problem”. Figure 5 shows an algorithm for extracting the device problem to be solved and technical functions.

Fig. 5
figure 5

Algorithm for extracting the problem of the device and technical functions to be solved

4 The Ontology

Triplets are the main way of expressing information in ontologies. A triplet consists of three components—subject, predicate, and object. This model is ideal for storing retrieved realizations of technical objects as SAO. So, a triplet will consist of three components—subject, action, object.

In Fig. 6 you can see the class diagram of the ontology of the subject areas “Technical functions” and “implementations of technical objects”.

Fig. 6
figure 6

Scheme of classes of the ontology of the subject areas “Technical functions” and “implementation of technical objects”

The following properties of objects were selected:

  • hasFunction—a property for linking a technical function and a component;

  • comprises—a property for communication between the components of a device (the verb “comprise”);

  • connectedTo—property for communication between device components (verbs “connect”, “attach”);

  • consists—a property for communication between device components (the verbs “consists”, “include”);

  • parentFor—indication of the presence of a parent relationship between elements (the verb “have”);

  • partOf—indication of the belonging of the component to the device of the patent document;

  • solutionFor—property for linking the problem and the device being solved by it;

  • connected_to—connection between elements (verbs “install”, “connect”, “connect”, etc.).

Figure 7 shows the ontology replenishment algorithm.

Fig. 7
figure 7

Algorithm for replenishing the ontology of the subject areas “Technical functions” and “implementation of technical objects”

The resulting ontology is exported to an OWL file, which can then be opened for further work in Protege.

5 The Software

The automated system is implemented as a desktop application for Linux operating systems. Development was carried out on the Ubuntu 18.04.4 operating system. The system is implemented in the Python 3.6.9 programming language. The PyQt5 library was used to create the user interface. For the analysis of natural language texts, the latest version of Stanford NLP called Stanza was used. The MySQL DBMS was used to store the extracted SAOs, and the Python PyMySQL library was used for development. XML files were parsed using the lxml library. The Owlready2 library was used to work with ontologies.

The automated system allows you to download patent documents, extract technical functions and implementations of technical objects, display the extracted implementations of technical objects in a form, build ontologies for a user-selected patent, as well as for all uploaded patents for which technical functions and technical object implementations have been extracted. Figure 8 shows the constructed ontology for one patent document.

Fig. 8
figure 8

The ontology for one patent document

As a computational experiment, patent documents were manually sorted out, the number of SAO retrieved for each patent and the time taken to parse each patent document were recorded. The extraction accuracy (P) was calculated using the formula (1)

$$P = \frac{{\text{E}}}{{\text{N}}},$$
(1)

where E is the number of correctly extracted by the SAO system, N is the number of SAO in the patent document.

In Table 1 you can see the results of the experiment.

Table 1 Results of the experiment

The average time for parsing one patent by the system was 1.72316 s, the average time for parsing one patent by an expert was 46.6 s. Accuracy rates are above 70%.

6 Discussion

This work solved the general problem of information support for the synthesis of new technical solutions based on the analysis of USPTO patents.

As concepts of the ontology of subject areas, the structural elements of a technical object (TO) and the relationship between them, as well as descriptions of the problems solved by the invention, were considered. The first claim of the patent document acted as the main source of information. The unit of extraction was the semantic structures SAO (Subject-Action-Object).

The main linguistic features of patent documents were identified. The method of preliminary processing of the patent mass has been formed. A separate auxiliary tool has been developed for the preliminary processing of the patent array. An algorithm for extracting SAO from the patent formula has been formed. A method has been developed for exporting extracted SAOs from English-language patents to a domain ontology.

The developed methods were tested on US patent documents