Keywords

1 Introduction

Currently, the semantic analysis of the text is one of the most important areas in computer science. This direction is widely used in various spheres, such as trade (analysis of consumer preferences), political science and sociology (forecasting of the election results, analysis of sociological data, etc.), psychiatry (patients diagnosing), etc. However, despite its wide application, semantic text analysis is one of the most complicated mathematical problems, which consists of several stages of text processing.

In this paper, text analysis is considered as a tool for extracting structured physical knowledge in the form of physical effects (PE) [1]. Currently, the task of identifying the descriptions of PE from the texts of scientific documents for replenishing the PE base as an information basis for the development of new technical solutions is very important.

Today, there is the problem of the effectiveness of the descriptions of physical effects identifying in the texts of physical documents. Identification of descriptions of physical effects, for example, from the text of a patent is necessary to check the new incoming physical effect on the coincidence with patents existing in the database. If the descriptions of physical effects coincide in the claimed patent with descriptions in the patents from the database, the refusal in the patent application is formed.

At the moment, the system for detecting descriptions of physical effects from texts based on “Semantix” semantic analyzer has been implemented [2, 3]. The disadvantage of this system is often the incorrect construction of the semantic network, due to which individual elements of physical effects can be extracted erroneously.

It is necessary to increase the effectiveness of the descriptions of physical effects identifying. In this paper, we propose a technique for identifying descriptions of physical effects from texts of documents based on the semantic templates built on the basis of the Tuzov’s ontology.

2 Analysis of Software Systems and Tools for the Semantic Analysis of Text

An analysis of the effectiveness of software systems for extracting facts is presented in [3]. As shown in the paper, most existing systems do not have sufficient flexibility to customize on the subject area. Therefore, there is a need for a new approach to solving the problem of extracting elements of the PE description from the texts of scientific documents.

Analysis of the effectiveness of software systems for semantic analysis is shown in Tables 1 and 2. In these tables different program systems and tools for semantic analyses are presented. The effectiveness and semantic abilities are presented.

Table 1. Analysis of systems for extracting facts from text
Table 2. Semantic analyses systems

There are the efficiency criteria and the main points for analyzing and evaluating the systems for extracting knowledge and facts from the text sources. The following points were identified in the analysis of the subject area:

  • Algorithm for identifying target entities (can be search by patterns or search based on neural network technologies).

  • System flexibility - the ability to configure system parameters, the ability to add new entities to the system, etc. (These parameters can be set rigidly in the system, and can be set by the user of the system).

  • License (closed or open).

  • The accuracy of the retrieval of the facts sought - the percentage of correctly identified semantic units (1 - from 85%, 2 - 75–85%, 3 - below 75%).

  • Completeness of extraction - the percentage of correctly extracted semantic units extracted from the total number of facts in the text (1- from 50%, 2 - to 50%).

As it can be seen from the Table 1 existing systems (except IOFFE) do not fit in the field of “Physics”; in the latter, the accuracy and completeness of the extraction of the elements of the PE structure are not high enough.

From the can be seen that these systems are oriented to the analysis of the any subject texts, they do not have the possibility of an advanced tuning to extract certain semantic structures, which significantly reduces the possibility of using them for solving the task. The most promising systems are the systems with the ability to configure flexibly templates for the subject area to improve the accuracy and completeness of retrieving the necessary information.

3 Semantic Text Analysis to Identify Descriptions of Physical Effect

3.1 The General Approach for PE Extracting

The general approach for PE extracting is following:

  • Tokenization. At this stage, sentences are broken up into words, extra characters are deleted from the sentence.

  • Morphological analysis. At this stage, all the lexemes allocated during the tokenization stage are subjected to morphological analysis. Morphological analysis is performed using a TreeTagger library. This tool is a language-independent tool for morphological markup of texts. Helmut Schmitt at the Institute of Computer Linguistics at the University of Stuttgart developed it. TreeTagger has successfully proved itself in various tasks of text processing in various languages (Russian, English, German, French, Italian, etc.).

The result is the structure that consists of three columns. The first column of the TreeTagger result is the original word received at the stage of tokenization. The second column presents the morphological characteristics of the word in the sentence (gender, number, case). In the third column, the word is presented in its initial form. If TreeTagger could not find the initial form of the word, the word <unknown> is written in the third column.

  • Lemmatization. At this stage, words that were not recognized at the morphological analysis stage (the initial form of the word was not recognized) enter the input to the CstLemma application.

  • Extraction of structured physical information in the form of physical effects based on semantic templates, developed based on the Tuzov’s ontology.

Ontology of the Russian language [2] - a formal description of the Russian language, proposed by V.A. Tuzov (semantic roles are determined on the basis of semantic classes and morphological information). The basis of the approach is a semantic dictionary that describes more than one hundred thousand lexical units (words and phrases), and each word is described as a semantic formula consisting of basic functions.

  • The language is the algebraic system {f 1f 2 f n,  M}, where f i are the basic functions in the language, and M is the language structure that represents the set of basic concepts m 1 m r , and their hierarchy.

  • Any sentence of a language can be represented as a superposition of basis functions f i , through which the words of the language are also expressed, excluding the basic concepts m j that enter in M. Thus, sentences represent single superposition of functions, in the mathematical sense are treated as functions.

  • The grammar is related to the semantics of the language, which is based on a semantic dictionary that describes more than one hundred thousand words and phrases. They are divided into 3 levels: fundamental (consists of 1500 hierarchical classes, as well as a set of basic functions); Variable (consists of 23000 classes, connected with the fundamental, because they are described on the basis of this level and are its variations); Descriptive (words are described based on words and concepts of the first two levels). Each word is described as a semantic formula, which consists of basic functions.

The dictionary of basic concepts of the Russian language contains words that can not be expressed through other simpler concepts. It contains about 18,000 nouns that call physical and abstract objects, more than a thousand basic adjectives and about a thousand basic verbs, which, ultimately, have been replaced by verbal nouns. The remaining words - more than 90,000 words - are derived, that is, their meaning is expressed in the form of a superposition constructed from basic functions and basic concepts. The whole set of concepts is divided into a hierarchical class system.

EXPOSURE $15142 (Caus_o (AGENT: SOMETHING $ 1 ~ Gen, Lab (OBJECT:! Acc, LOCATION: Prep)))

In [9, 10] a formal description of the physical effect is presented. The physical effect is an objective, naturally conditioned connection between two or more physical phenomena, each of which is characterized by a corresponding physical quantity. From the content side, the PE is represented as a functional connection (arising from physical laws, their consequences) between two or more physical quantities.

To extract descriptions of physical effects, a model for representing structured physical information in a natural language text has been developed.

$$ M_{PE} = {<}C, \, D, \, B, \, R_{C} , \, R_{B}{>} , $$
(1)

where, C is the set of predicates (relations) that are characteristic for describing the PE in the text, c i  ∈ C; D - semantic roles and cases of arguments with predicates. d i  ⊂ D - list of roles/ cases of arguments, consistent with the predicate c i ; d j  ∈ D i ; B - the set of elements of the description of the PE (A, B, C), B k  ∈ B.

$$ \forall c_{i} \in C\exists d_{j} \in D_{i} d_{j} \mathop{\longrightarrow}\limits{def}B_{k} , $$

where, B k  ∈  {input (A), output (C), object (B)}; def is the operator associating the role/case of the argument d j with the predicate c i , the set of the elements of the PE.

R C is the relation on C × D, the pair (c i d j ) ∈  R C uniquely identifies the element (s) of the PE description, consistent with the predicate role/case d j .

R B is the relation on R C  × B, the pair ((c i d j ), B k ) ∈ R B defines the set of software concepts corresponding to the element of the PE description B k .

Let us consider an example of the application of the model.

The text containing the description of the PE: “The effect of a magnetic field on amperage in a conducting layer”.

Where C 1 = IMPACT

D = {AGENT/Generic, OBJECT/VIN, LOCATION/Prev}

B = {InputPE, OutputPE, ObjectPE}

For the role “Agent” of the conceptual relationship c 1, which is performed by the member of the relation m 1 : z 1 = {input of the PE}.

For the role of “Object”, which is performed by the relation member m 2 : z 2 = {output of the PE, object PE}.

For the role of “Place”, which is performed by the relation member m 3 : z 3 = {input PE, object PE}. B 1 = InputPE = Magnetic field (Exposure (of what ?, gen), AGENT),

B 2 = OutputPE = Amperage (Impact (what ?, acc), OBJECT),

B 3 = ObjectPE = conductive layer (Impact (in what ?, prep), LOCATION)

The dictionary of templates contains a set of key words (predicates), which correspond to individual sets of links “Element of the description of the PE” – “Semantic role of the argument in the Tuzov ontology”.

Predicates are the basic verbs or verbal nouns that characterize the descriptions of physical effects in the text, such as interaction, interaction, influence, impact, highlight, act, dependence, etc.

Let us consider an example of the application of the model.

The text containing the description of the PE: “The influence of the magnetic field on the Amperage in the conductive layer”, where the predicate “EFFECT”, the “Input of the PE” according to the template dictionary corresponds to the semantic role of the “Agent”, the “Output of the PE” is the Object, the “Object of the PE” is the Place. We obtain the extracted elements of the PE description:

PE Input = Magnetic field (Exposure (AGENT)), PE Output = Amperage (Impact (OBJECT)),

PE object = conductive layer (Impact (LOCATION))

Based on the semantic roles in Tuzov’s ontology, the structure of the semantic template was defined:

<Predicate> <input of the PE in the required case> <preposition of the output of the PE> <output of the PE in the required case> <preposition of the PE object> <PE object in the desired case>, where predicate is the keyword, selected according to the subject domain, is a verb or verbal noun (interaction, interaction, influence, allocation, isolate, action, act, depend, dependency, etc.).

In the event that there is no preposition or semantic role for the keyword in the ontology, the word null is written in the template. In the event that several prepositions and/or cases corresponded to a single semantic role, prepositions and/or cases were recorded through separators.

On the basis of the analysis of the subject area “Physics”, the keywords (predicates) most frequently encountered in conjunction with unstructured descriptions of physical effects were identified. Such keywords are either verbs or other verbal parts of speech (noun, participle).

As a result, more than 100 keywords were chosen for the subject area.

Based on the semantic roles in Tuzov’s ontology, the structure of the semantic template was defined:

<Predicate> <input of the PE in the required case> <preposition of the output of the PE> <output of the PE in the required case> <preposition of the PE object> <PE object in the desired case>, where predicate - the keyword, selected according to the subject domain, is a verb or verbal noun (interaction, interaction, influence, influence, influence, influence, allocation, isolate, action, act, depend, dependency, etc.);

In the event that there is no preposition or semantic role for the keyword in the ontology, the word null is written in the template. In the event that several prepositions and/or cases corresponded to a single semantic role, prepositions and/or cases were recorded through separators.

As a result, more than 100 semantic templates for the subject area were developed.

Algorithm of semantic analysis for extracting arguments Agent, Object, Place is represented in Fig. 1.

Fig. 1.
figure 1

Algorithm of semantic analysis for extracting arguments Agent, Object, Place

Thus, the algorithm consists of the following basic steps:

  • Tokenization;

  • Morphological analysis;

  • Lemmatization;

  • Extract the arguments of the semantic roles of the Agent, Object and Place - nouns and adjectives related to the given nouns. In addition, connections were extracted from cases - a connection was made in the genitive case, consistent with the noun corresponding to the semantic role.

3.2 Extraction of Structured Physical Information in the Form of Physical Effects and Removal of Semantic Ambiguity. Algorithm for Constructing a Match Template

Because of the analysis of the subject domain, a phenomenon of semantic ambiguity arises. Semantic ambiguity arises when it is possible to match one semantic role in ontology. Tuzov (“Agent”, “Object”, “Place”) to several elements of the description of the physical effect (“Input”, “Exit”, “Object” of the physical effect).

For example, if we consider the following excerpt of a sentence containing a physical effect:

“IMPACT ON THE SEMICONDUCTOR”

The word “semiconductor” in this case is the object of physical effect. The semantic role in Tuzov’s ontology for this word is similar – “Object”.

If we consider another passage of a sentence containing a physical effect:

“IMPACT ON THE MAGNETIC FIELD”

The word “magnetic field” in this case is the output of the physical effect. The semantic role in the ontology of Tuzov is still the “Object”.

Thus, one semantic role in the ontology of Tuzov can be compared, depending on the context, several elements of the description of physical effects. This affects the correctness of determining the elements of physical effects according to the developed semantic templates.

To remove this kind of semantic ambiguity, a separate program module was developed, which is necessary to compare with the existing descriptions of physical effects and semantic roles described in templates based on the ontology of Tuzov. The essence of the approach is to compare the semantic roles of Tuzov, elements of the description of physical effects and specific words with which ambiguity can be revealed. Based on the analysis of the “Description of the PE” field of the description of 1200 physical effects, there is a correspondence between the semantic role and the element of the PE description. The template for extracting the PE is formed based on the statistics of this bundle for each predicate.

To eliminate the ambiguity based on the field “Description of the PE” of the database table and Tuzov ontology (predicates, cases and prepositions for semantic roles) is the correspondence:

<Semantic role> - <Element of the description of PE>.

Based on the statistics of this bundle, a template for extracting the PE is formed for each predicate.

Thus, a set of words that correspond to different elements of the description of physical effects is defined. For example, for the above example with the word “Impact”, this keyword is the word “Semiconductor”, and the output of the physical effect is the word “Magnetic field”.

Algorithm for constructing correlations between the semantic roles of the Tuzov ontology and elements of the description of the physical effect for predicates of the domain are shown in Fig. 2.

Fig. 2.
figure 2

The algorithm of construction of correspondences to semantic roles of Tuzov ontology to elements of the description of physical effect for predicates of the subject domain

So the essence of the approach is to compare the semantic roles of Tuzov, the elements of describing physical effects and specific words with which ambiguity can appear. To eliminate the ambiguity based on the field “Description of the PE” of the database table and Tuzov ontology (predicates, cases and prepositions for semantic roles) is the correspondence:

<Semantic role> - <Element of the description of PE>.

Based on the statistics of this bundle, a template for extracting the PE is formed for each predicate.

Thus, a set of words that correspond to different elements of the description of physical effects is defined.

The general algorithm for extracting physical effects is shown in Fig. 3.

Fig. 3.
figure 3

The algorithm for extracting elements of physical effects

Thus, the algorithm for extracting elements of physical effects reduces to finding all the predicates of the domain and performing the following steps for them:

  • Search for a predicate in the text;

  • Finding the arguments corresponding to the semantic roles according to the template based on Tuzov’s ontology;

  • Comparison of arguments to elements of the description of the physical effect on the basis of match patterns;

  • Elimination of semantic ambiguity by checking the presence of an argument in thesauri.

4 Results

The following performance criteria were developed:

  • Accuracy of extraction of elements of PE description;

  • Completeness of extraction of elements of PE description;

  • F-measure.

The accuracy is characterized by the number of correctly extracted elements to the total number of elements of the PE description

$$ P \, = \, N_{r} / \, N_{f} , $$
(2)

where, P is the accuracy of the PE extraction, Nr is the number of correctly extracted elements of the PE description, Nf is the number of elements of the FE description found in the text.

The completeness of the PE description elements extraction is a value expressed in percent, which characterizes the number of elements found in the PE description to the total number of elements of the PE description in the text.

$$ R \, = \, N_{f} / \, N, $$
(3)

where, R is the completeness of the extraction of elements of the PE description, Nf is the number of elements of the PE found in the text, N is the total number of elements of the PE description in the text.

The F-measure is calculated by formulas 4 and 5.

$$ F = \frac{{(\beta^{2} + 1)PR}}{{\beta^{2} P + R}}, $$
(4)
$$ \beta^{2} = \frac{1 - a}{a}, $$
(5)

The data was tested based on physical effects data, developed at the CAD department of Volgograd State Technical University. 100 physical effects were selected. The tests were carried out on the descriptions of physical effects in the DB of the PE and the results of the extraction were compared with the fields “Input”, “Output”, “Object” of the physical effect.

Also, the tests were conducted on 31 patent documents for which physical effects contained in the patent description are known in advance.

The results were compared with the results of the IOFFE program [11] based on “Semantix” semantic analyzer.

The efficiency analysis showed that the developed system increases the efficiency of extracting elements of the description of physical effects by 4% for accuracy and by 7% for completeness (Tables 3 and 4).

Table 3. Analysis of efficiency using the PE database
Table 4. Analysis of the effectiveness using the patent array

Sample 1. Patent description: “The photoelectric conversion element may be a photodiode having a p-n junction or a pin junction, a phototransistor, or the like. When the incident light hits the semiconductor junction of the cell, this light leads to the appearance of the photoelectric effect, in which electric charges arise.” [14].

  • PE Input: “Light, any other electromagnetic radiation (energy - eV)”. PE Output: “The electric charge (electron emission), (J)”.

  • PE Object: “Photoconductive material (photoconductor)”.

The results of the program show the results of the physical effect elements extraction:

  • PE Input: “Light”.

  • PE Output: “electric charge”. PE Object: “phototransistor”.

Sample 2. Patent description: “In electrical circuits, any electric current produces a magnetic field and hence generates a total magnetic flux acting on the circuit”.

  • PE Input: “Electric current”.

  • PE Object: “Electrical circuit”.

  • PE Output: “Magnetic field”.

The results of the program show the results of the physical effect elements extraction:

  • PE Input: “Electric current”.

  • PE Object: “Electrical circuit”.

  • PE Output: “Magnetic field”.

5 Conclusion

The method described in this article allowed increasing efficiency of the PE elements extracting. The semantic analyzer based on the Tuzov ontology was created to increase the accuracy and completeness of the method. The approach was tested on the PE database and the patent array.

Key words - predicates - (verbs, verbal nouns, participles) that can be found in sentences containing elements for describing physical effects have been identified. The corresponding descriptions are selected in the ontology.

On the basis of the ontology of Tuzov, semantic templates have been developed that consist of a keyword (predicate) - a characteristic verb or verbal form for a given field, cases of the semantic roles “Agent”, “Object”, “Place”, and related prepositions. Then the templates of the correspondence of the semantic roles of the ontology of Tuzov and the elements of the description of physical effects are constructed. Based on the thesauri of the subject area, this allows you to extract elements of the description of physical effects from text documents.

The efficiency analysis technique included finding the accuracy and completeness of extracting elements of descriptions of physical effects from the field of describing physical effects in the database, as well as in the texts of the patent documents. Then the results were compared with the results of the IOFFE system, which was implemented based on Semantix semantic analyzer. Thus, the efficiency of the developed software was evaluated.

The results of the efficiency analysis showed that the accuracy of extraction of elements of the description of physical effects was increased by 11% for physical effects from the database and 5% for physical effects from the texts of patent documents. Completeness of extraction of elements of the description of physical effects was increased by 6% for physical effects from the database and by 7% for physical effects from the texts of patent documents.