Keywords

1 Introduction

Due to constant growth of information flows in different interdisciplinary technical, economical and social intelligent information systems (IIS), the development of new ways of distributed sources information representation, formalization, systematization, integration and search are relevant today [1,2,3,4]. One of the main functions of modern IIS involves semantic search of problem-oriented knowledge elements of a distributed and, thus, heterogeneous representation character.

In this paper the term ‘semantic search’ is considered as information search which provides comparison and similarity estimation of information objects on semantic level i.e. with the use of knowledge. Existing mechanisms of semantic search [5,6,7] are based on methods and approaches of knowledge subject area ontological conceptualization.

This paper deals with the method of knowledge bases search, where metadata are formed on the basis of corresponding subject area ontologies represented in the form of semantic net. The knowledge relevance is estimated by closeness to a certain evaluation metric of similarity between concepts included in ontology elements semantic meta-description.

To calculate measures of semantic closeness and coherence between problem-oriented knowledge elements authors propose a combined model of semantic similarity estimation involving a set of interpreted taxonomical and associative meta-descriptions dependences of knowledge elements represented in ontology [8,9,10].

The algorithm of semantic similarity estimation is based on evolutionary procedures and genetic optimum search operators, which allows us to exclude uninformative or insignificant knowledge elements descriptions, and to manage speed of learning with the use of similarity threshold value assignment [11,12,13].

2 Problem Statement and Subject Area Analysis

The absence of a ‘gold standard’ for semantic similarity measures is a well-known problem. Many researchers are focused on the development of semantic similarity estimation and comparison methods used for a wide range of information search problems [2, 7,8,9,10].

In this paper factors defining selection of semantic similarity measures applied for formal representation (profile) of the user’s query are proposed as follows:

  • to select criteria composing the similarity measure: taxonomical relations between concepts – characteristics of ontological structures (the path length, hierarchy depth, etc.) and associative (horizontal) relations defining asymmetrical semantic similarity measure.

  • to select criteria importance degree – importance coefficients in hybrid measure of computational semantics [14].

With the use of ontology in works [2, 10] authors proposed the method of special metadata type creation – meta-descriptions including sets of simple proposition statements of the form of ‘subject (s)–predicate (p)–object (o)’ which are referred as triplets (t) and represent main semantics of described knowledge elements. It is noted that such meta-descriptions are important sources of information for search implementation. With the use of meta-descriptions it is possible to significantly improve the search mechanisms functionality. Similarity estimation is called semantic similarity estimation if and only if it is determined on the basis of meta descriptions and query semantics [9].

Thus, to determine a degree of semantic similarity between knowledge elements it is proposed to introduce the measure of distance between their meta-descriptions. The measure represents the combination of several measures of distance between two vertexes (concepts or attributes) of shortest weighted path between ontology graph vertexes [15,16,17].

It is suggested that all concepts required to compare are located in the united ontology and, thus, in the united taxonomy [15]. If ontologies are separated, they should be united before the analysis [3].

In this paper problem statement authors use following descriptions of ontology components: the ontology O represents the sign system O = <C, E, R, T>, where C denotes a set of concepts (knowledge elements); E denotes a set of concepts examples; T denotes a set of predicates – relation types; R denotes a set of relations assigning following relation types between entities: taxonomical, attributive, quantitative, logical, etc.

Let us introduce following rules and constraints:

  1. (1)

    On the basis of subject area ontology O, semantic meta-descriptions m(c i ) = {t 1, t 2, …, t n(i)} are created for each knowledge elements C = {ci}, where n(i) is a number of triplets in logical representation of a concept c i ; t i denotes RDF-triplets having a form of tuples <s i , p i , o i >, where s i and o i are included in the union of C i and E i , and p i is include in R.

  2. (2)

    Each query q created by the user from the set of queries Q consists of the set of triplets q = {t 1, t 2, …, t n(q)}, where n(q) denotes a number of triplets included in the query q.

The assigned problem involves finding a weight function w, which determines the importance of any triplet tT (where T denotes a set of possible triplets) when describing knowledge elements c i from the query q: 0 ≤ w(t, c i ) ≤ 1, where где t ∈ T, c i C, 0 ≤ w(t, q) ≤ 1, where tT, q ∈ Q.

For each query q it is required to determine a subset RES of the set of knowledge elements C, which includes relevant concepts for the assigned query q – the result set. C i is considered as relevant to the query q, if and only if the semantic similarity estimation between them exceeds a certain threshold value of semantic closeness. Therewith, to estimate the similarity between knowledge elements and the query authors propose to use their semantic meta-descriptions [10].

3 The Combined Method of Semantic Similarity Estimation

The key moment in semantic search problem-solving includes the development of semantic similarity quantitative estimations. Existing methods of computational semantics can be subdivided in several categories: measures based on hierarchical structures – methods of conceptual taxonomical closeness estimation using different metrics of finding the length of the shortest path between subject area ontology graph vertexes [2, 16,17,18]; measures using non-hierarchical relations – methods of relational closeness estimation [5,6,7]; measures using attribute values [8,9,10].

The main problem of most measures based on ontological structures is symmetry. Expert analysis shows that similarity measure is not always symmetrical for both hierarchical and attributive relations [5, 6, 8,9,10]. The relevant problem is semantic similarity estimation between ontological elements that are not related hierarchically, but have concrete problem-specific (horizontal or associative) relation.

Thus, the most promising measures today are hybrid measures, which combine several methods considering ontology structures and relation semantics. This allows us to calculate semantic similarity estimations between ontology elements (concepts, examples, relations – predicates). Similarity estimations are referred as elementary estimations, and similarities between triplets are determined on their basis [2, 5]. Further, similarities estimations between triplets are used to determine similarity between meta-descriptions.

To determine semantic similarity between triplets of queries meta-descriptions M q and triplets of concepts set M c let us introduce metrics of distance between ontology nodes on the basis of taxonomy and concepts characteristics, and metrics of density and information value of concepts related thematically. Then, the modified similarity measure can be represented as follows:

$$ SIM\left( {M_{q} ,M_{c} } \right) = \sum\nolimits_{i = 1}^{n} {w(t, q)_{i} Sim^{i} (c_{1} ,c_{i} ),} $$
(1)

Where \( Sim^{i} \) is a similarity measure based on a certain criterion, weight \( w(t, q)_{i} \) determines the relative importance of query triplets criterion, weight summary equals to 1, n denotes a number of criteria.

To calculate \( Sim^{i} \) let us introduce the modification of asymmetrical similarity measure [5] considering all types of semantic relations R appropriate for triplet components similarity estimation. In suggested modification graph edges are assigned with a certain weigh coefficients depending on passing direction. This is based on the assumption that a child is more similar to a parent rather than opposite way.

  1. 1.

    For the relation ‘parent-child’ (is-a) two coefficients g and s are assigned, which represents similarity in direction of generalization and detailing.

  2. 2.

    For the relation instanceOf (connects concepts and concepts examples) two parameters \( \delta ,\,\gamma \in [0,1] \) are assigned, which represent similarity between the example and the concept and between the concept and the example.

  3. 3.

    Similarity coefficients assigned for the relation sameAs (synonyms) and invertOf (antonyms) equals to 1 and −1 respectively.

  4. 4.

    For other semantic relations r i we assign weight coefficient \( \omega \), which represents semantic similarity in accordance with these relations.

Let us consider \( D = \left\{ {c_{1} , \ldots , c_{n} } \right\} \) as the path between entities \( c_{1} \) and \( c_{n} \) (which can be concepts, examples or predicates). The path D has following characteristics:

  1. 1.

    s(D) is a number of edges in detailing direction;

  2. 2.

    g(D) is a number of edges in generalization direction;

  3. 3.

    ic(D) is a number of edges from the example to the concept;

  4. 4.

    ci(D) is a number of edges from the concept to the example;

  5. 5.

    inv(D) is a number of inverse relation edges;

  6. 6.

    oth(D) is a number of other relation edges.

The estimation of similarity between entities \( c_{1} \) and \( c_{2} \) in terms of criterion i and the path D is determined by the following formula:

$$ Sim^{i} \left( {c_{1} ,c_{2} } \right) = {\text{max}}_{j = 1,..,m} \left\{ {\left( {\left| {( - 1)^{{inv\left( {d_{j} } \right)}} s^{{s\left( {d_{j} } \right)}} *g^{{g\left( {d_{j} } \right)}} * \delta^{{ic\left( {d_{j} } \right)}} *\gamma^{{ci\left( {d_{j} } \right)}} *\omega^{{oth\left( {d_{j} } \right)}} } \right|} \right)} \right\},) $$
(2)

where d 1 ,…,d m denotes paths between vertexes \( c_{1} \) and \( c_{2} \).

To determine the density and information value of thematically related elements and their meta descriptions let us define the concept weight on the basis of occurrence degree. It is considered that the query concept weight depends on a number of meta descriptions concepts related with it m(c i ) which is represented by triplets m(c i ) = {t 1, t 2, …, t n(i)}, where n(i) is a number of triplets in concept logical representation c i [10].

$$ w\left( {t,{\rm{ }}q} \right) = 1 + \ln \left( {{\upvarphi_{t,{c_i}}}{\rm{ }}\left( {1 + {\rm{ }}\sum\nolimits_{{{\rm{c}}_{\rm{i}}} \in {\rm{C}}} {{\upvarphi_{{\rm{t}},{{\rm{c}}_{\rm{i}}}}}{\rm{ }}SIM\left( {{c_1},{c_i}} \right)} } \right)} \right), $$
(3)

Where \( {\upvarphi_{{\rm{t}},{{\rm{c}}_{\rm{i}}}}} \) is a coefficient of occurrence degree of query triplet q in meta description m(c i ), the coefficient is assigned in the algorithm, \( SIM\left( {c_{1} ,c_{i} } \right) \) is a measure of semantic similarity between meta descriptions of concept vertexes in C.

4 Genetic Algorithm of Semantic Similarity Estimation

To improve the effectiveness of semantic similarity estimation and to determine semantically prioritized knowledge objects for the purpose of their representation in search model authors propose to use genetic algorithm (GA) which allows us to find suboptimal solutions in polynomial time effectively [19, 20]. GA is an heuristic search algorithm used for optimization and modeling problem solving by means of random selection, combination and variation of searched parameters with the use of mechanisms analogous to natural selection [20]. The generalized structure of genetic search is shown on the Fig. 1.

Fig. 1.
figure 1

The generalized architecture of genetic search

To determine optimal coefficients values authors defined the GA objective function with the use of similarity estimation maximization method:

$$ {\text{F}} = \hbox{max} \,\left( {SIM\left( {M_{q} ,M_{c} } \right)} \right). $$
(4)

The GA of model parameters values calculation for the purpose of semantic similarity estimation is shown on the Fig. 2.

Fig. 2.
figure 2

The genetic algorithm of similarity estimation

The first step of the GA is to generate initial parameters of estimation model elements (population size and chromosome length) and to input values of weight coefficient and probability of crossover and mutation operators. Then, the initial population is to be formed on the basis of available learning data from the set C = {ci} which semantic meta descriptions m(ci) = {t1, t2, …, tn(i)} was created for. Each chromosome element (gene) is a triplet in logical representation of the concept ci. To estimate the fitness of each chromosome authors propose to calculate objective function value (4).

The chromosomes selection is carried out by determined method with the use of elitist strategy and partial substitution the least fitted chromosomes by the best fitted ones in terms of saving population size [19]. To generate new specimen set for each pair of selected parent chromosomes it is required to use crossover and mutation operators with pre-assigned probability. The crossover is carried out in a random way with the probability Pc. Crossing point is determined in random way within the assigned interval.

The mutation procedure is carried out with the child population obtained as the result of crossover and involves change of gene value by means of randomly selected number from the interval [0, 1] with the probability Pm.

The selection of the most perspective solutions is carried out on the basis of probabilities Pv, calculated for each population individual with the use of proportional selection [11,12,13]. After calculation the each chromosome fitness using the formula (4) and the selection of the best one it is required to decide whether to continue the evolutionary procedure of next generation creation or to end the learning procedure. The higher the objective function value is, the higher is the chromosome fitness. The GA work stops under one of following conditions:

  1. (1)

    if the function F obtained expected value;

  2. (2)

    if the assigned number of iterations (generations) does not improve already obtained valued of the F;

  3. (3)

    if the time allotted for the problem solution is up.

Premature stop of the GA work can occur in case of population degeneration, which means the reduction of chromosomes diversity. The extreme form of degeneration is the condition, when all individuals have identical chromosomes [11].

As a result of the process of artificial evolution including the selection, the crossover, the mutation, the chromosome selection the quality of solutions in population gradually improves.

5 Experimental Research

Computational experiments were carried out for the purpose of the developed GA effectiveness research. To estimate the developed algorithm authors made a comparative analysis with the algorithm based on similarity measure TF-IDF. In the context of the measure each triplet is considered as an individual concept. To estimate the similarity the method uses cosine measure [5, 18] and the MKNN (Mutual KNN) algorithm providing the method of similarity estimation between nearest neighbor (concept triplets and relations) [16].

To carry out experimental research of the developed algorithm effectiveness authors designed software allowing us to implement the iterative procedure of semantic similarity estimation method parameters setting. Experimental research results allowed us to determine the dependence of algorithm execution time on input parameters of the similarity model: n is a number of chromosomes in population; the chromosome is represented as m(c i ) = {t 1 , t 2 , …, t n (i)}, where n(i) is a number of triplets in logical representation of the concept c i .

The dependences of developed algorithm execution time and TF-IDF and MKNN algorithms on number of similarity estimation model input data are shown on the Fig. 3. Time complexity of the developed algorithm is O(n 2 ).

Fig. 3.
figure 3

The dependences of algorithms time on number of input parameters

Authors carried out a set of experiments in terms of completeness and accuracy of relevant concept extraction performed by the MKNN algorithms and the genetic algorithm with the use of two previously described similarity metrics and different number of ontology elements – triplets C k represented by their meta descriptions. Results show almost linear growth of number of obtained concepts in dependence on the parameter k for both of algorithms (Fig. 4).

Fig. 4.
figure 4

Dependences of completeness and accuracy of concepts extraction

GA extracts more relevant concepts in terms of completeness and accuracy of the query. The reason is that the MKNN algorithm excludes pairs of elements that are not nearest neighbors [11].

During the GA search process rules of comparison of ontology elements triplets and query triplets are to be used. Performance accuracy depends on the quality of effective solutions obtained by the GA after each iteration, weighting results of criteria definition of knowledge elements similarity in ontology.

Proposed combined method represents the original mechanism of semantic search, which uses the GA to estimate similarity between ontology elements on the basis of the user’s query description semantic metadata.

6 Conclusion

To determine measures of semantic closeness and coherence of problem-oriented knowledge elements authors developed the combined model of semantic similarity estimation which uses a set of interpreted taxonomical and associative dependences of meta descriptions represented in ontologies. The algorithm of semantic similarity estimation is based on evolutionary procedures and genetic optimum search operators which allows us to exclude non-informative and insignificant knowledge elements descriptions and to manage the speed of learning with the use of similarity threshold value assignment.