Keywords

1 Introduction

Today, ontologies play an important role in many domains related to the semantic Web [1], information retrieval [2], knowledge engineering [3] and knowledge management [4]. Therefore, several researches and studies have been developed or are being done to cover this fertile area. These researches can be used in different approaches such as concepts creation, ontology design [5], classification [6], or segmentation [7]. The latter is useful for the processing of large ontologies, which is difficult to maintain, namely the addition, modification or deletion of large ontology parts.

Our work will focus on the measuring of the semantic similarity between concepts of ontology. This one is an important concept used in different areas of research. Jeffrey Hau, William Lee and John Darlington [8] use the semantic similarity to define compatibility between semantic web services [9, 10] annotated by OWL ontologies [11]. In [12] the authors present a method based on multiple information resources (lexical taxonomy, corpus…) to measure the semantic similarity between words. The similarity is also used in the correspondence between the shapes for example, the authors in [13] compute the similarity between outlines of 2D shapes by using a technique based on the extracting of the shapes contours which are represented by a set of points, then the authors describe each segment of this contours by a local and global features, these ones will be coded in string of symbols and stored into XML files On which the similarity calculation will be executed.

Also, several techniques are proposed to compute the semantic similarity between ontologies [14, 15]. Where, the authors, in [15] propose a new method to compute the semantic similarity which is based on three steps. In the first the authors compute the semantic similarity of nodes, and then they compute the semantic similarity of relations between these nodes, at last they combine these two results to form a unified value of semantic similarity.

There are two families of approaches to compute the semantic similarity between concepts:

  1. 1.

    A family based on computing the geometric distance between concepts to define their semantic similarity, where the less distance gives the more similarity [16].

  2. 2.

    A family based on degree of information sharing, more common information between two concepts means more similarity [8].

The principal idea of our method is defining the shortest path between any node of a graph (in the current paper the term graph is used to describe ontology) and the root node. Then, we base on these shortest paths and our formula for computing the rate of semantic similarity between the concepts of this graph.

This paper is organized as follows: in Sect. 2 we describe our method. The next section presents an experimental comparison with some other methods of similarity measuring, followed by a discussion of the changes made to the methodology. Finally, the Sect. 4 presents our conclusion.

2 Proposed Method

Our method is designed to compute the semantic similarity between two concepts that exist in the same hierarchy of ontology, where all their nodes are connected by “is-a” relations type. This method is summarized in the algorithm shown below in Fig. 1.

Fig. 1.
figure 1

A graphical representation of the proposed algorithm for similarity measuring of ontology concepts.

The first step in our method is reserved for weight allocation to the arcs which represent the relations between the nodes of the studied ontology (Sect. 2.1). Then, we calculate the shortest path from the node, which we want to measure its similarity to the root node (Sect. 2.2). Afterward, Sect. 2.3 is devoted to compute the semantic distance between the two concepts which we need to measure the similarity between them. Then, we use this semantic distance to define the rate of similarity between these two concepts (Sect. 2.4). Finally, in (Sect. 2.5) we present our global algorithm which summarizes our proposed method.

2.1 Weight Allocation

In the similarity measuring literature, several methods of weight allocation exist, we distinguish between those that affect the value of weight to the nodes like the methods proposed in [17, 18] and the others that allocate the value of weight to the relations (arcs) between nodes [16]. Our method is designed to compute the semantic similarity rate between two concepts of ontology whose all their nodes are connected by the relations of “is-a” type (inheritance type). Therefore, in this paper, we have adjusted one from the second type of weight allocation methods which allows allocating the weight W (m, n) computed using the formula 1 to all the arcs in the ontology.

$$ W({\text{m,}}\;{\text{n}}) = \left[ {\hbox{max} ({\text{depth}}({\text{m}})) + \frac{N^\circ (n)}{NTNodes(G) + 1} + 1} \right]^{ - 1} $$
(1)

Where, m and n represent two nodes directly connected, max (depth (m)) represents the maximum depth of the node m (the depth of the root node is equal to 0, 1 for the nodes directly connected to the root node and so on), NTNodes and N° (n) represent successively the total Number of nodes in graph G and the order number of the node n between their siblings. And this later (N° (n)) is an integer greater or equal to 0.

We consider the following ontology:

Fig. 2.
figure 2

An example of ontology.

The weights of arcs mentioned in the Fig. 2 are calculated by our formula (formula 1) of weight allocation. In this formula, we have taked max (depth (m)) for ensuring that the weight of the current arc is always less than the weights of their previous arcs. Therefore, the semantic similarity between two concepts more specific is greater than two concepts more generalist.

2.2 Shortest Path Defining

The shortest path is a famous problem in the science research domain. Therefore, several algorithms are proposed such as Dijkstra’s algorithm [19], Bellman-Ford algorithm [20], Floyd–Warshall algorithm [21], but each one of these is designed to define the shortest path under some criteria. In this paper, with our criteria which are: (1) the weight is strictly greater than zero. (2) The relations between the nodes are “is-a” type.

The best solution adapted to our case is Dijkstra’s algorithm with the use of the formulas (2) and (3). For this, we have adapted this one with some modifications that allow stopping the algorithm once the root node is visited, which is beneficial to us either at the performance or at the execution time. From our second criterion (The relations between the nodes are “is-a” type), we can deduce that all the arcs of the graph will be oriented to the root node. At the level of calculation of the shortest path, from a given node to the root node, only the nodes that can be considered a generalization of the current node will be visited and not all the nodes of the graph.

$$ W_{0} [m,n] = \left\{ \begin{aligned} 0\;if\;m = n \hfill \\ \infty \;m \ne n \hfill \\ \end{aligned} \right\} $$
(2)
$$ \begin{aligned} & For\;all\;0 \le k \le S - 1 \\ & W_{k + 1} [m,n] = { \hbox{min} }\left\{ {\begin{array}{*{20}l} {W_{k} [m,n]} \hfill \\ {W_{k + 1} [m,\text{x}] + G[\text{x},n]} \hfill \\ \end{array} } \right\} \\ \end{aligned} $$
(3)

m, n and x represent three nodes, S represents the set of all nodes of graph. The nodes x and n are directly connected and Wk [m, n] represents the weight of arc (path) [m, n] at iteration k,

In our adapted algorithm, for defining the shortest path, we use the formula 2 for initializing the nodes weight, where we give the value zero to the start node and infinity to all other nodes in the graph, and we rely on formula 3 to compute the weight of the shortest path which exists between the start node and the root node.

Our adapted algorithm is designed as follows (Fig. 3):

Fig. 3.
figure 3

Our adapted algorithm for shortest path defining.

This algorithm will define the shortest path between a given node in the ontology and the root node. In contrast to the Dijkstra’s algorithm, this algorithm will stop the execution once the root node is visited.

In addition, our algorithm allows not only defining the shortest path, but also defining the value of its weight. And we can’t define the shortest path without defining its weight.

For example, we consider the ontology presented in the Fig. 2. To define the shortest path for the node H. There are three paths from H to A (the root node): (H, E, FC, A), (H, G, F, D, FC, A) and (H, G, C, FC, A) (Table 1).

Table 1. An example to define the weight of shortest path.

We can easily observe that the shortest path between the node H and the root node is equal to 1.768.

From the Table 2, we can observe that:

Table 2. An example to define the shortest path.
figure a

Where, SPath (H, E, FC, A) represents the shortest path between the nodes H and A defined by our adapted algorithm.

2.3 Semantic Distance Computing

At this level, to compute the semantic distance between two concepts we use a new technique based on the shortest path calculated in the previous phase. This technique allows eliminating unnecessary parts in the shortest paths and only keeps the necessary parts to calculate the semantic distance between two concepts.

This technique is designed as follows:

We consider the colored segment in the figure below which represents a segment of the ontology mentioned in Fig. 2.

Fig. 4.
figure 4

A segment of the ontology mentioned previously.

Where, SPath1 and SParh2 represent the shortest paths between the nodes (H and B) and the root node. FC represents the first common node between these two shortest paths, if we come from the nodes in question towards the root node. And CSPath represents common sub shortest path between the FC node and the root node.

$$ {\text{SDis }}\left( {{\text{C1}},{\text{ C2}}} \right) = {\text{W }}\left[ {\text{SPath1}} \right] + {\text{W }}\left[ {\text{SPath2}} \right]{-}2*{\text{W }}\left[ {\text{CSPath}} \right] $$
(4)

Where, C1 and C2 represent two concepts (in our Fig. 4 C1 represents the concept (B, FC, A) and C2 represents the concept (H, E, FC, A)) and

$$ W \, \left[ {SPath_{i} } \right] = \sum\limits_{j = 1}^{k} {W_{j} [m,n]} $$
(5)

m and n represent two nodes directly connected in SPathi and k represents the set of arcs in SPathi.

The formula 4 allows calculating the semantic distance between any two concepts exist in the same graph.

2.4 Semantic Similarity Computing

There is an inverse relation between semantic similarity and semantic distance. Increasing of one among them decrease the other. We can categorize the output of the semantic similarity function in three categories:

  1. 1.

    The two concepts are the same.

  2. 2.

    Nothing in common between them.

  3. 3.

    There is a rate of semantic similarity between them.

Therefore, the function of semantic similarity should verify these three conditions:

  1. 1.

    \( \forall \;(\text{C}1,\text{C}2) \in G:0 \le SSim(\text{C}1,\text{C}2) \le 1 \)

  2. 2.

    \( \forall \;\text{C}1 \in G:SSim(\text{C}1,\text{C}1) = 1 \)

  3. 3.

    \( \begin{aligned} & \forall \;(\text{C}1,\text{C}2,C3) \in G:if\;SDis(\text{C}1,\text{C}2) > \text{SDis}(\text{C}1,\text{C}3)then \\ & \text{SSim}(\text{C}1,\text{C}2) < \text{SSim}(\text{C}1,\text{C}3) \\ \end{aligned} \)

Where, SSim represents the semantic similarity, SDis represents the semantic distance and (C1, C2 and C3) represent three concepts of graph G. Ci represents the set of nodes constituting the shortest path between a given node and the root node.

In this paper, we have used the function of semantic similarity computing proposed in [16].

$$ SSim(\text{C}1,C2) = \frac{1}{{\deg^{*} SDis(\text{C}1,\text{C}2) + 1}} $$
(6)

C1 and C2 represent two concepts and the parameter “deg” represents the impact degree of Semantic distance on semantic similarity, and it should be between 0 < deg ≤ 1 (the concrete value of “deg” is defined in the experience).

2.5 Global Algorithm

This section is devoted to our global algorithm that includes all the previous algorithms and techniques (Fig. 5).

Fig. 5.
figure 5

Our global algorithm.

3 Experiments

In this section, by consulting the WordNet, we have got a fragment of ontology hierarchy shown in Fig. 6, concerning the terms: Vehicle, truck, car, family car, sport car, luxury car and bus which we want to compare the similarity between them.

Fig. 6.
figure 6

A fragment of ontology hierarchy.

After computation of semantic similarity between these concepts by using our method (where, we have fixed the parameters deg to 0,4) and two others, we have obtained the results presented in the following Tables 3, 4, and 5.

Table 3. Application of the first method [22].
Table 4. Application of the second method [16].
Table 5. Application of our method.

In this experimental comparison, each node presents in the following tables such as Truck, Car and so on, represents a concept which is constituted from the set of all nodes of the shortest path which exists between this node and the root node. For example the node “SportCar” presents in these tables represents the concept (SportCar, Car, Vehicle).

By analyzing of the obtained results of semantic similarity, we observe that the first method [22] can only find the total similarity (otherwise, it can only discover the similarity between the same concepts). The second method [16] can find the semantic similarity between concepts but with a low score. Finally, our method can also calculate the semantic similarity between concepts but with a high score.

The second method [16] allows affecting the same value of the semantic similarity between the sibling concepts and their parent concept (for example, the semantic similarity between the concepts (Vehicle) and (Vehicle, Truck) is equal to 0.71 and the same value of semantic similarity between (Vehicle) and (Vehicle, Car)), but in the reality the rate of the semantic similarity between these concepts differs. For this reason, we have proposed our method which allows avoiding this type of problems by affecting the different values of semantic similarity for similar cases.

Also our method uses a simple function of weight allocation and bases on famous and robust algorithms with some adaptations to minimize the time of execution. Therefore, this one can calculate the semantic similarity in very short time even in industrial size ontologies.

4 Conclusion

In this paper, we have presented a new method to calculate the similarity between two concepts of ontology. Then, we have compared it against other methods that already exist. The obtained results are very interesting and prove the strength of our method.

We are interested in the future works to adjust this method to support the calculation of similarity between any two concepts that may exist in the same ontology or not and be related by any type of relations.