1 Introduction

With the enormous success of the Information Society and the World Wide Web, the amount of available information has significantly increased. In this context, computational text analysis has attracted great interest from the research community in order to enable a proper exploitation, management, classification and retrieval of textual data. In fact, considerable efforts have been made to standardize our understanding of various fields by means of ontologies, which allow us to model domains through sets of concepts and semantic relationships established between these concepts [24]. However, one of the most basic problems when aiming to interpret textual data or electronic documents is the assessment of semantic likeness between terms. According to Goldstone [21], psychological experiments have demonstrated that semantic likeness acts as a fundamental organizing principle by which human beings organize and classify objects.

Semantic similarity states how taxonomically near two terms are, because they share some aspects of their meaning (e.g., dogs and cats are similar to the extend they are mammals). On the other hand, the more general concept of semantic relatedness does not necessary rely on a taxonomic relation (e.g., car and wheel or pencil and paper); other non-taxonomic relationships (e.g., meronymy, antonymy, functionality, cause–effect) are also considered. Similarly, bronchitis and flu are similar because both are disorders of the respiratory system. Furthermore, words can also be related in non-taxonomic ways (e.g., diuretics help in the treatment of hypertension); in this more general case, one talks about semantic relatedness. In both cases, they are based on the evaluation of the semantic evidence observed in a knowledge source (such as ontologies or domain corpora). In other words, semantic similarity is understood as the degree of taxonomic proximity between terms. Similarity measures assign a numerical score that quantifies this proximity as a function of the semantic evidence observed in one or several knowledge sources [62]. Usually, these resources consist of taxonomies and more general ontologies, which provide a formal and machine-readable way to express a shared conceptualization by means of a unified terminology and semantic inter-relations from which semantic similarity can be assessed [68].

In information systems, semantic similarity plays an important role because it supports the identification of objects that are conceptually close but not identical [58]. It is a key feature in the development of semantic search technology [51]. It also facilitates the comparison of information resources in different types of knowledge domains [73, 82].

Relevant applications depend directly on semantic similarity computation, such as information retrieval techniques for improving accuracy [1, 2, 9, 27], to discover mappings between ontology entities [34, 52], to validate or repair ontology mappings [41], for question answering systems [79], for basic natural language processing tasks as word sense disambiguation [48, 76], recommending systems [7, 39], information extraction [5, 65, 78], multimedia content search [56], semantic information integration [17, 33], ontology learning in which new terms related to existing concepts, should be acquired from textual resources [60], text clustering [77], biomedical domain [6, 13, 49, 61], geographic information science [45, 46, 57, 73, 75], and cognitive science. This has been applied to learning about human cognition, reasoning and categorization about differences in conceptualizations [22, 43, 81], and Semantic Web, when dealing with automatic annotation of documents [10, 66]. Thus semantic similarity is a fundamental part in the semantic processing task.

Ontology-based semantic similarity measures compare how similar the meanings of concepts are, according to the taxonomic evidences modeled in the ontology. The exploitation of multiple ontologies provides additional knowledge that can improve the similarity estimation and solve situations in which terms are not represented in an individual ontology [2]. A plethora of measures have been proposed over the last decades. Although some context-independent semantic similarity measures have been proposed [31, 53, 54, 83], most measures were designed in an ad hoc manner and were expressed on the basis of domain-specific or application-oriented formalisms [61]. Therefore, most of these approaches target a specific audience and fail to benefit other communities. In this way, a non-specialist can only interpret the large diversity of state-of-the-art proposals as an extensive list of measures. As a consequence, the selection of an appropriate measure for a specific usage context is a challenging task [24].

Despite the large number of contributions related to ontology-based semantic similarity measures, the understanding of their foundations is limited. For a practitioner, some fundamental questions remain: Why does a measure work better than another one? How does one choose or design a measure? Is it possible to distinguish families of measures sharing specific properties? How can one identify the most appropriate measures according to particular criteria? Therefore, it is difficult to decide which measure should be used for a particular application or to compare results from different similarity theories [29].

In this paper, we propose an approach based on a network model that uses an algorithm that iteratively evaluates how close two concepts are, based on the semantics that an ontology explicitly expresses. Network models are employed in knowledge representation in the form of semantic networks. These structures are composed of nodes (concepts) and edges (relationships), in which nodes represent knowledge units such as objects, concepts or properties. While the edges linking nodes with each other represent explicit relationships between them. Although the model of representation always has the same structure, network models may differ restricting the direction of the relationship. This means that similarity measures based on the network model depend on the context [58] and describe the ontology semantics.

To sketch out our proposal, we present the following question: how far is the concept “mountain” from the concept “valley”? Possible answers for this question could be numerical values such as 10, 2, 3.5543. In general, the distance is expressed by the proximity between two objects. So, conceptual distance is defined by the space that separates two concepts within a conceptualization. Mathematically, the distance between two points in the Euclidean space is equal to the longitude of the line segment that numerically joins those points. Thus, the computation of the distance among objects depends directly on the space, in which they are located.

According to the literature, a conceptual distance is related to the semantic similarity based on a network model, which consists of graphs or conceptual representations such as semantic networks, hierarchies or ontologies [72]. The distance represents how similar or semantically related two concepts are [58]. The semantic similarity is a key issue in the semantic processing area and has a long tradition in cognitive science because it can be used for several purposes. Rada et al. [53] defined conceptual distance as the length of the shortest path that connects the concepts in a conceptualization, which represents the semantic similarity in is-a hierarchies. Thus, in this approach, similarity measures must have the resolvable property, which means that the representation must be rich enough so that there is a path between every concept. Therefore, one cannot compute the conceptual distance between concepts that are not connected. In fact, this measure is guided by two observations: the behavior analysis of conceptual distance and the proportionality of the conceptual distance between nodes in the hierarchy. Therefore, this measure is the minimum average of the path length over all the pairwise combinations of nodes between two subsets of nodes. Moreover, similarity measures in the network model assume that each relationship is important to determine a judgment on themselves [72].

This work presents the DIS-C approach, which is used to compute the conceptual distance between concepts of an ontology. The method is based on the fact that an ontology can be represented as a strongly connected graph. In the proposed approach, the topology of the graph defines the relationships between concepts; and from them, weight values are assigned to each relation taking into consideration the proximity between concepts. Initially, the weight values can be defined by a domain expert or even defined arbitrarily. So, in order to remove this arbitrariness, the use of a measure called generality is proposed. It describes how visible a concept is to any other concept of the ontology. The generality is computed considering the incoming and outgoing relations from one concept. For optimizing the values of the weights, an iterative refinement is performed. It allows the DIS-C algorithm to automatically assesses the distance, and to remove the subjectivity of the weightings defined by a user. The network model applicable to DIS-C, allows any type of relationship, not only taxonomic (like hierarchies, hyponomies and partonomies). Moreover, DIS-C in conjunction with the GEONTO-MET methodology [80] can be used to compute similarity in other knowledge representation models such as the feature-based model [58].

The rest of the paper is organized as follows. Section 2 presents the related work with respect to similarity approaches and their applications. Section 3 describes the theoretical foundation of conceptual distance under our perspective as well as a set of examples that were developed to illustrate our proposal. Section 4 presents the proposed algorithm and the results of a set of experiments that characterize its performance. Finally, Sect. 5 presents a discussion of our proposal in the context of previous and future works.

2 Related work

Many works have been developed in the last years, especially with the increasing interest on the Semantic Web. Ontologies have been of great interest for the semantic similarity research community as they offer a structured and unambiguous representation of the knowledge in the form of conceptualizations interconnected by means of semantic pointers [64]. These structures can be exploited in order to assess the degree of semantic proximity or conceptual distance between terms. According to the theoretical foundations where similarity computation is based on the way in which an ontology is processed and complemented with other sources, different approaches to measure the similarity can be identified in [84]. Other approaches have been proposed to assess semantic similarity among concepts represented by words within lexicographic databases [4]. In this context, Li et al. [38] proposed a methodology to compute similarity between short sentences through semantic similarity. Basically, similarity based on distance methods aim at assessing a score between a pair of words by exploiting some information sources, in which application are centered in search engines [8, 11] or a well-defined semantic network such as WordNet or MeSH [44].

According to Pirró [51], many approaches to assess similarity have been proposed, which can be classified on the basis of information source they exploit. Thus, different families of methods have been defined, taking into account the theoretical foundations and the way in which ontologies are analyzed in order to estimate the similarity. Ontology-based approaches [53], assesses semantic similarity by counting the number of nodes/edges separating two concepts within semantic networks. This measure was mainly designed to semantic networks with taxonomic relationships. It measures between two concepts or two set of nodes the average of the minimal distance among each pair of nodes related to the sets. Information content-based approaches assess the similarity between concepts by probabilistic models and as a function of information content that both concepts have in common in a specific ontology [31, 40, 55]. In the past, information content was typically computed from concept distribution in tagged textual corpora. Nowadays, methods for inferring information content of concepts in an intrinsic manner from knowledge structure modeled in an ontology have been proposed [61, 63, 74, 85]. Hybrid approaches combine multiple information sources and weights are used to set the contribution of each information source in order to be adjusted [37, 58, 70, 71]. Feature-based approaches estimate similarity according to the weighted sum of the amount of common and non-common features [39, 64]. By features, Sánchez et al. [68] usually considered taxonomic and non-taxonomic information modeled in an ontology, in addition to concept descriptions retrieved from dictionaries [50, 57]. Due to the additional semantic evidences established during the assessment, they potentially improve edge-counting approaches. However, non-taxonomic features are considered because they are rarely found in ontologies [15] and require fine-tuning of weighting parameters for integrating heterogeneous semantic evidences [50]. Moreover, edge-counting approaches consider the similarity assessment on the number of taxonomic links of minimum path, separating two concepts contained in a given ontology [35, 37, 53, 83]. However, Meng et al. [42] argued that all measures can be grouped into four classes: path length-based, information content-based, feature-based, and hybrid measures.

On the other hand, other works are focused on ontology alignment techniques. Cross and Hu [14] described a semantic method to measure the similarity between concepts that exist in two different ontologies by means of the matchers of ontology alignment systems. These matchers belong to various categories depending on the context of the similarity measure, such as lexical, structural, or extensional matchers. Other proposals combine the context and similarity to achieve the interoperability among different databases [32]. Methods focused on computing the semantic similarity with multiple ontologies have been proposed. Sánchez and Batet [62] defined a method to extend information content-based semantic similarity measures when multiple ontologies are available. It allows estimating the similarity when a term or a pair of term is missing in certain ontology but it is found in another one. Han et al. [23] present the ADSS approach to determine semantic similarity among a set of entities from different ontologies. This approach takes into consideration the similarity between two entities and their similarity reflected in context. The ranking score is defined as a function of some particular parameters. ADSS is different from other methods because it combines an efficient Tabu search algorithm established with multi-objective programming algorithm for improving the precision.

Other ontology-based approaches have been defined to compute and assess similarity in biomedical domain; for example, Batet et al. [6] proposed a similarity measure that can achieve a level of accuracy similar to corpus-based approaches but retaining the low computational complexity and lack of constraints of path-based measures. The method is based on the path-based measure because it exploits the geometrical model of the ontology no pre-calculus or pre-processing is needed, which makes them more computationally efficient. Harispe et al. [24] presented a unifying framework that aims to improve the understanding of semantic measures, to highlight their equivalences and propose bridges between their theoretical bases for the biomedical domain. Zadeh and Reformat [84] proposed a method for determining semantic similarity between concepts defined in an ontology. In contrast to other techniques that use ontological definition of concepts for similarity assessment, this approach is focused on relations between concepts and their semantics. It is able to determine similarity not only at the definition/abstract level, but also it is able to evaluate similarity of concrete pieces of information that are instances of concepts. In addition, the method allows for context-aware similarity assessment when only specific sets of relations, identified by the context, are taken into consideration. A new ontology-based measure relying on the exploitation of taxonomic features extracted from an ontology is proposed by Sánchez et al. [64]. It considers the similarity assessment and the way in which ontologies are exploited or complemented with other sources. The measure follows a similar principle proposed in the Tversky’s model [81], in which considers that the similarity between two concepts can be computed as a function that relies on taxonomic information. Likewise, Sánchez et al. [67] described that the problem of integrating heterogeneous knowledge sources is tackled by means of simple terminological matching between ontological concepts. Sánchez et al. [68] aimed to improve methods by analyzing the similarity between the modeled taxonomic knowledge and the structure of different ontologies by means of two methods. The first one, relying on the principles of knowledge representation, considers explicit knowledge modeled in the ontology to estimate the semantic overlapping between taxonomic ancestors of different ontologies. The second one exploits the net of semantic links and the structural similarities between several ontologies as an indication of implicit semantics.

Moreover, Saruladha et al. [69] presented a computational approach for assessing semantic similarity among concepts from different and independent ontologies, without constructing a shared ontology. The work has explored the possibility of adapting the existing single ontology information content-based approaches and proposed methods for assessing semantic similarity among concepts from different multiple ontologies. The approaches are corpus independent and they correlated well with human judgments. Albertoni and De Martino [4] proposed a framework to assess semantic similarity among instances within an ontology. It aimed to define a sensitive measure of semantic similarity, which takes into account different hints hidden in the ontology definition and explicitly considered the application context. An ontology-based method for assessing similarity based on Formal Concept Analysis is proposed by Formica [18]. The method is intended to support the ontology engineering in difficult activities that are becoming fundamental in the development of the Semantic Web, such as ontology merging and ontology mapping.

Other ontology-based approaches are focused on similarity computation between two concepts from an ontology. Albacete et al. [3] proposed a similarity function based on five dimensions like sort, compositional, essential, restrictive and descriptive. The obtained similarity values are weighted and aggregated in order to obtain a global similarity measure. The proposal has been evaluated by using the WordNet knowledge base. Goldstone [20] proposed a method for measuring similarity in which subjects rearrange items (psychological similarity), so that their proximity on a computer screen can be proportional to their similarity.

Thus, the most common ways for structuring knowledge are hierarchies and ontologies. Up to date, general ontologies have been developed, such as WordNet [16], SUMO [47], PROTON [28], DOLCE [19], SNOMED-CT [25], Gene Ontology [12], Kaab [80], among others. These ontologies allow us to analyze knowledge by using a graph-based model, describing concepts and their relationships with nodes and edges.

The semantic similarity computation in graph-based models has been realized in different manners. For instance, by using graph theory techniques to compute similarity values [35, 53, 83]. These measures are used in hierarchies and taxonomies, due to the knowledge subjacency that is considered by computing the similarity. The main problem of those approaches is the homogeneity dependency and the coverage of the relations in the ontology. Examples of ontologies like WordNet are good candidates to apply those measures, due to their homogeneity distribution of relations and their coverage between different domains [31]. In addition, Resnik [55] described a similarity measure based on the notion of information content. This similarity between two terms is estimated as the amount of information that they share within the conceptual representation. In a taxonomy, this information is represented by the Least Common Subsumer (LCS) of both terms. Multiple variations of this measure have been developed; for example, Resnik-like measures depend on two aspects: the way of computing information content and the organization of the subsumption hierarchy.

At this point, it is necessary to meditate about if the conceptual distance is adequate to measure the semantic similarity between concepts. In this work, we assume that two concepts could be conceptually near; however, they can be semantically non-similar. For instance, lakes and reservoirs, mountains and valleys are involved in specific conceptualizations, in which their conceptual distances are closer, but their semantic similarities are far according to their meanings. Our approach does not try to measure the semantic similarity, but it consists of measuring the conceptual distance, considering some ideas presented by Rada et al. [53] and Resnik [55]. For example, how similar are “credit card” and “food”? According to the semantic similarity, two concepts are weakly similar, but conceptually, it could be said that you can buy “food” by using the “credit card”. In fact, we consider that semantic similarity is different from the conceptual distance, the latter is a measure that tells us how strong two concepts are related, while semantic similarity indicates how similar they are. As we mentioned, conceptual distance can be used to compute semantic similarity. Some approaches presented above work with ontologies based on taxonomic relationships, which restrict their application. The DIS-C algorithm does not have this limitation and it is applicable to any type of ontology. Furthermore, the algorithm is intended to operate without the need for someone to assign a value to each relationship.

3 Theoretical foundation of the DIS-C approach

In this work, the conceptual distance is defined by the space that separates two concepts within a specific conceptualization, which is represented by an ontology. Another conceptual distance assumption is related to the difference of information content provided by two concepts with their own particular definitions.

The proposed approach is applicable to any type of conceptualization and ontology or different conceptual structures such as hierarchies, taxonomies, semantic networks. The novelty of the proposed algorithm is to assign a distance value to each type of relation in the ontology, and transform the latter into a weighted directed graph (called conceptual graph), in which each concept is a node and each relationship is a pair of edges (one for direct and other for inverse relation sense).

Once the conceptual graph is built, different techniques of graph theory are applied in order to process the underlying knowledge codified in the ontology. The natural step is to compute the shortest path in order to find the distance between concepts that are not directly related.

3.1 The basic algorithm

Let be \(K(C,\mathfrak {R},R)\) a conceptualization where C is the set of concepts, \(\mathfrak {R}\) is the set of types of relations and R is the set of relationships in the conceptualization. Then, for each relation \(\rho \in \mathfrak {R}\), the values of \(\delta ^{\rho }\) for relation \(\rho \) are directly set depending on the type of relation, and \({\overline{\delta }}^{\rho }\) for the reverse of relation \(\rho \).

  1. 1.

    For each type of relation \( \rho \in \mathfrak {R}\), assign a conceptual distance or the weight to such relationship. This weighting is defined in each direction of the relationship. For example, if we have the relation “is” and the sentence “cat is an animal”, then, the distances from “cat” to “animal” and “animal” to “cat” are set as follows: \(\text{ distance }(cat,animal)=1\) and \(\text{ distance }(animal,cat)=0\), or using the proposed notation, \(\delta ^{\text{ is }}=0\) and \({\overline{\delta }}^{\text{ is }}=1\).

  2. 2.

    The graph \(G_{K}(V,A)\) is created for the conceptualization K. First, each concept \(c\in C\) is added as a vertex in the graph \(G_{K}\), which means that \(V=C\).

  3. 3.

    For each relationship \(a\rho b\in R\), where \(a,b\in C\) and \(\rho \in \mathfrak {R}\), add the edges \((a,b,\delta ^{\rho })\) and \((b,a,{\overline{\delta }}^{\rho })\) to the set A of edges.

  4. 4.

    The length of shortest paths between each pair of vertex are computed. As a result, the conceptual distance is disseminated to all concepts in a conceptualization K.

Algorithm 1 shows the basic procedure for computing conceptual distance.Footnote 1

figure g

3.1.1 Application of DIS-C in the GEONTO-MET approach

In Torres et al. [80], we presented a methodology for conceptualizing the geographic domain. This approach will be used as an application example of the DIS-C basic algorithm.

According to Algorithm 1, the following three steps are applied: (1) assign a weight to each type of relation, (2) create the graph, and (3) compute the table of shortest paths.

In GEONTO-MET there are three axiomatic relations: “is”, “has” and “does”. The “is” relation is widely used in the literature and it establishes a hierarchical relationship. For example, if we have the relationship “cat is an animal”, then the distance of “cat” to “animal” and “animal” to “cat” must be set. So, it is represented by \(distance(cat,animal)=1\) and \(distance(animal,cat)=0\). Thus, we propose that if \(a(\text{ is })b\in R\), then \(\delta ^{\text{ is }}(a,b)=0\) and \({\overline{\delta }}^{\text{ is }}(a,b)=1\).

The “has” relation defines properties, in this case the distance is inversely proportional to the number of concept occurrences. For example, if the “urban area” concept “has” “street of first order”, then the conceptual distance between the concepts “urban area” and “street” will be inversely proportional to the number of streets in the urban area. That is, if \(a(\text{ has })b\in R\), then \(\delta ^{\text{ has }}(a,b)=\frac{1}{o(p)}\) , where o(p) is the number of occurrences of the property \(p=a(\text{ has })b\) in R. On the other hand, the conceptual distance of “street” to “urban area” is likewise inversely proportional to the number of streets in the urban area and directly proportional to the total number of properties of the urban area (streets, buildings, parks, etc.). Formally, if \(a(\text{ has })b\in R\), then \({\overline{\delta }}^{\text{ has }}(a,b)=\frac{|P(a)|}{o(p)}\), where \(P(a)=\left\{ x\mid a(\text{ has })x\in R\right\} \) for any concept \(x\in C\) and o(p) is the number of property occurrences \(p=a(\text{ has })b\) in R.

Similarly, the “does” relation defines abilities, thus the conceptual distance is defined in both directions of the relationship, inversely proportional to the number of times that an ability is referred by a concept. Likewise, the inverse relationship is directly proportional to the total number of concept abilities.

In summary, the distance values for each type of relationship in the GEONTO-MET are as follows:

  1. 1.

    If \(a(\text{ is })b\in R\)

    1. (a)

      \(\delta ^{\text{ is }}(a,b)=0\).

    2. (b)

      \({\overline{\delta }}^{\text{ is }}(a,b)=1\).

  2. 2.

    If \(a(\text{ has })b\in R\)

    1. (a)

      \(\delta ^{\text{ has }}(a,b)=\frac{1}{o(p)}\), where o(p) is the number of property occurrences \(p=a(\text{ has })b\) in R (this value is normally 1).

    2. (b)

      \({\overline{\delta }}^{\text{ has }}(a,b)=\frac{|P(a)|}{o(p)}\), where \(P(a)=\left\{ x\mid a(\text{ has })x\in R\right\} \) for any concept \(x\in C\) and o(p) is the number of property occurrences \(p=a(\text{ has })b\) in R.

  3. 3.

    If \(a(\text{ does })b\in R\)

    1. (a)

      \(\delta ^{\text{ does }}(a,b)=\frac{1}{o(h)}\), where o(h) is the number of ability occurrences \(h=a(\text{ does })b\) in R (this value is normally 1).

    2. (b)

      \({\overline{\delta }}^{\text{ does }}(a,b)=\frac{|H(a)|}{o(h)}\), where \(H(a)=\left\{ x\mid a(\text{ does })x\in R\right\} \) for any concept \(x\in C\) and o(h) is the number of ability occurrences \(h=a(\text{ does })b\) in R.

Fig. 1
figure 1

Example of an ontology, which was built by using the GEONTO-MET approach

Fig. 2
figure 2

Resulting conceptual graph obtained from the ontology depicted in Fig. 1

As example, the ontology depicted in Fig. 1 was developed by using the GEONTO-MET approach. Figure 2 shows the graph that was obtained by applying steps 2 and 3 of basic algorithm. Finally, in Table 1 the conceptual distance between all concepts are presented.

Table 1 Result of applying basic algorithm to the ontology of Fig. 1

3.2 Generality

Resnik [55] proposed that \(-\log p\left( c\right) \) describes information content of a concept c; where p is the probability that the concept c is presented in the definition of any concept, dividing the sum of concepts that has the concept c as ascendant, by the total number of concepts; that is, dividing the number of concepts related to c (including concept c itself) by the total number of concepts. This way of counting the amount of information makes sense, because the inheritance in taxonomies allocates concepts that are “deep in the taxonomy”, which contain all information of their ascendants, adding its own. Therefore, it is logical to think that the amount of information is proportional to the “depth” into the taxonomy.

Analogous to Resnik’s proposal [55], the generality is a way of describing information content that a concept has, but here we are not only dealing with taxonomies, but also with ontologies that may have multiple types of relationships at once (not only taxonomic ones). Thus, the generality is proposed to characterize information content of concepts in an ontology according to how related they are to each other. In addition, generality is used to quantify how is a concept connected with the entire ontology.

If concept \(x\in C\) is not related to any other concepts (do not use information from others to their own definition), it means that few information is required to identify it and denote that the concept is very abstract or general. Therefore, its conceptual distance to other concepts will be large (on average), because if it is only related to a few concepts, then the paths for connecting it to most of the other concepts will be larger. Conversely, the more specific concepts are defined in terms of other more general ones; so, if x is a very general concept, then the other concepts will be close to x in their definitions. Therefore, if x is a very general concept, then the average distance from other concepts in the ontology to x will be small. In conclusion, the generality of a concept x is defined as the ratio of information content required by x from the other concepts for their definitions, and the sum of this value plus information content that x contributes to other concepts in the ontology.

Now, information content of a concept x in an ontology is defined as \(-\log g\left( x\right) \), where g(x) is a function that indicates the generality of a concept (probability that concept x is “present” in the definition of other concepts). We propose that g(x) is defined as the ratio of information that is provided by other concepts to x, and all information related to x (information provided to x from all other concepts plus information provided from x to all other concepts). That is, the average of conceptual distances from concept x to all other concepts divided by the sum of average of conceptual distances and all concepts into the ontology. Thus, let \(K(C,\mathfrak {R},R)\) be a conceptualization, \(x,y\in C\) concepts of that conceptualization and \(\Delta _{K}\left( x,y\right) \) the conceptual distance from x to y, then \(\forall x\in C\) generality g(x) is defined as shown in Eq. 1.

$$\begin{aligned} g(x)=\frac{\frac{\sum \nolimits _{y\in C} \Delta _{K}(x,y)}{|\mathrm{C}|}}{\frac{\sum \nolimits _{y\in C}\Delta _{K}(x,y)}{|\mathrm{C}|}+\frac{\sum \nolimits _{y\in C}\Delta _{K}(y,x)}{|\mathrm{C}|}} =\frac{\sum \nolimits _{y\in C}\Delta _{K}(x,y)}{\sum \nolimits _{y\in C}\left( \Delta _{K}(x,y)+\Delta _{K}(y,x)\right) }. \end{aligned}$$
(1)

3.3 Automatic weighting method

In order to apply the DIS-C algorithm, the conceptual weighting for each type of conceptual relationship and its inverse must be established. In this section, we introduce a method for the automatic computation of these conceptual weights. In general, there are not rules in the literature that give us some notion of what are the desirable features of these values in a conceptualization. Most proposals are too specific and the metrics are specifically tailored for a particular methodology of conceptualization. However, we believe that it is possible to compute the conceptual distance values of each type of relationship in an ontology by using only its own structure, and regardless of the type of the ontology, amount or type of relationships.

The idea of the algorithm consists of considering the ontology as a graph and computing the weight that each edge must have, taking into account the generality of each node (concept). But, why do we have to calculate the generality for determining the conceptual distance? Because we want to use the intention/semantics as the ontologist has given to the concepts. Surely, more related concepts are more important in the domain that describe the ontology. So, generality of a concept provides information about the relations in the conceptualization and hence, we attempt to use this information for determining the weight that each edge.

At this point, we reach a deadlock because generality is based on the conceptual distance, and the conceptual distance is computed with the generality as part of the input. Therefore, as a starting point we assume that all nodes/concepts are equally generic.

In addition, the topology of the ontology is other aspect to consider, because it ”captures” the intention/semantics of the ontology. To take into consideration the topology, input and output degrees of each vertex are used. For computing the generality of concepts and conceptual distances, the following foundations are proposed.

Given a conceptualization \(K(C,\mathfrak {R},R)\) as defined above, the directed graph \(G_{K}(V_{G},A_{G})\) is created by making each concept \(c\in C\) a node in the graph \(G_{K}\): \(V_{G}=C\). Now, for each relation \(a\rho b\in R\), where \(a,b \in C\), the edge \((a,b,\rho )\) is added to \(A_{G}\).

The next step is to iteratively generate from \(G_{K}\), the weighted directed graph \(\Gamma _{K}^{j}(V_{\gamma }^{j},A_{\gamma }^{j})\). For this purpose, in jth iteration we make \(V_{\gamma }^{j}=V_{G}\), \(A_{\gamma }^{j}=\emptyset \) and, for each edge \((a,b,\rho )\in A_{G}\), edges \((a,b,\omega _{ab}^{j})\) and \((b,a,\omega _{ba}^{j})\) are added to \(\Gamma _{K}^{j}\), where \(\omega _{ab}^{j}\) is the geometric average of the estimation of conceptual distance from the vertex a to the vertex b at jth iteration. These are calculated by Eq. 2,

$$\begin{aligned} \begin{array}{c} \omega _{ab}^{j}=p_{w}\left( g_{a}^{j-1}\omega _{a}^{o}+g_{b}^{j-1}\omega _{b}^{i}\right) +\left( 1-p_{w}\right) \left[ \delta ^{\rho }\right] ^{j-1}\\ \omega _{ba}^{j}=p_{w}\left( g_{b}^{j-1}\omega _{b}^{o}+g_{a}^{j-1}\omega _{a}^{i}\right) +\left( 1-p_{w}\right) \left[ {\bar{\delta }}^{\rho }\right] ^{j-1} \end{array}, \end{aligned}$$
(2)

where \(p_{w}\in \left[ 0-1\right] \) is a parameter that indicates how much importance is given to recent values, and consequently, the importance given to past values. Normally, \(p_{w}=\frac{1}{2}\) and \(g_{x}^{j}\) is the generality of vertex \(x\in V_{G}\) at jth iteration (the value of \(g_{x}^{j}\) is calculated by using the graph \(\Gamma _{K}^{j}\)). We set that \(\forall x\in V_{G},g_{x}^{0}=1\), i.e., the initial value of generality for all nodes is equal to 1. Furthermore, the terms \(\left[ \delta ^{\rho }\right] ^{j}\) and \(\left[ {\bar{\delta }}^{\rho }\right] ^{j}\) are involved, and they are values of conceptual distance of the relationship between a and b (forward and reverse, respectively); whose values are being sought. Initially, those distances are 0, i.e.,\(\left[ \delta ^{\rho }\right] ^{0}=0\) and \(\left[ {\bar{\delta }}^{\rho }\right] ^{0}=0\) for any \(\rho \in \mathfrak {R}\).

Moreover, \(\omega _{x}^{i}\) is the cost of “getting” at vertex \(x\in V_{G}\), which is defined as the probability of not finding an edge arriving to vertex x, i.e.,\(\omega _{x}^{i}=1-\frac{i_{x}}{i_{x}+o_{x}}\). Thus, \(\omega _{x}^{o}\) is the cost of ”leaving” vertex \(x\in V_{G}\), defined as the probability of not finding an edge leaving vertex x, i.e.,\(\omega _{x}^{o}=1-\frac{o_{x}}{i_{x}+o_{x}}\), where \(i_{x}\) is the in-degree of vertex x and \(o_{x}\) is the out-degree of vertex x.

Figure 3 presents an example ontology, where concept a has two concepts related to it (b and c), so the in-degree value \(i_{a}=2\) (the number of relationships that “arrive” at concept a). In addition, concept a is not associated with any other concept in the ontology, so the out-degree \(o_{a}=0\) (no relationship “leaves” concept a).

Fig. 3
figure 3

Ontology example to clarify the process of relationships

We can set the cost of “getting” to concept a as \(\omega _{a}^{i}=1-\frac{i_{a}}{i_{a}+o_{a}}=1-\frac{2}{2+0}=0\), and the cost of “leaving” concept a as \(\omega _{a}^{o}=1-\frac{o_{a}}{i_{a}+o_{a}}=1-\frac{0}{2+0}=1\). Similarly, for the concept b: \(i_{b}=1\) (one relationship “enters” to b), \(o_{b}=2\) (two relationships “leaves” from b); then, \(\omega _{b}^{i}=1-\frac{i_{b}}{i_{b}+o_{b}}=1-\frac{1}{1+2}=\frac{2}{3}\) and \(\omega _{b}^{o}=1-\frac{o_{b}}{i_{b}+o_{b}}=1-\frac{2}{1+2}=\frac{1}{3}\). Now, suppose that the first iteration is computed, i.e.,\(j=1\); then the value of edge that goes from a to b is \(\omega _{ab}^{1}=p_{w}\left( g_{a}^{0}\omega _{a}^{o}+g_{b}^{0}\omega _{b}^{i}\right) -\left( 1-p_{w}\right) \left[ \delta ^{\rho }\right] ^{0}\); since \(p_{w}=\frac{1}{2}\), \(g_{a}^{0}=g_{b}^{0}=1\) and \(\left[ \delta ^{\text{ is }}\right] ^{0}=0\), then \(\omega _{ab}^{1}=\frac{1}{2}\left( \omega _{a}^{o}+\omega _{b}^{i}\right) =\frac{1}{2}\left( 1+\frac{2}{3}\right) =\frac{5}{6}\). Similarly, \(\omega _{ba}^{1}=\frac{1}{6}\).

The resulting graph of applying this process (Algorithm 2, line 5) to the ontology depicted in Fig. 3 is shown in Fig. 4.

Fig. 4
figure 4

Conceptual graph obtained by applying the generality measure in the ontology shown in Fig. 3

Now, from the graph \(\Gamma _{K}^{j}\), the adjacency matrix \(M_{\Gamma _{K}^{j}}\) is built (see Eq. 3).

$$\begin{aligned} M_{\Gamma _{K}^{j}}=\left[ \begin{array}{ccccc} \omega _{aa}^{j} &{} \omega _{ab}^{j} &{} \omega _{ac}^{j} &{} \omega _{ad}^{j} &{} \omega _{ae}^{j}\\ \omega _{ba}^{j} &{} \omega _{bb}^{j} &{} \omega _{bc}^{j} &{} \omega _{bd}^{j} &{} \omega _{be}^{j}\\ \omega _{ca}^{j} &{} \omega _{cb}^{j} &{} \omega _{cc}^{j} &{} \omega _{cd}^{j} &{} \omega _{ce}^{j}\\ \omega _{da}^{j} &{} \omega _{db}^{j} &{} \omega _{dc}^{j} &{} \omega _{dd}^{j} &{} \omega _{de}^{j}\\ \omega _{ea}^{j} &{} \omega _{eb}^{j} &{} \omega _{ec}^{j} &{}\omega _{ed}^{j} &{} \omega _{ee}^{j} \end{array}\right] \end{aligned}$$
(3)

Meanwhile \(\omega _{xy}^{j}=0\), if \(x=y\) and \(\omega _{xy}^{j}=\infty \), if there is not an edge in \(\Gamma _{K}^{j}\) that goes from vertex x to the vertex y, then, following the same example, for \(j=1\) the matrix \(M_{\Gamma _{K}^{1}}\) is obtained and shown in Eq. 4 (Algorithm 2, line 16).

$$\begin{aligned} M_{\Gamma _{K}^{1}}=\left[ \begin{array}{ccccc} 0 &{} \frac{5}{3} &{} \frac{4}{3} &{} \infty &{} \infty \\ \frac{1}{3} &{} 0 &{} \frac{2}{3} &{} \frac{5}{6} &{} \infty \\ \frac{2}{3} &{} \frac{4}{3} &{} 0 &{} \infty &{} \frac{5}{3}\\ \infty &{} \frac{7}{6} &{} \infty &{} 0 &{} \frac{3}{2}\\ \infty &{} \infty &{} \frac{1}{3} &{} \frac{1}{2} &{} 0 \end{array}\right] . \end{aligned}$$
(4)

Next step is to propagate these weights to the vertexes that are not directly connected. Thus, the new matrix \(M_{\Gamma _{K}^{1}}\) is shown in Eq. 5 (Algorithm 2, line 22).

$$\begin{aligned} M_{\Gamma _{K}^{1}}=\left[ \begin{array}{ccccc} 0 &{} \frac{5}{3} &{} \frac{4}{3} &{} \frac{5}{2} &{} 3\\ \frac{1}{3} &{} 0 &{} \frac{2}{3} &{} \frac{5}{6} &{} \frac{7}{3}\\ \frac{2}{3} &{} \frac{4}{3} &{} 0 &{} \frac{13}{6} &{} \frac{5}{3}\\ \frac{3}{2} &{} \frac{7}{6} &{} \frac{11}{6} &{} 0 &{} \frac{3}{2}\\ 1 &{} \frac{5}{3} &{} \frac{1}{3} &{} \frac{1}{2} &{} 0 \end{array}\right] . \end{aligned}$$
(5)

With this adjacency matrix, the values of generality for each vertex are calculated, in the jth iteration, by using Eq. 1, and considering \(\Delta _{K}=M_{\Gamma _{K}^{j}}\). In this example, the generality for the vertex a is \(g_{a}^{1}=\frac{0+\frac{5}{3}+\frac{4}{3}+\frac{5}{2}+3}{0+\frac{1}{3}+\frac{2}{3}+\frac{3}{2}+1}=\frac{\frac{17}{2}}{\frac{7}{2}}=\frac{17}{7}\) (Algorithm 2, line 23).

figure h

In addition, it calculates a new value of the conceptual distance for each type of relationship in \(\mathfrak {R}\). This value is obtained by the average of the distances \(\omega ^{j}\) between edges that share the same type of relationship, Eq. 6 (Algorithm 2, line 26).

$$\begin{aligned} \begin{array}{c} \left[ \delta ^{\rho }\right] ^{j}=\frac{\sum _{\left( a,b,\rho \right) \in \rho ^{*}}\omega _{ab}^{j}}{|\rho ^{*}|}\\ \left[ {\bar{\delta }}^{\rho }\right] ^{j}=\frac{\sum _{\left( a,b,\rho \right) \in \rho ^{*}}\omega _{ba}^{j}}{|\rho ^{*}|} \end{array}, \end{aligned}$$
(6)

where \(\rho ^{*}=\left\{ \left( a,b,\rho \right) \in A_{G}\right\} \) is the set of edges that represents a relationship \(\rho \). The ontology presented in Fig. 1 was built with the GEONTO-MET approach Torres et al. [80], thus, it has three types of relations “is”, “has” and “does”. With the same example, the conceptual distance for“is” relation in its normal and reverse direction would be: \(\left[ \delta ^{\text{ is }}\right] ^{1}=\frac{\frac{2}{3}+\frac{1}{3}+\frac{7}{6}+\frac{1}{3}}{4}=\frac{5}{8}\) and \(\left[ {\bar{\delta }}^{\text{ is }}\right] ^{1}=\frac{\frac{4}{3}+\frac{5}{3}+\frac{5}{6}+\frac{5}{3}}{4}=\frac{11}{8}\).

Fig. 5
figure 5

The DIS-C graph obtained from the ontology depicted in Fig. 1

The process starts with \(j=1\), and increases the value of j by one, until it meets the condition of Eq. 7, where \(\epsilon _{K}\) is the threshold of maximum change (Algorithm 2, line 31).

$$\begin{aligned} \frac{\sum \nolimits _{x\in V_{\gamma }}\left( g_{x}^{j}-g_{x}^{j-1}\right) ^{2}}{\text{ card }(V)}-\epsilon _{K}=0 \end{aligned}$$
(7)

4 Experimental analysis

According to the example presented in Fig. 1, the results of applying the proposed algorithm are depicted in Fig. 5 and described in Table 2. In this case, it can be seen that more precision is obtained in the distances depicted in the DIS-C graph (see Fig. 5), even when acquiring smaller distance values with respect to the basic algorithm (see Table 2).

4.1 DIS-C applied to ontologies

In this section we present the results of a series of experiments aimed at demonstrating that DIS-C is a general procedure for computing conceptual distances whose results are consistent with more particular approaches which are tailored to specific ontologies such as hierarchies. We used the confusion theory (CONF) [36], the information content (IC) proposed by Resnik [55] and the distance measure (DIS) provided by Rada [53] in order to evaluate the DIS-C algorithm. This comparison was performed with respect to the results presented by Levachkine and Guzmán-Arenas [36], where a hierarchy of living beings is proposed (see Fig. 6).

Fig. 6
figure 6

Hierarchy of living beings Levachkine and Guzmán-Arenas [36]

In Tables 3, 4, 5 and 6, the results of similarity values of the proposed hierarchy, applying the aforementioned methods, including the DIS-C algorithm are presented. Table 7 shows the correlation between the results obtained with different approaches.Footnote 2 As it can be seen in this table, the results obtained by DIS-C are strongly correlated with the values of the other methods; in fact, DIS-C has the highest correlation average with respect to the others.

Form these results, it can be observed that the correlation with CONF is very high, if the values obtained with DIS-C were rounded,Footnote 3 we will obtain about 80% of identical values to those of CONF. In other words, DIS-C provides greater accuracy in the estimation of the difference between two concepts and at the same time it supports the results of CONF.

Table 2 Values of conceptual distances with respect to the ontology depicted in Fig. 1
Table 3 The obtained results with CONF method
Table 4 Results using the Resnik’s measure
Table 5 Results using the Rada’s measure
Table 6 Results using the DIS-C algorithm
Table 7 Correlation of the DIS-C with other network-based methods
Table 8 Similarity of pairs of nouns proposed in Miller and Charles [44]

Other interesting aspect is that DIS-C is strongly correlated to CONF (95%) as well as to DIS (94%); however, the correlation between them is not of the same order (78%). This suggests that DIS-C provides results that are congruent with those two methods, and a measure that is consistent with both views.

4.2 DIS-C applied to word similarity using WordNet

In order to test our algorithm with large datasets, we compare our results to other similarity measures using WordNet.

Rubentein and Goodenough [59] recorded synonymy judgments for 65 pairs of nouns, where they invited 51 judges who assigned to every pair a score between 0 and 4 indicating the semantic similarity. Later, Miller and Charles [44] repeated the experiment restricting themselves to 30 pairs of nouns selected from the previous list, divided equally among words with high, intermediate and low similarity.

In Jarmasz and Szpakowicz [30], the authors repeated both experiments and presented the results of other six similarity measures that rely on WordNet. The first WordNet measure used is edge counting. It serves as a baseline, as it is the simplest and most intuitive measure. The next measure, from Hirst and St-Onge Hirst et al. [26], relies on the path length as well as on the number of changes of direction in the path; these changes are defined in terms of the WordNet semantic relations. Jiang and Conrath [31] proposed a combined approach based on edge counting enhanced by the node-based approach of the information content calculation proposed by Resnik [54]. Leacock and Chodorow [35] count the path length in nodes rather than links, and adjust it to take into account the maximum depth of the taxonomy. Lin [40] calculates semantic similarity using a formula derived from information theory. Resnik [54] calculates the information content of the concepts that subsume them in the taxonomy. These similarity measures as well as the similaritiesFootnote 4 measured by our algorithm appear in Table 8.

Table 9 presents the correlation coefficient between the human judgments (presented by Miller and Charles [44]) and the values achieved by the methods, including ours. As it can be seen, our method attains the best correlation coefficient among all the methods. These results indicate that the conceptual distances computed by DIS-C algorithm are consistent with human judgments.

Table 9 Correlation between the human judgments and similarity methods

5 Conclusions

In this paper, a formal definition and application of the conceptual distance measure have been presented. First, we have argued that the conceptual distance term has been used as synonym of semantic similarity, and it has been treated like that. However, we discussed that this equivalence between terms is only given when taxonomies are used, whose relations allow us to infer that if two concepts are close in the taxonomy, then those concepts are similar. This is not necessarily true for ontologies, where non-taxonomic relationships exist, in which the proximity of two conceptual entities does not mean that they are similar.

On the other hand, the conceptual distance calculation is based on the distance between concepts directly related, which is a-priori assigned by the author of the ontology. The proposed algorithm for the propagation of conceptual distances, establishes that each relationship must have an associated conceptual distance, both in the normal or direct orientation of the relationship, as in the reverse orientation. With this information, a strongly connected graph in which each concept is a vertex and each relation is associated with two edges (one in the original direction and the other in the opposite direction of the relation) is created. By using a shortest path algorithm, we disseminate local distances to determine the distance between two concepts within the ontology which are not directly connected by a relation. As case study, the conceptual distance between concepts of an ontology was applied. This ontology was developed using the GEONTO-MET approach.

Moreover, an automatic computation of the conceptual distance, based on the topology of the ontology is proposed. We introduced the metric of generality, which is defined by the ratio between information provided by a concept and the information received by the same concept. Thus, an algorithm called DIS-C is proposed; it is based on the topology and on successive approximations, which determine the generality values of each concept, taking into account the conceptual distance between any pair of concepts and the conceptual distance associated with each type of relationship in the ontology.

We presented a comparison of the results obtained by DIS-C with other three network-based methods (CONF, AINF and DIS). According to the correlation of the results, we demonstrate that DIS-C provides consistent results with respect to the other methods. DIS-C reaches the highest average of correlation among the methods discussed above. Likewise, DIS-C is strongly correlated with approaches that do not correlate together. Although it has been compared with other algorithms that use the network model for representing ontologies, we believe that this metric could be extended to other representations, such as the feature-based model. This model can be expanded as a linear combination of the conceptual distance of the features that define the concepts.

We also presented a comparative analysis against methods for computing similarity in the context of WordNet. These experiments are based on a set of pairs of words which was originally proposed by Milles and Charles in 1991 where a group of people evaluated the similarity of these pairs of words. Again, the results obtained by DIS-C exhibit the highest correlation with the results obtained by the group of people. These results indicate that DIS-C is able to capture the human notion of similarity.

Future works are oriented toward analyzing the performance and the accuracy of the proposed measure with other ontologies and domains such as SNOMED-CT, Mesh, and Gene Ontology. In addition, we are investigating the complexity to incorporate hybrid techniques in order to provide a more cognitive measure that relies on the human perception about the similarity between concepts.