A unified framework for semantic similarity computation of concepts

Jiang, Yuncheng

doi:10.1007/s11042-021-10966-1

A unified framework for semantic similarity computation of concepts

Published: 29 July 2021

Volume 80, pages 32335–32378, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

A unified framework for semantic similarity computation of concepts

Download PDF

Yuncheng Jiang ORCID: orcid.org/0000-0002-4294-454X¹

336 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Semantic similarity assessment between concepts is an important task in many language related applications. In the past, many approaches to assess similarity of concepts have been proposed by using one knowledge source. In this paper, some limitations of the existing similarity measures are identified. To tackle these problems, we propose an extensive study for semantic similarity of concepts from which a unified framework for semantic similarity computation is presented. Based on our framework, we give some generic and flexible approaches to semantic similarity measures resulting from instantiations of the framework. In particular, we obtain some new approaches to similarity measures that existing methods cannot deal with by introducing multiple knowledge sources. The evaluation based on eight benchmarks, three widely used benchmarks (i.e., M&C, R&G, and WordSim-353 benchmarks) and five benchmarks developed in ourselves (i.e, Jiang-1, Jiang-2, Jiang-3, Jiang-4, and Jiang-5 benchmarks), sustains the intuitions with respect to human judgements. Overall, some methods proposed in this paper have a good human correlation (Pearson correlation with human judgments and Spearman correlation with human judgments) and constitute some effective ways of determining semantic similarity between concepts.

Semantic Word Similarity Learned from Heterogenous Knowledge Bases

An Analysis of Semantic Similarity Measures for Information Retrieval

An Ontology-Based Approach for Measuring Semantic Similarity Between Words

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Semantic similarity between concepts is becoming a common problem for many applications such as natural language processing, text categorization, text clustering, information retrieval, and word sense disambiguation [1, 9, 12, 36, 37, 55, 57]. However, making judgments about the semantic similarity of different concepts is a routine yet deceptively complex task. To perform it, people need to draw on an immense amount of background knowledge about the concepts. Usually, these sources can be search engines [15], topical directories such as Open Directory Project [46], well-defined semantic networks such as WordNet [24, 43], more domain-dependent ontologies [67, 74] such as Gene Ontology [17] and biomedical ontologies MeSH or SNOMED CT [4, 69], Wikipedia [34, 37], or Linked Data [13, 56]. In fact, several works have been developed in the past years proposing semantic similarity measures. According to the concrete knowledge sources exploited and the way in which they are used, various similarity measures have been proposed [29, 71, 72]. Semantic similarity measures can be classified into four main categories: [37, 47, 51]: (1) distance-based models that are based on the structural representation of the underlying context; (2) feature-based models that define concepts or entities as sets of features; (3) statistical methods that consider statistics derived from the underlying context; and (4) hybrid models that comprise combinations of the three basic categories. Concretely, distance-based models, also referred to as edge-counting or path-based methods, define similarity as a function of distance between concepts [51, 62]. Feature-based methods assume that concepts can be represented as sets of features. They assess the similarity of concepts based on the commonalities among their feature sets: any increase in common features among concepts results in a higher similarity score and any decrease in shared features results in lower levels of similarity [51, 80]. Statistical similarity measures incorporate statistics derived from various aspects of the underlying domain into the similarity computation.

It is worth noting that all these measures mentioned above are some specific computation methods by using different knowledge sources such as WordNet [20], Wikipedia [48], or Linked Data [10] or different mathematical tools such as information content (IC) [63], pointwise mutual information (PMI) [14], or latent semantic analysis (LSA) [19]. Furthermore, for the same kind of knowledge source, different computation approaches for semantic similarity need different contents of the knowledge source. For example, in Wikipedia based similarity measures, IC-based measures need the category structure of Wikipedia, however, feature-based methods need the articles (e.g., the redirect pages and hyperlinks) of Wikipedia. In fact, we can propose some novel computation approaches for semantic similarity of concepts by exploiting new knowledge sources or mathematical tools. Clearly, there are some issues in existing researches. Firstly, there are lots of computation approaches of semantic similarity, however, there is not a unified framework for these methods. Therefore, in practical applications it is difficult for the users to choose which computation method for semantic similarity of concepts. Secondly, if two concepts A and B belong to two heterogeneous knowledge sources, the semantic similarity between A and B cannot be computed using existing methods. For example, if A∈WordNet, A∉DBpedia, B∈DBpedia, and B∉WordNet, existing approaches cannot compute the semantic similarity sim(A, B). Of course, if two concepts A and B belong to two homogeneous knowledge sources such as two different domain ontologies built in the same language, the value of sim(A, B) can be computed by using existing methods such as [8, 71].

To fill these gaps, this paper proposes an extensive study for semantic similarity of concepts from which a unified framework for semantic similarity computation is presented. It should be noted that Cross et al. [18] and Harispe et al. [32] have studied the unified framework issue for semantic similarity measures. However, their works are different from our research in this paper: Cross et al. [18] and Harispe et al. [32] present a framework for unifying ontology-based semantic similarity measures, and we will propose a unified framework for semantic similarity measures for multiple heterogeneous knowledge sources [68] such as WordNet [20], ontologies [77], Wikipedia [48], and Linked Data [10]. Based on our framework for semantic similarity of concepts, we give some generic and flexible approaches to semantic similarity measures resulting from instantiations of the framework. The main contributions of this paper are as follows:

The semantic representation and a unified framework for semantic similarity computation of concepts are presented.
Some generic and flexible approaches to semantic similarity measures of concepts resulting from instantiations of the framework are provided.
Several new approaches to semantic similarity computation of concepts that existing methods cannot measure are proposed.

It is worth mentioning that semantic similarity measures can also be used in multimedia system such as multimedia databases and retrieval, personalized electronic journals, multimedia encyclopedias, digital libraries, executive information systems, and multimedia documents. For example, in multimedia (e.g., image, audio or video) retrieval with text annotation, we may use the semantic similarity of text to assist multimedia retrieval, where the computation of semantic similarity of text can be implemented by exploiting semantic similarity measures of concepts. Another example, in digital libraries or multimedia documents, there are many image, audio, video, and text data. In a similar manner, we also can utilize the semantic similarity of text to assist the processing (e.g., retrieval, classification, recommendation, mining, and analysis) of digital libraries and multimedia documents. That is, semantic similarity measure of concepts is also relevant to multimedia system.

The rest of the paper is organized as follows. In the next section, we briefly present the related works on semantic similarity measures. Section 3 presents our unified framework for semantic similarity computation of concepts. This includes the semantic representation of concepts and a framework for semantic similarity computation. In Section 4, we investigate several similarity measures resulting from instantiations of the framework. Section 5 is devoted to presenting detail of experiments and evaluation of our approaches. Finally, in Section 6, we draw our conclusion and present some perspectives for future research.

2 Related work

As a fundamental concept in theories of perception, behavior, social bonding, learning, and judgment, the notion of similarity has been extensively studied for several decades. Many researchers have endeavored to understand and represent the way humans judge the similarity of two or more objects [12, 27, 51, 53, 76, 80]. Semantic similarity reflects the relationship between the meaning of two concepts (words, entities, or terms), sentences (or short texts) or documents (or texts) [21, 31, 54, 59]. The literature on semantic similarity measures is very extensive, thus, we only focus on the measures that are evaluated in this work, that is, this section takes an overview of the methods for semantic similarity measures for concepts.

As stated in Section 1, semantic similarity between concepts (semantic similarity for short) can be computed based on a set of factors derived typically from a knowledge representation model. Depending on the structure of the application context and its knowledge representation model, various similarity measures have been proposed and different families of methods can also be identified [51, 71]. These families are [37, 47, 51, 58]: (1) distance-based similarity measures; (2) feature-based similarity measures; (3) statistical similarity measures; and (4) hybrid similarity measures.

2.1 Distance-based similarity measures

Modern research in this area starts with the work presented by Rada et al. [62]. Concretely, Rada et al. propose to use the length of the shortest path between concepts as a measurement of distance. Formally, their definition of conceptual distance is as follows:

$$ Dist\left(A,B\right)=\mathrm{minimum}\ \mathrm{number}\ \mathrm{of}\ \mathrm{edges}\ \mathrm{separating}\ a\ \mathrm{and}\ b, $$

where A and B are the two concepts represented by the nodes a and b, respectively, in an is-a semantic net [24].

The distance measure is converted to a similarity measure by subtracting the path length from the maximum possible path length, which can be shown in the following equation:

$$ Sim\left(A,B\right)=2\times {Distance}_{max}- Dist\left(A,B\right), $$

where Distance_max is the maximum possible path length [24].

The work proposed by Rada et al. [62] opens up the family of edge-counting semantic measures and shows that conceptual distance (or similarity) between concepts in a semantic network is proportional to the length of the path that links them [38]. The ideas of Rada et al. are followed by other works such as Wu and Palmer [82], Leacock and Chodorow [39], Hirst and St-Onge [33], Li et al. [41], Pedersen et al. [56], and Garla and Brandt [25] which also propose similarity measures based on features derived from the length of shortest path between concepts. For example, the metric presented by Wu and Palmer [82] relies on the fact that in is-a hierarchies, concepts that are more distant from the root are more specific than the ones that are near the root. Formally, the conceptual similarity between concepts A and B is defined as follows:

$$ Sim\left(A,B\right)=\frac{2\times {N}_3}{N_1+{N}_2+2\times {N}_3}, $$

where N₁ (N₂) is the number of edges on the path from A (B) to LCS(A, B), N₃ is the number of edges on the path from LCS(A, B) to root, and LCS(A, B) means the least common subsumber (LCS) of concept A and concept B [24].

Leacock and Chodorow [39] propose a non-linear adaptation of Rada’s distance to define the similarity measure:

$$ Sim\left(A,B\right)=-\log \left(\frac{Dist\left(A,B\right)}{2\times \mathit{\operatorname{Max}}\_ depth}\right), $$

where Max_depth is the longest of the shortest paths linking a concept to the concept which subsumes all the others [32]. It should be noted that the non-linear adaptation here means logarithmic function of Rada’s distance, while the adaptation in [6] means runtime/semantic adaptation and management of software to support source-code semantic flexibility.

Garla and Brandt [25] give a proposal for the normalization of the metric of Leacock and Chodorow to the unit interval as follows [3]:

$$ Sim\left(A,B\right)=1-\frac{\log \left( Dist\left(A,B\right)\right)}{\log \left(2\times \mathit{\operatorname{Max}}\_ depth\right)}. $$

Li et al. [41] introduce a family of ten different parametric similarity measures whose core idea is the breaking down of the overall similarity function into a combination of functions linearly or nonlinearly, where each base function relies on a different taxonomical feature such as the length of the shortest path between concepts, and the depth of the lowest common ancestor [38]. One of the best measures among them is shown in the following equation:

$$ Sim\left(A,B\right)={e}^{-\alpha \ast Dist\left(A,B\right)}\cdotp \frac{e^{\beta h}-{e}^{-\beta h}}{e^{\beta h}+{e}^{-\beta h}}, $$

where Dist(A, B) is the number of edges separating A and B, h is the depth of LCS of A and B, α and β are parameters scaling the contribution of Dist(A, B) and h, α ≥ 0 and β > 0.

2.2 Feature-based similarity measures

Feature-based methods assume that concepts can be represented as sets of features. They assess the similarity of concepts based on the commonalities among their feature sets: any increase in common features among concepts results in a higher similarity score and any decrease in shared features results in lower levels of similarity [51, 80]. For discrete-valued vectors similarity measures are inspired by the comparison of sets and the cardinality of sets. Some common set-inspired similarity measures for discrete-valued vectors include [45]:

$$ \mathrm{Jaccard}\ \mathrm{coefficient}\ Jaccard\left(A,B\right)=\frac{\mid A\cap B\mid }{\mid A\cup B\mid }, $$

$$ \mathrm{Dice}\ \mathrm{coefficient}\ Dice\left(A,B\right)=\frac{2\times \mid A\cap B\mid }{\left|A\right|+\mid B\mid }, $$

$$ \mathrm{Salton}\ \mathrm{Cosine}\ \mathrm{coefficient}\ SaltonCosine\left(A,B\right)=\frac{\mid A\cap B\mid }{\left|A\right|\times \mid B\mid }, $$

where A and B denote the sets of features that correspond to concepts a and b.

The Tversky ration model [80] is defined by a weighted variant for the complement of the symmetric difference between the feature sets of two concepts and considers the distinctive characteristics of each concept (the features of one concept which are not part of the other):

$$ Tversky\left(A,B\right)=\frac{\mid A\cap B\mid }{\left|A\cap B\right|+\alpha \left|A-B\right|+\beta \mid B-A\mid}\mathrm{for}\ \alpha, \beta >0, $$

where α and β represent the relative contribution of unique features of A and B in the similarity value, respectively. The α and β parameters can be used to reflect the symmetric or asymmetric nature of a given context: if α = β then Tversky(A, B) = Tversky(B, A) thus, the similarity comparison is symmetric, otherwise, it is asymmetric (i.e., Tversky(A, B) ≠ Tversky(B, A)) [51].

With a perspective from set theory, the meaning of the Tversky measure is clear and well-founded. However, the feature sets associated to each concept cannot be derived directly from an ontology, which is a serious drawback for its practical implementation [38]. With the aim of bridging the gap in the Tversky measure, Sanchez et al. [73] introduce a feature-based dissimilarity measure which is based on the use of the common ancestors between concepts as a measure of their degree of similarity:

$$ Dis\left(A,B\right)={\log}_2\left(1+\frac{\left|\varphi (A)-\varphi (B)\right|+\left|\varphi (B)-\varphi (A)\right|}{\left|\varphi (A)-\varphi (B)\right|+\left|\varphi (B)-\varphi (A)\right|+\left|\varphi (A)\cap \varphi (B)\right|}\right), $$

where φ(C) = {D∈AllCons| C ≤ D}, AllCons is the set of concepts of a given ontology, and ≤ is a binary relation (i.e., concept subsumption).

The definition of the set of features such as the set of synonyms (called synsets in WordNet), definitions (i.e., glosses, containing textual descriptions of word senses), and the set of subconcepts (or subclasses, subcategories) is crucial in feature-based measures.

The Rodriguez and Egenhofer measure [65] is computed as the weighted sum of similarities between synsets, features (e.g., meronyms, attributes, etc.) and semantic neighborhoods (those linked via semantic pointer) of two concepts A and B:

$$ Sim\left(A,B\right)=w\cdotp {S}_{synsets}\left(A,B\right)+u\cdotp {S}_{features}\left(A,B\right)+v\cdotp {S}_{neighborhoods}\left(A,B\right)\ \mathrm{for}\ w,u,v\ge 0. $$

Weights assigned to w, u, and v depend on the characteristics of the ontologies. Only common specification components can be used in a similarity assessment. Their respective weights add up to 1.0.

X-Similarity [58] relies on matching between synsets and term description sets. The term description sets contains words extracted by parsing term definitions (“glosses” in WordNet or “scope notes” in MeSH). Two terms are similar if their synsets or description sets or, the synsets of the terms in their neighborhood (e.g., more specific and more general terms) are lexically similar. The similarity function is expressed as follows:

$$ \kern1em Sim\left(A,B\right)=\left\{\begin{array}{c}1,\kern9.5em \mathrm{if}\ {S}_{synsets}\left(A,B\right)>0\\ {}\max \left\{{S}_{neighborhoods}\left(A,B\right),{S}_{descriptions}\left(A,B\right)\right\},\kern0.5em \mathrm{if}\ {S}_{synsets}\left(A,B\right)=0\end{array}\right.. $$

Jiang et al. [37] investigate some feature-based approaches to semantic similarity assessment of concepts using Wikipedia and give the following framework for feature-based similarity using the sets of all synonym sets, gloss sets, anchor sets, and category sets of Wikipedia concepts:

Sim(A, B) = S_concepts(S_synonyms(Synonyms_A, Synonyms_B), S_glosses(Glosses_A, Glosses_B), S_anchors (Anchors_A, Anchors_B), S_categories(Categories_A, Categories_B)).

2.3 Statistical similarity measures

Statistical similarity measures incorporate statistics derived from various aspects of the underlying domain into the similarity computation [51]. Several approaches use the popularity of terms in a document as a measure of their informativeness and use this as a basis for measuring the similarity [34, 38, 42, 51, 63, 64, 71, 72]. These approaches are also known as Information Content (IC)-based measures.

Resnik [63] proposes an IC-based method which is not sensitive to the problem of varying link distance. They assume that the information shared by two concepts is indicated by the IC of the concepts that subsume them in a net (e.g. WordNet) [24]:

$$ Sim\left(A,B\right)= IC\left( LCS\left(A,B\right)\right), $$

where IC(C) = −log(p(C)) and p(C) is the probability of encountering an instance of concept C in a given corpus (e.g. Brown Corpus).

Resnik’s metric has two problems: any pair of concepts (words) with the same LCS will have the same semantic similarity; similarity between the same concepts (words) is not equal to one [24]. To correct these problems, Lin [42], Jiang and Conrath [35] propose their methods. Jiang and Conrath represent their metric as follows [35, 38]:

$$ {\displaystyle \begin{array}{c} Distance\left(A,B\right)= IC(A)+ IC(B)-2\times IC\left( LCS\left(A,B\right)\right)\ \mathrm{and}\\ {} Sim\left(A,B\right)=1-\frac{Distance\left(A,B\right)}{2}.\end{array}} $$

Lin’s similarity function [42] is expressed as follows:

$$ Sim\left(A,B\right)=\frac{2\times IC\left( LCS\left(A,B\right)\right)}{IC(A)+ IC(B)}. $$

Recently, there are many researches in IC-based semantic similarity measure [4, 34, 51]. For example, Jiang et al. [34] present several new methods to IC computation of a concept and similarity computation between two concepts drawn from Wikipedia category structure. Since Wikipedia category structure is a graph, naturally, the semantic similarity between concepts can be assessed by extending traditional information theoretic approaches (i.e., IC-based approaches).

All the IC-based similarity measures require an IC model. An IC model is a concept-valued function that assigns an IC value to each concept [38]. Except the corpus-based IC models [24, 35] [38, 42], some intrinsic IC models are developed. The pioneering work is the intrinsic IC model of Seco et al. [75]. Some new intrinsic IC models are also proposed [28, 49, 70, 72]. For example, in a recent work, Sanchez et al. [72] propose estimating the IC value of concept C as the ratio between the number of leaves on the taxonomical hierarchy under the concept C (as a measure of C’s generality) and the number of taxonomical subsumers above C including itself (as a measure of C’s concreteness). Formally,

$$ IC(C)=-\log \left(\frac{\frac{\mid leaves(C)\mid }{\mid subsumers(C)\mid }+1}{\mathit{\max}\_ leaves+1}\right), $$

where leaves(C) is the set of concepts found at the end of the taxonomical tree under concept C and subsumers(C) is the complete set of taxonomical ancestors of C including itself. The ratio is normalized by the least informative concept (i.e., the root of the taxonomy), for which the number of leaves is the total amount of leaves in the taxonomy (max_leaves) and the number of subsumers including itself is 1. To produce values in the range [0, 1] (i.e., in the same range as the original probability) and avoid log(0) values, 1 is added to the numerator and denominator.

Other approaches such as pointwise mutual information (PMI) [14] and vector-based methods such as latent semantic analysis (LSA) [19] and explicit semantic analysis (ESA) [23] can be classified as statistical semantic similarity measures as they use functions of term frequency for computing the similarity [51].

2.4 Hybrid similarity measures

A number of approaches can be classified as hybrid methods: they are based on combinations of some of the above presented methods. For example, Pirro [60] presents a similarity metric combining the feature-based and information theoretic theories of similarity. In particular, the proposed metric exploits the notion of intrinsic IC which quantifies IC values by scrutinizing how concepts are arranged in an ontological structure. Meng et al. [50] introduce a variant of the Lin measure [42], concretely, the similarity measure of Meng et al. [50] is a hybrid measure that combines the Lin IC-based measure with a power factor based on the shortest path length between concepts. In IS-A taxonomies, intrinsic IC (IIC) [75] incorporates the number of subclass of a concept for estimating the information content: the higher the number of subclass of a term, the lower its informativeness [51]. IIC has also been combined with feature-based [60] and edge counting methods [61, 78]. Gao et al. [24] propose an approach to calculate the semantic similarity between word pairs based on WordNet, specifically, they present an approach for semantic similarity measuring which is based on edge-counting and IC theory.

3 A framework for semantic similarity computation

To compute the semantic similarity sim(A, B) for two concepts A and B, we firstly need to get some related information such as synonyms or taxonomy structures of A and B from certain knowledge source such as WordNet [20] or domain ontologies [77]. For example, if users want to evaluate Sim(A, B) using IC-based measures, the users must have a taxonomy structure T (or two homogeneous taxonomy structures T₁ and T₂) such that A, B∈T (or A∈T₁ and B∈T₂). If A and B belong to two different heterogeneous knowledge sources such as A∈WordNet and B∈DBpedia, Sim(A, B) cannot be computed using existing IC-based methods. Similarly, to compute Sim(A, B), distance-based measures or feature-based measures also need some related information of A and B. If these related information comes from different knowledge sources, existing distance-based or feature-based measures also cannot compute Sim(A, B). On the other hand, when we compute Sim(A, B), the more related information of A and B we get, the more accurate result of Sim(A, B) can be obtained. Therefore, we need to get as much related information for A and B as possible from different knowledge sources in order to better computation of Sim(A, B). For instance, we can get the synonyms or taxonomy structures of A (or B) via WordNet [20], domain ontologies [77], Wikipedia [34, 37], DBpedia [11] or YAGO [79]. Obviously, we have to integrate these related information of A and B that comes from different (heterogeneous) knowledge sources. To this end, we first present the notion of semantic representation of concepts in theory. We then give a framework for semantic similarity computation based on the semantic representation of concepts.

3.1 Semantic representation of concepts

How to represent a concept for semantic similarity computation? Because the semantic information of a concept may come from multiple knowledge sources, in particular, with the development of information technology, some new knowledge sources might be developed, we need a flexible way to represent the semantic information of a concept. Let us see an example.

Example 1. Consider a concept C₁ = Artificial Intelligence. Clearly, from WordNet, Wikipedia and DBpedia we know that C₁∈WordNet, C₁∈Wikipedia, and C₁∈DBpedia. From WordNet we know that the set of synonyms of C₁ is synonyms(C₁) = {AI}. From Wikipedia or DBpedia we have that the set of synonyms of C₁ is synonyms(C₁) = {AI, Machine Intelligence, Cognitive System, Computational Rationality, Soft AI, …}. Similarly, from WordNet we also know that C₁ has a taxonomy structure (tree structure) TS_WordNet(C₁) (see Fig. 1), and C₁ has a taxonomy structure (graph structure) TS_Wikipedia(C₁) (see Fig. 2) or knowledge network (graph structure) TS_DBpedia(C₁) (see Fig. 3) from Wikipedia or DBpedia, respectively. Of course, we also can get other semantic information such as glosses for C₁ from WordNet, Wikipedia, DBpedia, or YAGO.

Consider another concept C₂ = Semantic Web. From WordNet, Wikipedia and DBpedia we know that C₂∉WordNet, C₂∈Wikipedia, and C₂∈DBpedia. Clearly, we cannot obtain the semantic information such as synonyms or taxonomy structure of C₂ from WordNet, however, the information can be obtained from Wikipedia or DBpedia.

Now we propose the definition of semantic representation of concepts.

Definition 1. Let con be a concept. The semantic representation of concept con is defined as follows:

$$ con=\left\langle {SI}_1(con),{SI}_2(con),\dots, {SI}_n(con)\right\rangle, $$

where the ith semantic information SI_i(con) of con (1 ≤ i ≤ n) is as below:

$$ SIi(con)=\left\langle \left\langle {KS}_{i_1}:{value}_{i_1}\right\rangle, \left\langle {KS}_{i_2}:{value}_{i_2}\right\rangle, \dots, \left\langle {KS}_{i_m}:{value}_{i_m}\right\rangle \right\rangle, $$

where $ {KS}_{i_j} $ (1 ≤ j ≤ m) means the jth knowledge source of SI_i(con), and $ {value}_{i_j} $ is the value of SI_i(con) from $ {KS}_{i_j} $ in 〈$ {KS}_{i_j} $:$ {value}_{i_j} $〉.

The semantic representation of concept con can be shown in Fig. 4.

To understand Definition 1, let us see a simple example.

Example 2. From Example 1 we have the following:

$$ Artificial\ Intelligence=\left\langle glosses\left( Artificial\ Intelligence\right), synonyms\left( Artificial\ Intelligence\right),\dots, taxonomy\ structure\left( Artificial\ Intelligence\right)\right\rangle, $$

where glosses, synonyms, …, and taxonomy structure represent the titles of all semantic information of Artificial Intelligence, and.

glosses(Artificial Intelligence) = 〈〈WordNet: the branch of computer science that deal with writing computer programs that can solve problems creatively, …〉,

…,

〈Wikipedia: Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, …〉〉,

synonyms(Artificial Intelligence) = 〈〈WordNet: {AI}〉, …, 〈Wikipedia: {AI, Machine Intelligence, Cognitive System, Computational Rationality, Soft AI, …}〉〉,

taxonomy structure(Artificial Intelligence) = 〈〈WordNet: TS_WordNet〉, …, 〈Wikipedia: TS_Wikipedia〉〉.

Remark 1. The semantic representation of concepts in Definition 1 is a flexible representation mechanism. On one hand, we don’t fix the numbers and kinds of semantic information of a concept, that is, users may add different semantic information such as hyponym (or sub-concept), hypernym (or super-concept), category, path, or seealso to semantic representation of concepts.

On the other hand, for any semantic information of concepts, we may obtain its value from multiple knowledge sources such as WordNet, domain ontologies (e.g., MeSH [44] or SNOMED CT [40]), Wikipedia, DBpedia, or YAGO. It is worth noting that the types of the values of different semantic information may be different, for instance, the types of the values of synonyms, glosses, or taxonomy structure are set, string, or tree (graph), respectively. Clearly, for some semantic information, its values from multiple knowledge sources can be integrated (merged). For example, the values of synonyms from different knowledge sources can be combined by using operation union in set theory, and the values of glosses from multiple knowledge sources may be merged by using operation concatenation of string. We call such semantic information as operable (denoted by ⊕, see Definition 2). Of course, some semantic information such as taxonomy structure is inoperable.

For the sake of convenience, we use string to represent the types of values of all semantic information. Our notation for the encoding of the value v of semantic information into its representation as a string is 〈v〉 such as 〈TS_WordNet〉 and 〈TS_Wikipedia〉.

Definition 2. Let 〈SI ₁(con), SI ₂(con), …, SI _n(con)〉 be the semantic representation of a concept con, where SI _i(con)=〈〈$ {KS}_{i_1} $:$ {value}_{i_1} $〉, 〈$ {KS}_{i_2} $:$ {value}_{i_2} $〉, …, 〈$ {KS}_{i_m} $:$ {value}_{i_m} $〉〉. If SI _i(con) is operable, its values $ {value}_{i_1} $, $ {value}_{i_2} $, …, and $ {value}_{i_m} $ from $ {KS}_{i_1} $, $ {KS}_{i_2} $, and $ {KS}_{i_m} $ respectively can be merged by the following operator:

value_i=$ {value}_{i_1} $⊕$ {value}_{i_2} $⊕…⊕$ {value}_{i_m} $, where ⊕ denotes integration (or combination) operator of multiple values of same type such as ∪ for sets and + for strings.

SI_i(con) is extended as follows:

SI_i(con)=〈〈$ {KS}_{i_1} $:$ {value}_{i_1} $〉, 〈$ {KS}_{i_2} $:$ {value}_{i_2} $〉, …, 〈$ {KS}_{i_m} $:$ {value}_{i_m} $〉, 〈$ {KS}_{i_1},{KS}_{i_2},\dots, {KS}_{i_m} $: value_i〉〉.

In fact, for any {〈$ {KS}_{i_s} $:$ {value}_{i_s} $〉, …, 〈$ {KS}_{i_t} $:$ {value}_{i_t} $〉}⊆{〈$ {KS}_{i_1} $:$ {value}_{i_1} $〉, 〈$ {KS}_{i_2} $:$ {value}_{i_2} $〉, …, 〈$ {KS}_{i_m} $: $ {value}_{i_m} $〉}, we may have the following:

value_i’=$ {value}_{i_s} $⊕…⊕$ {value}_{i_t} $,

SI_i(con) can be extended as SI_i(con)=〈〈$ {KS}_{i_1} $:$ {value}_{i_1} $〉, …, 〈$ {KS}_{i_m} $:$ {value}_{i_m} $〉, 〈$ {KS}_{i_s},\dots, {KS}_{i_t} $: value_i’〉〉.

Example 3. From Example 2 we know that the glosses of Artificial Intelligence can be merged as follows:

〈WordNet, …, Wikipedia: “the branch of computer science that deal with writing computer programs that can solve problems creatively, …”+…+“Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, …”〉.

3.2 A framework for semantic similarity computation

Given two concepts A and B, we firstly need to obtain their semantic information in order to compute semantic similarity between them. Clearly, we can get their semantic information from the semantic representation 〈SI₁(A), SI₂(A), …, SI_n(A)〉 and 〈SI₁(B), SI₂(B), …, SI_n(B)〉 of A and B, respectively. Because there are lots of semantic information in A and B, we can design different similarity computation methods by using different semantic information. For example, feature-based measures need some features such as glooses, synonyms, hyponyms (sub-concepts), hypernyms (super-concepts), or categories, but IC-based measures need certain taxonomy structure (tree structure or graph structure). To unify these similarity measures (e.g., distance-based, feature-based, or IC-based measures) between two concepts, we need a framework for these semantic similarity measures.

Definition 3. Let A = 〈SI ₁(A), SI ₂(A), …, SI _n(A)〉 and B = 〈SI ₁(B), SI ₂(B), …, SI _n(B)〉 be semantic representation of two concepts, where SI _i(A)=〈〈$ {KS}_{i_1} $:$ {value}_{i_1} $〉, 〈$ {KS}_{i_2} $:$ {value}_{i_2} $〉, …, 〈$ {KS}_{i_m} $: $ {value}_{i_m} $〉〉 and SI _i(B)= 〈〈$ {KS}_{i_1} $:$ {value_{i_1}}^{\prime } $〉, 〈$ {KS}_{i_2} $:$ \kern0.50em {value_{i_2}}^{\prime } $〉, …, 〈$ {KS}_{i_m} $:$ \kern0.50em {value_{i_m}}^{\prime } $〉〉. The semantic similarity between A and B, denoted as Sim(A, B), is the function Sim: CON×CON → [0, 1], and is defined as follows:

$$ Sim\left(A,B\right)={Sim}_{concepts}\left({Sim}_{SI_1}\left({ESetSI}_1,{ESetSI}_1^{\prime}\right),{Sim}_{SI_2}\left({ESetSI}_2,{ESetSI}_2^{\prime}\right),\dots, {Sim}_{SI_n}\left({ESetSI}_n,{ESetSI}_n^{\prime}\right)\right), $$

where (1) $ {Sim}_{SI_i} $(ESetSI_i, ESetSI_i′) (1 ≤ i ≤ n) is the similarity measure of semantic information SI_i(A) and SI_i(B), concretely, $ {Sim}_{SI_i} $ is the function $ {Sim}_{SI_i} $: SetSI_i × SetSI_i′ → [a_i, b_i], where a_i, b_i∈R⁺∪ {0}, a_i ≤ b_i, R⁺∪{0} denotes the set of non-negative real numbers.

(2) Sim_concepts is the function Sim_concepts: [a₁, b₁] × … × [a_n, b_n] → [0, 1].

(3) CON stands for the set of all concepts, SetSI_i and SetSI_i′ denote the set of all values of semantic information SI_i(A) and SI_i(B) respectively, formally, SetSI_i = {〈$ {value}_{i_1} $〉∪〈$ {value}_{i_2} $〉∪…∪ 〈$ {value}_{i_m} $〉} and SetSI_i′={〈$ {value_{i_1}}^{\prime } $〉∪〈$ {value_{i_2}}^{\prime } $〉∪…∪〈$ {value_{i_m}}^{\prime } $〉}, ESetSI_i∈SetSI_i, and ESetSI_i′ ∈SetSI_i′.

Example 4. Let A and B be two concepts, A = 〈glosses(A), synonyms(A), taxonomy(A)〉 and B = 〈glosses(B), synonyms(B), taxonomy(B)〉 be semantic representation of concepts A and B, where
$$ {\displaystyle \begin{array}{c} glosses(A)=\left\langle \left\langle WordNet:{g}_{WordNet}(A)\right\rangle, \left\langle Wikipedia:{g}_{Wikipedia}(A)\right\rangle, \left\langle DBpedia:{g}_{DBpedia}(A)\right\rangle \right\rangle, \\ {} glosses(B)=\left\langle \left\langle WordNet:{g}_{WordNet}(B)\right\rangle, \left\langle Wikipedia:{g}_{Wikipedia}(B)\right\rangle, \left\langle DBpedia:{g}_{DBpedia}(B)\right\rangle \right\rangle, \\ {} synonyms(A)=\left\langle \left\langle WordNet:{s}_{WordNet}(A)\right\rangle, \left\langle Wikipedia:{s}_{Wikipedia}(A)\right\rangle, \left\langle DBpedia:{s}_{DBpedia}(A)\right\rangle \right\rangle, \\ {} synonyms(B)=\left\langle \left\langle WordNet:{s}_{WordNet}(B)\right\rangle, \left\langle Wikipedia:{s}_{Wikipedia}(B)\right\rangle, \left\langle DBpedia:{s}_{DBpedia}(B)\right\rangle \right\rangle, \\ {} taxonomy(A)=\left\langle \left\langle WordNet:{t}_{WordNet}(A)\right\rangle, \left\langle Wikipedia:{t}_{Wikipedia}(A)\right\rangle, \left\langle DBpedia:{t}_{DBpedia}(A)\right\rangle \right\rangle, and\\ {} taxonomy(B)=\left\langle \left\langle WordNet:{t}_{WordNet}(B)\right\rangle, \left\langle Wikipedia:{t}_{Wikipedia}(B)\right\rangle, \left\langle DBpedia:{t}_{DBpedia}(B)\right\rangle \right\rangle \end{array}} $$

By Definition 3, we have the following:

$$ Sim\left(A,B\right)={Sim}_{concepts}\left({Sim}_{glosses}\left({g}_{WordNet}(A)\cup {g}_{Wikipedia}(A)\cup {g}_{DBpedia}(A),{g}_{WordNet}(B)\cup {g}_{Wikipedia}(B)\cup {g}_{DBpedia}(B)\right),{Sim}_{synonyms}\left({s}_{WordNet}(A)\cup {s}_{Wikipedia}(A)\cup {s}_{DBpedia}(A),{s}_{WordNet}(B)\cup {s}_{Wikipedia}(B)\cup {s}_{DBpedia}(B)\right),{Sim}_{taxonomy}\left({t}_{WordNet}(A)\cup {t}_{Wikipedia}(A)\cup {t}_{DBpedia}(A),{t}_{WordNet}(B)\cup {t}_{Wikipedia}(B)\cup {t}_{DBpedia}(B)\right)\right). $$

From Definition 3 and Example 4 we know that the framework for semantic similarity measures is very generic. For any similarity function $ {Sim}_{SI_i} $: SetSI_i × SetSI_i′ → [a_i, b_i], there are many concrete implementation methods. Formally, for any {〈$ {value}_{i_s} $〉, …, 〈$ {value}_{i_t} $〉}⊆{〈$ {value}_{i_1} $〉, 〈$ {value}_{i_2} $〉, …, 〈$ {value}_{i_m} $〉}, we can define a similarity function as follows from the perspective of knowledge sources:

$$ {Sim}_{SI_i}:\kern0.5em \left\{<{value}_{i_s}>\cup \dots \cup <{value}_{it}>\right\}\times \left\{<{value}_{i_s}\hbox{'}>\cup \dots \cup <{value}_{i_t}\hbox{'}>\right\}\to \left[{a}_i,{b}_i\right] $$

For example, in Example 4 part of the definitions of function Sim_glosses can be defined as follows:

$$ {Sim}_{glosses}:{g}_{WordNet}(A)\times {g}_{WordNet}(B)\to \left[a,b\right],{g}_{WordNet}(A)\times {g}_{Wikipedia}(B)\to \left[a,b\right],\mathrm{or}\ {g}_{WordNet}(A)\cup {g}_{Wikipedia}(A)\times {g}_{WordNet}(B)\cup {g}_{Wikipedia}(B)\to \left[a,b\right]. $$

From the perspective of mathematical tools of semantic similarity measures, we may use different mathematical tools such as IC [63], PMI [14], LSA [19], ESA [23], or Jaccard and Dice coefficients [45] for $ {sim}_{SI_i} $(ESetSI_i, ESetSI_i′) (1 ≤ i ≤ n) in Definition 3. For instance, we can define Sim_glosses and Sim_synonyms using ESA, Jaccard or Dice coefficients, and define Sim_taxonomy using IC.

Lastly, the function Sim_concepts in Definition 3 is also very flexible. Generally speaking, we may implement Sim_concepts by introducing some simple functions such as max, min, or average.

Now we give the implementation method of the framework for semantic similarity measures.

Remark 2. In Algorithm 1, the sets {SI₁, SI₂, …, SI_n} and {KS₁, KS₂, …, KS_m} can be specified by users or experts. The value of SI_i(A) and SI_i(B) may be obtained from knowledge sources automatically. In fact, we may obtain the values of SI_i(A) and SI_i(B) offline. If we cannot get 〈KS_j:$ {value}_{i_j} $〉 (resp., 〈KS_j:$ {value}_{i_j}^{\prime } $〉) of SI_i(A) (resp., SI_i(B)), we may assign 〈KS_j:$ {value}_{i_j} $〉=ϕ (resp., 〈KS_j:$ {value}_{i_j}^{\prime } $〉=ϕ). In Step (5) of Algorithm 1, we can assign lots of similarity functions for each SI_i∈{SI₁, SI₂, …, SI_n} in theory. However, we can selectively set up similarity functions according to the complementarity of knowledge sources in practical applications.

For example, let us consider knowledge sources {WordNet, Wikipedia, MeSH}. It is well known that WordNet is a large lexical database, Wikipedia is a free online encyclopedia, and MeSH is a hierarchically-organized terminology for indexing and cataloging of biomedical information. Clearly, these are three complementary knowledge sources. If we only consider semantic information glosses and taxonomy (see Example 4), we may set up the following similarity functions:

$$ {Sim}_{glosses}:{glosses}_{WordNet}(A)\times {glosses}_{WordNet}(B)\to \left[a,b\right],{Sim}_{glosses}:{glosses}_{Wikipedia}(A)\times {glosses}_{Wikipedia}(B)\to \left[a,b\right],{Sim}_{glosses}:{glosses}_{WordNet}(A)\cup {glosses}_{Wikipedia}(A)\times {glosses}_{WordNet}(B)\cup {glosses}_{Wikipedia}(B)\to \left[a,b\right],{Sim}_{taxonomy}:{taxonomy}_{WordNet}(A)\times {taxonomy}_{WordNet}(B)\to \left[a,b\right],{Sim}_{taxonomy}:{taxonomy}_{MeSH}(A)\times {taxonomy}_{MeSH}(B)\to \left[a,b\right],\mathrm{and}\kern0.17em {Sim}_{taxonomy}:{taxonomy}_{Wikipedia}(A)\times {taxonomy}_{Wikipedia}(B)\to \left[a,b\right]. $$

If A∈Wikipedia, A∉WordNet, A∉MeSH, B∈MeSH, B∉WordNet, and B∉Wikipedia, then we also can give the similarity functions as follows:

$$ {Sim}_{taxonomy}:{taxonomy}_{Wikipedia}(A)\times {taxonomy}_{MeSH}(B)\to \left[a,b\right]. $$

Obviously, all existing methods of similarity computation can be obtained by instantiating the framework (Definition 3), that is, all existing approaches to similarity measures (including distance-based measures, feature-based measures, statistical measures, and hybrid measures, see Section 2 for more details) can result from instantiations of the framework. Concretely, existing methods to similarity measures consider only one knowledge source such as WordNet, Wikipedia, domain ontology, or DBpedia, thus, in Step (5) of Algorithm 1 there is only one kind of similarity function for each SI_i∈{SI₁, SI₂, …, SI_n}. Clearly, in addition to the existing similarity computation methods, we can get a lot of new similarity measure methods by instantiating the framework, in particular, we may obtain some new approaches to similarity measures that existing methods cannot deal with by introducing multiple knowledge sources.

4 Some approaches for measuring semantic similarity

In Section 3 our framework for semantic similarity of concepts is proposed. In this section we give some generic and flexible approaches to similarity measures by instantiating the framework. As stated in Section 3, all existing approaches can result from instantiations of our framework, the instantiation method is as follows:

In what follows, we present some new similarity measures that existing methods cannot deal with by instantiating the framework. Similarly to existing similarity measures, we also give three families of similarity measure methods: (1) IC-based similarity measures; (2) distance-based similarity measures; and (3) feature-based similarity measures. Based on these three similarity measure families, we will naturally get hybrid similarity measures.

4.1 IC-based measures under multiple knowledge sources

In the framework in Definition 3 or Algorithm 1, to implement IC-based similarity measures, we need one or multiple taxonomy structures (tree structures or graph structures). Suppose that A and B are two concepts, KS₁, KS₂, …, KS_m are knowledge sources, and T₁, T₂, …, T_m are taxonomy structures in KS₁, KS₂, …, KS_m, respectively.

If there exists a taxonomy structure T_i (1 ≤ i ≤ m) such that A, B∈T_i, it is easy to get the LCS (least common subsumber) for A and B in T_i. Furthermore, we can compute Sim(A, B) by using IC-based similarity measure methods (see Section 2.3). However, if there does not exist any taxonomy structure T_i (1 ≤ i ≤ m) such that A, B∈T_i, that is, for any taxonomy structure T_i (1 ≤ i ≤ m), either A∈T_i, B∉T_i, or A∉T_i, B∈T_i, how should we compute Sim(A, B) by using IC-based measures at this time (or how to find the LCS for A and B by using KS₁, …, KS_m)? To solve this problem, we propose some new IC-based similarity measures for concepts.

Without loss of generality, suppose that all knowledge sources that we consider are the set AllKS = {KS₁, KS₂, …, KS_m}, and there exist some knowledge sources KSA = {KS_k, KS_k + 1,…, KS_l} ⊆ AllKS and KSB = {KS_s, KS_s + 1,…, KS_t} ⊆ AllKS such that for any KS_i∈KSA and KS_j∈KSB we have the following:

A∈T_i, A∉T_j, B∈T_j, B∉T_i, where T₁, T₂, …, T_m are taxonomy structures of KS₁, KS₂, …, KS_m, respectively.

Obviously, there is no LCS for A and B in T_i (or T_j), thus, we cannot compute Sim(A, B) only by considering T_i (or T_j). Now we give some methods for Sim(A, B) by considering both T_i and T_j.

Definition 4. Let T be a taxonomy structure and concept subsumption (<_T) be a binary relation <_T: CON×CON, being CON the set of all concepts, where A < _TC means that A is a subconcept of C or C is a parent concept of A in T. A < _TC iff C > _TA, that is, A > _TC means that A is a parent concept of C or C is a subconcept of A in T. A ≤ _TC iff A < _TC or A = C (i.e., A and C are two identical concepts). A ≥ _TC iff A > _TC or A = C. We define the set of subconcepts, superconcepts, hyponyms, and hypernyms of a concept A∈CON w.r.t T as follows:

$$ subconcepts\kern0.55mm \left(A,T\right)=\left\{C\in \kern0.55mm CON\kern0.55mm |\ C<{}_TA\right\}; superconcepts\left(A,T\right)=\left\{C\in CON|\ C>{}_TA\right\}; hyponyms\left(A,T\right)=\left\{C\in CON|\exists {C}_1,{C}_2,\dots, {C}_{n-1},{C}_n\in CON\wedge n\ge 2\wedge {C}_1=A\wedge {C}_n=C\wedge {C}_1>{}_T{C}_2\wedge \dots \wedge {C}_{n-1}>{}_T{C}_n\wedge {C}_1\ne {C}_2\ne \dots \ne {C}_{n-1}\ne {C}_n\right\}; hypernyms\left(A,T\right)=\left\{C\in CON|\exists {C}_1,{C}_2,\dots, {C}_{n-1},{C}_n\in CON\wedge n\ge 2\wedge {C}_1=A\wedge {C}_n=C\wedge {C}_1<{}_T{C}_2\wedge \dots \wedge {C}_{n-1}<{}_T{C}_n\wedge {C}_1\ne {C}_2\ne \dots \ne {C}_{n-1}\ne {C}_n\right\}. $$

Clearly, we have that subconcepts(A, T) ⊆ hyponyms(A, T) and superconcepts(A, T) ⊆ hypernyms(A, T).

Definition 5. Let A, B∈CON be two different concepts (i.e., A ≠ B) and T be a taxonomy structure. The set of walks between A and B w.r.t. T can be defined as follows:

walks(A, B, T) = {〈C₁, C₂, …, C_n〉| C₁, C₂, …, C_n∈CON ∧ C₁ = A ∧ C_n = B ∧ (∀1 ≤ i < n, C_i∈ superconcepts(C_i + 1, T)) ∧ C₁ ≠ C₂ ≠ … ≠ C_n-1 ≠ C_n}.

Definition 6. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The set of common ancestors of A and B w.r.t. T_i and T_j is defined as follows:

$$ CommonAnc\left(A,B,{T}_i,{T}_j\right)=\left\{C\in CON|\ C\in hypernyms\left(A,{T}_i\right)\wedge C\in hypernyms\left(B,{T}_j\right)\right\}. $$

Definition 7. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The set of GCS (Good Common Subsumer) of A and B w.r.t. T_i and T_j can be defined as follows:
$$ GCS\ \left(A,B,{T}_i,{T}_j\right)=\left\{C\in CON|C\in CommonAnc\left(A,B,{T}_i,{T}_j\right)\wedge {p}_1\in \kern0.55em walks\ \left(C,A,{T}_i\right),{p}_2\in walks\left(C,B,{T}_j\right),|{p}_1|+|{p}_2|={\min}_{D\in CommonAnc\left(A,B,{T}_i,{T}_j\right),{p}^{\prime}\in walks\left(D,A,{T}_i\right),{p}^{\hbox{'}\hbox{'}}\in walks\left(D,B,{T}_j\right)}\left\{|p\hbox{'}|+|p^{{\prime\prime} }|\right\}\right\}, $$

where |p| is the length of walk p, i.e., if p = 〈c₁, c₂, …, c_n + 1〉, then |p| = |〈c₁, c₂, …, c_n + 1〉| = n.

Based on the GCS for two concepts in two taxonomy structures (Definition 7), we can present some new IC-based measures under multiple knowledge sources by extending traditional IC-based similarity measures (see Section 2.3) [35, 41, 42, 63, 72]. To compute semantic similarity of two concepts A and B using IC-based measures, we firstly need to give some approaches to IC computation for concepts.

Definition 8. Let A∈CON be a concept and T be a taxonomy structure. The first IC of A w.r.t. T is defined as follows:

$$ I{C}_{fir}\left(A,T\right)=1-\frac{\log\ \left( hyponyms\left(A,T\right)+1\right)}{\log\ \left(|{CON}_T|\right)}, $$

where CON_T denotes the set of all concepts in T.

In fact, IC_fir(A, T) is an extension of the IC model of Seco et al. [75].

Definition 9. Let A∈CON be a concept and T be a taxonomy structure. The depth depth(A, T) of A in T is defined as follows:

$$ depth\left(A,T\right)=\max \left\{|p|\ |\ p\in walks\left( root(T),A,T\right)\right\},\mathrm{where}\ root(T)\ \mathrm{is}\ \mathrm{the}\ \mathrm{root}\ \mathrm{of}\ T $$

Definition 10. Let A∈CON be a concept and T be a taxonomy structure. The set of leaves of A in T is defined as follows:

$$ leaves\left(A,T\right)=\left\{C\in CON|\ C\in hyponyms\left(A,T\right)\wedge hyponyms\left(C,T\right)=\phi \right\} $$

Furthermore, we define the following:

$$ maxleaves(T)= leaves\left( root(T),T\right) and\ maxdepth(T)=\max \left\{|p\Big\Vert p\in walks\left( root(T),A,T\right),A\in maxleaves(T)\right\}. $$

By extending the IC definitions of Zhou et al. [83] and Sanchez et al. [72], we can propose the following approaches to IC computation.

Definition 11. Let A∈CON be a concept and T be a taxonomy structure. The second and third ICs of A w.r.t. T are defined as follows:
$$ I{C}_{sec}\left(A,T\right)=\gamma \left(1-\frac{\log\ \left( hyponyms\left(A,T\right)+1\right)}{\log\ \left(|{CON}_T|\right)}\right)+\left(1-\upgamma \right)\left(\frac{\log\ \left( depth\left(A,T\right)+1\right)}{\log\ \left( maxdepth(T)\right)}\right),\kern0.5em {IC}_{thi}\ \left(A,T\right)=-\log \left(\frac{\frac{\mid leaves\left(A,T\right)\mid }{\mid hypernyms\left(A,T\right)\cup \left\{A\right\}\mid }+1}{\mid maxleaves(T)\mid +1}\right) $$

where γ is a tuning factor that adjusts the weight of the two features involved in the IC computation. We use γ = 0.5 in default.

Now we propose some new approaches to semantic similarity measures for concepts under multiple knowledge sources by using GCS (Definition 7) and IC (Definitions 8 and 11). It is worth noting that we can obtain lots of new IC-based measures by extending traditional IC-based similarity measures. In this paper we only extend some classical IC-based measures.

Definition 12. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The IC-based semantic similarity SimIC1_ord between A and B w.r.t. T_i and T_j can be defined as:

$$ SimIC{1}_{ord}\left(A,B, Ti, Tj\right)={\max}_{C\in GCS\left(A,B,{T}_i,{T}_j\right)}\left\{\max \left\{{IC}_{ord}\left(C,{T}_i\right),{IC}_{ord}\left(C,{T}_j\right)\right\}\right\}, $$

where IC_ord = IC_fir, IC_sec, or IC_thi. For example, if IC_ord = IC_fir, SimIC1_ord means SimIC1_fir.

Clearly, SimIC_ord is an extension of Resnik’s metric [63].

By extending the Lin’s metric [42], we can present another similarity measure for concepts.

Definition 13. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The IC-based semantic similarity SimIC2_ord between A and B w.r.t. T_i and T_j can be defined as:

$$ SimIC{2}_{ord}\left(A,B,{T}_i,{T}_j\right)={\max}_{C\in GCS\left(A,B,{T}_i,{T}_j\right)}\left\{\frac{2\times \max \left\{{IC}_{ord}\left(C,{T}_i\right),{IC}_{ord}\left(C,{T}_j\right)\right\}}{IC_{ord}\left(A,{T}_i\right)+{IC}_{ord}\left(B,{T}_j\right)}\right\}, $$

where IC_ord = IC_fir, IC_sec, or IC_thi.

Obviously, we also can define a kind of similarity measure SimIC3_ord by extending the Jiang and Conrath’s metric [35].

Definition 14. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The IC-based semantic similarity SimIC3_ord between A and B w.r.t. T_i and T_j can be defined as:

$$ SimIC{3}_{ord}\left(A,B,{T}_i,{T}_j\right)=1-\frac{Distance\left(A,B,{T}_i,{T}_j\right)}{2}, $$

where Distance(A, B, T_i, T_j) =.

$$ I{\mathrm{C}}_{ord}\left(A,{T}_i\right)+I{\mathrm{C}}_{ord}\left(B,{T}_j\right)-2\times {\max}_{C\in GCS\left(A,B,{T}_i,{T}_j\right)}\left\{\max \left\{{IC}_{ord}\left(C,{T}_i\right),{IC}_{ord}\left(C,{T}_j\right)\right\}\right\},I{\mathrm{C}}_{ord}=I{\mathrm{C}}_{fir},I{\mathrm{C}}_{sec},\mathrm{or}\ I{\mathrm{C}}_{thi}. $$

From Definitions 12-14 we know that SimIC1_ord, SimIC2_ord, and SimIC3_ord are based on two knowledge sources. In fact, we need multiple knowledge sources in practical applications in order to obtain better results. Therefore, we have to give some similarity measures for multiple knowledge sources.

Definition 15. Let AllTS = {T₁, T₂, …, T_m} be all taxonomy structures, TSA = {T_k, T_k + 1,…, T_l} ⊆ AllTS and TSB = {T_s, T_s + 1,…, T_t} ⊆ AllTS. For any T_i∈TSA and T_j∈TSB, we have that A∈T_i, A∉T_j, B∈T_j, B∉T_i. The IC-based semantic similarity measures SimIC1M_ord, SimIC2M_ord, and SimIC3M_ord between A and B w.r.t. multiple taxonomy structures TSA and TSB can be defined as:
$$ {\displaystyle \begin{array}{c} SimIC1{M}_{ord}\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimIC{1}_{ord}\left(A,B,{T}_i,{T}_j\right)\right\},\\ {} SimIC2{M}_{ord}\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimIC{2}_{ord}\left(A,B,{T}_i,{T}_j\right)\right\},\mathrm{and}.\\ {} SimIC3{M}_{ord}\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimIC{3}_{ord}\left(A,B,{T}_i,{T}_j\right)\right\}.\end{array}} $$

The IC-based semantic similarity measure SimIC between A and B w.r.t. TSA and TSB and all baseline measures can be defined as:

$$ SimIC\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ Sim\mathrm{I}C{1}_{ord}\left(A,B,{T}_i,{T}_j\right), SimIC{2}_{ord}\left(A,B,{T}_i,{T}_j\right), SimIC{3}_{ord}\left(A,B,{T}_i,{T}_j\right)\right\}. $$

In order to compare the values of different similarities SimIC1_ord, SimIC2_ord, and SimIC3_ord, we normalize the value of each similarity.

Remark 3. In Definition 15, SimIC1M_ord, SimIC2M_ord, and SimIC3M_ord are extensions of SimIC1_ord, SimIC2_ord, and SimIC3_ord, respectively. That is, SimIC1_ord, SimIC2_ord, and SimIC3_ord are based on two taxonomy structures, and SimIC1M_ord, SimIC2M_ord, and SimIC3M_ord are based on multiple taxonomy structures.

On the other hand, if we give some new IC computation approaches (e.g., IC_fou), SimIC1_ord, SimIC2_ord, and SimIC3_ord can be expanded accordingly (e.g., SimIC1_fou, SimIC2_fou, SimIC3_fou). Furthermore, SimIC1M_ord, SimIC2M_ord, and SimIC3M_ord also can be expanded accordingly (e.g., SimIC1M_fou, SimIC2M_fou, SimIC3M_fou). Obviously, if we consider other baseline measures, we also can obtain some new similarity measures such as SimIC4_ord and SimIC4M_ord by instantiating our framework.

The similarity measure SimIC can be based on multiple taxonomy structures and baseline measures, clearly, it is easy to extend SimIC when we add new similarity measures for two or multiple taxonomy structures. For example, if a new measure SimIC4_ord is provided, SimIC can be expanded as follows:

$$ SimIC\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimIC{1}_{ord}\left(A,B,{T}_i,{T}_j\right)| SimIC{2}_{ord}\left(A,B,{T}_i,{T}_j\right), SimIC{3}_{ord}\left(A,B,{T}_i,{T}_j\right), SimIC{4}_{ord}\left(A,B,{T}_i,{T}_j\right)\right\}. $$

Lastly, it is worth noting that the condition of Definition 15 can be relaxed as follows:

Let AllTS = {T₁, T₂, …, T_m} be all taxonomy structures, TSA = {T_k, T_k + 1,…, T_l} ⊆ AllTS and TSB = {T_s, T_s + 1,…, T_t} ⊆ AllTS. For any T_i∈TSA and T_j∈TSB, we have that A∈T_i and B∈T_j.

If TSA∩TSB≠ϕ, traditional IC-based measures under one taxonomy structure are included in this framework of Definition 15. For example, if T_u∈TSA∩TSB, SimICN_ord(A, B, T_u, T_u) (N = 1, 2, 3) is based on one taxonomy structure.

The relationships among all definitions of IC-based measures under multiple knowledge sources are shown as Fig. 5.

4.2 Distance-based measures under multiple knowledge sources

Similarly to IC-based measures under multiple knowledge sources (see Section 4.1), in the framework in Definition 3 or Algorithm 1, we also need one or multiple taxonomy structures (tree structures or graph structures) in order to implement distance-based similarity measures. Assume that A and B are two concepts, KS₁, KS₂, …, KS_m are knowledge sources, and T₁, T₂, …, T_m are taxonomy structures in KS₁, KS₂, …, KS_m, respectively. Clearly, if there exists a taxonomy structure T_i (1 ≤ i ≤ m) such that A, B∈T_i, it is easy to compute Sim(A, B) by using distance-based similarity measures (see Section 2.1). However, if there does not exist any taxonomy structure T_i (1 ≤ i ≤ m) such that A, B∈T_i, we need some new distance-based similarity measures.

Let all knowledge sources be the set AllKS = {KS₁, KS₂, …, KS_m}. Suppose that KSA = {KS_k, KS_k + 1,…, KS_l} ⊆ AllKS and KSB = {KS_s, KS_s + 1,…, KS_t} ⊆ AllKS, and for any KS_i∈KSA and KS_j∈KSB we have that A∈T_i, A∉T_j, B∈T_j, and B∉T_j. Obviously, there is no a path between A and B in T_i (or T_j), thus, we cannot compute Sim(A, B) only by considering T_i (or T_j). Now we give some methods for sim(A, B) by considering both T_i and T_j.

Assume that part of T_i and T_j are shown as Fig. 6.

Obviously, if there exists a concept C, such that C∈T_i, C∈T_j, C is a super-concept of A in T_i, and C is also a super-concept of B in T_j (see Fig. 6), we can find a path between A and B in T_i and T_j, formally, the path is made up of two paths A → C (the bold solid line in T_i) and B → C (the bold solid line in T_j), that is, there are four edges between A and B in this path.

Similarly, if there exists a concept D, such that D∈T_i, D∈T_j, D is a sub-concept of A in T_i, and D is also a sub-concept of B in T_j (see Fig. 6), we also may find another path between A and B in T_i and T_j, formally, the path is made up of two paths A → D (the bold dotted line in T_i) and B → D (the bold dotted line in T_j), that is, there are five edges between A and B in this new path.

Furthermore, we can compute Sim(A, B) by making use of these paths. It’s obvious that we meet a problem here: How to obtain the common super-concept or common sub-concept that we need such as C and D in Fig. 6? Because there may be multiple common super-concepts or common sub-concepts, for example, both C and E are super-concepts of A (resp., B) in T_i (resp., T_j), and both D and F are sub-concepts of A (or B) in T_i (or T_j) in Fig. 6. Clearly, we need to find the shortest path between concepts A and B in two taxonomy structures. To get the shortest path, we firstly introduce some notions.

Definition 16. Let T be a taxonomy structure (directed graph) and concept reachability (→_T) be a binary relation →_T: CON×CON, being CON the set of all concepts, where A → _TC means that there is an edge e from A to C, that is, e is associated with the ordered pair (A, C) in T. A ← _TC iff C → _TA, that is, A ← _TC means that there is an edge which is associated with the ordered pair (C, A) in T. We define the set of relatedconcepts of a concept A∈CON w.r.t T as follows:

relatedconcepts(A, T) = {C∈CON| ∃C₁, C₂, …, C_n-1, C_n∈CON ∧ n ≥ 2 ∧ C₁ = A ∧ C_n = C ∧ ((C₁ → _TC₂ ∧ …∧ C_n-1 → _TC_n) ∨ (C₁ ← _TC₂ ∧ …∧ C_n-1 ← _TC_n)) ∧ C₁ ≠ C₂ ≠ … ≠ C_n-1 ≠ C_n}.

Definition 17. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The set of common concepts of A and B w.r.t. T_i and T_j is defined as follows, respectively:

$$ CommonCon\left(A,B,{T}_i,{T}_j\right)=\left\{C\in CON|\ C\in relatedconcepts\left(A,{T}_i\right)\wedge C\in relatedconcepts\left(B,{T}_j\right)\right\}. $$

Definition 18. Let A, B∈CON be two different concepts (i.e., A ≠ B) and T be a taxonomy structure. The set of paths between A and B w.r.t. T can be defined as follows:

$$ paths\left(A,B,T\right)=\left\{\left\langle {C}_1,{C}_2,\dots, {C}_n\right\rangle |\ {C}_1,{C}_2,\dots, {C}_n\in CON\wedge {C}_1=A\wedge {C}_n=B\wedge \left(\left(\forall 1\le i<n,{C}_i\to {}_T{C}_{i+1}\right)\vee \left(\forall 1\le i<n,{C}_i\leftarrow {}_T{C}_{i+1}\right)\right)\wedge {C}_1\ne {C}_2\ne \dots \ne {C}_{n-1}\ne {C}_n\right\}. $$

Now we can give the shortest and longest paths between concepts A and B in two taxonomy structures.

Definition 19. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The sets of the shortest paths spaths and the longest paths lpaths between A and B w.r.t. T_i and T_j can be defined as follows:

spaths(A, B, T_i, T_j)={〈C₁, C₂, …, C_n〉| C₁, C₂, …, C_n∈CON ∧ C₁ = A ∧ C_n = B ∧ ∃C∈CommonCon(A, B, T_i, T_j), p₁∈paths(C₁, C, T_i), p₂∈paths(C_n, C, T_j), |p₁| + |p₂|=.

$$ {\min}_{D\in CommonCon\left(A,B,{T}_i,{T}_j\right),{p}^{\prime}\in paths\left(A,D,{T}_i\right),{p}^{\prime \prime}\in paths\left(B,D,{T}_j\right)} $$

{|p′| + |p″|}},

lpaths(A, B, T_i, T_j)=.

{〈C₁, C₂, …, C_n〉| C₁, C₂, …, C_n∈CON ∧ C₁ = A ∧ C_n = B ∧ ∃C∈CommonCon(A, B, T_i, T_j), p₁∈paths(C₁, C, T_i), p₂∈paths(C_n, C, T_j), |p₁| + |p₂|=

$$ {\max}_{D\in CommonCon\left(A,B,{T}_i,{T}_j\right),{p}^{\prime}\in paths\left(A,D,{T}_i\right),{p}^{\prime \prime}\in paths\left(B,D,{T}_j\right)} $$

{|p′| + |p″|}}.

Furthermore, we define the longest paths w.r.t. T_i and T_j as follows:

maxdistance(T_i, T_j) = max{|p| | p∈lpaths(A, B, T_i, T_j), ∀A∈T_i, ∀B∈T_j}.

where |p| is the length of path p, i.e., if p = 〈c₁, c₂, …, c_n + 1〉, then |p| = |〈c₁, c₂, …, c_n + 1〉| = n.

Based on the shortest path between two concepts in two taxonomy structures (Definition 19), we can present some new distance-based measures under multiple knowledge sources by extending traditional distance-based similarity measures (see Section 2.1) [25, 39, 41, 62, 82].

Definition 20. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The distance-based semantic similarity SimDis1 between A and B w.r.t. T_i and T_j can be defined as:

$$ SimDis1\left(A,B,{T}_i,{T}_j\right)=2\times maxdistance\left({T}_i,{T}_j\right)-\mid p\mid, $$

where p∈spaths(A, B, T_i, T_j).

Clearly, SimDis1 is an extension of the metric of Rada et al. [62].

Definition 21. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The distance-based semantic similarity SimDis2 between A and B w.r.t. T_i and T_j can be defined as:

$$ SimDis2\left(A,B, Ti, Tj\right)=\frac{2\times {N}_3\left(A,B,{T}_i,{T}_j\right)}{N_1\left(A,{T}_i\right)+{N}_2\left(B,{T}_j\right)+2\times {N}_3\left(A,B,{T}_i,{T}_j\right)}, $$

where

N₁(A, T_i):: max{|p| | p∈walks(C, A, T_i), C∈GCS(A, B, T_i, T_j)},
N₂(B, T_j):: max{|p| | p∈walks(C, B, T_j), C∈GCS(A, B, T_i, T_j)},
N₃(A, B, T_i, T_j):: max{|p| | p∈walks(root(T_i), C, T_i)∨p∈walks(root(T_j), C, T_j), C∈GCS(A, B, T_i, T_j)}.

Similarly to the Wu and Palmer’ metric [82], SimDis2 (Definition 21) is based on is-a hierarchies, where walks and GCS are defined in Section 4.1 (Definitions 5 and 7). Obviously, SimDis2 is an extension of the Wu and Palmer’ metric [82].

We can define the following similarity measure SimDis3 by extending the Leacock and Chodorow’s metric [39].

Definition 22. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The distance-based semantic similarity SimDis3 between A and B w.r.t. T_i and T_j can be defined as:

$$ SimDis3\left(A,B, Ti, Tj\right)=-\log \left(\frac{\mid p\mid }{2\times \max \left\{|{p}_1|,|{p}_2|\right\}}\right), $$

where p∈spaths(A, B, T_i, T_j), p₁∈lrpaths(T_i), p₂∈lrpaths(T_j),

$$ lrpaths(Ti)=\left\{p\mid p\in paths\left( root(Ti),C, Ti\right),C\in Ti,\mid p\mid {\max}_{D\in {T}_i}\right\{\mid {p}^{\prime}\left\Vert {p}^{\prime}\in paths\right( root\left({T}_i\right),D,{T}_i\left\}\right\}, $$

$$ lrpaths(Tj)=\left\{p\mid p\in paths\left( root(Tj),C, Tj\right),C\in Tj,\mid p\mid ={\max}_{D\in {T}_j}\right\{\mid {p}^{\prime}\left\Vert {p}^{\prime}\in paths\right( root\left({T}_j\right),D,{T}_j\left\}\right\}. $$

Similarly to the metric of Garla and Brandt [25], we also can normalize SimDis3 to the unit interval as follows.

Definition 23. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The distance-based semantic similarity SimDis4 between A and B w.r.t. T_i and T_j can be defined as:

$$ SimDis4\left(A,B,{T}_i,{T}_j\right)=1-\frac{\log\ \left(|p|\right)}{\log\ \left(2\times \max \left\{\mid {p}_1|,|{p}_2|\right\}\right)}, $$

where p∈spaths(A, B, T_i, T_j), p₁∈lrpaths(T_i), and p₂∈lrpaths(T_j).

Obviously, we can define a kind of similarity measure SimDis5 by extending the metric of Li et al. [41].

Definition 24. Let T_i and T_j be two taxonomy structures, A, B∈CON be two different concepts (i.e., A ≠ B), A∈T_i, A∉T_j, B∈T_j, and B∉T_i. The distance-based semantic similarity SimDis5 between A and B w.r.t. T_i and T_j can be defined as:

$$ SimDis5\left(A,B,{T}_i,{T}_j\right)={e}^{-\alpha \times \mid p\mid}\cdotp \frac{e^{\beta h}-{e}^{-\beta h}}{e^{\beta h}+{e}^{-\beta h}}, $$

where p∈spaths(A, B, T_i, T_j), h = max{|p| | p∈walks(root(T_i), C, T_i)∨p∈walks(root(T_j), C, T_j), C∈ GCS(A, B, T_i, T_j)}, α ≥ 0, and β > 0. In our experiments, we use the same optimal parameters as in [41], i.e., α = 0.2 and β = 0.6.

In Definitions 20-24, SimDis1, SimDis2, SimDis3, SimDis4, and SimDis5 are based on two knowledge sources. We may give the following similarity measures for multiple knowledge sources.

Definition 25. Let AllTS = {T₁, T₂, …, T_m} be all taxonomy structures, TSA = {T_k, T_k + 1,…, T_l} ⊆ AllTS and TSB = {T_s, T_s + 1,…, T_t} ⊆ AllTS. For any T_i∈TSA and T_j∈TSB, we have that A∈T_i, A∉T_j, B∈T_j, B∉T_i. The distance-based semantic similarity measures SimDis1M, SimDis2M, SimDis3M, SimDis4M, and SimDis5M between A and B w.r.t. multiple taxonomy structures TSA and TSB can be defined as:

$$ SimDis1M\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimDis1\left(A,B,{T}_i,{T}_j\right)\right\}, $$

$$ SimDis2M\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimDis2\left(A,B,{T}_i,{T}_j\right)\right\}, $$

$$ SimDis3M\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimDis3\left(A,B,{T}_i,{T}_j\right)\right\}, $$

$$ SimDis4M\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimDis4\left(A,B,{T}_i,{T}_j\right)\right\}\mathrm{and} $$

$$ SimDis5M\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimDis5\left(A,B,{T}_i,{T}_j\right)\right\}. $$

The distance-based semantic similarity measure SimDis between A and B w.r.t. TSA and TSB and all baseline measures can be defined as:

$$ SimDis\left(A,B, TSA, TSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\Big\{ SimDis1\left(A,B,{T}_i,{T}_j\right), SimDis2\left(A,B,{T}_i,{T}_j\right), SimDis3\left(A,B,{T}_i,{T}_j\right), $$

$$ SimDis4\left(A,B,{T}_i,{T}_j\right), SimDis5\left(A,B,{T}_i,{T}_j\right)\Big\}. $$

In order to compare the values of different similarities SimDis1, SimDis2, SimDis3, SimDis4, and SimDis5, we also normalize the value of each similarity. Similar to Definition 15 (see Remark 3), the distance-based measure SimDis is also a generic and flexible approach.

The relationships among all definitions of distance-based measures under multiple knowledge sources are shown as Fig. 7.

4.3 Feature-based measures under multiple knowledge sources

Unlike IC-based or distance-based similarity measures, feature-based measures assess similarity between concepts as a function of their properties (i.e., features). Therefore, in the framework in Definition 3 or Algorithm 1, for each concept we need one or multiple knowledge sources in order to get its properties (i.e., features). Assume that A and B are two concepts, KS₁, KS₂, …, and KS_m are knowledge sources. Clearly, if there exists a knowledge source KS_i (1 ≤ i ≤ m) such that the features of A and B can be obtained from KS_i, it is easy to compute Sim(A, B) by using traditional feature-based similarity measures (see Section 2.2). However, if there does not exist any knowledge source KS_i (1 ≤ i ≤ m) that can provide the features of A and B at the same time, we need some new feature-based similarity measures.

Let all knowledge sources be the set AllKS = {KS₁, KS₂, …, KS_m}. Suppose that KSA = {KS_k, KS_k + 1,…, KS_l} ⊆ AllKS and KSB = {KS_s, KS_s + 1,…, KS_t} ⊆ AllKS, and for any KS_i∈KSA and KS_j∈KSB we have that A∈KS_i, A∉KS_j, B∈KS_j, and B∉KS_i. Obviously, we cannot compute Sim(A, B) only by considering KS_i (or KS_j). Now we give some methods for sim(A, B) by considering both KS_i and KS_j.

Definition 26. Let KS_i and KS_j be two knowledge sources, A, B∈CON be two different concepts (i.e., A ≠ B), A∈KS_i, A∉KS_j, B∈KS_j, and B∉KS_i. Assume that all features that we consider are {fea₁, fea₂, …, fea_n}, i.e., the semantic representation of A and B is as follows:

$$ A=\left\{{fea}_1(A),{fea}_2(A),\dots, {fea}_n(A)\right\}\ \mathrm{and}\ B=\left\{{fea}_1(B),{fea}_2(B),\dots, {fea}_n(B)\right\}, $$

where the value of fea_u(A) (resp., fea_u(B)) (1 ≤ u ≤ n) that comes from KS_i (resp., KS_j) is follows:

fea_u(A) = 〈KS_i: $ {value}_{i_u} $〉 (resp., fea_u(B) = 〈KS_j: $ {value}_{j_u} $〉).

The feature-based semantic similarity framework SimFea between A and B w.r.t. KS_i and KS_j can be defined as:

$$ SimFea\left(A,B,K{S}_i,K{S}_j\right)=\max \left\{{Sim}_1\left({value}_{i_1},{value}_{j_1}\right),\dots, {Sim}_n\left({value}_{i_n},{value}_{j_n}\right)\right\} $$

In this paper, we only consider four kinds of features, i.e., glooses, synonyms, hyponyms (or sub-concepts), and hypernyms (or super-concepts). Thus, SimFea is instantiated as follows:

$$ SimFea\left(A,B,{KS}_i,{KS}_j\right)=\max \left\{{Sim}_{glooses}\left({glooses}_i(A),{glooses}_j(B)\right),{Sim}_{synonyms}\left({synonyms}_i(A),{synonyms}_j(B)\right),{Sim}_{hyponyms}\left({hyponyms}_i(A),{hyponyms}_j(B)\right),{Sim}_{hypernyms}\left({hypernyms}_i(A),{hypernyms}_j(B)\right)\right\}, $$

where A = {〈KS_i: glooses_i(A)〉, 〈KS_i: synonyms_i(A)〉, 〈KS_i: hyponyms_i(A)〉, 〈KS_i: hypernyms_i(A)〉} and B = {〈KS_j: glooses_j(B)〉, 〈KS_j: synonyms_j(B)〉, 〈KS_j: hyponyms_j(B)〉, 〈KS_j: hypernyms_j(B)〉}.

Sim_glooses, Sim_hyponyms, and Sim_hypernyms are defined using Jaccard index, Sorensen coefficient, and Symmetric difference. Sim_synonyms is defined as follows:

$$ Si{m}_{synonyms}\left( synonym{s}_i(A), synonym{s}_j(B)\right)=\Big\{{\displaystyle \begin{array}{c}1,\kern0.75em \mathrm{if}\ {synonyms}_i(A)\cap {synonyms}_j(B)\ne \varnothing \\ {}0,\kern0.75em \mathrm{if}\ {synonyms}_i(A)\cap {synonyms}_j(B)=\varnothing \end{array}}\operatorname{}. $$

Therefore, we can define the following three kinds of feature-based semantic similarity between A and B w.r.t. KS_i and KS_j:

$$ SimFea1\left(A,B,{KS}_i,{KS}_j\right)=\max \left\{{Sim}_{synonyms}\left({synonyms}_i(A),{synonyms}_j(B)\right)| Jaccard\left({glooses}_i(A),{glooses}_j(B)\right), Jaccard\left({hyponyms}_i(A),{hyponyms}_j(B)\right), Jaccard\left({hypernyms}_i(A),{hypernyms}_j(B)\right)\right\}, SimFea2\left(A,B,{KS}_i,{KS}_j\right)=\max \left\{{Sim}_{synonyms}\left({synonyms}_i(A),{synonyms}_j(B)\right)| Dice\left({glooses}_i(A),{glooses}_j(B)\right), Dice\left({hyponyms}_i(A),{hyponyms}_j(B)\right), Dice\left({hypernyms}_i(A),{hypernyms}_j(B)\right)\right\}, $$

$$ SimFea3\left(A,B,{KS}_i,{KS}_j\right)=\max \left\{{Sim}_{synonyms}\left({synonyms}_i(A),{synonyms}_j(B)\right)| SaltonCosine\left({glooses}_i(A),{glooses}_j(B)\right), SaltonCosine\left({hyponyms}_i(A),{hyponyms}_j(B)\right), SaltonCosine\left({hypernyms}_i(A),{hypernyms}_j(B)\right)\right\}. $$

where synonyms_i(A), synonyms_j(B), hyponyms_i(A), hyponyms_j(B), hypernyms_i(A), and hypernyms_j(B) are some sets of concepts (or terms), and glooses_i(A) and glooses_j(B) are concept sets that contain words extracted by parsing glosses of A and B, respectively.

In Definition 26, SimFea1, SimFea2, and SimFea3 are based on two knowledge sources. Now we give some similarity measures for multiple knowledge sources.

Definition 27. Let AllKS = {KS₁, KS₂, …, KS_m} be all knowledge sources, KSA = {KS_k, KS_k + 1,…, KS_l} ⊆ AllKS and KSB = {KS_s, KS_s + 1,…, KS_t} ⊆ AllKS. For any KS_i∈KSA and KS_j∈KSB, we have that A∈KS_i, A∉KS_j, B∈KS_j, B∉KS_i. The feature-based semantic similarity measures SimFea1M, SimFea2M, SimFea3M, SimFea4M, SimFea5M, and SimFea6M between A and B w.r.t. multiple knowledge sources KSA and KSB can be defined as:
$$ {\displaystyle \begin{array}{c} SimFea1M\left(A,B, KSA, KSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimFea1\left(A,B,{KS}_i,{KS}_j\right)\right\}\\ {} SimFea2M\left(A,B, KSA, KSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimFea2\left(A,B,{KS}_i,{KS}_j\right)\right\}\\ {}\begin{array}{c} SimFea3M\left(A,B, KSA, KSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimFea3\left(A,B,{KS}_i,{KS}_j\right)\right\}\\ {} SimFea4M\left(A,B, KSA, KSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimFea4\left(A,B,{KS}_i,{KS}_j\right)\right\}\\ {}\begin{array}{c} SimFea5M\left(A,B, KSA, KSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimFea5\left(A,B,{KS}_i,{KS}_j\right)\right\}\\ {} SimFea6M\left(A,B, KSA, KSB\right)={\max}_{i=k}^l{\max}_{j=s}^t\left\{ SimFea6\left(A,B,{KS}_i,{KS}_j\right)\right\}\end{array}\end{array}\end{array}} $$

where SimFea4(A, B, KS_i, KS_j) = max{Sim_synonyms(synonyms(A), synonyms(B)), Jaccard(glooses(A), glooses(B)), Jaccard(hyponyms(A), hyponyms(B)), Jaccard(hypernyms(A), hypernyms(B))},

SimFea5(A, B, KS_i, KS_j) = max{Sim_synonyms(synonyms(A), synonyms(B)), Dice(glooses(A), glooses(B)), Dice(hyponyms(A), hyponyms(B)), Dice(hypernyms(A), hypernyms(B))},

SimFea6(A, B, KS_i, KS_j) = max{Sim_synonyms(synonyms(A), synonyms(B)), SaltonCosine(glooses(A), glooses(B)), SaltonCosine(hyponyms(A), hyponyms(B)), SaltonCosine(hypernyms(A), hypernyms(B))},

glooses(A) = glooses_k(A)∪…∪glooses_l(A),

glooses(B) = glooses_s(B)∪…∪glooses_t(B),

synonyms(A) = synonyms_k(A)∪…∪synonyms_l(A),

synonyms(B) = synonyms_s(B)∪…∪synonyms_t(B),

hyponyms(A) = hyponyms_k(A)∪…∪hyponyms_l(A),

hyponyms(B) = hyponyms_s(B)∪…∪hyponyms_t(B),

hypernyms(A) = hypernyms_k(A)∪…∪hypernyms_l(A),

hypernyms(B) = hypernyms_s(B)∪…∪hypernyms_t(B).

The feature-based semantic similarity measure SimFea between A and B w.r.t. KSA and KSB and baseline measures can be defined as:

SimFea(A, B, KSA, KSB)=

$$ {\max}_{i=k}^l{\max}_{j=s}^t\Big\{ SimFea1\left(A,B,{KS}_i,{KS}_j\right), SimFea2\left(A,B,{KS}_i,{KS}_j\right), $$

$$ SimFea3\left(A,B,{KS}_i,{KS}_j\right), SimFea4\left(A,B,{KS}_i,{KS}_j\right), $$

$$ SimFea5\left(A,B,{KS}_i,{KS}_j\right), SimFea6\left(A,B,{KS}_i,{KS}_j\right)\Big\}. $$

In order to compare the values of different similarities SimFea1, SimFea2, SimFea3, SimFea4, SimFea5, and SimFea6, we also normalize the value of each similarity. Similarly to Definitions 15 and 25, the feature-based measure SimFea is also a generic and flexible approach.

The relationships among all definitions of feature-based measures under multiple knowledge sources are shown as Fig. 8.

Until now some generic and flexible approaches (including IC-based measures, distance-based measures, and feature-based measures) to similarity measures of concepts have been presented. As stated in Section 1, semantic similarity between concepts can be applied to many fields such as multimedia databases, multimedia encyclopedias, digital libraries, and multimedia documents. The application architecture is as follows (Fig. 9):

5 Experiments and evaluation

In this section we discuss the evaluation problem of our similarity measures (see Section 4). Section 5.1 introduces some experimental datasets and evaluation metrics. Section 5.2 gives our experimental results. Lastly, in Section 5.3, we discuss and analyze the experimental results.

5.1 Experimental datasets and evaluation metrics

We collect several publicly available gold standard benchmarks for evaluating concept semantic similarity, which are conventionally most common-used and some recently most updated benchmarks. The descriptions of these benchmarks used in the experiments are listed below.

(1)
WS353 [22] benchmark contains 353 word pairs and 13 to 16 human subjects were asked to assign a numerical similarity score between 0.0 to 10.0 (0 means totally unrelated and 10 means very closely related). In fact, this benchmark measures general relatedness rather than similarity because it considers other semantic relations (e.g., antonyms are considered as similar).
(2)
WordSim-353 [2] benchmark is a subset of WS353. WS353 is divided into two subsets. The first one concerns about relatedness while the second subset focuses on similarity. We only use the second one named WordSim-353 in our experiments. It contains 203 pairs of words and it has been identified by the authors to be suitable for evaluating semantic similarity specially.
(3)
R&G [66] benchmark is the first and most used benchmark containing human assessment of word similarity. The benchmark resulted from the experiment conducted in 1965 where a group of 51 students (all native English speakers) assessed the similarity of 65 pairs of words selected from ordinary English nouns. Those 51 subjects were requested to judge the similarity of meaning for two given words on a scale from 0.0 (completely dissimilar) to 4.0 (highly synonymous). It focuses on semantic similarity and ignores any other possible semantic relationships between the words.
(4)
M&C [52] benchmark contains 30 word pairs. It replicated the R&G experiment again in 1991 by taking a subset of 30 noun pairs. The similarity between words was judged by 38 human subjects.
(5)
Jiang-1 [37] and Jiang-2 [34] benchmarks contain 30 pairs of real-world Wikipedia concepts, respectively. The similarity between each concept pair is assessed by 10 students and 10 teachers in a scale between 0 (semantically unrelated) and 4 (highly synonymous). After a normalization process, a final set of 30 concept pairs is rated with the average of the similarity values provided by the students and the teachers. Thus, these two benchmarks are created and can be used to evaluate the accuracy of our approaches so that we use them in this work.

Each benchmark described above contains a list of triples comprising two words and a similarity score denoting word similarity judged by human. Concretely, we select 203 word pairs from WordSim-353, 65 word pairs from R&G, 30 word pairs from M&C, 30 word pairs from Jiang-1, and 30 word pairs from Jiang-2 in our experiments.

It is well known that an objective evaluation of the accuracy of semantic similarity functions is difficult because the notion of similarity is subjective. Generally, similarity measures are evaluated by means of standard benchmarks of word pairs whose similarity has been assessed by a group of human experts [37]. However, in this paper we evaluate our new approaches to measure similarity under multiple knowledge sources that existing similarity computation methods cannot deal with (traditional methods are generally based on one knowledge source). In particular, for any word pairs (or concept pairs) (A, B), A and B belong to different knowledge sources (A and B belong to the same knowledge source in traditional methods). Therefore, comparison of the proposed methods with standard benchmarks imposes some challenges and requires some modifications and adjustments in order to make such comparison meaningful. The comparative experiments have been group into three parts.

Firstly, we evaluate our methods over 5 benchmarks, namely M&C, R&G, WordSim-353, Jiang-1, and Jiang-2 and two kinds of knowledge sources, namely Wikipedia^{Footnote 1} and WordNet.^{Footnote 2} To evaluate our methods objectively, for any concept pair (A, B), we require that the value of A comes from Wikipedia and the value of B comes from WordNet.

Secondly, we develop a benchmark Jiang-3 and then use it to evaluate the accuracy of our proposals. For comparison purposes, we select 30 pairs of real-world concepts extracted from some widely used knowledge sources, i.e., Wikipedia, WordNet, Medical Subject Headings (MeSH),^{Footnote 3} Disease Ontology (DO)^{Footnote 4} and Human Phenotype Ontology (HPO).^{Footnote 5} Our benchmark Jiang-3 is shown in Table 1. The similarity between each concept pair is assessed by 10 students and 10 teachers in biomedical fields in a scale between 0 (semantically unrelated) and 4 (highly synonymous), respectively. After a normalization process, a final set of 30 concept pairs is rated with the average of the similarity values provided by the students and the teachers. To evaluate our methods objectively, for any concept pair (A, B), we require that A∈Wikipedia, A∈WordNet, A∉MeSH, A∉DO, A∉HPO, B∈MeSH, B∈DO, B∈HPO, B∉Wikipedia, and B∉WordNet.

Table 1 Our benchmark Jiang-3

Full size table

Lastly, in our benchmark Jiang-3 there are five kinds of knowledge sources, i.e., Wikipedia, WordNet, MeSH, DO, and HPO. Clearly, Wikipedia and WordNet are two kinds of general-purpose knowledge sources, but MeSH, DO, and HPO are three kinds of domain dependent knowledge sources (biomedical ontologies). To evaluate the accuracy of our proposals in another setting, we build another two benchmarks Jiang-4 and Jiang-5 by using knowledge sources MeSH, DO, HPO, Gene Ontology (GO),^{Footnote 6} and Ontology for Biomedical Investigations (OBI).^{Footnote 7} In our benchmark Jiang-4 there are 30 pairs of real-world concepts extracted from three kinds of knowledge sources, i.e., MeSH, DO, and HPO. Jiang-4 is shown in Table 2. For any concept pair (A, B), we require that A∈MeSH, A∈HPO, A∉DO, B∈DO, B∉MeSH, and B∉HPO. In our benchmark Jiang-5 there are 30 pairs of real-world concepts extracted from three kinds of knowledge sources, i.e., MeSH, GO, and OBI. Jiang-5 is shown in Table 3. For any concept pair (C, D), we require that C∈MeSH, C∉GO, C∉OBI, D∈GO, D∈OBI, and D∉MeSH.

Table 2 Our benchmark Jiang-4

Full size table

Table 3 Our benchmark Jiang-5

Full size table

Different knowledge sources have different semantic information such as concept taxonomies and distributions of instances over concepts. We apply different combinations of knowledge sources to different benchmarks in this work and express the semantics of concepts through integrating different semantic information. To further illustrate it, we describe the relations among seven knowledge sources considered in our experiments and eight benchmarks in Fig. 10. The mark “1” on the arrow from knowledge source to benchmark represents the first concept in each pair of benchmark is computed in corresponding knowledge source. Similarly, the mark “2” represents the second concept in each pair of benchmark is computed in corresponding knowledge source. For example, the first concept in each pair of Jiang-3 benchmark is computed on WordNet and Wikipedia, and the second concept is computed on HPO, DO, and MeSH.

The knowledge sources WordNet and Wikipedia are used in measuring semantic similarities of concept pairs in M&C, R&G, WordSim-353, Jiang-1, Jiang-2 and Jiang-3 benchmarks. The WordNet organizes the lexical information in meanings (senses) and synsets (set of synonym words in a specific context) [5]. Each synset has a gloss that defines the concept. Hypernymy is a relation that organizes noun synsets into a lexical inheritance taxonomy. In this taxonomy, a subordinate term inherits the basic features from the superordinate term and adds its distinctive features to form its own meaning. The Wikipedia is a free, online multilingual knowledge source that is collaboratively maintained by volunteers and known to have a good coverage capacity [30]. At the bottom of each page in Wikipedia, all assigned categories are listed with links to the category page. These categories are connected to form the Wikipedia Category Graph (WCG). Wikipedia categories and their relations do not have explicit semantics like WordNet. The Wikipedia categorization system does not form a taxonomy like the WordNet “is-a” taxonomy with a fully subsumption hierarchy, but only through a thematically organized thesaurus. For example, Computer systems is categorized in the upper category of Technology systems (is-a) and Computer hardware (has-part).

The knowledge source MeSH is used in measuring semantic similarities of concept pairs in Jiang-3, Jiang-4, and Jiang-5 benchmarks. The MeSH organizes biomedical concepts in a meaningful way with explicit semantic relations. It consists of single- and multi-word terms that are used to index and catalog the medical literature [16]. Among the relations [5], we use the MeSH “is-a” taxonomy. The knowledge sources DO and HPO are used in measuring semantic similarities of concept pairs in Jiang-3 and Jiang-4 benchmarks. The DO has been developed as a standardized ontology for human disease with the purpose of providing the biomedical community with consistent, reusable and sustainable descriptions of human disease terms, phenotype characteristics and related medical vocabulary disease concepts. Also, the DO semantically integrates disease and medical vocabularies through extensive cross mapping terms to the MeSH thesaurus. The HPO is devising a system or a domain for the traits of phonomes and their effects on daily encountered human diseases [81]. The aim is to provide a well-structured vocabulary for these traits so that they can be easily studied and searched in the field of medical science to bring awareness about the traits and how they can damage a person’s health and body organs. The HPO currently contains over 13,000 different terms of traits and characteristics, and over 156,000 annotations to hereditary diseases. Each term describes a phenotypic abnormality such as Atrial septal defect.

The knowledge sources GO and OBI are used in measuring semantic similarities of concept pairs in Jiang-5 benchmark. The GO provides an ontology to describe attributes of gene products in three non-overlapping domains of molecular biology [26]. It includes several of the world’s major repositories for plant, animal and microbial genomes. Within each ontology, terms have free text definitions and stable unique identifiers. The vocabularies are structured in a classification that supports “is-a” and “part-of” relationships. The OBI is an ontology that provides terms with precisely defined meanings to describe all aspects of how investigations in the biological and medical domains are conducted [7]. It imports parts of other biomedical ontologies such as GO, Chemical Entities of Biological Interest (ChEBI) and Phenotype Attribute and Trait Ontology (PATO) without altering their meanings. OBI is being used in a wide range of projects covering genomics, multi-omics, immunology, and catalogs of services.

The accuracy of the similarity computation is quantified by computing the correlation between the human judgements and the results provided by the computerized measures. This enables an objective evaluation of the different similarity computation methods. The correlation between two variables is the degree to which there is a relationship between them. Correlation is usually expressed as a coefficient which measures the strength of a relationship between the variables. Our experiments will use two measures of correlation: Pearson (Pearson correlation coefficient) and Spearman (Spearman correlation coefficient). Pearson reflects the linear correlation between measuring result with human judgments. Spearman is another metric and compares the correlation between measuring result with human judgments based on the ranking strategy.

5.2 Experimental results

For environment of our evaluation, the version of Wikipedia is released on April 20, 2018, the version of WordNet is 3.1, the version of MeSH is released in 2018, the version of GO is released on May 31, 2018, the version of DO is released on May 15, 2018, the version of HPO is released on March 9, 2018, and the version of OBI is released on April 29, 2016. At the same time, we use JWPL (Java Wikipedia Library), Java with JDK1.8 and MySQL to implement our algorithms to measure similarity by the formulas given in Section 4. According to our statistics, there are 47,204 concepts in GO, 11192 concepts in DO, 3337 concepts in OBI, 13544 concepts in HPO, 28938 concepts in MeSH, 1,679,499 concepts in Wikipedia and 147,479 concepts in WordNet.

Tables 4 and 5 are some related results about the developments of Jiang-3, Jiang-4 and Jiang-5 benchmarks. To evaluate the similarity of the concepts come from different knowledge sources, the common concepts are the key factor in both path-based and IC-based approaches proposed in this paper. In fact, common concepts are the elements of the intersections of the corresponding concept sets of different knowledge sources. In this case, we list the numbers of the elements in the intersections of different concept sets of seven knowledge sources MeSH, DO, HPO, OBI, GO, Wikipedia and WordNet in Table 4. We take different combinations of knowledge sources to build benchmarks Jiang-3, Jiang-4 and Jiang-5. We list the numbers of the concept pairs that are generated by different combinations in Table 5. According to the numbers of the pairs that have common ancestors or children and the numbers of the pairs that perform well on all three types approaches proposed in this paper, we adopt the last three division schemes and extract 50 concept pairs from each scheme to generate Jiang-3, Jiang-4 and Jiang-5 benchmarks, respectively.

Table 4 Numbers of the concepts in the intersections of seven considered knowledge sources

Full size table

Table 5 The details of the concept pairs in different combinations of knowledge sources

Full size table

The second (M&C), third (R&G), fourth (WordSim-353), fifth (Jiang-1), sixth (Jiang-2), seventh (Jiang-3), eighth (Jiang-4), and ninth (Jiang-5) columns in Table 6 show the Pearson correlation coefficients of the different measures with human judgments.

Table 6 Results on Pearson correlation with human judgments of similarity measures

Full size table

The second (M&C), third (R&G), fourth (WordSim-353), fifth (Jiang-1), sixth (Jiang-2), seventh (Jiang-3), eighth (Jiang-4), and ninth (Jiang-5) columns in Table 7 show the Spearman correlation coefficients of the different measures with human judgments.

Table 7 Results on Spearman correlation with human judgments of similarity measures

Full size table

5.3 Discussion and analysis

Now we analyze and discuss the experimental results (see Tables 6 and 7) from four different aspects: (1) the influence of knowledge sources, (2) the influence of benchmarks, (3) the differences among three kinds of measures: IC-based measures, Distance-based measures, and Feature-based measures, (4) the performances of three most generic and flexible measures: SimIC, SimDis, and SimFea.

5.3.1 Influence of knowledge sources

The results in Tables 6 and 7 show that the most results on both Pearson correlation and Spearman correlation coefficients on the benchmarks M&C, R&G, Jiang-1, and Jiang-3 are better than those on benchmarks WordSim-353, Jiang-2, Jiang-4, and Jiang-5. It indicates that domain-independent knowledge sources like Wikipedia and WordNet perform better in measuring similarities among both general and special concepts. The reason is that the semantic information of the concepts in M&C, R&G, Jiang-1, and Jiang-3 is computed on Wikipedia and WordNet, but Jiang-4 and Jiang-5 are computed on five biomedical knowledge sources. Furthermore, they are biomedical ontologies and the expressions of the same word are often different from encyclopedia. For example, the glosses of the same concept in HPO and WordNet varies from each other and semantic information in WordNet contains more features.

GO, DO, OBI, HPO, and MeSH are all the domain-specific ontologies which express the concepts professionally, but Wikipedia and WordNet express the concepts more general. So this is a problem which the features of the same concept from different knowledge sources are different and even some features are empty. And we use our methods to compute semantic similarity between concepts based on the features so that it causes the differences in our results.

5.3.2 Influence of benchmarks

Eight benchmarks are computed in our experiments. For the first five benchmarks, Tables 6 and 7 show that the results of both Pearson and Spearman correlation coefficients on M&C, R&G, and Jiang-1 are relatively better than WordSim-353 and Jiang-2. For all the concept pairs in these five benchmarks, we measure the semantic information of one concept of each pair on WordNet and the other of each pair on Wikipedia. M&C is a subset of R&G with the relabeled human judgments. All the concepts in these three benchmarks (M&C, R&G, and Jiang-1) are ordinary English nouns so they are fully described in both lexical databases like WordNet and encyclopedia like Wikipedia. The characteristics of both benchmarks (M&C, R&G, and Jiang-1) and knowledge sources (WordNet and Wikipedia) make the results on M&C, R&G, and Jiang-1 relatively good. Jiang-2 contains pairs of real-world Wikipedia concepts and over half of them don’t appear in WordNet taxonomy structure. The correlation coefficients on Jiang-2 benchmark don’t exceed 0.5 in Tables 6 and 7. WordSim-353 is a dataset for measuring semantic relatedness between words (concepts) so the correlation coefficients of semantic similarity task on it is not good by using both WordNet and Wikipedia.

The results of Pearson correlation coefficients on Jiang-3 are a little better compared with Jiang-4 and Jiang-5, and the Spearman correlation coefficients are much better than both Jiang-4 and Jiang-5. The best Spearman correlation coefficient is higher than 0.98 on Jiang-3 but lower than 0.52 on both Jiang-4 and Jiang-5. The first reason may be that the diversity of the concept pairs in Jiang-3 are much higher. Since Jiang-3 involves both the biomedical ontologies and common knowledge sources, which can provide with more semantic information. The second reason may be caused by the small different integrality of the semantic information of the concepts in Jinag-4 and Jiang-5. The information contained in Jiang-4 and Jiang-5 are much more professional but in Jiang-3 are much more extensive.

5.3.3 Influence of measures

Three kinds of measures, i.e., IC-based measures, distance-based measures, and feature-based measures, are proposed in this paper. For the IC-based measures, it is obvious that the measures SimIC1M_fir, SimIC2M_thi, SimIC3M_fir, and SimIC3M_sec perform well on Pearson results while other six measures (SimIC1M_thi, SimIC1M_sec, SimIC2M_fir, SimIC2M_sec, SimIC3M_thi, and SimIC) don’t. Meanwhile, the measures SimIC1M_fir and SimIC2M_thi perform relatively better on Spearman results than other eight measures (SimIC1M_sec, SimIC1M_thi, SimIC2M_fir, SimIC2M_sec, SimIC3M_fir, SimIC3M_sec, SimIC3M_thi, and SimIC). These four measures (SimIC1M_fir, SimIC2M_thi, SimIC3M_fir, and SimIC3M_sec) involve three approaches of IC computation and three IC-based similarity measurement methods introduced in Section 4.1, which illustrates that IC-based measures are all feasible if they are adopted appropriately. The measure SimIC3M_sec outperforms other measures with Pearson correlation coefficients 0.805, 0.740 and 0.723 on M&C, R&G, and Jiang-1, respectively. For Spearman results, the measure SimIC2M_thi outperforms other measures with 0.644 on M&C and 0.763 Jiang-3. These confirm the statistical similarity measures like IC-based measures are effective on multiple heterogeneous taxonomy structures.

For the distance-based measures, all the measures obtain good correlation coefficients except SimDis1M. A major reason of poor result about SimDis1M is that the depths of knowledge sources are greatly different from each other. The same spath considered in Definition 19 represents different similarity values in different knowledge sources. Figure 11 shows an example to illustrate the different cases of the same spath of two concept pairs (A, B) and (A′, B′). spath(A, B, T_i, T_j) = 4 and spath(A′, B′, T_i, T_j) = 4, but they are not equally similar since the maximum depths of T_i and T_j are 10 and 5 separately. So the lengths of the path in different taxonomy structures are of different semantic meanings. In contrast, the measure SimDis5M obtains Pearson correlation coefficients 0.822, 0.738, 0.442, 0.699, 0.567 and 0.416 respectively on M&C, R&G, WordSim-353, Jiang-1, Jiang-4 and Jiang-5 benchmarks. These show the feasibility of computing semantic similarity of concepts with the distances among them on multiple taxonomy structures.

Most of the correlation coefficients of feature-based measures listed in Tables 6 and 7 are positive. The measures SimFea1M and SimFea2M obtain nearly the same correlation coefficients on all benchmarks. It indicates the set operations Jaccard and Dice in the similarity computation of features influence the result slightly. For SimFea3M, both the Pearson and Spearman results are relatively lower than SimFea1M and SimFea2M. For measures SimFea4M, SimFea5M, and SimFea6M, they combine the features from multiple knowledge sources before computing similarities. However, comparing the performances of SimFea1M and SimFea4M on all benchmarks, we find there is a significant decrease from the former to the latter on M&C, R&G and Jiang-1, but an increase from the former to the latter on Jiang-3 and Jiang-5. Analogously, the decreases also appear in the measures pairs (SimFea2M, SimFea5M) and (SimFea3M, SimFea6M). This illustrates that computing similarity of aggregate features come from multiple knowledge sources doesn’t always perform better than considering the feature in a separated knowledge source.

5.3.4 Our most generic and flexible approaches

The similarity computed in measure SimIC is completely depended on the maximum similarity of other nine IC-based measures (SimIC1M_fir, SimIC1M_sec, SimIC1M_thi, SimIC2M_fir, SimIC2M_sec, SimIC2M_thi, SimIC3M_fir, SimIC3M_sec, and SimIC3M_thi). To reduce the deviations of similarities among different IC-based measures, we normalize the similarities of each measure before computing SimIC. However, the results of correlation coefficients are not good in Tables 6 and 7 on all benchmarks, which means it is improper to compare different IC-based measures by similarities and set the value of SimIC to the maximum similarity. The major reason may be related to the incommensurable importance of different similarity values in different measures.

Similar to SimIC discussed above, the composite measures SimDis and SimFea also have lower correlation coefficients than separated measures such as SimDis5M and SimFea2M. This explicitly shows that the maximum methods generated by different semantic similarity computation measures can hardly improve the performance of similarity computation.

6 Conclusion

The final goal of computerized similarity measures is to accurately mimic human judgements about semantic similarity. At present similarity measures have been used for many different areas such as natural language processing, information retrieval, and word sense disambiguation. In this paper, some limitations of the existing similarity measures are identified (see Section 1). For example, there is not a unified framework for existing methods and existing approaches cannot compute similarity for two concepts that come from two different knowledge sources. To tackle these problems, this paper proposes an extensive study for semantic similarity of concepts from which a unified framework for semantic similarity computation is presented. Based on our framework, we give some generic and flexible approaches to semantic similarity measures resulting from instantiations of the framework. In particular, we obtain some new approaches to similarity measures that existing methods cannot deal with by introducing multiple knowledge sources. The evaluation, based on three widely used benchmarks and five benchmarks developed in ourselves, sustains the intuitions with respect to human judgements. Some methods proposed in this paper have a good human correlation and constitute some effective ways of determining semantic similarity between concepts.

With the development of deep learning technology, in recent years semantic similarity measures can also be implemented by exploiting deep learning technologies such as long short-term memory (LSTM) deep learning methods and attention-based approaches combined with Word2Vec. As future works, we are planning to further explore semantic similarity computation by using deep learning technologies. In addition, we will theoretically and empirically investigate the unified framework issue of semantic relatedness for concepts. It is also desirable to apply our similarity measure approaches to text or short text search tasks (semantic search for texts or short texts).

Notes

References

Abid A, Rouached M, Messai N (2020) Semantic web service composition using semantic similarity measures and formal concept analysis. Multimed Tools Appl 79:6569–6597
Article Google Scholar
Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, pp 19–27
Alonso I, Contreras D (2016) Evaluation of semantic similarity metrics applied to the automatic retrieval of medical documents: an UMLS approach. Expert Syst Appl 44:386–399
Article Google Scholar
Aouicha MB, Taieb MAH (2016) Computing semantic similarity between biomedical concepts using new information content approach. J Biomed Inform 59:258–275
Article Google Scholar
Aouicha MB, Taieb MAH, Hamadou AB (2016) Taxonomy-based information content and wordnet-wiktionary-wikipedia glosses for semantic relatedness. Appl Intell 45(2):475–511
Article Google Scholar
Baker T, Lamb D, Taleb-Bendiab A, Al-Jumeily D (2010) Facilitating semantic adaptation of web services at runtime using a meta-data layer. In: Proceedings of IEEE 2010 third international conference on Developments in eSystems Engineering (DESE 2010), IEEE, New York, pp 231–236
Bandrowski A, Brinkman R, Brochhausen M, Brush MH, Bug B, Chibucos MC, Clancy K, Courtot M, Derom D, Dumontier M, Fan L, Fostel J, Fragoso G, Gibson F, Gonzalez-Beltran A, Haendel MA, He Y, Heiskanen M, Hernandez-Boussard T, Jensen M, Lin Y, Lister AL, Lord P, Malone J, Manduchi E, McGee M, Morrison N, Overton JA, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Schober D, Smith B, Soldatova LN, Stoeckert CJ, Taylor CF, Torniai C, Turner JA, Vita R, Whetzel PL, Zheng J (2016) The ontology for biomedical investigations. PLoS One 11(4):e0154556
Article Google Scholar
Batet M, Sanchez D, Valls A, Gibert K (2013) Semantic similarity estimation from multiple ontologies. Appl Intell 38(1):29–44
Article Google Scholar
Bekhet S, Ahmed A (2020) Evaluation of similarity measures for video retrieval. Multimed Tools Appl 79:6265–6278
Article Google Scholar
Bizer C, Heath T, Berners-Lee T (2009) Linked data - the story so far. Int J Semant Web Inf Syst 5(3):1–22
Article Google Scholar
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpedia - a crystallization point for the web of data. J Web Semant 7(3):154–165
Article Google Scholar
Budanitsky A, Hirst G (2006) Evaluating WordNet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47
Article MATH Google Scholar
Capuano A, Rinaldi AM, Russo C (2020) An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques. Multimed Tools Appl 79:7577–7598
Article Google Scholar
Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Google Scholar
Cilibrasi RL, Vitanyi PMB (2007) The Google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383
Article Google Scholar
Coletti MH, Bleich HL (2001) Medical subject headings used to search the biomedical literature. J Am Med Inform Assoc 8(4):317–323
Article Google Scholar
Couto FM, Silva MJ, Coutinho PM (2007) Measuring semantic similarity between gene ontology terms. Data Knowl Eng 61(1):137–152
Article Google Scholar
Cross V, Yu X, Hu X (2013) Unifying ontological similarity measures: a theoretical and empirical investigation. Int J Approx Reason 54(7):861–875
Article Google Scholar
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Fellbaum C (1998) WordNet: an electronic lexical database. Academic Press, Cambridge, MA
Book MATH Google Scholar
Ferreira R, Lins RD, Simske SJ, Freitas F, Riss M (2016) Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput Speech Lang 39:1–28
Article Google Scholar
Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20(1):116–131
Article Google Scholar
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial intelligence (IJCAI 2007). Morgan Kaufmann Publishers, San Francisco, CA, USA, pp 1606–1611
Google Scholar
Gao JB, Zhang BW, Chen XH (2015) A WordNet-based semantic similarity measurement combining edge-counting and information content theory. Eng Appl Artif Intell 39:80–88
Article Google Scholar
Garla VN, Brandt C (2012) Semantic similarity in the biomedical domain: an evaluation across knowledge sources. BioMed Central Bioinform 13(1):261–273
Google Scholar
Gene Ontology Consortium (2004) The gene ontology (GO) database and informatics resource. Nucleic Acids Res 32:D258–D261
Article Google Scholar
Goldstone RL (1994) The role of similarity in categorization: providing a groundwork. Cognition 52(2):125–157
Article MathSciNet Google Scholar
Hadj Taieb MA, Aouicha MB, Hamadou AB (2014) A new semantic relatedness measurement using WordNet features. Knowl Inf Syst 41(2):467–497
Article Google Scholar
Hadj Taieb MA, Aouicha MB, Hamadou AB (2014) Ontology-based approach for measuring semantic similarity. Eng Appl Artif Intell 36:238–261
Article Google Scholar
Halavais A, Lackaff D (2008) An analysis of topical coverage of Wikipedia. J Comput-Mediat Commun 13(2):429–440
Article Google Scholar
Hamedani MR, Kim SW, Kim DJ (2016) SimCC: a novel method to consider both content and citations for computing similarity of scientific papers. Inf Sci 334-335:273–292
Article Google Scholar
Harispe S, Sanchez D, Ranwez S, Janaqi S, Montmain J (2014) A framework for unifying ontology-based semantic similarity measures: a study in the biomedical domain. J Biomed Inform 48:38–53
Article Google Scholar
Hirst G, St-Onge D (1998) Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, MA, pp 305–332
Google Scholar
Jiang Y, Bai W, Zhang X, Hu J (2017) Wikipedia-based information content and semantic similarity computation. Inf Process Manag 53(1):248–265
Article Google Scholar
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings of the 10th international conference on research on computational linguistics, The Association for Computational Linguistics and Chinese Language Processing (ACLCLP), Taipei, pp 19–33
Jiang Y, Yang M, Qu R (2019) Semantic similarity measures for formal concept analysis using linked data and WordNet. Multimed Tools Appl 78:19807–19837
Article Google Scholar
Jiang Y, Zhang X, Tang Y, Nie R (2015) Feature-based approaches to semantic similarity assessment of concepts using Wikipedia. Inf Process Manag 51(3):215–234
Article Google Scholar
Lastra-Diaz JJ, Garcia-Serrano A (2015) A novel family of IC-based similarity measures with a detailed experimental survey on WordNet. Eng Appl Artif Intell 46:140–153
Article Google Scholar
Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, MA, pp 265–283
Google Scholar
Lee D, Cornet R, Lau F, de Keizer N (2013) A survey of SNOMED CT implementations. J Biomed Inform 46(1):87–96
Article Google Scholar
Li Y, Bandar ZA, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng 15(4):871–882
Article Google Scholar
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998). Morgan Kaufmann Publishers, San Francisco, CA, USA, pp 296–304
Google Scholar
Liu H, Bao H, Xu D (2012) Concept vector for semantic similarity and relatedness based on WordNet structure. J Syst Softw 85(2):370–381
Article Google Scholar
Liu YH, Wacholder N (2017) Evaluating the impact of MeSH (medical subject headings) terms on different types of searchers. Inf Process Manag 53(4):851–870
Article Google Scholar
Maarek YS, Berry DM, Kaiser GE (1991) An information retrieval approach for automatically constructing software libraries. IEEE Trans Softw Eng 17(8):800–813
Article Google Scholar
Maguitman AG, Menczer F, Erdinc F, Roinestad H, Vespignani A (2006) Algorithmic computation and approximation of semantic similarity. World Wide Web 9(4):431–456
Article Google Scholar
Martinez-Gil J (2014) An overview of textual semantic similarity measures based on web intelligence. Artif Intell Rev 42(4):935–943
Article Google Scholar
Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Hum Comput Stud 67(9):716–754
Article Google Scholar
Meng L, Gu J, Zhou Z (2012) A new model of information content based on concept’s topology for measuring semantic similarity in WordNet. Int J Grid Distribute Comput 5(3):81–93
Google Scholar
Meng L, Huang R, Gu J (2014) Measuring semantic similarity of word pairs using path and information content. Int J Future Generation Commun Netw 7(3):183–194
Article Google Scholar
Meymandpour R, Davis JG (2016) A semantic similarity measure for linked data: an information content-based approach. Knowl-Based Syst 109:276–293
Article Google Scholar
Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6(1):1–28
Article Google Scholar
Nosofsky RM (1992) Similarity scaling and cognitive process models. Annu Rev Psychol 43(1):25–53
Article Google Scholar
Oliva J, Serrano JI, del Castillo MD, Iglesias A (2011) SyMSS: a syntax-based measure for short-text semantic similarity. Data Knowl Eng 70(4):390–405
Article Google Scholar
Ou W, Xuan R, Gou J, Zhou Q, Cao Y (2020) Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity. Multimed Tools Appl 79:14733–14750
Article Google Scholar
Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40(3):288–299
Article Google Scholar
Pellegrin L, Escalante HJ, Montes-y-Gomez M, Gonzalez FA (2019) Exploiting label semantic relatedness for unsupervised image annotation with large free vocabularies. Multimed Tools Appl 78:19641–19662
Article Google Scholar
Petrakis EGM, Varelas G, Hliaoutakis A, Raftopoulou P (2006) X-similarity: computing semantic similarity between concepts from different ontologies. J Digit Inf Manag 4(4):233–237
Google Scholar
Pilehvar MT, Navigli R (2015) From senses to texts: an all-in-one graph-based approach for measuring semantic similarity. Artif Intell 228:95–128
Article MathSciNet MATH Google Scholar
Pirro G (2009) A semantic similarity metric combining features and intrinsic information content. Data Knowl Eng 68(11):1289–1308
Article Google Scholar
Ponzetto SP, Strube M (2007) Knowledge derived from Wikipedia for computing semantic relatedness. J Artif Intell Res 30:181–212
Article MATH Google Scholar
Rada R, Mili H, Bicknell M, Blettner E (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern 19(1):17–30
Article Google Scholar
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of International Joint Conference for Artificial Intelligence (IJCAI 1995). Morgan Kaufmann Publishers, San Francisco, CA, USA, pp 448–453
Google Scholar
Resnik P (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95–130
Article MATH Google Scholar
Rodriguez MA, Egenhofer MJ (2003) Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl Data Eng 15(2):442–456
Article Google Scholar
Rubenstein H, Goodenough J (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633
Article Google Scholar
Safyan M, Qayyum ZU, Sarwar S, Garcia-Castro R, Ahmed M (2019) Ontology-driven semantic unified modelling for concurrent activity recognition (OSCAR). Multimed Tools Appl 78:2073–2104
Article Google Scholar
Samih H, Rady S, Gharib TF (2020) Enhancing image retrieval for complex queries using external knowledge sources. Multimed Tools Appl 79:27633–27657
Article Google Scholar
Sanchez D, Batet M (2011) Semantic similarity estimation in the biomedical domain: an ontology-based information-theoretic perspective. J Biomed Inform 44(5):749–759
Article Google Scholar
Sanchez D, Batet M (2012) A new model to compute the information content of concepts from taxonomic knowledge. Int J Semant Web Inf Syst 8(2):34–50
Article Google Scholar
Sanchez D, Batet M (2013) A semantic similarity method based on information content exploiting multiple ontologies. Expert Syst Appl 40(4):1393–1399
Article Google Scholar
Sanchez D, Batet M, Isern D (2011) Ontology-based information content computation. Knowl-Based Syst 24(2):297–303
Article Google Scholar
Sanchez D, Batet M, Isern D, Valls A (2012) Ontology-based semantic similarity: a new feature-based approach. Expert Syst Appl 39(9):7718–7728
Article Google Scholar
Sarwar S, Qayyum ZU, Garcia-Castro R, Safyan M, Munir RF (2019) Ontology based E-learning framework: a personalized, adaptive and context aware model. Multimed Tools Appl 78:34745–34771
Article Google Scholar
Seco N, Veale T, Hayes J (2004) An intrinsic information content metric for semantic similarity in WordNet. In: Proceedings of the 16th European Conference on Artificial Intelligence (ECAI), IOS Press, Amsterdam, pp 1089–1094
Shepard RN (1962) The analysis of proximities: multidimensional scaling with an unknown distance function. I Psychometrika 27(2):125–140
Article MathSciNet MATH Google Scholar
Staab S, Studer R (2009) Handbook on Ontologies. Springer, Second Edition
Book MATH Google Scholar
Strube M, Ponzetto SP (2006) WikiRelate! Computing semantic relatedness using Wikipedia. In: Proceedings of the 21st national conference on artificial intelligence (AAAI 2006), AAAI Press, Cambridge, pp 1419-1424
Suchanek FM, Kasneci G, Weikum G (2008) YAGO: a large ontology from Wikipedia and WordNet. J Web Semant 6(3):203–217
Article Google Scholar
Tversky A (1977) Features of similarity. Psychol Rev 84(4):327–352
Article Google Scholar
Wolk K, Wolk A (2017) Machine enhanced translation of the human phenotype ontology project. Procedia Comput Sci 121:11–18
Article Google Scholar
Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, pp 133–138
Chapter Google Scholar
Zhou Z, Wang Y, Gu J (2008) A new model of information content for semantic similarity in WordNet. In: Proceedings of second international conference on Future Generation Communication and Networking Symposia (FGCNS 2008), IEEE, New York, pp 85–89

Download references

Acknowledgements

The authors would like to thank the anonymous referees for their valuable comments and suggestions which greatly improved the exposition of the paper. The works described in this paper are supported by The National Natural Science Foundation of China under Grant Nos. 61772210 and U1911201; Guangdong Province Universities Pearl River Scholar Funded Scheme (2018); The Project of Science and Technology in Guangzhou in China under Grant Nos. 201807010043 and 202007040006. Thanks for my students Rong Qu, Yongyi Fang, and Yudong Liu for their discussion, programming, and experiments.

Author information

Authors and Affiliations

School of Computer Science, South China Normal University, Guangzhou, 510631, China
Yuncheng Jiang

Authors

Yuncheng Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuncheng Jiang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, Y. A unified framework for semantic similarity computation of concepts. Multimed Tools Appl 80, 32335–32378 (2021). https://doi.org/10.1007/s11042-021-10966-1

Download citation

Received: 02 August 2020
Revised: 05 January 2021
Accepted: 14 April 2021
Published: 29 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11042-021-10966-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A unified framework for semantic similarity computation of concepts

Abstract

Similar content being viewed by others

Semantic Word Similarity Learned from Heterogenous Knowledge Bases

An Analysis of Semantic Similarity Measures for Information Retrieval

An Ontology-Based Approach for Measuring Semantic Similarity Between Words

Explore related subjects

1 Introduction

2 Related work

2.1 Distance-based similarity measures

2.2 Feature-based similarity measures

2.3 Statistical similarity measures

2.4 Hybrid similarity measures

3 A framework for semantic similarity computation

3.1 Semantic representation of concepts

3.2 A framework for semantic similarity computation

4 Some approaches for measuring semantic similarity

4.1 IC-based measures under multiple knowledge sources

4.2 Distance-based measures under multiple knowledge sources

4.3 Feature-based measures under multiple knowledge sources

5 Experiments and evaluation

5.1 Experimental datasets and evaluation metrics

5.2 Experimental results

5.3 Discussion and analysis

5.3.1 Influence of knowledge sources

5.3.2 Influence of benchmarks

5.3.3 Influence of measures

5.3.4 Our most generic and flexible approaches

6 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation