Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Semantic Web initiatives have fostered the development of large linked collections from different domains [11], as well as the collaborative definition of ontologies to semantically describe and annotate these data. Particularly, the biological and biomedical domain has been greatly benefited from these research movements, and a diversity of semantically annotated linked scientific datasets are publicly available, e.g., Chem2Bio2RDFFootnote 1, Bio2RDFFootnote 2, OpenPHACTSFootnote 3, and Linked Life DataFootnote 4. Further, expressive ontologies have been defined, e.g., the Gene Ontology (GO)Footnote 5, and they have been extensively accepted by the scientific community as standards to describe the concepts and relations, and to replace textual descriptions by controlled vocabulary terms from the ontologies. For example, GO terms are extensively used for capturing functional information of proteins and genes as indicated in the Gene Ontology Annotation (UniProt-GOA) databaseFootnote 6, and there are international initiatives to collaboratively annotate organisms, e.g., the Pseudomonas aeruginosa PAO1 genome Footnote 7.

Ontology-based annotations provide the basis to uncover novel and interesting patterns, e.g., to predict gene functions across organisms, drug-target interactions, or to suggest families of drugs that interact in the effectiveness of other drugs. Annotations are also used to determine relatedness between annotated concepts that could not be observed only using structural properties of the entities. In this direction, several annotation-based similarity measures have been defined [4, 12] and results of empirical evaluation studies suggest that considering ontology annotations can enhance the effectiveness of similarity measures [12, 14]. Nevertheless, although the great effort conducted by the biomedical and Semantic Web communities, state-of-the-art annotation-based similarity measures may not fully explote all the semantics encoded in the annotations, and imprecisely assign high values of similarity to dissimilar entities [3, 12].

Next, we illustrate the potential impact of semantics on the computation of relatedness. Figure 1 presents a taxonomy of relations (i.e., object properties) in the Gene Ontology (GO); negatively regulates (nr), positively regulates (pr), regulates (rg), is-a (sc), and part of (pf). These relations can refine a neighborhood-based similarity approach assuming that not only the neighbors of a concept influence in the similarity measure, but also the justifications that support the entailment of facts in the neighborhood. For example, even if the concepts A, B, C, and D have the same taxonomic properties, they should not be considered all equally identical, if they are related through the following relations or object properties: (i) A pf D; (ii) B nr D; and (iii) C pr D. Moreover, because nr and pr are more similar according to the object property hierarchy (See Fig. 1), both B and C must be more similar than A and B, or A and C. Additionally, existing annotation-based similarity measures do not take into account inferred facts or the justifications that support their entailment. However, considering the justifications of inferred facts may provide also insights of uncover properties required to accurately determine similarity of ontology-based annotated entities.

Fig. 1.
figure 1

GO taxonomy of object properties

We propose OnSim, a novel semantic similarity measure for ontology terms that is able to: (i) distinguish the object properties that relate ontology terms with facts in their neighborhoods; and (ii) consider inferred facts and the justifications that support their entailment.

We model OnSim as a 1-1 maximum weight bipartite matching of the neighborhoods of two ontology terms, as well as of the justifications conducted to infer facts in the neighborhoods. We extend the state-of-the-art annotation-based similarity measure AnnSim [12] with OnSim to analyze the impact of considering the semantics of the annotations. AnnSim was selected as the baseline of our evaluation because it has shown to effectively behave in a diversity of real-world datasets of genes and their GO annotations, clinical trials, and human disease benchmarks [12]. The Collaborative Evaluation of Semantic Similarity Measures (CESSM) Footnote 8 tool was used to evaluate the correlation of AnnSimOnSim with respect to domain-specific similarities considered as gold standards by the biomedical community: the ECC similarity [6], Pfam similarity [15], and Sequence Similarity SeqSim [20]. The evaluation was conducted on two collections of pairs of proteins published by the two available versions of the CESSM tool: the 2008 collection contains 13,430 pairs of proteins from UniProt-GOAFootnote 9, while the 2014 dataset comprises 22,302 pairs; annotations are from GO versions 2008 and 2014, respectively. Reported plots are produced by the CESSM tool, and reveal that AnnSimOnSim enhances the effectiveness of AnnSim by increasing the Pearson’s correlation coefficients with respect to the gold standard measures. Additionally, AnnSimOnSim is compared to eleven state-of-the-art semantic similarity measures, and it is able to outperform all these measures with respect to Pfam, while is competitive with the other two gold standard measures. Further improvements are observed in the CESSM 2014 collection, suggesting that high values of AnnSimOnSim may provide evidences of high quality annotations.

AnnSimOnSim is also used to determining relatedness among patients annotated with the Human Phenotype Ontology (HPO)Footnote 10. Patient data is produced and managed to remotely monitoring patients in the FI-STAR projectFootnote 11. FI-STAR detects anomalies in patient measurements and vital signs by exploiting semantics and Complex Event Processing (CEP) technologies. FI-STAR manages static and sensed data, as well as real-time predictions. Static data provide contextual information that improves the predictions of the system, and are represented as ontology-based annotations of the patients. Pair-wise values of AnnSimOnSim computed from static data are exploited by FI-STAR link prediction methods; the implemented hypothesis prediction establishes that patients with similar symptoms also suffer of similar diseases.

This paper is organized as follows: Sect. 2 provides a motivating example in the biomedical domain and Sect. 3 briefly describes preliminaries of our work. Section 4 presents the OnSim approach, and experimental results are reported in Sect. 5. Section 6 summarizes related research and Sect. 7 concludes.

2 Motivating Example

Figure 2 presents a portion of the neighborhoods of the GO terms adaptation of rhodopsin mediated signaling (GO:0016062), and deactivation of rhodopsin mediated signaling (GO:0016059). These terms are used to annotate entities from different collections. For example, in the UniProt-GOA datasetFootnote 12, they are used to annotate the proteins P10676 and P13217. These GO terms participate in different object properties; concretely, we observe in Fig. 2, that they occur in the object properties rg and nr, which are sub-properties of rg (Fig. 1). GO is described in OWL, which allows for representing logical axioms to describe the semantics of the object properties, e.g., include logical axioms to express transitivity or symmetry. Similarly to other biomedical ontologies, GO is continuously changing and therefore, these logical axioms may also change. In the GO version of 2008, rg is not associated with any logical axiom, while the GO 2014 version states that rg is transitive over pf. We focus on the 2008 version of GO in our motivating example, but we will see in our experimental results that more detailed definitions of logical axioms positively impact on the behavior of similarity measures. Figure 2 illustrates justifications of the inferred facts (GO:0016062 rg GO:0008150) and (GO:0016059 rg GO:0008150):

  1. 1.

    The first justification relies on: the axiom of Instantiation of SubClassOf ( sc ) over nr and the axiom of Instantiation of SubPropertyOf ( sp ) over rg. In Fig. 2, we observe that (GO:0016062 sc GO:0022401) and (GO:0022401 nr GO:0008150). Then, we can infer (GO:0016062 nr GO:0008150) by transitivity of the object property nr over sc. Finally, because nr is sub-property of rg, we can infer the fact (GO:0016062 rg GO:0008150).

  2. 2.

    This inference is justified by the axiom of Instantiation of SubClassOf ( sc ) over rg. In other way, every GO term inherits all the properties of its ancestors. The GO term GO:0050789 is an ancestor of GO:0016059, i.e., (GO:0016059 sc GO:0050789) and (GO:0050789 rg GO:0008150) hold; therefore, we infer the fact (GO:0016059 rg GO:0008150).

Existing ontology-based similarities mainly rely on taxonomic hierarchies of classes, and are not aware of these differences. For example, \(D_{tax}\) [1] and \(D_{ps}\) [13] are two taxonomic similarity measures that define similarity of two nodes in terms of the depth of the nodes to the root of class hierarchy, and the distance to their lowest common ancestor (LCA). \(D_{tax}\) and \(D_{ps}\) will assign relatively high values of similarities to GO:0016062 and GO:0016059, 0.625 and 0.55, respectively. Nevertheless, \(D_{tax}\) and \(D_{ps}\) ignore that both the neighborhoods of GO:0016062 and GO:0016059, and the justifications of their inferred facts are different. Therefore, \(D_{tax}\) and \(D_{ps}\) values may overestimate the real value of relatedness of these GO terms.

Fig. 2.
figure 2

Portion of the neighborhood from GO:0016062 and GO:0016059. Solid arrows represent stated object properties: negatively regulates (ng), regulates (rg), and is-a (sc). Dashed arrows represent inferred object properties.

3 Preliminaries

AnnSim [12] and \(D_{tax}\) [1] have exhibited effective behavior on different domains, e.g., real-world datasets of genes and their GO annotations, clinical trials, and human disease benchmarks. Thus, we rely on these measures to evaluate the effectiveness of OnSim.

Consider two entities \(e_1\) and \(e_2\) annotated with the set of ontology terms \(A_1\) and \(A_2\). Let \(BG=(A_1 \cup A_2, WE)\) be a weighted bipartite graph for set of terms \(A_1\) and \(A_2\), and \(MWBG=(A_1 \cup A_2, WE_r)\) be 1-1 maximum weight bipartite matching for BG. Intersection of sets \(A_1\) and \(A_2\) is assumed empty, i.e., in case the same ontology term t occurs in \(A_1\) and \(A_2\), both occurrences of t are seen as different terms during the construction of BG and MWBG. The annotation-based similarity AnnSim is defined as follows:

$$ AnnSim(e_1, e_2) = \frac{2*\sum _{(a_1,a_2) \in WE_r}Sim(a_1,a_2)}{|A_1| + |A_2|} $$

A 1-1 maximum weight bipartite matching [17], MWBG = ( \(A_1\cup A_2\), WE\(_r\) ) for a weighted bipartite graph BG = ( \(A_1\cup A_2\), WE), where edges are annotated with similarity Sim is as follows:

  • WE \(_r\) \(\subseteq \) WE, i.e., MWBG is a sub-graph of BG.

  • The sum of the weights of the edges in WE \(_r\) is maximized, i.e.,

    $$\begin{aligned} max \sum _{(a_1,a_2) \in { WE_r}} Sim(a_1,a_2) \end{aligned}$$
  • for each node in \(A_1\cup A_2\) there is only one incident edge in WE \({}_r\), i.e.,

    • \(\sum _{i=1}^{\mid A_1\mid } (a_i,a_j)=1, \forall j=1\dots \mid A_2\mid \)

    • \(\sum _{j=1}^{\mid A_2\mid } (a_i,a_j)=1, \forall i=1\dots \mid A_1\mid \)

\(Sim(a_1, a_2)\) is a generic similarity measure for ontology terms, but Palma et al. [12] reports on the benefits of using the taxonomic similarity \(D_{tax}\) [1]. \(D_{tax}\) computes taxonomic similarity values in terms of Lowest Common Ancestor. Given a directed graph G, the lowest common ancestor of two nodes x and y, is the node of greatest depth in G that is an ancestor of both x and y. Let d(xy) be the number of edges on the longest path between nodes x and y in a given ontology. Also let lca(xy) be the lowest common ancestor of nodes x and y, and root is the root of the class hierarchy.

$$D_{tax}(x, y) = 1- \frac{d(x, lca(x,y)) + d(y, lca(x,y))}{d(root, x) + d(root, y)}$$

4 OnSim: An Ontology Similarity Measure

OnSim is an ontology similarity measure that computes relatedness between ontology terms. OnSim not only relies on taxonomic hierarchies of the classes to decide relatedness, but also considers the neighborhoods of two terms, as well as the object properties that relate these terms with the facts in the neighborhoods and the justifications that support the entailment of these facts.

To illustrate the impact that considering additional knowledge may have on the computation of the similarity, consider the GO terms adaptation of rhodopsin mediated signaling (GO:0016062) and deactivation of rhodopsin mediated signaling (GO:0016059). As observed in Fig. 3(a) and 3(b), the neighborhoods of these terms are different, as well as the justifications that support the inference of these facts. Nevertheless, taxonomic similarity measures ignore this information and may assign relatively high values of similarity to these two terms. Contrary, OnSim detects that these two annotations are dissimilar in terms of the facts in the neighborhoods and their justifications, and assigns a lower similarity value, i.e., OnSim(GO:0016062,GO:0016059) is equal to 0.31.

Fig. 3.
figure 3

Neighborhoods of GO terms. Object properties in inferred facts are represented with Dashed Arrows. Object properties are represented in arrows of different colors

To represent neighborhoods and justifications, we define for each ontology term \(a_i\), a set \(R_{a_i}\) that represent the neighborhood of \(a_i\). Facts in the neighborhood are modeled as quadruples \(t=(a_i, a_j, r_{ij}, E_{ij})\), where \(r_{ij}\) is an object property such that there is an out-going link from \(a_i\) to \(a_j\) in the ontology, and \(E_{ij}\) is a set of the instantiations of the antecedents of the axioms used to infer the fact (\(a_i\) \(r_{ij}\) \(a_j\))Footnote 13. Thus, \(t_1\) = (GO:0016062,GO:0007165,rg, {(nr sp rg), (GO:0016062 nr GO:0007165), Ax.4}) is the quadruple that represents that the GO terms GO:0016062 and GO:0007165 are related through the object property rg (Fig. 3(b)). Further, \(t_1\) states the justification of this inferred fact; in this case axiom Ax.4 is applied, and the instantiation of the antecedent of Ax.4 is (GO:0016062 nr GO:0007165). We define a quadruple t, based on the OWL2 axioms applied in a given justification.

Definition 1

Given two ontology terms \(a_i\) and \(a_j\), and an object property \(r_{ij}\). A fact in the neighborhood of \(a_i\) establishing that \(a_i\) and \(a_j\) are related through \(r_{ij}\), i.e., (\(a_i\) \(r_{ij}\) \(a_j\)), is represented as a quadruple \(t=(a_i,a_j,r_{ij},E_{ij})\), where \(E_{ij}\) is a set of the instantiations of the antecedents of the axioms used to infer the fact (\(a_i\) \(r_{ij}\) \(a_j\)). Depending of the axioms used to inferred the fact (\(a_i\) \(r_{ij}\) \(a_j\)), the quadruple t is inductively defined as follows:

  1. 1.

    (Ax.1) Axiom of Symmetry Relation \(r_{ij}\):

    $$ \frac{ (a_i \; r_{ij} \; a_j)\; }{ (a_j \; r_{ij} \; a_i) } \; \Longrightarrow \; { t=(a_i,a_j,r_{ij},\{ (a_j \; r_{ij} \; a_i), Ax.1 \})} $$
  2. 2.

    (Ax.2) Axiom of Instantiation of SubClassOf (sc) over \(r_{ij}\):

    $$ \frac{(a_i\; sc \; a_z) \wedge (a_z\; r_{ij} \; a_j)}{ (a_i\; r_{ij} \; a_j)} \; \Longrightarrow \; { t=(a_i,a_j,r_{ij} ,\{ (a_i \; sc \; a_z), (a_z \; r_{ij} \; a_j), Ax.2\})} $$
  3. 3.

    (Ax.3) Axiom of Transitivity of SubClassOf (sc):

    $$ \frac{(a_i\; sc \; a_z) \wedge (a_z\; sc \; a_j)}{ (a_i\; sc \; a_j)} \; \Longrightarrow \; { t=(a_i,a_j,sc,\{ (a_i \; sc \; a_z), (a_z \; sc \; a_j), Ax.3\})} $$
  4. 4.

    (Ax.4) Axiom of Instantiation of SubPropertyOf (sp) over \(r_{ij}\):

    $$ \frac{(r_z\; sp \; r_{ij}) \wedge (a_i\; r_{z} \; a_j)}{ (a_i\; r_{ij} \; a_j)} \; \Longrightarrow \; { t=(a_i,a_j,r_{ij},\{ (r_z \; sp \; r_{ij}), (a_i \; r_z \; a_j), Ax.4\})} $$
  5. 5.

    (Ax.5) Axiom of Transitivity of SubPropertyOf (sp):

    $$ \frac{(a_i\; sp \; a_z) \wedge (a_z\; sp \; a_j)}{ (a_i\; sp \; a_j)} \; \Longrightarrow \; { t=(a_i,a_j,sp,\{ (a_i \; sp \; a_z), (a_z \; sp \; a_j), Ax.5\})} $$
  6. 6.

    (Ax.6) Axiom of Transitivity Relation \(r_{ij}\):

    $$ \frac{(a_i\; r_{ij} \; a_z) \wedge (a_z\; r_{ij} \; a_j)}{ (a_i\; r_{ij} \; a_j)} \; \Longrightarrow \; { t=(a_i,a_j,r_{ij},\{ (a_i \; r_{ij} \; a_z), (a_z \; r_{ij} \; a_j), Ax.6\})} $$
  7. 7.

    (Ax.7) Axiom of Transitivity of \(r_z\) over \(r_{ij}\):

    $$ \frac{(a_i \; r_z \; a_z) \wedge (a_z\; r_{ij} \; a_j)}{ (a_i\; r_{ij} \; a_j)} \; \Longrightarrow \; { t=(a_i\,a_j,r_{ij},\{ (a_i\ \; r_z \; a_z), (a_z \; r_{ij} \; a_j), Ax.7\})} $$

Inductive Case: If \(t_z=(a_z,a_k,r_{zk},E_{zk})\) is part of the neighborhood of \(a_z\), \(t_i=(a_i,a_j,r_{ij},E_{ij})\) is in the neighborhood of \(a_i\), and \((a_z\; r_{zk}\;a_k)\) \(\in \) \(E_{ij}\), then eliminate \(t_i\) from the neighborhood of \(a_i\) and add the quadruple \(t=(a_i,a_j,r_{ij},\overline{E}_{ij})\) to the neighborhood of \(a_i\), where \(\overline{E}_{ij}=(E_{ij} -\{(a_z\; r_{zk}\;a_k)\}) \cup E_{zk}\).

Let us consider the GO terms GO:0016062 and GO:0016059 in Fig. 4. The neighborhood of GO:0016062 represented by \(R_{GO:0016062}\), comprises 12 quadruples associated with GO:0016062; the quadruples \(t_{1.1}\) and \(t_{1.2}\) describe the facts (GO:0016062 rg GO:0007165) and (GO:0016062 rg GO:0008150), respectively.

  • \(t_{1.1}\) = (GO:0016062,GO:0007165,rg, {(nr sp rg), (GO:0016062 nr GO:0007165), Ax.4}).

  • \(t_{1.2}\) = (GO:0016062, GO:0008150,rg, {(nr sp rg), (GO:0016062 sc GO:0022401), (GO:0022401 nr GO:0008150), Ax.2, Ax.4}.

Note that the quadruple \(t_{1.2}\) represents the information of the justification of the fact (GO:0016062 rg GO:0008150), where more than one axiom support the inference, and the inductive definition of a quadruple (Definition 1) is applied to generate the quadruple, i.e., the justification is as follows:

$$\begin{aligned}&\qquad \mathrm{(GO{:}0016062} \, \mathtt{sc} \, \mathrm{GO{:}0022401)} \wedge \mathrm{(GO{:}0022401} \, \mathtt{nr} \,\mathrm{GO{:}0008150)} \\ \Rightarrow&\qquad {<} \mathrm{Ax.2,} \, \mathrm{(A} ~ \mathtt{sc} ~ \mathrm{B)} \wedge \mathrm{(B \, r \, C)} \Rightarrow \mathrm{(A \, r \, C)} > \\&\qquad (\mathtt{nr} ~ \mathtt{sp} ~ \mathtt{rg}) \wedge \mathrm{(GO{:}0016062} ~ \mathtt{nr} ~ \mathrm{GO{:}0008150)}\\ \Rightarrow&\qquad {<} \mathrm{Ax.4}, (r_i ~ \mathtt{sp} ~ r_j) \wedge \mathrm{(B} ~ r_i ~ \mathrm{C)} \Rightarrow \mathrm{(B} ~ r_j ~ \mathrm{C)} > \\&\qquad \mathrm{(GO{:}0016062} ~ \mathtt{rg} ~ \mathrm{GO{:}0008150)} \end{aligned}$$

Similarly, \(R_{GO:0016059}\) describes the neighborhood of GO:0016059 and comprises 14 quadruples. The quadruple \(t_{2.1}\) represents the fact (GO:0016059 rg GO:0008150):

  • \(t_{2.1}\) = (GO:0016059,GO:0008150,rg, {(GO:0016059 sc GO:0050789), (GO:0050789 rg GO:0008150), Ax.2}).

Given two quadruples, \(t_{1i}=(a_1,a_i, r_{1i},E_{1i})\) and \(t_{2j}=(a_2,a_j, r_{2j},E_{2j})\), the similarity of two quadruples \(Sim(t_{1i},t_{2j})\) is defined as the product triangular norm, TN, that combines the taxonomic similarity of \(t_{1i}\) and \(t_{2j}\) with the similarity of the sets \(E_{1i}\) and \(E_{2j}\) of justifications, \(Sim_{justifications}(E_{1i},E_{2j})\). An item \(it_i\) in a justification can be an axiom identifier, or an RDF triple (b\(_i\) \(p_i\) c\(_i\)) that denotes the instantiation of one of the antecedents of the axiom. For example, the justification of the quadruple \(t_{1.1}\) = (GO:0016062,GO:0007165,rg, {(nr sp rg), (GO:0016062 nr GO:0007165), Ax.4}) is a set that comprises three items; two items are RDF triples (nr sp rg) and (GO:0016062 nr GO:0007165), and the other item is the identifier of the applied axiom, i.e., Ax.4. The similarity of two justification items \(it_{i}=(b_i\; p_i \; c_i)\) and \(it_{j}=(b_j \; p_j \; c_j)\), named \(Sim_{justification}(it_{i}, it_{j})\), is defined as a product triangular norm that combines three taxonomic similarities: \(D_{tax}(b_i,b_j)\), \(D_{tax}(p_i, p_j)\), and \(D_{tax}(c_{i}, c_{j})\). Further, the similarity of the same axiom identifier is 1.0, while two different axioms are dissimilar, i.e., their similarity value is 0.0.

In our running example, if the taxonomic similarity is \(D_{tax}\) [1], the similarity of the justification items \(it_1\) = (GO:0016062 nr GO:0007165) and \(it_2\) = (GO:0050789 rg GO:0008150) is 0.12, where

  • \(D_{tax}\)(GO:0016062,GO:0050789) is 0.55;

  • \(D_{tax}\)(nr,rg) is 0.67;

  • \(D_{tax}\)(GO:0007165,GO:0008150) is 0.33;

  • \(Sim_{justification}(e_1,e_2)\) = 0.55 \(\times \) 0.67 \(\times \) 0.33.

Fig. 4.
figure 4

Comparison of the justifications of quadruples \(t_{1.1}\) and \(t_{2.1}\); axiom identifiers are omitted for legibility: (a) Bi-partite graph from the pair-wise comparison of the justifications; (b) 1-1 maximum weight bipartite matching produced by the BlossomIV solver [2]

Fig. 5.
figure 5

Comparison of \(R_{GO:0016062}\) and \(R_{GO:0016059}\): 1-1 maximum weight bipartite matching produced by the BlossomIV solver [2]; Dummy Quadruples are added by the solver to find a matching that maximizes the sum of the similarity values

Two justifications \(E_{1i}\) and \(E_{2j}\) are compared based on a similarity value. Formally, the similarity of two justifications is computed from a bi-partite graph that corresponds to the 1-1 maximum weight bipartite matching of the edges in the Cartesian product of \(E_{1i}\) \(\times \) \(E_{2j}\). Figure 4 presents the 1-1 maximum weight bipartite matching of the justification sets of \(t_{1.1}\) = (GO:0016062,GO:0007165,rg, {(nr sp rg), (GO:0016062 nr GO:0007165), Ax.4}) and \(t_{2.1}\) = (GO:0016059,GO:0008150,rg, {(GO:0050789 rg GO:0008150), (GO:0016059 sc GO:0050789), Ax.2}); axiom identifiers are omitted for legibility. We apply an exact solution to the problem of computing the 1-1 maximum weight bipartite matching from a bipartite graph using the BlossomIV solver [2]. Values of justification similarity are used to compute the 1-1 maximum weight bipartite matching, and the sum of this similarity is maximized in the best matching. The time complexity of computing the 1-1 maximum weight bipartite matching is \(O(m^4)\), where m is sum of the cardinalities of sets of justifications. Once the 1-1 maximum weight bipartite matching MWBM of \(E_{1i}\) \(\times \) \(E_{2j}\) is computed, the similarity of these justifications is calculated as follows.

$$\begin{aligned} Sim_{justifications}(E_{1i}, E_{2j})=\frac{\sum \limits _{(e_{i},e_{j})\in MWBM(E_{1i},E_{2j}) }Sim_{justifications}(e_{i},e_{j})}{Max(|E_{1i}|,|E_{2j}|)} \end{aligned}$$

Particularly, the \(Sim_{justifications}\) values for the 1-1 maximum weight bipartite matching of quadruples \(t_{1.1}\) and \(t_{2.1}\) in Fig. 4 is 0.06. Finally, we compute similarity \(OnSim(a_1,a_2)\) based on the knowledge represented in quadruples \(t_{1i}\) and \(t_{2j}\) in the sets \(R_1\) and \(R_2\) associated with the ontology terms \(a_1\) and \(a_2\), respectively. First, a graph GOS = (\(R_1\cup R_2\), EOS) is a labelled bi-partite graph comprised of the nodes in the sets \(R_1\) and \(R_2\), EOS \(\subseteq \) \(R_1 \times R_2\), and edges are annotated with the similarity of the quadruples. EOS corresponds to the 1-1 maximum weight bipartite matching of the edges in the Cartesian product of \(R_1 \times R_2\).

$$ OnSim(a_1, a_2) = TN\Bigl (D_{tax}(a_1, a_2),\frac{\sum \limits _{(t_{1i},t_{2j})\in { EOS}} Sim(t_{1i},t_{2j})}{Max(|R_1|,|R_2|)}\Bigr ) $$
  • TN is a product triangular norm;

  • \(R_1\) and \(R_2\) are the sets associated with \(a_1\) and \(a_2\), respectively;

  • EOS corresponds to the 1-1 maximum weight bipartite matching of the quadruples in the Cartesian product of \(R_1\) and \(R_2\) annotated with the similarity \(Sim(t_{1i},t_{2j})\);

  • quadruples \(t_{1i}=(a_1,a_i, r_{1i},E_{1i})\) and \(t_{2j}=(a_2,a_j, r_{2j},E_{2j})\) belong to EOS; and

  • \(Sim(t_{1i},t_{2j})\) is defined as a triangular norm TN Footnote 14 that combines similarity values of the justifications of \(r_{1i},r_{2j}\) with the taxonomic similarity of \(t_{1i}\) and \(t_{2j}\).

Figure 5 presents the 1-1 maximum weight bipartite matching found by the BlossomIV solver [2] for the GO terms GO:0016062 and GO:0016059. We can observe that two dummy nodes are added to ensure that the sum of the similarity values is maximized. OnSim is computed on top of this 1-1 maximum weight bipartite matching and combined with the taxonomic similarity value of \(D_{tax}\)(GO:0016062,GO:0016059); thus, OnSim(GO:0016062,GO:0016059) corresponds to 0.488 \(\times \) 0.625 = 0.31, which is lower than the values of \(D_{tax}\) and \(D_{ps}\) reported in Sect. 2.

5 Experimental Results

The goal of the study is to evaluate the impact of OnSim on existing annotation-based similarity measures. Our research hypothesis states that because OnSim considers the neighborhood of two ontology terms, the annotation-based similarity values of entities annotated with these terms are more accurate. We conducted an empirical study on the collections of proteins published at the Collaborative Evaluation of Semantic Similarity Measures (CESSM) portals of 2008Footnote 15 and 2014Footnote 16 using Hermit 1.3.8 as the OWL reasoner. The CESSM 2008 collection contains 13,430 pairs of proteins from UniProt with 1,039 distinct proteins, while the CESSM 2014 collection comprises 22,302 pairs with 1,559 distinct proteins. Both collections are annotated with 1,908 distinct terms from the August 2008 version of GO and 3,909 distinct terms from the December 2014 version, respectively. The class hierarchy of the 2008 GO version has a maximum depth of 15 levels, while the depth of the version of 2014 increases until 17 levels. Similarly, the number of axioms grows; the 2008 version has four object properties, and one of them is transitive (Ax.6); and the 2014 version has ten object properties, three are transitive (Ax.6), and five meet the ObjectPropertyChain (Ax.7). Annotations are from UniProt-GOA, and are separated into the GO hierarchies of biological process (BP), molecular function (MF), and cellular component (CC). CESSM computes the Pearson’s correlation coefficients with respect to three similarity gold standards: ECC similarity [6], Pfam similarity [15], and Sequence Similarity SeqSim [20]. The ECC similarity assigns values between 0 and 4 that measure the number of Enzyme Comparison (ECC) digits that are shared by two genes; high values of ECC indicate that both genes share several digits and are similar. The Pfam similarity (Pfam) of two genes corresponds to the Jaccard similarity as the ratio between the number of shared Pfam families and the total number of Pfam families of the two genes. Pfam similarity values are between 0.0 and 1.0. Finally, SeqSim produces normalized values of the Sequence Similarity measure of BLAST that measures the sequence alignment of two genes or proteins; SeqSim is one of the gold standard measures for gene sequence alignment.

Eleven semantic similarity measures are compared; these similarity measures extend Resnik’s(R) [16], Lin’s(L) [9], and Jiang and Conrath’s(J) [10] measures to consider GO annotations of the compared proteins, the information content (IC) of these annotations, and pairwise combinations of common ancestors. The average combination which is labeled A, considers the average of the ICs of pairs of common ancestors. Sevilla et al. [18] apply the corresponding measure, i.e., the Resnik’s [16], Lin’s [9], and Jiang and Conrath’s [10] measures, to the maximum value of IC of pairs of common ancestors; these combined measures are distinguished with the labeled M. Measures labelled with B are combined with the best-match average of the ICs of pairs of disjunctive common ancestors (DCA) proposed by Couto et al. [4]. Finally, the set-based measures simUI (UI) and simGIC (GI) [14] apply the Jaccard index to sets of annotations together with domain-specific information. We evaluate two versions of AnnSim on the two CESSM collections: AnnSimD\(_{tax}\) relies on \(D_{tax}\) to decide the relatedness of two annotations, while AnnSimOnSim uses OnSim.

Fig. 6.
figure 6

Results are produced by the CESSM tool for GO BP terms (versions 2008 and 2014). Average values for AnnSimD\(_{tax}\) and AnnSimOnSim. The similarity measures are: simUI (UI), simGIC (GI), Resnik’s Average (RA), Resnik’s Maximum (RM), Resnik’s Best-Match Average (RB), Lin’s Average (LA), Lin’s Maximum (LM), Lin’s Best-Match Average (LB), Jiang&Conrath’s Average (JA), Jiang&Conrath’s Maximum (JM), Jiang&Conrath’s Best-Match Average (JB)

Figure 6(a)–(d) report on the comparison of SeqSim with AnnSimD\(_{tax}\), AnnSimOnSim, and the GO based extensions of the Resnik’s [16], Lin’s [9], and Jiang and Conrath’s [10] measures. Annotations are restricted to GO Biological Process (BP) terms, the richer branch of GO in terms of axioms. Plots in Fig. 6(a) and 6(b) were generated on CESSM 2008, while Fig. 6(c) and 6(d) were returned by CESSM 2014. In almost all the cases, the studied similarity measures assign high similarity values to pairs of proteins that SeqSim also consider similar. Nevertheless, the problem is to precisely distinguish when two proteins are dissimilar. In the collections 2008 and 2014, simGIC (GI) [14] has the highest correlation with respect to SeqSim, 0.773 and 0.799, respectively. In addition to GO annotations of the proteins, GI additionally exploits information content of the GO annotations in conjunction with the most informative ancestors of these annotations. Thus, a more precise estimate of the relatedness of two proteins is computed, i.e., both GI and SeqSim assign low similarity values to a large number of pairs of proteins. AnnSimD\(_{tax}\) does not precisely distinguish dissimilar proteins in none of the collections, and the correlation with respect to SeqSim is 0.650 and 0.682. Contrary, AnnSimOnSim considerably enhances AnnSim, and exhibits a performance more similar to GI in dissimilar pairs of proteins, i.e., pairs of proteins with low SeqSim values; thus, the correlation with respect to SeqSim is 0.732 and 0.772. This improvement is the result of analyzing the neighborhoods of the GO terms that are compared during the computation of AnnSimOnSim, and corroborates our hypothesis that OnSim can positively impact on the effectiveness of annotation-based similarity measures. Another interesting issue to highlight is the impact that newer versions of GO and annotations may have on the behavior of semantic similarity measures. Although the CESSM 2014 tool only reports on eight similarities, clearly all of them behave better in the CESSM 2014 collection than in the CESSM 2008. This observation suggests an improvement in the quality of the GO taxonomy and axioms, as well as on the annotations of the proteins provided by UniProt-GOA. Providing thus, this type of studies, not only the possibility of evaluating the effectiveness of existing measures, but also of analyzing the quality of existing ontologies and annotations.

Table 1. The Pearson’s correlation coefficient between three gold standards and eleven similarity measures of CESSM. The top 5 correlations are highlighted in gray, and the highest correlation with respect to each gold standard is highlighted in bold.

Further, Table 1(a) and (b) report on the comparison of all the similarity measures with the gold standards: ECC, Pfam, and SeqSim on CESSM 2008 and 2014. Both tables report on Pearson’s correlation coefficients, where the top-5 values are highlighted in gray, and the highest correlation with respect to each of the baseline similarity measure is highlighted in bold. We can observe that both AnnSimD\(_{tax}\) and AnnSimOnSim are among the top-5 more correlated measures to SeqSim and Pfam in CESSM 2008. However, in the version of 2014, only AnnSimOnSim is kept among the top-5 measures. While AnnSimD\(_{tax}\) maintains its improvement in the correlation with SeqSim in the 2014 collection, it drops to the last position in terms of correlation. Similar to the results reported in Fig. 6(d), the enhanced effectiveness of AnnSimOnSim in this dataset suggests an improvement in the quality of the annotations and in the knowledge represented in GO. We hypothesize that most of changes in GO are related to axioms and object properties and not so much with the taxonomy. These characteristics of GO 2014 would explain the behavior of AnnSimD\(_{tax}\) in this dataset. AnnSimOnSim is competitive because, unlike other top-5 similarity measures, it is a generic similarity measure and is not tuned for GO.

6 Related Work

A diversity of similarity measures have been proposed in the literature to compute relatedness between a pair of entities. Each measure exploits some knowledge including paths of relations with other entities, taxonomic hierarchies of the classes, and semantic knowledge. Path- or structure-based similarity measures compute the relatedness of two entities according to the properties of the paths that connect them (e.g., PathSim [21] or HeteSim [19]), or the structure of the graph that includes the two entities (e.g., SimRank [7]). High values of path- and structure-based similarity indicate that the entities are connected with a large number of paths that meet certain conditions, or the neighborhoods of these entities are highly connected. Taxonomic-based similarity measures, as \(D_{ps}\) [13] and \(D_{tax}\) [1], are a subset of structure-based similarity measures. They decide relatedness in terms of the class hierarchy of the ontology and usually consider only the is-a relation. High values of taxonomic similarities indicate that the entities share deep common ancestors in the ontology. In the context of Biomedicine, domain-specific similarity measures have been defined to measure relatedness between scientific entities. Smith and Waterman [20], BLASTFootnote 17 and FASTAFootnote 18 identify sequence alignment in sequences of nucleotides or amino-acids. Furthermore, domain-specific similarity measures rely on knowledge encoded in specific taxonomies to compute the similarity of two entities. For example, the GO semantic similarity measures assign values between GO terms according to the similarity measures proposed by Resnik et al. [16], Lin et al. [9], and Jiang&Conrath [8]. Finally, Couto et al. [3] propose a classification of similarity measures according to the semantics they exploit: Terminological measures compute relatedness between two entities by considering similarity between the names of the classes to which these entities belong; structural approaches decide similarity depending on the relations and attributes of the classes; extensional measures assign similarity values based on the cardinality of the intersection of the instantiations of the classes; and the semantic-based approaches take into account axioms that formalize properties of ontology classes to decide relatedness of two entities [5]. OnSim considers both, the ontology structure and logic axioms. Therefore, according to Couto et al., OnSim is classified as a structural and semantic-based similarity measure.

7 Conclusions and Future Work

We have defined OnSim, a similarity measure that exploits the semantics of ontology terms, i.e., object properties and axioms, to accurately determining relatedness. We extended the annotation-based similarity AnnSim with OnSim and conducted an extensive empirical study on collections available at the CESSM websites. Experimental results reveal that AnnSimOnSim is able to enhance AnnSim effectiveness with respect to biomedical gold standard similarity measures: SeqSim, Pfam, and ECC. Observed results also suggest that AnnSimOnSim and the other similarity measures are positively impacted by the evolution of the Gene Ontology and protein annotations; providing thus, a potential new application of these measures for suggesting quality issues.

In the future, we plan to study the impact of OnSim on other similarity measures, e.g., Cosine or GI. Further, we will formally analyze the effects of ontology and annotation evolution on the effectiveness of similarity measures; we hypothesize that these results will provide insights to define higher quality ontologies and annotations.