Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Motivation

Ontology (network) evaluation plays a key role in ensuring the quality of ontology networks, and it is employed within various ontology engineering scenarios. The main scenario is that of ontology development, namely the process during which the ontology is built. The goal in this case is to assess the quality and correctness of the obtained ontology. The process of ontology development can be achieved through different methods and the evaluation of the obtained ontology changes accordingly. For example, an ontology could be obtained through automatic extraction from representative data sources such as text (Cimiano and Völker 2005) or databases (Cerbah 2008). In this case, an important research question refers to evaluating ontology extraction algorithms with respect to the quality of the produced artifacts, as well as comparing the various algorithms to each other. Ontology evaluation can often be used as a means to automatically assess the quality of the output of such algorithms.

Alternatively, the ontology development phase could also involve an ontology evolution activity where a base ontology is extended, either manually or through automatic means, in order to cover new domain terminology or to correspond to new application requirements (Chap. 11). In this case, the goal of ontology evaluation is to assess whether the new additions have impacted on the quality of the base ontology.

Additionally to ontology development, another scenario where ontology evaluation plays an important role is that of ontology selection. With the recent advances in the area of the Semantic Web, in particular the proliferation of online available ontologies and semantic search engines such as WatsonFootnote 1 or SindiceFootnote 2, an increased number of applications are built by reusing external knowledge rather than building it from scratch (d’Aquin, et al. 2008). Examples include cross-ontology question answering (Lopez et al. 2010), relation detection, ontology evolution (Zablith et al. 2010), or ontology matching (Sabou et al. 2008). For these applications, it is crucial to evaluate, often entirely automatically, the quality of the reused knowledge. Ontology evaluation here refers to the situation where existing ontologies are evaluated (and often ranked) in terms of selected criteria in order to select the most appropriate one for the task at hand.

A final usage scenario is during the ontology modularization process that leads to a network of interconnected ontology modules (Chap. 10), whose quality is iteratively assessed in order to decide whether the modularization has reached the expected results.

In this chapter, we further explore ontology (network) evaluation by providing a definition (Sect. 9.2), methodological guidelines (Sect. 9.3), and concrete examples (Sect. 9.4).

2 Definitions and Filling Card

Ontology evaluation is defined as the activity of checking the technical quality of an ontology against a frame of reference (Suárez-Figueroa and Gómez-Pérez 2008). Intuitively, whenever an evaluation is performed for a certain ontology (or alignment) aspect (e.g., modeling correctness), the process is always guided by the evaluator’s understanding of what is best and what is worse. In some cases, these boundaries (which we refer to as frame of reference) are clearly defined and tangible (e.g., a reference ontology, a reference alignment), but in other cases, they are weakly defined and may be different from one person to another, or even across evaluation sessions. The NeOn Glossary distinguishes two types of ontology evaluations depending on the frame of reference used:

  • Ontology validation is the ontology evaluation activity that compares the meaning of the ontology definitions against the intended model of the world that it aims to conceptualize (an intangible frame of reference). This activity answers the question: are you producing the right ontology?

  • Ontology verification is the ontology evaluation activity which compares the ontology against the ontology specification document (ontology requirements and competency questions), thus ensuring that the ontology is built correctly (in compliance with the ontology specification). This activity answers the question: Are you producing the ontology in the right way?

The filling card shown in Fig. 9.1 provides a structured summary of the ontology (network) evaluation activity. Section 2.5 describes the main components of a filling card in more detail.

Fig. 9.1
figure 1_9

Filling card for ontology (network) evaluation

3 Ontology Network Evaluation Workflow and Guidelines

In this section, we describe the NeOn methodological guidelines for carrying out the ontology network evaluation activity. Besides prescribing a methodology, our aim is also to provide a brief overview of the various evaluation methods and techniques that can be used in each step of the methodology.

We propose a component-based evaluation approach where each element of the network (e.g., ontologies and alignments between ontology pairs) is evaluated as a stand-alone individual and then the findings of these evaluations are summed up (Fig. 9.2). An alternative to this approach would be the evaluation of the entire network from the point of view of the users or the organization that will use the ontology network. Methodologically, this approach is similar to evaluating a stand-alone component using, for example, a task-based evaluation, and therefore, it is covered by Tasks 2 and 3 of the proposed workflow. Figure 9.2 shows the workflow and the tasks for carrying out the ontology network evaluation.

Fig. 9.2
figure 2_9

Workflow and tasks for evaluating ontology networks

Task 1. Selecting individual components of the ontology network. In a first instance, the ontology development team identifies the elements of the network that need to be evaluated including individual ontologies (Maedche and Staab 2002; Burton-Jones et al. 2005; Alani et al. 2006; Fernandez et al. 2006), alignments between ontology pairs (Euzenat and Shvaiko 2007), ontology statements (Lopez et al. 2009), ontology relations, etc. Their decision should be based on two criteria: (1) which ontology network elements are critical for the overall network and (2) which of these elements can actually be evaluated. The latter means that there must exist some frame of reference against which these individual components can be, at least in principle, evaluated. As we discussed before, the frame of reference is not necessarily tangible, but can be some idea of the perfect model, or canon, defined by the human evaluator for the particular evaluation task. Examples of frames of references will be given at Task 3.

Task 2. Selecting an evaluation goal and approach. For evaluating individual ontologies, the team needs to decide the goal of the evaluation and select an appropriate evaluation approach (as summarized in Table 9.1). We distinguish the following evaluation goals:

Table 9.1 Evaluation goals, evaluation approaches, and relevant NeOn plugins
  • Domain coverage – Does the ontology cover a topic domain? The extent to which an ontology covers a considered domain is an important factor to be considered both during the development and the selection of an ontology. The evaluation approaches employed to achieve this goal imply the comparison of the ontology to frames of references such as a gold standard ontology (Maedche and Staab 2002), or data sets that are representative for the domain (user-defined terms (Alani et al. 2006; Fernandez et al. 2006), tag sets (Cantador et al. 2007), document corpus (Brewster et al. 2004), etc.).

  • Quality of the modeling in terms of the design and development process and in terms of the final result – Does the ontology development process comply with ontology modeling best practices/ODPs Footnote 3 ? Is the ontology modeled correctly? Applicable both for the ontology development (Lozano-Tello and Gómez-Pérez 2004) and selection scenarios (Burton-Jones et al. 2005; Tartir et al. 2005), this evaluation goal focuses on the quality of the ontology which can be assessed using a wide range of approaches focusing on logical correctness or syntactic, structural, and semantic quality. Quality in terms or correctness, precision, and recall is an important goal when evaluating ontology alignments.

  • Suitability for an application/task – Is the ontology suitable to use for a specific application/task? (Porzel and Malaka 2004; Fernandez et al. 2009) Will it produce the expected results? (Strasunskas and Tomassen 2008) Different applications rely on different ontology (or alignment) characteristics. For example, for applications that use ontologies to support natural language processing tasks, domain coverage is often more important than logical correctness. As a result, measuring ontology (alignment) quality alone is not enough to predict how well the ontology (developed or selected) will support an application or a task. Task-based evaluations help assessing suitability for a task or application, rather than generic quality features.

  • Adoption and use – Has the ontology been reused (imported) as part of other ontologies? (Sindice,2 Watson1) How did others rate the ontology? (Cantador et al. 2007, CupboardFootnote 4) Understanding the extent of adoption of an ontology is of particular interest when selecting it, the assumption being that there is a direct correlation between the level of adoption and the quality of the ontology. Analyzing the degree of interlinking between an ontology and other ontologies (e.g., in terms of reused terms or ontology imports) as well as relying on social rating systems are two key approaches to achieve this goal.

Task 3. Identifying a frame of reference and evaluation metric. While in Task 2 the ontology development team decides on the key goal(s) of the evaluation and potential approaches, in Task 3, the team needs to select the concrete ingredients of the evaluation, consisting of:

  • A frame of referenceWhat are we comparing against? The frame of reference denotes a set of representative resources that sets a baseline value against which the ontology should be compared.

  • Evaluation metric(s)How to measure the features of the ontology that will be compared? Example evaluation metrics are precision and recall, cost-based evaluation metrics, measures of similarity between an ontology or a mapping, and a corpus (domain knowledge), and lexical metrics. Table 9.2 summarizes the main evaluation metrics presented in the literature.

    Table 9.2 Evaluation metrics used for various evaluation frameworks

As exemplified in Table 9.2, evaluation metrics are generally specific for each frame of reference. There are however some generic metrics, such as precision and recall, which can be adapted for use with various frames of references.

Similarly to (Brank et al. 2005), we distinguish the following types of frames of references:

  • Gold standard: The frame of reference is defined by a baseline ontology or some other kind of structured representation of the problem domain for which an appropriate ontology is needed. A gold standard is often used when the goal of the evaluation is domain coverage. For alignments, a reference alignment can play the role of a gold standard.

  • Application-based: The frame of reference consists of the set of “ideal” results that an application should return when plugging the “perfect” ontology (or alignment) into it. This frame of reference pertains to the assessment of the ontology’s (alignment’s) suitability for an application/task.

  • Data-driven: The frame of reference is a collection of unstructured or informal data (e.g., text), which represents the problem domain. Similarly to structured representations used as gold standards, unstructured data collections are also mostly used to support the evaluation of domain coverage.

  • Assessment by humans: The frame of reference is defined by human judgments that measure ontology features (or alignment characteristics) not recognizable by machines. Humans can (relatively) easily assess several ontology quality features which are not amenable to automatic processing. Human ratings also help to assess the level of adoption and use of the ontologies. Human-based ontology ratings are exploited to automatically select the most appropriate ontology according to previous users’ experiences (Cantador et al. 2007).

Additionally, and based on the way in which human evaluators assess ontology quality features (by comparison with their mental idea of the perfect model or canon for these features), we have identified the next three nontangible frames of references as ideal models of topologies, languages, and ontology-construction methodologies, which constitute the boundaries within which comparisons are based when performing the evaluations: (a) the ontology with the optimal topology, (b) the potentially most powerful and expressive ontology language, and (c) the perfect set of steps to follow and requirements to fulfill in order to achieve the best modeled ontology. All these canons or ideal models of topologies, languages, and methodologies are weakly defined since they may vary across evaluations and across the evaluators who defined them.

  • Topology-based: The frame of reference is defined by the minimum or maximum possible values of the topology evaluation metrics among ontologies within the network, or among ontology entities within the same ontology. Topology metrics automatically assess ontology quality features as well as adoption and use features, by measuring the interlinking structure of ontologies across the network (Ding et al. 2005).

  • Language-based: The frame of reference is defined by the representational capabilities of the language used to construct the ontology.

  • Methodology-based: The frame of reference is defined by the different quality factors of the selected ontology-development methodology.

Task 4. Applying the selected evaluation approach. Applying the selected evaluation approach requires a proper setup for the evaluation experiments and implementation of software tools to compute the evaluation metrics, and/or engage the human experts in stimulating sessions to collect their evaluations. We advise ontology developers to refer to the relevant scientific publications cited in this chapter for example evaluation setups and best practices. Evaluation approaches that rely on human judgment (Guarino and Welty 2004; Lozano-Tello and Gómez-Pérez 2004) are generally more time consuming and sophisticated than those which compare numeric values derived by automatic measures (Sindice, Watson), although they often offer more valuable insight into the evaluation process. We advise using parallel evaluation with multiple human experts to account for cross-evaluator disagreements.

Task 5. Combining and presenting individual evaluation results. This task highlights the weakest spots in the ontology network by considering individual evaluation results and how they affect the rest of the network. The evaluation results derived for individual components are combined to reach a global understanding of the network’s quality. The final task is to present the results of the evaluation in an appropriate form for possible repair (corrections, additions), improvements, and future evolution of the ontology network.

4 Examples of Ontology Evaluation

Since ontology network evaluation is not a widespread activity as yet, in this section, we present examples of various ontology evaluation studies and show how their stages map to the tasks prescribed by our guidelines. The examples cover all the key evaluation goals described in Task 2: domain coverage (Sect. 9.4.3), quality of modeling (Sects. 9.4.1 and 9.4.2), suitability for an application (Sects. 9.4.3 and 9.4.4), and adoption (Sect. 9.4.5).

4.1 Evaluation of an Individual Ontology

In this example, we describe the evaluation of YAGO (Suchanek et al. 2008), a large, lightweight, general-purpose ontology, automatically derived from Wikipedia and WordNet. YAGO has over 1.7 million entities (individuals and concepts) and 15 million facts (ground binary relations between entities). The relations include the taxonomic hierarchy as well as around 100 semantic relations between entities. YAGO’s evaluation follows the main tasks of our methodology.

[Task 2] Since the evaluation was performed in an ontology development scenario, the authors’ goal was to assess the quality of modeling of YAGO, namely its precision with respect to the data sets from where it has been derived. The approach was that of evaluating the precision by using human expert opinion.

[Task 3] To evaluate the precision of an ontology, its facts have to be compared to some ground truths. Since there is no computer-processable ground truth of suitable extent to be used as a frame of reference, the authors relied on manual evaluations against Wikipedia content, which was the frame of reference.

[Task 4] During the evaluation, human judges rated as “correct,” “incorrect,” or “don’t know” facts that were randomly selected from YAGO. Since common sense often does not suffice to judge the correctness of the YAGO facts, a snippet of the corresponding Wikipedia page was also presented to the judges. Thus, the evaluation compared YAGO against the ground truth of Wikipedia (i.e., it does not deal with the problem of Wikipedia containing some false information). Thirteen judges evaluated a total of 5,200 facts (ground relations between YAGO entities).

[Task 5] The authors use a tabular format (Table 9.3) to present the evaluation results in the decreasing order of the obtained precision (we only show the most and least precise relations). To make sure that the findings are significant, the Wilson confidence interval for α = 5% was computed. A confidence interval of 0% means that the facts have been evaluated exhaustively. The evaluation shows very high quality results as 74 relations have a precision of over 95%.

Table 9.3 Precision of some YAGO facts

This tabular presentation helps identifying the least precise relations and fosters the analysis of such cases. It can be concluded, for example, that a key source of error are inconsistencies of the underlying sources. For example, for the relation bornOnDate, most false facts stem from erroneous Wikipedia categories (e.g., persons born in 1802 are in the 1805 Births Wikipedia category). For facts with literals (such as hasHeight), many errors stem from a nonstandard format of the numbers (e.g., height is considered 1.6 km, just because the infobox says 1,632 m instead of 1.632 m). Occasionally, the data in Wikipedia was updated between the time of extraction and the time of the evaluation. This explains many errors for frequently changing properties such as hasGDPPPP and hasGini.

4.2 Pattern-Based Ontology Evaluation

In this section, we show how ontology design patterns, specifically content design patterns (CPs), are used to evaluate an ontology. The example does not cover the complete evaluation of the ontology, but presents one specific case where a CP assisted in finding potential problems and additionally suggested a solution. The example is set within the fishery domain, and the evaluated ontology is version 0.3 of the “fishing areas” ontology, modeling the division of water areas into divisions and subdivisions. An example is the FAO major fishing area 51, Western Indian Ocean, and its subareas numbered from 1 to 8, where 1 corresponds to the Red Sea and 2 to the Persian Gulf, but where the subdivisions of these subareas are only numerically identified.

[Task 2] The goal of the evaluation was assessing the quality of modeling, and the chosen approach was manual evaluation by an ontology pattern expert.

[Task 3] The expert used the pattern catalog available in the ontology design pattern portalFootnote 5 as a “gold standard” of modeling to which the modeling solutions in the evaluated ontology were compared. CPs introduce best practices for solving particular modeling problems, but by introducing those solutions, the pattern catalog can also be seen as a catalog of modeling issues.

[Task 4] The ontology used a locally defined, transitive, “part-of” relation to model the division of subareas and further levels of divisions and subdivisions, thus using the same modeling approach as the “part-of” content pattern. This modeling solution, however, is not suitable for certain contexts, because, when using reasoning, it is not possible to distinguish between the direct and the indirect subparts of an area. For example, if the hierarchical structure of the partitioning of the areas should be reconstructed, for example, for browsing the ontology in a graphical interface, or when answering “what are the divisions of the Red Sea?,” only the direct subareas of the Red Sea are of interest rather than all the inferable parts.

The “componency pattern” provides a modeling alternative using two inverse object properties: “hasComponent” and “isComponentOf.” These are nontransitive properties that can be used in combination with the “part-of pattern” to both register general partitioning but also the nontransitive property of a “proper part,” i.e., a direct component of something. When using these two patterns as “gold standards” for modeling, the ontology evaluator can discover the potential problem of a missing nontransitive property to distinguish the different “levels” of area decomposition and propose an appropriate solution.

4.3 Multiple Evaluations of an Ontology

An example of how various types of evaluations shed light on different aspects of an ontology is provided in (Sabou et al. 2005). Similar to this, when evaluating ontology networks, one needs to combine evaluation results for various network components. The authors of (Sabou et al. 2005) report on the multifaceted evaluation of an ontology that was automatically extracted from a corpus of textual web service descriptions in the bioinformatics domain. The various stages of this evaluation are graphically depicted in Fig. 9.3. The aim of the extracted ontology is to support the semantic description of web services. The myGrid Footnote 6 project provided a good context to evaluate this ontology as a bioinformatics expert has previously built a gold standard ontology for describing the same set of web services. The domain expert has relied on his domain knowledge to build the ontology rather than on the description of web services (corpus), which were used as the main input for the automatic extraction algorithm. A part of the gold standard ontology, referred to as the application ontology, provides concepts for annotating web service descriptions in a form-based annotation tool and is subsequently used at web service discovery time to power the search.

Fig. 9.3
figure 3_9

Overview of various evaluations of an ontology (Sabou et al. 2005)

[Task 2] In this ontology development scenario, the evaluations had several complementary goals. First, the authors aimed to assess whether the extracted ontology would be a good starting point for building an ontology and relied on an expert evaluation approach for this (shown as evaluation 2 in Fig. 9.3). Second, they wanted to evaluate domain coverage by comparison to the gold standard ontology (shown as evaluation 3 in Fig. 9.3). Third, the authors got an insight into how well the ontology would support an application by comparing it with the application ontology.

[Task 3] The authors made use of the following frames of references and metrics. For evaluation 2, the frame of reference consisted in the expert’s knowledge of the domain as he was asked to review and rate the extracted concepts as either correct or spurious or new. A precision value was then computed as a ratio of the correct and new concepts over all extracted concepts. For evaluation 3, the authors used the gold standard ontology as a frame of reference and computed metrics such as lexical overlap (LO – the ratio of overlapping concepts), ontological improvement (OI – the ratio of new concepts that were not in the gold standard but were domain relevant), and ontological loss (OL – the ratio of gold standard concepts which were not extracted). For evaluation 4, the application ontology was used as a frame of reference and compared to the extracted ontology using the metrics defined for evaluation 3.

[Task 4] Task 4 consisted in the evaluation performed by the domain expert as well as the computation of the various ontology comparison metrics.

[Task 5] The authors sum up the results of the various evaluations in tabular form and perform a subsequent analysis of these results. For example, Table 9.4 sums up the results when assessing domain coverage and suitability for a task by comparing the extracted ontology to the gold standard and application ontologies. The results show that although the overlap with the gold standard is low (7%), the extracted ontology contains a significant number of new, domain-relevant concepts (56%) that were identified in the automatically analyzed corpus but missed by the domain expert, which relied exclusively on his domain knowledge. A detailed analysis of all the missed concepts when comparing to the gold standard ontology shows that 70.6% of these terms did actually not appear in the corpus (but could be acquired if the corpus would be enlarged) and 19.8% referred to abstract concepts introduced by the domain expert to structure the ontology and which again were not in the corpus. It turns out that extraction algorithm–related issues only account for only 10% of the missed concepts.

Table 9.4 Results for domain coverage and task fitness from (Sabou et al. 2005)

4.4 Task-Based Ontology Evaluation

The authors of (Strasunskas and Tomassen 2008) investigate which ontology features influence the web search task. In their study, they consider different types of search tasks (fact-finding, exploratory search, comprehensive search), identify ontology features important for each task, and then introduce new evaluation metrics that measure these features respectively (e.g., fact-finding fitness (FFF), exploratory search task fitness (EXF)). Such metrics can support ontology selection for search. Their theoretical considerations are experimentally verified, by correlating the values of the metrics for different ontology versions with the search performance obtained in the context of the WebOdIR web search application (Strasunskas and Tomassen 2008). Core to their study is therefore a task-based evaluation of ontologies.

[Task 2] The goal is to understand the suitability for a task, and the approach consists in exploiting ontologies to support web search and measuring the improvement in terms of search precision obtained in an experimental setting.

[Task 3] The frame of reference is defined by the performance scores obtained in a web search task with an original version of the ontology. The metrics used measure ontology features important for certain search tasks (e.g., FFF, EXF).

[Task 4] The experimental setup consists of relying on two groups of users to perform web search using WebOdIR within four different domains (two search tasks per domain, i.e., eight tasks in total). WebOdIR exploited a set of ontologies for one group and the extended version of the same ontologies for the second group. The performance score of the search task is computed and compared across the two versions of the ontologies as well as correlated with the computed values of the newly introduced metrics.

[Task 5] The authors present these correlations in both tabular and graphical form and conclude on the influence of ontology features on various search tasks. For example, they found that more instances and object properties improve fact finding, while the addition of disjoint and equivalent concepts is beneficial for explanatory and comprehensive search tasks.

4.5 Evaluating Ontology Adoption and Use

The work of Cantador and colleagues (Cantador et al. 2007) presents a tool for collaborative ontology evaluation and reuse (WebCORE) focused on evaluating domain coverage and adoption and usage. The goal of this tool is to help experts and practitioners to select the most appropriate ontologies from a repository. The tool has three main components. The first one helps the user to semiautomatically generate a gold standard representing the domain of interest. The second component evaluates the domain coverage of the ontologies by comparing them against the previously generated gold standard by means of lexical and taxonomical evaluation measures. The third component exploits previous users’ judgments of those ontologies to automatically recommend the best ones.

[Task 2] Two main evaluation goals are considered when selecting the optimal ontology: (a) the domain coverage and (b) the adoption and use of the ontology.

[Task 3] To evaluate domain coverage, authors select a gold standard as a frame of reference. This gold standard is a representation of the domain of interest and is semiautomatically generated by the user with the support of the tool. To generate it, the user (a) introduces an initial set of terms or selects a textual source from which a set of terms representing the domain of interest can be extracted, (b) complements this set of terms by selecting additional terms from a ranked list, automatically generated by the system by considering previous user-generated gold standards, and (c) extends this set of terms by selecting suggested hypernym, hyponym, and synonym relations from WordNet. To evaluate the adoption and use of the ontologies, this work relies on an assessment by humans’ frame of reference. Users share their own experiences by evaluating the used ontologies according to five criteria: correctness, readability, flexibility, level of formality (highly informal, semi-informal, semiformal, and rigorously formal), and type of model (upper-level, core-ontology, domain-ontology, task-ontology, and application-ontology).

[Task 4] The tool evaluates the ontologies in two phases. First, the ontologies are evaluated according to their domain coverage by comparing them against the semiautomatically generated gold standard using lexical and taxonomical similarity measures. Second, the ontologies with sufficient domain coverage are assessed on their level of adoption and use with the help of a collaborative filtering algorithm (Adomavicius and Tuzhilin 2005) that explores the manual evaluations of the ontologies stored into the system. This algorithm takes into account not only previous users’ experiences (usage) but also the number of times the ontologies were selected (adoption).

[Task 5] The representation of the results differs for the two types of evaluations. For domain coverage, the tool presents a ranked list of ontologies including their individual scores for the lexical and taxonomical evaluation measures, as well as a combined evaluation score. After the adoption and usage evaluation, the list of ontologies is reranked, and the collaborative ontology evaluation score is added to the previous scores. In addition, the system allows the user to provide her own judgment of the ontology so that her assessment can be exploited for future ontology evaluations and selections.

5 Relevant NeOn Toolkit Plugins

Given the complexity of the ontology evaluation task in terms of the variety of approaches and metrics, the NeOn Toolkit does not provide an evaluation plugin per se. However, various plugins exist that can support different evaluation approaches. We provide a brief description of these plugins here.

The RaDON pluginFootnote 7 supports the automatic detection of logical inconsistency and incoherence in an ontology or an ontology network. The plugin does not only detect these modeling errors but can also repair them automatically or support the user to manually solve these issues. As such, RaDON can support users whose goal is to assess the quality of modeling in their ontology.

The XDTools pluginFootnote 8 contains a suite of tools that support design pattern–based ontology development. One of the tools, XD Analyzer, provides suggestions and feedback to the user with respect to how good practices in ontology design have been followed, according to the eXtreme Design (XD) method (for instance, missing labels and comments, isolated entities, unused imported ontologies). Chapter 3 provides more information about the XD method. Similarly to RaDON, this plugin can also be used when checking the quality of modeling; however, the focus here is the quality of the domain conceptualization rather than logical correctness.

The Watson for knowledge reuse Footnote 9 plugin primarily supports knowledge reuse by allowing an ontology developer to search the Watson ontology search engine for relevant knowledge statements directly from within the NeOn Toolkit and then reuse those statements. The plugin also interfaces with the Cupboard ontology publication environment that allows users to rate various characteristics of the ontologies that they reused (e.g., reusability, correctness, completeness, domain coverage, modeling style). Individual ratings are aggregated into an overall score and can support other people when reusing ontologies. This plugin supports the evaluation of ontologies in terms of their adoption and use providing also reviews written by previous adopters.

6 Summary

Ontology evaluation is an important and complex ontology engineering activity. Its complexity stems both by its applicability in a variety of scenarios (Sect. 9.1) as well as the abundant number of existing approaches and metrics. In this chapter, we aimed at providing practitioners with the right balance of generic guidelines and specific techniques that they could use from the wide landscape of works in this area (Sect. 9.2). We hope that the five diverse evaluation examples in Sect. 9.3 will serve as useful material for exemplifying the proposed guidelines.

Although ontology networks contain both ontologies and their links in terms of alignments, we have mostly focused on ontology evaluation. Readers interested in ontology alignment evaluation should also consult Chap. 12. Finally, Chaps. 10 and 11 describe other ontology engineering activities that can benefit from ontology evaluation, namely ontology modularization and evolution.