Keywords

1 Introduction

The hurdle-free publication of correct information enables consumers from the public and business sector to solve particular tasks based on available data. However, it is not sufficient to make a bunch of data available through the World Wide Web. Several other requirements have to be fulfilled so that information from a certain knowledge domain becomes valuable and useful for a particular usage scenario. This involves both accessibility aspects in the data retrieval step as well as intrinsic demands on the data itself.

Data Quality (DQ) is a concept describing the appropriateness of a data set based on concrete use case requirements. The examined data set is of excellent quality if it conforms to all needs and is free if defects [10] (“fitness for use” [12]). Otherwise, the quality of a data source is described as poor, if it does not meet the expectations. Quality aspects are usage dependent in general. Information from a data source can be of good quality for one intended use, and totally inappropriate for another purpose (e.g., by lacking required information). This involves requirements both on data instance level, schema level as well as on service level [7].

The analysis of data quality issues is not new and originates already in the 1970s. In Information Science, it involves the formulation of required aspects in terms of quality metrics as indicators and the test of data sets against these quality requirements. Commonly, quantitative measurements with a concrete numeric output are run in (semi-)automated processes, but qualitative analysis steps are possible as well. However, it is still controversial, which quality metrics are of major interest and if a basic set of general-purpose metrics makes sense in general. An excellent overview on this topic was recently provided in publications by Zaveri [13], Hogan [6] or Flemming [3].

Furthermore, the comparison of quality metric measurements and the overall quality assessment among multiple data sources, a series of points in time, or different quality checker tools is not trivial. Several propositions have already been made for exchanging quality measurement results. Mainly, they originate in the Semantic Web community [2] [4], resulting in a recommendation for a Data Quality Vocabulary (D3V)Footnote 1 by W3Cs Data Quality Working Group.

We have adopted these previous contributions from other authors and used it in the context of an industrial Linked Enterprise Data Services (LEDS) growth-core project for a proof-of-concept in practise. As a result, we want to present the following contributions:

  • The realization of an up-to-date implementation of a DQ Assessment Component (SemQuire) for the general analysis of structured RDF data sources that returns a machine-readable DQV export of measurement results

  • A rating approach that maps each measurement value to a numeric quality assessment score for better interpretability

  • The brief discussion of implementation aspects for well-accepted quality metrics

The rest of the paper is structured in the following way. Section 2 contains a more detailed description on DQ metrics and provides an overview on functional requirements for a Data Quality Assessment component. Sect. 3 presents the prototypical implementation of our SemQuire software component and a list of experiences during the implementation process. Sect. 4 analyses the correctness of our implementation based on a concrete Use Case with measurement results. In Sect. 5, we mention recent publications of other authors in the quality assessment domain and contrast our work from existent alternative quality checkers from the past. Finally, Sect. 6 sums up our results and contains a plan for future work.

2 Challenges in Measuring Quality Metrics

Our driving research question is whether the quality state in online published data sources can be monitored in an automated fashion and compared among different data sources, assessment tools or points in time by the mean of using a set of standard quality metrics and the mapping to a rating score.

The term quality in the context of data source analysis is diffuse and encompasses aspects that go beyond a simple syntactic validation or a correctness check for the absence of contradictions and errors in local data sets. Research in the past has already focused on this challenge and multiple times investigated the different dimensions of quality. Publications like ISO/IEC 25012Footnote 2 provide a comprehensive overview and definitions for common and generally accepted metrics and try to classify and cluster the metrics in a more general scheme. We base our research on the data quality dimensions and their categorization identified in a systematic literature review by [13]. They suggest a classification of these metrics and corresponding indicators into four primary groups entitled with Accessibility, Representational, Contextual and Intrinsic Quality aspects. The implementation of such a quality metric should be possible straight-forward according to their unambigious conceptual description in the corresponding literature.

Stakeholders with potential interest on quality measurement results can be found both on data publication as well as on data consumption side. A data curator or service provider of a data portal is interested to publish correct data in a useful way. Data consumers on the contrary are interested to find data sources that fit test to their current needs. As a consequence, measurements can be run from all stakeholder groups on all available resources and data service endpoints. These measurement results can then be published as meta data in a machine-readable format for further processing and comparison activities.

Table 1. In SemQuire implemented DQ metrics with recommended Concept URI

In order to do that, analyzed data quality metrics should be stated in an unambiguous and referenceable fashion. The data quality vocabulary (DQV) therefore introduces a set of properties to announce quality measurement results. To identify particular quality aspects, URIs are used as a reference. It is intentionally not the objective of the W3C working group “to define a normative list of dimensions and metrics”Footnote 3, thus they only state some basic examples. However, it is also mentioned that “relying on existing classifications and metrics increases interoperability” which symbolizes a valuable intension for Open Data exchange. (A similar approach is followed in the Linked Data community to reference particular existing entities with URIs e.g., in the DBpedia projectFootnote 4, though it does not contain entries for abstract concepts such as data metrics yet). We therefore put in front in the following a list of potential quality metrics together with a recommended URI in Table 1. Be aware, that we currently do not focus on metrics of a limited application domain, metrics with already profound tool support or metrics involving sophisticated data mining or AI methodologies.

We pose the following requirements on a software tool that should be capable of measuring the mentioned quality metrics:

  1. RQ1

    It can be applied on data sets containing structured data in an RDF serialization format (unstructured or semi-structured data sources can be processed to some extend using document converters in advanceFootnote 5)

  2. RQ2

    Input data can be specified in a push (direct input,upload) and/or pull (fetch from url, fetch from SPARQL endpoint) manner

  3. RQ3

    Relevant metrics that should be measured can be selected in advance from a list of available implemented metrics

  4. RQ4

    If metrics depend or relate to each other, any dependencies should be resolved during calculation without remeasuring duplicate aspects

  5. RQ5

    The measurement assignment as well as the metrics should be referenceable by using a persistent URI

  6. RQ6

    A measurement report should be generated after finishing all measurements containing concrete measurement values

  7. RQ7

    The measurement report should be exportable in a machine-readable format, preferably using DQV

  8. RQ8

    Optionally, an overall quality assessment score should be calculated with ratings for each measurement result

  9. RQ9

    Optionally, the current measurement should be comparable with other quality measurements

  10. RQ10

    The software tool should provide a Web UI for human interaction and presentation as well as a service backend for automation purposes and bulk processing

Fig. 1.
figure 1

Activity diagram for a data quality assessment tool

A conceptual program flow for fulfilling these requirements is briefly depicted in Fig. 1.

3 The SemQuire Approach

In the following, we present SemQuire, a practical engineering approach for the data quality assessment of structured data sources. SemQuire is a result of the German Linked Enterprise Data Services (LEDS) growth-core project. The primary objective of the LEDS project is to build a novel, future-proof technology platform that is capable of combining, extending and enriching corporate data stores with external, open-available data. One of the most critical aspects in this concept is the (automated) assurance of certain quality requirements in the process of knowledge combination. Open Data Services often provide hereby an inhomogeneous variety of data structures ranging from very detailed, conscientiously curated data collections with a very high number of corresponding properties down to data providers with only little information value.

The SemQuire application consists out of four main components:

  • A WebGUI for enabling human users to manually check particular data sets for quality issues, relying on Googles MDL front-end template library

  • A RESTful web service API for machine-to-machine interaction, currently implemented in NodeJS with TypeScript Transpiling

  • A set of implemented metrics that is easily extensible, mainly based on rdflib and other Python libraries

  • A graph database, currently using Stardog, accessed via an industrial data middleware (eccenca DataPlatform)

The entire system architecture is depicted in Fig. 2 and deployed in a Docker container. In contrast to other previously existing quality checker tools, SemQuire is to the best of our knowledge the first that allows the machine-readable export of all measurement results in DQV, follows a rating concept for all quality measurements and calculates a comparable overall assessment score. The SemQuire component can be publicly accessed via https://goo.gl/nYv9sX for demonstration purposes. Figure 3 depicts screenshots of the SemQuire prototype.

Fig. 2.
figure 2

Components of the SemQuire quality assessment tool

Fig. 3.
figure 3

SemQuire WebUI screenshots

We implemented a set of common quality metrics from multiple quality groups (see Table 1) dealing with different views on a data source.

Metrics from the Accessibility group deal with technical data access aspects. Some of them are not applicable to data sets that are provided in a push manner to the system by the user (e.g., a file upload of a data dump or directly by pasting the data content), and refer to remote URLs or SPARQL endpoint concerns such as Latency, Scalability, Throughput or SPARQLAccessibility. Others evaluate meta data contained in the document itself or in retrievable well-known access paths such as License information, the Availability of a Dump download, Digital Signiture or appropriate ContentType information. Another dimension checks contained external URIs in the retrieved data set for dereferenceability. Especially the execution of the ladder metrics can become time-consuming for large documents with an increased number of URIs.

A second group dealt with representational aspects of the provided data. We implemented metrics, that check if the same data can be retrieved in different RDF serialization formats, if well-known vocabularies are reused, and if the usage of constructs like BlankNodes or other prolix RDF features is avoided. The usage of ShortURIs might also be seen as an intrinsic aspect and are subject for discussion regarding the char length of a concept representation. From our experience during implementation, this can be use case and domain dependent. As other publications did not state a recommended explicit maximum length for a short URI, we used 80 chars as a general threshold.

In the following, we were interested in analyzing general intrinsic quality aspects of open accessible structured data. After checking the general validity with a respective validator, SemQuire converts them internally uniformly into RDF/XML. Next, either traditional RDF validators can be applied or more sophisticated third-party tools such as RDFAlerts [5]. In order to check other intrinsic dimensions such as consistency, completeness and conciseness metrics, it is first of all necessary to retrieve schema information on the used ontologies in the document. Dereferencing all used namespaces within one document is one possible, flexible automated approach. However, still not all ontology description sites offer a machine-readable version of the vocabulary. Completeness checks provide another challenge for a quality checker by requiring additional background knowledge (“gold standard”). Obviously, this is hard to achieve for certain application domains under an Open World Assumption for distributed data. Additionally, a comparison based on literal values is not practical useful for different languages or spellings. Instead, a completeness check based on entity URIs is more valuable. However, it also has to consider owl:sameAs relationships for similar concepts identified under different URI domain names. SemQuire checks all intrinsic metrics based on available document from the current and linked documents.

In contrast, contextual metrics require an additional usage context for the concrete application scenario by the user. For some contextual dimensions such as timeliness or understandability, simple parameter inputs can be requested by the system or even meaningful standard values can be applied statically. Checking relevancy needs a complex contextual input to satisfy the metric on a high level. Assessing trust either needs kinds of black- or whitelists, an authority or also a complex contextual input. Provenance data can hereby also be an input regarding some trust metrics. To circumvent a complex input, the PageRank approach can be used to return a initialization regarding the relevance and the more detailed trust metric about content trust. Such an initialization will still not behalf as a high-end trust network or description of relevance, but gives the contextual metrics a kick-off in the right direction. Solving a contextual metric with crowd-sourcing seems not to fit for us, as each human brings in his own bias.

For all metrics of interest, each measurement result value is then mapped to a rating score, representing the fulfillment of the investigated aspect. It is a numeric value between 0.0 (not fulfilled at all) and 1.0 (perfect). All individual ratings are then linearly combined to an overall quality assessment score. Details can be found in [8].

4 Evaluation

To show the effectiveness of SemQuire, we conducted a case study and used a small example of real-world open data resources to solve a common task for evaluation purposes. In our example case, a user is interested in getting information on all existing movies in the film series of James Bond. We chose three different linked open data source candidates, which we queried with SemQuire, and compared later on the results. Namely, the three selected providers were DBpediaFootnote 6, WikidataFootnote 7 and LinkedMDBFootnote 8.

Therefore, we designed three different SPARQL CONSTRUCT queries manually to obtain with SemQuire all information about movies of the James Bond film series. The queries differ mainly in the used vocabularies for each data provider, but the semantic is always the same as we search for all James Bond films and their outgoing relations or properties.

Not all offered metrics by SemQuire are relevant for the test case, so we carefully selected only metrics that help in the assessment process of finding the most appropriate data source for solving the task. The metrics were chosen by either importance for the test case or based on interesting differences in the results and ratings of SemQuire. Hence, we will show in the following differences between the data provider candidates according to the scenario with respect to six metrics and the underlying data. The corresponding measurement’ ratings are shown in Table 2. Additionally, we provide the numbers of returned triples (T#) as a statistical meta info for better understanding. Two metrics’ results are further shown in Fig. 4 for contrasting purposes of results and ratings in SemQuire.

Population Completeness (PopulComp). Regarding the test case of gathering all James Bond films, it is important if the endpoints really return all relevant movies, thus have a population completeness of 100%. Surprisingly, metric (43) shows that LinkedMDB is not referencing all James Bond films, but only 48%. It could be the case that LinkedMDB is not referencing all James Bond films to the category about James Bond films, which would explain this low percentage.

Serialization Format (SeriForm). As the test case does not explicitly specify how the data will be used, it can be very interesting for further processing to have the possibility of retrieving different serialization formats. Metric (52) measures in how many formats the data can be provided. SemQuire indicates that only DBpedia is able to provide more than one, so more than the standard RDF/XML format with content negotiation.

Various Languages (VarLang). Beyond the processing of the data, the data might also be shown to humans and thus it can be important that various languages are included in the data set. As our queries are not filtering on any language, metric (55) is able to check if there are various languages or not in the underlying data. LinkedMDB is again beyond the two others, as it is only providing the information in one language.

URI Dereferenceability (URIDeref). The metric (13) about dereferenceability of the URIs is relevant for the evaluation, as the importance of SemQuire’s mapping approach from absolute values to normalized ratings can be seen. The results of this metric depict the count of all dereferenceable URIs within the data. All endpoints provide a different amount of triples, and thus there are also differences in the results. On the contrary, the ratings of this metric show that the difference between the three endpoints is not even relevant, as they are for all pretty good and close. The rating is hereby created with respect to the overall triple numbers of the data, and is thus more significant than the results.

External Links (ExternL). With regard to an open world model, one endpoint is often not able to provide all information within its domain. The metric (06) is checking whether the provided data includes a link to external data outside the data endpoint domain. Interestingly, only DBpedia provides external links to other domains.

Low Latency (LowLat). The advantages of a low latency for one request to the endpoint can be important at tasks with a time factor that often live-update their data. The test case is not necessarily referring to a need of low latency, but the metric (09) is still interesting for a general QoS rating of the endpoint and a possible extension of the test scenario. The results are quite different, but the rating again gives an idea on how good these results are.

Based on the results and ratings of the six metrics, a decision upon which endpoint should be used, is required (which involved human interaction in the past). We use SemQuire’s possibility to combine the discussed measurements to an overall quality assessment value (Score). The resulting order is depicted in Table 2, the recommended endpoint to choose is consequently in our test case DBpedia.

Table 2. SemQuire’s ratings, score and T# for each endpoint
Fig. 4.
figure 4

URIDeeref & LowLat Results

Table 3. Comparison of quality assessment tools wrt. requirements from section 2

5 Related Work

Examples for vocabularies to describe data and service quality from the Semantic Web community are the daQ [2], DQM vocabulary [4] or the current W3C draft for a data quality vocabulary (DQV)Footnote 9. Furthermore, several data quality checker implementations already existed in the past. They differ on various characteristics such as functionality, processable data format, implementation language, user interface or result output manner. Examples are Diachron [13], KBMetrics [11], LDSrcAss [3], Luzzu [2], RDFAlerts [5], Roomba OpenData Checker [1], Sieve [9] or SWIQA [4]. Some of them only focused on a limited use case or are not publicly available any longer. Moreover, assessment results were often provided in different output formats and not comparable to each other. For instance, the OpenData Checker calculated metrics from data quality indicators specifically for CKAN data stores and simply outputed them in percent. KBMetrics used a scoring system to make different data sources comparable. SWIQA calculated a quality score based on the percentage how many instances violate given data quality rules. Emphasis has therefore been placed on the requirement to make quality measurements comparable by using semantic means. Table 3 contrasts all mentioned software tools based on the original usage requirements we posed in Sect. 2. Currently, SemQuire is the only tool that satisfies all defined requirements.

6 Conclusion

In this paper, we presented SemQuire a practical implementation of a quality assessment component that can be used as a toolkit to measure and assure the quality of open or enterprise data sources that expose information in a common RDF serialization format. SemQuire relies on the theoretical findings of previously published surveys dealing with most relevant quality metrics. It implements 55 of the most common quality indicators. In advance, we conducted a brief market overview and compared other existing tools with our component with the result, that there is currently, to the best of our knowledge, no other software component available that fulfills all requirements of interest.