Keywords

1 Introduction

Training machine learning (ML) models [8], integration of heterogeneous data sources [5], or data quality measurement [3, 4] are exemplary tasks that involve more than one data source in an organization. To merge these data sources, a standardized description of the data sources and their data structures is required. Data Source Description Vocabulary (DSD)Footnote 1 version 4.0, which enables the standardized representation of data sources and their internal structure independently of the original type of source (e.g., database management system, comma-separated values (CSV) files).

We delimit DSD from related research in Sect. 2 and describe the details of the vocabulary in Sect. 3. Sect. 4 highlights the relevance of DSD by outlining its applications in practice. The vocabulary is evaluated against the FAIR (Findability, Accessibility, Interoperability, and Reuse [12]) principles in Sect. 5.

2 Related Work

The idea of developing a standardized representation for data sources of different types is not new. Atzeni et al. [1] present a metamodel that can represent (amongst others) relational data models, Entity-Relationship models, and object-oriented models. Candel et al. [2] propose “U-Schema”, a unified metamodel that is based on the Eclipse Modeling Framework (EMF)Footnote 2 and supports the most-widely used NoSQL systems, as well as MySQL. The DSD vocabulary is different from such metamodels since it is based on the Ontology Language (OWL)Footnote 3 for building ontologies that represent data sources.

The following OWL-based vocabularies for describing the metadata of data sources [13] have been recommended by the World WideWeb Consortium (W3C):

  • the Data Catalog Vocabulary (DCAT)Footnote 4, which provides terms for describing so-called “data sets” (i.e., data sources) and services to catalog them, and

  • the Vocabulary of Interlinked Datasets (VoID)Footnote 5, which is specifically tailored to describe metadata of Resource Description Framework (RDF) data sets.

In contrast to DSD, both vocabularies do not cover the structure inside a data source. There are also some vocabularies that support the representation of the internal structure of a data source, like CSV on the Web (CSVW)Footnote 6 that allows describing the structure of CSV files, or the RDF Data Cube VocabularyFootnote 7 that is suitable for multidimensional data. All of these vocabularies are dedicated to a specific data source type, while DSD is data source type independent. The Semantic Data Dictionary (SDD) has a similar objective as DSD, but only supports tabular data in its current state (Extensible Markup Language (XML) is planned in the future) [10].

Despite the same acronym, the DSD vocabulary is also different from the DSD Schema Language [9], which is an XML schema language with higher expressiveness than the XML document type declaration (DTD)Footnote 8 or XML Schema (XSD)Footnote 9.

In summary, there is no other OWL-based vocabulary than DSD that can represent data sources, independently of their type and internal structure.

3 The Data Source Description Vocabulary (DSD)

Originally, Ehrlinger and Wöß published DSD in 2015 [5]. The vocabulary is based on OWL, RDF, and RDF Schema. The core idea of DSD is to provide a terminology for representing the structure of data sources independently of their type [5]. It can be used to represent different types of data sources (e.g., relational or graph databases, document stores) and their (internal) semantics.

Based on our experience in data modeling (Entity-Relationship (ER) models, Unified Modeling Language (UML), and ontologies) and on requirements raised by company partners (cf. applications of DSD in Sect. 4), we defined a set of terms (i.e., OWL classes, object properties, and data properties) for describing data sources. Figure 1 illustrates the classes and object properties defined in DSD. For simplicity, inverse object properties are not shown. An inverse object property in OWL is a relationship between two classes where the direction of the relationship is reversed. We distinguish between “essential” classes, which are necessary for describing a data source using DSD, and “optional” classes, which provide additional non-necessary features. Below, we describe each class, in order of importance.

Fig. 1.
figure 1

OWL classes and OWL object properties in the DSD vocabulary

Essentials

  • Data Source. A generic class for representing data sources. Example: A dsd:DataSource can represent structured data such as relational databases, semi-structured data like XML files, or NoSQL databases such as graph databases or wide-column stores.

  • Concept. A representation of a structural part of a data source. Example: A dsd:Concept can represent a table or a view of a relational database or a class in object-oriented structures.

  • Attribute. A dsd:Attribute describes a property of a dsd:Concept. DSD also provides OWL data properties to define certain attribute characteristics, such as, nullable or unique. Example: If a dsd:Concept represents a relational table, its attributes correspond to the columns.

  • Association. A dsd:Association describes a relationship between two instances of dsd:Concept. There are three disjoint dsd:Association subclasses for aggregation, inheritance, and reference associations. For further details and also for object properties of the subclasses, we refer to [5].

Optionals

  • Schema. Instances of dsd:Schema create an optional hierarchy level between data sources (instances of dsd:DataSource) and concepts (instances of dsd:Concept). Schemas allow the grouping of concepts and are commonly used in enterprise databases.

  • Data Source Type. This class provides instances of the most common data source types, which can be assigned to instance of dsd:DataSource.

  • Primary Key and Foreign Key. Instances of these two classes are assigned to a dsd:Association or dsd:Concept and consist of one or more instances of dsd:Attribute (i.e., can be composite keys).

4 Use Cases and Applications of DSD

In recent years, DSD has been used in various applications. This section discusses three areas where DSD can be useful for both researchers and practitioners.

Schema Matching and Schema Similarity. A key advantage of DSD is to make data sources and their schemas comparable. Thus, in [6], DSD was used to generate homogeneous representations of data source schemas, which could then be compared directly. The similarity of these schemas (i.e., their degree of overlap) was used as input for a metric to assess the schema quality [6].

Metadata Management. The implementation of a corporate metadata management system (e.g., a data catalog) requires comparability of data source schemas from different types. For that purpose, we employed DSD to represent different data sources in a producing company [11]. In this project, DSD was the basis to describe data sources and their internal structure, which can then be annotated with different kinds of metadata, e.g., access security metadata or the assignment of data responsibility roles.

Data Quality. In real-world scenarios, data quality assessment should be carried out on multiple (heterogeneous) data sources. Thus, the data quality tools QuaIIe [4] and DQ-MeeRKat [3], which aim to be data source type independent, implement connectorsFootnote 10 that map the original schema of a data source to a DSD representation (see Table 1 in [5]). After calculating different data quality metrics, the measurement results can be annotated to these representations.

5 Evaluation Against the FAIR Principles

The FAIR principles define a measurable set of guidelines to assess the FAIRness of a data asset [12] and are therefore well suited to evaluate the quality (i.e., findability, accessibility, interoperability, and reuse) of DSD. We conducted a two-fold evaluation: (1) an automated evaluation using FOOPS!Footnote 11 in Sect. 5.1 and (2) a manual evaluation with the FAIR principles published online in Sect. 5.2.

5.1 Automatic Evaluation

For the automatic evaluation, we used the tool FOOPS! (Ontology Pitfall Scanner for FAIR) [7]. FOOPS! determines FAIRness by checking if Internationalized Resource Identifiers (IRIs) are resolvable and permanent, and if certain OWL properties (e.g., author, publication date, provenance information) are present.

In the automatic evaluation, DSD achieves a FAIRness score of 88%. FOOPS! does not assess DSD to be fully FAIR since it does not recognize some specific metadata. As an example, information on authors and contributors of DSD is included as instances of foaf:Person, but FOOPS! expects the presence of literal values.

5.2 Manual Evaluation

For each FAIR principleFootnote 12, we manually assessed and justified if it is fulfilled by DSD, as shown in detail in Table 1. Overall, we consider DSD to be fully FAIR.

Table 1. Manual evaluation against the FAIR Principles.

6 Conclusion and Outlook on Future Work

Although the focus of DSD is on the description of data sources, previous versions contained, e.g., a class Stakeholder, which was used for modelling people and their permissions to data sources. In the newest version 4.0, we removed all capabilities that do not support the core idea of DSD and suggest the reuse and combination with other vocabularies to annotate different kinds of metadata to a data source. An example is the Data Quality Vocabulary (DQV)Footnote 13, which is specifically designed to represent data quality metadata. DSD 4.0 is the first version that includes a rich set of metadata as well as a permanent identifier, and thus fulfills the FAIR principles. Due to intensively using DSD in data quality tools (cf. [3, 4]), we will further investigate the integration of DSD with DQV in our ongoing research. At this point, we would like to encourage other research groups to investigate the integration of additional vocabularies for annotating metadata to DSD data sources, e.g., security or provenance metadata.

All links in this publication were last visited on June 1, 2023.