Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Over the last decade we have seen the World Wide Web being populated more and more by “machines”. The world wide Web has evolved from its original form as a network of linked Documents, readable by humans to more and more a Web of data and APIs. That is, nowadays, even if we interact as humans with Web pages, in most cases (i) the contents of Web pages are generated from Databases in the backend, (ii) the Web content we see as humans contains annotations readable by machines, and even (iii) the way we interact with Web pages generates data (frighteningly, even often without the users being aware of), collected and stored again in databases around the globe. It is therefore valid to say that the Web of Data has become a reality and – to some extent – even the vision of the Semantic Web. In fact, this vision of the Semantic Web has itself evolved over the decades, starting with Berners-Lee et al.’s seminal article in 2001 [13] that already envisioned the future Web as “federating particular knowledge bases and databases to perform anticipated tasks for humans and their agents”. Based on these ideas a lot of effort and research has been devoted to the World Wide Web Consortium (W3C) Semantic Web activity,Footnote 1 which in 2013 has been subsumed by – i.e., renamed to – “Data Activity”.Footnote 2

In many aspects, the Semantic Web has not necessarily evolved as expected, and the biggest success stories so far do less depend on formal logics [37] than we may have expected, but more on the availability of data. Another recent article by Bernstein et al. [14] takes a backwards look on the community and summarizes successes of the Semantic Web community such as the establishment of lightweight annotation vocabularies like Schema.org on Web pages, or praising the uptake of large companies such as Google, Yahoo!, Microsoft, and Facebook who are developing large knowledge graphs, which however, so far these companies mostly keep closed.

Thus, if Web researchers outside of these companies want to tap into the rich sources of Data available now on the Web they need to develop their own data workflows to find relevant and usable data. To their help, more and more Open Data is being published on the Web, that is, data that is made freely available by mostly public institutions (Open Government Data) both for transparency reasons and with the goal to “fuel” a Data Economy, pushed both by the EU [29] and the G8 [72].

The present lecture notes may be viewed as partially an experience report as well as – hopefully – a guide through challenges arising when using (Open) data from the Web. The authors have been involved over the past view years in several projects and publications around the topic of Open Data integration, monitoring, and processing. The main challenges we have come across in all these projects are largely overlapping and therefore we decided to present them in the present chapter:

  1. 1.

    Where to find Open Data? (Sect. 2) Most Open Data nowadays can be found on so called Open Data Portals, that is, data catalogs, typically allowing API access and hosting dataset descriptions and links to actual data resources.

  2. 2.

    “Low-level” data heterogeneity (Sect. 3) As we will see, most of the structured data provided as Open Data is not readily available as RDF or Linked Data – the preferred formats for semantic data access described in other chapters of this volume. Different formats are much more prevalent, plus encoding issues make it difficult to access those datasets.

  3. 3.

    Licenses and Provenance (Sect. 4) Not all Open Data is really completely open, since most data on the Web is attached to different licences, terms and conditions, so we will discuss how and whether these licenses can be interpreted by machines, or, respectively how the provenance of different integrated data sources can be tracked.

  4. 4.

    Quality issues (Sect. 5) A major challenge for data – also often related to its provenance – is quality; on the one hand the re-use of poor quality data is obviously not advisable, but on the other hand different applications might have different demands/definitions of quality.

  5. 5.

    How to find data – Searchability? (Sect. 6) Last, but not least, we will look into current solutions for search in Open Data, which we pose as a major open research challenge: whereas crawling and (keyword-based search) of human readable websites work well, this is not yet the case for structured data on the Web; we will discuss why and sketch some routes ahead.

Besides these main questions, we will conclude with summarizing issues and open questions around integrating Open Data from the Web not covered explicitly herein in Sect. 7, such as multi-linguality, temporal aspects (archiving, evolution, temporal querying), as well as how/whether OWL and RDFS reasoning on top of integrated open data could be help.

2 Where to Find Web Data?

If we look for sources of openly available data that is widely discussed in the literature, we mainly can identify three starting points, which are partially overlapping:

  • User-created open data bases

  • The Linked Open Data “Cloud”

  • Webcrawls

  • Open Data Portals

User-created open data bases, through efforts such as Wikipedia, are large amounts of data and data-bases that have been co-created by user communities distributed around the globe; the most important ones being listed as follows:

  • DBpedia [44] is a community effort that has created one of the biggest and most important cross-domain dataset in RDF [19] in the focal point of the so called Linked Open Data (LOD) cloud [6]. At its core is a set of declarative mappings extracting data from Wikipedia infoboxes and tables into RDF and it is accessible as well as through dumps also through an open query interface supporting the SPARQL [33] query language. DBpedia can therefore be well called one of the cornerstones of Semantic Web and Linked Data research being the subject and center of a large number of research papers over the past few years. Reported numbers vary as DBpedia is modular and steadily growing with Wikipedia, e.g. in 2015 DBpedia contained overall more than 3B RDF StatementsFootnote 3, whereof the English DBpedia contributed 837 M statements (RDF triples). Those 837 M RDF triples alone amount to 4.7 GB when stored in the compressed RDF format HDT [30]Footnote 4. However, as we will see there are many, indeed far bigger other openly accessible data sources, that yet remain to be integrated, which are rather in the focus of the present chapter.

  • Wikidata [74] a similar, but conceptually different effort has been started in 2012 to bring order into data items in Wikipedia, with the idea to – instead of extracting data from semi-structured Wikipages – build a database for data observations with fixed properties and datatypes, mainly with the idea to avoid extraction errors and provide means to record provenance directly with the data, with likewise 100s of millions of facts in the meantime: exact numbers are hard to give, but [71] report some statistics of the status of 2015, when Freebase was included into Wikidata; we note that counting RDF triplesFootnote 5 is only partially useful, since the data representation of Wikidata is not directly comparable with the one from DBpedia [35, 36].

  • OpenStreetmap as another example of an openly available data base that has largely been created by users contains a vast amount of geographic features to obtain an openly available and re-usable map; with currently 739.7GB (uncompressed) data in OSM’s native XML format (and still 33GB compressed).Footnote 6

The Linked Open Data “Cloud” – already mentioned above – is a manually curated collection of datasets that are published on the Web openly, adhering to the so-called Linked Data principles, defined as follows [12] (cf. chapters of previous editions of the Reasoning Web book series for good overview articles):

  • LDP1: use URIs as names for things;

  • LDP2: use HTTP URIs so those names can be dereferenced;

  • LDP3: return useful – herein we assume RDF – information upon dereferencing of those URIs; and

  • LDP4: include links using externally dereferenceable URIs.Footnote 7

The latest iteration of the LOD Cloud [1] contains – with DBpedia in its center – hundreds of datasets with equal or even larger sizes than DBpedia, documenting a significant growth of Linked Data over the past years. Still, while often in the Semantic Web literature the LOD cloud and the “Web of Data” are implicitly equated, there is a lot of structured data available on the Web (a) either, while using RDF, not being linked to other datasets, or (b) provided in other, popular formats than RDF.

Running Web crawls is the only way to actually find and discover structured Web Data, which is both resource intensive and challenging in terms of respecting politeness rules when crawling. However, some Web crawls have been made openly available, such as the Common Crawl corpus which contains “petabytes of data collected over the last 7 years”Footnote 8. Indeed the project has already been used to collect and analyse the availability (and quality) of structured data on the Web, e.g. in the Web Data Commons Project [50, 51] (Table 1).

Open Data portals are collections or catalogs that index metadata and link to actual data resources which have become popular over the past few years through various Open Government Data Initiatives, but also in the private sector. Apart from all the other sources mentioned so far, most of the data published openly is indexed in some kind of Open Data Portal. We therefore will discuss these portals in the rest of this paper in more detail.

Table 1. Top-10 portals, ordered by datasets.

Open Data portals

Most of the current “open” data form part of a dataset that is published in Open Data portals which are basically catalogues similar to digital libraries (cf. Fig. 1): in such catalogues, a dataset aggregates a group of data files (referred to as resources or distributions) which are available for access or download in one or more formats (e.g., CSV, PDF, Microsoft Excel, etc.). Additionally, a dataset contains metadata (i.e., basic descriptive information in structured format) about these resources, e.g. authorship, provenance or licensing information. Most of these portals rely on existing software frameworks, such as CKANFootnote 9 or Socrata,Footnote 10 that offer UI, search, and API functionalities.

Fig. 1.
figure 1

High-level structure of a data catalog.

CKAN is the most prominent portal software framework used for publishing Open Data and is used by several governmental portals including data.gov.uk and data.gov.

For example, the Humanitarian Data ExchangeFootnote 11 (see Fig. 2) is a portal by the United Nations. It aggregates and publishes data about the context in which a humanitarian crisis is occurring (e.g., damage assessments and geospatial data) and data about the people affected by the crisis. The datasets on this portal are described using several metadata fields, and the metadata description can be retrieved in JSON format using the Web API of the data portal (cf. Fig. 2).

The metadata description of these datasets provide download links for the actual content. For instance, the particular dataset description in Fig. 2 – a dataset reporting the amounts paid by refugees to facilitate their movement to Europe – holds a URL which refers to a table (a CSV file) containing the corresponding data, displayed in Table 2.

Fig. 2.
figure 2

Example dataset description from the humanitarian data exchange portal.

Table 2. The tabular content of the dataset in Fig. 2

3 Data Formats on the Web

When we discuss different available data on the Web, we already emphasized that – despite being subject of a lot of research – RDF and Linked Data are not necessary the prevalent formats for published data on the Web. An analysis of the datasets systematically catalogued in Open Data portals will confirm this. Likewise, we will have to discuss metadata formats on these portals.

Data Formats on Open Data Portals. Table 3 shows the top used formats and the number of unique resources together with their number of portals they appear, adapted from [58], where we analysed and crawled metadata from 260 Open Data Portals for cues to the data formats in which different datasets are provided. Note, that these numbers are based on available metadata information of the datasets and can be higher due to varying spellings, misspellings, and missing metadata. Therefore, these numbers should be considered as a lower bound for the respective formats. Bold highlighted values indicate that the format is considered as open as per the Open Definition [12]:Footnote 12 the open definition sets out several guidelines of which data formats are to be considered “open”, according to which we have analysed assessed openness by a list of compliant formats, cf. [58].

Table 3. Most frequent formats.

A surprising observation is that \(\sim \)10% of all the resources are published as PDF files. This is remarkable, because strictly speaking PDF cannot be considered as an Open Data format: while PDFs may contain structured data (e.g. in tables) there are no standard ways to extract such structured data from PDFs - or general-purpose document formats in general. Therefore, PDFs cannot be considered as machine-readable, nor as a suitable way for publishing Open Data. As we also see, RDF does not appear among the top-15 formats for Open Data publishing.Footnote 13 This underlines the previously stated hypothesis that – especially in the area of Open Government Data – openly available datasets on data portals are mostly not published as RDF or Linked Data.

Also, JSON does not appear among the top ten formats in terms of numbers of published data resources on Open Data portals. Still, we include those main formats in our discussion below, as

  • particularly JSON and RDF play a significant role in metadata descriptions,

  • JSON is the prevalent format for many Web APIs,

  • RDF, as we saw, is apart from the Linked Data cloud prevalent in Web pages and crawls through its support as an annotation format by popular search engines.

In the following we introduce some of these popular, well known, data formats on the Web and categorize them by their structure, namely, graph-based, tree-shaped, and tabular formats.

Fig. 3.
figure 3

RDF graph of DCAT metadata mapping of Fig. 2

3.1 Graph-Based Formats

RDF, W3C recommendation since 2004 [41] and “refurbished” in 2014 [19, 23], was originally conceived as a metadata model language for describing resources on the web. It evolved (also through deployment) to a universal model and format to describe arbitrary relations between resources identified, typically, by URIs, such that they can be read and understood by machines.

RDF itself consists of statements in the form of subject, predicate, object triples. RDF triples can be displayed as graphs where the subjects and objects are nodes and the predicates are directed edges. RDF uses vocabularies to define the set of elements that can be used in an application. Vocabularies are similar to schemas for RDF datasets and can also define the domain and range of predicates. The graph in Fig. 3 represents the metadata description of the dataset in Fig. 2 in the DCAT (Data Catalog) vocabulary [48].Footnote 14

There exist several formats to serialize RDF data. Most prominent is RDF/XML, the XML serialization first introduced in the course of 1999 W3C specification of the RDF data model, but there are also a more readable/concise textual serialization formats such as the line-based N-Triples [21] and the “Terse RDF Language” TURTLE [10] syntax. More recent, in 2014, W3C released the first recommendation for JSON-LD [68]. JSON-LD is an extension for the JSON format (see below) mostly allowing to specify namespaces for identifiers and support of URIs (supporting Linked Data principles natively in JSON) which allows the serialization of RDF as JSON, or vice versa, the transformation of JSON as RDF: conventional JSON parser and databases can be used; users of JSON-LD which are mainly interested in conventional JSON, are not required to understand RDF and do not have to use the Linked Data additions.

3.2 Tree-Shaped Formats

The JSON file format [18] is a so-called semi-structured file format, i.e., where documents are loosely structured without a fixed schema (as for example data in relational databases) as attribute–value pairs where values can be primitive (Strings, numbers, Booleans), arrays (sequences of values enclosed in square brackets ‘[‘,’]’), or nested JSON objects (enclosed in curly braces ‘{‘,’}’), thus – essentially – providing a serialization format for tree-shaped, nested structures. For an example for JSON we refer to Fig. 2.

Initially, the JSON format was mainly intended to transmit data between servers and web applications, supported by web services and APIs. In the context of Open Data we often find JSON as a format to describe metadata but also to publish the actual data: also raw tabular data can easily be transformed into semi-structured and tree-based formats like JSONFootnote 15 and, therefore, is often used as alternative representation to access the data. On the other hand, JSON is the de facto standard for retrieving metadata from Open Data portals.

XML. For the sake of completeness, due to its long history, and also due to its still striking prevalence as a data exchange format of choice, we shall also mention some observations on XML. This prevalence is not really surprising since many industry standards and tools export and deliver XML, which is then used as the output for many legacy applications or still popular for many Web APIs, e.g., in the area of geographical information systems (e.g. KML,Footnote 16 GML,Footnote 17 WFS,Footnote 18 etc.). Likewise, XML has a large number of associated standards around it such as query, navigation, transformation and schema languages like XQuery,Footnote 19 XPath,Footnote 20 XSLTFootnote 21, and XML SchemaFootnote 22 which are still actively developed, supported by semi-structured database systems, and other tools. XML by itself has been subject to extensive research, for example in the fields of data exchange [4, Part III] or query languages [8]. Particularly, in the context of the Semantic Web, there have also been proposals to combine XQuery with SPARQL, cf. for instance [15, 26] and references therein. The issue of interoperability between RDF and XML indeed is further discussed within the W3C in their recently started “RDF and XML Interoperability Community Group”Footnote 23 see also [16] for a summary. So, whereas JSON has probably better support in terms of developer-friendliness and recent uptake particularly through Web APIs, there is still a strong community with well-established standards behind XML technologies. For instance, schema languages or query languages for JSON exist as proposals, but their formal underpinning is still under discussion, cf. e.g. [17, 63]. Another approach would be to adopt, reuse and extend XML technologies to work on JSON itself, as for instance proposed in [26]. On an abstract level, there is not much to argue about JSON and XML just being two syntactic variants for serializing arbitrary, tree-shaped data.

3.3 Tabular Data Formats

Last but not least, potentially driven also by the fact that the vast majority of Open Data on the Web originates from relational databases or simply from spreadsheets, a large part of the Web of Open Data consists of tabular data. This is illustrated by the fact that two of the most prominent formats for publishing Open Data in Table 3 cover tabular data: CSV and XLS. Note particularly that both of these formats are present on more Open Data portals than for instance XML.

While XLS (the export format of Microsoft Excel) is obviously a proprietary open format, CSV (comma-separated values) is a simple, open format with a standard specification allowing to serialize arbitrary tables as text (RFC4180) [67]. However, as we have shown in a recent analysis [54], compliance with this standard across published CSVs is not consistent: in Open Data corpus containing 200 K tabular resources with a total file size of 413 GB we found out that out of the resources in Open Data portals labelled as a tabular only 50% can be considered CSV files. In this work we also investigated different use of delimiters, the availability of (multiple) header rows or cases where single CSV files actually contain multiple tables as common problems.

Last, but not least, as opposed to tabular data in relational databases, which typically adhere to a fixed schema and constraints, these constraints, datatype information and other schema information is typically lost when being exported and re-published as CSVs. This loss can be compensated partially by adding this information as additional metadata to the published tables; one particular format for such kind of metadata has been recently standardized by the W3C [65]. For more details on the importance of metadata we refer also to Sect. 5 below.

3.4 Data Formats – Summary

Overall, while data formats are often only considered syntactic sugar, one should not underestimate the issues about conversions, scripts parsing errors, stability of tools, etc. where often significant amounts of work incurs. While any data can be converted/represented in principle into a CSV, XML, or RDF serialization, one should keep in mind that a canonical, “dumb” serialization in RDF by itself, does not “add” any “semantics”.

For instance, a naive RDF conversion (in Turtle syntax) of the CSV in Table 2 could look as follows in Fig. 4, but would obviously not make the data more “machine-readable” or easier to process.

Fig. 4.
figure 4

Naive conversion of tabular data into RDF

We would leave coming up with a likewise naive (and probably useless) conversion to XML or JSON to the reader: the real intelligence in mapping such data lies in finding suitable ontologies to describe the properties representing columns c1 to c4, recognizing the datatypes of the column values, linking names such as “East Med Sea” to actual entities occurring in other datasets, etc. Still, typically, in data processing workflows more than 80% of the effort to data conversion, pre-processing and cleansing tasks.

Within the Semantic Web, or to be more precise, within the closed scope of Linked Data this problem and the steps involved have been discussed in depth in the literature [7, 60]. A partial instantiation of a platform which shall provide a cleansed and integrated version of the Web of Linked Data is presented by the LOD-Laundromat [11] project: here, the authors present a cleansed unified store of Linked Data as an experimental platform for the whole Web of Linked Data, mostly containing the all datasets of the current LOD cloud, are made available. Querying this platform efficiently and investigating the properties of this subset of the Web of Data is a subject of active ongoing research, despite only Linked RDF data has been considered: however, building such a platform for the scale of arbitrary Open Data on the Web, or even only for the data accumulated in Open Data portals would demand a solution at a much larger scale, handling more tedious cleansing, data format conversion and schema integration problems.

4 Licensing and Provenance of Data

Publishing data on the Web is more than just making it publicly accessible. When it comes to consuming publicly accessible data, it is crucial for data consumers to be able to assess the trustworthiness of the data as well as being able to use it on a secure legal basis and to know where the data is coming from, or how it has been pre-processed. As such, if data is to be published on the Web, appropriate metadata (e.g., describing the data’s provenance and licensing information) should be published alongside with it, thus making published data as self-descriptive as possible (cf. [34]).

Table 4. Top-10 licenses.

4.1 Open Data Licensing in Practice

While metadata about terms and conditions under which a dataset can be re-used are essential for its users, according to the Linked Open Data Cloud web page, only less than 8% of the linked data datesets provide license informationFootnote 24.

Within Open data portals, the situation seems slightly better overall: more than 50% of the monitored datasets in the Open Data portals in the Portalwatch project (see Sect. 5 below) announce somehow in the metadata some kind of license information [58]. The most prevalent license keys used in Open Data portals [58] are listed in Table 4.

While most of the provided license definitions lack a machine-readable description that would allow automated compatibility checks of different licenses or alike, some are not even compliant with Open Definition conformant data licenses (cf. Table 5).

Table 5. Open definition conformant data licenses [40]

In order to circumvent these shortcomings, different RDF vocabularies have been introduced to formally describe licenses as well as provenance information of datasets, two of which (ODRL and PROV) we will briefly introduce in the next two subsections.

4.2 Making Licenses Machine-Readable

The Open Digital Rights Language (ODRL) [39] is a comprehensive policy expression language (representable with a resp. RDF vocabulary) that has been demonstrated to be suitable for expressing fine-grained access restrictions, access policies, as well as licensing information for Linked Data as shown in [20, 69].

An ODRL Policy is composed of a set of ODRL Rules and an ODRL Conflict Resolution Strategy, which is used by the enforcement mechanism to ensure that when conflicts among rules occur, a system either grants access, denies access or generates an error in a non-ambiguous manner.

An ODRL Rule either permits or prohibits the execution of a certain action on an asset (e.g. the data requested by the data consumer). The scope of such rules can be further refined by explicitly specifying the party/parties that the rule applies to (e.g. Alice is allowed to access some dataset), using constraints (e.g. access is allowed until a certain date) or in case of permission rules by defining duties (e.g. a payment of 10 euros is required).

Listing 1.1 demonstrates how ODRL can be used to represent the CreativeCommons license CC-BY 4.0.

figure a

Policy Conflict Resolution. A rule that permits or prohibits the execution of an action on an asset could potentially affect related actions on that same asset. Explicit relationships among actions in ODRL are defined using a subsumption hierarchy, which states that an action \(\alpha _1\) is a broader term for action \(\alpha _2\) and thus might influence its permission/prohibition (cf. Fig. 5). On the other hand implicit dependencies indicate that the permission associated with an action \(\alpha _1\) requires another action \(\alpha _2\) to be permitted also. Implicit dependencies can only be identified by interpreting the natural language description of the respective ODRL Actions (cf. Fig. 6). As such, when it comes to the enforcement of access policies defined in ODRL, there is a need for a reasoning engine which is capable of catering for both explicit and implicit dependencies between actions.

Fig. 5.
figure 5

Example of explicit dependencies in ODRL.

Fig. 6.
figure 6

Example of implicit dependencies in ODRL.

4.3 Tracking the Provenance of Data

In order to handle the unique challenges of diverse and unverified RDF data spread over RDF datasets published at different URIs by different data publishers across the Web, the inclusion of a notion of provenance is necessary. The W3C PROV Working Group [49] was chartered to address these issues and developed an RDF vocabulary to enable annotation of datasets with interchangeable provenance information. On a high level PROV distinguishes between entities, agents, and activities (see Fig. 7). A prov:Entity can be all kinds of things, digital or not, which are created or modified. Activities are the processes which create or modify entities. An prov:Agent is something or someone who is responsible for a prov:Activity (and indirectly also for an entity).

Fig. 7.
figure 7

Source: Taken from [49]

The core concepts of PROV.

Listing 1.2 illustrates a PROV example (all other triples removed) of two observations, where observation ex:obs123 was derived from another observation ex:obs789 via an activity ex:activity456 on the 1st of January 2017 at 01:01. This derivation was executed according to the rule ex:rule937 with an agent ex:fred being responsible. This use of the PROV vocabulary models tracking of source observations, a timestamp, the conversion rule and the responsible agent (which could be a person or software component). The PROV vocabulary could thus be used to annotated whole datasets, or single observations (data points) within such dataset, or, respectively any derivations and aggregations made from open data sources re-published elsewhere.

figure b

5 Metadata Quality Issues and Vocabularies

The Open Data Portalwatch project [58] has originally been set up as a framework for monitoring and quality assessment of (governmental) Open Data portals, see http://data.wu.ac.at/portalwatch. It monitors data from portals using the CKAN, Socrata, and OpenDataSoft software frameworks, as well as portals providing their metadata in the DCAT RDF vocabulary.

Currently, as of the second week of 2017, the framework monitors 261 portals, which describe in total about 854 k datasets with more than 2 million distributions, i.e., download URLs (cf. Table 6). As we monitor and crawl the metadata of these portals in a weekly fashion, we can use the gathered insights in two ways to enrich the crawled metadata of these portals: namely, (i) we publish and serve the integrated and homogenized metadata descriptions in a weekly, versioned manner, (ii) we enrich these metadata descriptions by assessed quality measures along several dimensions. These dimensions and metrics are defined on top of the DCAT vocabulary, which allows us to treat and assess the content independent of the portal’s software and own metadata schema.

Table 6. Monitored portals and datasets in Portalwatch

The quality assessment is performed along the following dimensions: (i) The existence dimension consists of metrics checking for important information, e.g., if there is contact information in the metadata. (ii) The metrics of the conformance dimension check if the available information adheres to a certain format, e.g., if the contact information is a valid email address. (iii) The open data dimension’s metrics test if the specified format and license information is suitable to classify a dataset as open. The formalization of all quality metrics currently assessed on the Portalwatch platform and implementation details can be found in [58].

5.1 Heterogeneous Metadata Descriptions

Different Open Data portals use different metadata keys to describe the datasets they host, mostly dependent on the software framework under which the portal runs: while the schema for metadata descriptions on Socrata and OpenDataSoft portals are fixed and predefined (they use their own vocabulary and metadata keys), CKAN provides a higher flexibility in terms of own, per portal, metadata schema and vocabulary. Thus, overall, the metadata that can be gathered from Open Data Portals show a high degree of heterogeneity.

In order to provide the metadata in a standard vocabulary, there exists a CKAN-to-DCAT extension for the CKAN software that defines mappings for datasets and their resources to the corresponding DCAT classes dcat:Dataset and dcat:Distribution and offers it via the CKAN API. However, in general it cannot be assumed that this extension is deployed for all CKAN portals: we were able to retrieve the DCAT descriptions of datasets for 93 of the 149 active CKAN portals monitored by Portalwatch [59].

Also, the CKAN software allows portal providers to include additional metadata fields in the metadata schema. When retrieving the metadata description for a dataset via the CKAN API, these keys are included in the resulting JSON. However, it is neither guaranteed that the CKAN-to-DCAT conversion of the CKAN metadata contains these extra fields, nor that these extra fields, if exported, are available in a standardized way.

We analysed the metadata of 749 k datasets over all 149 CKAN portals and extracted a total of 3746 distinct extra metadata fields [59]. Table 7 lists the most frequently used fields sorted by the number of portals they appear in; most frequent spatial in 29 portals. Most of these cross-portal extra keys are generated by widely used CKAN extensions. The keys in Table 7 are all generated by the harvestingFootnote 25 and spatial extension.Footnote 26

We manually selected mappings for the most frequent extra keys if they are not already included in the mapping; the selected properties are listed in the “DCAT key” column in Table 7 and are included in the homogenized, re-exposed, metadata descriptions, cf. Sect. 5.2. In case of an ?-cell, we were not able to choose an appropriate DCAT core property.

Table 7. Most frequent extra keys

5.2 Homogenizing Metadata Using DCAT and Other Metadata Vocabularies

The W3C identified the issue of heterogeneous metadata schemas across the data portals, and proposed an RDF vocabulary to solve this issue: The metadata standard DCAT [48] (Data Catalog Vocabulary) describes data catalogs and corresponding datasets. It models the datasets and their distributions (published data in different formats) and re-uses various existing vocabularies such as Dublin Core terms [75], and the SKOS [52] vocabulary.

The recent DCAT application profile for data portals in Europe (DCAT-AP)Footnote 27 extends the DCAT core vocabulary and aims towards the integration of datasets from different European data portals. In its current version (v1.1) it extends the existing DCAT schema by a set of additional properties. DCAT-AP allows to specify the version and the period of time of a dataset. Further, it classifies certain predicates as “optional”, “recommended” or “mandatory”. For instance, in DCAT-AP it is mandatory for a dcat:Distribution to hold a dcat:accessURL.

An earlier approach, in 2011, is the VoID vocabulary [3] published by W3C as an Interest Group Note. VoID – the Vocabulary for Interlinked Datasets – is an RDF schema for describing metadata about linked datasets: it has been developed specifically for data in RDF representation and is therefore complementary to the DCAT model and not fully suitable to model metadata on Open Data portals (which usually host resources in various formats) in general.

In 2011 Fürber and Hepp [32] proposed an ontology for data quality management that allows the formulation of data quality, cleansing rules, a classification of data quality problems and the computation of data quality scores. The classes and properties of this ontology include concrete data quality dimensions (e.g., completeness and accuracy) and concrete data cleansing rules (such as whitespace removal) and provides a total of about 50 classes and 50 properties. The ontology allows a detailed modelling of data quality management systems, and might be partially applicable and useful in our system and to our data. However, in the Open Data Portalwatch we decided to follow the W3C Data on the Web Best Practices and use the more lightweight Data Quality Vocabulary for describing the quality assessment dimensions and steps.

More recently, in 2015 Assaf et al. [5] propose HDL, an harmonized dataset model. HDL is mainly based on a set of frequent CKAN keys. On this basis, the authors define mappings from other metadata schemas, including Socrata, DCAT and Schema.org.

Metadata mapping by the Open Data Portalwatch framework. In order to offer the harvested datasets in the Portalwatch project in a homogenized and standardised way, we implemented a system that re-exposes data extracted from Open Data portal APIs such as CKAN [59]: the output formats include a subset of W3C’s DCAT with extensions and Schema.org’s Dataset-oriented vocabulary.Footnote 28 We enrich the integrated metadata by the quality measurements of the Portalwatch framework available as RDF data using the Data Quality VocabularyFootnote 29 (DQV). To further describe tabular data in our dataset corpus we use simple heuristics to generate additional metadata using the vocabulary defined by the W3C CSV on the Web working group [65], which we likewise add to our enriched metadata. We use the PROV ontology (cf. Sect. 4.3) to record and annotate the provenance of our generated/published data (which is partially generated by using heuristics). The example graph in Fig. 8 displays the generated data for the DCAT dataset, the quality measurements, the CSV metadata, and the provenance information.

Fig. 8.
figure 8

The mapped DCAT dataset is further enriched by three additional datasets (indicated by the bold edges): (i) each DCAT dataset is associated to a set of quality measurements; (ii) there is additional provenance information available for the generated RDF graph; (iii) in case the corresponding distribution is a table we generated CSV specific metadata such as the delimiter and the column headers.

6 Searchability and Semantic Annotation

The popular Open Data portal software frameworks (e.g., CKAN, Socrata) offer search interfaces and APIs. However, the APIs typically allow only search over the metadata descriptions of the datasets, i.e., the title, descriptions and tags, and therefore rely on complete and detailed meta-information. Nevertheless, if an user wants to find data for a specific entity this search might be not successful. For instance, a search for data about “Vienna” at the Humanitarian Data Exchange portal gives no results, even though there are relevant datasets in the portal such as “World – Population of Capital Cities”.

6.1 Open Data Search: State of the Art

Overall, to the best of our knowledge, there is not much substantial research in the area of search and querying for Open Data. A straightforward approach to offer search over the data is to index the documents as text files into typical keyword search systems. Keyword search is already addressed and partially solved by full-text search indices, as they exist by search engines such as Google. However, these systems do not exploit the underlying structure of the dataset. For instance, a default full-text indexer considers a CSV table as a document and the cells get indexed as (unstructured) tokens. A search query for tables containing the terms “Vienna” and “Berlin” in the same column is not possible using these existing search systems. In order to enable such a structured search over the content of tables an alternative data model is required.

In a current table search prototypeFootnote 30 we enable these query use-cases while utilizing existing state-of-the-art document-based search engines. We use the search engine ElasticsearchFootnote 31 and index the rows and columns of a table as separated documents, i.e., we add a new document for each column and for each row containing all values of the respective row/column. By doing so we store each single cell twice in the search system. This particular data model enables to define multi-keyword search over rows and columns. For instance, queries for which the terms “Vienna” and “Berlin” appear within the same column.

Recently, the Open Data Network projectFootnote 32 addresses the searchability issue by providing a search and query answering framework on top of Socrata portals. The UI allows to start a search with a keyword and suggested matching datasets or already registered questions. However, the system relies on the existing Socrata portal ecosystem with its relevant data APIFootnote 33. This API allows to programmatically access the uploaded data and apply filters on columns and rows.

The core challenge for search & query over tabular data is to process and build an index over a large corpus of heterogeneous tables. In 2016, we assessed the table heterogeneity for over 200 k Open Data CSV files [54]. We found that a typical Open Data CSV file has less than 100 kB (the biggest with over 25 GB) and consists of 14 columns and 379 rows. An interesting observation was that \(\sim \)50% of the inspected header values were composed of camel case, suggesting that the table was exported from a relation table. Regarding the data types, roughly half of the columns consists of numerical data types. As such, Open Data CSV tables have different numbers of columns and rows and column values can belong to different data types. Some of the CSV files contain multiple tables and the tables itself can be non well-formed, meaning that there exists multiple-headers or the rows with aggregated values over the previous rows.

To the best of our knowledge, the research regarding querying over thousands of heterogeneous tables is fairly sparse. One of the initial work towards search and query over tables was the work by Das Sarma et al. in 2012 [25]. The authors propose a system to find for a given input table a set of related Web tables. The approach relies on the assumptions that tables have an “entity” column (e.g. the player column in a table about tennis players) and introduces relatedness metrics for tables (either for joining two tables or appending one table to the other). the authors propose a set of high-level features for grouping tables to handle the large amount of heterogeneous tables and to reduce the search space for a given input table. Eventually, the system itself returns tables which either can be joined with the input table (via the entity column) or can be append to the input table (adding new rows).

The idea of finding related tables is also closely relate to the research of finding inclusion dependencies (IND), that are relation such as \(\text {column} A \subseteq \text {column} B\). A core application for these dependencies is the discovery of foreign key relations across tables, but they are also used in data integration [53] scenarios, query optimization, and schema redesign [62]. The task of finding INDs gets harder with the number of tables and columns and the scalable and efficient discovery of inclusion dependencies across several tables is a well-known challenge in database research [9, 43, 62]. The state of the art research combines probabilistic and exact data structures to approximate the INDs in relational datasets. The algorithm guarantees to correctly find all INDs and only adds false positives INDs with a low probability [42].

Another promising direction is the work of Liu et al. in 2014 which investigates the fundamental differences between relation data and JSON data management [46]. Consequently, the authors derive three architectural principles to facilitate a schema-less development within traditional relation database management systems. The first principle is to store JSON as JSON in the RDBMS. The second principle is to use the query language SQL as a Set-oriented Query Language rather than a Structured Query Language. The third principle is to use available partial schema-aware indexing methods but also schema agnostic indexing. While this work focuses on JSON and XML, it would be interesting to study and establish similar principles for tabular data and how this can be applied and benefit for search and querying.

Enabling search and querying over Open Data could benefit from many insights from the research around semantic search systems. The earlier semantic search systems such as Watson [24], Swoogle [27] or FalconS [22] provided search and simple querying over collections of RDF data. More advanced systems, such as SWSE [38] or Sindice.com [61] focused on indexing RDF document at web-scale. SWSE is a scalable entity lookup system operating over an integrated data, while Sindice.com provided keyword search and entity lookups using an inverted document index. Surprisingly, published research around semantic search slowed down. However, the big search engine players on the market such as Google or Bing utilise semantic search approaches to provide search over their internal knowledge graph.

6.2 Annotation, Labelling, and Integration of Tabular Data

Text-based search engines such as Elasticsearch, however, do not integrate any semantic information of the data sources and therefore do not enable search based on concepts, synonyms or related content. For instance, to enable a search for the concept “population” over a set of resources (that do not contain the string “population”), it is required that the tables (and their columns, respectively) are labelled and annotated correctly.

There exists an extensive body of research in the Semantic Web community in semantic annotation and linking of tabular data sources. The majority of these approaches [2, 28, 45, 55, 66, 70, 73, 76] assume well-formed relational tables and try to derive semantic labels for attributes in these structured data sources (such as columns in tables) which are used to (i) map the schema of the data source to ontologies or existing semantic models or (ii) categorize the content of a data source.

Given an existing knowledge base, these approaches try to discover concepts and named entities in the table, as well as relations among them, and link them to elements and properties in the knowledge base. This typically involves finding potential candidates from the knowledge base that match particular table components (e.g., column header, or cell content) and applying inference algorithms to decide the best mappings.

However, in typical Open Data portals many data sources exist where such textual descriptions (such as column headers or cell labels) are missing or cannot be mapped straightforwardly to known concepts or properties using linguistic approaches, particularly when tables contain many numerical columns for which we cannot establish a semantic mapping in such manner. Indeed, a major part of the datasets published in Open Data portals comprise tabular data containing many numerical columns with missing or non human-readable headers (organisational identifiers, sensor codes, internal abbreviations for attributes like “population count”, or geo-coding systems for areas instead of their names, e.g. for districts, etc.) [47].

Table 8. Header mapping of CSVs in open data portals

In [57] we verified this observation by inspecting 1200 tables collected from the European Open Data portal and the Austrian Government Open Data Portal and attempted to map the header values using the BabelNet service (http://babelnet.org): Table 8 lists our findings; an interesting observation is that the AT portal has an average number of 20 columns per table with an average of 8 numerical columns, while the EU portal has larger tables with an average of 4 out of 20 columns being numerical. Regarding the descriptiveness of possible column headers, we observed that 28% of the tables have missing header rows. Eventually, we extracted headers from 7714 out of around 10 K numerical columns and used the BabelNet service to retrieve possible mappings. We received only 1472 columns mappings to BabelNet concepts or instances, confirming our assumption that many headers in Open Data CSV files cannot easily be semantically mapped.

Therefore, we propose in [57] an approach to find and rank candidates of semantic labels and context descriptions for a given bag of numerical values, i.e., the numerical data in a certain column. To this end, we apply a hierarchical clustering over information taken from DBpedia to build a background knowledge graph of possible “semantic contexts” for bags of numerical values, over which we perform a nearest neighbour search to rank the most likely candidates. We assign different labels/contexts with different confidence values and this way our approach could potentially be combined with the previous introduced textual labelling techniques for further label refinement.

7 Conclusions, Including Further Issues and Challenges

In this chapter we gave a rough overview over the still persisting challenge of integrating and finding data on the Web. We focused on Open Data and provided some starting points for finding large amounts of nowadays available structured data, the processing of which still remains a major challenge: on the one hand, because the introduction of Semantic Web Standards such as RDF and OWL did not yet find adoption and there is still a large variety in terms of formats to publish structured data on the Web. On the other hand, even the use of such standard formats alone would not alleviate the issue of findability of said data. Proper search and indexing techniques for structured data and its metadata need to be devised. Moreover, metadata needs to be self-descriptive, that is, it needs to not only describe what published datasets contain, but also how the data was generated (provenance) or under which terms it can be used (licenses). Overall, one could say that despite the increased availability of data on the Web, (i) there are still a number of challenges to be solved before we can call it a Semantic Web, and (ii) one often needs to be ready to manually pre-process and align data before automated reasoning techniques can be applied. Projects such as the Open Data Portalwatch, a monitoring framework for Open Data portals worldwide, from which most of our insights presented in this paper were derived, are just a starting point in the direction of making this Web of data machine-processable: there is a number of aspects that we did not cover herein, such as monitoring the evolution of datasets, archiving such evolving data, or querying Web data over time, cf. [31] for some initial research on this topic. Nor did we discuss attempts to reason over Web data “in the wild” using OWL and RDFS, which we had investigated on the narrower scope of Linked Data some years ago [64], but which will impose far more challenges when taking into account the vast amounts of data not yet linked to the so called Linked Data cloud, but available through Open Data Portals. Lastly, another major issue we did not discuss in depth is multi-linguality: data (content) as well as metadata associated with Open Data is published in different languages with different language descriptions and thereby a lot of “Open” information is only accessible to speakers of the respective languages, leave aside impossible to integrate for machines: still recent progress in machine translation or multi-lingual Linked Data corpora like Babelnet [56] could contribute to solving this puzzle.

You will find further starting points in these directions in the present volume, or also previous editions of the Reasoning Web summer school. We hope these starting points serve as an inspiration for further research on making machines understand openly available data on the Web and thus bringing us closer to the original vision of the Semantic Web, an ongoing journey.