Data Integration for Open Data on the Web

Neumaier, Sebastian; Polleres, Axel; Steyskal, Simon; Umbrich, Jürgen

doi:10.1007/978-3-319-61033-7_1

Sebastian Neumaier²⁰,
Axel Polleres^20,21,
Simon Steyskal²⁰ &
…
Jürgen Umbrich²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10370))

Included in the following conference series:

Reasoning Web International Summer School

1295 Accesses
5 Citations

Abstract

In this lecture we will discuss and introduce challenges of integrating openly available Web data and how to solve them. Firstly, while we will address this topic from the viewpoint of Semantic Web research, not all data is readily available as RDF or Linked Data, so we will give an introduction to different data formats prevalent on the Web, namely, standard formats for publishing and exchanging tabular, tree-shaped, and graph data. Secondly, not all Open Data is really completely open, so we will discuss and address issues around licences, terms of usage associated with Open Data, as well as documentation of data provenance. Thirdly, we will discuss issues connected with (meta-)data quality issues associated with Open Data on the Web and how Semantic Web techniques and vocabularies can be used to describe and remedy them. Fourth, we will address issues about searchability and integration of Open Data and discuss in how far semantic search can help to overcome these. We close with briefly summarizing further issues not covered explicitly herein, such as multi-linguality, temporal aspects (archiving, evolution, temporal querying), as well as how/whether OWL and RDFS reasoning on top of integrated open data could be help.

Access provided by CONRICYT-eBooks. Download chapter PDF

Piveau: A Large-Scale Open Data Management Platform Based on Semantic Web Technologies

Enabling the Web of (Linked Open) Data

The advantages of an Ontology-Based Data Management approach: openness, interoperability and data quality

Article 19 March 2016

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Over the last decade we have seen the World Wide Web being populated more and more by “machines”. The world wide Web has evolved from its original form as a network of linked Documents, readable by humans to more and more a Web of data and APIs. That is, nowadays, even if we interact as humans with Web pages, in most cases (i) the contents of Web pages are generated from Databases in the backend, (ii) the Web content we see as humans contains annotations readable by machines, and even (iii) the way we interact with Web pages generates data (frighteningly, even often without the users being aware of), collected and stored again in databases around the globe. It is therefore valid to say that the Web of Data has become a reality and – to some extent – even the vision of the Semantic Web. In fact, this vision of the Semantic Web has itself evolved over the decades, starting with Berners-Lee et al.’s seminal article in 2001 [13] that already envisioned the future Web as “federating particular knowledge bases and databases to perform anticipated tasks for humans and their agents”. Based on these ideas a lot of effort and research has been devoted to the World Wide Web Consortium (W3C) Semantic Web activity,^{Footnote 1} which in 2013 has been subsumed by – i.e., renamed to – “Data Activity”.^{Footnote 2}

In many aspects, the Semantic Web has not necessarily evolved as expected, and the biggest success stories so far do less depend on formal logics [37] than we may have expected, but more on the availability of data. Another recent article by Bernstein et al. [14] takes a backwards look on the community and summarizes successes of the Semantic Web community such as the establishment of lightweight annotation vocabularies like Schema.org on Web pages, or praising the uptake of large companies such as Google, Yahoo!, Microsoft, and Facebook who are developing large knowledge graphs, which however, so far these companies mostly keep closed.

Thus, if Web researchers outside of these companies want to tap into the rich sources of Data available now on the Web they need to develop their own data workflows to find relevant and usable data. To their help, more and more Open Data is being published on the Web, that is, data that is made freely available by mostly public institutions (Open Government Data) both for transparency reasons and with the goal to “fuel” a Data Economy, pushed both by the EU [29] and the G8 [72].

The present lecture notes may be viewed as partially an experience report as well as – hopefully – a guide through challenges arising when using (Open) data from the Web. The authors have been involved over the past view years in several projects and publications around the topic of Open Data integration, monitoring, and processing. The main challenges we have come across in all these projects are largely overlapping and therefore we decided to present them in the present chapter:

1.
Where to find Open Data? (Sect. 2) Most Open Data nowadays can be found on so called Open Data Portals, that is, data catalogs, typically allowing API access and hosting dataset descriptions and links to actual data resources.
2.
“Low-level” data heterogeneity (Sect. 3) As we will see, most of the structured data provided as Open Data is not readily available as RDF or Linked Data – the preferred formats for semantic data access described in other chapters of this volume. Different formats are much more prevalent, plus encoding issues make it difficult to access those datasets.
3.
Licenses and Provenance (Sect. 4) Not all Open Data is really completely open, since most data on the Web is attached to different licences, terms and conditions, so we will discuss how and whether these licenses can be interpreted by machines, or, respectively how the provenance of different integrated data sources can be tracked.
4.
Quality issues (Sect. 5) A major challenge for data – also often related to its provenance – is quality; on the one hand the re-use of poor quality data is obviously not advisable, but on the other hand different applications might have different demands/definitions of quality.
5.
How to find data – Searchability? (Sect. 6) Last, but not least, we will look into current solutions for search in Open Data, which we pose as a major open research challenge: whereas crawling and (keyword-based search) of human readable websites work well, this is not yet the case for structured data on the Web; we will discuss why and sketch some routes ahead.

Besides these main questions, we will conclude with summarizing issues and open questions around integrating Open Data from the Web not covered explicitly herein in Sect. 7, such as multi-linguality, temporal aspects (archiving, evolution, temporal querying), as well as how/whether OWL and RDFS reasoning on top of integrated open data could be help.

2 Where to Find Web Data?

If we look for sources of openly available data that is widely discussed in the literature, we mainly can identify three starting points, which are partially overlapping:

User-created open data bases
The Linked Open Data “Cloud”
Webcrawls
Open Data Portals

User-created open data bases, through efforts such as Wikipedia, are large amounts of data and data-bases that have been co-created by user communities distributed around the globe; the most important ones being listed as follows:

DBpedia [44] is a community effort that has created one of the biggest and most important cross-domain dataset in RDF [19] in the focal point of the so called Linked Open Data (LOD) cloud [6]. At its core is a set of declarative mappings extracting data from Wikipedia infoboxes and tables into RDF and it is accessible as well as through dumps also through an open query interface supporting the SPARQL [33] query language. DBpedia can therefore be well called one of the cornerstones of Semantic Web and Linked Data research being the subject and center of a large number of research papers over the past few years. Reported numbers vary as DBpedia is modular and steadily growing with Wikipedia, e.g. in 2015 DBpedia contained overall more than 3B RDF Statements^{Footnote 3}, whereof the English DBpedia contributed 837 M statements (RDF triples). Those 837 M RDF triples alone amount to 4.7 GB when stored in the compressed RDF format HDT [30]^{Footnote 4}. However, as we will see there are many, indeed far bigger other openly accessible data sources, that yet remain to be integrated, which are rather in the focus of the present chapter.
Wikidata [74] a similar, but conceptually different effort has been started in 2012 to bring order into data items in Wikipedia, with the idea to – instead of extracting data from semi-structured Wikipages – build a database for data observations with fixed properties and datatypes, mainly with the idea to avoid extraction errors and provide means to record provenance directly with the data, with likewise 100s of millions of facts in the meantime: exact numbers are hard to give, but [71] report some statistics of the status of 2015, when Freebase was included into Wikidata; we note that counting RDF triples^{Footnote 5} is only partially useful, since the data representation of Wikidata is not directly comparable with the one from DBpedia [35, 36].
OpenStreetmap as another example of an openly available data base that has largely been created by users contains a vast amount of geographic features to obtain an openly available and re-usable map; with currently 739.7GB (uncompressed) data in OSM’s native XML format (and still 33GB compressed).^{Footnote 6}

The Linked Open Data “Cloud” – already mentioned above – is a manually curated collection of datasets that are published on the Web openly, adhering to the so-called Linked Data principles, defined as follows [12] (cf. chapters of previous editions of the Reasoning Web book series for good overview articles):

LDP1: use URIs as names for things;
LDP2: use HTTP URIs so those names can be dereferenced;
LDP3: return useful – herein we assume RDF – information upon dereferencing of those URIs; and
LDP4: include links using externally dereferenceable URIs.^{Footnote 7}

The latest iteration of the LOD Cloud [1] contains – with DBpedia in its center – hundreds of datasets with equal or even larger sizes than DBpedia, documenting a significant growth of Linked Data over the past years. Still, while often in the Semantic Web literature the LOD cloud and the “Web of Data” are implicitly equated, there is a lot of structured data available on the Web (a) either, while using RDF, not being linked to other datasets, or (b) provided in other, popular formats than RDF.

Running Web crawls is the only way to actually find and discover structured Web Data, which is both resource intensive and challenging in terms of respecting politeness rules when crawling. However, some Web crawls have been made openly available, such as the Common Crawl corpus which contains “petabytes of data collected over the last 7 years”^{Footnote 8}. Indeed the project has already been used to collect and analyse the availability (and quality) of structured data on the Web, e.g. in the Web Data Commons Project [50, 51] (Table 1).

Open Data portals are collections or catalogs that index metadata and link to actual data resources which have become popular over the past few years through various Open Government Data Initiatives, but also in the private sector. Apart from all the other sources mentioned so far, most of the data published openly is indexed in some kind of Open Data Portal. We therefore will discuss these portals in the rest of this paper in more detail.

Table 1. Top-10 portals, ordered by datasets.

Full size table

Open Data portals

Most of the current “open” data form part of a dataset that is published in Open Data portals which are basically catalogues similar to digital libraries (cf. Fig. 1): in such catalogues, a dataset aggregates a group of data files (referred to as resources or distributions) which are available for access or download in one or more formats (e.g., CSV, PDF, Microsoft Excel, etc.). Additionally, a dataset contains metadata (i.e., basic descriptive information in structured format) about these resources, e.g. authorship, provenance or licensing information. Most of these portals rely on existing software frameworks, such as CKAN^{Footnote 9} or Socrata,^{Footnote 10} that offer UI, search, and API functionalities.

CKAN is the most prominent portal software framework used for publishing Open Data and is used by several governmental portals including data.gov.uk and data.gov.

For example, the Humanitarian Data Exchange^{Footnote 11} (see Fig. 2) is a portal by the United Nations. It aggregates and publishes data about the context in which a humanitarian crisis is occurring (e.g., damage assessments and geospatial data) and data about the people affected by the crisis. The datasets on this portal are described using several metadata fields, and the metadata description can be retrieved in JSON format using the Web API of the data portal (cf. Fig. 2).

The metadata description of these datasets provide download links for the actual content. For instance, the particular dataset description in Fig. 2 – a dataset reporting the amounts paid by refugees to facilitate their movement to Europe – holds a URL which refers to a table (a CSV file) containing the corresponding data, displayed in Table 2.

Table 2. The tabular content of the dataset in Fig. 2

Full size table

3 Data Formats on the Web

When we discuss different available data on the Web, we already emphasized that – despite being subject of a lot of research – RDF and Linked Data are not necessary the prevalent formats for published data on the Web. An analysis of the datasets systematically catalogued in Open Data portals will confirm this. Likewise, we will have to discuss metadata formats on these portals.

Data Formats on Open Data Portals. Table 3 shows the top used formats and the number of unique resources together with their number of portals they appear, adapted from [58], where we analysed and crawled metadata from 260 Open Data Portals for cues to the data formats in which different datasets are provided. Note, that these numbers are based on available metadata information of the datasets and can be higher due to varying spellings, misspellings, and missing metadata. Therefore, these numbers should be considered as a lower bound for the respective formats. Bold highlighted values indicate that the format is considered as open as per the Open Definition [12]:^{Footnote 12} the open definition sets out several guidelines of which data formats are to be considered “open”, according to which we have analysed assessed openness by a list of compliant formats, cf. [58].

Table 3. Most frequent formats.

Full size table

A surprising observation is that \(\sim \)10% of all the resources are published as PDF files. This is remarkable, because strictly speaking PDF cannot be considered as an Open Data format: while PDFs may contain structured data (e.g. in tables) there are no standard ways to extract such structured data from PDFs - or general-purpose document formats in general. Therefore, PDFs cannot be considered as machine-readable, nor as a suitable way for publishing Open Data. As we also see, RDF does not appear among the top-15 formats for Open Data publishing.^{Footnote 13} This underlines the previously stated hypothesis that – especially in the area of Open Government Data – openly available datasets on data portals are mostly not published as RDF or Linked Data.

Also, JSON does not appear among the top ten formats in terms of numbers of published data resources on Open Data portals. Still, we include those main formats in our discussion below, as

particularly JSON and RDF play a significant role in metadata descriptions,
JSON is the prevalent format for many Web APIs,
RDF, as we saw, is apart from the Linked Data cloud prevalent in Web pages and crawls through its support as an annotation format by popular search engines.

In the following we introduce some of these popular, well known, data formats on the Web and categorize them by their structure, namely, graph-based, tree-shaped, and tabular formats.

3.1 Graph-Based Formats

RDF, W3C recommendation since 2004 [41] and “refurbished” in 2014 [19, 23], was originally conceived as a metadata model language for describing resources on the web. It evolved (also through deployment) to a universal model and format to describe arbitrary relations between resources identified, typically, by URIs, such that they can be read and understood by machines.

RDF itself consists of statements in the form of subject, predicate, object triples. RDF triples can be displayed as graphs where the subjects and objects are nodes and the predicates are directed edges. RDF uses vocabularies to define the set of elements that can be used in an application. Vocabularies are similar to schemas for RDF datasets and can also define the domain and range of predicates. The graph in Fig. 3 represents the metadata description of the dataset in Fig. 2 in the DCAT (Data Catalog) vocabulary [48].^{Footnote 14}

There exist several formats to serialize RDF data. Most prominent is RDF/XML, the XML serialization first introduced in the course of 1999 W3C specification of the RDF data model, but there are also a more readable/concise textual serialization formats such as the line-based N-Triples [21] and the “Terse RDF Language” TURTLE [10] syntax. More recent, in 2014, W3C released the first recommendation for JSON-LD [68]. JSON-LD is an extension for the JSON format (see below) mostly allowing to specify namespaces for identifiers and support of URIs (supporting Linked Data principles natively in JSON) which allows the serialization of RDF as JSON, or vice versa, the transformation of JSON as RDF: conventional JSON parser and databases can be used; users of JSON-LD which are mainly interested in conventional JSON, are not required to understand RDF and do not have to use the Linked Data additions.

3.2 Tree-Shaped Formats

The JSON file format [18] is a so-called semi-structured file format, i.e., where documents are loosely structured without a fixed schema (as for example data in relational databases) as attribute–value pairs where values can be primitive (Strings, numbers, Booleans), arrays (sequences of values enclosed in square brackets ‘[‘,’]’), or nested JSON objects (enclosed in curly braces ‘{‘,’}’), thus – essentially – providing a serialization format for tree-shaped, nested structures. For an example for JSON we refer to Fig. 2.

Initially, the JSON format was mainly intended to transmit data between servers and web applications, supported by web services and APIs. In the context of Open Data we often find JSON as a format to describe metadata but also to publish the actual data: also raw tabular data can easily be transformed into semi-structured and tree-based formats like JSON^{Footnote 15} and, therefore, is often used as alternative representation to access the data. On the other hand, JSON is the de facto standard for retrieving metadata from Open Data portals.

XML. For the sake of completeness, due to its long history, and also due to its still striking prevalence as a data exchange format of choice, we shall also mention some observations on XML. This prevalence is not really surprising since many industry standards and tools export and deliver XML, which is then used as the output for many legacy applications or still popular for many Web APIs, e.g., in the area of geographical information systems (e.g. KML,^{Footnote 16} GML,^{Footnote 17} WFS,^{Footnote 18} etc.). Likewise, XML has a large number of associated standards around it such as query, navigation, transformation and schema languages like XQuery,^{Footnote 19} XPath,^{Footnote 20} XSLT^{Footnote 21}, and XML Schema^{Footnote 22} which are still actively developed, supported by semi-structured database systems, and other tools. XML by itself has been subject to extensive research, for example in the fields of data exchange [4, Part III] or query languages [8]. Particularly, in the context of the Semantic Web, there have also been proposals to combine XQuery with SPARQL, cf. for instance [15, 26] and references therein. The issue of interoperability between RDF and XML indeed is further discussed within the W3C in their recently started “RDF and XML Interoperability Community Group”^{Footnote 23} see also [16] for a summary. So, whereas JSON has probably better support in terms of developer-friendliness and recent uptake particularly through Web APIs, there is still a strong community with well-established standards behind XML technologies. For instance, schema languages or query languages for JSON exist as proposals, but their formal underpinning is still under discussion, cf. e.g. [17, 63]. Another approach would be to adopt, reuse and extend XML technologies to work on JSON itself, as for instance proposed in [26]. On an abstract level, there is not much to argue about JSON and XML just being two syntactic variants for serializing arbitrary, tree-shaped data.

3.3 Tabular Data Formats

Last but not least, potentially driven also by the fact that the vast majority of Open Data on the Web originates from relational databases or simply from spreadsheets, a large part of the Web of Open Data consists of tabular data. This is illustrated by the fact that two of the most prominent formats for publishing Open Data in Table 3 cover tabular data: CSV and XLS. Note particularly that both of these formats are present on more Open Data portals than for instance XML.

While XLS (the export format of Microsoft Excel) is obviously a proprietary open format, CSV (comma-separated values) is a simple, open format with a standard specification allowing to serialize arbitrary tables as text (RFC4180) [67]. However, as we have shown in a recent analysis [54], compliance with this standard across published CSVs is not consistent: in Open Data corpus containing 200 K tabular resources with a total file size of 413 GB we found out that out of the resources in Open Data portals labelled as a tabular only 50% can be considered CSV files. In this work we also investigated different use of delimiters, the availability of (multiple) header rows or cases where single CSV files actually contain multiple tables as common problems.

Last, but not least, as opposed to tabular data in relational databases, which typically adhere to a fixed schema and constraints, these constraints, datatype information and other schema information is typically lost when being exported and re-published as CSVs. This loss can be compensated partially by adding this information as additional metadata to the published tables; one particular format for such kind of metadata has been recently standardized by the W3C [65]. For more details on the importance of metadata we refer also to Sect. 5 below.

3.4 Data Formats – Summary

Overall, while data formats are often only considered syntactic sugar, one should not underestimate the issues about conversions, scripts parsing errors, stability of tools, etc. where often significant amounts of work incurs. While any data can be converted/represented in principle into a CSV, XML, or RDF serialization, one should keep in mind that a canonical, “dumb” serialization in RDF by itself, does not “add” any “semantics”.

For instance, a naive RDF conversion (in Turtle syntax) of the CSV in Table 2 could look as follows in Fig. 4, but would obviously not make the data more “machine-readable” or easier to process.

We would leave coming up with a likewise naive (and probably useless) conversion to XML or JSON to the reader: the real intelligence in mapping such data lies in finding suitable ontologies to describe the properties representing columns c1 to c4, recognizing the datatypes of the column values, linking names such as “East Med Sea” to actual entities occurring in other datasets, etc. Still, typically, in data processing workflows more than 80% of the effort to data conversion, pre-processing and cleansing tasks.

Within the Semantic Web, or to be more precise, within the closed scope of Linked Data this problem and the steps involved have been discussed in depth in the literature [7, 60]. A partial instantiation of a platform which shall provide a cleansed and integrated version of the Web of Linked Data is presented by the LOD-Laundromat [11] project: here, the authors present a cleansed unified store of Linked Data as an experimental platform for the whole Web of Linked Data, mostly containing the all datasets of the current LOD cloud, are made available. Querying this platform efficiently and investigating the properties of this subset of the Web of Data is a subject of active ongoing research, despite only Linked RDF data has been considered: however, building such a platform for the scale of arbitrary Open Data on the Web, or even only for the data accumulated in Open Data portals would demand a solution at a much larger scale, handling more tedious cleansing, data format conversion and schema integration problems.

4 Licensing and Provenance of Data

Publishing data on the Web is more than just making it publicly accessible. When it comes to consuming publicly accessible data, it is crucial for data consumers to be able to assess the trustworthiness of the data as well as being able to use it on a secure legal basis and to know where the data is coming from, or how it has been pre-processed. As such, if data is to be published on the Web, appropriate metadata (e.g., describing the data’s provenance and licensing information) should be published alongside with it, thus making published data as self-descriptive as possible (cf. [34]).

Table 4. Top-10 licenses.

Full size table

4.1 Open Data Licensing in Practice

While metadata about terms and conditions under which a dataset can be re-used are essential for its users, according to the Linked Open Data Cloud web page, only less than 8% of the linked data datesets provide license information^{Footnote 24}.

Within Open data portals, the situation seems slightly better overall: more than 50% of the monitored datasets in the Open Data portals in the Portalwatch project (see Sect. 5 below) announce somehow in the metadata some kind of license information [58]. The most prevalent license keys used in Open Data portals [58] are listed in Table 4.

While most of the provided license definitions lack a machine-readable description that would allow automated compatibility checks of different licenses or alike, some are not even compliant with Open Definition conformant data licenses (cf. Table 5).

Table 5. Open definition conformant data licenses [40]

Full size table

In order to circumvent these shortcomings, different RDF vocabularies have been introduced to formally describe licenses as well as provenance information of datasets, two of which (ODRL and PROV) we will briefly introduce in the next two subsections.

4.2 Making Licenses Machine-Readable

The Open Digital Rights Language (ODRL) [39] is a comprehensive policy expression language (representable with a resp. RDF vocabulary) that has been demonstrated to be suitable for expressing fine-grained access restrictions, access policies, as well as licensing information for Linked Data as shown in [20, 69].

An ODRL Policy is composed of a set of ODRL Rules and an ODRL Conflict Resolution Strategy, which is used by the enforcement mechanism to ensure that when conflicts among rules occur, a system either grants access, denies access or generates an error in a non-ambiguous manner.

An ODRL Rule either permits or prohibits the execution of a certain action on an asset (e.g. the data requested by the data consumer). The scope of such rules can be further refined by explicitly specifying the party/parties that the rule applies to (e.g. Alice is allowed to access some dataset), using constraints (e.g. access is allowed until a certain date) or in case of permission rules by defining duties (e.g. a payment of 10 euros is required).

Listing 1.1 demonstrates how ODRL can be used to represent the CreativeCommons license CC-BY 4.0.

Policy Conflict Resolution. A rule that permits or prohibits the execution of an action on an asset could potentially affect related actions on that same asset. Explicit relationships among actions in ODRL are defined using a subsumption hierarchy, which states that an action \(\alpha _1\) is a broader term for action \(\alpha _2\) and thus might influence its permission/prohibition (cf. Fig. 5). On the other hand implicit dependencies indicate that the permission associated with an action \(\alpha _1\) requires another action \(\alpha _2\) to be permitted also. Implicit dependencies can only be identified by interpreting the natural language description of the respective ODRL Actions (cf. Fig. 6). As such, when it comes to the enforcement of access policies defined in ODRL, there is a need for a reasoning engine which is capable of catering for both explicit and implicit dependencies between actions.

4.3 Tracking the Provenance of Data

In order to handle the unique challenges of diverse and unverified RDF data spread over RDF datasets published at different URIs by different data publishers across the Web, the inclusion of a notion of provenance is necessary. The W3C PROV Working Group [49] was chartered to address these issues and developed an RDF vocabulary to enable annotation of datasets with interchangeable provenance information. On a high level PROV distinguishes between entities, agents, and activities (see Fig. 7). A prov:Entity can be all kinds of things, digital or not, which are created or modified. Activities are the processes which create or modify entities. An prov:Agent is something or someone who is responsible for a prov:Activity (and indirectly also for an entity).

Listing 1.2 illustrates a PROV example (all other triples removed) of two observations, where observation ex:obs123 was derived from another observation ex:obs789 via an activity ex:activity456 on the 1st of January 2017 at 01:01. This derivation was executed according to the rule ex:rule937 with an agent ex:fred being responsible. This use of the PROV vocabulary models tracking of source observations, a timestamp, the conversion rule and the responsible agent (which could be a person or software component). The PROV vocabulary could thus be used to annotated whole datasets, or single observations (data points) within such dataset, or, respectively any derivations and aggregations made from open data sources re-published elsewhere.

5 Metadata Quality Issues and Vocabularies

The Open Data Portalwatch project [58] has originally been set up as a framework for monitoring and quality assessment of (governmental) Open Data portals, see http://data.wu.ac.at/portalwatch. It monitors data from portals using the CKAN, Socrata, and OpenDataSoft software frameworks, as well as portals providing their metadata in the DCAT RDF vocabulary.

Currently, as of the second week of 2017, the framework monitors 261 portals, which describe in total about 854 k datasets with more than 2 million distributions, i.e., download URLs (cf. Table 6). As we monitor and crawl the metadata of these portals in a weekly fashion, we can use the gathered insights in two ways to enrich the crawled metadata of these portals: namely, (i) we publish and serve the integrated and homogenized metadata descriptions in a weekly, versioned manner, (ii) we enrich these metadata descriptions by assessed quality measures along several dimensions. These dimensions and metrics are defined on top of the DCAT vocabulary, which allows us to treat and assess the content independent of the portal’s software and own metadata schema.

Table 6. Monitored portals and datasets in Portalwatch

Full size table

The quality assessment is performed along the following dimensions: (i) The existence dimension consists of metrics checking for important information, e.g., if there is contact information in the metadata. (ii) The metrics of the conformance dimension check if the available information adheres to a certain format, e.g., if the contact information is a valid email address. (iii) The open data dimension’s metrics test if the specified format and license information is suitable to classify a dataset as open. The formalization of all quality metrics currently assessed on the Portalwatch platform and implementation details can be found in [58].

5.1 Heterogeneous Metadata Descriptions

Different Open Data portals use different metadata keys to describe the datasets they host, mostly dependent on the software framework under which the portal runs: while the schema for metadata descriptions on Socrata and OpenDataSoft portals are fixed and predefined (they use their own vocabulary and metadata keys), CKAN provides a higher flexibility in terms of own, per portal, metadata schema and vocabulary. Thus, overall, the metadata that can be gathered from Open Data Portals show a high degree of heterogeneity.

In order to provide the metadata in a standard vocabulary, there exists a CKAN-to-DCAT extension for the CKAN software that defines mappings for datasets and their resources to the corresponding DCAT classes dcat:Dataset and dcat:Distribution and offers it via the CKAN API. However, in general it cannot be assumed that this extension is deployed for all CKAN portals: we were able to retrieve the DCAT descriptions of datasets for 93 of the 149 active CKAN portals monitored by Portalwatch [59].

Also, the CKAN software allows portal providers to include additional metadata fields in the metadata schema. When retrieving the metadata description for a dataset via the CKAN API, these keys are included in the resulting JSON. However, it is neither guaranteed that the CKAN-to-DCAT conversion of the CKAN metadata contains these extra fields, nor that these extra fields, if exported, are available in a standardized way.

We analysed the metadata of 749 k datasets over all 149 CKAN portals and extracted a total of 3746 distinct extra metadata fields [59]. Table 7 lists the most frequently used fields sorted by the number of portals they appear in; most frequent spatial in 29 portals. Most of these cross-portal extra keys are generated by widely used CKAN extensions. The keys in Table 7 are all generated by the harvesting^{Footnote 25} and spatial extension.^{Footnote 26}

We manually selected mappings for the most frequent extra keys if they are not already included in the mapping; the selected properties are listed in the “DCAT key” column in Table 7 and are included in the homogenized, re-exposed, metadata descriptions, cf. Sect. 5.2. In case of an ?-cell, we were not able to choose an appropriate DCAT core property.

Table 7. Most frequent extra keys

Full size table

5.2 Homogenizing Metadata Using DCAT and Other Metadata Vocabularies

The W3C identified the issue of heterogeneous metadata schemas across the data portals, and proposed an RDF vocabulary to solve this issue: The metadata standard DCAT [48] (Data Catalog Vocabulary) describes data catalogs and corresponding datasets. It models the datasets and their distributions (published data in different formats) and re-uses various existing vocabularies such as Dublin Core terms [75], and the SKOS [52] vocabulary.

The recent DCAT application profile for data portals in Europe (DCAT-AP)^{Footnote 27} extends the DCAT core vocabulary and aims towards the integration of datasets from different European data portals. In its current version (v1.1) it extends the existing DCAT schema by a set of additional properties. DCAT-AP allows to specify the version and the period of time of a dataset. Further, it classifies certain predicates as “optional”, “recommended” or “mandatory”. For instance, in DCAT-AP it is mandatory for a dcat:Distribution to hold a dcat:accessURL.

An earlier approach, in 2011, is the VoID vocabulary [3] published by W3C as an Interest Group Note. VoID – the Vocabulary for Interlinked Datasets – is an RDF schema for describing metadata about linked datasets: it has been developed specifically for data in RDF representation and is therefore complementary to the DCAT model and not fully suitable to model metadata on Open Data portals (which usually host resources in various formats) in general.

In 2011 Fürber and Hepp [32] proposed an ontology for data quality management that allows the formulation of data quality, cleansing rules, a classification of data quality problems and the computation of data quality scores. The classes and properties of this ontology include concrete data quality dimensions (e.g., completeness and accuracy) and concrete data cleansing rules (such as whitespace removal) and provides a total of about 50 classes and 50 properties. The ontology allows a detailed modelling of data quality management systems, and might be partially applicable and useful in our system and to our data. However, in the Open Data Portalwatch we decided to follow the W3C Data on the Web Best Practices and use the more lightweight Data Quality Vocabulary for describing the quality assessment dimensions and steps.

More recently, in 2015 Assaf et al. [5] propose HDL, an harmonized dataset model. HDL is mainly based on a set of frequent CKAN keys. On this basis, the authors define mappings from other metadata schemas, including Socrata, DCAT and Schema.org.

Metadata mapping by the Open Data Portalwatch framework. In order to offer the harvested datasets in the Portalwatch project in a homogenized and standardised way, we implemented a system that re-exposes data extracted from Open Data portal APIs such as CKAN [59]: the output formats include a subset of W3C’s DCAT with extensions and Schema.org’s Dataset-oriented vocabulary.^{Footnote 28} We enrich the integrated metadata by the quality measurements of the Portalwatch framework available as RDF data using the Data Quality Vocabulary^{Footnote 29} (DQV). To further describe tabular data in our dataset corpus we use simple heuristics to generate additional metadata using the vocabulary defined by the W3C CSV on the Web working group [65], which we likewise add to our enriched metadata. We use the PROV ontology (cf. Sect. 4.3) to record and annotate the provenance of our generated/published data (which is partially generated by using heuristics). The example graph in Fig. 8 displays the generated data for the DCAT dataset, the quality measurements, the CSV metadata, and the provenance information.

6 Searchability and Semantic Annotation

The popular Open Data portal software frameworks (e.g., CKAN, Socrata) offer search interfaces and APIs. However, the APIs typically allow only search over the metadata descriptions of the datasets, i.e., the title, descriptions and tags, and therefore rely on complete and detailed meta-information. Nevertheless, if an user wants to find data for a specific entity this search might be not successful. For instance, a search for data about “Vienna” at the Humanitarian Data Exchange portal gives no results, even though there are relevant datasets in the portal such as “World – Population of Capital Cities”.

6.1 Open Data Search: State of the Art

Overall, to the best of our knowledge, there is not much substantial research in the area of search and querying for Open Data. A straightforward approach to offer search over the data is to index the documents as text files into typical keyword search systems. Keyword search is already addressed and partially solved by full-text search indices, as they exist by search engines such as Google. However, these systems do not exploit the underlying structure of the dataset. For instance, a default full-text indexer considers a CSV table as a document and the cells get indexed as (unstructured) tokens. A search query for tables containing the terms “Vienna” and “Berlin” in the same column is not possible using these existing search systems. In order to enable such a structured search over the content of tables an alternative data model is required.

In a current table search prototype^{Footnote 30} we enable these query use-cases while utilizing existing state-of-the-art document-based search engines. We use the search engine Elasticsearch^{Footnote 31} and index the rows and columns of a table as separated documents, i.e., we add a new document for each column and for each row containing all values of the respective row/column. By doing so we store each single cell twice in the search system. This particular data model enables to define multi-keyword search over rows and columns. For instance, queries for which the terms “Vienna” and “Berlin” appear within the same column.

Recently, the Open Data Network project^{Footnote 32} addresses the searchability issue by providing a search and query answering framework on top of Socrata portals. The UI allows to start a search with a keyword and suggested matching datasets or already registered questions. However, the system relies on the existing Socrata portal ecosystem with its relevant data API^{Footnote 33}. This API allows to programmatically access the uploaded data and apply filters on columns and rows.

The core challenge for search & query over tabular data is to process and build an index over a large corpus of heterogeneous tables. In 2016, we assessed the table heterogeneity for over 200 k Open Data CSV files [54]. We found that a typical Open Data CSV file has less than 100 kB (the biggest with over 25 GB) and consists of 14 columns and 379 rows. An interesting observation was that \(\sim \)50% of the inspected header values were composed of camel case, suggesting that the table was exported from a relation table. Regarding the data types, roughly half of the columns consists of numerical data types. As such, Open Data CSV tables have different numbers of columns and rows and column values can belong to different data types. Some of the CSV files contain multiple tables and the tables itself can be non well-formed, meaning that there exists multiple-headers or the rows with aggregated values over the previous rows.

To the best of our knowledge, the research regarding querying over thousands of heterogeneous tables is fairly sparse. One of the initial work towards search and query over tables was the work by Das Sarma et al. in 2012 [25]. The authors propose a system to find for a given input table a set of related Web tables. The approach relies on the assumptions that tables have an “entity” column (e.g. the player column in a table about tennis players) and introduces relatedness metrics for tables (either for joining two tables or appending one table to the other). the authors propose a set of high-level features for grouping tables to handle the large amount of heterogeneous tables and to reduce the search space for a given input table. Eventually, the system itself returns tables which either can be joined with the input table (via the entity column) or can be append to the input table (adding new rows).

The idea of finding related tables is also closely relate to the research of finding inclusion dependencies (IND), that are relation such as \(\text {column} A \subseteq \text {column} B\). A core application for these dependencies is the discovery of foreign key relations across tables, but they are also used in data integration [53] scenarios, query optimization, and schema redesign [62]. The task of finding INDs gets harder with the number of tables and columns and the scalable and efficient discovery of inclusion dependencies across several tables is a well-known challenge in database research [9, 43, 62]. The state of the art research combines probabilistic and exact data structures to approximate the INDs in relational datasets. The algorithm guarantees to correctly find all INDs and only adds false positives INDs with a low probability [42].

Another promising direction is the work of Liu et al. in 2014 which investigates the fundamental differences between relation data and JSON data management [46]. Consequently, the authors derive three architectural principles to facilitate a schema-less development within traditional relation database management systems. The first principle is to store JSON as JSON in the RDBMS. The second principle is to use the query language SQL as a Set-oriented Query Language rather than a Structured Query Language. The third principle is to use available partial schema-aware indexing methods but also schema agnostic indexing. While this work focuses on JSON and XML, it would be interesting to study and establish similar principles for tabular data and how this can be applied and benefit for search and querying.

Enabling search and querying over Open Data could benefit from many insights from the research around semantic search systems. The earlier semantic search systems such as Watson [24], Swoogle [27] or FalconS [22] provided search and simple querying over collections of RDF data. More advanced systems, such as SWSE [38] or Sindice.com [61] focused on indexing RDF document at web-scale. SWSE is a scalable entity lookup system operating over an integrated data, while Sindice.com provided keyword search and entity lookups using an inverted document index. Surprisingly, published research around semantic search slowed down. However, the big search engine players on the market such as Google or Bing utilise semantic search approaches to provide search over their internal knowledge graph.

6.2 Annotation, Labelling, and Integration of Tabular Data

Text-based search engines such as Elasticsearch, however, do not integrate any semantic information of the data sources and therefore do not enable search based on concepts, synonyms or related content. For instance, to enable a search for the concept “population” over a set of resources (that do not contain the string “population”), it is required that the tables (and their columns, respectively) are labelled and annotated correctly.

There exists an extensive body of research in the Semantic Web community in semantic annotation and linking of tabular data sources. The majority of these approaches [2, 28, 45, 55, 66, 70, 73, 76] assume well-formed relational tables and try to derive semantic labels for attributes in these structured data sources (such as columns in tables) which are used to (i) map the schema of the data source to ontologies or existing semantic models or (ii) categorize the content of a data source.

Given an existing knowledge base, these approaches try to discover concepts and named entities in the table, as well as relations among them, and link them to elements and properties in the knowledge base. This typically involves finding potential candidates from the knowledge base that match particular table components (e.g., column header, or cell content) and applying inference algorithms to decide the best mappings.

However, in typical Open Data portals many data sources exist where such textual descriptions (such as column headers or cell labels) are missing or cannot be mapped straightforwardly to known concepts or properties using linguistic approaches, particularly when tables contain many numerical columns for which we cannot establish a semantic mapping in such manner. Indeed, a major part of the datasets published in Open Data portals comprise tabular data containing many numerical columns with missing or non human-readable headers (organisational identifiers, sensor codes, internal abbreviations for attributes like “population count”, or geo-coding systems for areas instead of their names, e.g. for districts, etc.) [47].

Table 8. Header mapping of CSVs in open data portals

Full size table

In [57] we verified this observation by inspecting 1200 tables collected from the European Open Data portal and the Austrian Government Open Data Portal and attempted to map the header values using the BabelNet service (http://babelnet.org): Table 8 lists our findings; an interesting observation is that the AT portal has an average number of 20 columns per table with an average of 8 numerical columns, while the EU portal has larger tables with an average of 4 out of 20 columns being numerical. Regarding the descriptiveness of possible column headers, we observed that 28% of the tables have missing header rows. Eventually, we extracted headers from 7714 out of around 10 K numerical columns and used the BabelNet service to retrieve possible mappings. We received only 1472 columns mappings to BabelNet concepts or instances, confirming our assumption that many headers in Open Data CSV files cannot easily be semantically mapped.

Therefore, we propose in [57] an approach to find and rank candidates of semantic labels and context descriptions for a given bag of numerical values, i.e., the numerical data in a certain column. To this end, we apply a hierarchical clustering over information taken from DBpedia to build a background knowledge graph of possible “semantic contexts” for bags of numerical values, over which we perform a nearest neighbour search to rank the most likely candidates. We assign different labels/contexts with different confidence values and this way our approach could potentially be combined with the previous introduced textual labelling techniques for further label refinement.

7 Conclusions, Including Further Issues and Challenges

In this chapter we gave a rough overview over the still persisting challenge of integrating and finding data on the Web. We focused on Open Data and provided some starting points for finding large amounts of nowadays available structured data, the processing of which still remains a major challenge: on the one hand, because the introduction of Semantic Web Standards such as RDF and OWL did not yet find adoption and there is still a large variety in terms of formats to publish structured data on the Web. On the other hand, even the use of such standard formats alone would not alleviate the issue of findability of said data. Proper search and indexing techniques for structured data and its metadata need to be devised. Moreover, metadata needs to be self-descriptive, that is, it needs to not only describe what published datasets contain, but also how the data was generated (provenance) or under which terms it can be used (licenses). Overall, one could say that despite the increased availability of data on the Web, (i) there are still a number of challenges to be solved before we can call it a Semantic Web, and (ii) one often needs to be ready to manually pre-process and align data before automated reasoning techniques can be applied. Projects such as the Open Data Portalwatch, a monitoring framework for Open Data portals worldwide, from which most of our insights presented in this paper were derived, are just a starting point in the direction of making this Web of data machine-processable: there is a number of aspects that we did not cover herein, such as monitoring the evolution of datasets, archiving such evolving data, or querying Web data over time, cf. [31] for some initial research on this topic. Nor did we discuss attempts to reason over Web data “in the wild” using OWL and RDFS, which we had investigated on the narrower scope of Linked Data some years ago [64], but which will impose far more challenges when taking into account the vast amounts of data not yet linked to the so called Linked Data cloud, but available through Open Data Portals. Lastly, another major issue we did not discuss in depth is multi-linguality: data (content) as well as metadata associated with Open Data is published in different languages with different language descriptions and thereby a lot of “Open” information is only accessible to speakers of the respective languages, leave aside impossible to integrate for machines: still recent progress in machine translation or multi-lingual Linked Data corpora like Babelnet [56] could contribute to solving this puzzle.

You will find further starting points in these directions in the present volume, or also previous editions of the Reasoning Web summer school. We hope these starting points serve as an inspiration for further research on making machines understand openly available data on the Web and thus bringing us closer to the original vision of the Semantic Web, an ongoing journey.

Notes

1.
https://www.w3.org/2001/sw/, last accessed 30/03/2017.
2.
https://www.w3.org/2013/data/, last accessed 30/03/2017.
3.
http://wiki.dbpedia.org/about/facts-figures, last accessed 30/03/2017.
4.
http://www.rdfhdt.org/datasets/, last accessed 30/03/2017.
5.
Executing the SPARQL query SELECT (count(*) as ?C) WHERE {?S ?P ?O } on https://query.wikidata.org/ gives 1.7B triples, last accessed 30/03/2017.
6.
http://wiki.openstreetmap.org/wiki/Planet.osm, last accessed 30/03/2017.
7.
That is, within your published RDF graph, use HTTP URIs pointing to other dereferenceable documents, that possibly contain further RDF graphs.
8.
http://commoncrawl.org/, last accessed 30/03/2017.
9.
https://ckan.org/, last accessed 30/3/2017.
10.
https://socrata.com/, last accessed 30/3/2017.
11.
https://data.humdata.org/, last accessed 27/3/2017.
12.
http://opendefinition.org/ofd/, last accessed 30/03/2017.
13.
The numbers for the RDF serializations JSON-LD (8 resources) and TTL (55) are vanishingly small.
14.
DCAT is a vocabulary commonly used for describing general metadata about datasets. See Sect. 5.2 for mapping and homogenization of metadata descriptions using standard vocabularies.
15.
For instance, see Converter Tools on https://project-open-data.cio.gov/, last accessed 24/03/2017.
16.
https://developers.google.com/kml/documentation/, last accessed 24/03/2017.
17.
http://www.opengeospatial.org/standards/gml, last accessed 24/03/2017.
18.
http://www.opengeospatial.org/standards/wfs, last accessed 24/03/2017.
19.
https://www.w3.org/TR/xquery-30/, last accessed 24/03/2017.
20.
https://www.w3.org/TR/xpath-30/, last accessed 24/03/2017.
21.
https://www.w3.org/TR/xslt-30/, last accessed 24/03/2017.
22.
https://www.w3.org/XML/Schema, last accessed 24/03/2017.
23.
https://www.w3.org/community/rax/, last accessed 24/03/2017.
24.
http://lod-cloud.net/state/state_2014/#toc10, last accessed 01/05/2017.
25.
http://extensions.ckan.org/extension/harvest/, last accessed 24/03/2017.
26.
http://docs.ckan.org/projects/ckanext-spatial/en/latest/, last accessed 24/03/2017.
27.
https://joinup.ec.europa.eu/asset/dcat_application_profile/description, last accessed 24/03/2017.
28.
Google Research Blog entry, https://research.googleblog.com/2017/01/facilitating-discovery-of-public.html, last accessed 27/01/2017.
29.
https://www.w3.org/TR/vocab-dqv/, last accessed 24/03/2017.
30.
http://data.wu.ac.at/csvengine, last accessed 24/03/2017.
31.
https://www.elastic.co/products/elasticsearch, last accessed 24/03/2017.
32.
https://www.opendatanetwork.com, last accessed 24/03/2017.
33.
https://dev.socrata.com, last accessed 24/03/2017.

References

Abele, A., McCrae, J.P., Buitelaar, P., Jentzsch, A., Cyganiak, R.: Linking open data cloud diagram 2017 (2017)
Google Scholar
Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. Proc. VLDB Endow. 6(6), 421–432 (2013)
Article Google Scholar
Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the VoID Vocabulary, March 2011. https://www.w3.org/TR/void/
Arenas, M., Barceló, P., Libkin, L., Murlak, F.: Foundations of Data Exchange. Cambridge University Press, New York (2014)
MATH Google Scholar
Assaf, A., Troncy, R., Senart, A.: HDL - towards a harmonized dataset model for open data portals. In: PROFILES 2015, 2nd International Workshop on Dataset Profiling & Federated Search for Linked Data, Main conference ESWC15, 31 May-4, Portoroz, Slovenia, Portoroz, Slovenia, 05 2015. CEUR-WS.org., June 2015
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Auer, S., Lehmann, J.: Creating knowledge out of interlinked data. Semant. Web 1(1–2), 97–104 (2010)
Google Scholar
Bailey, J., Bry, F., Furche, T., Schaffert, S.: Web and semantic web query languages: a survey. In: Eisinger, N., Małuszyński, J. (eds.) Reasoning Web. LNCS, vol. 3564, pp. 35–133. Springer, Heidelberg (2005). doi:10.1007/11526988_3
Chapter Google Scholar
Bauckmann, J., Abedjan, Z., Leser, U., Müller, H., Naumann, F.: Discovering conditional inclusion dependencies. In 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), Maui, HI, USA, October 29 - November 02, 2012, pp. 2094–2098 (2012)
Google Scholar
Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 turtle: the terse RDF triple language. W3C Recommendation, February 2014. http://www.w3.org/TR/turtle/
Beek, W., Rietveld, L., Schlobach, S., van Harmelen, F.: LOD laundromat: why the semantic web needs centralization (even if we don’t like it). IEEE Internet Comput. 20(2), 78–81 (2016)
Article Google Scholar
Berners-Lee, T.: Linked Data. W3C Design Issues, July 2006. http://www.w3.org/DesignIssues/LinkedData.html. Accessed 31 Mar 2017
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 5, 29–37 (2001)
Google Scholar
Bernstein, A., Hendler, J., Noy, N.: The semantic web. Commun. ACM 59(9), 35–37 (2016)
Article Google Scholar
Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping between RDF and XML with XSPARQL. J. Data Semant. 1(3), 147–185 (2012)
Article Google Scholar
Borriello, M., Dirschl, C., Polleres, A., Ritchie, P., Salliau, F., Sasaki, F., Stoitsis, G.: From XML to RDF step by step: approaches for leveraging xml workflows with linked data. In: XML Prague 2016 - Conference Proceedings, pp. 121–138, Prague, Czech Republic, February 2016
Google Scholar
Bourhis, P., Reutter, J.L., Suárez, F., Domagoj Vrgoc, J.: Data model, query languages and schema specification. CoRR, abs/1701.02221 (2017)
Google Scholar
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. Internet Engineering Task Force (IETF) RFC 7159, March 2014
Google Scholar
Brickley, D., Guha, R.V.: RDF Schema 1.1. W3C Recommendation, February 2014. http://www.w3.org/TR/rdf-schema/
Cabrio, E., Palmero Aprosio, A., Villata, S.: These are your rights. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 255–269. Springer, Cham (2014). doi:10.1007/978-3-319-07443-6_18
Chapter Google Scholar
Carothers, G., Seaborne, A.: RDF 1.1 N-triples: a line-based syntax for an RDF graph. W3C Recommendation, February 2014. http://www.w3.org/TR/rdf-schema/
Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp. 1101–1102, New York, NY, USA. ACM (2008)
Google Scholar
Cyganiak, R., Wood, D., Lanthaler, M., Klyne, G., Carroll, J.J., Mcbride, B.: RDF 1.1 concepts and abstract syntax. Technical report (2014)
Google Scholar
d’Aquin, M., Motta, E.: Watson, more than a semantic web search engine. Semant. Web 2(1), 55–63 (2011)
Google Scholar
Sarma, A.D., Fang, L., Gupta, N., Halevy, A., Lee, H., Wu, F., Xin, R., Yu, C.: Finding related tables. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 817–828. ACM (2012)
Google Scholar
Dell’Aglio, D., Polleres, A., Lopes, N., Bischof, S.: Querying the web of data with XSPARQL 1.1. In: ISWC2014 Developers Workshop, vol. 1268 of CEUR Workshop Proceedings. CEUR-WS.org, October 2014
Google Scholar
Ding, L., Finin, T., Joshi, A., Pan, R., Scott Cost, R., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 652–659, New York, NY, USA. ACM (2004)
Google Scholar
Ermilov, I., Auer, S., Stadler, C.: User-driven semantic mapping of tabular data. In: Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS 2013), pp. 105–112, New York, NY, USA. ACM (2013)
Google Scholar
European Commission. Towards a thriving data-driven economy, July 2014
Google Scholar
Fernández, J.D., Martınez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). J. Web Semant. 19(2), 22–41 (2013)
Article Google Scholar
Fernández Garcia, J.D., Umbrich, J., Knuth, M., Polleres, A.: Evaluating query and storage strategies for RDF archives. In: 12th International Conference on Semantic Systems (SEMANTICS), ACM International Conference Proceedings Series, pp. 41–48. ACM, September 2016
Google Scholar
Fürber, C., Hepp, M.: Towards a vocabulary for data quality management in semantic web architectures. In: Proceedings of the 1st International Workshop on Linked Web Data Management (LWDM 2011), pp. 1–8, New York, NY, USA. ACM (2011)
Google Scholar
Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation, March 2013. http://www.w3.org/TR/sparql11-query/
Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers, San Rafael (2011)
Google Scholar
Hernández, D., Hogan, A., Krötzsch, M.: Reifying RDF: what works well with wikidata? In: Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems Co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, PA, USA, October 11, 2015, pp. 32–47 (2015)
Google Scholar
Hernández, D., Hogan, A., Riveros, C., Rojas, C., Zerega, E.: Querying wikidata: comparing SPARQL, relational and graph databases. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 88–103. Springer, Cham (2016). doi:10.1007/978-3-319-46547-0_10
Chapter Google Scholar
Hitzler, P., Lehmann, J., Polleres, A.: Logics for the semantic web. In: Gabbay, D.M., Siekmann, J.H., Woods, J. (eds.) Computational Logic, vol. 9 of Handbook of the History of Logic, pp. 679–710. Elesevier, Amsterdam (2014)
Google Scholar
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)
Article Google Scholar
Iannella, R., Villata, S.: ODRL information model. W3C Working Draft (2017). https://www.w3.org/TR/odrl-model/
Open Knowledge International. Open Definition Conformant Licenses, April 2017. http://opendefinition.org/licenses/. Accessed 28 Apr 2017
Klyne, G., Carroll, J.J.: Resource description framework (RDF): concepts and abstract syntax. Technical report (2004)
Google Scholar
Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs, Datenbanken und Informationssysteme (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings, pp. 207–226 (2017)
Google Scholar
Kruse, S., Papenbrock, T., Naumann, F.: Scaling out the discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, pp. 445–454 (2015)
Google Scholar
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 6(2), 167–195 (2015)
Google Scholar
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(1), 1338–1347 (2010)
Google Scholar
Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 1247–1258, New York, NY, USA. ACM (2014)
Google Scholar
Lopez, V., Kotoulas, S., Sbodio, M.L., Stephenson, M., Gkoulalas-Divanis, A., Aonghusa, P.M.: QuerioCity: a linked data platform for urban information management. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7650, pp. 148–163. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35173-0_10
Chapter Google Scholar
Maali, F., Erickson, J.: Data Catalog Vocabulary (DCAT), January 2014. http://www.w3.org/TR/vocab-dcat/
McGuinness, D., Lebo, T., Sahoo, S.: The PROV Ontology (PROV-O), April 2013. http://www.w3.org/TR/prov-o/
Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons microdata, RDFa and microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_18
Google Scholar
Meusel, R., Ritze, D., Paulheim, H.: Towards more accurate statistical profiling of deployed schema.org microdata. J. Data Inf. Qual. 8(1), 3:1–3:31 (2016)
Google Scholar
Miles, A., Bechhofer, S.: Simple knowledge organization system reference. W3C Recommendation (2009)
Google Scholar
Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L., Howard Ho, C.T., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIGMOD Rec. 30(1), 78–83 (2001)
Article Google Scholar
Mitlöhner, J., Neumaier, S., Umbrich, J., Polleres, A.: Characteristics of open data CSV files. In: 2nd International Conference on Open and Big Data, Invited Paper, August 2016
Google Scholar
Mulwad, V., Finin, T., Joshi, A.: Semantic message passing for generating linked data from tables. In: The Semantic Web - ISWC 2013–12th International Semantic Web Conference, Sydney, NSW, Australia, 21–25 October, 2013, Proceedings, Part I, pp. 363–378 (2013)
Google Scholar
Navigli, R., Ponzetto., S.P.: Babelnet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Article MathSciNet MATH Google Scholar
Neumaier, S., Umbrich, J., Parreira, J.X., Polleres, A.: Multi-level semantic labelling of numerical values. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 428–445. Springer, Cham (2016). doi:10.1007/978-3-319-46523-4_26
Chapter Google Scholar
Neumaier, S., Umbrich, J., Polleres, A.: Automated quality assessment of metadata across open data portals. J. Data Inf. Qual. 8(1), 2:1–2:29 (2016)
Google Scholar
Neumaier, S., Umbrich, J., Polleres, A.: Lifting data portals to the web of data. In: WWW 2017 Workshop on Linked Data on the Web (LDOW 2017), Perth, Australia, 3-7 April, 2017 (2017)
Google Scholar
Auer, S., Lehmann, J., Ngonga Ngomo, A.-C.: Introduction to linked data and its lifecycle on the web. In: Polleres, A., d’Amato, C., Arenas, M., Handschuh, S., Kroner, P., Ossowski, S., Patel-Schneider, P. (eds.) Reasoning Web 2011. LNCS, vol. 6848, pp. 1–75. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23032-5_1
Chapter Google Scholar
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)
Article Google Scholar
Papenbrock, T., Kruse, S., Quiané-Ruiz, J.-A., Naumann, F.: Divide & conquer-based inclusion dependency discovery. PVLDB 8(7), 774–785 (2015)
Google Scholar
Pezoa, F., Reutter, J.L., Suárez, F., Ugarte, M., Vrgoc, D.: Foundations of JSON schema. In: Proceedings of the 25th International Conference on World Wide Web (WWW 2016), Montreal, Canada, 11–15 April, 2016, pp. 263–273 (2016)
Google Scholar
Polleres, A., Hogan, A., Delbru, R., Umbrich, J.: RDFS & OWL reasoning for linked data. In: Rudolph, S., Gottlob, G., Horrocks, I., van Harmelen, F. (eds.) Reasoning Web. Semantic Technologies for Intelligent Data Access (Reasoning Web 2013), volume 8067, pp. 91–149. Springer, Mannheim (2013)
Google Scholar
Pollock, R., Tennison, J., Kellogg, G., Herman, I.: Metadata vocabulary for tabular data, W3C Recommendation, December 2015. https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/
Ramnandan, S.K., Mittal, A., Knoblock, C.A., Szekely, P.: Assigning semantic labels to data sources. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 403–417. Springer, Cham (2015). doi:10.1007/978-3-319-18818-8_25
Chapter Google Scholar
Shafranovich,Y.: Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180 (Informational), October 2005
Google Scholar
Sporny, M., Kellogg, G., Lanthaler, M.: JSON-LD 1.0A JSON-based Serialization for Linked Data, January 2014. http://www.w3.org/TR/json-ld/
Steyskal, S., Polleres, A.: Defining expressive access policies for linked data using the ODRL ontology 2.0. In: Proceedings of the 10th International Conference on Semantic Systems (SEMANTICS 2014) (2014)
Google Scholar
Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: A scalable approach to learn semantic models of structured sources. In: Proceedings of the 8th IEEE International Conference on Semantic Computing (ICSC 2014) (2014)
Google Scholar
Tanon, T.P., Vrandecic, D., Schaffert, S., Steiner, T., Pintscher, L.: From freebase to wikidata: the great migration. In: Proceedings of the 25th International Conference on World Wide Web (WWW 2016), Montreal, Canada, 11–15 April, 2016, pp. 1419–1428 (2016)
Google Scholar
The Open Data Charter. G8 open data charter and technical annex (2013)
Google Scholar
Venetis, P., Halevy, A.Y., Madhavan, J., Pasca, M., Shen, W., Fei, W., Miao, G., Chung, W.: Recovering semantics of tables on the web. PVLDB 4(9), 528–538 (2011)
Google Scholar
Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Article Google Scholar
Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery. Technical report, USA (1998)
Google Scholar
Zhang, Z.: Towards efficient and effective semantic table interpretation. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 487–502. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_31
Google Scholar

Download references

Acknowledgements

The work presented in this paper has been supported by the Austrian Research Promotion Agency (FFG) under the projects ADEQUATe (grant no. 849982) and DALICC (grant no. 855396).

Author information

Authors and Affiliations

Vienna University of Economics and Business, Vienna, Austria
Sebastian Neumaier, Axel Polleres, Simon Steyskal & Jürgen Umbrich
Complexity Science Hub Vienna, Vienna, Austria
Axel Polleres

Authors

Sebastian Neumaier
View author publications
You can also search for this author in PubMed Google Scholar
Axel Polleres
View author publications
You can also search for this author in PubMed Google Scholar
Simon Steyskal
View author publications
You can also search for this author in PubMed Google Scholar
Jürgen Umbrich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Axel Polleres .

Editor information

Editors and Affiliations

University of Calabria , Rende, Italy
Giovambattista Ianni
Sapienza University of Rome , Rome, Italy
Domenico Lembo
Carleton University , Ottawa, Québec, Canada
Leopoldo Bertossi
University of Huddersfield , Huddersfield, United Kingdom
Wolfgang Faber
University of Ulm , Ulm, Germany
Birte Glimm
St John's College, Oxford, United Kingdom
Georg Gottlob
University of Koblenz , Koblenz, Germany
Steffen Staab

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Neumaier, S., Polleres, A., Steyskal, S., Umbrich, J. (2017). Data Integration for Open Data on the Web. In: Ianni, G., et al. Reasoning Web. Semantic Interoperability on the Web. Reasoning Web 2017. Lecture Notes in Computer Science(), vol 10370. Springer, Cham. https://doi.org/10.1007/978-3-319-61033-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-61033-7_1
Published: 10 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61032-0
Online ISBN: 978-3-319-61033-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Data Integration for Open Data on the Web

Abstract

Similar content being viewed by others