Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

There exists an increasing amount of data available. Some data are public (for example, on the Web) and everybody can exploit them, while others are accessible only for particular collectives of people under licenses that constrain their exploitation and exploration (for example, the clinical history of patients in hospitals, industrial patents, etc.). Moreover, in the last decade there has been an increment of the amount of structured data sources available, due to multiple reasons such as:

  • the development of Information and Communication Technologies (ICT) which provide an infrastructure to digitalized, process and consume data;

  • the popularization of the Linked Data Web and the Internet of Things, which foster a change of behavior on people and many companies and administrations, who decided to publish structured data (some of them coming from different types of sensors) on the Web; and

  • new politics fostered by different organizations and governments (for example, the Organization for Economic Cooperation and Development -OECD- declared that all publicly funded data should be available for everybody).

The main goal of research in the Keystone Action COST IC1302 is to manage big amounts of heterogeneous data, especially structured data, in order to provide users (people or software agents) with the data they require in an effective way with the minimum cost. Keystone is organized in 4 working groups: Representation of Structure Data Sources (WG1), Keyword-based Search (WG2), User Interaction and Keyword Query Interpretation (WG3), and Research Integration, Showcases, Benchmarks and Evaluations (WG4) as shown in Fig. 1. This chapter is focused on the research related to WG1, whose results influence WG2 and WG3, whereas WG4 focuses on the integration of the results of all working groups and how to exploit them.

Fig. 1.
figure 1

Working groups of the Keystone Action COST IC-1302.

Independently of the kind of data considered, the consumption or reuse of structured data (taking into account its licenses and legislation) is limited yet. Thus, in the Web of Linked Data, most users only consume and reuse well-known general-purpose reference data sources, such as WikiData and DBpedia, despite the fact that there exist other domain-specific data sources that could be more appropriate for their purposes. We consider that the causes of this behavior are the difficulties to locate, identify and exploit suitable data sources due to:

  • Noise and inconsistencies appear in the generation, transfer and transformation of data. Thus, the longer the transfer chain and the more transformations, the more noise and inconsistencies are introduced.

  • A big amount of technologies to store and index structured data sources have appear recently. There exist multiple technologies, such as MongoDB, Cassandra, traditional relational databases, graph databases and different RDF stores, to store and index structured data sources. So, there is a big amount of interesting alternatives and it is difficult to decide which one is the most appropriate in a specific context, as there are not consensus guidelines and/or standards to allow users to select the most appropriate technologies for particular contexts.

  • The lack of knowledge and tools with reliable information about the nature of distributed third-party datasets. For instance with respect to their quality, dynamics, temporal coverage or the addressed domains.

  • The lack of knowledge and tools to locate the data sources that are interesting for users with an specific purpose in an efficient way; ideally, in an totally automatic way even when the users do not ask for it explicitly (i.e., based on a pull-based approach).

So, in this chapter, the research of different research groups and authors involved in Keystone WG1 is organized in the areas or categories of the pipeline (or data source value chain) in Fig. 2: Generation of Structured Data (WG1.A); Storing and Indexing of Structured Data (WG1.B); Characterization, Integration and Federation of Data Sources (WG1.C); and Selection and Retrieval of Data Sources (WG1.D).

Fig. 2.
figure 2

Pipeline considered to publish or select a specific data source.

Moreover, information about the different research groups that contribute to the research on which Keystone WG1 is focused, is also provided in Sect. 6 and the list of the authors that contributed to the elaboration of this survey chapter is provided in Sect. 7.

2 Generation of Structured Data (WG1.A)

The first question that arose when discussions about the generation of structured data took place in Keystone Working Group 1 was: where do structured data come from?, i.e., who or what generates structured data? The received feedback was organized in four overlapping groups: (1) from unstructured or semi-structured data sources (such as documents written in natural language and traditional HTML web pages), (2) from human users in a collaborative way, (3) from sensors and Internet of Things (IoT) devices, and (4) from other structured data sources. Moreover, discussions about how to publish or generate structured data are also relevant at this point (see Sect. 2.5).

2.1 From Unstructured or Semi-structured Data Sources

Since its creation in 2001, Wikipedia has become one of the most important sources of reliable information on the Web. Thus, currently, there are more than 280 active versions of Wikipedia in different languages. Wikipedia articles are typically split into two parts: (1) a body of unstructured text with details on the article subject and (2) an optional semistructured box usually called infobox. A considerable number of projects have exploited infoboxes in order to create structured data, such as Google’s Knowledge Graph, Microsoft’s Satori, and DBpedia [4] (the most famous one nowadays). More recently, in 2012, Wikimedia Deutschland proposed the Wikidata project [55], whose main goal is providing high-quality structured data acquired and maintained collaboratively to be directly used by Wikipedia to enrich its contents. DBpedia and Wikidata have become two important structured data sources in the current Linked Data Web. Thus, according to [42], DBpedia is the second node with more incoming links on the Linked Data Web, whereas Wikidata has been continuously increasing its popularity since its creation [56]. So, a great amount of data sources refer to them. A comparison between DBpedia and Wikidata and an analysis of their evolution are also done [23] by considering different criteria, metrics and frameworks focused on the quality of structured data sources [25, 40, 58]. There also exist recent works based on a wide set of techniques (text mining and ontology alignment, entity linking [21], etc.) such as [31], whose main purpose is the discovery of relationships among the elements of wikis written in different natural languages.

A great effort has been also made to develop techniques to extract structured data from texts and documents. So, techniques, tools and frameworks to perform extraction of data from the Web and texts have been developed. Some examples of this tools are: DBpediaSpotlightFootnote 1, BabelfyFootnote 2, different Temporal Taggers (e.g., SUTimeFootnote 3 and HeidelTimeFootnote 4), Stanford Name Entity Recognition (NER)Footnote 5 and Part of speech (POS) tagging systems. On the other hand, there also exist other digital resources with a great potential from which structured data can also be extracted, such as images, video, multidimensional arrays containing for example environmental data, data streams coming from sensors with a certain frequency, etc. Some works around these topics are [20, 54]. The former work presents a tool to create structured data sources about meteorological issues, while the latter one is focused on revealing new information about a virus by using Information Extraction (IE) techniques combined with existing genome sequenced data. However, there are not widely standard methods or techniques to answer the following questions: what features should be considered for images, video, and streaming data from sensors?, how a unique identifier for all data of these kinds of data sources should be built?, should this kind of data be associated to a geographical position?, which granularity should be considered (country, GPS coordinates, region, ...)?, etc.

2.2 From Human Users in a Collaborative Way

In the context of Knowledge Representation and Artificial Intelligence, structured data sources are usually called Knowledge Bases. Thus, a Knowledge Base is considered a store of information or data that is available to draw on, or the underlying set of facts and rules of a certain domain stored in a specific format. Therefore, creating/generating a structured data source is quite similar to creating a Knowledge Base in these contexts.

Some authors considered that a Knowledge Base (KB) is composed of two main elements [37]: (1) a set of ontologies that establish the model of the data that the KB contains (this set of ontologies is also known as TBox or Terminological Box), and (2) data or instances that represent facts of the domain modeled by the TBox (this set of data or instances is also known as ABox or Assertional Box). There is a certain agreement of the definition of ontology; thus, the most popular definition of ontology is “explicit specification of a conceptualization” [35]. However, there exist different approaches to create ontologies. Some groups and tools use bottom-up approaches starting from folksonomies [57], while other methodologies such as NeON [53] follow a top-down approach that considers as starting point the knowledge of domain experts. On the other hand, up to our knowledge, there is no widely-adopted tool or technique to populate a KB, i.e., there does not exist a well-known technique or tool to create data of the ABox component. Nevertheless, some Extraction-Transformation and Load (ETL) systems have been adapted to populate a specific KB in certain domains. Besides, there are some emerging tools oriented to non-technical users that suggest attributes/properties and values to be filled by users to populate a KB in a collaborative way [49]. Moreover, adapting recommender system techniques, such as collaborative recommender techniques, content-based recommender techniques, knowledge-based recommender techniques, context aware recommender techniques and so on [51], could be also a possibility to be explored. The main challenge of using Recommender Systems in this context is how to evaluate their performance, as there are not standard benchmarks or datasets. So, a recent framework to generate synthetic data for the evaluation of context-aware recommendations systems was created [16]. On the other hand, recent studies focused on analyzing how the data sources evolve along the time, in particular on how the evolution of the editions of data performed collaboratively is [50].

Despite the fact that there is neither a widely-adopted methodology to create the TBox of a KB nor a widely-adopted system to populate the KB (to insert/update and delete data in the ABox component), there exists a widely adopted set of standard languages to create KB and ontologies. The most popular languages to create ontologies are: RDFS [11] and OWL [34], standardized by the World Wide Web Consortium (W3C), while the most popular language to populate ontologies is RDF [32].

2.3 From Sensors and IoT Devices

With the popularization of the Internet of Thing, multiple devices provide digital data coming from different types of sensors (temperature, location, humidity, etc.) for different purposes: remote control, automatic control, monitoring areas, etc. Independently of the type of sensor and the purpose of obtaining those data, different issues have arisen:

  • Which devices should process the data? where the data process should be performed? in the sensors themselves?, in the infrastructure used to create the sensor network?, in servers where all data are collected and grouped? Different research groups have focused on solving these questions and new trends such as Fog Computing or Edge Computing extend the Cloud Computing paradigm to the edge of the network [52].

  • When data should be transmitted from the sensors to the network? each certain time or period or when a relevant change happens (changes of values bigger than a certain threshold). The proposal in [38] considers only the transmission of certain values and a function to predict the new values, and when the trend of the sequence of values changes, a new transmission of values and prediction functions is performed. Other works follow a pull approach, i.e., consulting the current values of the sensors when they are required instead of generating and transmitting data from the sensors all the time (push approach).

  • Which type of data should be provided to the consumers: raw data or smart data? The purpose of smart data is avoiding overload of raw data that are difficult to process and digest by final users; on the other hand when smart data are provided, usually several filters to simplified the output are applied and some filters could remove relevant raw data for the final user.

Another important issue that arises in the discussions about the totally connected world is how to exploit structured data by taking into account privacy and security issues. Moreover, industries demand methods to anonymize personal data in order to exploit them while guaranteeing the protection of their clients and workers and dealing with the right to forgetfulness in an appropriate way. So, guidelines, techniques and tools that deal with these issues from an interdisciplinary point of view are required. Frameworks and directives about security and/or safety issues usually consider the following dimensions of security represented as a triangle (as when priority to one of them is given, the other ones usually become weaker points): Availability, Confidentiality and Integrity of data. Besides, some frameworks consider other sub-dimensions of Integrity such as Authenticity and Traceability (or the provenance of data, where recent relevant works have been published [39, 45]). Finally, we would like to remark that works related to the recent block-chains are also emerging to take into account security issues in environments with sensors [41].

Finally, notice that nowadays a great amount of data are also generated by the execution of different processes. This data are usually stored in logs of different type with a specific structure. Therefore, they could by analyze to extract knowledge about the processes executed and evaluate their performance. These issues are study on a recent research area called Process Mining [1].

2.4 From Other Structured Data Sources

Nowadays, there exist a great number of secondary data sources, that obtain their contents from one or several different data sources, in contrast to primary data sources, that create their own contents from scratch. For example, the content of Data-ware houses for analytical purposes is usually generated from transactional systems, along the time. In this context, tools to facilitate: (a) the extraction of the data from the original systems, (b) the transformation of the data to integrate and clean it, and (c) loading the transformed data in the destiny systems have been popularized. Some examples of popular Extraction Transformation and Load (ETL) systems are Rapid MinerFootnote 6, TalendFootnote 7 and PentahoFootnote 8. All these tools provide mechanisms to deal with data curation. Nevertheless, this is a topic where there exist some open research issues such as Entity De-duplication, Entity Disambiguation and dealing with multilingual aspects.

Standards to translated relational data bases into semantic data sources based on RDF such as Direct Mapping [3] and R2RML [22] have also become popular in recent years. However, their use have not been widely adopted yet. When this translation is performed, there are two alternatives options: (1) materializing the RDF data by storing the same data in two different storage (one representation based on a relational approach, and another based on an RDF approach) and (2) using wrappers to create an virtual RDF model maintaining the original relational data base. The first option requires to deal with the problem of data redundancy (the same data stored by using two different models and structures), while the second option usually requires more processing time. A more complex scenario arise when dealing with heterogeneous data sources, that use different models with different semantics, is required.

2.5 Methodologies, Standards and Good Practices to Publish and Consume Structured Data

The main principles to publish Linked Data were established by Tim Berners-Lee et al. in 2006Footnote 9. This proposal is based on: (1) use URIs as names for things/resources, (2) use HTTP dereference URIs so that people can look up data about the resources that they represent by means of a browser, (3) include links to other URIs in order to discover more things. These principles were refined later by Bizer et al. [8]. Besides, the releasing of DBpedia caused the creation of new RDF-based data sources on the Web, as it showed the required steps to develop and implement a linked data source. After that, a standard formal language to query that kind of data sources was required. So, in 2008, the Protocol and RDF Query Language (SPARQL) was released as W3C a Recommendation [28]. Later, in 2013, this standard was updated [33]. Moreover, web services, called SPARQL endpoints, that allow to submit SPARQL queries to RDF data sources were also standardized. Unfortunately, a high percentage of users is not able to express their information needs in a SPARQL query as it requires to know the following elements: (1) the syntax of SPARQL in order to build a syntactically correct query, and (2) the underlying data structure of the source, i.e., how the data are organized (its schema or intension) and its semantics, in order to build a semantically correct query and to express the information need properly.

In order to easy the automatic interpretation and process of the available content of the web pages, i.e., to made them understandable for machines and not only for humans (that can read them), semantic annotations were created. A semantic annotation is an annotation embedded in the HTML source of a web pages that makes explicit the semantic meaning of a certain content (for example, a sequence of characters) for a machine. The standard language to make semantic annotations is the W3C Recommendation RDFa (RDF annotation). This language was release as Recommendation in 2012 [43], and later updated in 2015 [44]. Despite the fact that there exist a great amount of semantically annotated and linked data on the Web, most of its current content of the Web is not annotated. On the other hand, during the last decade, initiatives promoted by the main search engine companies (Google, Yandex, Bing, etc.) have created standard vocabularies (ontologies) to annotate the web content (Schema.org [36] has been one of these successful initiatives). Moreover, these companies promote the use of annotation by ranked annotated web pages on the first positions of the pages of results of the searches performed by their users.

In conclusion, currently, there exist two main ways of consuming Linked Data Web: (1) crawling web pages with semantic annotations (such as RDFa annotations) periodically in order to discover new data, and (2) querying SPARQL endpoints either to find out their structure or to obtain specific data.

3 Storing and Indexing of Structured Data (WG1.B)

In 1970, Edgar F. Codd defined the foundations of the relational database model to structure data within a database. This model has been widely used and considered so far, and its implementations satisfy the properties ACID (Atomicity, Consistency, Isolation and Durability). At the same time as the foundations of the relational model were being defined, Donald D. Chamberlin and Raymond F. Boyce, developed a language called Specifying Queries As Relational Expressions (SQUARE) to query databases based on that model. The evolution of SQUARE was later, in 1974, called Structured English Query Language (SEQUEL). SEQUEL was oriented to non-experts users because it specified “what” data to retrieve instead of “how” to retrieve them (i.e., it was a declarative language instead of a procedurallanguage). SEQUEL was renamed and standardized as the widely adopted Structured Query Language (SQL) in the middle of the 80’s and became the most used language to query data sources in the 80’s and the 90’s decade. Thus, although other types of models (e.g., deductive, pure object oriented, etc.) were proposed, they did not achieve commercial success.

With the explosion of the Web in the middle of 90’s decade, the use of databases oriented to manage text documents increased. Moreover, currently there exist a wide range of different types of data sources with different purposes based on different models which are generally classify NoSQL databases or Not only SQL databases. The most popular ones are the following ones:

  1. 1.

    graph-oriented databases (where the Triple Stores for managing RDF fit),

  2. 2.

    multivalued databases,

  3. 3.

    object-oriented databases,

  4. 4.

    columnar databases,

  5. 5.

    key-value databases, and

  6. 6.

    multi-model databases.

Most of these new types of models focused on satisfying a different set of properties from ACID properties. Thus, the NoSQL databases focus on the Basically Availability, Soft-State, Eventually Consistent (BASE) properties [10] emerged when Consistency, Availability and Partition-Tolerance (CAP) theorem became popular around 2000 [9]. Moreover, notice that the storage of these data sources can be distributed on a network. On the other hand, federation of independent data sources is commonly required to tackled complex problems; for example, in order to create pollutant dispersions models for an city is required to obtain data from meteorological models, traffic models, geographical information systems, and the geometry of the buildings of the city.

Regarding the structures and indexes to store the structured data sources, the most popular ones are the following ones:

  1. 1.

    balance trees and B+ trees for relational databases,

  2. 2.

    inverted indexes for databases oriented to store documents, and

  3. 3.

    different formats based on text files for RDF such as RDF Turtle [24], RDF N-Triple [5], RDF/XML, RDF/JSON, etc.

Moreover, several Keystone members has proposed different structures to index and store RDF in a binary way. The most popular approaches proposed are: Head Dictionary Triple (HDT) [29], HDT-MR (based on Map-Reduce) [30], RDFCSA (a compact RDF store based on compressed Suffix Arrays, a well-known self-index) [13] and K2-Triples (a compressed vertical partitioning for RDF) [2]. Moreover, works about versioning RDF, i.e., the evolution of an RDF data source along time have also been proposed. Some relevant works are: compressed kd-tree for temporal-graphs [18], compressed suffix-array from temporal-graphs [12] and RDF-Archive [19].

Finally, notice that indexes or structures to improve the access or storage of structure data sources can be classified by considering the following categories: (1) In Memory Structures vs In Disk Structures; (2) Compact Structures vs Structures Over Plain Data (generally text); and (3) Self-indexing Structures where the index and the data are kept in a unique in-memory data structure that allows indexed searches and to recover the original data.

4 Characterization, Integration and Federation of Data Sources (WG1.C)

At this point the main questions about characterization, integration and federation of data sources discussed in the context of the Semantic Web by Keystone Working Group 1 were: which meta-data should be considered to describe a (RDF) data source?, how to evaluate the quality of a data source? and how to integrate/federate heterogeneous data sources?.

With respect to the first question, notice that, currently, there exist multiple standards languages and initiatives to describe the content of a data source, such as: RDFS [11], OWL [34], VoIDFootnote 10 and DCATFootnote 11. Nevertheless, recently, some Keystone members have been working on a survey to provide a comprehensive overview of the RDF dataset profiling features, methods, tools and vocabularies [7]. With respect to the second question, a great amount of work have been done recently. Some works focus on defining methodologies and metrics grouped in dimensions to study the quality of the data sources such as [58]; while others focused of creating methods and tools to perform that evaluation efficiently. Some recent tools are: qSKOSFootnote 12, SkosifyFootnote 13, Luzzu [25] and PoolPartyFootnote 14. Finally, with respect to the third question, some systems developed by Keystone members in order to integrate/federate heterogeneous data sources are briefly described in the following:

  • MOMIS [14]. It is an open data tool, developed by the University of Modena and Reggio Emilia and the Enterprise DataRiver, to perform data integration from heterogeneous static data sources.

  • SOS-SM [47, 48]. It is a framework, developed by the University of Santiago de Compostela, whose aim is the semantic mediation between environmental observation datasets through OGC Sensor Observations Service interfaces. The framework combines a mediator/wrapper architecture with a Local As View approach for data integration, supported by a global model based on the Semantic Sensor Network ontology proposed by the W3C. General purpose wrappers were also developed to incorporate vector-based datasets recorded in spatial relational databases and raster-based datasets accessed through UNIDATA NETCDF Subset services.

There also exist multiple initiatives and projects to exploit open data in the context of smart cities [46]. However, up to our knowledge, most of them are focused on specific domains such as transportation, pollution, energy, point of interests for tourists, etc. On the other hand, KnowledgeManagement4City [6] is an ontology oriented to model smart city services. This ontology provides a unified view that facilitates the creation of any service for the city, as all services are managed in an uniform way.

5 Selection and Retrieval of Data Sources (WG1.D)

At this point two main questions were discussed by Keystone Working Group 1 were: how to discover or recommend structure data sources? and how to discover equivalent concepts, properties and instances from two different data sources?. With respect to the first question, different research groups involved in the WG1 of Keystone Action COST IC1302 have published recent works about recommendation. Some representatives examples are: [15, 17]. On the other hand, with respect to the second question, some relevant research papers have been also recently published [26, 27].

6 Composition of Working Group 1

According to the information in the website of the Keystone Action COST IC1302 (http://www.keystone-cost.eu/), the Working Group 1 is composed of 162 members (41 females and 121 males) belonging to 28 research groups (see Tables 1 and 2). For more detail about the host countries of the different researchers involved in the working group see Table 3. Most of members of the working group are currently active in research areas related to the Working Group 1 of Keystone (Representation of Structured Data Sources). In more detail, the distribution of people involved who have provided feedback for this chapter, by considering their host countries, is shown in Table 4.

Table 1. Research groups in Keystone Working Group 1 per country (part 1 of 2).
Table 2. Research groups in Keystone Working Group 1 per country (part 2 of 2).

When the papers recollected to analyze the research results of the Keystone WG 1 are clustered by considering the research groups to which their authors belong, clusters showing collaborations on topics related to WG1 among the research groups participating in this package are created (see Figs. 3 and 4). Notice that the research groups that have published more joint papers with authors from other research groups are those groups whose researchers have been involved the leadership of the WG 1 (the leaders of the WG1 belong to the research groups represented by DE1 and ES4) or the network (the chair of the Action belongs to the research group IT1, while the scientific coordinator of the action belongs to the research group represented by IT2).

Table 3. Number of researchers in Keystone Working Group 1 per country.
Table 4. Number of researchers per country who provided feedback to create this chapter.

Finally, research groups were also categorized by considering the research topics of the papers that they provided and the steps of the data value chain defined in this chapter (WG1.A, WG1.B, WG1.C and WG1.D). Moreover, the category other was also considered (Fig. 5).

Fig. 3.
figure 3

Research groups clustered by considering joint papers related to WG1 in the last 4 year.

Fig. 4.
figure 4

Countries of the researchers from WG1 clustered by considering joint papers related to topics of WG1.

Fig. 5.
figure 5

Research groups of Keystone Action COST IC-1302 categorized according to the data value chain defined in this chapter.

7 Researchers Contributing to This Survey

We sincerely thank every member of the working group for the work done along the last four years. We specially thank those members that help us to analyze the state of the art of research related to Keystone WG1 and provided us with references to their research papers. These researchers are the following ones (in alphabetic order by surname): Prof. José F. Aldana, Prof. Nieves R. Brisaboa, Dr. Ioannis Anagnostopoulos, Dr. Ilaria Bartolini, Dr. Fernando Bobillo, Dr. John Breslin, Dr. Ana Cerdeira Pena, Dr. Elena Demidova, Dr. Stefan Dietze, Dr. Mauro Dragoni, D. Dudić, Dr. Pablo Fafalios, Prof. Gilles Falquet, Prof. Antonio Fariña Martínez, Dr. Javier D. Fernández, Catarina Ferreira da Silva, Dr. Francesco Guerra, Dr. Ramón Hermoso, Dr. Claudia Ifrim, Dr. Sergio Ilarri, Dr. Ekaterini Ioannou, Dr. Javier Lacasta, Dr. Susana Ladra, Dr. Martín López Nores, Dr. Mihai Lupu, Dr. Miguel A. Martínez, Dr. Javier Nogueras, Dr. Enn Õunapuu, V. Pajić, Dr. José Ramón Paramá Gabía, Dr. Laura Po, Prof. José Ramón Ríos Viqueira, Dr. Ma del Mar Roldán, Dr. Tarcísio Souza, Dr. Yannis Stavrakas, Dr. Velislava Stoykova, Prof. Vagan Terziyan, Dr. Raquel Trillo Lado, Dr. Genoveva Vargas, Prof. Yannis Velegrakis, and Dr. Manolis Wallace. Thus, feedback to create this chapter has been received from 12 different countries (see more details in Table 4).