Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The evolution of data environments towards the consumption of data from multiple data sources and the growth in the schema size, complexity, dynamicity and decentralisation (SCoDD) of data [4, 7] increases the complexity of contemporary data management. The SCoDD trend emerges as a central data management concern in Big Data scenarios, where users and applications have a demand for more complete data, produced by independent data sources, under different semantic assumptions and contexts of use, which is the typical scenario for Semantic Web/Linked Data applications.

The evolution of databases in the direction of heterogeneous data environments strongly impacts the usability, semiotic and semantic assumptions behind existing data accessibility methods such as structured queries, keyword-based search and visual query systems. With schema-less databases containing potentially millions of dynamically changing attributes, it becomes unfeasible for some users to become aware of the ‘schema’ or vocabulary in order to query the database. At this scale, the effort in understanding the schema in order to build a structured query can become prohibitive.

This Semantic Web Challenge focuses on catalyzing the development and evaluation of methods and tools which can help data consumers to query structured data without the understanding of the representation behind the data.

At the center of this discussion is the semantic gap between users and databases, which becomes more central as the scale and complexity of the data grows. Addressing this gap is a fundamental part of the Semantic Web vision.

Schema-agnostic query mechanisms aim at allowing users to be abstracted from the representation of the data, supporting the automatic matching between queries and databases [1, 2, 5]. This challenge aims at emphasizing the role of schema-agnosticism as a key requirement for contemporary database management, by providing a test collection for evaluating flexible query and search systems over structured data in terms of their level of schema-agnosticism (i.e. their ability to map a query issued with the users’ terminology and structure, mapping it to the dataset vocabulary). The challenge is instantiated in the context of Semantic Web datasets.

2 Schema-Agnostic Queries

Schema-agnostic queries can be defined as query approaches over structured databases which allow users satisfying complex information needs without the understanding of the representation (schema) of the database. Similarly, [5] defines it as “search approaches, which do not require users to know the schema underlying the data”. Approaches such as keyword-based search over databases allow users to query databases without employing structured queries. However, as discussed by [5]: “From these points, users however have to do further navigation and exploration to address complex information needs. Unlike keyword search used on the Web, which focuses on simple needs, the keyword search elaborated here is used to obtain more complex results. Instead of a single set of resources, the goal is to compute complex sets of resources and their relations”.

The development of approaches to support natural language interfaces (NLI) over databases have aimed towards the goal of schema-agnostic queries. Complementarily, some approaches based on keyword search have targeted keyword-based queries which express more complex information needs. Other approaches have explored the construction of structured queries over databases where schema constraints can be relaxed. All these approaches (natural language, keyword-based search and structured queries) have targeted different degrees of sophistication in addressing the problem of supporting a flexible semantic matching between queries and data, which vary from the completely absence of the semantic concern to more principled semantic models.

While the demand for schema-agnosticism has been an implicit requirement across semantic search and natural language query systems over structured data, it is not sufficiently individuated as a concept and as a necessary requirement for contemporary database management systems. Recent works have started to define and model the semantic aspects involved on schema-agnostic queries [1, 2, 5].

3 Challenge Description

The challenge aims at providing an evaluation test collection for schema-agnostic query mechanisms, focusing on Semantic Web scenarios. The large-schema and semantically heterogeneous nature of Semantic Web datasets brings schema-agnosticism as a fundamental data management concern for this community.

The test collection supports the quantitative and qualitative evaluation of degree of schema-agnosticism of different approaches. Since addressing schema-agnostic queries is dependent on semantic approaches which need to cope with different types of semantic matching between query and dataset, the test collection explores different categories of semantic phenomena involved in the challenge of matching schema-agnostic queries. Each query is categorized according to the semantic mapping types. This categorization supports a fine-grained qualitative and quantitative interpretation of the evaluation results.

4 Evaluation Description

The challenge provides a gold standard with the correct answers for each schema-agnostic query. Queries are issued over DBpedia 3.10. A training dataset consisting of 30 queries is be made available for the participants. In order to participate in the challenge, each system submitted the results in the format proposed by the challenge. The organizers then automatically calculated precision, recall, mean reciprocal rank for each query and the associated averages. Participants are recommended to submit their query execution time, dataset semantic enrichment time, and user-interaction and disambiguation effort.

The challenge consists of addressing a set of 103 schema-agnostic queries over DBpedia 2014Footnote 1 and associated YAGO classesFootnote 2. The training and test sets are available atFootnote 3.

The schema-agnostic queries were derived from the natural languages present at the Question Answering over Linked Data (QALD-4) test collection [6]. These natural language questions were manually converted to schema-agnostic queries, preserving its vocabulary and using a consistent set of conversion guidelines.

Two categories of schema-agnostic queries (tasks) are available: schema-agnostic SPARQL query and schema-agnostic keyword query. Evaluation systems can compete in one or in both categories.

4.1 Schema-Agnostic SPARQL Query

Consists of schema-agnostic queries following the syntax of the SPARQL standard without namespace prefixes. The syntax and semantics of operators are maintained, while different terminologies are used.

Example I:

figure a

which maps to the following SPARQL query in the dataset vocabulary:

figure b

Example II:

figure c

which maps to the following SPARQL query in the dataset vocabulary:

figure d

4.2 Schema-Agnostic Keyword Query

Consists of schema-agnostic queries using keyword queries. In this case the syntax and semantics of operators are different from the SPARQL syntax.

Example I: “Bill Clinton daughter married to”

Example II: “Books by William Goldman with more than 300 pages”

4.3 Returned Result

In order to participate in the challenge, systems submitted the results in the format proposed by the challenge. For queries which return a list of URIs (uri1, uri2) or values:

figure e

For queries of the type YES/NO:

figure f

Teams had 24 h after receiving the test query set to return their results.

5 Schema-Agnostic Mappings

In the test set, each schema-agnostic query contains a classification of the query-data alignments. For example:

figure g

In the alignment below, the schema-agnostic query term ‘calledAfter’ is associated with a predicate ‘(p)’ data type, mapping to the predicate ‘shipNamesake’ in the dataset, and that the type of relationship between two terms are described as semantically related.

figure h

Alignments are categorized according to 6 categories:

  • semantically related: If a query term and its associated database entity are semantically related. Example: languageOf in the query maps to spokenIn in the dataset.

  • semantically similar: If a query term and its associated database entity are semantically similar, i.e. it follows a taxonomic relation. Example: wifeOf in the query maps to spouseOf in the dataset.

  • synonym: If a query term and its associated database entity are synonyms. Example: startDate in the query maps to beginDate in the dataset.

  • string similar: If a query term has a string similarity relationship to its associated database entity. Example: startDate in the query maps to beginDate in the dataset.

  • substring: If a query term is a substring of its associated database entity or vice-versa. Example: wifeOf in the query maps to wife in the dataset.

  • functional content: Consists on the mapping of function words (e.g. prepositions) in the query to other function words or content words in the dataset entity. Example: in in the query maps to location in the dataset.

  • abbreviation: If a query term is an abbreviation of its associated database entity or vice-versa. Example: extinct in the query maps to ‘EX’ in the dataset.

Other examples of alignments (including compositions of different categories) include:

figure i

Alignment terms are classified according to their data model types, with regard to the position within the triple (subject (s), predicate (p), object (o)) and entity type (instance (i), class (c), property (p), value (v)).

The alignment classifications are a simplification of the schema-agnostic alignments described in [1].

6 Results

Just one system competed officially in the SAQ-2015 Semantic Web Challenge: the UMBC_Ebiquity-SFQ system from the University of Maryland Baltimore County (Syed et al. [3]).

The results are described in Table 1:

Table 1. Evaluation of the participating system for the SAQ-2015 challenge.

7 Summary

The ability to abstract users from the specifics of the representation of the data, including its vocabulary and structural relations is a fundamental functionality for large-scale and heterogeneous data. The Schema-agnostic Queries Semantic Web Challenge (SAQ-2015) aims at providing a test collection for supporting the development of schema-agnostic query mechanisms, i.e. query approaches which supports automatically crossing the semantic gap between users and the data. The test collection provides a categorized set of schema-agnostic queries, covering a range of different alignments from string variations to different types of semantic relations. The performance of the participating system indicates that state-of-the-art systems are able to provide an initial solution for the problem. However, the initial results show that schema-agnostic queries are still a challenging problem and that there is space for major improvements.