Keywords

1 Introduction

Much research has been done to combine the fields of Databases and Natural Language Processing to provide natural language interfaces to database systems [22]. While many works focus on the problem of deriving a structured query for a given natural language question or a set of keywords [10, 21, 27, 30], the problem of query verbalization – translating a structured query into natural language – is less explored. In this work we describe our approach to verbalizing SPARQL queries in order to create natural language expressions that are readable and understandable by the human day-to-day user.

When a system generates SPARQL queries for a given natural language question or a set of keywords, the verbalized form of the generated query is helpful for users, since it allows them to understand whether the right question has been asked to the queried knowledge base and, if the query is executed and results are presented, how the results have been retrieved. Therefore, verbalization of SPARQL queries may improve the experience of users of any such SPARQL query generating system such as natural language-based question answering systems or keyword-based search systems.

In this paper we describe the current state of our SPARTIQULATION system,Footnote 1 which allows verbalization of a subset of SPARQL 1.1 SELECT queries in English.

The remainder of this paper is structured as follows. Section 2 gives an overview of the query verbalization approach in terms of the system architecture and the tasks that it performs. Section 3 presents the elements of our approach, Sect. 4 revisits existing work, and in Sect. 5 conclusions are drawn and an outlook is provided.

2 Query Verbalization Approach

2.1 Introduction

Our approach is inspired by the pipeline architecture for natural language generation (NLG) systems and the set of seven tasks performed by such systems as introduced by Reiter and Dale [19]. The input to such a system can be described by a four-tuple \((k, c, u, d)\) – where \(k\) is a knowledge source (not to be confused with the knowledge base a query is queried against), \(c\) the communicative goal, \(u\) the user model, and \(d\) the discourse history. Since we neither perform user-specific verbalization nor do we perform the verbalization in a dialog-based environment, we omit both the user model and the discourse history. The communicative goal is to communicate the meaning of a given SPARQL query \(q\). However, there are multiple options. Three basic types of linguistic expressions can be used: (i) statements that describe the search results where this description is based on the query only and not on the actual results returned by a SPARQL endpoint (e.g. Bavarian entertainers and where they are born), (ii) a question can be formulated about the existence of knowledge of a specified or unspecified agent (e.g. Which Bavarian entertainers are known and where are they born?), and (iii) a query can be formulated as a command (e.g. Show me Bavarian entertainers and where they are born). Thus, the communicative goal can be reached in three modes: statement verbalization, question verbalization, or command verbalization. Since the only communicative goal is to communicate the meaning of a query to a user, the various modes the system is built for and the omissions of both the user model and the discourse history, the input to our system is a tuple \((k, m)\) where \(k\) is the SPARQL query and \(m \in \lbrace statement, question, command \rbrace \) is a mode.

2.2 Components and Tasks

In this section we present our approach along the seven tasks involved in NLG according to Reiter and Dale [19]. This work is the first step towards the verbalization of SPARQL queries. So far we put a focus on document structuring, but not on lexicalization, aggregation, referring expression generation, linguistic realisation, and structure realisation. Note that the modes in which a communicative goal can be reached are regarded in the task linguistic realization only.

The pipeline architecture is depicted in Fig. 1. Within the Document Planner the content determination process creates messages and the document structuring process combines them into a document plan (DP), which is the output of this component and the input to the Microplanner component. Within the Microplanner the processes lexicalization, referring expression generation and aggregation take place, which results in a text specification (TS) that is made up of phrase specifications. The Surface Realizer then uses this text specification to create the output text.

Fig. 1.
figure 1

Pipeline architecture of our NLG system

Content Determination is the task to decide which information to communicate in the text. In the current implementation we decided not to leave this decision to the system. What is communicated is the meaning of the input query without communicating which vocabularies are used to express the query. For example if title occurs in the verbalization and is derived from the label of a property, then it is hidden to the user whether this has been derived from http://purl.org/dc/elements/1.1/title or http://purl.org/rss/1.0/title.

Document Structuring is the task to construct independently verbalizable messages from the input query and to decide for their order and structure. These messages are used for representing information, such as that a variable is selected, the class to which the entities selected by the query belong to or the number to which the result set is limited. The output of this task is a document plan. Our approach to document structuring consists of the following elements:

  1. 1.

    Query graph representation

  2. 2.

    Main entity identification

  3. 3.

    Query graph transformation

  4. 4.

    Message types

  5. 5.

    Document plan.

These are the main contributions of this work. We continue this section with an introduction of the remaining tasks. In Sect. 3, each of the elements of our approach are presented in detail.

Lexicalization is the task of deciding which specific words to use for expressing the content. For each entity we dereference its URI and in case that RDF data is returned, we check if an English label is provided using one of the \(36\) labeling properties defined in [6]. Otherwise, we derive a label from the URI’s local name. In case of properties, the \(7\) patterns introduced by Hewlett et al. in [11] are used. For example, Hewlett et al. provide the following pattern:

  • (is) VP P

  • Examples: producedBy, isMadeFrom

  • Expansions: X is produced by Y, X is made from Y.

The local name producedBy of a property ex:producedBy is expanded to produced By and its constituents are part-of-speech tagged. The expansion rule given for this pattern declares that a triple ex:X ex:producedBy ex:Y can be verbalized as X is produced by Y.

The main entityFootnote 2 is verbalized as things. If a constraint for the class of the main entity such as ?m rdf:type yago:AfricanCountries is given, then it can be verbalized as African countries.Footnote 3 If the query is limited to a single result using LIMIT 1 and no sort order is defined using ORDER BY, then it can be verbalized as An African country. Otherwise, if a sort order is defined such as ORDER BY DESC(?population), then it can be verbalized as The African country as in The African country with the highest population. Other variables are also verbalized as things unless a type is either explicitly given using rdfs:type or implicitly given using rdfs:domain or rdfs:range. For example, this information is regarded when verbalizing the query SELECT ?states ?uri WHERE { ?states dbo:capital ?uri .} as Populated places and their capitals. Here, the domain of the property dbo:capital is defined as dbpedia-owl:PopulatedPlace.

Referring Expression Generation is the task of deciding how to refer to an entity that is already introduced. Consider the following two example verbalizations:

  1. 1.

    Albums of things named Last Christmas and where available their labels.

  2. 2.

    Albums of things named Last Christmas and where available the labels of these albums.

In the beginning of the verbalizations the entities albums and things are introduced. At the end labels are requested. In the first verbalization it is not clear whether the labels of the albums or the labels of the things are requested, whereas in the second verbalization it is clear that the labels of the albums are requested.

Aggregation is the task to decide how to map structures created within the document planner onto linguistic structures such as sentences and paragraphs. For example, without aggregation a query such as SELECT ?m WHERE {?m dbo: starring res:Julia_Roberts . ?m dbo:starring res:Richard_Gere . } would be verbalized as Things that are starring Julia Roberts and that are starring Richard Gere. With aggregation the result is more concise: Things that are starring Julia Roberts as well as Richard Gere.

Linguistic Realization is the task of converting abstract representations of sentences into real text. Thereby the modes statement, question, and command are regarded. As introduced in the next chapter, chunks of content of a SPARQL query are represented as messages given the list of message types (MT) from Fig. 5. For each of the message types (1)–(9) a rule is invoked that produces a sentence fragment, for example for the MT \(MRVR_lL\) – which is an instance of the MT \(M(RV)^{{}*{}}R_lL\) – the rule article(lex(prop1)) + lex(prop1) + L produces for two triples ?uri dbpedia:producer ?producer and ?producer rdfs:label "Hal Roach" the text a producer named "Hal Roach". The function article choses an appropriate article (a or an) depending on the lexicalization lex(prop1) of the property.

Structure Realization is the task to add markup such as HTML code to the generated text in order to be interpreted by the presentation system, such as a web browser. Bontcheva [2] points out that hypertext usability studies [18] have shown that formatting is important since it improves readability. Indenting complex verbalizations, adding bullet points and changing the font size can help to communicate the meaning of a query.

3 Document Structuring

The elements our approach consists of can be summarized as follows. We transform textual SPARQL SELECT queries into a graphical representation – the query graph – which is suitable for performing traversal and transformational actions. Within a query graph we identify the main entity which is a variable that is rendered as subject of a verbalization. After a main entity is identified the graph is transformed into a graph where the main entity is the root. Then the graph is split into independently verbalizable parts called messages. We define a set of message types that allow to represent a query graph using messages. Message types are classified due to their role within the verbalization. The document plan is presented which orders the messages according to their classes and is the output of the Document Planner – the first component in our NLG pipeline.

Some observations in this chapter, namely regarding the main entity identification in Sect. 3.2 and the message types in Sect. 3.4, are based on a training set. This training set is derived from a corpus of SPARQL queries consisting of datasets from the QALD-1 challengeFootnote 4 and the ILD2012 challenge.Footnote 5 The full dataset contains \(263\) Footnote 6 SPARQL SELECT queries and associated manually created questions. In order to derive a training set we used 80 % of each dataset as training data – in total \(209\) SELECT queries.Footnote 7 Since in our approach we cannot yet handle all features of the SPARQL 1.1 standard, we had to exclude some queries. Within this training set of \(209\) queries we excluded the queries with the UNION feature (\(20\) queries) and those that were not parsable (\(1\) query). This means that this subset – 188 queries in total – covers 90 % of the queries within the training set.

3.1 Query Graph Representation

We parse a SPARQL query into a query graph since this allows for easier manipulation of the query compared to its textual representation. Thereby each subject \(S\) and object \(O\) of a triple pattern \(<S, P, O>\) within the query is represented by a node in the query graph. The nodes are connected by edges labeled with \(P\). Since \(<S, P, O>\) is a triple pattern and not an RDF triple, this means that each element can be a variable. Unless the subject or object is a variable, for each subject or object an own node is created. Therefore multiple non-variable nodes with the same label may exist. For every variable that appears within the query in subject or object position only one node is created.

As an example regard the SPARQL query in textual representation in Listing 1 and the visual representation of the query graph in Fig. 2. In this visual representation,Footnote 8 nodes are filled if they represent resources and not filled if they represent variables. Nodes are labeled with the name of the resource, variable, literal or blank node respectively. Labels of variables begin with a question mark. Labels of variables that appear in the SELECT clause of a query are underlined. Literal values are quoted. Edges are labeled with the name of the property and point from subject to object. Filters are attached to their respective variable(s) and parts of the graph that appear only within an OPTIONAL clause are marked as such.

figure a
Fig. 2.
figure 2

Example query graph

3.2 Main Entity Identification

We perform a transformation of the query graph, since it reduces the number of message types that are necessary to represent information contained in the query graph thus simplifying the verbalization process. This transformation is based on the observation that in most queries one variable can be identified that is rendered as the subject of a sentence. For example, when querying for mountains (bound to variable ?mountain) and their elevations (bound to variable ?elevation), then ?mountain is verbalized as the subject of the verbalization mountains and their elevations. We refer to this variable as the main entity of a query. However, for some queries no such element exists. Consider for example the query SELECT * WHERE { ?a dbpedia:marriedTo ?b .}. Here a tuple is selected and in a possible verbalization Tuples of married entities Footnote 9 no single variable appears represented as a subject. In order to identify the main entity we define Algorithm 1 that applies the ordered list of rules shown in Fig. 3. These rules propose the exclusion of members from a candidate set. We derived them by looking at queries within the training set having multiple candidates. The candidate set \(C\) for a given query is initialized with variables that appear in the SELECT clauseFootnote 10 and the algorithm eliminates candidates step by step. \(Q\) denotes the set of triples within the WHERE clause of a query, \(R_t\) is the property rdf:type and \(R_l\) is a labeling property from the set of \(36\) labeling properties identified by [6]. The application of an exclusion rule \(R_i\) to a candidate set \(C\), denoted by \(R_i(C)\), results in the removal of the set \(E\) which is proposed by the exclusion rule.

We identified the ordered list of exclusion rules shown in Fig. 3. The numbers show how often a rule was successful in reducing the candidate set for the \(188\) queries within our training set. In some cases (\(61\), \(32.45\,\%\)) no rule was applied since the candidate set contained only a single variable. In the case that given the rules above the algorithm does not manage to reduce the candidate set to a single variable (\(21\), 11.17 %), the first variable in lexicographic order is selected.

Fig. 3.
figure 3

Exclusion rules

As an example regard the SPARQL query in Listing 1 which is visually presented in Fig. 2. The candidate set is initialized as \(\lbrace uri, string\rbrace \). Rule 1 proposes to remove the variable string since it appears only within an OPTIONAL clause. Since the candidate set is reduced to \(\lbrace uri\rbrace \) containing a single entity, this entity is the main entity.

figure b

3.3 Query Graph Transformation

Algorithm 2 transforms a query graph into a graph for which the main entity is the root and all edges point away from the root. Therefore, the algorithm maintains three sets of edges: edges that are already processed (\(P\)), edges that need to be followed (\(F\)), and edges that need to be transformed (\(T\)) which means reversed. An edge is reversed by exchanging subject and object and by marking the property (\(p\)) as being reversed (\(p^r\)).

figure c

The query graph shown in Fig. 2 is transformed into the query graph shown in Fig. 4 where the main entity – the variable uri – is highlighted. Compared to the graph before the transformation, the edge dbo:capital was reversed. Therefore this edge now points away from the main entity and is marked as being reversed by the minus in superscript.

Fig. 4.
figure 4

Example query graph after transformation

3.4 Message Types

We identified the set of \(14\) message types (MT), shown in Fig. 5 that allow us to represent the \(209\) queries from our training set. Here, \(M(RV)*\) denotes a path beginning at the main entity via an arbitrary number of property-variable pairs such as The first \(9\) MTs represent directed paths in the query graph which means that for each directed path that begins at the main entity, we represent this path with a message. \(R_l\) denotes a labeling property and \(R_t\) the property rdf:type. The MTs \(ORDERBY\), \(LIMIT\), \(OFFSET\) and \(HAVING\) represent the respective SPARQL features.

As an example the SPARQL query in Listing 1 is represented using the \(7\) messages shown in Fig. 6. Note that due to the graph transformation the property dbo:capital is reversed which is denoted by REV: 1 in message \(2\). This query can be verbalized as: English names of African countries having capitals which have a population of less than 1000000 and the English names of these capitals. Note that the plural form capitals instead of capital is used per default since no information is available that a country has exactly one capital. The filter for English labels is stored within message \(6\) representing the variable string as lang: en.

Fig. 5.
figure 5

Message types

Fig. 6.
figure 6

Messages representing the SPARQL query in Listing 1.

While this set of message types is sufficient for the given training set, which means that all queries can be represented using these message types, we extended this list with 7Footnote 11 more types in order to be prepared for queries such as SELECT ?s ?p ?o WHERE { ?s ?p ?o. } and SELECT ?p WHERE { ?s ?p ?o. } where instead of generating text, canned text is used, such as All triples in the database and Properties used in the database.

3.5 Document Plan

The document plan (DP), which is the output of the Document Planner and input to the Microplanner, contains the set of messages and defines their order. The verbalization consists of two parts. In the first part the main entity and its constraints are described, followed by a description of the requests (the variables besides the main entity that appear in the select clause) and their constraints. In a second part, if available and not already communicated in the first part, the selection modifiers are verbalized. According to these \(3\) categories – abbreviated with cons, req, and mod – we classify the message types (MT) as follows. The MTs \((1)\), \((2)\), \((4)\), \((6)\), \((7)\), and \((9)\) from Fig. 5 belong to the class cons, the MTs \((3)\), \((5)\), and \((8)\) belong to the class req. MTs \((1)\), \((2)\), \((4)\), \((6)\), \((7)\) and \((9)\) may also belong to class req if they contain a variable besides the main entity that appears in the select clause. MTs \((11)-(14)\) belong to the class mod. The VAR message is not classified since its only purpose is to store information about variables and variables are verbalized when verbalizing other messages. For the example query in Listing 1, which is represented using the messages shown in Fig. 6, the messages M1, M2, and M3 are classified as cons, the message M1 is classified as req and no message is classified as mod.

4 Related Work

While to the best of our knowledge no work is published on the verbalization of SPARQL queries, related work comes from three areas: verbalization of RDF data [5, 15, 24, 25, 29], verbalization of OWL ontologies [1, 3, 4, 79, 11, 12, 14, 20, 23, 26, 28], and verbalization of SQL queries [13, 16, 17]. Although the first two fields provide techniques that we can apply to improve the lexicalization and aggregation tasks, such as the template-based approach presented in [5], the document structuring task, on which we focus here, is rarely explored. Compared to the SQL verbalization work by Minock [16, 17], where they focus on tuple relational queries, our problem of verbalizing SPARQL queries is different in the sense that we strive for having a generic approach that can be applied to any datasource without being tied to any schema. Patterns need to be manually created to cover all possible combinations for each relation in the schema whereas in our work we defined a set of message types that are schema-agnostic. Koutrika et al. [13] annotate query graphs with template labels and explore multiple graph traversal strategies. Moreover, they identify a main entity (the query subject), perform graph traversal starting from that entity, and distinguish between cons (subject qualifications) and req (information).

5 Conclusions and Outlook

For the task of verbalizing SPARQL queries we focused on a subset of the SPARQL 1.1 standard which covers 90 % of the queries in a corpus of 209 SPARQL SELECT queries. Evaluation will have to show the representativeness of this corpus compared to real-life queries and the qualities of the verbalizations generated using our SPARTIQULATION system. While in our architecture \(6\) tasks are needed to generate verbalizations, our main focus has been the task of document structuring which we described in this work. In order to realize the full verbalization pipeline, \(5\) other tasks need to be explored in future work. Since the current approach is mostly schema-agnostic – only terms from the vocabularies RDFa and RDFS as well as a list of labeling properties from various vocabularies are regarded – we believe that this approach is generic in terms of being applicable to queries for RDF datasources using any vocabularies. However, in the future the tasks of lexicalization can be improved by regarding schemas such as FOAF and OWL. FOAF is interesting since if an entity is a foaf:Person, it can be treated differently. For example the person’s gender can be regarded. OWL is interesting since if it is known that a property is functional, then the singular form can be used instead of, as per default, the plural form.Footnote 12 Having message types designed for specific vocabularies allows to tailor the verbalization to a specific use case and may lead to more concise verbalizations. In the current implementation, message types are hard-coded thus limiting the flexibility of the approach. Having the possibility to load a set of message types into the system would add the possibility to integrate automatically learned or application-specific message types.