Keywords

1 Introduction

DBPedia [1, 2] is one of the central Linked Data resources and is of fundamental importance to the entire Linked Data ecosystem. DBPedia extracts structured information from WikipediaFootnote 1-the most popular collaboratively maintained encyclopedia on the web. A public DBPedia SPARQL endpointFootnote 2, representing its “core” data, is a large and heterogeneous resource with over 480 thousand classes and over 50 thousand properties, making it difficult to find and extract the relevant information. The existing means for DBPedia data querying and exploration involve textual SPARQL query formulation and some research prototypes that offer assisted query composition options, as e.g., RDF Explorer [3], that do not reach the ability to use effectively the actual DBPedia schema information to support the query creation by end-users.

There is a DBPedia ontology that consists of 769 classes and 1431 properties (as of July 2021); it can be fully or partially loaded into generic query environments, as SPARKLIS [4] (based on natural language snippets), or Optique VQs [5, 6] or ViziQuer [7] (based on visual diagrammatic query presentation). The DBPedia ontology alone would, however, be rather insufficient in supporting the query building process, as it covers just a tiny fraction of actual DBPedia data classes and there are quite prominent classes and properties in the data set (e.g., the class foaf:Document, or any class from yago: namespace, or the property foaf:name) that are not present in the ontology.

We describe here services for the DBPedia data retrieval query composition assistance, running in real time, based on the full DBPedia data schema involving all its classes, all properties, and their relations (e.g., what properties are relevant for instances of what classes; both class-to-property and property-to-property relevance connections are considered). We apply the developed services to seeding and growing visual queries within the visual ViziQuer environment (cf. [7, 8]), however, the services can be made available also for schema-based query code completion in different environments, including the ones for textual SPARQL query composition, as e.g., YASGUI [9].

Due to the size of the data endpoint we pre-compute the class-to-property and property-to-property relevance mappings, using then the stored information to support the query creation. We limit pre-computation of the class-to-property mapping just for sufficiently large classes as most classes would have way less instances than the connected properties (for smaller classes the on-the-fly completion approach is used).

The principal novelty of the paper is:

  • A method for auto-completing queries, based on the class-to-property and property-to-property connections, working over the actual DBPedia data schema in real time, and

  • A visual query environment for exploration and querying of a very large and heterogeneous dataset, as DBPedia is.

The papers’ supporting material including a live server environment for visual queries over DBPedia can be accessed from its support site http://viziquer.lumii.lv/dss/.

In what follows, Sect. 2 outlines the query completion task. The query completion solution architecture is described in Sect. 3. Section 4 describes the DBPedia schema extraction process to build up the data schema necessary for query completion. The visual query creation is described in Sect. 5. Section 6 concludes the paper.

2 Query Completion Task

A diagrammatic presentation of a query over RDF data is typically based on nodes and edges, where a node corresponds to a query variable or a resource (or a literal) and an edge, labelled by a property path, describes a link between the nodes. A UML-style query diagram (as in ViziQuer [7], Optique VQs [5] or LinDA [10]) would also provide an option (in some notations, a request) to specify the class information for a variable or a resource represented by the node. Furthermore, some links of the abstract query graph can be presented in the UML-style query notation as node attributes.

The presence of a class information for a variable or a resource in a query, facilitated by the UML style query presentation, could facilitate the query readability. Still, this would not preclude queries that have nodes with empty class specification (cf. [8]).

Figure 1 shows example visual queries corresponding to some of QALD-4 tasksFootnote 3 Footnote 4, suitable for execution over DBPedia SPARQL endpoint, in the ViziQuer notation (cf. [8] and [11] for the notation and tool explanation).

Fig. 1.
figure 1

Example visual queries. Each query is a connected graph with a main query node (orange round rectangle) and possibly linked connection classes. Each node corresponds to a variable (usually left implicit) or a resource and an optional class name. There can be selection and aggregation attributes in a node. The edges correspond to properties (paths) linking the node variables/resources. [8] Also describes more advanced query constructs.

From the auto-completion viewpoint a query can be viewed as a graph with nodes allowing entity specifications in the positions of classes and individuals and edges able to hold property namesFootnote 5.

The process of the visual query creation by an end-user starts with query initialization or query seeding and is followed by query expanding, or query growingFootnote 6. Within each of these stages the query environment is expected to assist the end-user by offering the names from the entity vocabulary (involving classes, properties, possibly also individuals) that would make sense in the query position to be filled.

The simplest or context-free approach for the entity name suggestion would provide the entities for positions in a query just by their type–a class, a data property, or an object property (or an individual). This approach can provide reasonable results, if the user is ready to type in textual fragments of the entity name. The “most typical” names that can be offered to the user without any name fragment typing still can be significantly dependent on the context information where the entity is to be placed.

Another approach, followed e.g., by SPARKLIS [4] or RDF Explorer [3] would be presenting only those extensions of a query that would lead to a query with non-empty solutions (if taken together with the already existing query part). In the case of a large data endpoint, as DBPedia is, this would not be feasible, as even the simple queries to the endpoint asking for all properties that are available for instances of a large class typically do time-out or have running times not suitable for on-the-fly execution.

We propose to use an in-between path by suggesting to the end-user the entity names that are compatible with some local fragment of the existing query (these are the entity names that make sense in their immediate context). We shall follow a complete approach in a sense that all names leading to an existing solution of the extended query need to be included into the suggestion set (possibly after the name fragment entry), however the names not leading to a solution can sometimes be admitted, as well.

In a schema-based query environment the main context element for a property name suggestion would be a class name, however, suggestion of a class name in the context of a property and suggestion of a connected property in the context of an existing property would be important to support the property-centered modeling style, and to enable efficient auto-completion within a textual property path expression entryFootnote 7 (after a property name within an expression, only its “follower” properties are to be suggested, along with inverses of those properties whose triples can have common object with the last property from the already entered part of the property path).

3 Query Completion Principles

In what follows, we describe the principles of the query completion that can be shown to efficiently serve both the query seeding and context-aware query growing tasks for a SPARQL endpoint, as DBPedia core, with more than 480 thousand classes and more than 50 thousand properties, offering the text-search, filtering and prioritization options over the target linked entity sets. The query completion method has been implemented within a data shape serverFootnote 8 (also called schema server), featuring the example environments over the DBPedia core and other data sets.

3.1 Entity Mapping Types

The query completion on the data schema level is based on class-to-property and property-to-property relations, observing separately the outgoing and incoming properties for a classFootnote 9, and “following”, “common subject” and “common object” modes for the property-property relations. The relations shall be navigable in both directions, so:

  • The class-to-property (outgoing) relation can be used to compute the outgoing properties for a class, and source classes for a property,

  • The class-to-property (incoming) relation can be used to compute the incoming properties for a class, and target classes for a property,

  • The “following” property-property relation can be used for computing “followers” and “precursors” of a property.

For each of the mappings it is important to have the list of suggested entities ordered so that the “most relevant” entities can be suggested first. To implement a context-aware relevance measure, we compute the triple pattern counts for each pair in the class-to-property and property-to-property relations; for the class-to-property (outgoing) relation also the counts of “data triple” patterns and “object triple” patterns are computed separately. An entity X is higher in the list of entities corresponding to Y, if the triple pattern count for the pair (X,Y) is higherFootnote 10.

For query fragments involving an individual, the means shall be available for retrieving all classes the individual belongs to, all properties for which the individual is the subject (the properties “outgoing” from an individual) and for which the individual is the object (the properties “incoming” into the individual). We expect that the data SPARQL endpoint shall be able to answer queries of this type efficiently.

A further query completion task is to compute the individuals belonging to a class or available in the context of a given property (the class-to-individual, property-to-individual (subject) and property-to-individual (object) mappings). Since these mappings may return large sets of results for an argument class or property (e.g., around 1.7 million instances of dbo:Person class in DBPedia core), a text search with entity name fragment within the results is necessary. Such a search can be reasonably run over the SPARQL endpoint for classes with less than 100000 instances. For larger classes the suggested approach in query creation would be to start by filling the individual position first (using some index for the individual lookup as e.g., DBPedia LookupFootnote 11).

The solution that we propose can also provide linked entity (property, class, individual) suggestion from several initial entities; this is achieved (logically) by computing the linked entity lists independently for each initial entity and then intersectingFootnote 12.

3.2 Partial Class-to-Property Mapping Storage

The modern database technologies would allow storing and serving to the query environment the full class-to-property and property-to-property relationsFootnote 13. Still, this may be considered not effective for a heterogeneous data endpoint, as DBPedia is, where for about 95% of classes the number of class instances is lower than the number of properties that characterize these instances. Out of 483 748 classes in the DBPedia core there have been 93 321 classes (around 19%) with just a single instance; in this case only a single link from the class to an instance is available in data. To record the relation of such a singleton class to the properties, all properties that the class instance exhibits, would need to be recorded. Since an instance may belong to several classes, such full storage of the class-to-property mapping is considered superfluous.

Therefore, we propose to pre-compute and store the class-to-property relation just for a fraction of classes (we call them “large” classes), and to rely on the information retrieval from the data endpoint itself, if the class size falls below a certain thresholdFootnote 14 (regarding the property-property relation, our current proposal is to store it in full).

The partial storing of the class-to-property relation does not impede the possibility to compute the linked property lists for a given class, since for the classes that are not “large”, these lists can be efficiently served by the data SPARQL endpointFootnote 15.

The property-to-class direction of such a “partially stored” class-to-property relation becomes trickier, as, given a property, only the large classes are those that can be directly retrieved from the data schema. In order not to lose any relevant class name suggestions, we assign (and pre-compute) to any “small” class its representing superclass from the “large” classes set (we take the smallest of the large superclasses for the class). There turn out to be 154 small classes without a large superclass in the DBPedia endpoint (in accordance with the identified superclass information); the property links are to be pre-computed for these classes, to achieve complete class name suggestion lists.

The effect of suggested extra small classes in the context of a property can be analyzed. We note that in the DBPedia core out of top 5000 largest properties just 50 would have more small classes than the large ones within the source top 30 class UI window; in the case of target classes the number would be 190; so, the potentially non-exact class name suggestions are not going to have a major impact on the user interface (lowering the large class threshold would lower also the extra suggestion ratio even further).

3.3 Schema Server Implementation and Experiments

The schema server is implemented as REST API, responding to GET inquiries for (i) the list of known ontologies, (ii) the list of namespaces, (iii) the list of classes (possibly with text filter) and (iv) the list of properties (possibly with text filter), and POST inquiries for computing a list of classes, properties, and individuals in a context. The POST inquiries can specify query limit, text filter, lists of allowed or excluded namespaces, result ordering expression and the data endpoint URL; Further on there is a query context element, involving a class name (except for class name completion), individual URI (except for individual completion) and two lists of properties–the incoming and the outgoing ones; in the case of property completion, the context information sets can be created for both their subject and object positions.

The parameters of the schema server operations allow tuning the entity suggestion list selection and presentation to the end user. They are used in the visual tool user interface customization, in applying specific namespace conditions, or featuring basic and Full lists of properties in a context, as illustrated in Sect. 5.

A preliminary check of the schema server efficiency has found that the operations for suggesting classes and properties in a context perform reasonably, as shown in Table 1. For each of the link computation positions at least 10 source instances that can be expected to have the highest running times (e.g., the largest entities) are considered and the maximum of the found running times is listed.

The experiments with the schema server have been performed on a single-laptop (32 GB RAM) installation of the visual tool, with the PostgreSQL database over the local network and remote access to the public DBPedia endpointFootnote 16 as the data set; the query time is measured by the printouts from the schema server JavaScript code.

Table 1. Entity list suggestion timing estimates

We note that the queries for computing the entities in a multiple context, do not tend to blow up the execution time, if compared to the single-context inquiries.

4 Data Schema Retrieval

Some of the data endpoints may have an ontology that describes its data structure; however, it may well be the case that the ontology does not describe the actual data structure fully (e.g., including all classes, all properties and all their connections present in the data set)Footnote 17, therefore we consider retrieving the data from the SPARQL endpoint itselfFootnote 18.

The extraction of small and medium-sized schemas can be performed by methods described in e.g., [12] and [13]. We outline here retrieving the DBPedia schema.

The DBPedia core schema retrieval has been done from a local copy, installed from DBPedia Databus siteFootnote 19 (the copy of December 2020).

The basic data retrieval involves the following generic steps that can be followed on other endpoints, as well:

  1. 1)

    Retrieve all classes (entities that have some instance), together with their instance countFootnote 20.

  2. 2)

    Retrieve all properties, together with their triple count, their object triple count (triples, where the object is an URI) and the literal triple count.

  3. 3)

    For classes deemed to be “large”Footnote 21, compute the sets of its incoming and outgoing properties, with respective triple counts, including also object triple count and literal triple count for outgoing properties. For the classes, where direct computation of properties does not give results (e.g. due to the query timeout), check the instance counts for all (class,property) pairs separatelyFootnote 22.

  4. 4)

    Retrieve the property-property relations, recording the situations, when one property can follow the other (a), or both properties can have a common subject (b), or a common object (c), together with the triple pattern counts.

  5. 5)

    Pre-compute the property domain and range information, where possible (by checking, if the source/target class for a property with largest property triple count is its domain/range).

  6. 6)

    Create the list of namespaces and link the classes and properties to them.

    The following additional schema enrichment and tuning operations are performed, using the specifics of the DBPedia endpoint organization.

  7. 7)

    Compute the display names for classes and properties to coincide with the entity local name, with some DBPedia-specific adjustments:

    1. a.

      If the local name ends in a long number (as some yago: namespace classes do), replace the number part by ‘..’, followed by the last 2–4 digits of the number allowing to disambiguate the display names),

    2. b.

      If the local name contains ‘/’, surround it by [[and]],

    3. c.

      For the wikidata: namespace, fetch the class labels from wikidataFootnote 23 and use the labels (enclosed in [[and]]) as display names.

  8. 8)

    Note the sub-class relationFootnote 24 (to be used in the class tree presentation, and in determining the “representative” large classes for small classes).

  9. 9)

    Note the class equivalence relation, to allow the non-local classes to be “represented” by the local ones in the initial class list.

  10. 10)

    For each “small” class, compute its smallest “large” super-class (for use in the property-to-class mapping to suggest also the “small” class names). Perform the step (3) for “small” classes that do not have any “large” superclass.

The schema extraction process currently is semi-automated. It can be expected that after a full automation and some optimizations it would be able to complete within a couple of days. The process can be repeated for new DBPedia configurations and data releases. The database size on the PostgreSQL server (including the tables and indices) amounts to about 20 GB. The dump of the database for the currently analyzed DBPedia endpoint can be accessed from the paper’s supporting website.

5 Visual Query Creation

To enable the creation of visual queries over DBPedia (cf. Fig. 1 in Sect. 2), the ViziQuer tool [7] has been connected to the data schema server and enriched by new features involving: (i) new shape of the class tree, (ii) means for query seeding by properties and individuals, and (iii) search-boxes for names in attribute and link dialogues and for classes in the node property pane.

Fig. 2.
figure 2

Schema tree examples in the visual query tool: top classes except from yago: namespace, filtered classes, top properties

The implementation of the tool allows also for endpoint-specific extensions to customize the tool appearance while working on specific data endpoints.

The created ViziQuer/DSS tool can be accessed from the paper’s supporting website.

We briefly explain the visual environment elements that enable the schema-supported query creation experience, relying on the schema server API, (cf. Section 3).

For the query seeding there are tabs with class, property and individual selection, the class tab can show either the full list of classes, or the full list of classes without the dominating yago: namespace, or just the dbo: namespace classes (the top classes of the first two choices are in Fig. 2); the properties in their tab can be listed either in the basic (moving down the dbp: namespace properties and a few more “housekeeping” properties), or in the full mode (ordering just by the triple count descending). The property search can be restricted to either data or object properties only (a property of “dual nature” would be present in both lists). Both the class and property lists are efficiently searchable. There is also an option to obtain a list of subclasses for a class. Double click on an item in any of the tabs, initiates a new query from this element.

The main tools for query growing are the attribute and link addition dialogues, illustrated in Fig. 3, in the context of the dbo:Language class (cf. Figure 1); both basic and full lists of attributes and links are illustrated. In the link list the principal (range or domain) class is added, if available in the data schema for the property; the lists are efficiently searchable, as well.

Fig. 3.
figure 3

Top attribute and link suggestions in the context of dbo:Language class and outgoing property spokenIn: top of basic and full attribute lists, top of basic and full link lists

If a query has been started by a property or an individual, there is an option to fill in the class name (in the element’s property pane to the right of the diagram) from the class name suggestions created in the context of the selected node and its environment in the diagram. Figure 4 illustrates the class name suggestion in the context of an outgoing property dbo:spokenIn.

The created visual environment can be used both for Exploration and Querying of the data endpoint (DBPedia).

Fig. 4.
figure 4

source class

Visual diagram after selection of dbo:spokenIn property from the initial property list, and following class name suggestion for its

The exploration would allow obtaining the overview of the classes and properties in the textual pane, together with their size, the subclass relation in the class tree is supported based on the subclass data retrieved from the data endpoint. The class and property lists can be filtered, so allowing to reach any of the 480 K classes and 50 K properties. For each class and property its surrounding context is available (starting from most important classes/properties), as well as the queries over the data can be made from any point reached during the exploration phase (the exploration can be used to determine the entities for further query seeding).

Within the data querying options, the environment provides the visual querying benefits (demonstrated e.g., in [5] and [11]) in the work with the data endpoint of principal importance and substantial size. The environment would allow creating all queries from e.g., the QALD-4 dataset, however, the end user experience with query creation would need to be evaluated within a future work.

6 Conclusions

We have described a method enabling auto-completion of queries based on actual class-to-property and property-to-property mappings for the DBPedia data endpoint with more than 480 thousand classes and more than 50 thousand properties by using hybrid method for accessing the stored data schema and the data endpoint itself.

The created data schema extraction process can be repeated over different versions of the DBPedia, as well as over other data endpoints, so creating query environments over the datasets that need to be explored or analyzed. The open-source code of the visual tool and the data schema server allows adding custom elements to the environment that are important for quality user interface creation over user-supplied data.

An interesting future task would be also moving the schema data (currently stored on PostgreSQL server) into an RDF triple store to enable easier sharing of endpoint data schemas as resources themselves and processing the schema data themselves by means of visual queries and integrating them with other Linked Data resources. An issue to be addressed would be the efficiency of the schema-level queries over the data store, however, it can be conjectured that a reasonable efficiency could be achieved. The technical replacement of the PostgreSQL server by an RDF triple store (and generating SPARQL queries instead of SQL ones) is not expected to be a major challenge since the schema server architecture singles out the schema database querying module.