Keywords

1 Introduction

Considerable research has been done on effective algorithms for keyword search over data graphs (e.g., [3, 4, 7, 1012, 14, 17]). Usually, a data graph is obtained from RDB, XML or RDF by a rather simplistic transformation. In the case of RDB [3, 6, 12], tuples are nodes and foreign keys are edges. When the source is XML [11, 13], elements are nodes, and the edges reflect the document hierarchy and IDREF(S) attributes.

In many cases, the source data suffers from certain anomalies and some papers (e.g., [13, 15]) take necessary steps to fix those problems. For example, when citations are represented by XML elements, they should be converted to IDREF(S) attributes. As another example, instead of repeating the details of an author in each one of her papers, there should be a single element representing all the information about that author and all of her papers should reference that element. These are examples of necessary transformations on the source data. If they are not done, existing algorithms for keyword search over data graphs will not be able to generate meaningful answers.

Once a source data is ameliorated, it should be transformed into a graph. The literature hardly discusses how it should be done. In [3, 14], the source is an RDB and the naive approach mentioned earlier is used (i.e., tuples are nodes and foreign keys are edges). In [11, 20], the source data is XML and the simplistic transformation described at the beginning of this section is applied. In [2, 5, 12, 16, 18], they do not mention any details about the construction of data graphs. The lack of a thoughtful discussion in any of those papers is rather surprising, because the actual details of constructing a data graph have a profound effect on both the efficiency and the quality of keyword search, regardless of the specific algorithms and techniques that are used for generating answers and ranking them.

Construction of effective data graphs is not a simple task, since the following considerations should be taken into account. For efficiency, a data graph should be as small as possible. It does not matter much if nodes have large textual contents, but the number of nodes and edges is an important factor. However, lumping together various entities into a single node is not a good strategy for increasing efficiency, because answers to queries would lose their coherence.

The structure of a data graph should reflect succinctly the semantics of the data, or else answers (which are subtrees) would tend to be large, implying that finding them would take longer and grasping their meaning quickly would not be easy.

An effective engine for keyword search over data graphs must also use information-retrieval techniques. Those tend to perform better on large chunks of text, which is another reason against nodes with little content.

In this paper, we address the problem of how to construct data graphs in light of the above considerations. In Sect. 4, we develop a transformation for constructing data graphs from XML. In the full version of this paper [9], we also present a transformation for RDB. In addition, we show there that the format of the source data (i.e., RDB or XML) has a significant impact on the quality of the generated data graph. Moreover, XML is a better starting point than RDB. This is somewhat surprising given the extensive research that was done on designing relational database schemes.

As a conceptual guideline for constructing a good data graph, we use the OCP model [1], which was developed for supporting a graphical display of answers so that their meaning is easily understood. In Sect. 3, we explain why the OCP model is also useful as a general-purpose basis for constructing data graphs in a way that takes into account all the issues mentioned earlier.

In summary, our contributions are as follows. First, we enunciate the principles that should guide the construction of data graphs. Second, we develop transformations for doing so when the source data is XML or RDB (the latter is done in the full version [9]). These transformations are more elaborate than the simplistic approach that is usually applied. Third, the full version [9] shows how the format of the source data impacts the quality of the generated graphs. Moreover, it also explains why XML is a better starting point than RDB.

Our contributions are valid independently of a wide range of issues that are not addressed in this paper, such as the algorithm for generating answers and the method for ranking them. We only assume that an answer is a non-redundant subtree that includes all the keywords of the query. However, our results still hold even if answers are subgraphs, as sometimes done.

2 Preliminaries

2.1 The OCP Model

The object-connector-property (OCP) model for data graphs was developed in [1] to facilitate an effective GUI for presenting subtrees. (As explained in the next section, those subtrees are answers to keyword search over data graphs.) In the OCP model, objects are entities and connectors are relationships. We distinguish between two kinds of connectors: explicit and implicit. Objects and explicit connectors can have any number of properties. Two special properties are type and name.

Parts (a) and (b) of Fig. 1 show an object and a snippet of a data graph, respectively. An object is depicted as a rectangle with straight corners. The top line of the rectangle shows the name and type of the object. The former appears first (e.g., Ukraine) and the latter is inside parentheses (e.g., country). The other properties appear as pairs consisting of the property’s name and value, as shown in Fig. 1(a). Observe that properties can be nested; for example, the property percentage is nested inside ethnicgroup. Nesting is indicated in the figure by indentation.

An implicit connector is shown as a directed edge between two objects. Its meaning should be clear from the context. In Fig. 1(b), the implicit connector from Ukraine to Odeska means that the latter is a province in the former.

An explicit connector is depicted as a rectangle with rounded corners. It has at most one incoming edge from an object and any positive number of outgoing edges to some objects. An explicit connector has a type, but no name, and may also possess other properties. Figure 1(b) shows an explicit connector of type border from Ukraine to Russia that has the property length whose value is 1576 km.

Fig. 1.
figure 1

An object and a tiny snippet of a data graph (not all properties are shown)

Fig. 2.
figure 2

A tiny portion of Mondial

2.2 Answers to Keyword Search

We consider keyword search over a directed data graph G. (A data graph must be directed, because relationships among entities are not always symmetric.) A directed subtree t of G has a unique node r, called the root, such that there is exactly one directed path from r to each node of t.

A query Q over a data graph G is a set of keywords, namely, \(Q=\{k_1,\ldots ,k_n\}\). An answer to Q is a directed subtree t of G that contains all the keywords of Q and is nonredundant, in the sense that no proper subtree of t also contains all of them.

Fig. 3.
figure 3

Answers to queries

For example, consider Fig. 2, which shows a snippet of the data graph created from the XML version of the Mondial dataset,Footnote 1 according to the transformation of Sect. 4. To save space, only the name (but not the type) of each object is shown. The dashed edges should be ignored for the moment. The subtree in Fig. 3(a) is an answer to the query \(\{\texttt {Dnepr, Russia, Ukraine}\}\). There are additional answers to this query, but all of them have more than three nodes and at least one explicit connector.

For the query \(\{\texttt {Dnepr, Don}\}\), there is no answer (with only solid edges) saying that Dnepr and Don are rivers in Russia, although the data graph stores this fact. The reason is that the connectors (in the data graph of Fig. 2) have a symmetric semantics, but the solid edges representing them are in only one direction. The only exception is the connector border, which is already built into the graph in both directions (between Russia and Ukraine). In order not to miss answers, we add opposite edges when symmetric connectors do not already exist in both directions. Those are shown as dashed arrows. Now, there are quite a few answers to the query \(\{\texttt {Dnepr, Don}\}\) and Fig. 3(b)–(d) shows three of them. The first two of those say that Dnepr and Don are rivers in Russia. These two answers have the same meaning, because the relationship between a river and a country is represented twice: by an implicit connector and by the explicit connector located. The answer in Fig. 3(d) has a different meaning, namely, Dnepr and Don are rivers in Ukraine and Russia, respectively, and there is a border between these two countries.

To generate relevant answers early on, weights are assigned to the nodes and edges of a data graph. Existing algorithms (e.g., [3, 7, 8, 10] enumerate answers in an order that is likely to be correlated with the desired one. Developing an effective weighting scheme is highly important, but beyond the scope of this paper.

3 Advantages of the OCP Model

In this section, we discuss some of the advantages of the OCP model. In a naive approach of building a data graph, there is only one type of nodes (i.e., no distinction between objects and connectors). Moreover, sometimes there is even a separate node for each property. This approach suffers from three drawbacks. First, from the implementation’s point of view, this is inefficient in both time and space. That is, even if there is not much data, the number of nodes and edges is likely to be large. As a result, searching a data graph for answers would take longer (than the alternative described later in this section). In addition, if all the processing is done in main memory, the size of the data graph is more likely to become a limiting factor.

The second drawback of the naive approach is from the user’s point of view. A meaningful answer is likely to have quite a few nodes; hence, displaying it graphically in an easily understood manner is rather hard. Another problem is the following. The definition of an answer is intended to avoid redundant parts in order to cut down the search space. However, sometimes an answer must be augmented to make it clear to the user. For example, an answer cannot consist of just some property that contains the keywords of the query, without showing the context.

The third drawback pertains to ranking, which must take into account textual relevance (as well as some other factors). In the naive approach, many nodes have only a small amount of text, making it hard to determine their relevance to a given query.

In comparison to the naive approach, the OCP model dictates fat nodes. That is, an object or an explicit connector is represented by a node that contains all of its properties. Consequently, we get the following advantages. First, a data graph is not unduly large, which improves efficiency. Second, relevance is easier to determine, because all the text pertaining to an object or an explicit connector is in the same node. Third, the GUI of [1] is effective, because it does not clutter the screen with too many nodes or unnecessary stuff. In particular, the default presentation of an answer is condensed and only shows: types and names of objects; types of explicit connectors; and properties that match some keywords of the query. The user can optionally choose an expanded view in order to see all the properties of the displayed nodes, when additional information about the answer is needed. Since all the properties are stored in the nodes that are already shown, this can be done without any delay. Furthermore, the GUI of [1] visualizes the conceptual distinction between objects and connectors, which makes it much easier to quickly grasp the meaning of an answer.

4 Constructing Data Graphs

An XML document is a rooted hierarchy of elements. Each element can have any number of attributes. Three special types of attributes are ID, IDREF and IDREFS. An attribute of the first type has a value that uniquely identifies its element. The last two types serve as references to other elements. For an attribute defined (in the DTD) as IDREF, the value is a single ID (of the referenced element); and if an attribute is defined as IDREFS, its value is a set of IDs. In our terminology, a reference attribute is one defined as either IDREF or IDREFS. An attribute is plain if it is neither ID, IDREF nor IDREFS.

In XML lingo, an element has a name that appears in its tag (e.g., <city>). To avoid confusion, we call it the type of the element, because it corresponds to the notion of a type in the OCP model

In this section, we describe how to transform an XML document to a data graph. We assume that the document has a DTD and use it in the transformation. As we shall see, the DTD provides information that is essential to constructing the data graph. Conceivably, this information can also be gleaned from the document itself. However, if the document does not conform to a reasonable DTD, the resulting data graph (similarly to the document itself) is likely to be poorly designed. By only assuming that there is a DTD (as opposed to an XML schema), we make our transformation much more applicable to real-world XML documents.

In XML documents, redundancies commonly occur due to reference attributes. For example, a course may have an IDREF attribute teacher that points to an element of type teacher. Converting the attribute teacher to an explicit (rather than implicit) connector is redundant, because the type of the referenced element makes it clear what is the semantic meaning of the connector.

Formally, consider an attribute A that is defined as IDREF. A DTD does not impose any restriction on the type E of an element that can be referenced by the value of A. In a given XML document, A (i.e., its name) and E could be the same (e.g., teacher). If so, we say that A is an insignificantly named reference attribute. In the constructed data graph, the reference described by A can be represented by an implicit connector. If the opposite holds, namely, A and E are different, then we say that A is a significantly named reference attribute. In this case, the constructed data graph should retain A as the type of an explicit connector.

If attribute A is defined as IDREFS, then it is insignificantly named if all the IDs (in the value of A) are to elements of a type that has the same name as A; otherwise, it is significantly named.

Whether a reference attribute is significantly named depends on the given XML document (and not just on the DTD). It may change after some future updates. As a general rule, we propose the following. It is safe to assume that a reference attribute A is significantly named if there is no element of the DTD, such that its type is the same as A. In any other case, it is best to get some human confirmation before deciding that a reference attribute is insignificantly named.

Let \(E_1\) and \(E_2\) be element types. We say that \(E_2\) is a child element type of \(E_1\) if the DTD has a rule for \(E_1\) with \(E_2\) on its right side. In this case, \(E_1\) is a parent element type of \(E_2\).

Rudimentary rules for transforming an XML document to a data graph were given in [19]. However, they are applicable only to simple cases. Next, we describe a complete transformation that consists of two stages. We assume that prior to these two stages, both the DTD and the XML document are examined to determine for each reference attribute whether it is significantly named or not.

In the first stage, we analyze the DTD and classify element types as either objects, connectors or properties. This also induces a classification over the elements themselves. That is, when a type E is classified as an object, then so is every element of type E (and similarly when E is classified as a connector or a property). In the second stage, the classification is used to construct the data graph from the given XML document. The first stage starts by classifying all the element types E that satisfy one of the following base rules.

  1. 1.

    If E does not have any child element type and all of its attributes are plain, then E is a property.

  2. 2.

    If E has an ID attribute or a significantly named reference attribute, then it is an object.

  3. 3.

    If E has neither any child element type nor an ID attribute, but it does have some reference attributes and all of them are insignificantly named, then E is a connector.

As an example, consider the DTD of Fig. 4. Base Rule 2 implies that the element types country, province, river and confluence are objects, because the first three have an ID attribute and the fourth has a significantly named IDREFS attribute (i.e., rivers). No base rule applies to economy. By Base Rule 1, all the other element types are properties.

Next, we find all the element types that should be classified as properties by applying the following recursive rule. If (according to the DTD rules) element type E only has plain attributes and all of its child element types are already classified as properties, then so is E. It is easy to show that a repeated application of this recursive rule terminates with a unique result.

Continuing with the above example, a single application of the recursive rule shows that economy is a property, because all of its child elements have already been classified as such by Base Rule 1.

Now, we apply the following generalization of Base Rule 3. If E does not have an ID attribute, all of its child element types are classified as properties, and it has some reference attributes and all of them are insignificantly named, then E is a connector.

We end the first stage by classifying all the remaining element types as objects, and then the following observations hold. First, if an element type is classified as a property, then so are all of its descendants. Second, the classification (when combined with the construction of the data graph that is described below) ensures that a connector is always between two objects. Third, if an element type is classified as a connector, then it has some reference attributes and all of them are insignificantly named.

Fig. 4.
figure 4

DTD snippet of Mondial

Fig. 5.
figure 5

XML snippet of Mondial

In the second stage, we transform the XML document to a data graph. At first, we handle PCDATA as follows. If an element e (of the document) includes PCDATA as well as either sub-elements or attributes, then we should create a new attribute having an appropriate name (e.g., text) and make the PCDATA its value. This is not needed if e has neither sub-elements nor attributes, because in this case, e becomes (in the data graph constructed below) a non-nested property, such that the element type of e is the name of that property and the PCDATA is its value.

Now we construct the data graph as follows. For each element e, such that e is not classified as a property, we generate a node \(n_e\). This node is either an object or a connector (and hence an explicit one) according to the classification of e. The type of \(n_e\) is the same as that of e. If \(n_e\) is an object, we should choose one of its properties (which will be created by the rules below) as its name. As usual, we prefer a property (e.g., title) that describes the meaning of \(n_e\), even if it is not a unique identifier. For each \(n_e\), we create properties and add additional edges and nodes by applying the following six construction rules.

  1. 1.

    Every plain attribute of e is a property of \(n_e\).

  2. 2.

    For each child p of e, such that p is classified as a property, the subtree (of the given document) that is rooted at p becomes a property of \(n_e\). Note that this property is nested if p has either plain attributes or descendants of its own. Also observe that element types and attribute names appearing in p become names of properties nested in \(n_e\).

  3. 3.

    For each child o of e, such that o is classified as an object (hence, so is e), we add an edge from \(n_e\) to \(n_o\) (which is the node created for o).

  4. 4.

    For each child c of e, such that c is classified as a connector, we add an edge from \(n_e\) to \(n_c\). Observe that if such a c exists, then e is classified as an object and \(n_c\) is the node of the explicit connector corresponding to c.

  5. 5.

    For each reference attribute R of e, we create new connectors or add edges to existing ones, according to the following two cases. First, if R is insignificantly named, then for each object o that (the value of) R refers to, we add an edge from \(n_e\) to o. Note that this edge is an implicit connector if \(n_e\) is an object; otherwise, it is part of the explicit connector \(n_e\).

    The second case applies when R is significantly named. In this case, the classification rules imply that \(n_e\) is an object. We first create a node \(n_r\), such that its only incoming edge is from \(n_e\). This node represents an explicit connector that gets the name of attribute R as its type and has no properties. In addition, for each object o that (the value of) R refers to, we add an edge from \(n_r\) to o.

Figure 6 shows the data graph created from the XML document of Fig. 5 with the DTD of Fig. 4.

Fig. 6.
figure 6

A data graph constructed from the Mondial XML

We divide the original edges (i.e., those created by the above transformation) into two kinds. The hierarchical edges are those created by Construction Rule 3. They are implicit connectors that reflect the parent-child relationship between XML elements. The reference edges are the ones introduced by Construction Rule 5 (i.e., due to reference attributes). Construction Rule 4 creates edges due to the element hierarchy, but they enter nodes of explicit connectors; hence, we also refer to them as reference edges.

As explained in Sect. 2.2, we add opposite edges. However, our experience indicates that even if it is done just for the reference edges (i.e., no opposite edges are added for the hierarchical ones), we generally do not miss meaningful answers to queries. Furthermore, as we argue in the full version [9], a strategy that works well is to assign higher weights to opposite edges than to original ones. In this way, relevant answers are likely to be generated first without having too many duplicates early on.

5 Conclusions

We showed that the OCP model is an effective conceptual basis for constructing data graphs. Using it, we developed transformations for generating data graphs from XML and RDB (the latter is done in the full version of this paper [9]). These transformations are quite elaborate and provide much better results than the ad hoc methods that have been used in the literature thus far. In particular, the produced data graphs are better in terms of both efficiency (i.e., answers are generated more quickly) and effectiveness (i.e., the most relevant answers are produced early on).

It should be emphasized that the presented transformations are based on the principle of creating fat nodes (as explained in Sect. 3) and avoiding redundancies (e.g., due to insignificantly named references). Thus, they are applicable and useful (in most if not) all cases, regardless of how answers are generated or ranked.

In the full version of this paper [9], we show that XML is the preferred starting point for constructing data graphs. However, we need to better understand how to create XML documents that yield the best possible data graphs. Toward this end, we plan to develop appropriate design rules for XML documents.

An interesting topic for future work is to how to construct data graphs from XML documents without DTDs.