1 Introduction

Keyword query enables inexperienced users to easily search databases, and the quality of results returned is actually dependent on the keywords used. For database users who have little knowledge of the database content or domain information, it is often hard to formulate effective keyword queries with “good” terms. Consequently queries are often underspecified or overspecified. There exist large gaps between the query terms and the appropriate terms that can obtain desired results in the database. To fill in such gaps, query suggestion is used to help users express their information need more precisely so that they can access the required information.

For semantic query suggestion there are two major types of query reformulation [3, 14]. The first type is specialization, in which a general or vague concept in the user query is modified to narrow down the search result. For example, suppose that a user has little knowledge about mobile phones. And he poses a keyword query [Mobile phone, Price] on a mobile database. There are some semantically related terms of Mobile Phone such as iPhone 6s and Samsung galaxy S6 in the database, we could replace the term Mobile phone by iPhone 6s or Samsung galaxy S6, and generate the query suggestion [iPhone6s, Price] or [Samsung galaxy S6, Price] for user query. The other type of query reformulation is parallel movement, in which the user’s topic of interest drifts to others with similar aspects. For example, assume that the user query [Canon camera, Price] is issued. If we know that Nikon camera is related to Canon camera, query suggestion [Nikon camera, price] could be suggested to explore price information of other camera brands. In this paper, both types of semantic query suggestions are studied with the usage of the knowledge of domain taxonomy.

In response to a user query, existing query suggestion methods [17, 18] evaluate the relevance scores of candidate suggestions and return the top-k best query suggestions with highest scores. Since the relevance scores are independently measured on candidates, the returned suggestions are semantically relevant not only to the query and also to each other. The redundancy in the returned top-k suggestion set leads to uniform topics that can be found in most of the suggestions. Given a user query from a random user, the set of suggestions with diverse topics is more likely to match user intent than the of set of suggestions belonging to one topic. For example, suppose that a user poses a query [Mobile phone, Price] on a mobile phone database. Since it is unknown to the system which mobile operating system the user is interested in, the best strategy is to suggest mobile phones covering diverse operating systems such as IOS system and Android system.

In this paper, we aim to diversify the query suggestion set with respect to topics while keeping the query suggestions relevant to the user query.

The challenging issues in providing diverse query suggestion are: First, how do we measure the relevance, and especially the diversity for a suggestion set? Second, given the overall function over suggestion sets which lineally combines the diversity and relevance function, how do we find the optimal suggestion set that maximizes the overall function? This paper solves these problems and has the following novel study:

  1. 1.

    In this paper, to measure the quality of suggestion sets, the relevance function and the diversity function over suggestion sets are defined. The diversity function takes the factors of coverage and redundancy into account.

  2. 2.

    The overall function linearly combining the relevance function and diversity function is defined. It is then shown that finding the optimal suggestion set which maximizes the objective function is a submodular function maximization problem subject to n matroid constraints. And a greedy O(\(\frac {1}{1+n}\)) -approximation algorithm is also proposed.

  3. 3.

    New metrics for evaluating the quality of query suggestion sets are proposed, which takes the diversity into account. The experimental results validate the effectiveness of our method.

In the remainder of the paper, Section 2 introduces preliminary knowledge. From Sections 3 to 6, we present the framework of query suggestion and analyze the nature of our problem. Section 7 describes the algorithm for efficiently generating the query suggestion set. Section 8 illustrates our experiments. Section 9 reports on the related work. Finally Section 10 derives our conclusions.

2 Preliminary

Since it is often that users have little domain knowledge, user queries are underspecified or overspecified. In this paper domain taxonomies are used to replace the terms in user queries by more precise ones for obtaining high quality results in databases. Now we first introduce the concept of taxonomy.

Taxonomy information

The taxonomy is a DAG graph. A node in the taxonomy is either a concept node or an instance node (without any successor in the DAG graph). An arc (t 1, t 2) indicates that t 1 is a super concept of t 2 and we use t 2t 1 to denote concept t 1 directly subsumes t 2. And t 2 t 1 indicates that t 1 subsumes t 2 (directly or indirectly). A concept contains a set of instances and possibly a set of sub concepts. Note that a concept or instance node may have multiple parent nodes (super concepts).

Obviously, concepts in a taxonomy are semantically related. Given two concepts t 1, t 2, we compute the similarity of two concepts by the method proposed in [20] with some modification to bound the similarity value within [0, 1]:

$$ sim(t_{1},t_{2}) = \frac{2\cdot \max_{t \in S(t_{1},t_{2})} [-\log p(t) ]}{-(\log p(t_{1})+\log p(t_{2}))} $$
(1)

where p(t) is the probability of encountering an instance of concept t; S(t 1, t 2) is the set of concepts that subsume both t 1 and t 2 in the taxonomy. The similarity of two concepts is computed as the information content they share (times 2) divided by the sum of the information content of the two concepts. It is easy to verify that s i m(t 1, t 2) ∈ [0, 1].

Recently, more and more taxonomy knowledge bases are developed such as Yago [11], CYC [15]. They cover a lot of domain taxonomies such as taxonomies about movies and sciences.

3 Problem definition

A user (keyword) query q[w 1, w 2, ⋯ , w l ] consists of l keyword terms. With respect to query q, we aim to find the best suggestion set of k query suggestions. The “best” here is motivated by two considerations. First, the k query suggestions in the suggestion set should be highly relevant to user query q. Second, to maximize user satisfaction, the suggestion set should be diverse with respect to the topics that the user might be interested in. To achieve the diversification of the suggestion set while keeping it relevant to the user query, we introduce an objective function f over suggestion sets. For any suggestion set \(\mathcal {H}\) with k query suggestions, the objective function f on \(\mathcal {H}\) is:

$$ f(\mathcal{H}) = \lambda \cdot \textit{diversity} (\mathcal{H} )+(1-\lambda)\cdot \textit{relevance}(\mathcal{H},q) $$
(2)

The objective function f is a linear combination of relevance and diversity function by a tuning parameter λ ∈ [0, 1]. The diversity function \(diversity (\mathcal {H})\) and the relevance function \(relevance(\mathcal {H},q)\) will be defined in Section 5. The method to produce query suggestions will be presented in the next section.

Our aim is to find the suggestion set \(\mathcal {H}^{*}\) with k query suggestions, which has the highest f score.

4 Candidate query suggestions and topics

In this section, we present the method to generate candidate query suggestions by domain taxonomies, and to identify the potential topics that the query suggestions belongs to.

4.1 Candidate query suggestions

Assume that there exists a domain taxonomy \(\mathcal {T}\), which is related to the content of database. Given taxonomy \(\mathcal {T}\) and user query q[w 1, w 2, w 3, ⋯ , w l ] , we reformulate the query by specialization and parallel movement. The specialization is to replace a keyword term w i (as a general or vague concept in taxonomy \(\mathcal {T}\)) in user query q to its instances in taxonomy \(\mathcal {T}\).

Definition 1

(Specialization movement): Given a keyword term w i (as a concept in \(\mathcal {T}\)) in query q, specialization is to replace w i by s w i , if s w i is an instance and s w i w i in taxonomy \(\mathcal {T}\).

For example, given a user query [mobile phone, price] and the taxonomy shown in Figure 1, term Mobile phone can be replaced by terms iPhone 6s and iPhone 6s plus which are instances of Mobile phone in the taxonomy.

Figure 1
figure 1

Taxonomy of mobile phones

The other type of query reformulation is parallel movement, in which the terms (as instances in \(\mathcal {T}\)) in a user query are replaced by their sibling instances in \(\mathcal {T}\).

Definition 2

(Parallel movement): Given a keyword term w i (as an instance in \(\mathcal {T}\)) of query q, parallel movement is to replace w i by instance s w i , if there exists a concept t s.t. w i t and s w i t in taxonomy \(\mathcal {T}\).

For example, term iPhone 6s can be replaced by term iPhone 6s plus which is one of siblings of instance iPhone 6s in the taxonomy (i.e., they have a common direct super concept IOS system).

Definition 3

(Term suggestion set): For each keyword term w i in user query q, we define s u g g(w i ) to be the set of all terms that can be obtained from w i through the specialization or parallel movement.

Definition 4

(Query suggestion): Given user query q[ w 1, w 2,⋯ , w l ], for each keyword term w i , s u g g(w i ) is the term suggestion set of w i . We say that s[ s w 1, s w 2,⋯ , s w l ] is a query suggestion of user query q if ∀i, s w i s u g g(w i ) or s w i = w i .

For example, given the taxonomy shown in Figure 1, [Samsung galaxy S6, Price] and [iPhone 6s, Price] are candidate query suggestions of user query [Mobile phone, Price].

We use \(\mathcal {S}\) to denote the set of all possible candidate query suggestions with respect to user query q. Table 1 describes the notations used throughout this paper.

Table 1 Notation used throughout this paper

4.2 Topics of query suggestions

A query suggestion could belong to some potential topics that the user is interested in. These topics can be identified according to the concepts in the taxonomy.

In our running example, term iPhone 6s replaces the term Mobile phone in user query [Mobile phone, Price]. In the taxonomy, we can see that IOS system and Smart phones are two direct super concepts of iPhone 6s. It indicates that iPhone 6s belongs to the topics IOS system and Smart phones. For the more complicated case where a candidate suggestion consists of multiple keyword terms replaced, the topics it belongs to can be represented by the combinations of concepts.

Definition 5

(Topic of a query suggestion) Given a suggestion s[ s w 1, s w 2,⋯, s w l ], for each term s w i of s, {T i } is the set of concepts (in taxonomy \(\mathcal {T}\)) that directly subsume s w i . We define c = (t 1, t 2,⋯ , t l ), t i ∈ {T i } to be a topic that suggestion s belongs to.

If query suggestion s belongs to topic c, we also say that topic c is fully supported by suggestion s.

For instance, query suggestion [iPhone 6s, Price] fully supports topics (IOS system, Price) and (Smart phone, Price).

Multiple topics could be fully supported by one query suggestion, and multiple suggestions could support one common topic. We use \(\mathcal {C}_{s}\) to denote the set of topics supported by suggestion s, and sup(c) to denote the set of query suggestions that fully support topic c.

Given user query q and the set of all quey suggestions \(\mathcal {S}\) for q, the set of topics associated with user query q denoted by \(\mathcal {C}\) is defined as:

$$\mathcal{C} = \bigcup_{s \in \mathcal{S}} \mathcal{C}_{s} $$

For any topic \(c \in \mathcal {C}=\{c_{1}, c_{2}, \cdots , c_{n} \}\), we have \(sup(c) \subseteq \mathcal {S}\) and s u p(c 1) ∪ s u p(c 2) ∪ ⋯ ∪ s u p(c n ) = \(\mathcal {S}\). Note that s u p(c i ) ∩ s u p(c j ) is not necessarily , i.e., they could have overlappings. Hence s u p(c 1) ∪ s u p(c 2) ∪ ⋯ ∪ s u p(c n ) is a cover instead of a partition of query suggestion set \(\mathcal {S}\).

Topics are the combinations of concepts in the domain taxonomy, and they are semantically related with each other. A query suggestion could support a certain topic to some extent even though it does not fully support such the topic. Here we use μ(s, c) to denote the belief that query suggestion s supports topic c. The belief function μ(s, c) is defined as:

$$ \mu(s,c) = \left\{ \begin{array}{l l} 1, &s \in sup(c) \\ \\ \max_{c_{k} \in \mathcal{C}_{s}}rel(c_{k}, c), &s \notin sup(c) \end{array} \right. $$
(3)

where r e l(c k , c) denotes the relevance between topic c k and topic c. Given two topics \(c_{1}=({t^{1}_{1}}, {t^{1}_{2}}, \cdots , {t^{1}_{l}})\) and \(c_{2}=({t^{2}_{1}}, {t^{2}_{2}}, \cdots , {t^{2}_{l}})\), we compute the relevance between them as:

$$ rel(c_{1}, c_{2}) = \prod\limits_{i=1}^{l}sim({t^{1}_{i}}, {t^{2}_{i}}) $$
(4)

where \(sim({t^{1}_{i}}, {t^{2}_{i}})\) is the similarity value between the concepts \({t^{1}_{i}}\) and \({t^{2}_{i}}\) which can be computed by (1), and the independence between concepts is assumed here.

For topic c, if ss u p(c), μ(s, c) = 1; otherwise, it is computed as the maximal relevance value between topic c and the topics in \(\mathcal {C}_{s}\).

Definition 6

Given a query suggestion s, we say that s highly supports topic c if μ(s, c) ≥ δ where δ is a threshold (which can be set by the systems or users). And we define \(\widetilde {sup}(c)\) as the set of query suggestions that highly support topic c.

Obviously, if δ = 1 then we have \(\widetilde {sup}(c) = sup(c)\).

5 Relevance and diversity function on suggestion sets

In this section we define the relevance and diversity function on suggestion sets.

5.1 Relevance function on suggestion set

Given a suggestion set \(\mathcal {H}\) with k query suggestions, if \(\mathcal {H}\) is relevant to user query q, the k query suggestions in \(\mathcal {H}\) should be semantically relevant to user query q. We measure the relevance function on suggestion set \(\mathcal {H}\) as:

$$ \textit{relevance}(\mathcal{H},q) = \frac{1}{k}\cdot \sum\limits_{s_{i} \in \mathcal{H}} rel(s_{i},q) $$
(5)

where r e l(s i , q) denotes the relevance score of query suggestion s i with respect to user query q. Given user query q[w 1, w 2,⋯ , w l ] with l keywords and query suggestion s[s w 1, s w 2,⋯ , s w l ], we compute r e l(s, q) as:

$$ rel(s,q) = \prod\limits_{i=1}^{l} sim(sw_{i}, w_{i}) $$
(6)

where s i m(w i , s w i ) is the similarity score between concepts or instances w i and s w i in the taxonomy.

5.2 Diversity function on suggestion sets

Now we discuss an important issue: how do we measure the diversity over a query suggestion set? Intuitively, the diversity of a query suggestion set depends upon how suggestions in this set are different from each other in terms of their associated topics. Inspired by coverage-based approaches [1, 5, 19], we first consider the coverage of a suggestion set, which measures the extent to which topics are covered by the suggestion set.

Given user query q, we have a query suggestion set \(\mathcal {S}=\{ s_{1}, s_{2}, \cdots , s_{m} \}\) and a topic set \(\mathcal {C}=\{ c_{1}, c_{2}\), ⋯ , c n }. Since each c i in \(\mathcal {C}\) could be a potential topic that user is interested in, we maximize the average belief that topics are supported by at least one query suggestion in the suggestion set. For a query suggestion set \(\mathcal {H}\), \(|\mathcal {H}|=k\), we define the coverage function as:

$$ \textit{coverage}(\mathcal{H}) = \frac{1}{|\mathcal{C}|} \sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in \mathcal{H}}\mu(\overline{s_{i}},c_{j})) $$
(7)

where \(\mu (\overline {s_{i}},c_{j})\) denotes the belief that suggestion s i does not support topic c j and \(\mu (\overline {s_{i}},c_{j}) = 1-\mu (s_{i},c_{j})\). For topic c and query suggestions s 1, \(s_{2} \in \mathcal {S}\), we assume that \(\mu (\overline {s_{1}},c)\) is independent of \(\mu (\overline {s_{2}},c)\). Thus \({\prod }_{s_{i} \in \mathcal {H}}\mu (\overline {s_{i}},c_{j})\) indicates the belief that topic c j is not supported by the queries in \(\mathcal {H}\). And \(1-{\prod }_{s_{i} \in \mathcal {H}}\mu (\overline {s_{i}},c_{j})\) indicates the belief that topic c j is supported by at least one query suggestion in \(\mathcal {H}\).

However, coverage is insufficient for evaluating diversity. The redundancy of a suggestion set is another key factor for measuring the diversity. If one topic is over-supported by the suggestion set, the diversity would decrease.

In our example, user query [Mobile phone, Price] is issued to a mobile phone database. Suppose that one of its 3-element suggestion set is {[iPhone 6s, Price], [iPhone 6s plus, Price], [Samsung galaxy S6, Price]}. These query suggestions support topics such as IOS system, Android system and Smart phone, but they are still not diverse enough. Because they all support the topic Smart phone. With this suggestion set, users would have no chance to see the conventional mobiles (not smart phones) with their prices. Since conventional mobiles are normally much cheaper than smart ones, the suggestion set is more likely to dissatisfy the users on a tight budget, thus narrowing down the users’ horizons and hurting the diversity of suggestion.

With the aim of maximizing the diversity of the suggestion set and minimizing the dissatisfaction of users, the suggestion set should not be dominated by a certain topic. We say that a suggestion set \(\mathcal {H}\) is dominated by topic c if all queries in \(\mathcal {H}\) highly support c. The formal definition is:

Definition 7

(Domination): Given a suggestion set \(\mathcal {H}\), \(|\mathcal {H}|=k\), \(\mathcal {H}\) is dominated by topic c if \(|\mathcal {H} \cap \widetilde {sup}(c)|=k\).

To avoid such the case, we set a budget k − 1 for each topic. It means that for each topic, the number of query suggestions in suggestion set \(\mathcal {H}\) that highly support the topic is at most k − 1. Namely, \(\forall c_{i} \in \mathcal {C}, |\mathcal {H} \cap \widetilde {sup}(c_{i})| \leq k-1\).

Thus, different from the former work [1, 5, 19] we consider the diversity for suggestion set on two aspects. On the one hand, we maximize the belief that topics are supported by at least one query suggestion in return set \(\mathcal {H}\), namely the “at least” aspect. On the other hand, we constraint over-supported topics for \(\mathcal {H}\), which we call the “at most” aspect. We then define the diversity function as:

$$\begin{array}{@{}rcl@{}} \textit{diversity}(\mathcal{H}) & = & \frac{1}{|\mathcal{C}|} \sum\limits_{c_{j} \in \mathcal{C}} \left(1-\prod\limits_{s_{i} \in \mathcal{H}}\mu(\overline{s_{i}},c_{j})\right), \\ && \text{if } \forall c_{i} \in \mathcal{C}, |\mathcal{H} \cap \widetilde{sup}(c_{i})| \leq k-1; \\ \textit{diversity}(\mathcal{H}) & = &-\infty, \text{ otherwise} \end{array} $$
(8)

If suggestion set \(\mathcal {H}\) satisfies the constraints, the diversity score of \(\mathcal {H}\) is the same as in formula (7). Otherwise, we define the diversity score of \(\mathcal {H}\) as −, which actually indicates that \(\mathcal {H}\) is an unfavorable suggestion set since it does not satisfy the constraints. In formula (8), we set a budget k − 1 for each topic. Without loss of generality, we can set a budget b i for each topic c i more flexibly. For instance, we could set b i as an integer ranging from 1 to k − 1 in proportion to the size of \(\widetilde {sup}(c_{i})\). It means that the more query suggestions support c i , the more elements in \(\widetilde {sup}(c_{i})\) are allowed to appear in suggestion set \(\mathcal {H}\). Of course, the upper bound is still k − 1.Footnote 1

6 Problem reformulation and finding the suggestion

We have defined the diversity and relevance functions on suggestion sets. The formula (2) can be reformulated as:

$$\begin{array}{@{}rcl@{}} \mathcal{H}^{*} &=& \underset{\mathcal{H} \subseteq S,|\mathcal{H}| = k}{\mathrm{arg\,max}} \left\{\frac{\lambda}{|\mathcal{C}|} \cdot \sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in \mathcal{H}}\mu(\overline{s_{i}},c_{j}) \right. \\ &&\left. +\frac{(1-\lambda)}{k}\cdot \sum\limits_{s_{i} \in \mathcal{H}} rel(s_{i},q) \right\} \\ && \text{subject to: for } \forall c_{i} \in \mathcal{C}, |\mathcal{H} \cap \widetilde{sup}(c_{i})| \leq k-1 \end{array} $$
(9)

The desired solution \(\mathcal {H}^{*}\) is the set of k query suggestions which satisfies the constraints and maximizes the objective function.

To achieve the solution, we look deep into the nature of this problem. Actually it is the submodular function maximization problem subject to n matroids, which is an NP-Hard problem. We first introduce the concept of submodular function.

Definition 8

(Submodular): Let X be a finite ground set and f : \(2^{X} \to \mathbb {R}\). Then f is submodular if for all A, B ⊆ X,

$$f(A)+f(B) \geq f(A \cup B)+f(A \cap B). $$

Another equivalent definition of submodularity is as follows: We denote by f A (i) = f(A + i) − f(A) the marginal value of i with respect to A. Then f is submodular if for all A ⊆ B ⊆ X and iXB, f A (i) ≥ f B (i).

The Submodular set functions have the property of a decreasing marginal gain as the size of the set increases. And they have been well-studied in economics and combinatorial optimization. For our problem, we have the following theorem:

Theorem 1

The objective function in formula (9) is submodular and monotone (non-decreasing).

Proof

We first prove that the objective function f is monotone, namely given two sets BA, f(B) ≥ f(A). We have:

$$f(A) = \frac{\lambda}{|\mathcal{C}|} \cdot \sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in A}\mu(\overline{s_{i}},c_{j})) + \frac{(1-\lambda)}{k}\cdot \sum\limits_{s_{i} \in A} rel(s_{i},q)$$
$$f(B) = \frac{\lambda}{|\mathcal{C}|} \cdot \sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in B}\mu(\overline{s_{i}},c_{j}) + \frac{(1-\lambda)}{k}\cdot \sum\limits_{s_{i} \in B} rel(s_{i},q)$$

Since AB,

$$\sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in B}\mu(\overline{s_{i}},c_{j})) \geq \sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in A}\mu(\overline{s_{i}},c_{j})),$$

and

$$\sum\limits_{s_{i} \in B} rel(s_{i},q) \geq \sum\limits_{s_{i} \in A} rel(s_{i},q) $$

Thus, we have f(B) ≥ f(A).

Now we focus on proving that the objective function f is submodular. We denote f A (s) = f(A + s) − f(A), the marginal increasing of s w.r.t. set A. Given two sets AB, we need to prove: f A (s) ≥ f B (s). We have:

$$f(A) = \frac{\lambda}{|\mathcal{C}|} \cdot \sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in A}\mu(\overline{s_{i}},c_{j})) + \frac{(1-\lambda)}{k}\cdot \sum\limits_{s_{i} \in A} rel(s_{i},q)$$
$$f(A+s) = \frac{\lambda}{|\mathcal{C}|} \cdot \sum\limits_{c_{j} \in \mathcal{C}} (1-\prod\limits_{s_{i} \in A+s}\mu(\overline{s_{i}},c_{j}) + \frac{(1-\lambda)}{k}\cdot \sum\limits_{s_{i} \in A+s} rel(s_{i},q)$$
$$\begin{array}{@{}rcl@{}} f_{A}(s)&=& f(A+s)-f(A) \\ &=& \frac{\lambda}{|\mathcal{C}|} \cdot \sum\limits_{c_{j} \in \mathcal{C}} \prod\limits_{s_{i} \in A} \mu(\overline{s_{i}},c_{j}) \cdot (1-\mu(\overline{s},c_{j})) \\ && + \frac{1-\lambda}{k}\cdot rel(s,q) \end{array} $$

Similarly, we have:

$$\begin{array}{@{}rcl@{}} f_{B}(s)&=&\frac{\lambda}{|\mathcal{C}|} \cdot \sum\limits_{c_{j} \in \mathcal{C}} \prod\limits_{s_{i} \in B} \mu(\overline{s_{i}},c_{j}) \cdot (1-\mu(\overline{s},c_{j})) \\ && + \frac{1-\lambda}{k} \cdot rel(s,q) \end{array} $$

And since AB, it holds

$$\begin{array}{@{}rcl@{}} && \frac{\lambda}{|\mathcal{C}|} \sum\limits_{c_{j} \in \mathcal{C}} \prod\limits_{s_{i} \in B} \mu(\overline{s_{i}},c_{j}) \cdot (1-\mu(\overline{s},c_{j}) \\ && \leq \frac{\lambda}{|\mathcal{C}|} \sum\limits_{c_{j} \in \mathcal{C}} \prod\limits_{s_{i} \in A} \mu(\overline{s_{i}},c_{j}) \cdot (1-\mu(\overline{s},c_{j}) \end{array} $$

Eventually, we have f A (s) ≥ f B (s). □

Matroid constraints

In our problem, to obtain the desired suggestion set it requires maximizing the objective function and satisfying the constraints. Now we show that it is problem of submodular function maximization problem subject to n matroid constraints.

The matroids are combinatorial structures that generalize the notion of linear independence in matrices.

Definition 9

(Matroid): A matroid M = (N, \(\mathcal {I}\)) is a finite set N with a collection of sets \(\mathcal {I} \subseteq 2^{N}\), known as the independent sets, satisfying the following properties:

  • property 1: if A \(\in \mathcal {I}\) and BA then B \(\in \mathcal {I}\).

  • property 2: if A,B \(\in \mathcal {I}\) and |B| > |A| then there exists an element b ∈ B ∖ A such that \(A \cup \{b\} \in \mathcal {I}\).

The sets in M are typically called independent sets; for example, we would say that any subset of an independent set is independent. The union of all sets in M is called the ground set. An independent set is called a basis if it is not a proper subset of another independent set.

Given user query q with its query suggestion set \(\mathcal {S}=\{s_{1}, s_{2},\cdots ,s_{m}\}\) and a set of topics \(\mathcal {C}=\{c_{1}, c_{2},\cdots ,c_{n} \}\) associated. For each topic c i , \(\widetilde {sup}(c_{i})\) is the set of query suggestions that highly support c i . Now we define the set \(\mathcal {I}_{i}\) as:

$$\mathcal{I}_{i} = \{ A \subseteq \mathcal{S}, |A|\leq k: |A \cap \widetilde{sup}(c_{i})| \leq k-1 \} $$

\(\mathcal {I}_{i}\) is a family of independent query suggestion sets that satisfy the constraint \(|A \cap \widetilde {sup}(c_{i})| \leq k-1\). We have the following theorem which can be easily proved.

Theorem 2

\(M_{i}=(\mathcal {S},\mathcal {I}_{i})\) is a matroid.

Since for \(\widetilde {sup}(c_{1})\), \(\widetilde {sup}(c_{2})\), ⋯, \(\widetilde {sup}(c_{n})\) there are n constraints, we define n matroids as follows:

$$\begin{array}{@{}rcl@{}} M_{1} &&=(\mathcal{S}, \mathcal{I}_{1}), \\ && \mathcal{I}_{1}= \{ A \subseteq \mathcal{S}, |A|\leq k: |A \cap \widetilde{sup}(c_{1})| \leq k-1 \} \\ M_{2} &&=(\mathcal{S}, \mathcal{I}_{2}), \\ && \mathcal{I}_{2}= \{ A \subseteq \mathcal{S}, |A|\leq k: |A \cap \widetilde{sup}(c_{2})| \leq k-1 \} \\ &&{\cdots} \\ M_{n} &&=(\mathcal{S}, \mathcal{I}_{n}), \\ && \mathcal{I}_{n}= \{ A \subseteq \mathcal{S}, |A|\leq k: |A \cap \widetilde{sup}(c_{n})| \leq k-1 \} \end{array} $$

M 1, M 2, ⋯, M n are matroids on the same ground set \(\mathcal {S}\). If \(\mathcal {H}\) is a legal set, it indicates that \(\mathcal {H}\) should be a common independent set in each \(\mathcal {I}_{i}\), namely \(\mathcal {H} \in \bigcap _{i=1}^{n}\mathcal {I}_{i}\).

Thus, our problem can be formulated as:

$$ \mathcal{H}^{*}=\mathrm{arg\,max}\{f(\mathcal{H}):\mathcal{H} \in \bigcap_{i=1}^{n}\mathcal{I}_{i}, \} $$
(10)

where function f is submodular and monotone.

Our aim is to find the independentFootnote 2 query suggestion set from \(\bigcap _{i=1}^{n}\mathcal {I}_{i}\) that maximizes the submodular and monotone function f. This is the submodular function maximization problem subject to n matroids. For such problem, Fisher et al. in [9] show that finding the optimal suggestion H is an NP-hard problem and a greedy heuristic always produces a suggestion with an approximation ratio \(O(\frac {1}{1+n})\). We will employ greedy strategy to design our approximate algorithm, which is described in Section 7.

7 Implementation of query suggestion algorithm

In this section, we present the algorithm of efficiently obtaining the suggestion set of k query suggestions. It adopts greedy strategy to produce a near-optimal suggestion set with an approximation ratio \(O(\frac {1}{1+n})\). The algorithm is described in Algorithm 1.

figure a

In Algorithm 1, by greedy strategy the suggestion set of k query suggestions is generated iteratively. In each iteration, it selects the top query suggestion s that maximizes the objective function and \(\{ \mathcal {H} \cup s^{*} \}\) is a common independent set in \(\bigcap _{i=1}^{n}\mathcal {I}_{i}\). Then such query suggestion is added to the suggestion set and deleted from the candidate query suggestion set. This process runs iteratively until k query suggestions are obtained. The k query suggestions in the suggestion set can be ranked by their relevance score. Thus, a ranked list of k query suggestions will be returned to the user ultimately.

For time complexity, the complexity for relevance computation is \(O(|\mathcal {S}|)\). The complexity for computing diversity scores is \(O(|\mathcal {S}| \cdot |\mathcal {C}|)\). Thus, the overall complexity is \(O(|\mathcal {S}| \cdot (|\mathcal {C}|+1))\).

8 Evaluation

The experiments are run on a Dell T7500 workstation which has 6 Intel Xeon processor X5650 2.67GHz with 12 cores, 24GB of RAM, and a 1.4 TB hard disk. The algorithm described in the paper is implemented with Java 1.6.0.

Data sets

Two real-world datasets are used in our experiments: (1) IMDB(Internet Movie Database), which contains information about movies, actors and directors; (2) DBLP, a dataset describing computer science bibliography information.

Taxonomy knowledge

We extracted two taxonomies for the fields of movie and computer science from YAGO (http://www.mpi-inf.mpg.de/yago-naga/yago) which consists of facts extracted from Wikipedia and integrated with the Wordnet thesaurus. The details of the two taxonomies are shown in Table 2.

Table 2 Statistics of two taxonomies

Query load

For each dataset, we select abstract concepts or instances in the two taxonomies as keywords and randomly generate 30 queries. These queries are grouped into three groups based on the number of suggestion candidates they have. Each query group has 10 queries. Generally, the queries with more suggestion candidates are more abstract. The query load information of the two datasets is shown in Figure 2. Queries in query group QG1 and QG2 have a relatively smaller number of candidate suggestions, and queries in group QG3 have a much larger number of candidate suggestions.

Figure 2
figure 2

Query groups information for two datasets

Methods

We implemented our method DivQSA (diverse query suggestion algorithm). For comparison, we also implemented the method MMR proposed in [4] as the baseline. MMR strives to obtain diverse results by reducing redundancy among results. It selects the query suggestion which is relevant to the query and contains minimal similarity to previously selected queries. In our setting, we compute the similarity between two query suggestions s 1, s 2 as \(Sim(s_{1},s_{2}) = \frac {|topics(s_{1})\cap topics(s_{2})|}{|topics(s_{1})\cup topics(s_{2})|}\), where t o p i c s(s i ) is the set of topics highly supported by query s i .

8.1 Varying parameter λ

The objective function f linearly combines the relevance and diversity functions by a tuning parameter λ ∈ [0, 1]. In this experiment, we investigate the changes of diversity scores and relevance scores of the suggestion sets when λ varies from 0 to 1 with k = 10.

When λ = 1, the objective function relies only on the diversity factor. For the case of λ = 0, the objective function relies only on the relevance factor. To clearly show the changes of diversity scores and relevance scores when λ varying, we use relative diversity scores and relative relevance scores to denote the changes. The diversity score in the case of λ = 1 and the relevance score in the case of λ = 0 are used as reference scores. The relative score functions for query group Q G i with respect to λ are defined as:

$$\textit{Relative}\_DIV(QG_{i})_{\lambda}=\frac{\textit{avg}.\textit{Diversity}(QG_{i})_{\lambda}}{\textit{avg}. \textit{Diversity}(QG_{i})_{\lambda=1}}$$
$$\textit{Relative}\_REL(QG_{i})_{\lambda}=\frac{\textit{avg}.\textit{Relevance}(QG_{i})_{\lambda}}{\textit{avg}. \textit{Relevance}(QG_{i})_{\lambda=0}}$$

where R e l a t i v e_D I V(Q G i ) λ is the average relative diversity score of the suggestion sets for query group Q G i with respect to λ, and R e l a t i v e_R E L(Q G i ) λ is the average relative relevance score of the suggestion sets for query group Q G i with respect to λ.

The R e l a t i v e_D I V scores and the R e l a t i v e_R E L scores for three query groups with λ varying from 0 to 1 are shown in Figure 3. We are interested in the “sweet range” of λ. When λ falls in such a range, the R e l a t i v e_D I V score and the R e l a t i v e_R E L score are both high, which indicates that the suggestion set is of high quality. For query groups QG1 and QG2 (Figure 3a, b, d and e), we can see that when λ is in range [0.3,0.7], the R e l a t i v e_D I V scores are high (> 0.9) while the R e l a t i v e_R E L scores keep high (> 0.9) too. Thus, the sweet range for QG1 and QG2 is [0.3,0.7]. For query group QG3, Figure 3c and f show that the sweet range is [0.5,0.7] on both datasets. Queries in QG3 have much more candidate suggestions and topics. As we know, the diversity is related to the topic coverage of the suggestion set. With the fixed λ and the suggestion set size k, generally the topic coverage of the query suggestion set tends to be lower for the user query with more topics. To make the suggestion sets highly diverse for such queries, it generally needs a larger λ. Thus, the start point of sweet range of λ for QG3 (whose queries have more topics) is larger.

Figure 3
figure 3

Relative scores of diversity and relevance of suggestions when varying λ from 0 to 1 with k = 10

8.2 Performance metrics

In order to properly measure the quality of suggestion sets generated by our method, we need to discuss the metrics for measuring. Classic metrics such as NDCG (discounted cumulative gain), MRR (mean reciprocal rank) have been widely used in the field of information retrieval for measuring search quality. However, these metrics focus only on the relevance of results and are not appropriate when diversity is taken into account. We modify the classic metrics and propose the new metrics. We first introduce the classic NDCG, MRR metrics and then discuss the new metrics we propose.

  • Classic metrics NDCG and MRR

Given query Q and one of its ranked set of query suggestions \(\mathcal {S}_{Q}\), the premise of discounted cumulative gain (DCG) is that highly relevant results appearing lower in a result list should be penalized. The relevance value is reduced logarithmically proportional to the position of the result. The discounted cumulative gain at a particular rank threshold k is defined as:

$$DCG(\mathcal{S}_{Q},k) = \sum\limits_{j=1}^{k}\frac{2^{r(j)}-1}{log(1+j)}$$

where r(j) is the human judgment (0 = Bad, 1 = Fair, 2 = Good, 3 = Excellent) at rank position j. Assume that \(\mathcal {S}_{R}\) is an ideal order list of all query suggestions of query Q descending by their human judgments. The normalized discounted cumulative gain (NDCG) is computed as:

$$NDCG(\mathcal{S}_{Q},k) = \frac{DCG(\mathcal{S}_{Q},k)}{DCG(\mathcal{S}_{R},k)}$$

Given query Q and its ranked set of query suggestions \(\mathcal {S}_{Q}\), the reciprocal rank (RR) of a query response is the inverse of the position of the first relevant result in \(\mathcal {S}_{Q}\). The mean reciprocal rank (MRR) is the average of the reciprocal ranks of results for a sample of queries.

  • Our proposed metrics, Div_NDCG and Div_MRR.

The ideal query suggestion set is not only relevant to the user query but also highly diverse with respect to topics. We modify the classic metrics by taking into account the evaluation of diversity, specifically, coverage and redundancy.

For each possible topic c, we label any query suggestion in \(\mathcal {S}_{Q}\) as “Bad” if it does not support topic c and compute its topic-dependent \(NDCG(\mathcal {S}_{Q},k|c)\). Meanwhile, we compute its redundancy factor as \(RD(\mathcal {S}_{Q},k|c) = \frac {1}{log(1+m)}\), where m is the number of suggestions in \(\mathcal {S}_{Q}\) that support topic c. The value of \(RD(\mathcal {S}_{Q},k|c)\) is reduced logarithmically proportional to m. Then, we obtain the average across all topics as:

$$Div\_NDCG(\mathcal{S}_{Q},k) = \frac{1}{|C|}\sum\limits_{c \in C}NDCG(\mathcal{S}_{Q},k|c)\cdot RD(\mathcal{S}_{Q},k|c)$$

The new metric \(Div\_NDCG(\mathcal {S}_{Q},k)\) given above not only keeps the properties of NDCG but also takes the diversity into account. Thus, it is applicable to our evaluation tasks. Similarly, we have the other new metric Div_MRR defined as:

$$Div\_MRR(\mathcal{S}_{Q},k) = \frac{1}{|C|}\sum\limits_{c \in C}MRR(\mathcal{S}_{Q},k|c)\cdot RD(\mathcal{S}_{Q},k|c)$$

8.3 Evaluation results of Div_NDCG and Div_MRR

In this experiment, we evaluate our methods DivQSA and DivQSA against the baseline method MMR. MMR focuses on obtaining diverse results by reducing redundancy among results. DivQSA considers only the coverage factor, it does not takes redundancy factor into account. Namely for each topic, there is no constraint for the maximal number of query suggestions supporting the topic. DivQSA takes both of the coverage factor and redundancy factor into account. We acquire the relevance of query suggestions from human subjects on two datasets DBLP and IMDB. The relevance of query suggestions is judged not only based on their relevance to the original query but also their results in the database. The rank threshold k is set as 5, 10 and 15 separately. Figure 4a and b show the Div_NDCG results on two datasets. We can see that our methods outperform method MMR significantly in both datasets. Since method MMR focuses only on reducing redundancy among results, it can not guarantee the coverage of topics. For many topics, the topic-dependent NDCG scores are zero, thus lowering the overall scores of MMR. On the contrary, DivQSA and DivQSA take the coverage factor into account and obtain higher scores. Compared with DivQSA , DivQSA considers the redundancy factor, thus obtaining more diverse results and higher scores. The results of Div_MRR are shown in Figure 4c and d. Similarly, DivQSA and DivQSA also obtain significantly higher scores than method MMR on both datasets.

Figure 4
figure 4

Comparison of our proposed approaches DivQSA , DivQSA and the baseline method MMR on metrics Div_NDCG and Div_MRR

8.4 Case study

Table 3 shows a query [Database researchers, Database techniques] and its top-5 query suggestions generated by MMR and our method DivQSA (with λ = 0.5). It is shown that the diversity of the suggestion set generated by method MMR is lower. The database index techniques (R-tree and B-tree) appear three times, and there are only three distinct database techniques in the top-5 query suggestions. And the same database researcher Rudolf Bayer also appears twice. Compared with it, the suggestion set generated by DivQSA are much more diverse. There are five distinct database techniques and database researchers appearing in the suggestion set. Thus, the suggestion set is higher in quality.

Table 3 A user query and its top-5 query suggestions by the baseline method MMR and our method DivQSA

8.5 Online running time

In this experiment, we study the online performance of our suggested algorithm. Table 4 shows the results on two datasets. The size of solution set k is set to 10. Queries in group QG1 have relatively fewer candidate queries, and queries in group QG3 have more candidate queries. The running time is within 2 seconds for each query group and it is satisfactory.

Table 4 On-line performance

9 Related work

Keyword query suggestion over databases

The conventional query suggestion methods over relational and XML databases databases [17, 18] suggest queries only based on maximizing relevance to the user query and the content of the database. In [18], the user input query is rewritten to a more relevant query by token expansion. They propose a score function and combine the spelling error penalty and TF/IDF scores of query segments in a heuristics. However, the score function considers finding the best correction to each keyword individually while ignoring the connectivity of these keywords. Furthermore, only the relevance of suggestion is considered. Similarly, in [17] Lu et al. propose a probabilistic framework for query suggestion over XML documents. They consider the connectivity of these keywords when scoring suggestions without taking into account the diversity of suggestions. Unlike the study in [17, 18], in this paper our framework suggests queries that maximize both the relevance and diversity.

Query suggestion over web search

Query suggestion over web search engines often employs query logs and click-through data to suggest queries [7, 12]. The query log based techniques are suitable for systems with a large number of users such as web search engines with millions of users. However, for query suggestion over databases, the query logs are much smaller. Thus, in this paper we utilize the external taxonomy knowledge to suggest queries.

Results diversification

Result diversification has been intensively studied in the fields of information retrieval and Web search. The main goal is to obtain relevant and different documents as search results. Existing approaches can be categorized as either the novelty-based method or the coverage-based method. The novelty-based approaches [4, 6, 21] compare documents to one another and select those documents that convey novel information. The coverage-based approaches [1, 5, 19] directly model the query aspects and seek to find the set of documents to maximally cover these aspects.

In the context of query suggestions over databases, the expected results are query suggestions instead of documents. Our approach represents the possible topics for user query in an explicit way and seeks to maximize the coverage of the possible topics, which can be classified into the coverage-based method. However, the purely coverage-based approaches ignore the redundancy among the covered topics, which could lead to the problem that a certain topic is over-supported by query suggestions and degrades the diversity. Thus, unlike purely coverage-based approaches [1, 5, 19], we seek to avoid the cases that a certain topic is over-supported by query suggestions.

Compared with the data driven approaches, we use external taxonomies which provide domain knowledge for clustering or classifying the results. Thus, it has better performance.

Keyword search over databases

There are many approaches about keyword search over databases [2, 8, 10, 13, 16]. Among them, [2, 10, 13] focus on computing rooted trees as keyword search answers. BANKS [2] uses a backward search algorithm searching backwards from the nodes that contain keywords. BANKS-II [13] is proposed to overcome the drawbacks of BANKS. It is able to make forward search from potential roots. BLINKS [10] is proposed as a bi-level index to speed up BANKS-II, as no index is used in BANKS-II.

10 Conclusion

In this paper, we present a method that suggests relevant and diverse queries for a user query over databases. The objective function we defined considers both the relevance factor and diversity factor. It is shown that finding an optimal suggestion set is a submodular function maximization problem subject to n matroid constraints. Our evaluation results demonstrate that the query suggestions obtained by our suggestion algorithm are better than results obtained by the the suggestion method purely based on relevance ranking. In the future work, we will extend the work to split the long keywords to meaningful segments and obtain high-quality suggestions for them.