1 Introduction

The aim of semantic search is to deliver more relevant and focused responses, and in general an improved user experience, by understanding the searcher’s intent and context behind the query provided. Identifying entity mentions in text and subsequently linking them to the corresponding entries in a reference knowledge base (KB) is known as the task of entity linking. It can be performed on long texts (i.e., documents), or very short texts such as web search queries; the latter is referred to as entity linking in queries (ELQ). It has been shown that leveraging entity annotations of queries is beneficial for various information retrieval tasks including document retrieval [8, 31], entity retrieval [17, 28], and task understanding [32].

Entity linking has been extensively studied for long texts [25, 24, 7, 15, 21, 14]. Despite the large variety of approaches, there are two main components that are present in all entity linking systems: (i) candidate entity ranking, i.e., identifying entities that can be possibly linked to a mention, and (ii) disambiguation, i.e., selecting the best entity (or none) for each detected mention. There is also a general consensus on the two main categories of features that are needed for effective entity linking: (i) contextual similarity between a candidate entity and the surrounding text of the entity mention, and (ii) interdependence between all entity linking decisions in the text (extracted from the underlying KB). Previous studies [14, 4] have investigated these aspects in a unified framework and derived general recommendations for entity linking in documents. Entity linking in queries, however, has only recently started to draw attention [3, 16, 6] and such systematic evaluation of the different components has not been conducted until now. With this study, we aim to fill that gap.

What is special about entity linking in queries? First, queries are short, noisy text fragments where the ambiguity of a mention may not be resolved because of the limited context. That is, a mention can possibly be linked to more than one entity (see Table 1 for examples). This is unlike entity linking in documents, where it is assumed that there is enough context for disambiguation. Second, ELQ is an online process that happens during query-time, meaning that it should be performed under serious time constraints (in contrast with traditional entity linking which is offline). The ideal solution is not necessarily the most effective one, but the one that represents the best trade-off between effectiveness and efficiency. Therefore, the same techniques that have been used for entity linking in documents may not be suitable for queries. We formulate the following two research questions:

  • RQ1. Given the response time requirements of an online setting, what is the relative importance of candidate entity ranking vs. disambiguation? In other words, if we are to allocate the available processing time between the two, which one would yield the highest gain?

  • RQ2. Given the limited context provided by queries, which group of features is needed the most for effective entity disambiguation: contextual similarity, interdependence between entities, or both?

To answer the above research questions, we set up a framework where different candidate entity ranking and disambiguation methods can be plugged in. For each of these components, we experiment with both unsupervised and supervised alternatives, resulting in a total of four different ELQ systems. Our candidate entity ranking and disambiguation methods draw on, and extend further, ideas from the existing literature. Supervised methods are expected to yield high effectiveness coupled with lower efficiency, while for unsupervised approaches it is the other way around. Our results reveal that it is more beneficial to use supervised learning for the candidate entity ranking step. If this step provides high-quality results, then disambiguation can be successfully tackled with a simple and elegant greedy algorithm. Moreover, our analysis shows that entity interdependencies provide little help for disambiguation. This is an interesting finding as it stands in contrast to the established postulation for entity linking in documents. Consequently, we identify a clearly preferred approach that uses supervised learning for candidate entity ranking and an unsupervised algorithm for disambiguation. Using the evaluation platform of the Entity Recognition and Disambiguation (ERD) challenge [3], we show that our preferred approach performs on a par with the current state of the art.

The main contribution of this paper is to present the first systematic investigation of the ELQ task by bringing together the latest entity linking techniques and practices in a unified framework. In addition, we develop a novel supervised approach for entity disambiguation in ELQ, which encompasses various textual and KB-based relatedness features. Finally, we make a best practice recommendation for ELQ and demonstrate that our recommended approach achieves state-of-the-art performance. The resources developed with this paper are made available at http://bit.ly/ecir2017-elq.

Table 1. Example queries with their linked entities. Each set represents an interpretation of the query; ambiguous queries have multiple interpretations (i.e., multiple table rows).

2 Related Work

Early work on entity linking relied on the contextual similarity between the document and the candidate referent entities [24, 7]. Milne and Witten [25] introduced the concepts of commonness and relatedness, which are generally regarded as two of the most important features for entity linking. In contrast to early systems that disambiguate one mention at a time, collective entity linking systems exploit the relatedness between entities jointly and disambiguate all entity mentions in the text simultaneously [21, 15, 29, 19]. Since entity linking is a complex process, several attempts have been made to break it down into standard components and compare systems in a single framework [14, 4, 4]. Particularly, Hachey et al. [14] reimplemented three prominent entity linking systems in a single framework and found that much of the performance variation between these systems stems from the candidate entity ranking step (called searcher in their framework). We follow the final recommendation of their study and divide the entity linking task into two main steps, candidate entity ranking and disambiguation, to perform a systematic investigation of entity linking in queries.

Recognizing and disambiguating entities in short texts, such as tweets and search snippets, has only recently gained attention [11, 23, 13]. Entity linking in queries (ELQ) is particularly challenging because of the inherent ambiguity (see Table 1). Deepak et al. [9] addressed ELQ by assigning a single entity to a mention. The Entity Recognition and Disambiguation (ERD) [3] challenge framed ELQ as the task of finding multiple interpretations of the query, and this was followed in subsequent studies [16, 6, 18]. Hasibi et al. [16] proposed generative models for ranking and disambiguating entities. The SMAPH system [6], on the other hand, “piggybacks” on a web search engine to rank entities, and then disambiguates them using a supervised collective approach. We consider the key features of these previous studies in a single system in order to perform a comprehensive comparison of the two main ELQ components (candidate entity ranking and disambiguation) with respect to both efficiency and effectiveness. We, however, do not include the piggybacking technique as its reliance on an external search service would seriously hinder the efficiency of the entity linking process in our setup.

3 Entity Linking in Queries

The task of entity linking in queries (ELQ) is to identify, given an input query q, a set of entity linking interpretations \(I = \{E_1, \dots , E_m\}\), where each interpretation \(E_i = \{(m_1, e_1), \ldots , (m_k, e_k)\}\) consists of a set of mention-entity pairs. Mentions within \(E_i\) are non-overlapping and each mention \(m_j\) is linked to an entity \(e_j\) in a reference knowledge base. By way of illustration, the output of ELQ for the query “new york pizza manhattan” would be \(I = \{E_1, E_2\}\), where \(E_1 =\) {(new york pizza, New York-style pizza), (manhattan, Manhattan)} and \(E_2 =\) {(new york, New York), (manhattan, Manhattan)}. Following [3, 16], we restrict ourselves to detecting proper noun entities and do not link general concepts (e.g., “Pizza”).

We frame the ELQ problem as a sequence of the following two subtasks: candidate entity ranking (CER) and disambiguation. The first subtask takes the query q and outputs a ranked list of mention-entity pairs along with the corresponding scores. The second subtask takes this list as input and forms the set of entity linking interpretations I. For each subtask, we present two alternatives: unsupervised and supervised. The resulting four possible combinations are compared experimentally in Sect. 5.1.

3.1 Candidate Entity Ranking

This subtask is responsible for (i) identifying all possible entities that can be linked in the query and (ii) ranking them based on how likely they are link targets (in any interpretation of the query). The objective is to achieve both high recall and high precision at early ranks, as the top-ranked entity-mention pairs obtained here will be used directly in the subsequent disambiguation step. Using lexical matching of query n-grams against a rich dictionary of entity name variants allows for the identification of candidate entities with close to perfect recall [16]. We follow this approach to obtain a list of candidate entities together with their corresponding mentions in the query. Our focus of attention below is on ranking these candidate (me) pairs with respect to the query, i.e., estimating score(meq).

Unsupervised. For the unsupervised ranking approach, we take a state-of-the-art generative model, specifically, the MLMcg model proposed by Hasibi et al. [16]. This model considers both the likelihood of the given mention and the similarity between the query and the entity: \(score(m,e,q) = P(e|m)P(q|e)\), where P(e|m) is the probability of a mention being linked to an entity (a.k.a. commonness [22]), computed from the FACC collection [12]. The query likelihood P(q|e) is estimated using the query length normalized language model similarity [20]:

$$\begin{aligned} P(q|e) = \frac{\prod _{t \in q} P(t|\theta _e)^{P(t|q)}}{\prod _{t \in q} P(t|C)^{P(t|q)}}, \end{aligned}$$
(1)

where P(t|q) is the term’s relative frequency in the query (i.e., n(tq) / |q|). The entity and collection language models, \(P(t|\theta _e)\) and P(t|C), are computed using the Mixture of Language Models (MLM) approach [27].

Table 2. Feature set used for ranking entities, categorized to mention (M), entity (E), mention-entity (ME), and query (Q) features.

Supervised. Our supervised approach employs learning-to-rank (LTR), where each (query, mention, entity) triple is described using a set of features. The ranking function is trained on a set of mention-entity pairs with binary labels, with positive labels denoting the correctly annotated entities for the given query. We use a total of 28 features from the literature [23, 6], which are summarized in Table 2.

3.2 Disambiguation

The disambiguation step is concerned with the formation of entity linking interpretations \(\{E_1,\ldots , E_m\}\). Similar to the previous step, we examine both unsupervised and supervised alternatives, by adapting existing methods from the literature. We further extend the supervised approach with novel elements.

Unsupervised. We employ the greedy algorithm introduced in [16], which forms interpretations in three consecutive steps: (i) pruning, (ii) containment mention filtering, and (iii) set generation. In the first step, the algorithm takes the ranked list of mention-entity pairs and discards the ones with ranking score below the threshold \(\tau _s\). This threshold is a free parameter that controls the balance between precision and recall. The second step removes containment mentions (e.g., “kansas city mo” vs. “kansas city”) by keeping only the highest scoring one. Finally, interpretations are built iteratively by processing mention-entity pairs in decreasing order of score and adding them to an existing interpretation \(E_i\), where the mention does not overlap with other mentions already in \(E_i\) and i is minimal; if no such interpretation exists then a new interpretation \(E_{|E|+1}\) is created.

Supervised. The overall idea is to generate all possible interpretations from a ranked list of mention-entity pairs, then employ a binary classifier to collectively select the most pertinent interpretations. Our approach is similar in spirit to the top performing contender in the ERD challenge [6], as they also select interpretations using a collective supervised approach. However, we generate the interpretations only from the top-K mention-entity pairs (obtained from the CER step) and generate all possible interpretations out of those. We further require that mentions within the same interpretation do not overlap with each other. The value of K is set empirically, and it largely depends on the effectiveness of the CER step. If CER has high precision then K can be low, while less effective approaches can be compensated for with higher K values.

Table 3. Feature set used in the supervised disambiguation approach. Type is either query dependent (QD) or query independent (QI).

Once the candidate sets are generated, each is represented by a feature vector. We devise two main families of features: (i) set-based features are computed for the entire interpretation set, and (ii) entity-based features are calculated for individual entities. Features in the first group are computed collectively on all entities of the set and measured as a single value, while the members of the second group need to be aggregated (we use min, max, avg as aggregators). It is worth noting that each interpretation typically consists of very few entities. Therefore, considering all entities for computing set-based features is feasible; it also captures more information than one could get from aggregated pair-wise similarity features. Table 3 summarizes our feature set.

We highlight two novel and important features. SetSim(Eq) measures the similarity between all entities in the interpretation E and the query q:

$$\begin{aligned} SetSim(E, q) = P(q|\theta _E) =&\frac{\prod _{t \in q} P(t|\theta _E)^{P(t|q)}}{\prod _{t \in q} P(t|C)^{P(t|q)}}. \end{aligned}$$
(2)

It is calculated similar to Eq. (1), the main difference being that the probability of each term is estimated based on the interpretation’s language model \(P(t|\theta _E)\):

$$\begin{aligned} P(t|\theta _E) =&\sum _{e \in E}\sum _{f \in F} \mu _f P(t|\theta _{e_f}). \end{aligned}$$

In similar vein, ContextSim(eq) measures the similarity between the entity and the query context, where query context is the “rest” of the query, i.e., without the mention \(m_e\) that corresponds to entity e. Formally:

$$\begin{aligned} ContextSim(e, q) = P(q-m_e|e), \end{aligned}$$
(3)

where \(P(q-m_e|e)\) is computed using Eq. (1).

4 Experimental Setup

In this section we describe our data sources, settings of methods, and evaluation metrics.

4.1 Data

Knowledge base. We employ DBpedia 3.9 as our reference knowledge base and build an index of all entities that have both rdfs:label and dbo:comment predicates. The index includes the following set of fields: \(\mathcal {F} = \){title, content, rdfs:label, dbo:wikiPageWikiLink, rdfs:comment, dbo:abstract}, where title is the concatenation of rdfs:label, foaf:name and dbo:wikiPageRedirects predicates, and content holds the content of all predicates of the entity; the remaining fields correspond to individual predicates.

Surface form dictionary. To recognize candidate entities in queries, we employ a rich surface form dictionary, which maps surface forms to entities. We utilize the FACC entity-annotated web corpora [12] and include surface forms above a commonness threshold of 0.1 [16]. Additionally, we add DBpedia name variants as surface forms; i.e., entity names from rdfs:label, foaf:name, and dbo:wikiPageRedirects predicates [7, 11, 16]. We confine our dictionary to entities present in the Freebase snapshot of proper named entities, provided by the ERD challenge [3].

Test Collections. We evaluate our methods on two publicly available test collections: Y-ERD [16] and ERD-dev [3]. The former is based on the Yahoo Search Query Log to Entities (YSQLE) datasetFootnote 1 and consists of 2, 398 queries. All results on this collection are obtained by performing 5-fold cross validation.Footnote 2 The ERD-dev collection contains 91 queries and is released as part of the ERD challenge [3]. We apply the trained models (on the whole Y-ERD collection) to ERD-dev queries and report on the results. In addition, ERD also provides an online evaluation platform which is based on a set of 500 queries (referred to as ERD-test); the corresponding annotations are not released. We evaluate the effectivenessFootnote 3 of our recommended system using ERD-test to evaluate how it performs against the current state of the art.

4.2 Methods

Candidate Entity Ranking. For the unsupervised method (MLMcg), we follow [26] and use title and content fields, with weights 0.2 and 0.8, respectively. For the supervised method (LTR), we employ the Random Forest (RF) [2] ranking algorithm and set the number of trees to 1000 and the maximum features to \(10\%\) of size of the feature set [23]. We further include two baseline methods for reference comparison: (i) MLM is similar to MLMcg, but without considering the commonness score; i.e., computed based on the Eq. (1); (ii) CMNS ranks entities based on the commonness score, while prioritizing longer mentions, and is shown to be a strong baseline [23, 1, 16].

Disambiguation. The unsupervised disambiguation method (Greedy) involves a score threshold parameter, which is set (using a parameter sweep) depending on the CER method used: 20 for MLMcg and 0.3 in case of LTR. For the supervised disambiguation method (LTR), we set the number of top ranked entities K to 5 (based on a parameter sweep) and use a RF classifier with similar setting to supervised CER. For baseline comparison, we consider the top-3 performing systems from the ERD challenge: SMAPH [6], NTUNLP [5], and Seznam [10].

4.3 Evaluation

As both precision and recall matter for the candidate entity ranking step, we evaluate our methods using Mean Average Precision (MAP), recall at rank 5 (R@5), and precision at position 1 (P@1). When evaluating CER, we are only concerned about the ranking of entities; therefore, we consider each entity only once with its highest scoring mention: \(score(e, q) = \arg \max _{m \in q} score(m, e, q)\). For the disambiguation step, we measure the end-to-end performance using set-based metrics (precision, recall, and F-measure), according to the strict evaluation metrics in [16]. As for efficiency, we report on the average processing time for each query, measured in seconds. The experiments were conducted on a machine with an Intel Xeon E5 2.3GHz 12-core processor, running Ubuntu Linux v14.04. Statistical significance is tested using a two-tailed paired t-test. We mark improvements with \(^{\vartriangle }\)(\(p<0.05\)) or \(^{\blacktriangle }\)(\(p<0.01\)), detoriations with \(^{\triangledown }\)(\(p<0.05\)) or \(^{\blacktriangledown }\)(\(p<0.01\)), and no significance by \(^{\circ }\).

5 Results and Analysis

In this section we report on our experimental results and answer our research questions.

5.1 Results

We start by evaluating the candidate entity ranking and disambiguation steps and then answer our first research question: “Given the response time requirements of an online setting, what is the relative importance of candidate entity ranking vs. disambiguation?”

Table 4. Candidate entity ranking results on the Y-ERD and ERD-dev datasets. Best scores for each metric are in boldface. Significance for line \(i>1\) is tested against lines \(1..i-1\).

Candidate Entity Ranking. Table 4 presents the results for CER on the Y-ERD and ERD-dev datasets. We find that commonness is a strong performer (this is in line with the findings of [1, 16]). Combining commonness with MLM in a generative model (MLMcg) delivers excellent performance, with MAP above 0.85 and R@5 around 0.9. The LTR approach can bring in further slight, but for Y-ERD significant, improvements. This means that both of our CER methods (MLMcg and LTR) are able to find the vast majority of the relevant entities and return them at the top ranks.

Disambiguation. Table 5 reports on the disambiguation results. We use the naming convention X-Y, where X refers to the CER method (MLMcg or LTR) and Y refers to the disambiguation method (Greedy or LTR) that is applied on top. Our observations are as follows. The MLM-Greedy approach is clearly the most efficient but also the least effective one. Learning is more expensive for disambiguation than for CER, see LTR-Greedy vs. MLMcg-LTR; yet, it is also clear from this comparison that more performance can be gained when learning is done for CER than when it is done for disambiguation. The most effective method is LTR-Greedy, outperforming other approaches significantly on both test sets. It is also the second most efficient one. Interestingly, even though the MLMcg and LTR entity ranking methods perform equally well according to CER evaluation (cf. Table 4), we observe a large difference in their performance when the Greedy disambiguation approach is applied on top of them. The reason is that the absolute scores produced by LTR are more meaningful than those of MLMcg (despite the query length normalization efforts for the latter; cf. Eq. (1)). This plays a direct role in Greedy disambiguation, where score thresholding is used. We note that the reported efficiency results are meant for comparison across different approaches. For practical applications, further optimizations to our basic implementation would be needed (cf. [1]).

Table 5. End-to-end performance of ELQ systems on the Y-ERD and ERD-dev query sets. Significance for line \(i>1\) is tested against lines \(1..i-1\).
Table 6. ELQ results on the official ERD test platform.

Based on the results, LTR-Greedy is our overall recommendation. We compare this method against the top performers of the ERD challenge (using the official challenge platform); see Table 6. For this comparison, we additionally applied spell checking, as this has also been handled in the top performing system (SMAPH-2) [6]. The results show that our LTR-Greedy approach performs on a par with the state-of-the-art systems. This is remarkable taking into account the simplicity of the Greedy disambiguation algorithm vs. the considerably more complex solutions employed by others.

Answer to RQ1. Our results reveal that candidate entity ranking is of higher importance than disambiguation for ELQ. Hence, it is more beneficial to perform the (expensive) supervised learning early on in the pipeline for the seemingly easier CER step; disambiguation can then be tackled successfully with an unsupervised (greedy) algorithm. (Note that selecting the top ranked entity does not yield an immediate solution; as shown in [16], disambiguation is an indispensable step in ELQ.)

5.2 Feature Analysis

We now analyze the features used in our supervised methods and answer our second research question: “Given the limited context provided by queries, which group of features is needed the most for effective entity disambiguation?” For the sake of completeness, we also report feature importance for the CER step, even though that does not directly relate to the above RQ. Figure 1(a) shows the top features used in the LTR entity ranking approach in terms of Gini score. We observe that Matches, Commonness, and the various query similarity features play the main role in the entity ranking function. As for the supervised disambiguation step, which is our main focus here, we selected the top 15 features independently for the MLMcg-LTR and LTR-LTR methods; interestingly, we ended up with the exact same set of features. Figure 1(b) demonstrates that nearly all influential features are query dependent; the only query independent features are P and H, capturing the co-occurrence of entities in web corpora.

Answer to RQ2. We conclude that contextual similarity features are the most effective for entity disambiguation. This is based on two observations: (i) the unsupervised (Greedy) method takes only the entity ranking scores as input, which are computed based on the contextual similarity between entity and query; (ii) the supervised (LTR) method relies the most on query-dependent features. This is an interesting finding, as it stands in contrast to the common postulation in entity linking in documents that interdependence between entities help to better disambiguate entities. Entity interdependence features (and, in general, collective disambiguation methods) are more helpful when sufficiently many entities are mentioned in the text; this is not the case for queries.

Fig. 1.
figure 1

Most important features used in the supervised approaches, sorted by Gini score: (Left) candidate entity ranking, (Right) disambiguation.

6 Conclusion

In this paper, we have performed the first systematic investigation of entity linking in queries (ELQ). We have developed a framework where different methods can be plugged in for two core components: candidate entity ranking and disambiguation. For each of these components, we have explored both unsupervised and supervised alternatives by employing and further extending state-of-the-art approaches. Our experiments have led to two important findings: (i) it is more rewarding to employ supervised learning for candidate entity ranking than for disambiguation, and (ii) entity interdependence features, which are the essence of collective disambiguation methods, have little benefit for ELQ. Overall, our findings have not only revealed important insights, but also provide guidance as to where future research and development in ELQ should be focused.