1 Introduction

The explosion of intelligent software assistants such as Apple Siri, Microsoft Cortana, Ok Google and WolframAlpha makes our daily lives part of the query answering world. In the latter, anyone connected to a digital tool is looking for answers to questions. These demands increase mainly after the appearance of spectacular events related to pandemics (COVID-19), tsunamis, terrorism, politics, sport, and entertainment. Query Answering Systems (QASs) are one of the serious technologies that respond to these demands. A QAS aims at providing a concise response to a natural language query (NLQ). The use of natural languages has contributed to the popularization of QASs. Their development has become a topical issue. Searching on Google Scholar, we found more than 15 surveys published over 2016–2020, mainly focusing on QASs in the age of Web of Data. A recent survey published in 2017 [30] mentions a surprising number of 62 QASs developed since 2010.

The Web of Data provides an interesting characteristic that exploits the first generation of QASs that considers corpora of structured data-repositories mainly related to a closed domain. Baseball [21] is considered one of the first QASs to answer questions about the performance of baseball in the USA. The explosion of Semantic Web technologies is helping to put a lot of structured data on the web in the form of Knowledge Graphs (KGs) [23], including Linked Open Data. One of the main objectives of the QASs over KGs is to make this valuable data accessible and usable by end-users [44]. There is a variety of KGs covering several domains. In [5], a classification of KGs is given, where three main categories are distinguished: (1) generalist KGs such as DBpedia [3], FreeBase [9], and Google Knowledge Graph [56]; (2) specialized KGs associated with a specific field (Facebook Knowledge Graph, Amazon Knowledge Graph, and Central Banks); and (3) enterprise KGs such as Enterprise Knowledge Graph.Footnote 1 This type of data is well-structured thanks to the Resource Description Framework (RDF) structure (triplet: \({<}subject,predicate,object{>}\)) and can be searched using the SPARQL language. In a KG, each node represents an entity (e.g. a person, a place, or a concept), and each label represents a relation used to link entities (e.g. a birthplace that links two entities: a person and a place). Several QASs over KGs have been proposed [30], and examples include KEQA [33], Qanswer [14], GFMed [39] and SINA [53]. Developing a QAS is time-consuming, whatever the type of its corpus (structured data, semi-structured data, free textual documents, Web data, and multimodal repositories). Indeed, it comprises several complex components. Usually, a large majority of new QASs are developed from scratch. An interesting paper published in the WWW 2018 conference summarizing the results of an EU H2020 project highlights an interesting question concerning the increased reuse of already available QAS components [55]. This suggests several directions be taken into account by researchers when developing a new QAS: (i) promoting modularity in the development of a QAS; (ii) making each developed component available; and (iii) facilitating their reuse. The adoption of these directions facilitates the comparison of existing systems by reusing them in teaching activities, such as Nachos, which comprises various modules implementing the functionalities of a basic operating system. These directions should be integrated when developing new QASs. With this motivation in mind, this work focuses on the ambiguity that is a crucial component in treating an NLQ in RDF KGs. This component is a prerequisite for achieving QAS accuracy. It should be noted that this quality is limited by the ability of NLQ processors to handle the ambiguity that such queries may contain [15, 52]. Ambiguity occurs when a user’s query contains words that have more than one meaning. For example, in the query “Which books were written by Jack London?”, the entity “Jack London” must be linked to the novelist “dbr:Jack_London”Footnote 2 rather than the boxer “dbr:Jack_London_(boxer)”. The ambiguity can be classified into four main categories [8]: lexical, syntactic, semantic, and pragmatic.

A lexical ambiguity arises when a word has more than one generally accepted meaning: the cases of homonymy (words having the same forms, but no related meaning) or polysemy (words having the same forms, referring to different but related meanings), which is the best-known case. A syntactic ambiguity consists of an ambiguity arising from how the sentence is structured. Semantic ambiguities arise when a sentence has more than one interpretation, even if no lexical or syntactic ambiguity appears in the sentence. Pragmatic ambiguity appears when a sentence can have more than one meaning in the same context. This paper addresses the polysemy of named entities. Indeed, several KGs resources may have the same name. Therefore, polysemy arises when an entity in the NLQ can be linked to several entities in the KGs. The process of selecting the correct meaning refers to Entity Linking or Named Entity Disambiguation (NED). The latter consists of assigning named entities in a text document to entity identifiers in a KG. It generally consists of two phases [54]: Candidate generation and Disambiguation.

The first phase generates candidate resources to which the entity can refer, while the second aims to classify and filter candidates to select the best one for each entity detected. The NED task concerns both long and short texts. Several research efforts have focused on long texts [7, 19, 37, 40, 47, 49, 51, 60]. To choose the most relevant sense of a word, we generally refer to its context, i.e. the words surrounding the ambiguous word. This textual context is exploitable to provide information about the ambiguous word and plays an important role in the disambiguation. Therefore, some approaches have analysed the entity’s context to calculate the textual similarity score to remove ambiguity [40]. Other studies have measured the relationship between entities in the input text to link them collectively to the corresponding resources. Recently, short texts, and particularly queries in the QASs, have attracted attention because of their limited context [26, 46]. However, because of the limited information they provide, textual similarity is not a sufficient solution. Moreover, short texts generally contain only one entity, which makes approaches exploiting entities impossible in this case. Given the limited context provided by short texts, are semantic and syntactic features extracted from this context sufficient for the entity disambiguation? What is the impact of each feature on the disambiguation process? What is the best combination of these features to reach the highest accuracy?

This paper extends our previous work [11] that focused only on the context of the named entity to link it to its corresponding resource of the KGs. First, the user query was expanded by retrieving synonyms from WordNet in order to address the context shortness problem. Then, the similarity between the context of the entity and each candidate’s context is computed in order to select the best one. The exploitation of the context is not sufficient to reach high accuracy.

Consequently, in the present work, additional techniques are used to reinforce semantic aspects. We use relational information to better capture the semantic relation between the entity and the candidates. Two aspects are taken into account: (a) the coherence between entities and (b) the exploitation of relations. The distance between the name of the entity and that of a candidate is also considered as well as the use of syntactic features. In this paper, a complete system named Welink, for NED for short texts in general and QASs queries in particular, is built. Next, a score-based disambiguation algorithm is presented which is based on semantic and syntactic metrics to rank the candidates. WeLink is implemented using two different methods of entity recognition: lexical entity recognition and n-gram. Component-based architecture is adopted to ensure the flexibility of the system. Experiments are conducted on five well-known datasets to prove the effectiveness of WeLink. The results obtained are very encouraging. Finally, WeLink is available as an open REST APIFootnote 3 and its source code is published on Github.Footnote 4

This paper is organized as follows: Sect. 2 overviews and analyses existing studies using relevant criteria. Section 3 introduces the fundamental concepts including KGs, QASs, and NED. Section 4 presents WeLink, our proposed approach for dealing with the NED problem in short texts. Section 5 details the implementation aspects of WeLink. Section 6 presents our intensive experiments comparing WeLink and state-of-the-art systems. Section 7 shows our perspectives and concludes the paper.

2 Related work

A variety of NED approaches and systems have been proposed over the years [54]. Many existing approaches link named entities in long texts (documents, news corpuses, etc.) [2, 7, 19, 29, 34, 37, 40, 45, 47, 49, 51, 60]. These approaches generally remove ambiguity (i) by exploiting the text around the entity and calculating contextual similarity [40] and (ii) assuming that the input document refers to coherent entities so that these approaches observe all entities in the text and exploit this coherence to perform collective entity linking [60]. One of the best-known systems for long texts is DBpedia Spotlight [40]. This system identifies the entity using a list of surface forms and then generates candidates from DBpedia. The system then uses the surrounding context (paragraphs) to disambiguate the entity. To do this, Spotlight models a Vector Space Model (VSM) representation of the resource candidates with tfidf weights. It ranks them according to the similarity score (cosine) between their context vectors and the text surrounding the entity. This approach was later integrated into a QAS [18].

Recently, short texts and especially queries within QASs have attracted more attention because of their limited context [1, 46]. To perform a NED in queries, EARL [17] defines the context of the entity by observing the relations around it. The system implements two different strategies to solve the NED task. First, the NED task is formalized as an instance of the Generalized Travelling Salesman Problem (GTSP). The latter is solved using the approximate GTSP solver Lin–Kernighan–Helsgaun (LKH). Second, it uses machine learning to exploit the connection density between nodes for disambiguation. TAGME [20] is a well-known NED system for short texts. After detecting the anchors in the input text, the system performs disambiguation using a voting scheme that calculates a score for each anchor-candidate match. It then prunes all candidate annotations to filter out the least relevant candidates. This approach is based on the relation between entities, but QAS queries usually contain only one entity. Therefore, this solution may not be sufficient. Machine learning approaches such as EARL and TAGME are based on training data. The majority of existing training data is suitable for long texts. Furthermore, the performance of these approaches decreases when the input text differs from the training domain. Falcon [50] jointly carried out the relation and linking of entities in QAS questions on DBpedia. For a given question, it identifies the entity and generates a list of candidates. The system then classifies these resources by creating a triple of candidate entities and relations and checks whether these triples exist in the KG. The strength of this approach is also its weakness: if the triple does not exist, neither the entity nor the relation is linked.

Hence, none of the systems mentioned extend the context of the named entity. The proposed approach enriches the user’s query to overcome the shortness problem and exploits the context similarity generally used in NED for long texts. In addition, it exploits both relations and entities in the NED task. It captures the correspondence between the extended words in the query and the properties of a resource. For queries containing more than one entity, it also takes into account the coherence between entities. In the example “In which region of the United States is Georgia?”, the two entities “United States” and “Georgia” must be linked to the resources “dbr:United_States” and “dbr:Georgia_(U.S._state)”, respectively. Similarly, the “region” relation can be linked to the “dbo:region” property. We assume that these three aspects (context, relations, and entities) must be used together to achieve high accuracy in the NED task. Moreover, this work emphasizes the importance of syntactic aspects by prioritizing words in capital letters and using entity length. Table 1 compares the main approaches available in the literature with our proposed approach.

Table 1 Comparison of the approaches available in the literature with the proposed approach

3 Background

This section overviews the fundamental concepts related to KGs, QASs, and NED in short texts over a KG. The same example illustrates the definitions to facilitate their presentation and especially their interaction.

3.1 Knowledge graphs

Knowledge graphs have recently garnered significant attention from both industry and academia for capturing, representing, storing, and exploring structured knowledge. Several definitions of KGs have been proposed in the literature [31]. A recent consensual definition of a KG was proposed in [31] as follows:

Definition 1

A KG is viewed as a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities.

In general, the RDF model is used to represent these nodes and entities through its triples. A triple is the smallest unit of data in RDF. A triple models a single statement about resources with the following structure \(\langle subject, predicate, object \rangle \). It indicates that a relationship identified by the predicate (also known as property) holds between the subject and the object depicting Web resources (things, documents, concepts, numbers, strings, etc.).

Example 1

The statement “The author of The Law of Life is Jack London” can be represented by a triple as \(\langle The\) Law of Life\(has\_author,\) Jack \(London \rangle \).Footnote 5 This triple can be represented logically as a graph where two nodes (subject and object) are joined with a directed arc (predicate) as shown in Fig. 1a.

To query KGs and RDF datasets, the W3C defined SPARQL [25] as the standard query language for RDF. SPARQL allows expressing queries across diverse datasets. The simpler SPARQL queries are formed as a conjunction of triple patterns (known as Basic Graph Patterns BGP). Triple patterns are similar to RDF triples except that the subject, predicate, or object may be a variable.

Example 2

The query in Fig. 1b asks for the author of “The Law of Life” and his birthdate over DBpedia KG. The result of this query is a subgraph of the queried graph(s) in which the variable terms are mapped to the values of the resulting subgraph. Processing a SPARQL query can be viewed as a subgraph matching problem. The results of our query are the mappings of: \(?x \rightarrow \) <http://dbpedia.org/resource/Jack_London> and \(?y \rightarrow \) "1876-01-12".

Fig. 1
figure 1

Examples for RDF and SPARQL

This variety and the wealth of KGs motivate researchers and industrials to develop intelligent services in several domains such as social networks [28], recommender systems [41], COVID-19 management [42], recruitment domain [27], and QAS [32]. In the next section, we introduce QASs and their functioning.

3.2 Question answering systems (QASs)

A QAS allows a user to ask a question q composed of a set of words q \(=\{ w_{1}, w_{2},...,w_{|q|}\}\) and provides a concise answer to textitq. A QAS aims at returning a specific answer to the user rather than a list of relevant documents. It formally transforms a question posed in an NLQ into a SPARQL query and extracts the answer by querying an information source, usually a KG. Each word in the user question \(w_{i}\) can correspond to a resource \(w_{i}\) \(\in \) S, or a property \(w_{i}\) \(\in \) P or an object \(w_{i}\) \(\in \) O.

Example 3

Consider the question “Which books were written by Jack London?” over DBpedia KG. Listing 3 shows the SPARQL query that corresponds to this question.

figure d

Once the SPARQL query is executed over DBpedia, the QAS returns the books written by “Jack London” to the user, among others “The Law of Life”.

3.3 Named entity disambiguation

Named entities are defined as real-world objects that can be denoted by a proper name and associated with a type such as Person, Organization, Place. A mention (also called entity mention) is a span of text that refers to a named entity in a given text. A mention is often ambiguous because it can refer to different entities. The process of linking an entity mention to the corresponding KB resource is known as Named Entity Disambiguation or Entity Linking. Given a text consisting of a set of named entities M \(=\{ m_{1}, m_{2},...m_{i}\}\) and given a KG with a set of resources R \(=\{ r_{1}, r_{2},...r_{i}\}\), each entity \(m_i \in M\) has to be linked to a resource \(r_i \in R\).

Example 4

For our query “Which books were written by Jack London?” (cf. Fig. 2), the named entity M \(=\) {“Jack London”} has to be correctly identified and linked to the KG resource “dbr:Jack_London” the novelist rather than the boxer “dbr:Jack_London_(boxer)”.

Fig. 2
figure 2

An example of named entity ambiguity in a question

In some cases, the input texts do not contain named entities. For example, “Who has produced the most films?”. Thus, the NED system should not return any resource. In addition, the named entity may not have a corresponding KG resource. These cases are defined as unlinkable, and the system should return NIL [54].

4 The proposed approach

In this work, WeLink is proposed as an entity disambiguation approach. It consists of three modules (Fig. 3): Query analysis, Candidate generation, and Disambiguation. The Query analysis module makes the user’s query exploitable. It also allows the extraction of the entity mention and its features. The Candidate generation module queries the KG to select candidates for an entity mention. The Disambiguation module uses a scoring algorithm to select the most relevant KG resource.

Fig. 3
figure 3

The Welink approach

Example 5

Before detailing all the steps of our proposal, let us consider an illustration of each step through the previous example “Which books were written by Jack London?”. This latter contains an entity mention that should be recognized and linked. WeLink works according to the following steps:

  1. 1.

    Query analysis: WeLink starts by pre-processing the input text, where the named entity “Jack London” is recognized. In addition, NLP tasks are applied to the query to make it exploitable for query expansion to generate the entity context (see Sect. 4.1).

  2. 2.

    Candidate generation: a set of candidates is generated using a SPARQL query over DBpedia exploiting the identified named entity “Jack London”.

  3. 3.

    Disambiguation: a scoring algorithm is used to select the most relevant candidates. The algorithm assigns a weight according to semantic and syntactic features to rank the candidates. As a result, the resource “dbr:Jack_London” is selected.

4.1 Query analysis

Query analysis involves the use of NLP tasks to refine the input text and make it analysable. As shown in Fig. 4, the Query analysis phase is carried out by two separate tasks: Entity Recognition and Query Expansion.

Fig. 4
figure 4

Query analysis

4.1.1 Entity recognition

The objective of the named entity recognition (NER) task is to identify named entities in a given text. This important step influences the disambiguation process, i.e. if an entity is incorrectly detected, it is unlikely to be correctly linked. Moreover, if the entity is not detected, the whole process is stopped [35]. In this work, two different methods are used and analysed to identify the entities: the Lexical Entity Recognition (LER) and the n-gram. LER is mainly based on the lexical features of the input text. Thus, proper nouns are selected according to the part of speech of the words. However, this algorithm does not recognize words that are not in uppercase. N-gram is used to extract all possible tokens. It can detect entities in the input text, but generates several overlaps. The relevant entity can be detected, but isolated words (that make up the entity) are also detected and can be linked to KG resources. In the previous example, the entity “Jack London” is identified, but “Jack” and “London” are also recognized. In addition, meaningless word combinations are generated, which can take time. Therefore, to deal with this, we do not consider verbs as they cannot be named entities. Secondly, we give priority to the longest word combination. The aim is to prioritize the longest token rather than the words that make it up. Thirdly, to not deprive single words, we assign a score to words in uppercase, as they usually refer to proper nouns.

4.1.2 Query expansion

Query expansion is a well-known technique for improving the efficiency of information retrieval. In general, queries are short and ambiguous. Expanding the user’s query consists of adding similar terms to the initial text to retrieve more relevant information [4] and better represent the user’s intention. Despite its advantages, the expansion has been widely used in information retrieval but rarely in QASs [30]. Moreover, if the terms used by the user do not match those used in KG, this leads to a lexical gap problem. Therefore, to expand the query and simultaneously reduce the lexical gap between the user’s words and the resource label, we use WordNet [43] which has been widely used for query expansion [4]. To this end, the input text is first pre-processed and then expanded, as shown in Fig. 5.

  • Pre-processing

    A pipeline process is used to execute the following NLP techniques: first, contractions and punctuation are removed; then, the query is tokenized into words; next, stop words are removed. A POS tagger is applied, and tags are filtered to keep only nouns, verbs, and adjectives to get their synonyms later.

  • Expansion

    The obtained keywords are searched in WordNet for their synonyms. synonymy is considered to be the main relation between words in WordNet. Synonyms are grouped into Synsets with a brief description. We extract the synonyms of each word from the input text and the definition of the detected named entity to compose its context [11].

Fig. 5
figure 5

Query expansion

4.2 Candidate generation

Candidate generation consists of retrieving a set of KG resources that potentially match the entity mention. These candidates are then refined to retain the relevant resources during the Disambiguation (see Sect. 4.3). To accomplish this crucial step [24, 54] and retrieve the relevant KG resources, we follow the steps used in [16] to have a rich treatment of entity name variations:

  • The exact match between the mention and the resource title. The aim is to find a complete string match between the mention and the resource.

  • The partial match between the mention and the resource title. The aim is to identify the mentions included in the resource title.

  • Acronyms are used to retrieve resources that correspond to the first letters of a mention. For example, “US” refers to “United States”.

  • Alternative names allow the extraction of resources that have different names but refer to the same entity. It also includes synonyms, acronyms, and possible spelling mistakes.

In addition, we use entity types. In previous work, we already used entity types in two different ways:

  • Exploiting types in the candidate generation phase by including the type in the SPARQL query [10]. This reduces ambiguity and, in some cases, removes it completely.

  • The similarity between entity types and candidate types for filtering candidates [11], based on the assumption that an entity can be associated with several types. Thus, an entity may have multiple but related types. For example, London is a City, a Location, a Capital, etc.

In our analysis, we can note that the second method is ineffective due to the lexical gap between entity and resource types. Therefore, we use the first method, which consists of using types during the Candidate generation phase. We limit the entity types such as, but not limited to, Person, Organization and Place, to restrict the search space and thus limit the number of candidates.

4.3 Disambiguation

A score is assigned to each candidate to rank the generated candidates according to the following features: context similarity, coherence between entities, relations exploitation, entities name distance, and syntactic features. With the context similarity, we seek to calculate the semantic similarity of the context of the mention and the context of each candidate. Then, we use two semantic similarity scores: coherence between entities and exploitation of relations. Coherence between entities captures the semantic relation between entities and resources related to a candidate. In addition, the exploitation of relations measures the similarity between the words of the query and the properties of the candidate. Finally, we calculate the distance between the mention name and the candidate name and use syntactic features.

4.3.1 Context similarity

The similarity between the entity and candidate contexts is the most intuitive way to address the ambiguity problem. The context is defined as the textual information related to the entity mention (see Sect. 4.1.2) and the document associated with a candidate. The candidate context corresponds to the value of the “dbo:abstract” property, a short description extracted from the corresponding Wikipedia article. In this work, this abstract is considered the candidate context, which allows a rich contextual representation. For example, the candidate “dbr:London” has the following abstract considered to be its context: “London is the capital and most populous city of England and the United Kingdom”.

The abstract’s length varies from one resource to another, and it can be noticed that it is longer than the entity mention context, which is usually a single sentence. To provide a balance, we use only the first sentence of each abstract containing descriptive terms [12]. We use cosine similarity with a normalized version of tfidf (term frequency-inverse document frequency) to calculate the similarity between contexts, a weighting factor for ranking candidate entities. The tfidf weight for a term t in context c is the product of two factors: the term frequency (tf) and the inverse document frequency (idf). As the contexts have different lengths, the measurements will have a high variance. Therefore, we use logarithmically scaled term frequency to normalize it [38]. The term frequency \(tf_{(t,c)}\) is the occurrence of a t term in context c. The weighted term frequency is calculated as follows:

$$\begin{aligned} wf(t,c)= {\left\{ \begin{array}{ll} log(1+tf_{t,c}), &{} \text{ if } tf_{t,c}>0 \\ \text{0, } &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(1)

The inverse document frequency (idf) illustrates the importance of the word in the collection, so it measures the weight of rare words in the documents. The idf weighs frequent terms and scales rare terms.

$$\begin{aligned} idf(t,C)= {\left\{ \begin{array}{ll} log\frac{C}{df_t}, &{} \text{ if } df_t>0 \\ \text{0, } &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$
(2)

where, C is the total number of contexts and \(df_t\) is the number of contexts where the term t appears.

Consequently, tfidf weight value for a term t in a context d in a collection of contexts D is defined as:

$$\begin{aligned} tfidf_{t,c}=tf_{t,c} \times idf_{t,C} \end{aligned}$$
(3)

We use a vector spatial model (VSM) to represent each context as a vector in multidimensional space [40]. Therefore, we derive a context vector weighted by the normalized tfidf. Then, we use cosine similarity to measure the angle \(\theta \) between each candidate context vector \(\overrightarrow{v}(c_c)\) and the vector of the entity mention context \(\overrightarrow{v}(c_m)\) as follows:

$$\begin{aligned} cosine(c_c,c_m) = \frac{\overrightarrow{v}(c_c) . \overrightarrow{v}(c_m)}{||\overrightarrow{v}(c_c)|| \ ||\overrightarrow{v}(c_m)||}. \end{aligned}$$
(4)

4.3.2 Coherence between entities

When the entity “dbr:Barack_Obama” appears in a text, it is more likely that the mention “Michelle” represents his wife “dbr:Michelle_Obama” as the two entities are semantically related [36] and tend to occur more frequently together. The input text largely refers to coherent entities from a specific topic, and this coherence is exploitable for collectively disambiguating entities that appear in the same text. Therefore, for each entity mention, the entities in the same document are considered important for its disambiguation [54]. Different methods have been used to measure the entity coherence, one of which is Jaccard similarity [Eq. (5)] [22].

$$\begin{aligned} jaccard(a, b)= \frac{|a \cap b|}{|a \cup b|} \end{aligned}$$
(5)

However, this study aims to determine the extent to which a candidate is similar to the entity mention, to capture the coherence between the entities in the query \(e_m\) and the resources related to a candidate \(e_c\). Coherence is therefore measured as follows:

$$\begin{aligned} coh(e_m, e_c)= \frac{|e_m \cap e_c|}{|e_m|} \end{aligned}$$
(6)

Due to the limited context provided by short texts, the measure will only be effective if the input text contains more than one named entity.

4.3.3 Relations exploitation

Relations exploitation involves measuring the similarity between query terms and the properties of each candidate. In the example “Give me the birth place of Frank Sinatra”, we seek to capture relatedness by exploiting the term “birth place” and the property “dbo:birthPlace”. To capture this relatedness, we consider \(\delta _m= rel(m_{i}, m_{i+1})\) the relation between two entities (i.e. terms that are not named entities) and \(\delta _c = rel(c_{i}, r_{i})\) the property that links a candidate \(c_{i}\) to another resource \(r_{i}\). We exploit the same equation used previously [Eq. (6)] as follows:

$$\begin{aligned} coh(\delta _m, \delta _c)= \frac{|\delta _m \cap \delta _c|}{|\delta _m|}. \end{aligned}$$
(7)

4.3.4 Name distance

With name distance, the aim is to capture the dissimilarity between the entity mention \(name_m\) and the candidate label \(name_c\). To achieve this, we use the Levenshtein distance to capture the difference between the two strings. The obtained distance is scaled by the length of the longest string [Eq. (8)]. The name distance score is subtracted from the total score to favour resources with similar labels to the named entity.

$$\begin{aligned} lev(name_m, name_c)= \frac{levenshtein(name_m, name_c)}{max(|name_m|,|name_c|)}. \end{aligned}$$
(8)
Table 2 Scores description used for candidates ranking

4.3.5 Syntactic features

We assign a weight to the named entity mention to catch its syntactic feature. As mentioned in Sect. 4.1.1, we prioritize the longest sequences of words len(e) and capitalized words cl(e). Finally, the overall disambiguation score \(\alpha \) is assigned to each candidate [Eq. (9)]. Table 2 summarizes the features used to rank the entities.

$$\begin{aligned} \alpha = cosine(c_c,c_m)+coh(rel_c,rel_m)+coh(ec_c,ec_m) + len(e) + cl(e) - lev(e_c,e_m)\nonumber \\ \end{aligned}$$
(9)

Algorithm 1 presents the WeLink procedure for ranking candidates. After retrieving all the candidates, the disambiguation score \(\alpha \) [see Eq. (9)] is computed for each one. Then, the candidates are ranked based on their \(\alpha \) score. Finally, the candidate with the highest score is selected as the target resource.

figure e

5 Implementation

The WeLink system is a Python web application available online (Fig.  6) and is publicly accessible via a REST API.Footnote 6 The source code is published on Github.Footnote 7

Fig. 6
figure 6

WeLink web interface

Figure 7 illustrates the Welink component diagram, including the three main components detailed above (Query analysis, Candidate generation and, Disambiguation). The objective is to upgrade, maintain and improve each separate component without affecting the code of the overall system. This architecture ensures flexibility, allowing for easy adaptation [48] and integration of future components, thus allowing for the extension of the system. WeLink is deployed as a web service. Thus, regardless of the programming language and platform used, it can be integrated into any QAS or SOA (Service-Oriented Architecture) in general, increasing flexibility [13].

Fig. 7
figure 7

Component diagram of WeLink

The Candidate generation component communicates with DBpedia via VirtuosoFootnote 8 to perform SPARQL queries on remote SPARQL Endpoint. In Listing 2, we detail the SPARQL query executed to retrieve candidates from DBpedia. In this query, the entity and its normalized form are used. To normalize an entity, the word’s first letter is capitalized and an underscore symbol is added to link the words together (if the entity consists of several words). For example, the entity “jack london” is replaced by its normalized form “Jack_London”. To manage the name variation of an entity (see Sect. 4.2), we make a union of four different properties (rdfs:label, dbo:wikiPageDisambigates, dbo:wikiPageRedirects, and foaf:name) to obtain resources that can refer to the entity. In addition, two filters are applied: first, the type filter using the property rdf:type to restrict the search space and limit the number of candidates. Second, the language filter restricts the labels of the exploited subjects, properties, and objects to English labels. Consequently, this SPARQL query returns the URIs of the retrieved resources (?y), as well as their labels (?na), abstracts (?s), properties (?props), and objects (?objs).

figure f

6 Experimentation

We compared the results of WeLink with those of Earl, TAGME, DBpedia Spotlight, and Falcon to evaluate the proposed approach and demonstrate its effectiveness.

6.1 Datasets

WeLink is evaluated on the following datasets that are publicly available:

  • QALD-7 [59], QALD-8 [58], QALD-9 [57]: The Question Answering over Linked DataFootnote 9 challenge provides datasets containing multilingual questions to benchmark natural language processing for QASs and also information retrieval.

  • TREC 2014 Microblog: The Text REtrieval Conference (TREC) is a test collection that evaluates text retrieval. We use the TREC 2014 Microblog.Footnote 10 We manually annotated the search queries used to gather microblogs.

  • ERD14 [6]: the Entity Recognition and Disambiguation Challenge (ERD’14) aims to promote the recognition and the disambiguation of named entities in unstructured texts. The delivered dataset (ERD14) contains web search queries and their annotation.

The particularity of these datasets is the shortness of their queries. Table 3 details the total number of queries per dataset, the number of queries that contain named entities, the total number of entities in each dataset, and the average length of queries (by word).

Table 3 Dataset details, including the number of questions, questions containing NE, number of named entities per dataset

The experiments were conducted on a laptop machine running Windows 10 with an Intel Core i5-4300U vPro processor and 16 GB RAM.

6.2 Evaluation metrics

We report macro-Precision, macro-Recall, and macro-F-measure. We denote the entities in the query as E \(=\{ e_{1}, e_{2},...,e_{n}\}\) and the entities returned by the system as Ê \(=\{\) ê\(_{1},\) ê\(_{2},...\)ê\(_{n}\}\).

$$\begin{aligned} P= & {} {\left\{ \begin{array}{ll} |E \cap \hat{E} | / |E|, &{} \text{ if } |E|>0 \\ \text{1 } \text{, } &{} \text{ if } \text{ E } = \varnothing \hbox { and } \hat{E} = \varnothing \\ \text{0 } \text{, } &{} \text{ if } \text{ E } = \varnothing \hbox { and } \hat{E} \ne \varnothing \\ \end{array}\right. }\\ R= & {} {\left\{ \begin{array}{ll} |E \cap \hat{E} | / |\hat{E}| , &{} \text{ if } |\hat{E}| >0 \\ \text{1 } \text{, } &{} \text{ if } \text{ E } = \varnothing \hbox { and } \hat{E} = \varnothing \\ \text{0 } \text{, } &{} \text{ if } \text{ E } = \varnothing \hbox { and } \hat{E} \ne \varnothing \\ \end{array}\right. } \end{aligned}$$

Finally, the F-measure is computed based on precision and recall as follows:

$$\begin{aligned} F = {\left\{ \begin{array}{ll} (2 \cdot P \cdot R) / (P + R) , &{} \text{ if } P \ne 0 { and } R \ne 0 \\ \text{0 } \text{, } &{} \text{ Otherwise } \\ \end{array}\right. } \end{aligned}$$

6.3 Evaluation results

Tables 4, 5 and 6 present the results of WeLink compared to related work systems on QALD-7, QALD-8, and QALD-9, respectively. Table 7 presents the results on TREC 2014 Microblog and ERD14.

Table 4 Evaluation of WeLink against EARL, TAGME, DBpedia Spotlight, and Falcon on QALD-7
Table 5 Evaluation of WeLink against EARL, TAGME, DBpedia Spotlight, and Falcon on QALD-8
Table 6 Evaluation of WeLink against EARL, TAGME, DBpedia Spotlight, and Falcon on QALD-9

WeLink outperforms related-work systems on QALD-7 (Test) and QALD-8 (Test) reaching an F-measure of 0.600 and 0.626, respectively. WeLink also outperforms related-work systems on the QALD-9 for both Train and Test datasets, where it achieves an F-measure of 0.729 for the Train dataset and 0.706 for the Test dataset. In addition, WeLink achieves higher recall and precision on the QALD-9 (Train and Test). On the TREC 2014 Microblog and ERD14 datasets, WeLink achieves a higher F-measure than the related-work systems by achieving an average improvement of 27% on the TREC 2014 Microblog and 47% on the ERD14. The WeLink results obtained are with two different methods of entity recognition: LER and n-gram. Although the results are consistently better with n-gram than with LER, the latter method also gives good results. WeLink with LER gives a better F-measure on QALD-9 (Train and Test) than related-work systems.

Table 7 Evaluation of WeLink against EARL, TAGME, DBpedia Spotlight, and Falcon on TREC 2014 Microblog and ERD14

We observe that Falcon performs better on QALD-7 (Train) and QALD-8 (Train). However, WeLink performs quite comparably (\(-5\)% on QALD-7 (Train) and \(-1\)% on QALD-8 (Train)). First, one possible reason is that the entities in these datasets are highly ambiguous. For example, in the query “Which types of grapes grow in Oregon?”, our system returns the resource dbr:Oregon, while the dataset is annotated differently (dbr:Oregon_wine). Another example is the query “Who assassinated President McKinley?”, where WeLink returns the resource dbr:William_McKinley. Nevertheless, the dataset is annotated with dbc:Assassination_of_William_McKinley. The system results can be considered correct since the disambiguation task’s goal is to return the corresponding resource of an entity. However, the datasets are annotated with resources to answer the questions and are therefore better suited for QASs evaluation. We also noticed that a SPARQL query can be formulated differently, for example, “What other books have been written by the author of The Fault in Our Stars?”. This question is annotated in the QALD-8 dataset with the following SPARQL query “SELECT ?books WHERE \(\{\) ?books dbo:author dbr:John_Green_(author) \(\}\)”. This query contains the name of the author (John Green) which does not appear in the question and therefore cannot be exploited. This SPARQL query can be written differently: “SELECT ?books WHERE \(\{\) dbr:The_Fault_in_Our_Stars dbo:author ?y. ?books dbp:author ?y \(\}\)”. We judge that the latter query is more adequate because it contains the entity appearing in the question (dbr:The_Fault_in_Our_Stars). Based on that, we can affirm that the annotation of the datasets influences the evaluation of the NED systems.

Second, our approach prioritizes the exact match between the entity mention and the resource by using syntactic features. This is practical in many cases, but disadvantages the right resource in other cases. In the previous example with the mention “Oregon” the “dbr:Oregon_wine”, resource is disadvantaged because “dbr:Oregon” is an exact match. We will focus on these cases in future work.

Third, the entity types used in the candidate generation step are limited and may be enriched. Therefore, if a resource type is not in the list of types mentioned in the SPARQL query, the entity may not be correctly retrieved. This technique can be automatized by integrating the recognized type (retrieved during entity recognition) into the SPARQL query. Since entity recognition is not our main focus in this paper, we will focus on this step in future work to overcome this limit.

We also observe that Tagme (QALD-8 train dataset) and DBpedia Spotlight(TREC microblog 2014) have higher recall but lower precision and F-measure. These systems return numerous entities, out of which many are irrelevant. In our case, the precision and recall of WeLink are balanced, because it returns few entities that are correctly linked in most cases. Furthermore, even if WeLink has slightly poorer results on the QALD-7 train and QALD-8 train, we judge that the overall improvement on all datasets is more important (see Table 8). We consider the F-measure of WeLink stable (on average 0.663), compared to other approaches that have good results on some datasets, but their performance drops in others. As shown in Table 8, the average percentage improvement in the related-work F-measure is 27% across all datasets. It should be noted that the improvement is up to 71%. The results indicate that WeLink successfully tackles the ambiguity problem in short texts.

Table 8 The improvement percentage of WeLink F-measure compared to the related-work systems

Figure 8 shows the impact of each metric separately on the F-measure over the QALD datasets (Train \(+\) Test). The WeLink results with the disambiguation score \(\alpha \) are also illustrated. We observe that the exploitation of relations has a slightly higher impact on the disambiguation process than the other metrics. In addition, the distance between names has a small impact on F-measure because it is not intended to be used alone but rather with the overall score to deprive candidates with very different names of the entity.

Fig. 8
figure 8

The impact of each metric on the F-measure over QALD-7, QALD-8, and QALD-9

On average, queries that do not contain any entities account for 20% (on average) of the datasets used. The NIL threshold on the total similarity score is empirically set at 0.6. As a result, WeLink reacts correctly to these queries in 56% of the cases, which increases the F-measure by 10%.

6.4 Discussion

Analysis of the results shows that WeLink successfully addresses the problem of NED in short texts. High performance is obtained experimentally with the proposed algorithm that exploits features that capture the semantics of an ambiguous entity. Therefore, we consider that WeLink overcomes the context shortness problem. After analysing the metrics’ impact used in the proposed disambiguation algorithm, we assert that, these metrics do not reach a high F-measure separately. However, combined, they give good results. Furthermore, it can also be observed that the exploitation of relations has more impact on the F-measure than coherence between entities. This can be explained by the fact that queries usually contain one named entity.

Using two different approaches in the entity recognition step, it can be said that the n-gram approach generally identifies all named entities in the input text. As a result, these entities are more likely to be correctly linked by the proposed disambiguation process. On the other hand, it has a higher execution time than the lexical entity recognition (LER) method. Moreover, the LER certainly provides a considerable F-measure, but it fails, in some cases, to identify the entity, therefore stopping the disambiguation process. Thus, entity recognition is a crucial step for the NED task.

However, WeLink has certain limitations; one of the failure cases is spelling mistakes such as “Cheryl Teigs” where the corresponding resource is “dbr:Cheryl_Tiegs”. WeLink also fails when the words in the question differ from the resource name, for example, the word “oscar”, a benchmark for “dbr:Academy_Award”.

7 Conclusion

This paper addressed Named Entity Disambiguation (NED) in a short text in general and in Question Answering Systems (QASs) in particular. NED is one of the essential components in the development of open and modular QASs. The proposed approach combines semantic and syntactic features to overcome the context shortness. We designed WeLink, a Named Entity Disambiguation system for short texts, based on (i) exploiting context similarity by expanding the entity context using WordNet, (ii) using entities for queries that contain more than one entity, (iii) exploiting relations by comparing relations between entities to candidate properties, (iv) the distance between the entity name and the resource name, and (v) the use of syntactic features. One of the most important characteristics of the proposed method is its ability to be deployed for open QASs.

Experiments were conducted on five datasets: QALD-7, QALD-8, QALD-9, TREC 2014 Microblog, and ERD14. WeLink outperforms the related-work system and increases the F-measure by an average of 27%. In addition, we detailed the impact of each metric on the disambiguation process and concluded that relations exploitation has more impact on the F-measure than coherence between entities. We believe that our proposal is complete and its code is available at: https://github.com/wissembrdj/welink. Also, it is accessible via a REST API.

Currently, we are working on improving the entity recognition stage, which impacts both the effectiveness and efficiency of NED. We are also focusing on the problems we identified during the validation of Welink, such as spelling mistakes. Finally, we would like to investigate multilingual entity linking and explore more Knowledge Graphs.