1 Introduction

By dealing with the satisfaction of the users’ information needs, Information Retrieval (IR) is one of the most enduring, challenging and studied topics in the Computer Science research community. In its most basic structure, an IR approach is composed of (a) a meaningful technique for representing data collections and users’ queries; (b) a similarity measure for computing the closeness of data and query representations; and (c) an algorithm able to efficiently rank the data collection according to a user’s query.

In this paper, we focus on an ad-hoc retrieval task, where the input query is expressed in the form of keywords and the data is a collection of documents. In this scenario, the task of satisfying the users’ information needs mainly requires to deal with the ambiguity of the keywords in a query (e.g., terms can be polysemous and refer to different things in different contexts) and to find a measure for effectively ranking documents (usually composed of a large number of terms) according to users’ queries (usually composed of a number of terms which is some order of magnitude smaller than the document).

A large number of techniques have been developed for addressing this task, as reported in the related work Section (techniques for representing data, for measuring the similarity of queries and documents and for ranking the results). In this study, we first reproduce a recent experimental study [3] and then extend it by using a recently proposed term weighting method [9]. We explore if the results provided by a classical IR bag-of-word based approach can be improved through the use of some semantic techniques applied on the queries and the document collection.

Fig. 1.
figure 1

A typical system architecture combining text and semantic keywords

The architecture of a typical text-semantic system is shown in Fig. 1. The idea is to introduce new semantic representations of the documents and queries. Each representation will provide a ranked list of documents for each input query, which can be considered as a “partial answer” to the query, reflecting the semantics of the representation adopted. A unified answer is then generated by merging all the partial answers. A weighting mechanism can be introduced, as usually done in these approaches, for providing more importance to specific semantic representations.

The starting point of our work is [3], where an annotated query and dataset has been analyzed by taking into account four semantic layers and the weights of document and query vectors have been assigned according to the usual product of Term Frequency (tf) and Inverse Document Frequency (idf). We explore improvements to this approach in two directions: (1) experimenting with new semantic representations of the data; (2) experimenting with different measures for computing the closeness of documents and queries. Concerning the first direction, we started from a subset of the layers analyzed in KE4IR [3], by taking into account only classes and entities referenced in the data. The hypothesis here is that these concepts should reduce the noise generated by spurious information. Additionally, this set of annotations has been improved and extended in two ways. First, as in KE4IR, by relying on PIKESFootnote 1 [4] for finding related classes and entities. We refer to this as the enriched set of entities and classes. Second, we experiment a refinement and extension of the annotations by exploiting the information about the entities and classes provided by DBpedia. In particular, for most of the terms, DBpedia, through the abstract attribute, provides a short textual description of it. The application of AlchemyAPIFootnote 2 to the abstracts of the classes allows us to retrieves the entities referenced in them to be used for extending the representation. Concerning the second direction, we experiment the classical BM25 measure [14] and its recently introduced variant [9]. The results obtained from our experiments demonstrate that semantics and specific similarity measures can improve the quality of the results obtained.

Summarizing, the contribution of this paper is:

  • we reproduce the work in KE4IR [3];

  • we extend the work by introducing new semantic representations of data and queries;

  • we change the scoring function from the tfidf to the BM25 [14] and BM25 variant [9].

The rest of the paper is structured as follows. Section 2 introduces some background and related work; Sect. 3 describes the variation of BM25 scoring function adopted in the paper and Sect. 4 shows the results of our experiments. Finally, Sect. 5 sketches out some conclusion and future work.

2 Related Work and Background

Several approaches have been proposed in the literature for providing “semantic annotations” to describe unstructured documents. Depending on the specific goal motivating the approach, the related research community, the existence of a reference vocabulary/ontology of classes and entities, and the output provided and many other criteria, this task has been called in different ways as named entity recognition (NER), semantic annotation, knowledge and information extraction, or ontology population. Some interesting surveys on the topic are provided by Nadeau and Sekine [12], where some approaches for “named entity recognition” are classified on the basis of their learning method and set of features adopted; and Gangemi [5], where 14 tools for Information Extraction are analyzed and classified.

2.1 Probabilistic Relevance Framework

One of the fundamental ways of thinking about information retrieval is the Probabilistic Relevance Framework [14]. It is the so-called classical probabilistic model because it is that model that has its roots in the early work of Maron and Kuhns [11] in the early ’60s, and later in that of Van Rijsbergen [17] and Spark Jones [7]. Its most conspicuous advocate is however Robertson [15].

As pointed out in Fig. 1, in information retrieval (the black boxes), documents and queries are transformed to some common representation and then compared. This comparison, while it may, mathematically, be very precise (e.g. comparing two vectors is well defined and for any distance function we will have a deterministic output) is in reality unavoidably subjected to the uncertainty of language. Mathematically, the only way we can quantify and work with uncertainty are probabilities.

The Probabilistic Relevance Framework (PRF) ranks the documents by the estimated probability that a hidden random variable R takes one of two values (some authors use 1/0, others r/\(\bar{r}\) or even l/\(\bar{l}\) to denote relevance and not relevance). Estimating this probability for information retrieval consists of fundamentally two steps:

  1. 1.

    finding measurable statistics that we consider indicative of relevance (e.g. term frequency, collection frequency)

  2. 2.

    combining these statistics to estimate the probability of a documents relevance to the query

The affability of the PRF derives from the probability ranking principle, first publicly formulated by Robertson [15], but credited by him to private communication with W. Cooper of Univ. of California at Berkeley, and first hinted at by Maron and Kuhns:

“If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then overall effectiveness of the system to its users will be the best that is obtained on the basis of this data.”

The methods developed as a consequence of this principle, while often restricted to statistics that come out of the text, are not bound to this limitation. As the principle states, we can base this calculation on “whatever data has been made available to the system”. In the Web domain this freedom has been used to combine, for instance, Roberston’s BM25 Relevance Status Value with PageRank [14], thereby defining relevance as a combination of topical relevance and importance in a network.

2.2 Term Weighting

With respect to term weighting schemes, the last 60 years have essentially seen three methods: the initial heuristics of the vector space model, the various probabilistic models (including here Probabilistic Relevance Framework mentioned above but also Language Modelling [13] and Divergence from Randomness [1]), and machine learning approaches [10].

Among them, the Probabilistic Relevance Framework (PRF) and Language Modelling have received most of the attention of the community. While they are conceptually different (most notably with respect to the probability spaces in which they operate), there is a strong relationship between them [8]. This is partially why in this study we only focus on the best known instantiation of the PRF, bm25, and the recently introduced variant bm25 \(_\text {VA}\). The other reason is that language modelling methods are very sensitive to parameter tuning, and in our case the relatively small size of the test collection might make the experiments unreproducible to other environments. As Zhai [20] pointed out:

“This may be the reason why language models have not yet been able to outperform well-tuned full-fledged traditional methods consistently and convincingly in TREC evaluation.”

The Probabilistic Relevance Framework, as extensively described most recently by Robertson and Zaragoza [14], assumes that the term frequency of individual words is generated by the composition of two Poisson distributions: one for the occurrence of the term and one of the term being elite or not (where by elite, Roberston denotes those terms that bear the meaning of documents). However, as the two Poisson distributions are in practice impossible to estimate accurately, the weight of each term is approximated by

$$ w_t=log\frac{|D|-df_t+0.5}{df_t+0.5}\cdot \frac{tf_t}{k_1+tf_t} $$

Since BM25 does not use the cosine similarity (there are no vectors), a length normalisation is directly applied on the term frequency component. Thus, a score is computed for each document d and query q as follows:

$$\begin{aligned} S(q,d)=\sum _{t\in T_d\cap T_q}\frac{(k_3+1) tf _q}{k_3+ tf _q}\frac{(k_1+1)\overline{ tf _d}}{k_1+\overline{ tf _d}}\log {\frac{|D|+0.5}{df_t+0.5}} \end{aligned}$$
(1)

where

$$ \overline{ tf _d}=\frac{ tf _d}{B} \qquad B=(1-b)+b\frac{L_d}{avgdl} $$

where \( tf _q\) and \( tf _d\) are the term frequency of a term in the query and the document \(T_q\) and \(T_d\) are the set of unique terms in the query and the document, |D| is the total number of documents in the collection, \(L_d\) is the length of document d (i.e. number of tokens) and avgdl is the average length of a document in the collection.

Recently, Lipani et al. [9], based on the observation that the length normalisation is related to the average term frequency, introduced a variant of B that eliminates the free parameter b:

$$\begin{aligned} B_\text {VA}=\frac{ avgtf _d}{ mavgtf ^2}+(1- mavgtf ^{-1})\frac{L_d}{avgdl} \end{aligned}$$
(2)

where

$$ avgtf _d=\frac{1}{|T_d|}\sum _{t\in T_d} tf _d(t)=\frac{L_d}{|T_d|} \quad \text {and}\quad mavgtf =\frac{1}{|D|}\sum _{d\in D} avgtf _d $$

are the average entity frequency and the mean average entity frequency, respectively.

2.3 Concepts

In this paper we exploit classes and entities identified and extracted from the documents to improve the information retrieval process. As in KE4IR [3], PIKES [4] has been adopted as a tool for extracting classes and entities, but our proposal is independent of the tool used. We chose to use the PIKES dataset since it has been constructed with the explicit purpose of experimentation on small collections that present the challenge of the vocabulary gap. This is not the first time that semantic techniques have been used for supporting the IR process. In particular, semantics can support the search task in all the phases of the process: by providing possible interpretations and meanings to documents and queries, and by improving the matching and ranking of queries and documents. In this paper, we extend and improve the document and queries representation, by taking into account classes and entities extracted from them. A similar approach has been followed in other interesting proposals where, for example, a number of research groups worked on extending the Vector Space Model (VSM) by embedding additional type of information in it. The results obtained are not always convincing, in particular in the definition of a theoretically sound interpretation of the resulting vector space. Among the most interesting approaches, there are Tsatsaronis and Panagiotopoulou [16], where a technique exploiting the WordNet’s semantic information has been experimented against three TREC collections; and Waitelonis et al. [18], where a generalized vector space model with taxonomic relationships has been introduced. The last approach was experimented, and obtained good performance, against Linked Data collections.

Other approaches exploited semantic techniques for analyzing the dataset according to different perspectives: Gonzalo et al. [6] proposed to use a VSM where the index terms adopted are the synsets extracted from WordNet. Corcoglioniti et al. [3] proposed KE4IR, a system based on the combination of the knowledge provided by four semantic layers in conjunction with the usual bag of word representation of the data.

Finally, semantic techniques have been frequently adopted in keyword search approaches over structured data sources. Yu et al. [19] surveys the main approaches in the area, and [2] exploits semantic techniques for integrating, querying and analyzing heterogeneous data sources.

3 Terms and Concepts

We now discuss how to combine the information provided by the terms and the concepts of the document in answering a query. We say “concepts” to refer uniformly to entities, classes, or a combination of the two. The probabilistic relevance framework does not restrict us to using terms, and therefore we can consider the use of concepts in the scoring value.

Let us therefore consider the representation of a document d as a combination of its terms and concepts:

$$ d=(\overbrace{ tf (t_1), tf (t_2), ..., tf (t_{|T|})}^\text {term frequencies},\overbrace{ ef (e_1), ef (e_2), ..., ef (e_{|E|})}^\text {concept frequencies}) $$

where T is the set of unique terms in the collection and E is the set of unique concepts in the collection and \( tf (t_i)\) and \( ef (e_j)\) denote the (term) frequency of a term \(t_i\) and (concept) frequency of a concept \(e_j\).

Even though d looks like a typical frequency vector, directly applying the reasoning behind BM25 to the new vector does not make sense because the terms and concepts do not share the same probability space. However, building on the BM25 weighting schemes in [9, 14], we can define a specific score, which makes the same assumptions as the traditional BM25 scoring method above:

$$\begin{aligned} S_E(q,d)=\sum _{e\in E_d\cap E_q}\frac{(k_3+1) ef _q}{k_3+ ef _q}\frac{(k_1+1)\overline{ ef _d}}{k_1+\overline{ ef _d}}\log {\frac{|D|+0.5}{df_e+0.5}} \end{aligned}$$
(3)

where

$$ \overline{ ef _d}=\frac{ ef _d}{B_e} \qquad B_e=(1-b)+b\frac{L^e_d}{avgdl_e} $$

where all the elements are the direct translations of the statistics based on terms to statistics based on concepts.

$$\begin{aligned} B^e_\text {VA}=\frac{ avgef _d}{ mavgef ^2}+(1- mavgef ^{-1})\frac{L^e_d}{avgdl_e} \end{aligned}$$
(4)

where

$$ avgef _d=\frac{1}{|E_d|}\sum _{e\in E_d} ef _d(e)=\frac{L^e_d}{|E_d|} \quad \text {and}\quad mavgef =\frac{1}{|D|}\sum _{d\in D} avgef _d $$

are the average concept frequency and the mean average concept frequency.

This new score component \(S_E(q,d)\) can be linearly combined with the term-based score component S(qd) from Sect. 2:

$$\begin{aligned} \mathbf {S}(q,d)=S(q,d)+\lambda S_E(q,d) \end{aligned}$$
(5)

To note that we do not use here the more common linear combination (\(\lambda \))—(\(1-\lambda \)) for two reasons: first, the addition above is calculated based on the extended vector representation of the document, as noted at the beginning of this section. \(\lambda \) is introduced as a parameter to give more or less importance to this additional information. Second, it seems counter intuitive to force the text and concept components to play against each other (i.e. by increasing \(\lambda \) on one of them to automatically decrease \(1-\lambda \) for the other). There is no theoretical nor practical consideration for which we would do that here, since we do not need our score to be constrained within a given interval.

4 Experimental Results

We performed 4 sets of experiments, namely:

  1. 1.

    Using terms alone comparing traditional BM25 (standard B) with the variation \(B_\text {VA}\) introduced in [9], as well as the baseline in [3];

  2. 2.

    Using terms (as in 1 above) after applying filtering based on concepts;

  3. 3.

    Combining ranking of terms and concepts as defined in Eq. 5; and

  4. 4.

    Combining ranking of terms and concepts as in 3 after applying filtering based on concepts.

The evaluation has been performed against the dataset developed in [18]. The dataset consists of 331 articles from the yovisto blog, each one composed of 570 words in average. The articles are annotated (83 annotations per article in average). The queries have been inspired by the search log and have been manually annotated.

The metrics used are the same as those used in Corcoglioniti et al. [3], that we assume as baseline, namely: Precision at three different positions—namely P@1, P@5, P@10, Normalised Discounted Cumulative Gain (NDCG) and Mean Average Precision (MAP). The latter two are computed after the first ten retrieved documents, and for the entire ranked list. In all subsequent plots, a solid horizontal line indicates the best performing KE4IR run as reported in [3], while a dotted line indicates the best text-only run as reported in the same paper.

4.1 Retrieval Using Terms Alone

In this set of experiments, we did not utilise any semantic information whatsoever. We performed retrieval using BM25 where we compared the use of the standard B as defined by Eq. 1 with \(B_\text {VA}\) as described in Eq. 4. We used the following parameters: \(k1=1.2\), \(k3=0\) and \(b=0.75\) (b is only used in the standard B). The values for k1 and b are commonly used values according to [9]. We set k3 to 0 to remove possible effects from variations in keyword frequencies within the query itself.

Figure 2 shows the results. One can note that on the whole, the use of standard B produces slightly better results than using \(B_\text {VA}\). As expected, the use of BM25 on text-only representations does not outperform the best KE4IR configuration described in [3] that uses semantic information. However, BM25 outperforms the text-only KE4IR system on all metrics except for P@1 and P@5.

Fig. 2.
figure 2

Results from text data alone, according to Eq. 3 (BVA) and traditional BM25 (Standard B)

4.2 Retrieval Using Terms and Filter on Concepts

The configuration used here is very similar to that described in Sect. 4.1. However, prior to comparing the documents with the query, the document list was filtered to remove those documents that do not contain the concepts present in the query. We compare the use of different sets of concepts, namely: (a) classes, (b) entities, (c) classes and entities, (d) classes extracted through the application of the service provided by AlchemyAPI from the abstract properties of the corresponding DBpedia entries, (e) entities extracted with AlchemyAPI from the abstract properties of the corresponding DBpedia entries, (f) classes and entities extracted from the abstract properties of the corresponding DBpedia entries, (g) classes extracted from the data enriched by using PIKES, (h) entities extracted from the data enriched by using PIKES and (i) classes and entities extracted from the data enriched by using PIKES.

As shown in the results (Fig. 3), most runs outperform the best KE4IR configuration on P@5 and P@10. The best P@5 score is obtained when using \(B_\text {VA}\) and filtering on concepts. On the other hand, the best P@10 is obtained when using standard B, and applying filtering using concepts from enriched classes and entities. Despite obtaining relatively high P@5 and P@10, this system scores lower than KE4IR on P@1, NDCG and MAP. This indicates that despite tending to retrieve more relevant documents in the first 5 and 10 positions, its ranking tends to be worse than KE4IR.

Fig. 3.
figure 3

Results ranking by text data alone, according to Eq. 3 (BVA) and traditional BM25 (Standard B), but filtering on concepts

4.3 Retrieval Using Combined Ranking of Terms and Concepts

This experiment uses the document-query similarity function defined in Eq. 5, comparing the effect of standard B, and \(B_{VA}\) for with various values for \(\lambda \)—ranging from 0.2 to 1.4 in intervals of 0.2. Different runs are performed for the different sets of concepts listed in Sect. 4.2.

Fig. 4.
figure 4

Results combining text and concepts according to Eq. 3 (BVA) and traditional BM25 (Standard B)

Results (Fig. 4) indicate that this configuration performs worse that the previous set-up (textual retrieval with semantic filtering) as far as P@n is concerned. On the other hand, there are cases (using entities concepts, standard B, and \(\lambda =1.4\)) where P@10, and MAP scores exceed the corresponding KE4IR scores. This may imply an improved ranking.

4.4 Retrieval Using Combined Ranking of Terms and Concepts, and Filter on Concepts

This configuration is similar to the setup described in Sect. 4.3, but with the addition of concept filtering (as described in Sect. 4.2). The results obtained here are overall better than that obtained when using textual and conceptual ranking without filtering. Configurations involving \(B_\text {VA}\) and entities keywords manage to produce a P@5 score (0.7375) that exceeds the P@5 in KE4IR and all other runs. There are also various configurations that exceed the KE4IR P@10 score. However, the P@1, NDCG and MAP scores are generally lower than the ones obtained without concept filtering.

Fig. 5.
figure 5

Results combining text and concepts according to Eq. 3 (BVA) and traditional BM25 (Standard B) as well as filtering on concepts

4.5 Discussion

The experiments allow us to draw the following conclusions:

  • The best results were obtained on P@5 and P@10, significantly improving the current state of the art on the provided test collection.

  • By considering the top-heavy metrics (P@1 and MAP), the experiments show that it is extremely difficult to significantly improve on the existing results.

  • The increased performance in precision obtained by our technique does not correspond to an increase in the NDCG and MAP scores, thus meaning that a larger number of correct documents is associated to a worst ranking of them.

  • The main benefit from the adoption of concepts is then related to the filtering of the documents. The results show that in most cases they introduce more noise than utility into the ranking (NDCG and MAP decrease with the increase of \(\lambda \) in Figs. 4 and 5). The small increase in performance for some configurations refers to settings where no filtering has been performed, thus confirming our findings.

  • Due to the small dataset and number of queries evaluated, the result cannot be generalized out of this domain.

  • In this particular domain, the variation of BM25 introduced does not improve the scores.

5 Conclusion

We have extensively explored the use of concepts in an ad-hoc retrieval task, by combining them with term-based relevance in three ways: as a filter, as a ranking contributor, and as both a filter and a ranking contributor. We have explored 9 different interpretations of ‘concepts’ and concluded that, with reference to the dataset studied, the main benefit we achieve by taking into account concepts concerns the reduction of the vector space, where the irrelevant documents are filtered out. On the other hand a small benefit (and in some case a decrease in the performance) is measured if we consider the rank. Nevertheless, the problem needs more evaluation, with other, larger datasets, to draw general conclusions.