Keywords

1 Introduction

Pseudo-relevance feedback (PRF) has been proven effective for improving retrieval performance. The basic idea is to assume a certain number of top-ranked documents as relevant and select expansion terms from these documents to refine the query representation [18]. A fundamental question is whether the feedback information are truly relevant to the query. Cao et al. [4] show that the expansion process indeed adds more bad terms than good ones, and the proportions of bad terms in different collections are different. This means that there is noise in the expansion terms and we can not always trust the expansion information. Thus, we need to carefully balance the original query model and the expansion model derived from the feedback documents. If we over-trust the feedback information, the retrieval performance can be harmed due to the noise in the expansion model. If we under-trust it, we will not be able to take full advantage of the feedback information. Currently, the balance parameter is often manually tested and set to a fixed value across queries for a specific collection, to combine the original query and the expansion terms derived from the feedback documents. Due to the difference between different collections and different queries, this parameter should be set differently. Recently, Lv and Zhai [10] present a learning approach to adaptively predict the optimal weight of the original query model for different queries and collections. They explore a number of features and combine them using a regression approach for the prediction. The features they used are mostly based on the original query and feedback information, yet do not sufficiently consider features of the candidate expansion terms and the collection.

It has long been recognized in information retrieval that document collection has a great impact on the performance of a retrieval model [17]. In this paper, we propose and systematically investigate a set of collection-based features about queries, feedback documents and candidate terms, which are complementary to the features used in Lv and Zhai [10]. Specifically, three types of features are studied, including (1) Information amount of query: we suppose that a query is more reliable when it carries more information; (2) Reliability of feedback documents; (3) Reliability of candidate terms: We will trust the feedback documents and candidate terms only when they are highly reliable. The proposed features are feed into a logistic regression model to predict the feedback parameter.

2 Related Work

Pseudo-relevance feedback has been implemented in different retrieval models: e.g., vector space model, probabilistic model, and language model. In the vector space model [6], feedback is usually done by using the Rocchio algorithm, which forms a new query vector by maximizing its similarity to relevant documents and minimizing its similarity to non-relevant document. The feedback method in classical probabilistic models [3, 16] is to select expanded terms primarily based on Robertson/Sparck-Jones weight. In the language modeling approaches [9, 20], relevance feedback can be implemented through estimating a query language model or relevance model through exploiting a set of feedback documents. All those works used a fixed parameter to control the balance parameter between original query and feedback information.

Recently, Lv and Zhai [10] present a learning approach to adaptively predict the optimal balance parameter for each query and each collection. They leverage state-of-the-art language models for ranking documents and use logistic regression to optimize an important parameter inside the language modeling framework. Three heuristics to characterize feedback balance parameter are used, including the discrimination of query, discrimination of feedback documents and divergence between query and feedback documents. These three heuristics are then taken as a road map to explore a number of features and combined them using the logistic regression model to predict the balance parameter. The experiments show that the proposed adaptive relevance feedback is more robust and effective than the regular fixed-parameter feedback. Nevertheless there is still room to explore when the training and testing sets are different. Our work uses a similar method, but adds features based on characteristic of collection. The experiments show that our method is competitive in improving retrieval performance, in comparison with their approaches.

3 Basic Formulation

The Relevance Model (RM) [21] is a representative and state-of-the-art approach for re-estimating query language models based on PRF [9]. We will carry out our study in the RM framework.

For a given query \(Q=(q_1, q_2,... , q_m)\), based on the corresponding PRF document set F (\(|F|=n\)), RM estimates an expanded query model [22]:

$$\begin{aligned} P(w|\theta _F)\propto \sum _{D \in F} P(w|\theta _D)P(\theta _D)\prod ^m _{i=1}P(q_i|\theta _D) \end{aligned}$$
(1)

where \(P(\theta _D)\) is a prior on documents and is often assumed to be uniform without any additional prior knowledge about the document D. Thus, the estimated relevance model is essentially a weighted combination of individual feedback document models with the query likelihood score of each document as the weight.

The estimated relevance model, \(P(w|\theta _F)\), can then be interpolated with the original query model \(\theta _Q\) to improve performance:

$$\begin{aligned} P(w|\theta ^\prime _Q) = \lambda P(w|\theta _Q)+ (1-\lambda ) P(w|\theta _F) \end{aligned}$$
(2)

where \(\lambda \) is a balance parameter to control the weight of the feedback information. The model in Eq. (2) is often referred to as RM3 [9]. When \(\lambda =1\), we only use the original query model (i.e., no feedback). If \(\lambda =0\), we ignore the original query and rely only on the feedback model.

4 The Proposed Collection-Based Features

As aforementioned, due to the difference of collections in document type, size and other characteristics, and the difference of query difficulties, the expansion terms selected from the feedback documents are not always good terms [4]. Accordingly, the balance parameter should be set differently for different collections and queries. In this section, we investigate three types of collection-based features about query, feedback documents and candidate terms, for adaptive setting of the balance parameter.

4.1 Information Amount of Query

Intuitively, if a query contains a sufficient amount of information about the search topic, then the expansion terms may be less important and thus more weight should be given to the original query. As the query performance is largely related to the information amount of the query, it is natural to borrow some features that have been used in query performance prediction [5]. As a step further, we also propose to look at two extra features, namely the mutual information and information entropy.

4.1.1. The Distribution of Information Amount in the Query Terms

In general, each query term t can be associated with an inverse document frequency (idf(t)) describing the information amount that the term carries. According to Pirkola and Järvelin [13], the difference in the discriminative power of query terms, which is reflected by the idf(t) values, could affect the retrieval effectiveness. Therefore, the distribution of the idf(t) over query terms, denoted DI, might be an intrinsic feature that affects the selection of balance parameter. DI is represented as:

$$\begin{aligned} DI=\sigma _{idf} \end{aligned}$$
(3)

where \(\sigma _{idf}\) is the standard deviation of the idf values of the terms in Q. In our study, idf is defined as follows:

$$\begin{aligned} idf(t)=\frac{\log \frac{(N+0.5)}{N_t}}{\log (N+1)} \end{aligned}$$
(4)

where \(N_t\) is the number of documents containing the query term t, and N is the number of documents in the collection. The higher DI score, the more dispersive the query’s information amount distribution is. Then we would need to bring in more precise information from the expansion terms, and thus give more weight to the feedback/expansion model.

4.1.2. Query Scope

The notion of query scope characterizes the generality of a query. For example, the query “Chinese food” is more general than “Chinese dumplings”, as the latter is about a particular Chinese food. The query scope was originally studied in [14], defined as a decay function of the number of documents containing at least one query term, and has been shown to be an important property of the query. Similarly, in this paper, we define the query scope as follows:

$$\begin{aligned} QS = -\log (\frac{n_Q}{N}) \end{aligned}$$
(5)

where \(n_Q\) is the number of documents containing at least one of the query terms, and N is the number of documents in the whole collection. A larger \(n_Q\) value will result in a lower query scope. The higher QS value means clearer information contained by the query, then we should give more weight to the original query.

4.1.3. Average Inverse Collection Term Frequency

According to Kwok [8], the inverse collection term frequency (ICTF) can be seen as an alternative of idf and is correlated with the quality of a query term. The average ICTF (AvICTF) is given by:

$$\begin{aligned} AvICTF=\frac{log\varPi _{q\in Q}\frac{|C|}{{tf}_q}}{|Q|} \end{aligned}$$
(6)

where \({tf}_q\) is the occurrence frequency of a query term in the collection; |C| is the number of tokens in the collection; and |Q| is the query length. AvICTF measures the overall discriminative power of query terms. The higher AvICTF of the query indicates that more weight may be needed for the original query while the expansion terms may not bring much extra benefit.

4.1.4. Mutual Information Among Query Terms

Mutual information (MI) [12] is used to quantify how the terms in a query are associated to each other. The MI is a quantity that measures the mutual dependence of the two discrete random variables X and Y, defined as follows:

$$\begin{aligned} MI(Q)=I(X;Y)=\sum _{y\in Y}\sum _{x\in X} P(x,y)\log \frac{P(x,y)}{P(x)P(y)} \end{aligned}$$
(7)

where P(xy) is the joint probability distribution function of X and Y, P(x) and P(y) are the marginal probability distribution functions of X and Y respectively. In our study, they can be defined as follows:

$$\begin{aligned} \begin{array}{l} P(x,y)=df_{xy}/N\\ P(x)=df_x/N\\ P(y)=df_y/N \end{array} \end{aligned}$$
(8)

where x and y are two original query terms; \(df_{xy}\) is the document frequency where terms x and y co-occur; N is the number of documents in the whole collection; \(df_x\) and \(df_y\) are document frequency of the query term x and y respectively. The higher MI score means a high correlation among query terms, and thus more coherent information is carried by the original query. In turn, less weight can be given to candidate expansion terms.

4.1.5. Information Entropy of Query

We propose to analyze the term distribution in a query using information entropy [2]. In information theory, entropy measures the average amount of information contained in a message received, thus characterizing the uncertainty of information. For a random variable X with n outcomes {\(x_1\), ..., \(x_n\)}, the widely used Shannon entropy (denoted by H(X)), is defined as follows:

$$\begin{aligned} IE(Q)=H(x)=-\sum _{i=1}^n P(x_i)\log P(x_i) \end{aligned}$$
(9)

where \(P(x_i)\) is the probability mass function of outcome \(x_i\). In this study, it is calculated as follows:

$$\begin{aligned} P(x_i)=tf_i/Ntf \end{aligned}$$
(10)

where \(tf_i\) is the frequency of query term \(x_i\), Ntf is the sum of the all tfs in the collection. The high IE score means less certainty of the query, then more weight should be given to candidate expansion terms.

4.2 Reliability of Feedback Documents

We expect that the more reliable the feedback documents are, the more weight should be given to the expansion model derived from these documents.

4.2.1. Clarity of Feedback Documents

The clarity of feedback documents is defined as follows, as also used in [10].

$$\begin{aligned} CFD=\sum _{\omega \in F}p(\omega |\theta _F)\log \frac{p(\omega |\theta _F)}{p(\omega |C)} \end{aligned}$$
(11)

where F is the set of feedback documents, \(p(\omega |C)\) is the collection language model, and \(p(\omega |\theta _F)\) is estimated as \(p(\omega |\theta _F)=\frac{c(\omega ,F)}{\sum _\omega c(\omega ,F)}\). The higher CFD value, the more reliable the feedback documents tend to be.

4.2.2. The Content of Least Frequent Terms in Feedback Documents

The least frequent terms (LFT) are terms appearing less than a certain number of times (e.g., 3 in our experiments) in the collection and containing non-alphabetical characters, such as “00”, “1”, “2d”. These terms usually have little practical significance. The content of LFT in feedback documents is defined as:

$$\begin{aligned} LFT_F = \frac{N(LFT)}{|F|} \end{aligned}$$
(12)

where N(LFT) is the number of LFT terms in the feedback documents, and |F| is the total number of terms in the feedback documents. The higher \(LFT_F\), the less reliable the feedback documents tend to be.

4.3 Reliability of Candidate Terms

We expect that the higher reliability of candidate expansion terms, the more weight should be given to them when combining with the original query model.

4.3.1. Mutual Information Between Candidate Expansion Terms and Query

The definition of MI(C) is the same as MI(Q) roughly, except the different meaning of the variables in Eqs. (7) and (8). For MI(Q), the X and Y represent the original query terms, but for MI(C), they represent the original query and candidate terms respectively.

4.3.2. Information Entropy of Candidate Expansion Terms

Similar to the definition of IE(Q), the IE(C) can be calculated using Eqs. (9) and (10), with \(x_i\) representing candidate terms.

4.3.3. The Content of LFT in Candidate Expansion Terms

This can be measured in the same way as for the feedback documents in Eq. (12), and we defined it as \(LFT_C = \frac{N(LFT)}{|C|}\). For candidate expansion terms, N(LFT) is the number of LFT terms in the candidate terms, and |C| is the total number of candidate terms.

5 The Logistic Regression Model

Logistic regression is widely used in data mining and machine learning. We use a logistic regression model to combine our features and generate a score for predicting the balance parameter, whereas the output is confined to values between 0 and 1. The method is the same as the one used in [10], defined as follows:

$$\begin{aligned} f(X)=\frac{1}{1+exp(-X)} \end{aligned}$$
(13)

where the variable \(X=\overline{\omega }*\overline{x}\) represents the set of features. Specifically, \(\overline{x}\) is a vector of numeric values representing the features and \(\overline{\omega }\) represents a set of weights, which indicates the relative weights for each features. f(X) represents the probability of a particular outcome given the set of features.

6 Experiments and Results

6.1 Experimental Setup

We used five standard benchmarking collections in our experiments: AP8890 (AP), WSJ8792 (WSJ), ROBUST2004 (ROBUST), WT10G and SJM, which are different in size and genre. The WSJ, AP and SJM collections are relatively small and consist of news articles, science and technology reports and government documents, whereas WT10G is a larger Web collection. The details of these collections are shown in Table 1.

Table 1. Information of Collections

In all the experiments, we only used the title field of the TREC queries for retrieval, because it is closer to the actual queries used in the real web search applications and relevance feedback is expected to be the most useful for short queries [19].

First, we used Indri which is part of the Lemur Toolkit [11] to index document collection. In the indexing process, all terms were stemmed using Porter’s English stemmer [15], and stopwords from the standard InQuery stoplist [1] were removed. Then, we initially retrieved a document list for each query using language model with the Dirichlet prior (takes a hyper-parameter of \(\mu \) applied to smooth the document language model which is better than other smoothing methods for title query.) and fixed the smoothing parameter to 1500 for all queries. This is our baseline for all pseudo relevance models in our experiments denoted as LM. After that, for each query, we used the expanded query model (Eq. (1)) to get the candidate expansion terms. In this part, we fixed the number of feedback documents to top 30, and the number of candidate expansion terms to 100 according to the settings in existing work.

To train the proposed adaptive relevance feedback, we needed to obtain the training data first. Considering the reliability and authority of training data, \(90\,\%\) of the queries were selected randomly for training, resulting in a total of 41 out of queries 101-150, 45 out of queries 151-200, 89 out of queries 601-700, and 45 out of query 501-550, and the rest were taken as testing queries. In this way, we aimed to make the training data more diversified and the test results more general and reliable. It turned out that 262 queries were taken as training data of different types and 33 queries were taken as testing data.

For traditional RM3 model, as we have discussed in Sect. 3, the balance parameter \(\lambda \) (Eq. (2)) changed from 0 to 1 (0.1, 0.2,....., 1) on five collections to find the optimal \(\lambda \) (this parameter is fixed for all queries in the same collection) and we called it RM3-Manual. For our adaptive relevance feedback, we chosen the optimal \(\lambda \) for each query. All the above processes were the same for our training and testing data.

Finally, the effectiveness of the IR models on each collection was measured by the Mean Average Precision (MAP) [7] at the top 1000 retrieved documents.

Fig. 1.
figure 1

Sensitivity of the balance parameter (\(\lambda \)) for different queries on AP8890 collection

6.2 Sensitivity of Balance Parameter

We investigated the sensitivity of balance parameter on AP8890 collection and some queries of AP8890 in relevance feedback experiments by varying \(\lambda \) form 0 to 1, as it is showed in Fig. 1. We could observe that the setting of \(\lambda \) could affect the retrieval performance significantly, and the optimal parameter for different queries on the same collection could be quite different.

6.3 Correlation Between Features and the Optimal Balance Parameter

We measured the correlation between features and the optimal balance parameter for each query in the training data using Pearson and Spearman methods which are common. Based on the Sect. 4, we could obtain a matrix [262,10] of query-features, each query had its own 10 feature values and optimal \(\lambda \) which were the base of analysis. As showing in Table 2, DI, QS, IE(Q) and \(LFT_F\) are more correlated with the optimal feedback coefficient than other features. It may mean that the information of query plays an important role in predicting the balance parameter.

Table 2. Pearson and Spearman correlation coefficients between features and the optimal \(\lambda \) on training data

6.4 Prediction Models and the Results

In this part, we trained three prediction models on our training data by using three different sets of features respectively and the assessment of fit is based on significance tests for the balance parameter.: (1) all of the proposed features (ten); (2) only DI, QS, IE(Q) and \(LFT_F\); (3) five important features proposed in [10], including clarity of queries, feedback length, clarity of feedback documents and the absolute divergence between queries and feedback documents (see [10] for more details). They are called as “RM3-A”, “RM3-A2” and “RM3-B” respectively. Given a new query, we could predict its feedback balance parameter directly using the formula:\(f(X)=\frac{1}{1+exp(-X)}\) which was introduced in Sect. 5, and X for all ten features (X1) and four important features (X2) are showed below:

$$\begin{aligned} \begin{array}{l} X1 = -0.2444 - 3.6127*DI - 0.4249*QS - 0.0214*AvICTF \\ \qquad + 184.2403* MI(Q) - 44.8057*IE(Q) + 3.1588*CFD\\ \qquad + 0.6834*LFT_F - 0.7406* MI(C) + 0.5376*IE(C) - 4.7051*LFT_C \end{array} \end{aligned}$$
(14)
$$\begin{aligned} \begin{array}{l} X2 = -0.5594 - 5.5303*DI - 0.3347*QS - 43.2822*IE(Q) + 0.418*LFT_F \end{array} \end{aligned}$$
(15)

From the above formula, we can see that, for the distribution of information amount in queries, DI and IE(Q) are correlated negatively to the \(\lambda \) and MI(Q) shows a positive correlation. This is consistent with our expectation that more weight should be given to the original query when the query has more information. For the reliability of feedback documents and expansion terms, \(LFT_F\), EI(C) and \(LFT_C\) should be positively to the \(\lambda \), while CFD and MI(C) should show a negative correlation. That means high weight (\(1-\lambda \)) should be given to candidate terms when the feedback information is more reliable. However, the different behaviors of \(LFT_F\) and \(LFT_C\) in Eq. (14) could be explained as a trade-off between “credibility” and “quantity of information”.

Table 3. Performance comparison of RM3-A and RM3-B on all testing data.
Table 4. Performance comparison of RM3-A and RM3-B on ROBUST2004.

The performance (MAP) on baseline, RM3-A, RM3-A2, RM3-B and RM3-Manual are demonstrated in Table 3. It shows that RM3-A, RM3-A2 and RM3-B all outperform the LM, but comparing with RM3-Manual, there is still room improvement by further optimizing the feedback parameter. RM3-A is better than RM3-B on SJM, WSJ8792, ROBUST2004 and WT10G, and only fails on AP8890. RM3-A2 is better than RM3-B on AP8890. The result is encouraging, our method which explicitly takes into account the collection-based features and the multiple types of our training data make the result more robust when predicting for different type of collections. As for RM3-A and RM3-A2, the results indicate that the performance of using all features is better than some important features in generally.

Further more, we show the performance of baseline, RM3-A, RM3-B, RM3-A2 and RM3-Manual on ROBUST2004 for each testing query in Table 4. The results show that our method (RM3-A) is really effectiveness for per query when comparing with RM3-B.

7 Conclusions and Future Work

In this paper, we propose a series of collection-based features about query, feedback documents and candidate expansion terms, then combine them using a logistic regression model to adapt the balance parameter of PRF for different queries and collections (RM3-A). The experiments show that our method outperforms a state-of-art method (RM3-B) when the training and test data are of very different types. This verifies our hypothesis on incorporating collection-sensitive features will help improve the retrieval performance. On the other hand, there is still a room for further improvement when comparing with the manual setting of optimal balancing parameter. We will keep improving our work in the future by investigating other features about collection and analyzing the relationship between different features. We will also evaluate our method on different PRF methods and using different training and test data.