Keywords

1 Introduction

Ranking the results is a challenge in every information retrieval system. As an example of ranking, ranker algorithms provide a list of documents according to their relevance. Many algorithms have been developed for document ranking, including vector space algorithms and probabilistic algorithms. The rankers are compared with each other according to their results with the aim of deducing their similarity. Evaluating the ranked list is conventionally performed by evaluation measures including NDCG, MAP, ERR and can be applied for comparison. Similarity measures show how much a pair of rankers are correlated. A similarity measure indicates the agreement of two rankers with a value which is often in range of [0..1] or [−1..1]. This similarity value can apply under feature selection task in systems that apply a variety of rankers to compare the results of two search engines. In this paper the first task is considered.

Various similarity methods have been introduced. Kendal’s τ, Pearson correlation coefficient, Spearman rank correlation coefficient are well-known correlation coefficient measures. Kendal’s τ considers priority between each two objects and Spearman shows the linear correlation coefficient for two ranked lists. Studies in [1,2,3,4,5,6,7,8] mentioned problems with Kendal’s τ and attempted to solve the problems. Yilmaz et al. in [1] view the problem of Kendal’s τ as ignoring the importance of top ranks. In [3] the problem with Spearman’s footrule and is discussed and the relevance of elements and positional information are taken into consideration. In [1] a correlation measure was applied to determine the similarity between two search engines. We investigate the issue of feature similarity, which can be applied for feature selection.

In this paper, we propose a similarity method to measure the agreement between two ranked lists based on the difference in the position of documents on each sorted list. Moreover, we consider the difference between relevance degrees of documents in each position. The experiments were conducted on two standard datasets from Letor3 and Letor4 for all features pairs. Our method contributes to correlated performance with respect to other similarity measures.

This paper is organized as follows: in Sect. 2 related work focusing on similarity measures are reviewed. In Sect. 3 we present our method, introducing the weighted similarity measure, and in Sect. 4 empirical results are reported, and we conclude in Sect. 5.

2 Related Work

In this section, common rank correlation coefficient metric and the algorithms introduced to improve them are reviewed.

The aim of a similarity measure is to evaluate the correlation of two features that rank a list of documents. Several proposals have been made for rank correlation. Kendal’s τ correlation coefficient and Pearson correlation coefficient are two widely used correlation coefficient measures that are also applied for feature selection for learning to rank.

Kendal’s τ [9] indicates the number of paired documents of the same list sorted by two rankers that take equal preferences order in two ranked lists, over total number of documents pairs. We apply a version of Kendal’s τ as follows. Kendal’s -τ (TAU) for ranking a query q and two features fi and fj, is defined as follows:

$$ Tau\left( {f_{i} ,f_{j} } \right) = \frac{{\# \left\{ {\left( {x_{s} ,x_{t} } \right) \in X_{q} |x_{s} <_{{x_{i} }} x_{t} \,and\,x_{s} <_{{x_{j} }} x_{t} } \right\}}}{{\# \left\{ {\left( {x_{s} ,x_{t} } \right) \in X_{q} } \right\}}} $$
(1)

Where the numerator is the number of paired documents related to query q that takes equal preference according to two features fi and fj, and the denominator is the number of whole pairs of documents associated with the query.

To measure the similarity, Pearson (PCC) is defined as follows:

$$ PCC\left( {f_{i} ,f_{j} } \right) = \frac{{\text{cov} \left( {f_{i} ,f_{j} } \right)}}{{\sqrt {\text{var} \left( {f_{i} } \right).\text{var} \left( {f_{j} } \right)} }} $$
(2)

Where \( cov\left( {f_{i} , f_{j} } \right) = \sum\nolimits_{k = 1}^{n} {\left( {f_{i}^{\left( k \right)} - \overline{{f_{i} }} } \right)\left( {f_{j}^{\left( k \right)} - \overline{{f_{j} }} } \right)} \) is the covariance of two features and \( var\left( {f_{i} } \right) = \sum\nolimits_{k = 1}^{n} {\left( {f_{i}^{\left( k \right)} - \overline{{f_{i} }} } \right)^{2} } \) is the variance of a feature.

Algorithms in [1,2,3] investigate the problem with Kendal’s tau in ranking. In [1] Yilmaz et al. the weakness of Kendal’s tau in mirroring the error and precision in top rank is discussed. They proposed τAP a variation of Kendal’s tau based on the average precision that gives more weight to the errors at high rankings. τAP has gained a considerable interest [2, 3], Stefani et al. [2] apply a τAP in Mallows model as the distribution function for the problem of learning probabilistic models for permutations. Urabno et al. in [3] also apply statistical estimators Kendal’s and τAP to estimate the expected correlation of test collection against true ranking.

Carterette et al. in [4] solves the problem using Kendall’s tau that works regardless of actual correlation between the measurements, proposing a rank correlation based distance between rankings. In [5] Kumar et al. view the problem with Spearman’s footrule and Kendall’s tau as overlooking objects relevance and positional information. They extend Spearman’s footrule and Kendall’s tau to element weights, position weights, and element similarities. They list five principles for similarity metric; richness, simplicity, generalization, basic properties and correlation with other metrics. Richness evolves three concepts: element weights, position of elements and diversity. In [6] Luchen et al. propose a family of similarity measures based on maximization effectiveness difference. Effectiveness measures include MAP, NDCG and ERR applied. Webber et al. in [7] address the problem of indefinite rankings in which two lists lack the same items and only have some common items. Gao et al. in [8] by considering the top of the ranked list, propose a head-weighted measure. Their metric evaluates the gap between system scores, and is effected by the gap at the top of the ranked lists. In the next section, our method is presented based on distance and relevance degree.

3 Our Method

We consider feature (ranker) similarity under feature selection problem.

3.1 Similarity Based on Distances

A similarity measure is applied to evaluate the different between ranker’s results. To evaluate the similarity between features (ranking method) according to their ranked list, one way is to consider the distance between two ranked lists. We also consider the relevance (importance) degree of each document on the ranked list. Figure 1 shows the weighted similarity measure pattern.

Fig. 1.
figure 1

Weighted similarity measure pattern

We apply the definition of distance from Spearman’s ρ, and present a new similarity measure. The distance of instances in two ranked lists of the same document, provided by two features (ranker) is defined as the difference in the two positions on the two ranked lists. The following demonstrates an example for the distance of documents in two lists ranked on the same set with two different features.

Feature f i ::

d1, d2, d3, d4, d5, d6, d7, d8, d9

Feature f j: :

d9, d8, d5, d7, d6, d1, d3, d4, d2

Distance of d3 = :

position d3 in list 1 - list 2 (|3–7|)

To define similarity the following normalization is used:

$$ sim\left( {f_{i} ,f_{j} } \right) = \frac{\hbox{max} (d) - \sum d }{\hbox{max} (d)} $$
(3)

Where max (d) is the maximum distance which, for n instances, equals \( \left\lfloor {\frac{{n^{2} }}{2}} \right\rfloor \). The above definition measures the similarity according to difference in distance, but the relevance of each document in each position is important in ranking.

3.2 Similarity Based on Degree

We consider another distance, which shows the difference of weight in two ranked lists. For two ranked lists the relevance of each document in each position is considered.

For each position the absolute difference of relevance degrees of two ranked lists is computed. This distance is based on the degree of relevance. The following example illustrates the distance of degrees.

docs degree::

d1 = 0, d2 = 1, d3 = 2, d4 = 0, d5 = 2, d6 = 1, d7 = 3, d8 = 0

Feature f i ::

d3, d2, d3, d4, d7, d5, d6, d8

Feature f j ::

d1, d2, d3, d8, d5, d6, d7, d4

Distance of degrees::

2, 0, 0, 0, 1, 2, 0

Sum of this distance shows the weighting difference of two ranked lists. For two identical rankings, this value is zero and the maximum value is (r. ̅d) in which r is the number of degrees and d is the total object or position. Then d is normalized to the following equation to define similarity:

$$ w\left( {f_{i} ,f_{j} } \right) = \frac{{\hbox{max} (\overline{d} ) - \sum {\overline{d} } }}{{\hbox{max} (\overline{d} )}} + \alpha $$
(4)

Where α is a threshold to prevent zero, which equals 1 for identical rank. The similarity measure becomes w.sim.

$$ w.sim\left( {f_{i} ,f_{j} } \right) = \left( {\frac{{\hbox{max} (\overline{d} ) - \sum {\overline{d} } }}{{\hbox{max} (\overline{d} )}} + \alpha } \right).\frac{\hbox{max} (d) - \sum d }{\hbox{max} (d)} $$
(5)

4 Experiments

In this session, first the datasets and evaluation measures are introduced and then the algorithm settings are presented.

4.1 Dataset

In order to investigate the proposed method we conduct our experiment on two datasets from Letor benchmark. MQ2008 from Letor 4.0 has in total 784 queries, containing 15211 document-query pairs, for which 3 relevance degrees have been provided and 46 features have been extracted. OHSUMED dataset from Letor 3.0 consists of 106 queries, 45 features extracted for each document query pair. It consists of 16 140 document-query pairs, and 3 relevance degrees are supplied. For the purpose of cross-validation, each dataset is folded into five folds and each fold contains a training set, a validation set and a test set.

To compare the proposed method with other correlation strategies we applied two similarity methods, Kendal’s -τ Eq. (1) and Pearson correlation coefficient Eq. (2).

To assess the proposed measure and show that similar features provide similar results according to the accuracy measure, a comparison between the results of weighted similarity against MAP is necessary. The problem is the correlation between two ranked lists, none of which is the ground truth. Therefore, we compare weighted similarity against their similarity in MAP. For this evaluation we define MAP similarity of two features as MAPSIM \( (f_{i} ,f_{j} ) = 1- |{\text{MAP}}\left( {f_{i} } \right) - {\text{MAP}}(f_{j} )| \).

In the following WSIM shows the weighted similarity, PCC is used to refer to Pearson correlation coefficient and TAU shows Kendal’s tau. MAP represents MAPSIM.

4.2 Experimental Results

In Fig. 2(A) the plot illustrates the weighted similarity against Pearson correlation coefficient and Kendal’s tau across all pairs of features for MQ2008. The plot shows a correlation between weighted similarity against Kendal’s tau. Almost all data follow a similar pattern. The weighted similarity is partly correlated with Pearson correlation coefficient.

Fig. 2.
figure 2

Weighted similarity vs. Pearson correlation coefficient and weighted similarity vs. Kendal’s tau

In Fig. 2(B) the plot shows the results for OHSUMED containing weighted similarity against Kendal’s tau and Pearson correlation coefficient across all pairs of features. The plot shows again that for this data set the correlation between weighted similarity and Kendal’s tau is significant. The weighted similarity and Pearson correlation coefficient are partly correlated.

Figure 3 plots the similarity between MAP of pairs of features against weighted similarity of corresponding feature pairs. According to plot A, there is a correlation between MAP and weighted similarity for OHSUMED data set. However, MAP similarity produced a higher value compared with weighted similarity. The linear relation between two measures for all pairs of features is visible, which shows similar features based on the proposed measure providing a similar accuracy according to MAP. A correlation between MAP and weighted similarity for MQ2008 data set is demonstrated in plot B. Though this correlation is linear and less gradual than the plot for OHSUMED, the data set shows that there is a clear relation between the two measures.

Fig. 3.
figure 3

(A) Weighted similarity vs. MAP similarity on OHSUMED Data set (B) Weighted Similarity vs. MAP Similarity on MQ2008 Data set

5 Conclusion

We introduced a new similarity measure that evaluates the rank correlation between two ranking features. We applied two methods in order to compare the proposed method as similarity measures. The empirical results showed that our method is correlated with other methods and with MAP. Also the proposed measure has other properties including simplicity and is weighted based on document relevance. We conducted our experiments on document retrieval; future work will include applying weighted similarity for other information retrieval applications.