1 Introduction

Document ranking is an essential component of information retrieval systems and web search engines. Recently, machine learning-based ranking techniques, referred to as “learning to rank,” have given rise to an active and growing research area, both in the information retrieval and machine learning communities (Cao et al. 2007; Freund et al. 2003; Joachims et al. 2009; Niu et al. 2012; Xu and Li 2007). A large number of learning to rank algorithms have been proposed, which incorporate more and more useful features, aiming to improve the performance of the ranking algorithms (Liu 2011). In a supervised setting, they first collect a set of training data, which includes a set of queries, each associated with a list of documents labeled by relevance degrees; with the training dataset, they train a ranking model that can order unseen documents according to their degree of relevance (Joachims et al. 2007). In this situation, dimension reduction inevitably becomes an important issue (Geng et al. 2007).

Firstly, dimension reduction can enhance the accuracy for many machine learning problems, including learning to rank. With dimension reduction techniques, a small set of more discriminative and less redundant features can be selected or generated for learning. Thus, better results could be achieved, as overfitting becomes less likely (Ng 2004). Also, the generalization ability of machine learning models could depend on the radius of training data points, which may decrease when the number of features decreases (Blum and Langley 1997; Geng et al. 2007; Weston et al. 2000; Wolf and Bileschi 2005).

Secondly, large number of features leads to high complexity in most learning to rank algorithms. Therefore, dimension reduction often leads to significant improvements in training and prediction efficiency, while maintaining, or having a limited negative impact on, accuracy. With accuracy being the primary metric, efficiency has also emerged as a crucial issue for evaluating learning to rank algorithms (Cao et al. 2007; Chapelle et al. 2011; Wang et al. 2015). Training datasets and ranking features continue to expand, so as to obtain more accurate models. Furthermore, as a consequence of the dynamic character of the Web, ranking models need to be re-learned repeatedly, and the interval between re-learning procedures decreases sharply (Liu 2011). With dimension reduction techniques, fewer features are used, resulting in more efficient training and prediction.

Generally, there are two types of dimension reduction algorithms: feature selection and feature extraction. The former aims to select a subset of the original features for learning, while the latter attempts to generate a small set of new features from the original features (Blum and Langley 1997; Motoda and Liu 2002; Wyse et al. 1980). Recently, feature selection for ranking has been investigated intensively (Geng et al. 2007; Gupta and Rosso 2012; Lai et al. 2013; Laporte et al. 2014; Naini and Altingövde 2014; Pan et al. 2009; Yu et al. 2009). To the best of our knowledge, the advantages of feature extraction have not yet been explored in learning to rank.

In this study, we address the feature extraction problem for learning to rank. In comparison with feature selection, the feature extraction problem has a much larger search space. For example, given n original features, feature selection selects a subset of features of size k (where \(k < n\)) for learning. Here, for a particular value of k, the search space of the problem contains \(\left( {\begin{array}{c}n\\ k\end{array}}\right)\) possible solutions. The full search space that can include any number of features (i.e., all values of k in range 1 to n), would lead to \(2^{n} -1\) solutions. In comparison, for linear feature extraction, each extracted feature is a linear combination of original n features. Since the coefficient associated with each original feature can be any real number, the search space becomes infinite. The search space of non-linear feature extraction would be even larger, as it also includes solutions involving non-linear combinations of features (e.g. polynomial combinations). Hence, with a larger search space, feature extraction has a greater possibility to achieve better performance than feature selection.

To address the problem of linear feature extraction for learning to rank, we propose LifeRank, a Linear feature extraction algorithm for Ranking. LifeRank regards each dataset for training, validation or test as a matrix, referred to as an original matrix, where each row vector represents a document with a set of features. With a given original matrix for training \(\mathbf {X}\), LifeRank attempts to discover a transformation matrix \(\mathbf {T}\), so that a new matrix (dataset) \(\mathbf {X}'\) can be generated as the product of the original matrix and a transformation matrix, i.e., \(\mathbf {X}'={\mathbf {XT}}\). Thus \(\mathbf {T}\) projects high-dimensional document vectors in \(\mathbf {X}\) into lower-dimensional ones in \(\mathbf {X}'\). Theoretically, there could be a very large number of possible transformation matrices, each leading to a new generated matrix. LifeRank attempts to discover a transformation matrix to transform the original matrix (dataset) into a low-rank one for dimension reduction, on which learning to rank algorithms can achieve optimum results in comparison with other dimension-reduced matrices.

Our problem formulation is similar to principal component analysis (PCA) (Jolliffe 2002), and thus our algorithm LifeRank can be understood from the perspective of PCA. PCA is one of the most popular dimension reduction techniques in machine learning. When PCA is performed using singular valued decomposition (SVD) (Lange 2010), the given matrix \(\mathbf {X}\) can be approximately decomposed into three low-rank matrices \(\mathbf {X}\approx \mathbf {P\Sigma Q^{\top }}\). Here, \(\mathbf {\Sigma }\) is composed of the singular values of \(\mathbf {X}\), \(\mathbf {P}\) and \(\mathbf {Q}\) are composed of the left and right singular vectors of \(\mathbf {X}\) respectively, and \(\mathbf {P}^{\top }\mathbf {P}=\mathbf {Q}^{\top }\mathbf {Q}=\mathbf {I}\) is equal to the identity matrix. Thus a new matrix \(\mathbf {X}'=\mathbf {P\Sigma } \approx \mathbf {XQ}\). However, it should be noted that while PCA calculates \(\mathbf {X}'\) as an approximation of \(\mathbf {X}\), in LifeRank \(\mathbf {X}\) is transformed to \(\mathbf {X}'\) using a transformation matrix.

In LifeRank, we formulate the learning to rank task by using a classical pairwise loss function. A pairwise loss function is used because such functions are fundamental, straightforward and intuitive for ranking. Besides, pairwise loss functions are consistent with the assumption that the labels of documents to rank lie in a rank-differentiable probability space (Lan et al. 2012), and they are upper bounds of measure-based ranking errors (Chen et al. 2009). In the generated matrix, the column vectors represent the features. Since optimization over orthogonal features is beneficial to many machine learning problems (Shalit and Chechik 2014; Shivanna and Bhattacharyya 2014), we utilize the Lagrange multipliers method (Arfken 2013; Bertsekas 1999) to impose orthonormality constraints on the column (feature) vectors of the transformed matrix, and then use gradient descent for optimization. With the transformation matrix \(\mathbf {T}\), the training, validation and test datasets can be directly generated with matrix product.

Note that (1) LifeRank generalizes feature selection algorithms for the learning to rank task. Feature selection can be regarded as optimizing a transformation matrix \(\mathbf {T}\) so that the column vectors of \(\mathbf {T}\) meet the orthonormality constraints and each element in \(\mathbf {T}\) can only be either 0 or 1. (2) Although some deep learning-based ranking algorithms (Severyn and Moschitti 2015) also aim to generate a set of features for ranking, our problem is completely different: we try to construct our features based on some predesigned ranking features like term frequency (TF) and inverse document frequency (IDF), which have been comprehensively used in conventional learning to rank algorithms like Ranking SVM (Cao et al. 2006; Joachims et al. 2009) and RankBoost (Freund et al. 2003). Deep learning-based algorithms, however, try to build features based on word-level features in a corpus that differ substantially from conventional ranking features.

Our main contributions are as follows: (1) We address the feature extraction problem for learning to rank. Feature extraction is a category of comprehensively used dimension reduction techniques in many machine learning problems for performance gains in accuracy and efficiency, but to the best of our knowledge, feature extraction and its advantages have not been explored in learning to rank yet. (2) We propose LifeRank, a linear feature extraction algorithm that generates datasets to be utilized by the learning to rank task. (3) We perform extensive experiments on benchmark datasets and present the performance gains of LifeRank in comparison with the state-of-the-art feature selection algorithms.

The remainder of the paper is organized as follows. Section 2 reviews related work; Sect. 3 defines the feature extraction problem for ranking; Sect. 4 proposes LifeRank, a gradient descent-based algorithm. Section 5 introduces our experimental setup. Section 6 reports the experimental results, and Sect. 7 concludes the paper.

2 Related work

We discuss three types of related work: learning to rank, feature selection for ranking, and feature extraction for ranking.

2.1 Learning to rank for information retrieval

Learning to rank has received increased attention from both the machine learning and information retrieval community. While there is a growing interest in online learning to rank (Schuth et al. 2016) and in counterfactual learning to rank from online data (Joachims et al. 2018), the bulk of the work on learning to rank concerns offline learning to rank, where explicit human annotations are used to label query, document pairs. Offline learning to rank is the focus of this paper. Given its effectiveness, many algorithms have been proposed, which mainly fall into three categories (Chapelle et al. 2011; Liu 2009): pointwise, pairwise, and listwise.

Pointwise approaches, such as Pranking (Crammer and Singer 2001), McRank (Li et al. 2007) and Subset Ranking (Cossock and Zhang 2008), view each document in the training dataset as a learning instance, and utilize a classification or regression technique to predict the relevance categories or numerical/ordinal relevance scores for unlabeled data. Pairwise approaches, such as Ranking SVM (Cao et al. 2006; Joachims et al. 2009), RankBoost (Freund et al. 2003), RankNet (Burges et al. 2005), FRank (Tsai et al. 2007), LambdaRank (Burges et al. 2007), and BoltzRank (Volkovs and Zemel 2009), regard a pair of documents as a learning instance, and try to learn a binary classifier that can predict the more relevant document to the given query from each pair of documents. Then the ranked lists of documents can be aggregated based on the pairwise preferences of the documents. Listwise approaches, such as ListNet (Cao et al. 2007), SVM-MAP (Yue et al. 2007), NDCGBoost (Valizadegan et al. 2009), take the entire ranked list of documents as a learning instance, and attempt to construct a ranking model that can directly predict the full rankings of the documents. Recently, some hybrid algorithms have been proposed, such as FocusedRank (Niu et al. 2012), MixRank (Busa-Fekete et al. 2013), targeting improvements in learning accuracy, efficiency, or both. More algorithms are surveyed in Chapelle et al. (2011), Liu (2009, 2011).

With the incorporation of more and more useful features for performance gains, dimension reduction inevitably becomes an important issue in the ranking problem (Geng et al. 2007). With effective dimension reduction techniques, not only the efficiency of the algorithms could be improved, but also accuracy could be enhanced as a result of using more discriminative features with less redundancy and noise. Furthermore, the generalization of the ranking model can also be increased as a result of using fewer features (Geng et al. 2007).

2.2 Feature selection for ranking

Recently, considerable efforts have been made on feature selection for ranking. Geng et al. (2007) present GAS, one of the first attempts to incorporate the importance and similarity of features for ranking. In particular, it evaluates the importance of features with ranking metrics like MAP (Baeza-Yates and Ribeiro-Neto 1999) and NDCG (Järvelin and Kekäläinen 2002), and estimates the similarity between features using agreement between rankings, e.g., with Kendall \(\tau\) correlation coefficient (Kendall 1948). Then it greedily selects a subset of features with maximum total importance scores and minimum total similarity scores. Metzler (2007) proposes a greedy feature selection algorithm to be used within the Markov random field model for information retrieval. The model automatically generates models that are more effective than, or as effective as, models created by carefully selecting the features manually. Pan et al. (2009) investigate a boosted regression trees-based feature selection algorithm. It evaluates the importance of the features based on boosted trees. Then it selects features by maximizing the discounted importance of the features, where the importance of each feature is discounted by feature similarity. Yu et al. (2009) propose RankWrapper and RankSelect, two feature weighting and selection algorithms for learning to rank. They utilize ranking distances of nearest data points in order to identify the key features for ranking, demonstrating significant efficiency gains in comparison with GAS.

Gupta and Rosso (2012) present a Kullback–Leibler (KL) divergence-based divergence metric, and select a subset of features for ranking based on features’ expected divergence over the relevance classes and the importance of features. Lai et al. (2013) propose a joint convex optimization formulation for minimizing ranking errors while simultaneously conducting feature selection. This optimization formulation provides a flexible framework in which various importance measures and similarity measures of the features can easily be incorporated. Naini and Altingövde (2014) adopt three greedy diversification strategies, maximal marginal relevance, MaxSum dispersion and modern portfolio theory, to the problem of feature selection for ranking. Laporte et al. (2014) propose a general framework for feature selection in learning to rank based on support vector machine (SVM); they investigate both classical convex regularizations (such as L1 and weighted L1) and non-convex regularization terms (such as log penalty, Minimax Concave Penalty (MCP) and Lp pseudo norm with \(p < 1\)). Furthermore, they provided an accelerated proximal approach for solving the convex problems and a re-weighted L1 scheme to address the non-convex regularizations.

All of these algorithms are meant to address feature selection for ranking. To the best of our knowledge, there is no work targeting feature extraction for ranking.

2.3 Feature extraction techniques

Feature extraction has been used extensively used in various machine learning scenarios for performance gains in terms of accuracy and efficiency. Given its effectiveness, many approaches have been proposed, which are either linear or non-linear algorithms.

The main linear technique for feature extraction is principal component analysis (PCA) (Jolliffe 2002), which performs a linear mapping of high-dimensional data into a lower-dimensional space in such a way that the variance of the data in the low-dimensional representation is maximized. Canonical-correlation analysis (CCA) (Hardoon et al. 2004) is another popular linear feature extraction algorithm, which attempts to discover linear combinations of the original features that have maximal correlation with each other. In addition, several probabilistic algorithms, including probabilistic PCA (Tipping and Bishop 1999), probabilistic CCA (Bach and Jordan 2005) and probabilistic partial CCA (Mukuta and Harada 2014), have been proposed, where a set of latent variables are introduced for probabilistically interpreting these models.

Non-linear feature extraction algorithms can combine the original features to generate a set of features in a non-linear way. For example, the locally linear embedding (LLE) method (Roweis and Saul 2000) learns the global structure of non-linear manifolds to yield low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs. Isomap (Tenenbaum et al. 2000) is capable of discovering the non-linear degrees of freedom that underly complex natural observations. It can efficiently compute a globally optimal solution and can be guaranteed to converge asymptotically to the true structure. Besides, some kernel techniques have been proposed to transform linear feature extraction algorithms into nonlinear ones. For example, kernel PCA (Schölkopf et al. 1998) is a non-linear form of principal component analysis (PCA), which can efficiently compute principal components in high-dimensional feature spaces through the use of integral operator kernel functions.

Although feature extraction techniques have been extensively investigated and shown to demonstrate promising performance gains, to the best of our knowledge, they have not been explored yet in the context of the ranking problem.

3 Problem statement

3.1 Learning to rank for information retrieval

Let \(\mathcal {X}\) be a collection of documents, each represented by a vector of feature values. In information retrieval systems, given a query q, a list of documents from \(\mathcal {X}\) is returned as search results, where the documents are ranked according to their estimated relevance to q. Given a query q, the ground truth, i.e., relevance judgments of documents with respect to q (produced by human experts) is defined as a function \(rel: \mathcal {X} \rightarrow \mathbb {N}_0\), where \(\mathbb {N}_0\) is the set of natural numbers (including 0).

Let \(f: \mathcal {X} \rightarrow \mathbb {R}\) be a ranking function assigning real valued relevance scores to documents. The goodness of ranking functions can be evaluated by a measure s, such as precision at n (P@n), mean average precision (\(MAP\)) (Baeza-Yates and Ribeiro-Neto 1999), or normalized discount cumulative gain (\(NDCG @n\)) (Järvelin and Kekäläinen 2002).

Definition 1

(Learning to rank) Given a training dataset \(\mathcal {X}\) and an evaluation measure s, the problem of learning to rank is to learn a ranking function f from \(\mathcal {X}\) such that s(f) is maximized.

3.2 Dimension reduction for ranking

In learning to rank, each dataset \(\mathcal {X}\) can be regarded as a document matrix \(\mathbf {X}_{m\times n}\) with m rows (documents) and n columns (features). In particular, \(\mathbf {x}_i\) is the i-th row of \(\mathbf {X}\), and \({\mathbf {x}_i}^{\top }\) is a n-dimensional (column) vector that represents a document with n features. Let \(g: \mathbb {R}^n \rightarrow \mathbb {R}^k\) (\(k\le n\)) be a mapping that projects an n-dimensional vector space into a k-dimensional space. Let \(L(\cdot )\) be the loss function for the learning to rank task. Our problem is to discover a mapping function g such that the obtained dataset \(\mathbf {X}' = g(\mathbf {X})\) minimizes the loss function.

Definition 2

(Dimension reduction for ranking) Let \(\mathbf {X}_{m \times n}\) be a document matrix with m columns and n rows, where each column \({\mathbf {x}_i}^{\top }\) is a n-dimensional vector, representing a document with n features. Let \(\mathcal {G}\) be the set of all possible mapping functions, where each element \(g: \mathbb {R}^n \rightarrow \mathbb {R}^k\) (\(k\le n\)) is used to project an n-dimensional vector space into a k-dimensional space. The dimension reduction for the learning to rank task tries to discover an optimum mapping function \(g^* \in \mathcal {G}\) such that:

$$\underset{g \in \mathcal {G}}{{\text {arg min}}}\, L(g(\mathbf {X})),$$
(1)

where \(L(\cdot )\) is the loss function for the learning to rank task. Then the new dataset can be generated with \(g^*(\mathbf {X})\).

In this paper, we consider linear feature extraction for learning to rank as it is the simplest and most straightforward feature extraction technique in machine learning. Here, each generated feature is a linear combination of the original features. It utilizes a transformation matrix \(\mathbf {T}\) to achieve the effectiveness of the mapping function, aiming to discover an optimal matrix \(\mathbf {T}\) such that the obtained dataset \(\mathbf {X}' = \mathbf {XT}\) results in a minimal value of the loss function.

The problem can be understood from the perspective of PCA (Jolliffe 2002). Using PCA, the given matrix \(\mathbf {X}\) can be approximately decomposed into three lower-rank matrices:

$$\mathbf {X}\approx \mathbf {P\Sigma Q^{\top }},$$
(2)

where \(\mathbf {\Sigma }\) is composed of the singular values of \(\mathbf {X}\), \(\mathbf {P}\) and \(\mathbf {Q}\) are composed of the left and right singular vectors of \(\mathbf {X}\) respectively, and \(\mathbf {P}^{\top }\mathbf {P}=\mathbf {Q}^{\top }\mathbf {Q}=\mathbf {I}\) (the identity matrix). Thus, a new matrix \(\mathbf {X}'\) can be generated as follows:

$$\mathbf {X}'=\mathbf {P\Sigma } \approx \mathbf {XQ}.$$
(3)

The role of the transformation matrix \(\mathbf {T}\) in LifeRank is very similar to the right singular matrix \(\mathbf {Q}\) in PCA, where \(\mathbf {Q}\) maps the document vectors to another space spanned by the columns of \(\mathbf {Q}\) before transforming them through \(\mathbf {\Sigma }\) and going back through \(\mathbf {P}\). Hence, in LifeRank we consider the orthonormality constraints of \(\mathbf {T}\) in our optimization process, i.e., \(\mathbf {T}^{\top }\mathbf {T}=\mathbf {I}\).

Definition 3

(Constrained linear feature extraction for ranking) Let \(\mathbf {X}_{m \times n}\) be a document matrix, where the transpose of each row, i.e., \({\mathbf {x}_i}^{\top } = \mathbf {d}_i\) is a n-dimensional vector, representing a document with n features. Linear feature extraction for ranking aims to optimize a transformation matrix \(\mathbf {T}_{n\times k}\) by solving the following optimization problem, so that a new document matrix \(\mathbf {X}'_{m\times k} = \mathbf {X}_{m \times n}\mathbf {T}_{n\times k}\) can be generated, where each document vector \(\mathbf {d}_i\) can be projected into k-dimensional vector \(\mathbf {d}'_i = \mathbf {T}^{\top } \mathbf {d}_i\):

$$\underset{\mathbf {T}}{{\text {arg min}}}\quad L(\mathbf {X}\mathbf {T})\text { such that }\mathbf {T}^{\top }\mathbf {T}=\mathbf {I},$$
(4)

where \(L(\cdot )\) is the loss function for the learning to rank task.

Based on the optimized mapping function g, the new dataset can be generated by taking the product of the original matrix and the transformation matrix, i.e., \(\mathbf {X}'=\mathbf {XT}\).

We have used the example of PCA to help us explain the mechanism of LifeRank. However, it should be noted that in PCA \(\mathbf {X}'\) is calculated as an approximation of \(\mathbf {X}\), whereas in LifeRank we generate a transformed representation of the initial matrix, in order to achieve a better ranking performance. Hence, unlike PCA, \(\mathbf {X}'\) as computed in Definition 3 is not an approximation of \(\mathbf {X}\), but a transformation.

4 The LifeRank algorithm

Given a high-dimensional dataset \(\mathcal {X}\), LifeRank generates a new low-dimensional dataset \(\mathcal {X}'\) in two phases. In the first phase, LifeRank first preprocesses the training dataset \(\mathcal {X}\) into an original matrix \(\mathbf {X}\). Then LifeRank optimizes the transformation matrix \(\mathbf {T}\) for \(\mathbf {X}\) according to the loss function in Eq. 4. In the second phase, LifeRank generates low-dimensional training, validation and test matrices with the projection of \(\mathbf {T}\). Then LifeRank constructs new datasets based on the low-dimensional data matrices.

4.1 Phase I: Generation of the transformation matrix

In this study, we utilize a classic pairwise learning to rank loss function to implement the function \(L(\cdot )\) in Definition 3. Pairwise loss functions are chosen because apart from being relatively simple and straightforward, they are also intuitive choices for ranking. Besides, with the assumption that the labels of documents to rank lie in a rank-differentiable probability space, pairwise loss functions are consistent (Lan et al. 2012) and provide upper bounds for measure-based ranking errors like NDCG (Chen et al. 2009). Thus, minimizing a pairwise loss function will maximize the ranking measures (Lan et al. 2012).

First of all, the training dataset \(\mathcal {X}\) is preprocessed into an original matrix \(\mathbf {X}\) and other information \(\mathbf {I}_X\) consisting of identities of the documents and queries, relevance labels, etc. Let \(D = \{\mathbf {d}_1, \mathbf {d}_2, \ldots ,\mathbf {d}_m\}\) be the set of columns (document vectors) in the matrix \(\mathbf {X}_{m \times n} ^{\top }\). We regard each pair of vectors \((\mathbf {d}_i, \mathbf {d}_j) \in D \times D\) as an instance, and the label \(y_{i,j} \in \{+1,-1\}\) indicates whether the relevance of the i-th document is higher or lower than the j-th document, corresponding to the given query. Let \(\{\mathbf {t}_1, \mathbf {t}_2, \ldots , \mathbf {t}_k\}\) be the column vectors of \(\mathbf {T}\). We try to discover a k-dimensional vector of weights \(\mathbf {w}\) such that:

$$\begin{aligned} \begin{aligned} {\text {arg min}}_{\mathbf {T},\mathbf {w},b}\,&\sum _{\forall (\mathbf {d}_i,\mathbf {d}_j), i\ne j} \log \left( 1+e^{-y_{i,j} \left( \mathbf {w}^{\top } \mathbf {T}^{\top } (\mathbf {d}_i-\mathbf {d}_j)+b\right) }\right) +\frac{\lambda }{2} ||\mathbf {w}||^2\\&\text {such that } \mathbf {t}_i^{\top }\mathbf {t}_j= {\left\{ \begin{array}{ll} 1, &{} i=j\\ 0, &{} i\ne j \end{array}\right. }, \forall i,j=1,2,\ldots ,k, \end{aligned} \end{aligned}$$
(5)

where the first part calculates the \(\log\) loss of the ranking accuracy, the second part is the l2 norm of the parameters for regularization, and \(\lambda\) is the coefficient of the regularization term for trade-off.

We optimize the constrained loss function based on the Lagrange multipliers method (Arfken 2013; Bertsekas 1999) in Eq. 5. Let

$$\begin{aligned} \begin{aligned} \mathcal {L}(\mathbf {T},\mathbf {w},b,\mathbf {A}) &=\sum _{\forall (\mathbf {d}_i,\mathbf {d}_j), i\ne j} \log \left( 1+e^{-y_{i,j} \left( \mathbf {w}^{\top } \mathbf {T}^{\top } (\mathbf {d}_i-\mathbf {d}_j)+b\right) }\right) + {}\\&\frac{\lambda }{2} ||\mathbf {w}||^2 + \sum _{i,j=1,\ldots , k \wedge i\ne j} \alpha _{i,j}\mathbf {t}_i^{\top }\mathbf {t}_j + \sum _{i=1}^k \alpha _{i,i}\left( 1-\mathbf {t}_i^{\top }\mathbf {t}_i\right) , \end{aligned} \end{aligned}$$
(6)

where \(\mathbf {A}\) is a matrix with k columns and k rows, and elements \(\alpha _{i,j}\). Then, the optimum \(\mathbf {T}\), \(\mathbf {w}\) and b for minimizing \(\mathcal {L}\) are the exact results of Eq. 5.

In Phase I, we utilize gradient descent to generate the training dataset and the transformation matrix. Initially, we assign all 1s to the vector \(\mathbf {w}_{k \times 1}\) so that all of the generated features in the ranking model have the same initial weight. We initialize the transformation matrix \(\mathbf {T}\) in a random manner, following work on matrix generalization problems like matrix factorization-based collaborative filtering (Koren et al. 2009). After initialization, the weight vector \(\mathbf {w}\) and the factorized matrix can be updated iteratively with gradient descent until reaching convergence or the maximum number of iterations with the given learning rate. The gradients of the function \(\mathcal {L}\) with respect to the variables are calculated as follows:

$$\begin{aligned} \begin{aligned} \nabla _{\mathbf {w}} \mathcal {L} &=\sum _{\forall (\mathbf {d}_i,\mathbf {d}_j), i\ne j} \frac{-y_{i,j}\mathbf {T}^{\top }\left( \mathbf {d}_i - \mathbf {d}_j\right) }{1+e^{y_{i,j}\left( \mathbf {w}^{\top }\mathbf {T}^{\top }\left( \mathbf {d}_i - \mathbf {d}_j\right) +b\right) }} + \lambda \mathbf {w}\\ \nabla _{\mathbf {t}_l} \mathcal {L}&=\sum _{\forall (\mathbf {d}_i,\mathbf {d}_j), i\ne j} \frac{-y_{i,j}w_l\left( \mathbf {d}_i - \mathbf {d}_j\right) }{1+e^{y_{i,j}\left( \mathbf {w}^{\top }\mathbf {T}^{\top }\left( \mathbf {d}_i - \mathbf {d}_j\right) +b\right) }}+{}\\&\sum _{i\ne l} (\alpha _{l,i}+\alpha _{i,l})\mathbf {t}_i - 2\alpha _{l,l}\mathbf {t}_l,\quad l=1,\ldots ,k\\ \frac{\partial \mathcal {L}}{\partial b}&=\sum _{\forall (\mathbf {d}_i,\mathbf {d}_j), i\ne j} \frac{-y_{i,j}}{1+e^{y_{i,j}\left( \mathbf {w}^{\top }\mathbf {T}^{\top }\left( \mathbf {d}_i - \mathbf {d}_j\right) +b\right) }}\\ \frac{\partial \mathcal {L}}{\partial \alpha _{i,j}} &={\left\{ \begin{array}{ll} \mathbf {t}_i^{\top }\mathbf {t}_j, &{} i\ne j\\ 1-\mathbf {t}_i^{\top }\mathbf {t}_j, &{} i=j \end{array}\right. } \quad \forall i,j=1,\ldots ,k, \end{aligned} \end{aligned}$$
(7)

where \(\mathbf {t}_1, \mathbf {t}_2, \ldots \mathbf {t}_n\) are the column vectors of \(\mathbf {T}\). Since gradient descent generally does not work with Lagrange multipliers, we use the basic differential multiplier method (BDMM) (Platt and Barr 1988) for optimization, where the sign inversion for \(\alpha\) in Eq. 8 makes the optimization stable. Given a learning rate \(\eta\), the update formulas of the gradient descent method are:

$$\begin{aligned} \begin{aligned} \mathbf {w}&\leftarrow \mathbf {w} - \eta \nabla _{\mathbf {w}} \mathcal {L} \\ \mathbf {t}_l&\leftarrow \mathbf {t}_l - \eta \nabla _{\mathbf {t}_l}\mathcal {L}, \text { for } l=1,\ldots ,k\\ b&\leftarrow b-\eta \frac{\partial \mathcal {L}}{\partial b}\\ \alpha _{i,j}&\leftarrow \alpha _{i,j}+\eta \frac{\partial \mathcal {L}}{\partial \alpha _{i,j}}, \text { for }i,j=1,\ldots ,k. \end{aligned} \end{aligned}$$
(8)

4.2 Phase II: Generation of low-rank datasets

In LifeRank, Phase II generates all of the datasets for learning to rank, including the training, validation and test datasets. According to Definition 3, for each original matrix \(\mathbf {X}\), the generated matrix \(\mathbf {X}'\) can be obtained as a product of the original dataset \(\mathbf {X}\) and the transformation matrix \(\mathbf {T}\), formally \(\mathbf {X}'=\mathbf {XT}\). Then, the new low-dimensional dataset \(\mathcal {X}'\) can be generated by integrating matrix \(\mathbf {X}'\) with other information \(\mathbf {I}_X\) that was filtered in the preprocessing step in Phase I.

4.3 Pseudocode

The pseudocode of LifeRank as a dimension reduction algorithm for ranking is summarized in Algorithm 1. Given the number of generated features k and a set of standard learning to rank datasets, including a training dataset \(\mathcal {X}\), a validation dataset \(\mathcal {V}\) and a test dataset \(\mathcal {E}\), LifeRank tries to output new low-dimensional datasets \(\mathcal {X}'\), \(\mathcal {V}'\) and \(\mathcal {E}'\) for training, validation and test, respectively, for the learning to rank procedure.

Algorithm 1 implements the two phases of LifeRank: (I) Lines 1–8 generate the transformation matrix \(\mathbf {T}\) based on the original training dataset \(\mathcal {X}\); (II) Using \(\mathbf {T}\), lines 9–10 generate the low-dimensional matrices for training \(\mathbf {X}'\), validation \(\mathbf {V}'\) and test \(\mathbf {E}'\). Then, line 11 constructs the low-dimensional training, validation and test datasets by directly integrating the low-rank matrices and their corresponding information filtered in the preprocessing steps in lines 1 and 9.

figure a

4.4 Discussion

In this section, we reveal a connection between the feature selection for ranking problem and the linear feature extraction for ranking problem. In particular, from the perspective of linear transformations of matrices, the feature selection for ranking problem can be defined as in Definition 4.

Definition 4

(Feature selection for ranking) Let \(\mathbf {X}_{m \times n}\) be a document matrix, where the transpose of each row \({\mathbf {x}_i}^{\top } = \mathbf {d}_i\) is an n-dimensional vector, representing a document with n features. Feature selection for ranking aims to optimize a transformation matrix \(\mathbf {T}_{n\times k}\) by solving the following optimization problem, so that a new document matrix \(\mathbf {X}'_{m\times k} = \mathbf {X}_{m \times n}\mathbf {T}_{n\times k}\) can be generated, where each n-dimensional document vector \(\mathbf {d}_i\) can be projected into a k-dimensional vector \(\mathbf {d}'_i = \mathbf {T}^{\top } \mathbf {d}_i\):

$$\underset{\mathbf {T}}{{\text {arg min}}} \, L(g(\mathbf {X}\mathbf {T})) \text { such that } {\left\{ \begin{array}{ll} \forall t_{i,j}\in \mathbf {T}: t_{i,j} = \{0,1\}\\ \mathbf {T}^{\top }\mathbf {T}=\mathbf {I}. \end{array}\right. }$$
(9)

Based on the optimized mapping function g, the new low-rank matrix can be generated as a product of the original matrix and the transformation matrix, i.e., \(\mathbf {X}'=\mathbf {XT}\).

The k columns of the transformation \(\mathbf {T}\) mentioned in Defintion 4 present the k iterations of the feature selection processes. The constraints in Eq. 9 guarantee that there is only one “1” in each column of the transformation matrix \(\mathbf {T}\) and the others are all “0,” indicating that each feature selection process only selects one feature. The second constraint \(\mathbf {T}^{\top }\mathbf {T}=\mathbf {I}\) guarantees that the position of the unique “1” in each column is different from other columns, which is the index of the selected feature in that step.

Since the elements in the transformation matrix \(\mathbf {T}\) can be any real numbers in Definition 3 while they are only either 0 or 1 in Definition 4, Definition 3 generalizes Definition 4, i.e., the problem of linear feature extraction for ranking generalizes the problem of feature selection for ranking. Because of this, linear feature extraction is expected to outperform or be at least as good as any feature selection technique. The linear feature extraction is expected to use more computational resources than feature selection, since former deals with the search space in real numbers and the latter with binary case. However, this computational overhead is the tradeoff for the higher performance expected to be achieved by the extracted features, when utilized for learning to rank.

5 Experimental setup

5.1 Research questions

We list the research questions that guide the remainder of the paper.

RQ1 :

What is the performance of LifeRank in generating low-dimensional datasets? Does LifeRank outperform state-of-the-art feature selection algorithms? (See Sect. 6.1)

RQ2 :

Can the importance and redundancy of the features generated by LifeRank outperform those selected by feature selection algorithms? (See Sect. 6.2)

RQ3 :

What is the effect of the orthonormality constraints of the transformation matrix in Eq. 4? Does it help enhance the performance of ranking predictions? (See Sect. 6.3.)

5.2 Datasets

In this study, we use the MQ2007 and MQ2008 datasets from LETOR 4.0 (Qin and Liu 2013) and OHSUMED from LETOR 3.0 (Qin et al. 2010) to evaluate our algorithm. The LETORFootnote 1 datasets are commonly used benchmarks in learning to rank. LETOR 4.0 is the latest version, which was released in July 2009. It uses the Gov2 web page collection (\({\sim }\)25M pages) and two query sets from the Million Query track of TREC 2007 and TREC 2008, which are referred to as MQ2007 and MQ2008. We use both MQ2007 and MQ2008 in our experiments. In MQ2007, there are about 1700 queries and about 70,000 query-document pairs, while MQ2008 has 800 queries and about 15,000 query-document pairs for training, validation and testing. In both datasets, each query-document pair has 46 features. We also use the OHSUMED dataset from LETOR 3.0, which was released in December 2008. OHSUMED is extracted from the online medical information database MEDLINE. It contains 106 queries and about 16,000 query-document pairs, where each query-document pair has 45 features.

In all the datasets that we use, relevance of documents with respect to queries is judged at three levels: 2 (definitely relevant), 1 (partially relevant), and 0 (not relevant). In our experiments, we use five-fold cross validation. In each fold, 60% queries are used for training, 20% for validation and and the remaining 20% for testing. The performance numbers reported are averaged over the five folds.

5.3 Baselines

LifeRank aims to generate low-dimensional datasets for ranking. In this paper, we utilize three baselines to evaluate the datasets generated by our algorithm:

Original datasets: :

We firstly use the original LETOR datasets as our first baseline, on which no selection or generation has been performed.

Datasets generated by GAS: :

GAS (Geng et al. 2007) incorporates importance and similarity information of the features into ranking. It greedily selects a subset of features by maximizing the total importance scores meanwhile minimizing the total similarity scores.

Datasets generated by FSMRank: :

FSMRank (Lai et al. 2013) trains a feature selection model with machine learning, which can select a subset of features meanwhile minimizing the ranking errors.

We then run Linear Regression (Lawson and Hanson 1995)-based learning to rank and RankSVMFootnote 2 (Joachims et al. 2009) to determine how well these datasets can address the ranking problem. The former makes pointwise predictions on the relevance of the documents by linear regression, which is implemented in the RankLib learning to rank toolkit.Footnote 3 The latter predicts pairwise ranking relation between each pair of documents directly by support vector machine (SVM). These are classical pointwise and pairwise learning to rank algorithms, respectively, with which we can clearly demonstrate the effects of dimension reduction.

Since LifeRank uses a linear approach for feature extraction, it is expected to show effectiveness mainly for linear learning-to-rank methods. This is the reason why we have chosen SVMRank and Linear Regression for experimentation.

5.4 Evaluation measures

5.4.1 Measures for ranking

We use two standard ranking accuracy metrics to evaluate the rankings generated by learning to rank algorithms: mean average precision (MAP) (Baeza-Yates and Ribeiro-Neto 1999) and normalized discount cumulative gain (NDCG@n) (Järvelin and Kekäläinen 2002).

Statistical significance of observed differences between the performance of two runs is tested using a two-tailed paired t-test and is denoted using (or ) for strong significance for \(\alpha =0.01\); or (or ) for weak significance for \(\alpha =0.05\).

5.4.2 Measures for features

We consider two metrics to evaluate the quality of the features: importance and redundancy.

The importance of each feature can be evaluated by the ranking performance when the feature is used as a ranking model to order the documents. In particular, we use NDCG@5 for evaluation. Since for calculating these measures, for some features larger values correspond to higher ranks while for others smaller values lead to higher ranks, we utilize the strategy in GAS (Geng et al. 2007) for evaluation: We order the documents twice in ascending and descending manners respectively, and take the larger score as the importance score of the features. Then we calculate the average importance of the features as the importance of the set of features \(F=\{f_1, f_2, \ldots , f_k\}\):

$$Imp(F)=\frac{1}{k}\sum _{i=1}^k \max \big \{eva(\mathcal {X}, f_i), eva(\mathcal {X}, -f_i)\big \},$$

where the function \(eva(\mathcal {X}, f_i)\) returns the evaluation results of the ranking model \(f_i\) on the dataset \(\mathcal {X}\).

The redundancy of features can be defined as the average similarity between each pair of features. In practice, we regard each feature as a ranking model to order the documents, and then calculate the similarity between each pair of features as the average similarity of their document rankings associated to different queries. Let Q be the set of queries in the given dataset, each associated with a set of documents for ranking. The redundancy of the features \(F=\{f_1, f_2, \ldots , f_k\}\) is calculated as follows:

$$Rdd(F) = \frac{2}{k(k-1)}\sum _{f_i, f_j \in F, i> j} \frac{1}{|Q|}\sum _{q\in Q} sim\left( \sigma ^{(q)}_i, \sigma ^{(q)}_j\right) ,$$

where \(\sigma ^{(q)}_i\) is the ranking of the document associated to the query q when the feature \(f_i\) is used as the ranking model to order the documents. In this paper, we take the absolute value of Kendall’s \(\tau\) correlation coefficient (Kendall 1948) as the similarity metric for rankings:

$$\tau \left( \sigma _i, \sigma _j\right) = \frac{N_c - N_d}{N_c + N_d},$$
(10)

where \(N_c\) and \(N_d\) are the numbers of the concordant pairs and discordant pairs respectively between rankings \(\sigma _i\) and \(\sigma _j\).

The range of \(\tau \left( \sigma _i, \sigma _j\right)\) is \([-1,1]\), where the sign indicates that the correlation between \(\sigma _i\) and \(\sigma _j\) is either positive or negative, and the absolute value indicates the strength of the correlation. Since positive and negative values should not neutralize and we only consider the strength of the correlations, we take the absolute value of Kendall’s \(\tau\) as the similarity metric in the definition of redundancy.

6 Experimental results

6.1 Performance on generated datasets

Tables 1, 2 and 3 list the results obtained in our experiments on the MQ2007, MQ2008 and OHSUMED datasets, respectively. They show the NDCG@1–10 and MAP scores for the RankSVM and Linear Regression learning to rank algorithms on 4 categories of datasets: the original datasets and 3 datasets generated by dimension reduction algorithms including GAS, FSMRank and our LifeRank. For each dimension reduction algorithm, we consider k generated features, with \(k= 5\), 10, 15, 20. The results for the original dataset in the tables are independent of the value of k, but are repeated nevertheless for ease of comparison. The values in bold represent the best performance among GAS, FSMSVM and LifeRank.

Table 1 Ranking performance on MQ2007 and selected/generated datasets
Table 2 Ranking performance on MQ2008 and selected/generated datasets
Table 3 Ranking performance on OHSUMED and selected/generated datasets

Overall, from the tables we can see that: (1) The performance of ranking algorithms can be maintained or slightly improved on the datasets generated by dimension reduction techniques. (2) The performance of the ranking algorithms on the datasets generated by LifeRank is higher than those generated by GAS and FSMRank in most cases. Let us now take a closer look.

6.1.1 Performance of RankSVM

For RankSVM, we can see that LifeRank clearly shows improvements over the original datasets for all the three benchmarks (MQ2007, MQ2008 and OHSUMED) in terms of NDCG@1–10 as well as MAP. The only exception is MQ2007 for \(k=5\), where the performance of LifeRank as well as the other generated datasets does not beat the original dataset. We can also see from the tables that LifeRank clearly outperforms other generated datasets (GAS and FSMSVM) on NDCG@1–10 for all the benchmarks and all values of k.

In terms of MAP, LifeRank outperforms the other generated datasets in most cases. The few exceptions include the case for MQ2007, when GAS has a higher MAP for \(k=5\). For MQ2008, FSMSVM attains slightly higher MAP score than LifeRank for \(k=10\) and \(k=20\), but these differences are not significant. Also, for OHSUMED when \(k=5\), the MAP score attained by LifeRank is lower than FSMSVM and GAS, but it is still an improvement over the original dataset.

6.1.2 Performance of linear regression

Also in the case of Linear Regression, for all three benchmarks (MQ2007, MQ2008 and OHSUMED) LifeRank clearly shows improvements over the original datasets in terms of NDCG@1–10 as well as MAP. The only exception is MQ2007 for \(k=5\), where the original dataset performs better than LifeRank as well as the other generated datasets.

On NDCG@1–10, for MQ2007 LifeRank gives the best performance for all values of k, except for \(k=20\), where GAS gives the best performance. For MQ2008, LifeRank gives the best performances for \(k=5\) and \(k=10\) on NDCG@1–10. However, for \(k=15\) and \(k=20\), there is mixed performance where all GAS, FSMSVM and LifeRank give best performances in certain cases. For, OHSUMED, LifeRank gives the best performance on NDCG@1–10 in most cases.

In terms of MAP, LifeRank gives the best performance for MQ2007 for \(k=5\) and \(k=10\), whereas for \(k=15\) and \(k=20\) the best performance is given by GAS. For MQ2008, LifeRank outperforms others for all values of k, except for \(k=15\) where the best performance is given by GAS. Moreover, for OHSUMED, FSMSVM outperforms the others for \(k=5\) and \(k=15\), while LifeRank gives the best performance for \(k =10\). In case of \(k=20\), there is a tie between LifeRank and FSMSVM.

6.1.3 Statistical significance overview

In Tables 1, 2 and 3, markups are provided to denote the statistical significance between LifeRank and the following baselines: original dataset, GAS and FSMSVM. It should be noted that the original dataset is independent of the values of k, but is repeated in the table to indicate statistical significance between it and datasets generated by LifeRank for different values of k.

It can be observed from Table 1 that for MQ2007 in the case of RankSVM, there is strong to weak significance between LifeRank and the baselines in most cases across the metrics, while there is no significance shown against original for \(k = 15\). Moreover, for \(k=5\), significance is shown against original and GAS in few cases. For Linear Regression, there is strong significance shown against FSMSVM for \(k=5\) and \(k=15\), though there is not much significance shown for \(k=10\) and \(k=20\). Also, weak significance is shown against GAS in few cases for \(k = 5\) and against original for \(k = 15\).

Table 2 for MQ2008 shows no statistically significant differences between LifeRank and FSMSVM for RankSVM. There is weak to strong statistical significance for LifeRank against original dataset for most cases and against GAS mainly for \(k = 5, 10\) and 15. For Linear Regression, LifeRank shows weak to strong statistical significance against original in most cases, GAS in no cases and FSMSVM in few cases. Moreover, Table 3 for OHSUMED shows statistical significance for RankSVM in many cases against the baselines, whereas there is statistical significance observed for Liner Regression for few cases. The comparative lack of statistical significance seen for MQ2008 and OHSUMED can most probably be attributed to the relatively small size of these datasets.

6.2 Quality of the generated features

Table 4 lists the quality scores of the features from four datasets: the original datasets and the three datasets generated by GAS, FSMRank and LifeRank, respectively, for \(k=\) 10.

Table 4 Performance of the generated features

From the table we see that: (1) GAS can significantly reduce the redundancy of the features. The redundancy of the features selected by GAS is the lowest among the four datasets. However, the importance of the features selected by GAS is also lowest and even slightly lower than that of the original datasets. (2) FSMRank can improve the importance of the features while reducing their redundancy, but the differences in terms redundancy are subtle. (3) LifeRank can sharply improve the importance of the features. The importance of the features generated by LifeRank is highest among the four datasets. Besides, the redundancy of the features can also be slightly reduced by LifeRank for MQ2007 but deteriorated for MQ2008 and OHSUMED. Worse redundancy for LifeRank in comparison with GAS and FSMSVM could be because of the reason that, while these baselines are feature selection methods, for LifeRank each extracted feature is a linear combination of the original features. Moreover, it can be observed that for the larger dataset MQ2007, redundancy for LifeRank is comparable to the baselines, and even better than FSMSVM. However, for smaller dataset MQ2008, the redundancy is worse than the baselines. For the smallest dataset OHSUMED, it is worse than the baselines by a greater difference.

6.3 Effect of the orthonormality constraints

To confirm that the orthonormality constraints used in LifeRank do indeed contribute to performance gains, we re-generated the datasets for the benchmarks MQ2007, MQ2008 and OHSUMED using LifeRank for \(k=\) 10, but this time without the incorporation of the constraints in its algorithm in Phase I (see Algorithm 1, line 1–8). Table 5 shows the comparison of performances of ranking algorithms, for datasets generated by LifeRank and LifeRank without orthonormality constraints (represented by LifeRankNO). Moreover, markups are presented in the table to denote to the statistical significance between LifeRank and LifeRankNO.

Table 5 Effect of orthonormality constraints on datasets for \(k = 10\). Statistical significance shown for LifeRank against LifeRankNO

We can see from the results in Table 5 that the datasets generated by LifeRank show significant improvements in performance over the datasets generated by LifeRankNO for both learning to rank algorithms, RankSVM and Linear Regression. Performance gains can be observed on all three benchmarks and across all performance measures (NDCG@1–10 and MAP). Hence, these results show that the usage of orthonormality constraints is beneficial in the LifeRank algorithm. Also, strong statistical significance between LifeRank and LifeRankNO can be observed for all three benchmarks for RankSVM as well as Linear Regression, across all performance measures, except for a small number of cases where weak or no statistical significance is seen.

7 Conclusion

In this paper, we have addressed the feature extraction problem for learning to rank, and have proposed LifeRank, a linear feature extraction algorithm for ranking. LifeRank regards each dataset for ranking as a matrix, referred to as the original matrix. We then optimize a transformation matrix by minimizing a classic pairwise learning to rank loss function, so that we can discover the optimal one that matches the ranking task. Then a new matrix (dataset) can be generated by the product of original matrix and transformation matrix. Extensive experiments on benchmark datasets show the performance gains of LifeRank in comparison with the state-of-the-art algorithms.

The performance of LifeRank has been evaluated for RankSVM and Linear Regression. In future work, its benefits for other learning to rank algorithms could be analysed. Moreover, nonlinear feature extraction techniques like some kernel tricks could be incorporated in LifeRank to further improve its performance. Besides, we plan to try more learning to rank loss functions like some state-of-the-art listwise loss functions for performance gains of our algorithm. In addition, we believe it would be interesting to establish theoretical results on dimension reduction for ranking, including feature extraction and feature selection-based algorithms, especially concerning retrieval performance.