Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Reasoning with preferences has been recognized as a particularly promising research direction for artificial intelligence (AI) [15]. A preference can be considered as a relaxed constraint which, if necessary, can be violated to some degree. In fact, an important advantage of a preference-based problem solving paradigm is an increased flexibility, as nicely explained in [6]:

“Early work in AI focused on the notion of a goal – an explicit target that must be achieved – and this paradigm is still dominant in AI problem solving. But as application domains become more complex and realistic, it is apparent that the dichotomic notion of a goal, while adequate for certain puzzles, is too crude in general. The problem is that in many contemporary application domains the user has little knowledge about the set of possible solutions or feasible items, and what she typically seeks is the best that’s out there. But since the user does not know what is the best achievable plan or the best available document or product, she typically cannot characterize it or its properties specifically. As a result, she will end up either asking for an unachievable goal, getting no solution in response, or asking for too little, obtaining a solution that can be substantially improved.”

Drawing on past research on knowledge representation and reasoning, AI offers qualitative and symbolic methods for treating preferences that can reasonably complement traditional approaches that have been developed for quite a while in fields such as economic decision theory [37]. Needless to say, however, the acquisition of preferences is not always an easy task. Therefore, not only modeling languages and representation formalisms, but also methods for the automatic learning, discovery, and adaptation of preferences are needed. For example, computerized methods for discovering the preferences of individuals are useful ine-commerce and various other fields, where an increasing trend toward personalization of products and services can be recognized.

It is hence hardly surprising that methods for learning and predicting preferences in an automatic way are among the very recent research topics in disciplines, such as machine learning, knowledge discovery, and recommender systems. Approaches relevant to this area range from approximating the utility function of a single agent on the basis of an as effective as possible question-answer process (often referred to as preference elicitation) to collaborative filtering where a customer’s preferences are estimated from the preferences of other customers. In fact, problems of preference learning can be formalized within various settings, depending, e.g., on the underlying type of preference model or the type of information provided as an input to the learning system.

Roughly speaking, preference learning is about inducing predictive preference models from empirical data. In the literature on choice and decision theory, two main approaches to modeling preferences can be found, namely in terms ofutility functions and in terms ofpreference relations. From a machine learning point of view, these two approaches give rise to two kinds of learning problems: learning utility functions and learning preference relations. The latter deviates more strongly than the former from conventional problems such as classification and regression, as it involves the prediction of complex structures, such as rankings or partial order relations, rather than single values. Moreover, training input in preference learning will not, as it is usually the case in supervised learning, be offered in the form of complete examples but may comprise more general types of information, such as relative preferences or different kinds of indirect feedback and implicit preference information.

This book tries to give a comprehensive overview of the state-of-the-art in the field of preference learning. Some of its chapters are based on selected contributions to two successful workshops on this topic [29, 30], but the material is complemented with chapters that have been solicited explicitly for this book. Most notably, several survey chapters give a detailed account on ongoing research in various subfields of preference learning. Thus, we are confident that the book succeeds in giving a comprehensive survey of work on all aspects of this emerging research area.

In the remainder of this chapter, we shall briefly sketch some important branches of preference learning and, along the way, give pointers to the contributions in this volume. References to these contributions are indicated by capitalized author names, for example FÜRNKRANZ & HÜLLERMEIER.

2 Preference Learning Tasks

Among the problems in the realm of preference learning, the task of “learning to rank” has probably received the most attention in the machine learning literature in recent years. In fact, a number of different ranking problems have been introduced so far, though a commonly accepted terminology has not yet been established. In the following, we propose a unifying and hopefully clarifying terminology for the most important types of ranking problems, which will also serve as a guideline for organizing the chapters of the book. AIOLLI & SPERDUTI give an alternative unifying framework for learning to rank from preferences.

In general, a preference learning task consists of some set of items for which preferences are known, and the task is to learn a function that predicts preferences for a new set of items, or for the same set of items in a different context. Frequently, the predicted preference relation is required to form a total order, in which case we also speak of a ranking problem. In this book, we will frequently use the term “ranking” for categorizing different types of preference learning problems, but we note that the characterization mainly depends on the form of the training data and the required predictions, and not on the fact that a total order is predicted.Footnote 1

In the notation used in the remainder of this chapter (and throughout most of the book), our goal is to stick as much as possible to the terminology commonly used in supervised learning (classification), where a data object typically consists of aninstance (the input, also called predictive or independent variable in statistics) and an associatedclass label (the output, also called target or dependent variable in statistics). The former is normally denoted by x, and the corresponding instance space by \(\mathcal{X}\), while the output space is denoted by \(\mathcal{Y}\). Instances are often represented in the form offeature vectors, which means that x is a vector

$$\text{ {\it x}} = ({x}_{1},{x}_{2},\ldots,{x}_{m}) \in \mathcal{X} = {\mathcal{X}}_{1} \times {\mathcal{X}}_{2} \times \ldots\times {\mathcal{X}}_{m}.$$

We distinguish three types of ranking problems, namely label ranking, instance ranking, and object ranking, which are described in more detail in the following.

2.1 Label Ranking

In label ranking, we assume to be given an instance space \(\mathcal{X}\) and a finite set of labels \(\mathcal{Y} =\{ {y}_{1},{y}_{2},\ldots,{y}_{k}\}\). The goal is to learn a “label ranker” in the form of an \(\mathcal{X} \rightarrow{S}_{\mathcal{Y}}\) mapping, where the output space \({S}_{\mathcal{Y}}\) is given by the set of all total orders(permutations) of the set of labels \(\mathcal{Y}\) (the notation is leaned on the common notation S k for the symmetric group of order k). Thus, label ranking can be seen as a generalization of conventional classification, where a complete ranking

$${y}_{{\pi }_{\text{ {\it x}}}^{-1}(1)} {\succ }_{\text{ {\it x}}}{y}_{{\pi }_{\text{ {\it x}}}^{-1}(2)} {\succ }_{\text{ {\it x}}}\ldots{\succ }_{\text{ {\it x}}}{y}_{{\pi }_{\text{ {\it x}}}^{-1}(k)}$$

is associated with an instance x instead of only a single class label. Here, π x is a permutation of 1, 2, , k such that π x (i) is the position of label y i in the ranking associated with x.

The training data 𝒯 of a label ranker typically consist of a set ofpairwise preferences of the form \({y}_{i} {\succ }_\text{ {\it x}}{y}_{j}\), suggesting that, for instance x, y i is preferred to y j . In other words, an “observation” consists of an instance x and an ordered pair of labels (y i , y j ). The label ranking problem is summarized in Fig. 1.

Fig. 1
figure 1

Label ranking

This learning scenario has a large number of practical applications. For example, it is relevant for the prediction of every sort of ordering of a fixed set of elements, such as the preferential order of a fixed set of products (e.g., different types of holiday apartments) based on demographic properties of a person, or the ordering of a set of genes according to their expression level (as measured by microarray analysis) based on features of their phylogenetic profile [1]. Another application scenario is meta-learning, where the task is to rank learning algorithms according to their suitability for a new dataset, based on the characteristics of this dataset [7]. Finally, every preference statement in the well-known CP-nets approach [3], a qualitative graphical representation that reflects conditional dependence and independence of preferences under a ceteris paribus interpretation, formally corresponds to a label ranking.

In addition, it has been observed by several authors [25, 18, 14] that many conventional learning problems, such as classification and multilabel classification, may be formulated in terms of label preferences:

  • Classification: A single class label y i is assigned to each example x . This implicitly defines the set of preferences \({y}_{i} {\succ }_{\text{ {\it x}}_{\mathcal{l}}}{y}_{j}\,\vert \,1 \leq j\neq i \leq k\).

  • Multilabel classification: Each training example \(\text{ {\it x}}_{\mathcal{l}}\) is associated with a subset \({P}_{\mathcal{l}} \subseteq \mathcal{Y}\) of possible labels. This implicitly defines the set of preferences \({y}_{i} {\succ }_{\text{ {\it x}}_{\mathcal{l}}}{y}_{j}\,\vert \,{y}_{i} \in{L}_{\mathcal{l}},{y}_{j} \in \mathcal{Y}\setminus{P}_{\mathcal{l}}\).

A general framework encompassing these and other learning problems can be found in the chapter by AIOLLI & SPERDUTI.

In each of the former scenarios, a ranking model \(f : \mathcal{X} \rightarrow {\mathcal{S}}_{k}\) is learned from a subset of all possible pairwise preferences. A suitable projection may be applied to the ranking model (which outputs permutations) as a post-processing step, for example, a projection to the top-rank in classification learning where only this label is relevant.

To measure the predictive performance of a label ranker, a loss function on rankings is needed. In principle, any distance or correlation measure on rankings(permutations) can be used for that purpose, for example, the number of pairs of labels that are incorrectly ordered (i.e., the number of label pairs y i and y j such that y i precedes y j in the predicted ranking although y j is actually preferred to y i ). Apart from this type ofranking loss, which compares a predicted ranking with a given target ranking, it is also possible to compare a predicted ranking with a single class label. For example, if this class label is the target one is looking for, then it makes sense to evaluate a predicted ranking by the position it assigns to the label; in [28], this type of error (measuring the distance of the assigned position from the top-rank) is called the position error.

A general survey of label ranking is given by VEMBU & GÄRNTNER. Another discussion of label ranking and related problems is given by FÜRNKRANZ & HÜLLERMEIER. This chapter is specifically focused on approaches that are based on the idea of learning by pairwise comparison, i.e., of decomposing the original problem into a set of smaller binary classification problems. YU, WAN & LEE show how decision-tree learning algorithms such as CART can be adapted to tackle label ranking learning problems by extending the concept of purity to label ranking data. TSIVTSIVADZE et al. show how an approach for minimizing an approximation of a ranking loss function can be extended with a semi-supervised learning technique that tries to improve predictions by minimizing the disagreement of several ranking functions, which have been learned from different views of the training data.

2.2 Instance Ranking

This setting proceeds from the setting of ordinal classification, where an instance \(\text{ {\it x}} \in \mathcal{X}\) belongs to one among a finite set of classes \(\mathcal{Y} =\{ {y}_{1},{y}_{2},\ldots,{y}_{k}\}\) and, moreover, the classes have a natural order: \({y}_{1} < {y}_{2} < \ldots< {y}_{k}\). Training data consists of a set 𝒯 of labeled instances. As an example, consider the assignment of submitted papers to categories reject, weak reject, weak accept, and accept.

In contrast to the classification setting, the goal is not to learn a classifier but a ranking function f( ⋅). Given a subset \(X \subset \mathcal{X}\) of instances as an input, the function produces a ranking of these instances as an output (typically by assigning a score to each instance and then sorting by scores).

For the case k = 2, this problem is well-known as the bipartite ranking problem. The case k > 2 has recently been termed k-partite[42] or multipartite ranking [19]. As an example, consider the task of a reviewer who has to rank the papers according to their quality, possibly though not necessarily with the goal of partitioning this ranking into the above four categories.

Thus, the goal of instance ranking – our proposal for a generic term of bipartite and multipartite ranking – is to produce a ranking in which instances from higher classes precede those from lower classes; see Fig. 2 for a formalization of this task. Different types of accuracy measures have been proposed for predictions of this kind. Typically, they count the number of ranking errors, that is, the number of pairs \((\text{ {\it x}},\text{ {\it x}}\prime) \in X \times X\) such that x is ranked higher than x even though the former belongs to a lower class than the latter. In the two-class case, this amounts to the well-known AUC,the area under the ROC-curve [5], which is equivalent to theWilcoxon–Mann–Whitney statistic [47, 38]. Its generalization to multiple (ordered) classes is known as the concordance index orC-index in statistics [22].

Fig. 2
figure 2

Instance ranking

These measures and the multipartite ranking scenario are discussed in more detail by WAEGEMAN & DE BAETS. ZHANG et al. discuss different methods for employing rule learning algorithms for learning bipartite rankings. This scenario has been studied for decision-tree learning, but not yet for rule learning, where several additional problems have to be considered, such as how to combine estimates from overlapping rules into a single probability estimate.

2.3 Object Ranking

In the setting of object ranking, there is no supervision in the sense that no output or class label is associated with an object. The goal in object ranking is to learn a ranking function f( ⋅) which, given a subset Z of an underlying referential set \(\mathcal{Z}\) of objects as an input, produces a ranking of these objects as an output. Again, this is typically done by assigning a score to each instance and then sorting by scores.

Objects \(\text{ {\it z}} \in \mathcal{Z}\) are commonly though not necessarily described in terms of an attribute-value representation. As training information, an object ranker has access to exemplary rankings or pairwise preferences of the form zz suggesting that z should be ranked higher than z . This scenario, summarized in Fig. 3, is also known as“learning to order things” [12].

Fig. 3
figure 3

Object ranking

As an example consider the problem of learning to rank query results of a search engine [33, 41]. The training information is provided implicitly by the user who clicks on some of the links in the query result and not on others. This information can be turned into binary preferences by assuming that the selected pages are preferred over nearby pages that are not clicked on [34].

The performance of an object ranker can again be measured in terms of a distance function or correlation measure on rankings. In contrast to the setting of label ranking, however, the number of items to be ordered in the context of object ranking is typically much larger. Therefore, one often prefers measures that put more emphasis on the top of a ranking while paying less attention to the bottom [17]. In Web search, for example, people normally look at the top-10 results while ignoring the rest. Besides, the target is often not a “true” ranking but instead a single object or a subset of relevant objects, for example a set of documents relevant to a query. Evaluation measures especially tailored toward these types of requirements have been proposed in information retrieval. Typical examples include precision and recall as well as normalized discounted cumulative gain (NDCG) [32, 39].

An extensive survey of object ranking approaches is given by KAMISHIMA, KAZAWA & AKAHO. Subsequently, KAMISHIMA & AKAHO discuss dimensionality reduction methods for object ranking tasks, which retain the preference information as much as possible. They assume a scenario (which they call supervised ordering) in which total orders for multiple subsets of objects are given, and the goal is to predict an ordering of the full set of objects. DEMBCZYńSKI et al. compare different approaches for rule-based learning of object ranking functions, namely one utility-based approach and one approach that directly learns the binary preference predicate (cf. also Sect. 3). An application to learning to rank documents in biomedical information retrieval is described by ARENS.

3 Preference Learning Techniques

All three of the basic learning tasks discussed in the previous section can be tackled by very similar basic techniques. In agreement with the distinction between using utility functions and binary relations for modeling preferences, two general approaches to preference learning have been proposed in the literature, the first of which is based on the idea of learning to evaluate individual alternatives by means of a utility function (Sect. 3.1), while the second one seeks to compare (pairs of) competing alternatives, that is, to learn one or more binary preference predicate (Sect. 3.2). Making sufficiently restrictive model assumptions about the structure of a preference relation, one can also try to use the data for identifying this structure (Sect. 3.3). Finally, local estimation techniques à la nearest neighbor can be used, which mostly leads to aggregating preferences in one way or the other (Sect. 3.4).

3.1 Learning Utility Functions

As mentioned previously, an established approach to modeling preferences resorts to the concept of autility function. Such a function assigns an abstract degree of utility to each alternative under consideration. From a machine learning point of view, an obvious problem is to learn utility functions from given training data. Depending on the underlying utility scale, which is typically either numerical or ordinal, the problem becomes one of regression learning or ordered classification. Both problems are well-known in machine learning. However, utility functions often implicate special requirements and constraints that have to be taken into consideration such as, for example,monotonicity in certain attributes (DEMBCZYńSKI et al.).

Besides, as mentioned earlier, training data are not necessarily given in the form of input/output pairs, i.e., alternatives (instances) together with their utility degrees, but may also consist of qualitative feedback in the form ofpairwise comparisons, stating that one alternative is preferred to another one and therefore has a higher utility degree. More generally, certain types of preference information can be formalized in terms of constraints on one or more underlying utility functions. This idea forms the basis of the general framework presented by AIOLLI & SPERDUTI. Sometimes, of course, training data are less generic and more application-specific. In collaborative filtering, for example, it simply consists of an incomplete set of product ratings given by a set of users (see DE GEMMIS et al. in this volume).

In the instance and object preferences scenario, a utility function is a mapping \(f :\, \mathcal{X} \rightarrow\mathbb{R}\) that assigns a utility degree f( x) to each instance (object) x and, hence, induces a complete order on \(\mathcal{X}\). In the label preferencesscenario, a utility function \({f}_{i} :\, \mathcal{X} \rightarrow\mathbb{R}\) is needed for each of the labels y i (i = 1, , k); alternatively, the functions can be summarized into a single function \(f :\, \mathcal{X}\times \mathcal{Y}\rightarrow\mathbb{R}\) that maps instance/label tuples ( x, y) to real-valued scores (see AIOLLI & SPERDUTI).Footnote 2 Here, f i ( x) is the utility assigned to alternative y i by instance x. To obtain a ranking for x, the alternatives are sorted according to these utility scores, i.e., \({\succ }_\text{ {\it x}}\) is such that \({y}_{i} {\succ }_\text{ {\it x}}{y}_{j} \Rightarrow{f}_{i}(\text{ {\it x}}) \geq{f}_{j}(\text{ {\it x}})\).

In the setting of instance ranking, the training data consist of instances for which the sought utility scores are given. Thus, the learning problem can, in principle, be approached by means of classification or (ordinal) regression methods. As an important difference, however, note that the goal is not to maximize classification accuracy but ranking performance. Thus, conventional learning algorithms have to be adapted correspondingly. Approaches of this kind have, e.g., been proposed in [27, 33]. In this book, WAEGEMAN & DE BAETS discuss approaches that are based on the optimization of an extension of the binary AUC to a loss function for ordinal data.

In object and label ranking, training data typically originate from a kind of indirect supervision. Instead of the target scores of the utility function, the learner is given the constraints on the function, which are derived from comparative preference information of the form “This object (or label) should have a higher utility score than that object (or label)”. Thus, the challenge for the learner is to find a function which is as much as possible in agreement with these constraints.

For object ranking approaches, this idea has first been formalized by Tesauro under the namecomparison training [44]. He proposed a symmetric neural network architecture that can be trained with representations of two states and a training signal that indicates which of the two states is preferable. The elegance of this approach comes from the property that one can replace the two symmetric components of the network with a single network, which can subsequently provide a real-valued evaluation of single states. Later works on learning utility function from object preference data include [46, 33, 24].

For the case of label ranking, a method for learning the functions f i ( ⋅) (i = 1, , k) has been proposed in the framework ofconstraint classification [25, 26]. Here, the authors proceed from linear utility functions and find a way to express a constraint of the form f i ( x) − f j ( x) > 0 (suggesting that y i x y j ) in the form of a binary classification example in a newly constructed, higher-dimensional space. In other words, the original label ranking problem is transformed into a single binary classification problem. This problem is solved by fitting a separating hyperplane, and from this hyperplane, the linear utility functions (identified by corresponding weight vectors) can be reconstructed. An alternative approach, so-calledlog-linear models for label ranking, has been proposed in [14]. This approach is essentially equivalent to constraint classification, as it also amounts to learning linear utility functions for all labels. Algorithmically, however, the underlying optimization problem is approached in a different way, namely by means of a boosting-based algorithm that seeks to minimize a (generalized) ranking error in an iterative way. In this book, TSIVTSIVADZE et al. present an approach for learning a utility function for label ranking via minimization of a loss function that is based on a least-squares approximation of the ranking error.

3.2 Learning Preference Relations

The key idea of this approach is to learn a binary preference relation that compares pairs of alternatives (e.g., objects or labels). The training of a model thus becomes simpler, mainly because comparative training information (suggesting that one alternative is better than another one) can be used directly instead of translating it into constraints on a (latent) utility function. On the other hand, the prediction step may become more difficult, since a binary preference relation learned from data is not necessarily consistent in the sense of beingtransitive and, therefore, does normally not define a ranking in a unique way.

Binary preference relations can be turned into a ranking by finding a ranking that is maximally consistent with the corresponding pairwise preferences. The difficulty of this optimization problem depends on the concrete criterion, though many natural objectives (e.g., minimizing the number of object pairs whose ranks are in conflict with their pairwise preference) lead to NP-hard problems [12]. Fortunately, efficient techniques such as simple voting (known as the Borda count procedure in social choice theory) often deliver good approximations, sometimes even with provable guarantees [13]. Of course, one can also derive other, possibly more complex preference structures from a preference relation, for example weak instead of strict linear orders. In [45], a linear order with ties (indifference between two alternatives) is called abucket order (a total order of “buckets”, where each bucket corresponds to an equivalence class), and a method is proposed to find an order of this type, which is maximally consistent with the data.

For object ranking problems, the relational approach has been pursued in [12]. The authors propose to solve object ranking problems by learning a binarypreference predicate \(\mathcal{Q}(\text{ {\it x}},\text{ {\it x}}\prime)\), which predicts whether x is preferred to x or vice versa. This predicate is trained on the basis of exemplary preferences of the form xx . A final ordering is found in a second phase by deriving (an approximation of) a ranking that is maximally consistent with these predictions. DEMBCZYńSKI et al. discuss this setting for rule learning and propose to combine the predictions using theNet Flow score proposed in [4]. They also compare this setting with an alternative approach that directly learns a utility function based on the preferences and monotonicity constraints.

For label ranking problems, the pairwise approach has been introduced by [18, 31], where it is referred to asranking by pairwise comparison. The key idea is to learn, for each pair of labels (y i , y j ), a binary predicate \({\mathcal{M}}_{i,j}(\vec{x})\) that predicts whether \({y}_{i} {\succ }_{\vec{x}}{y}_{j}\) or \({y}_{j} {\succ }_{\vec{x}}{y}_{i}\) for an input \(\vec{x}\). A label ranking is then derived from these pairwise preferences via weighted voting (generalized Borda counting).

Pairwise learning techniques forinstance ranking have been proposed in [19]. More specifically, two approaches were developed and compared in that paper, one which trains binary models \({\mathcal{M}}_{i,j}\), one for each pair of labels y i and y j , and another one that trains models \({\mathcal{M}}_{i}\) (\(i = 1,\ldots,k - 1\)) to separate classes y 1, , y i from classes y i + 1, , y k . Given a new query instance x, both approaches submit this instance to all models that have been learned and aggregate the corresponding predictions into an overall score. A set of instances is then ranked according to these scores.

An overview of work on learning binary preference relations for label and instance ranking is given by FÜRNKRANZ & HÜLLERMEIER.

3.3 Model-Based Preference Learning

Another approach to learning ranking functions is to proceed from specific model assumptions, that is, assumptions about the structure of the preference relations. This approach is less generic than the previous ones, as it strongly depends on the concrete assumptions made.

An example is the assumption that the target ranking of a set of objects described in terms of multiple attributes can be represented as a lexicographic order. YAMAN et al. address the learning of lexicographic orders in the context of object ranking. From a machine learning point of view, assumptions of the above type can be seen as an inductive bias restricting the hypothesis space. Provided the bias is correct, this is clearly an advantage, as it may simplify the learning problem. In the case of lexicographic orders, for example, a complete ranking of all objects is uniquely identified by a total order of the attributes plus a total order of each of the attribute domains. For example, suppose objects to be described in terms of (only) four binary attributes. Thus, there are 24 = 16 objects and hence 16! ≈ 2 ⋅1013 rankings in total. However, only (24) ⋅4! = 384 of these rankings can be expressed in terms of a lexicographic order.

Needless to say, the bias induced by the assumption of a lexicographic order is very strong and will be rarely justified in practical applications. In particular, preferences on individual attribute values will normally not be independent of each other. For example, red wine might be preferred as a beverage if the main dish is meat, while white wine might be preferred in the case of fish. As mentioned earlier,CP-nets [3] offer a language for expressing preferences on the values of single attributes and provide a suitable formalism for modeling dependencies of this type. A compact representation of a complex preference (partial order) relation is achieved by making use of conditional independence relations between such preferences (which are interpreted in terms of aceteris paribus semantics), in much the same way as Bayesian networks reduce complexity of probability models by exploiting conditional independence between random variables. The CP-net itself is a graphical representation of these (in)dependencies, and each node (belonging to a single variable) is associated with a function that assigns a preference relation on the values of that attribute to each combination of values of the parent attributes. In this volume, CHEVALEYRE et al. discuss the learnability of CP-networks, both in a passive and an active learning scenario.

If a ranking function is defined implicitly via an underlying utility (scoring) function, the latter is normally also restricted by certain model assumptions. For example, the approaches outlined in Sect. 3.1 make use of linear functions to represent scores, although mostly for algorithmic reasons. There are other approaches in which the choice of the underlying utility function is more intrinsically motivated and addressed in a more explicit way. For example, GIESEN et al. describe an approach for conjoint analysis, also called multiattribute compositional models, which originated in mathematical psychology and is nowadays widely used in the social sciences and operations research. Proceeding from the description of objects in terms of a set of attributes, they assume that an underlying utility function can be decomposed into a linear sum of individual utility functions, one for each attribute. These utility functions can then be learned efficiently from the data.

Like in the case of lexicographic orders, this model assumption is obviously quite restrictive. TORRA presents a complementary and more general approach. Starting with a discussion of general properties that aggregation operators forattribute-based utility functions should fulfill, he surveys different approaches for learning a complex aggregation operator in the form of a nonadditive integral.

3.4 Local Aggregation of Preferences

Yet another alternative is to resort to the idea of local estimation techniques as prominently represented, for example, by the nearest neighbor estimation principle: Considering the rankings observed in similar situations as representative, a ranking for the current situation is estimated on the basis of these “neighbored” rankings, typically using an averaging-like aggregation operator. This approach is in a sense orthogonal to the previous model-based one, as it is very flexible and typically comes with no specific model assumption (except the regularity assumption underlying the nearest neighbor inference principle).

For label ranking, thenearest neighbor (instance-based learning) approach was first used in [9, 10]. Roughly speaking, the idea is to identify the query’s k nearest neighbors in the instance space \(\mathcal{X}\), and then to combine the corresponding rankings into a prediction using suitable aggregation techniques. In [11], this approach was developed in a theoretically more sophisticated way, realizing the aggregation step in the form of maximum likelihood estimation based on a statistical model for rank data. Besides, this approach is also able to handle the more general case in which the rankings of the neighbored instances are only partially known.

In the same paper, the authors propose to use this estimation principle for decision tree learning, too, namely for aggregating the label rankings associated with the instances grouped in a leaf node of the tree. Indeed, decision tree learning can also be seen as a kind of local learning method, namely as a piecewise constant approximation of the target function.Footnote 3 In this book, YU, WAN & LEE propose a similar method. They grow a decision tree and propose two splitting measures for label ranking data. The rankings that are predicted at the leaves of the trees are derived by aggregating the rankings of all training examples arriving at this leaf.

Aggregation techniques are also used for other types of preference learning problems, including object ranking. For example, assume the rankings of several subsets X i of a reference set \(\mathcal{X}\) to be given. The learning task is to combine these rankings into a complete ranking of all objects in \(\mathcal{X}\). A practical application of this setting occurs, e.g., in information retrieval, when different rankings of search results originating from different search engines should be combined into an overall ranking of all retrieved pages [16]. Among other things, the learning problem may involve the determination of suitable weights for the information sources (search engines), reflecting their performance or agreement with the preferences of the user [35]. Another example is theranking of sports teams or players, where individual tournament results with varying numbers of participants have to be combined into an overall ranking [2], or where different rankings by different judges have to be aggregated into an overall ranking of the participants of a competition [23].

4 Applications of Preference Learning

Preference learning problems in general, and ranking problems in particular, arise quite naturally in many application areas. For example, a search engine should rank Web pages according to a user’s preferences. Likewise, cars can be ranked according to a customer’s preferences on characteristics that discriminate different models. Another example is the ranking of possible keywords according to their relevance for an article. A few more examples have already been given in this article, and many more could be found.

In particular in the field of information retrieval, ranking applications occur quite naturally. Two particularly interesting problems are learning to rank the results of a query to a search engine, and learning to rank possible recommendations for new products. We will briefly discuss research in these areas below.

4.1 Learning to Rank Search Results

A widely studied preference learning problem is the ranking of retrieval results of a search engine. Roughly, the problem is the following: given a query q and a set of documents \(\mathcal{D}\), find a ranking of the documents in \(\mathcal{D}\) that corresponds to their relevance with respect to q. This ranking is based on an unknown preference relation ≻ q , which ought to be learned from user feedback on past rankings of retrieval results for different queries. An elaborate survey of current research in this area can be found in [36].

Current research focuses particularly on suitable ways of characterizing the queries that allow one to transfer the ranking from one query to another [20, 40]. In some sense, a query may be considered as a context for a ranking, and the task is to learn a function that allows one to transfer the ranking from one context to the other. Much of the research is conducted on theLETOR (LEarning TO Rank for information retrieval) collection, a package of datasets containing a wide variety of queries with user feedback in several domains.Footnote 4

Users can provide explicit feedback by labeling the retrieved pages with their degree of relevance. However, users are only willing to do this for a limited number of pages. It would be better if feedback could be collected in a way that is transparent to the user. RADLINSKI & JOACHIMS discuss a variety of techniques that allow one to collect the user feedback implicitly via their clicking behavior. Alternatively, ARENS proposes the use of active learning techniques which help minimizing the burden on the user by a careful automatic selection of suitable training examples for the ranking algorithm. He illustrates his technique in an application which learns a ranking function for the PubMed search engine for the MEDLINE database of biomedical literature.

4.2 Recommender Systems

Nowadays, recommender systems [43] are frequently used by online stores to recommend products to their customers. Such systems typically store a data table with products over users, which records the degree of preference of a user for this product. A customer can provide this preference degree explicitly by giving some sort of feedback (e.g., by assigning a rating to a movie) or implicitly (e.g., by buying the DVD of the movie). The elegant idea of collaborative filtering systems [21] is that recommendations can be based on user similarity, and that user similarity can in turn be defined by the similarity of their recommendations. Alternatively, recommender systems can also be based on item similarities, which are defined via the recommendations of the users that recommended the items in question. Yet other approaches try to learn models that capture the preference information contained in the matrix. A very good comparison of these approaches can be found in [8].

In this book, DE GEMMIS et al. given an extensive survey of recommender systems. Subsequently, KARATZOGLOU & WEIMER describe the use of matrix factorization methods for compressing the information contained in the user/product matrix into a product of two lower-dimensional matrices. The dimensions over which the product is computed may be viewed as hidden concepts, which can be used to categorize the interests of a user. Interestingly, only very few (in the order of 10) such concepts are enough for a sufficiently accurate representation of large numbers of users and products. Finally, BELLOGIN et al. describe an approach that uses decision tree learning for identifying features of recommendation models that influence the quality of the predicted preference ranking.

5 Conclusions

In this introductory chapter, we have tried to give an overview of different approaches to preference learning, categorized by the learning task (label, instance, or object ranking) and the learning technique (learning utility functions, learning binary preference relations, learning preference models having a specific structure, or using local estimation and preference aggregating methods). In principle, all task/technique combinations are conceivable and can be found in the literature. We also highlighted important application areas, in particular in ranking search results and product recommendations.

Throughout the chapter, pointers to the remaining articles in this book were given. They could be characterized along a variety of dimensions, including all of those mentioned above. We have adopted a grouping that more or less corresponds to the sections in this survey. As the categorization is multi-dimensional, most articles fit into several of these categories, and, in some cases, the choice was not entirely clear. This is why we do not have separate parts on learning of a utility function or a binary preference relation, for example. In fact, almost all approaches presented in this book follow either of the two approaches.