Analogy-Based Preference Learning with Kernels

Ahmadi Fahandar, Mohsen; Hüllermeier, Eyke

doi:10.1007/978-3-030-30179-8_3

Mohsen Ahmadi Fahandar¹⁰ &
Eyke Hüllermeier¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11793))

Included in the following conference series:

Joint German/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz)

1215 Accesses
1 Citations

Abstract

Building on a specific formalization of analogical relationships of the form “A relates to B as C relates to D”, we establish a connection between two important subfields of artificial intelligence, namely analogical reasoning and kernel-based learning. More specifically, we show that so-called analogical proportions are closely connected to kernel functions on pairs of objects. Based on this result, we introduce the analogy kernel, which can be seen as a measure of how strongly four objects are in analogical relationship. As an application, we consider the problem of object ranking in the realm of preference learning, for which we develop a new method based on support vector machines trained with the analogy kernel. Our first experimental results for data sets from different domains (sports, education, tourism, etc.) are promising and suggest that our approach is competitive to state-of-the-art algorithms in terms of predictive accuracy.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Feature Selection for Analogy-Based Learning to Rank

Analogical Embedding for Analogy-Based Learning to Rank

Towards Analogy-Based Explanations in Machine Learning

1 Introduction

In this paper, we establish a connection between analogical reasoning and kernel-based learning, which are two important subfields of artificial intelligence. Essentially, this becomes possible thanks to the observation that a specific formalization of analogical relationships, so-called analogical proportions [25, 29], defines a kernel function on pairs of objects. This relationship is established by means of generalized (fuzzy) equivalence relations as a bridging concept.

Analogical reasoning has a long tradition in artificial intelligence research, and various attempts at formalizing analogy-based inference can be found in the literature. In this regard, the aforementioned concept of analogical proportion is an especially appealing approach, which has already been used successfully in different problem domains, including classification [7], recommendation [19], preference completion [27], decision making [5], and solving IQ tests [4].

In spite of its popularity in AI in general, analogical reasoning has not been considered very much in machine learning so far. Yet, analogical proportions have recently been used in the context of preference learning [1, 6], a branch of machine learning that has received increasing attention in recent years [15]. Roughly speaking, the goal in preference learning is to induce preference models from observational (or experimental) data that reveal information about the preferences of an individual or a group of individuals in a direct or indirect way; the latter typically serve the purpose of predictive modeling, i.e., they are then used to predict the preferences in a new situation.

Frequently, the predicted preference relation is required to form a total order, in which case we also speak of a ranking problem. In fact, among the problems in the realm of preference learning, the task of “learning to rank” has probably received the most attention in the literature so far, and a number of different ranking problems have already been introduced. Based on the type of training data and the required predictions, [15] distinguish between the problems of object ranking [13, 21], label ranking [12, 17, 34], and instance ranking [16].

Building on [1], the focus of this paper is on the problem of object ranking. Given training data in the form of a set of exemplary rankings of subsets of objects, the goal in object ranking is to learn a ranking function that is able to predict the ranking of any new set of objects. Our contribution is a novel approach to this problem, namely a kernel-based implementation of analogy-based object ranking.

The rest of the paper is organized as follows. In the next section, we recall the setting of object ranking and formalize the corresponding learning problem. Section 3 outlines existing methods for the object ranking task, followed by Sect. 4 in which the connection between analogical reasoning and kernel-based learning is established. In Sect. 5, we introduce kernel-based analogical reasoning for the object ranking problem. Finally, we present an experimental evaluation of this approach in Sect. 6, prior to concluding the paper with a summary and an outline of future work.

2 Problem Formulation

Consider a reference set of objects, items, or choice alternatives $\mathcal {X}$, and assume each item $\varvec{x} \in \mathcal {X}$ to be described in terms of a feature vector; thus, an item is a vector $\varvec{x} = (x_1, \ldots , x_d) \in \mathbb {R}^d$ and $\mathcal {X} \subseteq \mathbb {R}^d$. The goal in object ranking is to learn a ranking function $\rho $ that accepts any (query) subset $Q = \{ \varvec{x}_1, \ldots , \varvec{x}_n \} \subseteq \mathcal {X}$ of $n = |Q|$ items as input. As output, the function produces a ranking $\pi \in \mathbb {S}_n$ of these items, where $\mathbb {S}_n$ denotes the set of all permutations of length n, i.e., all mappings $[n] \longrightarrow [n]$ (symmetric group of order n); $\pi $ represents the total order

$$\begin{aligned} \varvec{x}_{\pi ^{-1}(1)} \succ \varvec{x}_{\pi ^{-1}(2)} \succ \ldots \succ \varvec{x}_{\pi ^{-1}(n)} , \end{aligned}$$

(1)

i.e., $\pi ^{-1}(k)$ is the index of the item on position k, while $\pi (k)$ is the position of the kth item $\varvec{x}_k$ ($\pi $ is often called a ranking and $\pi ^{-1}$ an ordering). Formally, a ranking function is thus a mapping

$$\begin{aligned} \rho : \, \mathcal {Q}\longrightarrow \mathcal {R}, \end{aligned}$$

(2)

where $\mathcal {Q}= 2^\mathcal {X} \setminus \emptyset $ is the query space and $\mathcal {R}= \bigcup _{n \in \mathbb {N}} \mathbb {S}_n$ the ranking space. The order relation “$\succ $” is typically (though not necessarily) interpreted in terms of preferences, i.e., $\varvec{x} \succ \varvec{y}$ suggests that $\varvec{x}$ is preferred to $\varvec{y}$. A ranking function $\rho $ is learned on a set of training data that consists of a set of rankings

$$\begin{aligned} \mathcal {D} = \big \{ (Q_1, \pi _1) , \ldots , (Q_M, \pi _M) \big \} , \end{aligned}$$

(3)

where each ranking $\pi _\ell $ defines a total order of the set of objects $Q_\ell $. Once a ranking function has been learned, it can be used for making predictions for new query sets Q. Such predictions are evaluated in terms of a suitable loss function or performance metric. A common choice is the (normalized) ranking loss, which counts the number of inversions between two rankings $\pi $ and $\pi '$:

$$\begin{aligned} d_{RL}(\pi , \pi ') = \frac{ \sum _{1 \le i , j \le n} \llbracket {\pi (i) < \pi (j)} \rrbracket \llbracket {\pi '(i) > \pi '(j)} \rrbracket }{n(n-1)/2} , \end{aligned}$$

where $\llbracket \cdot \rrbracket $ is the indicator function. The ranking function (2) sought in object ranking is a complex mapping from the query to the ranking space. An important question, therefore, is how to represent a “ranking-valued” function of that kind, and how it can be learned efficiently.

3 Previous Work

Quite a number of approaches to object ranking and related learning-to-rank problems have already been proposed in the literature, most of them based on the idea of representing a ranking function via an underlying (latent) utility function. Depending on the type of training data used for learning such a function, a distinction is often made between so-called pointwise [21, 22], pairwise [10, 20], and listwise [11] approaches.

In the following, we give a brief overview of the analogy-based approach recently put forward in [1], which is most relevant for us. This approach essentially builds on the following inference pattern: If object $\varvec{a}$ relates to object $\varvec{b}$ as $\varvec{c}$ relates to $\varvec{d}$, and knowing that $\varvec{a}$ is preferred to $\varvec{b}$, we (hypothetically) infer that $\varvec{c}$ is preferred to $\varvec{d}$. This principle is formalized using the concept of analogical proportion [25]. For every quadruple of objects $\varvec{a},\varvec{b},\varvec{c},\varvec{d}$, the latter provides a numerical degree to which these objects are in analogical relation to each other. To this end, such a degree is first determined for each attribute value (feature) separately, and these degrees are then combined into an overall degree of analogy.

Consider four values a, b, c, d from an attribute domain $\mathbb {X}$. The quadruple (a, b, c, d) is said to be in analogical proportion, denoted by a : b : : c : d, if “a relates to b as c relates to d”. A bit more formally, the degree of proportion can be expressed as

$$\begin{aligned} E \big ( \mathcal {R}(a,b) , \mathcal {R}(c,d) \big ) , \end{aligned}$$

(4)

where the relation E denotes the “as” part of the informal description. $\mathcal {R}$ can be instantiated in different ways, depending on the underlying domain $\mathbb {X}$.

In the case of Boolean variables, where $\mathbb {X} = \{0,1\}$, there are $2^4=16$ instantiations of the pattern a : b : : c : d, of which only the following 6 satisfy a set of axioms required to hold for analogical proportions: (0, 0, 0, 0), (0, 0, 1, 1), (0, 1, 0, 1), (1, 0, 1, 0), (1, 1, 0, 0), (1, 1, 1, 1). This formalization captures the idea that a differs from b (in the sense of being “equally true”, “more true”, or “less true”, if the values 0 and 1 are interpreted as truth degrees) exactly as c differs from d, and vice versa. In the numerical case, assuming all attributes to be normalized to the unit interval [0, 1], the concept of analogical proportion can be generalized on the basis of generalized logical operators [8, 14]. In this case, the analogical proportion will become a matter of degree, i.e., a quadruple (a, b, c, d) can be in analogical proportion to some degree between 0 and 1. An example of such a proportion, with $\mathcal {R}$ the arithmetic difference $\mathcal {R}(a,b)=a-b$, is the following:

$$\begin{aligned} v(a,b,c,d) = 1- | (a-b) - (c-d)| , \end{aligned}$$

(5)

if ${\text {sign}}(a-b) = {\text {sign}}(c-d)$, and 0 otherwise. Note that this formalization indeed generalizes the Boolean case (where $a,b,c,d \in \{0,1 \}$).

To extend analogical proportions from individual values to complete feature vectors, the individual degrees of proportion can be combined using any suitable aggregation function, for example the arithmetic mean:

$$\begin{aligned} v(\varvec{a}, \varvec{b} , \varvec{c} , \varvec{d}) = \frac{1}{d} \sum _{i=1}^d v(a_i , b_i , c_i , d_i) . \end{aligned}$$

With a measure of analogical proportion at hand, the object ranking task is tackled as follows: Consider any pair of query objects $\varvec{x}_i , \varvec{x}_j \in Q$. Every preference $\varvec{z} \succ \varvec{z}'$ observed in the training data $\mathcal {D}$, such that $(\varvec{z}, \varvec{z}', \varvec{x}_i , \varvec{x}_j)$ are in analogical proportion, suggests that $\varvec{x}_i \succ \varvec{x}_j$. This principle is referred as analogical transfer of preferences, because the observed preference for $\varvec{z} , \varvec{z}'$ is (hypothetically) transferred to $\varvec{x}_i, \varvec{x}_j$. Accumulating all pieces of evidence that can be collected in favor of $\varvec{x}_i \succ \varvec{x}_j$ and, vice versa, the opposite preference $\varvec{x}_j \succ \varvec{x}_i$, an overall degree $p_{i,j}$ is derived for this pair of objects. The same is done for all other pairs in the query. Eventually, all these degrees are combined into an overall consensus ranking. We refer to [1] for a detailed description of this method, which is called “analogy-based learning to rank” (able2rank) by the authors.

As an aside, note that an analogy-based approach as outlined above appears to be specifically suitable for transfer learning. This is mainly because the relation $\mathcal {R}$ is evaluated separately for “source objects” a and b on the one side and “target objects” c and d on the other side, but never between sources and targets. In principle, one could even think of using different specifications of $\mathcal {R}$ for the source and the target.

4 Analogy and Kernels

The core idea of our proposal is based on the observation that an analogical proportion, by definition, defines a kind of similarity between the relation of pairs of objects: According to (4), the analogical proportion a : b : : c : d holds if $\mathcal {R}(a,b)$ is similar to $\mathcal {R}(c,d)$. The notion of similarity plays an important role in machine learning in general, and in kernel-based machine learning in particular. In fact, kernel functions can typically be interpreted in terms of similarity. Thus, a kernel-based approach might be a natural way to incorporate analogical reasoning in machine learning.

More specifically, to establish a connection between kernel-based machine learning and analogical reasoning, we make use of generalized (fuzzy) equivalence relations as a bridging concept. Fuzzy equivalences are weakened forms of standard equivalence relations, and hence capture the notion of similarity. More specifically, a fuzzy equivalence relation E on a set $\mathcal {X}$ is a fuzzy subset of $\mathcal {X} \times \mathcal {X}$, that is, a function $E:\, \mathcal {X}^2 \longrightarrow [0,1]$, which is reflexive, symmetric, and $\top $-transitive:

$E(x,x) = 1$ for all $x \in \mathcal {X}$,
$E(x,y)=E(y,x)$ for all $x, y \in \mathcal {X}$,
$\top (E(x,y), E(y,z)) \le E(x,z)$ for all $x, y, z \in \mathcal {X}$,

where $\top $ is a triangular norm (t-norm), that is, a generalized logical conjunction. In our case, the relation E in (4) will play the role of a fuzzy equivalence. The detour via fuzzy equivalences is motivated by the result of [26], who proved that certain types of fuzzy equivalence relations satisfy the properties of a kernel function. Before elaborating on this idea in more detail, we briefly recall some basic concepts of kernel-based machine learning as needed for this paper. For a thorough discussion of kernel methods, see for instance [32, 33].

4.1 Kernels

Let $\mathcal {X}$ be a nonempty set. A function $k: \mathcal {X} \times \mathcal {X} \longrightarrow \mathbb {R}$ is a positive semi-definite kernel on $\mathcal {X}$ iff it is symmetric, i.e., $k(x,y) = k(y,x)$ for all $x,y \in \mathcal {X}$, and positive semi-definite, i.e.,

$$\begin{aligned} \sum _{i=1}^n \sum _{j=1}^n c_i c_j k(x_i,x_j) \ge 0 \end{aligned}$$

for arbitrary n, arbitrary instances $x_1, \ldots , x_n \in \mathcal {X}$ and arbitrary $c_1, \ldots , c_n \in \mathbb {R}$. Given a kernel k on $\mathcal {X}$, an important theorem by [24] implies the existence of a (Hilbert) space $\mathcal {H}$ and a map $\phi :\, \mathcal {X} \longrightarrow \mathcal {H}$, such that

$$ k(x,y) = \langle \phi (x) , \phi (y) \rangle $$

for all $x,y \in \mathcal {X}$. Thus, computing the kernel k(x, y) in the original space $\mathcal {X}$ is equivalent to mapping x and y to $\mathcal {H}$ first, using the linearization or feature map $\phi $, and combining them in terms of the inner product in that space afterward. This connection between a nonlinear combination of instances in the original space $\mathcal {X}$ and a linear combination in the induced feature space $\mathcal {H}$ provides the basis for the so-called “kernel trick”, which offers a systematic way to design nonlinear extensions of methods for learning linear models. The kernel trick has been applied to various methods and has given rise to many state-of-the-art machine learning algorithms, including support vector machines, kernel principle component analysis, kernel Fisher discriminant, amongst others [30, 31].

4.2 Analogical Proportions as Kernels

Our focus in this paper is the analogical proportion (5), which is a map $v:\, [0,1]^4 \longrightarrow [0,1]$. In this case, the relation $\mathcal {R}$ is the simple arithmetic difference $\mathcal {R}(a,b)=a-b$, and the similarity relation E is defined as $E(u, v) = 1- |u-v|$ if both u, v have the same sign, and $E(u, v) = 0$ otherwise. As an aside, we note that, strictly speaking, E thus defined is not a fuzzy equivalence relation. This is due to the thresholding in the case where $\text {sign} (a-b) \ne \text {sign} (c-d)$. Without this thresholding, E would be a $\top _{\!\!\L }$-equivalence, where $\top _{\!\!\L }$ is the Łukasievicz t-norm $(\alpha ,\beta ) \mapsto \max (\alpha +\beta -1,0)$. For modeling analogy, however, setting E to 0 in the case where b deviates positively from a while d deviates negatively from c (or vice versa) appears reasonable.

We reinterpret v as defined above as a kernel function $k:\, [0,1]^2 \times [0,1]^2 \longrightarrow [0,1]$ on $\mathcal {X} = [0,1]^2$, i.e., a kernel on pairs of pairs of objects, which essentially means equating k with E:

$$\begin{aligned} k(a,b,c,d) \mapsto 1- |(a-b) - (c-d)| \end{aligned}$$

(6)

if $\text {sign} (a-b) = \text {sign} (c-d)$, and 0 otherwise. In what follows, we show that the “analogy kernel” (6) does indeed define a proper kernel function. The first property to be fulfilled, namely symmetry, is obvious. Thus, it remains to show that k is also positive semi-definite, which is done in Theorem 1 below. As a preparation, we first recall the following lemma, which is proved by [26] as part of his Theorem 11.

Lemma 1

Let $\mu _1, \ldots , \mu _n \in [0,1]$, $n \in \mathbb {N}$, and the matrix M be defined by

$$ M^{(n)}_{i,j} = ( 1- | \mu _i - \mu _j | ) . $$

Then M has a non-negative determinant.

Theorem 1

The function $k:\, [-1,1]^2 \longrightarrow [0,1]$ defined as

$$\begin{aligned} k(u,v) = {\left\{ \begin{array}{ll} 1 - |u-v| &{} \text {if } \text {sign}(u)=\text {sign}(v), \\ 0, &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$

is a valid kernel.

Proof

It is easy to see that k is symmetric. Thus, it remains to show that it is positive semi-definite. To this end, it suffices to show that the determinants of all principal minors of every kernel matrix produced by k are non-negative. Thus, consider $\alpha _1, \ldots , \alpha _n \in [-1,1]$, $n \in \mathbb {N}$, and the matrix K defined as

$$\begin{aligned} K^{(n)}_{i,j} = {\left\{ \begin{array}{ll} 1 - |\alpha _i-\alpha _j| ,&{} \text {if } \text {sign}(\alpha _i)=\text {sign}(\alpha _j), \\ 0, &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$

(7)

We need to show that

$$ \det \bigg ( K^{(m)}_{i,j} \bigg ) \ge 0 , $$

for all $1 \le m \le n$. Thanks to the permutation-invariance of determinants, we can assume (without loss of generality) that the values $\alpha _i$ are sorted in non-increasing order, i.e., $\alpha _1 \ge \alpha _2 \ge \cdots \ge \alpha _n$; in particular, note that the positive $\alpha _i$ will then precede all the negative ones. Thus, the matrix K takes the form of a diagonal block matrix

$$ K = \begin{pmatrix} A &{} 0 \\ 0 &{} B \\ \end{pmatrix} , $$

in which the submatrix A contains the values of K for which $\alpha _i, \alpha _j \in [0,1]$, and B contains the values of K where $\alpha _i, \alpha _j$ are negative. According to Lemma (1), $\det (A) \ge 0$. Moreover, since $1-|u-v| = 1-|(-u)-(-v)|$ for $u,v \in [0,1]$, the same lemma can also be applied to the submatrix B, hence $\det (B) \ge 0$. Finally, we can exploit that

$$ \det (K) = \det (A) \det (B). $$

Since both matrices A and B have non-negative determinant, it follows that $\det (K) \ge 0$, which completes the proof.

The class of kernel functions is closed under various operations, including addition and multiplication by a positive constant. This allows us to extend the analogy kernel from individual variables to feature vectors using the arithmetic mean as an aggregation function:

$$\begin{aligned} k_A( \varvec{a}, \varvec{b} , \varvec{c}, \varvec{d}) = \frac{1}{d} \sum _{i=1}^d k(a_i , b_i, c_i, d_i) . \end{aligned}$$

(8)

Furthermore, to allow for incorporating a certain degree of non-linearity, we make use of a homogeneous polynomial kernel of degree 2,

$$\begin{aligned} k_A'( \varvec{a}, \varvec{b} , \varvec{c}, \varvec{d} ) = \big ( k( \varvec{a}, \varvec{b} , \varvec{c}, \varvec{d} ) \big )^2 , \end{aligned}$$

(9)

which is again a valid kernel.

5 Analogy-Kernel-Based Object Ranking

Recall that, in the setting of learning to rank, we suppose to be given a set of training data of the form

$$ \mathcal {D} = \big \{ (Q_1, \pi _1) , \ldots , (Q_M, \pi _M) \big \} \ , $$

where each $\pi _\ell $ defines a ranking of the set of objects $Q_\ell $. If $\varvec{z}_i , \varvec{z}_j \in Q_\ell $ and $\pi _\ell (i) < \pi _\ell (j)$, then $\varvec{z}_i \succ \varvec{z}_j$ has been observed as a preference. Our approach to object ranking based on the analogy kernel, AnKer-rank, comprises two main steps:

First, for each pair of objects $\varvec{x}_i , \varvec{x}_j \in Q$, a degree of preference $p_{i,j} \in [0,1]$ is derived from $\mathcal {D}$. If these degrees are normalized such that $p_{i,j} + p_{j,i} = 1$, they define a reciprocal preference relation
$$\begin{aligned} P= \Big ( p_{i,j} \Big )_{1 \le i \ne j \le n} . \end{aligned}$$
(10)
Second, the preference relation P is turned into a ranking $\pi $ using a suitable ranking procedure.

Both steps will be explained in more detail further below.

5.1 Prediction of Pairwise Preferences

The first step of our proposed approach, prediction of pairwise preferences, is based on a reduction to binary classification. To this end, training data $\mathcal {D}_{bin}$ is constructed as follows: Consider any preference $\varvec{x}_i \succ \varvec{x}_j$ that can be extracted from the original training data $\mathcal {D}$, i.e., from any of the rankings $\pi _m$, $m \in [M]$. Then $\varvec{z}_{i,j}=(\varvec{x}_i, \varvec{x}_j)$ is a positive example for the binary problem (with label $y_{i,j}=+1$), and $\varvec{z}_{j,i}=(\varvec{x}_j, \varvec{x}_i)$ is a negative example (with label $y_{j,i}=-1$). Since these examples essentially carry the same information, we only add one of them to $\mathcal {D}_{bin}$. To keep a balance between positive and negative examples, the choice is simply made by flipping a fair coin.

Note that, for any pair of instances $(\varvec{a}, \varvec{b})$ and $(\varvec{c}, \varvec{d})$ in $\mathcal {D}_{bin}$, the analogy kernel (8) is well-defined, i.e., $k_A(\varvec{a}, \varvec{b}, \varvec{c}, \varvec{d})$ can be computed. Therefore, a binary predictor $h_{bin}$ can be trained on $\mathcal {D}_{bin}$ using any kernel-based classification method. We assume $h_{bin}$ to produce predictions in the unit interval [0, 1], which can be achieved, for example, by means of support vector machines with a suitable post-processing such as Platt-scaling [28].

Now, consider any pair of objects $\varvec{x}_i, \varvec{x}_j$ from a new query $Q=\{ \varvec{x}_1, \ldots , \varvec{x}_n \}$. Again, the analogy kernel can be applied to this pair and any example from $\mathcal {D}_{bin}$, so that a (binary) prediction for the preference between $\varvec{x}_i$ and $\varvec{x}_j$ can be derived from $h_{bin}$. More specifically, querying this model with $\varvec{z}_{i,j}=(\varvec{x}_i, \varvec{x}_j)$ yields a degree of support $q_{i,j}=h_{bin}(\varvec{z}_{i,j})$ in favor of $\varvec{x}_i \succ \varvec{x}_j$, while querying it with $\varvec{z}_{j,i}=(\varvec{x}_j, \varvec{x}_i)$ yields a degree of support $q_{j,i}=h_{bin}(\varvec{z}_{j,i})$ in favor of $\varvec{x}_j \succ \varvec{x}_i$. As already said, we assume both degrees to be normalized within the range [0, 1], and define $p_{i,j} = (1+q_{i,j}-q_{j,i})/2$ as an estimate for the probability of the preference $\varvec{x}_i \succ \varvec{x}_j$. This estimate constitutes one of the entries in the preference relation (10).

5.2 Rank Aggregation

To turn pairwise preferences into a total order, we make use of a rank aggregation method. More specifically, we apply the Bradley-Terry-Luce (BTL) model, which is well-known in the literature on discrete choice [9]. It starts from the parametric model

$$\begin{aligned} \mathbf {P}(\varvec{x}_i \succ \varvec{x}_j) = \frac{\theta _i}{\theta _i + \theta _j} , \end{aligned}$$

(11)

where $\theta _i, \theta _j \in \mathbb {R}_+$ are parameters representing the (latent) utility $U(\varvec{x}_i)$ and $U(\varvec{x}_j)$ of $\varvec{x}_i$ and $\varvec{x}_j$, respectively. Thus, according to the BTL model, the probability to observe a preference in favor of a choice alternative $\varvec{x}_i$, when being compared to any other alternative, is proportional to $\theta _i$.

Given the preference relation (10), i.e., the entries $p_{i,j}$ informing about the class probability of $\varvec{x}_i \succ \varvec{x}_j$, the parameter $\theta = (\theta _1, \ldots , \theta _n)$ can be estimated by likelihood maximization:

$$ \hat{\theta } \in \arg \max _{\theta \in \mathbb {R}^{n} } \prod _{1 \le i \ne j \le n} \left( \dfrac{\theta _{i}}{\theta _{i} + \theta _{j}} \right) ^{p_{i,j}} . $$

Finally, the predicted ranking $\pi $ is obtained by sorting the items $\varvec{x}_i$ in descending order of their estimated (latent) utilities $\hat{\theta }_i$.

We note that many other rank aggregation techniques have been proposed in the literature and could principally be used as well; see e.g. [2]. However, since BTL seems to perform very well, we did not consider any other method.

6 Experiments

To study the practical performance of our proposed method, we conducted experiments on several real-world data sets, essentially using the same setup as [1]. As baselines to compare with, we considered able2rank [1], expected rank regression (ERR) [21, 22], Ranking SVM (with linear kernel) [20] and RankNet [10].

Table 1. Properties of data sets.

Full size table

6.1 Data

We used the same data sets as [1], which are collected from various domains (e.g., sports, education, tourism) and comprise different types of feature (e.g., numeric, binary, ordinal). Table 1 provides a summary of the characteristics of the data sets. For a detailed description of the data, we refer the reader to the source paper. In addition, we include the ranking of the teams that participated in the men’s FIFA world cup 2014 and 2018 (32 instances) as well as under-17 in the year 2017 (22 instances) with respect to “goals statistics”. This data^{Footnote 1} comprises 7 numeric features such as MatchesPlayed, GoalsFor, GoalsScored, etc.

6.2 Experimental Setup

For the analogy-based methods, an important pre-processing step is the normalization of the attributes in the feature representation $\varvec{x}=(x_1, \ldots , x_d)$, because these attributes are assumed to take values in [0, 1]. To this end, we simply apply a linear rescaling

$$ x_k' \leftarrow \dfrac{x_k -\min _k}{ \max _k - \min _k } , $$

where $\min _k$ and $\max _k$ denote, respectively, the smallest and largest value of the kth feature in the data. This transformation is applied to the training data as well as the test data when a new query Q is received. Since the data from a new query is normally sparse, it might be better to take the minimum and maximum over the entire data, training and test. Yet, this strategy is not recommendable in case the test data has a different distribution. In fact, analogical inference is especially interesting for transfer learning (and indeed, in our experiments, training and test data are sometimes from different subdomains). Therefore, we first conduct a Kolmogorov-Smirnov test [23] to test whether the two parts of the data are drawn from the same distribution. In case the null hypothesis is rejected (at a significance level of $\alpha = 0.05$), normalization is conducted on the test data alone. Otherwise, the training data is additionally taken into account.

We also apply a standard normalization for the other baseline methods (ERR, Ranking SVM and RankNet), transforming each real-valued feature by standardization:

$$ x \leftarrow \dfrac{x-\mu }{\sigma } , $$

where $\mu $ and $\sigma $ denote the empirical mean and standard deviation, respectively. Like for the analogy-based methods, a hypothesis test is conducted to decide whether the test data should be normalized separately or together with the training data.

The analogy kernel (9) was used for AnKer-rank. We fixed the cost parameter C of SVM algorithms in an (internal) 2-fold cross-validation (repeated 3 times) on the training data. The search for C is guided by an algorithm^{Footnote 2} proposed by [18], which computes the entire regularization path for the two-class SVM classifier (i.e., all possible values of C for which the solution changes), with a cost a small ($\sim $3) multiple of the cost of fitting a single model. The following RankNet parameters are adjusted using grid-search and internal cross-validation: The number of units in the hidden layer (32, 64, 128, 256), the batch size (8, 16, 32), the optimizer learning rate (0.001, 0.01, 0.1). Since the data sets are relatively small, the network was restricted to a single hidden layer.

Table 2. Results in terms of loss $d_{RL}$ (averaged over 20 runs) on the test data.

Full size table

6.3 Results

In our experiments, predictions were produced for certain data set $D_{test}$ of the data, using other parts $D_{train}$ as training data; an experiment of that kind is denoted by $D_{train} \rightarrow D_{test}$ that is considered for all possible combinations within each domain. The averaged ranking loss together with the standard deviation of the conducted experiments (repeated 20 times) are summarized in Table 2, where the numbers in parentheses indicate the rank of the achieved score in the respective problem. Moreover, the table shows average ranks per problem domain.

As can be seen, the relative performance of the methods depends on the domain. In any case, our proposed approach is quite competitive in terms of predictive accuracy, and essentially on a par with able2rank and Ranking SVM, whereas ERR and RankNet show worse performance.

7 Conclusion and Future Work

This paper elaborates on the connection between kernel-based machine learning and analogical reasoning in the context of preference learning. Building on the observation that analogical proportions define a kind of similarity between the relation of pairs of objects, and that kernel functions can be interpreted in terms of similarity, we utilize generalized (fuzzy) equivalence relations as a bridging concept to show that a particular type of analogical proportion defines a valid kernel function. We introduce the analogy kernel and advocate a concrete kernel-based approach for the problem of object ranking. First experimental results on real-world data from various domains are quite promising and suggest that our approach is competitive to state-of-the-art methods for object ranking.

By making analogical inference amenable to kernel methods, our paper depicts a broad spectrum of directions for future work. In particular, we plan to study kernel properties of other analogical proportions proposed in the literature (e.g., geometric proportions [3]).

Besides, various extensions in the direction of kernel-based methods are conceivable and highly interesting from the point of view of analogical reasoning. This includes the use of kernel-based methods other than SVM, techniques such as multiple kernel learning, etc. Last but not least, other types of applications, whether in preference learning or beyond, are also of interest.

Notes

1.
Extracted from FIFA official website: www.fifa.com.
2.
Publicly available as an R package: http://cran.r-project.org/web/packages/svmpath.

References

Ahmadi Fahandar, M., Hüllermeier, E.: Learning to rank based on analogical reasoning. In: Proceedings AAAI-2018, 32th AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, pp. 2951–2958 (2018)
Google Scholar
Ahmadi Fahandar, M., Hüllermeier, E., Couso, I.: Statistical inference for incomplete ranking data: the case of rank-dependent coarsening. In: Proceedings ICML-2017, 34th International Conference on Machine Learning, vol. 70, pp. 1078–1087. PMLR, International Convention Centre, Sydney, Australia (2017)
Google Scholar
Beltran, W.C., Jaudoin, H., Pivert, O.: Analogical prediction of null values: the numerical attribute case. In: Manolopoulos, Y., Trajcevski, G., Kon-Popovska, M. (eds.) ADBIS 2014. LNCS, vol. 8716, pp. 323–336. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10933-6_24
Chapter Google Scholar
Beltran, W.C., Prade, H., Richard, G.: Constructive solving of Raven’s IQ tests with analogical proportions. Int. J. Intell. Syst. 31(11), 1072–1103 (2016)
Article Google Scholar
Billingsley, R., Prade, H., Richard, G., Williams, M.-A.: Towards analogy-based decision - a proposal. In: Christiansen, H., Jaudoin, H., Chountas, P., Andreasen, T., Legind Larsen, H. (eds.) FQAS 2017. LNCS (LNAI), vol. 10333, pp. 28–35. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59692-1_3
Chapter Google Scholar
Bounhas, M., Pirlot, M., Prade, H.: Predicting preferences by means of analogical proportions. In: Cox, M.T., Funk, P., Begum, S. (eds.) ICCBR 2018. LNCS (LNAI), vol. 11156, pp. 515–531. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01081-2_34
Chapter Google Scholar
Bounhas, M., Prade, H., Richard, G.: Analogical classification: a new way to deal with examples. In: Proceedings ECAI-2014, 21th European Conference on Artificial Intelligence, Czech Republic, Prague, pp. 135–140 (2014)
Google Scholar
Bounhas, M., Prade, H., Richard, G.: Analogy-based classifiers for nominal or numerical data. Int. J. Approx. Reason. 91, 36–55 (2017)
Article MathSciNet Google Scholar
Bradley, R., Terry, M.: The rank analysis of incomplete block designs I. The method of paired comparisons. Biometrika 39, 324–345 (1952)
MathSciNet MATH Google Scholar
Burges, C., et al.: Learning to rank using gradient descent. In: Proceedings ICML-2005, 22th International Conference on Machine Learning, Bonn, Germany, pp. 89–96. ACM (2005)
Google Scholar
Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: Proceedings ICML-2007, 24th International Conference on Machine Learning, pp. 129–136 (2007)
Google Scholar
Cheng, W., Hühn, J., Hüllermeier, E.: Decision tree and instance-based learning for label ranking. In: Proceedings ICML-2009, 26th International Conference on Machine Learning, pp. 161–168. ACM, New York (2009)
Google Scholar
Cohen, W.W., Schapire, R.E., Singer, Y.: Learning to order things. J. Artif. Intell. Res. 10(1), 243–270 (1999)
Article MathSciNet Google Scholar
Dubois, D., Prade, H., Richard, G.: Multiple-valued extensions of analogical proportions. Fuzzy Sets Syst. 292, 193–202 (2016)
Article MathSciNet Google Scholar
Fürnkranz, J., Hüllermeier, E.: Preference Learning. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-14125-6
Book MATH Google Scholar
Fürnkranz, J., Hüllermeier, E., Vanderlooy, S.: Binary decomposition methods for multipartite ranking. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI), vol. 5781, pp. 359–374. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04180-8_41
Chapter Google Scholar
Har-Peled, S., Roth, D., Zimak, D.: Constraint classification: a new approach to multiclass classification. In: Cesa-Bianchi, N., Numao, M., Reischuk, R. (eds.) ALT 2002. LNCS (LNAI), vol. 2533, pp. 365–379. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36169-3_29
Chapter Google Scholar
Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5, 1391–1415 (2004)
MathSciNet MATH Google Scholar
Hug, N., Prade, H., Richard, G., Serrurier, M.: Analogy in recommendation. Numerical vs. ordinal: a discussion. In: FUZZ-IEEE-2016, IEEE International Conference on Fuzzy Systems, Vancouver, BC, Canada, pp. 2220–2226 (2016)
Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002), pp. 133–142. ACM Press (2002)
Google Scholar
Kamishima, T., Kazawa, H., Akaho, S.: A survey and empirical comparison of object ranking methods. In: Fürnkranz, J., Hüllermeier, E. (eds.) Preference Learning, pp. 181–202. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14125-6_9
Chapter Google Scholar
Kamishima, T., Akaho, S.: Supervised ordering by regression combined with thurstone’s model. Artif. Intell. Rev. 25(3), 231–246 (2006)
Article Google Scholar
Kolmogorov, A.: Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari 4, 83–91 (1933)
MATH Google Scholar
Mercer, J.: Functions of positive and negative type, and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A 209, 415–446 (1909)
Article Google Scholar
Miclet, L., Prade, H.: Handling analogical proportions in classical logic and fuzzy logics settings. In: Sossai, C., Chemello, G. (eds.) ECSQARU 2009. LNCS (LNAI), vol. 5590, pp. 638–650. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02906-6_55
Chapter MATH Google Scholar
Moser, B.: On representing and generating kernels by fuzzy equivalence relations. J. Mach. Learn. Res. 7, 2603–2620 (2006)
MathSciNet MATH Google Scholar
Pirlot, M., Prade, H., Richard, G.: Completing preferences by means of analogical proportions. In: Torra, V., Narukawa, Y., Navarro-Arribas, G., Yañez, C. (eds.) MDAI 2016. LNCS (LNAI), vol. 9880, pp. 135–147. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45656-0_12
Chapter Google Scholar
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74. MIT Press (1999)
Google Scholar
Prade, H., Richard, G.: Analogical proportions and analogical reasoning - an introduction. In: Aha, D.W., Lieber, J. (eds.) ICCBR 2017. LNCS (LNAI), vol. 10339, pp. 16–32. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-61030-6_2
Chapter Google Scholar
Schölkopf, B.: The kernel trick for distances. In: Proceedings NIPS-2000, 13th International Conference on Neural Information Processing Systems, pp. 301–307. MIT Press (2001)
Google Scholar
Schölkopf, B., Smola, A., Müller, K.R.: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5), 1299–1319 (1998)
Article Google Scholar
Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)
Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)
Book Google Scholar
Vembu, S., Gärtner, T.: Label ranking algorithms: a survey. In: Fürnkranz, J., Hüllermeier, E. (eds.) Preference Learning, pp. 45–64. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-14125-6_3
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Heinz Nixdorf Institute and Department of Computer Science, Intelligent Systems and Machine Learning Group, Paderborn University, Paderborn, Germany
Mohsen Ahmadi Fahandar & Eyke Hüllermeier

Authors

Mohsen Ahmadi Fahandar
View author publications
You can also search for this author in PubMed Google Scholar
Eyke Hüllermeier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohsen Ahmadi Fahandar .

Editor information

Editors and Affiliations

Freie Universität Berlin, Berlin, Germany
Christoph Benzmüller
Universität Mannheim, Mannheim, Germany
Heiner Stuckenschmidt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahmadi Fahandar, M., Hüllermeier, E. (2019). Analogy-Based Preference Learning with Kernels. In: Benzmüller, C., Stuckenschmidt, H. (eds) KI 2019: Advances in Artificial Intelligence. KI 2019. Lecture Notes in Computer Science(), vol 11793. Springer, Cham. https://doi.org/10.1007/978-3-030-30179-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-30179-8_3
Published: 24 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30178-1
Online ISBN: 978-3-030-30179-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Analogy-Based Preference Learning with Kernels

Abstract

Similar content being viewed by others

Feature Selection for Analogy-Based Learning to Rank

Analogical Embedding for Analogy-Based Learning to Rank

Towards Analogy-Based Explanations in Machine Learning

1 Introduction

2 Problem Formulation

3 Previous Work

4 Analogy and Kernels

4.1 Kernels

4.2 Analogical Proportions as Kernels

Lemma 1

Theorem 1

Proof

5 Analogy-Kernel-Based Object Ranking

5.1 Prediction of Pairwise Preferences

5.2 Rank Aggregation

6 Experiments

6.1 Data

6.2 Experimental Setup

6.3 Results

7 Conclusion and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Analogy-Based Preference Learning with Kernels

Abstract

Similar content being viewed by others

Feature Selection for Analogy-Based Learning to Rank

Analogical Embedding for Analogy-Based Learning to Rank

Towards Analogy-Based Explanations in Machine Learning

1 Introduction

2 Problem Formulation

3 Previous Work

4 Analogy and Kernels

4.1 Kernels

4.2 Analogical Proportions as Kernels

Lemma 1

Theorem 1

Proof

5 Analogy-Kernel-Based Object Ranking

5.1 Prediction of Pairwise Preferences

5.2 Rank Aggregation

6 Experiments

6.1 Data

6.2 Experimental Setup

6.3 Results

7 Conclusion and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation