1 Introduction

In this paper, we establish a connection between analogical reasoning and kernel-based learning, which are two important subfields of artificial intelligence. Essentially, this becomes possible thanks to the observation that a specific formalization of analogical relationships, so-called analogical proportions [25, 29], defines a kernel function on pairs of objects. This relationship is established by means of generalized (fuzzy) equivalence relations as a bridging concept.

Analogical reasoning has a long tradition in artificial intelligence research, and various attempts at formalizing analogy-based inference can be found in the literature. In this regard, the aforementioned concept of analogical proportion is an especially appealing approach, which has already been used successfully in different problem domains, including classification [7], recommendation [19], preference completion [27], decision making [5], and solving IQ tests [4].

In spite of its popularity in AI in general, analogical reasoning has not been considered very much in machine learning so far. Yet, analogical proportions have recently been used in the context of preference learning [1, 6], a branch of machine learning that has received increasing attention in recent years [15]. Roughly speaking, the goal in preference learning is to induce preference models from observational (or experimental) data that reveal information about the preferences of an individual or a group of individuals in a direct or indirect way; the latter typically serve the purpose of predictive modeling, i.e., they are then used to predict the preferences in a new situation.

Frequently, the predicted preference relation is required to form a total order, in which case we also speak of a ranking problem. In fact, among the problems in the realm of preference learning, the task of “learning to rank” has probably received the most attention in the literature so far, and a number of different ranking problems have already been introduced. Based on the type of training data and the required predictions, [15] distinguish between the problems of object ranking [13, 21], label ranking [12, 17, 34], and instance ranking [16].

Building on [1], the focus of this paper is on the problem of object ranking. Given training data in the form of a set of exemplary rankings of subsets of objects, the goal in object ranking is to learn a ranking function that is able to predict the ranking of any new set of objects. Our contribution is a novel approach to this problem, namely a kernel-based implementation of analogy-based object ranking.

The rest of the paper is organized as follows. In the next section, we recall the setting of object ranking and formalize the corresponding learning problem. Section 3 outlines existing methods for the object ranking task, followed by Sect. 4 in which the connection between analogical reasoning and kernel-based learning is established. In Sect. 5, we introduce kernel-based analogical reasoning for the object ranking problem. Finally, we present an experimental evaluation of this approach in Sect. 6, prior to concluding the paper with a summary and an outline of future work.

2 Problem Formulation

Consider a reference set of objects, items, or choice alternatives \(\mathcal {X}\), and assume each item \(\varvec{x} \in \mathcal {X}\) to be described in terms of a feature vector; thus, an item is a vector \(\varvec{x} = (x_1, \ldots , x_d) \in \mathbb {R}^d\) and \(\mathcal {X} \subseteq \mathbb {R}^d\). The goal in object ranking is to learn a ranking function \(\rho \) that accepts any (query) subset \(Q = \{ \varvec{x}_1, \ldots , \varvec{x}_n \} \subseteq \mathcal {X}\) of \(n = |Q|\) items as input. As output, the function produces a ranking \(\pi \in \mathbb {S}_n\) of these items, where \(\mathbb {S}_n\) denotes the set of all permutations of length n, i.e., all mappings \([n] \longrightarrow [n]\) (symmetric group of order n); \(\pi \) represents the total order

$$\begin{aligned} \varvec{x}_{\pi ^{-1}(1)} \succ \varvec{x}_{\pi ^{-1}(2)} \succ \ldots \succ \varvec{x}_{\pi ^{-1}(n)} , \end{aligned}$$
(1)

i.e., \(\pi ^{-1}(k)\) is the index of the item on position k, while \(\pi (k)\) is the position of the kth item \(\varvec{x}_k\) (\(\pi \) is often called a ranking and \(\pi ^{-1}\) an ordering). Formally, a ranking function is thus a mapping

$$\begin{aligned} \rho : \, \mathcal {Q}\longrightarrow \mathcal {R}, \end{aligned}$$
(2)

where \(\mathcal {Q}= 2^\mathcal {X} \setminus \emptyset \) is the query space and \(\mathcal {R}= \bigcup _{n \in \mathbb {N}} \mathbb {S}_n\) the ranking space. The order relation “\(\succ \)” is typically (though not necessarily) interpreted in terms of preferences, i.e., \(\varvec{x} \succ \varvec{y}\) suggests that \(\varvec{x}\) is preferred to \(\varvec{y}\). A ranking function \(\rho \) is learned on a set of training data that consists of a set of rankings

$$\begin{aligned} \mathcal {D} = \big \{ (Q_1, \pi _1) , \ldots , (Q_M, \pi _M) \big \} , \end{aligned}$$
(3)

where each ranking \(\pi _\ell \) defines a total order of the set of objects \(Q_\ell \). Once a ranking function has been learned, it can be used for making predictions for new query sets Q. Such predictions are evaluated in terms of a suitable loss function or performance metric. A common choice is the (normalized) ranking loss, which counts the number of inversions between two rankings \(\pi \) and \(\pi '\):

$$\begin{aligned} d_{RL}(\pi , \pi ') = \frac{ \sum _{1 \le i , j \le n} \llbracket {\pi (i) < \pi (j)} \rrbracket \llbracket {\pi '(i) > \pi '(j)} \rrbracket }{n(n-1)/2} , \end{aligned}$$

where \(\llbracket \cdot \rrbracket \) is the indicator function. The ranking function (2) sought in object ranking is a complex mapping from the query to the ranking space. An important question, therefore, is how to represent a “ranking-valued” function of that kind, and how it can be learned efficiently.

3 Previous Work

Quite a number of approaches to object ranking and related learning-to-rank problems have already been proposed in the literature, most of them based on the idea of representing a ranking function via an underlying (latent) utility function. Depending on the type of training data used for learning such a function, a distinction is often made between so-called pointwise [21, 22], pairwise [10, 20], and listwise [11] approaches.

In the following, we give a brief overview of the analogy-based approach recently put forward in [1], which is most relevant for us. This approach essentially builds on the following inference pattern: If object \(\varvec{a}\) relates to object \(\varvec{b}\) as \(\varvec{c}\) relates to \(\varvec{d}\), and knowing that \(\varvec{a}\) is preferred to \(\varvec{b}\), we (hypothetically) infer that \(\varvec{c}\) is preferred to \(\varvec{d}\). This principle is formalized using the concept of analogical proportion [25]. For every quadruple of objects \(\varvec{a},\varvec{b},\varvec{c},\varvec{d}\), the latter provides a numerical degree to which these objects are in analogical relation to each other. To this end, such a degree is first determined for each attribute value (feature) separately, and these degrees are then combined into an overall degree of analogy.

Consider four values abcd from an attribute domain \(\mathbb {X}\). The quadruple (abcd) is said to be in analogical proportion, denoted by a : b :  : c : d, if “a relates to b as c relates to d”. A bit more formally, the degree of proportion can be expressed as

$$\begin{aligned} E \big ( \mathcal {R}(a,b) , \mathcal {R}(c,d) \big ) , \end{aligned}$$
(4)

where the relation E denotes the “as” part of the informal description. \(\mathcal {R}\) can be instantiated in different ways, depending on the underlying domain \(\mathbb {X}\).

In the case of Boolean variables, where \(\mathbb {X} = \{0,1\}\), there are \(2^4=16\) instantiations of the pattern a : b :  : c : d, of which only the following 6 satisfy a set of axioms required to hold for analogical proportions: (0, 0, 0, 0), (0, 0, 1, 1), (0, 1, 0, 1), (1, 0, 1, 0), (1, 1, 0, 0), (1, 1, 1, 1). This formalization captures the idea that a differs from b (in the sense of being “equally true”, “more true”, or “less true”, if the values 0 and 1 are interpreted as truth degrees) exactly as c differs from d, and vice versa. In the numerical case, assuming all attributes to be normalized to the unit interval [0, 1], the concept of analogical proportion can be generalized on the basis of generalized logical operators [8, 14]. In this case, the analogical proportion will become a matter of degree, i.e., a quadruple (abcd) can be in analogical proportion to some degree between 0 and 1. An example of such a proportion, with \(\mathcal {R}\) the arithmetic difference \(\mathcal {R}(a,b)=a-b\), is the following:

$$\begin{aligned} v(a,b,c,d) = 1- | (a-b) - (c-d)| , \end{aligned}$$
(5)

if \({\text {sign}}(a-b) = {\text {sign}}(c-d)\), and 0 otherwise. Note that this formalization indeed generalizes the Boolean case (where \(a,b,c,d \in \{0,1 \}\)).

To extend analogical proportions from individual values to complete feature vectors, the individual degrees of proportion can be combined using any suitable aggregation function, for example the arithmetic mean:

$$\begin{aligned} v(\varvec{a}, \varvec{b} , \varvec{c} , \varvec{d}) = \frac{1}{d} \sum _{i=1}^d v(a_i , b_i , c_i , d_i) . \end{aligned}$$

With a measure of analogical proportion at hand, the object ranking task is tackled as follows: Consider any pair of query objects \(\varvec{x}_i , \varvec{x}_j \in Q\). Every preference \(\varvec{z} \succ \varvec{z}'\) observed in the training data \(\mathcal {D}\), such that \((\varvec{z}, \varvec{z}', \varvec{x}_i , \varvec{x}_j)\) are in analogical proportion, suggests that \(\varvec{x}_i \succ \varvec{x}_j\). This principle is referred as analogical transfer of preferences, because the observed preference for \(\varvec{z} , \varvec{z}'\) is (hypothetically) transferred to \(\varvec{x}_i, \varvec{x}_j\). Accumulating all pieces of evidence that can be collected in favor of \(\varvec{x}_i \succ \varvec{x}_j\) and, vice versa, the opposite preference \(\varvec{x}_j \succ \varvec{x}_i\), an overall degree \(p_{i,j}\) is derived for this pair of objects. The same is done for all other pairs in the query. Eventually, all these degrees are combined into an overall consensus ranking. We refer to [1] for a detailed description of this method, which is called “analogy-based learning to rank” (able2rank) by the authors.

As an aside, note that an analogy-based approach as outlined above appears to be specifically suitable for transfer learning. This is mainly because the relation \(\mathcal {R}\) is evaluated separately for “source objects” a and b on the one side and “target objects” c and d on the other side, but never between sources and targets. In principle, one could even think of using different specifications of \(\mathcal {R}\) for the source and the target.

4 Analogy and Kernels

The core idea of our proposal is based on the observation that an analogical proportion, by definition, defines a kind of similarity between the relation of pairs of objects: According to (4), the analogical proportion a : b :  : c : d holds if \(\mathcal {R}(a,b)\) is similar to \(\mathcal {R}(c,d)\). The notion of similarity plays an important role in machine learning in general, and in kernel-based machine learning in particular. In fact, kernel functions can typically be interpreted in terms of similarity. Thus, a kernel-based approach might be a natural way to incorporate analogical reasoning in machine learning.

More specifically, to establish a connection between kernel-based machine learning and analogical reasoning, we make use of generalized (fuzzy) equivalence relations as a bridging concept. Fuzzy equivalences are weakened forms of standard equivalence relations, and hence capture the notion of similarity. More specifically, a fuzzy equivalence relation E on a set \(\mathcal {X}\) is a fuzzy subset of \(\mathcal {X} \times \mathcal {X}\), that is, a function \(E:\, \mathcal {X}^2 \longrightarrow [0,1]\), which is reflexive, symmetric, and \(\top \)-transitive:

  • \(E(x,x) = 1\) for all \(x \in \mathcal {X}\),

  • \(E(x,y)=E(y,x)\) for all \(x, y \in \mathcal {X}\),

  • \(\top (E(x,y), E(y,z)) \le E(x,z)\) for all \(x, y, z \in \mathcal {X}\),

where \(\top \) is a triangular norm (t-norm), that is, a generalized logical conjunction. In our case, the relation E in (4) will play the role of a fuzzy equivalence. The detour via fuzzy equivalences is motivated by the result of [26], who proved that certain types of fuzzy equivalence relations satisfy the properties of a kernel function. Before elaborating on this idea in more detail, we briefly recall some basic concepts of kernel-based machine learning as needed for this paper. For a thorough discussion of kernel methods, see for instance [32, 33].

4.1 Kernels

Let \(\mathcal {X}\) be a nonempty set. A function \(k: \mathcal {X} \times \mathcal {X} \longrightarrow \mathbb {R}\) is a positive semi-definite kernel on \(\mathcal {X}\) iff it is symmetric, i.e., \(k(x,y) = k(y,x)\) for all \(x,y \in \mathcal {X}\), and positive semi-definite, i.e.,

$$\begin{aligned} \sum _{i=1}^n \sum _{j=1}^n c_i c_j k(x_i,x_j) \ge 0 \end{aligned}$$

for arbitrary n, arbitrary instances \(x_1, \ldots , x_n \in \mathcal {X}\) and arbitrary \(c_1, \ldots , c_n \in \mathbb {R}\). Given a kernel k on \(\mathcal {X}\), an important theorem by [24] implies the existence of a (Hilbert) space \(\mathcal {H}\) and a map \(\phi :\, \mathcal {X} \longrightarrow \mathcal {H}\), such that

$$ k(x,y) = \langle \phi (x) , \phi (y) \rangle $$

for all \(x,y \in \mathcal {X}\). Thus, computing the kernel k(xy) in the original space \(\mathcal {X}\) is equivalent to mapping x and y to \(\mathcal {H}\) first, using the linearization or feature map \(\phi \), and combining them in terms of the inner product in that space afterward. This connection between a nonlinear combination of instances in the original space \(\mathcal {X}\) and a linear combination in the induced feature space \(\mathcal {H}\) provides the basis for the so-called “kernel trick”, which offers a systematic way to design nonlinear extensions of methods for learning linear models. The kernel trick has been applied to various methods and has given rise to many state-of-the-art machine learning algorithms, including support vector machines, kernel principle component analysis, kernel Fisher discriminant, amongst others [30, 31].

4.2 Analogical Proportions as Kernels

Our focus in this paper is the analogical proportion (5), which is a map \(v:\, [0,1]^4 \longrightarrow [0,1]\). In this case, the relation \(\mathcal {R}\) is the simple arithmetic difference \(\mathcal {R}(a,b)=a-b\), and the similarity relation E is defined as \(E(u, v) = 1- |u-v|\) if both uv have the same sign, and \(E(u, v) = 0\) otherwise. As an aside, we note that, strictly speaking, E thus defined is not a fuzzy equivalence relation. This is due to the thresholding in the case where \(\text {sign} (a-b) \ne \text {sign} (c-d)\). Without this thresholding, E would be a \(\top _{\!\!\L }\)-equivalence, where \(\top _{\!\!\L }\) is the Łukasievicz t-norm \((\alpha ,\beta ) \mapsto \max (\alpha +\beta -1,0)\). For modeling analogy, however, setting E to 0 in the case where b deviates positively from a while d deviates negatively from c (or vice versa) appears reasonable.

We reinterpret v as defined above as a kernel function \(k:\, [0,1]^2 \times [0,1]^2 \longrightarrow [0,1]\) on \(\mathcal {X} = [0,1]^2\), i.e., a kernel on pairs of pairs of objects, which essentially means equating k with E:

$$\begin{aligned} k(a,b,c,d) \mapsto 1- |(a-b) - (c-d)| \end{aligned}$$
(6)

if \(\text {sign} (a-b) = \text {sign} (c-d)\), and 0 otherwise. In what follows, we show that the “analogy kernel” (6) does indeed define a proper kernel function. The first property to be fulfilled, namely symmetry, is obvious. Thus, it remains to show that k is also positive semi-definite, which is done in Theorem 1 below. As a preparation, we first recall the following lemma, which is proved by [26] as part of his Theorem 11.

Lemma 1

Let \(\mu _1, \ldots , \mu _n \in [0,1]\), \(n \in \mathbb {N}\), and the matrix M be defined by

$$ M^{(n)}_{i,j} = ( 1- | \mu _i - \mu _j | ) . $$

Then M has a non-negative determinant.

Theorem 1

The function \(k:\, [-1,1]^2 \longrightarrow [0,1]\) defined as

$$\begin{aligned} k(u,v) = {\left\{ \begin{array}{ll} 1 - |u-v| &{} \text {if } \text {sign}(u)=\text {sign}(v), \\ 0, &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$

is a valid kernel.

Proof

It is easy to see that k is symmetric. Thus, it remains to show that it is positive semi-definite. To this end, it suffices to show that the determinants of all principal minors of every kernel matrix produced by k are non-negative. Thus, consider \(\alpha _1, \ldots , \alpha _n \in [-1,1]\), \(n \in \mathbb {N}\), and the matrix K defined as

$$\begin{aligned} K^{(n)}_{i,j} = {\left\{ \begin{array}{ll} 1 - |\alpha _i-\alpha _j| ,&{} \text {if } \text {sign}(\alpha _i)=\text {sign}(\alpha _j), \\ 0, &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
(7)

We need to show that

$$ \det \bigg ( K^{(m)}_{i,j} \bigg ) \ge 0 , $$

for all \(1 \le m \le n\). Thanks to the permutation-invariance of determinants, we can assume (without loss of generality) that the values \(\alpha _i\) are sorted in non-increasing order, i.e., \(\alpha _1 \ge \alpha _2 \ge \cdots \ge \alpha _n\); in particular, note that the positive \(\alpha _i\) will then precede all the negative ones. Thus, the matrix K takes the form of a diagonal block matrix

$$ K = \begin{pmatrix} A &{} 0 \\ 0 &{} B \\ \end{pmatrix} , $$

in which the submatrix A contains the values of K for which \(\alpha _i, \alpha _j \in [0,1]\), and B contains the values of K where \(\alpha _i, \alpha _j\) are negative. According to Lemma (1), \(\det (A) \ge 0\). Moreover, since \(1-|u-v| = 1-|(-u)-(-v)|\) for \(u,v \in [0,1]\), the same lemma can also be applied to the submatrix B, hence \(\det (B) \ge 0\). Finally, we can exploit that

$$ \det (K) = \det (A) \det (B). $$

Since both matrices A and B have non-negative determinant, it follows that \(\det (K) \ge 0\), which completes the proof.

The class of kernel functions is closed under various operations, including addition and multiplication by a positive constant. This allows us to extend the analogy kernel from individual variables to feature vectors using the arithmetic mean as an aggregation function:

$$\begin{aligned} k_A( \varvec{a}, \varvec{b} , \varvec{c}, \varvec{d}) = \frac{1}{d} \sum _{i=1}^d k(a_i , b_i, c_i, d_i) . \end{aligned}$$
(8)

Furthermore, to allow for incorporating a certain degree of non-linearity, we make use of a homogeneous polynomial kernel of degree 2,

$$\begin{aligned} k_A'( \varvec{a}, \varvec{b} , \varvec{c}, \varvec{d} ) = \big ( k( \varvec{a}, \varvec{b} , \varvec{c}, \varvec{d} ) \big )^2 , \end{aligned}$$
(9)

which is again a valid kernel.

5 Analogy-Kernel-Based Object Ranking

Recall that, in the setting of learning to rank, we suppose to be given a set of training data of the form

$$ \mathcal {D} = \big \{ (Q_1, \pi _1) , \ldots , (Q_M, \pi _M) \big \} \ , $$

where each \(\pi _\ell \) defines a ranking of the set of objects \(Q_\ell \). If \(\varvec{z}_i , \varvec{z}_j \in Q_\ell \) and \(\pi _\ell (i) < \pi _\ell (j)\), then \(\varvec{z}_i \succ \varvec{z}_j\) has been observed as a preference. Our approach to object ranking based on the analogy kernel, AnKer-rank, comprises two main steps:

  • First, for each pair of objects \(\varvec{x}_i , \varvec{x}_j \in Q\), a degree of preference \(p_{i,j} \in [0,1]\) is derived from \(\mathcal {D}\). If these degrees are normalized such that \(p_{i,j} + p_{j,i} = 1\), they define a reciprocal preference relation

    $$\begin{aligned} P= \Big ( p_{i,j} \Big )_{1 \le i \ne j \le n} . \end{aligned}$$
    (10)
  • Second, the preference relation P is turned into a ranking \(\pi \) using a suitable ranking procedure.

Both steps will be explained in more detail further below.

5.1 Prediction of Pairwise Preferences

The first step of our proposed approach, prediction of pairwise preferences, is based on a reduction to binary classification. To this end, training data \(\mathcal {D}_{bin}\) is constructed as follows: Consider any preference \(\varvec{x}_i \succ \varvec{x}_j\) that can be extracted from the original training data \(\mathcal {D}\), i.e., from any of the rankings \(\pi _m\), \(m \in [M]\). Then \(\varvec{z}_{i,j}=(\varvec{x}_i, \varvec{x}_j)\) is a positive example for the binary problem (with label \(y_{i,j}=+1\)), and \(\varvec{z}_{j,i}=(\varvec{x}_j, \varvec{x}_i)\) is a negative example (with label \(y_{j,i}=-1\)). Since these examples essentially carry the same information, we only add one of them to \(\mathcal {D}_{bin}\). To keep a balance between positive and negative examples, the choice is simply made by flipping a fair coin.

Note that, for any pair of instances \((\varvec{a}, \varvec{b})\) and \((\varvec{c}, \varvec{d})\) in \(\mathcal {D}_{bin}\), the analogy kernel (8) is well-defined, i.e., \(k_A(\varvec{a}, \varvec{b}, \varvec{c}, \varvec{d})\) can be computed. Therefore, a binary predictor \(h_{bin}\) can be trained on \(\mathcal {D}_{bin}\) using any kernel-based classification method. We assume \(h_{bin}\) to produce predictions in the unit interval [0, 1], which can be achieved, for example, by means of support vector machines with a suitable post-processing such as Platt-scaling [28].

Now, consider any pair of objects \(\varvec{x}_i, \varvec{x}_j\) from a new query \(Q=\{ \varvec{x}_1, \ldots , \varvec{x}_n \}\). Again, the analogy kernel can be applied to this pair and any example from \(\mathcal {D}_{bin}\), so that a (binary) prediction for the preference between \(\varvec{x}_i\) and \(\varvec{x}_j\) can be derived from \(h_{bin}\). More specifically, querying this model with \(\varvec{z}_{i,j}=(\varvec{x}_i, \varvec{x}_j)\) yields a degree of support \(q_{i,j}=h_{bin}(\varvec{z}_{i,j})\) in favor of \(\varvec{x}_i \succ \varvec{x}_j\), while querying it with \(\varvec{z}_{j,i}=(\varvec{x}_j, \varvec{x}_i)\) yields a degree of support \(q_{j,i}=h_{bin}(\varvec{z}_{j,i})\) in favor of \(\varvec{x}_j \succ \varvec{x}_i\). As already said, we assume both degrees to be normalized within the range [0, 1], and define \(p_{i,j} = (1+q_{i,j}-q_{j,i})/2\) as an estimate for the probability of the preference \(\varvec{x}_i \succ \varvec{x}_j\). This estimate constitutes one of the entries in the preference relation (10).

5.2 Rank Aggregation

To turn pairwise preferences into a total order, we make use of a rank aggregation method. More specifically, we apply the Bradley-Terry-Luce (BTL) model, which is well-known in the literature on discrete choice [9]. It starts from the parametric model

$$\begin{aligned} \mathbf {P}(\varvec{x}_i \succ \varvec{x}_j) = \frac{\theta _i}{\theta _i + \theta _j} , \end{aligned}$$
(11)

where \(\theta _i, \theta _j \in \mathbb {R}_+\) are parameters representing the (latent) utility \(U(\varvec{x}_i)\) and \(U(\varvec{x}_j)\) of \(\varvec{x}_i\) and \(\varvec{x}_j\), respectively. Thus, according to the BTL model, the probability to observe a preference in favor of a choice alternative \(\varvec{x}_i\), when being compared to any other alternative, is proportional to \(\theta _i\).

Given the preference relation (10), i.e., the entries \(p_{i,j}\) informing about the class probability of \(\varvec{x}_i \succ \varvec{x}_j\), the parameter \(\theta = (\theta _1, \ldots , \theta _n)\) can be estimated by likelihood maximization:

$$ \hat{\theta } \in \arg \max _{\theta \in \mathbb {R}^{n} } \prod _{1 \le i \ne j \le n} \left( \dfrac{\theta _{i}}{\theta _{i} + \theta _{j}} \right) ^{p_{i,j}} . $$

Finally, the predicted ranking \(\pi \) is obtained by sorting the items \(\varvec{x}_i\) in descending order of their estimated (latent) utilities \(\hat{\theta }_i\).

We note that many other rank aggregation techniques have been proposed in the literature and could principally be used as well; see e.g. [2]. However, since BTL seems to perform very well, we did not consider any other method.

6 Experiments

To study the practical performance of our proposed method, we conducted experiments on several real-world data sets, essentially using the same setup as [1]. As baselines to compare with, we considered able2rank [1], expected rank regression (ERR) [21, 22], Ranking SVM (with linear kernel) [20] and RankNet [10].

Table 1. Properties of data sets.

6.1 Data

We used the same data sets as [1], which are collected from various domains (e.g., sports, education, tourism) and comprise different types of feature (e.g., numeric, binary, ordinal). Table 1 provides a summary of the characteristics of the data sets. For a detailed description of the data, we refer the reader to the source paper. In addition, we include the ranking of the teams that participated in the men’s FIFA world cup 2014 and 2018 (32 instances) as well as under-17 in the year 2017 (22 instances) with respect to “goals statistics”. This dataFootnote 1 comprises 7 numeric features such as MatchesPlayed, GoalsFor, GoalsScored, etc.

6.2 Experimental Setup

For the analogy-based methods, an important pre-processing step is the normalization of the attributes in the feature representation \(\varvec{x}=(x_1, \ldots , x_d)\), because these attributes are assumed to take values in [0, 1]. To this end, we simply apply a linear rescaling

$$ x_k' \leftarrow \dfrac{x_k -\min _k}{ \max _k - \min _k } , $$

where \(\min _k\) and \(\max _k\) denote, respectively, the smallest and largest value of the kth feature in the data. This transformation is applied to the training data as well as the test data when a new query Q is received. Since the data from a new query is normally sparse, it might be better to take the minimum and maximum over the entire data, training and test. Yet, this strategy is not recommendable in case the test data has a different distribution. In fact, analogical inference is especially interesting for transfer learning (and indeed, in our experiments, training and test data are sometimes from different subdomains). Therefore, we first conduct a Kolmogorov-Smirnov test [23] to test whether the two parts of the data are drawn from the same distribution. In case the null hypothesis is rejected (at a significance level of \(\alpha = 0.05\)), normalization is conducted on the test data alone. Otherwise, the training data is additionally taken into account.

We also apply a standard normalization for the other baseline methods (ERR, Ranking SVM and RankNet), transforming each real-valued feature by standardization:

$$ x \leftarrow \dfrac{x-\mu }{\sigma } , $$

where \(\mu \) and \(\sigma \) denote the empirical mean and standard deviation, respectively. Like for the analogy-based methods, a hypothesis test is conducted to decide whether the test data should be normalized separately or together with the training data.

The analogy kernel (9) was used for AnKer-rank. We fixed the cost parameter C of SVM algorithms in an (internal) 2-fold cross-validation (repeated 3 times) on the training data. The search for C is guided by an algorithmFootnote 2 proposed by [18], which computes the entire regularization path for the two-class SVM classifier (i.e., all possible values of C for which the solution changes), with a cost a small (\(\sim \)3) multiple of the cost of fitting a single model. The following RankNet parameters are adjusted using grid-search and internal cross-validation: The number of units in the hidden layer (32, 64, 128, 256), the batch size (8, 16, 32), the optimizer learning rate (0.001, 0.01, 0.1). Since the data sets are relatively small, the network was restricted to a single hidden layer.

Table 2. Results in terms of loss \(d_{RL}\) (averaged over 20 runs) on the test data.

6.3 Results

In our experiments, predictions were produced for certain data set \(D_{test}\) of the data, using other parts \(D_{train}\) as training data; an experiment of that kind is denoted by \(D_{train} \rightarrow D_{test}\) that is considered for all possible combinations within each domain. The averaged ranking loss together with the standard deviation of the conducted experiments (repeated 20 times) are summarized in Table 2, where the numbers in parentheses indicate the rank of the achieved score in the respective problem. Moreover, the table shows average ranks per problem domain.

As can be seen, the relative performance of the methods depends on the domain. In any case, our proposed approach is quite competitive in terms of predictive accuracy, and essentially on a par with able2rank and Ranking SVM, whereas ERR and RankNet show worse performance.

7 Conclusion and Future Work

This paper elaborates on the connection between kernel-based machine learning and analogical reasoning in the context of preference learning. Building on the observation that analogical proportions define a kind of similarity between the relation of pairs of objects, and that kernel functions can be interpreted in terms of similarity, we utilize generalized (fuzzy) equivalence relations as a bridging concept to show that a particular type of analogical proportion defines a valid kernel function. We introduce the analogy kernel and advocate a concrete kernel-based approach for the problem of object ranking. First experimental results on real-world data from various domains are quite promising and suggest that our approach is competitive to state-of-the-art methods for object ranking.

By making analogical inference amenable to kernel methods, our paper depicts a broad spectrum of directions for future work. In particular, we plan to study kernel properties of other analogical proportions proposed in the literature (e.g., geometric proportions [3]).

Besides, various extensions in the direction of kernel-based methods are conceivable and highly interesting from the point of view of analogical reasoning. This includes the use of kernel-based methods other than SVM, techniques such as multiple kernel learning, etc. Last but not least, other types of applications, whether in preference learning or beyond, are also of interest.