Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

This contribution summarizes a tutorial talk which was meant as a first introduction to distance and prototype based machine learning techniques. Accordingly, our intention is not to give a complete overview of the field or to review all relevant literature. The paper may serve as a starting point for the interested reader to explore this practically relevant framework and active area of research.

The inference of classification schemes from previous observations, i.e. from labelled example data, is one of the core issues in machine learning [14]. A large variety of real world problems can be formulated as classification tasks. Examples include handwritten character recognition, medical diagnoses based on clinical data, pixel-wise segmentation and other image processing tasks, or fault detection in technical systems based on sensor data, to name only a few.

Throughout this contribution we assume that observations are given in terms of real-valued feature vectors in \(N\) dimensions. In general, the structure of the data can be more complex and may require modified approaches, for instance the pseudo-Euclidean embedding of relational data. For this and other extensions of the concepts presented here, we refer the reader to [5, 6] and references therein.

A variety of frameworks and training algorithms have been developed for the learning from examples, i.e. the data driven adaptation of parameters in the chosen classification model. They range from classical statistics based methods like Discriminant Analysis to the application of Multilayer Perceptrons or the prominent Support Vector Machine [14].

A particularly transparent approach is that of distance or similarity based classification [2, 3, 5]. Here, observations are directly compared with reference data or prototypes which have been determined in a training process from available examples. The similarity or, more correctly, dis-similarity is quantified in terms of a suitable distance measure.Footnote 1 The choice of appropriate measures is in the focus of this contribution. Most of the concepts discussed here can be applied in a much broader context, including supervised regression or the unsupervised clustering of data [5]. Here, however, we will limit the discussion to clear-cut classification problems and the use of prototype or reference data based classifiers.

In the next section we discuss two classical methods: the k-Nearest-Neighbor (kNN) approach [2, 3, 7] and Kohonen’s Learning Vector Quantization (LVQ) [8, 9] which – in their simplest versions – employ standard Euclidean distance. Mainly in terms of LVQ we discuss how to extend the framework to more general distance measures in Sect. 3.1. The use of divergences for the classification of histograms serves as one example. Section 4 presents the elegant framework of Relevance Learning Vector Quantization as an example for the use of adaptive distance measures. We conclude with a brief summary in Sect. 5.

2 Simple Classifiers Based on Euclidean Distances

When dealing with \(N\)-dimensional feature vectors, the use of Euclidean metrics for their pairwise comparison seems natural. In the following we discuss two classical methods which employ this measure in their simplest versions.

2.1 Nearest-Neighbor Classifiers

Arguably the simplest and by far most popular distance based scheme for vectorial data is the k-Nearest-Neighbor (kNN) classifier [2, 3, 7]. In this classical approach, a given set of \(P\) vectors in \(N\)-dim. feature space is stored together with labels which indicate their known assignment to one of the \(C\) classes:

$$\begin{aligned} \left\{ \, \mathbf {x}^\mu , y(\mathbf {x}^\mu ) = y^\mu \, ) \right\} _{\mu =1}^P \text{ where } \mathbf {x}^\mu \in \mathbb {R}^N \text{ and } y^\mu \in \{1,2,\ldots ,C\}. \end{aligned}$$
(1)

An arbitrary feature vector \(\mathbf {x}\) is classified according to the distances from the reference samples. In the most basic 1NN scheme, its (squared) Euclidean distance

$$\begin{aligned} d(\mathbf {x},\mathbf {x}^\mu ) = \left( \mathbf {x} - \mathbf {x}^\mu \right) ^2 = \sum _{j=1}^N \, \left( x_j - x_j^\mu \right) ^2 \end{aligned}$$
(2)

from all reference samples \(\mathbf {x}^\mu \) is computed and the data point is assigned to the class of the Nearest Neighbor:

$$\begin{aligned} y(\mathbf {x})= y^* = y(\mathbf {x}^{*}) \text{ with } \mathbf {x}^* = \mathop {\text {argmin}}\limits _{\mathbf {x}^\mu } \left\{ d(\mathbf {x},\mathbf {x}^\mu ) \right\} _{\mu =1}^P. \end{aligned}$$
(3)

In the more general kNN classifier the assignment, Eq. (3), is replaced by a voting among the \(k\) closest reference vectors.

The kNN classifier is straightforward to implement and requires no further analysis of the example data in a training phase. Furthermore, theoretical considerations show that kNN schemes can achieve close to Bayes optimal classification if \(k\) is selected appropriately [2, 3]. As a consequence, the kNN classifier continues to be applied in a variety of practical contexts and often serves as a baseline for comparison with other methods. Figure 1 (left panel) illustrates the 1NN classifier for an artificial three class data set in two dimensions. The prescription (3) results in a piece-wise linear tessellation of feature space.

Fig. 1.
figure 1

Illustration of simple classification schemes based on Euclidean distance. Both panels display the same three class data set and decision boundaries represented by solid lines. Left: Nearest-Neighbor classifier. Right: Nearest Prototype classification, prototypes are marked by larger symbols as indicated by the legend.

Two major drawbacks of the approach are evident:

  1. (I)

    For large data sets, the method involves considerable computational effort in the working phase. The naive implementation of (3) requires the evaluation of \(P\) distances and the identification of their minimum for each novel classification. Although clever search and sorting strategies can reduce the computational complexity [3], the basic problem persists for large data sets.

  1. (II)

    More importantly, class boundaries can become very complex since every example is taken into account on an equal footing. The system is highly sensitive to single, potentially mislabelled examples or outliers. This bears the risk of over-fitting, i.e. the classifier can become too specific to the example set which may result in poor generalization performance with respect to novel data. The effect is clearly mildened when \(k\) neighbors are taken into account. However, too large \(k\) can yield overly smooth boundaries.

Both problems suggest to reduce the number of reference examples. The representation of the data set by a condensed set of examples was already considered in [10]. A variety of improved selection schemes have been proposed which aim at retaining relevant information contained in the data set, see e.g. [11] and references therein.

2.2 Learning Vector Quantization

Here we consider approaches which compute class representatives without restricting them to be elements of the training set. Each class is represented by at least one vector in a set of \(M\) labeled prototypes:

$$\begin{aligned} \left\{ \, \mathbf {w}^{j}, \, c^j \, \right\} _{j=1}^M \text{ where } \mathbf {w}^j \in \mathbb {R}^N \text{ and } c^j \in \{1,2,\ldots C\}. \end{aligned}$$
(4)

Together with the Euclidean measure, the prototypes parameterize piece-wise linear class boundaries. Similar to the 1NN classifier, a Nearest Prototype Scheme (NPC) assigns an arbitrary feature vector to class

$$\begin{aligned} y(\mathbf {x}) = c^* \text{ where } c^*\ \text {is the label of}\ \mathbf {w}^* = \mathop {\text {argmin}}\limits _{\mathbf {w}^j} \left\{ d(\mathbf {x},\mathbf {w}^j) \right\} _{j=1}^M. \end{aligned}$$
(5)

The term winner is frequently used for the closest prototype \(\mathbf {w}^*\) with respect to data point \(\mathbf {x}\). More sophisticated voting rules, probabilistic or soft schemes can be devised, but here we limit the discussion to crisp classifiers.

The right panel of Fig. 1 illustrates the NPC scheme. The resulting decision boundaries are obviously much smoother than those of the corresponding 1NN classifier (left panel). The NPC scheme is less sensitive to details of the data set which is reflected by the fact that it misclassifies some of the training examples. In comparison to the 1NN scheme, this should result in superior generalization behavior in the presence of noisy examples and outliers.

Arguably the most attractive feature of prototype-based schemes is their interpretability [12]. Prototypes are defined in the feature space of observations and, hence, can be inspected by domain experts in the same way as the sample data. This is in contrast to Multilayer Perceptrons or other model parameterizations which are less transparent. Moreover, prototypes should be - in a sense - typical for their classes. Thus, the concept is complementary to, for instance, the Support Vector Machine approach [4] which puts emphasis on atypical samples close to the decision boundaries.

LVQ prototypes are determined from the example data by more or less sophisticated training procedures. A conceptually simple idea for their initialization is to compute the class-conditional means, which appears promising when classes are represented by single, more or less spherical clusters. In more realistic situations, LVQ prototypes could be initialized at random in feature space. More reasonably, their initial positions could be determined by means of class-wise unsupervised competitive learning [13] prior to the actual supervised training.

In the following we present two prominent prototype-based, iterative training schemes from the family of Learning Vector Quantization algorithms.

Kohonen’s LVQ1. Kohonen’s original Learning Vector Quantization algorithm [8, 9, 13], known as LVQ1, constitutes an intuitive, heuristic procedure for the computation of prototypes. It is reminiscent of competitive learning for the purpose of unsupervised Vector Quantization [2].

In LVQ1, single training examples are presented, for instance in randomized order. Upon presentation of example \(\{\mathbf {x},y\}\), the currently closest prototype \(\mathbf {w}^*\) is determined in analogy to Eq. (5). Only the winner is updated according to

$$\begin{aligned} \mathbf {w}^* \, \leftarrow \, \mathbf {w}^* + \eta \,\,\, \varPsi (c^*,y) \,\, \, (\mathbf {x}-\mathbf {w}^*) \text{ where } \varPsi (c,y) = \left\{ \begin{array}{ll} +1 &{} \text{ if } c=y \\ -1 &{} \text{ if } c\ne y.\\ \end{array} \right. \end{aligned}$$
(6)

Here, the learning rate \(\eta >0\) controls the step size. Note that Eq. (6) could be re-written as

$$\begin{aligned} \mathbf {w}^* \, \leftarrow \, \mathbf {w}^* - \eta \,\,\, \varPsi (c^*,y) \,\, \frac{\partial }{\partial \mathbf {w}^*} \, \left[ \frac{1}{2} \, (\mathbf {x}-\mathbf {w}^*)^2 \right] , \end{aligned}$$
(7)

formally. The Winner Takes All (WTA) prescription moves the prototype \(\mathbf {w}^*\) closer to or away from the feature vector if the class labels agree or disagree, respectively. As a consequence, the sample \(\mathbf {x}\) – or other feature vectors in its vicinity – will be classified correctly with higher probability after the update. Intuitively, after repeated presentation of the data set, prototypes approach positions which should be typical for the corresponding classes.

A number of variations of the basic scheme have been suggested in the literature, aiming at better generalization ability or more stable convergence behavior, e.g. [9, 1417]. Several modifications update more than one prototype at a time, e.g. LVQ2.1 or LVQ3, or employ adaptive learning rates as for instance the so-called Optimized LVQ (OLVQ) [9]. However, the essential features of LVQ1 – competitive learning and label-dependent updates – are present in all versions of LVQ.

Generalized Learning Vector Quantization. Cost function based approaches [1417] have attracted particular attention. First of all, convergence properties can be studied analytically in terms of their objective function. Moreover, cost functions allow for systematic extensions of the training schemes, for instance by including adaptive hyperparameters in the optimization [18, 19].

Here we focus on the so–called Generalized Learning Vector Quantization (GLVQ) as introduced by Sato and Yamada [14, 15]. The suggested cost function is given by a sum over examples:

$$\begin{aligned} E = \sum _{\mu =1}^P \, e^\mu \text{ with } e^\mu = \varPhi \left( \frac{ d(\mathbf {w}^J, \mathbf {x}^\mu ) - d(\mathbf {w}^K, \mathbf {x}^\mu )}{ d(\mathbf {w}^J, \mathbf {x}^\mu ) + d(\mathbf {w}^K, \mathbf {x}^\mu )} \right) , \end{aligned}$$
(8)

where \(\mathbf {w}^J\) and \(\mathbf {w}^K\) denote the closest correct and closest incorrect prototype, respectively, for a particular example \(\left\{ \mathbf {x}^\mu ,y^\mu \right\} \). Formally,

$$\begin{aligned} \mathbf {w}^J&= \mathop {\text {argmin}}\limits _{\mathbf {w}^j} \left\{ d(\mathbf {x}^\mu ,\mathbf {w}^j) \, \mid c^j = y^\mu \right\} _{j=1}^M \nonumber \\ \mathbf {w}^K&= \mathop {\text {argmin}}\limits _{\mathbf {w}^j} \left\{ d(\mathbf {x}^\mu ,\mathbf {w}^j) \, \mid c^j \ne y^\mu \right\} _{j=1}^M. \end{aligned}$$
(9)

Popular choices for the monotonically increasing function \(\varPhi (z)\) in Eq. (8) are the identity \(\varPhi (z)=z\) or a sigmoidal like \(\varPhi (z)=1/[1+exp(-\gamma \, z)]\) where \(\gamma >0\) controls the steepness in the origin [14, 20]. Its argument obeys \(-1 \le z \le 1\), negative values \(z<0\) indicate that the corresponding training example is correctly classified. Note that for large \(\gamma \) the cost function approximates the number of misclassified training data, while for small steepness the minimization of \(E\) corresponds to maximizing the margin-like quantities \(\left[ d(\mathbf {w}^K, \mathbf {x}^\mu ) - d(\mathbf {w}^J, \mathbf {x}^\mu )\right] \).

One possible strategy to optimize \(E\) for a given data set is stochastic gradient descent based on single example presentation [1, 2, 21, 22]. The update step for the winning prototypes \(\mathbf {w}^J,\mathbf {w}^K\), given a particular example \(\{\mathbf {x},y\}\), reads

$$\begin{aligned} \mathbf {w}^J&\leftarrow \mathbf {w}^J - \eta \, \frac{\partial }{\partial \mathbf {w}^J} \,\, \varPhi (e)= \mathbf {w}^J \, + \eta \, \, \varPhi ^\prime (e) \, \frac{4 d_K}{(d_J+d_K)^2} \, \left( \mathbf {x} - \mathbf {w}^J\right) \nonumber \\ \mathbf {w}^K&\leftarrow \mathbf {w}^K - \eta \, \frac{\partial }{\partial \mathbf {w}^K} \, \varPhi (e) = \mathbf {w}^K - \eta \, \, \varPhi ^\prime (e) \, \frac{4 d_J}{(d_J+d_K)^2} \, \left( \mathbf {x} - \mathbf {w}^K\right) \nonumber \\&\end{aligned}$$
(10)

where the abbreviation \(d^L = d(\mathbf {w}^L,\mathbf {x})\) for the squared Euclidean distances has been introduced.

Note that in contrast to GLVQ, LVQ1 cannot be interpreted as a stochastic gradient descent, although Eq. (7) involves the gradient of \(d(\mathbf {w}^*,\mathbf {x})\). Formal integration yields the function

$$ \frac{1}{2} \, \sum _{\mu =1}^P \, \varPsi (c^*,y^\mu ) \, \left( \mathbf {x}^\mu - \mathbf {w}^* \right) ^2 $$

which is not differentiable at class borders. Crossing the decision boundary, a different prototype becomes the winner and the sign of \(\varPsi \) changes discontinuously.

3 Extensions to General Distance Measures

Occasionally it is argued that all distance based methods are bound to fail in high-dimensional feature space due to the so-called curse of dimensionality and the related concentration of norms, see [23] for a general discussion thereof. The problem is evident in the context of, e.g., density estimation or histogram based techniques. However, we would like to emphasize that the argument does not necessarily carry over to the comparison of distances. Consider, for instance, the difference of two squared Euclidean distances

$$\begin{aligned} d(\mathbf {x},\mathbf {x}^a) - d(\mathbf {x},\mathbf {x}^b) = 2\, \mathbf {x} \cdot (\mathbf {x}^b - \mathbf {x}^a) + ( {\mathbf {x}^{a}} )^2 - ({\mathbf {x}^{b}})^2 \end{aligned}$$
(11)

which involves the projection of \(\mathbf {x}\) into the low-dimensional subspace spanned by reference vectors \(\mathbf {x}^a,\mathbf {x}^b\). The concentration of norms suggests, indeed, that the last two terms approximately cancel each other in high dimensions, while the first remains non-trivial. Moreover, in the context of LVQ, \(\mathbf {x}^a,\mathbf {x}^b\) in (11) are replaced by prototypes which have been determined as low noise representatives of the data set.

Euclidean distance appears to be a natural measure and is by far the most popular choice in practice. However, one should be aware that other measures may be more suitable, depending on the nature of the data set at hand [24]. Both the kNN and the LVQ framework facilitate the use of alternative distance measures in a rather straightforward fashion as outlined in the following.

3.1 Example Metrics and More General Measures

Frequently, distances \(d(\mathbf {x},\mathbf {y})\) are required to satisfy the metric properties

$$\begin{aligned} d(\mathbf {x},\mathbf {y}) =0 \, \Leftrightarrow \, \mathbf {x}=\mathbf {y}, \quad d(\mathbf {x},\mathbf {y}) = d(\mathbf {y},\mathbf {x}), \quad d(\mathbf {x},\mathbf {z}) \le d(\mathbf {x},\mathbf {y}) + d(\mathbf {y},\mathbf {z}). \end{aligned}$$
(12)

However, in the prototype based or kNN classification of a query \(\mathbf {x}\), these conditions can be relaxed. For example, the NPC with prototypes \(\{\mathbf {w}^j\}\) is still well defined with a non-symmetric measure as long as only one of the two choices, \(d(\mathbf {x},\mathbf {w}^j)\) or \(d(\mathbf {w}^j,\mathbf {x})\), is used consistently. Distances between different prototypes or between two data points are never considered explicitly in the scheme.

A large variety of distance measures can be employed for classification tasks. Discretized data, for instance, can be compared by means of the Hamming distance or more general string metrics. Specific measures have been devised for functional data where the order of the observed features is relevant, see [25, 26] for examples.

In the following we outline how, quite generally, differentiable distance measures can be made use of in LVQ schemes. Then we briefly discuss three example families of measures which constitute important alternatives to the standard Euclidean choice.

Incorporation of Differentiable Distances in LVQ Schemes. In the working phase of kNN or prototype based classification, essentially any meaningful distance measure can be employed which is appropriate for the problem at hand. An important restriction applies, however, if gradient based training schemes like LVQ1 or GLVQ are used which require that the underlying distance is differentiable. Under this condition, a general LVQ1-like update can be written as

$$\begin{aligned} \mathbf {w}^* \, \leftarrow \, \mathbf {w}^* - \eta \,\,\, \varPsi (c^*,y) \,\, \frac{\partial }{\partial \mathbf {w}^*} \, d(\mathbf {w}^*,\mathbf {x}) \end{aligned}$$
(13)

in analogy with Eq. (7).

Similarly, the Euclidean distance in the GLVQ cost function (8) can be replaced by a more general, differentiable measure, yielding the update

$$\begin{aligned} \mathbf {w}^J&\leftarrow \mathbf {w}^J + \eta \, \varPhi ^\prime (e) \, \frac{2 d_K}{(d_J+d_K)^2} \, \frac{\partial }{\partial \mathbf {w}^J } \, \, d(\mathbf {w}^J,\mathbf {x} ) \nonumber \\ \mathbf {w}^K&\leftarrow \mathbf {w}^K - \eta \, \, \varPhi ^\prime (e) \, \frac{2 d_J}{(d_J+d_K)^2 } \, \frac{\partial }{\partial \mathbf {w}^K } \, d(\mathbf {w}^K,\mathbf {x}) \end{aligned}$$
(14)

where the winners and all other quantities are defined as in (10). In the following we highlight a few families of differentiable distance measures which can be incorporated into LVQ in a straightforward way.

Minkowski Distances. A prominent class of distances corresponds to the so-called Minkowski measures

$$\begin{aligned} d_p (\mathbf {x},\mathbf {y}) = \left( \sum _{j=1}^N \, \left| x_j - y_j \right| ^p \right) ^{1/p} \end{aligned}$$
(15)

with \(p>0\) which includes the standard Euclidean distance for \(p=2\) or the so–called Manhattan metric for \(p=1\). Note that (15) is a metric only for \(p\ge 1\), while it violates (12) for \(p<1\). However, in the latter case, \(\left( d_p (\mathbf {x},\mathbf {y}) \right) ^p\) becomes a metric [27]. Note that the Euclidean distance can be determined using the inner product

$$\begin{aligned} \textstyle \left\langle \mathbf {x},\mathbf {y} \right\rangle = \sum _{j=1}^N x_j \cdot y_j \end{aligned}$$
(16)

by computing

$$\begin{aligned} d_2 (\mathbf {x},\mathbf {y}) = \sqrt{ \left\langle \mathbf {x},\mathbf {x} \right\rangle ^2 - 2 \cdot \left\langle \mathbf {x},\mathbf {y} \right\rangle + \left\langle \mathbf {y},\mathbf {y} \right\rangle ^2}. \end{aligned}$$
(17)

For \(p \ne 2\) and \(p \ge 1\), an analogous calculation can be done using semi-inner products [28, 29]. The use of Minkowski metrics with \(p\ne 2\) has proven advantageous in several practical applications, e.g. [30, 31], which can be accompanied by appropriate dimensionality reduction schemes, e.g. principal component analysis (PCA) [32, 33]. Minkowski distances are either differentiable or can be replaced by differentiable approximations, see [27] and references therein. Figure 2 illustrates the influence of the parameter \(p\) in (15). It displays the unit circles in two dimensions corresponding to different Minkowski distances.

Fig. 2.
figure 2

Unit circles corresponding to Minkowski metrics, Eq. (15), in two dimensions with, from left to right, \(p=0.5\), \(p=1\) (Manhattan), \(p=2\) (Euclidean), and \(p=10\).

Divergences. In many practical problems, properties of the data are represented by histograms. Prominent examples are the characterization of images by color histograms or the bag of words representation for texts. In other domains, intensity spectra or other non-negative and normalizable functional data represent the objects of interest [34]. A large variety of statistical divergences are tailored for the comparison of positive measures or probability densities. Arguably, the non-symmetric Kullback-Leibler divergence is the most prominent example [35]. Here we exemplify the concept in terms of the symmetric Cauchy-Schwarz divergence

$$\begin{aligned} d_{CS} (\mathbf {x},\mathbf {y}) = \frac{1}{2} \, \log \left[ \left\langle \mathbf {x},\mathbf {x} \right\rangle \cdot \left\langle \mathbf {y},\mathbf {y} \right\rangle \right] - \log \left[ \left\langle \mathbf {x},\mathbf {y} \right\rangle \right] \end{aligned}$$
(18)

which is obviously differentiable [36]. Figure 3 illustrates how \(d_{CS}\) differs from the standard Euclidean distance: Three normalized \(50\)-bin histograms are displayed which satisfy \( (\mathbf {x}^a-\mathbf {x}^b)^2 = (\mathbf {x}^c-\mathbf {x}^b)^2 \). However, according to the Cauchy-Schwarz measure, \(d_{CS}(\mathbf {x}^a,\mathbf {x}^b) \approx 1/2 \, d_{CS}(\mathbf {x}^c,\mathbf {x}^b)\), implying that the single peak \(\mathbf {x}^a\) is considered to be closer to the broad unimodal \(\mathbf {x}^b\) than the double peak histogram \(\mathbf {x}^c\).

The incorporation of symmetric and non-symmetric, differentiable divergences into GLVQ training and classification is introduced in [37]. As an application example, the detection of Mosaic Disease in Cassava plants based on various image histograms is discussed there.

Kernel Distances. Kernel distances [38] can also be incorporated in prototype based learning and classification approaches, see e.g. [39, 40]. The so-called kernel trick consists of an implicit, in general non-linear, mapping to a potentially infinite dimensional space. This mapping space is equipped with an inner product which can be calculated from original data in terms of a so–called kernel \(\kappa \left( \mathbf {x},\mathbf {y} \right) \) [41, 42]. The corresponding kernel distance is calculated as

$$\begin{aligned} d_{\kappa } (\mathbf {x},\mathbf {y}) = \sqrt{ \kappa \left( \mathbf {x},\mathbf {x} \right) ^2 - 2 \cdot \kappa \left( \mathbf {x},\mathbf {y} \right) + \kappa \left( \mathbf {y},\mathbf {y} \right) ^2} \end{aligned}$$
(19)

in complete analogy to the inner product based Euclidean distance calculation (17). A famous example is the Gaussian kernel

$$\begin{aligned} \kappa _G \left( \mathbf {x},\mathbf {y} \right) = \exp \, \left( \frac{-\left( d_2\left( \mathbf {x},\mathbf {y} \right) \right) ^2}{2\sigma ^2} \right) \end{aligned}$$
(20)

with the kernel width \(\sigma \).

The application of kernel distances frequently translates non-linear complex classification tasks into easier, linearly separable problems [41], as demonstrated for, e.g., image based face recognition in [43]. For LVQ schemes, the kernel distance is assumed to be differentiable, which implies that also the kernel \(\kappa \left( \mathbf {x},\mathbf {y} \right) \) has to be differentiable [44].

Fig. 3.
figure 3

Three normalized histograms \(\mathbf {x}^a,\mathbf {x}^b,\mathbf {x}^c\) with \(50\) bins each. The pair-wise comparison in terms of Euclidean distance and Cauchy-Schwarz divergence, cf. Eq. (18), as discussed in Sect. 3.1

4 Adaptive Distances and Relevance LVQ

In an ideal situation, insight into the problem suggests the use of a specific, fixed distance measure. Very often, however, prior knowledge is limited and only a suitable parametric form of the distance can be specified. In Relevance Learning, a particularly elegant extension of LVQ, the corresponding parameters are adapted in the same data driven training process that identifies the prototypes.

4.1 Matrix Relevance LVQ

In the following we discuss Matrix Relevance LVQ as an extension of the basic Euclidean scheme [20]. An obvious problem of the standard measure is that all dimensions are taken into account on the same footing. First of all, some of the features may be very noisy and potentially corrupt the classifier. Furthermore, features can be correlated or scale very differently. Euclidean or other pre-defined measures are sensitive to rescaling and more general linear transformations of the features. Consequently, their naive use can be problematic in practice. Matrix Relevance LVQ in its simplest form addresses these problems by using a generalized quadratic distance of the form

$$\begin{aligned} d(\mathbf {x},\mathbf {w}) = \left( \mathbf {x}-\mathbf {w}\right) ^\top \, \varLambda \, \left( \mathbf {x}-\mathbf {w}\right) \text{ with } \varLambda = \varOmega ^\top \varOmega \text{ where } \varLambda ,\varOmega \in \mathbb {R}^{N\times N}. \end{aligned}$$
(21)

Here the specific parameterization of \(\varLambda \) as a square guarantees that the distance is positive semi-definite: \(d(\mathbf {x},\mathbf {w})\ge 0\).

The elements of the matrix \(\varOmega \) are considered adaptive quantities in the training process. The distance (21) is differentiable with respect to \(\mathbf {w}\) and \(\varOmega \):

$$\begin{aligned} \frac{\partial d(\mathbf {w},\mathbf {x})}{\partial \mathbf {w}} = \varOmega ^\top \varOmega \, (\mathbf {w}-\mathbf {x}), \qquad \frac{\partial d(\mathbf {w},\mathbf {x})}{\partial \varOmega } = \varOmega \, (\mathbf {w}-\mathbf {x})(\mathbf {w}-\mathbf {x})^\top \end{aligned}$$
(22)

which facilitates gradient based updates of prototypes and distance measure. In the corresponding extension of LVQ1-like updates, the WTA prototype update (13) is combined with

$$\begin{aligned} \varOmega \, \leftarrow \, \varOmega - \eta _\varOmega \,\,\, \varPsi (c^*,y) \,\, \frac{\partial }{\partial \varOmega } \, d(\mathbf {w}^*,\mathbf {x}). \end{aligned}$$
(23)

Generalized Matrix Relevance LVQ (GMLVQ) updates \(\varOmega \) according to

$$\begin{aligned} \varOmega \, \leftarrow \varOmega - \eta _\varOmega \, \varPhi ^\prime (e) \, \left( \frac{2 d_K}{(d_J+d_K)^2 } \frac{\partial \, d(\mathbf {w}^J,\mathbf {x}) }{\partial \varOmega } -\ \frac{2 d_J}{(d_J+d_K)^2 } \frac{\partial \, d(\mathbf {w}^K,\mathbf {x})}{\partial \varOmega } \right) \end{aligned}$$
(24)

together with the prototype updates (14). Both, (23) and (24) can be followed by an explicit normalization to enforce \(\sum _{ij}\varOmega _{ij}^2=1\). The matrix learning rate \(\eta _\varOmega \) is frequently chosen smaller than that of the prototype updates. We refer the reader to [20, 45] for details and the full form of the updates and a discussion of their variants.

Note that the above correspond to only the simplest versions of matrix relevance learning. A number of non-trivial variations have been suggested, including the use of prototype- or class-specific localized matrices which yield piece-wise quadratic decision boundaries in feature space [20]. Rectangular matrices \(\varOmega \) can be employed in order to avoid the adaptation of \(\mathcal{O} (N^2)\) degrees of freedom in high-dimensional data sets [45]. They facilitate also the discriminative low-dim. representation or visualization of labeled data sets [45, 46]. The restriction to diagonal matrices \(\varOmega \) and \(\varLambda \) reduces the scheme to a weighting of single features, which had been introduced earlier as RLVQ [47] and GRLVQ [48], respectively. Formally, Euclidean LVQ versions are recovered by setting \(\varOmega \) proportional to the \(N\)-dimensional identity matrix.

Similar parameterized distance measures have been used in the context of various classification frameworks. For instance, the cost function based optimization of a quadratic distance (21) can be integrated in an extended kNN approach as introduced in [49], see also references therein. As another example we would like to mention Radial Basis Function networks [1] which, in combination with relevance learning, have been applied in problems of vital importance recently [50].

Fig. 4.
figure 4

Left: ROC curves as obtained by GLVQ (dashed), GRLVQ (dotted), GMLVQ (solid line) with respect to the detection of malignant ACC, see Sect. 4.3. Right: Visualization of the data set, displaying projections on the leading eigenvalues of \(\varLambda \). In addition to malignant ACC (triangles) and benign ACA (circles), healthy control data (crosses) are displayed. Prototypes for ACA and ACC are marked by filled symbols.

A Matlab toolbox Relevance and Matrix adaptation in Learning Vector Quantization, including GMLVQ and a number of variants, is available at the website [51].

4.2 Interpretation of the Relevance Matrix

It is instructive to note that the quadratic distance (21) can be rewritten as \(d(\mathbf {w},\mathbf {x})= \left[ \varOmega ( \mathbf {w}-\mathbf {x} )\right] ^2\), implying that plain Euclidean distance is applied after a linear transformation of feature vectors and prototypes. The transformation can account for the above mentioned problems of noisy or correlated features by assigning weights to single dimensions and pairs of features, respectively. Note that the diagonal element \(\varLambda _{jj} = \sum _{i} \varOmega _{ij}^2\) quantifies the total contribution of the original feature dimension \(j\) to the linear combinations \(\left[ \varOmega (\mathbf {w}-\mathbf {x})\right] _i\).

The direct interpretation of \(\varLambda _{jj}\) as the relevance of feature \(j\) for the classification is only justified if different features are of the same magnitude, typically. This can be achieved by, for instance, a z-score transformation in a preprocessing step, such that \(\sum _{\mu } x_j^\mu /P =0\) and \(\sum _{\mu } (x_j^\mu )^2/P=1\). Additional measures have to be taken in the presence of strongly correlated or linearly dependent features, see [12] for a detailed discussion of the interpretation of \(\varLambda \) and related regularization techniques.

It is instructive to note that, given \(\varLambda \), a continuum of matrices \(\varOmega \) satisfies \(\varOmega ^\top \varOmega = \varLambda \). However, this does not pose a problem, since the ambiguity reflects invariances of the distance measure with respect to reflections or rotations of the data. For convenience, e.g. when comparing the results of different training processes, one can resort to a canonical representation of \(\varOmega \) in terms of the eigenvectors of \(\varLambda \), see [12] for a more detailed discussion.

4.3 Example Application: Classification of Adrenal Tumors

We briefly illustrate the MRLVQ approach in terms of a recent medical application [52, 53]. Tumors of the adrenal gland occur in an estimated 1–2 % of the population and are mostly found incidentally. The non-invasive differentiation between malignant Adrenocortical Carcinoma (ACC) and benign Adenomas (ACA) constitutes a diagnostic challenge of great significance. To this end, a panel of 32 steroid biomarkers – produced by the adrenal gland - has been suggested in [52] where details are given. The 24h excretion of these steroids has been analysed for a number of example patients with confirmed diagnosis, retrospectively. Preprocessing and normalization steps are also detailed in [52, 53]. The available data set was analysed by means of GMLVQ in its simplest setting, employing one \(32\)-dim. prototype per class and an adaptive \(\varOmega \in \mathbb {R}^{32\times 32}\).

Standard validation procedures, for details see [52, 53], were used to demonstrate that the resulting classifier achieves very good sensitivity (true positive rate) and specificity (1-false positive rate) with respect to the detection of malignancy. The obtained Receiver Operator Characteristics (ROC) [54] is shown in Fig. 4 (left panel). For comparison, the ROC is also displayed for simple GLVQ using the plain Euclidean distance in \(\mathbb {R}^{32}\) and for a system restricted to an adaptive, diagonal \(\varLambda \), which corresponds to GRLVQ [48]. Evidently, relevance learning and in particular the matrix scheme improves the performance significantly over the use of the naive Euclidean distance.

The resulting relevance matrix, see [53], shows that a few of the steroid markers play a dominant role in the classification as marked by large diagonal elements \(\varLambda _{jj}\). Based on these results, a reduced panel of \(9\) markers was proposed in [52]. This reduction facilitates an efficient technical realization of the method, while the performance is essentially retained. The method constitutes a promising tool for the sensitive and specific differential diagnosis of ACC in clinical practice [52].

An additional feature of matrix relevance learning becomes apparent in this application example. Typically, relevance matrices become low rank in the course of training. Theoretical considerations which support this general, empirical finding are presented in [55]. As a consequence, the dominating eigenvectors of the relevance matrix can be used for the discriminative visualization of the labelled examples. Figure 4 (right panel) displays the projections of all ACA and ACC data and the obtained prototypes on the first two eigenvectors of \( \varLambda \). In addition, healthy control data is displayed which was not explicitly used in the training process. The example demonstrates how the combination of prototype based and relevance learning can provide novel insight and facilitates fruitful discussions with the domain experts. For a similar application of GMLVQ in a different medical context, see [56].

5 Summary

This contribution provides only a first introduction to distance based classification schemes. To a large extent, the discussion is presented in terms of two classical approaches: the k-Nearest-Neighbor classifier and Kohonen’s Learning Vector Quantization. The latter requires a training phase which tunes the classifier according to available training data. Examples for heuristic and cost function based training prescriptions are given. Mainly in the context of LVQ the use of generalized dissimilarity measures is discussed, which go beyond the standard choice of Euclidean distance. Relevance Learning is presented as an extension of LVQ, which makes use of adaptive distances. Their data driven optimization can be integrated naturally in the LVQ training process. As an example, matrix relevance learning is briefly presented and illustrated in terms of an application example in the medical domain.

This article and the suggested references can merely serve as a starting point for the interested reader. It is far from giving a complete overview of this fascinating area of ongoing fundamental and application oriented research.