Keywords

1 Introduction

The development of novel technologies for biomedical research and clinical practice have led to an impressive increase of the amount and complexity of electronically available data. Large amounts of potentially high-dimensional data are available from different imaging platforms, genomics, proteomics and other omics techniques, or longitudinal studies of large patient cohorts. At the same time there is a clear trend towards personalized medicine in complex diseases such as cancer or heart disorders.

As a consequence, an ever-increasing need for powerful automated data analysis is observed. Machine Learning can provide efficient tools for tasks including problems of unsupervised learning, e.g. in the context of clustering, and supervised learning for classification and diagnosis, regression, risk assessment or outcome prediction.

In biomedical and more general life science applications, it is particularly important that algorithms provide white box solutions. For instance, the criteria which determine the outcome of a particular diagnosis system or recommendation scheme, should be transparent to the user. On the one hand, this increases the acceptance of automated systems among practitioners. In basic research, on the other hand, interpretable systems may provide novel insights into the nature of the problem at hand.

Prototype-based classifiers constitute a powerful family of tools for supervised data analysis. These systems are parameterized in terms of class-specific representatives in the original feature space and, therefore, facilitate direct interpretation of the classifiers. In addition, prototype-based systems can be further enhanced by the data-driven optimization of adaptive distance measures. The framework of relevance learning increases the flexibility of the approaches significantly and can provide important insights into the role of the considered features.

In Sect. 2, the basic concepts of prototype based classification is introduced with emphasis on the framework of Learning Vector Quantization (LVQ). The use of standard and unconventional distances is briefly discussed before relevance learning is introduced in Sect. 2.5. Emphasis is on the so-called Generalized Relevance Matrix LVQ (GMLVQ). Section 3 presents the application of GMLVQ in several relevant biomedical problems, before a brief summary is given in Sect. 4.

2 Distance-based Classification and Prototypes

Here, a brief review of distance based systems is provided. First, the concepts of Nearest Prototype Classifiers and Learning Vector Quantization (LVQ) are presented in Sects. 2.1 and 2.2. The presentation focusses on their relation to the classical Nearest Neighbor classifier. In Sect. 2.3 examples of non-standard distance measures are briefly discussed. Eventually, adaptive dissimilarities in the framework of relevance learning are introduced in Sect. 2.4.

2.1 Nearest Prototype Classifiers

Similarity based schemes constitute an important and successful framework for the supervised training of classifiers in machine learning [10, 14, 31, 55]. The basic idea of comparing observations with a set of reference data is at the core of the classical Nearest-Neighbor (NN) or, more generally, k-Nearest-Neighbor (kNN) scheme [14, 31, 55, 66]. This very popular approach is easy to implement and serves as an important baseline for the evaluation of alternative algorithms.

A given set of P feature vectors and associated class labels

(1)

is stored as a reference set. An arbitrary feature vector or query is then classified according to its similarity to the reference samples: The vector \(\mathbf {x}\) is assigned to the class of its Nearest Neighbor in . Very frequently, the (squared) Euclidean distance with \(d(\mathbf {x},\mathbf {x}^\mu )= (\mathbf {x}-\mathbf {x}^\mu )^2\) is employed for the comparison. The more general kNN classifier determines the majority class membership among the k closest samples. Figure 1(a) illustrates the concept in terms of the NN-classifier.

While kNN classification is very intuitive and does not require an explicit training phase, an essential drawback is obvious: For large data sets , storage needs are significant and, moreover, computing and sorting all distances \(d(\mathbf {x},\mathbf {x}^\mu )\) becomes costly, even if sophisticated bookkeeping and sorting strategies are employed. Most importantly, NN or kNN classifiers tend to realize very complex decision boundaries which may be subject to over-fitting effects, because all reference samples are taken into account explicitly, cf. Fig. 1(a).

Fig. 1.
figure 1

Illustration of Nearest-Neighbor classification (panel a) and Nearest-Prototype classification in LVQ (panel b). The same two-dimensional data set with three different classes (marked by squares, diamonds and pentagrams) is shown in both panels. Piecewise linear decision boundaries, based on Euclidean distance are shown for the NN classifier in (a), while panel (b) corresponds to an NPC with prototypes marked by large symbols.

These particular difficulties of kNN schemes motivated the idea to replace the complete set of exemplars by a few representatives already in [30]. Learning Vector Quantization (LVQ) as a principled approach to the identification of suitable prototypes was suggested by Kohonen [35, 37]. The prototypes carry fixed labels \(y^k=y(\mathbf {w}^k)\) indicating which class they represent. Obviously, the LVQ system should comprise at least one prototype per class.

Originally, LVQ was motivated as an approximate realization of a Bayes classifier with the prototypes serving as a robust, simplified representation of class-conditional densities [35, 37, 67]. Ideally, prototypes constitute typical representatives of the classes, see [26] for a detailed discussion of this property. Recent reviews of prototype based systems in general and LVQ in particular can be found in [11, 41, 53, 67].

A Nearest Prototype Classifier (NPC) assigns any feature vector \(\mathbf {x}\) to the class \(y^{*}=y(\mathbf {w}^*)\) of the closest prototype \(\mathbf {w}^* (\mathbf {x})\), or \(\mathbf {w}^*\) for short, which satisfies

$$\begin{aligned} d(\mathbf {w}^*,\mathbf {x}) \le d(\mathbf {w}^j,\mathbf {x}) ~~ \mathrm{{for}}~~ j=1,2,\ldots K. \end{aligned}$$
(2)

Assuming that meaningful prototype positions have been determined from a given data set , an NPC scheme based on Euclidean distance also implements piece-wise linear class boundaries. However, since usually \(K\ll P\), these are much smoother than in an NN or kNN scheme and the resulting classifier is less specific to the training data. Moreover, the NPC requires only the computation and ranking of K distances \(d(\mathbf {w}^j,\mathbf {x})\). Figure 1(b) illustrates the NPC scheme as parameterized by a few prototypes and employing Euclidean distance for the same data set as shown in panel (a).

In binary problems with classes A and B, a bias can be introduced by modifying the NPC scheme: A data point \(\mathbf {x}\) is assigned to class A if

$$\begin{aligned} d(\mathbf {w}^{A},\mathbf {x}) \le d(\mathbf {w}^{B},\mathbf {x}) + \varTheta \end{aligned}$$
(3)

and to class B, else. Here, \(\mathbf {w}^{A}\) and \(\mathbf {w}^{B}\) denote the closest prototypes carrying label A or B, respectively. The threshold \(\varTheta \) can be varied from large negative to large positive values, yielding true positive rate (sensitivity) and false positive rate (1-specificity) as functions of \(\varTheta \). Hence, the full Receiver Operator Characteristics (ROC) can be determined [22].

2.2 Learning Vector Quantizaton

A variety of schemes have been suggested for the iterative identification of LVQ prototypes from a given dataset. Kohonen’s basic LVQ1 algorithm [35] already comprises the essential ingredients of most modifications which were suggested later. It is conceptually very similar to unsupervised competitive learning [14] but takes class membership information into account, explicitly.

Upon presentation of a single feature vector \(\mathbf {x}^\mu \) with class label \(y^\mu = y(\mathbf {x}^\mu )\), the currently closest prototype, i.e. the so-called winner \(\mathbf {w}^*=\mathbf {w}^*(\mathbf {x}^\mu )\) is identified according to condition (2). The Winner-Takes-All (WTA) update of LVQ1 leaves all other prototypes unchanged:

$$\begin{aligned} \mathbf {w}^* \leftarrow \mathbf {w}^* \, + \, \eta _w \, \, \varPsi (y^*,y^\mu ) \,\, \left( \mathbf {x}^\mu - \mathbf {w}^* \right) ~~~ \mathrm{{with}}~~ \varPsi (y,\tilde{y}) = \left\{ \begin{array}{ll} +1 &{} \mathrm{{if}}~~ y=\tilde{y} \\ -1 &{} \mathrm{{else.}} \end{array} \right. \end{aligned}$$
(4)

Hence, the winning prototype is moved even closer to \(\mathbf {x}^\mu \) if both carry the same class label: \(y^*=y^\mu \Rightarrow \varPsi =+1\). If the prototype is meant to represent a different class, it is moved further away \((\varPsi =-1)\) from the feature vector. The learning rate \(\eta _w\) controls the step size of the prototype updates.

All examples in are presented repeatedly, for instance in random sequential order. A possible initialization is to set prototypes identical to randomly selected feature vectors from their class or close to the class-conditional means.

Several modifications of the basic scheme have been considered in the literature, aiming at better generalization ability or convergence properties, see [7, 36, 53] for examples and further references.

LVQ1 and many other modifications cannot be formulated as the optimization of a suitable objective function in a straightforward way [59]. However, several cost function based LVQ schemes have been proposed in the literature [58, 59, 67]. A popular example is the so–called Generalized Learning Vector Quantization (GLVQ) as introduced by Sato and Yamada [59]. The suggested cost function is given as a sum over all examples in :

$$\begin{aligned} E \, = \, \sum _{\mu =1}^P \, \varPhi (e^\mu ) ~~~~ \mathrm{{with}}~~ e^\mu \, = \frac{ d(\mathbf {w}^J, \mathbf {x}^\mu ) - d(\mathbf {w}^K, \mathbf {x}^\mu )}{ d(\mathbf {w}^J, \mathbf {x}^\mu ) + d(\mathbf {w}^K, \mathbf {x}^\mu )}. \end{aligned}$$
(5)

For a given \(\mathbf {x}^\mu \), \(\mathbf {w}^J\) represents the closest correct prototype carrying the correct label \(y(\mathbf {w}^J)=y^\mu \) and \(\mathbf {w}^K\) is the closest incorrect prototype with \(y(\mathbf {w}^K) \ne y^\mu \), respectively. A monotonically increasing function \(\varPhi (e^\mu )\) specifies the contribution of a given example in dependence of the respective distances \(d(\mathbf {w}^J, \mathbf {x}^\mu )\) and \(d(\mathbf {w}^K, \mathbf {x}^\mu )\). Frequent choices are the identity \(\varPhi (e^\mu )=e^\mu \) and the sigmoidal \(\varPhi (e^\mu )=1/[1+exp(-\gamma \, e^\mu )]\) where \(\gamma >0\) controls the steepness [59]. Note that \(e^\mu \) in Eq. (5) satisfies \(-1 \le e^\mu \le 1\). The misclassification of a particular sample is indicated by \(e^\mu >0\), while negative \(e^\mu \) correspond to correctly classified training data. As a consequence, the cost function can be interpreted as to approximate the number of misclassified samples for large \(\gamma \), i.e. for steep \(\varPhi \).

Since E is differentiable with respect to the prototype components, gradient based methods can be used to minimize the objective function for a given data set in the training phase. The popular stochastic gradient descent (SGD) is based on the repeated, random sequential presentation of single examples [14, 17, 31, 56].

The SGD updates of the correct and incorrect winner for a given example \(\{\mathbf {x},y(\mathbf {x}) \}\) read

$$\begin{aligned} \mathbf {w}^J\leftarrow & {} \mathbf {w}^J \, - \eta _w \, \frac{\partial }{\partial \mathbf {w}^J} \,\, \varPhi (e) \,\,\, = \, \mathbf {w}^J \, - \eta _w \, \, \varPhi ^\prime (e) \, \frac{2 d_K}{(d_J+d_K)^2} \,\, \frac{\partial d_J}{\partial \mathbf w^J}, \nonumber \\ \mathbf {w}^K\leftarrow & {} \mathbf {w}^K - \eta _w \, \frac{\partial }{\partial \mathbf {w}^K} \,\, \varPhi (e) \, = \, \mathbf {w}^K \, + \eta _w \, \, \varPhi ^\prime (e) \, \frac{2 d_J}{(d_J+d_K)^2} \,\, \frac{\partial d_K}{\partial \mathbf w^K}, \nonumber \\&\end{aligned}$$
(6)

where the abbreviation \(d_L = d(\mathbf {w}^L,\mathbf {x})\) is used. For the squared Euclidean distance we have . Hence, the displacement of the correct winner is along \(+(\mathbf {x}-\mathbf {w}^J)\) and the update of the incorrect winner is along \(-(\mathbf {x}-\mathbf {w}^K)\), very similar to the attraction and repulsion in LVQ1. However, in GLVQ, both winners are updated simultaneously.

Theoretical studies of stochastic gradient descent suggest the use of time-dependent learning rates \(\eta _w\) following suitable schedules in order to achieve convergent behavior of the training process, see [17, 56]. for mathematical conditions and example schedules. Alternatively, automated procedures can be employed which adapt the learning rate in the course of training, see for instance [34, 65]. Methods for adaptive step size control have also been devised for batch gradient versions of GLVQ, employing the full gradient in each step, see e.g. [40, 54].

Alternative cost functions have been considered for the training of LVQ systems, see, for instance, [57, 58] for a likelihood based approach. Other objective functions focus on the generative aspect of LVQ [26], or aim at the optimization of the classifier’s ROC [68].

2.3 Alternative Distances

Although very popular, the use of the standard Euclidean distance is frequently not further justified. It can even lead to inferior performance compared with problem specific dissimilarity measures which might, for instance, take domain knowledge into account.

A large variety of meaningful measures can be considered to quantify the dissimilarity of N-dim. vectors. Here, we mention only briefly a few important alternatives to Euclidean metrics. A more detailed discussion and further examples can be found in [9, 11, 29], see also references therein.

The family of Minkowski distances of the form

(7)

provides an important set of alternatives [39]. They fulfill metric properties (for \(p\ge 1\)) and Euclidean distance is recovered with \(p=2\). Employing Minkowski distances with \(p\ne 2\) has proven advantageous in several practical applications, see for instance [4, 25, 69].

A different class of more general measures is based on the observation that the Euclidean distance can be written as

$$\begin{aligned} d_2(\mathbf {x},\mathbf {y}) = \left[ (\mathbf {x} \cdot \mathbf {x}) - 2 \mathbf {x} \cdot \mathbf {y} + (\mathbf {y}\cdot \mathbf {y})\right] ^{1/2}. \end{aligned}$$
(8)

Replacing inner products of the form \(\mathbf {a}\cdot \mathbf {b} = \sum _j a_j b_j\) by a suitable kernel function \(\kappa (\mathbf {a},\mathbf {b})\), one obtains so-called kernelized distances [63, 64]. In analogy to the kernel-trick used in the Support Vector Machine [64], kernelized distances can be used to implicitly transform non-separable complex data to simpler problems in a higher-dimensional space, see [60] for a discussion in the context of GLVQ.

A very popular dissimilarity measure that takes statistical properties of the data into account explicitly, was suggested very early by Mahalanobis [42]. The point-wise version

$$\begin{aligned} d_M (\mathbf {x},\mathbf {y}) \, = \, \left[ (\mathbf {x}-\mathbf {y})^\top C^{-1} \, (\mathbf {x}-\mathbf {y}) \right] ^{1/2} \end{aligned}$$
(9)

employs the (empirical) covariance matrix C of the data set for the comparison of two particular feature vectors. The Mahalonobis distance is widely used in the context of the unsupervised and supervised analysis of given data sets, see [55] for a more detailed discussion.

As a last example we mention statistical divergences which can be used when observations are described in terms of densities or histograms. For instance, text can be characterized by word counts while color histograms are often used to summarize properties of images. In such cases, the comparison of sample data amounts to evaluating the dissimilarity of histograms. A variety of statistical divergences is suitable for this task [20]. The non-symmetric Kullback-Leibler divergence [55] constitutes a well-known measure for the comparison of densities. An example of a symmetric dissimilarity is the so-called Cauchy-Schwarz divergence [20]:

$$\begin{aligned} d_{CS} (\mathbf {x},\mathbf {y})= 1/2 \, \log \left[ (\mathbf {x}\cdot \mathbf {x} ) (\mathbf {y}\cdot \mathbf {y}) \right] - \log \left[ \mathbf {x}\cdot \mathbf {y}\right] . \end{aligned}$$
(10)

It can be interpreted as a special case of more general \(\gamma \)-divergences, see [20, 50].

In LVQ, meaningful dissimilarities do not have to satisfy metric properties, necessarily. Unlike the kNN approach, LVQ classification does not rely on the pair-wise comparison of data points. A non-symmetric measure \(d(\mathbf {w},\mathbf {x})\ne d(\mathbf {x},\mathbf {w}) \) can be employed for the comparison of prototypes and data points as long as one version is used consistently in the winner identification, update steps, and the actual classification after training [50].

In cost function based GLVQ, cf. Eq. (5), it is straightforward to replace the squared Euclidean by more general, suitable differentiable measures \(d(\mathbf {w},\mathbf {x})\). Similarly, LVQ1-like updates can be devised by replacing the term \((\mathbf {w}-\mathbf {x})\) in Eq. (4) by \(1/2 \, \partial d(\mathbf {w},\mathbf {x})/\partial \mathbf {w}\). Obviously, the winner identification has to make use of the same distance measure in order to be consistent with the update.

It is also possible to extend gradient-based LVQ to non-differentiable distance measures like the Manhattan distance with \(p=1\) in Eq. (7), if differentiable approximations are available [39]. Furthermore, the concepts of LVQ can be transferred to more general settings, where data sets do not comprise real-valued feature vectors in an N-dimensional Euclidean space [41]. Methods for classification problems where only pair-wise dissimilarity information is available, can be found in [27, 52], for instance.

2.4 Adaptive Distances and Relevance Learning

The choice of a suitable distance measures constitutes a key step in the design of a prototype-based classifier. It usually requires domain knowledge and insight into the problem at hand. In this context, Relevance Learning constitutes a very elegant and powerful conceptual extension of distance based classification. The idea is to fix only the basic form of the dissimilarity a priori and optimize its parameters in the training phase.

2.5 Generalized Matrix Relevance Learning

As an important example of this strategy we consider here the replacement of standard Euclidean distance by the more general quadratic form

$$\begin{aligned} d_\varLambda (\mathbf {x},\mathbf {w}) \, = \, \left( \mathbf {x}-\mathbf {w}\right) ^\top \, \varLambda \, \left( \mathbf {x}-\mathbf {w}\right) \, = \, \sum _{i,j=1}^N \, (x_i - w_i) \, \varLambda _{ij} \, (x_j-w_j). \end{aligned}$$
(11)

While the measure is formally reminiscent of the Mahalonobis distance defined in Eq. (9), it is important to note that \(\varLambda \) cannot be directly computed from the data. On the contrary, its elements are considered adaptive parameters in the training process as outlined below.

Note that Euclidean distance is recovered by setting \(\varLambda \) proportional to the N-dim. identity matrix. A restriction to diagonal matrices \(\varLambda \) corresponds to the original formulation of relevance LVQ, which was introduced as RLVQ or GRLVQ in [16] and [28] respectively. There, each feature is weighted by a single adaptive factor in the distance measure.

Measures of the form (11) have been employed in various classification schemes [15, 32, 70, 71]. Here we focus on the so-called Generalized Matrix Relevance LVQ (GMLVQ), which was introduced and extended in [18, 61, 62]. Applications from the biomedical and other domains are discussed in Sect. 3.

As a minimal requirement, \(d_\varLambda (\mathbf {x},\mathbf {w}) \ge 0\) should hold true for all . This can be guaranteed by assuming a re-parameterization of the form

$$\begin{aligned} \varLambda = \, \varOmega ^\top \varOmega , ~~\mathrm{{i.e.}}~~ d_\varLambda (\mathbf {x},\mathbf {w}) \, = \, \left[ \varOmega \, \left( \mathbf {x}-\mathbf {w} \right) \right] ^2 \end{aligned}$$
(12)

with the auxiliary matrix . It also implies the symmetries \(\varLambda _{ij}=\varLambda _{ji}\) and \(d_\varLambda (\mathbf {x},\mathbf {w})=d_\varLambda (\mathbf {w},\mathbf {x})\). Frequently, a normalization \(\sum _{ii}\varLambda _{ii} = \sum _{ij} \varOmega _{ij}^2 = 1\) is imposed in order to avoid numerical problems.

According to Eq. (12), \(d_{\varLambda }\) corresponds to conventional Euclidean distance after a linear transformation of all data and prototypes. The transformation matrix can be \((M\times N)\)-dimensional, in general, where \(M<N\) corresponds to a low-dimensional intrinsic representation of the original feature vectors. Note that, even for \(M=N\), the matrix \(\varLambda \) can become singular and \(d_\varLambda \) is only a pseudo-metric in : for instance, \(d_\varLambda (\mathbf {x},\mathbf {y})=0\) is possible for \( \mathbf {x}\ne \mathbf {y}\).

In the training process, all elements of the matrix \(\varOmega \) are considered adaptive quantities. From Eq. (12) we obtain the derivatives

$$\begin{aligned} \frac{\partial d_\varLambda (\mathbf {w},\mathbf {x})}{\partial \mathbf {w}} \, = \, \varOmega ^\top \varOmega \, (\mathbf {w}-\mathbf {x}), \qquad \frac{\partial d_\varLambda (\mathbf {w},\mathbf {x})}{\partial \varOmega } \, = \, \varOmega \, (\mathbf {w}-\mathbf {x})(\mathbf {w}-\mathbf {x})^\top \end{aligned}$$
(13)

which can be used to construct heuristic updates along the lines of LVQ1 [8, 11, 41]. From the GLVQ cost function, cf. Eq. (5), one obtains the matrix update

$$\begin{aligned} \varOmega \, \leftarrow \varOmega - \eta _\varOmega \, \varPhi ^\prime (e) \, \left( \frac{2 d_\varLambda ^K}{(d_\varLambda ^J+d_\varLambda ^K)^2 } \frac{\partial \, d_\varLambda (\mathbf {w}^J,\mathbf {x}) }{\partial \varOmega } -\ \frac{2 d_\varLambda ^J}{(d_J+d_\varLambda ^K)^2 } \frac{\partial \, d_\varLambda (\mathbf {w}^K,\mathbf {x})}{\partial \varOmega } \right) \end{aligned}$$
(14)

which can be followed by a normalization step achieving \(\sum _{ij}\varOmega ^2=1\). Prototypes are updated as given in Eq. (6) with the gradient terms replaced according to Eq. (13). The matrix learning rate is frequently chosen smaller than that of the prototype updates: \(\eta _\varOmega < \eta _w\), details can be found in [18, 61]. The matrix can be initialized by, for instance, drawing independent random elements or by setting it proportional to the N-dim. identity matrix for \(M=N\).

In the measure (11), the diagonal elements of \(\varLambda \) quantify the weight of single features in the distance. The inspection of the relevance matrix can provide valuable insights into the structure of the data set after training, examples are discussed in Sect. 3. Off-diagonal elements correspond to the contribution of pairs of features to \(d_\varLambda \) and their adaptation enables the system to cope with correlations and dependencies between the features. Note that this heuristic interpretation of \(\varLambda \) is only justified if all features are of the same order of magnitude, strictly speaking. In any given data set, this can be achieved by applying a z-score transformation, yielding zero mean and unit variance features. Alternatively, potentially different magnitudes of the features could be taken into account after training by rescaling the elements of \(\varLambda \) accordingly.

2.6 Related Schemes and Variants of GMLVQ

Adaptive distance measures of the form (11) have been considered in several realizations of distance based classifiers. For example, Weinberger et al. optimize a quadratic form in the context of nearest neighbor classification [70, 71]. An explicit construction of a relevance matrix from a given data set is suggested and discussed in [15], while the gradient based optimization of an alternative cost function is presented in [32].

Localized versions of the distance (11) have been considered in [18, 61, 71]. In GMLVQ, it is possible to assign an individual relevance matrix \(\varLambda ^j\) to each \(\mathbf {w}^j\) or to devise class-wise matrices. Details and the corresponding modified update rules can be found in [18, 61]. While this can enhance the classification performance significantly in complex problems, we restrict the discussion to the simplest case of one global measure of the form (11).

The GMLVQ algorithm displays an intrinsic tendency to yield singular relevance matrices which are dominated by a few eigenvectors corresponding to the leading eigenvalues. This effect has been observed empirically in real world applications and benchmark data sets, see [41, 61] for examples. Moreover, a mathematical investigation of stationarity conditions explains this typical property of GMLVQ systems [8]. Very often, the effect allows for an interpretable visualization of the labeled data set in terms of projections onto two or three leading eigenvectors [5, 41, 61].

An explicit rank control can be achieved by using a rectangular \((M\times N)\) matrix \(\varOmega \) in the re-parameterization (12), together with the incorporation of a penalty term for \(\mathrm{{rank}}(\varLambda )<M\) in the cost function [18, 62]. For \(M=2\) or 3, the approach can also be used for the discriminative visualization of labelled data sets [5].

An important alternative to the intrinsic dimension reduction provided by GMLVQ is the identification of a suitable linear projection in a pre-processing step. This can be advantageous, in particular for nominally very high-dimensional data as encountered in e.g. bioinformatics, or in situations where the number of training samples P is smaller than the dimension of the feature space. Assuming that a given projection of the form

(15)

maps N-dim. feature vectors and prototypes to their M-dim. representations we can re-write the distance measure of the form (11) as

$$\begin{aligned} \left( \mathbf {x}-\mathbf {w}\right) ^\top \, \varLambda \, (\mathbf {x}-\mathbf {w}) \, = \, \left( \mathbf {x}-\mathbf {w}\right) ^\top \, \varPsi ^\top \, \widetilde{\varLambda } \,\, \varPsi \, (\mathbf {x}-\mathbf {w}) \, = \, \left( \mathbf {y}-\mathbf {v}\right) ^\top \,\, \widetilde{\varLambda } \, \, (\mathbf {y}-\mathbf {v}). \end{aligned}$$
(16)

Hence, training and classification can be formulated in the M-dimensional space, employing prototypes and an \(M\times M\) relevance matrix \(\widetilde{\varLambda }\). Moreover, the relation \(\varLambda = \varPsi ^\top \widetilde{\varLambda } \, \varPsi \) facilitates its interpretation in the original feature space.

This versatile framework allows to combine GMLVQ with, for instance, Principal Component Analysis (PCA) [55] or other linear projection techniques. Furthermore, it can be applied to the classification of functional data, where the components of the feature vectors represent an ordered sequence of values rather than a collection of more or less independent quantities. This is the case in, for instance, time series data or spectra obtained from organic samples, see [43] for examples and further references. The coefficients of a, for instance, polynomial approximation of observed data are typically obtained by a linear transformation of the form (15), where the rows of \(\varPsi \) represent the basis functions. Hence, training can be performed in the lower-dimensional coefficient space, while the resulting classifier is still interpretable in terms of the original features [43].

3 Biomedical Applications of GMLVQ

In the following, selected bio-medical applications of the GMLVQ approach are highlighted. The example problems illustrate the flexibility of the approach and range from the direct analysis of relatively low-dim. data in steroid metabolomics (Sect. 3.1), the combination of relevance learning with dimension reduction for cytokine data (Sect. 3.2), and the application of GMLVQ to selected gene expression data in the context of tumor recurrence prediction (Sect. 3.3). A brief discussion with emphasis on the interpretability of the relevance matrix. Eventutally, further applications of GMLVQ for biomedical and life science data are briefly mentioned in Sect. 3.4.

3.1 Steroid Metabolomics in Endocrinology

A variety of disorders can affect the human endocrine system. For instance, tumors of the adrenal glands are relatively frequent and often found incidentally [3, 21]. The adrenals produce a number of steroid hormones which regulate important body functions. The differential diagnosis of malignant Adrenocortical Carcinoma (ACC) vs. benign Adenoma (ACA) based on non-invasive methods constitutes a highly relevant diagnostic challenge [21]. In [3], Arlt et al. explore the possibility to detect malignancy on the basis of the patient’s steroid excretion pattern obtained from 24 h urine samples by means of gas chromatography/mass spectrometry (GC/MS).

Fig. 2.
figure 2

Detection of malignancy in adrenocortical tumors, see Sect. 3.1. Panel (a): Test set ROC as obtained in the randomized validation procedure by applying GLVQ with Euclidean distance (dash-dotted line), GRLVQ with diagonal \(\varLambda \) (dashed) and GMLVQ with a full relevance matrix (solid). Panel (b): Visualization of the data set based on the GMLVQ analysis in terms of the projection of steroid profiles on the leading eigenvectors of \(\varLambda \). Circles correspond to patients with benign ACA while triangles mark malignant ACC. Prototypes are marked by larger symbols. In addition, healthy controls (not used in the analysis) are displayed as crosses.

The analysis of data comprising the excretion of 32 steroids and steroid metabolites was presented in [3, 13]: A data set representing a study population of 102 ACA and 45 ACC samples was analysed by means of training a GMLVQ system with one prototype per class and a single, global relevance matrix . In a pre-processing step, excretion values were log-transformed and in every individual training process a z-score transformation was applied.

In order to estimate the classification performance with respect to novel data representing patients with unknown diagnosis, random splits of the data set were considered: about \(90\%\) of the samples were used for training, while \(10\%\) served as a validation set. Results were obtained on average over 1000 randomized splits, yielding, for instance the threshold-averaged ROC [22], see Eq. (3).

A comparison of three scenarios provides evidence for the beneficial effect of relevance learning: When applying Euclidean GLVQ, the classifier achieves an ROC with an area under the curve of \(AUC \approx 0.87\), see Fig. 2(a). The consideration of an adaptive diagonal relevance matrix, corresponding to GRLVQ [28], yields an improved performance with \(AUC \approx 0.93\). The GMLVQ approach, cf. Sect. 2.5, with a fully adaptive relevance matrix achieves an AUC of about 0.97. In the latter case, a working point with equal sensitivity and specificity of 0.90 can be selected by proper choice of the threshold \(\varTheta \) in Eq. (3). As reported in [3], the GMLVQ system outperformed alternative classifiers of comparable complexity.

The resulting relevance matrix \(\varLambda \) turned out to be dominated by the leading eigenvector corresponding to its largest eigenvalue; subsequent eigenvalues are found to be significantly smaller. As discussed above, this property can be exploited for the discriminative visualization of the data set and prototypes, see Fig. 2(b). The figure displays, in addition, a set of feature vectors representing healthy controls, which were not explicitly considered in the training process. Reassuringly, control samples cluster close to the ACA prototype and appear clearly separated from the malignant ACC.

Fig. 3.
figure 3

Relevance of steroid markers in adrenal tumor classification, see Sect. 3.1 for details. Panel (a): Diagonal elements \(\varLambda _{ii}\) of the GMLVQ relevance matrix on average over the 1000 randomized training runs. Panel (b): Percentage of training runs in which a particular steroid appeared among the 9 most relevant markers.

By inspecting the relevance matrix of the trained system, further insight into the problem and data set can be achieved. Figure 3(a) displays the diagonal elements of \(\varLambda \) on average over 1000 randomized training runs. Subsets of markers can be identified, which are consistently rated as particularly important for the classification. For instance, markers 5, 6 and 19 appear significantly more relevant than all others, see [3] for a detailed discussion from the endocrinological perspective. There, the authors suggest a panel of nine leading steroids, which could serve as a reduced marker set in a practical realization of the diagnosis tool. Figure 3(b) displays the fraction of training runs in which a single marker is rated among the nine most relevant ones, providing further support for the selection of the subset [3]. Repeating the GMLVQ training for selected subsets of leading markers yielded slightly inferior performance compared to the full panel of 32 markers, with \(AUC \approx 0.96\) for nine steroids, and \(AUC \approx 0.94\) with 3 leading markers only, see [3] for details of the analysis.

The analysis of steroid metabolomics data by means of GMLVQ and related techniques is currently explored in the context of various disorders, see [24, 38, 44] for recent examples. In the context of adrenocortical tumors, the validation of the diagnostic approach in prospective studies and the development of efficient methods for the detection of post-operative recurrence are in the center of interest [19].

3.2 Cytokine Markers in Inflammatory Diseases

Rheumatoid Arthritis (RA) constitutes an important example of chronic inflammatory disease. It is the most common form of autoimmune arthritis with symptoms ranging from stiffness and swelling of joints to, in the long term, bone erosion and joint deformity.

Fig. 4.
figure 4

GMLVQ analysis of Rheumatoid Arthritis data, see Sect. 3.2 for details. Discrimination of patients with early RA (class B) vs. resolving cases (class C). Panel (a) shows the ROC \((AUC\approx 0.763)\) as obtained in the Leave-One-Out (from each class) validation. Panel (b) displays the diagonal elements of the back-transformed relevance matrix on average over the validation runs.

Cytokines play an important role in the regulation of inflammatory processes. Yeo et al. [72] investigated the role of 117 cytokines in early stages of RA. Their mRNA expression was determined by means of PCR techniques for four different patient groups: Uninflamed healthy controls (group A, 9 samples), patients with joint inflammations that resolved within 18 months after symptom onset (group B, 9 samples), early RA patients developing Rheumatoid Arthritis in this period of time (group C, 17 samples), and patients with an established diagnosis of RA (group D, 12 samples).

Note that the total number of samples is small compared to the dimension \(N=117\) of the feature vectors \(\mathbf {x}\) comprising log-transformed RNA expression values. Hence, standard PCA was applied to identify a suitable low-dimensional representation of the data. The analysis revealed that \(95\%\) of the variation in the data set was explained by the 21 leading principal components already. Attributing the remaining \(5\%\) mainly to noise, all cytokine expressions data were represented in terms of \(M=21\)-dim. feature vectors corresponding to the in Eq. (15).

GMLVQ was applied to two classification subproblems: The first addressed the discrimination of healthy controls (class A) and established RA patients (class D). While this problem does not constitute a diagnostic challenge at all, it served as a consistency check and revealed first insights into the role of cytokine markers. In the second setting, the much more difficult problem of discriminating early stage RA (class C) from resolving cases (class B) was considered.

The performances of the respective classifier systems were evaluated in a validation procedure by leaving out one sample from each class for testing and training on the remaining data. Results were reported on average over all possible test set configurations. Reassuringly, the validation set ROC obtained for the classification of A vs. D displayed almost error free performance with \(AUC\approx 0.996\). The expected greater difficulty of discriminating patient groups C and D was reflected in a lower AUC of approximately 0.763, see Fig. 4(a).

It is important to note that it was not the main aim of the investigation to propose a practical diagnosis tool for the early detection of Rheumatoid Arthritis. As much as an early diagnosis would be desirable, the limited size of the study population would not provide enough supporting evidence for such a suggestion. However, the GMLVQ analysis revealed important and surprising insights into the role of cytokines. Computing the back-transformed relevance matrix with respect to the original cytokine expression features along the lines of Eq. (16), makes possible an evaluation of their significance in the respective classification problem. Figure 4(b) displays the cytokine relevances as obtained in the discrimination of classes B and C. Two cytokines, CXCL4 and CXCL7, were identified as clearly dominating in terms of their discriminative power. A discussion of further relevant cytokines also with respect to the differences between the two classification problems can be found in [72].

The main result of the machine learning analysis triggered additional investigations by means of a direct inspection of synovial tissue samples. Careful studies employing staining techniques confirmed that CXCL4 and CXCL7 play an important role in the early stages of RA [72]. Significantly increased expression of CXCL4 and CXCL7 was confirmed in early RA patients compared with those with resolving arthritis or with clearly established disease. The study showed that the two cytokines co-localize, in particular, with extravascular macrophages in early stage Rheumatoid Arthritis. Implications for future research into the onset and progression of RA are also discussed in [72].

3.3 Recurrence Risk Prediction in Clear Cell Renal Cell Carcinoma

Mukherjee et al. [47] investigated the use of mRNA-Seq expression data to evaluate recurrence risk in clear cell Renal Cell Carcinoma (ccRCC). The corresponding data set is publicly available from The Cancer Genome Atlas (TCGA) repository [51] and is also hosted at the Broad Institute (http://gdac.broadinstitute.org). It comprises mRNA-Seq data (raw and RPKM normalized) for 20532 genes, accompanied by clinical data for survival and recurrences for 469 tumor samples. Preprocessing steps, including normalization, log-transformation, and median centering, are described in [47].

By means of an outlier analysis [1], a drastically reduced panel of 80 genes was identified for further use, see also [47] for a description of the method in this particular example. The panel consists of four different groups, each comprising 20 selected genes: In group (I), high expression can be correlated with low risk, i.e. late or no recurrence. In group (II), however, low expression is associated with low risk. Group (III) contains genes where high expression is correlated with a high risk for early recurrence, while in group (IV) low expression of the genes is an indication of high risk.

In [47], a risk index is presented, which is based on a voting scheme with respect to the 80 selected genes. Here, the focus is on the further analysis of the corresponding expression values using GMLVQ, also discussed in [47].

Fig. 5.
figure 5

Recurrence risk prediction in ccRCC, see Sect. 3.3 for details. Panel (a): Number of recurrences registered in the 469 patients vs. time in days. The vertical line marks a threshold of 24 months, before which 109 patients developed a recurrence. Panel (b): Diagonal entries of the relevance matrix with respect to the discrimination of low risk vs. high risk patients from the expression of the 80 selected genes.

Fig. 6.
figure 6

Recurrence risk prediction in ccRCC, see Sect. 3.3 for details. Panel (a): ROC for the classification of low-risk (no or late recurrence) vs. high risk (early recurrence) as obtained in the Leave-One-Out validation of the GMLVQ classifier trained on the subset of 216 patients, cf. Sect. 3.3. The circle marks the performance of the Nearest Prototype Classifier. Panel (b): Kaplan-Meier plot [33] showing recurrence free survival rates in the high-risk (lower curve) and low-risk (upper curve) group as classified by the GMLVQ system applied to all 469 samples. Time is given in days.

In order to define a meaningful classification problem, two extreme groups of patients were considered: group A with poor prognosis/high risk, comprises 109 patients with recurrence within the first 24 months after the initial diagnosis. Group B corresponds to 107 patients with favorable prognosis/low risk, who did not develop tumor recurrence within 60 months after diagnosis. The frequency of recurrence times observed over five years in the complete set of 469 patients is shown in Fig. 5(a), the vertical line marks the threshold of two years after diagnosis.

A GMLVQ system with one 80-dim. prototype per class (A, B) and a global relevance matrix was trained on the subset of the 216 clear-cut cases in groups A and B. Leave-One-Out validation yielded the averaged ROC shown in Fig. 6(a) with \(AUC\approx 0.812\).

The diagonal elements of the averaged relevance matrix are displayed in Fig. 5(b). The results show that genes in the groups (I) and (IV) seem to be particularly discriminative and suggest that a further reduction of the gene panel should be well possible [47].

In order to further evaluate the GMLVQ classifier, it was employed to assign all 469 samples in the data set to the groups of high risk or low risk patients, respectively. In case of the 216 cases with early recurrence (\({\le }24\) months) or no recurrence within 60 months, the Leave-One-Out prediction was used. For the remaining 253 patients, the GMLVQ classifier obtained from the 216 reference samples was used.

In Fig. 6 the resulting Kaplan-Meier plot [33] is shown. It displays the recurrence free survival rate of the low risk (upper) and high risk (lower) groups according to GMLVQ classification, corresponding to a pronounced discrimination of the groups with log-rank p-value \(1.2 \times 10^{-8}\).

In summary, the work presented in [47] shows that gene expression data makes possible an efficient risk assessment with respect to tumor recurrence. Further analysis, taking into account healthy cell samples as well, shows that the panel of genes is not only prognostic but also diagnostic [47].

3.4 Further Bio-medical and Life Science Applications

Apart from the studies discussed in the previous sections, variants of LVQ have been employed successfully in a variety of biomedical and life science applications. In the following, a few more examples are briefly mentioned and references are provided for the interested reader.

An LVQ1-like classifier was employed for the identification of exonic vs. intronic regions in the genome of C. Elegans based on features derived from sequence data [4]. In this application, the use of the Manhattan distance in combination with heuristic relevance learning proved advantageous.

Simple LVQ1 with Euclidean distance measure was employed successfully in the inter-species prediction of protein phosphorylation in the sbv IMPROVER challenge [12]. There, the goal was to predict the effect of chemical stimuli on human lung cells, given information about the reaction of rodent cells under the same conditions.

The detection and discrimination of viral crop plant diseases, based on color and shape features derived from photographic images was studied in [50]. The authors applied divergence-based LVQ, cf. Sect. 2.3, for the comparison of feature histograms derived from Cassava plant leaf images. A comparison with alternative approaches, including GMLVQ is presented in [49].

The analysis of flow-cytometry data was considered in [6] in the context of the DREAM6/FlowCAP2 challenge [2]. For each subject, 31 markers were provided, including measures of cell size and intracellular granularity as well as 29 expression values of surface proteins for thousands of individual cells. Hand-crafted features were determined in terms of statistical moments over the entire cell population, yielding a 186-dim. representation for each patient. GMLVQ applied in this feature space yielded error-free prediction of AML in the test set [2, 6].

The detection and discrimination of different Parkinsonian syndromes was addressed in [45, 46]. Three-dimensional brain images obtained by fluorodeoxyglucose positron emission tomography (FDG-PET) comprise several hundreds of thousands voxels per subject, providing information about the local glucose metabolism. An appropriate dimension reduction by Scaled Subprofile Model with Principal Component Analysis (SSM/PCA), yields a data set dependent, low-dimensional representation in terms of subject scores, see [45, 46] for further references. In comparison with Decision Trees and Support Vector Machines, the GMLVQ classifier displayed competitive or superior performance [46].

4 Concluding Remarks

This contribution merely serves as a starting point for studies into the application of prototype and distance based classification in the biomedical domain. It provides by no means a complete overview and focusses on the example framework of Generalized Matrix Relevance Learning Vector Quantization, which has been applied to a variety of life science datasets. The specific application examples were selected in order to demonstrate the flexibility of the approach and illustrate its interpretability.

A number of open questions and challenges deserve attention in future research – to name only a few examples: A better understanding of feature relevances should be obtained, for instance, by exploiting the approaches presented in [23]. Combined distance measures can be designed for the treatment of different sources of information in an integrative manner [48]. The analysis of functional data plays a role of increasing importance in the biomedical domain, see e.g. [43]. In general, the development of efficient methods for the analysis of biomedical data, which are at the same time powerful and transparent, constitutes a major challenge of great importance. Prototype based classifiers will continue to play a central role in this context.