1 Introduction

Classification of vectorial data is still a challenging topic. For example, classification of hyper-spectral vectors in remote sensing image analysis requires precise learning of classifier models for frequently overlapping or non-linear class distributions (Villmann et al. 2003). Discrimination of patient records in medicine may demand subtle differentiation of features for correct disease diagnosis (Schleif et al. 2009; Villmann 2002; Wutzler et al. 2009). Adaptive models from machine learning such as learning vector quantizers (LVQ, Kohonen 1997), support vector machines (SVMs, Schölkopf and Smola 2002) or multilayer perceptrons (MLP, Haykin 1994) promise alternatives to traditional multivariate data analysis approaches like linear or quadratic discriminant analysis (LDA/QDA) (Duda and Hart 1973; Sachs 1992), when these classical statistical methods do not deliver results with sufficient precision.

LVQs as well as SVMs belong to prototype-based classifiers. LVQ algorithms generate under certain conditions class typical prototypes whereas for SVMs the resulting prototypes determine the class borders and are here called support vectors. These support vectors are always data points. They are identified by convex optimization providing a unique solution. Yet, LVQs as introduced by Kohonen (1986, 1990) realize an intuitive learning based on the Hebbian principle. A cost function-based variant was proposed by Sato and Yamada and denoted as generalized LVQ (GLVQ, Sato and Yamada 1996). Here, optimization is realized as stochastic gradient descent learning such that an optimum is achieved only with high probability instead of a unique solution obtained from SVM optimization. Further, LVQs handle the prototypes in the data space, usually equipped with the Euclidean metric, such that they are easy to interpret. In contrast, SVMs implicitly map the data into the feature mapping space (FMS) associated with the used kernel. This FMS is high dimensional, maybe infinite, and the mapping is generally non-linear. These SVM properties frequently lead to a superior performance compared to other classification models (Schölkopf and Smola 2002). Yet, there are other powerful classifiers in play including decision trees, random forest and deep architectures (Bengio 2009). A good comparison can be found in Caruana et al. (2006).

However, we restrict ourself to the aspect of precise learning of the class borders as one important aspect for good classification performance. In this sense, SVM still plays an important role. Yet, the number of support vectors in SVM, which can be taken as a measure for model complexity, may become large and cannot be explicitly controlled or determined in advance. The model complexity is only implicitly controlled by regularization and additional slack variables (Schölkopf and Smola 2002).

For LVQ approaches, in general the number of prototypes has to be fixed before model adaptation, i.e. before training. This might be an advantage, if restrictions to the complexity of the model appear as it might be the case in onboard technical systems with restricted memory like in robotics (Klingner et al. 2014). Yet, the basic LVQ scheme with fixed prototype number can be easily combined with growing networks like growing neural gas (GNG, Fritzke 1995) allowing a controlled increase of model complexity (Hammer et al. 2005a).

As mentioned above, support vectors detect the class borders such that SVMs maximize the class separation margin (Hastie et al. 2001) whereas GLVQ optimizes the hypothesis margin (Crammer et al. 2003). The border-sensitive behavior as well as the kernel feature mapping of SVMs contributes to their superior performance for many applications. Several attempts were made to integrate the kernel idea into GLVQ using approximation techniques in the related FMS (Qin and Suganthan 2004; Schleif et al. 2011).

In this paper, we deal with the other aspect—the border-sensitive learning for LVQ models. In particular, we propose two different methods to establish class border sensitivity in GLVQ. Thereby, the aim of the investigation is not to show better performance for the new LVQ variants compared to SVM. Rather than this goal we would like to show these variants as a matter of principle to border sensitive SVM, if explicit control of model complexity is demanded. This focus is triggered by the assumption that frequently class separation is favored versus class description. In general, both aspects are difficult to combine (Hammer et al. 2014).

The first one of those border-sensitive LVQ variants uses an additional penalty term for the cost function of GLVQ forcing explicitly the prototypes to move closer to the class borders such that a better sensibility is achieved. The second approach achieves the border sensitivity implicitly by a parameter control for the classifier function already implemented inside the standard GLVQ model. This latter strategy leads to an adaptation of prototypes only for those data points, which are close to the class borders and can be related to active learning schemes (Hasenjäger and Ritter 1998; Schleif et al. 2007). Hence, the prototypes learn only those data near the class borders and, therefore, are implicitly sensitized for the class decision boundaries. Yet, as pointed out in Hammer et al. (2014), border sensitivity does not automatically implies class-typical prototypes.

Both approaches are demonstrated for artificial, illustrating data sets as well as real world data.

2 Generalized learning vector quantization (GLVQ)

In this section, we briefly give the basic variants of LVQ according to Kohonen and Sato and Yamada to justify notations and descriptions. In particular, we assume a given training data set of vectors \(\mathbf {v}\in V\subseteq \mathbb {R}^{n}\). The cardinality of \(V\) is denoted as \(\#V\). The prototypes \(\mathbf {w}_{k}\in \mathbb {R}^{n}\) of the LVQ model for data representation are collected in the set \(W=\left\{ \mathbf {w}_{k}\in \mathbb {R}^{n},\, k=1,\ldots , M\right\} \). Each training vector \(\mathbf {v}\) belongs to a predefined class \(x\left( \mathbf {v}\right) \in \mathcal {C}=\left\{ 1,\ldots ,C\right\} \) out of \(C\) classes. The prototypes are labeled by \(y\left( \mathbf {w}_{k}\right) \in \mathcal {C}\) such that at least one prototype is responsible for each class. Further, we suppose a dissimilarity measure \(d\left( \mathbf {v},\mathbf {w}_{k}\right) \) in the data space, frequently but not necessarily chosen as the squared Euclidean distance.

2.1 Kohonen’s LVQ

LVQs as introduced by Kohonen (1992) are prototype-based classifiers with a predefined set of prototypes, which is optimized during learning and then serving as reference set for classification. The optimization takes place by attraction and repulsion of the prototypes for presented training samples in compliance with a nearest prototype principle (NPP). According to this principle, let \(\mathbf {w}^{+}\) denote the nearest prototype for a given data sample (vector) \(\mathbf {v}\) with respect to the dissimilarity measure \(d\) and, additionally, \(y\left( \mathbf {w}^{+}\right) =x\left( \mathbf {v}\right) \) is valid. Thus, \(\mathbf {w}^{+}\) is the best matching prototype with correct class label also shortly denoted as best matching correct prototype. We define \(d^{+}\left( \mathbf {v}\right) =d\left( \mathbf {v},\mathbf {w}^{+}\right) \) as the respective dissimilarity degree. Analogously, \(\mathbf {w}^{-}\) is the best matching prototype with a class label \(y\left( \mathbf {w}^{-}\right) \) different from \(x\left( \mathbf {v}\right) \), i.e. best matching incorrect prototype, and \(d^{-}\left( \mathbf {v}\right) =d\left( \mathbf {v},\mathbf {w}^{-}\right) \) is again the assigned dissimilarity degree, see Fig. 1.

Fig. 1
figure 1

Illustration of the winner determination of \(\mathbf {w}^{+}\), the best matching correct prototype and the best matching incorrect prototype \(\mathbf {w}^{-}\) together with their distances \(d^{+}\left( \mathbf {v}\right) \) and \(d^{-}\left( \mathbf {v}\right) \), respectively. The overall best matching prototype here is \(\mathbf {w}^{*}=\mathbf {w}^{+}\)

Further, let

$$\begin{aligned} \mathbf {w}^{*}=\mathrm {argmin_{\mathbf {w}_{k}\in W}\left( d\left( \mathbf {v},\mathbf {w}_{k}\right) \right) } \end{aligned}$$
(1)

be the overall best matching prototype (winner) without any label restriction and \(d^{*}\left( \mathbf {v}\right) =d\left( \mathbf {v},\mathbf {w}^{*}\right) \) the respective dissimilarity degree and \(y^{*}=y\left( \mathbf {w}^{*}\right) \) indicates the respective class label of the winner. Hence, \(\mathbf {w}^{*}\in W^{*}=\left\{ \mathbf {w}^{+},\mathbf {w}^{-}\right\} \) and \(W^{*}\) is denoted as the winner subset of the prototype set \(W\). With these notations, the basic learning scheme in LVQ can be formulated as:

$$\begin{aligned} \triangle \mathbf {w}^{*}=\varepsilon \cdot \varPsi \left( x\left( \mathbf {v}\right) ,y^{*}\right) \cdot \left( \mathbf {v}-\mathbf {w}^{*}\right) \end{aligned}$$
(2)

with \(0<\varepsilon \ll 1\) being the learning rate (Biehl et al. 2014). The label evaluation function

$$\begin{aligned} \varPsi \left( x\left( \mathbf {v}\right) ,y^{*}\right) =\delta _{x\left( \mathbf {v}\right) ,y^{*}}-\left( 1-\delta _{x\left( \mathbf {v}\right) ,y^{*}}\right) \end{aligned}$$
(3)

determines the direction of the vector shift \(\mathbf {v}-\mathbf {w}^{*}\) where \(\delta _{x\left( \mathbf {v}\right) ,y^{*}}\) is the Kronecker symbol, such that \(\delta _{x\left( \mathbf {v}\right) ,y^{*}}=1\) holds for \(x\left( \mathbf {v}\right) =y^{*}\) and zero elsewhere. This heuristic adaptation scheme leads to an approximation of a Bayes decision scheme (Kohonen 1997).

An improved convergence behavior is obtained for slight modifications of this basic scheme regarding, for example, an adaptive learning rate or an update also for the second winning prototype

$$\begin{aligned} \mathbf {w}_{2\mathrm{nd}}^{*}=\mathrm {argmin_{\mathbf {w}_{k}\in W\setminus \left\{ \mathbf {w}^{*}\right\} }\left( d\left( \mathbf {v},\mathbf {w}_{k}\right) \right) } \end{aligned}$$

additionally to the overall winner \(\mathbf {w}^{*}\). If \(\mathbf {w}^{*}\) and \(\mathbf {w}_{2\mathrm{nd}}^{*}\) constitute the winner subset \(W^{*}\) Kohonen suggested a window rule

$$\begin{aligned} \min \left( \frac{d^{+}\left( \mathbf {v}\right) }{d^{-}\left( \mathbf {v}\right) },\frac{d^{-}\left( \mathbf {v}\right) }{d^{+}\left( \mathbf {v}\right) }\right) \ge \frac{1-\omega }{1+\omega } \end{aligned}$$
(4)

in the variant LVQ2.1. Prototype adaptation only takes place if this relation is fulfilled for a predefined value \(0<\omega <1\) (Kohonen 1997), i.e. if the data sample \(\mathbf {v}\) falls into a window around the decision border. Yet, this rule does not work for very high-dimensional data as explained in Witoelar et al. (2010).

After training, the response \(y^{*}\) of the LVQ yields the predicted classification of a data sample. According to the winner determination (1) for each data sample, the prototype set \(W\) determines a partition of the data space into the so-called receptive fields defined as:

$$\begin{aligned} R\left( \mathbf {w}_{k}\right) =\left\{ \mathbf {v}\in V|\mathbf {w}_{k}=\mathbf {w}^{*}\right\} \end{aligned}$$
(5)

also known as Voronoi tesselation. The dual graph \(\mathcal {G}\), also denoted as Delaunay- or neighborhood graph, with prototype indices taken as the graph vertices, determines the class distributions via the class labels \(y\left( \mathbf {w}_{k}\right) \) and the adjacency \(\mathbf {G}\) matrix of \(\mathcal {G}\) with elements \(g_{ij}=1\) iff \(R\left( \mathbf {w}_{i}\right) \cap R\left( \mathbf {w}_{j}\right) \ne \emptyset \) and zero elsewhere. For given prototypes and data sample, the graph can be estimated using \(\mathbf {w}^{*}\) and \(\mathbf {w}_{2\mathrm{nd}}^{*}\) (Martinetz and Schulten 1994).

It turns out that the window rule (4) may destabilize the learning process and, therefore, it was suggested to apply this rule only for a few learning steps after usual LVQ1 training to improve the performance (Kohonen 1997). This unstable behavior can be prevented or at least reduced, if the window rule is only applied if the receptive fields \(R(\mathbf {w}^{+})\) and \(R(\mathbf {w}^{-})\) are neighbored, i.e. \(R(\mathbf {w}^{+})\cap R(\mathbf {w}^{-})\ne \emptyset \) (Kaden et al. 2014).

2.2 The basic GLVQ model

The aim of the Generalized LVQ introduced by Sato and Yamada Sato and Yamada (1996) was to keep the basic principle of attraction and repulsion in prototype-based classification learning but vanquishing the problem of the adaptation heuristic. In particular, stochastic gradient descent learning related to a well-defined cost function was identified as a powerful alternative. For this purpose, a classifier function

$$\begin{aligned} \mu \left( \mathbf {v}\right) =\frac{d^{+}\left( \mathbf {v}\right) -d^{-}\left( \mathbf {v}\right) }{d^{+}\left( \mathbf {v}\right) +d^{-}\left( \mathbf {v}\right) } \end{aligned}$$
(6)

is considered, where \(\mu \left( \mathbf {v}\right) \in \left[ -1,1\right] \) is valid and correct classification of a training sample \(\mathbf {v}\) corresponds to \(\mu \left( \mathbf {v}\right) \le 0\). Then, a cost function

$$\begin{aligned} E_{\mathrm{GLVQ}}\left( W\right) =\frac{1}{\#V}\sum _{\mathbf {v}\in V}f\left( \mu \left( \mathbf {v}\right) \right) \end{aligned}$$
(7)

is defined with a monotonically increasing transfer or squashing function \(f\). The squashing function is commonly chosen as a sigmoid function like

$$\begin{aligned} f\left( x\right) =\frac{1}{1+\exp \left( -x\right) } \end{aligned}$$
(8)

or as the identity function \(f\left( x\right) =id\left( x\right) =x\).

Learning in GLVQ takes place as stochastic gradient descent on \(E_{\mathrm{GLVQ}}\left( W\right) \). In particular, we have

$$\begin{aligned} \triangle \mathbf {w}^{+}\sim \xi ^{+}\left( \mathbf {v}\right) \cdot \frac{\partial d^{+}\left( \mathbf {v}\right) }{\partial \mathbf {w}^{+}}\quad \text { and }\quad \triangle \mathbf {w}^{-}\sim \xi ^{-}\left( \mathbf {v}\right) \cdot \frac{\partial d^{-}\left( \mathbf {v}\right) }{\partial \mathbf {w}^{-}} \end{aligned}$$
(9)

with the scaling factors

$$\begin{aligned} \xi ^{+}\left( \mathbf {v}\right) =f^{\prime }\left( \mu \left( \mathbf {v}\right) \right) \cdot \frac{2\cdot d^{-}\left( \mathbf {v}\right) }{\left( d^{+}\left( \mathbf {v}\right) +d^{-}\left( \mathbf {v}\right) \right) ^{2}} \end{aligned}$$

and

$$\begin{aligned} \xi ^{-}\left( \mathbf {v}\right) =-f^{\prime }\left( \mu \left( \mathbf {v}\right) \right) \cdot \frac{2\cdot d^{+}\left( \mathbf {v}\right) }{\left( d^{+}\left( \mathbf {v}\right) +d^{-}\left( \mathbf {v}\right) \right) ^{2}} \end{aligned}$$

For the squared Euclidean metric, we obtain a vector shift according to

$$\begin{aligned} \frac{\partial d^{\pm }\left( \mathbf {v}\right) }{\partial \mathbf {w}^{\pm }}=-2\left( \mathbf {v}-\mathbf {w}^{\pm }\right) \end{aligned}$$
(10)

for the prototypes.

As shown in Crammer et al. (2003), GLVQ maximizes the hypothesis margin

$$\begin{aligned} M\left( \mathbf {v},x\left( \mathbf {v}\right) \right) = d^{+}\left( \mathbf {v}\right) -d^{-}\left( \mathbf {v}\right) \end{aligned}$$
(11)

which refers to the distance of the closest prototype labeled with a different class from \(\mathbf {v}\). Thus, it describes a ‘security’ of classification, i.e. it is related to the distance that the prototypes can be altered without changing the classification decision (Nova and Estévez 2013; Kaden et al. 2014). The hypothesis margin is associated with the generalization error bound independent of the data dimension but depending on the number of prototypes (Hammer et al. 2005b). Further, we remark that minimizing the cost function \(E_{\mathrm{GLVQ}}\left( W\right) \) from (7) approximates the minimization of the misclassification rate (Kaden et al. 2014).

2.3 GLVQ and non-Euclidean distances

Depending on the classification task, other (differentiable) dissimilarity measures than the Euclidean may be more appropriate (Hammer and Villmann 2002; Villmann and Haase 2011). Quadratic forms \(d_{\mathbf {\Lambda }}\left( \mathbf {v},\mathbf {w}\right) =\left( \mathbf {v},\mathbf {w}\right) ^{\top }\mathbf {\Lambda }\left( \mathbf {v},\mathbf {w}\right) \) are discussed in Bunte et al. (2012), and Schneider et al. (2009a, b, 2010). Here, the positive semi-definite matrix \(\mathbf {\Lambda }\) is decomposed into \(\mathbf {\Lambda }=\Omega ^{\top }\Omega \) with arbitrary matrices \(\Omega \in \mathbb {R}^{m\times D}\) which can be adapted during the training. For classification visualization, the parameter \(m\) has to be two or three, the full problem is obtained for \(m=D\). If data are matrices, distances based on Schatten norms or general matrix norms come into play (Horn and Johnson 2013; Schatten 1950), which show very good discriminative behavior (Gu et al. 2012). Alternatively, SVMs implicitly map the data and prototypes into a high-, maybe infinite-,dimensional function Hilbert space \(\mathcal {H}\) by a generally non-linear mapping \(\Phi :V\rightarrow \mathcal {I}_{\kappa _{\Phi }}\subseteq \mathcal {H}\) to achieve better class separability (Cristianini and Shawe-Taylor 2000; Shawe-Taylor and Cristianini 2004). This mapping corresponds uniquely in canonical manner to positive-definite universal kernelsFootnote 1 \(\kappa _{\Phi }\left( \mathbf {v},\mathbf {w}\right) \) (PU-kernels), such that \(\left\langle \Phi \left( \mathbf {v}\right) ,\Phi \left( \mathbf {w}\right) \right\rangle _{\mathcal {H}}=\kappa _{\Phi }\left( \mathbf {v},\mathbf {w}\right) \) is valid for the inner product \(\left\langle \bullet ,\bullet \right\rangle _{\mathcal {H}}\) (Aronszajn 1950; Mercer 1909; Steinwart 2001) (Fig. 2). By means of this inner product, the metric \(d_{\mathcal {H}}\) is determined and for the image \(\mathcal {I}_{\kappa _{\Phi }}\) of the mapping \(\Phi \) the equality \(d_{\mathcal {H}}\left( \Phi \left( \mathbf {v}\right) ,\Phi \left( \mathbf {w}\right) \right) =d_{\kappa _{\Phi }}\left( \mathbf {v},\mathbf {w}\right) \) holds with

$$\begin{aligned} d_{\kappa _{\Phi }}\left( \mathbf {v},\mathbf {w}\right) =\sqrt{\kappa _{\Phi }\left( \mathbf {v},\mathbf {v}\right) -2\kappa _{\Phi }\left( \mathbf {v},\mathbf {w}\right) +\kappa _{\Phi }\left( \mathbf {w},\mathbf {w}\right) } \end{aligned}$$
(12)

is the so-called kernel distance (Villmann et al. 2013). In context of GLVQ, it is interesting to consider differentiable PU-kernels (DPU-kernels), for which the derivative \(\frac{\partial \kappa _{\Phi }\left( \mathbf {v},\mathbf {w}\right) }{\partial \mathbf {w}}\) exists. We can define an accompanying formal data transformation \(\Psi :V\longrightarrow \mathcal {V}\), where \(\mathcal {V}\) is the data space equipped with the kernel metric \(d_{\kappa _{\Phi }}\). The formal map \(\Psi \) is bijective, continuous and non-linear iff \(\Phi \) does (Steinwart 2001). Further, \(\mathcal {V}\) is isometric and isomorphic to \(\mathcal {I}_{\kappa _{\Phi }}\) and, hence, offers the same topological structure and richness as the image \(\mathcal {I}_{\kappa _{\Phi }}\) as known from SVMs.

Fig. 2
figure 2

Visualization of the relationship between the original data space \(V\), the kernel mapping \(\Phi \) and the kernel data space \(\mathcal {V}\). The metric space \(\mathcal {V}\) is isometric to the image \(\mathcal {I}_{\kappa _{\Phi }}\) of \(V\) under the kernel map \(\Phi \) under certain conditions, which itself uniquely corresponds to the kernel \(\kappa _{\Phi }\) in a canonical manner

The differentiability of the kernel ensures the applicability of the stochastic gradient learning of GLVQ in \(\mathcal {V}\) , replacing the distance \(d\left( \mathbf {v},\mathbf {w}\right) \) from the data space \(V\) contained in the calculation of the classifier function \(\mu \left( \mathbf {v}\right) \) from (6) by the kernel distance \(d_{\kappa _{\Phi }}\left( \mathbf {v},\mathbf {w}\right) \) (Villmann et al. 2014). We denote this new data space \(\mathcal {V}\) as kernelized data space and the respective GLVQ in \(\mathcal {V}\) as kernel-GLVQ (KGLVQ). We remark that this approach does not require any approximation techniques as suggested in earlier kernelized GLVQ variants proposed in Qin and Suganthan (2004) and Schleif et al. (2011).

3 Class border sensitive learning in GLVQ

As we have seen in the previous section, KGLVQ is an extension of usual GLVQ to kernel data spaces \(\mathcal {V}\). However, in general, the prototypes of KGLVQ as well as for GLVQ are not particularly sensitized to detect the class borders. This might be a disadvantage for KGLVQ compared to SVMs, if precise classification decisions are favored. In this section, we provide two possibilities to integrate class border sensitivity in GLVQ and, hence, also for KGLVQ. The first choice applies an additive attraction force for prototypes with different class responsibilities, such that the prototypes move closer to each other, which implicitly leads to an improved class border sensitivity. The second approach supposes a parametrized sigmoid transfer functions \(f_{\theta }\left( \mu \right) \) in (7), where the \(\theta \) parameter controls the class border sensitivity via the so-called active sets. These active sets appear as subsets of the whole training data set containing only those data samples close to the class borders. It turns out that only the data contained in the active set contribute to the prototype learning and, hence, the prototypes become particularly sensitive to these data subsets.

3.1 Border sensitive learning in GLVQ by an additive penalty function

Penalizing an undesirable behavior of a learning system is a common strategy in machine learning. In this context, class border sensitivity learning by an additive penalty term for unsupervised fuzzy-c-means models was proposed in Villmann et al. (2012) and Yin et al. (2012). Here, we adopt these ideas for class border sensitive learning in GLVQ, i.e. we also consider a penalty strategy for classification learning (P-GLVQ). For this purpose, we extend the cost function of GLVQ (7) by an additive penalty term \(F_{\mathrm{neigh}}\left( W,V\right) \) such that

$$\begin{aligned} E_{\mathrm{P-GLVQ}}\left( W,\gamma \right) =\left( 1-\gamma \right) \cdot E_{\mathrm{GLVQ}}\left( W\right) +\gamma \cdot F_{\mathrm{neigh}}\left( W,V\right) \end{aligned}$$
(13)

is a convex sum. The parameter \(\gamma \in \left[ 0,1\right) \) is the sensitivity control parameter. The penalty term \(F_{\mathrm{neigh}}\left( W,V\right) \) is a neighborhood-attentive attraction force (NAAF)

$$\begin{aligned}&F_{\mathrm{neigh}}\left( W,V\right) \nonumber \\&\quad =\sum _{\mathbf {v}\in V}\sum _{k:\mathbf {w}_{k}\in W^{-}\left( \mathbf {v}\right) } h_{\sigma }^{NG}\left( k,\mathbf {w}^{+},W^{-}\left( \mathbf {v}\right) \right) d\left( \mathbf {w}^{+},\mathbf {w}_{k}\right) \end{aligned}$$
(14)

depending on the subset \(W^{-}\left( \mathbf {v}\right) \subset W\) of all prototypes with incorrect class labels for a given data vector \(\mathbf {v}\). The term \(d\left( \mathbf {w}^{+},\mathbf {w}_{k}\right) \) explicitly penalizes large distances between the best matching prototypes of the correct and incorrect class, i.e. large distances of these prototypes to the class borders.

This distance is weighted with the neighborhood function

$$\begin{aligned}&h_{\sigma _{-}}^{NG}\left( k,\mathbf {w}^{+},W^{-}\left( \mathbf {v}\right) \right) \nonumber \\&\quad =c_{\sigma _{-}}^{NG}\cdot \exp \left( -\frac{\left( rk_{k}\left( \mathbf {w}^{+},W^{-}\left( \mathbf {v}\right) \right) -1\right) ^{2}}{2\sigma _{-}^{2}}\right) \end{aligned}$$
(15)

determining a rank-neighborhood between the prototypes in \(W^{-}\left( \mathbf {v}\right) \) via the dissimilarity rank function

$$\begin{aligned} rk_{k}\left( \mathbf {w}^{+},W^{-}\left( \mathbf {v}\right) \right) =\sum _{\mathbf {w}_{l}\in W^{-}\left( \mathbf {v}\right) }H\left( d\left( \mathbf {w}^{+},\mathbf {w}_{k}\right) -d\left( \mathbf {w}^{+},\mathbf {w}_{l}\right) \right) \end{aligned}$$
(16)

known from Neural Gas (NG, Martinetz et al. 1993). Here, \(H\) is the Heaviside function

$$\begin{aligned} H\left( x\right) ={\left\{ \begin{array}{ll} \begin{array}{lll} 0 &{} \quad {\mathrm{if}}\ x\le 0\\ 1 &{}\quad {\mathrm{else}}. \end{array}&\end{array}\right. } \end{aligned}$$
(17)

The NAAF causes an additional gradient term

$$\begin{aligned} \frac{\partial F_{\mathrm{neigh}}\left( W,V\right) }{\partial \mathbf {w}_{j}}=h_{\sigma _{-}}^{NG}\left( j,\mathbf {w}^{+},W^{-}\left( \mathbf {v}\right) \right) \cdot \frac{\partial d\left( \mathbf {w}^{+},\mathbf {w}_{j}\right) }{\partial \mathbf {w}_{j}} \end{aligned}$$
(18)

for a given input vector \(\mathbf {v}\) and \(\mathbf {w}_{j}\in W^{-}\left( \mathbf {v}\right) \), i.e. all incorrect prototypes are gradually moved towards the correct best matching prototype \(\mathbf {w}^{+}\) according to their dissimilarity rank with respect to \(\mathbf {w}^{+}\) but into the direction defined by the derivative \(\frac{\partial d\left( \mathbf {w}^{+},\mathbf {w}_{j}\right) }{\partial \mathbf {w}_{j}}\). For the squared Euclidean distance, this gives a gradual usual vector shift. In consequence, the closer the neighborhood is to the correct winning prototype \(\mathbf {w}^{+}\), the stronger is the resulting neighborhood attraction but controlled by the neighborhood range \(\sigma _{-}>0\). Further, the weighting coefficient \(\gamma \) regulates the overall influence of border sensitive learning in this model. Because, all prototypes in \(W^{-}\left( \mathbf {v}\right) \) belong to another class than \(\mathbf {w}^{+}\), the introduced attraction force enhances prototypes positions close to the decision borders between classes. Hence, an implicit better class border sensitivity is achieved.

Otherwise, two new parameters have to be involved for the realization of such a strategy, which requires additional control in applications. Moreover, the margin optimization strategy is lost due to the penalty term \(F_{\mathrm{neigh}}\left( W,V\right) \) from (14) contributing to the over all costs (13). From this point of view, a more GLVQ-inherent approach is desired.

3.2 Class border sensitive learning by parametrized transfer functions \(f_{\theta }\) in GLVQ

In this section, we provide an alternative approach for class border sensitive learning in GLVQ, which is more consistent to standard GLVQ than the previously penalty strategy. It uses a more inherent modification of standard GLVQ than P-GLVQ and, therefore, does not need an additional external force. For this purpose, we seize the thought presented in Strickert (2011) and Witoelar et al. (2010) that the shape of the transfer function \(f\) in (7) sensitively influences the decision certainty at the class borders. Therefore, we suppose \(f\) to be of the sigmoid type (8) as the usual opposite to the identity already suggested in the very first presentation of the GLVQ approach (Sato and Tsukumo 1994; Sato and Yamada 1995). Particularly, we can pay attention to the sensitivity behavior introducing the respective parametrized variant

$$\begin{aligned} f_{\theta }\left( x\right) =\frac{1}{1+\exp \left( -\frac{x}{2\theta ^{2}}\right) } \end{aligned}$$
(19)

with the parameter \(\theta \) determining the slope of \(f_{\theta }\), see Fig. 3.

Fig. 3
figure 3

Visualization of the parametrized sigmoid function \(f_{\theta }\left( x\right) \) depending on the slope parameter \(\theta \)

The derivative \(f_{\theta }^{\prime }\left( \mu \left( \mathbf {v}\right) \right) \) of the logistic function can be expressed in terms of the sigmoid function itself and reads as:

$$\begin{aligned} f_{\theta }^{\prime }\left( \mu \left( \mathbf {v}\right) \right) =\frac{f_{\theta }\left( \mu \left( \mathbf {v}\right) \right) }{2\theta ^{2}}\cdot \left( 1-f_{\theta }\left( \mu \left( \mathbf {v}\right) \right) \right) , \end{aligned}$$
(20)

which appears in the scaling factors \(\xi ^{\pm }\) occuring in the updates (9) for the winning prototypes \(\mathbf {w}^{\pm }\). Considering this derivative (see Fig. 4), we observe that \(\left| \xi ^{\pm }\right| \gg 0\) holds only for \(\left| \mu \left( \mathbf {v}\right) \right| \ll 1\), i.e. a significant prototype update only takes place for a small range of the classifier values \(\mu \) in (6). This range also depends on the slope parameter \(\theta \). Therefore, we introduce the active set of the data contributing significantly to a prototype update during learning to be the set

$$\begin{aligned} \hat{\Xi }=\left\{ \mathbf {v}\in V|\mu \left( \mathbf {v}\right) \in \left[ -\frac{1-\mu _{\theta }}{1+\mu _{\theta }},\frac{1-\mu _{\theta }}{1+\mu _{\theta }}\right] \right\} \end{aligned}$$
(21)

with \(\mu _{\theta }\) chosen such that \(f_{\theta }^{\prime }\left( \mu \right) \approx 0\) is valid for \(\mu \in \Xi =V\backslash \hat{\Xi }\). Thus, \(\left| \xi ^{\pm }\right| \approx 0\) is valid for all training sample not belonging to the active set. Hence, these data do not contribute to prototype learning; see Fig. 4.

Fig. 4
figure 4

Left derivatives \(f_{\theta }^{\prime }\left( \mu \right) \) for several \(\theta \) values. Right visualization of the active set \(\hat{\Xi }\) (green circles) for a illustrative two-class example with one prototype per class. The prototypes are the big dots. Only data points belonging to the active set contribute significantly to prototype learning such that they become sensitive for them (color figure online)

Otherwise, data samples contained in the active set yield moderate prototype updates such that the prototypes become sensitized for them. Obviously, the active set \(\hat{\Xi }\) is distributed along the class decision boundaries, because \(f_{\theta }^{\prime }\left( \mu \right) \gg 0\) is valid only here. Therefore, the active set \(\hat{\Xi }\) is characterized by values \(\mu \left( \mathbf {v}\right) \approx 0\) for \(\mathbf {v}\in \hat{\Xi }\). Hence, this active set \(\hat{\Xi }\) can be understood as another formulation of the window rule for LVQ2.1 given in (4) and taking there \(\omega =\mu _{\theta }\) (Kohonen 1997). The learning of the parameter \(\theta \) in GLVQ was explicitly addressed in Witoelar et al. (2010). Optimization for accuracy improvement was discussed in Strickert (2011).

These observations lead to the idea to control the border sensitivity of the GLVQ algorithm by the parameter \(\theta \), which obviously determines the width of the active set surrounding the class borders. Large values correspond to an overall learning whereas small \(\theta \) values define small stripes as active sets. In consequence, only these data contribute to the prototype updates. In other words, according to (21), the active set is crisp but the possibilities for control are smooth such that we could speak about thresholded active sets \(\hat{\Xi }_{\theta }\). Therefore, border sensitivity leads to prototypes sensitized to those data points close to the class borders depending on the control parameter \(\theta \). In this sense, the active set learning can be seen as a kind of attention based or learning (Hermann et al. 1994) or active learning (Hasenjäger and Ritter 1998; Hasenjäger et al. 1999; Schleif et al. 2007). We refer to this border sensitive approach as BS-GLVQ.

It should be explicitly mentioned at this point that BS-GLVQ is still optimizing the hypothesis margin \(M\left( \mathbf {v},x\left( \mathbf {v}\right) \right) \) from (11) because the transfer function is still monotonically increasing.

Last but not least, we emphasize another advantage of this border sensitive approach: For this methodology, we can state that in the limit \(\theta \searrow 0\) the respective cost function

$$\begin{aligned} E_{\mathrm{BS-GLVQ}}\left( W,\theta \right) =\frac{1}{\#V}\sum _{\mathbf {v}\in V}f_{\theta }\left( \mu \left( \mathbf {v}\right) \right) \end{aligned}$$
(22)

reflects the classification error (Kaden et al. 2014). This fact is based on the observation that in the limit \(\theta \rightarrow 0\) the sigmoid \(f_{\theta }\) from (19) becomes the Heaviside function \(H\!\left( x\right) \) (17) (Kaden et al. 2014).

4 Illustrative example and application

In this section, we demonstrate the desired properties of the border sensitive variants of GLVQ. For this purpose, we start with two-dimensional artificial data sets such that the results can be easily visualized. Thereafter, we present results from a medical application. After this, we move to more sophisticated real-world examples and applications, one from image segmentation, the other one being a medical application in neurology. For all experiments, using a GLVQ-variant, the prototypes were initialized randomly as data points of the respective classes.

4.1 Illustrative toy examples

The first two-dimensional artificial data set is a three-class problem. The data classes are uniformly distributed as in the Czech-flag, see Fig. 5.

Fig. 5
figure 5

Border sensitive learning for the class distribution example ‘Czech-flag’: Obtained prototype positions for standard GLVQ (top), P-GLVQ (middle) and BS-GLVQ (bottom). The Euclidean distance was used

For each class, we generated \(1{,}000\) data points for training and \(1{,}000\) for testing. All GLVQ variants were trained in \(2{,}000\) epochs with constant learning rate \(\varepsilon =0.01\). The results are reported for the test data.

We compare both border sensitive GLVQ approaches P-GLVQ and BS-GLVQ with a standard GLVQ network. We refer to standard GLVQ for the variant with the identity transfer function \(f\left( \mu \right) =\mu \). In case of the parametrized transfer function \(f_{\theta }\), we used the initial parameter \(\theta _{\mathrm{init}}=1.0\) decreased to \(\theta _{\mathrm{fin}}=0.1\) during learning (BS-GLVQ). The balancing parameter \(\gamma \) for P-GLVQ was set permanently to \(\gamma =0.5\).

We observe that both border sensitive models place the prototypes closer to the class borders than standard GLVQ, see Fig. 5. Moreover, the classification accuracy is improved: For the BS-GLVQ, we achieved \(91.1\,\%\) and the sigmoid variant results \(97.2\,\%\) whereas standard GLVQ gets only \(89.7\,\%\). Thus, class border sensitive models detect the noisy class borders more accurately.

The second artificial two-dimensional data set, depicted in Fig. 6, is a non-linear two-class problem denoted as Palau flag.

Fig. 6
figure 6

Palau flag data set

Again we generated for each class \(1{,}000\) data for training and \(1{,}000\) for testing. As before, the learning rate was fixed to be \(\varepsilon =0.01\) during the learning within \(2{,}000\) epochs and the results are reported for the test data. If standard GLVQ is applied with two prototypes, i.e. one prototype for each class, and the Euclidean distance used as dissimilarity measure, a test accuracy of only \(67.1\,\%\) is obtained. If we switch in this standard GLVQ to a kernel distance \(d_{\kappa _{\Phi }}\left( \mathbf {v},\mathbf {w}\right) \) according to (12) with an exponential kernel

$$\begin{aligned} \kappa _{\Phi }\left( \mathbf {v},\mathbf {w}\right) =\exp \left( -\frac{\left( \mathbf {v}-\mathbf {w}\right) }{2\sigma ^{2}}\right) \end{aligned}$$
(23)

the performance is increased to \(95.5\,\%\). The kernel width was automatically adapted according to the gradient learning \(\frac{\partial E_{\mathrm{BS-GLVQ}}\left( W,\theta \right) }{\partial \sigma }\) as described in Villmann et al. (2014). We refer to this kernelized GLVQ as KGLVQ.

In the next step, we used \(2\) prototypes for the inner class and \(4\) for the surrounding. Again we started with the Euclidean variant. Standard GLVQ achieved \(97.1\,\%\) accuracy. Applying BS-GLVQ with slowly linearly decreasing \(\theta \) parameter down to \(\theta _{\mathrm{fin}}=0.16\), the performance is further increased to \(99.1\,\%\). Subsequently, we applied again the KGLVQ as before. For standard KGLVQ, a slightly improved accuracy of \(97.4\,\%\) compared to standard GLVQ is obtained. The resulting prototype and incorrectly classified data points are visualized in Fig. 7.

Fig. 7
figure 7

Application of KGLVQ with \(2+4\) prototypes to the Palau flag data set (top). Wrongly classified data points are marked by crosses. BS-KGLVQ achieves an improvement (bottom). Prototypes move closer to the class borders

However, the BS-GLVQ performance is not achieved. Incorporating now the border sensitive learning, i.e BS-KGLVQ is applied, BS-GLVQ is outperformed by a further improved accuracy of \(99.5\,\%\) for a final \(\theta \) parameter \(\theta _{\mathrm{fin}}=0.11\). Yet, the prototypes move closer to the class borders to realize better sensitivity. In comparison to this BS-KGLVQ with the predefined number of only \(6\) prototypes, SVM approach (LIB-SVM, vers. 3.17, Chang and Lin 2011) yields an accuracy of \(99.9\,\%\) but requires \(16\) support vectors serving as prototypes, see Fig. 8. Here, the kernel with was determined manually to be \(\sigma =0.7\) with regularizing parameter \(C=500\) for best performance.

Fig. 8
figure 8

Application of SVM to the Palau flag data set. Support vectors are marked by diamonds. BS-KGLVQ achieves an improvement (bottom). Prototypes move closer to the class borders compared to KGLVQ

4.2 Real-world datasets

4.2.1 Image segmentation data set

The image segmentation data set (ISDS) is from the UCI Repository (Blake and Merz 1998). It consist of \(2,310\) data vectors of dimension \(n=16\) feature values of instances drawn from a database of \(7\) outdoor images. Each instance is a \(3\times 3\) region. The features describe structural and statistical properties of the instances like saturation, intensity, RGB values, hedge values and others. Each data vector is assigned to the original outdoor image as the respective class label. Thus, classification of these vector is a \(7\)-class problem with \(300\) samples for each class.

The data were preprocessed applying a \(z\)-score transformation, i.e. normalized according to the variance. The mean is set to the zero vector, accordingly. The data set was processed by tenfold cross-validation. In each fold, the lower amount of \(10\,\%\) of the data was taken as training samples, the remaining data were applied for testing. All reported results refer to the averaged test outcomes.

The main concern of these simulations is to verify experimentally the convergence of the cost function value to the classification error in case of decrease in the border-sensitivity control parameter \(\theta \).

We applied BS-GLVQ, with two prototypes per class. During the learning, the control parameter \(\theta \) was fixed to \(\theta _{\mathrm{ini}}=0.7\) during the first training phase. For this value, \(f_{\theta }\left( \mu \right) \approx id\left( \mu \right) =\mu \) holds, i.e. BS-GLVQ behaves like standard GLVQ (quasi-linear activation). After this initial phase, the sensitivity parameter \(\theta \) is slowly decreased to zero (Fig. 9).

Fig. 9
figure 9

BS-GLVQ with Euclidean distance for the ISDS. Starting with a quasi-linear activation corresponding to \(\theta \approx 0.7\) (standard GLVQ), the sensitivity control parameter \(\theta \) is slowly decreased in steps of \(0.01\). Each level is trained during \(1{,}000\) epochs with constant learning rate \(\varepsilon =0.01\). The process is accompanied by a decreasing classification error (red square). For \(\theta \searrow 0\), the cost function value \(E_{\mathrm{BS-GLVQ}}(W,\theta )\) approaches the classification error. Curves were obtained from tenfold cross-validation

We observe that the classification error rate, i.e. the ratio of the number of misclassified test data and the overall number of test data, decreases from \(0.15\) without border sensitive learning to \(0.04\) if the border sensitivity is considered. Further, for \(\theta \searrow 0\) the cost function value \(E_{\mathrm{BS-KGLVQ}}\left( W,\theta \right) \) approaches the classification error as predicted by the theory.

Replacing the Euclidean distance by the kernel distance with self-adapting kernel width and keeping the remaining parameter settings unchanged lead to a slightly deteriorated classification error rate of \(0.16\) and \(0.05\), respectively.

4.2.2 Classification of Wilson’s disease based on electrophysiological data

The last data set is regarded to detection of neurological manifestation of Wilson’s disease. Wilson’s disease is an autosomal-recessive disorder of copper metabolism, which shows disturbances in the liver function. This liver impairment leads to an accumulation of copper in the brain. In consequence, a reduced glucose metabolism is observed in several parts of the brain, the basal ganglia show hepatic and extrapyramidal motor symptoms and movement disorders are apparently for patients suffering from Wilson’s disease (Barthel et al. 2001; Hermann et al. 2003, 2005). According to a clinical scheme suggested by Konovalov, patients can be divided into two main groups: patients with neurological and without neurological manifestations denoted as the neurological and the non-neurological group, respectively (Hermann et al. 2002). In addition to hepatolenticular degeneration in Wilson’s disease, sensory and extrapyramidal motoric systems are also disturbed. The impairments of these nervous pathways can be detected by investigation of the latencies for evoked potentials in different brain regions (Günther et al. 2011). Collecting several evoked potentials according to a pre-defined medical examination yields a so-called electrophysiological impairment profile (EIP) for the patient. Yet, it is not clear so far, whether a precise classification of the EIPs according to their underlying neurological type is possible (Hermann et al. 2005).

The aim of the study was to find a hint, whether such a distinction can be made. For this purpose, a database containing \(M=122\) five-dimensional EIPs was provided, generated according to the conditions defined in Hermann et al. (2003). Classic discriminant analysis did not reveal any significant distinction (Hermann et al. 2005).

We applied both border sensitive GLVQ algorithms. P-GLVQ was trained with balancing parameter \(\gamma =0.5\). BS-KGLVQ was applied with several \(\theta \) parameter. For all models, we used \(3{,}000\) training epochs with constant learning rate \(\varepsilon =0.01\) and \(6\) prototypes per class. For comparison, an SVM with radial basis function kernel (rbf) was trained. The kernel width \(\sigma _{\mathrm{rbf}}\) and the regularizing parameter \(C\) were manually tuned for best performance (\(\sigma =0.015\), \(C=200\)). The data were preprocessed by a \(z\)-score transformation and classification results are obtained as tenfold cross validation. The respective results are depicted in Table 1.

Table 1 Accuracies and respective standard deviations for the Wilson’s disease classification for the applied classifier models obtained by tenfold cross-validation

BS-KGLVQ achieves drastically improved accuracies compared to standard KGLVQ, which accords approximately the BS-KGLVQ for \(\theta =\frac{1}{\sqrt{2}}\), see Table 1. With increasing sensitivity (decreasing \(\theta \) parameter), better accuracies are obtained. However, if the \(\theta \) value drops down, the best performance of \(89.2\,\%\) test accuracy is lost. This may be dedicated to the fact that the active set \(\hat{\Xi }\) becomes too small for precise learning in those cases. For comparison, we also applied SVM to achieve a test accuracy of \(87.4\,\%\). Hence, without the border sensitivity feature, SVMs would be superior. Yet, incorporating border sensitive learning into GLVQ, KGLVQ variants outperform SVMs in this application. Further, we remark at this point that model complexity of the SVMs is at least three times larger (in average \(45.5\) support vectors for SVM) in comparison to the \(12\) prototypes used for the KGLVQ models.

Regarding the medical question, we have to state that although we achieved a quite high performance using BS-GLVQ, the obtained classification accuracies are not sufficiently high for a secure clinical discrimination. For this purpose, further investigations including an improved database and/or other dissimilarity measures are mandatory.

5 Conclusion and outlook

In this paper, we introduced two strategies for class border sensitive learning in GLVQ. The first one adds a penalty term to the cost function to force class border sensitivity of the prototypes, the second uses a parameter control of the sigmoid transfer function defining implicitly the so-called active sets as subsets of the whole training data, which only contribute significantly to prototype adaptation. The latter approach adopts an observation about certainty of class decision boundaries already investigated earlier but not utilized for improved learning so far. This methodology realizes some kind of attention based or active learning as earlier proposed. The proposed strategies for border sensitive learning together with a kernelized variant of GLVQ offer a powerful alternative to SVMs. An advantage of the introduced approaches compared to SVM is the explicit control of the model complexity in GLVQ/KGLVQ, because the number of prototypes has to be chosen in advance for these models whereas in SVMs the number of support vector may become quite large in case of difficult classification tasks and cannot be explicitly controlled.

We applied and compared the new border sensitive GLVQ approaches for several data set: Artificial data sets were considered for illustration whereas real-world data set offers more challenging difficulties to be surmounted. In particular, a medical data set of neurophysiological data in case of Wilson’s disease patients was investigated. Border sensitive KGLVQ variants achieve better results than SVMs with significant lower model complexity. Further, the classification results indicate that a discrimination between neurological and non-neurological type of Wilson’s disease can be performed on the basis of electrophysiological impairment profiles. However, this hypothesis needs further (medical) investigations.