1 Introduction

The problem of characterizing data in terms of prototypes commonly arises in contexts such as clustering, latent component analysis, manifold identification, or classifier training [15, 16, 19, 20, 23]. Ideally, prototype identification should be computationally efficient and yield representative and interpretable results either for meaningful downstream processing or for assisting analysts in their decision making. In this paper, we discuss a two-stage approach based on kernel minimum enclosing balls and their characteristic functions that meets all these requirements.

Minimum enclosing balls (MEBs) are central to venerable techniques such as support vector clustering [3], or support vector data description [17]. More recently, balls have shown remarkable success in structuring deep representation learning [6, 7, 14]. Here, we revisit the kernel MEB problem and discuss how to solve it using simple recurrent neural networks. Resorting to recent work [1] which showed that recurrent neural networks can accomplish Frank-Wolfe optimization [9], we show how the Frank-Wolfe algorithm allows for finding MEBs and how this approach can be interpreted in terms of reservoir computing.

The solution to the kernel MEB problem consists in a set of support vectors that define the surface of a ball in a high dimensional feature space. Its interior, too, can be characterized in terms of a function of its support vectors. Local minima of this function coincide with representative and easily interpretable prototypes of the given data and we show that, for balls computed using Gaussian kernels, these minima are naturally found via generalized mean shifts [4, 10]. Since the mean shift procedure, too, can be interpreted in terms of reservoir computing, the approach we present in this paper constitutes an entirely neurocomputing based method for prototype extraction.

Fig. 1.
figure 1

A 2D data set, its Euclidean MEB, and several Gaussian kernel MEBs. Squares indicate which data points support the surface of the corresponding ball

2 Minimum Enclosing Balls in Data- and Feature Space

In order for our presentation to be self-contained, we begin with a brief review of the minimum enclosing ball (MEB) problem.

Given a data matrix \(\varvec{{X}} = [ \varvec{{x}}_1, \ldots , \varvec{{x}}_n ] \in {\mathbb {R}}^{m \times n}\), the minimum enclosing ball problem asks for the smallest Euclidean m-ball \({\mathcal {B}}(\varvec{{c}}, r)\) with center \(\varvec{{c}} \in {\mathbb {R}}^m\) and radius \(r \in {\mathbb {R}}\) that contains each of the given data points \(\varvec{{x}}_i\).

Understood as an inequality constrained convex minimization problem, the primal MEB problem is to solve

$$\begin{aligned} \begin{aligned} \varvec{{c}}_*, r_*&= \mathop {\mathrm{argmin}}\limits _{\varvec{{c}}, \, r}&r^2 \\&\quad \quad {\text {s.t.}}&\bigl ||\varvec{{x}}_i - \varvec{{c}} \bigr ||^2 - r^2 \le 0 \qquad i = 1, \ldots , n. \end{aligned} \end{aligned}$$
(1)

Evaluating the Lagrangian and Karush-Kuhn-Tucker conditions for (1) yields the corresponding dual MEB problem

$$\begin{aligned} \begin{aligned} \varvec{{\mu }}_* = \mathop {\mathrm{argmax}}\limits _{\varvec{{\mu }}} \;&\; \varvec{{\mu }}^\intercal \varvec{{z}} - \varvec{{\mu }}^\intercal \varvec{{X}}^\intercal \varvec{{X}} \, \varvec{{\mu }} \\ \quad {\text {s.t.}}\quad&\begin{aligned} \varvec{{\mu }}^\intercal \varvec{{1}}&= 1 \\ \varvec{{\mu }}&\succeq \varvec{{0}} \end{aligned} \end{aligned} \end{aligned}$$
(2)

where \(\varvec{{\mu }} \in {\mathbb {R}}^n\) is a vector of Lagrange multipliers, \(\varvec{{0}}, \varvec{{1}} \in {\mathbb {R}}^n\) denote vectors of all zeros and ones, and the entries of \(\varvec{{z}} \in {\mathbb {R}}^n\) are given by \(z_i = \varvec{{x}}_i^\intercal \varvec{{x}}_i\).

The Karush-Kuhn-Tucker conditions further reveal that, once (2) has been solved, center and radius of the sought after ball amount to

$$\begin{aligned} \varvec{{c}}_*&= \varvec{{X}} \, \varvec{{\mu }}_* \end{aligned}$$
(3)
$$\begin{aligned} r_*&= \sqrt{\varvec{{\mu }}_*^\intercal \varvec{{z}} - \varvec{{\mu }}_*^\intercal \varvec{{X}}^\intercal \varvec{{X}} \, \varvec{{\mu }}_*} . \end{aligned}$$
(4)

Note that the given data points enter the problem in (2) only in form of inner products with other data points because \(\varvec{{X}}^\intercal \varvec{{X}}\) is an \(n \times n\) Gramian with entries \(( \varvec{{X}}^\intercal \varvec{{X}} )_{ij} = \varvec{{x}}_i^\intercal \varvec{{x}}_j\) and \(\varvec{{z}} = {\text {diag}} [\varvec{{X}}^\intercal \varvec{{X}}]\). The dual thus allows for invoking the kernel trick where inner products are replaced by non-linear kernel functions so as to implicitly solve the problem in a high dimensional feature space.

Hence, letting \(K : {\mathbb {R}}^m \times {\mathbb {R}}^m \rightarrow {\mathbb {R}}\) be a Mercer kernel, we introduce \(\varvec{{K}} \in {\mathbb {R}}^{n \times n}\) where \(K_{ij} = K(\varvec{{x}}_i, \varvec{{x}}_j)\) and \(\varvec{{k}} \in {\mathbb {R}}^n\) such that \(\varvec{{k}} = {\text {diag}}[\varvec{{K}}]\) and obtain the kernel MEB problem

$$\begin{aligned} \begin{aligned} \varvec{{\mu }}_* = \mathop {\mathrm{argmax}}\limits _{\varvec{{\mu }}} \;&\; \varvec{{\mu }}^\intercal \varvec{{k}} - \varvec{{\mu }}^\intercal \varvec{{K}} \, \varvec{{\mu }} \\ \quad {\text {s.t.}}\quad&\begin{aligned} \varvec{{\mu }}^\intercal \varvec{{1}}&= 1 \\ \varvec{{\mu }}&\succeq \varvec{{0}}. \end{aligned} \end{aligned} \end{aligned}$$
(5)

Once (5) has been solved, the radius of the minimum enclosing ball in feature space can be computed analogously to (4), namely

$$\begin{aligned} r_* = \sqrt{\varvec{{\mu }}_*^\intercal \varvec{{k}} - \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*}. \end{aligned}$$
(6)

However, the center of the feature space ball cannot be computed similarly since (3) does not lend itself to the kernel trick. Nevertheless, computing

$$\begin{aligned} \varvec{{c}}_*^\intercal \varvec{{c}}_* = \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_* . \end{aligned}$$
(7)

still allows for checking whether or not an arbitrary \(\varvec{{x}} \in {\mathbb {R}}^m\) resides within the kernel MEB of the given data. This is because the inequality \(||\varvec{{x}} - \varvec{{c}}_* ||^2 \le r_*^2\) can be rewritten as

$$\begin{aligned} K(\varvec{{x}}, \varvec{{x}}) - 2 \, \varvec{{\kappa }}^\intercal \varvec{{\mu }}_* + \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_* \le \varvec{{\mu }}_*^\intercal \varvec{{k}} - \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_* \end{aligned}$$
(8)

where \(\varvec{{\kappa }} \in {\mathbb {R}}^n\) in the second term on the left has entries \(\kappa _i = K(\varvec{{x}}, \varvec{{x}}_i)\).

Figure 1 compares the Euclidean minimum enclosing ball of a set of 2D data to kernel minimum enclosing balls computed using Gaussian kernels

$$\begin{aligned} K(\varvec{{x}}_i, \varvec{{x}}_j) = \exp \left( - \frac{||\varvec{{x}}_i - \varvec{{x}}_j ||^2}{2 \, \lambda ^2} \right) \end{aligned}$$
(9)

with different scale parameters \(\lambda \). In order to visualize the surfaces of the feature space balls in the original data space, we considered the function

$$\begin{aligned} f(\varvec{{x}}) = \sqrt{K(\varvec{{x}}, \varvec{{x}}) - 2 \, \varvec{{\kappa }}^\intercal \varvec{{\mu }}_* + \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*} - \sqrt{\varvec{{\mu }}_*^\intercal \varvec{{k}} - \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*} \end{aligned}$$
(10)

and highlighted the contour for which \(f(\varvec{{x}}) = 0\). Note that \(f(\varvec{{x}})\) can be seen as a characteristic function of the corresponding MEB \({\mathcal {B}}\), because \(f(\varvec{{x}}) \le 0 \Leftrightarrow \varvec{{x}} \in {\mathcal {B}}\) and \(f(\varvec{{x}}) > 0 \Leftrightarrow \varvec{{x}} \not \in {\mathcal {B}}\).

Finally, we note that those data points \(\varvec{{x}}_i\) which support the surface of an MEB \({\mathcal {B}}\) in data- or in feature space are easily identified. This is because only if \(\varvec{{x}}_i\) resides on the surface of the ball will its Lagrange multiplier \(\mu _{i*}\) exceed zero; for points inside the ball the inequality constraints in (2) or (5) are inactive and their multipliers vanish. Below, we will refer to points whose multipliers exceed zero as the support vectors \(\varvec{{s}}_j\) of \({\mathcal {B}}\).

figure a

3 Neural Computation of Minimum Enclosing Balls

Next, we discuss how the Frank-Wolfe algorithm [9] solves the kernel MEB problem and how this approach can be interpreted in terms of reservoir computing.

Observe that the kernelized dual Lagrangian \({\mathcal {D}}(\varvec{{\mu }}) = \varvec{{\mu }}^\intercal \varvec{{k}} - \varvec{{\mu }}^\intercal \varvec{{K}} \, \varvec{{\mu }}\) in (5) is concave so that \(-{\mathcal {D}}(\varvec{{\mu }}) = \varvec{{\mu }}^\intercal \varvec{{K}} \, \varvec{{\mu }} - \varvec{{\mu }}^\intercal \varvec{{k}}\) is convex. We may therefore rewrite the maximization problem in (5) in terms of a minimization problem

$$\begin{aligned} \varvec{{\mu }}_* = \mathop {\mathrm{argmin}}\limits _{\varvec{{\mu }} \in {\varDelta }^{n-1}} \; \varvec{{\mu }}^\intercal \varvec{{K}} \, \varvec{{\mu }} - \varvec{{\mu }}^\intercal \varvec{{k}} \end{aligned}$$
(11)

where we also exploited that the non-negativity and sum-to-one constraints in (5) require any feasible solution to reside in the standard simplex \({\varDelta }^{n-1} \subset {\mathbb {R}}^n\).

Written as in (11), our problem is clearly recognizable as an instance of a convex minimization problem over a compact convex set and we note that the Frank-Wolfe algorithm provides a simple iterative solver for this setting.

Algorithm 1 shows how it specializes to our context: Given an initial guess \(\varvec{{\mu }}_0\) for the solution, each iteration of the algorithm determines which \(\varvec{{\nu }}_t \in {\varDelta }^{n-1}\) minimizes the inner product \(\varvec{{\nu }}^\intercal \nabla {\mathcal {D}}(\varvec{{\mu }}_t)\) and applies a conditional gradient update \(\varvec{{\mu }}_{t+1} = \varvec{{\mu }}_t + \eta _t \, (\varvec{{\nu }}_t - \varvec{{\mu }}_t)\) where the step size \(\eta _t = \tfrac{2}{t+2} \in [0,1]\) decreases over time. This way, updates will never leave the feasible set and the efficiency of the algorithm stems from the fact that it turns a quadratic problem into a series of simple linear problems.

Next, we build on recent work [1] and show how Frank-Wolfe optimization for the kernel MEB problem can be implemented by means of rather simple recurrent neural networks.

For the gradient of the negated dual Lagrangian \(-{\mathcal {D}}(\varvec{{\mu }})\), we simply have \(-\nabla {\mathcal {D}}(\varvec{{\mu }}) = 2 \, \varvec{{K}} \varvec{{\mu }} - \varvec{{k}}\) so that each iteration of the Frank-Wolfe algorithm has to compute

$$\begin{aligned} \varvec{{\nu }}_t&= \mathop {\mathrm{argmin}}\limits _{\varvec{{\nu }} \in {\varDelta }^{n-1}} \, \varvec{{\nu }}^\intercal \bigl [ 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr ]. \end{aligned}$$
(12)

The objective function in (12) is linear in \(\varvec{{\nu }}\) and needs to be minimized over a compact convex set. Since minima of a linear functions over compact convex sets are necessarily attained at a vertex, the solution of (12) must coincide with a vertex of \({\varDelta }^{n-1}\). Since the vertices of the standard simplex in \({\mathbb {R}}^n\) correspond to the standard basis vectors \(\varvec{{e}}_j \in {\mathbb {R}}^n\), we can cast (12) as

$$\begin{aligned} \varvec{{\nu }}_t&= \mathop {\mathrm{argmin}}\limits _{\varvec{{e}}_j \in {\mathbb {R}}^n } \, \varvec{{e}}_j^\intercal \bigl [ 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr ] \approx \varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr ). \end{aligned}$$
(13)

where \(\varvec{{g}}_\beta (\varvec{{x}})\) introduced in the approximation on the right of (13) represents the vector-valued softmin operator. Its i-th component is given by

$$\begin{aligned} \bigl ( \varvec{{g}}_\beta (\varvec{{x}}) \bigr )_i = \frac{e^{-\beta x_i}}{\sum _j e^{-\beta x_j}} \end{aligned}$$
(14)

and we note that

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \varvec{{g}}_\beta (\varvec{{x}}) = \mathop {\mathrm{argmin}}\limits _{\varvec{{e}}_j \in {\mathbb {R}}^n} \varvec{{e}}_j^\intercal \varvec{{x}} = \varvec{{e}}_i. \end{aligned}$$
(15)

Based on the relaxed optimization step in (13), we can therefore rewrite the Frank-Wolfe updates for our problem as

$$\begin{aligned} \varvec{{\mu }}_{t+1}&= \varvec{{\mu }}_t + \eta _t \, \bigl [ \varvec{{\nu }}_t - \varvec{{\mu }}_t \bigr ] \end{aligned}$$
(16)
$$\begin{aligned}&= (1-\eta _t) \, \varvec{{\mu }}_t + \eta _t \, \varvec{{\nu }}_t \end{aligned}$$
(17)
$$\begin{aligned}&\approx (1-\eta _t) \, \varvec{{\mu }}_t + \eta _t \, \varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr ). \end{aligned}$$
(18)

Choosing an appropriate parameter \(\beta \) for the softmin function, the non-linear dynamical system in (18) mimics the Frank-Wolfe algorithm up to arbitrary precision and can therefore solve the kernel MEB problem.

From the point of view of neurocomputing this is of interest because, the system in (18) is algebraically equivalent to the equations that govern the internal dynamics of the simple recurrent architectures known as echo state networks [11]. In other words, we can think of this system in terms of a reservoir of n neurons whose synaptic connections are encoded in the matrix \(2 \, \varvec{{K}}\). The system evolves with fixed inputs inputs \(\varvec{{k}}\) and its non-linear readout happens according to (6) and (7). The step size \(\eta _t\) assumes the role of the leaking rate of the reservoir. Since \(\eta _t\) decays towards zero, neural activities will stabilize and the system is guaranteed to approach a fixed point \(\varvec{{\mu }}_* = \lim _{t \rightarrow \infty } \varvec{{\mu }}_t\).

What is further worth noting about the reservoir governed by (18) is that its synaptic connections and constant input are determined by the training data for the problem under consideration. Understanding the MEB problem as a learning task, both could be seen as a form of short term memory. At the beginning of a learning episode, data is loaded into this memory and used to determine support vectors. At the end of a learning episode, only those data points and activities required for decision making, i.e. those \(\varvec{{x}}_i\) and \(\mu _i\) for which \(\mu _i > 0\), need to be persisted in a long term memory to be able to compute the characteristic function in (10).

Fig. 2.
figure 2

Two additional 2D data sets. Squares highlight support vectors \(\varvec{{s}}_j\) of Gaussian kernel MEBs; the color coding indicates the characteristic function \(f(\varvec{{x}})\) in (10). The panels on the right show local minima of \(f(\varvec{{x}})\). Points that minimze the function \(f(\varvec{{x}})\) can be understood as prototypes for the given data

4 Neural Reduction of Support Vectors to Prototypes

Figure 2 show two more 2D data sets for which we computed Gaussian kernel MEBs. Squares highlight support vectors, the coloring indicates values of the characteristic function \(f(\varvec{{x}})\) in (10), and blue dots represent its local minima. Both examples illustrate that (i) the number of support vectors of a kernel MEB is typically smaller than the number of data points the support vectors are computed from, (ii) the number of local minima of the characteristic function is typically smaller than the number of support vectors, and (iii) points where the characteristic function achieves a minimum constitute characteristic prototypes for the given data. Curiously, however, we are not aware of any prior work where minimizers of \(f(\varvec{{x}})\) have been considered as prototypes before. Next, we therefore discuss a simple recurrent procedure for how to compute them.

Solving the kernel MEB problem yields a vector of Lagrange multipliers whose non-zero entries indicate support vectors of \({\mathcal {B}}\). As the multipliers of all other data points equal zero, the characteristic function in (10) can be evaluated using only the support vectors and their multipliers.

Hence, letting \(l \le n\) denote the number of support vectors of \({\mathcal {B}}\), we next collect all of the support vectors of \({\mathcal {B}}\) in a matrix \(\varvec{{S}} = [\varvec{{s}}_1, \ldots , \varvec{{s}}_l] \in {\mathbb {R}}^{m \times l}\) and consider a vector \(\varvec{{\sigma }} \in {\mathbb {R}}^{l}\) of their multiplies. Furthermore, introducing a reduced kernel matrix \(\varvec{{Q}} \in {\mathbb {R}}^{l \times l}\) where \(Q_{ij} = K(\varvec{{s}}_i, \varvec{{s}}_j)\) and kernel vector \(\varvec{{q}} \in {\mathbb {R}}^l\) such that \(\varvec{{q}} = {\text {diag}}[\varvec{{Q}}]\), allows us to rewrite the characteristic function in (10) as

$$\begin{aligned} f(\varvec{{x}}) = \sqrt{K(\varvec{{x}}, \varvec{{x}}) - 2 \, \varvec{{\kappa }}^\intercal \varvec{{\sigma }} + \varvec{{\sigma }}^\intercal \varvec{{Q}} \, \varvec{{\sigma }}} - \sqrt{\varvec{{\sigma }}^\intercal \varvec{{q}} - \varvec{{\sigma }}^\intercal \varvec{{Q}} \, \varvec{{\sigma }}} = \sqrt{d(\varvec{{x}})} - r_* \end{aligned}$$
(19)

where the entries of \(\varvec{{\kappa }} \in {\mathbb {R}}^l\) now amount to \(\kappa _j = K(\varvec{{x}}, \varvec{{s}}_j)\) and where function \(d : {\mathbb {R}}^m \rightarrow {\mathbb {R}}\) computes the squared feature space distance between \(\varvec{{x}}\) and the center of \({\mathcal {B}}\).

Writing the characteristic function like this and observing that, on the outside of \({\mathcal {B}}\), the distance function \(d(\varvec{{x}})\) will grow beyond all bounds, it is clear that the problem of estimating local minimizers of \(f(\varvec{{x}})\) is equivalent to the problem of estimating those \(\varvec{{x}} \in {\mathcal {B}}\) for which the gradient of \(d(\varvec{{x}})\) vanishes.

Assuming that \(K(\cdot , \cdot )\) is a Gaussian kernel such as in (9), we have \(K(\varvec{{x}}, \varvec{{x}}) = 1\) so that the gradient of \(d(\varvec{{x}})\) becomes

$$\begin{aligned} \nabla d(\varvec{{x}}) = - \frac{2}{\lambda ^2} \sum _{j=1}^k \sigma _j \, K(\varvec{{x}}, \varvec{{s}}_j) \, \bigl [ \varvec{{x}} - \varvec{{s}}_j \bigr ]. \end{aligned}$$
(20)

Equating the right hand side to \(\varvec{{0}}\) provides

$$\begin{aligned} \varvec{{x}} = \frac{\sum _j \sigma _j \, K(\varvec{{x}}, \varvec{{s}}_j) \varvec{{s}}_j}{\sum _j \sigma _j \, K(\varvec{{x}}, \varvec{{s}}_j)} = \frac{\sum _j \sigma _j \, \kappa _j \varvec{{s}}_j}{\sum _j \sigma _j \, \kappa _j} = \frac{\varvec{{S}} \varvec{{{\varSigma }}} \varvec{{\kappa }}}{\varvec{{\kappa }}^\intercal \varvec{{\sigma }}} = \varvec{{S}} \varvec{{{\varSigma }}} \varvec{{D}}^{-1} \varvec{{\kappa }} \end{aligned}$$
(21)

where we introduced two diagonal matrices \(\varvec{{{\varSigma }}} = {\text {diag}}[\varvec{{\sigma }}]\) and \(\varvec{{D}} = {\text {diag}}[(\varvec{{\sigma }} \varvec{{\kappa }}^\intercal )\varvec{{1}}]\), respectively. But this result is to say that local minima of \(f(\varvec{{x}})\) correspond to weighted means or convex combinations of the support vectors of \({\mathcal {B}}\).

We also recognize (21) as an extension of classical mean shift updates [4, 10] (where there are no scaling parameters \(\sigma _j\)). Hence, when started with \(\varvec{{x}}_0 \in {\mathbb {R}}^m\), the following process with step size \(\gamma _t \in [0,1]\) will find the nearest minimizer of the characteristic function

$$\begin{aligned} \varvec{{\kappa }}_t&= {\text {vec}} \bigl [ K(\varvec{{x}}_t, \varvec{{s}}_j)_j \bigr ] \end{aligned}$$
(22)
$$\begin{aligned} \varvec{{D}}_t&= {\text {diag}} \bigl [ (\varvec{{\sigma }} \varvec{{\kappa }}_t^\intercal ) \varvec{{1}} \bigr ] \end{aligned}$$
(23)
$$\begin{aligned} \varvec{{x}}_{t+1}&= (1 - \gamma _t) \, \varvec{{x}}_t + \gamma _t \, \varvec{{S}} \varvec{{{\varSigma }}} \varvec{{D}}_t^{-1} \varvec{{\kappa }}_t. \end{aligned}$$
(24)

Looking at (24), we recognize these dynamics as yet another variant of the internal dynamics of a reservoir of neurons and note that, for \(\gamma _t = 1\), the updates in (24) become the mean shift updates in (21). In other words, mode seeking via mean shifts can be seen as yet another form of neurocomputing.

Letting \(\varvec{{x}}_0 \leftarrow \varvec{{s}}_j\) be a copy of one of the support vectors in \(\varvec{{S}}\) and starting mode seeking at this point will identify the minimizer closest to this support vector. Repeating this process for all the support vectors of \({\mathcal {B}}\) will thus collapse them into another, usually smaller, set of points that can be understood as prototypes of the given data. Collecting these in a matrix \(\varvec{{P}} \in {\mathbb {R}}^{m \times p}\) where \(p \le l \le n\) therefore provides a reduced representation for a variety of downstream processing steps.

Table 1. Sample mean and prototypes extracted from the CBCL face data

5 Practical Examples

In order to provide illustrative examples for the performance of our approach, we next present results obtained in experiments with three standard benchmark data sets: The MIT CBCL face databaseFootnote 1 contains intensity images of different faces recorded under various illumination conditions, the well known MNIST database [12] consist of intensity images of ten classes of handwritten digits, and the recently introduced MNIST-Fashion data [22] contains intensity images of fashion items again sampled from ten classes.

For each experimental setting, we vectorized the designated training samples which left us with a data matrix \(\varvec{{X}} \in {\mathbb {R}}^{361 \times 2429}\) for the CBCL data and matrices \(\varvec{{X}} \in {\mathbb {R}}^{361 \times 6000}\) for each class in two MNIST data sets.

In each experiment, we computed the sample mean \(\bar{\varvec{{x}}} = \tfrac{1}{n} \varvec{{X}} \varvec{{1}}\) as a reference prototype and normalized the data in \(\varvec{{X}}\) to zero mean and unit variance before running our procedure. Scale parameters \(\lambda \) for the Gaussian MEB kernels were determined using the method in [8] and reused during mean shift computation; the activation function for neural MEB computation was set to \(\varvec{{g}}_\infty \).

An favorable property of MEB-based prototype identification is that it does not need manual specification of the number p of prototypes. Minimum enclosing ball computation and mean shift on the resulting support vectors automatically identify appropriate numbers l and p of support vectors and prototypes. Hence, after having obtained p MEB-based prototypes for each data matrix, we also ran k-means clustering for \(k = p\) in order to provide an intuitive baseline comparison.

Table 2. Sample mean and prototypes extracted from MNIST digit data

The rightmost column of Table 1 shows the \(p=29\) MEB-based prototypes we found for the CBCL face data; the center column of the table shows cluster prototypes resulting from k-means clustering for \(k=29\) and the single image in the leftmost column depicts the overall sample mean for comparison.

The cluster means in the center column represent average faces which are smoothed to an extent that makes it difficult to discern characteristic features. The MEB prototypes, on the other hand, show distinguishable and therefore interpretable visual characteristics. In other words, these prototypes reveal that the CBCL data contains pictures of faces of people of pale or dark complexion, of people wearing glasses, sporting mustaches, or having been photographed under varying illumination conditions.

Table 3. Sample mean and prototypes extracted from MNIST fashion data

What is further worth noting is that several of the MEB-based prototypes coincide with given images or, put differently, with actual data points. This phenomenon is known from latent factor models such as archetypal analysis [2, 5, 19] or CUR decompositions [13, 18, 21] and usually considered beneficial for interpretability [16]. The fact that we observe it here suggests that, for real world data, some of the support vectors of a kernel minimum enclosing ball themselves constitute minima of the corresponding characteristic function so that the above mean shift procedure will not reduce them any further. Since support vectors reside on the boundary of a given data set, this also explains the apparent variety among the MEB-based prototypes. While this also holds for prototypes extracted via archetypal analysis or CUR decompositions, the prototypes resulting from our approach do not exclusively coincide with extremal data points. In fact, some of them resemble the overall sample mean or the local means found via k-means. In contrast to archetypal analysis, CUR decompositions, or k-means clustering, we therefore observe that our MEB-based approach produces extremal and central prototypes simultaneously.

Tables 2 and 3 show examples of results obtained from the MNIST data sets. These are apparently analogous to the results we just discussed and therefore corroborate that our approach identifies prototypes that cover a wide variety of aspects of a data set.

6 Conclusion

The problem of extracting representative prototypes from a given set of data frequently arises in data clustering, latent component analysis, manifold identification, or classifier training. Methods for this purpose should scale well and yield meaningful and interpretable results so as to assist downstream processing or decision making by analysts. In this paper, we proposed a two-stage approach based on kernel minimum enclosing balls and their characteristic functions. Our approach can be efficiently computed and empirical results suggest that it yields notably distinct prototypes that are therefore interpretable. Contrary to established techniques for clustering or factor analysis, our method yields central and extremal prototypes alike.

From the point of view of neurocomputing, our approach in interesting in that it can be computed using simple recurrent neural networks. Building on recent work in [1], we showed that kernel minimum enclosing balls can be computed using architectures akin to those found in reservoir computing. We also showed that, if kernel minimum enclosing balls are determined w.r.t. Gaussian kernels, the problem of further reducing the support vectors of a ball naturally leads to a variant of the mean shift procedure which can be understood as a form of recurrent neural computation, too.