1 Introduction

The notion of novelty discovery (or detection)  [6] can be described as a one-class classification problem (a.k.a data domain description  [11]) aiming to learn certain characteristics of the analyzed datasets to be able to separate novel datapoints. It finds many applications in numerous scientific and engineering areas such as fraudulent activity detection in financial applications or detecting rare events in medical monitoring  [6]. Although reservoir computing based approaches  [4] have been proposed to a variety of classification and regression problems, to the best of our knowledge, corresponding methods that are oriented to tackle one-class problems are scarce. The main contribution of this work is about utilizing Minimum Enclosing Balls  [1] for novelty discovery. Minimum Enclosing Balls (MEBs) fall into the class of unsupervised representation learning methods that can be used to extract important characteristics about the considered datasets  [1]. The main idea behind the MEBs is about determining the smallest ball encapsulating the entire dataset in the data- or feature space, which can be found by formulating the problem as an inequality constrained convex minimization problem with a dual allowing for invoking the kernel trick and this dual can be solved using dynamical processes from reservoir computing  [1].

Our contribution is based on the decisions of a set of Kernel Minimum Enclosing Balls (KMEBs) by introducing a compound novelty score, which can allow for, for instance, a majority voting based detection as decision based on single balls might be limiting for novelty detection. In addition, our methodology can be easily implemented in neuromorphic architectures and is capable of dealing with nonlinear patterns due to kernelization  [1, 10]. Figure 1 shows an illustrative example explaining our idea about detecting the novel datapoints (green diamond shaped points in Fig. 1a) given a dataset of normal datapoints (black round points in Fig. 1a), which can neither be detected using euclidean Minimum Enclosing Balls (as seen in Fig. 1b) nor considering probabilistic novelty detection such as the deviation from the sample data mean  [6]. Instead by considering the characteristic functions of multiple KMEBs (see example in Fig. 1c) with differently scaled Gaussian kernels we can detect all novel points that the considered balls might not individually be capable of capturing (compare the results of Fig. 1d to the others).

Fig. 1.
figure 1

A conceptual example illustrating the idea of utilizing Kernel Minimum Enclosing Balls for novelty discovery. (a) shows the data (the inner ring), which is used to compute the ball, and the novel (green diamond) points. It is important to note that neither considering the deviation from the mean vector nor computing the euclidean MEB, which is shown in (b), can in this case isolate the points inside the inner ring. (c) shows a heat-map of the characteristic function from Eq. 7, where colors orange, white and blue respectively indicate positive, zero and negative values. (d–h) shows the dataset and the novel points with the decision boundaries for different Gaussian Kernel scale values \(\lambda \). Used individually to detect the novel points, the recall values for detecting the novelty are respectively 0.935, 0.995 and 1.000 for the balls in (d), (e) and (f–h). We obtain 1.000 recall when majority-voting over the prediction of the balls (d–h) (i.e. by considering an ensemble of 5 KMEBs with evenly spaced \(\lambda \) values over [0.4, 0.6]). (Color figure online)

The remaining of the paper is organized as follows. So as to be self-contained, in Sect. 2 we will formally define the notion of KMEBs, show how we can compute them following a process akin to the ones used in echo state networks and finally show how, once computed, the support vectors of balls can be used to characterize the interior of the fitted balls. Following that in Sect. 3 we will introduce a new novelty scoring methodology based on the characteristic functions for novelty discovery. In Sect. 4 we will present empirical results to evaluate our approach using real world datasets and in Sect. 5 we will conclude our work.

2 An Overview of Kernel Minimum Enclosing Balls

Given a set of m-dimensional data points \(\mathcal {X}=\{\varvec{{x}}_1, \ldots , \varvec{{x}}_n\}\) (for \(\varvec{{x}}_i\in \mathbb {R}^{m}\)) that are grouped into a column data matrix \(\varvec{{X}} = [ \varvec{{x}}_1, \ldots , \varvec{{x}}_n ] \in \mathbb {R}^{m \times n}\), we aim to find the m-ball \(\mathcal {B}(\varvec{{c}}, r)\) containing each of the given data points in \(\mathcal {X}\), where \(\varvec{{c}} \in \mathbb {R}^m\) and \(r \in \mathbb {R}\) are respectively the center and the radius of \(\mathcal {B}\). Finding MEBs can be cast as a convex optimization problem

$$\begin{aligned} \begin{aligned} \varvec{{c}}_*, r_*&= \mathop {\text {argmin}}\limits _{\varvec{{c}}, \, r} r^2 \\&\quad \quad {\text {s.t.}}&\bigl \Vert \varvec{{x}}_i - \varvec{{c}} \bigr \Vert ^2 - r^2 \le 0 \qquad i \in [1, \ldots , n]. \end{aligned} \end{aligned}$$
(1)

Upon evaluating the Lagrangian and the KKT conditions, the negated dual of (1), allows for the kernel trick (as the data only occurs in form of inner products  [1]) and can be written as the minimization problem

$$\begin{aligned} \begin{aligned} \varvec{{\mu }}_*= \mathop {\text {argmin}}\limits _{\varvec{{\mu }}} \;&\; \varvec{{\mu }}^\intercal \varvec{{K}} \, \varvec{{\mu }} - \varvec{{\mu }}^\intercal \varvec{{k}}\\ \quad {\text {s.t.}}\quad&\begin{aligned} \sum _{i=1}^{n} \mu _i = 1 \quad \wedge \quad \mu _j \ge 0 \; \; \forall \; \; j \in [1, \ldots , n], \end{aligned} \end{aligned} \end{aligned}$$
(2)

where \(\varvec{{K}} \in \mathbb {R}^{n \times n}\) is a kernel matrix, \(\varvec{{k}}\) contains its diagonal (i.e. \(\varvec{{k}} = {\text {*}}{diag}[\varvec{{K}}]\)) and \(\varvec{{\mu }} \in \mathbb {R}^n\) contains Lagrange multipliers. The kernel matrix \(\varvec{{K}}\) in (2) is built by considering a Mercer kernel \(K : \mathbb {R}^m \times \mathbb {R}^m \rightarrow \mathbb {R}\) such that \(K_{ij} = K(\varvec{{x}}_i, \varvec{{x}}_j)\). An example kernel function that we considered throughout our work is the Gaussian kernel that for scale parameter \(\lambda \) is defined as \(K(\varvec{{x}}_i, \varvec{{x}}_j) = \exp \left( - \frac{\Vert \varvec{{x}}_i - \varvec{{x}}_j \Vert ^2}{2 \, \lambda ^2} \right) \).

Considering (2), we note that finding Kernel Minimum Enclosing balls boils down to finding optimal \(\varvec{{\mu }}\), which resides in the standard simplex \(\varDelta ^{n-1}\) and minimizes a convex function \(\mathcal {L}(\varvec{{\mu }})= \varvec{{\mu }}^\intercal \varvec{{k}} - \varvec{{\mu }}^\intercal \varvec{{K}} \, \varvec{{\mu }}\). Optimization settings of this kind can be easily solved iteratively using the Frank-Wolfe algorithm [3], which itself can be implemented as a recurrent neural network (see examples from [1, 2, 8, 10]). To this end, at each iteration t, the Frank-Wolfe algorithm evaluates the gradient of the negated dual Lagrangian \(\mathcal {L}(\varvec{{\mu }})\) from (2), which amounts to \(\nabla \mathcal {L}(\varvec{{\mu }}) = 2 \, \varvec{{K}} \varvec{{\mu }} - \varvec{{k}}\), and finds the vertex of \(\varDelta ^{n-1}\) for the update, that minimizes

$$\begin{aligned} \varvec{{\nu }}_t&= \mathop {\text {argmin}}\limits _{\varvec{{v}}_j \in \mathbb {R}^n} \, \varvec{{v}}_j^\intercal \bigl [ 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr ] \approx \varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr ), \end{aligned}$$
(3)

where \(\varvec{{\nu }}_t \in \mathbb {R}^{n}\) represent the current solution at t, \(\varvec{{v}}_j\) is the jth standard vector \(\varvec{{v}}_j =[\delta _{j1},\delta _{j2},\dots ,\delta _{jp} ]^T \}\) (here \(\delta _{ji}\) represents the Kronecker delta) and, finally, \(\varvec{{g}}_\beta (\varvec{{x}})\) represents the soft-min operator. This operator is the smooth approximation of \(\mathop {\text {argmin}}\nolimits _{\cdot }\), whose the ith entry defined as \(\bigl ( \varvec{{g}}_\beta (\varvec{{x}}) \bigr )_i = \frac{e^{-\beta x_i}}{\sum _j e^{-\beta x_j}}\) and has the limit

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \varvec{{g}}_\beta (\varvec{{x}}) = \mathop {\text {argmin}}\limits _{\varvec{{v}}_j \in \mathbb {R}^n} \varvec{{v}}_j^\intercal \varvec{{x}} = \varvec{{v}}_i. \end{aligned}$$
(4)

Given that we can define the convergent iterative Frank-Wolfe updates [1] as

$$\begin{aligned} \varvec{{\mu }}_{t+1} \leftarrow (1-\eta _t) \, \varvec{{\mu }}_t + \eta _t \, \varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr ), \end{aligned}$$
(5)

where \(\eta _t \in [0,1]\) is a monotonically decreasing step size. Rearranging the rightmost expression in (5) as \(\varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr )= \varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t + \bar{\varvec{{K}}}\bar{\varvec{{1}}} \bigr )\), where \(\bar{\varvec{{K}}}=diag(\varvec{{k}})\) and \(\bar{\varvec{{1}}}\) is the vector of \(-1\)s defined as \(\bar{\varvec{{1}}}=[-1,\dots ,-1]^T\), allows us to interpret and implement these updates in terms of echo state networks [4]. That is, we can describe this machinery as a structurally constrained echo state network, in which we have the fixed input vector \(\bar{\varvec{{1}}}\) containing \(-1\)s, the input weight matrix \(\bar{\varvec{{K}}}\), n reservoir neurons with \(\varvec{{g}}_\beta (\cdot )\) and \(2 \, \varvec{{K}}\) respectively being the nonlinear activation function and the reservoir weight matrices and \(\eta _t\) acting as a leaking rate for updating the Lagrange multipliers. Once optimal Lagrange multipliers have been found using the updates from (5), we can determine the kernelized radius and the squared magnitude of the center of the fitted ball \(\mathcal {B}\) respectively as \(r_* = \sqrt{\varvec{{\mu }}_*^\intercal \varvec{{k}} - \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*} \) and \(\varvec{{c}}_*^\intercal \varvec{{c}}_* = \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*\), which will allow us to define a characteristic function defining the interior of \(\mathcal {B}\)  [1]. Namely, using these equalities we can represent the inequality \(\Vert \varvec{{x}} - \varvec{{c}}_* \Vert ^2 \le r_*^2\) to check whether an arbitrary point \(\varvec{{x}} \in \mathbb {R}^m\) within the ball \(\mathcal {B}\) by considering

$$\begin{aligned} f(\varvec{{x}}) = \sqrt{K(\varvec{{x}}, \varvec{{x}}) - 2 \, \bar{\varvec{{k}}}^\intercal \varvec{{\mu }}_* + \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*} - \sqrt{\varvec{{\mu }}_*^\intercal \varvec{{k}} - \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*}, \end{aligned}$$
(6)

where \(\bar{\varvec{{k}}} \in \mathbb {R}^n\) is defined as \(\bar{k}_i = K(\varvec{{x}}, \varvec{{x}}_i)\)  [1]. That is, \(f(\varvec{{x}}) > 0\) holds if \(\varvec{{x}}\) is outside of the ball \(\mathcal {B}\), whereas, \(f(\varvec{{x}}) \le 0\) is the case when \(\varvec{{x}}\) is inside the ball \(\mathcal {B}\). Though, \(f(\varvec{{x}}) = 0\) only holds for the points with nonzero Lagrange multipliers that are the support vectors of \(\mathcal {B}\) and can be defined as \(\mathcal {S}=\{\varvec{{x}}_i\; | \forall \; i\; \in \;[1,\dots ,n]\; \wedge \mu _{i*}>0\}\). It is worth noting that, we can simplify (6) by grouping the \(l \le n\) points in \(\mathcal {S}\) into a column data matrix \(\varvec{{S}} = [ \varvec{{s}}_1, \ldots , \varvec{{s}}_l ] \in \mathbb {R}^{m \times l}\), putting their corresponding multipliers in \(\varvec{{\sigma }} \in \mathbb {R}^{l}\), letting \(\varvec{{Q}} \in \mathbb {R}^{l \times l}\) be the kernel matrix for the support vectors (i.e. \(Q_{ij} = K(\varvec{{s}}_i, \varvec{{s}}_j)\)) and \(\varvec{{q}} \in \mathbb {R}^l\) to contain its diagonal (i.e. \(\varvec{{q}} = {\text {*}}{diag}[\varvec{{Q}}]\)), which yields a simpler characteristic function

$$\begin{aligned} f(\varvec{{x}}) = \sqrt{K(\varvec{{x}}, \varvec{{x}}) - 2 \, \bar{\varvec{{k}}}^\intercal \varvec{{\sigma }} + \varvec{{\sigma }}^\intercal \varvec{{Q}} \, \varvec{{\sigma }}} - \sqrt{\varvec{{\sigma }}^\intercal \varvec{{q}} - \varvec{{\sigma }}^\intercal \varvec{{Q}} \, \varvec{{\sigma }}} \end{aligned}$$
(7)

where as in (6), \(\bar{\varvec{{k}}} \in \mathbb {R}^l\) is evaluated as \(\bar{k}_j = K(\varvec{{x}}, \varvec{{s}}_j)\) and we note that the term \(\sqrt{\varvec{{\sigma }}^\intercal \varvec{{q}} - \varvec{{\sigma }}^\intercal \varvec{{Q}} \, \varvec{{\sigma }}}\) (which indeed amounts to \(r_*\)) is does not depend on \(\varvec{{x}}\).

3 An Ensemble Approach for Novelty Discovery

Having explained how KMEBs are defined and can be computed so that we can determine their interior, we will now turn our attention to novelty discovery by combining the characteristic functions of a set of balls. We note that, the characteristic function from (7) for a given ball \(\mathcal {B}\) can be used to label the points outside of the ball to be the novel points. In this case a query point \(\varvec{{x}}\) is considered novel if \(f(\varvec{{x}})>0\) and not novel for \(f(\varvec{{x}}) \le 0\). Although this approach can capture novel points it might result in very restrictive or too general decision boundaries that respectively might result in detecting every query point to be novel or not novel (see Fig. 1d for the latter case). Both problems, however, can be avoided if we generalize this approach by combining the decisions of multiple balls. One approach for such a combination can be based on uniform voting [7]. That is, given a set of u KMEBs \(\mathcal {P}=\{\mathcal {B}_1, \ldots , \mathcal {B}_u\}\), that are trained considering a different setting, and \(f_i(\cdot )\) and \(\llbracket \cdot \rrbracket \) respectively indicating the characteristic function from (7) for ball \(\mathcal {B}_i\) and the Iverson bracket, we can assign the novelty score of a query point \(\varvec{{x}}\) by evaluating \(z(\varvec{{x}}) = \sum _{i=1}^{u} \; \llbracket f_i(\varvec{{x}}) > 0 \rrbracket \) and, for instance, label \(\varvec{{x}}\) to be novel if (i.e. \(\varvec{{x}}\) is outside of the majority of the balls in \(\mathcal {P}\) for an odd u) and not novel if . In the next section, we will empirically evaluate this methodology to detect novelty by showing two conceptual examples on benchmarking datasets.

Table 1. Novelty prediction results in terms of recall (RC), precision (PR), as well as the harmonic mean and geometric mean of both (respectively referred as F1 and GM) for (a) the CBLC Face and (b) the MNIST datasets to respectively detect non-face images from face images and the images of digit 0 from the ones of 1. We benchmarked methods to detect novelty that consider the deviation from the sample mean (MDEV), matrix factorization (MF), euclidean MEBs (EMEB) and the ensemble of kernelized MEBs (EKMEB). The superior prediction results indicate that EKMEB can indeed be used for novelty discovery.

4 Empirical Results

We evaluated our method on the MNIST  [5] and CBCL-face (bit.ly/2KwOVV6) datasets. For the former we trained models on the digit 1 aiming to obtain the 0s, whereas for the latter we leaned balls on faces to detect non-face novel images. So as to evaluate the precision of the detections, we divided the training data into 90/10 splits and the latter split is combined with the novel points, which resulted in training/evaluation datsets of cardinality values 6067/6598 and 2186/4791 for respectively the MNIST and CBCL-face datasets. We note that for both examples, we constructed ensembles of KMEBs (i.e. distinct \(\mathcal {P}\) sets) with the Gaussian Kernel, whose scale values, in our case, were evenly spaced over specified intervals (as in Fig. 1) by considering \(u=5\) KMEBs with \(\lambda \) ranging in [40, 60]. We also normalized the datasets to have zero mean and unit variance and always considered \(\beta =\infty \) for the softmin function (see (4)).

In Table 1, we compare our method against thresholding the tested points considering the maximum deviation from the sample mean vector  [6], euclidean MEBs  [1] (where we consider points outside of the ball as novel) and matrix factorization (MF)  [8] based reconstruction to validate the use of kernel methods. For the first method, we label points in the test set as novel if the euclidean distance is larger than the furthest point to the sample mean. For the last method, we factorize the matrix with the number of latent factors \(k=50\) using the alternating least squares method  [9] and learn a threshold value based on the worst reconstruction error (l2-norm). Unseen points with reconstruction error exceeding this threshold are considered novel. Table 1a and Table 1b respectively depict the prediction results for the MNIST and CBCL datasets, where we observe the superiority of ensemble KMEBs to detect novel datapoints.

5 Conclusion and Future Work

In this work, we introduced the idea of using ensemble of KEMBs for novelty discovery. We showed how we can construct ensembles of KEMBs and introduced a voting-based approach to detect novel data points. Our empirical evaluation yielded superior results over the use of mean deviation, euclidean MEBs and matrix factorization approaches. Our future work involves studying different ball selection as well as novelty determination strategies and extending the scope of the applications. Another line of future work is related to physical implementation of our methodology and in resource-constrained devices for applications in industrial domains such as for predictive maintenance.