Abstract
We introduce the idea of utilizing ensembles of Kernel Minimum Enclosing Balls to detect novel datapoints. To this end, we propose a novelty scoring methodology that is based on combining outcomes of the corresponding characteristic functions of a set of fitted balls. We empirically evaluate our model by presenting experiments on synthetic as well as real world datasets.
Supported by the Competence Center for Machine Learning Rhine Ruhr (ML2R) which is funded by the Federal Ministry of Education and Research of Germany (grant no. 01|S18038A).
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
1 Introduction
The notion of novelty discovery (or detection) [6] can be described as a one-class classification problem (a.k.a data domain description [11]) aiming to learn certain characteristics of the analyzed datasets to be able to separate novel datapoints. It finds many applications in numerous scientific and engineering areas such as fraudulent activity detection in financial applications or detecting rare events in medical monitoring [6]. Although reservoir computing based approaches [4] have been proposed to a variety of classification and regression problems, to the best of our knowledge, corresponding methods that are oriented to tackle one-class problems are scarce. The main contribution of this work is about utilizing Minimum Enclosing Balls [1] for novelty discovery. Minimum Enclosing Balls (MEBs) fall into the class of unsupervised representation learning methods that can be used to extract important characteristics about the considered datasets [1]. The main idea behind the MEBs is about determining the smallest ball encapsulating the entire dataset in the data- or feature space, which can be found by formulating the problem as an inequality constrained convex minimization problem with a dual allowing for invoking the kernel trick and this dual can be solved using dynamical processes from reservoir computing [1].
Our contribution is based on the decisions of a set of Kernel Minimum Enclosing Balls (KMEBs) by introducing a compound novelty score, which can allow for, for instance, a majority voting based detection as decision based on single balls might be limiting for novelty detection. In addition, our methodology can be easily implemented in neuromorphic architectures and is capable of dealing with nonlinear patterns due to kernelization [1, 10]. Figure 1 shows an illustrative example explaining our idea about detecting the novel datapoints (green diamond shaped points in Fig. 1a) given a dataset of normal datapoints (black round points in Fig. 1a), which can neither be detected using euclidean Minimum Enclosing Balls (as seen in Fig. 1b) nor considering probabilistic novelty detection such as the deviation from the sample data mean [6]. Instead by considering the characteristic functions of multiple KMEBs (see example in Fig. 1c) with differently scaled Gaussian kernels we can detect all novel points that the considered balls might not individually be capable of capturing (compare the results of Fig. 1d to the others).
The remaining of the paper is organized as follows. So as to be self-contained, in Sect. 2 we will formally define the notion of KMEBs, show how we can compute them following a process akin to the ones used in echo state networks and finally show how, once computed, the support vectors of balls can be used to characterize the interior of the fitted balls. Following that in Sect. 3 we will introduce a new novelty scoring methodology based on the characteristic functions for novelty discovery. In Sect. 4 we will present empirical results to evaluate our approach using real world datasets and in Sect. 5 we will conclude our work.
2 An Overview of Kernel Minimum Enclosing Balls
Given a set of m-dimensional data points \(\mathcal {X}=\{\varvec{{x}}_1, \ldots , \varvec{{x}}_n\}\) (for \(\varvec{{x}}_i\in \mathbb {R}^{m}\)) that are grouped into a column data matrix \(\varvec{{X}} = [ \varvec{{x}}_1, \ldots , \varvec{{x}}_n ] \in \mathbb {R}^{m \times n}\), we aim to find the m-ball \(\mathcal {B}(\varvec{{c}}, r)\) containing each of the given data points in \(\mathcal {X}\), where \(\varvec{{c}} \in \mathbb {R}^m\) and \(r \in \mathbb {R}\) are respectively the center and the radius of \(\mathcal {B}\). Finding MEBs can be cast as a convex optimization problem
Upon evaluating the Lagrangian and the KKT conditions, the negated dual of (1), allows for the kernel trick (as the data only occurs in form of inner products [1]) and can be written as the minimization problem
where \(\varvec{{K}} \in \mathbb {R}^{n \times n}\) is a kernel matrix, \(\varvec{{k}}\) contains its diagonal (i.e. \(\varvec{{k}} = {\text {*}}{diag}[\varvec{{K}}]\)) and \(\varvec{{\mu }} \in \mathbb {R}^n\) contains Lagrange multipliers. The kernel matrix \(\varvec{{K}}\) in (2) is built by considering a Mercer kernel \(K : \mathbb {R}^m \times \mathbb {R}^m \rightarrow \mathbb {R}\) such that \(K_{ij} = K(\varvec{{x}}_i, \varvec{{x}}_j)\). An example kernel function that we considered throughout our work is the Gaussian kernel that for scale parameter \(\lambda \) is defined as \(K(\varvec{{x}}_i, \varvec{{x}}_j) = \exp \left( - \frac{\Vert \varvec{{x}}_i - \varvec{{x}}_j \Vert ^2}{2 \, \lambda ^2} \right) \).
Considering (2), we note that finding Kernel Minimum Enclosing balls boils down to finding optimal \(\varvec{{\mu }}\), which resides in the standard simplex \(\varDelta ^{n-1}\) and minimizes a convex function \(\mathcal {L}(\varvec{{\mu }})= \varvec{{\mu }}^\intercal \varvec{{k}} - \varvec{{\mu }}^\intercal \varvec{{K}} \, \varvec{{\mu }}\). Optimization settings of this kind can be easily solved iteratively using the Frank-Wolfe algorithm [3], which itself can be implemented as a recurrent neural network (see examples from [1, 2, 8, 10]). To this end, at each iteration t, the Frank-Wolfe algorithm evaluates the gradient of the negated dual Lagrangian \(\mathcal {L}(\varvec{{\mu }})\) from (2), which amounts to \(\nabla \mathcal {L}(\varvec{{\mu }}) = 2 \, \varvec{{K}} \varvec{{\mu }} - \varvec{{k}}\), and finds the vertex of \(\varDelta ^{n-1}\) for the update, that minimizes
where \(\varvec{{\nu }}_t \in \mathbb {R}^{n}\) represent the current solution at t, \(\varvec{{v}}_j\) is the jth standard vector \(\varvec{{v}}_j =[\delta _{j1},\delta _{j2},\dots ,\delta _{jp} ]^T \}\) (here \(\delta _{ji}\) represents the Kronecker delta) and, finally, \(\varvec{{g}}_\beta (\varvec{{x}})\) represents the soft-min operator. This operator is the smooth approximation of \(\mathop {\text {argmin}}\nolimits _{\cdot }\), whose the ith entry defined as \(\bigl ( \varvec{{g}}_\beta (\varvec{{x}}) \bigr )_i = \frac{e^{-\beta x_i}}{\sum _j e^{-\beta x_j}}\) and has the limit
Given that we can define the convergent iterative Frank-Wolfe updates [1] as
where \(\eta _t \in [0,1]\) is a monotonically decreasing step size. Rearranging the rightmost expression in (5) as \(\varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t - \varvec{{k}} \bigr )= \varvec{{g}}_\beta \bigl ( 2 \, \varvec{{K}} \varvec{{\mu }}_t + \bar{\varvec{{K}}}\bar{\varvec{{1}}} \bigr )\), where \(\bar{\varvec{{K}}}=diag(\varvec{{k}})\) and \(\bar{\varvec{{1}}}\) is the vector of \(-1\)s defined as \(\bar{\varvec{{1}}}=[-1,\dots ,-1]^T\), allows us to interpret and implement these updates in terms of echo state networks [4]. That is, we can describe this machinery as a structurally constrained echo state network, in which we have the fixed input vector \(\bar{\varvec{{1}}}\) containing \(-1\)s, the input weight matrix \(\bar{\varvec{{K}}}\), n reservoir neurons with \(\varvec{{g}}_\beta (\cdot )\) and \(2 \, \varvec{{K}}\) respectively being the nonlinear activation function and the reservoir weight matrices and \(\eta _t\) acting as a leaking rate for updating the Lagrange multipliers. Once optimal Lagrange multipliers have been found using the updates from (5), we can determine the kernelized radius and the squared magnitude of the center of the fitted ball \(\mathcal {B}\) respectively as \(r_* = \sqrt{\varvec{{\mu }}_*^\intercal \varvec{{k}} - \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*} \) and \(\varvec{{c}}_*^\intercal \varvec{{c}}_* = \varvec{{\mu }}_*^\intercal \varvec{{K}} \, \varvec{{\mu }}_*\), which will allow us to define a characteristic function defining the interior of \(\mathcal {B}\) [1]. Namely, using these equalities we can represent the inequality \(\Vert \varvec{{x}} - \varvec{{c}}_* \Vert ^2 \le r_*^2\) to check whether an arbitrary point \(\varvec{{x}} \in \mathbb {R}^m\) within the ball \(\mathcal {B}\) by considering
where \(\bar{\varvec{{k}}} \in \mathbb {R}^n\) is defined as \(\bar{k}_i = K(\varvec{{x}}, \varvec{{x}}_i)\) [1]. That is, \(f(\varvec{{x}}) > 0\) holds if \(\varvec{{x}}\) is outside of the ball \(\mathcal {B}\), whereas, \(f(\varvec{{x}}) \le 0\) is the case when \(\varvec{{x}}\) is inside the ball \(\mathcal {B}\). Though, \(f(\varvec{{x}}) = 0\) only holds for the points with nonzero Lagrange multipliers that are the support vectors of \(\mathcal {B}\) and can be defined as \(\mathcal {S}=\{\varvec{{x}}_i\; | \forall \; i\; \in \;[1,\dots ,n]\; \wedge \mu _{i*}>0\}\). It is worth noting that, we can simplify (6) by grouping the \(l \le n\) points in \(\mathcal {S}\) into a column data matrix \(\varvec{{S}} = [ \varvec{{s}}_1, \ldots , \varvec{{s}}_l ] \in \mathbb {R}^{m \times l}\), putting their corresponding multipliers in \(\varvec{{\sigma }} \in \mathbb {R}^{l}\), letting \(\varvec{{Q}} \in \mathbb {R}^{l \times l}\) be the kernel matrix for the support vectors (i.e. \(Q_{ij} = K(\varvec{{s}}_i, \varvec{{s}}_j)\)) and \(\varvec{{q}} \in \mathbb {R}^l\) to contain its diagonal (i.e. \(\varvec{{q}} = {\text {*}}{diag}[\varvec{{Q}}]\)), which yields a simpler characteristic function
where as in (6), \(\bar{\varvec{{k}}} \in \mathbb {R}^l\) is evaluated as \(\bar{k}_j = K(\varvec{{x}}, \varvec{{s}}_j)\) and we note that the term \(\sqrt{\varvec{{\sigma }}^\intercal \varvec{{q}} - \varvec{{\sigma }}^\intercal \varvec{{Q}} \, \varvec{{\sigma }}}\) (which indeed amounts to \(r_*\)) is does not depend on \(\varvec{{x}}\).
3 An Ensemble Approach for Novelty Discovery
Having explained how KMEBs are defined and can be computed so that we can determine their interior, we will now turn our attention to novelty discovery by combining the characteristic functions of a set of balls. We note that, the characteristic function from (7) for a given ball \(\mathcal {B}\) can be used to label the points outside of the ball to be the novel points. In this case a query point \(\varvec{{x}}\) is considered novel if \(f(\varvec{{x}})>0\) and not novel for \(f(\varvec{{x}}) \le 0\). Although this approach can capture novel points it might result in very restrictive or too general decision boundaries that respectively might result in detecting every query point to be novel or not novel (see Fig. 1d for the latter case). Both problems, however, can be avoided if we generalize this approach by combining the decisions of multiple balls. One approach for such a combination can be based on uniform voting [7]. That is, given a set of u KMEBs \(\mathcal {P}=\{\mathcal {B}_1, \ldots , \mathcal {B}_u\}\), that are trained considering a different setting, and \(f_i(\cdot )\) and \(\llbracket \cdot \rrbracket \) respectively indicating the characteristic function from (7) for ball \(\mathcal {B}_i\) and the Iverson bracket, we can assign the novelty score of a query point \(\varvec{{x}}\) by evaluating \(z(\varvec{{x}}) = \sum _{i=1}^{u} \; \llbracket f_i(\varvec{{x}}) > 0 \rrbracket \) and, for instance, label \(\varvec{{x}}\) to be novel if (i.e. \(\varvec{{x}}\) is outside of the majority of the balls in \(\mathcal {P}\) for an odd u) and not novel if . In the next section, we will empirically evaluate this methodology to detect novelty by showing two conceptual examples on benchmarking datasets.
4 Empirical Results
We evaluated our method on the MNIST [5] and CBCL-face (bit.ly/2KwOVV6) datasets. For the former we trained models on the digit 1 aiming to obtain the 0s, whereas for the latter we leaned balls on faces to detect non-face novel images. So as to evaluate the precision of the detections, we divided the training data into 90/10 splits and the latter split is combined with the novel points, which resulted in training/evaluation datsets of cardinality values 6067/6598 and 2186/4791 for respectively the MNIST and CBCL-face datasets. We note that for both examples, we constructed ensembles of KMEBs (i.e. distinct \(\mathcal {P}\) sets) with the Gaussian Kernel, whose scale values, in our case, were evenly spaced over specified intervals (as in Fig. 1) by considering \(u=5\) KMEBs with \(\lambda \) ranging in [40, 60]. We also normalized the datasets to have zero mean and unit variance and always considered \(\beta =\infty \) for the softmin function (see (4)).
In Table 1, we compare our method against thresholding the tested points considering the maximum deviation from the sample mean vector [6], euclidean MEBs [1] (where we consider points outside of the ball as novel) and matrix factorization (MF) [8] based reconstruction to validate the use of kernel methods. For the first method, we label points in the test set as novel if the euclidean distance is larger than the furthest point to the sample mean. For the last method, we factorize the matrix with the number of latent factors \(k=50\) using the alternating least squares method [9] and learn a threshold value based on the worst reconstruction error (l2-norm). Unseen points with reconstruction error exceeding this threshold are considered novel. Table 1a and Table 1b respectively depict the prediction results for the MNIST and CBCL datasets, where we observe the superiority of ensemble KMEBs to detect novel datapoints.
5 Conclusion and Future Work
In this work, we introduced the idea of using ensemble of KEMBs for novelty discovery. We showed how we can construct ensembles of KEMBs and introduced a voting-based approach to detect novel data points. Our empirical evaluation yielded superior results over the use of mean deviation, euclidean MEBs and matrix factorization approaches. Our future work involves studying different ball selection as well as novelty determination strategies and extending the scope of the applications. Another line of future work is related to physical implementation of our methodology and in resource-constrained devices for applications in industrial domains such as for predictive maintenance.
References
Bauckhage, C., Sifa, R., Dong, T.: Prototypes within minimum enclosing balls. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 365–376. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_36
Bauckhage, C.: A neural network implementation of Frank-Wolfe optimization. In: Lintas, A., Rovetta, S., Verschure, P.F.M.J., Villa, A.E.P. (eds.) ICANN 2017. LNCS, vol. 10613, pp. 219–226. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68600-4_26
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. 3, 95–110 (1956)
Jäger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304(5667), 78–80 (2004)
LeCun, Y., Boottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Pimentel, M.A., Clifton, D.A., Clifton, L., Tarassenko, L.: A review of novelty detection. Sig. Process. 99, 215–249 (2014)
Rokach, L.: Ensemble methods for classifiers. In: Maimon, O., Rokach, L. (eds.) Data mining and Knowledge Discovery Handbook, pp. 957–980. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X_45
Sifa, R.: An overview of Frank-Wolfe optimization for stochasticity constrained interpretable matrix and tensor factorization. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11140, pp. 369–379. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01421-6_36
Sifa, R.: Matrix and Tensor Factorization for Profiling Player Behavior. LeanPub, British Columbia (2019)
Sifa, R., Paurat, D., Trabold, D., Bauckhage, C.: Simple recurrent neural networks for support vector machine training. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 13–22. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7_2
Tax, D.M., Duin, R.P.: Data domain description using support vectors. In: Proceedings of ESANN (1999)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Sifa, R., Bauckhage, C. (2020). Novelty Discovery with Kernel Minimum Enclosing Balls. In: Kotsireas, I., Pardalos, P. (eds) Learning and Intelligent Optimization. LION 2020. Lecture Notes in Computer Science(), vol 12096. Springer, Cham. https://doi.org/10.1007/978-3-030-53552-0_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-53552-0_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-53551-3
Online ISBN: 978-3-030-53552-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)