1 Introduction

Cognitive process is the instinctive learning ability of human being. Human always transfers the feature information to the brain through perception, and then the brain will process the feature information and remember the feature information for the given objects. According to the cognitive science theory, the human brain can be imitated but can not be completely reproduced. Currently, artificial intelligence is an important direction of function imitation of the human brain.

Neural-computing and neural networks (NN) families which base on the neuron working mechanism have made great achievements in various aspects. Recently, statistical learning and support vector machines (SVM) are drawing extensive attention and show attractive and excellent performances in various areas compared with NN, which imply that artificial intelligence can also be made via advanced statistical computing theory.

It should be noted that as for both NN and SVM, the function imitation of human cognitive process for pattern classification can be explained as follows [1]. Given the training pairs (sample features, class indicator), we can train a NN or a SVM learning machine. The training process of these learning machines actually imitates the learning ability of human being. For clarity, we call this process “cognizing”. Then, the trained NN or SVM can be used for testing an unknown sample and determine the class indicator it belongs to. The testing process of an unknown sample actually imitates the recognizing process of human being. We call this process “recognizing”.

From the mathematic point of view, both these two learning machines are based on the hyperplane adjustment, and obtain the optimum or sub-optimum hyperplane combinations after the training process. As for NN, each neuron acts as a hyperplane in the feature space. The feature space is divided into many partitions according to the selected training principle. Each feature space partition is linked to a corresponding class, which accomplishes the “cognizing” process. Given an unknown sample, it only detects the partition where the sample locates in and then assigns the indicator of this sample, which accomplishes the “recognizing” process. Like NN, SVM is based on the optimum hyperplane. Unlike NN, standard SVM determines the hyperplane via solving a convex optimization problem. They have the same “cognizing” and “recognizing” processes except the different solving strategies.

If a totally unknown and novel sample comes, both SVM and NN will not recognize it correctly and would prefer to assign a most close indicator in the learned classes [2, 3]. This is generally wrong in fact, and here comes the topic which this paper concerns with. The root cause of this phenomenon is the learning principle, which is based on feature space partition. This kind of learning principle may amplify each class’s region especially when the samples are small due to incompleteness. This makes it impossible to automatically detect the novel samples. Here comes the problem: how to make it clever enough to automatically identify the novel samples.

The rest of this paper is organized as follows. Section 2 gives the basic concepts of enclosing learning machine. Section 3 then describes the proposed one-class description algorithm based on MVEE cognitive learner, and shows how this can be used to build corresponding cognitive learner in kernel-defined feature space. Section 4 presents a novel two-class classification algorithm based on single MVEE cognitive learner. Experimental results are presented in Sect. 5, and Sect. 6 contains some concluding remarks.

2 Enclosing machine learning

As is known , human being generally cognize things of one kind and recognize completely unknown things of a novel kind easily. So the question is, why not make the learning machine “cognize” or “recognize” like human being. In other words, the learning machine should “cognize” the training samples of the same class. Each class is cognized or described by a cognitive learner. It uses some kind of model to describe each class instead of using feature space partition so as to imitate the “cognizing” process. The bounding and closing boundary of each cognitive learner scatters in the feature space. For an unknown sample, the cognitive class recognizer recognizes, then detects whether the unknown sample is located inside a cognitive learner’s boundary to imitate the “recognizing” process. If the sample is completely new (i.e., none of the trained cognitive learner contains the sample), it can again be described by a new cognitive learner and the new obtained learner can be added to the feature space without changing others. We call this feedback process active self-learning. This concludes the basic concepts of enclosing machine learning. The basic flow of enclosing machine learning is depicted in Fig. 1.

Fig. 1
figure 1

Enclosing machine learning paradigm. The solid line denotes the cognizing process. The dashed line denotes the recognizing process. The dash-dotted line denotes the active self-learning process

Now, we can investigate the definition of the cognitive learner. The cognitive leaner should own at least following features:

  1. a.

    regular and convenient to calculate,

  2. b.

    bounding with the minimum volume,

  3. c.

    convex bodies to guarantee optimality,

  4. d.

    fault tolerant to guarantee generalization performance.

The basic geometric shapes perhaps are the best choices. Because they are all convex bodies and the operations like intersection, union or complement of the basic geometric shapes can be implemented easily. So we propose to use basic geometric shapes such as sphere, box or ellipsoid. The cognitive learner is then to use these geometric shapes to enclose all the given samples with minimum volume in the feature space. This is the most important reason why we call this learning paradigm enclosing machine learning. Here we give the definition of the cognitive learner and recognizer.

Definition 1 A cognitive learner is defined as the bounding and closing boundary of a minimum volume set which encloses all the given samples. The cognitive learner can be either a sphere or an ellipsoid or their combinations. Currently only ellipsoid is investigated for illustration.

Definition 2 A cognitive recognizer is defined as the point detection and assignment algorithm.

Figure 2 gives a geometric illustration of the differences between enclosing machine learning and feature space partition learning paradigm. For the cognizing (or learning) process, each class is described by a cognitive class description learner. For the recognizing (or classification) process, we only need to check which bounding learner the testing sample locates inside. For the learning process, each two classes are separated via a hyperplane (or other boundary forms, such as hypersphere). For the classification process, we need to check whether it is located on the left side or the right side of the hyperplane and then assign the corresponding class indicator. We can see that the feature space partition learning paradigm in fact amplifies the real distribution regions of each class. But the enclosing machine learning paradigm obtains more reasonable distribution region of each class.

Fig. 2
figure 2

A geometric illustration of learning a three class samples via enclosing machine learning vs. feature space partition learning paradigm. a For the depicted example, the cognitive learner is the bounding minimum volume ellipsoid, while the cognitive recognizer is actually the point location detection algorithm of the testing sample. b All the three classes are separated by three hyperplanes. Obviously, each distribution region of the given class samples is amplified

3 MVEE cognitive learner for one class description

3.1 Preliminaries

Our concern is with covering n given points \({x_{i} \in {\Re}^{k} ,\,X: = [x_{1} \;x_{2} \; \ldots \;x_{n} ]}\) with an ellipsoid of minimum volume[4, 5]. To avoid trivialities, we also make the following assumption for the remainder of this paper, which guarantees that any ellipsoid containing \({X: = [x_{1} \;x_{2} \; \ldots \;x_{n} ]}\) has positive volume:

Assumption 1 The affine hull of \({X: = [x_{1} \;x_{2} \; \ldots \;x_{n} ]}\) spans \({{\Re}^{k}}\) .Equivalently, \({\operatorname{rank} {\left[ {\begin{array}{*{20}c} {{A^{\rm T}}}\\ {{e^{\rm T}}} \end{array}} \right]} = k + 1},\) where e is a vector of ones, A denotes the n × k matrix, whose rows are the given points.

Definition 3 For center \({c \in {\Re}^{k} }\) and shape matrix \({E \in S_{++}^{{k \times k}} },\) we define the ellipsoid

$${\varepsilon (E,c): = {\left\{{x \in {\Re}^{k} |(x - c)^{\rm T} E(x - c) \leqslant 1} \right\}} .}$$

where \({E \in S_{{++}}^{{k \times k}}}\) determines the shape and directions of the ellipsoid. The length of the axes is given by \({{\left[ {{\sqrt {\lambda_{1}}},{\sqrt {\lambda_{2}}}, \ldots, {\sqrt {\lambda_{k}}}} \right]}},\) where \({{\left[ {\lambda_{1}, \lambda_{2}, \ldots, \lambda_{k}} \right]}}\) are the corresponding eigenvalues of the matrix E.

Definition 4 For \({x_{i} \in {\Re}^{k},\,X: = [x_{1} \;x_{2} \; \ldots \;x_{n} ]},\) a MVEE cognitive learner is defined as the boundary of all the possible enclosing ellipsoids with the minimum volume.

Under the Assumption 1, a natural formulation of a minimum volume ellipsoid enclosing can be obtained via solving the following convex minimization problem,

$${\begin{aligned}\,& {\mathop {\min}\limits_M}\quad - \ln \det M \\\,& s.t.\quad {\left({Mx_{i} - z} \right)}^{\rm T} {\left({Mx_{i} - z} \right)} \leqslant 1, \quad \forall i = 1,2, \ldots, n, \\\,& \quad \quad M \succ 0 \end{aligned} }$$
(1)

where, \({M = {\sqrt E},\;z = c{\sqrt E}},\) square root of X is defined as : \({{\sqrt X} = V^{\rm T} {\sqrt {D_{{{\left[ {d_{{ii}}} \right]}}}}}V,\,D_{{{\left[ {d_{{ii}}} \right]}}} }\) is an element-wise square root of eigenvalues.

Definition 5 Decompose \({\tilde{M} \in {\Re}^{{{\left({k + 1} \right)} \times {\left({k + 1} \right)}}} = {\left({{\sum\nolimits_{i = 1}^n {\alpha_{i} \tilde{x}_{i}}} \tilde{x}^{\rm T}_{i}} \right)}^{{- 1}} = {\left({\begin{array}{*{20}c} {s}& {{v^{\rm T}}}\\ {v}& {F}\\ \end{array}} \right)},\, \tilde{x}_{i} = {\left({\begin{array}{*{20}c} {1}\\ {{x_{i}}} \\ \end{array}} \right)}},\) denote \({v = - F \tilde{z},\,F \in {\Re}^{{k \times k}}, \,v \in {\Re}^{k}, \,s}\) is a const, denote \({\delta = 1 - s + \tilde{z}^{\rm T} F \tilde{z}},\) the linear transformation \(f: {{\Re}^{{k + 1}} \varepsilon {\left({\tilde{M},0} \right)} \to {\Re}^{k} \varepsilon {\left({M,z} \right)}}\) is defined as \({\left\{{\begin{array}{*{20}c} {{z = \tilde{z}}}\\ {{M = \delta^{{- 1}} F}} \end{array}} \right.}.\) Ellipsoid ɛ (E,c) can be computed from \({\left\{{\begin{array}{*{20}c} {{E = {\left({\delta^{{- 1}} F} \right)}^{\rm T} {\left({\delta^{{- 1}} F} \right)}}}\\ {{c = - F^{{- 1}} v(E)^{{- \frac{1}{2}}}}} \end{array}} \right..}\)

Lemma 1 The minimization the volume of the ellipsoid \({\varepsilon {\left({M,z} \right)} \hbox{in}\;{\Re}^{k} }\) is equivalent to minimization the volume of the ellipsoid \({\varepsilon {\left({\tilde{M},0} \right)}}\) in augmented \({{\Re}^{{k + 1}} }\) centered at the origin using linear transformation f[6].

According to Lemma 1 and Definition 5, Eq. (1) can be rewritten as:

$${\begin{aligned}\,& {\mathop {\min}\limits_{\tilde{M}}}\quad - \ln \det \tilde{M}\\ & s.t.\quad \tilde{x}^{\rm T}_{i} \tilde{M} \tilde{x}_{i} \leqslant 1, \quad \forall i = 1,2, \ldots, n,\\ & \quad \quad \tilde{M} \succ 0 \end{aligned} }$$
(2)

The dual form is:

$${\begin{aligned}\,& {\mathop {\max}\limits_{\alpha_{i}}}\quad \ln \det {\sum\limits_{i = 1}^n {\alpha_{i} \tilde{x}_{i} \tilde{x}^{\rm T}_{i}}} \\\,& s.t.\quad {\sum\limits_{i = 1}^n {\alpha_{i}}} = k + 1\\ & \quad \quad 0 \leqslant \alpha_{i} \leqslant 1,\quad \forall i = 1,2, \ldots, n\\ \end{aligned} }$$
(3)

where α is the dual variable.

We name this one class description method OCMVEE for clarity, where OCMVEE stands for one class minimum volume enclosing ellipsoid.

According to KKT conditions, we can get \({ \tilde{M}^{{- 1}} = {\sum\nolimits_{i = 1}^n {\alpha_{i} \tilde{x}_{i} \tilde{x}^{\rm T}_{i}}}}.\) We define \({A:A_{{ii}} = a_{i} = {\sqrt {\alpha_{i}}} \geqslant 0,\,X = {\left({\tilde{x}_{1}, \tilde{x}_{2}, \ldots, \tilde{x}_{n}} \right)}}.\) Because \({{\sum\limits_{i = 1}^n {\alpha_{i} \tilde{x}_{i} \tilde{x}^{\rm T}_{i}}} = X^{\rm T} A^{2} X, \tilde{M}^{{- 1}} = {\left({AX} \right)}^{\rm T} {\left({AX} \right)} = X^{\rm T} A^{2} X,\, {\left({AX} \right)}{\left({AX} \right)}^{\rm T} = AXX^{\rm T} A}.\) Using singular value decomposition method, we can naturally define \({X^{\rm T} A^{2} X = P\Lambda P^{\rm T}, \,AXX^{\rm T} A = AKA = Q\Lambda Q^{\rm T} },\) and thus we can obtain the following important equation \({P = X^{\rm T} AQ\Lambda^{-1/2}} \). Using eigenspectrum analysis, we can infer following important lemma.

Lemma 2 For \({ \tilde{M}^{{- 1}} = {\left({AX} \right)}^{\rm T} {\left({AX} \right)} = X^{\rm T} A^{2} X,\,{\left({AX} \right)}{\left({AX} \right)}^{\rm T} = AXX^{\rm T} A},\) following identities are concluded [7, 8]:

$${\begin{aligned}\,& \ln \det {\left({X^{\rm T} A^{2} X} \right)} = {\sum\limits_{i:\lambda_{i} \ne 0} {\ln {\left({\lambda_{i}} \right)}}} + {\left({k + 1 - \# {\left\{{\lambda_{i} \ne 0} \right\}}} \right)}\ln {\left(0 \right)} \\\,& \ln \det {\left({AXX^{\rm T} A} \right)} = {\sum\limits_{i:\lambda_{i} \ne 0} {\ln {\left({\lambda_{i}} \right)}}} + {\left({n - \# {\left\{{\lambda_{i} \ne 0} \right\}}} \right)}\ln {\left(0 \right)}\\ \end{aligned} }$$
(4)

where, λ i are nonzero eigenvalues. (Proof Omitted)

3.2 Regularized self-adaptive MVEE cognitive learner

According to Lemma 2, it is obvious that there are probable existences of zero eigenvalues. So it is wise to introduce a regularized item to avoid this situation. And it is natural to add a regularized item μI in the ln det (•) objective function. According to Lemma 2, we can easily conclude following identities:

$${\begin{aligned}\,& \ln \det {\left({X^{\rm T} A^{2} X + \mu I} \right)} = {\sum\limits_{i:\lambda_{i} \ne 0} {\ln {\left({\lambda_{i} + \mu} \right)}}} + {\left({k + 1 - \# {\left\{{\lambda_{i} \ne 0} \right\}}} \right)}\ln {\left(\mu \right)} \\\,& \ln \det {\left({AXX^{\rm T} A + \mu I} \right)} = {\sum\limits_{i:\lambda_{i} \ne 0} {\ln {\left({\lambda_{i} + \mu} \right)}}} + {\left({n - \# {\left\{{\lambda_{i} \ne 0} \right\}}} \right)}\ln {\left(\mu \right)} \end{aligned} }$$
(5)

To realize this regularized operation, we can add an additional item \({\mu\;\operatorname{trace} {\left({\tilde{M}} \right)}}\) due to \({\operatorname{trace} {\left({\tilde{M}} \right)} = {\sum\nolimits_{i = 1}^{k + 1} {\frac{1}{{\lambda_{i}}}}}},\) where λ i is eigenvalue of M − 1. Then the primal regularized self-adaptive MVEE can be written as:

$${\begin{aligned}\,& {\mathop {\min}\limits_{\tilde{M},\xi_{i}}}\quad - \ln \det \tilde{M} + \mu\;\operatorname{trace} {\left({\tilde{M}} \right)} + \frac{1}{n}{\sum\limits_{i = 1}^n {\xi_{i}}} + \nu \rho, \\\,& s.t.\quad \tilde{x}^{\rm T}_{i} \tilde{M} \tilde{x}_{i} \leqslant \rho + \xi_{i}, \quad \forall i = 1,2, \ldots, n, \\\,& \quad \quad \tilde{M} \succ 0,\xi_{i} \geqslant 0,\rho \geqslant 0, \quad \forall i = 1,2, \ldots, n \end{aligned} }$$
(6)

where ν ≥ 0 is now a user-specified parameter that equals the fraction of objects outside the optimized ellipsoid. ρ is a variable controls the volume according to ν,  ξ i is slack variable tolerates the misclassified samples.

By introducing dual variables α i , β i , γ ≥ 0, the Lagrange dual form is:

$${\begin{aligned}\,& {\mathop {\max}\limits_{\alpha_{i}}}\quad \ln \det {\left({{\sum\limits_{i = 1}^n {\alpha_{i} \tilde{x}_{i} \tilde{x}^{\rm T}_{i}}} + \mu I} \right)} \\\,& s.t.\quad \;{\sum\limits_{i = 1}^n {\alpha_{i}}} = \nu, \\\,& \quad \quad \;0 \leqslant \alpha_{i} \leqslant \frac{1}{n}, \quad \forall i = 1,2, \ldots n \end{aligned} }$$
(7)

ρ*, ξ * i can be determined using following KKT conditions:

$${\left| {\begin{array}{*{20}c} {{\alpha^{*}_{i} {\left({\rho^{*} + \xi^{*}_{i} - \tilde{x}^{\rm T}_{i} U^{*} U^{{\rm *T}} \tilde{x}_{i}} \right)} = 0}}\\ {{\beta^{*}_{i} \xi^{*}_{i} = 0,{\left({\frac{1}{n} - \alpha^{*}_{i}} \right)}\xi^{*}_{i} = 0}}\\ {{\gamma^{*} \rho^{*} = 0}} \end{array}} \right.}$$
(8)

For a given sample \({ \tilde{x}_{i} },\) we only need to check whether it is located inside the MVEE. If it satisfies \({ \tilde{x}_{i}^{\rm T} \tilde{M} \tilde{x}_{i} \leqslant 1},\) then the sample is inside the MVEE, otherwise the sample is outside the MVEE.

3.3 Kernel regularized self-adaptive MVEE cognitive learner

Matrix X T A 2 X and AXX T A have the same nonzero eigenvalues. According to (5), we have following identity:

$${\ln \det {\left({AXX^{\rm T} A + \mu I} \right)} = \ln \det {\left({X^{\rm T} A^{2} X + \mu I} \right)} + {\left({n - {\left({k + 1} \right)}} \right)}\ln {\left(\mu \right)}}$$
(9)

Later, we will explain how to use this important equation.

As for the inner product XX T, we can find a kernel K which satisfies Mercer condition and then replace the inner product, i.e., XX TK. We get AXX T A = AKA and (9) can be rewritten as:

$${\ln \det (\hbox{AKA} + \mu I) = \ln \det (X^{\rm T} AAX + \mu I) + (n - k - 1)\ln (\mu)}$$
(10)

Hence we can optimize ln det (AKA +  μ I) instead of ln det (X T AAX +  μ I). The corresponding kernel-regularized self-adaptive MVEE can be written as:

$${\begin{aligned}\,& {\mathop {\max}\limits_{\alpha_{i}}}\quad \ln \det {\left({\hbox{AKA} + \mu I} \right)} \\\,& s.t.\quad \;{\sum\limits_{i = 1}^n {\alpha_{i}}} = \nu, \\\,& \quad \quad \;0 \leqslant \alpha_{i} \leqslant \frac{1}{n}, \quad \forall i = 1,2, \ldots n\\ \end{aligned} }$$
(11)

So as to connect (11) with the dual variable α, we define GR n × n to be a Cholesky factor of K, i.e. KGG T. Then we get \({\hbox{AKA} = AGG^{\rm T} A,\,G^{\rm T} A^{2} G = {\sum\nolimits_{i = 1}^n {\alpha_{i} g_{i} g^{\rm T}_{i}}} + \mu I},\) where g i is the ith column of K. According to (9), \({AGG^{\rm T} A}\) and G T A 2 G have the same eigenvalues, such that \({\ln \det (\hbox{AGG}^{\rm T} A + \mu I) = \ln \det (G^{\rm T} A^{2} G + \mu I) = {\sum\nolimits_{i = 1}^n {\alpha_{i} g_{i} g^{\rm T}_{i}}} + \mu I}.\) Thus, we obtain the final dual kernel-regularized self-adaptive MVEE:

$${\begin{aligned}\,& {\mathop {\max}\limits_{\alpha_{i}}}\quad \ln \det {\left({{\sum\limits_{i = 1}^n {\alpha_{i} g_{i} g^{\rm T}_{i}}} + \mu I} \right)} \\\,& s.t.\quad \;{\sum\limits_{i = 1}^n {\alpha_{i}}} = \nu, \\\,& \quad \quad \;0 \leqslant \alpha_{i} \leqslant \frac{1}{n}, \quad \forall i = 1,2, \ldots n\\ \end{aligned} }$$
(12)

Equation (12) is convex, and can be solved via the state of the art convex programming solvers such as Yalmip [9]. And then we can get the kernel-regularized MVEE cognitive learner. Now we should consider the point detection of the test samples. For ellipsoid \({\varepsilon (\tilde{M},0)},\) the Mahalanobis distance is defined as \({d{\left({\tilde{x}, \tilde{M}} \right)} = \tilde{x}^{\rm T} \tilde{M} \tilde{x}, \tilde{x} \in R^{{k + 1}}}.\) Due to the existence of matrix \({ \tilde{M}},\) the Mahalanobis distance \({d{\left({\tilde{x}, \tilde{M}} \right)}}\) cannot be directly expressed in kernel form. Thus, we have to find another way to deduce the kernel form of the Mahalanobis distance.

Note that X T A 2 XPΛ P T, we have \({{\sum\nolimits_{i = 1}^n {\alpha_{i} \tilde{x}_{i} \tilde{x}^{\rm T}_{i}}} + \mu I = P{\left({\Lambda + \mu I} \right)}P^{\rm T} + P^{\bot} {\left({\mu I} \right)}P^{\bot T}, \,P^{\bot} = I - P,\, \tilde{M} = {\left({{\sum\nolimits_{i = 1}^n {\alpha_{i} \tilde{x}_{i} \tilde{x}^{\rm T}_{i}}} + \mu I} \right)}^{{- 1}}, \,P = X^{\rm T} AQ\Lambda^{{- \frac{1}{2}}} }.\) Then the Mahalanobis distance can be rewritten as:

$${d{\left({\tilde{x}, \tilde{M}} \right)} = \frac{1}{\mu}k{\left({\tilde{x}, \tilde{x}} \right)} - \frac{1}{\mu}{\mathbf{k}}^{\rm T} AQ{\left({\Lambda + \mu I} \right)}^{{- 1}} Q^{\rm T} A{\mathbf{k}}}$$
(13)

where, \({{\mathbf{k}} = {\left({k{\left({\tilde{x}_{1}, x} \right)},k{\left({\tilde{x}_{2}, x} \right)}, \ldots, k{\left({\tilde{x}_{n}, x} \right)}} \right)}^{\rm T}, \,Q,\Lambda }\) can be determined via \({\hbox{AKA} = Q\Lambda Q^{\rm T} .}\)

4 MVEE cognitive learner for two class classification

A novel two class classification algorithm based on single MVEE cognitive learner is presented. The main idea is to use MVEE to enclose all the samples of one class (m samples, y i = 1) and to try to exclude the samples of the other class (n samples, y i = − 1). We name this method MVEEC (Minimum Volume Enclosing Ellipsoid for Classification) for clarity. The problem can be formulated as:

$${\begin{aligned}\,& {\mathop {\min}\limits_{\tilde{M},\xi_{i}, \xi_{j}}}\quad - \ln \det {\left({\tilde{M}} \right)} + \frac{1}{l}{\sum\limits_{i = 1}^l {\xi_{i}}} + \nu \rho \\\,& s.t.\quad y_{i} \tilde{x}^{\rm T}_{i} \tilde{M} \tilde{x}_{i} \leqslant y_{i} \rho + \xi_{i}, \quad \forall i = 1,2, \ldots, l \\\,& \quad \quad \tilde{M} \succ 0,\rho > 0,\xi_{i} \geqslant 0,y_{i} \in {\left\{{1, - 1} \right\}}, \quad \forall i = 1,2, \ldots, l \end{aligned} }$$
(14)

where, l is the total number of the samples, i.e., lmn.

According to the optimality conditions and KKT conditions, the Lagrange dual of (14) is:

$${\begin{aligned}\,& {\mathop {\max}\limits_\alpha}\quad \ln \det {\left({{\sum\limits_{i = 1}^l {y_{i} \alpha_{i} \tilde{x}_{i} \tilde{x}^{\rm T}_{i}}}} \right)}\\ & s.t.\quad \;\;{\sum\limits_{i = 1}^l {y_{i} \alpha_{i}}} = \nu, \\\,& \quad \;\quad \;0\;\, \leqslant \alpha_{i} \leqslant \frac{1}{l}, \quad \forall i = 1,2, \ldots, l \end{aligned} }$$
(15)

where, ρ*, ξ* i can be computed via following KKT conditions:

$${\left| {\begin{array}{*{20}c} {{\alpha^{*}_{i} {\left({y_{i} \rho^{*} + \xi^{*}_{i} - y_{i} \tilde{x}^{\rm T}_{i} U^{*} U^{* T} \tilde{x}_{i}} \right)} = 0,\forall {\left({\alpha^{*}_{i} = \frac{1}{l}} \right)}}}\\ {{{\left({\frac{1}{l} - \alpha^{*}_{i}} \right)}\xi^{*}_{i} = 0,\forall {\left({0 \leqslant \alpha^{*}_{i} < \frac{1}{l}} \right)}}} \end{array}} \right.}$$
(16)

5 Experiments

This section investigates the enclosing learning machine on two well-known benchmark datasets: a ball bearing dataset for novelty detection [10], an iris dataset for classification [11]. For obtaining a quantity investigation of the performances of MVEE learner, One Class SVM (OCSVM) and NuSVM (a two class classification algorithm) are adopted in corresponding dataset for performance comparison under the same backgrounds. That is to say, we compare OCMVEE method with OCSVM for the novelty detection problem. And we compare MVEEC method with NuSVM for the two-class classification problem. We use libsvm [12] for implementation of OCSVM and NuSVM. The MVEE-enclosing learning machines are programmed in Matlab via Yalmip. Both linear kernel and RBF kernel for the three methods are investigated. The optimum parameters of OCSVM, NuSVM, MVEE, and MVEEC are determined via fivefold Cross-Validation.

5.1 Novelty detection

The data comes from a real ball bearing type 6204 (steel cage), with rotational frequency 24.5625 Hz (Tacho-signal used for the measurement), sampling frequency 16384 Hz, minimum frequency 0.7 Hz, Data acquisition system (B&K analyzer). Each file consists of 10–12 vectors including 2,048 samples. Each instance consisted of 2,048 samples of acceleration. After preprocessing with a discrete Fast Fourier Transform each such instance had 32 attributes. The dataset consisted of five categories: normal data from new ball bearings and four types of abnormalities, i.e., fault 1 (outer race completely broken), fault 2 (broken cage with one loose element), fault 3 (damaged cage with four loose elements) and fault 4 (a badly worn ball bearing with no evident damage) (Table 1).

Table 1 Ball-bearing dataset novelty detection success rate

5.2 Pattern classification

The data set contains three classes of 50 instances each, where the three classes refer to Iris Setosa, Iris Versicolour and Iris Virginica. Each class has four attributes: sepal length, sepal width, petal length, petal width. Class Setosa is linearly separable from the other two classes Versicolour and Virginica. Since Versicolour and Virginica class are nonlinearly separable, we only report this two class classification problem (Table 2).

Table 2 Iris dataset total classification error rate

6 Conclusions

We propose a novel machine learning paradigm based on minimum volume enclosing shapes called enclosing machine learning and illustrated the concepts and algorithms using minimum volume enclosing ellipsoid. We have developed MVEE class description algorithm and two class classification algorithm, and validated the algorithms via benchmark datasets. The results prove the proposed MVEE-enclosing learning machines are comparable even better than SVMs in the datasets studied.