1 Introduction

Data description is a problem of how to represent a group of data. In general, a description is constructed from a given set of target objects and is later used to predict whether incoming unknown objects belong to the same target group or not. The simplest form of the problem is known as one-class classification where only one description is concerned. Such a problem can be used for outlier or novelty detection. One-class classification can be viewed as a two-class problem where only data in one class called targets are easily obtained. However, the availability of the samples from the other class could be very rare because of either the difficulty or high cost in data collection. These samples may be considered outliers in the case of undesirable, or novelties in the case of desirable ones.

Support vector data description (SVDD), inspired by the rise of the support vector machine (SVM) [6], was proposed by Tax and Duin [24]. This method solves data description problems by fitting a spherical shape around the targets in a higher-dimensional space defined by a kernel function. Like SVM, the method has an ability to apply kernel tricks and does not depend on estimating a probability density function of the target data like in some existing literature [4].

The success of SVDD is obviously witnessed by a number of derived works; for example, Tax and Duin themselves later extended SVDD so that outliers are also included in estimating the descriptive domain. The method is called SVDD with negative samples, or nSVDD [25], which is the more complete form of SVDD, making it comparable to SVM. In other words, SVDD possesses a set of hyperspheres as a predefined hypothesis set instead of hyperplanes as in SVM. Some other extensions of SVDD include the two-norm nSVDD by Mu and Nandi [16] where they also proposed a scheme for multiclass classification. Their multiclass paradigm is to find a hypersphere containing the data in one class but excluding the others. The optimization is done for all classes so each class has its own covering sphere. The classification rule for a new test sample is determined by the shortest proximity from the sample to the closest ball. Huang [11] also proposed the two-class SVDD (TC-SVDD) simultaneously trying to find two hyperspheres covering two classes of samples.

The idea of SVDD is built around the minimum enclosing ball problem (MEB), proposed by Sylvester [23] for more than 150 years, to find the smallest ball that contains all training samples. SVDD may be considered as a soft-margin case of MEB with kernel tricks. As a further step, it is obviously tempting to develop an ellipsoid version of the SVDD which will be called the “ellipsoidal support vector data description” (eSVDD). Intuitively, ellipsoids are considered the next simple, convex, smooth geometrical shape to a plane and a sphere. The popular and well-known Gaussian distribution function is also represented by ellipsoidal shapes.

A natural choice for selecting a covering ellipsoid is the one with minimum volume containing the target data. Such an ellipsoid is known as “minimum volume covering ellipsoid” (MVCE) which is the ellipsoid with the smallest volume containing a set of points. In a larger picture, MVCE belongs to one class of ellipsoid inclusion problems [1]. Its potential applications span over various fields; for example, it was used to approximate a stability region in control theory [14], represent a group of data in statistics, or even play a part in obstacle collision detection [18]. This is because its shape is unarguably less conservative than a sphere.

One of the earliest interests in MVCE can be traced back to the famous Löwner-John’s ellipsoid [10] where the optimal ellipsoid with minimum volume containing a convex body is unique and only characterized by the contacted points between the convex body and the optimal ellipsoid. Since then, numerous studies have been devoted to particularly solving the MVCE problem such as the works from Khachiyan [12], Sun and Freund [22], Kuma and Yıldırım [13], Todd and Yıldırım [29], and Ahipaşaoğlu [2]. In addition, the problem can also be cast as a semidefinite programming problem. Toh [30] and Vandenberghe [32] proposed a generic logarithm-determinant maximization solver which also can be used to solve MVCE problems.

MVCE was also applied to pattern classification problems. The earliest work was by Rosen [19] where the size of an ellipsoid was measured by the trace of a matrix, equivalently equal to the sum of squares of the ellipsoid semiaxes. The application at the time was to perform a binary classification on vectorcardiograms with 33 samples in \(\mathbb {R}^3\). Other more recent works include [8, 21, 34].

However, perhaps due to its higher degrees of nonlinearity and more computational cost than using a plane or sphere, MVCE has not been widely applied to pattern classification problems. Nevertheless, by considering the progress of SVM, it can be observed that SVM became more and more efficiently solved than when it was initially introduced. Therefore, working with ellipsoids is still interesting and it may provide fruitful results over SVDD as we will later show in this paper. In addition, MVCE problems are largely close to the D-optimal design problem which is an active area of research for decades. The research results from the field of experimental designs may help support the uses of MVCE in pattern classification problems.

In this paper, the main objective is to design a learning machine whose predefined hypothesis set consists of ellipsoids. The proposed method will be called eSVDD since it is similar to SVDD but with more degrees of flexibility allowed by the use of ellipsoids. The proposed eSVDD is not merely the MVCE, but equipped with more functionalities including soft margins, negative samples, as well as kernel methods. A soft margin is added to eSVDD to control the capacity of the hypothesis set. Incorporating negative samples may help better reject outliers. More complex descriptive boundary is also enabled by kernel methods.

Nevertheless, the problem of solving eSVDD in feature space cannot be done by usual kernel tricks as in SVDD since its dual formulation is expressed in terms of outer products. There are some studies by Dolia et al. [7, 8] trying to estimate MVCE in the kernel-defined feature space where they used spectral decomposition and singular value decomposition (SVD) to reformulate the problem in the form of inner products in order to apply kernel tricks. Wei et al. [33, 34] also applied the similar approach and proposed the concept of “enclosing machine learning” for data description. Unfortunately, their method in applying kernel tricks is too specific to the structure of the problem. As a result, it cannot promptly utilize various existing MVCE solvers. In order to overcome such a problem, we propose that the problem should be tackled from the perspective of “empirical feature map” [20, 35]. In other words, it is not necessary to rewrite the problem in terms of inner products. Instead, it is better to apply kernel methods by means of approximately estimating an image of samples in the feature space.

In the next section, we describe how MVCE problems are formed as eSVDD. Then, the previous work on estimating MVCE in feature space with kernel tricks is explained in Sect. 3 as well as the proposed method based on empirical feature mapping. In Sect. 4, the performance of the proposed method will be evaluated against SVDD on one-class classification with some standard benchmark datasets, followed by the conclusion in Sect. 5.

2 Minimum volume covering ellipsoid

An ellipsoid can be represented in various ways [5]. Let \(\mathcal {E}(\mathbf {E}, \mathbf {d})\) denote an ellipsoid \(\mathcal {E}\) whose center is at \(\mathbf {d} \in \mathbb {R}^n\) with its shape described by \(\mathbf {E} \in \mathbb {S}^n_{++}\). An ellipsoid is a closed convex set defined by

$$\begin{aligned} \mathcal {E} (\mathbf {E}, \mathbf {d}) = \{ \mathbf {x} \in \mathbb {R}^n : (\mathbf {x} - \mathbf {d})^\mathrm{T} \mathbf {E} (\mathbf {x} - \mathbf {d}) \le 1 \}. \end{aligned}$$
(1)

Alternatively, by the change of variables with \(\mathbf {E} = \mathbf {M}^2\) and \(\mathbf {d} = \mathbf {M}^{-1} \mathbf {z}\), another form of the same covering ellipsoid can be expressed as

$$\begin{aligned} \mathcal {E} (\mathbf {M}^2, \mathbf {M}^{-1}\mathbf {z}) = \{ \mathbf {x} \in \mathbb {R}^n : \Vert \mathbf {M}\mathbf {x} - \mathbf {z} \Vert \le 1 \} \end{aligned}$$
(2)

with its volume equal to

$$\begin{aligned} \frac{(n\pi )^{\frac{n}{2}}}{\Gamma (\frac{n}{2}+1)} \frac{1}{\det \mathbf {M}} \end{aligned}$$
(3)

where \(\Gamma\) is the gamma function.

Given a set of samples \(\mathbf {x}_i \in \mathbb {R}^n\) where \(i \in \{1, \ldots , m\}\), the main interest is to find the minimum volume ellipsoid covering all the given points. However, in order to avoid a degenerate case where an ellipsoid has zero volume in a particular dimension, we first make the following assumption similar to [22]:

Assumption 1

The affine hull of the m given samples \(\mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_m \ \mathrm {span}\ \mathbb {R}^n\).

2.1 MVCE centered at the origin

According to Titterington [28], estimating the MVCE covering \(\{ \mathbf {x}_i \}_{i=1}^m \subset \mathbb {R}^n\) is equivalent to finding the MVCE centered at the origin covering the set of augmented samples \(\tilde{\mathbf {x}}_i = [\mathbf {x}_i^\mathrm{T}, \ 1 ]^\mathrm{T}\) for \(i = 1, 2, \ldots , m\). The mapping of the MVCE from the augmented space \(\mathbb {R}^{n+1}\) back to \(\mathbb {R}^n\) is trivial as described in [9, 34]. Therefore, it is possible to merely concern about solving for the MVCE among the family of ellipsoids whose centers are fixed at the origin in \(\mathbb {R}^{n+1}\). Let the ellipsoid described as \(\tilde{\mathcal {E}}(\tilde{\mathbf {E}}, \mathbf {0})\) where \(\tilde{\mathbf {E}} \in \mathbb {S}_{++}^{n+1}\) and \(\mathbf {0}\) is the zero vector with the appropriate dimension. The minimum volume covering ellipsoid at the origin is the solution to the following problem:

$$\begin{array}{ll} \underset{\tilde{\mathbf {E}}}{\text {min}} & \log \det ( \tilde{\mathbf {E}}^{-1} ) \\ \text {s.t.} & \tilde{\mathbf {x}}_i^\mathrm{T} \tilde{\mathbf{E}} \tilde{\mathbf {x}}_i \le 1, \quad i=1,\ldots , m. \end{array}$$
(P1)

Let \(\mathbf {a} = [a_1, a_2, \ldots , a_m ]^\mathrm{T} \in \mathbb {R}^{m}\) be a vector of Lagrange multipliers. The Lagrangian is formed as

$$\begin{aligned} L(\tilde{\mathbf {E}}, \mathbf {a}) = \log \det ( \tilde{\mathbf {E}}^{-1} ) + \sum _{i = 1}^m a_i ( \tilde{\mathbf {x}}_i^\mathrm{T}\tilde{\mathbf {E}} \tilde{\mathbf {x}}_i - 1). \end{aligned}$$
(4)

Under the first-order necessary conditions for optimality, we obtain

$$\begin{aligned} \tilde{\mathbf {E}}^{-1} = \sum _{i=1}^{m} a_i \tilde{\mathbf {x}}_i\tilde{\mathbf {x}}_i^\mathrm{T} = \tilde{\mathbf {X}}\mathbf {A}\tilde{\mathbf {X}}^\mathrm{T} \end{aligned}$$
(5)

where \(\mathbf {A} = {{\mathrm{diag}}}(\mathbf {a})\) and \(\tilde{\mathbf {X}} = [\tilde{\mathbf {x}}_1, \tilde{\mathbf {x}}_2, \ldots , \tilde{\mathbf {x}}_m ]\). The Lagrangian (4) with the optimal \(\tilde{\mathbf {E}}^*\) results in

$$\begin{aligned} L(\tilde{\mathbf {E}}^{*}, \mathbf {a}) = -\log \det ( \tilde{\mathbf {E}}^{*} ) - \mathbf {e}^\mathrm{T} \mathbf {a} + n \end{aligned}$$
(6)

where \(\mathbf {e}\) is the vector of ones with the appropriate dimension. To make the objective value of the dual problem the same as the primal one, we set \(\mathbf {e}^\mathrm{T} \mathbf {a} = n\), leading to the dual problem:

$$\begin{array}{ll} \underset{\mathbf {a}}{\text {max}} & \log \det ( \tilde{\mathbf {X}}\mathbf {A}\tilde{\mathbf {X}}^\mathrm{T} ) \\ \text{s.t.} & \mathbf{e}^\mathrm{T} \mathbf {a} = n \\ & \mathbf {a} \ge \mathbf {0}. \end{array}$$
(D1)

(D1) can be solved by standard interior point softwares with logarithm-determinant maximization support, such as SDPT3 version 4 [31]. Then, \(\tilde{\mathbf {E}}\) is computed according to (5). The method of projecting the ellipsoid \(\tilde{\mathcal {E}}(\tilde{\mathbf {E}}, \mathbf {0})\) to \(\mathcal {E}(\mathbf {E}, \mathbf {d})\) is described in [9, 34]. In addition, for a given sample \(\mathbf {x} \in \mathbb {R}^n\), it is easy to determine whether the point is covered by the ellipsoid by computing the distance from the sample to the origin in the augmented space as \(d^2 (\tilde{\mathbf {x}}, \mathbf {0}) = \tilde{\mathbf {x}}^\mathrm{T}\tilde{\mathbf {E}}\tilde{\mathbf {x}}\). If \(0 \le d \le 1\), then \(\mathbf {x}\) is inside or on the boundary of the ellipsoid. Otherwise, \(\mathbf {x}\) is not covered by the ellipsoid.

2.2 MVCE with optimally selected center

MVCE with optimally selected center can be extended from the one centered at the origin. Some previous works such as [9] and [34] solved the problem in two steps by, first, solving for MVCE in the augmented space and then projecting the resulted ellipsoid back into the original space. However, the process can be more concise. In this section, we discuss the following formulation adapted from [13] whose result is similar to a combination of the two steps.

Lemma 1

(MVCE) Given the ellipsoid equation of the form (2), the minimum volume \(\mathcal {E}(\mathbf {M}^2, \mathbf {M}^{-1}\mathbf {z})\) containing the given samples \(\{ \mathbf {x}_i \}_{i=1}^{m}\) is the solution to the following optimization problem in the dual form:

$$\begin{array}{ll} \underset{\mathbf {a}}{\text {max}} & \log \det \tilde{\mathbf {X}}\mathbf {A}\tilde{\mathbf {X}}^{\rm T} \\ \mathrm {s.t.} & \mathbf {e}^T\mathbf {a} = n \\ & \mathbf {a} \ge \mathbf {0}. \end{array}$$
(D2)

where \(\tilde{\mathbf {X}} = [\mathbf {X}^\mathrm{T}, \mathbf {e} ]^\mathrm{T}\) and \(\mathbf {X} = [ \mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_m ]\). \(\mathbf {a} \in \mathbb {R}^m\) is the vector of Lagrange multipliers and \(\mathbf {A} = \mathrm {diag}(\mathbf {a})\). The optimal \(\mathcal {E}(\mathbf {M}^2, \mathbf {M}^{-1}\mathbf {z})\) is obtained from the first-order optimality conditions:

$$\begin{aligned} \mathbf {M}&= \left( \mathbf {X}\left[ \mathbf {A} - \frac{\mathbf {a} \mathbf {a}^\mathrm{T}}{\mathbf {e}^\mathrm{T}\mathbf {a}}\right] \mathbf {X}^\mathrm{T} \right) ^{-1/2} \\ \mathbf {z}&= \frac{\mathbf {M}\mathbf {X}\mathbf {a}}{\mathbf {e}^\mathrm{T}\mathbf {a}}. \\ \end{aligned}$$
(7)

Proof

Since the MVCE problem formulated with the constraint in the form of (1) is not a convex program as a counterexample was given in [9], we instead consider the following MVCE problem:

$$\begin{array}{ll} \underset{\mathbf {M}, \mathbf {z}}{\text {min}} & 2\log \det ( \mathbf {M}^{-1} ) \\ \mathrm {s.t. } & \Vert \mathbf {M}\mathbf {x}_i - \mathbf {z}\Vert \le 1, \quad i=1,\ldots , m. \end{array}$$
(P2)

It should be noted that the multiples of two in the objective of (P2) are maintained merely for the sake of mathematical convenience in order to later avoid multiplication and division by two in the dual formulation. It follows that the Lagrangian of (P2) is

$$\begin{aligned} L(\mathbf {M}, \mathbf {z}, \mathbf {a}) = 2\log \det ( \mathbf {M}^{-1} ) + \sum _{i = 1}^m a_i ( (\mathbf {M}\mathbf {x}_i - \mathbf {z})^\mathrm{T} (\mathbf {M}\mathbf {x}_i - \mathbf {z}) - 1 ). \end{aligned}$$

Under the first-order necessary conditions for optimality, we have

$$\begin{aligned} 0&= \sum _{i = 1}^m a_i (\mathbf {z} - \mathbf {M} \mathbf {x}_i) \\ 0&= -2 \mathbf {M}^{-1} + \sum _{i = 1}^m a_i [(\mathbf {M}\mathbf {x}_i - \mathbf {z})\mathbf {x}_i^\mathrm{T} + \mathbf {x}_i (\mathbf {M} \mathbf {x}_i - \mathbf {z})^\mathrm{T} ] \end{aligned}$$

which lead to

$$\begin{aligned} \mathbf {z}&= \frac{\mathbf {M}\mathbf {X}\mathbf {a}}{\mathbf {e}^\mathrm{T}\mathbf {a}} \end{aligned}$$
(8)
$$\begin{aligned} \mathbf {M}^{-1}&= \frac{1}{2}( \mathbf {S}\mathbf {M} + \mathbf {M}\mathbf {S} ) \end{aligned}$$
(9)

where

$$\begin{aligned} \mathbf {S} = \mathbf {X}\left[ \mathbf {A} - \frac{\mathbf {a} \mathbf {a}^\mathrm{T}}{\mathbf {e}^\mathrm{T}\mathbf {a}}\right] \mathbf {X}^\mathrm{T}. \end{aligned}$$
(10)

According to the proof by Sun and Freund [22], and Zhang and Gao [36], (9) has the unique positive definite solution as

$$\begin{aligned} \mathbf {M}= \mathbf {S}^{-1/2}. \end{aligned}$$
(11)

As a result, the dual problem associated with (P2) can be derived as

$$\begin{array}{ll}\underset{\mathbf {a}}{\text {max}}&\log \det \left( \mathbf {X}\left[ \mathbf {A} - \frac{\mathbf {a} \mathbf {a}^\mathrm{T}}{\mathbf {e}^\mathrm{T}\mathbf {a}}\right] \mathbf {X}^\mathrm{T} \right) - \mathbf {e}^\mathrm{T}\mathbf {a} + n \\\text {s.t.}&\mathbf {a} \ge \mathbf {0}. \end{array}$$
(12)

Similar to (6), we set \(\mathbf {e}^\mathrm{T}\mathbf {a} = n\) so that the primal and dual problems have the same objective. In addition, the quadratic term as a function of \(\mathbf {a}\) in (12) can be rewritten in a linear form using Schur complement with the fact that \(\mathbf {a} = \mathbf {A} \mathbf {e}\). That is

$$\begin{aligned} \mathbf {X}\left[ \mathbf {A} - \frac{\mathbf {a} \mathbf {a}^\mathrm{T}}{\mathbf {e}^\mathrm{T}\mathbf {a}}\right] \mathbf {X}^\mathrm{T} = \begin{bmatrix} \mathbf {X}\mathbf {A}\mathbf {X}^\mathrm{T}&\mathbf {X}\mathbf {A} \mathbf {e} \\ \mathbf {e}^\mathrm{T} \mathbf {A} \mathbf {X}^\mathrm{T}&\mathbf {e}^\mathrm{T}\mathbf {A}\mathbf {e} \end{bmatrix} = \tilde{\mathbf {X}} \mathbf {A} \tilde{\mathbf {X}}^\mathrm{T}. \end{aligned}$$

Theresult follows. \(\square\)

From Lemma 1, it is clearly seen that solving MVCE with optimally selected center in \(\mathbb {R}^n\) (D2) is equivalent to solving MVCE centered at the origin (D1) in the augmented space \(\mathbb {R}^{n+1}\) with the augmented samples. The lemma also provides a direct interpretation of the solution as an ellipsoid in \(\mathbb {R}^n\), eliminating the unnecessary step required to project the ellipsoid from the augmented space to the original one. It is also worth noting from (7) that if all the elements of \(\mathbf {a}\) are set to 1 / m, then \(\mathbf {M}^{-2}\) and \(\mathbf {M}^{-1}\mathbf {z}\) become the sample covariance and sample mean of the training samples. Lastly, the distance from the center of the ellipsoid to a given sample \(\mathbf {x}\) is defined by \(d(\mathbf {x}, \mathbf {z}) = \Vert \mathbf {M}\mathbf {x} - \mathbf {z}\Vert\). The sample is covered by the ellipsoid when \(0 \le d \le 1\), otherwise it is considered an outlier.

Fig. 1
figure 1

Examples of eSVDD by varying the parameter \(c_i = C\) for all samples

2.3 Ellipsoidal support vector data description

The idea of creating a soft-margin MVCE is inspired by other popular soft-margin learning machines such as SVM and SVDD. We found that Dolia et al. [7] and Wei et al. [34] also incorporated this idea in constructing their MVCEs. An MVCE problem with \(\ell _1\)-relaxation is formulated by introducing a slack variable \(\xi _i\), for \(i = 1, \ldots , m\), to allow possible misclassification as

$$\begin{array}{ll}\underset{\mathbf {M}, \mathbf {z}, \varvec{\xi }}{\text {min}}&2\log \det ( \mathbf {M}^{-1} ) + \mathbf {c}^\mathrm{T}\varvec{\xi } \\\mathrm {s.t. }&\Vert \mathbf {M}\mathbf {x}_i - \mathbf {z}\Vert ^2 \le 1 + \xi _i \\\\&\xi _i \ge 0, \quad i=1,\ldots , m \end{array}$$
(P3)

where \(\varvec{\xi }\) and \(\mathbf {c}\) are the vectors of \(\xi _i\) and \(c_i\), respectively, where \(i = \{1, 2, \ldots , m\}\) and \(c_i > 0\) is given.

The introduced slack variables simply represent empirical errors which are subjected to minimization. Since the formulation follows the same concept as SVDD, it is called “ellipsoidal support vector data description” (eSVDD).

It is also possible to incorporate known outliers, if they exist, into the formulation. Such a problem is still called a one-class classification even though a training sample is labeled as either a target or an outlier. The knowledge of the presence of outliers adds a possibility to enhance data descriptive boundaries. It is natural to exclude outliers from being inside the ellipsoid. The MVCE formulation to explicitly exclude known outliers was briefly presented in [34] for two-class classification. However, our following formulation is slightly different.

Let each sample be labeled as either \(y_i = 1\) or \(-1\) for targets and outliers, respectively, for all training samples. The MVCE formulated with both soft margins and outliers is called “eSVDD with negative samples” (neSVDD), analogous to SVDD with negative samples (nSVDD). When there is no outlier, neSVDD becomes an eSVDD problem. The neSVDD problem is stated as

$$\begin{array}{ll}\underset{\mathbf {M}, \mathbf {z}, \varvec{\xi }}{\text {min}}&2\log \det ( \mathbf {M}^{-1} ) + \mathbf {c}^\mathrm{T} \varvec{\xi } \\ \mathrm {s.t. }&y_i \Vert \mathbf {M}\mathbf {x}_i - \mathbf {z}\Vert ^2 \le y_i + \xi _i \\&\xi _i \ge 0, \quad i=1,\ldots , m \end{array}$$
(P4)

where the corresponding dual problem is

$$\begin{array}{ll}\underset{\mathbf {a}}{\text {max}}&\log \det \tilde{\mathbf {X}}\mathbf {A}\mathbf {Y}\tilde{\mathbf {X}}^\mathrm{T} \\ \mathrm {s.t.}&\mathbf {y}^\mathrm{T}\mathbf {a} = n \\& \mathbf {0} \le \mathbf {a} \le \mathbf {c} \end{array}$$
(D4)

where \(\mathbf {y} = [y_1, y_2, \ldots , y_m]\) and \(\mathbf {Y} = \mathrm {diag}({\mathbf {y}})\) with \(y_i \in \{1, -1\}\) for \(i = 1, 2, \ldots , m\). The optimal \(\mathcal {E}(\mathbf {M}^{2}, \mathbf {M}^{-1}\mathbf {z})\) is computed from

$$\begin{aligned} \mathbf {M}&= \left( \mathbf {X}\mathbf {Y}\left[ \mathbf {A}\mathbf {Y} - \frac{\mathbf {a}\mathbf {a}^\mathrm{T}}{\mathbf {y}^\mathrm{T}\mathbf {a}}\right] \mathbf {Y} \mathbf {X}^\mathrm{T} \right) ^{-1/2} \\ \mathbf {z}&= \frac{\mathbf {M}\mathbf {X}\mathbf {Y}\mathbf {a}}{\mathbf {y}^\mathrm{T}\mathbf {a}}. \\ \end{aligned}$$
(13)

Adding slack variables in the primal problem results in the box constraints on Lagrange multipliers in the dual form. Recall that the center \((\mathbf {d})\) of an ellipsoid is a linear combination of training samples weighted by the corresponding Lagrange multipliers. The ellipsoid’s shape (\(\mathbf {E}\)) is also a linear combination of outer products of training samples weighted by the same Lagrange multipliers. When a sample possesses \(a_i = 0\), its presence does not affect the ellipsoid. The samples whose \(a_i > 0\) are the so-called support vectors. The shape and center of the optimal ellipsoid depend only on a linear combination of support vectors.

Fig. 2
figure 2

Examples of neSVDD by varying the parameter \(c_i = C\) for all samples

Suppose \(\mathbf {c} = C\mathbf {e}\). An effect of regularization parameter \(\mathbf {c}\) on eSVDD and neSVDD is shown in Figs. 1 and 2, respectively. Since the sum of all \(\alpha _i\) must be n, the possible minimum value of C is \(\frac{n}{m}\) for the case of eSVDD. It can be seen that when C increases, the size of the ellipsoid also increases. In the case of eSVDD, the size of the ellipsoid is limited by the maximum volume containing the samples as in Fig. 1. Increasing C beyond a certain value does not affect the shape of the ellipsoid. This differs from the case with negative samples shown in Fig. 2. The reason is that, for neSVDD, the constraint \(\mathbf {y}^\mathrm{T}\mathbf {a} = n\) is still satisfied even for a very large C due to the subtraction between Lagrange multipliers. Therefore, a Lagrange multiplier can be as large as the value of C. However, in eSVDD, when C is set too large, it is impossible for a Lagrange multiplier to be too large because of the constraint \(\mathbf {e}^\mathrm{T}\mathbf {a} = n\).

3 Flexible ellipsoidal support vector data description

Although ellipsoidal data descriptions are more flexible than spherical ones, they are still insufficient to describe complex data patterns. A general approach in kernel learning machines introduces kernel methods to enable more complicated descriptive boundary by replacing an inner product \(\mathbf {x}_i^\mathrm{T}\mathbf {x}_j\) with a kernel function \(k(\mathbf {x}_i, \mathbf {x}_j)\) for all samples i and j. Intuitively, an inner product measures a similarity between two samples. Replacing it with a kernel function provides an alternative similarity measurement. With an appropriate choice of kernel functions, the samples are mapped into space with better class separability [35].

In general, a kernel is a positive definite function \(k(\cdot , \cdot ) : \mathbb {R}^n \times \mathbb {R}^n \mapsto \mathbb {R}\) satisfying Mercer’s conditions. These properties allow the ability to explicitly factorize a kernel in the form of an inner product between two vectors, i.e., \(k(\mathbf {x}_i, \mathbf {x}_j) = \mathbf {\varPhi }(\mathbf {x}_i)^\mathrm{T} \mathbf {\varPhi }(\mathbf {x}_j)\), where \(\mathbf {\varPhi }(\mathbf {x}_i) : \mathbb {R}^n \mapsto \mathbb {H}\), and \(\mathbb {H}\) is the feature space. In other words, a kernel also defines an inner product in \(\mathbb {H}\). In general, the mapping \(\varPhi (\mathbf {x})\) is unknown and it usually maps samples into a higher-dimensional space, possibly the infinite one.

Since it is more concise to represent variables in a matrix form, let \(\mathbf {K} \in \mathbb {S}_{+}^{m}\) denote a positive semidefinite matrix, the (ij)th element of which has the value \(k(\mathbf {x}_i, \mathbf {x}_j)\). By denoting \(\mathbf {\Phi } = [ \mathbf {\varPhi }(\mathbf {x}_1), \mathbf {\varPhi }(\mathbf {x}_2), \ldots , \mathbf {\varPhi }(\mathbf {x}_m) ]\), we also have \(\mathbf {K} = \mathbf {\Phi }^\mathrm{T}\mathbf {\Phi }\).

MVCE constructions in a higher-dimensional space using kernel methods are presented in the following two subsections. We first review the existing method [7, 8, 33, 34] which tries to explicitly rewrite the MVCE formulation in the form of inner products. Then, the proposed method based on “empirical feature maps” [20] is presented. The ellipsoid formulation (D2) will be extensively used throughout the section with \(\mathbf {X}\) replaced by \(\mathbf {\Phi }\). Note that \(\tilde{\mathbf {\Phi }}\) and \(\tilde{\mathbf {\varPhi }}\) represent the augmented versions of \(\mathbf {\Phi }\) and \(\mathbf {\varPhi }\), respectively.

3.1 Existing works

There exist few research studies concerning ellipsoid formulations with kernel tricks. To the best of our knowledge, such works include Dolia et al. [7, 8] and Wei et al. [33, 34]. However, their methods are basically the same and can be summarized as follows:

First, rewrite the objective (D1) by utilizing the fact that, for any matrices \(\mathbf {A}\) and \(\mathbf {B}\), \(\mathbf {A}\mathbf {B}\) and \(\mathbf {B}\mathbf {A}\) have the same nonzero eigenvalues. Together with Cholesky decomposition of the kernel matrix in the augmented space, \(\tilde{\mathbf {K}} = \mathbf {K} + \mathbf {e}\mathbf {e}^\mathrm{T} = \mathbf {C}^\mathrm{T}\mathbf {C}\), ones arrive at the conclusion, \(\log \det (\tilde{\mathbf {\Phi }}\mathbf {A}\tilde{\mathbf {\Phi }}^\mathrm{T}) = \log \det (\mathbf {C}\mathbf {A}\mathbf {C}^\mathrm{T})\).

Second, find the definition of the distance in the augmented feature space, \(\tilde{\mathbf {\varPhi }}^\mathrm{T}(\mathbf {x}) \tilde{\mathbf {E}} \tilde{\mathbf {\varPhi }}(\mathbf {x})\) for a given sample \(\mathbf {x} \in \mathbb {R}^n\). This is accomplished by performing a truncated eigenvalue decomposition on the matrix \(\mathbf {A}\tilde{\mathbf {K}}\mathbf {A} = \mathbf {V}\mathbf {\Sigma }^\mathrm{T}\mathbf {\Sigma }\mathbf {V}^\mathrm{T}\), where \(\mathbf {V}\) and \(\mathbf {\Sigma }\) are the matrices of corresponding eigenvectors and singular values, respectively. The reason that the matrix \(\mathbf {A}\tilde{\mathbf {K}}\mathbf {A}\) is important here arises from the reduced singular value decomposition (SVD) of \(\mathbf {\Phi }\mathbf {A}^{\frac{1}{2}} = \mathbf {U}\mathbf {\Sigma }\mathbf {V}^\mathrm{T}\) which provides both \(\tilde{\mathbf {E}}^{-1} = \tilde{\mathbf {\Phi }}\mathbf {A}\tilde{\mathbf {\Phi }}^\mathrm{T} = \mathbf {U}\mathbf {\Sigma }\mathbf {\Sigma }^\mathrm{T}\mathbf {U}^\mathrm{T}\) and \(\mathbf {A}\mathbf {K}\mathbf {A} = \mathbf {V}\mathbf {\Sigma }^\mathrm{T}\mathbf {\Sigma }\mathbf {V}^\mathrm{T}\). Obtaining \(\mathbf {V}\) and \(\mathbf {\Sigma }\) helps estimate \(\mathbf {U} = \mathbf {\Phi }\mathbf {A}^{\frac{1}{2}}(\mathbf {\Sigma }\mathbf {V}^\mathrm{T} )^{+}\) where \((\cdot )^{+}\) is a pseudoinverse operator. As a result, the Mahalanobis distance can be roughly computed by \(\tilde{\mathbf {\varPhi }}^\mathrm{T} \tilde{\mathbf {E}} \tilde{\mathbf {\varPhi }} = \mathbf {\varPhi }^\mathrm{T} \mathbf {U} \left( \mathbf {\Sigma }\mathbf {\Sigma }^\mathrm{T} \right) ^{-1}\mathbf {U}^\mathrm{T} \mathbf {\varPhi }\), where the term \(\tilde{\mathbf {\Phi }}^\mathrm{T}\tilde{\mathbf {\varPhi }}\) is the vector of inner products defined by a kernel.

3.2 Proposed method

Although the aforementioned approach could attain the objective to construct MVCE in the feature space, its formulation is rather specific to the factorization of the constituent of the Lagrange multipliers and kernel matrices, making it depend on the structure of the problem. In other words, the factorization of the matrix \(\mathbf {A}\tilde{\mathbf {K}}\mathbf {A}\) is required. Therefore, it will not always readily be usable if the formulation of the MVCE problem is altered. In addition, it is inconvenient to apply the approach to kernelize existing MVCE algorithms such as the DRN algorithm [22] and the WA-TY algorithm [29]. From our perspective, the versatility issue must be addressed. An MVCE algorithm should be able to apply kernel methods without a burdensome step in trying to refactorize the problem in the form of inner products.

As a result, we propose a method, called “empirical feature mapping,” which explicitly defines a map from the input space to an empirical feature space \(\mathbb {H}_E\) which is a finite-dimensional Euclidean space. The term “empirical” is added to indicate that the mapping is constructed from a given set of empirical measures and also to distinguish it from the feature space \(\mathbb {H}\). In general, \(\mathbb {H}\) and \(\mathbb {H}_E\) are two different spaces, and \(\mathbb {H}\) possibly even has infinite dimensions as in the case of RBF kernels. Therefore, in order to create an empirical feature mapping to a finite space in a meaningful way, for a given set of training samples, we define the mapping in Definition 1 such that the inner product in \(\mathbb {H}_{E}\) is equal to the one in \(\mathbb {H}\). This specific version of empirical feature map is called “kernel PCA map” [20] since it includes kernel whitening.

Definition 1

(Kernel PCA map) Given the training samples \(\mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_m \in \mathbb {R}^n\), the kernel PCA map from \(\mathbb {R}^n\) to \(\mathbb {H}_{E}\) is defined as

$$\begin{aligned} \mathbf {\varPhi }_{E} : \mathbf {x} \mapsto (\mathbf {\Omega }^{+})^\mathrm{T} \mathbf {k}(\mathbf {x}) \end{aligned}$$
(14)

where \(\mathbf {k}(\mathbf {x}) = [ k(\mathbf {x}, \mathbf {x}_1), k(\mathbf {x}, \mathbf {x}_2), \ldots , k(\mathbf {x}, \mathbf {x}_m) ]^\mathrm{T}\) and \(\mathbf {K} = \mathbf {\Omega }^\mathrm{T}\mathbf {\Omega }\) is a matrix factorization of the kernel matrix.

Corollary 1

The empirical feature space \(\mathbb {H}_E\) and the feature space \(\mathbb {H}\) possess the same inner product and Euclidean distance.

Proof

Let \(\mathbf {k}_i = \mathbf {k}(\mathbf {x}_i)\), \(\mathbf {\varPhi }_i = \mathbf {\varPhi }(\mathbf {x}_i)\), and \(\mathbf {\varPhi }_{E_i} = \mathbf {\varPhi }_E(\mathbf {x}_i)\). It follows that \(\mathbf {\varPhi }_{E_i}^\mathrm{T}\mathbf {\varPhi }_{E_j} = \mathbf {k}_i^\mathrm{T}\mathbf {\Omega }^{+}(\mathbf {\Omega }^{+})^\mathrm{T} \mathbf {k}_j = \mathbf {k}_i^\mathrm{T}\mathbf {K}^{+}\mathbf {k}_j = \mathbf {\varPhi }_i^\mathrm{T}\mathbf {\Phi }(\mathbf {\Phi }^\mathrm{T}\mathbf {\Phi })^{+}\mathbf {\Phi }^\mathrm{T}\mathbf {\varPhi }_j = \mathbf {\varPhi }_i^\mathrm{T}\mathbf {\varPhi }_j = k(\mathbf {x}_i, \mathbf {x}_j)\). The Euclidean distance in both space is also the same according to \(\Vert \mathbf {\varPhi }_i - \mathbf {\varPhi }_j \Vert ^2 = k(\mathbf {x}_i, \mathbf {x}_i) - 2 k(\mathbf {x}_i, \mathbf {x}_j) + k(\mathbf {x}_j, \mathbf {x}_j)\). \(\square\)

Although \(\mathbb {H}\) can have infinite dimensions, kernel learning machines perform only in a subspace of \(\mathbb {H}\) spanned by the images of given samples \(\{\varPhi (\mathbf {x}_i)\}\) for \(i=1,2,\ldots ,m\). Since the inner product and the Euclidean distance in \(\mathbb {H}_{E}\) are the same as in \(\mathbb {H}\) as shown in Corollary 1, data separability is also the same in both spaces. In fact, \(\mathbb {H}\) is isomorphic with \(\mathbb {H}_{E}\) [35]. Therefore, it is tempting to work with \(\mathbb {H}_{E}\), instead of \(\mathbb {H}\), since the former is easier to access by the mapping defined in Definition 1.

Even though a kernel matrix satisfying Mercer’s condition can always be factorized as \(\mathbf {K} = \mathbf {\Omega }^\mathrm{T} \mathbf {\Omega }\), the decomposition may not be unique; for example, the factorization can be obtained from eigendecomposition or LDL decomposition. For eigendecomposition, we have \(\mathbf {\Omega } = (\mathbf {V}\mathbf {\Lambda }^{\frac{1}{2}})^\mathrm{T}\) from \(\mathbf {K} = \mathbf {V}\mathbf {\Lambda }\mathbf {V}^\mathrm{T}\), where \(\mathbf {\Lambda }\) is the diagonal matrix of eigenvalues corresponding to the eigenvector matrix \(\mathbf {V}\). Alternatively, for LDL decomposition \(\mathbf {K} = \mathbf {L}\mathbf {D}\mathbf {L}^\mathrm{T}\), we have \(\mathbf {\Omega } = (\mathbf {L}\mathbf {D}^{\frac{1}{2}})^\mathrm{T}\) where \(\mathbf {L}\) is the lower triangular matrix whose diagonal elements are all ones and \(\mathbf {D}\) is a diagonal matrix. Both \(\mathbf {\Lambda }\) and \(\mathbf {D}\) generally are not full rank since \(\mathbf {K}\) is positive semidefinite. Therefore, in this paper, it is assumed to work with the reduced version of eigendecomposition and LDL decomposition.

Despite being discussed in [20, 35] as an approach to apply kernel methods to an algorithm that cannot explicitly formulate the problem in the form of inner products, empirical feature mapping is still new to MVCE problems. It will allow the problems to seamlessly utilize kernel methods. Furthermore, one benefit of using empirical feature maps is that a sample in the original space can be visualized in the feature space as one sample. The matrix \(\mathbf {E}\) and the vector \(\mathbf {d}\) describing the shape of an ellipsoid can also be computed, although they may not be unique depending on how \(\mathbf {K}\) is factorized. As a result, empirical feature mapping opens a possibility to perform kernel optimizations to determine the most suitable kernel for a given dataset [35].

Fig. 3
figure 3

Contour plots of neSVDD with the RBF kernel \(k(\mathbf {x}, \mathbf {y}) = e^{-\frac{\Vert \mathbf {x} - \mathbf {y}\Vert ^2}{2\sigma }}\) varying \(\sigma = 0.50\) (left) and 0.75 (right)

In order to show how a kernelized eSVDD looks like in data description, Fig. 3 demonstrates the effect of RBF kernels on neSVDD where 5 samples belong to the target class and one sample belongs to the outlier class. By varying only kernel parameter \(\sigma\), it can be seen that the data descriptive contours fit tighter to the samples when \(\sigma\) decreases. This is in contrast to the number of support vectors (not shown in the figure).

Table 1 Datasets

4 Experimental results

In this section, we are finally ready to perform some experiments on one-class classification using eSVDD with some standard datasets to demonstrate its performance. The experiments compare the minimum ellipsoids, eSVDD and neSVDD, with their minimum spherical counterparts, i.e., SVDD and nSVDD. We deployed MATLAB with YALMIP [15] as an interface to the SDPT3 solver [31] to solve logarithm-determinant minimization. The DDtools toolbox [27] which was particularly designed for one-class classification was also used to help the implementation. Since the goal is to compare eSVDD directly with SVDD, the datasets we used for the experiments were also obtained from Tax [26], the author of DDtools, who provides standard datasets in PRTools format [17] with some minor data cleanup such as filling missing values to the datasets. The list of 27 datasets used in the experiments is shown in Table 1 where \(m_\mathrm{t}\), \(m_\mathrm{o}\), and n denote the number of target samples, outliers, and features, respectively. These datasets are derived from multiclass datasets by assigning the interested group of data as the target class and the rest as outliers.

In the experiments, the domains of the training data were scaled to one. Only target classes were used for training SVDD and eSVDD. Since the RBF kernel is a popular choice among kernel functions and as it is also encouraged to use with SVDD [25], we compare these ellipsoidal and spherical data descriptive algorithms through the uses of the RBF kernel. The two hyperparameters in SVDD, nSVDD, eSVDD, and neSVDD, namely \(\mathbf {c}\) and \(\sigma\), were selected by grid search over 120 pairs of parameters. The predefined values of \(\sigma\) were 0.5, 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50. The box-constraint parameter \(\mathbf {c}\) was assumed to be the same for all training samples, i.e., \(\mathbf {c} = C\mathbf {e}\). In the case of SVDD, the search values of C were \(\frac{1}{mr}\) for \(r \in [0.1, 0.2, \ldots , 1.0]\).

For eSVDD, C was set to \(\frac{N}{mr}\) for \(r \in [0.1, 0.2, \ldots , 1.0]\) where N is the approximate dimension of the empirical feature space. Here, kernel matrices were factorized by LDL decomposition to obtain \(\mathbf {\Omega } = (\mathbf {L}\mathbf {D}^{\frac{1}{2}})^\mathrm{T}\) in order to obtain the kernel PCA map. However, since \(\mathbf {D}\) does not always has full rank, its diagonal whose elements are less than \(10^{-5}\) were truncated. The dimension of the reduced \(\mathbf {D}\) is simply the empirical feature space’s dimension N. In addition, in order to avoid a degenerate case where Assumption 1 does not hold, a weighted identity matrix \(\gamma \mathbf {I}\) with the appropriate dimension, for a small \(\gamma = 10^{-5}\), was added inside the \(\log \det\) function.

For each pair of \((\mathbf {c}, \sigma )\), tenfold cross-validation was performed. In order to avoid any bias on random reshuffles of the data, all of the four ellipsoidal and spherical SVDDs did all training and testing on the same partitioned data. However, for 9 out of 10 folds used for training nSVDD and neSVDD, negative samples were limited to only 20 samples, instead of \(\frac{9 m_\mathrm{o}}{10}\). This was done to reduce the training time. The best pair was the one that yielded the maximum mean of the area under the receiver operating characteristic (ROC) curve. The ROC curve used here is defined as the plot between false-positive rates and true-positive rates with the maximum area under the curve equal to one. The data descriptive boundaries were set up to output membership probabilities so that the ROC curve could be constructed. For one sample, the membership value was defined as \(\mathrm{e}^{-\Vert d \Vert }\), where d is the distance to the center of the descriptive domain. After the best parameters were selected, the areas under the ROC curves were reported from tenfold cross-validation for 5 independent runs. The results are summarized in Table 2 where, for each dataset, the best value among all the four algorithms is underlined. In addition, the table can be split in half to compare the performance between SVDD and eSVDD, or nSVDD and neSVDD. The best value between each pair is highlighted in bold.

Table 2 Comparison between the ellipsoidal and spherical SVDDs by the mean and standard deviation of the area under the ROC curve

From the table, we observe that the change from a spherical descriptive boundary to an ellipsoidal one does help improve the results as seen in the cases of SVDD vs. eSVDD and nSVDD vs. neSVDD. In the experiment, eSVDD performed better than SVDD on 18 out of 26 datasets, and neSVDD also provided better results than nSVDD on 20 out of 26 datasets. However, since some of the results only slightly differ, we further illustrate graphically how the gaps between the results exist as shown in Fig. 4. From the figure, by neglecting absolute area differences less than 0.05 (or 5 %) due to their less significance, the uses of an ellipsoid instead of a sphere only degraded the performance in 4 and 1 datasets for the cases of SVDD vs. eSVDD (\(\times\)) and nSVDD vs. neSVDD (\(+\)), respectively.

In addition, the introduction of negative samples to SVDD and eSVDD also helped improve the results in many datasets, though we observed performances dropped in some datasets. For the spherical case, negative samples improved the area under the ROC curve for 19 datasets with only one dataset (No. 9) whose value very slightly decreased. For the ellipsoidal case, the improvement was not obvious in terms of the number of datasets. We observed that including negative samples improved the results in 12 datasets, while it also degraded the results in 13 datasets. Although it may be seen that the inclusion of negative samples in eSVDD worsens the results, those results are, in fact, average values from multiple runs of cross-validation which could also be biased if they are compared directly digit by digit. Therefore, it is also interesting to look at the results from another perspective. By neglecting the absolute changes in the area under the ROC curve which are less than 5 %, we observe from Fig. 4 that the performance of eSVDD (\(\Box\)) did not significantly degrade as negative samples were introduced.

It is also worth noting that the decreases in performances after negative samples were included are possibly because the presence of some negative samples may not be useful in constructing a better descriptive boundary since only 20 samples of outliers were used. In addition, in the case of neSVDD, the decreases may contribute to occasional numerical errors that occurred during the simulation. In some situations, the matrix \(\mathbf {M}\) in neSVDD is indefinite so the optimization (D4) cannot be solved. In order to deal with such a problem, any pair of cross-validation parameters was discarded whenever any particular training folds could not be trained. Although this seems to be a limitation, neSVDD is still in favor in terms of the average performance.

Among all of the algorithms, the number of datasets that each algorithm performed best was 14, 10, 5, and 3 out of 27 datasets, for eSVDD, neSVDD, nSVDD, and SVDD, respectively, according to Table 2 highlighted with underlines. In overall, the spherical SVDDs performed best in 6 datasets (both the first and third columns), while the ellipsoidal SVDDs did best in 22 datasets (both the second and forth columns). Therefore, it may be concluded that using ellipsoids could provide better results than using spheres.

Furthermore, it may be interesting to see how negative samples influenced the results. We provided a plot in Fig. 5 comparing the ratio of negative samples in each dataset against how the results changed from eSVDD to neSVDD when negative samples were added. However, from the plot, no obvious relationship can be observed. This is because the number of negative samples has no direct effect on the results. In fact, what does matter is how the locations of samples provide any meaningful information to the construction of descriptive boundaries.

Despite being pointed out in [3] that the use of \(\ell _1\)-relaxation to formulate a soft-margin ellipsoid could result in unexpected outcomes, our experimental results practically showed promising performance. We believe that this is because the claim in [3] concerned only an appropriate selection of the parameter \(\mathbf {c}\) with no kernel method, which is not the case for our experiments. In fact, the ability to adjust the kernel parameter \(\sigma\) is also so important that it helps improve the solution. It is worthwhile noting that in this research we chose the best pair of \(\mathbf {c}\) and \(\sigma\) using grid search. From the experiments, we found that the values of \(\mathbf {c}\) and \(\sigma\) are not very sensitive to the results. However, from our observations, the value of \(\sigma\) possesses more sensitivity than that of \(\mathbf {c}\). The best method to determine the best parameters is still an ongoing research area, and a basic approach to choose the parameters is generally by brute force. A better method in determining the values of training parameters or how those parameters affect descriptive boundaries is beyond the scope of this paper and is the subjects of further studies.

5 Conclusions

In this paper, the proposed ellipsoidal SVDD is formed by using the same concept as in SVDD. That is the size of ellipsoids is limited by the minimum volume to cover a given set of data. The soft-margin ellipsoid is also achieved by minimizing one-norm empirical risk. The method also embraces kernel methods to construct an ellipsoid in a higher-dimensional space via empirical feature mapping. In the empirical feature space, which is a finite Euclidean space, the two-norm and inner products are also the same as in the feature space. The experimental results on one-class classification with standard benchmark datasets showed that eSVDD and neSVDD provided better results than the others for most of the datasets. Therefore, the ellipsoidal SVDDs can be good alternatives to SVDD. For future work, a better approach in incorporating negative samples into eSVDD is required in order to improve numerical stability. A faster MVCE solver is also needed so that the proposed method can perform on larger datasets. It is also interesting to see if the ellipsoidal SVDDs also perform well in multiclass classification.

Fig. 4
figure 4

Differences of the area under the ROC curve between two algorithms summarized from Table 2

Fig. 5
figure 5

Ratio of negative samples in each dataset compared against the performance changes in eSVDD after negative samples were included