1 Introduction

Classification problems exist extensively in human activities. People can differentiate various groups of objects by their features. Sometimes features are too many to neatly distinguish an object since redundant features would confuse human cognition. Advanced artificial technologies can be introduced into a classification process. Even so, information acquisition demands the extra cost of space and time. In that case, an approach like feature selection can assist in simplifying the features. Feature selection is frequently applied to classification problems with high-dimensional data in which each feature denotes a separate dimension [12, 35]. For high-dimensional data, features involve three categories: relevant, irrelevant and redundant [22]. Feature selection aims at obtaining a relevant feature subset by eliminating irrelevant and redundant features from given original data. Compared with utilizing all features, feature selection can achieve a similar or superior classification performance and have a lower computational complexity [2, 14, 27].

Dash and Liu divided feature selection methods into four fundamental processes: generation procedure, evaluation function, stopping criterion and validation procedure [6]. The first two processes count as the key ones among the four processes. The generation procedure can be one of three types, complete, heuristic and random generation [17, 25]. With the advantages of fast setting up models and simple implementation, the heuristic generation is often broached, such as sequential backward selection (SBS) [23] and sequential forward selection (SFS) [20].

The evaluation function could be a function of the measurement of distance, information, dependency, consistency, or classifier error rate on data, where the last measurement is frequently used in SBS, SFS and their extensions. Based on the evaluation function, two methods regularly appear in the researches of feature selection: wrapper and filter methods [6, 21]. The approaches using the classification error rate as the evaluation function are referred to wrapper ones. Alternatively, filter methods pick up features via statistics, for instance, the mean or variance of measurements. Generally speaking, wrapper methods can gain advanced classification performance compared to filter methods. The reason is that wrapper methods work with the assist of selectable classifiers and filter methods ignore classifiers when processing data. However, at the cost of the classification accuracy, filter methods can quickly obtain the final feature subset throughout the heuristic algorithms [33, 34, 37]. In addition, embedded methods have been proposed [5, 37]. It is known that feature selection in embedded methods is incorporated as a lot part of training process, which considers the empirical risk of given data. Conversely, the test process of embedded methods is dependent on the optional classifier itself.

There have been some beneficial attempts to combine feature selection with the-state-of-art learner, support vector machines (SVMs). Weston et al. proposed a wrapper method combining SVM with feature selection [32]. Owing to good generalization performance of SVM and the outstanding experimental results, similar wrapper methods have been discussed [4, 16, 30]. SVM-RFE (recursive feature elimination) is one of significant SVM-based embedded methods [15, 31]. SVM-RFE behaves well in binary classification problems. To make SVM-RFE applicable to multi-class classification problems, multi-class SVM-RFE (MSVM-RFE) methods have been proposed in [36] and [28]. Zhou and Tuck considered the linear kernel case [36], while Shieh and Yang gave the nonlinear feature selection [28]. Actually, MSVM-RFE treats a multi-class classification problem as multiple binary classification ones, where an SVM is used to solve a binary classification problem. In the linear case, the weight vectors obtained by multiple SVMs are summed as the feature weights. Then MSVM-RFE would remove the feature with the minimal weight coefficient, which is the most unimportant feature. This process is repeated until all features are ranked. In the nonlinear case, the ranking criterion considers the difference between the dual objective with all remained features and the dual objective with removing one remained feature, which leads to a situation of high computational complexity. In theory, the nonlinear MSVM-RFE has a much higher computational complexity than the linear MSVM-RFE does even if both methods adopt the linear kernel and would result in the same feature ranking.

With the purpose of settling the abnormal data detection problems, Jeong et al. applied support vector data description (SVDD) to feature selection and proposed two methods, SVDD-radius-recursive feature elimination (SVDD-RRFE) and SVDD-dual-objective-recursive feature elimination (SVDD-DRFE) [18]. Both algorithms build up a compact SVDD model to select the required features with only one-class samples. The criterion rule of SVDD-RRFE is related to the radius of an SVDD model, and that of SVDD-DRFE to the dual objective function. However, both SVDD-RRFE and SVDD-DRFE suffer from high computational complexity since both methods consider the nonlinear kernel. Pursuing the rapid feature selection in cancer classification using gene expression data, Cao et al. presented a multiple SVDD-RFE (MSVDD-RFE) method in the linear case [3]. MSVDD-RFE independently constructs multiple SVDD models and selects feature subsets according to the direction energy of model centers. A final feature subset can be obtained by merging these feature subsets. Experimental results provided in [3] show that MSVDD-RFE is much faster than both SVDD-RRFE and SVDD-DRFE. However, MSVDD-RFE could not get a final feature ranking since it generates multiple feature rankings. Thus, we do not know which feature is the most important one for the task at hand even if we have the final feature subset.

In this paper, two new fast feature selection methods based on the radius and the dual-objective ranking criteria are proposed for one-class, binary and multi-class classification problems, called fast multiple SVDD-RRFE (FMSV DD-RRFE) and fast multiple SVDD-DRFE (FMSVDD- DRFE). Compared to SVDD-RRFE and SVDD-DRFE, FMSVDD-RRFE and FMSVDD-DRFE can address not only one-class problems but also binary or multi-class classification problems. For one-class classification tasks, FMSVDD-RRFE has the same feature ranking as SVDD-RRFE, while FMSVDD-DRFE has the same result as SVDD-DRFE. However, FMSVDD-RRFE and FMSVDD-DRFE can faster rank features. For binary or multi-class classification, the proposed methods require training two or more SVDD models on which we can calculate the ranking score criteria for all features. The computational complexity of both FMSVDD-RRFE and FMSVDD-DRFE is similar to MSVDD-RFE proposed in [3].

This paper has two contributions. First, we provide speed schemes for computing radius and dual-objective ranking scores under the condition of using linear kernel, respectively. Second, based on the fast schemes, we extend the application of SVDD-RRFE and SVDD-DRFE to multi-class classification including binary classification tasks and develop FMSVDD-RRFE and FMSVDD-DRFE. The rest of the paper is organized as follows. Section 2 gives a brief presentation to related work of SVDD and SVDD-based feature selection. The new algorithms are introduced in Section 3. Section 4 discusses experimental results on the UCI database and microarray datasets, and Section 5 provides summaries and conclusions.

2 Related work

This section discusses the related works. We simply describe SVDD and three previous SVDD-based feature selection methods [3, 18]. We assume that the data show variances in all feature directions.

2.1 Support vector data description

Inspired by SVM, Tax and Duin put forward SVDD to detect novel data or outliers [29]. SVDD can construct a closed hypersphere border surrounding the target data. Outliers are rejected outside the boundary. In other words, the hypersphere model can separate the normal (target) data and novel points. No limited to one-class problems, SVDD is also applied in dealing with multi-classification problems [24].

Let \(\left \{{{{\textbf {x}}_{i}}|{{\textbf {x}}_{i}} \in {{\mathbb {R}}^{D}},i = 1, {\ldots } ,n} \right \}\) be the set of target training data, where D is the number of features and n is the number of target samples. The principle of SVDD dividing the target samples from others can be interpreted as the optimization of a convex quadratic programming:

$$ \begin{array}{l} {\mathrm{\qquad \qquad min}}\quad{R^{2}} + C\sum\limits_{i = 1}^{n} {{\xi_{i}}}\\ {\mathrm{s.}}~{\mathrm{t.}}~{\mathrm{||}}{{\textbf{x}}_{i}} - {\textbf{a}}{\mathrm{|}}{{\mathrm{|}}^{2}} \le {R^{2}} + {\xi_{i}},{\mathrm{} }{\kern 1pt} {\xi_{i}} \ge 0,{\mathrm{} }i = 1, {\cdots} ,n \end{array} $$
(1)

where R is the radius of the hypersphere, a means the center of the hypersphere, ξ i is a slack variable, and C > 0 is the penalty factor which controls the balance between the volume of the model and the number of data outside the model. An unseen sample \(\bar {\textbf {x}}\) would be judged as a novel point if it satisfies

$$ {\mathrm{||}} \bar{\textbf{x}} -{\textbf{a}}{\mathrm{|}}{{\mathrm{|}}^{2}}> {R^{2}}. $$
(2)

By introducing the Lagrange multiplier technology, we can derive the dual programming of (1):

$$ \begin{array}{l} \max\limits_{\alpha} \sum\limits_{i = 1}^{n} {{\alpha_{i}}{\textbf{x}}_{i}^{T}{{\textbf{x}}_{i}}}-\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {{\alpha_{i}}{\alpha_{j}}{\textbf{x}}_{i}^{T}{{\textbf{x}}_{j}}}} \\ {\mathrm{s}}{\mathrm{.t}}{\mathrm{.}} \sum\limits_{i = 1}^{n} {{\alpha_{i}}} = 1,0 \le {\alpha_{i}} \le C,i = 1, {\cdots} ,n, \end{array} $$
(3)

where α i is the Lagrange multiplier. The hypersphere center a and radius R can be calculated with α i , respectively. Namely, we have

$$ {\textbf{a}} = \sum\limits_{i = 1}^{n} {{\alpha_{i}}{{\textbf{x}}_{i}}} $$
(4)

and

$$ \begin{array}{l} \quad{R^{2}}({\textbf{x}}_{sv}) = \left\| {{{\textbf{x}}_{sv}} - {\textbf{a}}} \right\|^{2} \end{array} $$
(5)

where x s v represents the support vector (SV) whose coefficient satisfies 0 < α s v < C.

2.2 SVDD-RRFE

SVDD discriminates between normal objects and outliers for establishing a boundary (hypersphere) as compact as possible while the radius of the hypersphere refers to the compact description of the boundary. Thus, the size of radius is very important in SVDD. SVDD-RRFE considers the radius R as its ranking criterion [18].

Let S V be the set of support vectors (SVs) and J r be the average of R 2(x s v ) on all SVs. Then J r can be defined as follows:

$$ \begin{array}{l} {J_{r}} = \sum\limits_{{{\textbf{x}}_{sv}} \in {{SV}}} {\frac{{{R^{2}}\left( {{{\textbf{x}}_{sv}}} \right)}}{{\left| {SV} \right|}}} \\ \quad\ = \frac{1}{{\left| {{{SV}}} \right|}}\sum\limits_{{{\textbf{x}}_{{{sv}}}} \in {{SV}}} {({\textbf{x}}_{{{sv}}}^{T}{{\textbf{x}}_{{{sv}}}} - 2\sum\limits_{i = 1}^{n} {{\alpha_{i}}{\textbf{x}}_{{{i}}}^{T}{{\textbf{x}}_{{{sv}}}}}} \\ \quad\quad\ + \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {{\alpha_{i}}{\alpha_{j}}{\textbf{x}}_{{{i}}}^{T}{{\textbf{x}}_{{{j}}}}}} ) \end{array} $$
(6)

Let \({{J_{r}^{k}}}\) be the average of \(R^{2}({\textbf {x}}^{k}_{sv})\), where \({\textbf {x}}^{k}_{sv}\) means x s v without feature k.

$$ \begin{array}{l} {{J^{k}_{r}}} = \sum\limits_{{{\textbf{x}}_{sv}} \in {{SV}}} {\frac{{{R^{2}}\left( {{{\textbf{x}}^{k}_{sv}}} \right)}}{{\left| {SV} \right|}}} \\ \quad\ = \frac{1}{{\left| {{{SV}}} \right|}}\sum\limits_{{{\textbf{x}}_{{{sv}}}} \in {{SV}}} {(({\textbf{x}}^{k}_{{{sv}}})^{T}{{\textbf{x}}^{k}_{{{sv}}}} - 2\sum\limits_{i = 1}^{n} {{\alpha_{i}}{(\textbf{x}}^{k}_{{{i}}})^{T}{{\textbf{x}}^{k}_{{{sv}}}}}} \\ \quad\quad\ + \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {{\alpha_{i}}{\alpha_{j}}({\textbf{x}}^{k}_{{{i}}})^{T}{{\textbf{x}}^{k}_{{{j}}}}}}) \end{array} $$
(7)

SVDD-RRFE aims at removing feature p which makes the radius ranking score \((J_{r}-{J_{r}^{p}})\) minimal. Namely,

$$ p=\arg\min_{k=1,\cdots,D} ({J_{r}} - {{J^{k}_{r}}}) $$
(8)

If \({{J_{r}^{k}}}\) approaches to J r , then the difference between them is small, which means the effect of the feature k is tiny. In this way, the feature with the smallest \((J_{r}-{J_{r}^{k}})\) should be eliminated.

2.3 SVDD-DRFE

Different from the radius ranking criterion, SVDD-DRFE takes the dual objective function as the ranking criterion [18]. Let J d be the dual objective function of SVDD. Then,

$$ \begin{array}{l} {J_{d}} = \sum\limits_{i = 1}^{n} {\alpha {_{i}}{\textbf{x}}_{i}^{T}{\textbf{x}}{_{i}}}- \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\alpha {_{i}} \alpha {_{j}}{\textbf{x}}_{i}^{T}{\textbf{x}}{_{j}}}} \end{array} $$
(9)

Let \({J_{d}^{k}}\) be the dual objective function of SVDD without feature k. Then,

$$ \begin{array}{l} {{J_{d}^{k}}} = \sum\limits_{i = 1}^{n} {\alpha{_{i}}({\textbf{x}}^{k}_{i})^{T}{\textbf{x}}^{k}_{i}}- \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\alpha {_{i}} \alpha {_{j}}{(\textbf{x}}^{k}_{i})^{T}{\textbf{x}}^{k}_{j}}} \end{array} $$
(10)

Aiming at removing the worst feature k, SVDD-DRFE should take the smallest \((J_{d}-{J_{d}^{k}})\) as the dual-objective ranking score. However, SVDD-DRFE proposed in [18] is to remove feature k with the largest \((J_{d}-{J_{d}^{k}})\). In theory, if a feature is completely useless, it should not significantly change the dual objective. Thus, \({J_{d}^{k}}\) should be very approached to J d in this situation. Therefore, the worst feature p should conform with:

$$ \begin{array}{l} p = \arg \mathop {\min} \limits_{k=1,\cdots,D} (J_{d}-{J_{d}^{k}}) \end{array} $$
(11)

In this paper, we take (11) as the ranking criterion of SVDD-DRFE.

2.4 Multiple SVDD-RFE

MSVDD-RFE was proposed in [3], which considers the center of the hypersphere as the ranking criterion. Let the center of an SVDD model be a = [a 1, a 2, ⋯a D ]T. In MSVDD-RFE, |a i | indicates the average magnitude in the i-th direction with respect to the origin, and (a i )2 measures the distribution energy of data in the i-th direction. The i-th feature seems compacte when (a i )2 is small. For a multi-class classification problem, MSVDD-RFE requires training multiple hypersphere models. For each class, MSVDD-RFE trains an SVDD and ranks features. With the solution to (3) and (4), the center of the j-th SVDD model can be expressed as:

$$ \begin{array}{l} {{\textbf{a}}^{j}} = {\left[ {{a_{1}^{j}},{a_{2}^{j}}, {\cdots} {a_{D}^{j}}} \right]^{T}} \end{array} $$
(12)

where j = 1,2,⋯ ,c, and c is the number of classes. When processing the j-th class, MSVDD-RFE regards feature k with the smallest energy as the worst feature. Namely,

$$ \begin{array}{l} k = \arg \mathop {\min} \limits_{p=1,\cdots,D} {\left( {{a_{p}^{j}}} \right)^{2}}. \end{array} $$
(13)

In MSVDD-RFE, we need to determine the remained feature number in advance. For each one-class problem, MSVDD-RFE can get a remained or ranked feature subset. A final feature subset is generated by combing these remained features. However, we can not tell which is the most important in the final feature subset. In other words, the final feature subset is not ranked one.

3 Proposed methods

As mentioned before, SVDD-based feature selection methods for one-class problems take the hypersphere radius and the dual objective of SVDD as the ranking criteria, and show their good performance on some datasets [18]. However, these methods suffer from high computational complexity for considering kernel functions when computing ranking scores. MSVDD-RFE can fast select features when only the linear kernel is applied [3]. However, MSVDD-RFE could not give a final feature ranking.

To get a final feature ranking for classification tasks and improve the speed of computing ranking scores, this section proposes two fast methods based on the radius and the dual-objective criteria in the case of linear kernel, respectively.

3.1 Fast multiple SVDD-RRFE

Consider the radius ranking criterion. The size of the radius and the position of center determine the hypersphere of SVDD, and also determine the performance of SVDD. Thus, it is reasonable to make the radius as a ranking criterion. In the following, we first present a fast SVDD-RRFE (FSVDD-RRFE) method for computing radius ranking scores with the linear kernel and then propose FMSVDD-RRFE based on FSVDD-RRFE.

3.1.1 Fast SVDD-RRFE

Let the set of target training data be \(\left \{{{{\textbf {x}}_{i}}|{{\textbf {x}}_{i}} \in {{\mathbb {R}}^{D}},i = 1, {\ldots } ,n} \right \}\), where D is the number of features and n is the number of target samples. We list the computational complexity of SVDD-RRFE under the situation with the linear kernel in Table 1, where D is the number of the current feature subset and SV is the set of SVs in the current iteration. To remove the worst feature from the current feature subset, SVDD-RRFE requires calculating J r in (6) and \({J_{r}^{k}}, k=1,\cdots , D^{\prime }\) in (7). The computational complexity of J r is O(|S V|3 D ), and that of \({J_{r}^{k}}\) is O(|S V|3(D − 1)). Totally, the computational complexity of feature selection is O(|S V|3 D ′2) for SVDD-RRFE in the current iteration.

Table 1 Comparison of computational complexity in an iteration

To reduce the computational complexity, we propose a theorem on calculating the radius ranking score. Let the ranking score be \(JR_{k}=J_{r}-{J_{r}^{k}}\).

Theorem 1

Given the center a of the hypersphere and the set of support vectors SV obtained by SVDD with the linear kernel, the radius ranking score can be computed as

$$ JR_{k}=\frac{{\sum}_{{\textbf{x}}_{sv}\in SV}\left( x_{(sv,k)}^{2}-2x_{(sv,k)}a_{k}+{a_{k}^{2}}\right)}{|SV|} $$
(14)

where x (s v,k) is the k-th componentof the support vector x s v , and a k is the k-th component of the center a.

The proof of Theorem 1 is given in Appendix A. According to Theorem 1, we can construct a speeded algorithm for SVDD-RRFE. Moreover, the computational complexity of FSVDD-RRFE is O(|S V |D ) in an iteration, which is much smaller than that of the original algorithm SVDD-RFE. Table 1 lists the comparison of computational complexity.

3.1.2 FMSVDD-RRFE

Based on the fast algorithm proposed above, we discuss the extension of SVDD-RRFE to binary and multi-class classification.

For a c-class classification problem, assume that there is a set of training samples \(X = \left \{ {{{\textbf {x}}_{i}},{y_{i}}}\right \}_{i = 1}^{n}\) where \({\textbf {x}}_{i} \in \mathbb {R}^{D}\), y i ∈ {1,2,⋯ ,c} denotes the class label of x i , D and n are the number of features and samples, respectively.

Let X j be the set of training samples in the j-th class. Then we have \(X = \cup _{j=1}^{c} {X_{j}}\). For the j-th class, we use X j to train an SVDD and get the corresponding center a j and support vector set S V j. Totally, there are c SVDD models. To find the worst feature by using all c SVDD models, we redesign the radius ranking score in (14) as follows:

$$ JR_{k}=\sum\limits_{j=1}^{c}\frac{{\sum}_{{\textbf{x}}_{sv}\in SV^{j}}\left( x_{(sv,k)}^{2}-2x_{(sv,k)}{a^{j}_{k}}+({a^{j}_{k}})^{2}\right)}{|SV^{j}|} $$
(15)

If J R p is the smallest one among J R k , k = 1,⋯ ,D , then feature p should be removed from the current feature set. Thus, FMSVDD-RRFE concerns all classes when removing the worst features. Algorithm 1 displays the process of eliminating features in FMSVDD-RRFE. Roughly speaking, FMSVDD-RRFE removes the feature with the smallest ranking score in each iteration. In doing so, we can have a feature ranking at last.

figure c

3.2 Fast multiple SVDD-DRFE

Similar to the radius ranking criterion, the dual-objective ranking criterion is also reasonable since it requires maximizing the objective function to find an SVDD model. Here, we also provide a speeded algorithm for computing the dual-objective ranking scores, and then present a fast multiple SVDD-DRFE method based the speeded algorithm.

3.2.1 Fast SVDD-DRFE

The computational complexity of SVDD-DRFE with the linear kernel is also given in Table 1. Similarly, SVDD-DRFE first needs to compute J d in (9) and \({J_{d}^{k}}, k=1,\cdots , D^{\prime }\) in (10), and then finds the worst feature p according to the dual-objective ranking criterion (11). The computational complexity of J d is O(|S V|2 D ), and that of \({J_{d}^{k}}\) is O(|S V|2(D − 1)). Totally, the computational complexity of feature selection is O(|S V|2 D ′2) for SVDD-DRFE in the current iteration.

We also give a theorem on computing the dual-objective ranking score in the following. Let the dual-objective ranking score be \(JD_{k}=J_{d}-{J_{d}^{k}}\).

Theorem 2

Given the optimal solution α i , i = 1,⋯ ,n to the dual objective (3) in the linear case, the dual-objective ranking score can be computed as

$$\begin{array}{@{}rcl@{}} JD_{k}=\sum\limits_{{\textbf{x}}_{sv}\in SV} \alpha_{sv} x^{2}_{(sv,k)}- \sum\limits_{{\textbf{x}}_{sv}\in SV} \sum\limits_{{\textbf{x}}_{sv^{\prime}}\in SV}\alpha_{sv}\alpha_{sv^{\prime}} x_{(sv,k)}x_{(sv^{\prime},k)} \end{array} $$
(20)

where SV is the set of support vectors, x s v,k is the k-th component of the support vector x s v , and α s v , the corresponding coefficient of x s v , is a component of the optimal solution.

The proof of Theorem 2 is shown in Appendix B. According to Theorem 2, a fast SVDD-DRFE method can be constructed. Moreover, the computational complexity of FSVDD-DRFE is O(|S V|2 D ) in an iteration, which is also greatly smaller than that of SVDD-DRFE.

Table 1 compares the computational complexity of four algorithms. Obviously, FSVDD-RRFE has the lowest complexity among four algorithms, and followed by FSVDD-DRFE. According to Table 1, SVDD-DRFE is faster than SVDD-RRFE. In other words, the speeded scheme on SVDD-RRFE is more effective than that on SVDD-DRFE, which will be proved by experiments later.

3.2.2 FMSVDD-DRFE

Based on the proposed speeded algorithm, we develop a new fast feature selection algorithm for binary and multi-class classification problems.

For a c-class classification problem, FMSVDD-DRFE also requires training c SVDD models. Let X j be the set of training samples in the j-th class. For the j-th class, we use X j to train an SVDD and get the solution \({\alpha ^{j}_{i}},i=1,\cdots , n\). The support vector set \(SV^{j}=\{{\alpha _{i}^{j}}|{\alpha _{i}^{j}}>0,i=1,\cdots ,n\}\). To find the worst feature by using all c SVDD models, we rewrite the dual-objectives ranking score in (20) as follows:

$$\begin{array}{@{}rcl@{}} JD_{k}&=&\sum\limits_{j=1}^{c} \sum\limits_{{\textbf{x}}_{sv}\in SV^{j}} \alpha^{j}_{sv} x^{2}_{(sv,k)}\\ &&- \sum\limits_{j=1}^{c} \sum\limits_{{\textbf{x}}_{sv}\in SV^{j}} \sum\limits_{{\textbf{x}}_{sv^{\prime}}\in SV^{j}}\alpha^{j}_{sv}\alpha^{j}_{sv^{\prime}} x_{(sv,k)}x_{(sv^{\prime},k)} \end{array} $$
(21)

If J D p is the smallest one among J D k , k = 1,⋯ ,D , then feature p should be removed from the current feature subset. FMSVDD-DRFE considers all classes and removes the feature with the smallest ranking score J D k in an iteration. Algorithm 2 shows the process of feature selection in FMSVDD-DRFE which can give a feature ranking finally.

figure d

3.3 Connection to other SVDD-based feature selection methods

Presently, three existing SVDD-based feature selection methods, SVDD-RRFE [18], SVDD-DRFE [18], and MSVDD-RFE [3], correspond to three ranking criteria: the radius, the dual-objective and the center.

Similar to SVDD-RRFE, the proposed FMSVDD-RRFE uses the radius ranking criterion. Both FMSVDD-RRFE and SVDD-RRFE can process one-class classification problems. In this case, FMSVDD-RRFE has the same ranking result as SVDD-RRFE according to Theorem 1. But most of all, FMSVDD-RRFE is much faster than SVDD-RRFE when performing feature selection. Table 1 shows the computational complexity of the two methods when dealing with one-class classification problems.

Compared to SVDD-DRFE, the proposed FMSVDD-DRFE utilizes the same ranking criterion. When dealing with one-class tasks, FMSVDD-DRFE has the same results as and a much faster speed than SVDD-DRFE. The former can be validated by Theorem 2, and the latter can be observed from Table 1.

MSVDD-RFE uses the center ranking criterion which is different from both FM-SVDD-RRFE and FMSVDD-DRFE. The three multiple SVDD methods can address one-class, binary and multi-class tasks. MSVDD-based methods use the samples belonging to the same class to train an SVDD model. However, MSVDD-RFE considers a feature ranking only for one class instead of for all classes and generate multiple feature rankings. On the contrary, both FMSVDD-RRFE and FMSVDD-DRFE can give a unique feature ranking for all classes. In addition, the three MSVDD-based methods have a similar computational complexity which includes two parts: training c SVDD models for c-class tasks, and computing ranking scores. Obviously, the complexity of training SVDD models is exactly the same to each other. The computational complexity of computing ranking score in MSVDD-RFE is O(|S V |D ) in an iteration when considering one-class tasks. While the computational complexities of FMSVDD-RRFE and FMSVDD-DRFE are O(|S V |D ) and O(|S V|2 D ), respectively. Generally, SV is only a small part of training samples. When D is large enough, |S V | could be ignored.

4 Comparative performance analysis

To validate the performance of our proposed algorithms (FMSVDD-RRFE and FM-SVDD-DRFE), we compare them with other SVDD-based methods mentioned above. All numerical experiments are performed on a personal computer with a 3.4GHz Intel Core and 4G bytes of memory. This computer runs Windows 7, with Matlab R2013a.

4.1 Simulated datasets

The goal of this experiment is to validate that the speeded algorithms can improve the speed of feature ranking. We generate two simulated datasets, Dataset A and Dataset B. Dataset A only contains one-class data, and Dataset B consists of two-class data. In the following, we describe the experimental results on the two datasets, respectively.

4.1.1 Dataset A

We use Dataset A to validate that the proposed speeded algorithms have the same feature ranking and a faster ranking speed compared to the original ones when applying these algorithm to one-class tasks. For Dataset A, we first randomly generate two-dimensional data with a uniform distribution on the interval [0,1] and then add noise features which are m-Gaussian distributions with zero mean and a variance 0.01, where m takes value in the set {21,22,⋯ ,212}. In other words, the feature number in Dataset A is m + 2. The number of training samples is 50.

The compared methods are FMSVDD-RRFE, FMSVDD-DRFE, SVDD-RRFE, and SVDD-DRFE. Actually, FMSVDD-RRFE is FSVDD-RRFE, and FMSVDD-DRFE is FSVDD-DRFE since only one SVDD model is needed. Let C = 0.5 for all four methods here. We perform 10 runs for each m, and report the average feature ranking time in Fig. 1. Note that the logarithmic base 10 scale is used for the Y-axis. Naturally, all four methods spend much time ranking feature with the increase of feature number. The curves in Fig. 1 lead to the same conclusion as Table 1 shows. When the feature number is much larger than the sample number, SVDD-RRFE has the highest computational complexity, followed by SVDD-DRFE. For example, when m = 210, the average ranking time of SVDD-RRFE is 637.75 second, SVDD-DRFE 432.27 second, FMSVDD-DRFE 8.11 second, and FMSVDD-RRFE 3.58 second. In this case, the proposed methods are two orders of magnitude faster than the old ones.

Fig. 1
figure 1

Feature ranking time vs. feature number on Dataset A

Now, we observe the ranked features. Without loss of generality, let m = 24. Then, Dataset A has 18 features where the first two features are useful and the rest ones are noise. By randomly generating data, we perform 10 trials. The ranked feature indices in one trial are listed in Table 2. Experimental results of the other nine trials are similar to the one listed in Table 2. From the ranking sequences obtained by four methods, the first two features are correctly ranked in front of the sequence. In addition, as we expected that FMSVDD-RRFE has the same ranking as SVDD-RRFE, and FMSVDD-DRFE as SVDD-DRFE. In order words, our algorithms speed both SVDD-RRFE and SVDD-DRFE without changing their performance.

Table 2 Feature ranking of different methods on Dataset A

4.1.2 Dataset B

We use Dataset B to validate that the proposed methods could get better feature ranking for binary (or multiple-class) classification problem than the original ones do. For Dataset B, we generate two-class samples in the 12-dimensional data space, the class centers of which are located at [0,⋯,0]T (class one) and [1,⋯,1]T (class two), respectively. Similar dataset was tested in [16, 18]. The i th feature of all samples is independently drawn from the Gaussian distribution with the standard deviation 0.2 × 1.2i−1. For each class, 250 samples are randomly generated for training and 250 ones for test. The compared methods are FMSVDD-RRFE, FMSVDD-DRFE, SVDD-RRFE, and SVDD-DRFE. Let C = 0.5 for all four methods here. We treat class one as the normal data when applying SVDD-RRFE and SVDD-DRFE. Totally, there are 500 training samples for FMSVDD-RRFE and FMSVDD-DRFE. For all four methods, the number of test samples is 500.

In one trial, feature rankings obtained by four methods are listed in Table 3. In fact, the smaller the feature index is, the more important the feature in Dataset B is. From Table 3, we can see that feature six is ranked in the third place by FMSVDD-DRFE, and in the sixth place by SVDD-DRFE. Similarly, feature five is ranked in the fifth place by FMSVDD-RRFE, and in the sixth place by SVDD-RRFE. In other words, our methods could provide better feature ranking and result better classification performance.

Table 3 Feature ranking of different methods on Dataset B

To compare the classification performance of these four methods, we take support vector machine (SVM) with the linear kernel as the classifier and use the F1-measure to measure classification performance. In SVM, let the regularized parameter be 10, which is an empirical value. The F1-measure is a statistic that can evaluate the accuracy of model by combining precision and recall:

$$ F1 = \frac{1}{c}\sum\limits_{j=1}^{c} \frac{{2 \times Precision_{j}\times Recall_{j}}}{{Precision_{j} + Recall_{j}}} $$
(26)

where \(Precision_{j} = \frac {{TP_{j}}}{{TP_{j} + FP_{j}}}\) and \(Recall_{j} = \frac {{TP_{j}}}{{TP_{j} + FN_{j}}}\), F P j is the number of misclassified j th samples, T P j is the number of true j th ones, and F N j is the number of misclassified non-j th samples.

The average classification performance on 10 runs obtained by SVM is shown in Fig. 2. Obviously, FMSVDD-DRFE is always the best, followed by FMSVDD-RRFE. These results lead to a conclusion that FMSVDD-RRFE and FMSVDD-DRFE are more effective than SVDD-RRFE and SVDD-DRFE on Dataset B, which is consistent with the results listed in Table 3.

Fig. 2
figure 2

Average F1 measure vs. feature number on Dataset B

4.2 Microarray datasets

Microarray datasets describe the differential expression genes and have extensive and thorough gene expressive quantity [11], which would be accompanied by terrible computation. Feature selection methods seek the appropriate features to solve the above conflict effectively. Four public available microarray datasets are used to validate the performance of our proposed method, and summarized in Table 4 including Small Round Blue Cell Tumor (SRBCT) of Khan et al. [19], Leukemia-ALLAML of Golub et al. [13], Central Nervous System (CNS) dataset of Pomeroy et al. [26], and Lung Cancer of Bhattacharjee et al. [1].

Table 4 Description of three datasets

In these datasets, all genes are expressed as numerical values at different measurement levels. Since the value is related to the value of genes, we normalize each gene on the interval [0,1] so that all genes can be measured on the same scale.

Four classification performance indices are used here, including F1-measure (26), recall, precision, and accuracy, which are respectively defined by

$$ Recall = \frac{1}{c}\sum\limits_{j=1}^{c} Recall_{j} $$
(27)
$$ Precision = \frac{1}{c}\sum\limits_{j=1}^{c} Precision_{j} $$
(28)

and

$$ Accuracy = \frac{1}{n^{\prime}}\sum\limits_{j=1}^{c} TP_{j} $$
(29)

where n is the number of test samples.

4.2.1 SRBCT

The SRBCT dataset contains 83 samples belonging to four classes, or Ewing family of tumors (EWS), neuroblastoma (NB), Burkitt lymphoma (BL) and rhabdomyosarcoma (RMS). Each sample has 2308 genes. The training set has 63 samples, including 23 EWS, 20 RMS, 12 NB and 8 BL. The test set contains 6 EWS, 6 RMS, 6NB and 3 BL. This dataset could be downloaded from http://www.biolab.si/supp/bi-cancer/projections/info/SRBCT.htm.

We compare our methods with SVDD-RRFE [18], SVDD-DRFE [18], MSVM-RFE [36], and MSVDD-RFE [3] on this dataset. In these feature selection methods, the parameter C has a vital influence on experimental results. Here, we use 5-fold cross validation to select the parameter C. The parameter C for all SVDD-based methods take values in the set {0.2,0.5,1}, for SVM-based methods take values in the set {1,10,100}. Both SVDD-RRFE and SVDD-DRFE require only one-class samples. In the SRBCT dataset, there are 23 training samples in class EWS which is taken as the target class in our experiments.

After we get the feature ranking list, we deal with the top 400 features in the list. We show the classification performance from the top 1 to 400 features on the SRBCT dataset in Fig. 3, where SVM with the linear kernel is the subsequent classifier. When the feature number is small (say less than 40), MSVM-RFE, MSVDD-RFE and FMSVDD-RRFE have a rather good performance compared to the rest methods. SVDD-DRFE performs worst among these six methods. Since FMSVDD-DRFF is proposed based on SVDD-DRFE, FMSVDD-DRFE is not so good. Although FMSVDD-DRFE is bad, it is still much better than SVDD-DRFE. Similarly, FMSVDD-RRFF is proposed based on SVDD-RRFE and much better than SVDD-RRFE.

Fig. 3
figure 3

Comparison of classification performance (a) F1-measure (b) Recall, (c) Precision (d) Accuracy vs. top feature number on the SRBCT datasets

The best performance obtained by these methods is given in Table 5, where feature number is determined by the values of best F1-measure of these methods. The other three performance indices recall, precision and accuracy are determined according to the selected feature number. All methods achieve good classification performance except for SVDD-DRFE. FMSVDD-RRFE requires less genes to implement perfect classification than SVDD-RRFE.

Table 5 Performance comparison on SRBCT

The feature ranking time on the training set is also listed in Table 5. We can see that MSVDD-RFE, FMSVDD-RRFE and FMSVDD-DRFE have a comparable ranking time, which supports the analysis in Section 3.3, or the three MSVDD-RFE methods have similar computational complexity. SVM-RFE is slow by comparison. In addition, both SVDD-RRFE and SVDD-DRFE are much slower than other methods.

4.2.2 Leukemia-ALLAML

The Leukemia-ALLAML dataset has 72 samples belonging to two classes, or ALL (Acute Lymphoblastic Leukemia) and AML (Acute Myeloid Leukemia). The training set consists of 38 bone marrow samples (27 ALL and 11 AML), over 7129 probes from 6817 human genes. In addition, 34 test samples is provided, with 20 ALL and 14 AML. This dataset could be downloaded from http://datam.i2r.a-star.edu.sg/datasets/krbd/Leukemia/ALLAML.html.

Since both SVDD-RRFE and SVDD-DRFE took a long time to rank feature on the Leukemia-ALLAML dataset, we replace them by FSVDD-RRFE and FSVDD-DRFE, respectively. In theory, FSVDD-RRFE has the same feature list as SVDD-RRFE, and FSVDD-DRFE as SVDD-DRFE. But FSVDD-RRFE and FSVDD-DRFE are much faster than SVDD-RRFE and SVDD-DRFE, respectively. For FSVDD-RRFE and FSVDD-DRFE, 27 ALL are taken as the target samples. Parameter setting is the same as that in the SRBCT dataset.

We report the best performance obtained by six methods in Table 6. These results lead to similar conclusions as those on the SRBCT dataset. FMSVDD-RRFE can achieve the best classification performance as well as MSVDD-RFE and MSVM-RFE. FMSVDD-RRFE is much better than FSVDD-RRFE, and FMSVDD-DRFE better than FSVDD-DRFE. Note that FSVDD-RRFE and FSVDD-DRFE have a faster ranking speed since the two methods deal with only one-class samples.

Table 6 Performance comparison on Leukemia-ALLAML

4.2.3 More datasets

The CNS dataset contains 60 patient samples, 21 are survivors (alive after treatment) and 39 are failures (succumbed to their disease). There are 7129 genes in the dataset. The training set consists of the first 10 survivors and 30 failures, the other 11 survivors and 9 failures are testing points. The CNS dataset could be downloaded from http://datam.i2r.a-star.edu.sg/datasets/krbd/NervousSystem/NervousSystem.html.

This Lung Cancer dataset has a total of 203 snap-frozen lung tumors and normal lung. The 203 speciments include 139 samples of lung adenocarcinomas (labeled as ADEN), 21 samples of squamous cell lung carcinomas (labeled as SQUA), 20 samples of pulmonary carcinoids (labeled as COID), 6 samples of small-cell lung carcinomas (labeled as SCLC) and 17 normal lung samples (labeled as NORMAL). Each sample is described by 12600 genes. The first 20 ADEN, 10 SQUA, 10 COID, 4 SCLC, and 10 NORMAL are taken as the training samples, and the rest are the test ones. This dataset could be downloaded from http://datam.i2r.a-star.edu.sg/datasets/krbd/LungCancer/LungCancer-Harvard1.html.

For the CNS dataset, 30 failures are the target samples in FSVDD-RRFE and FSVDD-DRFE. For the Lung Cancer dataset, ADEN is selected as the target class. Since four performance indices F1-measure, recall, precision and accuracy are coincident according to Tables 5 and 6, we only give the best F1-measure performance in Table 7. First, we have a conclusion that FMSVDD-RRFE is equal to or better than FSVDD-RRFE, and FMSVDD-DRFE is always better than FSVDD-DRFE. On the CNS dataset, although FSVDD-DRFE is the worst, FMSVDD-DRFE achieves the best performance 79.80% among six methods. FMSVDD-RRFE, FSVDD-RRFE and MSVDD-RFE are next to FMSVDD-DRFE. On the Lung Cancer dataset, MSVDD-RFE achieves the best 95.18%, followed by FMSVDD-RRFE.

Table 7 Comparison of F1-measure (%) obtained by six methods

In a nutshell, FMSVDD-RRFE and FMSVDD-DRFE can get not only a better classification performance but also a faster feature ranking speed than SVDD-RRFE and SVDD-DRFE on four microarray datasets, respectively. FMSVDD-RRFE has a comparable classification performance with MSVM-RFE and MSVDD-RFE on four datasets. FMSVDD-DRFE only behaves well on the CNS dataset.

4.3 UCI database

This section considers the datasets where the sample number is much greater than the feature number. We perform experiments on eight datasets from the UCI database [9], Breast, Wine, Wdbc, Vowel, Vehicle, Soy, Waveform, and Segment datasets. The description on these eight datasets is shown in Table 8. These datasets are normalized so that their features range in the interval [0,1]. For each dataset, we randomly split it into two subsets for training and test, respectively. The training set contains 2/3 of the samples of each class, and the test set contains the remaining 1/3. The split process is repeated 10 times for each dataset.

Table 8 Description on eight UCI Datasets

We also compare the proposed FMSVDD methods with other four feature selection methods, SVDD-RRFE, SVDD-DRFE, MSVDD-RFE and MSVM-RFE. The parameter C for all SVDD-based methods take values in the set {0.1,0.5,1}, for SVM-based methods take values in the set {1,10,100}. We use 5-fold cross validation to select the parameter C and the optimal feature number on the training set for all compared methods. Note that, for both SVDD-RRFE and SVDD-DRFE, only one-class samples are supported. We choose the class with the largest number of samples, or randomly choose the target class when each class has the same number of samples.

We report the average classification performance of these methods on 10 test subsets in Tables 91011 and 12, where SVM with the linear kernel is the subsequent classifier and the bold values are the best ones among the compared methods. From Tables 9-12, we can see that these four performance indices are consistent with each other, which also shows that the performance of algorithms is stable even for imbalanced data. For example, the ratio of sample numbers between two classes is almost 2 : 1 in the Breast dataset. It is obvious that MFSVDD-RRFE achieves the best F1-measure in five out of eight datasets. FMSVDD-RRFE outperforms both SVDD-RREF and MSVM-RFE in seven out of eight data sets, and both SVDD-DRFE and MSVDD-RFE in all eight datasets. In addition, FMSVDD-DRFE is better than both SVDD-RRFE and MSVDD-RFE in seven out of eight datasets, SVDD-DRFE in all eight datasets, and MSVM-RFE in five out of eight datasets. FMSVDD-DRFE only outperforms FMSVDD-RRFE on the Vehicle dataset.

Table 9 Average accuracy of eight UCI Datasets
Table 10 Average recall of eight UCI Datasets
Table 11 Average precision of eight UCI Datasets
Table 12 Average F1-measure values of eight UCI Datasets

In summary, FMSVDD-RRFE and FMSVDD-DRFE are always better than SVDD-RRFE and SVDD-DRFE when addressing the binary or multi-class tasks where the sample number is much greater than the feature number, respectively. The main reason is that proposed methods incorporate the information from all classes instead of one class. Compared to methods utilizing all class information (both MSVDD-RFE and MSVM-RFE), FMSVDD-RRFE and FMSVDD-DRFE also have superiority.

4.4 Statistical comparison over multiple datasets

In this subsection, we perform statistical tests on multiple data sets for comparing different algorithms. In the following, we conduct the Friedman test with the corresponding post-hoc tests, which is a non-parametric equivalence of the repeated-measures analysis of variance (ANOVA) under the null hypothesis that all the algorithms are equivalent and so their ranks should be equal [10]. The Friedman test is carried out to test whether all the algorithms are equivalent. If the test result rejects the null hypothesis, i.e., these algorithms are equivalent, we can proceed to a post-hoc test, or the Bonferroni-Dunn test [8]. The performance of pairwise classifiers is significantly different if the corresponding average ranks differ by at least the critical difference

$$ CD = q_{\alpha}\sqrt{\frac{j(j+1)}{6T}} $$
(30)

where j is the number of algorithms, T is the number of data sets, the critical values q α can be found in [10], and the subscript α is the threshold value. Generally, let α = 0.1 and q 0.10 = 2.326 [7]. In detail, we have j = 6 and T = 12 (including four microarray and eight UCI datasets), then C D = 1.7765.

Table 13 lists the mean rank of six feature selection algorithms, SVDD-RRFE, SVDD-DRFE, MSVDD-RFE, MSVM-RFE, FMSVDD-RRFE and FMSVDD-DRFE. We can see that FMSVDD-RRFE lists the top, followed by FMSVDD-DRFE. Table 14 shows the Friedman test results. According to the results in Table 14, we find that the differences between FMSVDD-RRFE and other algorithms are greater than the critical difference 1.7765 except FMSVDD-DRFE. Thus, FMSVDD-RRFE is significantly better than the other four methods. Similarly, FMSVDD-DRFE is just significantly better than SVDD-DRFE.

Table 13 The mean rank of six methods on 12 datasets
Table 14 Friedman tests with the corresponding post-hoc tests

5 Summaries and conclusions

Recent developments of feature selection have achieved outstanding simulated results along with favorable time complexity. Promotion in the SVDD-based methods brings with it some meaningful study. This paper develops two new fast feature selection methods based on multiple support vector data description, called FMSVDD-RRFE and FMSVDD-DRFE. The proposed methods can address not only one-class classification tasks, but also binary and multi-class ones. When dealing with one-class problems and using the linear kernel, FMSVDD-RRFE is a fast version of SVDD-RRFE and FMSVDD-DRFE is a speeded version of SVDD-DRFE. The facts are proved by Theorem 1 and Theorem 2, respectively. Extensive experiments are performed to validate the performance of the proposed methods. On eight UCI datasets, both FMSVDD-RRFE and FMSVDD-DRFE behave well. On four microarray datasets, the performance of FMSVDD-RRFE is compared to that of the state-of-the-art feature selection methods, MSVM-RFE and MSVDD-RFE. More importantly, the feature ranking time of SVDD-RRFE and SVDD-DRFE is greatly reduced by our proposed methods when addressing high-dimensional datasets, which is supported by the experimental results of the simulated Dataset A and the SRBCT dataset.

The proposed methods are improved versions of SVDD-based methods with the linear kernel. Thus, our proposed methods are used only with the linear kernel, which means that they could not perform nonlinear feature selection. In the future, we will consider the improvement of nonlinear kernels, such as radius basis function (RBF) kernel.