Keywords

1 Introduction

Classification is one of the most important tasks in machine learning. A great number of classifiers were putted forward in recent decades to tackle various kinds of classification problems arose in real world, such as support vector machine, decision tree, AdaBoost, artificial neural networks, and so on. Does there exist a classifier that significantly performs better than any other classifiers on most of data sets? Some literatures have done in-depth investigations on this problem. The No Free Lunch Theorem [1] tells us that there does not exist such classifier. If classifier \(\mathcal {A}_1\) outperforms \(\mathcal {A}_2\) on some data sets, then there must exist as many other data sets on which \(\mathcal {A}_2\) outperforms \(\mathcal {A}_1\). In [2], authors analyzed the performances of three classifiers on some data sets and they did not observe which classifier is significantly better than the others. Furthermore, [3] conducted classification experiments using 179 classifiers and 121 data sets and showed that there is no optimal classifier. These results indicate that classifiers have preference on different types of data sets. Therefore, which classifier(s) would be selected for a given classification problem?

One idea is to use cross validation for all possible classifiers to find the best classifier. However, this procedure is time-consuming. An efficient alternative approach is automatic classifier selection based on data set similarity [4,5,6,7, 10], or classifier selection (CS) for short. We believe that the performances of classifiers on similar data sets should be close. Since different data sets may vary in sample size, dimensions, classes and attributes, how to measure the similarity between data sets is a critical step of CS. The common method is to extract data set feature by designing a feature extraction function (or called meta-learning) and then compute the similarity between these features. There is an intrinsic relationship between classifier performance and data set feature [9]. Therefore, the recommendation heavily depends on the effectiveness of data set feature. Furthermore, the feature should be calculated in a low time complexity, which is a bottleneck of CS.

A number of data set features have proposed. These features are extracted from different aspects of a data set: (i) statistics and information theory (SI) [7, 10]; (ii) model structure (MS) [5]; (iii) problem complexity (PC) [4]; (iv) landmarking (LM) [6]. Especially, PC and LM characterize the classification complexity of problems (we call it complexity) using a set of geometrical metrics or basic classifiers. The complexity is expected to highly correlate to the performances of classifiers [11]. In other words, the performances of classifiers on data sets that have similar complexity should be close. Therefore, complexity plays a vital role in CS. However, the data set features extracted by PC and LM have two shortages: (i) time-consuming; (ii) no theoretical connection with performances of classifiers. It is observed that PC and LM did not perform well in some literatures [5, 7], which means that they cannot characterize the complexity accurately.

To remedy the aforementioned shortcomings of PC and LM, this paper uses a set of geometrical and statistical metrics to describe the complexity of two-class data set, then these metrics are united as data set feature. We use KNN classifier as recommendation algorithm for CS. For multi-class classification problem, we split the problem into two-class problems using one-vs-one strategy. Compared with PC and LM, our work has improvements in two aspects: computation efficiency and theoretical guarantee. Empirical results demonstrate the effectiveness and efficiency of our method.

The rest of the paper is structured as follows. We briefly introduce the related works in Sect. 2. Section 3 presents our data set feature. The classifier selection algorithm is given in Sect. 4. Empirical investigations are discussed in Sect. 5 and conclusions are drew in Sect. 6.

2 Related Work

The key problem of CS is feature extraction. To the best of our knowledge, there are four kinds of features.

Statistical Feature: This feature can be categorized into two kinds. The first kind describes the data set using a group of statistical and information theory characteristics [10]. The second kind is based on summary statistics. Song [7] characterizes the data set structure by computing the frequencies of itemsets generated from binary data sets. Non-binary data set needs to be transformed to binary data set, which would be time-consuming when the attributes of data set are continuous.

Problem Complexity Feature: Twelve measures are designed to describe the geometrical complexity of decision boundary of two-class problems [11]. Cano [12] claimed that some of the measures have little connection with the performances of classifiers. Bernado [4] selected six measures to characterize data set.

Landmarking Feature: This feature [6] utilizes the performances of a set of basic classifiers (called landmarkers) to describe the data set. Therefore, the similar features indicate that data sets may belong to the subspace of the same performance. The chosen landmarkers must be significantly different.

Model Structure Feature: The statistical information of a model generated from data set is collected as feature. In this category, decision tree is usually considered [5], from which we gather a set of statistics like maximum/minimum number of nodes, length of longest/shortest branches, and so on.

The aforementioned features belong to experimental origin. However, a theoretical investigation would be more persuasive. Furthermore, these features are computationally expensive.

3 Proposed Feature

In this section, we firstly propose several metrics of complexity for CS. Then the theoretical connections between two metrics and generalization errors of some classifiers are investigated. Finally, we present our data set feature and similarity measurement criterion.

3.1 Metrics of Complexity

Given a two-class data set \(\mathcal {D}= \{(\mathbf {x}_1,y_1),(\mathbf {x}_2,y_2),\ldots ,(\mathbf {x}_n,y_n) \}\) in input space \(\mathcal {X}\), where \(\mathbf {x}_i\), \(i=1,2,\cdots ,n\) are data points, and \(y_i\) is the binary class label, i.e., \(y_i \in \{1,-1\}\). Let \(\mathbf {y}=[y_1,y_2,\cdots ,y_n]^\top \) represents the vector formed with n labels. We use \(n_-\) and \(n_+\) to represent the amount of samples labeled \(-1\) or 1, respectively. Note that \(n_-+n_+=n\).

For a given kernel function \(k(\mathbf {x},\mathbf {y}) = \langle \phi (\mathbf {x}),\phi (\mathbf {y})\rangle \), where \(\phi \) is a nonlinear mapping that maps \(\mathbf {x}\in \mathcal {X}\) to a reproduce kernel hilbert space (RKHS) \(\mathcal {H}\), an \(n \times n\) kernel matrix \(\mathbf {K}\) is generated from \(\mathcal {D}\) as

$$K_{ij}=\langle \phi (\mathbf {x}_i), \phi (\mathbf {x}_j)\rangle , i,j=1,2,\cdots ,n.$$

\(\mathbf {K}\) is a symmetric positive and semi-definite matrix that totally preserves the geometrical structure of \(\mathcal {D}\). Our five metrics of complexity are based on \(\mathbf {K}\).

Kernel Alignment. This metric, which is known as centered kernel target alignment (KA) [13], is defined as

$$\begin{aligned} \text{ KA }(\mathbf {K}_c,\mathbf {yy^\top })=\frac{\langle \mathbf {K}_c,\mathbf {yy^\top }\rangle _F}{\sqrt{\langle \mathbf {K}_c,\mathbf {K}_c\rangle _F\langle \mathbf {yy^\top },\mathbf {yy^\top }\rangle _F}}, \end{aligned}$$
(1)

where \(\mathbf {K}_c\) is a centralized kernel matrix of \(\mathbf {K}\), \(\langle \cdot ,\cdot \rangle _F\) denotes Frobenius inner-product and \(\mathbf {yy^\top }\) is called the target matrix. \(\text{ KA }\in [0,1]\) since \(\langle \mathbf {K}_c,\mathbf {yy^\top }\rangle _F\geqslant 0\).

The numerator of (1) can be expanded as

$$\begin{aligned} \begin{aligned} \langle \mathbf {K}_c,\mathbf {yy}^\top \rangle _F&= \mathbf {y}^\top \mathbf {K}_c\mathbf {y} = \sum _{i=1}^n\sum _{j=1}^n y_iy_j({K_c})_{ij}\\&= \sum _{y_i=y_j}({K_c})_{ij}-\sum _{y_i\ne y_j}({K_c})_{ij}. \end{aligned} \end{aligned}$$

Therefore, KA measures the difference between the within-class and between-class distances of data set. A bigger KA indicates that the corresponding data set is more separable. The most time-consuming calculations of KA are the centralization of \(\mathbf {K}\) and \(\langle \mathbf {K}_c,\mathbf {K}_c\rangle _F\), which take \(O(n^2)\) time complexity.

Kernel Space-Based Separability. The centers of two classes in \(\mathcal {H}\) are calculated as

$$\begin{aligned} \begin{aligned} \phi _{-}&= \frac{1}{n_-}\sum _{y_i = -1}\phi (\mathbf {x}_i),\\ \phi _{+}&= \frac{1}{n_+}\sum _{y_i = 1}\phi (\mathbf {x}_i), \end{aligned} \end{aligned}$$
(2)

respectively. KS [14] is defined as

$$\begin{aligned} \text{ KS }(\mathbf {K},\mathbf {y})=\frac{std_-+std_+}{||\phi _--\phi _+||_2}, \end{aligned}$$
(3)

where

$$\begin{aligned} \begin{aligned} std_-&= \sqrt{\frac{\sum _{y_i = -1}\langle \phi (\mathbf {x}_i)-\phi _-,\mathbf {e}\rangle ^2}{n_--1}},\\ std_+&= \sqrt{\frac{\sum _{y_i = 1}\langle \phi (\mathbf {x}_i)-\phi _+,\mathbf {e}\rangle ^2}{n_+-1}}, \end{aligned} \end{aligned}$$
(4)

are the standard deviations of two classes projected along the direction \(\mathbf {e}=\frac{\phi _--\phi _+}{||\phi _--\phi _+||_2}\) respectively, and \(||\cdot ||_2\) denotes 2-norm of vector.

\(\text{ KS }\in (0, +\infty ]\) actually describes the samples’ distribution along direction \(\phi _--\phi _+\). A smaller KS means that the data set is more separable. KS needs \(O(n^2)\) time complexity.

Overlap Region. We propose a metric that compute the ratio of the overlapped region of two classes to the total region of two classes along direction \(\mathbf {e}\), denoted as ROR. Suppose that the projected data of one class fall into \([a_1,b_1]\), where \(a_1, b_1\) are the minimum and maximum values of the projected data, and the other class falls into \([a_2,b_2]\). Let \(U = [a_1, b_1]\cap [a_2, b_2]\) and \(V=[a_1, b_1]\cup [a_2, b_2]\) be intersection and union of these two intervals, respectively. ROR is defined as

$$\begin{aligned} \mathrm {ROR} = {\left\{ \begin{array}{ll} 0, &{} U=\emptyset , \\ \frac{\max (U)-\min (U)}{\max (V) - \min (V)}, &{} U\ne \emptyset , \end{array}\right. } \end{aligned}$$
(5)

where \(\min (\cdot )\) and \(\max (\cdot )\) are the maximum and minimum values of interval respectively and \(\emptyset \) represents empty set. \(\text{ ROR }\in [0,1]\) since U is a subset of V. When data set is linear separable, ROR is expected to zero. However, ROR will increase if data set is nonlinear separable. ROR also needs \(O(n^2)\) time complexity.

Test of Equality of Means. Now we treat kernel matrix \(\mathbf {K}\) as a similarity matrix. The following measure depends on the assumption that the similarity among within-class data is higher than between-class data. We first introduce two vectors extracted from \(\mathbf {K}\):

$$\begin{aligned} \begin{aligned}&\mathbf {k}_W = \{K_{ij}|i<j\wedge y_i=y_j\},\\&\mathbf {k}_B = \{K_{ij}|i<j\wedge y_i\ne y_j\}. \end{aligned} \end{aligned}$$
(6)

We denote \(n_W = \frac{n_-(n_--1)}{2}+\frac{n_+(n_-+1)}{2}\) and \(n_B = n_-n_+\) represent the size of vectors \(\mathbf {k}_W\) and \(\mathbf {k}_B\) respectively. We see that \(\mathbf {k}_W\) is the collection of within-class similarity and \(\mathbf {k}_B\) is the collection of between-class similarity.

TEM [15] is defined as a variant of t-test to evaluate the equality of means of \(\mathbf {k}_W\) and \(\mathbf {k}_B\):

$$\begin{aligned} \text{ TEM }(\mathbf {K,y}) = \frac{1}{n}\Bigg |\frac{\bar{k}_W - \bar{k}_B}{\sqrt{\frac{\sigma ^2_W}{n_W}+\frac{\sigma ^2_B}{n_B}}}\Bigg |, \end{aligned}$$
(7)

where \(\bar{k}_W\) and \(\sigma ^2_W\) denote the mean and variance of \(\mathbf {k}_W\) respectively, and \(\bar{k}_B\) and \(\sigma ^2_B\) denote the mean and variance of \(\mathbf {k}_B\) respectively. TEM is very sensitive to the nonlinearity of decision boundary. A larger TEM reflects that the data set is more likely to be linearly separable. Here we normalized TEM by multiplying the reciprocal of n to eliminate the influence of sample size. TEM only utilizes the upper triangle elements of \(\mathbf {K}\), which needs \(O(n^2)\) time complexity.

Test of Equality of Variances. Let \(\mathbf {k}_{WB} = \mathbf {k}_W\cup \mathbf {k}_B\) be the union of \(\mathbf {k}_W\) and \(\mathbf {k}_B\). We define three new vectors as follows:

$$\begin{aligned} \begin{aligned}&\mathbf {z}_W = |\mathbf {k}_W - \tilde{\mathbf {k}}_W|,\\&\mathbf {z}_B = |\mathbf {k}_B - \tilde{\mathbf {k}}_B|,\\&\mathbf {z}_{WB} = |\mathbf {k}_{WB} - \tilde{\mathbf {k}}_{WB}|, \end{aligned} \end{aligned}$$
(8)

where \(|\cdot |\) represents element-wise absolute value, \(\tilde{\mathbf {k}}_W\), \(\tilde{\mathbf {k}}_B\) and \(\tilde{\mathbf {k}}_{WB}\) are the medians of \(\mathbf {k}_W\), \(\mathbf {k}_B\) and \(\mathbf {k}_{WB}\) respectively. TEV [15] is defined using Brown-Forsythe test to measure the equality of variances of \(\mathbf {k}_W\) and \(\mathbf {k}_B\),

$$\begin{aligned} \text{ TEV }(\mathbf {K,y}) = (1-\frac{2}{n})\frac{n_W(\bar{z}_W-\bar{z}_{WB})^2+n_B(\bar{z}_B-\bar{z}_{WB})^2}{\sum _{i=1}^{n_W}[(z_W)_i-\bar{z}_W]^2+\sum _{i=1}^{n_B}[(z_B)_i-\bar{z}_B]^2}, \end{aligned}$$
(9)

where \(\bar{z}_B\), \(\bar{z}_W\) and \(\bar{z}_{WB}\) are the mean values of vectors \(\mathbf {z}_B\), \(\mathbf {z}_W\) and \(\mathbf {z}_{WB}\) respectively, \((z_W)_i\) and \((z_B)_i\) represent the \(i^{th}\) element of \(\mathbf {z}_W\) and \(\mathbf {z}_B\). The idea behind TEV is that if \(\mathbf {k}_W\) and \(\mathbf {k}_B\) have the same variance, then the data set should be difficult to separate. The high value of TEV rejects the hypothesis of equal variance and indicates compact within-class and mutually distant between-class distribution [15]. Here we also normalize TEV by multiplying 1 / n.

Like TEM, TEV also needs \(O(n^2)\) time complexity, but TEV needs extra \(O(n^2)\) to search the medians.

3.2 Theoretical Analysis

We theoretically investigate the relationship between metrics KA, KS and generalization errors.

Theorem 1

\(\mathrm {KA}\) is defined as (1). Let \(\text{ R }(h)=\text{ Pr }[yh<0]\) be the error rate of Parzen window predictor

$$\begin{aligned} h(\mathbf {x^\prime })=\frac{\text{ E }_\mathbf {x}[yk_c(\mathbf {x},\mathbf {x}^\prime )]}{\sqrt{\text{ E }[k^2_c]}} \end{aligned}$$
(10)

in binary classification. \(k_c\) is the centered kernel function and \(\text{ E }[\cdot ]\) is an expectation operator. Suppose that \(k(\mathbf {x,x})\leqslant S^2\) for all \(\mathbf {x}\). Then for any \(\delta >0\), the following inequality holds with probability at least \(1-\delta \):

$$\begin{aligned} R(h)\leqslant 1-\Bigg (\mathrm {KA}(\mathbf {K_c},\mathbf {yy^\top })-18\beta \Bigg [\frac{3}{n}+4\sqrt{\frac{\log \frac{6}{\delta }}{2n}}\Bigg ]\Bigg )\cdot \frac{1}{\varGamma }, \end{aligned}$$
(11)

where \(\varGamma =\max _{\mathbf {x}^\prime }\sqrt{\frac{\text{ E }_\mathbf {x}[k^2_c(\mathbf {x^\prime },\mathbf {x})]}{\text{ E }_{\mathbf {x},\mathbf {x^\prime }}[k^2_c(\mathbf {x^\prime },\mathbf {x})]}}\), \(\beta =\max (\frac{S^2}{\text{ E }[k^2_c]}, \frac{S^2}{\text{ E }[{k^\prime }^2_c]})\) and \(k^\prime (\mathbf {x}_i, \mathbf {x}_j)=y_iy_j\).

Proof

According to Theorem 12 in [13], we have

$$\begin{aligned} \mathrm {KA}(k_c,k^\prime _c)\geqslant \mathrm {KA}(\mathbf {K_c},\mathbf {yy^\top })-18\beta \Bigg [\frac{3}{n}+4\sqrt{\frac{\log \frac{6}{\delta }}{2n}}\Bigg ], \end{aligned}$$

where \(\mathrm {KA}(k_c,k^\prime _c) = \frac{\text{ E }[k_ck^\prime _c]}{\sqrt{\text{ E }[k^2_c]\text{ E }[{k^\prime _c}^2]}}\). Unifying Theorem 13 in [13]

$$R(h)\leqslant 1-\mathrm {KA}(k_c,k^\prime _c)\cdot \frac{1}{\varGamma },$$

We obtain the inequation (11) directly.

Theorem 2

[14] \(\mathrm {KS}\) is defined as (3). There is a separating hyperplane

$$\begin{aligned} h(\mathbf {x})=\mathbf {e}\cdot \phi (\mathbf {x})-\mathbf {e}\cdot \frac{std_-\phi _++std_+\phi _-}{std_-+std_+}, \end{aligned}$$
(12)

such that the upper bound of training error of data set \(\mathcal {D}\) is

$$\begin{aligned} \mathrm {KSerr}=\frac{\mathrm {KS}(\mathbf {K},\mathbf {y})^2}{1+\mathrm {KS}(\mathbf {K},\mathbf {y})^2}. \end{aligned}$$
(13)

Theorem 1 tells us that if there is a high KA and \(\varGamma \) is not too large, then the upper bound of generalization error of (10) on \(\mathcal {D}\) is small. Theorem 2 indicates if KS is small, then the upper bound of training error of (12) on \(\mathcal {D}\) is small, thus we can expect a low generalization error [14].

3.3 Data Set Feature

Based on the above analysis, we define data set feature as follows:

$$\begin{aligned} \mathbf {v}=\ [\text{ KA }, 1-\text{ KSerr }, 1-\text{ ROR }, \text{ TEM }, \text{ TEV }]. \end{aligned}$$
(14)

The computation of \(\mathbf {v}\) has a time complexity of \(O(n^2)\). KA, KS and ROR mainly focus on the distributions and the degree of overlap of two classes from a geometrical point of view, while statistical tests (TEM, TEV) are used to characterize the nonlinearity of decision boundary. Employing different kernel functions would produce different features. We adopt Euclidean distance as similarity criterion:

$$\begin{aligned} \rho (\mathcal {D},\mathcal {D^\prime }):=||\mathbf {v}-\mathbf {v}^\prime ||_2=\sqrt{\sum _{i=1}^5(v_i-v^\prime _i)^2}. \end{aligned}$$
(15)

The smaller \(\rho (\mathcal {D},\mathcal {D^\prime })\) means that the similarity between data sets \(\mathcal {D}\) and \(\mathcal {D^\prime }\) is higher.

4 Classifier Selection

Suppose that historical data sets \(\mathcal {D}_1,\ldots ,\mathcal {D}_m\) and testing data set \(\mathcal {D}\) are two-class problems. Our CS algorithm is shown in Algorithm 1.

figure a

4.1 Recommendation Algorithm

In step 2 of Algorithm 1, we use KNN classifier as \(\mathcal {A}_R\), where the data set similarity is the distance between data set features. Assuming \(\mathcal {D}_j, j= 1,2,\cdots ,K\) are the K most similar data sets for \(\mathcal {D}\), the recommended classifier is selected as: (i) for each \(\mathcal {D}_j\), we assign a rank to candidate classifiers according to its performances on this problem. The classifier with the best performance has rank 1, while the classifier with the worst performance has rank m. Classifiers with the same performance have the same average rank; (ii) let \(R_{i,j}, i = 1,2,\cdots ,\ell \) denote the rank of classifier \(\mathcal {A}_i\) on \(\mathcal {D}_j\), then the rank of classification algorithm \(\mathcal {A}_i\) on \(\mathcal {D}\) is computed as

$$\begin{aligned} R_{i,\mathcal {D}} = \frac{1}{K}\sum _{\mathcal {D}_j\in N_c(\mathcal {D})}R_{i,j}, j = 1,2,\cdots ,K, \end{aligned}$$
(16)

where \(N_c(\mathcal {D})\) is a set contains the K most similar data sets of \(\mathcal {D}\). In the end, the classifier with the lowest rank is the recommended classifier.

4.2 Multi-class Classification Problem

Our feature only suitable for two-class data sets. We handle multi-class problems as follow.

  • Step 1: Suppose that data set \(\mathcal {D}\) has c classes. We split \(\mathcal {D}\) into \(m=\frac{c(c-1)}{2}\) two-class problems using one-vs-one strategy.

  • Step 2: For each sub-problem, we recommend one classifier based on Algorithm 1.

  • Step 3: The final decision is determined by using voting strategy.

The merit of this method is that we can select the most suitable classifier for each sub-problem, which would make the classification accuracy higher than that of the single classifier.

5 Experiments

We evaluate the proposed feature with three state-of-the-art features with respect to computational efficiency and recommendation performance.

Table 1. Summary of testing data sets in terms of attributes, sample size and classes.

5.1 Experimental Setup

Data Sets. We selected 67 classification problems from the UCI repository which include 49 historical data sets and 18 testing data sets (Table 1). Among the historical data sets, the multi-class data sets are split into two-class data sets using one-vs-one technique, then those data sets that are easy to classify or have severely unbalanced/small samples in each class are deleted. We totally have 84 two-class historical data sets. The attributes of data sets are normalized into \([-1,1]\).

Candidate Classifiers. We employ 20 candidate classifiers. Some candidate classifiers are KNN, LDA, logistics regression, SVM (linear, polynomial kernel, RBF kernel), naive bayes, decision tree C4.5, random forest, Bagging (tree) and AdaBoost (tree). These classifiers are run with the MATLAB statistic toolbox except SVM uses LIBSVM software.

The remaining classifiers are nearest mean classifier, Fisher’s least square linear discriminant, BP neural network, linear perceptron, Bayesian classifier, Gaussians mixture model, Parzen classifier, Parzen density classifier and radial basis neural network classifier, which are adopted from PrTools toolbox 5.0. We run all codes on MATLAB 2017a on Windows operating system with Inter(R) Core(TM) i5-6500 CPU @3.20GHz processer.

Comparative Classifiers. We evaluate 24 classifiers on testing data sets which include 20 candidate classifiers and 4 data set features.

  • statistical feature (\(F_s\)) [7];

  • problem complexity feature (\(F_{p}\)) [4];

  • landmarking feature (\(F_l\)) [6] with landmarkers KNN, C4.5, LR and NB;

  • our data set feature using polynomial kernel (\(F_{poly}\)). We set \(d = 3\).

The attributes of 4 data set features are normalized into [0, 1]. \(F_s\), \(F_p\) and \(F_l\) adopt the CS framework in Algorithm 1. For each testing data set, \(10\%\) samples of each class are dropped as testing samples and the rests are used for training (the testing data set in Algorithm 1). The classification model of recommended classifier on training samples are trained using 10-fold cross validation. For the sake of fairness, we also evaluate the performance of candidate classifiers on multi-class testing data sets using splitting and voting strategy.

Performance Metrics. We employ classification accuracy (CA), average recommendation performance ratio (ARPR) [8] and non-parameter statistical tests [16] to evaluate the performance of data set features.

5.2 Computational Efficiency

We collected the computation times of 4 data set features on 18 testing data sets (Fig. 1). The recorded time of each data set is the sum of times of its sub-problems. From Fig. 1, we see that our feature has the fastest computational speed, which spent 160 seconds on overall data sets. However, \(F_s\), \(F_p\) and \(F_l\) have unacceptable low speeds. Although \(F_s\) outperformed our features on data sets 2, 3, and 9, we found that these data sets have discrete variables. For continuous variables, the efficiency of \(F_s\) would be degraded rapidly. Therefore, our feature outperforms \(F_s\), \(F_p\) and \(F_l\) in terms of efficiency.

Fig. 1.
figure 1

Running times (s) of \(F_{poly}\), \(F_s\), \(F_p\) and \(F_l\) on testing data sets. The total times are 160.11s, 14662.85s, 31602.10s and 77109.74s, respectively.

5.3 Performance Comparisons

In this section, we compare our \(F_{poly}\) with three state-of-the-art data set features: \(F_s\), \(F_p\) and \(F_l\), as well as 20 candidate classifiers. The comparisons of CA, ARPR and statistical test are listed in Table 2. We observe that \(F_{poly}\) has the highest CA and ARPR.

Fig. 2.
figure 2

CA (%) of best, BC and \(F_{poly}\). best represents the CA of the best candidate classifier.

Table 2. CA (%) of \(F_{poly}\), \(F_s\), \(F_p\) and \(F_l\) on testing data sets. The first column shows the ID of data sets. The last row reports the average rank of each CS algorithm. \(\mathcal {A}_{best}\) and \(\mathcal {A}_{worst}\) indicate the best and worst CA of candidate classifiers.

To check the statistical difference between different methods, we calculated the average rank of each feature and shown it in the last row of Table 2. \(F_{poly}\) has the lowest average rank 1.36, followed by \(F_s\). \(F_p\) has the worst average rank. The Friedman statistic is distributed according to the F-distribution with \((4-1) = 3\) and \((4-1)\times (18-1) = 51\) degrees of freedom. The value of Friedman statistic is 11.64 and the critical value of F(3, 51) is 2.79 at 0.05 significance level. Thus, the null hypothesis is rejected. Then we applied the Nemenyi test for pairwise comparisons. The critical different is 1.11 which means that \(F_{poly}\) is significantly better than \(F_p\) and \(F_l\).

Finally, we compare the CA of \(F_{poly}\) with that of the best candidate classifier and Bayesian classifier (BC) which has the highest ACA among 20 classifiers, shown in Fig. 2. We see that the CA of \(F_{poly}\) are very close to the CA of the best candidate classifier except on data sets 7 and 18. \(F_{poly}\) is equal to or higher than the best candidate classifier on 11 data sets. \(F_{poly}\) has the same CA as or outperforms BC in 14 out of 18 cases. On the 4 data sets that BC outperforms \(F_{poly}\), we see that the CA of BC and \(F_{poly}\) are very close.

6 Conclusion

The difficulties of CS mainly stem from the similarity measurement among data sets. So far, people resolve this problem by characterizing data set feature and turn to comparing the similarity of features. In this paper, we proposed a new data set feature to describe the classification complexity of data set. Different from previous works, our feature has merits like low computational complexity and theoretical support. We built a CS framework using the proposed feature. Experimental results show that our feature is effective and efficient. Our method outperforms three data set features, which means that the proposed feature can help to choose suitable classifiers for new classification problems.