1 Introduction

Real-world machine learning problems are described by a large number of features but many of them negatively affect the learning performance. Feature selection aims to improve the quality of feature sets by selecting a small number of relevant features. Therefore, feature selection can reduce the dimensionality to avoid the “curse of dimensionality” [5], leading to a better learning performance, faster training process and simpler learned model. This work focuses on feature selection for classification [25].

Feature selection [6] is a challenging task because of its large search space. An exhaustive search guarantees an optimal feature subset but is impractical in most situations. Evolutionary computation (EC) has been widely applied to feature selection because of its potential global search ability, especially genetic algorithms (GAs) and particle swarm optimization (PSO). In comparison with GAs, PSO has fewer parameters and is usually more efficient and effective in some areas [4]. In GAs, the crossover and mutation operators contribute to its convergence to the optimal solution, but the two operators without a careful design might potentially break good groups of complementary features when solving feature selection problems. Therefore, PSO is used as a search method in this work.

Complex feature interactions also make feature selection a challenging task. A good fitness function, which measures feature subsets’ qualities, should be able to capture feature interactions. According to evaluation criteria, feature selection can be divided into two categories: wrapper and filter approaches. In filters, feature subsets are evaluated based on the characteristics of the data, which is independent of any classification algorithm. However, most filter measures only cope with either continuous or discrete datasets. In addition, it is usually difficult for a filter measure to detect multi-way feature interactions. Nguyen et al. [20] attempt to compute multi-variate mutual information between a set of features, which results in promising results but with a high computation cost. Wrappers use a classification algorithm to evaluate feature subsets, which ensures to consider feature interactions. Therefore, wrappers usually achieve better classification performance than filters. In this work, we will use a wrapper approach to achieve feature selection.

In terms of computation cost, wrappers are usually more expensive than filters due to involving classification processes. This problem is alleviated in our previous work [21], where a surrogate model for a PSO-based wrapper feature selection is proposed. In particular, an instance selection algorithm called DROP3 [23] is applied to select a small number of training instances, which forms a surrogate training set. The surrogate set is used to quickly locate promising regions A local search based on the surrogate training set is developed to use information from previous iterations to improve the current gbest. The results show that although the proposed algorithm is less computationally intensive than using the original training set, it still can achieve similar or better classification performance. Although the initial design of surrogate training sets works well, there are several key factors that need to be further investigated. For example, DROP3 usually requires a nicely distributed training set and may result in missing informative instances or selecting noisy instances. Furthermore, the relationship between the surrogate training set and the whole training set is not fully investigated. We will continue our previous work on surrogate models for feature selection to address the above issues.

Goal In this study, based on the previous work [21], we aim to improve and further investigate the surrogate model for feature selection, which is expected to increase the classification accuracy while still having a low computation cost. Particularly, a clustering algorithm is utilized instead of DROP3 to produce a better surrogate training set. Furthermore, the relationship between surrogate and full training sets is also investigated, from which a dynamic surrogate model is proposed so that it can adapt with different datasets to select small number of features with high discriminating abilities. Specifically, we will investigate the following questions:

  • whether applying the clustering algorithm can improve the qualities of selected features over using DROP3,

  • whether a bigger surrogate training set can lead to the better evolved feature subsets. Note that when the surrogate training set’s size is increased, it will be more similar to the original training set, and,

  • whether the proposed dynamic surrogate model can rely on characteristics of datasets to select suitable surrogate training sets, which helps to evolve better feature subsets in a short training time.

2 Background

2.1 Particle swarm optimization

In 1995, particle swarm optimization (PSO) [8] is proposed, which is inspired by social behaviors of bird flocking. The underlying principal of PSO is the knowledge sharing between particles to guide the swarm towards optimal points. Each particle has its own position and velocity in the search space. The velocity of a particle is calculated based on its previous velocity (momentum), pbest, which is its own best position (cognitive) and gbest, which is the best position discovered by its neighbors including itself (social).

PSO is originally developed to optimize continuous problems. Although it is extended to cope with binary problems [9], its performance is still limited in comparison with the continuous one [28]. Nguyen et al. [18] propose a binary PSO called sticky binary PSO (SBPSO). In SBPSO, the velocity is a probability vector determining the flipping probability of position entries. The momentum is redefined as the tendency to stick with the current position, also known as stickiness property (stk). The stk is linearly decreased until it is 0 or the corresponding entry is flipped. The stickiness property of the dth entry is updated by the following equation:

$$\begin{aligned} stk^{t+1}_d = {\left\{ \begin{array}{ll} 1, &{}\quad \mathbf{if } \text { the bit is just flipped}\\ \text {max}\left( stk^{t}_d - \displaystyle \frac{1}{ustkS},0\right) , &{} \quad \mathbf{otherwise } \end{array}\right. } \end{aligned}$$
(1)

where t is the ith iteration and ustkS is a number of iterations to reduce stk from 1 to 0.

Based on stk, flipping probabilities and position entries of a particles are defined as in Eqs. (2) and (3).

$$\begin{aligned} p_{d} = i_s*(1-stk_{d}) + i_p * |pbest_{d}-x_{d}| + i_g * |gbest_{d}-x_{d}|\quad \end{aligned}$$
(2)

where \(i_s\), \(i_p\) and \(i_g\) are the importance of stickiness, cognitive and social factor.

$$\begin{aligned} x_{d}^{t+1} = {\left\{ \begin{array}{ll} 1 - x_d^t, &{} \quad \mathbf{if }~rand() < p_d\\ x_d^t, &{}\quad \mathbf{otherwise } \end{array}\right. } \end{aligned}$$
(3)

The experimental results show that SBPSO is more efficiently and evolves better solutions than standard BPSO and probability-based BPSO [29] on feature selection. Therefore, SBPSO is used as the search mechanism in this work.

2.2 Related work on feature selection

Feature selection is a difficult task because of its large search space and complex feature interactions. Although exhaustive search guarantees the best feature subset, it is infeasible in most cases due to its extremely high computation cost. Several techniques have been developed to reduce the computation cost such as greedy searches [11], sequential searches [14, 22, 27], which consider only one feature each iteration. Therefore, they usually suffer from stagnation in local optima.

EC has been widely applied to feature selection because of its potential global search ability. GAs is the earliest EC algorithm used to achieve feature selection [7, 24] because of its natural representation. Besides GAs, genetic programming (GP) can simultaneously perform feature selection and build a classifier [15]. In addition, GP is suitable for some machine learning tasks, such as regression [2, 10]. Recently, PSO gains more attention from feature selection community [1, 30] because of its efficiency and effectiveness. However, since EC is a population-based optimization family, EC algorithms usually require a large number of evaluations. Therefore, EC-based feature selection algorithms are usually computationally intensive. In order to improve the efficiency, many filter measures are used in EC-based feature selection [3, 17, 20].

There is not much attempt to reduce the computation costs of wrapper EC-based feature selection. Nguyen et al [19] improves the efficiency of PSO-based feature selection by shortening the length of particles. Although the computation time is reduced, the evaluation time mainly remains the same as standard PSO-based feature selection. The improvement is from the updating position process and the upper bound of the number of selected features. Wang and Liang [26] directly modify the fitness measure by splitting a training set into many subsets. From each subset, a number of features are selected and then all selected features are combined to form the final feature subset. However, it is possible that features selected from different subsets might be redundant. In our previous work [21], we use an instance selection algorithm to form a surrogate training set, which has fewer instances than the original training set. Therefore, the computation cost is significantly reduced since the evaluation time is much shorter. In this work, we will investigate more on the surrogate model by improving its quality using a clustering algorithm. In addition, different static surrogate models with various numbers of instances and a proposed dynamic surrogate model are examined to analyze the effect of surrogate training sets.

3 Methodology

In this section, we firstly describe DROP3 and its limitations. We then propose how to use a clustering algorithm to address the limitations. We also propose a dynamic surrogate model for feature selection, which is expected to capture characteristics of datasets to evolve better feature subsets.

Fig. 1
figure 1

DROP3 may remove informative instances

Fig. 2
figure 2

DROP3 cannot remove noisy instances

3.1 DROP3 and its limitations

In our previous work [21], DROP3 is used to form surrogate training sets. In the first step, DROP3 removes all instances that are wrongly classified by its K nearest neighbors, which is expected to remove noisy instances. After that, all instances are sorted according to their closest distances to instances from other classes. The instance with a larger distance is considered to be removed earlier since they may be far from its class boundary. An instance is removed if discarding it does not wrongly classify other instances that take the instance as their neighbors. The main idea of DROP3 is to preserve all instances on class boundaries and remove all inner instances, which requires the training set being nicely distributed. Therefore, it is possible that DROP3 removes informative instances or remains noisy instances. Let consider two examples given in Figs. 1 and 2, where there are two class labels (marked by red and green) and a KNN classification algorithm is used with \(\hbox {K} = 3\). In Fig. 1, the two green ones inside the dotted circle have the largest distances to instances from other classes, which means that they are considered to be removed first. It is obvious that removing the two instances does not affect any other green instances since they are too far from the two instances. Therefore, DROP3 will remove them despite they are on the class boundary. The consequence can be seen in classifying an unseen instance (marked by a question mark). If the two green instances are not removed, it will be classified as a green instance, but removing them changes the class label of the unseen instance. Thus DROP3 removes informative instances. Figure 2 gives an example where DROP3 is not able to remove noisy instances. It can be seen that in Fig. 2, there are three noisy red instances located inside the region of the green class. The distances from the three red instances to the green class are definitely smaller than any other red instances, so according to DROP3 they are likely to be on the class boundary. In addition, removing one of the three red instances will wrongly classified the other two, so none of them is discarded by DROP3. It can be seen that even DROP3 is designed specifically for KNN, it may result in a poor surrogate training set. In the next section, we will show how a clustering algorithm can address DROP3’s problems.

3.2 Clustering-based surrogate model

The problem of DROP3 is that it considers instances from the same class as an instance group despite they may be far from each other as shown in Figs. 1 and 2. On the other hand, a clustering algorithm divides instances from the same class into many clusters. Therefore, in Fig. 1, the two marked green instances are likely to be grouped to a cluster, which ensures that information from the two instances is preserved.

A representative is formed for each cluster, which will contribute as one instance into the surrogate training set. Therefore, the size of surrogate training set is equal to the number of clusters. In this work, the centroid of a cluster, which is the instance closest to the cluster’s mean, is selected as the representative. The first reason for selecting the centroid is to ensure a fair comparison with DROP3, which also selects original instances to form the surrogate training set. The second is that using original instances can preserve the relationships/interactions between features while building a new instance from a cluster is more likely to construct new feature interactions, which do not exist in test sets. Note that a cluster may not be pure, which means that it may contain instances from different classes. Therefore, only instances from the majority class, which contributes the largest number of instances in the cluster, are used to select the representative for the cluster. Hence in Fig. 2, the three noisy red instances along with their surrounding green instances are grouped in the same cluster and the noisy instances will be removed since the red class is the minority one in this cluster.

The question is which clustering algorithm should be used. K-means is proposed about 50 years ago and widely used in clustering [13], which may be a good option. However, the main task of this work is to analyze how surrogate training sets with different sizes affect performances of the selected feature subset. Therefore, K-means has to be run many times with different numbers of clusters, which is time consuming. agglomerative clustering (AGG) [16] is a bottom-up hierarchical clustering algorithm in which each instance starts with its own cluster. When moving up the hierarchy, the two closest clusters are merged into one cluster. An example of the agglomerative clustering algorithm is given in Fig. 3.

Note that although both DROP3 and AGG are deterministic algorithms, they have very different outputs and behaviors. DROP3 directly produces a unique surrogate training set for each dataset. On the other hand, AGG results in a set of possible clustering partitions, whose numbers of clusters (\(\#c\)) can be from 1 to the total number of instances in the original training set (as shown in Fig. 3). If \(\#c\) is decided, AGG produces only one unique clustering partition containing a unique set of clusters. Since only the centroid instance is selected from each cluster, the size of surrogate set formed by AGG is equal to the number of clusters \(\#c\).

3.3 Dynamic clustering-based surrogate model

In SBPSO-based feature selection, the position of each particle is a binary vector, in which each entry corresponds to an original feature. If the entry’s value is 1, the corresponding feature is selected. Otherwise, the feature is discarded. Therefore, each particle defines a feature subset which is evaluated according to the following fitness function:

$$\begin{aligned} fitness = \alpha *Error + (1-\alpha )*\frac{\#selectedFeatures}{\#originalFeatures}\quad \end{aligned}$$
(4)

where Error is the classification error rate, which can be measured by either the surrogate training set or the original training set, \(\alpha \) is used to control the contributions of the two objectives. From now, if Error is measured by the original training set, the fitness value is called real fitness. If it is measured by the surrogate training set, the fitness value is called surrogate fitness.

Fig. 3
figure 3

Example of the agglomerative clustering algorithm

Since the surrogate training set is used to estimate possible good regions, it is important to ensure that the surrogate fitness value should be consistent with the real fitness value. For example, if a feature subset A is better than a feature subset B in terms of the surrogate fitness value, A should also be better than B when they are evaluated by the original training set. However, since feature subsets are changed during the evolutionary process, the consistency between the surrogate training set and the original training set may not be preserved. To address this problem, a dynamic clustering-based surrogate model is proposed. The task can be described as: “Given a pool of surrogate training sets, \(P = \{S_1,S_2,...,S_m\}\), which surrogate training set \(S_i\) should be used to evaluate feature subsets.”

In the initialization process, each particle is randomly initialized. After evaluating the particles using the original training set \(S_0\), the position \(x_{best}\) with the best real fitness value \(f_0\) is recorded to find out the most suitable surrogate training set. Particularly, \(x_{best}\) is evaluated on m surrogate training sets, which results in m surrogate fitness values \(\{f_1,f_2,...,f_m\}\). The surrogate training set, which has the smallest difference in comparison with the original training set, i.e. the smallest \(|f_i-f_0|\), is used to evaluate feature subsets in the following iterations. Hence, even in the initialization step, the surrogate training set is dynamically determined based on its consistency with the original training set.

In the first \(I_s\) iterations, feature subsets are evaluated by the surrogate training set, which is also dynamically updated to preserve the consistency with the original training set. However, updating the surrogate set too frequently makes PSO more difficult to adapt with changes in the fitness landscape. Therefore, the surrogate one is only updated when the real fitness value of gbest is not improved for a certain number of iterations (NIStep). The process of finding the most suitable surrogate training set is similar to the method used in the initialization process, except for \(x_{best}\) is replaced by gbest. After the surrogate process, i.e. the first \(I_s\) iterations, is finished, the original training set is used to evaluate the candidate solutions. The pseudo-code of the dynamic surrogate model is given in Algorithm 1.

figure a

4 Experiment design

4.1 Datasets

The proposed methods are tested on 12 datasets chosen from the UCI machine learning repository [12]. The datasets are selected so that they have different numbers of features (#Fs), classes and instances, which can be seen in Table 1. Each dataset is divided into training and test sets, so that they contain 70 and 30% instances respectively and the class distribution is roughly preserved.

Table 1 Datasets

In the experiments, the performances of different surrogate training sets are examined. Firstly, the DROP3 algorithm and the agglomerative clustering algorithm are compared. To ensure a fair comparison, the number of clusters in the clustering algorithm is equal to the number of instances selected by DROP3. Therefore, the two algorithms result in two surrogate training sets with the same size, which are called “DROP3” and “AGG”, respectively. Since it was shown in our previous work [21] that the surrogate model built by DROP3 was already better than an improved version of sequential feature selection search, AGG is not compared with sequential searches due mainly to the page limit.

Fig. 4
figure 4

Evolutionary process of PSO on the Madelon dataset

However, DROP3 usually selects a very small number of instances. For example, on Arrhythmia, the surrogate set built by DROP3 contains only 26 instances, which is even 10 times smaller than the total number of features. Therefore, we decide to examine surrogate training sets produced by the clustering algorithm with various numbers of instances. In particular, five surrogate training sets, whose sizes range from 10 to 50% of the original training set, are generated. The lower bound 10% is to ensure that the surrogate training sets contain enough training instances. The upper bound is set to 50% so that the surrogate model still can significantly reduce the computation cost over using the original training set. The five surrogate training sets form a training set pool, from which the dynamic surrogate model picks the most suitable training set during the surrogate process. The percentages i.e “10”, “20”, “30”, “40” and “50%” are used to name methods with corresponding surrogate training sets, while the dynamic surrogate model is called “Dynamic”.

Table 2 Compare different \(I_s\) values against \(I_s = 75\)

4.2 Parameter settings

The feature subsets are evaluated using a KNN classification algorithm, where K is set to 5 so that it is able to avoid issues caused by noisy instances while still has good efficiency. \(\alpha \) in Eq. (4) is set to 0.9 to ensure that the classification performance has higher priority than the number of selected features. For SBPSO, \(i_m,i_p,i_g\), and ustkS are set to 0.1154, 0.4423,0.4423, and 40, respectively, as suggested by results in Nguyen et al [18]. 5 different values of \(I_s\) ranging from 0 to 100 are examined and the results show that 75 is the most suitable setting. Table 2 shows results of statistical significance tests, which compare between the value 75 and the other four different values of \(I_s\) on four datasets with different numbers of features. “+”/ “=” / “\({-}\)” means that 75 is significantly better/ similar or worse than the other values. It can be seen that the value of 75 achieves similar or better performance than the three smaller values (0, 25, 50) while being less computationally intensive. In addition, the value 75 is significantly better than 100 i.e. using only the surrogate training sets. The population size is set to the number of features and limited by 100. The maximum number of iterations is set to 100. NIStep is set to 5 as an indication that the algorithms might be trapped in local optima. An evolutionary process of PSO on the Madelon dataset is shown in Fig. 4. It can be seen that if the gbest’s fitness value (vertical axis) is not changed for more than 5 iterations, it is very likely that the fitness value is not changed in the following iterations.

Table 3 Comparison between DROP3 and agglomerative clustering algorithms
Table 4 Results of clustering-based surrogate models

5 Results and discussions

5.1 Effect of applying the clustering algorithm

Table 3 shows the comparison between applying DROP3 and the agglomerative (AGG) clustering algorithm. In the table, “#Features” means the number of selected features, “Training” and “Testing” represent the training and testing accuracies, respectively. Note that both DROP3 and AGG use the same number of instances to build surrogate training sets. The two models are compared using Wilcoxon test, a significance signed rank test with significance level set to 0.05. “\(\uparrow \)” or “\(\downarrow \)” means that AGG is significantly better or worse than DROP3, while “\(\circ \)” indicates that there is no significant difference between the two algorithms. In terms of training accuracy, AGG is significantly better than DROP3 on 4 datasets while being worse only on German. On the test set, the feature subsets selected by AGG are never worse than the ones using DROP3. On 3 out of the 12 datasets, AGG’s accuracies are significantly higher than DROP3’s. In addition, the feature subsets selected by AGG are smaller than the ones of DROP3 on 6 datasets. On the other 4 datasets, the two algorithms select the similar number of features. The experimental results show that given the same number of selected instances, AGG can maintain more informative instances to form more consistent surrogate training sets, which results in better classification accuracies. Although AGG and DROP3 use surrogate training sets with the same size, AGG usually selects a smaller number of features. The possible reason is that AGG can remove the outliers, so it selects only necessary features to distinguish instances from different classes. DROP3 may select some noisy instances, which possibly requires additional features to correctly classify them.

5.2 Results of clustering-based surrogate models

The results of clustering-based surrogate models with different sizes of the surrogate set are shown in Table 4. The best classification accuracies on both training and test sets are marked in bold. In comparison with other clustering-based models, AGG can achieve the best performance on only 1 out of the 24 cases (including both training and testing accuracies). In terms of the number of selected features, AGG usually selects a smaller number of features than the other methods. The reason for this pattern is that AGG uses the smallest number of instances, so it does not need to select as many features as the other methods. This is an example of underfitting, where AGG does not have enough instances to select a sufficient number of informative features which are necessary for classifying unseen instances.

As can be seen in Table 4, depending on characteristics of the datasets, the best accuracies are achieved by different sizes of surrogate training sets. Mostly the surrogate training sets ranging from 30 to 50% produce the best accuracies since these training sets are more similar to the original ones. However, on 3 datasets, Australian, ImageSegmentation and WBCD, 10 and 20% achieve the best performance, which may be an indication that the 3 datasets have noisy instances and small size surrogate training sets help to eliminate these instances. An important pattern shown in Table 4 is the consistency between training and testing performance. Specifically, on 6 out of the 12 datasets, both best training and testing accuracies are achieved by the same method. On the other datasets, although the exact consistency does not happen, the method with the best testing accuracy usually has the second best training performance. This pattern shows that to some extent, using surrogate models can help to avoid overfitting.

In order to analyze the condition for a surrogate model to locate good search regions, for each surrogate model (10%-50%), the evolutionary process of the best run is shown in Fig. 5. The horizontal axis is iterations and the vertical axis shows the real fitness function of gbest on each iteration. Due to the space limitation, only 6 out of the 12 datasets are shown. The evolutionary processes on the other 6 datasets have similar patterns. Note that in the first 75 iterations, particles are evaluated by the surrogate set, which means that in terms of the surrogate fitness, the later gbest is always not worse than the earlier gbest. However, on the figure, the gbest is re-evaluated by using the original training set, which does not guarantee that the later gbest has better real fitness value. Therefore, the less fluctuating evolutionary process shows that the corresponding surrogate model is more consistent with the original training set. By collating between Table 4 and Fig. 5, it can be seen that usually the method with the least fluctuating evolutionary process yields the best classification accuracy. For example, on the Arrhythmia dataset, the best training and testing accuracies are achieved by the 50% surrogate model, which has the least fluctuating evolutionary process.

Fig. 5
figure 5

Real evolutionary processes on different sizes of surrogate training sets

5.3 Results of the dynamic surrogate model

As illustrated in Sect. 5.2, in order to achieve good classification performance, it is important to maintain the consistency between the surrogate and the original training sets. However, which surrogate model should be selected heavily depends on datasets. Therefore, the dynamic surrogate model is designed with an expectation of selecting the most suitable surrogate training set during the evolutionary process. The results of the dynamic one are shown by the “Dyn” column in Table 4. A Wilcoxon signed rank test is used to compare between the dynamic and other fixed-size surrogate models. “\(\uparrow \)”/ “\(\downarrow \)” or “\(\circ \)” shows that the dynamic surrogate model is significantly better, worse or no significant difference in comparison with the others.

On training sets, except for German, the dynamic model (Dyn) achieves similar or better performance than the other methods. Specifically, Dyn is significantly better than AGG on 5 datasets. From 10 to 40%, the number of datasets on which Dyn is superior ranges from 5 to 1. Similarly, on the test sets, except for the 50% surrogate, Dyn is never worse than the other models on most datasets. On each dataset, the 7 models are sorted according to their accuracies and their average ranks on all datasets are shown in Table 5. The smaller the rank, the better the method. It can be seen that on the training sets, Dyn is the second best and it is only worse than 50%, which is understandable since the 50% surrogate training set is the most similar to the original training set. However, on the test sets, Dyn is only ranked at the 4th position, which is an indication of overfitting. The possible reason is at the step finding the best suitable surrogate training set (line 10 in Algorithm 1), which can be considered as a local search. In terms of computation time, 50% is the only model which is worse than Dyn. However, in comparison with AGG, Dyn is at most 2 times slower, particularly less than 1.5 times on 9 out of the 12 datasets. It was shown in our previous work [21] that the surrogate model built by DROP3 was already 3–4 times less computationally expensive than using the original training set. Given that AGG and DROP3 have the same computation cost, one can say that both static (AGG) and dynamic (DYN) can reduce the computation cost over using the original training set.

The experimental results show that the dynamic model can adapt with different datasets to select the suitable surrogate training set, which results in similar or better performance than other algorithms on most datasets. However, the dynamic model may suffer from the overfitting problem, which results in less general feature subsets than the other algorithms.

Table 5 Average ranks of clustering-based surrogate models

6 Conclusions and future work

This work investigates the effect of surrogate models on feature selection. Firstly, the quality of the surrogate training set is improved by a clustering algorithm, which divides the original training set into many clusters. The surrogate training set is formed by selecting a centroid instance as a representative of each cluster. The experimental results show that when selecting the same number of instances, the clustering-based surrogate model can maintain or improve the classification performance while selecting fewer features than the DROP3-based surrogate model on most datasets. In addition, various clustering-based surrogate models with different numbers of instances are examined. The results highlight the importance of selecting enough informative instances to avoid underfitting. It is also shown that to some extent using the surrogate models can improve the generalization of evolved feature subsets. In addition, it is also necessary to maintain the consistency between the surrogate and the original training sets. To ensure the consistency, a dynamic surrogate model is proposed which automatically selects the most suitable surrogate training set during the evolutionary process. The dynamic model can adapt with different datasets to consistently achieve similar or better training accuracies than other static surrogate models.

Although the dynamic model achieves good results, there are issues which we will investigate in the future. For example, the dynamic model may suffer the overfitting problem, which makes its testing accuracies are not as good as its training accuracies. In addition, the clustering-based models have the same size on all datasets. It would be better if the pool of surrogate training sets is designed with respect to the characteristics of datasets. However, it is not an easy task since it requires a deep understanding of each dataset. In the future we will further investigate and develop surrogate models on larger datasets in terms of both number of features and instances.