1 Introduction

Machine learning algorithms such as decision tree (DT), back-propagation network (BPN), and support vector machine (SVM) are very popular and can be applied to various areas. However, most of the machine learning algorithms will suffer parameter setting and feature-selection problems (Han and Kamber 2006). Before applying these methods to solve the problems, the parameter values must be set in advance to avoid constructing an over-fitting or under-fitting model. There are no clear rules for the “best” parameter settings and feature selection. In general, it is a trial-and-error process and may affect the classification accuracy of the model. A number of automated techniques have been proposed that search for “good” parameters and selected features. These techniques typically use a hill-climbing approach that starts with an initial value or feature subset; however, these techniques can quickly get a result, and it may easily fall into a sub-optimal situation.

DT can be used easily in numerous domains as it does not impose restrictions (e.g., variables should be independent or variables should follow a normal distribution) that are imposed by other techniques such as discriminant analysis and regression (Berry and Linoff 2001). Moreover, a DT has the following benefits: (1) a DT is a simple method for presenting knowledge, (2) it can handle nominal and categorical data and perform well and the DT has relatively faster learning speed than other classification methods (Han and Kamber 2006), (3) a DT provides information about relevance of features for prediction purposes. As a feature moves close the tree root, its relevance for predicting decisions for a class of data increases (Freitas 1998).

There are various DT algorithms such as iterative dichotomiser 3 (ID3) (Quinlan 1986), classification and regression tree (CART) (Quinlan 1987), supervised learning in quest (SLIQ) (Han and Kamber 2006), scalable parallelizable induction of decision tree (SPRINT) (Sun et al. 2007), and C4.5 algorithm (Sun et al. 2007). Four criteria-predictive accuracy, speed, robustness, and interpretability are used to analyze the model created by a DT. The C4.5 algorithm, which satisfies these criteria, is the most popular DT algorithm (Han and Kamber 2006). However, before applying the C4.5 algorithm to solve problems, parameters such as minimum case and pruning confidence level must be set in advance. Parameter settings for C4.5 algorithm must be determined carefully to avoid over- or under-fitting. The minimum case controls the tree whether it grows or not in the construct tree phase, and the pruning conference level influences whether the node of the tree will be deleted or not in the pruning phase. For example, if the minimum case is set to a small value, the tree may be very large and may have too many branches, and some may reflect anomalies due to noise or outliers. That is, the classification accuracy rate will be very good in the training data, but have poor classification accuracy rate in the testing (unseen) data. This situation is called the over-fitting problem. On the other hand, if the minimum case is set to be a large value, then the tree may be very small and the classification accuracy rate of the training data may be worse. Furthermore, the classification accuracy rate of testing data may be much larger than that of the training data.

Selecting the right set of features for classification is a difficult problem when designing a good classifier. Typically, one does not know a priori which features are relevant for a particular classification task. One common approach is to collect as many features as possible prior to the learning and data-modeling phase. However, in most pattern classification problems, given a large set of potential features, identifying a small subset to classify data object is generally necessary. Data without feature selection might be redundant or noisy, and decrease classification efficiency. The primary benefits of feature selection are as follows: (1) computational cost and storage requirements are reduced; (2) degradation of classification efficiency due to the irrelevant or redundant features used in the training samples is overcome; (3) training and prediction time are reduced, and (4) data understanding and visualization are facilitated (Abe and Kudo 2005). In feature selection, whether each feature is useful must be determined; the task of finding an optimal subset of features is inherently combinatory. Therefore, feature selection becomes an optimization problem. An optimal approach is then needed to examine all possible subsets. This study presents a novel scatter search-based approach that provides the best parameter settings for C4.5 algorithm, and identifies the beneficial subset of features for different problems such that the classification accuracy rate of C4.5 algorithm is maximized.

The remainder of this paper is organized as follows. Section 2 reviews previous studies of DTs, feature selection, and scatter search. Section 3 describes the proposed SS + DT approach for determining optimal parameter settings for C4.5 algorithm, and identifies the most beneficial feature subset. Section 4 presents experimental results. Conclusions and future research directions are given in Sect. 5.

2 Literature review

2.1 Decision tree

Most DTs employ a top–down strategy that recursively partitions a dataset into small subdivisions. These procedures form the basis of a set of tests applied to each tree branch. The tree-like structure is composed of a root node (formed from all data), a set of internal nodes (splits), and a set of terminal nodes (leaves). Each interior node corresponds to a variable; an arc to a child represents a possible value for that variable. A leaf represents the predicted value of a target variable given the values of variables represented by the path from the root node.

The DT constructing process has two principal phases: the growth phase and pruning phase (Kim and Koehler 1995). During the growth phase, for a set of samples in partition S, a test feature X is selected for further partitioning the set into S 1, S 2,…, S L, which are added to the decision tree as children of node S. Additionally, the node for S is labeled with test X, and partitions S 1, S 2,…, S L are then recursively partitioned.

The interactive dichotomimizer 3 (ID3) algorithms (Quinlan 1986, 1987) and their successor C4.5 algorithm (Quinlan 1993) are the primary focus of research in the field of DT learning. During the growth phase, the central choice by the ID3 algorithm is selection, during which features are tested at each node in the most useful way for classifying examples. The C4.5 algorithm uses an information entropy evaluation function as selection criteria (Quinlan 1993). The entropy evaluation function is calculated as follows.

Step 1: Calculate Info(S) to identify the class in the training set S.

$$ {\text{Info(}}S )= - \sum\limits_{i = 1}^{k} {\{ [{\text{freq}}(C_{i} ,S)/|S|]{ \log }_{2} [{\text{freq}}(C_{i} ,S)/|S|]} \} , $$
(1)

where \( |S| \) is the number of cases in the training set, C i is a class, i = 1,2,…,k, k is the number of classes, and freq(C i , S) is the number of cases in C i .

Step 2: Calculate the expected information value, \( {\text{Info}}_{x} (S), \) for feature X to partition S.

$$ {\text{Info}}_{x} (S) = - \sum\limits_{i = 1}^{L} {[(|S_{i} |/|S|){\text{Info}}(S_{i} )],} $$
(2)

where L is the number of outputs for feature X, S i is a subset of S corresponding to the ith output, and \( |S| \) is the number of cases in subset S i.

Step 3: Calculate the information gained after partitioning according to feature X.

$$ {\text{Gain}}(X) = {\text{Info}}(S) - {\text{Info}}_{x} (S) $$
(3)

Step 4: Calculate the partition information value, \( {\text{SplitInfo}}(X), \) acquired for S partitioned into L subsets.

$$ {\text{SplitInfo}}(X) = - \sum\limits_{i = 1}^{L} {\left[ {\frac{{|S_{i} |}}{|S|}{ \log }_{ 2} \frac{{|S_{i} |}}{|S|}} \right]} $$
(4)

Step 5: Calculate the gain ratio of Gain(X) over SplitInfo(X).

$$ {\text{GainRatio}}(X) = {\text{Gain}}(X)/{\text{SplitInfo}}(X) $$
(5)

The GainRatio(X) compensates for the weak point of Gain(X), which is the quantity of information provided by X in the training set. Thus, the feature with the highest GainRatio(X) is adopted as a decision tree root. The gain ratio criterion is robust and results in small trees (Quinlan 1993). In order to avoid the over-fitting, splits can be stopped if a certain specified threshold (e.g., the minimum number of cases for a split search) is met (Osei-Bryson 2007). This is the so-called minimum case.

The aim of the pruning phase is to generalize the DT generated during the growth phase by generating a sub-tree that avoids over-fitting the training data. The actions in the pruning phase are often called post-pruning. The approach taken in C4.5 is called the confidence level, which uses the estimated error to determine whether the tree built in growth phase requires pruning or not at certain nodes. The probability of error cannot be determined exactly; however, there exists a probability distribution that is generally summarized as a pair of confidence limits. For a given confidence level, the upper limit of this probability can be determined from the confidence limits for the binomial distribution. Then, C4.5 simply equates the estimated error rate at a leaf with this upper limit, based on the argument that the tree has been constructed to minimize observed error rate (Quinlan 1993).

For constructing the DT model, the most difficult task is to obtain a good balance between accuracy and simplicity. Unfortunately, the minimum cases (M) for the leaf and pruning confidence level (CF) are varied for different problems. Generally, the M is preferred to be a high value when data is noisy; on the other hand, the CF should be a lower value when the test error rate of pruned tree exceeds the estimated error rate. Determining these two parameters is an optimization problem (Quinlan 1993).

John (1994) observed that determining how to set parameters is an important issue associated with C4.5 algorithms. This study investigates several cross-validation-based approaches (C4.5*, CVC4.5, and small CVC) to identify the best parameter values for the C4.5 algorithm. Moreover, Kohavi and John (1995) developed a best-first search algorithm to determine the parameter values of C4.5 algorithm using the minimum estimated error. Many studies extended the DT. Carvalho and Freitas (2002) utilized a genetic algorithm to discover small-disjunct rules and compared the results obtained by three versions of the C4.5 algorithm alone, and in eight public domain datasets. Gray and Fan (2008) proposed a genetic algorithm approach to construct DTs called randomly generated evolved tree (TARGET) that performs a better search of the tree model space than the greedy search algorithm. Aitkenhead (2008) created an evolutionary approach to increase DT flexibility using co-evolving competition between the tree and training dataset. Orsenigo and Vercellis (2004) developed an algorithm for creating a DT in which multivariate splitting rules were based on a new concept of discrete support vector machines (LDSDTTS). These studies focused on tree construction and rule generation. However, they did not consider parameter settings and feature selection simultaneously.

2.2 Feature selection

The DT requires a dataset for model construction. A dataset can have many features; however, not all features are useful for classification. When a dataset has considerable noise and complex dimensionality, a DT may have limitations associated with learning the classification patterns. Although the C4.5 algorithm has a feature-selection strategy that encompasses its learning performance, this strategy is not optimal. Correlated and irrelevant features may reduce the performance of the induced classifier (Perner and Apte 2004).

Feature selection can be defined as selecting the smallest subset of an original set of features that are necessary and sufficient for describing a target concept. The approaches for feature selection can be categorized into two models: a filter model and a wrapper model (Liu and Motoda 1998). Filter models utilize statistical approaches, such as factor analysis (FA), independent component analysis (ICA), principal component analysis (PCA), and discriminant analysis (DA), to investigate indirect performance measures, primarily based upon distance and information measures in feature selection. Sun et al. (2007) developed a PCA method on the C4.5 algorithm. The PCA is utilized to reduce the number of features and C4.5 algorithm is trained to generate a DT model for diagnosis of rotating machinery. Last et al. (2001) presented an information-theoretical algorithm for feature selection to enhance C4.5 algorithm; this algorithm finds a set of features by removing irrelevant and redundant features. Perner and Apte (2004) created C4.5 algorithm and a contextual merit (CM) algorithm to select features. They showed that accuracy of the C4.5 classifier can be improved with an appropriate feature pre-selection phase for the learning algorithm. Although this model is fast, the resulting feature subset may not be optimum (Liu and Motoda 1998).

In the wrapper approach, feature subset selection is performed by an induction algorithm as a black box. The feature subset selection algorithm conducts a search for a good subset using the induction algorithm as part of the evaluation function. Some studies have proposed that when the objective is to minimize classifier classification error rate, and measurement cost for all features is equal, then the classification accuracy rate of the classifier is most appealing. That is, a classifier should be constructed with the goal of achieving the highest classification accuracy rate possible, and selecting the features used by the classifier as optimal features. This model is the so-called wrapper model, which uses selection methods to choose feature subsets and then evaluates the selection result after the classification algorithm calculates the accuracy rate. When the relevant features can be selected or noise removed, the classification accuracy rate of the classifier can be improved.

Smith and Bull (2005) utilized genetic programing to pre-process data then applied the C4.5 algorithm to ten well-known datasets from the UCI (University of California, Irvine) repository. López et al. (2006) developed three scatter search-based algorithms, the sequential scatter search with a greedy combination (SSS-GC), sequential scatter search with a reduced greedy combination (SSS-RGC), and parallel scatter search (PSS), to solve the feature-selection problem using three algorithms, the instance-based algorithm, Naive Bayes algorithm, and C4.5 algorithm. However, these algorithms do not consider parameter settings for the C4.5 algorithm. Thus, the optimal solution may be excluded. Few studies have considered parameter settings and feature selection simultaneously for the C4.5 algorithm. As irrelevant and redundant features exist in classification problems, when parameter settings and feature selection are not considered simultaneously, the optimal model may be excluded. Su and Shiue (2003) proposed the GA/DT approach to determine the optimization parameter values and a feature subset for production control systems. The GA/DT approach was only adopted for a specific problem; thus, further comparisons cannot be made.

In order to illustrate the feature-selection problem in the DT algorithm, an example shown in Table 1 was used. This example has 32 instances, and 5 variables, a 1, a 2, a 3, a 4 and a 5, can be used to classify its class. There are three classes, labeled 1, 2, and 3.

Table 1 Data values of example

If default parameter setting in C4.5 (M = 2 and CF = 25%) is used and all of five variables (feature selection is not applied) are fed to C4.5, the C4.5 algorithm will use four variables (a 1, a 2 , a 4 and a 5) to build the classification model. The classification accuracy rate is 87.5% (28/32) and the tree structure is shown in Fig. 1a. If a certain feature selection method is performed, only three variables, a 1, a 3 and a 5, are necessary to construct the classification model. The classification accuracy rate is 90.63% (29/32), and the tree structure is shown in Fig. 1b.

Fig. 1
figure 1

Tree structure of the DT example (M = 2 and CF = 25%). a Without feature selection. b With feature selection

Figure 1 points out that the C4.5 algorithm may easily fall into a local optimal (lower classification accuracy rate) due to the greedy search that is used. If feature selection is not used before constructing the DT model, the root node is a 1. The left nodes are a 4 and a 2 and the right node is a 5. If feature selection is used before constructing the DT model, only three variables are needed to construct the C4.5 model with a higher classification accuracy rate. The most important variable (root node) is the same, but the second important variable changes to a 3 in the left node. Moreover, the feature selection can use fewer variables to construct the classification model, which has higher classification accuracy and may have simplified tree structure.

2.3 Scatter search

Introduced by Glover (1977), the scatter search (SS) is a population-based approach that starts with a collection of reference solutions obtained by applying preliminary heuristic processes. In 1998, Glover published the SS template (Glover 1998) which presents an algorithmic description of the SS method and is considered a milestone in SS literature; many different applications were subsequently developed that have shown potential for solving various complicated optimization problems (Martí 2006). The scatter search is a powerful meta-heuristic approach and has been applied to many various applications successfully. A sample list of these applications can be found in Laguna and Martí (2003). To the best of our knowledge, some studies applied the scatter search to machine learning algorithm. For examples, Su et al. (2005) proposes a hybrid procedure combining neural networks and scatter searches to optimize the continuous parameter design of back-propagation neural network. López et al. (2006) developed three scatter search-based algorithms, to solve the feature-selection problem. Rasha et al. (2006) proposed a scatter search-based automatic clustering problem to discover cluster number and cluster centers without prior knowledge of a possible number of class, and without any initial partition. However, they did not apply SS to parameter determination and feature selection simultaneously.

Briefly, unlike genetic algorithms, an SS operates on a small set of solutions and makes only limited use of randomization as a proxy for diversification when searching for a globally optimal solution. Based on formulations initially proposed for combining decision rules and constraints, an SS uses strategies to combine solutions and create a balance between quality and diversification in the reference set.

Generally, the principal components in an SS can be described as follows.

  1. (1)

    A diversification generation method generates a population of solutions that satisfy a critical level of diversity.

  2. (2)

    An improvement method transforms a trial solution into an enhanced feasible trial solution.

  3. (3)

    A reference set update method builds and maintains a reference set that is a collection of high-quality solutions and diverse solutions. The reference set is the basis for creating new combined solutions.

  4. (4)

    A subset generation method is applied to the reference set and produces a subset of solutions as a basis for creating combined solutions.

  5. (5)

    A solution combination method transforms a given subset of solutions produced by the subset generation method into one or more combined new solutions.

Since each of these methods in italics can be implemented in a variety of ways and with different degrees of complexity, the SS procedure is very adaptable to different problems. Only four of the five components are strictly required in an SS. The improvement method, the only exception, is applied to generate high-quality solutions when they are not provided by other components.

3 The proposed approach

This study presents a novel SS-based approach that provides the best parameter settings for the C4.5 algorithm and finds the beneficial subset of features for different problems, such that the classification accuracy rate of the C4.5 algorithm is maximized. The way in which to apply scatter search to parameter determination and feature selection of DTs, solution representation and objective function calculation, and the procedure of the SS + DT is discussed as follows.

3.1 Solution representation and objective function value

This study adopted an SS-based approach, called the SS + DT, for parameter determination and feature selection in the C4.5 algorithm. For the C4.5 algorithm without feature selection two parameter values, M and CF, were necessary. For the C4.5 algorithm with feature selection, if n features were needed to determine which features were selected, then additional n indicative variables had to be identified. The values of n variables range from 0 to 1. If a variable value is ≤0.5, then its corresponding feature was not chosen. Conversely, if a variable value >0.5, then its corresponding feature was chosen. For example, if the dataset had four features and the C4.5 algorithm requires two parameters, there were six variables used as shown in Fig. 2. This solution can be decoded as follows. The M is 5, the CF is 32%, and the selected features are 1, 2, and 4. The range of M and CF in the solution representation was 0–1, and the real value of M and CF was scaled to a specific range related to the input dataset.

Fig. 2
figure 2

Solution representation of SS + DT

Although four criteria exist, (predictive accuracy, speed, robustness, and interpretability) for evaluating the model created by the DT, the classification accuracy rate was employed most. Therefore, the classification accuracy rate was adopted as the objective function in this study.

3.2 Applying SS for C4.5 DT

The proposed SS + DT approach follows the steps of SS template (Laguna and Martí 2003) and is described as follows.

  1. (1)

    Diversification generation method: Population P with P size solutions was generated randomly. Because all values ranged from 0 to 1 in the solution presentation, each value was uniformly generated from 0 to 1.

  2. (2)

    Improvement method: The improvement method was optional in the SS template, and therefore, was not applied in this study.

  3. (3)

    Reference set update method: The reference set, RefSet, collects both high-quality solutions and diverse solutions that are used to generate new solutions by applying the Combination method. The size of the reference set was \( b = b_{1} + b_{2} = |{\text{RefSet}}|, \) where b 1 was the number of high-quality solutions and b 2 was the number of diverse solutions. Construction of the initial reference set starts with selecting b 1 best solutions (solutions with the highest classification rates) from P. These solutions were added to RefSet and deleted from P. For each solution in the P-RefSet, the minimum Euclidean distance to the solutions in RefSet was calculated. The solution with the maximum of the minimum distances was selected. This solution was then added to RefSet and deleted from P, and the minimum distances were updated accordingly. This process was repeated b 2 times, where \( b_{2} = b - b_{1} . \) Thus, the resulting reference set had b 1 high-quality solutions and b 2 diverse solutions.

  4. (4)

    Subset generation method: The size of subsets was set to 2; that is, only subsets consisting of all pair-wise combinations of solutions in RefSet were considered. Therefore, at maximum, \( b(b - 1)/2 \) subsets exist.

  5. (5)

    Solution combination method: The method employed consisted of finding linear combinations of reference solutions. Each combination of two reference solutions, denoted as \( Y^{\prime} \) and \( Y^{\prime\prime}, \) were employed to create three trial solutions. These three trial solutions were (1) \( Y = Y^{\prime} - d, \) (2) \( Y = Y^{\prime} + d \), and (3) \( Y = Y^{\prime\prime} + d, \) where \( d = u(Y^{\prime\prime} - Y^{\prime})/2 \) and u was a random number with values of 0–1.

3.3 Demonstration of SS + DT procedure

For demonstration purpose, the dataset in Sect. 2.2 was used and the parameter values for SS + DT were set as follows: P size = 8, b = 4, b 1 = 2, b 2 = 2, M ranges from 2 to 5, and CF ranges from 1 to 30%.

3.3.1 Diversification generation method

Because all variables were ranged from 0 to 1 in the solution, P size solutions could be generated by setting each variable uniformly generated from 0 to 1, and the results are shown in Table 2.

Table 2 Diverse solutions

In this table, solution 1 represents that M = 3 (0.295 × (5 − 2) + 2, round up to integer), CF = 28% (0.921 × (30–1%) + 1%, round up to integer), the features 1, 2, 3, and 5 (x 3 > 0.5, x 4 > 0.5, x 5 > 0.5, x 7 > 0.5) were used for creating a DT model. Meanwhile, solution 2 represents that M = 3 (0.258 × (5–2) + 2), CF = 8% (0.233 × (30–1%) + 1%), the features 1, 3, and 4 (x 3 > 0.5, x 5 > 0.5 and x 6 > 0.5) were used for creating a DT model. Other solutions could be described in the same way. After the DT model was created, the classification accuracy rate for each solution could be calculated and is shown in the last column of Table 2.

3.3.2 Reference set update method

Table 3 shows the best b 1 solutions in P, which were immediately added to the RefSet. The first column in this table shows the solution number in P, followed by the variable values and the objective function value. Therefore, solutions X 8 and solution X 1 in P had the highest objective function value and became the first solution and the second solution in RefSet, respectively.

Table 3 High-quality solution in RefSet

We then calculated the minimum distance \( d_{\min } (X) \) between each solution X in P-RefSet and the solution Y currently in RefSet. That is, \( d_{\min } (X) = \mathop {\text{Min}}\limits_{{Y \in \text{Re} {\text{fSet}}}} \{ d(X,Y)\} , \) where \( d(X,Y) \) is the Euclidean distance between X and Y. Mathematically, \( d(X,Y) = \sqrt {\sum\limits_{i = 1}^{n} {(X_{i} - Y_{i} )^{2} } }. \)

For example, the minimum distance between solution (i.e., X 2) in Table 2 and the RefSet solution in Table 3 (i.e., X 8 and X 1) was calculated as follows:

$$ d_{\min } (X^{2} ) = {\text{Min}}\{ d(X^{2} ,X^{8} ),d(X^{2} ,X^{1} )\} = {\text{Min}}(1.193,1.289) = 1.193. $$

The maximum \( d_{\min } \)value for the solution in P-RefSet corresponds to Solution X 7 in P \( (d_{\min } (X^{7} ) = 1.300). \) We added this solution to RefSet, deleted it from P, and updated the \( d_{\min } \)values. The new maximum \( d_{\min } \)value of 0.1.135 corresponded to solutions X 6 in P, the diverse solutions added to RefSet are shown in Table 4.

Table 4 Diverse solutions in RefSet

3.3.3 Subset generation method

This method consisted of a generated subset of reference solutions to be subjected to the combination method. Due to the size of subset being set to 2 in this study, there was a maximum of \( b \times (b - 1)/2 = 4 \times (4 - 1)/2 = 6\) subsets.

3.3.4 Combination method

Suppose solutions X 1 and X 8 were selected for the use in the combination method. Three trial solutions were (1) \( Y = Y^{\prime} - d \), (2) \( Y = Y^{\prime} + d \), and (3) \( Y = Y^{\prime\prime} + d \), where \( d = u(Y^{\prime\prime} - Y^{\prime})/2 \) and u was a random number with values of 0–1. It should be noted that if the value of a variable was lower than 0, the value was set to 0; if the value of a variable was large than 1, the value was set to 1.

Suppose u was 0.543 \( Y^{\prime\prime} = X^{1} \) and \( Y^{\prime} = X^{8} \), three trial solutions were then obtained and are shown in Table 5. The best solution in Table 4 is X 9 with the objective function value of 90.6%.

Table 5 New trial solution generated form the combination of solution X 1 and X 8

Using the subsets that are generated from the subset generation method, more combinations were obtained from the subset generation method and could be used to create additional trial solutions. The search continued in a loop that consisted of applying the combination method followed by the reference update method. This loop terminated when termination conditions were met.

3.4 System architecture of SS + DT

The SS-based approach for parameter determination and feature selection of a DT was constructed following the steps and detailed explanation as follows.

  1. (1)

    Input dataset and data pre-processing: After the dataset was input, the k-fold approach developed by Salzberg was applied with k = 10 (Salzberg 1997). Thus, the dataset was segmented into 10 portions, with each portion of the data sharing the same proportion of each class of data. Nine data portions were applied in the training process, whereas the remaining portion was utilized in the test process. Since the number of data in each class was not a multiple of 10, the dataset could not be partitioned equitably. However, the ratio of the number of data in the training set to the number of data in the testing set was maintained as closely as possible to 9:1.

  2. (2)

    Feature subset selection and determination of parameter values: Each solution generated by the SS was the selected subset of features and parameter values. The selected features, parameter values, and training dataset were then used for building the DT classifier model. Each DT classifier model was created by calling the C4.5 algorithm provided by Quinlan (1993).

  3. (3)

    Objective function value calculation: After the classification model was constructed, the objective function value can be calculated. The higher the classification accuracy rate, the better is the objective function value.

  4. (4)

    Termination criteria: When termination criteria were satisfied, the process ended; otherwise, the next iteration was run. The termination criterions utilized in this study were the maximal solutions evaluated, S max, and the allowable number of successive non-improving solutions evaluated, N non-improving. That is, the number of solutions evaluated exceeds S max or the best objective function value obtained was not improved in N non-improving successive solutions and the SS procedure was terminated.

  5. (5)

    The SS process: In this step, the system generated other solutions using SS as described in Sect. 3.2.

4 Experimental results

The proposed SS + DT approach was implemented using C language and a Windows XP operating system on a personal computer with a Pentium IV 3.0 GHz CPU and 512 MB of RAM. To verify the effectiveness of the proposed SS + DT approach, 23 datasets from the UCI Machine Learning Repository (Hettich et al. 1998) were implemented. These dataset included several high dimensional datasets (Anneal, Breast cancer new, Ionosphere structure, and Sonar), several large datasets (Adult, Segmentation, and Wave), and one high dimensional and large dataset (Connect). Table 6 shows the number of features, instances, and classes for these datasets. As the C4.5 algorithm can handle missing values, the missing value was replaced by a “?” and the instance with missing values was reserved during the experiments. The range of M is 2–20, whereas the range of CF is 0.01–0.35 (1–35%).

Table 6 Datasets from the UCI repository

After running a few datasets with several combinations of parameter settings under the situation in which feature selection was not considered; that is, only M and CF values in the C4.5 algorithm were necessary to be searched, the parameter values for P size, b 1, b 2, N non-improving and S max for a SS were 30, 5, 5, 300, and 1,500, respectively. If the number of solutions evaluated exceeded 1,500, the proposed approach was terminated. If the best solution obtained was not improved in 300 successive solutions, the proposed approach was also terminated.

Because tenfold cross-validation was utilized, the minimum cases, M, confidence level, CF, and classification accuracy rate were obtained by executing the proposed SS + DT approach once for each fold. As the proposed SS + DT approach was non-deterministic, the solutions obtained may not be equal for the same data. Thus, the proposed SS + DT approach was executed three times for each fold in the dataset to calculate average classification accuracy rate. That is, the SS + DT approach executed 30 (10 × 3) times for each dataset. The classification results obtained were then compared with those obtained by the C4.5*, CVC 4.5, Small CVC 4.5 (John 1994), and the C4.5 algorithm (using default value) (Table 7). Classification accuracy rates were cited from their original studies. Notably, only the classification accuracy rate for Monk2 (66.95% in SS + DT, 67.1% in the C4.5*, and 67.1% in the small CVC 4.5) obtained by SS + DT was not the best among the other approaches. However, other classification accuracy rates obtained by the proposed SS + DT approach were the best.

Table 7 Comparison between various approaches without feature selection (%)

For feature selection, because the solution space was increased by n indicative variables, S max was increased to 3,000 and the other parameters for SS + DT remained unchanged. The classification results obtained were compared with those obtained by the Sequential Scatter Search with Greedy Combination (C4.5 + SSS-GC) (López et al. 2006), Genetic algorithm (C4.5 + GA), and genetic programing (C4.5 + GAP) (Smith and Bull 2005) (Table 8).

Table 8 Comparison of other approaches with feature selection (%)

Classification accuracy rates on these three methods were from their original studies. Based on the classification accuracy rates obtained, the proposed SS + DT approach was the best for all datasets.

Furthermore, the classification results obtained by SS + DT with feature selection were compared with those obtained by other DT-based methods, including Hybrid C4.5/GA (Carvalho and Freitas 2002), TARGET (Gray and Fan 2008), co-evolutionary DT (Aitkenhead 2008), and LDSDTTS (Orsenigo and Vercellis 2004) (Table 9). Their classification accuracy rates were also obtained from their original studies. Only classification accuracy rates for Connect (73.30% in SS + DT and 75.93% in Hybrid C4.5/GA) and Hepatitis (83.42% in SS + DT and 84.97% in Hybrid C4.5/GA) obtained by the SS + DT were not the best among the other approaches. The other classification accuracy rates obtained by the proposed SS + DT approach were best, demonstrating that the proposed SS + DT approach performs well in various problems. To sum up, only 3 out of 23 of the classification accuracy rates obtained by the proposed approach for dataset were worse than those of other approaches.

Table 9 Comparison between the SS + DT approach with feature selection and other approaches (%)

Finally, to determine whether a significant difference existed between the proposed SS + DT approach with feature selection and that without feature selection, the classification results obtained by the proposed SS + DT approach with and without feature selection were compared (Table 10). Although classification accuracy rate for Monk2 was reduced, the classification accuracy rates for all other datasets were increased. For datasets where the classification accuracy rate increased, only one dataset (Anneal) was not significantly different; all other datasets have values of P < 0.05, meaning that significant difference existed.

Table 10 Experimental results of SS + DT with/without feature selection on datasets

Table 10 shows that all the average computations in the acceptable range, and in some datasets the computation cost between without/with feature selection, were quite small (breast-cancer new, breast-cancer original, bupa liver, car evaluation, CRX, monk2, new thyroid, pima Indians diabetes, sonar, voting, and wine). The N non-improving stop criterion worked well, it balanced the SS + DT algorithm between the model computation cost and the classification accuracy rate. The proposed approach can cope with high dimensional datasets (Anneal, Breast cancer new, Ionosphere structure, and Sonar), large datasets (Adult, Segmentation, and Wave), and high dimensional and large dataset (Connect). The result showed the proposed approach performed well and the feature selection could enhance the classification accuracy rate and remove irrelevant or redundant features.

In order to verify the problem of the over-fitting, and under-fitting, both the classification accuracy rates on the training data and the testing data of SS + DT are shown in Table 11. Because there was no large difference between the training data and testing data, the use of the proposed approach seems not to have suffered the problem of over-fitting and under-fitting. Moreover, in Table 11 we provide the average tree size (average number of nodes) of each dataset. It can be noted that even feature selections did not have significant improvement of classification accuracy rates in several datasets; the SS + DT can remove some irrelevant features and produce smaller tree structure for these datasets.

Table 11 Validation of the proposed scatter-based meta-heuristics for the training and testing result (%)

Table 11 shows that the proposed approach can provide high-quality classification accuracy rate of the training data and the testing data, and the feature selection could help the C4.5 algorithm reduce or keep the same tree size while classification accuracy rate is increased or keeps the similar result.

This study applied scatter search for C4.5 algorithm to determine the parameter and beneficial feature subset simultaneously for different problems to obtain higher classification accuracy rate. The experimental results showed that the scatter search is indeed beneficial for C4.5 to determine the parameter values and feature selection. Therefore, the scatter search has the potential to be applied to other machine learning algorithms in future research.

5 Conclusion and future research

Machine learning algorithms such as DT, BPN, and SVM are very popular and can be applied to various areas. However, most machine learning algorithms will suffer the parameter setting and feature selection problems. This study applied the SS-based approach to search for the best parameter settings for the C4.5 algorithm. This study used high dimensional datasets (Anneal, Breast cancer new, Ionosphere structure, and Sonar), large datasets (Adult, Segmentation, and Wave), and high dimensional and large dataset (Connect), the proposed approach performed well. On the other hand, Table 10 shows that computation costs in the acceptable time means the proposed approach could solve high dimensional, large, and high dimensional and large dataset problems, and did not require large computation costs. Compared with previous studies, the proposed SS + DT approach shows good performance by obtaining higher classification accuracy rates. With feature selection, the proposed SS + DT approach effectively deletes some moderating or non-affecting features while maintaining the same or superior classification accuracy rate. Furthermore, the effects of the remaining features on classification can be examined in the future. The main contributions of the proposed approach include:

  1. (1)

    The trial-and-error method traditionally used for C4.5 algorithm in determining the parameter is time-consuming and cannot guarantee the better result. The proposed approach can be used for automatic parameter determination for C4.5 algorithm.

  2. (2)

    The feature selection could help the C4.5 algorithm reduce or keep the same tree size while classification accuracy rate is increased, or keep the similar result compared with the method.

  3. (3)

    The experimental result showed that the scatter search is indeed beneficial for C4.5 to determine the parameter values and feature selection. Therefore, the scatter search has the potential to be applied to other machine learning algorithms.

More studies can be done in the future. First, as the proposed SS-based meta-heuristic is versatile; exploring the potential application of this approach to other data-mining techniques, (such as BPN, SVM, probabilistic graphical models, and probabilistic graphical models) could improve the classification result. Second, the proposed SS + DT approach can be applied to other real-world problems to determine whether it can effectively solve such problems. Finally, ensemble architecture can be used. The original concept of ensemble is based on a committee machine. The purpose of a committee machine is to integrate options of many experts rather than only one expert to obtain a good classification result. Therefore, a multi-decision tree model can be utilized in ensemble architecture to enhance further classification accuracy rate; this is currently being investigated by the authors of this study.