Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

A feature selection method selects a small subset of highly predictive features from the original set of features. It most of the time yields better results due to reduction of noises and distractions, and takes less training time for a classifier than using the entire set of features. Feature selection approaches can be classified into three categories which are wrapper, filter, and hybrid approaches.

Given a classification problem, a wrapper method incorporates the classification itself in the feature evaluation process. To evaluate a candidate feature subset, a classification model is built and used to evaluate the set. Maroño et al. [11] propose a wrapper based feature selection using ANOVA decomposition and functional networks to calculate global sensitivity indices. Features with high index values are selected. Zhuo et al. [19] use a genetic algorithm (GA) to optimize a support vector machine (SVM) kernel parameters for selecting a feature subset. The fitness function is accuracy which also is used as the criterion function for selecting features. The wrapper approach is expected to return a subset of features that yields high accuracy since every candidate feature set is evaluated by the classifier that is used in the problem. Since classification models are trained and tested many times, and data becomes larger in dimensionality and number of instances, this approach takes a long time for learning such data and in many cases is inapplicable.

In a filter method, instead of performing classification as part of the feature selection process, a quality measure is used to evaluate each feature set. The filter-based approach composes two important components which are a selection algorithm and a criterion function. The selection algorithm creates candidate features while the criterion function selects features and evaluates feature subsets. The criterion function can be independent from the classification model, but it should be suitable for the problem. The filter-based approach generally takes less time than does the wrapper approach since no classifier is trained and tested as in the wrapper approach. It is more preferable for real-world problems, especially those with large data sets. Many researchers find that it yields subsets with lower accuracy than do the other two approaches. However, it is not true to state that the filter approach always gives lower accuracy. Some criterion functions may return subsets with equivalent or better performance than other approaches.

Yu and Liu [15] use symmetrical uncertainty as the measure to select features relevant to classes which are not redundant with other selected features. Zhou et al. [18] propose a forward algorithm to select features using conditional maximum entropy modeling to approximate the gain for features. Fleuret [4] uses conditional mutual information (CMI) as the criterion function to fasten the forward search process. Haindl et al. [6] propose a backward filter-based feature selection method based on mutual correlation, a similarity measure between two variables, to select features which are uncorrelated.

The hybrid approach takes advantage of both the wrapper and the filter approaches. It applies a filter-based technique to select highly significant features and applies a wrapper-based technique to add candidate features and evaluate candidate sets. Zhang et al. [17] apply the RELIEFF algorithm to estimate the quality of attributes according to how well their values distinguish between instances that are close to each other, and apply GA with classifier accuracy as the fitness function to search for an optimal feature subset. Somol et al. [13] present a hybrid floating search, named hSFFS, by applying a filter criterion function first to filter some features and applying a wrapper criterion to generate a candidate set. After that a wrapper criterion function is applied to select the best feature from the candidate set. This is a wrapper-dominating hybrid method. Gan et al. [5] propose an alternative to hSFFS, which is a filter-dominating hybrid method. A filter criterion is used to select the best feature from an unselected set, and a wrapper criterion is then used to evaluate a feature subset.

Problems usually found in real-world applications are mixtures of ambiguous and noisy data. This results in an inaccurate classification model. Fuzzy Logic, which is a multi-value logic that allows intermediate values to be defined between conventional crisp evaluations, e.g., true/false, yes/no, etc., provides a simple way to define conclusions based upon vague, ambiguous, imprecise, noisy, or missing input information [3]. Membership functions for fuzzy sets can be of any shape or type, such as triangular, trapezoidal, and Gaussian-shaped, as determined by experts in the domain over which the sets are defined.

This paper proposes a feature selection technique for classification using two criterion functions and feature fuzzification using irregular-shaped membership functions, evolved by genetic algorithm and particle swarm optimization. The technique is evaluated using standard machine learning data sets in various sizes and complexities.

2 Proposed Feature Fuzzification

Irregular-shaped membership functions for every continuous attribute are evolved. Values of those attributes are fuzzified to create a suitable set of value ranges. All attributes are then fed into the filter-based feature selection algorithm which employs two criterion functions to generate the best set of predictive features.

The membership function (MF) shape determined in advance by experts may not be suitable for a specific problem at hand, especially those with large and complex search spaces. We convert the wrapper-based hierarchical co-evolutionary by Huang et al. [7] for generating irregular-shaped membership functions (ISMFs) into a filter-based algorithm using two optimization techniques: genetic algorithm and particle swarm optimization, where a criterion function is used in order to improve efficiency. An MF shape is represented as one pivot point, left shoulder points, and right shoulder points, depicted in Fig. 1.

Fig. 1.
figure 1

An irregular-shaped membership function

2.1 Membership Function Evolution by Genetic Algorithm

A genetic algorithm can be employed to create membership functions for continuous variables. An irregular-shaped MF is represented as one pivot point, left shoulder points and right shoulder points, as shown in Fig. 2(a). Fuzzy partitions on each input variable are encoded in genetic segmentations and concatenated into one chromosome in the first level (L1-level) for the corresponding variable. A chromosome in the second level (L2-level) composes of genes pointing to chromosomes for all variables in L1-level. An L2-level gene contains the integer value of an index in the L1-level chromosome. With GA operations (crossover, mutation and selection), coordinates of points will be changed, and it results in changing shapes. Constraints and repairing schemes are applied before decoding the genetic representation.

Fig. 2.
figure 2

Chromosome structure

The algorithm partitions and encodes possible solutions as populations in different levels, allowing for different kinds of chromosomes and genetic operations. A higher level chromosome selects a set of lower-level chromosomes to form a solution. In this case, a highly complicated search task can be partitioned into several subtasks which are simultaneously and effectively handled. The structure of the chromosome is shown in Fig. 2(b).

2.2 Membership Function Evolution by Particle Swarm Optimization

Particle swarm optimization (PSO) [2] can be used to evolve optimal locations of points on ISMFs for a feature. PSO is a heuristic global optimization method based on swarm intelligence. Potential solutions, called particles, fly through the problem space by following the current optimum particles. Each particle keeps track of its coordinates in the problem space which are associated with the best solution (fitness) it has achieved so far. This value is called pbest. Another best value that is tracked by the particle swarm optimizer is the best value obtained so far by any particle in the neighbors of the particle. In this research, a particle takes the entire population as its topological neighbors, the best value is a global best and is called gbest. At each time step, we change the velocity of (accelerating) each particle toward its pbest and the gbest locations. Acceleration is weighted randomly toward the pbest and the gbest locations. Content of a PSO particle for generating ISMFs is shown in Fig. 3.

Fig. 3.
figure 3

Particle content

Evolution in PSO is the process to update particles’ positions. A particle position is updated as follows:

$$ x_{ij} (t + 1) = x_{ij} (t) + v_{ij} (t + 1) $$

and

$$ v_{ij} (t + 1) = wv_{ij} (t) + c_{1} r_{1j} (t)\left[ {b_{ij} (t) - x_{ij} (t)} \right] + \,c_{2} r_{2j} (t)\left[ {\hat{b}(t) - x_{ij} (t)} \right] $$

where x ij is the vector of i-th particle with j dimensions, t denotes a discrete time step or iteration, w is the inertia weight, r is a random number in the range [0..1] sampled from a uniform distribution, c is an acceleration constant, \( \hat{b}(t) \) is the best position among all particles, and b ij (t) is the best position of the i-th particle.

To determine the values of three important parameters needed by PSO which are w, c 1 and c 2 , Zhang et al. [16] constructs a relationship between the dynamic process of particle swarm optimization and the transition process of a control system. It reduces the three parameters to the percentage overshoot, and from their experiments the value should fall between 0.6 and 0.8. The percentage overshoot allows us to determine the values of w and c, and further c = c 1  = c 2 . Comparing to other parameter setting strategies, this method leads to similar optimization results but faster convergence.

Opposition-based PSO technique [8] is used to initialize particles to preserve the coverage. The process can be described as:

1. Randomly initialize n particles.

2. Calculate opposite particles of first n particles.

3. Evaluate 2n particles from steps 1 and 2, and select the best n particles to be in the swarm of optimization process.

3 Proposed Feature Selection Process

We improve the filter-based sequential forward floating search algorithm [12] by employing two criterion functions with different characteristics to complement each other and allowing more thorough search for features by introducing candidate sets. Conditional mutual information (CMI) is employed as the first criterion function. It measures dependency between two variables with respect to a class, conditional to the response of features already picked [4]. CMI selects features which maximize MI to the target class where such information must not have been caught by features already selected to reduce redundant features. It generates a candidate set of features which are suitable to be added to or removed from a selected subset instead of examining one feature at a time. Using the candidate sets makes the search more thorough. The second criterion function selects a feature to be added or removed from this set.

Input to the algorithm consists of the original feature set S, the first criterion function J 1 , and the second criterion function J 2 . Let D be the total number of original features. dsel is the number of selected features. dcand is the number of features in a candidate set where d cand  ≥ 1. S sel is the selected feature subset. \(S_{cand}^{-}\) is the candidate set in the backward step, and \(S_{cand}^{+}\) is the candidate set in the forward step.

In the forward step, unselected features are evaluated by the a criterion function J 1 and sorted in descending order. A candidate feature set is created as follows:

$$ S_{cand}^{ + } = \left\{ {x_{n} |x_{n} \in S\backslash S_{sel} \,and\,n = \left[ {1 \ldots d_{cand} } \right]\,\,and\,J_{1} (x_{1} ) \ge J_{1} (x_{2} ) \cdots \ge J_{1} (x_{n} )} \right\} $$

where \( J_{1} = min\,I(Y;X_{n} |S_{sel} ) \) where \( X_{n} \in S\backslash S_{sel} \)

As mentioned earlier, CMI is used as J 1 , and it can be calculated as follows:

$$ I(Y;X_{n} |X_{m} ) = H(Y,X_{m} ) - \,H(X_{m} ) - H(Y,X_{n} ,X_{m} ) + H(X_{n} ,X_{m} ) $$

where I(Y; Xn|Xm) is the conditional mutual information between Y and X n given X m , and H is an entropy function. For more information on how to compute CMI, see [4].

The feature selected is the one when combined with the previously selected subset of size k gives the best subset when evaluated with J2, forming the selected subset of size k + 1. Then the algorithm compares the new subset with the previously selected subset of size k + 1 and retains the better one.

In the backward step, a feature to be removed must be the one providing the least information to target classes, and its information has been caught by features already picked. Therefore, J 1 in the backward step is calculated as follows:

$$ J_{1} = max\,I(Y;X_{n} |S_{sel} \backslash X_{n} )\,\,where\,\,X_{n} \in S_{sel} $$

Selected features are evaluated by \( J_{1} \) and sorted in ascending order. A candidate set is generated as follows:

\( S_{cand}^{ - } = \left\{ {x_{n} |x_{n} \in S\backslash S_{sel} \,and\,n = \left[ {1 \ldots d_{cand} } \right]\,\,and\,J_{1} (x_{1} ) \le J_{1} (x_{2} ) \le \cdots \le J_{1} (x_{n} )} \right\} \) The feature to be removed is the one when removed from the selected subset yields the best subset with k features according to \( J_{2} \). The algorithm compares the new subset and the previously selected subset of size k and retains the better one. The exclusion step continues to smaller subsets if the new subset is better, or else the algorithm goes back to the inclusion step. The algorithm terminates when the selected subset size is \( d_{sel} + \Delta \).

3.1 The Second Criterion Function

As part of the feature selection process, the second criterion function (J 2 )’s role is to select a feature subset that maximizes inter-class distances and minimizes intra-class distances. Three effective measures are studied as candidates for J 2 .

Mutual Information (MI).

MI can be calculated as follows:

$$ I(Y;X_{n} ) = H(Y) + H(X_{n} ) - H(Y,X_{n} ) $$

where H is an entropy function, Y is a class attribute, and \( X_{n} \) is the feature to be selected.

Jeffreys-Matusita Distance Bound to the Bayes Error (JMBH).

JMBH can be calculated as follows:

$$ J_{bh} = \sum\limits_{i = 1}^{c} {\sum\limits_{j = 1}^{c} {\sqrt {P(\omega_{i} )P(\omega_{j} )} J_{ij}^{2} } } $$
$$ J_{ij} = \left[ {2(1 - e^{{ - B_{ij} }} )} \right]^{{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} $$
$$ B_{ij} = \frac{1}{8}(m_{i} - m{}_{j})^{t} \left( {\frac{{\varSigma_{i} + \varSigma_{j} }}{2}} \right)^{ - 1} (m_{i} - m{}_{j}) + \frac{1}{2}\log \left[ {\frac{{\left( {\frac{{\varSigma_{i} + \varSigma_{j} }}{2}} \right)}}{{\sqrt {|\varSigma_{i} ||\varSigma_{j} |} }}} \right] $$

Where \( m_{i} ,m_{j} \) and \( \varSigma_{i} ,\,\varSigma_{j} \) are mean vectors and covariance matrices for the classes \( \omega_{i} \) and \( \omega_{j} \), respectively.

Mahalanobis Distance (MAHA).

MAHA can be calculated as follows:

$$ D_{M} (x) = \sqrt {(x - \mu )^{T} S^{ - 1} (x - \mu )} $$

where \( \mu \) is the mean vector, and S is the covariance matrix for a group.

3.2 Classification Model

Classification and Regression Trees (CART) was introduced by Breiman et al. [1]. CART is based on a fundamental idea that each split should be selected so that the data in each descendant subset is purer than the data in the parent node. The node impurity is largest when all classes are equally mixed together and smallest when the node contains only one class. CART produces binary splits. Hence, it produces binary trees. CART uses Gini impurity index as an attribute selection measure to build a decision tree. Consider a parent node m, which contains the data that belongs to the jth class. The impurity function for node t is given by \( i\left( t \right) = 1 - \sum\nolimits_{i} {p^{2} (j|m)} \). The decrease of split impurity is given by \( \Delta i\left( {\delta ,t} \right) = i\left( t \right) - p_{L} i\left( {m_{L} } \right) - p_{R} i\left( {m_{R} } \right) \), where t is a parent node using a splitting coefficient δ to split into two nodes m L and m R . The split with the largest decrease in impurity is chosen for that particular node.

4 Experimental Evaluation

The data sets used in the experiments using standard data sets from the UCI machine learning repository. For any data set without a separate test set provided, a 10-fold cross validation is employed to measure performance. The stopping criterion for genetic algorithm and PSO is set at 100 iterations.

4.1 Effectiveness of Feature Fuzzification

In this experiment, we study the effectiveness of the fuzzification process using both genetic algorithm and particle swarm optimization against not using the fuzzification at all in different classification problems. An initial study shows that the swarm size of 30 particles gives the highest accuracy and will be used in all experiments. The results are shown in Table 1. We can see that in almost all configurations using fuzzification yields higher accuracy then not using it, thus the fuzzification is useful. In addition, the configuration that gives the best results is fuzzification using GA and JMBH as J 2 function (referred to as Fuzzified GA+JMBH).

Table 1. Classification accuracy of applying and not applying feature fuzzification by genetic algorithm and particle swarm optimization, with 3 different J 2 criterion functions

4.2 Performance of the Proposed Technique

Since fuzzification and JMBH are beneficial to the performance of the proposed feature selection technique, in this section we focus more on the feature reduction abilities of fuzzification by genetic algorithm and particle swarm optimization. The results in Table 2 show that although GA yields higher accuracies in general, however, PSO tends to give better feature reduction rates.

Table 2. Accuracies and feature reduction abilities of fuzzification by GA and PSO

4.3 Comparisons with Other Research

Lastly, the proposed method (Fuzzified GA+JMBH) is compared against three recent research on fuzzy-based feature selection which are: Jalali et al. [9], Vieira et al. [14], Li and Wu [10], using the performance numbers reported in each paper. The results (in Table 3) show that the proposed method outperforms [9] and [10] in all common data sets. Comparing with [14], we find that the proposed method gives higher accuracy in 3 out of 4 data sets. Thus, the proposed technique is shown to perform very well across different data sets and in comparison with other techniques.

Table 3. Results of (Fuzzified GA + JMBH) compared to other previous fuzzy-based research. Feature reduction percentages relative to the original feature sets are shown in parentheses.

5 Conclusion

As data sets grow in size and complexity, a feature selection technique is needed to select a small subset of highly predictive features from the entire set of features. The technique is expected to reduce noises and distractions, thus improve both effectiveness and efficiency of machine learning. This paper presents a new filter-based technique to select a minimal set of features for classification problems. The proposed technique employs fuzzification of original features using irregular-shaped membership functions evolved by genetic algorithm and particle swarm optimization, and a filter-based feature selection using two criterion functions where the first function is applied to eliminate features with redundant effects, and the second function is used to select a feature subset that maximizes inter-class distances and minimize intra-class distances. The technique is evaluated using standard UCI data sets and compared to recent fuzzy-based feature selection research papers. The results show that feature selection improves classification accuracy; that the use of evolutionary feature fuzzification and two criterion functions enhances the performance of feature selection; and that the best configuration is using Jeffreys-Matusita Distance Bound to the Bayes Error as the second criterion function and genetic algorithm to evolve irregular-shaped fuzzy membership functions. In addition, the proposed technique performs well in comparison to previous research on common data sets.