Keywords

1 Introduction

In the present technology-oriented world, huge number of data are generated intentionally or unintentionally which is stored with some characteristics, mostly referred as features. All the features in the dataset may not be of importance due to dependency implying features dependence on each other and redundancy indicating unwanted and less important features. The process of selecting most optimal subset of feature is commonly termed as feature selection (FS), which is often considered one of the most critical and challenging tasks in machine learning. The reduced feature set improve the computational complexity and further helps in additional analysis. FS reduces feature space (\(m\times n\)) from very large space to a smaller (\(n\times d\)) space where d < m. Mathematically, a feature selection problem can be defined as follows: Suppose a dataset D contains n number of features, and the objective is to select the optimal subset of features from D which are relevant. D is represented as follows:

$$D = \left\{ {f1,f2,f3,...,fn} \right\}$$
(1)

where \(f1 , f2 , f3 , . . . , fn\) represents the features of any dataset. In FS, we extract a subset \(S = \left\{ {f1,f2,f3,......,fd} \right\}\), where, \(d < n\).

In general, FS methods are classified into three categories namely filter, wrapper and embedded (hybrid) approaches. Filter approaches use general characteristics of the data to select features and are independent of learning algorithms. Wrapper methods always include a learning algorithm and according to its performance (increase or decrease), features are selected. The wrapper methods have high computational cost but provide more accurate results. On the other hand, filter approaches have low computational cost with less reliability. Lastly, the embedded methods include of filter and wrapper approaches both where FS is a part of the training process that is held with a learning algorithm.

According to set theory, if any dataset has n number of features, then \({2}^{n}\) number of subsets is possible, and our task is to pick best subset by which machine learning model can give best accuracy. To select best subset, we need to evaluate every subset performance with the model, which is unrealistic, when n increase to a huge number. Various existing heuristics are partially useful as they portray premature convergence, exponential or high computational complexity. To solve this issue, evolutionary or multi-objective optimization (MOO) based FS approaches have been extensively used in obtaining the optimal subset of features while preserving the accuracy of the model [1, 5]. These approaches tend to be efficient, effective, and reliable methods. In practice, the FS based on evolutionary, or MOO approaches falls under the wrapper method.

Mathematically, a multi-objective optimization problem (MOOP) can be formulated as follows:

$$\begin{gathered} minimize \,F\left( x \right) = [f1\left( x \right),\left( {f2\left( x \right)...,fm\left( x \right)} \right]^{T} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,s.t. x \in D, \hfill \\ \end{gathered}$$
(2)

where \(D\) is the decision space and \(x\in D\) is a decision variable. F(x) consists of \(m\) objective functions \(fi : D \to O, i = 1,...,m,\) where \(O\) is the objective space. The function of two objectives often trade-offs with each other, as, improvement in one objective may lead to degradation of another. A decision maker (DM) whose having expert domain knowledge, implicitly choose a solution which can optimize all objectives simultaneously. The best trade-off solutions are called the Pareto optimal solutions. In this paper, a feature selection problem is formulated under a multi-objective optimization approach where subset of features is selected by simultaneously optimizing more than one objective. Since there are multi-objectives, we propose to use the evolutionary algorithm (EA) of Non dominated sorted genetic algorithm-II (NSGA-II). Although, there are many EA’s available in the literature and the selection of “best” algorithm certainly depends upon the characteristics of each problem (No free lunch), NSGA-II is used in this paper due to its scale-up capability [17]. The two contradictory objectives, considered in this paper are: i) the number of selected features, and ii) Accuracy. The main goal of this paper is to propose an efficient evolutionary based multi-objective feature selection approach. As obtaining the maximum accuracy is the prime objective with minimum number of features, we select optimal feature in two stages. In the first stage, subset of features is randomly selected by the initialization step of NSGA-II. In the second stage, mutual correlation coefficient technique is used to get important, and relevant subset of features.

This approach is the hybrid filter-wrapper evolutionary approach where mutual correlation coefficient technique is incorporated inside the considered EA. Using mutual correlation coefficient technique, we calculate pair wise correlation for all features and then calculate average correlation for all features from pair wise correlation matrix. We use a threshold value and for reducing irrelevant and redundant feature we introduce percentile concept which make a novel correlation coefficient technique. In this concept, we only reduce percentage of features whose average correlation coefficient values are above the threshold value. Collectively, the major contribution of this paper is summarized as follows:

  1. 1)

    A novel correlation coefficient filter method is proposed and incorporated with NSGA-II, to obtain optimal subset of features.

  2. 2)

    This paper introduces novel way to reduce only less correlated features based on some threshold value in the filter approach. A novel randomized parameter, in the form of a percentile value, is introduced which decides the number of features to be reduced based on the size of the dataset.

  3. 3)

    The proposed approach enables to easily manage highly correlated dataset.

The remainder of this paper is organized as follows: In Sect. 2, literature review introduces the related existing works in this field. Section 3 details the explanation of the proposed approach of feature selection technique with the novel correlation coefficient technique. In Sect. 4, we will present experimental results of different dataset and compare the proposed approach with traditional NSGA-II. In Sect. 5, we will draw a summary of this article and outline the future research directions.

2 Literature Review

MOO approaches have achieved wide attention to solve the FS problems in many applications such as biomedical problems [2], text mining [3], image analysis [4], etc., where more than thousands of features are present. Various feature selection techniques with evolutionary algorithms (EAs) have been proposed in the literature. A survey of evolutionary feature selection techniques can be found in Xue et al. [5]. Binary genetic algorithms (Gas) are popularly used in EA when applied to feature selection. They use N dimensional binary vector for the number of features in the dataset. Here, “1” and “0”, shows whether the resultant feature is selected or not [6]. For more than 100 features, exhaustive search techniques of feature selection such as SFS (Sequential Forward Selection), SBS (Sequential Backward Selection), SFFS (Sequential Floating Forward Selection), SFBS (Sequential Floating Backward Selection) algorithms are become computationally infeasible in feature selection. To solve issue of selecting subset of features in large dataset, evolutionary computing (EC) algorithms have drawn attention of the researchers. It has non-exhaustive search procedure which is computationally expensive but not computationally infeasible. To solve FS problems effectively, metaheuristics algorithms are coupled with wrapper methods that search the lower dimensional dataset space by iteratively calling the learning algorithm [7]. Mostly used metaheuristics algorithms focused on single objective of FS problems are GA [7, 17], Genetic Programming (GP) [8], Particle Swarm Optimization (PSO) [9], Differential Evolution (DE) [4], Ant Colony Optimization (ACO) [10]. Multi-objective EAs like NSGA-II [11, 20], multi-objective evolutionary algorithm with domain decomposition (MOEA/D) [12], Multi-objective particle swarm optimization (MOPSO) [13] are mostly used for the multi-objective FS problems. Hu et al. proposed fuzzy multi-objective FS method with particle swarm optimization, called PSOMOFS, where a fuzzy dominance relationship is developed to compare the goodness of candidate particles and global leader of particles are determined by fuzzy crowding distance measure [14]. Xue et al. proposed PSO based multi-objective FS algorithms where feature selection problem addressed by non-dominated sorting and applying crowding distance, mutation, and dominance into PSO [15]. Chen et.al. Proposed an efficient ACO for image feature selection [16]. In ACO, graph is made to solve feature selection problem in which each feature is considered a node of the graph. Any feature i.e., node is selected if an ant has visited the node. Hancer et al. [22] developed first multi-objective artificial bee colony (MOABC) framework for feature selection in classification, where a new fuzzy mutual information-based criterion is proposed to evaluate the relevance of feature subsets. Khushaba et al. [18] proposed a novel feature selection algorithm by combining DE with ACO where DE was used to search for the optimal feature subset based on the solutions obtained by ACO.

3 Multi-objective Optimization Based Feature Selection (MOOFS)

Finding the optimal feature subset and maintaining the classification accuracy of low and high dimensional dataset are the goal of our proposed approach. For this purpose, we propose a novel Multi-Objective Optimization Based Feature Selection (MOOFS). The candidate solution of MOOFS is encoded as a vector of n bits where each bit can take value of 1 or 0 and number of selected features is decided according to the value of 1 in the vector. Each vector is an individual in the population. For example, binary vector solutions X of a dataset with n features represent as:

$$X = \left( {x_{1} ,x_{2} ,x_{3} ,x_{4} \ldots ,x_{n} } \right),x_{j} \in \left\{ {0, 1} \right\}$$
(3)

Then, the selected feature subset of X is:

$$X = \left( {x_{1} ,x_{2} ,x_{3} ,x_{4} \ldots ,x_{d} } \right)$$
(4)

where d < n and \({x}_{j}=1\) represents that the corresponding jth feature is selected.

In our proposed MOOPS approach, we use binary NSGA-II algorithm with mutual correlation coefficient technique where for all feature pair, we calculate average absolute mutual correlation of a feature over k (= n − 1) features, where n is the total number of features and k has (n − 1) features. Since NSGA-II algorithm randomly select initial population and offspring are generated using crossover and mutation, there is high possibility to select unwanted and noisy feature in each generation. To select relevant features efficiently, after generating the population of feature set by NSGA-II, we use mutual correlation coefficient technique (MCCT) before evaluating the selected features. MCCT reduce the uncorrelated and less important features and gives the best subset of features. These subsets of features evaluate by classification model to get the accuracy. To clearly understand the proposed MOOFS, we introduce a novel algorithm (Algorithm 1) and then explain the step-by-step procedure.

figure a

In Step 1, a population with M individuals is randomly initialized in the binary space \(\left\{0, 1\right\}\). The variable size of a population is determined according to the feature size of a dataset. The core process of (MOOFS) algorithm starts from Step 2, which is iterated until the stopping criteria is met or a maximum number of iterations is reached. In Step 2(a), the subset of features is identified with a position of 1 in the chromosome. Then in Step 2(b), individuals are randomly generated, noisy and unwanted feature may be included which can degrade the accuracy of the classifier. After that in Step 2(c), k-nearest neighbors (KNN) classifier is used to evaluate the reduced subset of features with l-fold cross validation (l = 10) technique. KNN is one of the widely used classifier in evolutionary feature selection algorithms. The number of selected subset of features and accuracy with KNN classifier are two objectives of our MOOFS algorithms. The optimal features are selected by simultaneously optimizing the objectives which are number of selected features and accuracy. The objective functions are described as follows:

  1. i)

    The number of selected features: The objective of this function is minimizing number of features:

    $$\min F_{1} \left( X \right) = \left| X \right|$$
    (5)

    where \(\left|X\right|\) denotes the cardinality of selected subset of features.

  2. ii)

    Accuracy: Maximizing the accuracy of the classification model represent the higher performance of classification. In this paper, accuracy is calculated by the KNN classifier with k fold cross-validation method (k = 10) and objective function is defined as follows:

    $$\max F_{2} \left( X \right) = \left( {\frac{1}{k}\sum\nolimits_{i = 1}^{k} {\frac{{N_{Cor} }}{{N_{All} }}} } \right) \times 100$$
    (6)

where \({N}_{Cor}\) denotes the correctly classified test samples, and \({N}_{All}\) is the total number of test samples.

In Step 2(d), Non-dominated sorting used on the population and the crowding distance is calculated, which are key process of NSGA-II algorithm. In Step 2(e, f), selection, crossover and mutation is performed. The output of crossover and mutation is used to generate new population for the next generation. Repeat the step 2 until the maximum number of iterations/generations is reached. At the end in Step 3, the non-dominated Pareto optimal front (POF) is returned where POF includes the number of selected features and accuracy.

To improve the performance of the classifier, important, linearly independent features need to be selected by removing noisy and unwanted feature. This task is done using Algorithm 2 (Percentile_correlation_coefficient), which returns the reduced subset of features after removing constant, and highly correlated features. In this algorithm, basic correlation coefficient method is used with some conditional step. The correlation coefficient technique measures the linear dependency or uncorrelation between two features. Two features are uncorrelated when their correlation coefficient is 0 and linearly dependent when their correlation coefficient is +1(positively correlated) or −1(negatively correlated). Generally, the approach of algorithm 2 may be refereed as a filter approach.

figure b

Algorithm 2 shows the pseudocode of the percentile correlation coefficient algorithm. In step 1 of this algorithm 2, we initialize the percentile value p (p = 0.3), number of features (n) according to the input parameter-subset of features, and threshold value (threshold = 0.9). Then, we calculate variable, w, according to the value of p and n. The value of w decides how many times step 3 will iterate to reduce the feature following the condition of threshold value. In step 3(a), we define a variable k = n-1 and in 3(b), the correlation coefficient matrix of Dataset (D) is computed, according to Eq. (7):

$${\text{r}}_{{{\text{x}},{\text{y}}}} = \frac{{\mathop \sum \nolimits_{{{\mathbf{i}} = 1}}^{{\mathbf{n}}} \left( {{\mathbf{x}}_{{\mathbf{i}}} - {\overline{\mathbf{x}}}} \right)\left( {{\mathbf{y}}_{{\mathbf{i}}} - {\overline{\mathbf{y}}}} \right)}}{{\sqrt {\mathop \sum \nolimits_{{{\mathbf{i}} = 1}}^{{\mathbf{n}}} \left( {{\mathbf{x}}_{{\mathbf{i}}} - {\overline{\mathbf{x}}}} \right)^{2} \mathop \sum \nolimits_{{{\mathbf{i}} = 1}}^{{\mathbf{n}}} \left( {{\mathbf{y}}_{{\mathbf{i}}} - {\overline{\mathbf{y}}}} \right)^{2} } }}$$
(7)

Then, step 3(c) computes the average absolute mutual correlation of one feature to (n-1) feature according to Eq. (8), which is given as follows:

$$r_{{i,k\left( { = n - 1} \right)}} = \frac{1}{k}\sum\nolimits_{j = 1,j \ne i}^{k} {\left| {r_{{x_{i} ,x_{j} }} } \right|}$$
(8)

where i is the ith feature, k denotes the all n-1 features except ith feature and j denotes 1 to k features except ith feature.

Fig. 1.
figure 1

Flowchart for multi-objective optimization based feature selection

Find out largest average mutual correlation value and assign to the variable ‘s’ (step 3(d)). Feature with the largest average mutual correlation will be removed in each iteration, if the value of s is greater than threshold value (step (3(e)). Repeat the step 3 until the t is equated to w. As step 3 will iterate up to the value of w which means we only reduce maximum w feature and minimum zero feature following the condition of step 3(e). Further, the process will be iterated up-to stopping criteria and return the best reduced subset of features (step 4). In this algorithm, we can manage highly correlated dataset, where all feature’s average absolute mutual correlation value is greater than the threshold which was impossible in only correlation coefficient technique. Figure 1 pictorially depicts the flowchart of proposed algorithm 1 and 2.

4 Experimental Results

The proposed MOOFS uses NSGA-II for the feature selection. To validate the efficiency of the proposed approach, we have used several datasets with various dimensions and compared MOOFS with traditional NSGA-II approach. This section is divided into two subsections. In first we have discussed about the datasets used and parameters settings of the MOO approaches. In the later sub-section, we present the experimentation results.

  1. i.

    Datasets and parameter settings

    The details of the datasets used in this paper in mentioned in Table 1 which are taken from UCI dataset [19]. The number of features varies from 33 to 241, while the number of instances goes from 351 to 4464. The maximum iteration in the NSGA-II is 500 while the number of populations is 50. In the percentile_correlation _coefficient method, the parameter threshold is experimented for 0.8 and 0.9 [21, 23].

Table 1. Datasets and their dimensions
  1. ii.

    Experimental result and discussion.

    Our proposed evolutionary Feature selection algorithm (MOOFS) is compared with standard NSGA-II algorithm to validate its efficiency. The first set of our experimental results are Tables 2 and 3, consisting of best accuracy value corresponding to the number of features. These tables also show the comparison among NSGA-II and our proposed MOOFS. Table 2 shows the accuracy and number of feature values when the threshold values is 0.8, whereas Table 3 shows the values for 0.9 threshold.

Table 2. Best accuracy of proposed MOOFS (with 0.8 threshold) and NSGA-II

From the Table 3, it’s evident that best values are obtained when the threshold value is chosen as 0.9. For ionosphere dataset, MOOFS results in a higher accuracy of 94.865 with only six number of features, however, NSGA-II returned nine optimal features with lesser accuracy of 94.286. For Sonar and HillValley datasets, best accuracy is again achieved by MOOFS which is 76.286 and 65.97 with seven and ten features, respectively. The NSGA-II achieved less accuracy but with same number of features.

Table 3. Best accuracy of proposed MOOFS (with 0.9 threshold) and NSGA-II
Fig. 2.
figure 2

POF of accuracy vs. no. of features of datasets: (a) Ionosphere, (b) Sonar (c) Hill Valley, (d) Musk1, (e) Taundromd (Blue Color: Proposed MOOFS; Red Color: NSGA-II) (Color figure online)

The significant amount of improvement is achieved for the Musk1 and Taundromd datasets, where the accuracy is significantly higher with a smaller number of features as compared to the traditional NSGA-II approach. The improvement in accuracy is achieved with around 18% (Musk1) and 15% (Taundromd) reduction in number of features for the datasets. The marginal improvement in accuracy for HillValley dataset is due to the highly correlation among the features. It is clearly observed that our proposed MOOFS algorithm shows better results compare to standard NSGA-II. Figure 2 shows the POFs of all the datasets. In Fig. 3(a-e), accuracy convergence graph of the five datasets is shown where best accuracy of every iteration is considered.

Fig. 3.
figure 3

Accuracy convergence graph of Features of datasets: (a) Ionosphere, (b) Sonar (c) Hill Valley, (d) Musk1, (e) Taundromd (Blue Color: Proposed MOOFS; Red Color: NSGA-II) (Color figure online)

This convergence graph is made by five independent runs with 500 iterations of the MOOFS and standard NSGA-II. Except HillValley and Musk1 dataset, we got higher accuracy convergence graph and it is also shown that MOOFS showed stable result around 250 iterations. We have further compared the proposed MOOFS with other well-known classifier in machine learning using basic correlation coefficient with threshold 0.9. The results are compiled in Table 4. Hill Valley data set select only one feature because this dataset is highly correlated. It is observed that except Hill Valley dataset, remaining 5 datasets select more number of features which are 33 for Ionosphere, 57 for Sonar, 106 for Musk1, 154 for Tuandromd. Also, most of the classifier showed less accuracy than our proposed approach. This trade-off of selecting optimal feature with higher accuracy is efficiently achieved by our proposed MOOPS approach.

Table 4. Accuracy of different classifier

5 Conclusion and Future Work

This paper introduces a novel feature selection approach formulated under a multi-objective optimization approach where subset of features is selected by simultaneously optimizing more than one objective. Due to the scale-up capability, NSGA-II is utilized where the two considered objectives are optimal number of features, and accuracy. Several datasets with various dimensions are considered for the experimentation. The proposed approach is named Multi-Objective Optimization based Feature Selection (MOOFS), an efficient evolutionary-based multi-objective feature selection approach with a correlation coefficient filter method. The experiments are done on different datasets. From the results, a significant improvement in the accuracy with a considerable reduction in the number of features can be seen. Overall, our approach can reduce irrelevant and redundant features, which helps reduce training time as well as improve the performance of ML algorithms. One of the limitations of the proposed approach is to define the threshold value for reducing the number of features, which can be improved in future work. Further, the capability of the proposed approach for the big datasets shall be experimented. This work can also be extended to self-adapting parameter values where the correlation of the data set, dynamic in nature, can be identified.