Keywords

1 Introduction

Data analytics researchers need relevant and high-quality data from huge amount of stored data. Feature selection method helps in reducing the dimensionality of features by removing redundant, irrelevant, or noisy data through which improvisation in classification accuracy with minimum processing of data can be achieved. Due to increase of dimensionality of records and features in data repositories, there is shift in maintaining the records related to each and every individual. Formally, in recent times, data mining techniques are used to discover a novel and useful pattern from the historical data. Many research ideas are openly still needed to solve this. Mostly classification framework gives an effective result for classifying the datasets. Usually, large and high dimensional datasets contain complex information with errors, and in such a situation, classification algorithm plays a vital role.

By using the feature subset selection process, the relevant subset features can be determined from the original features. The process of ranking features according to their significance in improving the performance of classifiers is called as ranking-based feature selector. Under KDD process, feature selection is only one essential step that improves the detection performance of the classifiers, minimizes the time taken to build the data mining model, and reduces the number of initial features because there are no quality features and no quality results in the classifiers.

FS methods are classified as filter and wrapper method. Based on this key idea, several FS methods are introduced in machine learning paradigm. Wrapper method is used to select the features based on the accuracy estimate, and filter method is used to select the features not based on the accuracy estimate; rather, it uses the data characteristics with the relevancy or correlation measures. Filter-based approaches are not dependent on classifiers and usually faster and more scalable than wrapper-based methods. Moreover, they have low computational complexity too. Recently, numbers of hybrid approaches are also being proposed to achieve a good balance in the feature selection criteria by combing both filter and wrapper method.

2 Related Works

In [1], the author discussed the impact of the noise in the class labels by analyzing the traditional mutual information-based filter feature selection algorithm. This proposed idea brings the solution for the nearest neighbors-based entropy estimator to minimize the class label errors. In [2] the author presents an empirical study on many feature selection and classification algorithms to analyze their performance on diverse biological datasets. This study reveals combination of RFE with SVM and LR as best generalization model to perform feature selection and classification. In [3], the author has proposed a rule-based feature selection algorithm to improve the detection performance of multiclass support vector machine. In [4], the author has applied a hybrid algorithm combines the genetic algorithm with K-nearest neighbor for predicting the binding of protein–water from its X-ray crystallographic protein structure data. In [5], the author has proposed a redundancy demoting (RD) approach for making improvement in ranking by demoting the redundant features. For instance, for diagnosing the erythemato-squamous disease with the help of feature selection based on F-score measure has been applied [6]. In [7], the author proposed genetic algorithm (GA) as commonly used global searches for optimization. This method is used in feature selection process among various applications and has exposed to be a good tool [8,9,10,11,12]. A recommended way of solving this issue is to group genetic algorithm and some memetic (search) operations [13, 14]. This helps in fine-tuning the search process and improves the quality of the results generated from genetic algorithm with relation to accuracy and efficiency. Recently, these types of evolutionary algorithms are called hybrid evolutionary algorithms (EAs), memetic algorithms (MAs), Lamarckian evolutionary algorithms, Baldwinian EAs, and local search or cultural genetic algorithms. These algorithms are not used for convergence to high-quality results only, also for further efficient search mechanism [13, 14]. Selecting only the minimum relevant features from the original subset is the main challenge in feature selection. This work aims to develop ‘symmetrical uncertainty and genetic algorithm (SU-GA)’-based feature selector named universal and novel feature selector. The goal of this study is to make accurate prediction with the help of least number of significant features. In this research work, the features undergo memetic (genetic) evolution such as ‘include’ and ‘remove’ to select the features. The examined results prove the proposed SU-GA-based classifier attains significant dimensionality reduction in various dimensional datasets from the UCI machine learning repository [15].

3 Proposed Method

3.1 Symmetrical Uncertainty

The symmetrical uncertainty (SU) between features and the target concept are used to obtain the best features for classification. Features with a larger SU value obtain greater weight. SU measures the relationship among X, Y variables based on the information theory. It can be calculated as follows.

$$ {\text{SU}}(X,Y) = 2 \frac{I(X,Y)}{H(X) + H(Y)} $$

Considering I(X, Y) as the MI among X, Y. H(..) as an entropy function for X, Y features. The SU admits the normalized range value [0,1] as correction factor value is 2. If SU value is 1, then the information of one feature is absolutely predictable. If SU value is 0, then X, Y are not associated.

3.2 Genetic Algorithm

The amount for genetic algorithm (GA) [16] is stated randomly for all individual chromosome that encrypt the feature subsets. Each chromosome are assembled with binary string, and the binary string encrypting describes that the value ‘1’ (‘0’) shows the particular feature is selected (omitted). The Obj_Fun (Objective Function) for subset are obtained from the aptness of individual chromosome as,

$$ \begin{aligned} {\text{Fitness}}\left( c \right) & = {\text{Max}}\left( {{\text{Obj\_Fun }}\left( {\text{SFc}} \right)} \right) \\ {\text{Obj\_Fun}}\left( {\text{SFc}} \right) & = \alpha *\left( {1/\tau } \right) + \left( {{\text{RCF}}*{\text{Recall}}} \right) + \left( {{\text{PCF}}*{\text{Precision}}} \right) \\ \end{aligned} $$

where

τ :

No. of ones in the SFc

α :

No. of minimum features selected

RCF:

Recall credibility factor

PCF:

Precision credibility factor

Considering SFc as the SF subset arranged with Chromosome c. Obj_Fun (SFc) estimates the feature subset contribution. If Chromosomes with similar aptness value obtained, then the first priority of surviving will be provided to the less number of SF.

3.3 Proposed SU-GA Feature Selector

The proposed hybrid feature selection algorithm including feature ranking and optimal feature selection (FR-OFS) method adopts the filter method with wrapper method to attain the optimal subset features. First phase, the proposed algorithm selects very few relevant features, by computing the SU values between features and the target concept. Second phase, GA is used to search the optimal subset of features with higher accuracy obtained in the first phase with symmetrical uncertainty. The number of GA is set based on features rank rendering from the examined results from SU. Features having higher SU value will have the higher possibility, where the feature can be selected, that states the resembling bits will have more possibility to be select in the chromosome. Each individual’s fitness is evaluated using GA such as the value of the result by using the fitness function. The chromosomes might subjectively have transformed by the crossover and mutation functions. This impacts the aptness value is carried out by both crossover and mutation operation. The procedure is repeated until acceptable results are obtained. The features are chosen at the end of this space called as reliable features. The proposed SU-GA feature selector is shown in Figs. 1 and 2.

Fig. 1
figure 1

Proposed SUGAFS method

Fig. 2
figure 2

Hybrid feature selection algorithm

4 System Implementation and Experimental Results

The implementation of proposed SU-GA Feature selector is on twofold. Firstly, the SU value of each feature is examined, and the features are prioritized based on the highest SU value, and those features having lesser SU are eliminated by considering as an irrelevant and redundant feature. Both WEKA and MATLAB tool box are used to successfully implement the proposed hybrid feature selector SU-GA. In the second phase, genetic algorithm is applied on the selected feature subset by SU to find out the optimal feature with out compromising the classification accuracy. The final set of optimal features selected by the proposed hybrid feature selector SU-GA is tested with the help of various benchmark classification algorithms in the literature by using the classification accuracy, number of reduced features, and time taken to build a model by using 10-fold crossvalidation as a test method.

4.1 Experimental Setup

Totally, 10 benchmark dataset from UCI machine learning repository is chosen for carrying out the experiments on the existing and proposed feature selector in terms of number of feature selected by each method, the improvement in detection performance of classifiers by each method and time taken to build a model by each method are empirically evaluated and tested on various datasets including Soybean, Lung Cancer, Ionosphere, and Dermatology. The summary of this datasets such as the number of attributes and instances in each dataset is shown in Table 1.

Table 1 Datasets for research

Different classification algorithm, namely NB, J48, SMO, and JRIP, are applied on both original as well as number of features selected by SU, GA, and proposed SU-GA feature selector. The proposed method is implemented using WEKA and MATLAB tool box. The features having higher SU values are identified through ranker search algorithm available in WEKA. The number best-ranked features by SU was given as an input to the GA toolbox available in the MATLAB to select only the optimal feature set through eliminating both irrelevant and redundant features effectively. Finally, it varies supervised classifiers applied on both original and optimal subset of features selected by the proposed method. The classification accuracy, number of selected feature by each method, and time taken for classifiers are clearly proved that the proposed method is superior than other existing feature selector in the literature. The results are presented in Tables 3 and 4.

4.2 Results and Discussions

In this experiment, three widely used evaluation measures features selected; classification accuracy and processing time are adopted to evaluate the proposed method. This proposed SUGAFS algorithm and other two feature selection algorithms SU and GA are implemented in WEKA and MATLAB. These three algorithms are tested and compared on ten discrete UCI datasets. The results abstained from SU, genetic feature selection methods, and the proposed method have been tabulated. Tables 2, 3, 4, 5, 6, 7, and 8 exemplify the number of features selected, classification performance, and the time taken by the proposed method with SU and GA methods. This experiment shows the proposed method is more effective when compared with available feature selection methods.

Table 2 Features selected by different FS methods
Table 3 Classification accuracy by different methods on various datasets
Table 4 Classification accuracy by different methods on various datasets
Table 5 Average performance of different classification algorithms
Table 6 Computational time of different methods on various datasets
Table 7 Computational time of different methods on various datasets
Table 8 Processing time average by different FS methods

4.3 Feature Selection

FS is a process of driving the subset features from the original feature space. The proposed method has been applied in all datasets to select the relevant features by removing the irrelevant one. Table 2 shows the features selected by using SU, GA, and SU-GA. The results indicate that the proposed method selects the least number of features than other two methods for all the ten datasets. Notably, this method selects around ten percentage of attributes for segment challenge and vote datasets. For lung cancer and vehicle data, it selects only around twenty percentage of features. If the selected features are decreased, the performance of classification algorithm does improve this advocates the necessity of feature selection. Therefore, effective feature selection may improve the accuracy and performance of learning algorithms (Fig. 3).

Fig. 3
figure 3

Number features selected on UCI datasets

4.4 Classification Performance

In order to evaluate how well original features and each selected feature by different feature selector including SU, GA and SU-GA will able to improve the detection performance of various classifiers is empirical evaluated. The various classification algorithms are used in numerous sets in predicting the results, and the value are observed. It is observed that the selected features by SU-GA feature selector will able to improve the detection performance of all the classifiers. However, this method selects more number of features as compared with the proposed method which selects only 30% of features. It is worth to note that the difference of classification performance between GA and the proposed method is very small with better running time and number of features selected. And also, the proposed method is effective and efficient when compared with other available feature selection methods.

Table 5 shows the average learning accuracy and observed the proposed SU-GA feature selector improves the detection performance of JRIP and decision table algorithms and is comparatively improved than the other methods. This experimentation concludes that the attributes reduced through SUGAFS, and classification accuracy may increase or remain comparatively identical. The exploratory results illustrate the classification accuracy from the chose feature subset indicates prevalent results than all other existing strategies (Fig. 4).

Fig. 4
figure 4

Performance of FS methods

4.5 Processing Time

In the third phase, several tests are carried out to assess the running time of the proposed method across all the datasets. It is also conducted that the same experiments for SU and GA feature selection methods. After obtaining the results, the performance of the three feature selection methods are compared with the original dataset in terms of running time. The detailed results are provided in Tables 6, 7 and 8. From the experimental results, it can be noted that the proposed feature selection method drastically reduces the running time of a learning algorithms. It is found that the average running time of the proposed model significantly improves the processing time than any other FS methods.

From the experiment, it has observed the following points are in favor of the proposed method:

  1. 1.

    Running time for all classification algorithms is lesser than other methods.

  2. 2.

    GA feature selection method takes much time than other two feature selection methods because of its global search nature.

  3. 3.

    The average processing time of the proposed method is considerably lesser than other two feature selection methods (Figs. 5 and 6).

    Fig. 5
    figure 5

    Processing time average by different FS methods

    Fig. 6
    figure 6

    Processing time average by different FS methods

5 Conclusion

In this work, both SU and GA are combined named as hybrid feature selector for the purpose of eliminating both irrelevant and redundant features and to select only the most relevant features for improving data classification. The performance of the proposed feature selector was evaluated in terms of three quality measures such as number of selected features, detection performance of classifiers, and time taken to build the model with the dataset from University of California Irvine dataset. The proposed SU and GA should probably become part of the standard tool box of feature selection method for effective data classification. The proposed method can more clearly be stated as follows:

  1. 1.

    SU and GA are combined named as SU-GA hybrid feature selector for selecting only most relevant features for supervised. The system is aimed at making improvements over the existing work in three perspectives such as reduction in feature set, improvement in classification accuracy, and finally, minimizing the running time of achieving the goal.

  2. 2.

    The proposed method significantly reduces processing time than any other feature selection methods with minimal number of features. The result of SU-GA imparts higher classification accuracy rate for some dataset with minimum selected features and minimum running time.

  3. 3.

    The proposed features and learning paradigm SU-GA are promising strategies to be applied on any data classification problems.