1 Introduction

Improvement in the learning ability of machine learning algorithms is still a challenge in the field of pattern recognition. The problem becomes even more complex when data contains a large number of features. This curse of dimensionality undermines the learning ability of a classifier, especially when instances are less than the features [1]. The presence of irrelevant and redundant features in the data confuses the learning algorithms and results in degraded performance of the classifier (also known as the Hughes phenomenon [2]). The other implications are over-fitting of the classifier and learning overheads because of higher computations [3, 4]. The problem has become even more challenging with the boom in stored and streaming data of classification during recent times. This tremendous increase in the data demands a more effective approach to tackle performance issues in machine learning algorithms. Therefore, many researchers are working on this active problem, and many efforts have been made [5,6,7,8] with the eventual goal of developing a well generalized and efficient machine learning algorithms from the high dimensional offline or online data.

Feature Selection (FS) and Feature Weighting (FW) are two widely adopted approaches for improving performance and reducing the dimensions of data. Feature selection is a combinatorial search problem where the feature is either accepted or rejected. It is suitable for the data containing redundant and irrelevant features only [9]. While feature weighting is a continuous search problem where weights are assigned to features according to their relevance [10, 11]. FW approaches are a good choice when relevancy of features vary in data [12].

The methods for the selection of relevant feature subsets are of three types: filter, wrapper, and hybrid. Filter methods inspect intrinsic properties of the data to evaluate and select a subset of features. Wrapper methods use heuristic search algorithms where classifiers themselves act as an evaluator and, hybrid methods are based on a combination of the filter and wrapper methods. Feature weighting, on the other hand, is based mostly on wrapper methods [6, 11].

The filter methods suffer from lower classification accuracy whereas the performance of wrapper methods directly depend upon search ability of the optimization algorithm. However, these optimization algorithms also suffer from the curse of dimensionality. Therefore, replication of the same classification performance with the existing methods is still a difficult task when data has higher dimensions. Considering this as a motivation, we introduce the hybrid method that combines feature selection and feature weighting. Specifically, in this paper, four strategies based on the hybrid method are proposed for improving the classification performance on high dimensional data. In the paper, we investigate hybrid strategies for higher classification performance by combining filter and wrapper methods of feature selection and feature weighting approaches. Furthermore, the Ant Lion Optimization (ALO) with blend crossover for feature selection and feature weighting is also presented.

Rest of the paper is organized as: The related work is discussed in Section 2. The need for the hybrid method that combines feature selection and feature weighting approaches and proposed strategies are discussed in Section 3. The results and discussion of proposed method are provided in Section 4. The comparison of the work with other well-known methods, its limitations and future scope is also included in the section. Finally, Section 5 concludes the work.

2 Related work

A plethora of successful efforts have been made to improve classification performance and to select the best subset of features from lower dimensional data. Numerous filter, wrapper, and hybrid methods have been suggested in the literature. However, only a few have worked on high dimensional data, and their work is discussed as follows.

The ranking of features by assigning weights to them based on relevancy criterion, a RELIEF method [13]. Various variants have suggested for further advancement of this method which includes RELIEF-F [14], Iterative RELIEF [15], Fuzzy-theoretic Margin-maximization (FM-RELIEF) [16]. However, these algorithms are unable to identify features that are entirely redundant and get trapped in local optima while optimizing their objective function on higher dimensions. The same problem was observed for Simba algorithm [17] when data has large irrelevant features as pointed out by Sun et al. [18]. A Local leaning based feature selection method was proposed by Sun et al. [18] to find the best feature subsets from high dimensional data. Hall et al. [19] proposed correlation based feature selection which is based on the idea of searching a subset of features that are highly correlated within a class but are uncorrelated with the other class. Advancing the algorithm for higher dimensions, a fast correlation based filter (FCBF) was proposed by Yu and Liu [20] to select the subset of features by measuring the correlation between feature-class and feature-feature.

Recently, clustering based filter methods have gained a lot of attention due to their success in reducing features from high dimensional datasets. DeSarbo et al. [21] applied k-means Clustering to group features by computing weights assigned by the user. Huang et al. [22] automated the weight assignment in k-means clustering to group features. The weighted k-means clustering algorithm updates the weights of features based on the current partition at each iteration. Domeniconi et al. [23] proposed a Locally Adaptive Clustering (LAC) algorithm for finding a relevant subset of features. But, Liping et al. [24] pointed out that the objective function of LAC is not differentiable since it was a maximum function and they proved that replacing the largest average value with a constant value would lead to the same convergence of objective function. An entropy-based weighted k-means clustering (EWKM) was proposed by Liping et al. [24] to cluster the group of features. In their approach, weights for features were inversely proportional to the variance of the feature within-cluster. The work was further extended by Chen et al. [25] by introducing FG-k-means clustering algorithm where individual features, as well as feature groups, were weighted to find the best feature subset. Besides k-means clustering, recently Song et al. [26] used graph-theoretic clustering to find the best feature subset. They proposed a fast clustering-based feature selection (FAST) algorithm for the high dimensional datasets. It was a two-step approach in which clusters of features were formed using Minimum Spanning Tree (MST) based clustering method in the first step. In the next step, features strongly related to a class were selected from each cluster to estimate the final feature subset. Recently, Revanasiddappa and Harish [27] proposed Intuitionistic Fuzzy Entropy (IFE) method for selecting a subset of features from text categorization data. Initially, the Intuitionistic Fuzzy C-Means (IFCM) clustering method is used to measure the intuitionistic membership values for the features. Then, the intuitionistic fuzzy entropy is calculated with a matching degree using these membership values. The features that have low entropy are selected for the classification of text documents.

Wrapper methods have been studied extensively and have been used for feature selection, feature weighting, and simultaneous feature selection and feature weighting. In feature selection, Liu et al. [28] proposed the improved feature selection approach that combines both feature selection and kernel parameters simultaneously using modified Multi-Swarm Optimization. Ghamisi et al. [29] proposed a wrapper based approach for feature selection by integrating Genetic algorithm (GA) and Particle swarm optimization (PSO). Hancer et al. [30] used Binary Artificial Bee Colony (BABC) optimization algorithm whereas Hafez et al. [31] sine cosine algorithm (SCA) for selecting features.

In feature weighting, Kelly and Davis [10] used the combination of genetic algorithm and k-NN classifier for the improvement of the accuracy. Paredes et al. [32] applied the gradient descent optimization algorithm to weight the class and features (Class-dependent Weighting), individual Prototype (Prototype-dependent Weighting) and the combination of both. Additionally, they considered Euclidean and Class-Dependent Mahalanobis (CDM) distance schemes for measuring the distances in the k-NN classifier. AlSukker et al. [2] used Differential Evolution (DE) optimization algorithm for feature weighting, neighbors weighting, class weighting, and hybrid (feature and class) weighting. The studies for the individual feature weighting are very few, and it is combined with instances or feature selection.

Tahir et al. [33] proposed a simultaneous selection and weighting of the features using a Tabu search optimization algorithm. The scope for improvement in classification performance had motivated many researchers to work on optimizing feature subsets and feature weights simultaneously. Barros et al. [34] introduced a new adaptive distance scheme to further improve the performance of the work of Tahir et al. [33]. Derrac et al. [35] combined instance selection, instance weighting, and feature weighting using cooperative coevolution algorithm. Recently, Rodrguez et al. [6] studied 15 combinations of feature and instance selection and weighting. A framework using an evolutionary approach was presented that combines the selection and weighting of features as well as instances. The evolutionary approach used binary cross-generational elitist selection genetic algorithm for selection and differential evolution for weighting.

Feature selection based hybrid methods studied for high dimensional data includes the work of Chuang et al. [36] that combined correlation based feature selection and the Taguchi-Genetic algorithm for DNA microarray data. Derrac et al. [37] presented a hybrid method based on the combination of fuzzy rough set theory and Genetic Algorithm (GA) for instance and feature selection. Recently, Apolloni [38] proposed two hybrid methods that combine Information Gain (IG) and Binary Differential Evolution (BDE) for high-dimensional microarray data.

3 Hybridization of feature selection and feature weighting

3.1 The need for hybrid method

Feature selection and feature weighting are two well-proven approaches for improving classification accuracy and reducing data dimensionality. Feature selection is meant for the data that have redundant and irrelevant features only, whereas feature weighting suits for the data where features vary in relevance [12].

A classification problem has been shown in Fig. 1 from the viewpoint of feature selection and feature weighting. Two features, petal length, and petal width are used from Iris data to plot the classification accuracy of the 1-NN classifier. Figure 1a shows the accuracies for individual features as well as their combination. The data has many global optimum solutions as depicted in Fig. 1b. It can be seen that feature selection, being a subset of feature weighting, is a relatively less computational complex problem which gives a fair accuracy. On the other hand, feature weighting promises good accuracy, but searching for optimal solutions is a very time-consuming task. Filter methods can eliminate the irrelevant and redundant features based on the information measures of the data itself, but these methods are independent of the machine learning algorithms [39]. Moreover, these methods do not address the problem of parameters optimality of the classifier. Therefore, an improvement in the classification accuracy may not be guaranteed for every machine learning algorithm. When higher classification accuracy is the primary objective, wrapper methods are more effective [9] since these can address the data and parameters of classifiers simultaneously. However, these methods have two downsides; first is the selection of an optimization algorithm that converges towards a global solution avoiding local optimal solutions. Second is the execution time for evaluating the newly generated solutions. The execution cost depends upon total instances as well as dimensions of the input data. The data having large dimensions and instances requires more processing time and vice-versa. Therefore, wrapper methods are computationally intensive, and outcomes of performance entirely depend upon the exploration capability of the optimization method. Such approaches perform well for the data that have fewer features because it is easier to search for lower dimensions and requires less computational time. When data is high dimensional, the performance of wrapper methods deteriorates due to the implications of the curse of dimensionality for optimization algorithms [40]. Higher dimensions limit the searching capability of the optimization algorithms for a global solution due to an increase in search space and hence, stagnate at local optimal solutions.

Fig. 1
figure 1

Visualization of feature selection and feature weighting problem considering two features of Iris data

The simultaneous feature selection and feature weighting methods show improvements in accuracy on lower dimensional data only, as compared to individual feature selection and feature weighting. But, the dimensions of the data are doubled when feature selection and feature weighting are integrated into a single problem. This two-fold increase in data dimensions make the problem more computationally complex to solve, and optimization algorithms fail to search for a global optimum solution thereby results in lower classification accuracy.

3.2 The Hybrid method combining feature selection and feature weighting

The limitation in previous studies motivates us to develop the hybrid method that explores the strengths of feature selection and feature weighting approaches. In this paper, we present four strategies that combine various methods of feature selection and feature weighting. We consider feature selection and feature weighting as an independent problem, which needs to be optimized independently instead of combining these into a single problem. The idea of hybridizing feature selection and feature weighting is illustrated in Fig. 2 with the help of an example. Figure 2a shows the wrapper based simultaneous search vector of feature selection and feature weighting having a size of (2 ∗ d+ NNs Parameter), where d is the dimensions of data. Figure 2b and c show the hybridization of feature selection and feature weighting in two possible ways along with how an individually optimized search vector might appear. The initial search vector consists of dimensions of the data along with the parameter of the NN classifier while the subsequent search vector has lower data dimensions in addition to the parameter of the classifier. The advantages of the proposed hybrid method are two-fold: reduction of dimensions through initial approach either by selecting a subset of features or searching for the weights which will help the successive approach to search for a better optimal solution from the resulting lower dimensional data. Second, different combinations of filter and wrapper methods can be tried to find the best performing hybrid method.

Fig. 2
figure 2

An example of the hybridization of feature selection and feature weighting. a Simultaneous search vector of feature selection and feature weighting as given in existing work. b Feature selection followed by feature weighting and c feature weighting followed by feature selection

The feature selection and feature weighting approaches can be combined in two orderly ways. Figure 3 shows the possible combinations of feature selection and feature weighting using the filter and wrapper methods. In this paper, we suggest four hybrid strategies based on these two ways. The strategies are:

  1. 1.

    Feature selection with filter method followed by feature weighting with wrapper method (FFS-WFW)

  2. 2.

    Feature weighting with wrapper method followed by feature selection with filter method (WFW-FFS)

  3. 3.

    Feature selection followed by feature weighting with wrapper method (WFS-WFW)

  4. 4.

    Feature weighting followed by feature selection with wrapper method (WFW-WFS)

Fig. 3
figure 3

Proposed hybrid strategies for the high dimensional data

Although, many filter and wrapper methods have been presented in literature, but in this paper, we consider a clustering based FAST filtering method proposed by Song et al. [26], for selecting relevant features. The method identifies irrelevant features and is more successful for high dimensional datasets as compared to the other filter based methods. The wrapper method that we have used for feature selection as well as for feature weighting is the Ant Lion optimization algorithm with blend crossover. The crossover operation is used to improve exploration and exploitation capabilities for searching better solutions. The linear feature weighting method is employed in this work in which weights are multiplied with the features directly.

3.3 FAST clustering method

FAST method is a clustering-based feature selection (FAST) algorithm for high-dimensional data. It is a two-step procedure of feature selection which involves the elimination of irrelevant and redundant features. In the first step, the irrelevant features are discarded from the full set of features and then, in the second step, the redundant features are eliminated by selecting representative features from each cluster to a obtain the final subset of features. The method is based on the mutual information among feature or feature and target class where symmetric uncertainty (SU) is used as a measure of correlation. It is given as follows:

$$ S(X,Y)=\frac{2*IG(X|Y)}{H(X)+H(Y)} $$
(1)

where S denotes the symmetric uncertainty, H denotes the entropy and IG denotes the information gain. The entropy measure for the variable X is given as:

$$ H(X)=-\sum\limits_{x\in X}^{}p(x) \log_{2} p(x) $$
(2)

Similarly, the conditional entropy is measured for the variable X after observing the values of another variable Y as follows:

$$ H(X|Y)=-\sum\limits_{y\in Y}^{}p(y) \sum\limits_{x\in X}^{}p(x|y) \log_{2} p(x|y) $$
(3)

where p(x) denotes the prior probabilities for all values of X and p(x|y) denotes the posterior probabilities of X given the values of Y. The Information Gain measures the amount of decrease in the entropy of X which reveals the additional information about variable X as provided by Y. It is a symmetrical measure where gained information about X after observing Y is equal to the gained information about Y after observing X [20]. It is defined as:

$$ IG(X|Y)=H(X)-H(X|Y) $$
(4)

The output of the symmetric uncertainty lies with the range of [0, 1] where the value 0 denotes that X and Y are independent variables and the value 1 denotes that either one of variable predicts the value of other variable completely.

Suppose d number of features in data are represented as: D = {F1,F2,...,Fd}. Initially, irrelevant features are removed by measuring the correlation between the feature Fi and class C. If S(Fi,C) is greater than the specified threshold 𝜃, then the feature is considered as relevant else discarded from full feature set to produce \(D^{\prime }=\{F^{\prime }_{1},F^{\prime }_{2},...,F^{\prime }_{c}|c \leq d\}\). Then, the redundant features are eliminated by constructing a complete weighted graph, G = (V, E) where features Fi and Fj acts as vertices and the correlation among these features are the weights of the edges. The vertices of the undirected graph represent the correlation between feature Fi and class C, and edges represent the correlation between features Fi and Fj. Then, the Prim algorithm is used to build the minimum spanning tree (MST) from the graph. The clusters are formed by partitioning the MST based on the removal of edges whose weight is less than its vertices. The representative features are selected from each cluster which has a higher value of correlation between feature and class. This selection of feature from each cluster produces the final feature subset.

3.4 Ant lion optimization

Recently proposed Ant Lion optimization algorithm by Mirjalili [41] is attracting researchers for solving optimization problems. It is a nature-inspired algorithm which imitates the hunting behavior of the antlions. The algorithm has been applied successfully to many engineering problems due to its good exploration and exploitation capabilities [42,43,44]. It balances the searching of optimal solutions through a single parameter which controls the exploration as well as exploitation. Exploration is realized by employing the concept of random walks and the roulette wheel strategy for generating diverse solutions. Exploitation capability is achieved by shrinking the searching area of the problem adaptively and with elitism. The ALO is modified by using the crossover operator to improve the search capabilities of the algorithm.

In the optimization algorithm, the act of catching Ant Lion’s prey is simulated mathematically for finding the optimal solution of the given problem. The hunting simulation of antlions (AL) is depicted by moving the ants (AN) over the search space where antlions entrap them and become fitter. To model this idea, the ants are required to perform the random walk in the locality of the antlions so that they can be captured and consumed by the antlions.

Initially, the positions of antlions are selected randomly within the search space of the problem. Then, the random walk of an ant (ANl) is performed around an antlion (ALj) which is defined as follows:

$$ W^{t}=\left[0, \sum\limits_{i = 1}^{t_{1}}2r-1, \sum\limits_{i = 1}^{t_{2}}2r-1, ... ,\sum\limits_{i = 1}^{t_{T}}2r-1\right] $$
(5)

where t denotes the current iteration, T is the maximum iterations and r is a random variable which is given as follows:

$$ r= \left\{\begin{array}{ll} 1&if rand > 0.5\\ 0&otherwise \end{array}\right. $$
(6)

where rand denotes the randomly generated real values in the range of [0, 1] uniformly. Thus, the random walks of the ants denote the cumulative sum of random variable up to t iterations. Further, min-max normalization is used to confine these random walks of ants within the search boundaries of the problem as follows:

$$ {W^{t}_{i}}=\frac{\left( {W^{t}_{i}}-a_{i}\right)\left( {h^{t}_{i}}-{g^{t}_{i}}\right)}{b_{i}-a_{i}}+{g^{t}_{i}} $$
(7)

where ai denotes the lower bound and bi denotes the upper bound of random walk’s ith dimension, \({g_{i}^{t}}\) is lower bound and \({h_{i}^{t}}\) is the upper bound of ith dimension at tth iteration. At each iteration, the lower and upper bounds are updated to simulate the trapping of ants in the antlion’s pit. For an ant (ANl) at tth iteration, it is defined as follows:

$$ {g^{t}_{i}}=A{L^{t}_{j}}+{g^{t}_{i}},\quad \text{and} \quad {h^{t}_{i}}=A{L^{t}_{j}}+{h^{t}_{i}} $$
(8)

where \(A{L^{t}_{j}}\) represents the position of the selected jth antlion at tth iteration around which ants are trapping, and \({g^{t}_{i}}\) and \({h^{t}_{i}}\) represents the lower and upper bounds of ith dimension for lth ant at tth iteration respectively. The antlion catches their prey by sliding them downwards into the pit. This act has represented by reducing the radius of ant walks adaptively as follows:

$$ {g^{t}_{i}}=\frac{g_{i}}{I},\quad \text{and} \quad {h^{t}_{i}}=\frac{h_{i}}{I} $$
(9)

where gi is the lower bound of ith dimension for the given problem, hi is the upper bound of ith dimension for the given problem and I is the ratio which is defined as:

$$ I = 10^{w}\frac{t}{T} $$
(10)

where w is the parameter that helps to adjust the level of exploitation. The newer positions of the ants are updated using elitism; a technique used in optimization algorithms to maintain the best solutions at each iteration. It involves an elite antlion (ALET) that has the fittest solution and an antlion (ALRW) which is selected through roulette wheel strategy. The elite antlion affects the movement of all ants whereas the selected antlion affects the movements of nearby ants only. The newer position of an ant was determined by measuring the average of a random walk around elite and selected antlion. However, we discourage the averaging operation and has employed the blend crossover operation (BLX) [45] for determining the newer positions of ants. This crossover operation has implemented in many evolutionary algorithms for its success in the global exploration of the solutions. The BLX operator has preferred over the averaging operation because the newer solutions are generated randomly within the uniformly extended search range of random walks around elite and the selected antlion. On the contrary, the averaging operation gives deterministic newer solutions representing the mid-point of the ant walk around elite and the selected antlion. Let \(R^{t}_{l,j}\) denote the normalized random walk of lth ant around jth antlion at tth iteration. Then, the BLX operation is given as follows:

$$ \begin{array}{l} x_{1}=min\left( R^{t}_{l,ET},R^{t}_{l,RW}\right)-\eta \cdot m \\ x_{2}=max\left( R^{t}_{l,ET},R^{t}_{l,RW}\right)+\eta \cdot m \end{array} $$
(11)

where m is given as \(|R^{t}_{l,ET}-R^{t}_{l,RW}|\) and η is the positive constant which controls the exploration and exploitation of the search space. A uniform random number is selected in-between x1 and x2 to determine the new position of the ant as follows:

$$Sn=rand(x_{1},x_{2}) $$

Figure 4 depicts the working of the BLX operator assuming that Rl, ET < Rl, RW. The extended search range for generating new solutions is shown which varies from Rl, ETηm to Rl, RW + ηm. The stretching of the range depends upon the parameter η. The newer solution can be selected randomly within this extended search range, for instance, ANBLX. The averaging operation is also depicted as ANAvg.. The BLX operator helps to improve the exploration as well as exploitation ability of the ALO. It helps to avoid the local optima by generating solutions randomly at each iteration which are not tried earlier.

Fig. 4
figure 4

Blend Crossover operator

In case of feature weighting which is a continuous search problem \(A{N^{t}_{l}}=Sn\). For feature subset selection, we have used an S-shaped transfer function to convert the real values into binary values for selecting or rejecting the features. The output of the function is compared with the random number generated in the interval of [0, 1] for conversion and is given as follows:

$$ P=\frac {1}{1+e^{-Sn}} \quad \text{and} \quad A{N^{t}_{l}}= \left\{\begin{array}{ll}1 & P>rand\\ 0 & otherwise \end{array}\right. $$
(12)

The fitness of newer solutions is determined to update the global best solution if a better solution is obtained. It is equivalent to the consumption of prey by antlion and laying of its new trap for next prey which is defined as follows:

$$ AL^{t + 1}_{j}=A{N^{t}_{l}} \quad \text{if} \quad f\left( A{N^{t}_{l}}\right)>f\left( A{L^{t}_{j}}\right) $$
(13)

where f denotes fitness function which is the classification accuracy measured with k-NN classifier and is given as follows:

$$ f=\frac{number of correctly classified instances}{Total number of instances} $$
(14)

4 Results and discussions

In this section, the experiments and analysis of the above-discussed strategies have been provided. We discuss the findings and also compare the results with similar work in the field to validate the proposed strategies.

4.1 Data and evaluation criteria

The experiments have been performed on 12 high-dimensional datasets containing features ranging from 274 to 10,304. These datasets have obtained from the ASU feature selection [46] and the UCI machine learning [47] repositories. These benchmarking datasets belong to different application fields, and their details are provided in Table 1. These datasets are evaluated using a 10-fold cross-validation procedure. This procedure involves the splitting of original data into 10 sets randomly such that each class of the data shares the same proportion in each set. The training data is obtained by selecting 9 sets while the remaining one set is used as testing data. The process is then repeated 10 times so that each set will be used as testing data once. The performance is measured with Nearest Neighbor (NN) classifier in terms of classification accuracy, the percentage of selected features, and runtime. We have used k-nearest neighbor classifier due to its efficiency in many practical applications, and it requires only one parameter (i.e., k) to optimize. We have implemented the proposed strategies on a system having Intel Xeon CPU, 8 GB RAM and MATLAB 2016 development environment. Furthermore, we have carried out 20 independent runs of each dataset wherever wrapper method is employed to eliminate the random discrepancies in the outcomes. It allows better analysis of the results leading to robust conclusions. It is in contrast to the FAST filter method that gives deterministic results.

Table 1 High dimensional datasets used in the study

From the above discussed existing works, five methods can be identified based on the feature selection and/or feature weighting. The experiments are performed considering these five methods and the four proposed strategies to provide a proper comparative study. The goal is to imitate the current work of field on the considered benchmarking datasets to deliver a fair assessment of results when compared with proposed strategies. In the paper, besides a full set of features (FULLSET), nine methods are evaluated which include: feature selection using filter and wrapper method (denoted as FFS and WFS respectively), wrapper based feature weighting (WFW), the hybrid method for feature selection (FFS-WFS), wrapper based simultaneous feature selection and feature weighting (SFSFW) and the proposed strategies FFS-WFW, WFW-FFS, WFS-WFW and WFW-WFS. We have implemented these nine methods using only FAST and ALO techniques to keep the baseline of the experiments uniform. It is worth noting that FFS denotes the FAST method while feature selection, feature weighting, hybrid feature selection and simultaneous feature selection and feature weighting methods are inspired from the recent works.

We have also used Extended Adjusted Ratio of Ratios (EARR) [48], a multi-criteria metric which unifies classification accuracy, the percentage of selected features, and runtime. The metric allows computation of relative importance of classification accuracy, feature selection and runtime of different methods. Considering M datasets {A1,A2,...AM} which are being evaluated on set of N methods represented {B1,B2,...BN}. Let classification accuracy, the percentage of selected features and runtime of method Bu on data Ai are denoted as \(ac{c^{i}_{u}}\), \(f{s^{i}_{u}}\) and \(r{t^{i}_{u}}\) respectively. Then, EARR of Bu with respect to rest of (N − 1) methods is given as follows:

$$ EARR^{A}_{B_{u}}=\frac{1}{N-1} \sum\limits_{v = 1\wedge v\ne u}^{N} EARR^{A}_{B_{u}, B_{v}} $$
(15)
$$ EARR^{A_{i}}_{B_{u}, B_{v}}=\frac{ac{c^{i}_{u}}/ac{c^{i}_{v}}}{1 + \beta \cdot \log_{10}(f{s^{i}_{u}}/f{s^{i}_{v}}) + \gamma \cdot \log_{10}(r{t^{i}_{u}}/r{t^{i}_{v}})} $$
(16)

where β and γ donates the relative importance that user wants to give to feature selection and runtime over accuracy when comparing various methods. To determine whether Bu is the best, equal or the worst method as compared to Bv on data Ai, \(EARR^{A_{i}}_{B_{u},B_{v}}\) is equated to \(EARR^{A_{i}}_{B_{v},B_{u}}\) for higher, equal or lower value respectively. In experiments, the values of both β and γ are set at 0.1%, 1%, 10%, representing the will of the user for trading accuracy with features as well as the runtime in proportions of 10:1, 1:1 and 1:10 respectively.

For validation, a comparative analysis of the proposed strategies has also been supported by various statistical tests. We use \({\chi ^{2}_{F}}\) Friedman test [49], a non-parametric approach which measures the average rank (ranki) of each method. The acceptance or the rejection of the null hypothesis (i.e., the performance of each method is not statistically different) depends upon the p-value which is estimated from chi-square distribution. Additionally, another statistical test, Holm post hoc test [9] is applied considering a minimum ranked method to determine whether its performance is statistically significant with respect to the rest of the methods. Considering N methods, the test performed pairwise comparison as \(z=(rank_{i}-rank_{j})/\sqrt {N(N + 1)/6M}\), where the method that has a minimum Friedman rank is considered as a control method. The p-value is calculated from the normal distribution using the value of z. The Holm step down procedure compares the smallest p-value with α/(N − 1). If p is less, the hypothesis is rejected, and the next higher value is tried with α/(N − 2). If this hypothesis is also rejected, next higher p-values are tried. The procedure continues until all p-values are tried, or hypothesis gets accepted.

4.2 Parameter setting

The parameters of the clustering based FAST method is kept the same as mentioned in [26]. The parameter settings of the modified Ant Lion optimization algorithm for feature selection and feature weighting are presented in Table 2. The population size and a maximum number of iterations are chosen to be 20 and 200 respectively [50]. It is because searching through the large solution space of high dimensional data requires higher iterations to avoid the stagnation. The values of w are kept same as mentioned in [41] whereas the value of blend crossover is set to 0.5 [45]. The search range of k neighborhood is set in the range of [1, 11] experimentally.

Table 2 Parameters setting for ant lion optimization

4.3 Results and analysis

The results have evaluated and discussed based on classification accuracy, the percentage of selected features and runtime.

4.3.1 Classification accuracy

Table 3 shows the classification accuracy of nine methods for the high dimensional datasets along with the overall average accuracy. It also provides the accuracies when all features (FULLSET) are used for the evaluation of datasets. Further, the standard deviation of the accuracies for the independent executions is also presented except FULLSET and FFS method due to the deterministic outcomes. The results show that feature selection methods lack in accuracy as compared to feature weighting which upholds the findings of Wettschereck et al. [12]. Maximum average accuracy is 83.15% for feature selection and 88.47% for feature weighting approaches. Moreover, in the case of feature selection, FAST filter method (72.99%) is unable to compete with its wrapper counterpart (83.15%) that uses Ant Lion optimization for selecting features. The overall best method is one of the proposed strategies, i.e., WFS-WFW, which achieves 85.64% accuracy. This method performs better than WFS, which is the second-best method. A comparison of WFS-WFW with the other proposed strategies, FFS-WFW, WFW-FFS, and WFW-WFS shows the gain in accuracy of 7.93%, 14.06%, and 1.13% respectively.

Table 3 Average classification accuracy (and standard deviation) of various methods on high dimensional data

We further perform Friedman test followed by Holm post hoc test to analyse the outcomes of experiments. Figure 5a shows the Friedman ranks of ten methods. The obtained p-value is 9.90E-11, which is less than the assumed significance level, α = 0.05 and therefore, it shows the significant differences in classification accuracies of the methods. WFS-WFW attains the best rank with a difference of 0.95 from the subsequent method. Furthermore, considering the best method as a control method, Holm post hoc test is applied to the rest of the methods, and the results are presented in Fig. 5b. It can be seen that the results of the best method are significantly better than FULLSET, FFS, WFS, FFS-WFS and WFW-FFS methods.

Fig. 5
figure 5

Statistical analysis for classification accuracy

4.3.2 Percentage of selected features

Table 4 outlines the percentage of selected features for all methods excluding FULLSET. The results show that feature selection outperforms feature weighting by selecting minimum features. The results are expected because, in feature selection approach, features are rejected completely even though their relevancy is less. Hybrid feature selection method reduces maximum features while ALO based feature weighting reduces approximately 7% features only. A large difference of more than 92% has observed when compared to the best (FFS-WFS) performing method with the worst (WFW) performing method. Nonetheless, the FAST filter method lacks in classification accuracy despite selecting the minimum number of features. The results also show the shortcoming of feature weighting approach for not reducing features despite achieving good classification accuracy.

Table 4 Percentage of selected features by various methods

Further, we apply the Friedman test along with post hoc Holm test to the results. Figure 6a shows the Friedman ranks of the nine methods. The p-value obtained for the test is 2.03E-15 which is less than the presumed significance level, α = 0.05 . Therefore, the percentage of selected features by all methods are significantly different. The overall best rank obtains by hybrid feature selection method, and this method acts as a control method in Holm post hoc test to realize whether its performance is statistically different as compared to the rest of the methods. The outcomes in Fig. 6b shows that FFS-WFS method is significantly better than WFS, WFW, SFSFW, WFS-WFW, and WFW-WFS.

Fig. 6
figure 6

Statistical analysis for percentage of selected features

4.3.3 Runtime

The runtime has shown in Table 5 after selecting a subset of features. The results are computed by evaluating the resultant data using 10-fold cross-validation procedure for 20 independent executions and are shown in seconds. The outcomes show that FFS-WFS method has minimum execution time on the datasets whereas WFW method has maximum execution time. FFS-WFS is 12 times faster than WFW and this large difference in runtime could play a vital role while designing practical applications. Figure 7a shows the Friedman rank of nine methods. The obtained p-value is 6.95E-15 for the test which is less than the presumed significance level, α = 0.05. Therefore, the runtime of the methods is also significantly different. FFS-WFS attains the overall best rank, and therefore, we use it as a control method in the Holm test. Figure 7b shows that the best method is significantly different from all methods except FFS and FFS-WFW.

Table 5 Runtime of the resultant datasets after selecting features with nine methods
Fig. 7
figure 7

Statistical analysis for runtime

Figure 8 outlines the comparison between the accuracy and feature selection for the methods that were put to experiments in this paper. The plot is shown in order of increasing average accuracies for the better understanding of the results. It can be seen that features selected through filter methods do not yield efficient learning models as compared to the wrapper methods. The filter methods have reduced a large number of features from the data which results in a loss of reliable information as well. On the contrary, the wrapper methods keep relevant features in the resultant data but are unable to eliminate most of the irrelevant and redundant features due to limitations of optimization algorithms in high dimensional data. A sudden rise in the percentage of the feature selected along with accuracy can be observed. The proposed hybrid strategies, integrating the strengths of feature selection and feature weighting, yield more efficient learning models with lesser features (FFS-WFW and WFS-WFW).

Fig. 8
figure 8

Comparison between accuracy and percentage of the feature selected

From the outcomes of above-discussed analysis approaches, we can conclude the following:

  1. 1.

    No method is superior to others in all three analysis approaches. However, proposed strategies obtain best ranks in classification accuracy and second ranks in the percentage of selected features and runtime.

  2. 2.

    No method is significant in all three analysis approaches.

  3. 3.

    Simultaneous feature selection and feature weighting method always lack in all three performance analysis approaches.

  4. 4.

    The proposed strategies perform well in all three analysis approaches, but no strategy emerges as a superior.

Therefore, to determine the best method, we use multi-criteria metric, EARR for further analysis.

4.3.4 Analysis of results using multi-criteria metric

The outcomes of EARR is calculated by setting both β and γ to 0.1%, 1% and 10% as shown in Fig. 9. The results of EARR show that the proposed strategies rank highest in all three scenarios, thereby confirming that orderly combination of feature selection and feature weighting results in better performance than existing work in the field. Furthermore, WFS-WFW emerges as the best choice in 0.1% and 1% scenarios. The first scenario depicted the favor to classification accuracy as compared to feature selection and runtime whereas the latter scenario depicted the balance between classification accuracy, feature selection, and runtime. In 10% scenario, FFS-WFW obtained the highest rank depicting favor to feature selection as well as runtime over classification accuracy. Therefore, based on the observations of the EARR, we conclude that the wrapper based FS followed by FW is the best choice when classification accuracy is the primary concern in high-dimensional datasets. In case, the runtime is the main concern, filter based FS followed by wrapper based FW is the optimum choice.

Fig. 9
figure 9

EARR calculated for three different scenarios

4.4 Comparison with other works

We have also made a comparison with the existing works based on classification accuracy as shown in Table 6. The different cross-validation methods used for evaluating the model of these works are also presented where 10-fold cross-validation is not used. The outcomes of the highest accurate strategy (i.e., WFS-WFW) are compared with the other methods to determine the superiority of the proposed work. The well-known methods used for the comparison consists of the filter (FRFS [51], KM-IG [52], SRFS [53], OSFSMI [54] and DCFS [55]), wrapper (CSO-k NN [56]), hybrid (EFR-ESO [57], UFSMB-PSO [58] and IGIS [5]) feature selection and wrapper based simultaneous feature selection and weighting (DE-CHCGA [6]) method. Other included methods are MV-NNMF [59] and KNN-m0.5 [60] where MN-NNMF method used majority voting and non-negative matrix factorization for dealing with high dimensional data while the latter method improves the performance of NN classifier with data-dependent dissimilarity measure. To provide the fair and unbiased comparison with the existing work, the accuracies of WFS-WFW are compared with those methods that have used 10-fold cross-validation on the NN classifier. The WFS-WFW strategy outperforms the other methods in 5 out of 8 datasets excluding a tie case as highlighted in the Table. The main reason for the success of the proposed strategy is the orderly combination of feature selection and feature weighting. The prior outcomes (Tables 3 and 4) indicate that feature selection yields less accurate models with fewer features whereas feature weighting yields the higher accurate models but lacks in dimensionality reduction. The WFS-WFW strategy that combines both approaches removes many unwanted features in the initial stage which helps in effective search of feature weights from the rest of the data dimensions. It helps to achieve higher accuracy with fewer features as compared to other methods. Hence, combining the strengths of feature selection and feature weighting ensures the success of the proposed strategies. These outcomes confirm the validity of the work presented in the paper.

Table 6 Comparison of average accuracy with other methods

4.5 Limitations and future work

Feature weighting plays an important role in improving the classification performance but the concept of weighting the features is not generalized for all machine learning algorithms. The linear weight assignment method utilized in this study is applied to various machine learning algorithms which includes Naive Bayes (NB), Support Vector Machines (SVM) and Artificial Neural Networks (ANN). However, while working with tree structured based learning algorithms, the weighting method should be selected carefully because some of its classifiers (ID3, CART and Random Forests) are insensitive to all monotone transformations on the features [61, 62]. Therefore, linear feature weighting will not be applied and only those features that have zero weight values will affect the classification performance. So, feature weighting is employed in another manner for these classifiers such as weighting the merit function [63] or testing data only [64].

The work presented in the paper provides plenty of opportunities for developing new hybrid methods based on feature selection and feature weighting approaches. Numerous filter and wrapper methods are available in the literature that will help the researchers to come up with new combinations of such hybrid methods. Furthermore, the choice of optimization algorithms is also a key aspect in wrapper methods for obtaining the best learning models. In the growing literature of optimization, several algorithms [65,66,67,68] are available which could enhance the performance even further. Moreover, the performance of these hybrid methods on other machine learning algorithms that are sensitive to feature weighting can be assessed.

Additionally, it would be interesting to investigate the performance of the proposed hybrid method on the problem of data streams. Data streams is a recent topic in machine learning that emphasize the real-world problems where data arrive continuously causing ever-growing dataset [69]. The new instances will arrive continuously one by one or in batches which make the problem more complex as compared to the static data set. In this area, the balance between accuracy and computation cost is a prime concern because time is bounded by the incoming speed of instances and learning algorithms need to be updated constantly with new data. Therefore, higher computation requirements make the wrapper methods not feasible for the problem and Ramírez-Gallego et al. [70] found out that no wrapper method was proposed for this online problem. He further suggested that the combination of filter and wrapper methods of feature selection can be used to achieve better accuracy with lower computations. But, the filter feature selection-wrapper feature weighting strategy is better than the hybrid feature selection method and therefore, it is more suitable for the problem of data streams.

5 Conclusion

Feature selection has been explored more extensively as compared to feature weighting, but feature weighting is more successful. However, feature weighting is the more computationally complex problem because weights of the features are searched from defined search space whereas feature selection is a binarized version of the feature weighting. In fact, feature selection is a subset of feature weighting and therefore theoretically, it will result in lower (average scenario) or equal (best scenario) performance when compared with the feature weighting.

The paper presents the hybrid strategies using feature selection and feature weighting approaches for high dimensional data on the NN classifier. Considering FAST filter method and Ant Lion optimization based wrapper method, four hybrid strategies are presented. FAST method is used for feature selection only whereas ALO is used for both feature selection and feature weighting. Moreover, we have also used the ALO for individual feature selection, individual feature weighting, hybrid feature selection, and simultaneous feature selection and feature weighting to realize the concept of recent wrapper based methodologies. The datasets selected for experiments represent the classification problems from different areas of research for the effective analysis and comparison with the state of the art methodologies. We have used extended adjusted ratio of ratios metric to recognize the best performing method. The results are also analysed by utilizing statistical tests. Experiments using eight methods show that the proposed hybrid strategies have gained better performance. The proposed strategies have obtained the highest ranks in three different scenarios of multi-criteria metric. The work is further strengthened by the outcomes of the statistical tests towards hybrid methods as well as accuracy based comparison with the existing works. The hybrid feature selection- feature weighting wrapper method is best for the application where higher accuracy is the main concern whereas hybrid feature selection filter-feature weighting wrapper method is best for applications that need low response time but with less classification accuracy. In conclusion, feature weighting improves the learning models obtained from the feature selection either by filter or wrapper method. Hence, the hybridization of feature selection and feature weighting is fruitful for achieving better classification accuracy with minimum features.