1 Introduction

Feature selection is one of the dimensionality reduction methods, which helps to choose the most relevant features before applying the learning algorithm [39]. The goal of this method is to get a subset of all available features that are much more useful for the next trend. In other words, feature selection is aimed at eliminating irrelevant and redundant features that may lead to undesirable correlations in the next process (learning) [26]. Feature selection approaches have brought significant improvements to both the learning performance and the computational efficiency of the classification algorithms. Such improvements are particularly observed when feature selection approaches are wrapped around evolutionary computation (EC) techniques that are known to be powerful global search methods [50]. The performance of the feature selection method is usually evaluated by the machine learning model. Deep learning (a sub-branch of machine learning) algorithms have been popular for automatic recognition of digits and characters of different languages. Deep networks can be trained in a supervised mode requiring labels, or in an unsupervised way without the need for labels [5, 27]. The commonly-used machine learning models include Naïve Bayes, KNN, C4.5, SVM, BPNN, RBF-NN, K-means, Hierarchical clustering, and Density based clustering [38]. A good feature selection method should have high learning accuracy, but less computational overhead (time and space complexity). Although there have been solid reviews on feature selection [45], they mainly focus on specific research fields in feature selection. Therefore, it is still worth comprehensively surveying recent advances in feature selection and discussing some future challenges.

Feature selection has been the subject of much research in the field of supervised and unsupervised data engineering in recent decades [31], and existing methods for supervised classification are mainly divided into three categories: the filter methods [22], the wrapper methods [33], and the embedded methods [54]. The wrapper methods follow innovative guidelines for selecting a subset of features with the best predictive performance. Since the number of subsets may be extremely large and new classifiers need to be created for each updated subset, the wrapper methods are usually computationally expensive, which makes them difficult to be applied to datasets with a large number of instances or feature candidates [13, 24, 51]. Embedded methods, despite their higher speed than the wrapper methods, are still computationally heavy, and their feature selection results depend on the learning machine [13]. On the other hand, the filter feature selection methods score and rank the feature candidates according to a certain criterion, and extract one feature at a time to form a subset with a predefined dimension. They are also computationally cheaper and do not rely on particular predictors [13].

Cluster analysis is an unsupervised learning method aiming to group a set of unlabeled objects into clusters in such a way that each cluster could contain items that are more similar to the rest of the items in the same cluster than those in the other clusters [52]. Clustering can help scientists analyze data and solve practical problems; thus, it has been widely used in disparate fields, including statistical analysis [6, 7], pattern recognition [30], information retrieval [17], and bioinformatics [16]. The traditional clustering algorithms can be divided into different categories such as partition methods [11], hierarchical algorithms [32], density-based algorithms [9], intelligent traffic prediction [10], and graph-based methods [14].

Many feature selection methods use meta-heuristics, evolutionary, and swarm intelligence-based algorithms to avoid increasing computational complexity in the high-dimensional medical datasets. Meta-heuristic algorithms have been very successful in tackling many optimization problems such as data mining, machine learning, engineering design, production tasks, and Feature Selection (FS) [48]. Meta-heuristic algorithms are general-purpose stochastic methods that can find a near-optimal solution within a reasonable time. Lately, various Swarm Intelligence (SI)-based meta-heuristics have been developed and proved efficient in handling FS tasks in different fields [35, 41]. SI-based algorithms generally consist of a simple population of artificial agents. This concept is typically inspired by nature, and each agent performs an easy job, but local interactions and partly-random interactions between these agents lead to the emergence of “intelligent” global behavior, which is unknown to individual agents [39].

In the field of Machine Learning and Data Mining, many learning algorithms have been proposed, which are primarily applied to handling discrete features. However, data in real world are often continuous in nature. Hence, discretization is a commonly used data preprocessing procedure that transforms continuous features into discrete features [19]. It is the process of partitioning continuous variables into categories. Unfortunately, the number of ways to discretize a continuous attribute is infinite. Discretization is a potential time-consuming bottleneck since the number of possible discretization is exponential in the number of interval threshold candidates within the domain. Discretization techniques are often used by the classification algorithms, genetic algorithms, and a wide range of learning algorithms. The use of discrete values has a number of advantages such as:

  • Discrete features require less memory space.

  • Discrete features are often closer to a knowledge-level representation.

  • Data can be reduced and simplified through discretization, which becomes easier to understand, use, and explain.

  • Learning will be more accurate and faster using the discrete features.

The focus of this paper is on proposing a hybrid filter feature selection method including the concept of clustering, NMI, and IAMB, based on MRMR. Three feature clustering methods are used in this study, i.e., ward method, equal width discretization, and k-means. The selection of clustering methods was based on the following criteria. The ward method is noise robust and biased towards globular clusters. Equal width is straightforward and simple to apply. It works solely by having the smallest and largest value of each feature as well as the number of samples. k-means is very fast compared to many other clustering algorithms. In addition, feature selection based on learning the IAMB does not depend on a specific classifier. Moreover, IAMB does not need to know the number of features prior to applying feature selection. It is also globally optimal and less prone to over-fitting than other existing methods. The contributions of the proposed method are summarized as follows:

  • To convert features with continuous values to the respective features with discrete values using k-means, complete link, and equal width discretization clustering algorithm and to find the number of clusters of each feature by normalize mutual information based on the labels.

  • To select the most informative features using the concept of Maximum Relevance Minimum Redundancy (MRMR) within a filter method via predefined thresholds.

  • To apply Incremental Association Markov Blanket Feature Selection (IAMB) to automatically select the features.

In the rest of the paper, we will introduce the evaluation measure for feature selection. Section 2 introduces some recent studies on feature selection approaches, including wrapper, embedded, and filter approaches. Section 3 explains the proposed methodology. Section 4 presents an introduction to experimental data and setup and provides the summery of experimental results and related discussions. Finally, in Section 5, conclusion and trends for future research are presented.

2 Related work

In this section, recent studies conducted on the feature selection methods are reviewed. Sheng et al. introduced a niching memetic approach (NMA-CFS) to clustering and feature selection. In NMA-CFS, a type of variable-string length genetic algorithm was used as an optimizer, and a kind of within-cluster scatter was utilized as an evaluation criterion to find the correct number of clusters. NMA-CFS could find the exact number of clusters using a smaller number of features, but the datasets used in their experiments mostly involved a small number of clusters and features [46].

Lensen et al. put forward a comparative study of medoid-based and centroid-based encoding schemes on the PSO framework for clustering and feature selection. Although the medoid-based encoding scheme performed better than the centroid-based one, in most cases, it could not obtain the optimal number of clusters [28].

Lensen et al. proposed a three-stage PSO-based clustering and feature selection approach. In the first stage, an initial number of clusters was utilized using the Silhouette index. Then, in the second stage, the evolutionary process was carried out with the help of the initial cluster number. Finally, the best solution previously represented by the medoid-based encoding was converted to centroid-based encoding to find the optimal cluster. Like the previous ones, this approach also could not properly estimate the optimal number of clusters in most datasets [29].

Rostami et al. proposed a multi objective feature selection method on PSO-based. Their proposed method involves three main steps. At the first step, the main features are shown as a graph representation model. At the next step, feature centralities for all nodes in the chart are calculated, and finally, at the third step, an improved PSO-based search process is applied to the final feature selection. The novel approach employed in their study evaluates a feature subset by the combination of feature separability index, similarity, and feature subset size. Although the results on five medical datasets indicated that the proposed method improves previous related methods in terms of efficiency and effectiveness, several users are needed to specify the parameters used in the proposed methods. Thus, their corresponding values should be determined by users. Given that the accuracy of the learning model must be calculated to evaluate each combination of parameter values, this approach will not be applicable to situations where the construction of the learning model has a high computational complexity [39].

Rostami et al., in another research, proposed the Community Detection-based Genetic Algorithm for Feature Selection (CDGAFS) operating in three stages. The similarities of the features are calculated at the first step. Then, the features are classified by community detection algorithms into clusters at the second step. Finally, at the third step, features are selected by a genetic algorithm with a new community-based repair operation. Although the proposed method gives higher efficiency, faster convergence, and search efficiency compared to other feature selection methods, there are several user-specified parameters used in the developed feature selection methods. Therefore, their corresponding values should be determined by users. As a result, choosing the best values for the parameters is an optimization problem [42].

In another study, Rostami et al. proposed a novel pairwise constraints-based method for feature selection. In the proposed method, the similarity between the pair constraints is calculated and an uncertainty region is created based on it. Then, in an iterative process, most informative pairs are selected. The proposed method was compared to different supervised and unsupervised feature selection approaches, including LS, GCNC, FJUFS, FS, FAST, FJMI, and PCA. The reported findings indicated that, in most cases, their proposed approach was more accurate and selected fewer features [40]. Alirezanejad et al. proposed two heuristic filter methods for gene selection, namely Xvariance against Mutual Congestion (MC). Xvariance depends on internal features such as variance and mean, while the Mutual Congestion is based on the frequency of features. Experimental results showed that Mutual Congestion increased the accuracy of basic classifiers in subsequent datasets, whereas Xvariance had significant results in standard datasets. The comparisons of the results with those of the state-of-the-art methods confirmed that their methods performed better than the existing ones. Even though Xvariance and Mutual Congestion achieved significant results, both methods selected 10 best features in the data set as the final feature subset. In fact, Alirezanejad et al. used a fixed threshold in their work [8].

Abbasabadi et al. suggested that the ensemble feature selection method was adopted on the premise that combining a number of feature selection methods yields more reliable results than using just one feature selection method in order to combine rankings of features from various algorithms into a single rank for each feature, a combinational method should be used when performing ensemble feature selection. Additionally, a threshold must be established in order to acquire a functional subset of features. This study proposes Automatic Thresholding Feature Selection (ATFS), a three-step ensemble feature selection method. Diversity generation is the first step, in which various rankers are applied to each dataset to produce various feature rankings. The proposed ensemble is given automatic thresholding capabilities by fast non-dominated sorting, which is used to combine the output rankings of individual selectors. This is the second step in order to create the best feature set, feature sets are generated, which is the third step. In addition, Sorted Label Interference (SLI), a brand-new filter technique built on the interference between class labels, is suggested. Binary datasets can be used with SLI and ATFS [1].

Nematzadeh et al. introduced Whale-Mutual Congestion (WMC) as a hybrid filter feature selection method for medical datasets. It was shown that Mutual Congestion (MC) can well predict class labels. The authors also demonstrated how the whale algorithm could improve the Mutual Congestion performance by removing half of the trivial features. The accuracy, sensitivity, specificity, and MC of the proposed method were calculated in the top 10 subsets of features using a majority of votes. To further evaluate the proposed method, the whale effectiveness analysis, box diagram analysis, and whale convergence analysis were performed. The results revealed that the proposed method can achieve superior results in most cases in terms of accuracy and size of the subset. Similar to the Xvariance and Mutual Congestion proposed by Alirezanejad et al., WMC had a fixed threshold 𝜏 = 10 and could not automatically select the final feature subset [34].

Abbasabadi et al also proposed a hybrid method using a proposed filter feature selection (SLI-γ) and a wrapper GA-based feature selection approach known as GArank&rand. In the first phase, SLI-γ was used to eliminate 99% of unnecessary features. The first phase solutions were; then, optimized by GA using the SLI-γ most calculated pertinent features. For the evaluation of the measurement criteria, this paper used 11 well-known datasets, including 4 standard datasets and 7 high-dimensional datasets. The experimental findings demonstrated that SLI-γ was able to outperform the other rankers (taken into account in this study) across all datasets. Additionally, SLI-γ had a big impact on GA’s performance, in terms of, classification precision and the number of features chosen. Furthermore, when 1% of the highest ranked features were chosen for the GA population generation, the execution time of GArank&rand was significantly reduced. [2].

Al-Batah et al. applied the Correlation-based Feature Selection (CFS) algorithm to the feature selection process to reduce the dimensionality of data and find a set of discriminatory genes. Then, the Decision Table, JRip, and OneR were employed for the classification process. They indicated that the feature selection by CFS improves not only the efficiency, but also the accuracy of the classification process. However, the number of selected genes using CFS had much more features within the data set as a final feature subset [4].

Sadeghian et al. first, proposed the Information Gain Binary Butterfly Optimization Algorithm (IG-bBOA) to circumvent the S-bBOA constraints. In addition to improving classification accuracy, IG-bBOA maximized the mean of the mutual information between features and class labels. Additionally, the three-stage process of the Ensemble Information Theory-based Binary Butterfly Optimization Algorithm used IG-bBOA to decrease the number of selected features (EIT-bBOA). The first phase employs the Minimal Redundancy-Maximum New Classification Information (MRMNCI) feature selection technique to remove 80% of unnecessary and redundant features. In the subsequent phase, IG-bBOA is used to select the best subset of features. Utilizing a ranking system based on similarity, the ultimate feature subset is selected. [43].

Brankovic et al. introduced a novel feature selection approach, which employs the distance correlation (dCor) as a criterion for evaluating the dependence of the class on a given feature subset (D2CORFS). The dCorindex provides a reliable dependence measure among random vectors of arbitrary dimension, without any assumption on their distribution. The dCorindex appears to be a particularly robust criterion with respect to overfitting and redundancy issues, which are common with multivariate filter methods. The distributed combinatorial optimization scheme was used to handle the severe asymmetry of microarray datasets by dividing the feature set into several feature bins and running independently the FS algorithm on each of them [12].

The mutual information and three-dimensional mutual information (TDMI) between the features and the class label are the foundation of many feature selection algorithms. The performance of feature selection can be affected because these algorithms do not take TDMI into account when considering features. Xiangyuan et al. suggested researching feature selection based on TDMI among features in the light of the issue. The joint mutual information between the class label and feature set is used to describe relevance in accordance with the maximal relevance minimal redundancy criterion, and mutual information between feature sets is utilized to describe redundancy. The mutual information between feature sets as well as that between the class label and feature set is then separated. TDMI, among other features, is taken into account during the decomposition process in order to produce an objective function. Lastly, a feature selection algorithm based on conditional mutual information for maximal relevance with minimal redundancy (CMI-MRMR) was proposed. [23]. A novel relevancy-redundancy measurement based on distance is presented by Hallajian et al. They use an unsupervised method and the mRMR criteria concept. In addition, a supervised approach in which the features are ranked according to the separation between each pair of samples in various feature vector classes. To select the most pertinent feature subset, an ensemble of the suggested supervised and unsupervised methods was used. The effectiveness of the suggested feature selection methods was examined, investigated, and compared using the effects of 24 distance measures drawn from five major families of distance functions. The highest-ranked characteristics are chosen based on an empirically determined criterion. Three classifiers, Decision Tree, Support Vector Machine, and Naive Bayes, were used on biomedical datasets representing binary issues from the UCI data repository to assess the selected features. [25]. Thejas et al. proposed a hybrid approach with two distinguished stages, namely feature ranking and feature selection. The proposed method was independent of any number of class labels and used Random Forest as a classifier. The main idea in the feature ranking stage was to cluster the features using mini-batch k-means, which performed clustering by using a batch of data. Each cluster was scored in [0, 1] using Normalized Mutual Information (NMI). The higher the score is, the better the candidate feature is for classification. The feature ranking list was constructed through calculating cluster scores separately for all features. The features were sorted and the ranking list was created. The feature selection stage comprised two approaches: Feature Inclusion and Least Ranked Feature Exclusion. The Feature Inclusion exploited a process called Mini Batch k-means Normalized Mutual Information Feature Inclusion (KNFI). Likewise, the Least Ranked Feature Exclusion used a process called Mini-Batch k-means Normalized Mutual Information least ranked Feature Exclusion (KNFE). KNFE achieved better results when there was little relationship among features, whereas KNFI had acceptable results in the majority of the data sets. There were no evidence that their proposed method could be applied to high dimensional data sets [49]. Table 1 illustrates the existing works in Tabular format.

Table 1 Outlining the reviewed feature selection methods

Unlike WOA-MC [34], Xvariance [8], and D-PSO Scaled [29], which calculated the final feature subset manually (with a pre-defined threshold), this paper proposes a hybrid filter method that automatically selects the optimal feature subset. Although the feature selection methods proposed by D-PSO Scaled [29] and KNFI-KNFE [49] had acceptable results, the applicability of the methods decreases when the number of features is considerably high (the number of features is greater than the number of samples). This paper investigates feature selection in high dimensional medical datasets with any number of class labels. The majority of works in Table 1 have moderate (MODE-CFS [26], PSO(4–2) [50], DHSTNet [6], NMA-CFS [46], Dynamic Medoid PSO [28], D-PSO Scaled [29], Xvariance [8], WOA-MC [34], IG-bBOA [43],) or high selected subset length at least in some datasets (MPSONC [39], MHBPSO1 [24], CDGAFS [42],ATFS [1], GArank&rand [2], CFS [4], CMI-MRMR [23], MRmMC [25], KNFI-KNFE [49]) The subset length calculated in this research is considerably less than existing works.

The data are, first, clustered using k-means, complete link, and equal width discretization to generate different clustering within an unsupervised approach. Next, the number of clusters of each feature is found by Normalize Mutual Information (NMI) based on the labels; then, the maximum amount of calculation is selected for each feature. Finally, the optimal set of features is selected automatically through Incremental Association Markov Blanket feature selection (IAMB).

3 Preliminaries

Considering that our method applies three clustering methods namely, Ward method, k-means, Equal Width Discretization, and subsequently NMI, we discuss these concepts below.

The objective of the Ward method, also known as Minimum Variance Method (MVM), is to reduce the sum of squared errors among individual clusters. It measures the distance among the cluster pair in two ways. First, it estimates the distance between two individual clusters (xi, xj) with single data object based on the Squared Euclidean and is defined as expressed in Eq. (1):

$$ d\left({x}_{i,k},{x}_j\right)={\left\Vert {x}_i-{x}_j\right\Vert}^2 $$
(1)

Second, it calculates the distance between merged cluster (xi ∪ xj) and new cluster xk based on the lance-William method and is defined as follows:

$$ d\left({x}_{i,k},{x}_j\right)=\left\{\frac{N_i+{N}_k}{N_i+{N}_j+{N}_k}d\left({x}_i,{x}_k\right)+\frac{N_j+{N}_k}{N_i+{N}_j+{N}_k}d\left({x}_j,{x}_k\right)-\frac{N_k}{N_i+{N}_j+{N}_k}d\left({x}_i,{x}_j\right)\right\} $$
(2)

where xi, j denotes the newly-merged cluster, and Ni, Nj and Nk represent the size of the ith, jth and kth clusters, respectively [47].

Agglomerative hierarchical methods include, but are not limited to, Ward’s Link, Average Link, Complete Link, or Single Link. In this work, we focus on the Ward’s Link. For more information on clustering algorithms, we recommend Chapter 14 of Rencher’s Multivariate Analysis reference. The Ward’s clustering algorithm optimizes the “within sum of squares” (WSS) for each cluster, which is estimated for each pairwise combination of clusters, and the merging of clusters that provide the smallest contribution to the WSS measure is implemented. The Ward method is robust against outliers and provides even clusters across the data set, making it a reliable algorithm for exploratory work such as the study in [20].

The k-means algorithm is a simple iterative clustering algorithm. Using the distance (typically the Euclidean distance) as the metric and given the K number of clusters, the k-means algorithm calculates the centroids and specifies the cluster members. Assuming an arbitrary dataset with n multi-dimensional data points, K predefined number of clusters, and the Euclidean distance as the similarity index, k-means attempts to minimize the within-cluster sum of squares using Eq. (3) [56]:

$$ d=\sum \limits_{k=1}^k\sum \limits_{i=1}^n{\left\Vert \left({x}_i-{u}_k\right)\right\Vert}^2 $$
(3)

where uk represents the kth center, and xi represents the ith point in the data set. The solution to the centroid uk is as expressed in Eq. (4):

$$ \frac{\partial }{\partial {u}_k}=\frac{\partial }{\partial {u}_k}\sum \limits_{k=1}^k\sum \limits_{i=1}^n{\left({x}_i-{u}_k\right)}^2=\sum \limits_{k=1}^k\sum \limits_{i=1}^n\frac{\partial }{\partial {u}_k}{\left({x}_i-{u}_k\right)}^2=\sum \limits_{i=1}^n2\left({x}_i-{u}_k\right) $$
(4)

If Eq. (3) is zero, then \( {u}_k=\frac{1}{n}\sum \limits_{i=1}^n{x}_i \).

The basic idea of the k-means algorithm is to randomly extract K data points from the sample set containing n multi-dimensional data points as the centroids of the initial clusters. The distance of the rest of the samples (data points) within a dataset are calculated with respect to the centroids, and the cluster members are specified. The new centroids are derived from existing clusters and the members of each cluster are updated. The k-means algorithm repeats the process until the centroids of the clusters remain unchanged. The result of the k-means algorithm directly depends on the initial centroids [37].

Algorithm 1.
figure a

Equal Width Discretization algorithm.

The mutual information (MI) is used to assess how arbitrary random variables are dependent. MI returns 1 for maximum dependency between two random variables of X and Y. On the contrary, MI = 0 shows the two random variables are completely independent. Let x ∈ X and y ∈ Y be two random variables, P (x, y ) be the joint probability density function of X and Y, and P (x) and P (y ) be the corresponding marginal probability density functions, then the mutual information X and Y can be expressed as follows in Eq. (5) [44].

$$ MI\left(X,Y\right)=\sum \limits_{x\in X}\sum \limits_{y\in Y}p\left(x,y\right)\mathrm{l} og\frac{p\left(x,y\right)}{p(x)p(y)} $$
(5)

Using Eq. (1) and entropy H(X) =  − ∑x ∈ Xp(x)logP(x), the Normalized Mutual Information (NMI) is calculated using Eq. (6) as follows.

$$ NMI\left(X,Y\right)=\frac{MI\left(x,Y\right)}{\mathit{\min}\left\{H(X),H(Y)\right\}} $$
(6)

In statistics, the quantile function plays a crucial role in prescribing the probability distributions. It is indispensable in determining the location and spread of any given distribution, especially the median that is resistant to extreme values or outliers. The quantile function is used extensively in the simulation of non-uniform random variables and details of the use of the quantile function in modeling, statistical, reliability, and survival analysis can be found in [36]. A quantile defines a particular part of a data set; it determines how many values in a distribution are above or below a certain limit. Special quantiles are the quartile (quarter), the quintile (fifth), and the percentiles (hundredth).

The probability density function of the chi-square distribution and the cumulative distribution function are given by Eqs. (7) and (8).

$$ f(x)=\frac{1}{2^{\frac{k}{2}}\Gamma \left(\frac{k}{2}\right)}{x}^{\frac{k}{2}-1}{e}^{\frac{x}{2}},k>0,x\in \left[0,+\infty \right] $$
(7)
$$ F\left(x,k\right)=\frac{\gamma \left(\frac{k}{2},\frac{x}{2}\right)}{\Gamma \left(\frac{k}{2}\right)}=P\left(\frac{k}{2},\frac{x}{2}\right) $$
(8)

where γ (., .) = incomplete gamma function and P(., .) = regularized gamma function.

The quantile (Q) approach was used to obtain the second order nonlinear differential eq. Q is applied to distributions whose Cumulative Distribution Function (CDF) is monotonously increasing and absolutely continuous. The Chi-square distribution is one of such distributions, which is expressed in Eq. (9):

$$ Q(p)={F}^{-1}(p) $$
(9)

where the function F−1(p) is the composition inverse of the CDF. Suppose the Probability Density Function (PDF) f(x) is known and the differentiation exists. The first-order quantile equation is obtained from the differentiation of Eq. (9) to obtain Eq. (10):

$$ {Q}^{\prime }(p)=\frac{1}{F^{\prime}\left({F}^{-1}(p)\right)}=\frac{1}{f\left(Q(p)\right)} $$
(10)

Maximum Relevance and Minimum Redundancy (MRMR) is an efficient variable selection method with confirmed successful results on biological datasets. MRMR finds the most informative features based on the correlation with class label with minimum redundancy among features. In other words, in the MRMR method, each feature is ranked based on not only its relevance to the target variable, but also its redundancy in the feature set [3].

Assuming fi is the ith feature and the class label is c, the Maximum-Relevance method selects the top m features relevant to the class label in Eq. (11).

$$ {\mathit{\max}}_s\frac{1}{\left|s\right|}{\sum}_{f_i\in S}I\left({f}_i;c\right) $$
(11)

To eliminate the redundancy among features, a Minimum-Redundancy criterion was defined in Eq. (12).

$$ {\mathit{\min}}_S\frac{1}{\left|s\right|}{\sum}_{f_i,{f}_j\in S}I\left({f}_i;{f}_j\right) $$
(12)

A sequential incremental algorithm was used to solve the simultaneous optimizations of Eqs. (7) and (5) in Eq. (9). Assuming G is the set of all features and Sm-1  number of features are already selected, then the task is to select the mth feature from the G − (Sm-1 ) so that it could maximize the single-variable relevance subtracted from the redundancy function in Eq. (13).

$$ \max {f}_i\in G-{S}_{m-1}\left(I\left({f}_i;c\right)-\frac{1}{m-1}{\sum}_{f_i\in {S}_{m-1}}I\left({f}_i;{f}_j\right)\right) $$
(13)

4 Proposed method

The general steps of the method (BC-NMIQ-IAMB) proposed in the current paper are shown in Fig. 1. In this method, first, the diversity of clustering is generated (using Equal Width Discretization, k-means, and Ward Link) for each feature of dataset D. Then, the Best Clustering (BC) for each feature is selected with normalized mutual information with respect to class labels (L). Next, the features are ranked based on the concept of MRMR and the initial best features are selected using a predefined threshold that is the square root of the number of features. Finally, the proposed BC-NMIQ is improved with IAMB. The measurement criteria were applied to BC-NMIQ-IAMB as the main proposed method and also to BC-NMIQ as the subsidiary proposed method. Table 2 shows the notations used in the paper.

Fig. 1
figure 1

Flowchart of feature gene selection and classification prediction

Table 2 Notations used in the paper and their descriptions

First, each feature of a dataset (Fi) is clustered using k-means, Ward Link, and Equal Width Discretization cluster algorithm so that k could be equal to main cluster datasets to achieve equal clusters for each feature (Fi). Then, the clustering with maximum NMI with respect to the response variable (L) is selected for each Fi to construct \( {C}_i^{\ast } \). Therefore, Fi with continuous feature values was changed to the corresponding clusters with discrete values with different diversities in \( {C}_i^{\ast } \) as shown in Fig. 2.

Fig. 2
figure 2

Converting continuous values to discrete values

Algorithm 2, first, clusters the features based on k-means-Ward Linkage- Equal Width Discretization and then selects the optimal clustering using NMI among all features.

Algorithm 2.
figure b

Feature clustering.

The Best Clustering (BC) for each feature is selected using Maximum NMI.

$$ {C}_i^{\ast }=\max {\left\{ NMI\left({C}_{ri},L\right)\right\}}_{r=1}^{p-1}\kern2.25em i=1,2,\dots .,m $$
(14)

Next, using the concept of normalized mutual information quantile (NMIQ), the most informative features are selected based on a filter approach in Algorithm 2, which uses an incremental search method to find the optimal features. The normalized MI between fi and fs, NMI(fi; fs) is calculated using Eq. (15).

$$ NMI\left({f}_i;{f}_s\right)=\frac{I\left({f}_i;{f}_s\right)}{\mathit{\min}\left\{H\left({f}_i\right),H\left({f}_s\right)\right\}} $$
(15)

In this paper, we propose to use the average normalized MI as a measure of redundancy between the ith feature and the subset of selected features S = {fs}, for s = 1, …, |S|, i. e. , in Eq. (16).

$$ \frac{1}{\mid S\mid}\sum \limits_{f_s\in S} NMI\left({f}_i;{f}_s\right) $$
(16)

where ∣S∣ is the cardinality of set S. Equation (16) is a kind of correlation measure that is symmetric and takes values in [0,1]. The value of 0 indicates that feature fi and the subset S of the selected features are independent. The value of 1 indicates that feature fi is highly correlated with all features in the subset S.

The selection criterion used in NMIFS selects the feature that maximizes the measure G in Eq. (17).

$$ G\dot{=} NMI\left(C;{f}_i\right)-\frac{1}{\left|S\right|}\sum \limits_{f_s\in S} NMI\left({f}_i;{f}_s\right). $$
(17)

The right-hand side of Eq. (17) is an adaptive redundancy penalization term, which corresponds to the average normalized MI between the candidate feature and the set of selected features [21].

The selection criterion used in NMI selects the feature that maximizes the measure. The complete NMI algorithm is in Algorithm 3.

Algorithm 3.
figure c

BC-NMIQ.

Algorithm 3 shows how the feature subset (S) is incrementally selected using the BC-NMIQ algorithm. Finally, the Incremental Association Markov Blanket Feature Selection (IAMB) uses the subset of features (S) calculated from Algorithm 3 as an input to generate the final feature subset automatically in Algorithm 4.

Algorithm 4.
figure d

BC-NMIQ-IAMB

Assuming the dataset with n instances and m features, Algorithms 2, 3, and 4 have the time complexity of \( O\left({n}^{\frac{3}{2}}\times logn\times m\right) \), O(\( {m}^{\frac{3}{2}}\times nlogn\Big) \), and O(m2), respectively.

5 Experimental results

In this work, the hybrid filter method was used to increase the accuracy of the machine learning model. Ten benchmark medical datasets (including binary and multi-labeled) were used to evaluate the proposed method. The evaluations are in terms of the number of selected features and the classification accuracy. Prior to introducing the datasets in Section 5.2, the classifiers used to calculate the measurement criteria are described in Section 5.1. Finally, the implementation results and the comparative results are given in Section 5.3.

5.1 Experimental setup

The experiments in this study were carried out using a standard PC with Intel Core i5, CPU 2.7 GHz, and GeForce GTX 1080 GPU, with 8 GB memory. All the experiments were performed in MATLAB using the MATLAB–R2018 libraries.

In this paper, the performance evaluation was done by randomly partitioning the original datasets into training and test sets using stratified 10-fold cross validation. The system was tuned in the validation phase so that the hyper parameters, including the depth of the decision tree, the number of trees in random forest, distance in the ward linkage, and the value of k in KNN could be cross validated for each dataset. Cross validation prevents the model from being too complex for possible over-fitting or too simple for possible under-fitting.

Decision tree (DT)

The family of decision tree algorithms is generally classified by creating a tree-like pattern in a descending way. The leaves of the decision tree correspond to the classifications. DT deals with both numerical and symbolic data.

Random Forest (RF)

RF is an ensemble technique in multiple decision trees that uses majority voting. This technique has wide applicability to pattern recognition for high-dimensional and complex problems. RF tries to decrease the high variance of decision trees.

K-nearest neighbors (KNN)

KNN is one of the machine learning algorithms that has acceptable classification performance when data are not easily separable. KNN scans through all past experiences and looks for the K closest experiences (data points), which are called the K nearest neighbors.

5.2 Experimental data

In this research, ten medical benchmark datasets were used to investigate the proposed method (BC-NMIQ-IAMBFS), as presented in Table 1. The information in Table 3 includes the number of genes, number of instances, and number of class labels. The total number of genes ranges from 2000 to 15,154.

Table 3 The public gene expression datasets with dataset names, number of samples (N), number of genes (M), number of classes C, and distribution of samples within the classes (Dist. Classes)

When analyzing the performance of the partitional approaches, the information presented in Table 2 shows that k-means and Ward Linkage have outperformed the Equal Width Discretization algorithm, and the performance of Equal Width Discretization, especially in terms of the external validation, may remarkably deteriorate in some datasets (e.g., CNS and Colon). Table 4 shows that k-means with DT has competitively the best accuracy on the average. Table 5 shows that BC-NMIQ-IAMB achieves the highest average accuracies with less number of selected features than those of BC-NMIQ and BC (recalling that BC uses the entire set of features). Table 6 compares BC-NMIQ-IAMB with four existing approaches, the most accurate of which is the method we suggest.

Table 4 Comparison of the clustering accuracies achieved by the three methods on 10 datasets using three classifiers
Table 5 Comparison of the accuracies of the methods achieved on 10 datasets
Table 6 Comparison with four state-of-the-art methods

Figure 3 shows the Receiver Operating Characteristics (ROC) curve of binary datasets, which is one of the major metrics used to check the performance of classification models. In other words, the ROC curve demonstrates to what extent the classification model is able to distinguish between classes. The ROC curve with greater Area Under Curve (AUC) shows a better separability of classes. The optimal situation is when AUC equals 1. The ROC curves in Fig. 4 are based on the DT classifier. The ROC curve is plotted with True Positive Rate (TPR) or recall in y-axis (\( Recall=\frac{TP}{FN+ TP} \)) and False Positive Rate (FPR) in x-axis (\( FPR=\frac{FP}{TN+ FP} \)). (The blue curve indicates cancer genes, and the red curve shows normal genes).

Fig. 3
figure 3

ROC curve of binary datasets applying the DT classifier to BC-NMIQ-IAMB

The overall goal of this paper is to select a smaller number of genes and achieve similar or better classification accuracy than using all genes. The tests of the proposed BC-NMIQ-IAMB on ten datasets using random forest classifier shows that our algorithm is better than all other compared algorithms in terms of the classification accuracy and dimension reduction. In addition, with using different evaluation measures, the proposed algorithms are found highly efficient offering better solutions than other algorithms considered in this paper, i.e., CFS [4], MC, WOA-MC [34], and D2CORFS [12]. The lower accuracies of CFS and MC compared to WOA-MC and BC-NMIQ-IAMB is predictable because CFS and MC are simple filter methods (especially CFS in which the classification accuracy decreases when the number of features exceeds in the dataset). On the other hand, D2CORFS achieves higher accuracy than CFS, MC, and WOA-MC. The proposed hybrid method is successful in reaching the best accuracy and finding the smallest subset of features, which is due to the high capability of BC-NMIQ-IAMB to model the complex problem of high dimensional data. All in all, BC-NMIQ-IAMB increases the average accuracy of similar algorithms in this domain on binary and multi-class datasets so that this increase is considerable in comparison with CFS (in both binary and multi-class datasets) as well as MC and WOA-MC (in binary datasets).

6 Conclusion

This paper proposes a feature selection approach, i.e., the hybrid BC-NMIQ-IAMB method comprising two filter methods, namely BC-NMIQ and IAMB, to find the most informative genes for the cancer classification and diagnosis. The main objective of the proposed approach is to find the smallest subset of biomarkers, which can signify the disease efficiently. The contribution of the proposed method is that it can effectively model the interactions in complex systems and is applicable to the problem of feature selection on high dimensional data. The proposed approach is evaluated on 10 popular microarray datasets and compared with some of the most recent approaches. The experimental results obtained on different datasets shows that despite the selection of a minimal subset of features, the selected genes have high influence on separating different classes. The results of the experiments demonstrate that the proposed approach achieved a high accuracy rate even with 2 or 4 highly-informative genes. There is also much work to be done in the future, for example, investigating how to improve the efficiency of algorithms; how to combine the findings of this paper with some of the current, advanced feature selection algorithms such as semi-supervised manifold learning methods and sparse classification algorithms; and/or how to apply the proposed algorithms to automatically determining the number of clusters and feature clustering with incorporating the clinical biological information.