1 Introduction

Data mining research community works on designing and improving techniques for data classifications (Mashrgy et al. 2014; Al-Abdallah et al. 2017), pattern recognition (Ma and Xia 2017), and machine learning (Lee and Lee 2006; Doush and Sahar 2017; Sawalha and Doush 2012). Some data mining problems contain huge data with thousands of features. In many cases, only a set of proper features is needed while the others are redundant, irrelevant or noisy. Picking a subset of these features to accurately represent the entire set of features can largely affect the performance of machine learning algorithms in different properties such as time complexity and classification accuracy (Hu et al. 2006).

Feature selection (FS) is choosing a relevant set of features from a large group of features to represent a record in a dataset. The feature selection technique is applied in many applications like text classification (Forman 2003; Deng et al. 2019), text mining (Jing et al. 2002; Ravisankar et al. 2011), image recognition (Goltsev and Gritsenko 2012), image retrieval (Rashedi et al. 2013; Dubey et al. 2015), bioinformatic data mining (Saeys et al. 2007; Urbanowicz et al. 2018), and many others reported in (Bolón-Canedo and Alonso-Betanzos 2019).

There are three types of feature selection techniques based on the evaluation criteria: filter-based, wrapper-based, and embedded-based methods (Li et al. 2017). Firstly, filter-based feature selection methods give a score for each feature in the dataset using a statistical measure [e.g., information gain (Shang et al. 2013), Chi-squared test (Liu and Setiono 1995), or ReliefF (Robnik-Šikonja and Kononenko 2003)]. Then these features are ranked based on their score. As a result, the features with the higher ranking are kept, while the features with lower rank are removed. Secondly, wrapper-based feature selection methods use search algorithms (e.g., genetic algorithm or particle swarm optimization) to evaluate the generated subset of features. After completing the search process, one of the classifiers (e.g., k-nearest neighbor (Park and Kim 2015), naive bayes (Bermejo et al. 2014), decision trees (Sindhu et al. 2012), etc.) is used to evaluate the quality of the chosen subset of features in term of accuracy. Finally, the integration of a wrapper-based and a filter-based method is known as an embedded-based method in which the searching algorithm is embedded in the classifier such as the k-nearest neighbor algorithm (kNN). Then it guides the classifier to pick some features that can achieve higher accuracy.

In the context of optimization, the FS is considered not easy to solve combinatorial optimization problem (Gheyas and Smith 2010). The complexity of the FS problem comes from selecting the relevant set of features from a plenty possible subsets. For example, the power-set of the set \({\mathcal {A}}\) of size N features contains \(2^N-1\) possible subsets of features. Therefore, as the number of features increase the number of solutions to look for the problem increases exponentially. The picked set of features is modeled using a function which is guided by the accuracy of the classification and the number of used feature. The FS solution is conventionally expressed as a binary array of the selected features.

The brute-force method can be used to solve the FS problem where all possible subsets generated and evaluated, then the relevant subset is identified (Lai et al. 2006). This type of algorithm cannot be used when we have a large number of features. Heuristic algorithms can be utilized to obtain the optimal set for the FS problem (Zhong et al. 2001). This type of algorithms can find efficiently the subset of relevant features. However, the quality of this acceptable subset is not necessarily guaranteed (Talbi 2009). Therefore, researchers use metaheuristic-based algorithms to find an optimal portion of relevant features in a feasible time with high classification accuracy.

Metaheuristic-based algorithms can be used to solve different kinds of optimization problems using self-learning operators that is configured with operators to efficiently explore and exploit possible solutions, hoping to attain the best solution (Blum and Roli 2003). Metaheuristic-based algorithms can be classified into population-based and local search-based algorithms (Blum and Roli 2003). Population-based algorithms examine several search space regions concurrently and improve them iteratively wishing to obtain the optimal solution. Examples of population-based algorithm for FS include genetic algorithm (Ghareb et al. 2016), differential evolution (Mlakar et al. 2017), ant lion optimizer (Emary et al. 2016), grey wolf optimization (Emary et al. 2016), ant colony optimization (Kabir et al. 2012), competitive swarm optimizer (Gu et al. 2018), firefly algorithm (Zhang et al. 2018; Al-Abdallah et al. 2017), grasshopper optimization algorithm (Mafarja et al. 2018b, 2019), bat algorithm (Mafarja et al. 2018b), whale optimization algorithm (Mafarja and Mirjalili 2018), dragonfly optimization (Mafarja et al. 2018a), crow search algorithm (Sayed et al. 2019), gravitational search algorithm (Taradeh et al. 2019), and harmony search algorithm (Moayedikia et al. 2017)

Local search-based algorithms, the focal point of this paper, consider one solution at a time. Let’s call this the initial solution. It will be modified repetitively using an operator which allow visiting nearby values until a peak local value is found. A local search-based algorithm is capable of thoroughly investigate a specific region of the initial solution and find the local optima. Such algorithms have a limitation of not exploring multi-search space regions concurrently. Therefore, some random strategies are employed to empower the local search-based approach. In the literature, FS is tackled by several local search-based algorithms such as tabu search (Zhang and Sun 2002), GRASP (Bermejo et al. 2011), iterated local search (Marinaki and Marinakis 2015), variable neighborhood search (Marinaki and Marinakis 2015), and stochastic local search method (Boughaci and Alkhawaldeh 2018).

Due to the complex nature of FS problems, most FS-based algorithms are either a modification of a metaheuristic algorithm or a hybridization of two or more metaheuristic algorithms. Examples of modified metaheuristic algorithms that are used to solve FS problems are binary ant lion optimizer using S-shaped function and V-shaped function (Emary et al. 2016), binary grey wolf optimization using crossover and sigmoidal function (Emary et al. 2016), binary dragonfly optimization using time-varying transfer functions (Mafarja et al. 2018a), and chaotic crow search algorithm (Sayed et al. 2019). Examples of hybrid metaheuristics are the hybridization of the ant colony optimization with the wrapper and filter approaches (Kabir et al. 2012), and the integration of the gravitational search algorithm with evolutionary crossover and mutation operators (Taradeh et al. 2019).

\(\beta\)-hill climbing is a local search-based algorithm that is recently introduced Al-Betar (2017). It is an improved version of the hill-climbing algorithm with a \(\beta\) operator that is governed by the \(\beta\) parameter to diversify the search as well as a neighboring operator called \({\mathcal {N}}\) to intensify the search. The \(\beta\) operator empowers the \(\beta\)-hill climbing to intelligently escape the local optima by searching different regions and digging deeply into the regions of the search space. Due to its simplicity, the algorithm is adapted in a broad range optimization problems such as ECG and EEG signal denoising (Alyasseri et al. 2017a, b, 2018), generating substitution-boxes (Alzaidi et al. 2018), gene selection (Alomari et al. 2018b), economic load dispatch problem (Al-Betar et al. 2018), mathematical optimization functions (Abed-alguni and Alkhateeb 2018; Abed-alguni and Klaib 2018), multiple-reservoir scheduling (Alsukni et al. 2017), sudoku game (Al-Betar et al. 2017), text document clustering (Abualigah et al. 2017a, b), and cancer classification (Alomari et al. 2018a). \(\beta\)-hill climbing produces successful outcomes for a broad range of optimization problems. It produces pleasing results for many real-world problems, and such algorithm can be used to tackle the FS problem.

In this paper, a binary version of \(\beta\)-hill climbing optimizer is developed to tackle the problem of feature selection. The algorithm is evaluated using 22 UCI machine learning benchmark datasets that are used widely in the literature. The evaluation is discussed using four measurement criteria which are the fitness function, the classification accuracy, the number of relevant features, and the elapsed CPU time. The impact of the parameter settings (\(\beta\) and \({\mathcal {N}}\)) for the binary \(\beta\)-hill climbing optimizer is carefully analyzed and studied. Furthermore, the effect of using different transfer functions as well as the different classifiers on the efficiency of the binary \(\beta\)-hill climbing is studied. The comparative evaluation is conducted against three local search methods using the same datasets on the four evaluation criteria. Interestingly, the proposed binary \(\beta\)-hill climbing optimizer excel other comparative local search methods 16 out of 22 datasets. On the other hand, it overcomes other comparative metaheuristics methods approaches in 7 out of 22 datasets and very-closed to the best results for the remaining 15 datasets. The binary \(\beta\)-hill climbing optimizer can be considered as a very important addition to the body of knowledge in the machine learning and classifications domain as it produces very promising outcomes when compared against other methods.

The rest of the paper is organized as follows: in sect. 2, the procedural steps of the proposed binary \(\beta\)-hill climbing algorithm is provided. The parameter setting analysis and comparative evaluations of the work are discussed in sect. 3. Finally, the conclusion is presented and the possible future research directions are shown in sect. 4.

2 Binary \(\beta\)-hill climbing optimizer for feature selection

Metaheuristic-based methods can be local search-based or population-based. Most of the techniques applied to resolve the feature selection problem are population-based methods such as swarm-based algorithms or evolutionary-based algorithm.

\(\beta\)-hill climbing optimizer is an enhanced variant of the basic hill climbing algorithm. It is a local search-based method proposed in Al-Betar (2017) to escape from being stuck into a local optima. \(\beta\)-hill climbing optimizer can be initiated with a random or heuristic solution [say \({{\varvec{x}}}=(x_1,x_2, \ldots , x_n)\)]. In each step, the current solution can be improved using three operators: (i) \({\mathcal {N}}\)-operator which is controlled by \({\mathcal {N}}\) parameter to exploit a specific region on the search space, (ii) \(\beta\)-operator which is controlled by \(\beta\) parameter to examine the solution space, and (iii) \({\mathcal {S}}\)-operator that utilizes the principle of survival-of-the-fittest. Normally, the search process of \(\beta\)-hill climbing optimizer is halted once the maximum number of iterations/time is reached.

The \(\beta\)-hill climbing optimizer can be applied for discrete or continuous search spaces. The variables of the feature selection problem are assigned by binary values (i.e., being selected or not). Therefore, a new operator called \({\mathcal {T}}\)-operator (i.e., transfer operator) is proposed to transfer the variable into binary using the sinusoidal function. Consequently, the new version of \(\beta\)-hill climbing optimizer is called binary \(\beta\)-hill climbing optimizer. The algorithm pseudo-code is shown in Algorithm 1 and it is flow-charted in Fig. 1. The steps and the four operators are described as follows:

Fig. 1
figure 1

Binary \(\beta\)-hill climbing optimizer for feature selection

  • Step 1: Initialize binary \(\beta\)-hill climbing and feature selection parameters— The parameters of the binary \(\beta\)-hill climbing for feature selection are set in this step. The problem of feature selection is known to have a binary search space. Therefore, the solution is modeled as a binary vector \({{\varvec{x}}}=(x_1,x_2,\ldots ,x_n)\) of n features. The value of \(x_i=1\) means that the feature i is considered. The parameters of the binary \(\beta\)-Hill climbing optimizer are \({\mathcal {N}}\) and \(\beta\). The parameter \({\mathcal {N}} \in [0,1]\) controls the neighboring operator (\({\mathcal {N}}\)-operator) which is responsible for determining the adjustment bandwidth to modify current solution to a neighboring solution. The \(\beta\) parameter controls the \(\beta\)-operator which determines the intensity of utilizing exploration in the neighboring solution. The last parameter of the proposed binary \(\beta\)-hill climbing optimizer is the maximum number of iterations which is \(Max_itr\).

  • Step 2: Construct the initial solution— The initial solution \({{\varvec{x}}}=(x_1,x_2,\ldots ,x_n)\) is randomly constructed from the binary domain as follows:

    $$\begin{aligned} x_{i}\leftarrow {\left\{ \begin{array}{ll} 1 &{} \qquad U[0,1] \ge 0.5\\ 0 &{} \qquad otherwise. \end{array}\right. } \end{aligned}$$

    where U[0, 1] generates a random number between 0 and 1. In order to evaluate the initial solution, the set of features in the current solution are evaluated using the objective function formulated in Eq. (1) (Emary et al. 2016).

    $$\begin{aligned} f({{\varvec{x}}})=\alpha \gamma _{R}(D)+ \left( 1-\alpha \right) \frac{|R|}{|N|} \end{aligned}$$
    (1)

    where the classification error rate is expressed as \(\gamma _{R}(D)\). In this study, the kNN classifier is used to find the classification error rate (Liao and Vemuri 2002). Note that |R| is how many features are selected, |N| is the count of the entire features, \(\alpha\) refers to the role of classification rate and the length of feature subset, \(\alpha \in [0,1]\) .

  • Step 3: Improvement loop— The enhancement of the current feature selection solution (\({{\varvec{x}}}\)) is achieved by using four operators which are used to yield a neighboring solution (i.e., \({{\varvec{x}}}'\)).

    • \({\mathcal {N}}\)-operator— This operator is responsible for moving the present solution \({{\varvec{x}}}\) to the near by solution \({{\varvec{x}}}'\) using Eq. (2). This operator is governed by the likeliness of picking \({\mathcal {N}}\) parameter where \({\mathcal {N}} \in [0,1]\). The probability determines the adjustment of the decision variables (features) in the current solution. A greater value of \({\mathcal {N}}\) aid in a furthest movement from the neighboring solution \({{\varvec{x}}}'\). The pseudo-code for \({\mathcal {N}}\)-operator is shown in line 5 of Algorithm 1. Formally, let \(x_i\) be given the value of \(v_i(k)\) of \(k^{th}\) position, then the following present how to allocate the value of \(x'_i\):

      $$\begin{aligned} x'_i= x_{i, k} \pm {\mathcal {N}} \end{aligned}$$
      (2)

      where \(x_{i, k} \pm {\mathcal {N}}\) is the neighboring value of \(x_{i, k}\).

    • \(\beta\)-operator— This operator is utilized for increasing the regions covered from the search space. It utilizes the idea of invariable mutation to the present solution. As shown in Eq. (3), \(\forall i \in (1,2,\ldots ,n)\), the \(x_i\) decision variable is picked at random to be adjusted using the \(\beta\) parameter. This is pseudo-coded in lines from 6 to 10 in Algorithm 1.

      $$\begin{aligned} x'_{i}\leftarrow {\left\{ \begin{array}{ll} x_{r} &{} \qquad U[0,1] \le \beta \\ x'_i &{} \qquad otherwise. \end{array}\right. } \end{aligned}$$
      (3)

      Note that the \(\beta\) parameter determines how often the uniform mutation is used. Also, \(x_{r}\) is a random value which is either 0 or 1.

    • \({\mathcal {T}}\)-operator— As aforementioned, the feature selection problem deals with a binary values for the decision variables. Therefore, the sinusoidal function (or S-shape function as shown in Fig. 2) is adapted to transform continuous solutions into binary. To elaborate, the Sigmoidal function (Kennedy and Eberhart 1997) is formulated in Eq. (4):

      $$\begin{aligned} T\left( x'_{i}\right) =\frac{1}{1+e^{-x'_{i}}} \quad \forall i=(1,2,\ldots ,n) \end{aligned}$$
      (4)

      The value of the decision variables in the neighboring solution is re-assigned a binary value using Eq. (5). Let r be function that generates at random a value bounded by 0 and 1 (i.e., \(r \in [0,1]\)), the value of \(x'_i\) of the feature i will be re-assigned as follows:

      $$\begin{aligned} x'_i ={\left\{ \begin{array}{ll} 1 &{} \qquad r<T\left( x'_{i}\right) \\ 0 &{} \qquad Otherwise) \end{array}\right. } \end{aligned}$$
      (5)
    • \({\mathcal {S}}\)-operator— The quality of the neighboring solution \({{\varvec{x}}}'\) is assessed by applying the objective function \(f({{\varvec{x}}}')\) which is formulated in 1. The neighboring solution \({{\varvec{x}}}'\) is interchanged by the current one \({{\varvec{x}}}\), if it is better (i.e., \(f({{\varvec{x}}}')\le f({{\varvec{x}}})\)). The pseudo-code for the \({\mathcal {S}}\)-operator is presented in the lines from 20 to 22 of Algorithm 1.

  • Step 4: Stop criterion— The proposed binary \(\beta\)-hill climbing is iterated until a stop criterion is reached. The stop criterion used in this study is based on the number of iteration \(\text {Max}\_\text {Itr}\) defined at Step 1.

  • Step 5: kNN classifier— The accuracy of the obtained solution by binary \(\beta\)-hill climbing is evaluated using a kNN classifier. kNN is an effective non-parametric method that is utilized for classification and regression. The kNN starts by storing all the training data instances. After that, a pairwise computation is applied to calculate the similarity between the training instances when compared against the unseen instances (Chen et al. 2009; Weinberger and Saul 2009). Then selecting the k-closest instances. This operation is done repetitively for all the unseen instances.

    In order to compute the classification accuracy and error rate measurements, Eqs. (6) and (7) are used. Classification accuracy is a statistical measure which defines the ability of the classifier to correctly use the picked features to precisely label a given tuple into a class. It can be computed using Eq.(6).

    $$\begin{aligned} Accuracy= \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
    (6)

    where TP (true positive) denotes identifying correctly the class using a precise set of features. TN (true negative) denotes identifying correctly that it is not the class using a correct set of features. FP (false positive) denotes identifying incorrectly that it is the class. Finally, FN (false negative) denotes identifying incorrectly that it is not the class.

    The classification error rate calculated in Eq.(7) is used to determine the percentage of features that are incorrectly assigned. It will be part of the formulated objective function.

    $$\begin{aligned} \gamma _{R}(D)= 1-Accuracy \end{aligned}$$
    (7)
Fig. 2
figure 2

S-shape function

figure a

2.1 Computational complexity of the proposed method

The time complexity required for the proposed binary \(\beta\)-hill climbing algorithm is measured by analyzing the pseudo-code given in Algorithm 1 in terms of big-\({\mathcal {O}}\) notation. The binary \(\beta\)-hill climbing pseudo-code can be divided into three parts: (i) The initial phase (from line 1 to line 3 in Algorithm 1); (ii) The improvement phase (from line 5 to line 24 in Algorithm 1); and (iii)The classifier phase (the line 25 in Algorithm 1).

The time complexity of the initial phase is \({\mathcal {O}}(n)\) for the construction of the initial solution. The f(x) calculation is based on the kNN classifier which computes the classification error rate and thus its complexity is \({\mathcal {O}}(n^2)\). Therefore, the time complexity for the initial phase is \({\mathcal {O}}(n^2)\).

The time complexity of the second phase (i.e., improvement phase) rely upon number of iterations (\(\text {Max}\_\text {Itr}\)) and the time required for \({\mathcal {N}}\)-operator, \(\beta\)-operator, \({\mathcal {T}}\)-operator, and \({\mathcal {S}}\)-operator. The time complexity of the \({\mathcal {N}}\)-operator is \({\mathcal {O}}(1)\) while the complexity of the \(\beta\)-operator is \({\mathcal {O}}(n)\). Furthermore, the time complexity of the \({\mathcal {T}}\)-operator is also \({\mathcal {O}}(n)\) while \({\mathcal {S}}\)-operator requires \({\mathcal {O}}(n^2)\). In brief, the time complexity of the second phase is \({\mathcal {O}}(\text {Max}\_\text {Itr} \cdot n^2)\).

The time complexity of the classifier phase is \({\mathcal {O}}(n^2)\) which is the time required to execute the kNN classifier. as a wrap-up, the time complexity required to execute the developed binary \(\beta\)-hill climbing is \({\mathcal {O}}(\text {Max}\_\text {Itr} \cdot n^2)\).

3 Experiments and results

A comprehensive experimental analysis is conducted in this section to investigate the proposed binary \(\beta\)HC algorithm efficiency when solving the problem of feature selection. The experiments are divided as follows: (i) the effect of two parameters of binary \(\beta\)HC (i.e., \({\mathcal {N}}\) and \(\beta\)) on the algorithm performance is studied in sect. 3.2.1 and 3.2.2; (ii) the influence of different transfer functions on the efficiency of the proposed binary \(\beta\)HC algorithm is presented in sect. 3.2.3; (iii) the performance of the proposed binary \(\beta\)HC algorithm using different classifiers is summarized in sect. 3.2.4; (iv) The effect of training/testing against k-fold cross validation models on the performance of the proposed algorithm is provided in Sec.3.2.5 ; and (v) the efficiency of the binary \(\beta\)HC algorithm is compared against other local search-based algorithms in sect. 3.3, it is compared against recent metaheuristic methods in sect. 3.4, and it also compared against filter-based approach in sect. 3.5. It should be noted that the attributes of the datasets utilized in the algorithm assessment are summarized in sect. 3.1.

The binary \(\beta\)HC algorithm was implemented using MATLAB (R2014a) and tested on a laptop with 2.80 Intel Core i7 with 16 GB RAM. The operating system installed on the laptop is Microsoft Windows 10. In all the experiments each dataset is splitted at random into two portions: training which is 80% of the instances and testing which is the remaining 20%. This split is used as it has been widely adapted by several state-of-the-art methods (Mafarja et al. 2019; Alsaafin and Elnagar 2017; Li et al. 2011; Wieland and Pittore 2014)

3.1 Dataset

The proposed binary \(\beta\)HC algorithm is evaluated using twenty-two datasets collected from the UCI data repository. A brief of the datasets characteristics is presented in Table 1. The summary shows for each dataset the FS problem name, the number of features, and the instances count.

Table 1 The characteristics of the datasets

3.2 Evaluation of the proposed binary \(\beta\)HC algorithm

A sensitivity analysis for the proposed binary \(\beta\)HC algorithm is performed to investigate the effect of different operators on the convergence of the algorithm. Twenty-three experimental scenarios are designed as shown in Table 2. In this table, these scenarios are divided to four groups as follows: firstly, five scenarios (Sen1–Sen5) are designed to study the influence of the \({\mathcal {N}}\) parameter on the performance of the binary \(\beta\)HC algorithm. The next five experimental scenarios (Sen6–Sen10) investigate the convergence behavior of the binary \(\beta\)HC algorithm by tunning the \(\beta\) parameter. The third group of experimental scenarios (Sen11–Sen18) are designed in order to investigate the influence of the different transfer functions on the the proposed binary \(\beta\)HC algorithm efficiency. The effect of the classifiers on the behavior of the developed binary \(\beta\)HC algorithm is investigated in the last three scenarios (Sen19–Sen21). Finally, the last two scenarios (i.e., Sen22, and Sen23) are designed in order to study the influence of data splitting techniques (i.e., training/testing against k-fold cross-validation) on the performance of the proposed \(\beta\)HC algorithm. It should be noted that 20 independent replications is conducted for each experimental scenario, and the maximum number of iterations is 500. As suggested in (Emary et al. 2016), the value of the k parameter in the kNN algorithm is set to 5. Note that the bold results obtained in all result tables refers to the best results obtained. 

Table 2 Twenty three experimental scenarios to evaluate the sensitivity of binary \(\beta\)HC algorithm

3.2.1 Study the effect of the \({\mathcal {N}}\) parameter

The parameter \({\mathcal {N}}\) influence on the performance of the binary \(\beta\)HC algorithm is investigated in this section. Five experimental scenarios are designed with different settings of \({\mathcal {N}}\) (i.e., Sen1 (\({\mathcal {N}}\)=0.005), Sen2 (\({\mathcal {N}}\)=0.05), Sen3 (\({\mathcal {N}}\)=0.1), Sen4 (\({\mathcal {N}}\)=0.5), and Sen5 (\({\mathcal {N}}\)=0.9)). In general, a higher value for the \({\mathcal {N}}\) parameter leads to a higher exploitation and makes the algorithm dig deeper in the searched region. Tables 3, 4, 5, and 6 provide the average (Avg) and the standard deviation (Stdv) of the results obtained by running Sen1 to Sen5 in terms of the classification accuracy, the fitness value, the selected features, and the elapsed CPU time. Note that the best results in these tables are highlighted in bold font.

Table 3 shows the behavior of the proposed binary \(\beta\)HC algorithm with different values of the parameter \({\mathcal {N}}\) in terms of classification accuracy. The highest average values are the best. As shown in Table 3, the two scenarios (i.e., Sen3 and Sen5) obtained the best results on 6 datasets. While the three other scenarios (i.e., Sen1, Sen2, and Sen4) obtain the best results on 4, 5, and 5 datasets respectively. Based on the above findings, it is not clear which scenario configuration is the most efficient. In other words, no real impact of the parameter \({\mathcal {N}}\) on the performance of binary \(\beta\)HC. In addition, the standard deviation values recorded in Table 3 show the robustness of the proposed algorithm. Clearly, the five experimental scenarios has low standard deviation values in all datasets, with a better performance for Sen5.

Table 3 The classification accuracy results obtained by binary \(\beta\)HC algorithm with varies \({\mathcal {N}}\) values

Table 4 present the influence of different values of the parameter \({\mathcal {N}}\) on the performance of the binary \(\beta\)HC algorithm in terms of fitness value. The best outcomes achieved are highlighted in bold font (lowest is the best).Clearly, Table 4 shows that the efficiency of the binary \(\beta\)HC algorithm in the five experimental scenarios (Sen1 to Sen5) is almost the same. The two experimental scenarios Sen3 and Sen4 achieved the best results in 6 datasets, while Sen1 get the utmost results in 5 datasets. In addition, Sen1 and Sen5 achieved the best results in 4 datasets. The lowest standard derivation values in all datasets happened in Sen1 to Sen 5. These experimental scenarios obtain almost the same results over 20 runs.

Table 4 The fitness values obtained by binary \(\beta\)HC algorithm with varies \({\mathcal {N}}\) values

The results of Sen1 to Sen5 on the proposed algorithm in terms of the selected features are outlined in Table 5. Again lowest values (i.e., best) are point up using bold font. The developed binary \(\beta\)HC algorithm efficiency using Sen5 outperforms the other four scenarios (Sen1–Sen4) as it achieves the best results on 7 datasets. While the other scenarios Sen1, Sen2, Sen3, and Sen4 get the best results on 3, 4, 6, and 2 datasets respectively.

Table 5 The selected features obtained by binary \(\beta\)HC algorithm with varies \({\mathcal {N}}\) values

Similarly, the impact of the parameter \({\mathcal {N}}\) by utilizing different configurations on the efficiency of the proposed binary \(\beta\)HC algorithm in terms of the elapsed CPU time is recorded in Table 6. Clearly, higher values of the parameter \({\mathcal {N}}\) leads to a lower CPU time. In other words, the two experimental scenarios (Sen4 and Sen5) achieved the lowest CPU time on 8 datasets, while the other three scenarios Sen1, Sen2, and Sen3 obtained the lowest CPU time on 3, 1, and 2 datasets respectively.

Table 6 The CPU time (in Seconds) obtained by binary \(\beta\)HC algorithm with varies \({\mathcal {N}}\) values

In summary, the best results for the proposed \(\beta\)HC algorithm for most datasets in terms of the classification accuracy, , the selected features, and the elapsed CPU time is happened when \({\mathcal {N}}\) = 0.9. Furthermore, the performance of the proposed \(\beta\)HC algorithm using different configurations of the parameter \({\mathcal {N}}\) is almost the same on all datasets in terms of the fitness value. As a result, in the next experiments the parameter \({\mathcal {N}}\) value is set to 0.9.

3.2.2 Study the effect of parameter \(\beta\)

The impact of the \(\beta\) parameter on the performance of the binary \(\beta\)HC algorithm is studied in this section using five various settings (\(\beta\)=0, \(\beta\)=0.005, \(\beta\)=0.05, \(\beta\)=0.1, and \(\beta\)=0.5). As a result, five empirical scenarios (Sen6–Sen10) are devised. The value of the parameter \({\mathcal {N}}\) is set to 0.9 based on the previous experiment conducted in sect. 3.2.1. Generally speaking, a larger \(\beta\) value results in a greater exploration rate. Tables 7, 8, 9, and 10 present the mean and the standard deviation of the results by running the scenarios from Sen6 to Sen10 in terms of the classification accuracy, the fitness value, the selected features,and the elapsed CPU time. The obtained finest outcomes are highlighted in bold font.

Table 7 The classification accuracy results obtained by binary \(\beta\)HC algorithm with varies \(\beta\) values

Table 7 summarizes the experimental results of running Sen6 to Sen10 in terms of the classification accuracy. Table 7 show that the efficiency of the binary \(\beta\)HC algorithm is enhanced by increasing the the parameter \(\beta\) value. In other words, Sen10 attained the best results on 10 out of 20 datasets, while the other scenarios Sen9, Sen8, and Sen7 achieved the finest outcomes on the datasets 7, 3, and 2 respectively. However, the performance of Sen6 is the worst when it is compared with other scenarios (i.e., Sen7, Sen8, and Sen10). This is because the parameter \(\beta\) is neglected in this scenario, and thus the source of exploration is not used. Furthermore, the standard derivation values are recorded in Table 7. It can be observed that Sen10 is more robust than the other scenarios on almost all the datasets as it obtains the same results over 20 runs.

Table 8 The fitness values obtained by binary \(\beta\)HC algorithm with varies \(\beta\) values

Table 8 presents the results of examining the performance of the binary \(\beta\)HC algorithm by utilizing various values of the parameter \(\beta\) in terms of fitness values are recorded in . Table 8 provides clear evidence that the efficiency of the proposed binary \(\beta\)HC algorithm is enhanced with a larger value of the parameter \(\beta\). This is for the reason that higher values of \(\beta\) lead to a higher rate of exploration, and thus avoid from being stuck in local optima. Clearly, the performance of Sen10 outperforms the other four scenarios (Sen6–Sen9) as it achieves the finest outcomes on 10 datasets. Furthermore, Sen9 achieved the best results in 8 datasets, while Sen8, Sen7, and Sen6 get the best results on 4, 3, and 1 datasets respectively. Nonetheless, the performance of the proposed binary \(\beta\)HC algorithm using Sen10 is more robust than other scenarios based on the standard derivation outcomes listed in Table 8.

Table 9 The selected features obtained by binary \(\beta\)HC algorithm with varies \(\beta\) values

Table 9 illustrates the results when running Sen6 to Sen10 in terms of the selected features by the proposed algorithm. Apparently, Sen8 outperform other scenarios, as it obtains the finest outcomes on 9 datasets. Furthermore, Sen7 and Sen9 obtained the finest outcomes on 5 datasets, while Sen10 achieved the finest outcomes on 2 datasets. Note that Sen6 did not obtain any best results for any of the datasets. This is because parameter \(\beta\) value is zero, and this makes the search process of the proposed algorithm to get suck in local optima.

Table 10 The CPU time (in Seconds) obtained by binary \(\beta\)HC algorithm with varies \(\beta\) values

Finally, the outcomes of the proposed binary \(\beta\)HC algorithm efficiency when tunning the parameter \(\beta\) in terms of the elapsed CPU time are outlined in Table 10. According to the results in Table 10, it can be seen that Sen8 obtained the minimum CPU time on 8 datasets. While the other scenarios Sen6, Sen7, Sen9, and Sen10 achieved the minimum CPU time on 1, 4, 5, and 4 datasets respectively.

In a nutshell, the proposed binary \(\beta\)HC algorithm with \(\beta\)=0.5 is superior than the other versions of the binary \(\beta\)HC algorithm in terms of the classification accuracy and the obtained fitness value. On the other hand, the efficiency of the proposed binary \(\beta\)HC algorithm with \(\beta\)=0.05 outperforms the other versions of the binary \(\beta\)HC algorithm in terms of the selected features and the elapsed CPU time. As a result, based on the classification accuracy results the value 0.5 is used to set the parameter \(\beta\) in the upcoming experiments.

3.2.3 Study the effect of the different transfer functions

The the proposed binary \(\beta\)HC algorithm performance when using various transfer functions is studied in this section. Eight experimental scenarios (i.e., Sen11–Sen18) are tailored with different eight transfer functions (i.e., S1–S4 as a different versions of S-Shaped, and V1–V4 as a different versions of V-Shaped). These transfer functions are borrowed from (Mafarja et al. 2018a). The value of the parameter \({\mathcal {N}}\) is set to 0.9 and the parameter \(\beta\) is set to 0.5 based on the previous experiments conducted in sects. 3.2.1 and 3.2.2. Tables 11, 12, 13, and 14 illustrate the mean and the standard deviation of the results by running Sen11 to Sen18 in terms of the classification accuracy, the fitness value, the selected features, and the elapsed CPU time. The bold font is used to point up the best outcomes.

Table 11 shows the influence of using different transfer functions on the performance of the proposed binary \(\beta\)HC algorithm in terms of the classification accuracy. Table 11 demonstrate that the performance of the proposed algorithm using Sen12 outperforms the seven other scenarios (Sen11, and Sen13–Sen18) as it obtained the finest outcomes on 7 datasets. In addition, Sen11, Sen13, and Sen16 have the best results on 3 datasets. While Sen14, Sen15, and Sen17 obtained the finest outcomes on 2 datasets. Finally, Sen18 achieved better results than other scenarios in one dataset. Table 11 shows the robustness of Sen11 to Sen18 based on the standard derivations of the results. It is not clear which experimental scenario to pick, as all scenarios achieved almost similar results.

Table 11 The classification accuracy results obtained by binary \(\beta\)HC algorithm with varies transfer functions

Similarly, Table 12 demonstrates the proposed binary \(\beta\)HC algorithm efficiency in terms of the fitness value by utilizing various transfer functions. Clearly, Sen12 outperforms other scenarios as it obtained the finest outcomes on 6 datasets. In addition, Sen11, Sen13, and Sen16 achieved the best results on 3 datasets. While Sen15, Sen17, and Sen18 obtained the best results on 2 datasets. Finally, Sen14 achieved the finest outcomes on one dataset. Based on the standard derivation results recorded in Table 12, it can be observed that Sen11 to Sen18 obtained almost similar standard derivations results. This reflects the robustness of the proposed binary \(\beta\)HC algorithm in all cases.

Table 12 The fitness values obtained by binary \(\beta\)HC algorithm with varies transfer functions

Table 13 summarizes the results of running the experimental scenarios from Sen11 to Sen18 in terms of the selected features. Clearly, Sen11 to Sen14 which study different versions of the S-Shaped transfer function did not obtain any of the best results. On the other hand, Sen16 achieved the best results in 7 datasets. While the scenarios Sen15, Sen17, and Sen18 get the best results in 5 datasets. In conclusion, the performance of the proposed binary \(\beta\)HC algorithm in terms of the selected features using the V-Shaped transfer function achieved superior outcomes than the S-Shaped transfer function for all cases.

Table 13 The selected features obtained by binary \(\beta\)HC algorithm with varies transfer functions

Finally, the results of running Sen11 to Se18 in terms of the elapsed CPU time are summarized in Table 14. Apparently, the performance of Sen16 outperforms the other scenarios as it obtained the minimum of the average CPU time on 9 datasets. Furthermore, Sen15 gets the minimum of the average CPU time on 5 datasets. While Sen12 and Sen17 achieved the minimum of the average CPU time on 4 datasets. Also, Sen14 and Sen18 obtained the minimum CPU time on one dataset. Furthermore, Sen11 and Sen13 did not obtain any of the best results in terms of CPU time.

Table 14 The CPU time (in seconds) obtained by binary \(\beta\)HC algorithm with varies transfer functions

Based on the above discussions, the performance of the proposed algorithm using the S-Shaped transfer function is better than the V-Shaped transfer function in terms of the classification accuracy and the fitness values. On the other hand, the performance of the proposed algorithm using the V-Shaped transfer function is superior than the S-Shaped transfer function in terms of the selected attributes and the CPU time. However, it is hard to decide which transfer function is better for the feature selection problem. We decide to use the S-Shaped (S3) transfer function in the subsequent experimentation, as the performance of the proposed algorithm using S-Shaped (S3) is superior than others in terms of classification accuracy.

3.2.4 Study the effect of the different classifiers

In this section, the influence of using three distinct classifiers (i.e., kNN, SVM, and decision tree (DT)) on the performance of the proposed binary \(\beta\)HC algorithm for FS problem is studied. Three experimental scenarios Sen19 to Sen21 are designed, where each scenario utilized to study one of the three classifiers. The average (Avg) and the standard derivation (Stdv) are summarized based on the classification accuracy, the fitness value, the selected features, and the elapsed CPU time in Tables 15, 16, 17, and 18. The bold font is used to point out the finest outcomes.

Table 16 summarizes the experimental outcomes of investigating the effect of using three distinct classifiers on the performance of the proposed binary \(\beta\)HC algorithm in terms of the classification accuracy. The performance of the proposed algorithm using the kNN classifier (Sen19) outperforms the two other classifiers as it realized the finest outcomes on 20 out of 22 datasets. Furthermore, the proposed algorithm using the SVM classifier achieved the best results on two datasets. While the proposed algorithm with decision tree classifier did not get any of the best results for any dataset. Furthermore, it can be seen that the proposed algorithm with the kNN classifier is more robust than the other versions of this algorithm as it achieves almost the same results over 20 times of run as the results of the standard derivation show.

Table 15 The classification accuracy results obtained by binary \(\beta\)HC algorithm with different classifiers

The impact of using three different classifiers on the performance of the proposed binary \(\beta\)HC algorithm in terms of the fitness values is summarized in Table 16. As the table shows, it is hard to decide which classifier is better to integrate within the proposed algorithm for tackling the problem of feature selection. This is because the performance of the proposed algorithm with decision tree classifier outperforms the performance of the other two classifiers on eight datasets. While the proposed algorithm efficiency when compared with the two other classifiers is almost the same as both were able to realize the finest outcomes for seven datasets.

Table 16 The fitness values obtained by binary \(\beta\)HC algorithm with different classifiers

Table 16 shows the results for Sen19 to Sen21 in terms of the selected features. Table 16 demonstrates that the results produced by Sen19 outperform those produced by other scenarios (Sen20 and Sen21) as it obtained the minimum number of the selected features on 17 out of 22 datasets. Furthermore, the experimental scenarios of Sen20 and Sen21 get the minimum number of the features in 2 and 3 datasets, respectively. This proves that the integration of the proposed binary \(\beta\)HC algorithm and the kNN classifier is better than the other two classifiers for the problem of feature selection in terms of the number of the selected features.

Table 17 The selected attributes obtained by binary \(\beta\)HC algorithm with different classifiers

Finally, the results of the elapsed CPU time are investigated from Sen19 to Sen21, and it is recorded in Table 18. In this table, we can see that Sen19 obtained the minimum CPU time on 19 out of 22 datasets. While Sen21 achieved the minimum CPU time on 3 datasets. In addition, it can be seen that Sen20 is slower than the other scenarios (i.e., Sen19, and Sen21) as it could not obtain any minimum CPU time for any dataset.

To wrap-up, the kNN as a classifier is capable to obtain superior outcomes than the other two classifiers in almost all cases. As a result, the kNN classifier will be used in the following experiments.

Table 18 The CPU times (in Seconds) obtained by binary \(\beta\)HC algorithm with different classifiers

3.2.5 The effect of training/testing against k-fold cross validation models

In this section, we investigate the effect of the used data splitting techniques (i.e., training/testing against k-fold cross-validation) on the convergence behaviour of the proposed method. The evaluation is conducted using kNN with training/testing (binary \(\beta\)HC(kNN-TT)) and using kNN with k-fold (binary \(\beta\)HC(kNN-KF)) cross-validation. In k-fold we divide a set of features data into complementary subsets, performing the analysis on one subset (i.e., training set), and validating the outcomes using the other subset (i.e., validation set or testing set) Delen et al. (2005). This technique is used for feature selection methods by other researchers Shao et al. (2013); Zhang et al. (2014). Note that the value of \(k=10\) is used in the k-fold cross-validation method. Again, four main measurements are used in the comparison which are classification accuracy, fitness function value, number of informative features, and the CPU time. Each experiment is replicated using 10 runs and the average (Avg) and standard deviation (Stdv) are recorded. The best Avg is highlighted in bold (lowest is best).

As can be noticed from Table 19, the binary \(\beta\)HC(kNN-TT) can outperform the binary \(\beta\)HC(kNN-KF) in 17 out of 22 datasets in terms of classification accuracy, and 18 out of 22 datasets in terms of the number of features reduced. However, \(\beta\)HC(kNN-KF) outperforms the binary \(\beta\)HC(kNN-TT) in 16 out of 22 datasets in terms of fitness function values. Furthermore, it excels in all datasets in terms of CPU time required.

In conclusion, binary \(\beta\)HC(kNN-TT) is a powerful algorithm in terms of classification accuracy and the number of features reduced. Therefore, it is recommended to use binary \(\beta\)HC(kNN-TT) for feature selection problems.

Table 19 The performance of binary \(\beta\)HC(kNN-TT) against binary \(\beta\)HC(kNN-KF)

3.3 Comparison with local search method

In order to investigate the effectiveness of the proposed binary \(\beta\)HC algorithm for the feature selection problem,the algorithm is compared against other popular local search-based algorithms in this section. The comparative methods are hill climbing (HC), simulated annealing (SA), and variable neighborhood search (VNS). It should be noted that these algorithms are executed under the same conditions of the proposed algorithm in order to ensure fairness. Table 20 shows the parameter settings of these algorithms.

The comparison results of the proposed \(\beta\)HC algorithm against other local search-based algorithms are reported in Tables 21, 22, 23, and 24. These tables show the average (Avg) and the standard deviation (Stdv) achieved by each algorithm in terms of the classification accuracy, the fitness values, the selected features, and the elapsed CPU time, respectively. The bold font is used to point out the finest outcomes.

Table 20 The parameter settings of the local search-based algorithms

Table 21 shows that the proposed binary \(\beta\)HC outperforms other comparative methods on 16 out of 22 datasets. On the other hand, the performance of the HC algorithm outperforms other comparative algorithms on 3 datasets. Also, the performance of the VNS algorithm is superior than other comparative algorithms on 3 datasets. While the SA algorithm did not achieve any best result for any dataset. Consequently, the average accuracy of the proposed binary \(\beta\)HC algorithm reached 100\(\%\) on M-of-n and Zoo datasets and 99.9\(\%\) for the Exactly dataset. This supports the claim that the proposed algorithm has successful trials to reach the proper stability between exploration and exploitation and thus avoid the local optimal problem.

Table 21 The classification accuracy results of the binary \(\beta\)HC compared to other local search-based method

Similarly, the experimental results obtained by the proposed binary \(\beta\)HC algorithm as well as the other comparative local search-based algorithms in terms of the fitness values are recorded in Table 22. This table shows that \(\beta\)HC, HC, and VNS algorithms are able to obtain the same average of the fitness values on the M-of-n dataset. Interestingly, the proposed binary \(\beta\)HC outperforms the other comparative methods on 13 out of 22 datasets. The HC and VNS algorithms obtained the finest outcomes on 4 datasets, although the SA algorithm did not realize any best result for any dataset.

Table 22 The fitness values of the binary \(\beta\)HC compared to other local search-based method

Table 23 demonstrates the average of the selected features obtained by the proposed binary \(\beta\)HC algorithm and the other comparative local search-based algorithms. The SA algorithm obtains the finest outcomes on 13 datasets, although the VNS algorithm achieves the finest outcomes on 9 datasets. Furthermore, the proposed binary \(\beta\)HC and HC algorithms did not realize the finest outcomes for any dataset.

Table 23 The selected features of the binary \(\beta\)HC compared to other local search-based method

Finally, Table 24 demonstrates the average (Avg) and the standard deviation (Stdv) of the results produced by the proposed \(\beta\)HC algorithm as well as the other comparative local search-based algorithms in terms of the elapsed CPU time. The results summarized in Table 24 are in line with the results recorded in Table 23, where the SA algorithm achieved the finest outcomes on 13 datasets. Moreover, the VNS algorithm realize the finest outcomes on 9 datasets, while the proposed \(\beta\)HC and HC algorithms did not get any of the best results.

Table 24 The CPU time (in Seconds) of the binary \(\beta\)HC compared to other local search-based method

3.4 Comparison with other metaheuristics

The proposed binary \(\beta\)HC efficiency is verified by comparing it against ten other metaheuristics methods from the literature. These methods include: binary grasshopper optimization algorithm (BGOA) (Mafarja et al. 2018b), binary grey wolf optimizer (BGWO) (Mafarja et al. 2018b), binary gravitational search algorithm (BGSA) (Mafarja et al. 2018b), binary bat algorithm (BBA) (Mafarja et al. 2018b), binary salp swarm algorithm (BSSA) (Aljarah et al. 2018), hybrid gravitational search algorithm (HGSA) (Taradeh et al. 2019), whale optimization algorithm (WOA) (Mafarja and Mirjalili 2018), binary dragonfly optimization (BDA) (Mafarja et al. 2018a), genetic algorithm (GA) (Kashef and Nezamabadi-pour 2015), and particle swarm optimization (PSO) (Kashef and Nezamabadi-pour 2015).

The average classification accuracy achieved by all comparative methods is summarized in Table 25. The finest outcomes are highlighted in boldface. The results of the proposed binary \(\beta\)HC algorithm are collected from Table 3 and Table 7. The results in Table 25 demonstrates that the BDA algorithm achieves the best performance as it outperforms other comparative methods in 14 datasets. Remarkably, the binary \(\beta\)HC algorithm ranked second, as it outperforms other comparative methods in 7 datasets. On the other hand, six of the comparative methods did not obtain any best results for any dataset.

Table 26 illustrates the average ranking of the proposed binary \(\beta\)HC algorithm using Friedman statistical test when compared against the comparative methods. Note that the average ranking of the comparative methods is computed using the results in Table 25. The Significant level \(\alpha\) is set to 0.05 as suggested in (García et al. 2010). Interestingly, the binary \(\beta\)HC algorithm outperforms other comparative methods by getting the first rank.

Thereafter, the Holm and Hochberg as a post-hoc statistical test are utilized to calculate the adjusted \(\rho\)-values between the first rank method (i.e., controlled method) identified by Friedman test and other methods. Again, the proposed binary \(\beta\)HC algorithm is ranked first as demonstrated in Table 26. Table 27 reveals that the performance of the binary \(\beta\)HC algorithm is statistically superior than 7 of the other comparative methods (i.e., BSSA, WOA, BGWO, BGSA, PSO, GA, and BBA) using the significant level \(\alpha\)/Order. Also, the statistical test expose that there is no significant difference between the binary \(\beta\)HC algorithm and three of the comparative methods (BDA, BGOA, and HGSA).

Table 25 The classification accuracy results of the binary \(\beta\)HC compared to other metaheuristics
Table 26 Average rankings of the algorithms calculated using Friedman test
Table 27 Holm/Hochberg results between the binary \(\beta\)HC algorithm and other methods

3.5 Comparison with filter-based techniques

For further validations, the proposed binary \(\beta\)HC algorithm efficiency is compared against relevant filter-based techniques in terms of the average classification accuracy. The comparative outcomes are recorded in Table 28. Note that the filter-based techniques outcomes are taken from Mafarja et al. (2018b) which uses similar configurations as the proposed method. The best outcomes are point out using bold font. To elaborate, the results of the binary \(\beta\)HC algorithm are extracted from Tables 3 and 7. While the comparative filter-based techniques are: correlation-based feature selection from Hall and Smith (1999), fast correlation-based filter (FCBF) from Yu and Liu (2003), fisher score (F-score) from Duda et al. (2012), IG from Cover and Thomas (2012), and wavelet power spectrum (WPS) from Zhao and Liu (2007). Table 28 demonstrates that the binary \(\beta\)HC algorithm outperforms the other techniques in 20 out of 22 datasets. This proves that the binary \(\beta\)HC algorithm is capable to examine the search space efficiently and obtain good results when compared to other comparative methods.

Table 29 shows the average ranking of the proposed binary \(\beta\)HC algorithm and the five filter-based techniques using the Friedman test. These ranking are computed using the outcomes presented in Table 28. Again, the significant level of \(\alpha\) is 0.05. Remarkably, the binary \(\beta\)HC algorithm is ranked first, while the F-score technique shows that it is ranked second. In addition, the FCBF, CFS, WPS, and IG techniques reserved the third position until the last one respectively. Eventually, the Holm and Hochberg statistical test is utilized to calculate the \(\rho\)-values between the controlled algorithm (i.e., binary \(\beta\)HC algorithm) and the other comparative techniques. Interestingly, there are significant differences between the binary \(\beta\)HC algorithm and the five filter-based techniques as demonstrated in Table 30.

Table 28 The accuracy results of the binary \(\beta\)HC compared to all filter-based methods
Table 29 Average rankings of the algorithms calculated using Friedman test
Table 30 Holm/Hochberg results between the binary \(\beta\)HC algorithm and filter-based techniques

After heavy experiments conducted to prove the viability of the proposed method, we can conclude that the proposed method very efficient algorithm for feature selection problems. The proposed method is able is work well when the value of parameters \({\mathcal {N}}\) is large and \(\beta\) is small. This is because the \({\mathcal {N}}\) parameter determines to what extend the proposed method can be make use of accumulative search while the \(\beta\) parameter is like mutation rate which determines the size of randomization in the search.

Apparently, the proposed method competes very well against other local search-based methods and with other advanced metaheuristic techniques. This is can be borne out by the results obtained. Although, the proposed method belongs to the local search-based algorithm which is normally simple and easy-to-use, it outperforms other methods in 7 out of 22 datasets of various sizes and complexities. This is because that the proposed method is able to achieve the right balance between exploration through \({\mathcal {N}}\) operator and exploitation through \(\beta\) operator. Interestingly, the computational time required to achieve the final results for the proposed method is very small. It is conventionally known that the time required for local search-based algorithm required less computational time than other population-based algorithms.

4 Conclusion and future works

In this paper, \(\beta\)-hill climbing which is a new version of a local search-based algorithm is proposed to solve the feature selection problem. The proposed algorithm is called binary \(\beta\)-hill climbing optimizer. A new operator called \({\mathcal {T}}\)-operator is added to the \(\beta\)-hill climbing algorithm operators (\({\mathcal {N}}\)-operator, \(\beta\)-operator, and \({\mathcal {S}}\)-operator) to transfer the continuous values of the produced feature solution into binary using the S-shape strategy.

To evaluate the proposed binary \(\beta\)-hill climbing optimizer, Four measurements are used: fitness function, classification accuracy, number of relevant features, and elapsed CPU time consumption. In order to evaluate the proposed algorithm, a commonly used 22 problem instances are picked from the UCI datasets with various sizes and complexities. Different evaluation experiments are conducted which include: parameter configurations, transfer functions effect, the used classifier effect, and comparative evaluations against other local search-based methods as well as other population-based algorithms using the same UCI datasets.

The influence of the main parameters (i.e., \(\beta\) and \({\mathcal {N}}\)) of binary \(\beta\)-hill climbing optimizer on the algorithm convergence behavior is studied. In conclusion, for \(\beta\) operator, the higher value can achieve superior results. This means that the exploration is much beneficial for feature selection problem search space. Also, larger \({\mathcal {N}}\) parameter value obtains better results in most cases. Furthermore, eight different transfer functions are experimented with which include S-shaped and V-Shaped transfer functions. In summary, the outcomes produced by the proposed binary \(\beta\)HC with S-shape can almost excel all other results produced by other transfer functions. Furthermore, three classifiers are utilized to find the classification accuracy (i.e., kNN, SVM, and decision tree (DT)). The kNN is adapted for the proposed method since it has the best performance. However, the main limitation of the proposed method is the parameter configurations and the trajectory-based search which might result in being stuck in local minima very quickly. Furthermore, the convergence behavior of the proposed method might be varied from feature selection problem to another due to the No-Free-Lunch theorem in optimization (Wolpert et al. 1997).

The proposed binary \(\beta\)-hill climbing is compared against 13 other comparative methods (3 local-search-based algorithms and 10 metaheuristics algorithms) using the same dataset. The proposed binary \(\beta\)-hill climbing optimizer can excel other comparative local search-based approaches in 16 out of 22 datasets. This means that the proposed method is very competitive when compared to other local search-based approaches. On the other hand, the binary \(\beta\)-hill climbing can outperform other comparative metaheuristic approaches in 7 out of 22 datasets. These results prove the effectiveness of the proposed binary \(\beta\)-hill climbing optimizer as an important addition to the body of knowledge in the machine learning and classifications domain.

The Friedman statistical test at a significant level \(\alpha\) set to 0.05 shows that the binary \(\beta\)HC algorithm outperforms other comparative metaheuristic methods by getting the first rank. In addition, by applying Holm and Hochberg as a post-hoc statistical test the performance of the binary \(\beta\)HC algorithm is statistically better than 7 of the other comparative metaheuristic methods (i.e., BSSA, WOA, BGWO, BGSA, PSO, GA, and BBA) using the significant level \(\alpha\)/Order.

As the proposed binary \(\beta\)-hill climbing is shown to be very successful when used to solve the feature selection problem, we plan to address the following issues in the future:

  • Feature selection applications: other feature selection applications such as gene selection which might be more complex can be tackled using the proposed binary \(\beta\)-hill climbing optimizer.

  • adaptive\(\beta\)-HC: the adaptive version of \(\beta\) hill climbing optimizer can be utilized for feature selection applications to simplify the adaptations.

  • Hybridization with population-based algorithm: hybridize binary \(\beta\) hill climbing optimizer with other swarm-based algorithms to empower the exploration capabilities of the algorithm.

  • Using other evaluating measurements in the experiments: The performance of the proposed method can be also evaluated using other evaluating measurements such as sensitivity, specificity, area under the curve (AUC), and others.

  • Experimenting using large-scaled datasets: Large-scaled datasets can be experimented with to evaluate the efficiency and scalability of the proposed method.