Binary $$\beta$$ -hill climbing optimizer with S-shape transfer function for feature selection

Al-Betar, Mohammed Azmi; Hammouri, Abdelaziz I.; Awadallah, Mohammed A.; Abu Doush, Iyad

doi:10.1007/s12652-020-02484-z

Binary $\beta$-hill climbing optimizer with S-shape transfer function for feature selection

Original Research
Published: 28 August 2020

Volume 12, pages 7637–7665, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Binary $\beta$-hill climbing optimizer with S-shape transfer function for feature selection

Download PDF

Mohammed Azmi Al-Betar^1,2,
Abdelaziz I. Hammouri³,
Mohammed A. Awadallah ORCID: orcid.org/0000-0002-7815-8946⁴ &
…
Iyad Abu Doush^5,6

466 Accesses
22 Citations
Explore all metrics

Abstract

Feature selection is an essential stage in many data mining and machine learning and applications that find the proper subset of features from a set of irrelevant, redundant, noisy and high dimensional data. This dimensional reduction is a vital task to increase classification accuracy and thus reduce the processing time. An optimization algorithm can be applied to tackle the feature selection problem. In this paper, a $\beta$-hill climbing optimizer is applied to solve the feature selection problem. $\beta$-hill climbing is recently introduced as a local-search based algorithm that can obtain pleasing solutions for different optimization problems. In order to tailor $\beta$-hill climbing for feature selection, it has to be adapted to work in a binary context. The S-shaped transfer function is used to transform the data into the binary representation. A set of 22 de facto benchmark real-world datasets are used to evaluate the proposed algorithm. The effect of the $\beta$-hill climbing parameters on the convergence rate is studied in terms of accuracy, the number of features, fitness values, and computational time. Furthermore, the proposed method is compared against three local search methods and ten metaheuristics methods. The obtained results show that the proposed binary $\beta$-hill climbing optimizer outperforms other comparative local search methods in terms of classification accuracy on 16 out of 22 datasets. Furthermore, it overcomes other comparative metaheuristics approaches in terms of classification accuracy in 7 out of 22 datasets. The obtained results prove the efficiency of the proposed binary $\beta$-hill climbing optimizer.

Binary Black Widow with Hill Climbing Algorithm for Feature Selection

Bi-objective feature selection in high-dimensional datasets using improved binary chimp optimization algorithm

Article 10 August 2024

Binary arithmetic optimization algorithm for feature selection

Article 17 May 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data mining research community works on designing and improving techniques for data classifications (Mashrgy et al. 2014; Al-Abdallah et al. 2017), pattern recognition (Ma and Xia 2017), and machine learning (Lee and Lee 2006; Doush and Sahar 2017; Sawalha and Doush 2012). Some data mining problems contain huge data with thousands of features. In many cases, only a set of proper features is needed while the others are redundant, irrelevant or noisy. Picking a subset of these features to accurately represent the entire set of features can largely affect the performance of machine learning algorithms in different properties such as time complexity and classification accuracy (Hu et al. 2006).

Feature selection (FS) is choosing a relevant set of features from a large group of features to represent a record in a dataset. The feature selection technique is applied in many applications like text classification (Forman 2003; Deng et al. 2019), text mining (Jing et al. 2002; Ravisankar et al. 2011), image recognition (Goltsev and Gritsenko 2012), image retrieval (Rashedi et al. 2013; Dubey et al. 2015), bioinformatic data mining (Saeys et al. 2007; Urbanowicz et al. 2018), and many others reported in (Bolón-Canedo and Alonso-Betanzos 2019).

There are three types of feature selection techniques based on the evaluation criteria: filter-based, wrapper-based, and embedded-based methods (Li et al. 2017). Firstly, filter-based feature selection methods give a score for each feature in the dataset using a statistical measure [e.g., information gain (Shang et al. 2013), Chi-squared test (Liu and Setiono 1995), or ReliefF (Robnik-Šikonja and Kononenko 2003)]. Then these features are ranked based on their score. As a result, the features with the higher ranking are kept, while the features with lower rank are removed. Secondly, wrapper-based feature selection methods use search algorithms (e.g., genetic algorithm or particle swarm optimization) to evaluate the generated subset of features. After completing the search process, one of the classifiers (e.g., k-nearest neighbor (Park and Kim 2015), naive bayes (Bermejo et al. 2014), decision trees (Sindhu et al. 2012), etc.) is used to evaluate the quality of the chosen subset of features in term of accuracy. Finally, the integration of a wrapper-based and a filter-based method is known as an embedded-based method in which the searching algorithm is embedded in the classifier such as the k-nearest neighbor algorithm (kNN). Then it guides the classifier to pick some features that can achieve higher accuracy.

In the context of optimization, the FS is considered not easy to solve combinatorial optimization problem (Gheyas and Smith 2010). The complexity of the FS problem comes from selecting the relevant set of features from a plenty possible subsets. For example, the power-set of the set ${\mathcal {A}}$ of size N features contains $2^N-1$ possible subsets of features. Therefore, as the number of features increase the number of solutions to look for the problem increases exponentially. The picked set of features is modeled using a function which is guided by the accuracy of the classification and the number of used feature. The FS solution is conventionally expressed as a binary array of the selected features.

The brute-force method can be used to solve the FS problem where all possible subsets generated and evaluated, then the relevant subset is identified (Lai et al. 2006). This type of algorithm cannot be used when we have a large number of features. Heuristic algorithms can be utilized to obtain the optimal set for the FS problem (Zhong et al. 2001). This type of algorithms can find efficiently the subset of relevant features. However, the quality of this acceptable subset is not necessarily guaranteed (Talbi 2009). Therefore, researchers use metaheuristic-based algorithms to find an optimal portion of relevant features in a feasible time with high classification accuracy.

Metaheuristic-based algorithms can be used to solve different kinds of optimization problems using self-learning operators that is configured with operators to efficiently explore and exploit possible solutions, hoping to attain the best solution (Blum and Roli 2003). Metaheuristic-based algorithms can be classified into population-based and local search-based algorithms (Blum and Roli 2003). Population-based algorithms examine several search space regions concurrently and improve them iteratively wishing to obtain the optimal solution. Examples of population-based algorithm for FS include genetic algorithm (Ghareb et al. 2016), differential evolution (Mlakar et al. 2017), ant lion optimizer (Emary et al. 2016), grey wolf optimization (Emary et al. 2016), ant colony optimization (Kabir et al. 2012), competitive swarm optimizer (Gu et al. 2018), firefly algorithm (Zhang et al. 2018; Al-Abdallah et al. 2017), grasshopper optimization algorithm (Mafarja et al. 2018b, 2019), bat algorithm (Mafarja et al. 2018b), whale optimization algorithm (Mafarja and Mirjalili 2018), dragonfly optimization (Mafarja et al. 2018a), crow search algorithm (Sayed et al. 2019), gravitational search algorithm (Taradeh et al. 2019), and harmony search algorithm (Moayedikia et al. 2017)

Local search-based algorithms, the focal point of this paper, consider one solution at a time. Let’s call this the initial solution. It will be modified repetitively using an operator which allow visiting nearby values until a peak local value is found. A local search-based algorithm is capable of thoroughly investigate a specific region of the initial solution and find the local optima. Such algorithms have a limitation of not exploring multi-search space regions concurrently. Therefore, some random strategies are employed to empower the local search-based approach. In the literature, FS is tackled by several local search-based algorithms such as tabu search (Zhang and Sun 2002), GRASP (Bermejo et al. 2011), iterated local search (Marinaki and Marinakis 2015), variable neighborhood search (Marinaki and Marinakis 2015), and stochastic local search method (Boughaci and Alkhawaldeh 2018).

Due to the complex nature of FS problems, most FS-based algorithms are either a modification of a metaheuristic algorithm or a hybridization of two or more metaheuristic algorithms. Examples of modified metaheuristic algorithms that are used to solve FS problems are binary ant lion optimizer using S-shaped function and V-shaped function (Emary et al. 2016), binary grey wolf optimization using crossover and sigmoidal function (Emary et al. 2016), binary dragonfly optimization using time-varying transfer functions (Mafarja et al. 2018a), and chaotic crow search algorithm (Sayed et al. 2019). Examples of hybrid metaheuristics are the hybridization of the ant colony optimization with the wrapper and filter approaches (Kabir et al. 2012), and the integration of the gravitational search algorithm with evolutionary crossover and mutation operators (Taradeh et al. 2019).

$\beta$-hill climbing is a local search-based algorithm that is recently introduced Al-Betar (2017). It is an improved version of the hill-climbing algorithm with a $\beta$ operator that is governed by the $\beta$ parameter to diversify the search as well as a neighboring operator called ${\mathcal {N}}$ to intensify the search. The $\beta$ operator empowers the $\beta$-hill climbing to intelligently escape the local optima by searching different regions and digging deeply into the regions of the search space. Due to its simplicity, the algorithm is adapted in a broad range optimization problems such as ECG and EEG signal denoising (Alyasseri et al. 2017a, b, 2018), generating substitution-boxes (Alzaidi et al. 2018), gene selection (Alomari et al. 2018b), economic load dispatch problem (Al-Betar et al. 2018), mathematical optimization functions (Abed-alguni and Alkhateeb 2018; Abed-alguni and Klaib 2018), multiple-reservoir scheduling (Alsukni et al. 2017), sudoku game (Al-Betar et al. 2017), text document clustering (Abualigah et al. 2017a, b), and cancer classification (Alomari et al. 2018a). $\beta$-hill climbing produces successful outcomes for a broad range of optimization problems. It produces pleasing results for many real-world problems, and such algorithm can be used to tackle the FS problem.

In this paper, a binary version of $\beta$-hill climbing optimizer is developed to tackle the problem of feature selection. The algorithm is evaluated using 22 UCI machine learning benchmark datasets that are used widely in the literature. The evaluation is discussed using four measurement criteria which are the fitness function, the classification accuracy, the number of relevant features, and the elapsed CPU time. The impact of the parameter settings ($\beta$ and ${\mathcal {N}}$) for the binary $\beta$-hill climbing optimizer is carefully analyzed and studied. Furthermore, the effect of using different transfer functions as well as the different classifiers on the efficiency of the binary $\beta$-hill climbing is studied. The comparative evaluation is conducted against three local search methods using the same datasets on the four evaluation criteria. Interestingly, the proposed binary $\beta$-hill climbing optimizer excel other comparative local search methods 16 out of 22 datasets. On the other hand, it overcomes other comparative metaheuristics methods approaches in 7 out of 22 datasets and very-closed to the best results for the remaining 15 datasets. The binary $\beta$-hill climbing optimizer can be considered as a very important addition to the body of knowledge in the machine learning and classifications domain as it produces very promising outcomes when compared against other methods.

The rest of the paper is organized as follows: in sect. 2, the procedural steps of the proposed binary $\beta$-hill climbing algorithm is provided. The parameter setting analysis and comparative evaluations of the work are discussed in sect. 3. Finally, the conclusion is presented and the possible future research directions are shown in sect. 4.

2 Binary $\beta$-hill climbing optimizer for feature selection

Metaheuristic-based methods can be local search-based or population-based. Most of the techniques applied to resolve the feature selection problem are population-based methods such as swarm-based algorithms or evolutionary-based algorithm.

$\beta$-hill climbing optimizer is an enhanced variant of the basic hill climbing algorithm. It is a local search-based method proposed in Al-Betar (2017) to escape from being stuck into a local optima. $\beta$-hill climbing optimizer can be initiated with a random or heuristic solution [say ${{\varvec{x}}}=(x_1,x_2, \ldots , x_n)$]. In each step, the current solution can be improved using three operators: (i) ${\mathcal {N}}$-operator which is controlled by ${\mathcal {N}}$ parameter to exploit a specific region on the search space, (ii) $\beta$-operator which is controlled by $\beta$ parameter to examine the solution space, and (iii) ${\mathcal {S}}$-operator that utilizes the principle of survival-of-the-fittest. Normally, the search process of $\beta$-hill climbing optimizer is halted once the maximum number of iterations/time is reached.

The $\beta$-hill climbing optimizer can be applied for discrete or continuous search spaces. The variables of the feature selection problem are assigned by binary values (i.e., being selected or not). Therefore, a new operator called ${\mathcal {T}}$-operator (i.e., transfer operator) is proposed to transfer the variable into binary using the sinusoidal function. Consequently, the new version of $\beta$-hill climbing optimizer is called binary $\beta$-hill climbing optimizer. The algorithm pseudo-code is shown in Algorithm 1 and it is flow-charted in Fig. 1. The steps and the four operators are described as follows:

Step 1: Initialize binary $\beta$-hill climbing and feature selection parameters— The parameters of the binary $\beta$-hill climbing for feature selection are set in this step. The problem of feature selection is known to have a binary search space. Therefore, the solution is modeled as a binary vector ${{\varvec{x}}}=(x_1,x_2,\ldots ,x_n)$ of n features. The value of $x_i=1$ means that the feature i is considered. The parameters of the binary $\beta$-Hill climbing optimizer are ${\mathcal {N}}$ and $\beta$. The parameter ${\mathcal {N}} \in [0,1]$ controls the neighboring operator (${\mathcal {N}}$-operator) which is responsible for determining the adjustment bandwidth to modify current solution to a neighboring solution. The $\beta$ parameter controls the $\beta$-operator which determines the intensity of utilizing exploration in the neighboring solution. The last parameter of the proposed binary $\beta$-hill climbing optimizer is the maximum number of iterations which is $Max_itr$.
Step 2: Construct the initial solution— The initial solution ${{\varvec{x}}}=(x_1,x_2,\ldots ,x_n)$ is randomly constructed from the binary domain as follows:
$$\begin{aligned} x_{i}\leftarrow {\left\{ \begin{array}{ll} 1 &{} \qquad U[0,1] \ge 0.5\\ 0 &{} \qquad otherwise. \end{array}\right. } \end{aligned}$$
where U[0, 1] generates a random number between 0 and 1. In order to evaluate the initial solution, the set of features in the current solution are evaluated using the objective function formulated in Eq. (1) (Emary et al. 2016).
$$\begin{aligned} f({{\varvec{x}}})=\alpha \gamma _{R}(D)+ \left( 1-\alpha \right) \frac{|R|}{|N|} \end{aligned}$$
(1)
where the classification error rate is expressed as $\gamma _{R}(D)$. In this study, the kNN classifier is used to find the classification error rate (Liao and Vemuri 2002). Note that |R| is how many features are selected, |N| is the count of the entire features, $\alpha$ refers to the role of classification rate and the length of feature subset, $\alpha \in [0,1]$ .
Step 3: Improvement loop— The enhancement of the current feature selection solution (${{\varvec{x}}}$) is achieved by using four operators which are used to yield a neighboring solution (i.e., ${{\varvec{x}}}'$).
- ${\mathcal {N}}$-operator— This operator is responsible for moving the present solution ${{\varvec{x}}}$ to the near by solution ${{\varvec{x}}}'$ using Eq. (2). This operator is governed by the likeliness of picking ${\mathcal {N}}$ parameter where ${\mathcal {N}} \in [0,1]$. The probability determines the adjustment of the decision variables (features) in the current solution. A greater value of ${\mathcal {N}}$ aid in a furthest movement from the neighboring solution ${{\varvec{x}}}'$. The pseudo-code for ${\mathcal {N}}$-operator is shown in line 5 of Algorithm 1. Formally, let $x_i$ be given the value of $v_i(k)$ of $k^{th}$ position, then the following present how to allocate the value of $x'_i$:
  $$\begin{aligned} x'_i= x_{i, k} \pm {\mathcal {N}} \end{aligned}$$
  (2)
  where $x_{i, k} \pm {\mathcal {N}}$ is the neighboring value of $x_{i, k}$.
- $\beta$-operator— This operator is utilized for increasing the regions covered from the search space. It utilizes the idea of invariable mutation to the present solution. As shown in Eq. (3), $\forall i \in (1,2,\ldots ,n)$, the $x_i$ decision variable is picked at random to be adjusted using the $\beta$ parameter. This is pseudo-coded in lines from 6 to 10 in Algorithm 1.
  $$\begin{aligned} x'_{i}\leftarrow {\left\{ \begin{array}{ll} x_{r} &{} \qquad U[0,1] \le \beta \\ x'_i &{} \qquad otherwise. \end{array}\right. } \end{aligned}$$
  (3)
  Note that the $\beta$ parameter determines how often the uniform mutation is used. Also, $x_{r}$ is a random value which is either 0 or 1.
- ${\mathcal {T}}$-operator— As aforementioned, the feature selection problem deals with a binary values for the decision variables. Therefore, the sinusoidal function (or S-shape function as shown in Fig. 2) is adapted to transform continuous solutions into binary. To elaborate, the Sigmoidal function (Kennedy and Eberhart 1997) is formulated in Eq. (4):
  $$\begin{aligned} T\left( x'_{i}\right) =\frac{1}{1+e^{-x'_{i}}} \quad \forall i=(1,2,\ldots ,n) \end{aligned}$$
  (4)
  The value of the decision variables in the neighboring solution is re-assigned a binary value using Eq. (5). Let r be function that generates at random a value bounded by 0 and 1 (i.e., $r \in [0,1]$), the value of $x'_i$ of the feature i will be re-assigned as follows:
  $$\begin{aligned} x'_i ={\left\{ \begin{array}{ll} 1 &{} \qquad r<T\left( x'_{i}\right) \\ 0 &{} \qquad Otherwise) \end{array}\right. } \end{aligned}$$
  (5)
- ${\mathcal {S}}$-operator— The quality of the neighboring solution ${{\varvec{x}}}'$ is assessed by applying the objective function $f({{\varvec{x}}}')$ which is formulated in 1. The neighboring solution ${{\varvec{x}}}'$ is interchanged by the current one ${{\varvec{x}}}$, if it is better (i.e., $f({{\varvec{x}}}')\le f({{\varvec{x}}})$). The pseudo-code for the ${\mathcal {S}}$-operator is presented in the lines from 20 to 22 of Algorithm 1.
Step 4: Stop criterion— The proposed binary $\beta$-hill climbing is iterated until a stop criterion is reached. The stop criterion used in this study is based on the number of iteration $\text {Max}\_\text {Itr}$ defined at Step 1.
Step 5: kNN classifier— The accuracy of the obtained solution by binary $\beta$-hill climbing is evaluated using a kNN classifier. kNN is an effective non-parametric method that is utilized for classification and regression. The kNN starts by storing all the training data instances. After that, a pairwise computation is applied to calculate the similarity between the training instances when compared against the unseen instances (Chen et al. 2009; Weinberger and Saul 2009). Then selecting the k-closest instances. This operation is done repetitively for all the unseen instances.

In order to compute the classification accuracy and error rate measurements, Eqs. (6) and (7) are used. Classification accuracy is a statistical measure which defines the ability of the classifier to correctly use the picked features to precisely label a given tuple into a class. It can be computed using Eq.(6).
$$\begin{aligned} Accuracy= \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
(6)
where TP (true positive) denotes identifying correctly the class using a precise set of features. TN (true negative) denotes identifying correctly that it is not the class using a correct set of features. FP (false positive) denotes identifying incorrectly that it is the class. Finally, FN (false negative) denotes identifying incorrectly that it is not the class.

The classification error rate calculated in Eq.(7) is used to determine the percentage of features that are incorrectly assigned. It will be part of the formulated objective function.
$$\begin{aligned} \gamma _{R}(D)= 1-Accuracy \end{aligned}$$
(7)

2.1 Computational complexity of the proposed method

The time complexity required for the proposed binary $\beta$-hill climbing algorithm is measured by analyzing the pseudo-code given in Algorithm 1 in terms of big-${\mathcal {O}}$ notation. The binary $\beta$-hill climbing pseudo-code can be divided into three parts: (i) The initial phase (from line 1 to line 3 in Algorithm 1); (ii) The improvement phase (from line 5 to line 24 in Algorithm 1); and (iii)The classifier phase (the line 25 in Algorithm 1).

The time complexity of the initial phase is ${\mathcal {O}}(n)$ for the construction of the initial solution. The f(x) calculation is based on the kNN classifier which computes the classification error rate and thus its complexity is ${\mathcal {O}}(n^2)$. Therefore, the time complexity for the initial phase is ${\mathcal {O}}(n^2)$.

The time complexity of the second phase (i.e., improvement phase) rely upon number of iterations ($\text {Max}\_\text {Itr}$) and the time required for ${\mathcal {N}}$-operator, $\beta$-operator, ${\mathcal {T}}$-operator, and ${\mathcal {S}}$-operator. The time complexity of the ${\mathcal {N}}$-operator is ${\mathcal {O}}(1)$ while the complexity of the $\beta$-operator is ${\mathcal {O}}(n)$. Furthermore, the time complexity of the ${\mathcal {T}}$-operator is also ${\mathcal {O}}(n)$ while ${\mathcal {S}}$-operator requires ${\mathcal {O}}(n^2)$. In brief, the time complexity of the second phase is ${\mathcal {O}}(\text {Max}\_\text {Itr} \cdot n^2)$.

The time complexity of the classifier phase is ${\mathcal {O}}(n^2)$ which is the time required to execute the kNN classifier. as a wrap-up, the time complexity required to execute the developed binary $\beta$-hill climbing is ${\mathcal {O}}(\text {Max}\_\text {Itr} \cdot n^2)$.

3 Experiments and results

A comprehensive experimental analysis is conducted in this section to investigate the proposed binary $\beta$HC algorithm efficiency when solving the problem of feature selection. The experiments are divided as follows: (i) the effect of two parameters of binary $\beta$HC (i.e., ${\mathcal {N}}$ and $\beta$) on the algorithm performance is studied in sect. 3.2.1 and 3.2.2; (ii) the influence of different transfer functions on the efficiency of the proposed binary $\beta$HC algorithm is presented in sect. 3.2.3; (iii) the performance of the proposed binary $\beta$HC algorithm using different classifiers is summarized in sect. 3.2.4; (iv) The effect of training/testing against k-fold cross validation models on the performance of the proposed algorithm is provided in Sec.3.2.5 ; and (v) the efficiency of the binary $\beta$HC algorithm is compared against other local search-based algorithms in sect. 3.3, it is compared against recent metaheuristic methods in sect. 3.4, and it also compared against filter-based approach in sect. 3.5. It should be noted that the attributes of the datasets utilized in the algorithm assessment are summarized in sect. 3.1.

The binary $\beta$HC algorithm was implemented using MATLAB (R2014a) and tested on a laptop with 2.80 Intel Core i7 with 16 GB RAM. The operating system installed on the laptop is Microsoft Windows 10. In all the experiments each dataset is splitted at random into two portions: training which is 80% of the instances and testing which is the remaining 20%. This split is used as it has been widely adapted by several state-of-the-art methods (Mafarja et al. 2019; Alsaafin and Elnagar 2017; Li et al. 2011; Wieland and Pittore 2014)

3.1 Dataset

The proposed binary $\beta$HC algorithm is evaluated using twenty-two datasets collected from the UCI data repository. A brief of the datasets characteristics is presented in Table 1. The summary shows for each dataset the FS problem name, the number of features, and the instances count.

Table 1 The characteristics of the datasets

Binary \(\beta\)-hill climbing optimizer with S-shape transfer function for feature selection

Abstract

Similar content being viewed by others

Binary Black Widow with Hill Climbing Algorithm for Feature Selection

Bi-objective feature selection in high-dimensional datasets using improved binary chimp optimization algorithm

Binary arithmetic optimization algorithm for feature selection

Explore related subjects

1 Introduction

2 Binary \(\beta\)-hill climbing optimizer for feature selection

2.1 Computational complexity of the proposed method

3 Experiments and results

3.1 Dataset

3.2 Evaluation of the proposed binary \(\beta\)HC algorithm

3.2.1 Study the effect of the \({\mathcal {N}}\) parameter

3.2.2 Study the effect of parameter \(\beta\)

3.2.3 Study the effect of the different transfer functions

3.2.4 Study the effect of the different classifiers

3.2.5 The effect of training/testing against k-fold cross validation models

3.3 Comparison with local search method

3.4 Comparison with other metaheuristics

3.5 Comparison with filter-based techniques

4 Conclusion and future works

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation