1 Introduction

Feature selection plays an indispensable role in big data classification. Knowledge discovery or extracting the knowledge from the big data to make a decision is defined as data mining. The field of data mining is maturing, with a wealth of popular applications such as health studies (Marino et al. 2018), machine learning (Cai et al. 2018; Liu et al. 2017; Zhang et al. 2019), biological data (Pashaei and Aydin 2017; Yamada et al. 2018), financial data (Attigeri and Manohara Pai 2019; Jadhav et al. 2018), medical diagnosis (Habib et al. 2020; Singh et al. 2016). Internet of Things (IoT) is one of the most recent and emerging technologies characterized by its massive data production (Zhao and Dong 2018). In many IoT applications, processing the big data generated from the sensor nodes is intricate. That’s due to the restrictions impeding the performance of the sensor nodes in terms of energy, storage, computing power, and communication range (Wu et al. 2018). FS can reduce the big data dimensionality produced from IoT and neglect the unwanted data, which facilitates the task of processing the data (Sun et al. 2018).

A challenging problem that arises in data mining is dealing with massive dimensions of data. Even the bliss of technology can become a curse with large dimension sizes of data. Data dimensionality may hamper the data mining process. Furthermore, it needs high computing costs of space and time. The traditional machine learning methods can’t deal with these vast datasets. The dataset comprises a collection of samples or instances representing information about a specific case. Each sample consists of a set of features. The dataset issue is not restricted to its huge dimension sizes, but also the involvement of irrelevant or redundant features. Additionally, an elevated amount of noise may accompany the gathered dataset, and the model may be complicated.

Feature selection is a pre-processing phase performed to choose the optimal subset of the informative features contributing to the output. It also seeks to remove any irrelevant or redundant features. There are many advantages to FS, such as reducing the vast data dimension sizes, saving computing resources, minimizing training time, simplifying the model, and maximizing the classification accuracy of the model. The two features are recognized as repetitive if their values are wholly correlated, and they can take the role of each other.

Feature selection has met with great success to alleviate the burdens of data dimensionality. Finding the OSF represents a challenging and costly task. Moreover, three key factors determine the FS model, like a classifier, evaluation criteria, and the search algorithm. Classification aims to assign each sample to a specific class. One of the most common evaluation criteria is classification accuracy. We can spit FS classifiers into filer scheme, wrapper scheme, and embedded scheme. The filter scheme has fast speed and low accuracy, unlike the wrapper scheme, which is characterized by low speed and high classification accuracy. The embedded system is preferred when dealing with a particular model.

Many search algorithms are designed to search for the OSF. For the exponential or exhaustive search, if we assume a dataset with n features, then there are 2n solutions to search for the OSF. The exponential search can guarantee the OSF, but it is computationally expensive. The random search is another search that randomly searches for the next subset of features. In the sequential search, one solution is selected among all successors of the current solution. While random search and sequential search aren’t computationally expensive, the optimal solution isn’t guaranteed. Recently, metaheuristics have emerged and have proved to be more effective in solving many problems (Wang and Chen 2020; Zhao et al. 2019; Xu and Chen 2014; Shen et al. 2016; Wang et al. 2017; Xu et al. 2019; Chen et al. 2020). The modeling of metaheuristics depends on exploring the search space for more promising solutions and exploiting the best solution found so far. This modeling makes the performance of metaheuristics algorithms outperforms the previous search algorithms. Moreover, they aren’t computationally expensive. We summarize some symbols annotations used in this paper in Table 1.

Table 1 The meaning of the used symbols

To better understand the FS problem, we assume a dataset A that contains a set of n samples or instances where \( i = 1, 2, \ldots ,n \). Each sample has a set of d features or attributes where \( j = 1,2, \ldots ,d \). \( A_{ij} \) represents the value of jth feature in the ith sample. The samples are distributed among a set of m classes and \( k = 1, 2, \ldots , m \). The instances of the same class have similar attributes, whereas the instances of different classes have dissimilar attributes. Figure 1 depicts a description of a dataset.

Fig. 1
figure 1

Dataset description

The main objective brought by this paper is to develop the HHO algorithm hybridized with SA algorithm based FS techniques. To achieve this objective, we have suggested a hybrid and improved method, namely, hybrid Harris Hawks optimization algorithm with bitwise operations and simulated annealing (HHOBSA). Our contributions can be pointed as follows:

  • A binary version of the HHO algorithm is hybridized with Bitwise operations and SA algorithm (HHOBSA) to increase population diversity and obtains more promising solutions.

  • Many experiments and comparisons with the existing metaheuristics using 24 big datasets and 19 artificial datasets are done.

  • Many performance measures are employed, including fitness values, computational time, classification accuracy, and size of selected features.

  • Several statistical tests of significance like Wilcoxon signed ranks and paired-samples T-tests are conducted and superior results are revealed by the proposed algorithm HHOBSA.

We organize the remaining of the paper as follows. Section 2 offered an overview of the related work performed for FS. Section 3 describes the preliminaries used, including Harris Hawks optimization algorithm, SA algorithm, and K-nearest neighbor method. Moreover, Sect. 4 explains and depicts the suggested algorithm (HHOBSA). Furthermore, Sect. 5 presents numerical and comparison results. In Sect. 6, the conclusions and future directions are drawn and indicated.

2 Related work

In an attempt to identify the OSF, the researchers have developed various metaheuristic algorithms. So, we will investigate many of these algorithms. Lately, a real-valued grasshopper (Zakeri and Hokmabadi 2019) is proposed for tackling FS problems. The algorithm implemented a statistical measure called feature probability factors to substitute the duplicate features with the most promising ones during iterations. The algorithm needs to be further improved and compared on more datasets with other more algorithms. Also, the authors Mafarja et al. (2019b) suggested a binary grasshopper algorithm using sigmoid and V-shaped transfer functions. Besides the transfer functions, a mutation operator is used to enhance the exploration of the algorithm. The performance of the suggested algorithm requires examination with other classifiers. The Grasshopper algorithm (Mafarja et al. 2018a) is integrated with evolutionary population dynamics and choice operators. The algorithm is time consuming compared to the different algorithms. Furthermore, Aljarah et al. (2018) used the algorithm with the support vector machine. But the algorithm didn’t consider the large scale datasets.

Nematzadeh et al. (2019) introduced the whale algorithm to mutual congestion as a filter frequency-based method. The mutual congestion can foresee the class labels effectively. Mafarja et al. (2019a) studied the effect of adding eight different transfer functions to the whale algorithm. The transfer functions belong to two families called S-Shaped and V-shaped. Over the S-shaped methods, the results showed superior efficiency for the V-shaped methods. Also, Hussien et al. (2019) implemented a binary version of the whale algorithm based S-shaped transfer function. Mafarja et al. (2018) employed the crossover and mutation operators for promoting the exploitation property while the tournament selection supported the exploration property of the whale algorithm. It would be valuable if the algorithm solved the high dimensional datasets. Besides, Zheng et al. (2018) combined the maximum Pearson maximum distance with the improved whale algorithm. The greater the dataset dimension, the lower the performance is the algorithm.

In Tu et al. (2019a), the authors modified a version of the grey wolf optimizer algorithm that divides the population into dominant and omega wolves. For the dominant wolves, the enhanced elite learning strategy is adopted. For the omega wolves, a hybrid grey wolf algorithm with differential evolution strategy is integrated in addition to total-dimensional and one-dimensional selection strategies. Despite the outperformance of the algorithm, several control parameters can affect the performance of the algorithm. Another study presented by the same authors Tu et al. (2019b), which has integrated the grey wolf algorithm with three distinct strategies to update the solutions: adaptable cooperative, global- best lead, and the disperse foraging. One disadvantage of that algorithm is suffering from training the parameters of the algorithm in advance to choose the best setting. Also, this can be time-consuming. Abdel-Basset et al. (2019) developed a mutation operator with two phases to boost the effectiveness of the grey wolf optimizer algorithm. Also, Mafarja et al. (2019c) hybridized the grew wolf algorithm and the whale algorithm.

De Souza et al. (2018) suggested the crow search algorithm for solving the FS problem. The V-shaped transfer function is used to suit the binary nature of the problem. Ten chaotic maps were introduced to the crow search algorithm to enhance the algorithm’s performance and convergence speed (Sayed et al. 2019). Hybridization is the solution if the benefits of the two algorithms are to be exploited. Hence, in Arora et al. (2019), the grey wolf algorithm is hybridized with the crow search algorithm to overcome its limitations. In this context, Al-Tashi et al. (Al-Tashi et al. 2019) developed a hybrid algorithm of grey wolf optimizer and particle swarm optimization. Also, Yan et al. (2019) combined the SA algorithm and tournament selection strategy with the coral reefs optimization algorithm. KNN classifier and tenfold cross-validation are used to assess the solutions.

Rajamohana and Umamaheswari (Rajamohana and Umamaheswari 2018) suggested the integration of Particle Swarm Optimization (PSO) and shuffled frog leaping algorithm to lessen the large dimensionality of the feature set and helps the customers to ignore fake reviews that come from spammers. The PSO (Too et al. 2019) constructed a random selection approach that chooses a random inertia weight scheme from various inertia weight schemes each time, which helps to overcome the LO. Another study by Mafarja et al. (2018a), has investigated the impact of five updating strategies of weight inertia. An improved version of the PSO algorithm (Jain et al. 2018) incorporated the correlation-based FS and Naive-Bayes classifier with stratified tenfold cross-validation. Also, Chen et al. (2019) implemented the particle swarm optimization algorithm with a logistic map sequence for tuning the inertia weight. The algorithm adopted the spiral-shaped mechanism around the optimal solution. The main weakness of the algorithm is that the obtained size of the selected features is large for many datasets.

By using the mutation and crossover operators, the efficacy of the gravitational search algorithm is enhanced (Taradeh et al. 2019). Reducing data dimensionality in the network intrusion system is a great challenge, and in this regard, a firefly algorithm based C4.5 classifier is proposed (Selvakumar and Muneeswaran 2019). The algorithm consumes much time in selecting the best subset of features. Also, for the treatment of the cancer dataset’s dimensionality, Sayed et al. (2019) proposed a nested genetic algorithm that involves two nested genetic algorithms (outer and inner) to work on two distinct datasets kinds: microarray gene expression and DNA methylation. Sayed et al. (2019) applied ten chaotic maps to the dragonfly algorithm to select the best subset features from the drug bank database. Pourpanah et al. (2019) presented the brain storm optimization algorithm combined with the fuzzy ARTMAP model to learn the training samples. Thaher et al. (2020) provided a binary version of the Harris Hawks optimization algorithm that employs the s-shaped function. The water cycle optimization algorithm is incorporated with SA to detect spam emails (Al-Rawashdeh et al. 2019). In this regard, other algorithms have been developed to solve the spam email detection, such as whale optimization algorithm (Saidala and Devarakonda 2017; Shuaib et al. 2019), particle swarm optimization (Faris et al. 2016), harmony search algorithm (Gashti 2017), and intelligent water drops algorithm (Singh 2019).

Nayak et al. (2019) suggested a binary differential evolution algorithm based on the individual entropy method, which used a two-stage mutation and crossover operators. The algorithm fails to measure the stability of FS based on an evolutionary algorithm and the correlation between optimization objective and classifier. Other metaheuristics developed by many authors for FS problems such as slap swarm (Faris et al. 2018; Ahmed et al. 2018; Ibrahim et al. 2018), ant lion optimizer algorithm (Mafarja and Mirjalili 2018), dragonfly algorithm (Mafarja et al. 2018c), cuckoo search algorithm (El Aziz and Hassanien 2018), multi-verse optimizer algorithm (Ewees et al. 2019), butterfly algorithm (Arora and Anand 2019), bat algorithm (Alam 2018), and ant colony optimization algorithm (Zhao et al. 2014).

Metaheuristic algorithms came to solve many of the latest and emerging problems and achieved overwhelming success. Therefore, many researchers were quick to solve FS problems using metaheuristic algorithms. However, we argue that previous literature suffers from specific weaknesses like:

  • Getting stuck into LO and low convergence.

  • The increase of the computational time.

  • Unfortunately, the performance of the algorithm may be affected by large dimensions of data.

  • Training the parameters of the algorithm in advance to choose the best setting is time-consuming.

To treat most of the limitations found in the previous studies, a hybrid method is searching for the OSF. The SA is integrated with the Harris Hawks optimization algorithm to flee from LO because it can accept a worse solution based on a probability. Moreover, the bitwise operations can increase the diversity in the population. Different and large dimension sizes of the datasets are used to study the efficacy of the proposed algorithm.

3 Harris Hawks optimization algorithm

A novel nature-inspired algorithm was developed by Heidari et al. (2019) simulating the Harris Hawks behaviors called the Harris Hawks Optimization (HHO) algorithm. An incredible social behavior has been followed by the Harris Hawks to track and pounce on their prey. Searching for the prey, abrupt pounce, and multiple attacking ways perform the explorative and exploitative phases of the algorithm.

Harris Hawks are randomly distributed to locations waiting for prey using two exploration approaches. They are considered the candidate solutions and the best solution is the one which is the purposed prey or near the optimum. In the first approach, Harris Hawks perch on a place taking into account other family members’ locations and the rabbit (prey). In the second approach, the Hawks are waiting on random tall trees. The two approaches can be modeled with an equal chance of \( q \) for each as follows:

$$ x\left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} {x_{r} \left( t \right) - r_{1} \left| {x_{r} \left( t \right) - 2r_{2} x\left( t \right)} \right|} \hfill & {q \ge 0.5} \hfill \\ {x_{rabbit} \left( t \right) - x_{mean} \left( t \right) - r_{3} \left( {Lb + r_{4} \left( {Ub - Lb} \right)} \right)} \hfill & {q < 0.5} \hfill \\ \end{array} } \right. $$
(1)

where x(t) and x(t + 1) are the position vectors of Hawks in the current and next iteration, respectively. xr(t) is a random hawk selected from the population. xrabbit(t) is the position of the rabbit. q, r1, r2, r3 and r4 are randomly generated numbers. Lb and Ub are lower and upper bounds to produce random locations inside the Hawks’ home. xmean(t) is the mean position of Hawks in the population which can be calculated as:

$$ x_{mean} \left( t \right) = \frac{1}{h}\mathop \sum \limits_{i = 1}^{h} x_{i} \left( t \right) $$
(2)

where \( x_{i} \left( t \right) \) is the ith position vector of each hawk in the population at iteration t and \( i = 1, \ldots ,h \). h is the number of Harris Hawks in the population. Based on the fleeing or escape energy E of the rabbit, the algorithm can change from the exploration to the exploitation stage as follows:

$$ E = 2E_{0} \left( {1 - \frac{t}{Max\_iter}} \right) $$
(3)

where \( E_{0} \) is the initial rabbit energy which is randomly generated in [− 1, 1]. \( Max\_iter \) determines the maximum number of iterations. Hawks look for more regions to explore the rabbit location when \( \left| E \right| \ge 1 \); otherwise, the exploitation stage occurs. With an equal chance p, the success (\( p \ge 0.5 \)) or failure (\( p < 0.5 \)) of rabbit escape is formulated in the algorithm. Also, depending on the rabbit energy, the Hawks will perform a soft (\( \left| E \right| \ge 0.5 \)) or hard (\( \left| E \right| < 0.5 \)) besiege. The soft besiege can be formulated as:

$$ x\left( {t + 1} \right) = \Delta x\left( t \right) - E|J \times x_{rabbit} \left( t \right) - x\left( t \right) $$
(4)
$$ \Delta x\left( t \right) = x_{rabbit} \left( t \right) - x\left( t \right) $$
(5)
$$ J = 2\left( {1 - rand} \right) $$
(6)

\( \Delta x\left( t \right) \) is the difference between the hawk and rabbit positions. Random jump strength \( J \) of the rabbit is drawn using a random number \( rand \). On the other side, hard besiege can be formulated as:

$$ x\left( {t + 1} \right) = x\left( t \right) - E\left| {\Delta x\left( t \right)} \right| $$
(7)

when (\( \left| E \right| \ge 0.5 \)) and (\( p < 0.5 \)), soft besiege with progressive rapid dives is performed as the rabbit can successfully flee. The Hawks can choose the best possible dive. Lévy flight is used to mimic the leapfrog of the prey. To decide if the dive is good or not, the next move of the Hawks is estimated using:

$$ k = x_{rabbit} \left( t \right) - E\left| {J \times x_{rabbit} \left( t \right) - x\left( t \right)} \right| $$
(8)

If the previous dive isn’t useful, the Hawks will dive utilizing lévy flight \( L \) pattern as follows:

$$ z = k + s \times L\left( d \right) $$
(9)

where \( d \) is the problem dimension and s is a random vector with size \( d \). Lévy can be calculated by Yang (Yang 2010):

$$ Levy = \frac{u \times \sigma }{{\left| v \right|^{{\frac{1}{\beta }}} }} $$
(10)
$$ \sigma = \left( {\frac{{\varGamma \left( {1 +\upbeta} \right) \times \sin \left( {\frac{\pi \beta }{2}} \right)}}{{\varGamma \left( {\frac{1 + \beta }{2}} \right) \times\upbeta \times 2^{{\left( {\frac{\beta - 1}{2}} \right)}} }}} \right)^{{\frac{1}{\beta }}} $$
(11)

\( u \) and \( v \) are random numbers \( \in \)[0, 1]. \( \beta \) is a constant set to 1.5. The final soft besiege progressive rapid dives is updated using:

$$ x\left( {t + 1} \right) = \left\{ {\begin{array}{*{20}l} k \hfill & {if\,f\left( k \right) < f\left( {x\left( t \right)} \right)} \hfill \\ z \hfill & {if\,f\left( z \right) < f\left( {x\left( t \right)} \right)} \hfill \\ \end{array} } \right. $$
(12)

where \( k \) and \( z \) are calculated using Eqs. (8) and (9). Hard besiege with progressive rapid dives is happened when (\( \left| E \right| \ge 0.5 \)) and (\( p < 0.5 \)) as the rabbit hasn’t enough energy to flee using Eq. (12) where \( z \) is calculated using Eq. (9) and \( k \) is updated using the following equations:

$$ k = x_{rabbit} \left( t \right) - E\left| {J \times x_{rabbit} \left( t \right) - x_{mean} \left( t \right)} \right|. $$
(13)

Algorithm 1 presents the pseudocode of the standard HHO algorithm.

figure a

4 The proposed approach

Here, the proposed algorithm HHOBSA will be explored and explained in detail. FS is a binary problem where the feature is set to one if it is selected; otherwise, it is set to zero. Harris Hawks optimization algorithm is intended to solve the continuous problems which contradict the binary nature of the FS problem. Two main stages to constitute our proposed approach: the application of the HHO algorithm with bitwise operations for FS and the hybridization of the HHO algorithm with the SA algorithm. In the second stage, the hybridization between the HHO algorithm and SA will be discussed. Like many other metaheuristic algorithms, HHO gets stuck into LO. Thus, the SA tries to prevent the HHO algorithm from getting stuck into LO. The structure of HHOBSA is shown in Fig. 2.

Fig. 2
figure 2

The structure of HHOBSA

4.1 Harris Hawks algorithm for feature selection

In this stage, several steps will be explained: initialization, transformation function, K-nearest neighbor, and evaluation. Besides, two bitwise operations are employed for improving the quality of the solution.

4.1.1 Initialization

An initial population of \( H \) Hawks or search agents is randomly generated in this step. Each Harris hawk in the population represents a possible solution. The solution is represented by a vector with a dimension \( d \). \( d \) is set to the size of features in a dataset. Each value in the vector can be 1 or 0, indicating that the feature is selected or not. Figure 3 depicts a binary representation of a possible solution of Harris hawk of a dataset that contains eight attributes. Five of them are selected while the other features aren’t selected. The main aim of performing FS is to lessen the data dimensionality. Therefore, we need to choose some attributes and reject others.

Fig. 3
figure 3

The binary representation of a possible Harris hawk solution

4.1.2 K-nearest neighbor

Several classifiers can be used to rate and assess the quality of solutions. In this study, the KNN classifier (Altman 1992) is used because of the following reasons: its simple implementation, having only one parameter \( K \) which represents a number of neighbors, and more beneficial in finding the best subset of attributes. The job of classification is to assign a sample to a particular class to which most of its K closest neighbors belong (see Fig. 4).

Fig. 4
figure 4

KNN example

The main purpose of classification is to classify the new samples that aren’t labeled for a specific class. However, in the beginning, we need to train the classifier to enable the classifier to know the data’s peculiarities, the connection between the attribute values, and the class label. In the real world, we can’t make sure if our classifier is correctly trained or not. A common practice, therefore, is to keep part of a labeled data as a training dataset and the other part as a testing dataset. The classifier is then trained using the training dataset while the testing dataset is kept far away to ensure that the classifier is well trained on data not seen before (testing data). For the testing dataset, each sample must determine its \( K \) closest neighbors from the training dataset using Euclidean distance as follows:

$$ ED = \sqrt {\mathop \sum \limits_{j = 1}^{d} (ftrain_{j} - ftest_{j} )} $$
(14)

where \( ED \) is the Euclidean distance. \( d \) is the size of attributes in a given dataset and \( j = 1, \ldots .d \). \( ftrain_{j} \) is the jth attribute in a sample in the training dataset. \( ftest_{j} \) is the jth attribute in a sample in the testing dataset. The classification accuracy is a metric that demonstrates how good the class label prediction is for the classifier. It can be defined as the percentage of the correct instances divided by the total number of instances found in the testing dataset. On the other side, the classification error rate is the percentage of the incorrect instances divided by the total number of instances found in the testing dataset.

4.1.3 Evaluation

To assess the quality of a solution, the classification accuracy rate calculated from the classifier KNN is used. The best solution is one that maximizes the classification accuracy rate. In Fig. 5, we assume that the two solutions have the same accuracy. Even though the two solutions have the same accuracy, there is one solution better than the other depending on the size of the features selected. But the proposed algorithm cannot distinguish between the two Harris Hawks solutions as the measure of assessment is based solely on the classification accuracy.

Fig. 5
figure 5

Example of two Harris Hawks solutions with the same accuracy

Therefore, the evaluation measure was modified to take into account both the accuracy and the size of the selected features. Based on this, FS must satisfy two objectives: minimizing the size of selected attributes and maximizing the accuracy of a KNN classifier. The fitness function for assessing the Harris Hawks population will contain two contradictory goals: minimizing one while maximizing the other. The fitness function will be concerned with minimizing the classification error rate rather than the accuracy to minimize the two objectives. The fitness function is designed to balance the two objectives as follows:

$$ f = w_{1} \times \left( {1 - acc} \right) + w_{2} \times \frac{{\left| {slected\_f} \right|}}{\left| d \right|} $$
(15)
$$ w_{1} \in \left[ {0, 1} \right], w_{2} = 1 - w_{1} $$
(16)

where \( acc \) is the classification accuracy calculated from KNN and \( \left( {1 - acc} \right) \) is the classification error rate. \( \left| {selected\_f} \right| \) indicates the number of the selected features. \( \left| d \right| \) is the size of the features in a dataset. \( w_{1} \) and \( w_{2} \) are the weight parameters for each objective. The great precedence is given to minimizing the classification error (maximizing classification accuracy) rather than minimizing the size of selected attributes which means that \( w_{1} > w_{2} \).

4.1.4 Transformation function

The standard HHO is designed for continuous problems and the search agents are real-values. So, the HHO algorithm cannot be directly implemented to solve the binary FS problem. The transformation function is one that transforms the real-valued search agent to a binary one. It can be categorized into S-shaped and V-shaped functions. The sigmoid belongs to the family of the S-shaped function (Mirjalili and Lewis 2013). The sigmoidal maps a continuous position in a Harris hawk solution to a binary one using:

$$ S\left( {x_{j} } \right) = \frac{1}{{1 + e^{{ - x_{j} }} }},\quad x_{binary} = \left\{ {\begin{array}{*{20}l} 0 \hfill & {rand < S\left( {x_{j} } \right)} \hfill \\ 1 \hfill & {rand \ge S\left( {x_{j} } \right)} \hfill \\ \end{array} } \right. $$
(17)

Each value in the real-valued solution \( x_{j} \) vector is converted to a value \( x_{{s_{j} }} \) which is a value between 0 and 1. \( j = 1, \ldots ,d \), where \( d \) is the number of features. \( rand \) is a random number \( \in \left[ {0, 1} \right] \). Another transformation function called hyperbolic tanh function which belongs to the V-shaped family proposed in (Rashedi et al. 2010) will be used:

$$ V\left( {x_{j} } \right) = \left| {tanh\left( {x_{j} } \right)} \right|,\quad x_{binary} = \left\{ {\begin{array}{*{20}l} {\neg x_{j} } \hfill & {rand < V\left( {x_{j} } \right)} \hfill \\ { x_{j} } \hfill & {rand \ge V\left( {x_{j} } \right)} \hfill \\ \end{array} } \right. $$
(18)

The value resulting from \( V\left( {x_{j} } \right) \) is still a continuous value between 1 and -1. So, it must be thresholded with \( rand \in \left[ {0, 1} \right] \) to get the binary value.

4.1.5 Bitwise operations

LO and low population diversity affect the performance of the HHO algorithm. HHO algorithm is developed using two bitwise operations to overcome the problems mentioned above: bitwise OR operation and bitwise AND operation. Firstly, a new solution is randomly generated. Then, a bitwise AND operation is performed between the new random solution and the best solution (rabbit position \( x_{rabbit} \)) obtained so far. The purpose of bitwise AND operation is to obtain the good features that are common in the best solution and the random solution. Secondly, the output solution from the AND operation and the new solution generated from the HHO algorithm are taken as input to the OR operation. The purpose of OR operation is to transfer the most informative features produced from AND operation to the newly generated solutions which raise the quality of solutions. The bitwise operations help the algorithm to flee from local optima. The random solution increases the population diversity. The quality of the solutions is ameliorated. Figure 6 shows the bitwise operations that are done during iterations.

Fig. 6
figure 6

Bitwise operations

4.2 Hybridization of Harris Hawks optimization algorithm and simulated annealing

SA is a single-solution algorithm developed in Kirkpatrick et al. (1983) to simulate the annealing process of metals. The annealing is a physical process used to harden metals starting at a high temperature and slowly to cool down. At first, the parameters of SA are initialized like initial starting temperature \( T_{0} \), final temperature \( T_{final} \), and cooling rate \( \tau \). The initial temperature is the highest temperature, which is gradually cooled by the cooling rate until it reaches the final temperature. The algorithm begins with a solution produced randomly. It relies on the incremental improvement of the current solution. A new neighboring solution to the current solution is selected during iterations. The current solution is updated if the new neighboring solution is better. Furthermore, the best solution is updated if the neighboring solution is better. The algorithm stops when the final temperature is reached. The current temperature \( T \) is updated per iteration using:

$$ T = T*\tau ,\;0 < \tau < 1 $$
(19)

SA is a probabilistic algorithm that can accept a worse neighboring solution to be replaced with the current solution to overcome the LO. The chance of accepting a worse alternative relies on how much worse it is and how much the present temperature value is and can be defined as:

$$ exp\left( { \frac{ - \Delta }{T} } \right) \le rand $$
(20)

where \( \Delta \) determines the difference in fitness between the new fitness of the neighboring solution and the current fitness. T is the current temperature. \( exp \) is the exponential function and \( \left( {\frac{ - \Delta }{T}} \right) \) is the exponent to raise e to.

For further improvements, the SA is used to boost the performance of the HHO algorithm and prevent falling into local optima. As the SA algorithm always accepts a better solution and also can accept a worse solution based on a probability of how worse it is and what the value of current temperature is. After finishing the HHO iteration, it is the time of the SA algorithm to begin. Rather than starting with a randomly generated solution, SA starts with rabbit location \( pos_{rabbit} \) generated from HHO. A mutation operation is used in SA to achieve incremental improvement in the current solution \( s_{current} \) during iterations. Algorithm 2 shows the pseudocode of the mutation operation. It will generate a new solution based on the current solution. The mutation operation stores the indices of the selected attributes (ones positions) in the current solution. Then, the mutation attempts to remove any redundant or irrelevant features from the selected features to improve accuracy. The mutation is performed based on a probability \( MP \) to avoid time-consuming particularly for the large data dimensions. The procedure of the mutation enhances the efficacy of the SA and at the same time, SA helps to flee from local optima.

figure b

Moreover, the hybridization of the two algorithms (HHO and SA) supports the exploitation capability of the HHO algorithm. Finally, the pseudocode of the HHOBSA can be found in Algorithm 3.

figure c

5 Results and discussion

In this section, we conduct several experimental studies to prove the efficacy of HHOBSA. We run all the experiments and comparisons on a laptop. The specifications of the laptop are as follows. The operating system installed is Windows 10 Ultimate 64-bit. The processor is Intel® Core™ i7-4810MQ CPU @ 2.20 GHz, while the RAM size is 16.0 GB. All the algorithms are implemented in the environment of Java IDE 8.0.

5.1 Dataset description

Twenty-four benchmark datasets were used for the experiments and comparisons to ensure the performance of the proposed HHOBSA algorithm. The datasets including: diabetic, EEG-eye-state, fri_c0_1000_10, fri_c1_1000_10, kc1, and page blocks can be found at https://www.openml.org/search?type=data. The other remaining datasets are extracted from The UCI repository from Lichman (2013). All the datasets are selected based on three principles of diversity, size, and area to which they belong. We focus on the datasets characterized by a large number of instances, large dimension size (number of attributes), or both. The number of samples or instances ranges between 62 and 14,980. Also, the number of attributes ranges between 10 and 7129. Furthermore, datasets belong to different areas such as biology, computer, financial, life, physical, and statistical. N/A means that the area of the given dataset is not available. A full report of the datasets is provided in Table 2.

Table 2 List of standard datasets

5.2 Parameter tuning

The performance of any algorithm can be affected by setting its parameter values. Practically, the parameter tuning requires a large number of experiments to explore its effect on the proposed algorithm. Thus, the parameter values are set depending on trial and error or from the recommendation of the previous studies. The effectiveness of the proposed algorithm is compared to other existing algorithms. Each algorithm is evaluated by 20 independent runs. Moreover, the maximum number of iterations is set to 30 for all experiments. We set the number of Hawks or search agents to be 5 as we notice that increasing the number of search agents doesn’t significantly affect the results. Additionally, it will lead to an increase in time due to the increase in the number of evaluations of search agents per iteration. For the previous reason, we see that five search agents are sufficient. Here, each dataset is split into 80% for training and the other remaining 20% for testing, as suggested by Mafarja et al. (2018b), Rajamohana and Umamaheswari (2018), Too et al. (2019), Faris et al. (2018), Mafarja and Mirjalili (2018). Before splitting the dataset, its instances were randomized with a random seed for all the algorithms to ensure the same rank of the instances number. We utilized KNN classifier with the Euclidean distance metric, which is a popular wrapper method for its simple implementation and only one parameter \( k \) to tune compared to other classifiers. Several experiments were conducted on randomly selected datasets with different \( k \) values (1, 2, 3, and 5). The best results obtained when the value for \( k \) is 5, and this value is also suggested by several previous studies (Mafarja et al. 2018; Arora and Anand 2019; Emary et al. 2016; Guha et al. 2020; Agrawal et al. 2020). For the mutation operation, when the mutation probability increased, the performance of the algorithm is improved. On the other side, the time is increased. Accordingly, the mutation probability is performed with a small probability of 0.01. The values of \( w_{1} \) and \( w_{2} \) are set to 0.01 and 0.99, respectively, as in Faris et al. (2018).

The proposed algorithm HHOBSA is compared with some well-regarded approaches such as binary Whale Optimization Algorithm (bWOA) (Hussien et al. 2019), Binary Grey Wolf Optimization Algorithm (BGWOA) (Emary et al. 2016), Discrete Particle Swarm Optimization (DPSO) algorithm (Unler and Murat 2010). Binary PSO (BPSO) algorithm (Majid et al. 2018), Binary Multi-Verse Optimizer algorithm (BMVO) (Mirjalili et al. 2016), Binary Flower Pollination Algorithm (BFPA) (Yang 2012), Non-Linear PSO (NLPSO) algorithm (Mafarja et al. 2018), Binary Bat Algorithm (BBA) (Mirjalili et al. 2014), Binary Salp Swarm Algorithm with crossover based on V-shaped function (BSSA_V4) (Faris et al. 2018), and Binary Crow Search Algorithm (BCSA) (De Souza et al. 2018). We have implemented the previous algorithm and apply them to our model for a fair comparison. The parameter settings of the algorithms are taken from their original papers, as suggested by the authors. Finally, the setting of parameter values of HHOBSA is outlined in Table 3.

Table 3 The parameter setting

5.3 Performance measures

Several statistical measures are used to assess the performance of the algorithms described as follows.

5.3.1 The classification accuracy

It is a performance measure that measures how precise and accurate the classifier is in selecting the optimal subset of features when running the algorithm \( M \) times. The best classification accuracy can be calculated as:

$$ BestAcc = Max\,Acc_{i}^{*} $$
(21)

where \( Acc_{i}^{*} \) is the best value of classification accuracy achieved at run \( i \) by running the algorithm \( M \) times. \( i \) is the \( i^{th} \) run of the algorithm and \( i = 1, \ldots ,M \). The average classification accuracy can be computed as:

$$ AvgAcc = \frac{1}{M}\mathop \sum \limits_{i = 1}^{M} Acc_{i} $$
(22)

where \( Acc_{i} \) determines the final accuracy obtained at run \( i \) and \( i = 1, \ldots ,M \).

5.3.2 The selected features

It concerns the size of features selected in a solution. We consider two different measures here. The first measure is the Selected Features (\( SF \)) which determines the size of selected features of a solution that accompanies the best fitness value. The second measure is the Average Selected Features (\( ASF \)) which can be calculated as follows:

$$ ASF = \frac{1}{M}\mathop \sum \limits_{i = 1}^{M} \frac{{SF_{i} }}{d} $$
(23)

where \( SF_{i} \) is the size of the selected features found in the best solution obtained by the algorithm at run \( i \). \( d \) represents the size of the attributes or features in a given dataset.

5.3.3 The fitness value

Three performance measures are employed such as the best fitness, the average fitness, and the worst fitness. The best fitness (\( BestF \)) depicts the minimum value of fitness attained by running the algorithm \( M \) times.

$$ BestF = Min F_{i}^{*} $$
(24)

where \( F_{i}^{*} \) is the minimum fitness value attained at run \( i \) when the algorithm is run M times. The average fitness (\( AvgF \)) represents the summation of all fitness values attained through running The algorithm \( M \) times, then divided by the number of runs \( M \). It can be calculated as follows:

$$ AvgF = \frac{1}{M}\mathop \sum \limits_{i = 1}^{M} F_{i} $$
(25)

where \( F_{i} \) is the final fitness value obtained at run \( i \). Moreover, the worst fitness (\( WorstF \)) is the maximum value of fitness obtained by running the algorithm \( M \) times and can be computed as:

$$ WorstF = Max F_{i}^{*} $$
(26)

5.3.4 The total average computational time

This indicator concerns with the Total Average Time (TAT) taken by the algorithm for running all the datasets \( M \) times and can be calculated as follows:

$$ TAT = \mathop \sum \limits_{j = 1}^{N} AT_{j} $$
(27)
$$ AT = \frac{1}{M}\mathop \sum \limits_{1}^{M} time_{i} $$
(28)

where \( time_{i} \) is the time taken by each run \( i \). AT is the average time taken by the algorithm to be run M times for a given dataset \( j \). \( N \) is the number of datasets and \( j = 1, \ldots ,N \).

5.4 Numerical results and discussion

5.4.1 Studying the effect of the transformation function

Many experiments are conducted to reveal the superiority of the HHOBSA algorithm. The impact of using two different transformation functions on the original Harris Hawks optimization algorithm is studied. Hence, two binary versions of the HHO algorithm are developed using the S-shaped function (HHO-S) and V-shaped function (HHO-V). Table 4 presents a comparison between HHO-S and HHO-V algorithms. The statistical results concern the best, average, and worst of the fitness values of each algorithm. The bold results indicate the best results. Additionally, the average number of selected features \( ASF \) is provided as it measures the data dimensionality reduction. As can be seen from the results of the best, average, and worst fitness, HHO-V outperforms HHO-S in 17 out of 24 of the datasets. According to the ASF measure, HHO-V succeeds in attaining the minimum ASF in 16 datasets in comparison with HHO-S. Also, the average of each of the previous measures is recorded in the table. For the average of the best fitness results, HHO-V receives a value of 0.105, while HHO-S fails to precede it with a value of 0.107. The minimum fitness value indicates a minimum classification error and a minimum number of selected features. Moreover, HHO-V achieves the minimum average value of ASF for all the datasets with a value of 201.233. From the above statistical analysis, the V-shaped function obtains the best results so, it is introduced to the proposed algorithm. The outperformance of the V-shaped function is due to that there are more changes and variations between the zeros and ones in a given search agent than in the S-shaped function (Liu et al. 2016).

Table 4 The fitness values and the average selected features of HHO-S and HHO-V

5.4.2 The assessment of the proposed algorithm

In this subsection, the convergence and the quality of the results of the three algorithms are deeply investigated. We are interested in studying the performance of three versions of HHO algorithms as follows:

  • Harris Hawks Optimization algorithm with V-shaped function (HHO-V).

  • Harris Hawks Optimization algorithm with Bitwise operations. (HHOB).

  • Harris Hawks Optimization algorithm with Bitwise operations and Simulated Annealing (HHOBSA).

Table 5 demonstrates a comparison among the three algorithms in terms of four performance measures, including the best accuracy, average accuracy, average fitness, and the size of the selected features accompanying the best fitness value. HHOBSA has the maximum value of the beat and average classification accuracy in 23 out of 24 datasets. By observing the average fitness, HHOBSA achieves the best results in 21 out of 24 datasets compared to HHO and HHOB. Moreover, HHOBSA outperforms the other two algorithms in terms of the best accuracy, and the size of selected features in most of the datasets. Here, we are interested in studying the impact of adding the bitwise operations as well as the integration of the SA to the HHO-V algorithm and its effect on the performance.

Table 5 The results obtained by HHO, HHOB, and HHOBSA

Also, the table presents the total average for the classification accuracy average and the average of the fitness values. We have noticed that HHOBSA comes in the first rank for both the classification accuracy and the fitness values over all the datasets with values of 0.891 and 0.112, respectively. HHOB comes next with values of 0.879 and 0.124 for classification accuracy and fitness value. Similarly, the total average of selected features accompanying the best fitness value by each algorithm overall the datasets of HHOBSA is the minimum with a value of 191.08. As the main aim of the FS problem is to lessen the size of selected features while maximizing the classification accuracy. Hence, the data dimensionality will be reduced as a result, which is one of the biggest problems that face data mining. The superiority of the HHOBSA algorithm is confirmed by the integration of the bitwise operations and the SA.

By inspecting Fig. 7, a comparison based on the total average of time \( TAT \) is provided. The total average of time for solving all the datasets 304.2 s, and it is considered a reasonable time as we are dealing with big data which its dimension sizes reach thousands. We can conclude that the bitwise operations and the SA have improved the exploitation and exploration capabilities of the algorithm from the above experiments. Therefore, the convergence and the quality of the results can be relieved noticeably in comparison with HHO-V and HHOB. In this regard, and according to the results, HHOBSA is adopted as the final proposed approach.

Fig. 7
figure 7

Total average time overall the datasets

In Tables 6 and 7, a paired-samples t test was conducted to compare the best classification accuracy of two algorithms before performing any improvements (HHO-V) and after performing improvements (HHOBSA) overall the datasets (N = 24). This test is used to ensure the statistical difference between the two algorithms. From the test, we can reveal that there is a significant difference in the accuracy for HHO-V (Mean = 0.8950, standard deviation = 0.089) and HHOBSA (Mean = 0.9082, standard deviation = 0.085) conditions; t (23) = 5.123, p-value = 0.000. These results suggest that performing improvements leads to an obvious increase in accuracy. The Sig. (2-tailed) value is 0.000. This value is less than 0.05. Because of this, it can be concluded that there is a statistically significant difference between the mean classification accuracies of the two algorithms with a value of 0.0132. Thus, we reject the null hypothesis that assumes that the means of classification accuracies of HHO-V and HHOBSA are equal.

Table 6 Paired samples statistics
Table 7 Paired samples test

5.4.3 Experiment using real-world data between HHOBSA and other algorithms

In this subsection, we compare the performance of the HHOBSA algorithm with various metaheuristic algorithms such as bWOA, BGWOA, DPSO, BPSO, BMVO, BFPA, NLPSO, BBA, BSSA_V4, and BCSA. Table 8 records the best fitness values achieved by each algorithm. \( Full \) refers to the fitness value obtained when all the features of the dataset are selected. It helps us to measure how the improvement made by each algorithm in the table. By observing the results of this table, HHOBSA succeeds in preceding all the other algorithms in most of the datasets. It can attain more promising solutions than other algorithms. We can observe that HHOBSA can surpass the other peers in 22 out of 24 datasets. Furthermore, Fig. 8 presents a comparison in terms of the total average of the best fitness values overall the datasets. The average fitness value is 0.181when we select all features for all the datasets. By inspecting the figure, it can be inferred that HHOBSA has the minimum value 0.0895 and comes in the first rank, while BFPA comes in the second rank with a value of 0.1060. Moreover, BMVO comes in the last rank with a value of 0.127. The superior results obtained by HHOBSA are inherited from the advantages of the hybridization of the SA, which boosts the convergence of HHOBSA.

Table 8 The results of best fitness values among algorithms
Fig. 8
figure 8

Total average for the best fitness overall datasets

To further demonstrate the superiority of the proposed algorithm, Table 9 provides a comparison between HHOBSA and other metaheuristics based on the average fitness indicator. Regarding the results of HHOBSA, it can get the minimum average fitness values in 20 datasets while NLPSO obtains the minimum average fitness values in only three datasets. For HHOBSA, the minimum average fitness value is recoded for pendigits dataset with a value of 0.006, while the other metaheuristics fail to reach this value. Moreover, the maximum average fitness value of the arrhythmia dataset is 0.327. For the same dataset, NLPSO comes next with a value of 0.344. Figure 9 depicts a comparison among the algorithms in terms of the total average for the average fitness values shown in the table for all the datasets. It can be demonstrated that HHOBSA is ranked first with a value of 0.112. The second-ranked algorithm is DPSO, with a value of 0.121. It is evident that HHOBSA convergence is quicker and more stable than the other algorithms.

Table 9 Comparison among metaheuristics in terms of average fitness values
Fig. 9
figure 9

Total average for the average fitness values for all the datasets

The reduction of features is an important objective that we seek to achieve while maintaining the maximization of classification accuracy. As seen before, HHOBSA outperforms other algorithms depending on the fitness values. Now, it is time to compare the features reduction capability of HHOBA with other metaheuristics. Based on this, the results of the average selected features of each algorithm are reported in Table 10. By observing the results, the average number of selected features \( ASF \) is the minimum in eight datasets compared to the other peers. The algorithm BMVO is the next that achieves the minimum average size of features selected in six datasets. The ability of HHOBSA to lessen the size of selected features is due to the mutation operation employed for generating new solutions during iterations in the simulated annealing algorithm. The mutation operation is designed to remove any redundant or irrelevant features from the selected features found in the best solution.

Table 10 Comparison between HHOBSA and other algorithms based on the average selected features

In Table 11, the standard deviation with respect to the classification accuracy is provided. The smaller the values of the standard deviation, the more stable the results of the algorithm are. The proposed algorithm outperforms other peers, and it is ranked first as it obtains the minimum value of standard deviation in 12 datasets. The total average of standard deviation is calculated, as shown in Fig. 10. HHOBSA has the lowest value 0.007, which means that HHOBSA converges quickly to the optimal solution. DPSO is ranked second as it achieves the minimum standard deviation in 11 datasets, and the total average value is 0.008. Accordingly, the results prove the robustness of the proposed algorithm.

Table 11 The standard deviation of classification accuracy
Fig. 10
figure 10

Total average of standard deviation for all the datasets

Wilcoxon signed ranks test is a non-parametric statistical test used to determine if the results of the proposed algorithm HHOBSA differ statistically from other metaheuristics. In Table 12, a Wilcoxon signed ranks test between the HHOBSA algorithm, and each algorithm is conducted with a significance level α = 0.05 for all 24 datasets. \( p \)-values express how significant difference of the results between any two algorithms is. The smaller p value (< 0.05), the stronger evidence of a significant difference, and the outperformance of the proposed algorithm are. It can be seen from the table that the performance of HHOBA is better than other algorithms in most of the datasets. HHOBSA shows a significant improvement (p value < 0.05) over all the algorithms in 15 datasets including arrhythmia,clean1, dermatology, DNA, fri_c0_1000_10, fri_c1_1000_10, german, kc1, madelon, optdigits, satellite, semeion, spambase, spectEW, and waveform. Finally, we conclude that there is a statistically significant difference in the performance between HHOBSA and the other algorithms as p value < α.

Table 12 p-values calculated for Wilcoxon signed ranks test of HHOBSA classification accuracy results versus other algorithms (bold values < 0.05)

Figures 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21 and 22 depicts the boxplots of classification accuracy results obtained by the algorithms on the first twelve datasets. The boxplot shows the dispersion of the classification accuracy values of each algorithm on a given dataset when running the algorithm 20 times. It is a means to compare among algorithms based on five main statistical measures, including minimum accuracy value, maximum accuracy value, lower quartile, upper quartile, and median. The higher the boxplot, the better classification accuracy attained by the algorithm is. Accordingly, HHOBSA has the maximum classification accuracy values in most of the datasets. By inspecting the figures, the results confirm that HHOBSA has a higher median while keeping smaller interquartile ranges in most of the datasets. This is considered as evidence of the robustness and the stability of the HHOBSA algorithm.

Fig. 11
figure 11

The boxplot of arrhythmia

Fig. 12
figure 12

The boxplot of clean1

Fig. 13
figure 13

The boxplot of colon

Fig. 14
figure 14

The boxplot of dermatology

Fig. 15
figure 15

The boxplot of diabetic

Fig. 16
figure 16

The boxplot of DNA

Fig. 17
figure 17

The boxplot of Eeg-eye-state

Fig. 18
figure 18

The boxplot of fri-c0-1000-10

Fig. 19
figure 19

The boxplot of fri-c1-1000-10

Fig. 20
figure 20

The boxplot of german

Fig. 21
figure 21

The boxplot of kc1

Fig. 22
figure 22

The boxplot of leukemia

5.4.4 Experiments using artificial datasets

In this section, although we tested our proposed algorithm using publicly and real-datasets, we are interested in testing our proposed algorithm using artificial or synthetic datasets. The artificial dataset can provide an accurate measure for the efficacy and the effectiveness of the proposed algorithm in solving FS. In the artificial dataset, we know the optimal features in advance so that we can compare the features selected by the algorithm with the known optimal features. Also, using the artificial dataset, we can study the effects of different dimensions of data, noise ratios, and the size of samples on the FS process. Based on this, we employ 19 artificial datasets described in Table 13. The datasets 1–15 are taken from Liu and Motoda (2012) and Bolón-Canedo et al. (2013). We generate the datasets 17–19 based on a linear function as follows:

  1. 1.

    Determine the number of samples (\( NS \)) and a number of features (\( NF \)) which can be calculated as:

    $$ NF = NR + NIR $$
    (29)

    where \( NR \) is the number of relevant features and \( NIR \) is the number of irrelevant features.

  2. 2.

    Generate a random number \( x_{i} \) \( \in \left[ {0, \, 1} \right] \), where \( i = 1, \ldots , NR \). Generate the coefficients of \( x_{i} \) (\( \propto_{i} \)) such that:

    $$ \mathop \sum \limits_{i = 1}^{NR} \propto_{i} = 1 $$
    (30)

    The first coefficient for \( x_{0} \) is \( \propto_{0} = 0.01 \). to compute remaining coefficients we apply the following equation:

    $$ \propto_{i + 1} = \propto_{i} + \Delta \propto $$
    (31)
  3. 3.

    Multiply \( x_{i} \times \propto_{i} \) to produce relevant features.

  4. 4.

    The class label is generated based on a linear function as follows:

    $$ c_{j} = \mathop \sum \limits_{i = 1}^{NR} \propto_{i} x_{i} + \mathop \sum \limits_{k = 1}^{NIR} \partial_{k} x_{k} ,\quad k = 1, \ldots ,NIR $$
    (32)

    \( c_{j} \) is the \( j^{th} \) class label and \( j = 1, \ldots , NS \). \( \partial_{k} \) is the \( k^{th} \) coefficient of irrelevant the feature. \( \partial_{k} = 0 \) so that they can’t make any contribution to the class label.

  5. 5.

    We convert the continuous value of \( c_{j} \) into binary one (0 or 1) using:

    $$ c_{j} = \left\{ {\begin{array}{*{20}c} {0 c_{j} < c_{sum} } \\ {1 c_{j} \ge c_{sum} } \\ \end{array} } \right. $$
    (33)
    $$ c_{sum} = \mathop \sum \limits_{j = 1}^{NS} c_{j} $$
    (34)
Table 13 List of artificial datasets

In our experiment, the number of relevant features is set to 20, whereas the NIR is 5,30, and 80 for the three datasets, respectively. We provide a description of each dataset using the following terms:

  • The number of features.

  • The number of samples.

  • The number of classes.

  • The features type, which is divided into Relevant (R), Irrelevant (IR), and Redundant (Red) features.

  • The noise ratio.

Table 14 records the results of nineteen artificial datasets obtained by the proposed algorithm and KNN. For HHOBSA, we set the number of iterations to be 100. Twenty independent runs used to assess HHOBSA. We present the best accuracy, average accuracy, and worst accuracy as performance measures. According to features, we provide the best set of relevant features selected by the algorithm through different runs as well as ASF. We compare the performance of HHOBSA with the KNN classifier. KNN is used to evaluate the accuracy of the dataset when selecting all the features in that dataset. As KNN is deterministic, it isn’t essential to repeat the experiment and to get the average accuracy. Also, we record the time taken by both HHOBSA and KNN for each dataset. By inspecting the results, we can see that the proposed algorithm can get the optimal set of the relevant features in 12 out of 19 datasets.

Table 14 The results of the artificial datasets obtained by HHOBSA and KNN

For the remaining dataset, the algorithm attains most of the relevant features. The ASF indicates that HHOBSA can discard most of the irrelevant features. Also, when observing the accuracy obtained by HHOBSA and KNN, we can see significant improvements for all the datasets. The worst accuracy obtained by HHOBSA is better than that obtained by KNN. In the CorrAl dataset, HHOBSA gets the four relevant features with accuracy 1.00 and discards the two irrelevant features. The increased dimension size in CorrAl-100 doesn’t hinder the algorithm from finding the relevant features with an accuracy of 1.00. Although a 5% noise ration infects Monk3, HHOBSA is capable of choosing the appropriate features with higher accuracy of 0.92. In the LED dataset, we can find that when the noise intensity increases, the accuracy decreases. As we can observe in LED100n20, HHOBSA supports the accuracy from 0.10 to 0.80, and it can select the optimal set of the relevant features. In AD25, AD50, and AD100, The increasing number of samples helped to improve accuracy, especially in the average and worst cases. This experiment shows the superiority of our proposed algorithm when solving many artificial datasets in a reasonable time.

Finally, the efficacy of the proposed algorithm is tested in comparison with the other ten metaheuristic algorithms. From all the previous experiments and statistical analyses, the results proved the superiority of the proposed algorithm in tackling feature selection problems. HHOBSA can overcome the LO using the SA and bitwise operations. The mutation operation is performed by the probability of 0.01 to avoid consuming time. It has the capability of achieving a high classification accuracy while minimizing the number of selected features. Also, the standard deviation among the classification accuracies of each dataset is smaller than other algorithms. This proves that the convergence towards the optimal solution is quick.

5.4.5 Complexity analysis of HHOBSA

The time complexity of HHOBSA depends on its structure, including the population size (n), the dimension (d), the number of iterations (t), bitwise operations, and the SA algorithm. It can be calculated using three main components of the algorithm (initialization phase, Harris hawk algorithm with the bitwise operation, and SA). With n Harris Hawks, the complexity of the initialization phase is O(n). The related complexity for the Harris hawk updating and the bitwise operation is O(n × t × d) + O(n × t). Let M be the number of times that the initial temperature will be cooled until reach the final temperature and the mutation operation for the mutated solution takes O(d), then the complexity of SA is O(M × d). Finally, the overall complexity of HHOBSA is O(n) + O(n × t × d) + O(n × t) + O(M × d) ≈ O(n × t × d) + O(M × d).

6 Conclusions and future work

In this study, a hybrid approach of Harris Hawks optimization algorithm and SA algorithm HHOBSA for searching for the optimal subset of features based on a wrapper method. The proposed algorithm has employed KNN because it is common and easy to implement and contains only one parameter to tune. HHOBSA is applied for 24 standard datasets and 19 artificial datasets and their dimension sizes can reach up to thousands. In the FS problem, we may select a feature or not which makes us deal with a binary problem. Thus, a transformation function is implemented in the original HHO algorithm. Firstly, the effect of the V-shaped function and S-shaped function is studied on the proposed algorithm. V-shaped gives superior results and fast convergence toward the optimal solution during iterations rather than S-shaped one. Secondly, for fear of exposure to local optima, HHO is integrated with the SA. Thirdly, a mutation operation is developed to remove any redundant or irrelevant features from the best solution found so far. The mutation operator is performed using a small probability of 0.01 to minimize the time. Two bitwise operations are designed inside the proposed algorithm for further improvement and increasing the diversity in the population. They can randomly transfer the most informative features from the best solution to the individuals of the populations which raise their qualities. The performance of HHOBSA is deeply examined in comparison with other well-regarded metaheuristics. The fitness values, classification accuracy, standard deviation, computational time are exploited in detail to investigate the performance of HHOBSA. The results showed the superiority of the proposed algorithm and its ability to solve the problem brilliantly due to its outstanding ability to balance between exploration and exploitation, flee from the local optima, increase the diversity in the population, transfer the good features to the population individuals, and in a reasonable time. Wilcoxon signed ranks test and paired-samples T-test are conducted to demonstrate that there is a statistically significant difference between HHOBSA and the other metaheuristics.

For future work, the performance of the proposed algorithm needs to be tested with other different classifiers like support vector machine and decision tree. Also, using the FS classification with IoT, spam email detection, medical diagnosis, and financial data can be a significant trend. To make the algorithm more effective in dealing with huge dimension sizes of data, we hope to develop a parallel version of HHOBSA to exploit the capabilities of computing resources and to reduce the time burden.