Keywords

1 Introduction

It has become a challenge for researchers and developers to cope with the explosive growth of available data, the dimensions of which are expanding daily. Feature selection (FS) is a pre-processing step in most big data processing and machine learning applications, particularly in data mining applications. It is used to remove noisy and inconsequential features to determine a subset of features that best represent and portray the data, thus, boosting the quality of the data obtained.

Classical search approaches, such as random search and complete search have been used to solve FS problems [17]. While these methods ensure the optimal solution for small datasets, their execution is impractical for large datasets. FS is an NP-hard problem, it tries to determine the optimal subset. For example, if a dataset contains n features, then \(2^n\) solutions must be formulated and assessed, i.e., the problem complexity is \(O(\infty )\). It also requires an enormous amount of computational power and an excessive amount of time.

In the last few years, metaheuristic algorithms (MAs) have been identified as an ideal solution to FS problems, particularly in cases involving high-dimensional data. Researchers used the Simulated Annealing algorithm, Ant Colony Optimization algorithm, Particle Swarm Optimization algorithm, Genetic Algorithm, etc. to solve FS problems and have obtained valuable results. For example, see [1, 4, 5, 8, 11, 12, 18].

MAs are the most appropriate alternative method for addressing the limitations of lengthy, far-reaching searches that entail high computational cost. Despite some desirable results, however, most MAs are impeded by the limitations imposed by a local optimum and a disproportion between the explorative and exploitative scope of the algorithm. Moreover, each dataset has a different number of features, and no single method is the most appropriate for FS problems, i.e., one can still find room for improvements. These observations motivated this work to look for means to overcome the limitations described and develop a novel FS solution.

We selected a recent algorithm, the Black Widow Optimization algorithm (BWO) [9], to study FS problems due to its success in optimizing engineering design problems. BWO is a nature-inspired algorithm that mimics the black widow’s life cycle. It is inspired by the singular mating behaviour exhibited by the black widow spider, a process that includes an exclusive stage called cannibalism. The BWO approach is designed to deliver rapid convergence and to avoid local optima, and, because BWO maintains equilibrium between the exploration and exploitation stages [9], a property that most MAs applied to FS problems are lacking [14], the BWO is particularly appropriate for solving several kinds of optimization problems that involve a number of local optima.

This work is an extension of our previous effort in trying to find an effective solution to FS problems by applying the BWO algorithm. In our previous effort, we modified the BWO algorithm to solve feature selection problems and developed the Binary Black Widow Optimization (BBWO) algorithm [3].

Despite the competitive results of the BBWO, its performance can be further improved by enhancing the slow convergence caused by the use of a population of solutions and a lack of local exploitation. In this article, we describe an improved version of the BBWO. We combined the BBWO with the Hill-Climbing Algorithm (HCA). The newly developed algorithm, the BBWO-HCA, is tested using 28 UCI datasets and compared with six well-regarded algorithms in the domain. The algorithms are Binary particle swarm optimization (BPSO) [10], Binary multi-verse optimization algorithm (BMWO) [2], Binary grey wolf optimizer algorithm (BGWO) [1, 7], Binary moth-flame optimization algorithm (BMFO) [19], Binary whale optimization algorithm (BWOA) [11], and Binary bat algorithm (BBAT) [13].

The rest of this article is organized as follows: In Sect. 2, the proposed algorithm is presented. In Sect. 3, the experiment setup, test results, and result discussion are presented. Finally, in Sect. 4, the research is concluded and some future works are identified.

2 BBWO-HCA Feature Selection Algorithm

The results presented in [3] show that the BBWO produces impressive results and, in some cases, is competitive with the best-known algorithms. The results also reveal that the BBWO performance can be further improved by enhancing the slow convergence due to the use of a population of solutions and a lack of local exploitation. The BBWO-HCA aims to increase the exploitation process of the BBWO by incorporating it with a local metaheuristic algorithm based on the Hill-Climbing Algorithm (HCA). HCA is a well-known local search algorithm. It has been tested on various problems and has shown to be an effective and efficient method that can produce sound results [6]. The BBWO-HCA’s main steps, selection, procreation, and mutation, are described in the subsections below.

2.1 Solutions Representation

In the BBWO-HCA, each solution represents a single black widow. All possible solutions to all FS problems are envisioned in terms of the attributes of the black widow spider. In programming terms, this is equivalent to saying each spider is represented by a class and spider attributes are class instance variables, or each spider is an array and spider attributes are array values. The spider population is modeled as an \(N_{var}\) dimensional array, i.e., an array of spider objects, and the FS problem becomes an \(N_{var}\) dimensional optimization problem.

The BBWO-HCA algorithm uses binary values to represent a population of solutions \((N_{pop})\). In binary representation, a solution is shown by a one-dimensional array. The length of the array varies in accordance with the feature number of the original dataset. For example, if S features are contained in the dataset, the solution length is S. The cell value in the array will be ‘1’ or ‘0’. The value ‘1’ indicates that the corresponding feature is selected, whereas ‘0’ indicates that the feature is not selected. In general, when the number of features is \(N_f\) and the population size is \(\mid N_{pop}\mid \), the array size of the problem will be \(N_f\) \(\times \) \(\mid N_{pop} \mid \).

2.2 Initialization

The population of solutions offered by the BBWO-HCA is randomly generated by assigning a value of either “0” or “1” to each cell of the solution. The process begins by initializing the population size and the number of features. The algorithm then arbitrarily assigns either ‘0’ or ‘1’ by looping through each solution in the population. This process is repeated until all solutions in the population have been initialized.

2.3 Fitness Function and Evaluation

Each solution is evaluated according to a fitness function. The function employed is shown in Eq. 1. A similar function is used by [1, 15]. The k-nearest neighbour algorithm (KNN) [16] is used in the solution evaluation, i.e., the KNN classifier determines the accuracy of the solution.

$$\begin{aligned} f= \alpha \gamma _{R}(D) + \beta \frac{\mid R\mid }{\mid C\mid } \end{aligned}$$
(1)

In Eq. 1, \(\gamma _{R}(D)\) represents the classification error rate of the KNN classifier, \(\mid R \mid \) is the cardinality of the selected subset, \(\mid C \mid \) is the total number of the original features in the dataset, and \(\alpha , \beta \) are two weight parameters corresponding to the importance of classification quality and subset length, \(\alpha \in [0,1]\) and \(\beta = (1-\alpha )\). A similar approach is adapted by [7, 11, 19].

After initializing the population of solutions, we assign to each solution (widow) a fitness value, which represents the quality of the solution. The fitness value of each solution is calculated using the fitness function and is evaluated using the KNN classifier. This is because the BBWO-HCA is a wrapper-based FS approach.

2.4 Transformation Function

The positions of the search agents generated from the standard BWO are continuous values. This cannot be directly applied to our problem because it contradicts the binary nature of the FS on selection or non-selection (0 or 1).

The sigmoidal function in 2 and 3, which is considered a form of the transformation function, is used in our proposed method as a part of the reproduction process to convert any continuous values to binary equivalents. The performance of the transformation function has been investigated and adopted by many researchers, e.g., [1, 7, 19].

$$\begin{aligned} z_{s_{w}}= \frac{1}{1+e^{-z_{w}}} \end{aligned}$$
(2)
$$\begin{aligned} z_{binary}= {\left\{ \begin{array}{ll} 0, &{} \text {if}\ rand < z_{s_{w}}\\ 1, &{} \text {if}\ rand \ge z_{s_{w}}\\ \end{array}\right. } \end{aligned}$$
(3)

where each of \(z_{s_{w}}\) is a continuous value (feature) in the search agent for the S-shaped function, specifically in the solution w at dimension d (w = 1,...,d), and is a random number drawn from the uniform distribution \(\in \) [0,1]. The \(z_{binary}\) value can be 0 or 1 depending on the value of \( rand\) compared to the values of \(z_{s_{w}}\), where e is a mathematical constant known as Euler’s number.

2.5 Reproduction Process

BWO is inspired by Darwin’s natural selection theory, which is defined as generational descent accompanied by modification where species are subtly adjusted over time and new species arise as a result.

In the BBWO-BCA algorithm, the procreation process begins and parents (in pairs) are selected randomly to perform the procreating steps by mating to bring forth the new generation. An array known as alpha will be generated to complete further reproduction. Offspring \(c_1\) and \(c_2\) will be produced by taking \(\alpha \) with the following equation in which \(w_1\) and \(w_2\) are parents.

$$\begin{aligned} {\left\{ \begin{array}{ll} c_1= \alpha {\times } w_1+ (1-\alpha ) {\times }w_2 \\ c_2= \alpha {\times } w_2+(1-\alpha ) {\times } w_1 \\ \end{array}\right. } \end{aligned}$$
(4)

2.6 Cannibalism Process

The BWO includes an exclusive stage, cannibalism. Cannibalism can be classified into three kinds: sexual cannibalism where the husband gets eaten by the female black widow during or after mating, sibling-cannibalism where the weaker siblings are eaten by the stronger siblings, and mother cannibalism where the mother is eaten by her strongest child. The BBWO-HCA uses this concept of cannibalism and determines the weak or strong spiders by calculating and evaluating their fitness values. The best solutions (surviving spiders) from the reproduction process will be selected and stored in population two, i.e., \(pop_2\).

2.7 Mutation Process and New Population Generation

The procedure of mutations begins by randomly selecting a number of solutions (widows) from the \(pop_1\) which will be mutated individually. Two cells from each selected solution are randomly exchanged, and the new mutation solutions will be kept in \(pop_3\). The new generation can finally be generated as a combination of \(pop_2\) and \(pop_3\), which will then be evaluated to return the optimal solution \((W^{*})\) of values bearing the N dimension.

In the BBWO-HCA, the cannibalism rate \((C_R)\), the procreation rate \((P_r)\), and the mutation rate \((M_r)\) are used as parameters. The value of the \((C_R)\) is determined by the fitness values obtained by Eq. 1, and the \(P_r\) and \(M_r\) are identical to those of the standard BWO.

2.8 HCA Steps

The algorithm uses the best solution \((W^{*})\) of the BBWO as an initial solution for the HCA. The solution is modified by selecting one feature randomly and flipping the value of that feature, i.e., if the feature value is “0” it is changed to “1” (which indicates adding one feature), and if the value is “1” it is changed to “0” (which indicates deleting one feature). If the fitness value of the modified solution is improved, it will replace the old one, otherwise, it discards the new solution.

Next, the HCA iteration counter and BBWO best solution \((W^{*})\) are updated, and the stopping criteria of the BBWO is checked. If the BBWO stopping condition is met, i.e., the max iterations are reached, the algorithm stops and returns the best solution \((W^{*})\), otherwise, a new iteration for the BBWO starts.

The pseudocode for the BBWO is shown in Fig. 1 and the additional steps involved in implementing the HCA are shown in Fig. 2. Together, they form the pseudocode for the BBWO-HCA.

3 Experiment Setup and Results

28 well-known datasets from the University of California Irvine (UCI)Footnote 1 machine learning repository have been used to investigate the performance and strength of our proposed methods. The dataset is randomly split into 80% for the training set and 20% for the test set. These rates are widely accepted data partition rates. The datasets vary in the number of features and instances. Table 1 presents a brief description of the datasets. Each row in the table represents the number of features, objects, classes, and the domain to which each of these datasets belong.

Fig. 1.
figure 1

The Binary Black Widow Algorithm Pseudocode for FS

Fig. 2.
figure 2

BBWO with the Hill-climbing (BBWO-HCA) Algorithm

Table 1. Datasets description.

The performance of our proposed method, the BBWO-HCA, is compared with six well-respected binary FS algorithms: (BPSO [10], BMVO [2], BGWO [1, 7], BMFO [19], BWOA [11], BBAT [13]) based on the two evaluation criteria, classification accuracy and the number of features selected.

To ensure an impartial comparison and a correct evaluation between our proposed method and other FS algorithms, we re-implemented the six FS algorithms using the same parameters values as illustrated in Table 2 and the same transformation function as explained in Sect. 2.4. The algorithms are run independently multiple times and the average accuracy and the average number of features selected are reported.

3.1 BBWO-HCA vs. BBWO Results and Discussion 1

Table 3 shows the comparison between our two algorithms (BBWO-HCA and BBWO algorithms) based on the two evaluation criteria (the classification accuracy and feature selected). The best classification accuracy and the lower number of features selected are highlighted in bold.

The results show that the BBWO-HCA is more efficient than the BBWO in terms of maximizing classification accuracy. The BBWO-HCA outperforms the BBWO in 15 datasets and obtains the same results in 13 datasets in terms of classification accuracy. When considering the average accuracy for all datasets, the performance of the BBWO-HCA is better than the BBWO. This is shown in Fig. 3.

In terms of minimizing the total number of features selected, the results show that the BBWO-HCA obtains better results than the BBWO in 26 datasets, the same results in one dataset, and worse results in one dataset. The results also show that on average, the BBWO-HCA is more efficient than the BBWO in this regard. This is shown in Fig. 4.

Table 2. BBWO-HCA Parameters
Table 3. Compression between BBWO-HCA and BBWO
Fig. 3.
figure 3

Average of classification accuracy of BBWO-HCA vs. BBWO

Fig. 4.
figure 4

Average of feature selection of BBWO-HCA vs. BBWO

3.2 BBWO-HCA vs. Six FS Algorithms, Results and Discussion 2

We compared the BBWO-HCA results with six FS algorithms (BPSO, BMVO, BGWO, BMFO, BWOA, BBAT). The results are presented in Tables 4 and 5. The test results reveal that the BBWO-HCA outperforms all six algorithms unless an algorithm already reached the best possible solution, in which case the BBWO-HCA results are the same as the other algorithm. For example, the BBWO-HCA produced better results than the BPSO in 26 datasets and the same results in two datasets.

Table 4. Comparison BBWO-HCA with all algorithms based on the classification accuracy
Table 5. Comparison BBWO-HCA all algorithms based on the features selected

In to the number of features selected, the test results reveal that the BBWO-HCA outperforms all six FS algorithms for all 28 datasets tested, see Table 5. These results are depicted pictorially in Figs. 5 and 6, and they show that the BBWO-HCA is an effective algorithm for solving FS problems.

Fig. 5.
figure 5

Average number of classification accuracy of all algorithms

Fig. 6.
figure 6

Average number of features selected of all algorithms

4 Conclusion and Future Works

Recently, a novel algorithm, the Black Widow Algorithm (BWO), has been developed to solve optimization problems. BWO is derived from nature; it mimics the singular mating behaviour exhibited by the black widow spider.

Initially, we developed the BBWO algorithm based on the BWO for solving FS problems. In this work, we further improved the BBWO by combining it with the Hill-Climbing Algorithm. The newly developed algorithm, BBWO-HCA, is tested using 28 UCI datasets and compared with six well-regarded algorithms in the domain. The test results show that the BBWO-HCA outperforms the BBWO and almost all comparable algorithms for most datasets tested.

This work opened the door for further FS and optimization studies. Examples of such studies are:

  • The test results of the BBWO-HCA can be further analyzed to determine the impact of the dataset size, number of features in the dataset, number of instances, etc. on the performance of the algorithm.

  • Combining the BBWO with other algorithms and studying the outcomes of these new combinations are open future works.

  • The BBWO and BBWO-HCA can be applied to various other areas of study to solve many other real-world optimization problems such as text mining, clustering, image processing, and routing problems.