1 Introduction

Cheminformatics has many strategies that can be used in drug design and discovery. A lot of efforts with large numbers of chemical compounds are being used to evaluate specific molecular properties [1]. The prediction process of molecular properties is close to drug virtual screening of a chemical library. Commonly, the drug is an organic molecule that inhibits the function of proteins as bimolecular interactions [2]. Drug design is often referred to as a rational design and inventive process of finding new drugs based on biological target knowledge [3]. Drug design and discovery consist of lead structures optimization, quantitative structure–activity relationships (QSAR) [4] and docking of ligand into a receptor [5]. Recently, machine learning (ML) techniques are applied in chemoinformatics filed to predict the chemical descriptor selection, compound activities, molecular properties [6] and for drugs design and discovery [7]. The storage space size increases exponentially with respect to the number of features available in the data set.

Feature selection (FS) is used in many critical fields such as classification, data mining and object recognition, where it is useful in eliminating obsolete and redundant features from datasets [8, 9]. It represents a real challenge and computational processing, especially when working with datasets of high dimensions in classification problems [10, 11]. The aim of FS is to minimize the number of features which increasing the search space and allowing ML techniques to use only the most significant features that affecting the classification accuracy [12]. Swarm intelligence (SI) algorithms are the most common methods used to solve FS-based problems [13]. The SI algorithms reflect computational intelligence methods made up of artificial agent population and inspired by social behavior of animals from the real world [14].

Heidari et al. [15] have proposed the HHO algorithm that mimics the Harris hawks cooperative hunting behavior. The original HHO maintains some important limitations, such as: (1) the exploration and exploitation are smooth but unbalanced and hence the global search and local search process becomes difficult to manipulate, (2) it has premature convergence when the problems are highly multi-modal and (3) the exploitation strategy in HHO is insufficient and the search agents may find local solutions. In this study, to overcome the limitations of HHO, a wrapper feature selection method termed HHOCM, which hybridizes HHO with crossover and mutation for chemical descriptors selection and chemical activities. The role of mutation and crossover is to generate new offspring that helps to find solutions simulating the nature laws of origin and adaptation to the environment.

Experimental results revealed that the three versions of proposed HHOCM algorithms are efficient alternatives for solving the FS problems. Several comparisons are performed considering seven well-established SI algorithms, namely the grey wolf optimizer (GWO) [16], original HHO [15], whale optimization algorithm (WOA) [17], salp swarm algorithm (SSA) [18], ant line optimizer (ALO) [19], grasshopper optimization algorithm (GOA) [20] and dragonfly algorithm (DA) [21] using k-nearest neighbors (k-NN) as a classifier. Further, the Wilcoxon test is used to evaluate the performance of the proposed algorithms. It is noted that the ROBHHOCM and HHOCM algorithms achieve the best results compared with the competitor algorithms in most statistical and graphical measurements (e.g., average and standard deviation of fitness, accuracy, number of selected features, and the convergence curves and boxplot curves). To sum up, the major contributions of this work are:

  1. 1.

    Crossover and mutation evolutionary operators are used to enhance the performance of the HHO algorithm.

  2. 2.

    Integration of opposition-based learning operator to HHO.

  3. 3.

    Three versions based on HHO called HHOCM, OBLHHOCM and ROBLHHOCM algorithms are proposed to select the chemical descriptors and chemical compound activities.

  4. 4.

    Modeling a wrapper FS paradigm using the three versions is conducted.

  5. 5.

    A series of experiments are carried out to prove superiority of these versions in choosing the best molecular descriptor subset with high classification accuracy.

The rest of this paper is organized as follows. Section 2 summarizes related works in the literature. Section 3 presents briefly basics of the HHO algorithm, the k-nearest neighbor (k-NN) algorithm and genetic operators. Section 4 explains the proposed modifications over the HHO algorithm. Experimental results are reported and discussed in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Related work

Generally, for creating a medical molecule, it is necessary to use protein bank databases in order to determine crystal structure of the protein [22]. Computer-aided drug design (CADD) is an efficient tool for chemical compounds identification of drug design and discovery [2]. CADD methods are used to extract the compounds such as Pubchem by the pharmacophore modeling tools. These methods depend on the important concept of docking large libraries for small molecule [23]. Also, CADD methods are used to obtain protein and ligand. A good drug depends on good ligands selected by the PyMOL software [24] to separate the ligand from protein, and the energy is calculated by AutoDock software.

In the same context, QSAR is applied to describe the correlation between structures from a set of molecule and the response for targets. Thus, it can be considered as an alternative tool for the CADD methods [25]. In general, QSAR consists of an active and inactive molecule, which requires good molecular descriptors representing molecular features responsible for the relevant molecular activity. Drug design and discovery is one of the main aspects in cheminformatics including two phases: encoding phase to represent the molecular graph (or connection table) to extract vector of features through calculating the descriptors as three-dimensional information for the molecular structure and mapping phase to build various models for ML techniques in cheminformatics. Mapping between different feature vectors and property is the major role of ML techniques in cheminformatics to discover different functions, which can be done by ML techniques [12].

Several efforts were developed for selecting proper features in datasets. Considering this fact, three categories of FS are found in the literature [26]: filter-based [27], embedded-based [28] and wrapper-based [29]. FS-based SI includes several algorithms, such as improved salp swarm algorithm using crossover [30], whale optimization algorithm [31] and binary dragonfly optimization [32]. In [33], a filter-based FS method is introduced for the QSAR Biodegradation and other medical benchmarks. It combined relief-f with differential evolution for selecting the most relevant features. It achieved 85.4% classification accuracy with keeping only 16 relevant molecular descriptors from 41 features. Wrapper-based methods have been attracted more attention due to the involvement of learning algorithms in the FS process. Thus, selection of significant features effected by the performance of learning algorithms (e.g., rate of correct classification accuracy) [12]. A swarm-based algorithm is introduced in [34] using wrapper FS for predicting chemical compound activities. SSA is applied for selecting the best subset of molecular descriptors of the Monoamine Oxidase (MAO) dataset. It is important to note that SSA with k-NN classifier obtained the highest accuracy of 87.35% and kept only 783 molecular descriptors. Houssein et al. [35] proposed two classification approaches called HHO-SVM and HHO-kNN for drug design and discovery prediction. In [36], the HHO is combined with cuckoo search for drug design and discovery in chemoinformatics.

Another branch of multi-objective optimization algorithms-based FS has developed recently for selection of molecular descriptors in QSAR by introducing the molecular descriptors subsets selection software (MoDeSuS) for QSAR Biodegradation [37]. Two scenarios are proposed for selecting relevant molecular descriptors known as aggregation and Pareto based FS. In the first one, a binary vector is generated which contains m molecular descriptors, where ones bits indicate that the molecular descriptors are selected and zeros bits indicate the molecular descriptors are ignored. In addition, the selected subset should be evaluated using aggregation function that combines the accuracy with the selection ratio. The second scenario (Pareto-based methods) employs two algorithms (i.e., non-dominated sorting genetic algorithm (NSGAII) and strength Pareto evolutionary algorithm (SPEA2)) to optimize the accuracy and the selection ratio, separately. The MoDeSuS achieved high performance on the QSAR Biodegradation dataset with an accuracy rate of 84% and 37% of selection ratio.

In [38], a method based on biclustering is proposed to reduce the number of molecular descriptors for predicting Biodegradation of chemical compounds. The task of Biodegradation is evaluated using three classifiers: random committee, neural network and random forest. The experimental results have shown that the best classifier was RF, which achieved 88.81% of accuracy with only 19 molecular descriptors (MD) on the QSAR Biodegradation dataset. Recently, artificial intelligence knows a remarkable progress that allows to develop several horizons based on ML and deep learning for QSAR modeling [39, 40]. Also, Putra et al. [41] combined artificial neural network and support vector machine for QSAR modeling and principal component analysis (PCA) is utilized for reducing the dimensionality of data. The performance is assessed on the QSAR Biodegradation dataset and a classification rate of 82% is achieved.

Hierarchical stochastic graphlet embedding (HSGE) [42] is introduced using different hierarchical configurations for treating molecular graph dataset. The approach achieved 95.71% of accuracy on the MAO dataset. In the same context, the work of [43] presented a fusion between old neuronal architecture (multi-layer perceptron) and recent architecture based on deep learning called CNN-MLP for predicting chemical activities. In the CNN-MLP method, two models DeepBioD+ and DeepBioD are proposed for the QSAR Biodegradation dataset based on domain-specific features engineering and learned representations from pattern samples, which achieved 90% and 87.5% of accuracy, respectively. A similar work based on a pre-trained model called ChemNet was presented in [44] for the prediction of chemical activities using deep learning model achieved 86.7% of accuracy on the QSAR Biodegradation dataset. In [45], diffusion-convolutional neural network (DCNN) was produced for graph-structured data using a representation of deep learning architecture called diffusion CNN. The experiments showed that an accuracy rate of 75.14% can be achieved on the MAO dataset.

3 Preliminaries

This section introduces necessary basics of the HHO algorithm, k-NN and the genetic operators.

3.1 Harris hawks optimization

The HHO [15] as a new SI algorithm is inspired from the cooperative behaviors of Harris hawks in hunting and escaping preys. Harris hawks demonstrate a variety of chasing styles dependent on the dynamic nature of circumstances and escaping patterns of a prey. In this intelligent strategy, several Harris hawks try to cooperatively attack from different directions and simultaneously converge on a detected escaping rabbit outside the cover showing different hunt strategies. The candidate solutions are the Harris hawks, and the intended prey is the best candidate solution (nearly the optimum) in each step. The three phases of the HHO algorithm can be highlighted as: exploration phase, transition from exploration to exploitation phase and exploitation phase. The hunting is modeled as:

$$\begin{aligned} \begin{array}{l} x_{t + 1}^i = \left\{ {\begin{array}{*{20}{c}} {{x_{rand}} - {\tau _1}\left| {{x_{rand}} - 2{\tau _2}x_t^i} \right| } &{} {if\ {\tau _5} \ge 0.5\ } \\ {\left( {{x_{Rabbit}} - \overline{{x_t}} } \right) - {\tau _3}\left| {l{b^j} + {\tau _4}\left( {u{b^j} - l{b^j}} \right) } \right| } &{} {else\ } \end{array}} \right. \\ t \in \left[ {1 \cdots T} \right] ,i \in \left[ {1 \cdots N} \right], \\ \end{array} \end{aligned}$$
(1)

where the current location of \(i{th}\) hawk and its new location in iteration \(t + 1\) are represented by \(x_t^i\) and \(x_{t + 1}^i\), whereas \( x_{rand}\) and \(x_{Rabbit}\) are randomly selected hawk location and the best solution (target:rabbit). Lower and upper bounds of \(j{th}\) dimension are defined by \(lb^j\) and \(ub^j\), while \(\tau _1\) to \(\tau _5\) represent random numbers which belong to the interval [0, 1]. The average hawk position \( \overline{{x_t}}\) is defined as:

$$\begin{aligned} \overline{x_t}=\frac{1}{N}\sum _{i=1}^{N}x_{t}(i). \end{aligned}$$
(2)

In Eq. (1), the first scenario (\( {\tau _5} \ge 0.5\)) grants a chance to the hawks to hunt randomly spread in the planned space, while the second scenario explains context when the Hawks hunt beside family members close to a target. In the exploration to exploitation transformation phase, the prey attempt to escape from the capture, so the escaping energy \(E_n\) level of the prey decreases gradually. The energy is defined by

$$\begin{aligned} E_n=2* E_{n0}*\left(1-\frac{t}{T}\right), \end{aligned}$$
(3)

where the initial energy \((E_{n0})\) is defined by \(E_{n0} = 2*rand -1\) , randomly changed inside \((-1,1),\) and T is the maximum number of iterations. HHO keep explorative as long as \(|E_n|\ge 1\) and hawks remain on exploring global regions, while it swaps into exploitative mode when \(|E_n|<1\). R refers to escaping probability of the target. The exploitation phase aims to avoid fall into local optima.

The first task-surrounding soft The surrounding soft can be formulated mathematically when \(R\ge \frac{1}{2}\) and the level of energy is greater than \(\frac{1}{2}\) (i.e., \(|E_n|\ge \frac{1}{2}\)) as:

$$\begin{aligned} \begin{array}{l} x_{t + 1}^i = \Delta x_t^i - E_n\left| {J{x_{Rabbit}} - x_t^i} \right| \\ \Delta x_t^i = {x_{Rabbit}} - x_t^i,\ J = 2\left( {1 - {\tau _6}} \right), \\ \end{array} \end{aligned}$$
(4)

where \(\Delta x_t^i\) is the difference between the best agent (i.e., a rabbit) and the current position of \(i{th}\) hawk. J indicates random strength jump of the prey, and \(\tau _6\) is a random number which belongs to the interval [0, 1].

The second task-surrounding hard When the level of energy is less than \(\frac{1}{2}\) \((|E_n|< \frac{1}{2})\) & \(R\ge \frac{1}{2}\), the rabbit becomes exhausted and the possibility of escaping low (or escaping becomes hard) because the level of energy is decreased. This behavior can be modeled by

$$\begin{aligned} x_{t + 1}^i = {x_{Rabbit}} - E_n\left| {\Delta x_t^i} \right|. \end{aligned}$$
(5)

The third task-surrounding soft beside advanced rapid dives This task is applicable when the level of energy is greater than \(\frac{1}{2}\) \((|E_n|> \frac{1}{2})\) & \(R < \frac{1}{2}\), where the rabbit still has sufficient force to run away. Hence, the hawk tries progressive dives in order to take the best position for catching the prey. This behavior is modeled by integrating the Lévy flight function [46].

The position of \(i{th}\) hawk should be modified using:

$$\begin{aligned} \begin{array}{l} {x_{t+1}^{i} =\left\{ \begin{array}{cc} {y} &{} {if\; fit\left( y\right)< fit\left( x_{t}^{i} \right) } \\ \\ {z} &{} {if\, fit\left( z\right) < fit\left( x_{t}^{i} \right) , } \end{array}\right. } \\ {y=x_{rabbit} -E_{n} \left| Jx_{rabbit} -x_{t}^{i} \right| ,} \\ { z=y+r_{v} \times Lv\left( D\right), } \end{array} \end{aligned}$$
(6)

where

$$\begin{aligned} Lv\left( D\right)= & {} 0.01\times \frac{rand\left( 1,D\right) \times \sigma }{\left| rand\left( 1,D\right) \right| ^{\frac{1}{\beta } } } , \end{aligned}$$
(7)
$$\begin{aligned} \sigma= & {} \left( \frac{\Gamma \left( 1+\beta \right) \times \sin \left( \frac{\pi \beta }{2} \right) }{\Gamma \left( \frac{1+\beta }{2} \right) \times \beta \times 2^{\left( \frac{\beta -1}{2} \right) } } \right) ^{{}^{\frac{1}{\beta } } }, \end{aligned}$$
(8)

where D is the dimensionality space, \( r_{v}\) contains D components generated randomly inside (0,1), Lv represents the Lévy flight function, \(\beta \) is a constant with default \(\beta =1.5\) and fit indicates the fitness function computed by Eq. (16).

The fourth task-surrounding hard beside advanced rapid dives In this task, it is assumed that \(R < \frac{1}{2}\) and the level of energy is less than \(\frac{1}{2}\) \((|E_n|<\frac{1}{2})\), the prey has a lower level of energy to evade, and Hawks are close to realize a successive dives for catching. This process can be described by

$$\begin{aligned} \begin{array}{l} x_{t + 1}^i = \left\{ {\begin{array}{*{20}{c}} y &{} {if\ fit\left( y \right)< fit\left( {x_t^i} \right) \ } \\ z &{} {if\ fit\left( z \right) < fit\left( {x_t^i} \right) \ } \\ \end{array}} \right. ; \\ \ \ y = {x_{Rabbit}} - E_n\left| {J{x_{Rabbit}} - {\overline{x}} } \right| ; \\ \ \ z = y + r_{v} \times Lv\left( D \right). \\ \end{array} \end{aligned}$$
(9)

For illustration, the general flowchart of the HHO algorithm is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of the HHO algorithm

3.2 k-Nearest neighbors algorithm (k-NN)

The k-NN classifier belongs to the supervised machine for identifying a new pattern based on statistical metric. It is considered as a lazy model of learning, which can be performed for prediction tasks and classification problems [47]. This algorithm presents a certain advantage that is an easy interpreter of the output. It provides less computing cost and efficiency. The classification process depends only on computing the Euclidean distance between the current test example and the query examples of training data. After that, the first k minimal distances are selected to determine the label of the current test vector. The k-NN is composed of different steps given in Algorithm 1.

figure a

3.3 Genetic operators

The use of evolutionary operators is widely exploited in several algorithms. Primarily, two basic algorithms have explored deeply genetic operators such as differential evolution and genetic algorithms. Here, we give a quick overview of the genetic operators (i.e., mutation, crossover and selection).

Mutation The results of the tasks numbers three and four of HHO and the target solution (\(x_{Rabbit}\)) are utilized for producing the mutation operation. For each component, a number between 0 and 1 is randomly generated. In a case the value is superior to the mutation rate (\(\zeta \)), the element of the target agent (\(x_{Rabbit}\)) is considered. If this value is less than to the mutation rate (\(\zeta \)), the old vector is replaced by the component of y or z vectors. The mutation operator is determined using:

$$\begin{aligned}&\begin{array}{l} {y_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ rand_{1} \ge \zeta } \\ {y} &{} {else} \end{array}\right. } {and\ } \\ {z_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ rand_{2} \ge \zeta } \\ {z} &{} {else} \end{array}\right. } \\ {Where:\ \ \left\{ \begin{array}{c} {\zeta =\frac{t}{T} ;} \\ {y=x_{rabbit} -E_{n} \left| Jx_{rabbit} -x_{t}^{i} \right| ;} \\ {z=y+r_{v} \times Lv\left( D\right) \ \ } \end{array}\right. \ \ \ \ \ \ \ } \end{array} \end{aligned}$$
(10)
$$\begin{aligned}&\begin{array}{l} {y_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ \rho _{1} \ge \zeta } \\ {y} &{} {else} \end{array}\right. } {and\ } \\ {z_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ \rho _{2} \ge \zeta } \\ {z} &{} {else} \end{array}\right. } \\ {Where:\ \ \left\{ \begin{array}{c} {\zeta =\frac{t}{T} ;} \\ {y=x_{rabbit} -E_{n} \left| Jx_{rabbit} -\overline{x} \right| ;} \\ {z=y+r_{v} \times Lv\left( D\right). \ \ } \end{array}\right. \ \ \ \ \ \ \ } \end{array} \end{aligned}$$
(11)

Crossover In order to produce more diversity, the crossover involves recombination of two individuals. An intermediate crossover with a random number \(\tau \) is used to generate a new offspring \(w_{Cross}\) as

$$\begin{aligned} {w_{Cross}} = {y_{Mut}} + \left( { \tau } \right) * {(z_{Mut}-{y_{Mut}}).}\ \end{aligned}$$
(12)

This type of operator allows children to inherit more information from parents compared to other type as linear recombination.

Selection The type of selection used in HHO is a greedy selection inspired from differential evolution. The offspring produced after evolutionary functions (mutation & crossover) are accessed. Then, the performance of the child and parent is compared to select the best one. Finally, the parent has a chance to remain in the population if their performance is high. The greedy selection is defined by the following rule:

$$\begin{aligned} x_{t + 1}^i = \left\{ {\begin{array}{*{20}{c}} {{y_{Mut}}} &{} {if\ fit\left( {{y_{Mut}}} \right)< fit\left( {x_t^i} \right) } \\ {{z_{Mut}}} &{} {if\ fit\left( {{z_{Mut}}} \right)< fit\left( {x_t^i} \right) } \\ {{w_{Cross}}} &{} {if\ fit\left( {{w_{Cross}}} \right) < fit\left( {x_t^i} \right) } \\ \end{array}} \right.. \end{aligned}$$
(13)

4 The proposed HHOCM algorithm

In this section, we give an alternative method for FS that combines the HHO with genetic operators. Like other meta-heuristic algorithms, HHO tends to be trapped in low diversity, local optima and unbalanced exploitation ability [48, 49]. Although HHO has the characteristics of acceptable convergence speed and a simple structure, it may fail to maintain the balance between exploration and exploitation and fall into a local optimum in some complex optimization problems [50].

Thus, the main contribution of the proposed HHOCM algorithm focuses on integrating genetic operators (mutation, crossover and selection) for solving the problem of exploitation in the HHO algorithm. To this end, the proposed HHOCM tries to ensure more diversity considering two main phases: initialization phase and updating phase. The framework of the proposed HHOCM algorithm for FS is given in Fig. 2.

Fig. 2
figure 2

A general framework for the proposed HHOCM algorithm

4.1 Initialization phase

In this step, HHOCM generates N swarm agents in the first population, where each individual represents a portion of molecular descriptors (features) to be selected for evaluation. This step has a significant effect on the convergence and aptitude of the optimal solution. The population X is generated randomly as:

$$\begin{aligned} x_i^j=lb^j+\lambda ^j \times (ub^j-lb^j), \ i=1,2,.,N; j=1,2..D. \end{aligned}$$
(14)

The lower and upper bounds \(lb^j\) and \(ub^j\) for each candidate solution i are in the range of [0, 1]. The \(\lambda ^j\) is a random number \(\in [0, 1]\). To select a subset of molecular descriptors, an intermediate binary conversion step is necessary before fitness evaluation. Thus, each solution \(x^i\) undergoes a binary conversion (\(x^i_{bin}\)) using:

$$\begin{aligned} x^{i}_{bin}=\left\{ {\begin{array}{*{20}{c}} 1&{} \text {if} \ x^i>0.5\\ 0&{}\text {otherwise.}\\ \end{array}} \right. \end{aligned}$$
(15)

The solution \(x^i\) with ten molecular descriptors, where \(x^i=[0.6, 0.2, 0.9, 0.33, 0.15, 0.8, 0.2, 0.75, 0.1, 0.9]\), is considered. The operation of conversion is applied using Eq. (15) to generate a binary vector \(x^i_{bin}\), where ones imply that the molecular descriptors are selected and otherwise are not selected. This means that first, third, sixth, eighth and the last molecular descriptors in original datasets are relevant ones and must be selected, while the others are irrelevant features and must be eliminated. After determining the subset of selected molecular descriptors, the fitness function is calculated for each agent \(x^i_{bin}\) to determine the quality of these features. The fitness of the \(i{th}\) solution is defined by

$$\begin{aligned} fit_i=\upsilon _{1}\times Er_i+\upsilon _{2}\times \frac{d_i}{D}, \end{aligned}$$
(16)

where \(\upsilon _{1}=0.99\) and \((\upsilon _{2}=1-\upsilon _{1})\). The weight \(\upsilon _{1}\) represents the equalizer parameter employed to ensure a relationship between the error rate of classification \((Er_{i}=1-Acc_i)\) and the size of selected molecular descriptors \((d_i)\). In Eq. (16), D is the total size of Molecular Descriptors (MD) in the original dataset. The k-NN is utilized as a classifier in the FS cycle. As a strategy of classification, the hold-out is utilized, which assigns 80% as a training set and the rest of data as testing samples. The \(Er_i\) refers to the error rate of test datasets computed by k-NN (Algorithm 1). The lower value of fitness through all agents is assigned to the best prey \((x_{Rabbit})\).

4.2 Updating phase

The process of updating solutions consists of applying firstly the exploration step, which aims to apply a global search when the energy is greater than one. After that, the transition from exploration to exploitation is applied. Then, the exploitation phase is employed, which contains four tasks: surrounding soft, surrounding hard, surrounding soft beside advanced rapid dives and surrounding hard beside advanced rapid dives boosted by genetic operators. For improving the local search capability, HHOCM integrates the mutation operator in task three (surrounding soft beside advanced rapid dives) and task four (surrounding hard beside advanced rapid dives) of HHO using Eqs. (10) and (11), respectively. For more diversity, another genetic operator is introduced called crossover. This operator tries to combine both mutant vectors \(y_{Mut}\) and \(z_{Mut}\) for producing a new child \(w_{Cross}\) as described in Eq. (12). The fitness values of all offspring: \(y_{Mut}\), \(z_{Mut}\) and \(w_{Cross}\) based on selection operator Eq. (13) are compared to identify the best prey \(x_{Rabbit}\). The process is reproduced, while the termination condition is met. The stop criterion corresponds to the maximum amount of iterations that allows to evaluate the performance of the HHOCM algorithm. Then, the best solution \(x_{Rabbit}\) is returned and converted to determine the number of relevant features. In this regard, experiments are carried out 30 times independently for achieving accurate and precise results.

5 Experimental results and discussion

For validating the effectiveness of the proposed HHOCM algorithm, a number of experiments are conducted using two common datasets widely used in the field of chemoinformatics: QSAR Biodegradation and MAO. The first experiment is assigned to study the impact of swarm size (N) and the maximum number of iterations (T) on the accuracy and number of selected features. The second experiment aims to determine the best value of control parameter \(\beta \) used in Lévy flight function. The third experiment is conducted to compare the performance of the HHOCM algorithm and seven recent SI algorithms (i.e., HHO, GWO, WOA, SSA, DA, GOA and ALO) based on the optimal control parameters obtained by previous two experiments between. Each algorithm is executed 30 times with keeping the same optimal values of N, T and \(\beta \). Other experiments are conducted using statistical Wilcoxon test, which is used as assessment measure in order to verify significance of the accuracy achieved by HHOCM and ROBLHHOCM against the other competitor algorithms. The last experiment is conducted to compare between the three versions, HHOCM, OBLHHOCM and ROBLHHOCM, and other works from the literature using same parameters configuration on both of the QSAR Biodegradation and MAO datasets. The running is established on a PC with Intel Core i7-5500 CPU@2.40 GHz 2.40 GHz, 8 GB RAM, Windows 10 and Matlab 2016a.

5.1 Parameter settings

Parameters settings of the DA, ALO, GWO, WOA, SSA, GOA and HHO algorithms as well as the proposed versions of HHO are listed in Table 1.

Table 1 Parameters settings of the SI algorithms

5.2 Performance measures

The performance of the proposed

HHOCM algorithm is assessed based on several criteria including, average and standard deviations of fitness, accuracy, number of selected features, sensitivity, specificity and CPU time. Table 2 shows the confusion matrix that allows to produce some performance metrics as accuracy (Acc), sensitivity (Sn) and specificity (Sp).

Table 2 Confusion matrix
  • Average accuracy \((AVG_{Acc})\): (Acc) represents the correct number of correspondences between the label of sample data and the output of classifier and is computed using:

    $$\begin{aligned} Acc = \frac{{Tp + Tn}}{{Tp + Fn + Fp + Tn}}. \end{aligned}$$
    (17)

    The number of runs is fixed \({N_r=30}\), so the average accuracy \(AVG_{Acc}\) is calculated as:

    $$\begin{aligned} AV{G_{Acc}} = \frac{1}{{{N_r}}}\sum \limits _{k = 1}^{{N_r}} {Acc_{best}^{(k)}}. \end{aligned}$$
    (18)
  • Average sensitivity \((AVG_{Sn})\): The sensitivity (Sn) accesses the rate of prognosticating positive samples as:

    $$\begin{aligned} Sn = \frac{{Tp}}{{Tp + Fn}}. \end{aligned}$$
    (19)

    The \(AVG_{Sn}\) is calculated from the best prey \((x_{Rabbit})\) using:

    $$\begin{aligned} AV{G_{Sn}} = \frac{1}{{{N_r}}}\sum \limits _{k = 1}^{{N_r}} {Sn_{best}^{(k)}}. \end{aligned}$$
    (20)
  • Average precision \((AVG_{Pr})\): The precision (Pr) indicates the rate of true predicted samples as:

    $$\begin{aligned} Pr = \frac{{Tp}}{{Fp + Tp}}. \end{aligned}$$
    (21)

    The \(AVG_{Pr}\) is determined via the following equation:

    $$\begin{aligned} AV{G_{Pr}} = \frac{1}{{{N_r}}}\sum \limits _{k = 1}^{{N_r}} {Pr_{best}^{(k)}}. \end{aligned}$$
    (22)
  • Average fitness value \((AVG_{fit})\): The objective value estimates the quality of algorithms that study the correlation between the selection ratio of FS and the error rate of classifier as in Eq. (16). Its average is computed by

    $$\begin{aligned} AV{G_{fit}} = \frac{1}{{{N_r}}}\sum \limits _{k = 1}^{{N_r}} {fit_{best}^{(k)}}. \end{aligned}$$
    (23)
  • Average size of selected features \((AVG_{size})\): This metric implies the size of relevant features. It is computed as:

    $$\begin{aligned} AVG_{size} = \frac{1}{{{N_r}}}\sum \limits _{k = 1}^{{N_r}} {d_{best}^{(k)}}, \end{aligned}$$
    (24)

    where \({d_{best}^{(k)}}\) is the cardinally of the selected features of the best agent for \(k{th}\) execution.

  • Average CPU time \((AVG_{Time})\): It is the average of computation time for each method, that is:

    $$\begin{aligned} AV{G_{Time}} = \frac{1}{{{N_r}}}\sum \limits _{k = 1}^{{N_r}} {T_{best}^{(k)}}. \end{aligned}$$
    (25)
  • Standard deviation (Std): It is the quality of each algorithm and analysis of the obtained results over different executions and metrics. It is calculated for all measures described previously.

5.3 Datasets and preprocessing details

Monoamine Oxidase (MAO) dataset is represented by an enzyme that is dispersed in largest tissues. It can catalyze the inactivation and oxidation of monoamine neurotransmitters. The information used in this data is taken from the publicly available GREYC’s chemistry dataset.Footnote 1 Thus, it is transferred from MOA to simplified molecular-input line entry system (SMILES) styles via open Babel software [51]. Then, the molecular descriptors are determined using E-Dragon [4]. It contains 1665 features (MD) with 68 compounds divided into two classes.


QSAR Biodegradation dataset has 41 attributes (molecular descriptors) for classifying 1055 chemicals compounds. This data is explored in the field of discrimination between two chemical classes including 356 of readily biodegradation samples and 699 of not readily biodegradation patterns. In addition, this data can be useful for QSAR development in order to determine the correlation molecular biodegradation and chemical design. It is available on the web page of UCI.Footnote 2

The preprocessing stage of the datasets follows three steps as follows:

  1. 1.

    Information of proteins is converted to isomeric SMILES using the open Babel software [51]. Information of proteins is stored in MOA chemical format that must be converted to isomeric SMILES using Babel software. Features represent attributes that have values for making instances.

  2. 2.

    Descriptors are calculated by E-Dragon; in chemistry, the features are performed for implementation of different 2D and 3D data in QSAR model and calculate descriptor by E-Dragon software [4]. The descriptors are categorized into types as structural or physico-chemical (weight and volume of molecule, rotary links, the distance inter atoms, the type of atom, account of molecular walking, electro-negativity, atom distribution, aromatic and thawed characteristics).

  3. 3.

    The correlation between chemical design and biological activeness is expressed mathematically using QSAR. Also, the features can identify the instances. QSAR is used for seeking main characteristics of chemical compounds as shown in Fig. 3. On the other side, several techniques of ML are exploited to structure–activity correlation analysis for predicting the similarity of the compounds in the presence of a given malady. The compounds of complex molecule contain several features like topological factors [52].

Fig. 3
figure 3

Flowchart of the QSAR model

5.4 Sensitivity analysis

This initial test is conducted to determine sensitivity and to understand the main impact of some HHOCM parameters, such as the size of swarm (N), number of iterations (T) and the \(\beta \) parameter. The sensitivity is assessed in three stages. First, we treat the effect of swarm size and maximum number of iterations according to accuracy \((AVG_{Acc})\) and number of selected features \((AVG_{size})\) obtained by HHO and HHOCM. Second, we study the influence of the \(\beta \) parameter used in Lévy flight function according to accuracy \((AVG_{Acc})\) and number of selected features \((AVG_{size})\) obtained using HHO and HHOCM on the QSAR Biodegradation and MAO datasets. Third, we analyze the influence of initialization using opposition-based learning (OBL) and random OBL (ROBL) according to accuracy \((AVG_{Acc})\) and number of selected features \((AVG_{size})\) obtained by HHO and HHOCM on the two datasets (QSAR Biodegradation and MAO datasets). A short description for OBL [53, 54] and ROBL [55] is described as follows:

  • OBL can produce opposite solutions that enhance the convergence and jump out local optima. So, this operator can be modeled mathematically using

    $$\begin{aligned} {x_i^j}^*=lb^j+ub^j-x_i^j, \ i=1,2,...,N; j=1,2...,D. \end{aligned}$$
    (26)
  • ROBL as a new operator allows to explore the search space with more diversity. It can be formulated by

    $$\begin{aligned} {x_i^j}^*=lb^j+ub^j-r*x_i^j. \end{aligned}$$
    (27)

    r is a random number \(\in [0, 1]\). According to the best fitness of current solution \((x_i)\) and their opposite \((x_i^*)\), the initial population is created.

By inspecting the results of Tables 3, 4, 5 and 6, it can be seen that the optimal values of accuracy \((AVG_{Acc})\) and the number of selected features \((AVG_{size})\) are obtained when the swarm size N is 10 and the maximum number of iterations T is 100 for both datasets using the basic HHO and HHOCM. The second stage is conducted to treat the impact of \(\beta \) by varying their value from 0.5 to 2 and fixing the swarm size (N) and the maximum number of iterations (T) to the best values obtained in the first stage, which are 10 and 100, respectively. From Tables 7, 8, 9 and 10, it can be observed that the best values of accuracy and selected features are reached when the value of \(\beta \) in Lévy flight function is equal to 1.5 for both datasets using HHO and HHOCM. From Tables 11 and 12, we can highlight the impact of OBL and ROBL in initialization step for basic HHO and the proposed method HHOCM over both datasets (QSAR Biodegradation and MAO datasets). It can be seen clearly that ROBL enhanced the performance of HHO and HHOCM for both datasets. Additionally, ROBLHHOCM provides high performance in terms of average accuracy and size of selected features compared to HHO, OBLHHO, ROBLHHO, HHOCM and OBLHHOCM. The best obtained values of these control parameters are used for the rest of experiments. The initialization-based OBL allows to give another angle of view that help exploring a new variants of HHOCM called OBLHHOCM and OBLHHOCM.

Table 3 Impact of iterations number and swarm size on the accuracy and number of selected features for the MAO dataset using basic HHO
Table 4 Impact of iterations number and swarm size on the accuracy and number of selected features for the MAO dataset using HHOCM
Table 5 Impact of iterations number and swarm size on the accuracy and number of selected features for QSAR Biodegradation dataset using basic HHO
Table 6 Impact of iterations number and swarm size on the accuracy and number of selected features for the QSAR Biodegradation dataset using HHOCM
Table 7 Impact of the \({\beta }\) parameter on Lévy function for the MAO dataset using basic HHO
Table 8 Impact of the \({\beta }\) parameter on Lévy function for the MAO dataset using HHOCM
Table 9 Impact of the \({\beta }\) parameter on Lévy function for the QSAR Biodegradation dataset using basic HHO
Table 10 Impact of the \({\beta }\) parameter on Lévy function for the QSAR Biodegradation dataset using basic HHOCM
Table 11 Impact of initialization strategies for the MAO dataset using basic HHO and HHOCM
Table 12 Impact of initialization strategies for the Biodegradation dataset using basic HHO and HHOCM

5.5 Comparison of HHOCM with other SI algorithms

  • In terms of the average and standard deviations of fitness: Table 13 reports the mean fitness values obtained by the proposed algorithms HHOCM, OBLHHOCM and ROBLHHOCM and recent SI algorithms. It can be deduced clearly that ROBLHHOCM outperforms all other competitor algorithms on both datasets. The performance can be interpreted by two reasons: The first reason is justified by the use of genetic operators in HHO, which are based on evolutionary CM operators, while the second one is illustrated by the use of OBL operator, specially Random OBL that enhances the exploration and avoids the convergence to local optima. Also, the OBLHHO algorithm takes the second rank in terms of average fitness for the QSAR Biodegradation dataset. However, HHOCM is ranked thirdly which can be interpreted by generating more diverse solutions by the use of CM operators. For the MAO dataset, the convergence curves between the three variants of HHOCM are shown in Fig. 4. The proposed ROBLHHOCM algorithm highlights more stability for both datasets because the value of Std is close to zero, which presents the key reason of good balance between exploration and exploitation.

Table 13 The average fitness values of all competing optimizers
  • In terms of the average and standard deviations of accuracy and selected features: the performance of three variants of HHOCM (ROBLHHOCM, OBLHHOCM and HHOCM), three variants of HHO (ROBLHHO, OBLHHO and HHO) and other swarm competitor algorithms in terms of accuracy and number of selected features are illustrated in Tables 14 and 15. It can be seen that ROBLHHOCM finds the most informative features that provide high accuracy for both datasets. It is important to highlight that ROBLHHOCM achieves high classification accuracy of 100% with keeping only four features from 1665 in the case of MAO dataset that represents high-dimensional low-instance data. Also, it can be observed that the three variants of HHOCM outperform the variants of HHO in terms of average correct classification rate and average size of selected features for both datasets. For MAO dataset, the second rank is shared between OBLHHOCM and HHOCM in terms of average accuracy, while the second best optimizer for the QSAR Biodegradation dataset is OBLHHOCM. In this regard, the three variants of HHOCM achieve high performance in terms of average accuracy and average size of selected features.

Table 14 The average classification accuracy of all competing optimizers
Table 15 The average size of selected features of all competing optimizers
  • In terms of the average and standard deviations of sensitivity and precision metrics: Comparison the performance of sensitivity and precision of three variants of HHOCM, three variants of HHO and six SI algorithms are illustrated in Tables 16 and 17. The performance of ROBLHHOCM in terms of sensitivity and precision is still much better than all other competitor algorithms. In terms of precision, we can observe a clear advantage obtained by the three variants of HHOCM, specially for MAO dataset.

Table 16 The average sensitivity of the optimizers
Table 17 The average precision of the optimizers
Table 18 The average CPU time of the optimizers
  • In terms of the average and standard deviations of CPU time: The CPU time consumed by the three variants of HHOCM/HHO and the other algorithms is given in Table 18. From the listed results, it can be observed that the WOA is very fast, specially for the MAO dataset when the number of patterns is small, while the three variants of HHOCM require more time when the number of samples increases exponentially. This behavior can be interpreted by adding two genetic operators: CM and the use of OBL operator. For the QSAR dataset, SSA provides the lowest time due to the use of simple updating operator.

  • Wilcoxon rank-sum test: The significance of the obtained results using different algorithms requires to realize a statistical test in order to access the efficiency of the proposed ROBLHHOCM algorithm against HHOCM and other SI algorithms including DA, WOA, GOA, ALO, GWO, SSA and HHO. Table 19 shows the p-values of Wilcoxon rank-sum test based on accuracy metric. It can be concluded that the proposed ROBHHOCM provides a clear superiority compared to the other SI algorithms. For both datasets, the ROBLHHOCM obtained lower values of p-values, which are less than 1% compared to all SI except HHOCM. Thus, the proposed algorithms HHOCM and ROBLHHOCM are statistically significant compared to all optimizer tested in this study. In addition, we can see that the HHOCM and ROBLHHOCM provide same performance on QSAR, while on the MAO dataset, HHOCM is statistically significant compared to ROBLHHOCM.

Table 19 Wilcoxon rank-sum test
  • Graphical analysis: Fig. 4 illustrates convergence curves of the ROBLHHOCM, HHOCM algorithms against all other SI algorithms HHO, GWO, GOA, WOA, SSA, DA and ALO, which are implemented and assessed under same conditions (i.e., same number of agents \((N=10)\) and same number of iterations \((T=100)\)). It is clear that the HHOCM and ROBLHHOCM algorithm present fast convergence on both datasets. Also, the convergence behavior of HHOCM is more accelerated than that of ROBLHHOCM algorithm for large dataset in the case of MAO dataset. Additionally, for QSAR Biodegradation dataset, the convergence behavior of ROBLHHOCM is faster than that of WOA, ALO, DA, GWO, GOA, SSA and HHOCM. Moreover, the convergence of the HHOCM and ROBLHHOCM algorithms show that the optimal values of fitness coincide perfectly with the optimal value of accuracy. This phenomenon can be explained by the effective trade-off balance between exploration and exploitation due to the integration of genetic operators and the use of random OBL operator. Figure 5 shows box plots of the accuracy for all datasets achieved by the competitor algorithms and the proposed variants of HHOCM. From this representation, one can determine first quartile \((Q{_1})\), third quartile \((Q{_3})\), maximum and minimum values. The red line inside the box indicates the median value. It is important to emphasize that each box is obtained after 30 runs of each algorithms. Looking closely to Fig. 5, it can be concluded that the HHOCM and ROBLHHOCM algorithms have higher box plots for both datasets than the other SI algorithms. The third place is taken by ALO on the QSAR Biodegradation dataset, and WOA shows the third high box on the MAO dataset.

Fig. 4
figure 4

Convergence curves of the HHOCM and ROBLHHOCM algorithms against other SI algorithms

Fig. 5
figure 5

Boxplot of the HHOCM and ROBLHHOCM algorithms against other SI algorithms

5.6 Comparison of HHOCM variants with the existing algorithms

For proving the efficiency of HHOCM,OBLHHOCM and ROBLHHOCM, some results from literature on the same datasets are reported in Table 20. This task becomes difficult because researchers use different parameters configuration, specially in terms of population size (N) and maximum number of iterations (T) in swarm algorithms. For explaining this situation, the work of [34] used eight solutions as population size \((N=8)\) and and maximum number of iterations \((T=200)\). Also, a recent work is developed by [35], which employed other values of parameters (i.e., population size \((N=30)\) and and maximum number of iterations \((T=100\), 500 and 1000) and the type of classifiers (k-NN and SVM)). To solve this issue, other experiments are added using same parameters configurations for realizing a fair comparison between HHOCM, OBLHHOCM, ROBLHHOCM and other swarm optimizers from the literature.

Table 20 Comparison with some existing algorithms

MAO dataset Table 20 presents results of the three versions of the HHOCM algorithm and other competitor algorithms including SSA, MFO, PSO, GOA, SCA, DCNN and HSGE. According to these results, it is obvious that the competitor algorithms are not so good as ROBLHHOCM or HHOCM, where the SSA optimizer achieved 87.35% with keeping 783.55 molecular descriptors, while ROBLHHOCM obtained higher rate of classification with 100% and keeping only 4.3333 MD from 1665 features in the case of \((N=10\) and \(T=100)\). Additionally, ROBLHHOCM showed 100% as accuracy and 4.6667 molecular descriptors in the case of \((N=30\) and \(T=100)\). In [35], the HHO-k-NN with same conditions achieved 96.9% as accuracy in the case of \((N=30\) and \(T=100)\), while DCNN as classifier is ranked in third position with 75.14% of accuracy.


QSAR Biodegradation dataset The obtained results listed in Table 20 on the QSAR Biodegradation dataset prove that the best optimizer is ROBLHHOCM, which achieved 90.84% of accuracy in the case of \((N=10\) and \(T=100)\) followed by OBLHHOCM approach which achieved 90.81% of accuracy. Also, the approaches which used deep learning methods show that DeepBioD+ presented a good behavior of performance because it achieved 90%. In the case of \((N=30\) and \(T=100)\), an accuracy of 85.9% was achieved using HHO-k-NN in [35]. For approaches, which use ML methods such as the ANN-SVM, only 82% of compounds activity was recognized. It is important to highlight that the lower number of molecular descriptors for the QSAR Biodegradation dataset is obtained by ROBLHHOCM, which equals to 13.6667 in the case of \((N=10\) and \(T=100)\). Also, HHOCM, MoDeSuS and OBLHHOCM determined a low number of features around 15 MD. Thus, the proposed ROBLHHOCM shows a powerful efficiency on both datasets because the performance is more efficient using both metrics (accuracy and average number of features) which has high accuracy with low number of molecular descriptors compared to competitor algorithms as reported in Table 20. This behavior can be interpreted by the incorporation of genetic operators to HHO and the random OBL operator which allows to enhance clearly the diversity of the population and the exploitation step. However, the proposed HHOCM, OBLHHOCM and ROBLHHOCM algorithms suffer from certain drawbacks as the computation time and the subset of selected molecular descriptors change according to the execution that may cause confusion for users.

6 Conclusion

In fields of cheminformatics research, QSAR is an important model that predicts the biological activities and physiochemical properties of chemical compounds. QSAR presents a real challenge problem because the representation of chemical compounds requires several features (i.e., high dimensionality problem is provoked). FS based on SI algorithms has become an efficient solution for keeping the prominent features and removing irrelevant data. To tackle with the previous challenges, this paper has proposed a three hybrid wrapper FS algorithms called HHOCM, OBLHHOCM and ROBLHHOCM which combined HHO with the genetic operators assisted by OBL strategies for selecting the proper chemical descriptor. The introduced wrapper FS is based on the variants of HHOCM and integrating the k-NN classifier that provides accurate and fast classification rates. To evaluate the proposed variants of the HHOCM algorithms, two common datasets of chemical information: the MAO dataset and the QSAR Biodegradation dataset, are considered in the performance evaluation process. The quantitative results revealed that the proposed algorithms HHOCM, OBLHHOCM and ROBLHHOCM achieve significant performance compared to seven well-established SI algorithms including the basic HHO, GWO, ALO, DA, WOA, GOA and SSA on both datasets. Moreover, it is concluded that the proposed ROBLHHOCM algorithm outperformed the competitor algorithms in terms of average standard deviations of fitness, accuracy, number of selected features, sensitivity and precision.

As a future work, the three variants of HHOCM can be used as multi-objective global optimization or FS paradigm for high dimensional with small instance in order to synchronously increase the classification rate and decrease the selection ratio of attributes. Another investigation is to consider the implementation of the HHOCM in parallel way to reduce the computation time.