Enhanced Harris hawks optimization with genetic operators for selection chemical descriptors and compounds activities

Houssein, Essam H.; Neggaz, Nabil; Hosney, Mosa E.; Mohamed, Waleed M.; Hassaballah, M.

doi:10.1007/s00521-021-05991-y

Enhanced Harris hawks optimization with genetic operators for selection chemical descriptors and compounds activities

Original Article
Published: 20 April 2021

Volume 33, pages 13601–13618, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Enhanced Harris hawks optimization with genetic operators for selection chemical descriptors and compounds activities

Download PDF

Essam H. Houssein ORCID: orcid.org/0000-0002-8127-7233¹,
Nabil Neggaz²,
Mosa E. Hosney³,
Waleed M. Mohamed¹ &
…
M. Hassaballah⁴

789 Accesses
49 Citations
Explore all metrics

Abstract

This paper presents modified versions of a recent swarm intelligence algorithm called Harris hawks optimization (HHO) via incorporating genetic operators (crossover and mutation CM) boosted by two strategies of (opposition-based learning and random opposition-based learning) to provide perfect balance between intensification and diversification and to explore efficiently the search space in order to jump out local optima. Three modified versions of HHO termed as HHOCM, OBLHHOCM and ROBLHHOCM enhance the exploitation ability of solutions and improve the diversity of the population. The core exploratory and exploitative processes of the modified versions are adapted for selecting the most important molecular descriptors ensuring high classification accuracy. The Wilcoxon rank sum test is conducted to assess the performance of the HHOCM and ROBLHHOCM algorithms. Two common datasets of chemical information are used in the evaluation process of HHOCM variants, namely Monoamine Oxidase and QSAR Biodegradation datasets. Experimental results revealed that the three modified algorithms provide competitive and superior performance in terms of finding optimal subset of molecular descriptors and maximizing classification accuracy compared to several well-established swarm intelligence algorithms including the original HHO, grey wolf optimizer, salp swarm algorithm, dragonfly algorithm, ant lion optimizer, grasshopper optimization algorithm and whale optimization algorithm.

Laplacian whale optimization algorithm

Article 10 May 2019

A novel quasi-reflected Harris hawks optimization algorithm for global optimization problems

Article 09 March 2020

HMPA: an innovative hybrid multi-population algorithm based on artificial ecosystem-based and Harris Hawks optimization algorithms for engineering problems

Article 28 July 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Cheminformatics has many strategies that can be used in drug design and discovery. A lot of efforts with large numbers of chemical compounds are being used to evaluate specific molecular properties [1]. The prediction process of molecular properties is close to drug virtual screening of a chemical library. Commonly, the drug is an organic molecule that inhibits the function of proteins as bimolecular interactions [2]. Drug design is often referred to as a rational design and inventive process of finding new drugs based on biological target knowledge [3]. Drug design and discovery consist of lead structures optimization, quantitative structure–activity relationships (QSAR) [4] and docking of ligand into a receptor [5]. Recently, machine learning (ML) techniques are applied in chemoinformatics filed to predict the chemical descriptor selection, compound activities, molecular properties [6] and for drugs design and discovery [7]. The storage space size increases exponentially with respect to the number of features available in the data set.

Feature selection (FS) is used in many critical fields such as classification, data mining and object recognition, where it is useful in eliminating obsolete and redundant features from datasets [8, 9]. It represents a real challenge and computational processing, especially when working with datasets of high dimensions in classification problems [10, 11]. The aim of FS is to minimize the number of features which increasing the search space and allowing ML techniques to use only the most significant features that affecting the classification accuracy [12]. Swarm intelligence (SI) algorithms are the most common methods used to solve FS-based problems [13]. The SI algorithms reflect computational intelligence methods made up of artificial agent population and inspired by social behavior of animals from the real world [14].

Heidari et al. [15] have proposed the HHO algorithm that mimics the Harris hawks cooperative hunting behavior. The original HHO maintains some important limitations, such as: (1) the exploration and exploitation are smooth but unbalanced and hence the global search and local search process becomes difficult to manipulate, (2) it has premature convergence when the problems are highly multi-modal and (3) the exploitation strategy in HHO is insufficient and the search agents may find local solutions. In this study, to overcome the limitations of HHO, a wrapper feature selection method termed HHOCM, which hybridizes HHO with crossover and mutation for chemical descriptors selection and chemical activities. The role of mutation and crossover is to generate new offspring that helps to find solutions simulating the nature laws of origin and adaptation to the environment.

Experimental results revealed that the three versions of proposed HHOCM algorithms are efficient alternatives for solving the FS problems. Several comparisons are performed considering seven well-established SI algorithms, namely the grey wolf optimizer (GWO) [16], original HHO [15], whale optimization algorithm (WOA) [17], salp swarm algorithm (SSA) [18], ant line optimizer (ALO) [19], grasshopper optimization algorithm (GOA) [20] and dragonfly algorithm (DA) [21] using k-nearest neighbors (k-NN) as a classifier. Further, the Wilcoxon test is used to evaluate the performance of the proposed algorithms. It is noted that the ROBHHOCM and HHOCM algorithms achieve the best results compared with the competitor algorithms in most statistical and graphical measurements (e.g., average and standard deviation of fitness, accuracy, number of selected features, and the convergence curves and boxplot curves). To sum up, the major contributions of this work are:

1.
Crossover and mutation evolutionary operators are used to enhance the performance of the HHO algorithm.
2.
Integration of opposition-based learning operator to HHO.
3.
Three versions based on HHO called HHOCM, OBLHHOCM and ROBLHHOCM algorithms are proposed to select the chemical descriptors and chemical compound activities.
4.
Modeling a wrapper FS paradigm using the three versions is conducted.
5.
A series of experiments are carried out to prove superiority of these versions in choosing the best molecular descriptor subset with high classification accuracy.

The rest of this paper is organized as follows. Section 2 summarizes related works in the literature. Section 3 presents briefly basics of the HHO algorithm, the k-nearest neighbor (k-NN) algorithm and genetic operators. Section 4 explains the proposed modifications over the HHO algorithm. Experimental results are reported and discussed in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Related work

Generally, for creating a medical molecule, it is necessary to use protein bank databases in order to determine crystal structure of the protein [22]. Computer-aided drug design (CADD) is an efficient tool for chemical compounds identification of drug design and discovery [2]. CADD methods are used to extract the compounds such as Pubchem by the pharmacophore modeling tools. These methods depend on the important concept of docking large libraries for small molecule [23]. Also, CADD methods are used to obtain protein and ligand. A good drug depends on good ligands selected by the PyMOL software [24] to separate the ligand from protein, and the energy is calculated by AutoDock software.

In the same context, QSAR is applied to describe the correlation between structures from a set of molecule and the response for targets. Thus, it can be considered as an alternative tool for the CADD methods [25]. In general, QSAR consists of an active and inactive molecule, which requires good molecular descriptors representing molecular features responsible for the relevant molecular activity. Drug design and discovery is one of the main aspects in cheminformatics including two phases: encoding phase to represent the molecular graph (or connection table) to extract vector of features through calculating the descriptors as three-dimensional information for the molecular structure and mapping phase to build various models for ML techniques in cheminformatics. Mapping between different feature vectors and property is the major role of ML techniques in cheminformatics to discover different functions, which can be done by ML techniques [12].

Several efforts were developed for selecting proper features in datasets. Considering this fact, three categories of FS are found in the literature [26]: filter-based [27], embedded-based [28] and wrapper-based [29]. FS-based SI includes several algorithms, such as improved salp swarm algorithm using crossover [30], whale optimization algorithm [31] and binary dragonfly optimization [32]. In [33], a filter-based FS method is introduced for the QSAR Biodegradation and other medical benchmarks. It combined relief-f with differential evolution for selecting the most relevant features. It achieved 85.4% classification accuracy with keeping only 16 relevant molecular descriptors from 41 features. Wrapper-based methods have been attracted more attention due to the involvement of learning algorithms in the FS process. Thus, selection of significant features effected by the performance of learning algorithms (e.g., rate of correct classification accuracy) [12]. A swarm-based algorithm is introduced in [34] using wrapper FS for predicting chemical compound activities. SSA is applied for selecting the best subset of molecular descriptors of the Monoamine Oxidase (MAO) dataset. It is important to note that SSA with k-NN classifier obtained the highest accuracy of 87.35% and kept only 783 molecular descriptors. Houssein et al. [35] proposed two classification approaches called HHO-SVM and HHO-kNN for drug design and discovery prediction. In [36], the HHO is combined with cuckoo search for drug design and discovery in chemoinformatics.

Another branch of multi-objective optimization algorithms-based FS has developed recently for selection of molecular descriptors in QSAR by introducing the molecular descriptors subsets selection software (MoDeSuS) for QSAR Biodegradation [37]. Two scenarios are proposed for selecting relevant molecular descriptors known as aggregation and Pareto based FS. In the first one, a binary vector is generated which contains m molecular descriptors, where ones bits indicate that the molecular descriptors are selected and zeros bits indicate the molecular descriptors are ignored. In addition, the selected subset should be evaluated using aggregation function that combines the accuracy with the selection ratio. The second scenario (Pareto-based methods) employs two algorithms (i.e., non-dominated sorting genetic algorithm (NSGAII) and strength Pareto evolutionary algorithm (SPEA2)) to optimize the accuracy and the selection ratio, separately. The MoDeSuS achieved high performance on the QSAR Biodegradation dataset with an accuracy rate of 84% and 37% of selection ratio.

In [38], a method based on biclustering is proposed to reduce the number of molecular descriptors for predicting Biodegradation of chemical compounds. The task of Biodegradation is evaluated using three classifiers: random committee, neural network and random forest. The experimental results have shown that the best classifier was RF, which achieved 88.81% of accuracy with only 19 molecular descriptors (MD) on the QSAR Biodegradation dataset. Recently, artificial intelligence knows a remarkable progress that allows to develop several horizons based on ML and deep learning for QSAR modeling [39, 40]. Also, Putra et al. [41] combined artificial neural network and support vector machine for QSAR modeling and principal component analysis (PCA) is utilized for reducing the dimensionality of data. The performance is assessed on the QSAR Biodegradation dataset and a classification rate of 82% is achieved.

Hierarchical stochastic graphlet embedding (HSGE) [42] is introduced using different hierarchical configurations for treating molecular graph dataset. The approach achieved 95.71% of accuracy on the MAO dataset. In the same context, the work of [43] presented a fusion between old neuronal architecture (multi-layer perceptron) and recent architecture based on deep learning called CNN-MLP for predicting chemical activities. In the CNN-MLP method, two models DeepBioD+ and DeepBioD are proposed for the QSAR Biodegradation dataset based on domain-specific features engineering and learned representations from pattern samples, which achieved 90% and 87.5% of accuracy, respectively. A similar work based on a pre-trained model called ChemNet was presented in [44] for the prediction of chemical activities using deep learning model achieved 86.7% of accuracy on the QSAR Biodegradation dataset. In [45], diffusion-convolutional neural network (DCNN) was produced for graph-structured data using a representation of deep learning architecture called diffusion CNN. The experiments showed that an accuracy rate of 75.14% can be achieved on the MAO dataset.

3 Preliminaries

This section introduces necessary basics of the HHO algorithm, k-NN and the genetic operators.

3.1 Harris hawks optimization

The HHO [15] as a new SI algorithm is inspired from the cooperative behaviors of Harris hawks in hunting and escaping preys. Harris hawks demonstrate a variety of chasing styles dependent on the dynamic nature of circumstances and escaping patterns of a prey. In this intelligent strategy, several Harris hawks try to cooperatively attack from different directions and simultaneously converge on a detected escaping rabbit outside the cover showing different hunt strategies. The candidate solutions are the Harris hawks, and the intended prey is the best candidate solution (nearly the optimum) in each step. The three phases of the HHO algorithm can be highlighted as: exploration phase, transition from exploration to exploitation phase and exploitation phase. The hunting is modeled as:

$$\begin{aligned} \begin{array}{l} x_{t + 1}^i = \left\{ {\begin{array}{*{20}{c}} {{x_{rand}} - {\tau _1}\left| {{x_{rand}} - 2{\tau _2}x_t^i} \right| } &{} {if\ {\tau _5} \ge 0.5\ } \\ {\left( {{x_{Rabbit}} - \overline{{x_t}} } \right) - {\tau _3}\left| {l{b^j} + {\tau _4}\left( {u{b^j} - l{b^j}} \right) } \right| } &{} {else\ } \end{array}} \right. \\ t \in \left[ {1 \cdots T} \right] ,i \in \left[ {1 \cdots N} \right], \\ \end{array} \end{aligned}$$

(1)

where the current location of $i{th}$ hawk and its new location in iteration $t + 1$ are represented by $x_t^i$ and $x_{t + 1}^i$, whereas $ x_{rand}$ and $x_{Rabbit}$ are randomly selected hawk location and the best solution (target:rabbit). Lower and upper bounds of $j{th}$ dimension are defined by $lb^j$ and $ub^j$, while $\tau _1$ to $\tau _5$ represent random numbers which belong to the interval [0, 1]. The average hawk position $ \overline{{x_t}}$ is defined as:

$$\begin{aligned} \overline{x_t}=\frac{1}{N}\sum _{i=1}^{N}x_{t}(i). \end{aligned}$$

(2)

In Eq. (1), the first scenario ($ {\tau _5} \ge 0.5$) grants a chance to the hawks to hunt randomly spread in the planned space, while the second scenario explains context when the Hawks hunt beside family members close to a target. In the exploration to exploitation transformation phase, the prey attempt to escape from the capture, so the escaping energy $E_n$ level of the prey decreases gradually. The energy is defined by

$$\begin{aligned} E_n=2* E_{n0}*\left(1-\frac{t}{T}\right), \end{aligned}$$

(3)

where the initial energy $(E_{n0})$ is defined by $E_{n0} = 2*rand -1$ , randomly changed inside $(-1,1),$ and T is the maximum number of iterations. HHO keep explorative as long as $|E_n|\ge 1$ and hawks remain on exploring global regions, while it swaps into exploitative mode when $|E_n|<1$. R refers to escaping probability of the target. The exploitation phase aims to avoid fall into local optima.

The first task-surrounding soft The surrounding soft can be formulated mathematically when $R\ge \frac{1}{2}$ and the level of energy is greater than $\frac{1}{2}$ (i.e., $|E_n|\ge \frac{1}{2}$) as:

$$\begin{aligned} \begin{array}{l} x_{t + 1}^i = \Delta x_t^i - E_n\left| {J{x_{Rabbit}} - x_t^i} \right| \\ \Delta x_t^i = {x_{Rabbit}} - x_t^i,\ J = 2\left( {1 - {\tau _6}} \right), \\ \end{array} \end{aligned}$$

(4)

where $\Delta x_t^i$ is the difference between the best agent (i.e., a rabbit) and the current position of $i{th}$ hawk. J indicates random strength jump of the prey, and $\tau _6$ is a random number which belongs to the interval [0, 1].

The second task-surrounding hard When the level of energy is less than $\frac{1}{2}$ $(|E_n|< \frac{1}{2})$ & $R\ge \frac{1}{2}$, the rabbit becomes exhausted and the possibility of escaping low (or escaping becomes hard) because the level of energy is decreased. This behavior can be modeled by

$$\begin{aligned} x_{t + 1}^i = {x_{Rabbit}} - E_n\left| {\Delta x_t^i} \right|. \end{aligned}$$

(5)

The third task-surrounding soft beside advanced rapid dives This task is applicable when the level of energy is greater than $\frac{1}{2}$ $(|E_n|> \frac{1}{2})$ & $R < \frac{1}{2}$, where the rabbit still has sufficient force to run away. Hence, the hawk tries progressive dives in order to take the best position for catching the prey. This behavior is modeled by integrating the Lévy flight function [46].

The position of $i{th}$ hawk should be modified using:

$$\begin{aligned} \begin{array}{l} {x_{t+1}^{i} =\left\{ \begin{array}{cc} {y} &{} {if\; fit\left( y\right)< fit\left( x_{t}^{i} \right) } \\ \\ {z} &{} {if\, fit\left( z\right) < fit\left( x_{t}^{i} \right) , } \end{array}\right. } \\ {y=x_{rabbit} -E_{n} \left| Jx_{rabbit} -x_{t}^{i} \right| ,} \\ { z=y+r_{v} \times Lv\left( D\right), } \end{array} \end{aligned}$$

(6)

where

$$\begin{aligned} Lv\left( D\right)= & {} 0.01\times \frac{rand\left( 1,D\right) \times \sigma }{\left| rand\left( 1,D\right) \right| ^{\frac{1}{\beta } } } , \end{aligned}$$

(7)

$$\begin{aligned} \sigma= & {} \left( \frac{\Gamma \left( 1+\beta \right) \times \sin \left( \frac{\pi \beta }{2} \right) }{\Gamma \left( \frac{1+\beta }{2} \right) \times \beta \times 2^{\left( \frac{\beta -1}{2} \right) } } \right) ^{{}^{\frac{1}{\beta } } }, \end{aligned}$$

(8)

where D is the dimensionality space, $ r_{v}$ contains D components generated randomly inside (0,1), Lv represents the Lévy flight function, $\beta $ is a constant with default $\beta =1.5$ and fit indicates the fitness function computed by Eq. (16).

The fourth task-surrounding hard beside advanced rapid dives In this task, it is assumed that $R < \frac{1}{2}$ and the level of energy is less than $\frac{1}{2}$ $(|E_n|<\frac{1}{2})$, the prey has a lower level of energy to evade, and Hawks are close to realize a successive dives for catching. This process can be described by

$$\begin{aligned} \begin{array}{l} x_{t + 1}^i = \left\{ {\begin{array}{*{20}{c}} y &{} {if\ fit\left( y \right)< fit\left( {x_t^i} \right) \ } \\ z &{} {if\ fit\left( z \right) < fit\left( {x_t^i} \right) \ } \\ \end{array}} \right. ; \\ \ \ y = {x_{Rabbit}} - E_n\left| {J{x_{Rabbit}} - {\overline{x}} } \right| ; \\ \ \ z = y + r_{v} \times Lv\left( D \right). \\ \end{array} \end{aligned}$$

(9)

For illustration, the general flowchart of the HHO algorithm is shown in Fig. 1.

3.2 k-Nearest neighbors algorithm (k-NN)

The k-NN classifier belongs to the supervised machine for identifying a new pattern based on statistical metric. It is considered as a lazy model of learning, which can be performed for prediction tasks and classification problems [47]. This algorithm presents a certain advantage that is an easy interpreter of the output. It provides less computing cost and efficiency. The classification process depends only on computing the Euclidean distance between the current test example and the query examples of training data. After that, the first k minimal distances are selected to determine the label of the current test vector. The k-NN is composed of different steps given in Algorithm 1.

3.3 Genetic operators

The use of evolutionary operators is widely exploited in several algorithms. Primarily, two basic algorithms have explored deeply genetic operators such as differential evolution and genetic algorithms. Here, we give a quick overview of the genetic operators (i.e., mutation, crossover and selection).

Mutation The results of the tasks numbers three and four of HHO and the target solution ($x_{Rabbit}$) are utilized for producing the mutation operation. For each component, a number between 0 and 1 is randomly generated. In a case the value is superior to the mutation rate ($\zeta $), the element of the target agent ($x_{Rabbit}$) is considered. If this value is less than to the mutation rate ($\zeta $), the old vector is replaced by the component of y or z vectors. The mutation operator is determined using:

$$\begin{aligned}&\begin{array}{l} {y_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ rand_{1} \ge \zeta } \\ {y} &{} {else} \end{array}\right. } {and\ } \\ {z_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ rand_{2} \ge \zeta } \\ {z} &{} {else} \end{array}\right. } \\ {Where:\ \ \left\{ \begin{array}{c} {\zeta =\frac{t}{T} ;} \\ {y=x_{rabbit} -E_{n} \left| Jx_{rabbit} -x_{t}^{i} \right| ;} \\ {z=y+r_{v} \times Lv\left( D\right) \ \ } \end{array}\right. \ \ \ \ \ \ \ } \end{array} \end{aligned}$$

(10)

$$\begin{aligned}&\begin{array}{l} {y_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ \rho _{1} \ge \zeta } \\ {y} &{} {else} \end{array}\right. } {and\ } \\ {z_{Mut} =\left\{ \begin{array}{cc} {x_{Rabbit} } &{} {if\ \rho _{2} \ge \zeta } \\ {z} &{} {else} \end{array}\right. } \\ {Where:\ \ \left\{ \begin{array}{c} {\zeta =\frac{t}{T} ;} \\ {y=x_{rabbit} -E_{n} \left| Jx_{rabbit} -\overline{x} \right| ;} \\ {z=y+r_{v} \times Lv\left( D\right). \ \ } \end{array}\right. \ \ \ \ \ \ \ } \end{array} \end{aligned}$$

(11)

Crossover In order to produce more diversity, the crossover involves recombination of two individuals. An intermediate crossover with a random number $\tau $ is used to generate a new offspring $w_{Cross}$ as

$$\begin{aligned} {w_{Cross}} = {y_{Mut}} + \left( { \tau } \right) * {(z_{Mut}-{y_{Mut}}).}\ \end{aligned}$$

(12)

This type of operator allows children to inherit more information from parents compared to other type as linear recombination.

Selection The type of selection used in HHO is a greedy selection inspired from differential evolution. The offspring produced after evolutionary functions (mutation & crossover) are accessed. Then, the performance of the child and parent is compared to select the best one. Finally, the parent has a chance to remain in the population if their performance is high. The greedy selection is defined by the following rule:

$$\begin{aligned} x_{t + 1}^i = \left\{ {\begin{array}{*{20}{c}} {{y_{Mut}}} &{} {if\ fit\left( {{y_{Mut}}} \right)< fit\left( {x_t^i} \right) } \\ {{z_{Mut}}} &{} {if\ fit\left( {{z_{Mut}}} \right)< fit\left( {x_t^i} \right) } \\ {{w_{Cross}}} &{} {if\ fit\left( {{w_{Cross}}} \right) < fit\left( {x_t^i} \right) } \\ \end{array}} \right.. \end{aligned}$$

(13)

4 The proposed HHOCM algorithm

In this section, we give an alternative method for FS that combines the HHO with genetic operators. Like other meta-heuristic algorithms, HHO tends to be trapped in low diversity, local optima and unbalanced exploitation ability [48, 49]. Although HHO has the characteristics of acceptable convergence speed and a simple structure, it may fail to maintain the balance between exploration and exploitation and fall into a local optimum in some complex optimization problems [50].

Thus, the main contribution of the proposed HHOCM algorithm focuses on integrating genetic operators (mutation, crossover and selection) for solving the problem of exploitation in the HHO algorithm. To this end, the proposed HHOCM tries to ensure more diversity considering two main phases: initialization phase and updating phase. The framework of the proposed HHOCM algorithm for FS is given in Fig. 2.

4.1 Initialization phase

In this step, HHOCM generates N swarm agents in the first population, where each individual represents a portion of molecular descriptors (features) to be selected for evaluation. This step has a significant effect on the convergence and aptitude of the optimal solution. The population X is generated randomly as:

$$\begin{aligned} x_i^j=lb^j+\lambda ^j \times (ub^j-lb^j), \ i=1,2,.,N; j=1,2..D. \end{aligned}$$

(14)

The lower and upper bounds $lb^j$ and $ub^j$ for each candidate solution i are in the range of [0, 1]. The $\lambda ^j$ is a random number $\in [0, 1]$. To select a subset of molecular descriptors, an intermediate binary conversion step is necessary before fitness evaluation. Thus, each solution $x^i$ undergoes a binary conversion ($x^i_{bin}$) using:

$$\begin{aligned} x^{i}_{bin}=\left\{ {\begin{array}{*{20}{c}} 1&{} \text {if} \ x^i>0.5\\ 0&{}\text {otherwise.}\\ \end{array}} \right. \end{aligned}$$

(15)

The solution $x^i$ with ten molecular descriptors, where $x^i=[0.6, 0.2, 0.9, 0.33, 0.15, 0.8, 0.2, 0.75, 0.1, 0.9]$, is considered. The operation of conversion is applied using Eq. (15) to generate a binary vector $x^i_{bin}$, where ones imply that the molecular descriptors are selected and otherwise are not selected. This means that first, third, sixth, eighth and the last molecular descriptors in original datasets are relevant ones and must be selected, while the others are irrelevant features and must be eliminated. After determining the subset of selected molecular descriptors, the fitness function is calculated for each agent $x^i_{bin}$ to determine the quality of these features. The fitness of the $i{th}$ solution is defined by

$$\begin{aligned} fit_i=\upsilon _{1}\times Er_i+\upsilon _{2}\times \frac{d_i}{D}, \end{aligned}$$

(16)

where $\upsilon _{1}=0.99$ and $(\upsilon _{2}=1-\upsilon _{1})$. The weight $\upsilon _{1}$ represents the equalizer parameter employed to ensure a relationship between the error rate of classification $(Er_{i}=1-Acc_i)$ and the size of selected molecular descriptors $(d_i)$. In Eq. (16), D is the total size of Molecular Descriptors (MD) in the original dataset. The k-NN is utilized as a classifier in the FS cycle. As a strategy of classification, the hold-out is utilized, which assigns 80% as a training set and the rest of data as testing samples. The $Er_i$ refers to the error rate of test datasets computed by k-NN (Algorithm 1). The lower value of fitness through all agents is assigned to the best prey $(x_{Rabbit})$.

4.2 Updating phase

The process of updating solutions consists of applying firstly the exploration step, which aims to apply a global search when the energy is greater than one. After that, the transition from exploration to exploitation is applied. Then, the exploitation phase is employed, which contains four tasks: surrounding soft, surrounding hard, surrounding soft beside advanced rapid dives and surrounding hard beside advanced rapid dives boosted by genetic operators. For improving the local search capability, HHOCM integrates the mutation operator in task three (surrounding soft beside advanced rapid dives) and task four (surrounding hard beside advanced rapid dives) of HHO using Eqs. (10) and (11), respectively. For more diversity, another genetic operator is introduced called crossover. This operator tries to combine both mutant vectors $y_{Mut}$ and $z_{Mut}$ for producing a new child $w_{Cross}$ as described in Eq. (12). The fitness values of all offspring: $y_{Mut}$, $z_{Mut}$ and $w_{Cross}$ based on selection operator Eq. (13) are compared to identify the best prey $x_{Rabbit}$. The process is reproduced, while the termination condition is met. The stop criterion corresponds to the maximum amount of iterations that allows to evaluate the performance of the HHOCM algorithm. Then, the best solution $x_{Rabbit}$ is returned and converted to determine the number of relevant features. In this regard, experiments are carried out 30 times independently for achieving accurate and precise results.

5 Experimental results and discussion

For validating the effectiveness of the proposed HHOCM algorithm, a number of experiments are conducted using two common datasets widely used in the field of chemoinformatics: QSAR Biodegradation and MAO. The first experiment is assigned to study the impact of swarm size (N) and the maximum number of iterations (T) on the accuracy and number of selected features. The second experiment aims to determine the best value of control parameter $\beta $ used in Lévy flight function. The third experiment is conducted to compare the performance of the HHOCM algorithm and seven recent SI algorithms (i.e., HHO, GWO, WOA, SSA, DA, GOA and ALO) based on the optimal control parameters obtained by previous two experiments between. Each algorithm is executed 30 times with keeping the same optimal values of N, T and $\beta $. Other experiments are conducted using statistical Wilcoxon test, which is used as assessment measure in order to verify significance of the accuracy achieved by HHOCM and ROBLHHOCM against the other competitor algorithms. The last experiment is conducted to compare between the three versions, HHOCM, OBLHHOCM and ROBLHHOCM, and other works from the literature using same parameters configuration on both of the QSAR Biodegradation and MAO datasets. The running is established on a PC with Intel Core i7-5500 CPU@2.40 GHz 2.40 GHz, 8 GB RAM, Windows 10 and Matlab 2016a.

5.1 Parameter settings

Parameters settings of the DA, ALO, GWO, WOA, SSA, GOA and HHO algorithms as well as the proposed versions of HHO are listed in Table 1.

Table 1 Parameters settings of the SI algorithms

Enhanced Harris hawks optimization with genetic operators for selection chemical descriptors and compounds activities

Abstract

Similar content being viewed by others

Laplacian whale optimization algorithm

A novel quasi-reflected Harris hawks optimization algorithm for global optimization problems

HMPA: an innovative hybrid multi-population algorithm based on artificial ecosystem-based and Harris Hawks optimization algorithms for engineering problems

Explore related subjects

1 Introduction

2 Related work

3 Preliminaries

3.1 Harris hawks optimization

3.2 k-Nearest neighbors algorithm (k-NN)

3.3 Genetic operators

4 The proposed HHOCM algorithm

4.1 Initialization phase

4.2 Updating phase

5 Experimental results and discussion

5.1 Parameter settings

5.2 Performance measures

5.3 Datasets and preprocessing details

5.4 Sensitivity analysis

5.5 Comparison of HHOCM with other SI algorithms

5.6 Comparison of HHOCM variants with the existing algorithms

6 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation