1 Introduction

Protein folding problem is one of the fundamental problems in computational biology. Simplified HP cubic lattice model plays an important role in protein structure prediction (PSP) and protein folding optimization (PFO) problem. The PSP problem represents how to predict the native structure of a protein from its amino acid sequence. The PFO problem represents a computational problem for simulating the protein folding process within the protein structure prediction. The protein structure is the result of the so-called protein folding process where the final structure is obtained from its primary unfolded chain of amino acids [1]. The primary formation consists of a sequence of amino acids with a peptide bond which are interconnected. Secondary structure refers to the coiling or folding of the primary sequence. There are two types of secondary structures observed in proteins: alpha (\(\alpha \)) helix and beta (\(\beta \)) pleated sheet. The tertiary structure of a polypeptide or protein is the three-dimensional arrangement of the amino acids in the form of secondary structure within a single polypeptide chain. On the contrary, quaternary structure is the arrangement of the amino acids with multiple polypeptide chains [2]. In general, the length of the sequence can vary from one protein to another. Particularly, from the tertiary structure of a protein all functionalities of that protein can be determined. The functionalities include physical fitness, illness, the aging process of human body, etc. Moreover, protein structure changes over time. This can produce infeasible structure of the protein in human body. Many diseases can take birth from this type of wrong folding [3]. By determining the correct structure, it is possible to understand how protein does all these tasks. It is also possible to identify the diseases and produce antidotes of the diseases. New drugs can be invented experimentally with help of the information associated with the tertiary structure of the protein. The application of protein folding optimization also includes the improvement of protein functionality and the particular modeling of cells [4,5,6].

The PFO problem can be illustrated with a simple model such as hydrophobic-polar (HP) model which was proposed by Dill [7]. This PFO problem has been authenticated as an NP-complete problem according to this model [8]. In this model, amino acids can be classified into two types, hydrophobic (H) and hydrophilic or polar (P). The main concern of this model is maximum number of hydrophobic–hydrophobic interactions in a unit distance [9, 10]. Because the maximum number of hydrophobic–hydrophobic (H–H) interactions represents the most stable protein structure. If S denotes the most stable structure of a protein, then the objective function \((F_{\mathrm{s}})\) according to this model can be defined for that protein as follows:

$$\begin{aligned} F_{\mathrm{s}}={\mathrm{Max}}(H{-}H) \end{aligned}$$
(1)

According to the law of thermodynamics, it is considered that the most stable structure holds the lowest energy value. Now if \(E_{\mathrm{s}}\) denotes energy value of the structure, then the energy function can be shown as

$$\begin{aligned} E_{\mathrm{s}}=(-1)*F_{\mathrm{s}} \end{aligned}$$
(2)

There are some dissimilarities in the hydrophobic-polar model. The model can be depicted by any of lattice models such as square, face-centered-cubic (FCC), general crystallographic, triangular and cubic lattices. In our proposed method, the cubic lattice model has been used to find the optimal solution for PFO problem. Any protein sequence in the cubic lattice is converted to a binary string of length l which can be represented as a string \(s=\{s_1, s_2,\ldots , s_l\}\) where \(s_i\in \{H,P\}\) and \(i=1, 2,\ldots , l\) [9]. There are six directions such as left (L), right (R), up (U), down (D), forward (F) and backward (B) in a cubic lattice [5]. Each amino acid of a particular structure is represented by (xyz) coordinates in the cubic lattice and also with a set of directions. A conformation is valid only when no lattice point is occupied by more than one amino acid [11]. If any overlapping occurs, the conformation is said to be invalid. If \(a_m\) and \(a_n\) are two adjacent amino acids and V is a set of all valid structures, then the energy function \((F_{\mathrm{v}})\) given in [5, 12] is as follows (Fig. 1):

$$\begin{aligned} F_{\mathrm{v}}=\sum \limits _{m=1}^l\sum \limits _{n=m+2}^l g(a_m,a_n)*h_{m,n} \end{aligned}$$
(3)

where

$$\begin{aligned} g(a_m,a_n) & = {\left\{ \begin{array}{ll}\begin{array}{ll} 1, &\quad {\text {if}}\;a_m, a_n\;{\text {are adjacent and not connected amino acids,}}\\ 0, &\quad {\text {otherwise.}} \end{array}\end{array}\right. }\\ h_{m,n} & = {\left\{ \begin{array}{ll}\begin{array}{ll} -\,1, &\quad {\text {if}}\;a_m\;{\text {and}}\;a_n\;{\text {are hydrophobic amino acids,}}\\ 0,&\quad {\text {otherwise.}} \end{array}\end{array}\right. } \end{aligned}$$

where two hydrophobic amino acids that are adjacent to each other by lattice points are scored as − 1. The summation of all H–H contacts is the energy score for the confirmation [5].

Fig. 1
figure 1

Energy value calculation. Here, the P monomers are shown in blue and the H monomers in green. The protein energy value \(F_{\mathrm{v}} = -4\) for sequence S = {H, H, P, H, H, P, H, P, H, P, H} in the cubic lattice (colour figure online)

In this paper, a chemical reaction optimization algorithm \(({\mathrm{CRO}}_{{\mathrm{PFO}}})\) has been applied to HP cubic lattice model to optimize the tertiary protein folding. In real proteins, hydrophobic amino acids tend to be in core portion as well as the hydrophilic amino acids tend to move in the outer portion of protein structure. The most stable structure of protein contains maximum hydrophobic amino acids in core portion. We have redesigned the basic operators of CRO for solving PFO problem. Besides that, we have proposed two extra mechanisms, hydrophobic and polar (H&P) compliance and evolution. The H&P compliance mechanism visualizes a center of each individual structure and determines the distance of each H and P monomer from the center. After determining the distance of the monomers from core position, this mechanism changes the direction of each monomer in such a way that the H monomers tend to the nearest positions of the core and the P monomers are folded to far positions from the core. Basically, this mechanism tries to imitate the behavior of real protein folding. The evolution mechanism tries to move each monomer in a protein using every possible direction. In this mechanism, each monomer folded to the new direction provides better energy value than the existing one. These two advanced mechanisms improve the performance of the \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) algorithm dramatically. In assistance with these two extra mechanisms, the four operators of CRO algorithm also improve the energy values of the solutions. While applying these operators and mechanisms, many invalid structures can be produced. The invalid structure represents a structure that contains two or more amino acids in the same position of cubic lattice. If any invalid structure is created during the execution process of the operators or mechanisms, then the repair mechanism repairs the invalid structure to a valid one.

In recent years, CRO algorithm has become more familiar among other metaheuristic algorithms. It has been successfully applied to many familiar problems such as population transition problem in peer-to-peer live streaming [13], shortest common supersequence problem [14], standard continuous benchmark functions [15], cognitive radio spectrum allocation problem [16], network coding optimization problem [17], probabilistic select grid scheduling problem [18], stock portfolio selection problem [19] and artificial neural network training [20]. This information has inspired us to use CRO algorithm to solve PFO problem. The particularity of this paper is as follows: (1) the proposed algorithm for the PFO problem, (2) the evolution mechanism improves the energy value of each solution and (3) H&P compliance mechanism places the H monomer in the core positions and P monomers in the outer portions produces a stable structure gradually. The algorithm has been examined over some sets of sequences and also has been compared with the genetic algorithm with advanced mechanisms [5], which is state of the art.

2 Related work

Since the PFO is an NP-hard optimization problem, a small mistake in the prediction process can mislead the process completely. The algorithms that already solved the PFO problem are reviewed below.

2.1 Particle swarm optimization

A discrete particle swarm optimization \(({\mathrm{DPSO}}_{{\mathrm{HP}}})\) algorithm was introduced in [21] to solve protein folding optimization problem in both 2D and 3D lattices. The discrete PSO method was in accordance with the possibility theory from a set-based PSO (S-PSO). First, the authors represented the protein sequence in HP model for avoiding overlapping in the lattice or cubic. In this method, they introduced special velocity and position updating processes which represent the overall framework of \({\mathrm{DPSO}}_{{\mathrm{HP}}}\). At the beginning of the construction phase, each particle chooses two middle amino acids and placed them in the two central cells in the lattice or cubic board. After that, the particles randomly choose amino acids to fold the left part or right part of the protein sequence. In path construction, new amino acids could not be placed on the lattice replacing old ones. For this reason, the protein could not be folded anymore during the construction phase and they called it motionlessness. A path retrieval mechanism was used to build an infeasible solution into a feasible one for solving the problem. The whole sequence was divided into two parts for indicating the two middle amino acids of the amino acid sequence and then checked which side of the sequence was motionless. The results of this algorithm are good for small sequences of the protein, but the results of the longer sequences of the protein are not efficient for the search space [22]. Another PSO algorithm was proposed for the prediction of the protein structure by Mansour et al. [22]. The algorithm predicted the 3D protein structure with low energy. To update the velocity, a function is introduced of the particle and it is the main operator of this algorithm. The task of this operator is to explore new areas of the search space to find optimal solutions.

2.2 Genetic algorithm

Khimasia et al. [23] proposed a genetic algorithm for the prediction of protein structure problem. A method was launched based on the simple lattice for prediction of the tertiary structure of the protein in this paper. The study has two important conclusions: First, they require multi-point crossovers and second, a local perturbation for any GA is fully effective. König et al. [24] used systematic crossover for search strategy by coupling the best individuals. After that, they have checked every possible crossover point of the best individuals and has taken two individuals for the next generation. The difference between two individuals was used as an information for selecting the best individual. In 2004 Unger and Ron proposed another genetic algorithm for the prediction of protein structure problem [25]. In their paper, arrangements were changed through mutation, in the form of conventional Monte Carlo steps and crossovers where parts of the polypeptide chain were exchanged between conformations. Both genetic operators were repeated up to the creation of feasible structure. An upgraded genetic algorithm accompanied by mutation mechanism was proposed in by Lin and Su [12]. Their algorithm was designed in accordance with the particle swarm optimization algorithm. Custódio et al. [26] developed a structure which was phenotype-based crowding structure in order to maintain the useful of diversity within the populations.

2.2.1 Genetic algorithm with advanced mechanism

In 2016 a stochastic, population-based GA was proposed by Bošković and Brest [5]. They divided the total process into crowding, clustering, repair, local search and opposition-based mechanisms. Here, the population P was generated by populating a group of individuals \(x_i = 1,2,3,\ldots , popSize\) (i is a subscript and popSize is the population size) and each individual was encoded with ultimate encoding. Each solution or individual had a fixed size (which represents the length of the amino acid sequence) absolute directions: left (L), right (R), upper (U), down (D), forward (F) and backward (B). Initial population P is selected uniform randomly and improved with local searches at the beginning of the evolutionary process. Subpopulations from the population P were generated using compare mechanism with the nearest structures, and the nearest individual structure was determined by the Hamming distance between two individuals. For every generated subpopulation, the algorithm applied one point, two points or three points (multi-point) crossover randomly and the segment mutation randomly selected directions of one, two or three consecutive directions. The mutation operation improved the quality of the protein structure and ensured the diversity of the population. The local search was used to improve convergence speed by trying to improve the quality of individual structure by applying local movements within one or two successive monomers through the entire conformation where one of the consecutive monomers was hydrophobic (H). The repair mechanism was used to construct a feasible solution from an infeasible solution using a backtracking algorithm. Within the population P, improved structures were then compared with the nearest structures and the nearest individual structure was determined by the Hamming distance between two individuals. An opposition-based mechanism was used to generate good conformation from both sides (starting and ending) of the sequence and terminated when the number of repair method invocations grew up to a predefined level. The algorithm obtained best results for most of the cases and reduced the time to acquire the best native energy.

2.3 Ant colony optimization

ACO follows the foraging behavior of real ants to solve optimization problems. An improved version of ACO algorithm was proposed by Shmygelska et al. [27]. Long-range moves and selective local search improved the performance of this algorithm. Selective local search used in this algorithm reduced the time complexity of the local search phase. Thilagavathi and Amudha proposed another ant colony optimization (ACO) algorithm [28]. In this paper, the pheromone level of a trail was set to a constant value and after the end of the construction phase, the value was updated. The pheromone value was updated in two stages: local update and global update. In the global update, the pheromone level was reduced and this reduced the possibility. As a result, the search is more diversified. The global update rule was applied in order to deliver a large amount of pheromone on the suitable path of the optimal solution. In the course of the construction phase, the next move was selected based on pheromone matrix value and heuristic information. The ant colony was initialized with five ants, and each ant performed the folding process and gave different folding structures using the energy functions. The initial population was only five ants which could be proved as an obstacle for getting the best solution. The performance of this algorithm was appreciable for only smaller protein sequences, but for real-time proteins or long-length protein sequences its performance was not applicable.

3 Chemical reaction optimization

Human beings are closely associated with the nature which is a very complex system. The complex system operates in its own way without any obstacle. The way in which nature operates can be worthy principle to solve real-world problems. In the field of computer science, the principles are called algorithms or more precisely nature-inspired algorithms [29]. CRO is one of the latest optimization algorithms designed by Lam and Li [30]. Basically, it is a population-based metaheuristic algorithm that imitates the behavior of chemical reactions. Furthermore, CRO is a technique that loosely couples with chemical reactions [29]. The functions of CRO operate on two laws of thermodynamics. The first law declares that energy cannot be produced or destroyed; energy can be transformed from one kind to another and transferred from one entity to another. In a chemical reaction, there are chemical substances and surroundings around these. Every chemical substance has potential energy \((P_{\mathrm{E}})\), kinetic energy \((K_{\mathrm{E}})\), and the energy of surroundings is symbolically represented as the buffer energy. So from the first law of thermodynamics, we can write

$$\begin{aligned} \sum \limits _{i=1}^{PopSize(t)}(PE_{i}(t)+KE_{i}(t))+buffer(t)=Z \end{aligned}$$
(4)

where \(PE_i(t)\) and \(KE_i(t)\) denote the potential and kinetic energy of the molecule i at time t, respectively, buffer(t) is the energy of the surrounding as well as the energy of the central buffer at time t and Z is a constant [29].

A chemical reaction can be either endothermic or exothermic. The initial buffer size \((S_{{\mathrm{BO}}})\) can characterize these two types of chemical reactions. If \(S_{{\mathrm{BO}}}>0\), then the reaction is endothermic, and if \(S_{{\mathrm{BO}}}=0\), then it is exothermic. The second law says that the entropy of a system always tends to increase over time. The energy stored in a molecule is referred to as potential energy. When it is converted into kinetic energy, then the system becomes more disordered and entropy increases. When a chemical system becomes unstable with excessive energy, it undergoes some chemical reactions to obtain stable state. When a chemical substance goes to a stable condition, its potential energy is minimum. The algorithm is a step-wise searching process for minimum energy (optimal solution) [29].

4 Design of CRO for PFO problem

Chemical reaction optimization (CRO) is a population-based metaheuristic where the number of molecules may not be same in each iteration. When an ineffective collision occurs, the number of molecules remains same. On the other hand, when an effective collision occurs, the number of molecules may be either increased or decreased. In CRO, a molecule (M) represents one specific solution of any optimization problem. Essentially, CRO consists of three stages: initialization, iteration and the final stage. The initialization stage begins with the initialization of the different parameters of CRO algorithm. The iteration stage consists of some operations between molecules. In the iteration stage, if any stopping criterion is not met, then the algorithm proceeds to the final stage. The final stage determines a globally optimal solution from the local optimal solutions using its objective function value and terminates the algorithm [31].

4.1 Population initialization and algorithmic description

The initial population is selected by creating random directions. For PFO problem, the population can be termed as a set of direction arrays. If the length of an amino acid sequence is l, then the length of the direction array will be \(l-1\) because the first position is fixed for the center of the solution structure. In cubic lattice, there are six directions: \(\{{\text {L, R, U, D, F, B}}\}\) where L: left, R: right, U: up, D: down, F: forward, B: backward. Every array-based solution is represented by randomly generating a number between 1 and 6. The corresponding direction of the cubic lattice according to the random value is then taken as an element of the direction array. The generation process of a solution structure of sequence \(\{{\text {H, H, P, H}}\}\) is shown in Fig. 2. In step 1, we can see that the first position is fixed and no direction has not been assigned to it. In step 2, upward direction has been selected randomly and has been assigned to the resultant array. The same process continues to step 4. After assigning all the direction values, a complete structure is formed. After generating a complete solution, it is checked with the existing solutions. If the same solution exists in the set, then the solution is discarded; otherwise, it is accepted and added to the set as a new solution.

figure a
Fig. 2
figure 2

Population initialization

Table 1 Description of parameters of CRO algorithm

After the initialization process, a conditional loop starts executing until the stopping criterion is not satisfied. The algorithm meets the stopping criterion when the molecule (solution) has lost its kinetic energy (initialKE) completely. The molecule has lost its kinetic energy according to KELossRate that is initialized at very first. A threshold value (b) is generated between 0 and 1 randomly to determine whether the reaction will be uni-molecular or inter-molecular. Every solution of the population is iterated and modified by one of the four operators of CRO algorithm. The value of b decides which operator will be used to modify the solutions. If \(b > {\text {molCol}}\), a uni-molecular collision is selected; otherwise, inter-molecular collision is selected. In the case of uni-molecular collision, if the criterion of decomposition is met, the decomposition is executed; otherwise, on-wall ineffective collision is performed. On the other hand, in the case of an inter-molecular collision if the criterion of synthesis is met, then synthesis is done; otherwise, inter-molecular ineffective collision (IMIC) is performed to modify the selected solutions. When new minimum solution is found, the solution is placed into the last position of the population array using a variable, newP.

The strategy of CRO is totally dependent on the behavior during a chemical reaction. The very first stage of CRO generates initial population along with the parameters: popSize, KELossRate, molCol, initialKE and two thresholds \((\alpha {\text { and }} \beta )\). The parameters and their definitions are given in Table 1. In the second stage, one particular elementary reaction out of four occurs in every iteration. The threshold value \(\alpha \) is used to select which uni-molecular reaction will be triggered. If \(numHit-minHit \ge \alpha ,\) then decomposition reaction is triggered; otherwise, on-wall ineffective collision reaction is triggered. On the other hand, the second threshold value \(\beta \) is also used to select which inter-molecular reaction will be triggered. If \(initialKE \le \beta ,\) then synthesis reaction is triggered; otherwise, inter-molecular ineffective collision reaction is triggered. At the end of each iteration, we have to check the termination criterion of the algorithm. The algorithm terminates when the molecule (solution) has lost its kinetic energy completely, which means the initialKE of the molecule has turned to be 0. In each iteration, the molecule has lost its kinetic energy according to KELossRate. The final and the last stage executes when termination criterion of iteration stage satisfies. In the final stage, the algorithm terminates and best solution is found. The pseudo-code of this procedure is given in Algorithm 1.

4.2 Reaction operators

In the proposed algorithm, we have redesigned the basic four operators of CRO for the PFO problem. Along with these operators, two extra operators have been designed. All of these operators are described below.

4.2.1 On-wall ineffective collision

This is a molecular reaction that is invoked frequently in the initial iteration phase to update the solutions. The operator selects a solution and takes the length of the direction array of the solution. Then, the process makes a copy of the solution and selects a random pointer in the range of the length that specifies how many positions will be modified. Another random variable is generated within range 1–6. The selected position is altered with the corresponding direction value pointed by the new random variable. When all the positions are updated, the new (updated) solution is returned as output. Figure 3 depicts the process. First, a solution is selected and then positions 2 and 5 are selected randomly and the direction values in the corresponding positions have been changed to their opposite directions. Here, direction U (up) is changed to direction D (down) and direction B (backward) is changed to direction F (forward). The modified solution is the output of this process. Algorithm 2 gives the pseudo-code of the process.

figure b
Fig. 3
figure 3

On-wall ineffective collision

4.2.2 Decomposition

Decomposition process breaks a selected solution into two new solutions. Since the process produces two new solutions from an existing one, in PFO problem the new solutions may contain overlapping. Each time a solution is generated, the solution is repaired immediately. In this process, a solution is randomly selected. Two duplicate solutions are produced from the selected solution. Then, a position is randomly selected that divides the solutions into two parts. The first half of one new solution is changed to opposite directions, and the second half of another solution is modified by opposite directions. In Fig. 4, a pointer has been set randomly after position 3. The segment before the pointer is considered as the first portion, and the remaining part constitutes the second portion. According to rule, the first portion of the first solution and the second portion of the second solution are selected. The directions of the selected portions have been changed to their opposite directions, such as U to D, L to R, F to B and vice versa. The remaining portions of the two new solutions remain same as in their parent. The process generates two new solutions as output. Algorithm 3 shows the pseudo-code of the process.

figure c
Fig. 4
figure 4

Decomposition

4.2.3 Synthesis

In this process, a new solution is produced from two existing solutions. First, two solutions are selected randomly. Then, the solutions are divided into parts. How many parts will be created is selected randomly. Then, the energy value is calculated for each part of the solutions. The parts of the two solutions are compared. The portion that shows the highest energy value is selected as a part of the new solution. The new solution often shows higher energy value than the selected solutions. Figure 5 shows the synthesis process. In the figure, a pointer is set after position 4 randomly which breaks each solution into two parts. Then, for each part we have calculated the energy value. Now if we compare the parts, then the first part of the second solution and the second part of the first solution are selected due to higher energy values. The combined solution is then returned as output. Algorithm 4 depicts the pseudo-code of the synthesis operation.

figure d
Fig. 5
figure 5

Synthesis

4.2.4 Inter-molecular ineffective collision

The nature of this process is to select two solutions that collide with each other. The collision produces two new solutions by changing some positions. The structural view of the two new solutions may be hugely different from their parents. The difference depends on the decision that how many positions will be changed during this process. The process selects an increment value randomly. Each of the two solutions is selected sequentially. Starting from the first position, some positions are selected by the increment value. The selected positions are then changed to their opposite directions.

Fig. 6
figure 6

Inter-molecular ineffective collision

A pictorial view of this process is shown in Fig. 6. First, 2 has been chosen randomly as an increment value. Starting from the first position, every position according to the increment value is selected, which are colored by deep blue and deep green for the first solution and second solution of the figure, respectively. The selected positions are then changed to their opposite directions such as U to D, L to R, F to B and vice versa. Algorithm 5 shows the pseudo-code of this process.

figure e

4.3 H&P compliance

The hydrophobic and polar (H&P) compliance mechanism determines the center of each individual structure and the distance of each H and P monomer from the center. It checks whether the hydrophobic (H) amino acids are in the core positions and the polar (P) amino acids are in the remote portion of the core or not. If H amino acids are not in the core positions, it restructures the conformation such that the hydrophobic amino acid remains in the core position. Similarly, it reforms the conformation such that the polar amino acid remains in the remote part of the core of the structure. If \((P_{\mathrm{c}},Q_{\mathrm{c}},R_{\mathrm{c}})\) represents the core position and \((P_{\mathrm{d}},Q_{\mathrm{d}},R_{\mathrm{d}})\) represents \(d{\mathrm{th}}\) amino acid, then H&P compliance is calculated as follows:

$$\begin{aligned} HP_{\mathrm{s}}=\frac{\sum \nolimits _{d=1}^{length}(P_{\mathrm{c}}-P_{\mathrm{d}})^2 + (Q_{\mathrm{c}}-Q_{\mathrm{d}})^2 + (R_{\mathrm{c}}-R_{\mathrm{d}})^2}{N_{\mathrm{s}}} \end{aligned}$$
(5)

where \(N_{\mathrm{s}}=\) number of amino acids in a structure and

$$\begin{aligned} P_{\mathrm{c}}=(P_{\mathrm{max}}-P_{\mathrm{min}})/2\\ Q_{\mathrm{c}}=(Q_{\mathrm{max}}-Q_{\mathrm{min}})/2\\ R_{\mathrm{c}}=(R_{\mathrm{max}}-R_{\mathrm{min}})/2 \end{aligned}$$
Fig. 7
figure 7

H&P compliance

\(HP_{\mathrm{s}}\) calculates the closeness of any amino acid from the core position. After determining distance, this mechanism changes the direction of each monomer in such a way that the H monomers tend to small distance from the core position and P monomers are folded to far distance from the center of the specific structure. Essentially, this mechanism tries to mimic the behavior of real protein folding. The H&P compliance process is depicted in Fig. 7.

Here, the last monomer is hydrophobic amino acid and it locates on the outer side. According to H&P compliance, H monomer tends to be inner part and P monomer tends to be in the outer part. Now if we change the direction of \(13{\mathrm{th}}\) monomer from right to left, then we can see that all the H monomers seem to be inner part and P monomers in the outer part. The energy value has also increased 3–4 due to this movement. Algorithm 6 gives the pseudo-code of the process.

figure f

4.4 Evolution mechanism

The evolution mechanism tries to move each individual monomer in a protein using every possible direction. For each amino acid, it inspects empty positions of the cubic lattice points. After that, it changes the position of amino acid to one of the empty positions and determines whether the energy value of the entire structure increases or not. If the energy value increases, then new structure is accepted; otherwise, it is rejected. Figure 8a gives a sample structure with 13 monomers. The energy value is 3 and the structure does not contain any overlapping in positions. The main principle of evolution mechanism is to increase the performance of each structure by moving the monomers in every possible direction. In this mechanism, we try to move each monomers from its monomer, gradually moving toward the first monomer. For example, in the case of \(13{\mathrm{th}}\) monomer, if we move the \(13{\mathrm{th}}\) monomer in every possible direction, it can be seen that the energy value does not increase. As the energy value cannot be increased due to \(13{\mathrm{th}}\) monomer, the mechanism immediately switches to the previous monomer. If we change the direction of the \(12{\mathrm{th}}\) monomer from right to left, then we can observe that energy value increases from 3 to 5 as shown in Fig. 8b. This direction changing process continues to the first monomer, the process tries to move the monomer and increases the performance. After that, we consider the \(12{\mathrm{th}}\) monomer and continue this process \(12{\mathrm{th}}\) to first monomer. We continue this process for every individual monomer with respect to each previous individual monomer. Algorithm 7 depicts the pseudo-code of this process.

Fig. 8
figure 8

Evolution mechanism

figure g

4.5 Repair mechanism

During execution of the operators of \({\mathrm{CRO}}_{{\mathrm{PFO}}}\), many invalid structures may be created and those may contain overlapping in the same position. The number of overlapping positions can be one or more. Such structures cannot be accepted to solve PFO problem. For this reason, we have designed a repair function that works on the invalid structures and produces valid ones. The two major tasks of the repair mechanism are overlap checking and applying backtracking algorithm to create a valid structure. The working process of repair mechanism is as follows. A checking process checks the structures to observe that if the structures contain overlapping or not. The procedure starts checking from the last position to the first position of any particular monomer. If any lattice point contains more than one monomer, then the procedure returns true. A backtracking algorithm is applied to the protein structure for removing the overlapping. The backtracking algorithm sets a pointer to the previous monomer of the overlapping monomer. Then, the algorithm finds free adjacent lattice points of the previous monomer and tries to move the overlapping monomer to the free positions. If there exists any free lattice point, then the algorithm moves the overlapping monomer to the free position, and if there is no free lattice point, then the algorithm moves to the next previous monomer. This process continues until the structure becomes overlapped free. For example, if we find any overlapping for any individual monomer,

Fig. 9
figure 9

Repair mechanism

we start checking from the previous position to first position of that monomer. If we find any free lattice point position, then we move the overlapping monomer to that free position. We continue this process for each individual monomer with respect to its previous all monomers until we get a valid structure. Algorithm 8 gives the pseudo-code of the process, and Fig. 9 depicts the pictorial view, where an invalid structure has been repaired using repair mechanism. In the case of Fig. 9, if we start checking from the last monomer, it can be seen that the last monomer produces overlapping with another monomer. After removing overlapping according to the above steps, we have found a valid structure that has also shown on the right side of Fig. 9. The repair mechanism is invoked after the execution of any of the four operators and it repairs any invalid structure.

figure h

5 Experimental result

We compared the \(E_{{\mathrm{best}}}\), \(E_{{\mathrm{mean}}}\) values and the standard deviation of our proposed algorithm with the results of GAAM [5]. Here, the \(E_{{\mathrm{mean}}}\) value has been multiplied by − 1 which means higher \(E_{\mathrm{mean}}\) value shows better efficiency. The proposed algorithm was tested on four different datasets. The results are shown in Tables 4, 5, 6 and 7. In the tables, the sequence number is generated by a set number followed by a serial number of the set, such as S1.1 represents set 1 and the first sequence of the set.

Table 2 Test sequences

5.1 Experimental setup

We described the proposed CRO algorithm in Sect. 4.1. The experimental issues have been started as follows. The algorithm was implemented in Java programming language and executed using an Intel Core i5 computer with 2.60 GHz CPU and 4 GB RAM under windows operating system. Our developed algorithms were tested on 4 datasets [5] which are shown in Table 2. The values of the parameters used in the algorithm are shown in Table 3. The molCol value varies according to the length of protein sequences. In this case, the algorithm performs 200 iterations in one run.

$$\begin{aligned} molCol= {\left\{ \begin{array}{ll}\begin{array}{ll} 1, & {\text {for length}} \le 36\\ 0.5, & {\text {for length}} > 36 \end{array}\end{array}\right. } \end{aligned}$$
(6)
Table 3 Parameters of CRO algorithm

5.2 Effect of control parameters

The effect of the control parameters was tested on the first dataset of sequences. Because the set contains a large number of sequences and the length is medium. We set three values for each control parameter: \(popSize \in \{200, 500, 1000\}\) and \(molCol \in \{ 0.4, 0.5, 0.6\}\). Besides this, the initialKE and the KELossRate were set to 100 and 0.5 during the performance test. The default values were set to \(popSize = 500\), \(molCol = 0.5\), \(initialKE = 100\) and \(KELossRate = 0.5\) and we calculated the \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) for all the values of control parameters. Table 4 shows the results and 30 independent runs were performed to compute the values. From the table, it can be seen that the values have been changed a little due to the change in parameters. For \(popSize = 200\), the \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) values have changed a little from the highest values. On the other hand, for \(popSize = 1000\), all the \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) values have attained the highest values. But for \(molCol = 0.4\) and \(molCol = 0.6\), the \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) values have also changed a little from the highest values. The \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) have been calculated as a statistical measure. The \(E_{{\mathrm{std}}}\) value is calculated as follows:

Table 4 The effect of control parameters for dataset 1 with \(initialKE = 100\), \(KELossRate = 0.5\) and the number of runs was 30
$$\begin{aligned} E_{{\mathrm{std}}}=\sqrt{\frac{1}{N_{\mathrm{R}}}\sum _{i=1}^{N_{\mathrm{R}}}(E_i-E_{{\mathrm{mean}}})^2} \end{aligned}$$
(7)

where \(N_{\mathrm{R}}\) is the number of runs which is set to 30 and \(E_i\) is the selected energy value in each run. \(E_{{\mathrm{mean}}}\) is the mean energy value in all the iterations and the value is calculated by the following equation:

$$\begin{aligned} E_{{\mathrm{mean}}} = \frac{\sum _{i=1}^{N_{\mathrm{R}}}E_{{\mathrm{best}}}}{N_{\mathrm{R}}} \end{aligned}$$
(8)

It means that the \(E_{{\mathrm{best}}}\) is the best energy value of each run.

5.3 Comparison with existing algorithm

The state of the art for the PFO in HP cubic lattice model is GAAM [5]. This method was proposed in 2016 by Bošković and Brest. They compared their proposed algorithm with adaptive genetic algorithm with phenotypical crowding [26], memetic algorithm with self-adaptive local search [32], ant colony optimization algorithm [33], multi-objective genetic algorithm with feasibility rules [34], genetic algorithm with systematic crossover [24], estimation of distribution algorithm [4] and clustered memetic algorithm with local heuristics [35]. The performance of GAAM was best of all the compared algorithms. So, we compared our proposed algorithm with the GAAM. The results are shown in Tables 5, 6 and 7. In the tables, the sequence number was generated by a set number followed by a serial number of the set, such as S1.1 represents set 1 and the first sequence of the set. We computed the \(E_{{\mathrm{best}}}\), \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) for dataset 1. The results are shown in left part of Table 5. All the sequences of this dataset are of 48 lengths long. The \(E_{{\mathrm{mean}}}\) indicates the mean energy value of all independent runs, and \(E_{{\mathrm{std}}}\) represents the standard deviation of all runs. From the table, it can be observed that \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) shows better \(E_{{\mathrm{mean}}}\) values for all sequences. Our proposed algorithm has obtained not only better \(E_{{\mathrm{mean}}}\) values but also better \(E_{{\mathrm{std}}}\) values for all the sequences with respect to GAAM. We computed the \(E_{{\mathrm{best}}}\), \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) for dataset 2. The results are shown in the right side of Table 5. All the sequences of this dataset are 64 lengths long. For the sequences of both datasets 1 and 2, we set \(initialKE = 100\) and \(KELossRate = 0.5\). The molCol value was set to 0.5. The number of independent runs was 50 for each individual sequence. It means that \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) performed 200 iterations in one run. In one full iteration the algorithm performs 500 evaluations (\(popSize=500\)). Now we describe the calculation process of number of evaluations (NE) as follows:

Table 5 Results of datasets 1 and 2 with \(initialKE = 100\), \(KELossRate = 0.5\), the number of runs was 50 and \({\mathrm{NE}}=4\times 10^5\) (whereas \({\mathrm{NE}}=4\times 10^6\) for GAAM [5])
  • If we consider the worst case, in Algorithm 1 within popSize and (if \(b>molCol\) or not) loops the repair operator executes maximum 2 times (whether \(b>molCol\) or not). Either line numbers 11–13 or line numbers 27–29 of Algorithm 1.

  • In each iteration within popSize loop, but outside the (if \(b>molCol\)) loop the repair operator executes 2 times (line numbers 33–35 of Algorithm 1).

From these two points, we can say that in each iteration within popSize loop repair executes 4 times which is maximum. Actually, it will be less than 4 in average case. By considering the worst case the number of evaluations in one run is \(200\times 500\times 4= 4\times 10^5\). From Table 5, it can be noticed that our proposed algorithm provides better \(E_{{\mathrm{best}}}\) values for the sequences S2.2, S2.4, S2.6, S2.9 and for other sequences it gives same \(E_{{\mathrm{best}}}\) values as GAAM. For all sequences of this dataset, \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) has obtained better \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) values than GAAM. Our proposed algorithm has also showed the less number of solution evaluations (NE) than GAAM [5].

Table 6 shows the \(E_{{\mathrm{best}}}\), \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) for dataset 3. Dataset 3 contains sequences of length 20–64. Here, we used three different parameters. For dataset 3 we set \(initialKE =100\) and \(KELossRate = 0.5\). The molCol value was set 1 for \(length \le 36\) and 0.5 for \(length > 36\). Fifty independent runs were performed for each individual sequence. From this table, we can notice that our proposed algorithm provides better \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) for sequences S3.6, S3.7 and for other sequences, the proposed algorithm has shown same \(E_{{\mathrm{best}}}\) values as GAAM [5]. Here, the values in boldface indicate the better performance of our algorithm.

Table 6 Results of dataset 3 with \(initialKE = 100\), \(KELossRate = 0.5\), the number of runs was 50 and \({\mathrm{NE}}=4\times 10^5\) (whereas \({\mathrm{NE}}=4\times 10^6\) for GAAM [5])

Table 7 shows the \(E_{{\mathrm{best}}}\), \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) for dataset 4. Dataset 4 contains sequences of length 46–136. Here, two different parameters were used. For sequences of length 46 and 58, we set \(KELossRate = 0.5\), for length 103 and 136, the \(KELossRate = 0.05\). The molCol value was set 0.5 for all sequences. Fifty independent runs were performed for each individual sequence. For all sequences of Table 7, \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) has obtained better \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) values. In particular, when the sequence length is long, \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) obtained better \(E_{{\mathrm{best}}}\), \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) values than GAAM.

Table 7 Results of dataset 4 with \(initialKE = 100\), \(KELossRate = 0.5\) (for S4.1 and S4.2), \(KELossRate = 0.05\) (for S4.3 and S4.4), the number of runs was 50, \({\mathrm{NE}}=4\times 10^5\) (for S4.1 and S4.2) and \({\mathrm{NE}}=4\times 10^6\) (for S4.3 and S4.4) (whereas \({\mathrm{NE}}=4\times 10^6\) (for S4.1–S4.3) and \({\mathrm{NE}}=4\times 10^7\) (for S4.4) for GAAM [5])

Table 8 shows the \(E_{{\mathrm{best}}}\), \(E_{{\mathrm{mean}}}\) and \(E_{{\mathrm{std}}}\) for dataset 3. Dataset 3 contains sequences of length 20–64. The stopping condition was 12 h for dataset 3 and our algorithm obtained the best result. From all of the tables, we can observe that the results of our proposed algorithm are better than the GAAM especially for the mean energy \((E_{{\mathrm{mean}}})\) and the standard deviation \((E_{{\mathrm{std}}})\) values for all the datasets.

Table 8 Results of dataset 3 under runtime 12 h

5.4 Performance measure

In order to do a fairer comparison, we performed some more statistical tests over the datasets such as the diversity of the final population and the speed (number of sequence evaluations per second) of our proposed algorithm. Diversity measures the structural differences in each population.

Table 9 A performance measure of datasets 1 and 2 with \(initialKE = 100\), \(KELossRate = 0.5\) and the number of runs was 30
Table 10 A performance measure of dataset 3 with \(initialKE = 100\), \(KELossRate = 0.5\) and the number of runs was 30

The mean diversity of final population has been calculated by following equation:

$$\begin{aligned} {\mathrm{div}} = \frac{2}{{\mathrm{TR}}*(p-1)*p}\sum _{i=1}^{{\mathrm{TR}}}\sum _{m=1}^{p}\sum _{n=m+1}^{p} D(a_m,a_n) \end{aligned}$$
(9)

Here, \(p=popSize\) and \(D(a_m,a_n)\) represent the Hamming distance between sequence \(a_m\) and \(a_n\). The \(E_{{\mathrm{mean}}}\) value has been multiplied by − 1 which means higher \(E_{{\mathrm{mean}}}\) value shows better efficiency. The statistical tests were done over the datasets 1, 2 and 3 to check the speed and diversity in population. It seems that speed is a number of function evaluations per second. Tables 9 and 10 show the results.

5.5 Execution time comparison

In this subsection, the execution time comparison of datasets 1, 2 and 3 was performed. The sequences are of different lengths.

Fig. 10
figure 10

Progress of energy value of dataset 1 with respect to time (s)

The default parameter settings were used for the sequences. Speed up value indicates the rate of increase performance of \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) algorithm. The speeds of the algorithm for different datasets with respect to GAAM [5] are shown in Tables 11, 12 and 13.

Table 11 Mean time comparison of dataset 1 with \(initialKE = 100\), \(KELossRate = 0.5\) and the number of runs was 20
Table 12 Mean time comparison of dataset 2 with \(initialKE = 100\), \(KELossRate = 0.5\) and the number of runs was 20
Table 13 Mean time comparison of dataset 3 with \(initialKE = 100\), \(KELossRate = 0.5\) and the number of runs was 20

From the tables, it can be noticed that the highest speed up value = 7.39 and lowest speed up value = 1.05 of our proposed \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) algorithm. Here, the stopping condition was the maximum target energy value or until the initialKE is not reached 0. The speed up values show that the proposed algorithm is better than GAAM due to less execution time and high speed up values. From these tables, it can be observed that our proposed algorithm takes less time for evaluating the amino acid sequences than GAAM [5] and provides highest energy values most of the times. In our developed algorithm, we worked with the directions of structures, not with the real structure. Because the final direction array provides sufficient information of the structures. We know that for the PFO problem input is the amino acid sequence which is a combination of H and P monomers as defined earlier. For an example, HPHHPPHHHHPHHHPPHHPPHPHHHPHPHHPPHHPPPHPPPPPPPPHH is an input sequence of 48 lengths. According to \({\mathrm{CRO}}_{{\mathrm{PFO}}}\) algorithm, a random population of fixed size is created first. Then, our algorithm applies the four operators of CRO and the additional mechanisms on the population which is a set of direction arrays. After all the iterations and modifications, the best structure is selected among all the structures. The sequence of directions of the final structure is RUUULDDFDRUULUFDRRDLDLULDBLBRBUFFLFURUFFLBBBRDB. The pictorial view for the PFO problem is given in Fig. 10a. Here, the green monomers represent hydrophobic amino acid and the blue monomer represents the polar amino acids. Another example can be given for the sequence HHPHPHPHPHHHHPHPPPHPPPHPPPPHPPPHPPPHPHHHHPHPHPHPHH that is a sequence of 50 length. After completing all the steps according to our algorithm, the best structure is selected among all the structures. The sequence of directions of the best structure is FLBLBURBDFRRBLURFLULFDLFRRBRFDBRDBLBLFLBLFFRFRRBL. The pictorial view of PFO problem for this sequence is given in Fig. 11a.

Fig. 11
figure 11

Progress of energy value of dataset 1 with respect to time (s)

Table 14 Performance analysis of datasets 1–4 with \(initialKE = 100\), \(KELossRate = 0.5\) and the number of runs was 30

5.6 Performance analysis

In Table 14, it can be observed that four types of results have been shown to analyze the performance of our proposed algorithm. First, we calculated the energy value of the sequences with H&P compliance and without evolution mechanism. Second, the energy values were calculated of the sequences with evolution mechanism and without H&P compliance.

After that, we found out the energy value of the sequences without two mechanisms. It can be noticed that without these two types of mechanisms the algorithm cannot find out the target best energy values and most of the cases the obtained energy values are more than 49

In Fig. 12a, b we have shown the convergence results of the algorithm with time. From the graphs, it can be noticed that the energy value of each sequence (S1.1–S1.10) is increased with respect to the time. Within the interval of time, the energy value of each sequence is increased sequentially in most of the cases and finally gets the best energy value. The convergence curves show that our proposed algorithm has progressed with time and found out the best results.

Fig. 12
figure 12

Progress of energy value of dataset 1 with respect to time (s)

6 Conclusions

The protein folding optimization (PFO) is a familiar NP-hard optimization problem, and here, we have used chemical reaction optimization (CRO) algorithm to solve this problem. CRO is a recent population-based metaheuristic algorithm that has already been successfully applied to many well-known optimization problems. It is a nature-inspired algorithm that mimics the behavior of chemical reactions. The main challenge of this algorithm is the search space formulation. In PFO problem, the actual search space is massive with respect to the length of the protein and the size of the actual search space increases exponentially with the protein length. So, it is a very challenging task to select a limited number of structures from the huge number of possible choices. We have redesigned four basic operators of CRO algorithm to solve PFO problem. Additionally, we have also designed two extra mechanisms, evolution and H&P compliance. These two extra mechanisms help to raise the performance of the algorithm dramatically. Our algorithm includes a repair mechanism that produces valid structures from invalid structures. The performance of our proposed algorithm is appreciable. We have compared the results of our algorithm with the genetic algorithm with an advanced mechanism (GAAM) which is a state-of-the-art population-based algorithm, and from the results, it is clear that the performance of our algorithm is better than GAAM.

A particular study on parameters of CRO may yield better results and less execution time for this problem. Since there is no fixed rule for setting of the parameters of CRO, finding the right values for the parameters is a tough task. So more experiment and study on parameters may give better results in the case of PFO problem.