Keywords

1 Introduction

Proteins are macromolecules which have important biological functions in every living organism when the tree-dimensional conformation is reached. These macromolecules are composed by an unique amino acid sequence, known as the primary structure, which influences the protein to fold into a three-dimensional shape [2]. Nowadays, the methods to determine an existing protein’s tertiary structure are the nuclear magnetic resonance and the crystallography X-ray [9]. Although these methods can determine the native conformation of a determined protein, they are too expensive [1].

Different representations were created to solve the protein structure prediction (PSP) problem. The prediction only by the amino acid sequence is called Ab Initio prediction and it’s one of the most challenging problems in bioinformatics because of its complexity even for small proteins [10]. The high complexity associated with this problem is due the huge amount of plausible shapes that a protein can assume. Hence, the protein structure prediction (PSP) is labeled as a NP-complete problem [11].

Due limitations of exact algorithms to solve this class of problems, meta-heuristics became a viable way to explore the search space and find possible conformations in a plausible time. In recent literature, different approaches of Evolutionary Computing (EC) algorithms have been used to solve the PSP problem in atomic representations as in [2, 6, 9].

The standard Differential Evolution (DE) algorithm is a population-based EC algorithm that has been chosen to solve the PSP problem in the present work. The DE algorithm is considered a good algorithm to solve problems from continuous optimization [14]. However, it is known that it loses its diversity very quickly, increasing the chance of getting stuck in a local optimum when employed in a high multimodal problem like the PSP. Hence, with diversity control strategies it is possible to slow down the convergence, aiming to escape from local optima [4]. In this work, two simple diversification strategies are used with the DE algorithm: the Generation Gap and the Gaussian Perturbation.

Four different approaches were used to solve the PSP problem. One of them is the standard Best/1/Bin DE algorithm, while others are called as Generation Gap (GG), Gaussian Perturbation (GP) and the combination of them (GG-GP). To verify the efficiency of the proposed approaches, the genotypic diversity is analysed. With this analysis it is possible to verify the behaviour of algorithms and the impact of such strategies in the results obtained. Furthermore, results are compared with recent literature which used the same representation model and the same energy function.

The next sections are organized as follows. In Sect. 2 the PSP problem is described and related works are discussed. Section 3 explains the DE approaches to solve the PSP problem. Section 4 exhibit the results obtained in our test cases. Finally, Sect. 5 contains the conclusion of this work and some future directions.

2 Protein Structure Prediction

Proteins are made from amino acids chains where each amino acid is composed by an amino group (\(\text {H}_3\text {N}^{+}\)), a carboxyl group (COO\(^{-}\)) and a hydrogen atom attached to a central carbon (C\(_\alpha \)) [2]. Each amino acid has a side chain attached to the C\(_\alpha \), distinguishing each one of the 20 different amino acids known in nature.

A protein can be depicted into four different well defined structures. The primary structure is formed by a linear sequence of amino acids, the secondary structure represents the local structure found in the backbone conformation, the tertiary structure considers the protein’s final conformation (including the amino acids side chain) and determine its biological function. The quaternary structure represents interactions among proteins to accomplish specific functions.

To evaluate if a protein is near its native state, the Anfinsen’s thermodynamic hypothesis declares that a native three-dimensional protein shape has the lowest free energy. In the present work, the energy is obtained by the CHARMM force field [3] which is one of the most popular energy functions [10] shown in Eq. 1.

$$\begin{aligned} \begin{aligned} E_{total} = \displaystyle \sum _{bonds}K_b(b-b_0)^2 + \displaystyle \sum _{UB}K_{UB}(S-S_0)^2 + \displaystyle \sum _{angle}K_{\theta }(\theta - \theta _0)^2 + \\ \displaystyle \sum _{dihedrals}K_\chi (1+\cos (\eta -\delta )) + \displaystyle \sum _{impropers}K_{imp}(\varphi - \varphi _0)^2 + \\ \displaystyle \sum _{nonbond}\epsilon [(\frac{R_{minij}}{R_{ij}})^{12} - (\frac{R_{minij}}{R_{ij}})^6 ] + \frac{q_iq_j}{\epsilon _1r_{ij}} \end{aligned} \end{aligned}$$
(1)
Table 1. \(\chi \) angles for each amino acid
Table 2. DSSP 8-class classification.

From Eq. 1, \(\mathbf E _\mathbf{total }\): is the total energy value; bonds measures the energy according to the bond stretching between two atoms; Urey-Bradley (UB) represents the interactions between pairs of atoms; angle is the sum among all angles in the structure; dihedrals is the energy associated with the torsion angles; impropers values are associated to deformations of improper torsion angles; nonbond values are related to Van der Waals and Charge-Charge energy. Van der Waals is the energy from interactions between nonbonded angles from attraction and repulsion. Charge-Charge varies according to the distance among atoms.

Different types of protein’s atomic representations emerged with different levels of abstraction. Some commonly used are: (a) all-atom three-dimensional coordinates; (b) all-heavy-atom coordinates; (c) backbone atom coordinates + side-chain centroids; (d) \(C_{\alpha }\) coordinates; (e) backbone and side-chain torsion angles.

As this work employs the backbone and side-chain torsion angles model, it is known that each residue has a defined number of torsion angles that is needed to be optimized. Each amino acid has three backbone angles (\(\phi \), \(\psi \), and \(\omega \)) and a particular number of side chain angles (\(\chi _{i}\)) as shown in Table 1. A backbone classification was employed to identify the secondary structure and the recommended bounds for each type of structure. These angles are shown in Table 2 and are based on the full DSSP 8-class classification [12].

2.1 Related Works

Some related works apply bio-inspired algorithms, Ab Initio prediction and CHARMM energy function. In [9] was applied a GA with two different approaches for diversity control: the random immigrants technique, which replaces a percentage of individuals from the population for new randomly generated individuals, and an extension called simplified self-organizing random immigrants with dynamic replacement rate. In [13] A bacterial foraging optimization algorithm (BFOA) was applied for the 1PLW protein using CHARMM energy function. Although most of related works tried to improve the diversity, in [17] a GA was combined with Hill Climbing in a parallel grid environment, improving the exploitation capacity.

In [15] the NSGA-II was employed to solve the PSP problem using island models. Also, a modified PAES algorithm is proposed in [7, 8] with immune inspired operators (cloning and hyper mutation). Another multi-objective approach was proposed in [18] using the DE algorithm. However, the DE was modified to be adaptive varying the mutation mechanism. This technique is known as probability matching, associating a percentage at each DE version to execute the mutation process. If the mutation process is successful, than its chance to be selected again increases. This is the approach that found the lowest energy value using CHARMM for the 1PLW, 1ZDD and 1CRN proteins. The multi-objective formulation in these works is slicing the CHARMM energy function in two terms: bonded and non-bonded.

3 Methods

In this work each individual is formed by a set of angles representing amino acids. An individual is structured as a vector and its size changes according to the number of amino acids in each protein. Figure 1 illustrates the structure of an individual for the 1PLW protein which have 5 amino acids. Note that the \(\omega \) angle is not in our representation because its value is set always to 180\(^{\circ }\).

Fig. 1.
figure 1

Graphical representation of 1PLW individual.

The standard DE algorithm is population-based and at each new iteration an offspring is generated by a mutation operator to replace the current population if it achieves a better solution. In order to modify this routine and improve the diversity in the population, we have used the generation gap mechanism [16]. This mechanism is commonly used in EC and only a fraction of the population is replaced by the offspring according to a parameter G which varies between 0 and 1. The maintained individuals are selected at random from the current population. This model is called generation gap DE (\(\text {DE}_{GG}\)).

To create new individuals, the DE algorithm uses a mutation mechanism, combining values from different individuals. There are different approaches for mutation in DE and, in this work, the \(\text {DE}_{Best/1/Bin}\) version was selected. This standard version of DE always uses the best individual in the population to combine with two other random individuals. The DE\(_{Best/1/Bin}\) approach creates a new individual using \({\varvec{w}} = {\varvec{x}}_{\varvec{best}} + F \cdot ({\varvec{x}}_{{\varvec{rand}}{} \mathbf{1}} - {\varvec{x}}_{{\varvec{rand}}{} \mathbf{2}})\), where \({\varvec{w}}\) represents the new generated individual and F is a threshold that need to be set.

Hence, another modification was done in the mutation operator of the standard DE\(_{Best/1/Bin}\) algorithm. For the two randomly selected individuals, a gaussian perturbation technique is applied. With the gaussian perturbation, \({\varvec{x}}_{{\varvec{rand}}{} \mathbf{1}}\) and \({\varvec{x}}_{{\varvec{rand}}{} \mathbf{2}}\) are considered the mean and the standard deviation is defined between 0 and 1. This model is called gaussian perturbation DE (DE\(_{GP}\))

To verify how these approaches impact the diversity of solutions during the optimization process, this work uses a genotypic diversity measure for continuous domains [5]. The Eq. 2 shows how to calculate the genotypic diversity.

$$\begin{aligned} \begin{aligned} GDM = \frac{\displaystyle \sum _{i=1}^{N-1} ln \left( 1 + \displaystyle \min _{j [i+1,N]}\frac{1}{D} \sqrt{\displaystyle \sum _{k=1}^D (x_{i,k} - x_{j,k})^2} \right) }{NMDF} \end{aligned} \end{aligned}$$
(2)

where D is the size of the solution vector, N is the population’s size and x the individual (or solution vector). The NMDF is a normalization factor which corresponds to the maximum diversity value so far. The genotypic diversity starts with 1 which is the maximum value and when it reaches 0 it corresponds to the full convergence of the population. With this measurement it is possible to verify the diversity level during each iteration. This is an important measure to verify if the algorithm is getting trapped in a local optima and, consequently, getting a premature convergence.

Besides the function evaluation given by CHARMM and the genotypic diversity measure given by Eq. 2, there is another important metric which is considered in this work: the root mean square deviation (RMSD). The RMSD is a measure given in Å(angstrom) which compares the atomic distance between proteins and verifies if the final predicted conformation reached the native conformation. When the RMSD is near 0 means that the predicted protein is very similar to the native protein.

4 Experiments, Results and Analysis

In the current work three different proteins were used as problem instances: 1PLW, 1ZDD and 1CRN. The smallest protein used is known as Met-Enkephalin (1PLW) with only 5 amino acids and 22 angles to be optimized, without any well defined secondary structure. The 1ZDD is a protein which have two well defined \(\alpha \)-helices structures and it contains 34 amino acids with 179 angles to be optimized. The biggest protein used in this work has 46 amino acids and 191 angles to be optimized, known as 1CRN. The 1CRN protein has two well defined \(\alpha \)-helices and two \(\beta \)-sheets as secondary structures.

The experiments were conducted with 4 different algorithm configurations: the standard DE\(_{Best/1/Bin}\), the standard DE with generation gap mechanism (DE\(_{GG}\)), the standard DE with gaussian perturbation (DE\(_{GP}\)) and DE\(_{GG-GP}\) which combines the standard DE with generation gap and gaussian perturbation. This work also compares the results obtained with another works found in the literature. All approaches use the atomic representation and the CHARMM energy function calculated with Tinker Molecular Dynamics Package.

The DE parameters used in this work are recommended by [18], with a population size of 100 individuals, the mutation factor (F) is set to 0.5, the crossover factor is 1 and the number of function evaluations is 500.000. The parameters G, which controls the generation gap was empirically set to 0.8, and the GSD which is responsible for the standard deviation was empirically set to 0.1.

For each protein and each approach 10 runs were done. Table 3 contains the results obtained for 1PLW, 1ZDD and 1CRN proteins.

Table 3. Results obtained.

The first column indicates each protein and the second column identifies each algorithm. Column 3 represents the minimum energy found in all runs and column 4 the RMSD\(_{\alpha }\) from each minimum energy. Finally, column 5 represents the average minimum energy with the standard deviation. All DE approaches were developed using C++ language in an Intel core i7 with 8GB RAM.

For Met-Enkephalin (1PLW) all four developed DE approaches got similar results. However, the DE\(_{GG-GP}\) reached \(-35.82\) kcal mol\(^{-1}\) with RMSD\(_{\alpha }\) of 1.98Å. These values are competitive with the state-of-art ADEMO/D algorithm proposed in [18]. The lowest RMSD\(_\alpha \) was reached by NSGA-II [15] with 1.26Å.

For 1ZDD protein the results obtained showed significant differences among all four DE approaches. It is possible to notice that the DE\(_{GP}\) and DE\(_{GG-GP}\) reached better results for minimum energy values when compared with standard DE\(_{Best/1/Bin}\) and DE\(_{GG}\). The DE\(_{GP}\) reached \(-1,216.40\) kcal mol\(^{-1}\) with a RMSD of 2.36 Å, becoming competitive with the state-of-art algorithm found in literature known as ADEMO/D [18] and NSGA-II [15].

Analysing the results for 1CRN protein, the best approach was the DE\(_{GP}\) with energy value of 166.83 kcal mol\(^{-1}\). Comparing with the four DE approaches developed in this work, the DE\(_{GP}\) was better than DE\(_{Best/1/Bin}\) when considering the energy value and the standard deviation showing that the gaussian perturbation improved the results obtained. Comparing the DE\(_{GP}\) with ADEMO/D [18], our approach achieved lower energy besides the RMSD\(_{\alpha }\) was bigger than the results found in the literature.

Figure 2 shows both the energy and the genotypic diversity convergence over generations for all developed DE approaches. Note that all approaches converged very quickly for 1PLW protein. At generation 1, 000 the energy function stabilized. However, when the genotypic diversity is plotted, there is diversity in the population after the generation 1, 000 for DE\(_{GG-GP}\) while for DE\(_{Best/1/Bin}\) the diversity ended earlier, given no possibility to create different individuals. Because of the small size of this protein, even using a diversity control mechanism the DE\(_{Best/1/Bin}\) and DE\(_{GG-GP}\) got very similar values.

Analysing the convergence for 1ZDD protein, the energies of DE\(_{Best/1/Bin}\) and DE\(_{GG}\) stabilizes around generations 2, 500 and 3, 000, respectively, while the energies of DE\(_{GP}\) and DE\(_{GG-GP}\) are still decreasing at generation 5, 000, when all algorithms end. This behaviour is related with the diversity in the population. Note that for the approaches which have converged earlier, worst energy values were obtained and the diversity was lost prematurely. However, for DE\(_{GP}\) and DE\(_{GG-GP}\), that have reached better energy results, the diversity is not over in the last generation.

The convergence analysis made for 1ZDD protein can also be made for 1CRN protein, where the approaches with diversity maintenance routines achieved better results avoiding premature convergence.

Overall the diversification techniques helped the standard DE to obtain lower average energy values mainly for the two biggest instances (1ZDD and 1CRN). It was verified through the GDM index that the genotypic maintenance is an important factor that need to be considered in PSP problem.

Fig. 2.
figure 2

Energy (left) and genotypic diversity (right) for each sequence.

5 Conclusions and Future Research

This work applied four different DE approaches to solve some PSP problem instances using the torsion angles model and the CHARMM energy function. Two diversification strategies were used in order to avoid premature convergence: the generation gap and the gaussian perturbation.

Despite there are many works in the literature solving some PSP problem, none of them made an analysis of the diversity of solutions during the optimization process. As proposed in this work, the genotypic diversity was analyzed using the GDM index. With this index was possible to relate the genotypic diversity with the energy convergence, verifying that the versions in which maintained the diversity got better results in all three proteins.

Although the genotypic diversification strategies increased the population’s diversity, the gaussian perturbation always got the best energy values in comparison with the standard DE and the generation gap version. All four algorithms were also compared to state-of-art algorithms found in literature that used CHARMM as energy function. The DE\(_{GP}\) version showed to be competitive in all three proteins, 1PLW, 1ZDD and 1CRN, even being a much simpler approach than the works found in the literature.

Also, it was verified that when the algorithm ends, the diversity for bigger proteins is about \(40\%\). This indicates that exploitation routines, like local search algorithms could be used to explore this diversity aiming to reach better energy values. Another future research could be the use of GPUs for energy minimization, possible granting higher speed ups when comparing with CPU approaches, providing bigger amounts of function evaluations and longer convergences.

As our developed approaches showed to be much simpler than the literature ones and reached competitive results, it is possible to use a famous Occam’s razor statement: when you have two competing theories that make exactly the same predictions, the simpler one is the better.