Introduction

Proteins are large biological molecules or macromolecules consisting of amino acid residues that are connected by peptide bonds [1]. Understanding the functional mechanism of proteins, known as the building blocks of life, is essential in biological sciences and presents an enormous impact on enzyme engineering pathology, medicine, and pharmaceutics [2]. The folded structures of proteins in a cell provide the information necessary for studying the functions of proteins at the molecular level [3, 4].

Various experimental techniques have been developed to identify protein structures [5]. Up to February 25, 2013, approximately 81,700 protein structures had been deposited in the Protein Data Bank (PDB) database; most of these structures were determined using X-ray diffraction (88.8 %) and nuclear magnetic resonance spectroscopy (10.5 %) [6]. However, strict laboratory requirements and heavy operating burdens have limited the applications of the experimental techniques, making it difficult to keep track of deposited structures in the same pace of primary structures (i.e., amino-acid sequences) recovery [2, 3, 7]. In 2011, the number of experimentally determined secondary/tertiary structures was twofold lower than the number of undetermined protein structures [8]. Hence, computation-based techniques have been used to optimize protein structures or to simulate protein folding processes.

Protein folding in computational biology involves calculating protein conformation by arranging a sequence of basic structural elements. Christian Anfinsen’s theory, which states that all information needed to predict the native structure of a protein is encoded in its primary sequence, has largely contributed to the development in this research area [9, 10]. Researchers [11, 12] have used this theory as a basis to analyze native structures with minimal free energy. However, solving such a problem remains challenging for complicated protein models [13]. A key issue here is to what extent the trivial mechanisms in protein folding should be omitted [14]. Therefore, models with reduced complexity (e.g., [1518]) have been developed by utilizing parts of the protein properties to avoid the large computational cost associated with an all-atom model.

In the present study, an off-lattice model introduced by Stillinger et al. [18] is utilized to simplify the protein folding scheme. All the amino acid residues in a sequence can be classified as hydrophobic or hydrophilic on the basis of the K-D method [19]. These “binarized” residues are connected by unbendable but free-to-rotate chemical bonds. Considering potentials among particles, we utilize a predefined criterion to calculate the free energy of a given conformation. Assuming that the native conformation of a protein is the most stable structure with minimal free energy, we can simulate the protein folding process through minimizing the free energy criterion.

The original protein folding scheme can be transformed into a numerical optimization problem by using the off-lattice model. Considering the NP-hard property of such optimization problems [20], researchers have developed various optimization techniques. For example, Kim et al. [13] proposed a conformational space annealing optimizer, Zhang et al. [21] a genetic tabu search algorithm, Chen and Huang [22] a heuristic algorithm, Wang et al. [23] a chaotic artificial bee colony algorithm, and Liang an annealing contour Monte Carlo algorithm, among many others [2444]. However, previous research studies have focused overly on developing optimization algorithms, leaving the original protein structure folding mission ignored. In fact, large numbers of new optimization methods are being proposed every day, which outperform the relatively old ones. Therefore, the impact of an optimizer is commonly not anticipated to increase with time in the long run. In this sense, we believe that research efforts should be exerted on the concerned biological problem so as to make long-lasting contributions.

Previous publications in this broad research area suffer from three typical drawbacks from the authors’ viewpoint. First is the lack of algorithm validation. Only the best-so-far protein structure solutions were reported without sufficient numerical experiments or theoretical analyses to support the proposed optimization algorithm. Second is that artificial protein sequences (e.g., Fibonacci sequences) or real sequences that are relatively short are considered in simulations. At this point, ref. [27] indicated that real sequence formations are far more realistic than those artificial ones, thus simulations on the Fibonacci sequences cannot fully verify the efficacy of the utilized optimization algorithm. Third is the lack of insights into the biological problem: the protein folding mission is regarded as a general numerical optimization problem and few in-depth analysis or discussions about the obtained results are attached. For the assistance of understanding, a list of the aforementioned references is shown in Fig. 1. In addition to the three common drawbacks mentioned above, the obtained solution vectors were seldom made public among the previous publications, rendering their calculated solutions impossible to validate.

Fig. 1
figure 1

Comprehensive review on the common drawbacks of publications regarding protein structure optimization using off-lattice models

The present study aims to overcome the typical limitations mentioned above. A balance-evolution artificial bee colony (BE-ABC) algorithm [31, 45] is applied for protein structure optimization. Compared with the conventional ABC algorithm and other state-of-the-art variants, BE-ABC algorithm utilizes parameters/variables in the algorithm framework as guidelines to adaptively adjust search intensity. Moreover, an overall degradation procedure is used to avoid premature convergence. A benchmark case set for protein structure optimization that focuses only on real protein sequences is systematically formulated. In addition to reporting the best-so-far optimization results, we also strictly compare the short-term convergence performance of BE-ABC algorithm to those of some state-of-the-art ABC variants. Also, innovative insights into the protein folding problem are presented. The details of our optimized best-so-far structures are released for the convenience of validation.

The remainder of this paper is organized as follows: the next section introduces the 3D off-lattice protein model, then the principle of BE-ABC algorithm is briefly presented. Results are presented next followed by further discussions and conclusions.

Three-dimensional off-lattice protein model

The off-lattice protein model was initially developed to consider 2D folding problems and was extended to deal with 3D scenarios where additional torsional energy contributions of each bond are taken into account [14]. In the present study, a 3D off-lattice model is used to fold proteins in a relatively flexible but nontrivial extent.

The remainder of this section concerns how a structural protein can be uniquely determined by a set of bond/torsional angle parameters and how the corresponding free-energy value can be calculated.

Structure formulation

The off-lattice model is a coarse-grained one which holds the viewpoint that the main driving forces to form protein structures are the hydrophobic interactions among amino acid residues [18]. Thus, all 20 types of amino acid residues are classified as hydrophobic residues (amino acids I, V, L, P, C, M, A, and G represented by letter “A”) and hydrophilic residues (amino acids D, E, F, H, K, N, Q, R, S, T, W, and Y represented by letter “B”) [19, 25, 27]. All the AB-binarized amino acid residues are sequentially connected by unit-length chemical bonds to form a structural chain in the 3D space. This subsection introduces how the unique structure of a chain with N residues can be specified by (2N − 5) angle parameters [θ 1, …, θ N − 2, β 1, …, β N − 3]1 × (2N − 5), which include (N − 2) bond angles and (N − 3) torsional angles.

Locations of amino acid residues in the space are determined as follows [23]:

$$ \left({x}_i,{y}_i,{z}_i\right) = \left\{\begin{array}{c}\hfill \left(0,0,0\right)\kern7.25em i=1\hfill \\ {}\hfill\ \left(0,1,0\right)\kern7.25em i=2\hfill \\ {}\hfill \left( \cos \left({\theta}_1\right), \sin \left({\theta}_1\right)+1,0\right)\kern1.5em i=3\hfill \\ {}\hfill \left(\begin{array}{l}{x}_{i-1}+ \cos \left({\theta}_{i-2}\right)\cdot \cos \left({\beta}_{i-2}\right),\\ {}{y}_{i-1}+ \sin \left({\theta}_{i-2}\right)\cdot \cos \left({\beta}_{i-2}\right),\\ {}{z}_{i-1}+ \sin \left({\beta}_{i-2}\right)\end{array}\right)\ 4\le i\le N\hfill \end{array}\right.. $$
(1)

For normalization, the first three amino acid particles are defined in the plane z = 0. The locations of subsequent particles (i.e., when i ≥ 4) can then be calculated recursively. Specifically, the location of the ith particle is based on that of the (i − 1)th particle (i ≥ 4). Figure 2 shows how the location of a protein sequence AABA is determined by the angle vector [θ 1, θ 2, β 1].

Fig. 2
figure 2

Schematic of 3D off-lattice protein model for sequence AABA (N = 4): a Illustration of how [θ 1, θ 2] affects protein structure with β 1 = 0; b Illustration of how β 1 effects protein structure while [θ 1, θ 2] is temporarily fixed

Free energy computation

Once a unique structure can be specified by a set of angular parameters, the corresponding free energy of the protein structure can be calculated.

In the 3D off-lattice model, the free energy function Energy can be defined as:

$$ Energy\left(\left[{\theta}_1,\dots, {\theta}_{N-2},{\beta}_1,\dots, {\beta}_{N-3}\right]\right)={\displaystyle \sum_{i=1}^{N-2}\frac{1- \cos \left({\theta}_i\right)}{4}}+4{\displaystyle \sum_{i=1}^{N-2}{\displaystyle \sum_{j=i+2}^N\left[{r_{ij}}^{-12}-C\left({\xi}_i,{\xi}_j\right){r_{ij}}^{-6}\right]}}, $$
(2)

where ξi reflects the binary property of the ith particle in the sequence, r ij denotes the distance between residues i and j in the 3D space, and C(ξi, ξj) represents the interaction between these two particles. If particle i is hydrophilic, then ξ i  = 1; otherwise, ξ i  = − 1. r ij and C(ξi, ξj) can be defined as follows:

$$ {r}_{ij}=\sqrt{{\left({x}_i-{x}_j\right)}^2+{\left({y}_i-{y}_j\right)}^2+{\left({z}_i-{z}_j\right)}^2}. $$
(3)
$$ C\left({\xi}_i,{\xi}_j\right)=\frac{1}{8}\left(1+{\xi}_i+{\xi}_j+5{\xi}_i{\xi}_j\right). $$
(4)

The free energy in Eq. (2) consists of two items that respectively represent intermolecular interactions among residues and intermolecular interactions between the peptide chain and surrounding solvent molecules. The first sum in Eq. (2) runs over the (N − 2) angles θ i  ∈ (−180, 180] of successive bond vectors. This item represents the bending energy and the coupling is ferromagnetic, i.e., it penalizes energy to bend the chain [31]. The second item partially competes with the bending barrier depending on the distance between nonadjacent particles along the chain [14].

One angle vector [θ 1, …, θ N − 2, β 1, …, β N − 3] represents one candidate structure and Energy([θ 1, …, θ N − 2, β 1, …, β N − 3]) is the corresponding free energy value, with θ i , β j  ∈ (−180, 180] for any 1 ≤ i ≤ N − 2, 1 ≤ j ≤ N − 3.

In this way, an protein folding mission is transformed into a constrained numerical optimization problem. Given a protein sequence, we aim to find the optimal angle vector associated with the minimal free energy value, i.e.,

$$ {\left[{\theta}_1,\dots, {\theta}_{N-2},{\beta}_1,\dots, {\beta}_{N-3}\right]}^{\ast }= \arg \left(\underset{-{180}^{\circ }<{\theta}_i,{\beta}_j<{180}^{\circ }}{ \min}\left( Energy\left(\left[{\theta}_1,\dots, {\theta}_{N-2},{\beta}_1,\dots, {\beta}_{N-3}\right]\right)\right)\right),1\le i\le N-2,\ 1\le j\le N-3. $$

The following section introduces BE-ABC algorithm to search for that optimal solution.

BE-ABC algorithm

Consistent with Anfinsen’s theories of protein structure stability when free energies of native protein structures are at their lowest, this section focuses on how BE-ABC algorithm searches for the solution with minimum free energy.

Motivations

The conventional ABC algorithm is a well-known swarm intelligence optimization method inspired by the foraging behavior of bees [46]. ABC algorithm is featured by the implementation of both local exploitation and global exploration procedures during the optimization process [47]. However, local search accuracy of the ABC algorithm is not satisfactory [48], causing large numbers of remedies proposed for that imperfection. Most of the proposed ABC variants have considered adjusting the local exploitation equations to achieve good search intensity. To that end, three ways have been prevailingly utilized: (i) to adopt a new principle from the outside world (e.g., [23, 49, 50]), (ii) to integrate ABC with other metaheuristics and (iii) to adjust the local search intensity in a self-adaptive manner (e.g., [7, 45, 51]). Commonly it is difficult to intuitively judge whether one ABC variant outperforms another unless statistically significant results based on comparative numerical experiments are available. Generally speaking, the absence of in-depth mathematical analyses may have made this research area as prosperous as it is. In contrast with the first two ways of modification, self-adaptive ABC variants (i.e., methods based on the third way) are intuitively understandable (because no complicated outside-world principles are involved) and remain the original framework of the conventional ABC algorithm. On the other hand, one may notice that the effect to execute a local/global search may be different during different periods of the entire optimization process, making it sensible to strike an intensity balance between a global search and a local one.

Adjusting the search intensity adaptively is considered in BE-ABC algorithm. Specifically, when evolutions are made with ease around one employed bee, the subsequent search accuracy is intensified to expect further evolutions there. On the contrary, when the search efficiency is bad there, the subsequent search should be executed in a relatively wide scale. Through this, it is possible that a global search appears local and a local exploitation appears global. Then there is no longer any clear gap between the global and local searches during the evolutions. Since a balance between the global and local searches is ideally expected, we name this proposal “balance evolution”.

In more detail, BE-ABC algorithm is different from the conventional ABC algorithm in two aspects. First, adaptively changed multipliers are added in the global and global search equations to manipulate the search intensity. Those multipliers are determined on the basis of the optimization efficiency in a current iteration. Second, the conventional re-initialization procedure is improved by an overall degradation strategy, which aims to re-initialize employed bees simultaneously during one iteration (instead of re-initializing only one in each iteration). In BE-ABC algorithm, the optimization efficiency is evaluated through the unsuccessful trail attempts. In the next subsection, the principle of BE-ABC algorithm is presented.

Algorithm principle

BE-ABC algorithm involves three types of bees: unemployed, employed, and onlooker. Unemployed bees search in a fully random manner in space while employed bees form a cooperative group to search in a global manner. Onlooker bees follow “qualified” employed bees to exploit locally. A qualified employed bee is one that finds a high-quality nectar source [31]. The location of a nectar source resembles a candidate solution to the concerned free energy minimization problem and the nectar quantity is evaluated by the function Energy(⋅). Considering that the search space (wherein all nectar sources exist) is not infinite, we impose bounds on the solution vectors: let X = [θ 1, …, θ N − 2, β 1, …, β N − 3]1 × (2N − 5) represent a general candidate solution, then Lb ≤ X ≤ Ub, where Ub and Lb denote the upper/lower bound. In this current study, we have Ub = (180, …, 180)1 × (2N − 5) and Lb = (−180, …, − 180)1 × (2N − 5). To simplify the presentation, in the remaining of this paper, we refer to X as (X 1, X 2, ⋯, X Dim), Ub as (U 1, U 2, ⋯, U Dim) and Lb as (L 1, L 2, ⋯, L Dim) when Dim = 2N − 5.

A bee colony consists of SN bees, one half of which are unemployed bees and the other half onlooker bees. Before the optimization starts, those \( \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right. \) unemployed bees are randomly initialized in the search space. The following equation shows how the jth element of the ith employed bee’s location X i is generated:

$$ \begin{array}{l}{X}_i^j\leftarrow {L}^j+ rand\left(0,1\right)\cdot \left({U}^j-{L}^j\right),\kern0.5em \\ {}\kern6.25em i=1,2,\dots, \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.,\ j=1,2,\dots, Dim,\end{array} $$
(5)

where rand(0, 1) denotes a random number (range of 0 to 1) obeying uniform distribution. After the initialization procedure, these unemployed bees directly become employed bees. Then an iterative optimization process gets started.

In each cycle of iteration, \( \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right. \) employed bees try new positions within the search space. The new positions are generated by sharing locations with each other. For instance, the ith employed bee X i is assumed at (X 1 i , X 2 i , ⋯, X Dim i ) and the to-be-computed search location X i is denoted by (X ∗ 1 i , X ∗ 2 i , ⋯, X ∗ Dim i ). First, as many as trial(i) out of all Dim dimensions in X i are randomly selected. trial(i) is an integral scalar (range of 1 to Dim) associated with the ith employed bee, which is formally introduced below. Second, the element of each selected dimension is mutated through a crossover procedure. The following equation shows how the ith employed bee takes usage of the randomly chosen kth employed bee’s location to update the jth dimension:

$$ \begin{array}{l}{X^{\ast}}_i^j\leftarrow {X}_i^j+ rand\left(-1,1\right)\cdot \left({X}_k^j-{X}_i^j\right)\cdot \frac{trial(i)}{trial(i)+ trial(k)},\ \\ {}\kern16em k\in \left\{1,2,\dots, \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\right\},\ k\ne i.\end{array} $$
(6)

When X i is available, Energy(X i ) is immediately compared with Energy(X i ). When Energy(X i ) < Energy(X i ), the ith employed bee flies to the new location X , i.e., X i is updated as X i . In addition, trial(i) should be reset to 1. When Energy(X i ) ≥ Energy(X i ), the ith employed bee remains at X i but trial(i) adds one. Based on the principle by which the principle trial(i) is calculated, it is notable that trial(i) records the consecutive unsuccessful trial attempts of the ith employed bee.

When all the employed bees have updated their locations, onlooker bees act accordingly. A roulette selection procedure guides those \( \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right. \) onlooker bees to choose qualified employed bees to search around. A probability index P is calculated according to Eqs. (7) and (8) to reflect the search quality of the employed bees [46].

$$ P(i)=\frac{fitness(i)}{{\displaystyle {\sum}_{j=1}^{\raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.} fitness(j)}},\ i=1,2,\dots, \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right., $$
(7)
$$ fitness(i)=\left\{\begin{array}{ll}\frac{1}{1+ Energy\left({\mathbf{X}}_i\right)}\hfill & \kern2.75em \mathrm{if}\ Energy\left({\mathbf{X}}_i\right)\ \ge 0\hfill \\ {}1- Energy\left({\mathbf{X}}_i\right)\hfill & \kern2.75em \mathrm{if}\ Energy\left({\mathbf{X}}_i\right) < 0\hfill \end{array}.\right. $$
(8)

Let’s take the ith onlooker bee for example to see how its search location can be generated. In order to find an employed bee to follow around, a comparison is made between rand(0, 1) and P(1) first. When P(1) ≥ rand(0, 1), the ith onlooker bee will search around the first employed bee; otherwise, an additional comparison between rand(0, 1) (another random number this time) and P(2) is performed. If all the P(j) \( \left(j=1,2,\dots, \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\right) \) happen to be smaller than rand(0, 1), the process is repeated from the beginning P(1) again until one P(j) that satisfies P(j) ≥ rand(0, 1) is finally found. That finally found jth employed bee is chosen by the ith onlooker bee to search around.

Contrary to the employed-bee phase, the crossover and mutation process involves only one (randomly chosen) dimension of the onlooker bee’s solution vector. The following equation shows how the ith onlooker bee utilizes the mth employed bee to generate a new search location around the jth in the kth dimension:

$$ \begin{array}{l}{Y}_j^k\leftarrow {X}_j^k+ rand\left(-1,1\right)\cdot \left({X}_m^k-{X}_j^k\right)\cdot \frac{trial(j)}{trial(j)+ trial(m)},\ \\ {}\kern12.25em m\in \left\{1,2,\dots, \raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\right\},\ m\ne j.\end{array} $$
(9)

When the location of the onlooker bee Y i  = (X 1 j , …, X k − 1 j , Y k j , X k + 1 j , …, X Dim j ) is determined, Energy(X j ) and Energy(Y i ) are compared. When Energy(Y i ) < Energy(X j ), the jth employed bee updates its current location X j as Y i . Also, trial(j) is reset to 1. When Energy(X j ) < Energy(Y i ), trial(j) adds one. When all the onlooker bees have tried around their chosen employed bees, a re-initialization phase at the end of each iteration is followed.

Two things should be accomplished in the re-initialization phase. First, any trial(i) that exceeds Dim is set to Dim. Second, a criterion (i.e., \( \frac{2}{SN}{\displaystyle {\sum}_{i=1}^{\raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.} trial(i)} \)) is proposed to reflect the overall optimization efficiency and is compared with α odr  ⋅ Dim, where α odr  ∈ (0, 1) is a user-specified threshold. When \( {\alpha}_{odr}\cdot Dim<\frac{2}{SN}{\displaystyle {\sum}_{i=1}^{\raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.} trial(i)} \), the entire employed bee swarm is considered not working efficiently to a degree of α odr . Then, (100α odr ) % randomly chosen employed bees directly become unemployed bees, which will turn to employed bees at the beginning of the next iteration. The corresponding trial indices are reset to 1. When \( {\alpha}_{odr}\cdot Dim\ge \frac{2}{SN}{\displaystyle {\sum}_{i=1}^{\raisebox{1ex}{$SN$}\!\left/ \!\raisebox{-1ex}{$2$}\right.} trial(i)} \), the current iteration cycle is directly terminated.

When the iteration number reaches a user-specified threshold MCN, the entire optimization process is accomplished. A flowchart of BE-ABC algorithm is depicted in Fig. 3.

Fig. 3
figure 3

Flowchart of BE-ABC algorithm

Results

We systematically design a series of simulations on real protein sequences (Table 1) that originated from the PDB database. All simulations involved in this section were carried out in a Matlab R2011b environment and executed on an Intel Core 2 Duo CPU with 2 GB RAM running at 2.53 GHz under Windows XP. In this work, α odr is set to 0.9.

Table 1 Details of 13 benchmark amino acid sequences

In this section, first, the convergence performance of BE-ABC algorithm is evaluated by comparing it to several ABC variants in short-term numerical experiments. Second, our best-so-far solutions optimized by BE-ABC algorithm are compared to the ones published in previous works. Third, the dynamic process of protein folding is preliminarily investigated through case studies.

Comparison of optimization performance among ABC variants

The convergence performance of BE-ABC algorithm is evaluated on all 13 real sequences. In order to show its advantage, BE-ABC algorithm was compared to the conventional ABC algorithm and its state-of-the-art variants, namely an improved ABC (I-ABC) [51], Gbest-guided ABC (Gbest-ABC) [48], chaotic ABC (C-ABC) [23], and internal feedback ABC (IF-ABC) [7]. In each of the 13 cases, the mentioned algorithms were tested repeatedly for 30 times under the condition that SN = 40. MCN was set to 5000 for relatively short sequences and 20,000 for sequences with more than 85 amino acids. Figures 4, 5, and 6 depict the evolutionary curves in three representative cases (which represent short, medium and long sequences respectively). Results of all these cases are collected in Table 2.

Fig. 4
figure 4

Convergence curves of BE-ABC and other ABC variants when tested on case 1CB3 (Dim = 21, SN = 40, and MCN = 5000)

Fig. 5
figure 5

Convergence curves of BE-ABC and other ABC variants when tested on case 1TZ5 (Dim = 69, SN = 40, and MCN = 5000)

Fig. 6
figure 6

Convergence curves of BE-ABC and other ABC variants when tested on case 2EWH (Dim = 191, SN = 40, and MCN = 20000)

Table 2 Comparative optimization results (average, standard deviations, and time consumptions) obtained using BE-ABC and other ABC variants

Comparison of structures optimized by BE-ABC algorithm with previous literature

This subsection compares the best-so-far structures optimized by BE-ABC algorithm with the structures reported in previous publications. Table 3 lists the lowest free energy values for the concerned 13 sequences obtained using different optimization techniques. Figure 7 illustrates the structures obtained using our BE-ABC algorithm. Details of the optimized structures are provided in Table 4.

Table 3 Comparative results of the lowest free energy values for 13 sequences achieved by an improved particle swarm optimization (I-PSO) algorithm [25], a hybrid optimization that combines PSO, genetic algorithm (GA), and Tabu search (TS) named PGATS [35], a multiple-populations GA-PSO (MPGPSO) algorithm [37], ABC algorithm [36], GA-based TS (GATS) algorithm [21, 27], and our BE-ABC algorithm.
Fig. 7
figure 7

Best-so-far conformation produced by BE-ABC algorithm for a 1CB3; b 1BXL; c 1EDP; d 2H3S; e 2KGU; f 1TZ4; g 1TZ5; h 1AGT; i 1CRN; j 1HVV; k 1GK4; (l) 1PCH and (m) 2EWH

Folding process investigation

During the optimization process wherein BE-ABC searches for the lowest free energy, lower and lower values are sequentially obtained. Therefore, how the protein structure “migrates” during the optimization process can be observed, which may reveal critical insights into the protein folding process that occurs in nature.

Simulations were conducted on sequences 1BXL and 1AGT as examples. The results are depicted in Figs. 8 and 9.

Fig. 8
figure 8

Folding simulation on case 1BXL

Fig. 9
figure 9

Folding simulation on case 1AGT

Discussion

In this section, the simulation results shown in the preceding section are further analyzed.

Performance of BE-ABC algorithm

Through a comprehensive set of short-term numerical experiments, the optimization performance of BE-ABC algorithm is compared to other state-of-the-art ABC variants. The collected data in Table 2 indicate that BE-ABC algorithm outperforms the other ABC variants.

As depicted in Figures 4, 5, and 6, the evolution processes of BE-ABC algorithm do not remain the best among all six ABC variants in some early iterations. The origin of the problem might come from the following aspects of the algorithm. First, intuitively speaking, global search should be enhanced in the early iterations when evolutions are easy to make. During that period, elements in the trial vector are relatively small, yielding the global search scale of BE-ABC algorithm generally smaller than that of the conventional ABC algorithm under the same circumstance. In a simple instance with trial(i) = trial(k) = 1 regarding Eq. (6), the search scale determined by BE-ABC algorithm shrinks to half of that determined by the conventional ABC algorithm. When trial(k) is larger, the shrink becomes severer, which is not good for a global search. Thus the evolution processes based on BE-ABC algorithm commonly lag at the very beginning. Due to the same reason, when making an evolution is not that easy, the benefit of utilizing gradually emerges and that is why BE-ABC algorithm usually catches up with its competitors in the later iterations. The second aspect concerns the overall degradation strategy. Unlike that in the conventional ABC algorithm, BE-ABC algorithm requires that part of the employed bees re-initialized all at once, which happens under some certain circumstance, which is not likely to happen in the early iterations. In contrast, ABC algorithm re-initializes the employed bees one after another when they are judged as inefficient. In other words, the re-initialization procedure in BE-ABC algorithm seldom works during the early iterations. This may cause differences among the evolution curves. However, re-initializing the employed bees one after another (in the conventional ABC algorithm) is not efficacious when evolutions are difficult to make or when SN is set large. The long-lasting convergence performance (based on efficient re-initialization ability) of BE-ABC algorithm is clearly illustrated, especially in Fig. 6.

Moreover, Table 2 shows that BE-ABC algorithm usually consumes less time than IF-ABC and C-ABC algorithms do. This is because BE-ABC algorithm neither adds complicated outside-world strategy nor changes the conventional algorithm framework.

Performance of the optimized best-so-far structures

The best solutions obtained through BE-ABC are compared with those presented in previous literature. The blanks in Table 3 indicate that more cases than those in previous studies have been considered in this present study. In fact, the comparative studies listed in Table 3 constitute all the relevant studies that can be found, to the best of our ability. In nearly all of the cases, BE-ABC algorithm clearly outperforms its competitors.

Here, curious readers may ask how the best solutions were computed. It is a “mystery” how the best solutions in most of the previous publications were computed. In fact, there is not a unified standard that all the publications had followed to generate the best solutions. In our work, we terminated the optimization process when there had been no evolution for up to 5,000,000 consecutive iterations.

Although in some cases our obtained free-energy values are slightly lower than previously published ones, the optimized structures are completely different. The best-so-far structures reported in refs. [21] and [27] for cases 1CB3, 1BXL, and 1EDP are depicted in Fig. 10. Also in that figure, an index ΔE is utilized to reflect the error between their reported free energy and ours; another root of mean square error index RMSE (defined in Eq. (10)) measures the difference between two concerned solution vectors.

Fig. 10
figure 10

Best-so-far structures optimized by GATS algorithm: a for case 1CB3; b for case 1BXL; c for case 1EDP. In each case, ΔE denotes the error between free energies obtained by GATS and BE-ABC. The RMSE index reflects how close two structures optimized by BE-ABC and GATS are

$$ RMSE\left(\mathbf{X},\mathbf{Y}\right)=\sqrt{\frac{1}{Dim}\left({\displaystyle {\sum}_{i=1}^{Dim}{\left({X}_i-{Y}_i\right)}^2}\right)}, $$
(10)

where X = [X1, X 2, …, X Dim ] and Y = [Y1, Y 2, …, Y Dim ].

Commonly in a natural protein structure, the hydrophobic residues tend to form a core with a minimum surface area that encounters water molecules. Thus such a hydrophobic core tends to be surrounded by hydrophilic residues so as to prevent encountering water molecules outside. In Fig. 7 (where the light-colored particles reflect hydrophobic residues while those dark-colored ones reflect hydrophilic residues), that characteristic appearing in nature can be reflected in nearly all of the simulated cases.

Performance of off-lattice model

In this research study, we have also investigated the protein folding process which is expressed on the basis of the off-lattice model. Assuming that the structure optimization process is the natural process for a protein structure to be folded, we observe that all the hydrophobic residues gradually assemble to become one central core surrounded by hydrophilic residues in cases 1BXL and 1AGT.

Although the off-lattice model roughly classifies all the amino acids in a sequence into two groups, some critical characteristics during the natural folding process can be observed in our artificial optimization process, possibly providing an initial guess for protein folding simulation with more complicated models.

Conclusions

In this paper, we have investigated protein folding optimization on the basis of a coarse-grained mode (i.e., the 3D off-lattice model) and an improved ABC algorithm. The underlying contributions of this paper are summarized as follows.

First, this work presents an application of a relatively new metaheuristic optimization technique. The optimization performance of BE-ABC algorithm is comprehensively evaluated by comparing it with several state-of-the-art ABC variants.

Second, 13 real protein sequences, rather than artificial sequences, are chosen for optimization. To the best of our knowledge, none of the previous works have conducted simulations in such a comprehensive manner. The simulated protein sequences and obtained optimization results form a benchmark case set, which can be used as a standard set by future researchers.

Last, the protein folding process is investigated on the basis of the off-lattice model through a viewpoint of evolution. Previous studies used their concerned optimization techniques to solve a numerical problem but quite few of those studies further investigated the efficiency of the off-lattice model. Simulation results in this study indicate that the off-lattice model provides a rough estimation of the folding process of proteins.

Investigations about how the optimized results benefit the area of molecular biology should be made through future studies. Effects are also needed to measure how close the optimized structures are to the existing ones in the PDB database and to reveal more details about the protein folding process.