1 Background

Three-dimensional structure of a protein determines its chemical and physical properties. The premier methods of determining the three-dimensional (3D) structure of a protein include Cryo-TEM, X-ray crystallography and NMR-spectroscopy. However, these experimental methods are comparatively expensive and time consuming. When genome sequencing becomes much faster and cheaper, the gap between the accumulated protein sequences and the available protein 3D structures have become larger. Therefore, there is a high demand for applying computational methods in predicting protein 3D structures solely from amino acid sequences. The Anfison’s dogma states that, at least for small globular proteins, the native structure is determined only by the protein’s amino acid sequence [1]. This dogma is the basis for most computational techniques used for predicting structure for globular proteins. The dogma amounts to saying that, the native structure is a unique, stable and minimum of the free energy at the environmental condition. However, due to the complexity of free energy surface of protein folding, protein structure prediction is still a very challenging problem in computational biology.

A variety of mathematical models have been developed to prediction protein structure from de novo. As free energy surface of protein folding are very complicated, some simplified models have been proposed. In those simplified models the atomic details are sacrificed, however it preserves important features in searching low-energy conformations of protein, and therefore provides insight into protein folding process. One of the most extensively studied model is Dill’s hydrophobic-hydrophilic (HP) model [2], in which each amino acid of a protein can be one of the two types: H (hydrophobic, i.e. nopolar) or P (hydrophilic, i.e. polar). There are basically two types of HP models: HP lattice models and HP off-lattice models. In the lattice models, the conformation of the model molecular is defined within a certain type of lattice, which means each residue can only move between adjacent lattice points. In the off-lattice models, the angle between adjacent residues can be any arbitrary value. A popular model is the AB off-lattice model first proposed by Stillinger et al. in [3], in which A stands for hydrophobic (non-polar) residues and B for hydrophilic (polar) residues. In this model, the distances between two adjacent residues are held fixed, and potential energy is contributed from each angle between successive bonds. The off-lattice model can better represent real proteins. However, its potential energy function is highly non-linear and contains a huge number of local minimum points, making protein structure prediction a typical NP-hard problem. To improve global convergence, a variety of global optimization algorithms have been put forward, including colony optimization [4], genetic-annealing algorithm [5, 6], Particle Swarm Optimization (PSO) algorithm [7, 8] and so on.

In previous work using off-lattice models and optimization algorithms, the contributions from all residues of a protein to the potential energy function are calculated simultaneously. In another word, each position of the tested protein is equally weighted in prediction. In this way, with the increase of protein lengths, the computational cost and complexity will increase exponentially. The natural protein synthesis process suggests an alternative strategy to make the protein structure prediction. The ribosome functions as a fully automatic, extremely efficient assembling machine for making proteins, and a polypeptide chain can only be synthesized by one ribosome, from its N-terminal to its C-terminal. It has been found out that the folding process of the nascent chain starts early during the synthesis process, and the amount of the ordered structure increases with the elongation of the peptide chain [9, 10]. Such a co-translational strategy has been exploited in protein folding study with lattice models [11, 12]. In the current work, co-translational strategy has been applied on the 3D off-lattice model with the Quantum-behaved Particle Swarm Optimization (QPSO) algorithm for protein tertiary structure prediction. In this strategy, protein structure is predicted from its N-terminal, and residue is added one by one. To avoid the newly formed trapped in local minimum, we have also added turbulence.

If these assumptions holds true, it will significantly reduce the difficulty of the problem for protein folding prediction in mathematics and has a great significance in biology.

2 Methods

2.1 The AB off-lattice model

The AB off-lattice model was first proposed by Stillinger et al. in [3]. The model can work in two dimensions or three dimensions. In the 2D AB off-lattice model, the bond distances between adjacent amino acids are held fixed to 1, and there is the lowest protein bending energy folding state for a protein from each bond angle θi between successive bonds. The potential energy function ϕ can concisely expresses the spatial structure of the set of amino acids, it can be expressed as:

$$\phi = \sum\limits_{i = 2}^{n - 1} {V_{1} (\theta_{i} )} + \sum\limits_{i = 1}^{n - 2} {\sum\limits_{j = i + 2}^{n} {V_{2} \left( {r_{ij} ,\xi_{i} ,\xi_{j} } \right)} }$$
(1)

Where:

$$V_{1} (\theta_{i} ) = \frac{1}{4}\left( {1 - \cos \theta_{i} } \right)$$
(2)
$$V_{2} (r_{ij} ,\xi_{i} ,\xi_{j} ) = 4\left[ {r_{ij}^{ - 12} - C(\xi_{i} ,\xi_{j} )r_{ij}^{ - 6} } \right]$$
(3)
$$C(\xi_{i} ,\xi_{j} ) = \frac{1}{8}\left( {1 + \xi_{i} + \xi_{j} + 5\xi_{i} \xi_{j} } \right)$$
(4)

Here V1 is the bending potential energy of protein backbone and V2 is the gravitational potential energy between nonadjacent amino acids; rij is the Euler Distance between monomer i and j, ξi is +1 if the ith amino acid is labeled with A, while ξi is − 1 if ith amino acid is labeled with B; C(ξiξj) is + 1, + 0.5 and − 0.5 for AA, BB, and AB pairs respectively, to weigh the attraction among these pairs.

Without changing the energy function, Irback et al. [13] and Hsu et al. [14] proposed the 3D AB off-lattice model based on the 2D version. Recently, energy landscape paving minimizer method [15] and conformational space annealing method [16] have been put forward and have achieved better results in three dimensions.

2.2 Developing a new 3D off-lattice model

In this work, we have proposed a new 3D Off-lattice model, which can take good advantage of the Particle Swarm Optimization (PSO) algorithm to reduce the computation time and easily get the 3D coordinate of each amino acid. In this model, we use horizontal angle αi and vertical angle βi to derive the coordinates of the ith residue. We put the first residue at the Origin Point (0, 0, 0), the second residue at the X axis with the coordinate (1, 0, 0), and the X–Y plane was defined by the first three residues (Fig. 1a). The beginning of vector \({\vec{\text{r}}}_{\text{i}}\) of the ith residue is placed at the end of vector \({\vec{\text{r}}}_{\text{i - 1}}\) of the previous residue. The vector \({\vec{\text{r}}}_{\text{i}}\) is represented by the horizontal bond angle αi and vertical angle βi, in which αi is the angle between the X axis and the projection of the vector \({\vec{\text{r}}}_{\text{i}}\) on the X–Y plane, and βi is the angle between the Z axis and the projection of the vector \({\vec{\text{r}}}_{\text{i}}\) on the X–Y plane (Fig. 1b).

Fig. 1
figure 1

Definition of coordinate of each amino acid. a Shows the definition of X–Y plane, b Shows the definition of αi, βi and θi

We can use horizontal angle αi and vertical angle βi to calculate the coordinates of the ith residue:

$$\left. {\begin{array}{*{20}l} {x_{i} = x_{i - 1} + \cos \alpha_{i} \cdot\cos \beta_{i} } \hfill \\ {y_{i} = y_{i - 1} + \sin \alpha_{i} \cdot\cos \beta_{i} } \hfill \\ {z_{i} = z_{i - 1} + \sin \beta_{i} } \hfill \\ \end{array} } \right\}$$
(5)

The euclidean Distance rij between the ith and the jth amino acids are calculated from the coordinates of the two residues:

$$r_{ij} = \sqrt {\left( {x_{i} - x_{j} } \right)^{2} + \left( {y_{i} - y_{j} } \right)^{2} + \left( {z_{i} - z_{j} } \right)^{2} }$$
(6)

Therefore, the angle θi between vector \({\vec{\text{r}}}_{\text{i + 1}}\) and vector \({\vec{\text{r}}}_{\text{i}}\) can be represented by the αi and βi angles:

$$\theta_{i} = \arccos \left( {\cos \left\langle {\vec{r}_{i} ,\vec{r}_{i - 1} } \right\rangle } \right) = \arccos \left( \begin{aligned} \cos \beta_{i} \cdot \cos \alpha_{i} \cdot \cos \beta_{i - 1} \cdot \cos \alpha_{i - 1} \hfill \\ + \cos \beta_{i} \cdot \sin \alpha_{i} \cdot \cos \beta_{i - 1} \cdot \sin \alpha_{i - 1} \hfill \\ + \sin \beta_{i} \cdot \sin \beta_{i - 1} \hfill \\ \end{aligned} \right)$$
(7)

2.3 The QPSO algorithm

The Particle Swarm Optimization (PSO) algorithm is a population based stochastic optimization algorithm originally proposed by Kennedy and Eberhart [17]. The basic PSO algorithm works by S particles in the N-dimensional search-space by Rn. At each generation t, each particle has a position marked Xi(t) = (Xi1(t), Xi2(t), …, XiN(t)) and a velocity marked Vi(t) = (Vi1(t), Vi2(t), …, ViN(t)). These particles move according to the following equation:

$$V_{ij} (t + 1) = wV_{ij} (t) + c_{1} r_{1} [P_{i} (t) - X_{ij} (t)] + c_{2} r_{2} [G_{j} (t) - X_{ij} (t)]$$
(8)
$$X_{ij} (t + 1) = X_{ij} (t) + V_{ij} (t + 1)$$
(9)

where c1 and c2 are acceleration coefficients representing the cognitive behavior and social behavior; w is coefficient of inertia; r1 and r2 are uniformly distributed random numbers in the interval [0, 1]; Pi(t) is the best previous position of the ith particle and is called personal best position; G(t)is the best solution by any particle in the swarm and is called global best position; Vij(t) is the current velocity; Vij(t + 1) is the new velocity; Xij(t) is the current position; Xij(t + 1) is the new position.

A variety of improved PSO algorithms have been proposed. The Quantum-behaved Particle Swarm Optimization algorithm (QPSO) is a PSO algorithm incorporating properties of quantum mechanics, and this algorithm was proposed by Mikki and Kishk [18]. QPSO is a strong global convergent algorithm and could be an effective means for the optimization problem of complex systems [7, 19]. In the QPSO algorithm, the particles move according to the following equation:

$$X_{ij} (t + 1) = p_{ij} (t) \pm \alpha \left| {C_{j} (t) - X_{ij} (t)} \right| \cdot \ln \left( {\frac{1}{u}} \right)\;\;\;u \in U(0,1)$$
(10)

C(t) is the mean of personal best positions among the particles:

$$C(t) = \frac{1}{M}\sum\limits_{i = 1}^{M} {P_{i} (t)}$$
(11)

The new personal best position of Pi(t + 1) is

$$P_{i} (t + 1) = \varphi P_{i} (t) + (1 - \varphi )G(t)\;\;\;\varphi \in U(0,1)$$
(12)

Where the φ and u are uniformly distributed random numbers in the interval [0, 1], and α is contraction–expansion coefficient.

2.4 Modify the QPSO algorithm for co-translational folding strategy

In this work, we have modified the QPSO algorithm to predict protein structure following the co-translational folding process. We call this modified QPSO algorithm c-QPSO where ‘c’ stands for the co-translational strategy. To do this, we postulate that the newly added residue can choose their own space position to make the newly formed three-dimensional structure achieve lowest free energy, and the addition of the new residue will not drastically disturb the preformed low-energy conformation of the previous residues.

This algorithm works by n particles in the 3d-dimension, in which d is the number of amino acids for a protein. Considering the minimal problem, given the ith particle, let Xi = (ai1, ai2, …, aid, bi1, bi2, …, bid) be its current unit vector and Pi = (pixpiypiz) be its coordinates. Following the co-translational strategy, d is initially set to 3 and will increase according to the iteration number, and Xi+1(t) is defined as follows:

$$X_{ij} (t + 1) = \left\{ {\begin{array}{*{20}l} {p_{ij} (t) \pm \alpha \left| {C_{j} (t) - X_{ij} (t)} \right|\cdot\ln \left( {\frac{1}{u}} \right)} \hfill & {j \in [n,n - 1, \ldots ,n - f + 1,r],u \in U(0,1)} \hfill \\ {X_{ij} (t)} \hfill & {j \notin [n,n - 1, \ldots ,n - f + 1,r]} \hfill \\ \end{array} } \right.$$
(13)

where f is the degree of freedom and represents the new amino acids; r is the random number in the range between 1 and n, and this random number is designed to introduce turbulence to the lowest energy conformation of the previous residues when a new residue is added.

The process of the algorithm is described as following:

  • Step 1: Initialize all of the particles as 6 dimensions, for the ith particle, set Xi,Pi1,Pi2 to(0, 0, 0, 0, 0, 0), (0, 0, 0)and (1, 0, 0) respectively.

  • Step 2: Update each particle’s horizontal angle αid and vertical angle βid randomly in the range of [− π, π].

  • Step 3: Compute the energy function value Ei of each particle, where i = 1, 2, …, n

  • Step 4: Update each particle’s vector;

  • Step 5: Renew global minimum unit vector and compute the global minimum energy;

  • Step 6: repeat step 2, 3, 4 until accuracy requirement of energy function or iteration limit is met;

  • Step 7: Compute 3D coordinate of each residue and output the global minimum energy value.

In this work, for all test sequences, the population size is set to 1000. Maximum iteration is set to 500. All the experiments were run for 50 times and the lowest protein folding energies was calculated. The simulation was run with Delphi 7 on personal computer with the Intel i7-5600 CPU, 4.00 GB RAM, installed with Windows 10 system.

3 Results

3.1 The lowest energy conformations with amino acid sequence length increasing

To verify the second hypothesis, we experimented on 1AGT from the PDB database. The information about its sequence is as follows:

>1AGT:

GVPINVSCTGSPQCIKPCKDQGMRFGKCMNRKCHCTPK

Figure 2 depicts the lowest energy conformations in the 2D off-lattice model obtained by our c-QPSO algorithm, In which black spots and white spots represent hydrophobic A monomers and hydrophilic B monomers, Fig. 3 shows the global lowest energy value decreases gradually with amino acid sequence length increasing. The process indicates that the AB model in two dimensions with 1AGT sequences displays the important feature, the lowest energy protein structure have small changes when a new acid joined.

Fig. 2
figure 2

The lowest energy conformations for the 1AGT sequence with amino acid sequence length increasing predicted by c-QPSO algorithm

Fig. 3
figure 3

The lowest energy values for the 1AGT sequence with different length predicted by c-QPSO algorithm

3.2 Find the best degree of freedom for the c-QPSO algorithm using Fibonacci sequences

To find out the best degree of freedom for the c-QPSO algorithm, we calculated the lowest free energy at different degree of freedom on the four Fibonacci sequences with size 13, 21, 34 and 55. Fibonacci sequences are widely used in protein folding prediction, which are defined recursively Si+1 = Si−1 ∗ Si, where ‘*’ is the connection operator. For example, if S0 = a, S1 = b, then S2 = ab, S3 = bab and so on. In this study we have tested the following four Fibonacci sequences (S0, S1, S2 and S3) studied in the previous work [17], where A denotes hydrophobic residues and B denotes hydrophilic residues.

S0 (13aa): ABBABBABABBAB

S1 (21aa): BABABBABABBABBABABBAB

S2 (34aa): ABBABBABABBABBABABBABABBABBABABBAB

S3 (55aa): BABABBABABBABBABABBABABBABBABABBABBABABBABABBABBABABBAB

We tested on the degree of freedom from 1 to 10, and the Table 1 shows the lowest energies for the four sequences at different freedom degrees. Our calculations have shown that basically the lowest energies occurs when the degree of freedom is 2. Therefore, we choose the number 2 as the degree of freedom of the c-QPSO algorithm for the future predictions. Table 2 lists the lowest energies obtained by PSO, LPSO, ACMC, ELP algorithms for comparison. It can be seen that our results are sufficiently better than the results determined by the other algorithms. For sequence with size 13, 21 and 34, our result is roughly twice times than that of ELP and 1.5 times for sequence with size 55. It can also be seen that c-QPSO algorithm takes significantly less time than PSO method. Figure 4 shows the Protein spatial structure of lowest energy for the four sequences.

Table 1 The relationship between degree of freedom and the lowest energy
Table 2 Lists the lowest energies obtained by other algorithms
Fig. 4
figure 4

Protein spatial structure of lowest energy for the Fibonacci sequences with length 13, 21, 34 and 55 predicted by c-QPSO algorithm. a n = 13; b n = 21; c n = 34; d n = 55. The red balls and grey balls represent hydrophobic A monomers and hydrophilic B monomers

3.3 Compare to other methods using Fibonacci sequences

The same four Fibonacci sequences (S0, S1, S2 and S3) have been used to compare the performance of our c-QPSO algorithm to traditional algorithms including the PSO [8], LPSO [8], ACMC [20], ELP [13] that do not apply the co-translational strategy. It can be seen that for each of the four sequences the lowest free energy predicted by the c-QPSO algorithm is much lower than that determined by the other algorithms (Table 2). When the c-QPSO and traditional PSO algorithms are compared, we can see clearly that applying the co-translational strategy also drastically decrease the computational time (Table 3).

Table 3 Lowest energies of real protein sequence

3.4 Predict on real protein sequences

We then predict on the following four real protein sequences from the PDB database:

>2KGU

GYCAEKGIRCDDIHCCTGLKCKCNASGNCVCRKK

>1CRN

TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

>2KAP

KEACDWLRATGFPQYAQLYEDFLFPIDISLVKREHDFLDRDAIEALCRRLNTLNKCAVMK

>1PCH

AKFSAIITDKVGLHARPASVLAKEASKFSSNITIIANEKQGNLKSIMNVMAMAIKTGTEITIQADGNDADQAIQAIKQTMIDTALIQG

In these sequences, D, E, F, H, K, N, Q, R, S, T, W and Y are treated as hydrophilic amino acids while I, V, L, P, C, M, A and G are hydrophobic ones according to the K-D method. We can see that the structures predicted by c-QPSO can significantly reduce the protein folding energy compared with other methods (Table 4). Figure 5a shows the progression of the lowest free energy at different extrusion lengths for each of the four proteins, and Fig. 5b show the predicted backbone structures of each protein, with the corresponding backbone structures from PDB database shown in Fig. 5c. From these figures, we can see that the structures predicted by c-QPSO can approximately simulate the real protein to some extent.

Table 4 Lowest energies of real protein sequence
Fig. 5
figure 5

Predict on four real protein sequences by c-QPSO algorithm

4 Conclusion

In this paper we proposed two hypotheses to explain protein folding, then we propose a improved 3D Off-lattice model for predicting the protein folding structure, this model is beneficial for Quantum-behaved Particle Swarm Optimization algorithm (QPSO) to reduce the computation time and easily get the 3D coordinate of each amino acid. This method converges fast and has strong global search capability. It is the biggest advantage to reduce the protein folding energy. So we combine QPSO and protein folding structure and propose a new algorithm called c-QPSO. Experiments show that c-QPSO has better performance than basic PSO so it is a feasible solution for this problem.