Keywords

1 Introduction

Proteins are required for growth and development in the human body. A protein, called a polypeptide, is composed of a chain of amino acids. A series of amino acids are linked together by peptide chains to form a protein backbone. An amino acid is the fundamental unit of a protein. There are twenty kinds of standard amino acids and each kind of amino acid can be differentiated by its R group. Each protein has its own specific functions and unique structure which can cooperate with other proteins to achieve some required functionalities. Because protein structure and protein function are closely related, in order to find out these protein functions, most biologists adopt the approaches that predict protein tertiary structures by means of amino acid sequences or other related information.

There are two main approach types of the protein structure prediction. The first one is the experimental method, including X-ray diffraction and nuclear magnetic resonance (NMR) [10, 14]. The other one is the computational method, including homology modeling [5], folding recognition [6], and ab initio [9]. As for these two categories, the former one requires a lot of time and cost while the latter one doesn’t. Therefore, this motivates us to adopt the computational method to predict the 3D protein structure.

The all-atom protein backbone reconstruction problem (PBRP) is to utilize a protein sequence and its 3D coordinates of \(\alpha \)-carbon (C\(_{\alpha }\)) for predicting the 3D coordinates of all atoms (N, C and O atoms) on the protein backbone. There are several related studies which fall into this category, such as SABBAC [11], Wang’s method [17], Chang’s method [2], BBQ [4], Chen’s method [3], Wu’s method [18], PD2 [12] and so on. For a complete survey on the methods (or software) of the protein backbone prediction, one can refer to the thesis written by Yuan [19].

In order to improve the accuracy of protein structure prediction, we propose a new method to refine the 3D coordinates of all backbone atoms by means of support vector regression (SVR) [15, 16]. Our prediction target is the differences of N and O atoms’ coordinates between the predicted results of PD2 and the real 3D coordinates of PDB, and the differences of C atoms’ coordinates between the predicted results of BBQ and the real 3D coordinates of PDB. Our training features are generated from the predicted results of PD2 and BBQ.

The experimental datasets range from CASP7 to CASP11, where CASP stands for the Critical Assessment of Protein Structure Prediction [13]. We perform a 5-fold cross validation experiment for performance evaluation. For each fold of validation, a CASP dataset is extracted for testing and the remaining CASP datasets are involved for training. Coordinates in the training datasets are first predicted by PD2 and BBQ. Then these predicted coordinates are compared with their real coordinates to produce the differences. The differences are the learning objective for the SVR. In the feature selection stage, we analyze the correlations between the objective value and available features, and then select the most representative features. To ease the training process, we partition the amino acids into twenty groups (twenty datasets) and then predict these differences by each individual SVR. Finally, we combine our predicted differences, their corresponding predicted N and O coordinates with PD2, and predicted C coordinates with BBQ to export our predicted coordinates of N, C and O. The performance is evaluated by the RMSD values. The experimental results show that our prediction results yield about \(8\,\%\) improvement over the results predicted by PD2, which is the most accurate predictor for the problem [19].

The rest of this paper is organized as follows. In Sect. 2, we will introduce experimental datasets, root-mean-square deviation (RMSD) and features used in this paper. In Sect. 3, we will describe our proposed method in detail. In Sect. 4, we will present our experimental results. Finally, in Sect. 5, the conclusion will be given.

2 Preliminaries

2.1 Datasets and Performance Evaluation

Critical Assessment of Protein Structure Prediction (CASP) [13] is an international competition held every two years since 1994. The main goal of CASP is to evaluate the capabilities of the methods for identifying three-dimensional structure of the protein from its amino acid sequence. In order to assess the performance of a method, CASP examines the predicted 3D structures in many different ways, such as the accuracy of a model, accuracy of a quaternary structure, and so on. Because our research also focuses on the 3D structure prediction, we use CASP datasets to perform our experiments.

Root-mean-square deviation (RMSD) [7, 8] is an evaluation method of molecular modeling which computes the average distance between the predicted values and the ground truths.

$$\begin{aligned} RMSD=\sqrt{\frac{1}{l}\sum _{i=1}^{l}(X_{i}^{A}-X_{i}^{B})^{2}}, \end{aligned}$$
(1)

where \(X_{i}^{A}\) and \(X_{i}^{B}\) denote the coordinates of the ith atom on the backbone in the proteins A and B, respectively, and l denotes the length of the proteins. Generally, a lower RMSD indicates the higher similarity, which means that the predicted coordinates is close to the real ones. RMSD has been widely used in structural biology. In this paper, we also use RMSD to evaluate the quality of the prediction models.

2.2 Feature Generation and Feature Selection

This subsection describes the features we use to build the SVR models. All features are extracted within a fragment, as illustrated in Fig. 1. That is, once the prediction target, like the C atom, is determined, we define a window around this atom. We say this atom and its surrounding atoms constitute a fragment. The features required by SVR are calculated as follows.

Fig. 1.
figure 1

The fragment of \(L_{C}=6\) consecutive atoms on the protein backbone, where the C atom is assigned to be the fragment center.

  • Coordinate: The predicted x, y and z coordinates of N, C, O and N\(_{next}\) atoms are obtained from BBQ and PD2, denoted as \(N_x(BBQ)\), \(N_y(BBQ)\), \(C_z(PD2)\), etc. The real coordinates of two C\(_{\alpha }\)s are also involved as the features. Thus, there are totally 30 features.

  • Coordinate difference: The coordinate differences are calculated from the predicted x, y and z coordinates of N, C, O and N\(_{next}\) atoms. Only the difference of each corresponding pair of coordinates is computed, such as \(N_x(PD2)-N_x(BBQ)\), \(N_y(PD2)-N_y(BBQ)\), etc. Thus, 12 features are obtained.

  • Euclidean distance: The Euclidean distance measures the amount of space between the two predicted positions, obtained from BBQ and PD2, of the same atom on the Euclidean space. The formula for calculating Euclidean distance is given as follows.

    $$\begin{aligned} d(x_{p_{i}}, x_{b_{i}}, y_{p_{i}}, y_{b_{i}}, z_{p_{i}}, z_{b_{i}}) = \sqrt{\left( x_{p_{i}}- x_{b_{i}} \right) ^{2}+\left( y_{p_{i}}- y_{b_{i}} \right) ^{2}+\left( z_{p_{i}}- z_{b_{i}} \right) ^{2}} \end{aligned}$$
    (2)

    In Eq. 2, \(x_{p_{i}}\), \(y_{p_{i}}\), \(z_{p_{i}}\) are the 3D coordinates of the ith atom on the protein backbone predicted by PD2 and \(x_{b_{i}}\), \(y_{b_{i}}\), \(z_{b_{i}}\) are the 3D coordinates of the ith atom on the protein backbone predicted by BBQ.

  • Bond length: Two adjacent atoms form a bond on the protein backbone. There are five bond lengths in a fragment, including N-C\(_{\alpha }\), C\(_{\alpha }\)-C, C=O, C-N\(_{next}\), and N\(_{next}\)-C\(_{\alpha _{next}}\). The bond length is calculated in terms of the Euclidean distance. Because bond lengths associated with BBQ and PD2 are calculated separately, it follows that 10 features are obtained.

  • Bond length difference: Each bond length difference is derived by the two corresponding bond lengths, predicted from the two methods.

  • Bond angle: Since three atoms form an angle, the bond angle can thus be obtained by law of cosines. Consequently, for a fragment with \(L_C= 6\), we can get five different angles, including N-C\(_{\alpha }\)-C, C\(_{\alpha }\)-C=O, C\(_{\alpha }\)-C-N\(_{next}\), O=C-N\(_{next}\), and C-N\(_{next}\)-C\(_{\alpha _{next}}\).

  • Bond angle difference: Each bond angle difference is calculated from the two corresponding predicted bond angles of two methods.

  • Torsion angle: The torsion angle is computed by four consecutive atoms on the main chain. In addition to \({\phi }\)(C\(_{prev}\)-N-C\(_{\alpha }\)-C), \({\psi }\)(N-C\(_{\alpha }\)-C-N) and \({\omega }\)(C\(_{\alpha } \)-C-N\(_{next}\)-C\(_{\alpha _{next}}\)), we also choose the other features in our fragment, including N-C\(_{\alpha }\)-C=O, O=C-N\(_{next}\)-C\(_{\alpha _{next}}\), C\(_{\alpha }\)-C=O-N\(_{next}\), C\(_{\alpha }\)-C-N\(_{next}\)=O (planes of C\(_{\alpha }\)-C-N\(_{next}\) and C -N\(_{next}\)=O), N\(_{next}\)-C-C\(_{\alpha }\)=O and C-N\(_{next}\)-C\(_{\alpha _{next}}\)-C\(_{next}\). Since PD2 and BBQ are used for the computation of torsion angles, 18 features are obtained.

  • Torsion angle difference: This feature is obtained from the torsion angles. We compute the differences from the torsion angles obtained by PD2 and BBQ.

So far, we have defined nine kinds of features. Table 1 shows the feature names and their sizes.

Table 1. The names and sizes of all feature subsets.

The above feature extraction method is performed around the C atom in a fragment-by-fragment manner, as shown in Fig. 1. We assign the C atom as the fragment center because this arrangement is most suitable for predicting both C and O atoms. Nevertheless, one may wonder whether this is also suitable when N atom is served as the prediction target. Therefore, we assign the N atom as the fragment center and perform another experiment. The experimental results (not shown in this paper) exhibit that the prediction accuracies of N atoms with the C-center window and those with the N-center window only show little difference. Thus, we still use the C-center window for predicting N atoms here.

Since the performance of models depends heavily on the selected features, we have to consider which one is relevant to the coordinate prediction. In order to identify important features, we calculate the Pearson’s correlation coefficient between the objective value and each feature value. For a given feature, its correlation coefficient with the objective variable is given in Eq. 3.

$$\begin{aligned} r =\frac{\sum _{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum _{i=1}^{n}(x_{i}-\bar{x})^{2}}\sqrt{\sum _{i=1}^{n}(y_{i}-\bar{y})^2)}}, \end{aligned}$$
(3)

where n denotes the number of data elements, \(x_{i}\) denotes the ith element of data instances (\(x_{1}\), \(x_{2}\), ..., \(x_{n}\)), \(y_{i}\) denotes the ith instance of the objective values (\(y_{1}\), \(y_2\), ...,\(y_{n}\)), \(\bar{x}\) and \(\bar{y}\) represent the means of x and y, respectively.

3 The Coordinate Difference Prediction Method

In order to improve the predicted results, we adopt SVR to predict the x-difference, y-difference and z-difference of each of N, C and O atoms on the backbone of a target protein. Then, these differences are combined with the predicted results of BBQ and PD2 to yield our predicted coordinates. Our coordinate difference prediction procedure is described as follows.

  • Algorithm: The Coordinate Difference Prediction Method.

  • Input: 1. One training set T, containing the predicted coordinates obtained from PD2 and BBQ and the real coordinates in PDB.

    2. One target protein, containing the predicted coordinates of PD2 and BBQ, along with real C\(_{\alpha }\) coordinates.

  • Output: The predicted coordinates of N, C and O atoms along the target protein backbone.

  • Step 1 (Extract features): Partition the residues of all proteins in T into 20 groups, corresponding to 20 types of standard amino acids. Calculate the 103 feature values associated with each kind of residue, defined in the previous section.

  • Step 2 (Perform correlation analysis): For each of the nine objective values (\(O_x\)-difference, \(O_y\)-difference, \(O_z\)-difference, \(N_x\)-difference, etc.) in T, calculate the Pearson’s correlation coefficient between each feature value and the objective value. Since the p-value represents the confidence level associated with its correlation coefficient, we thus can adopt a thresholding method to identify significant features.

  • Step 3 (Predict the difference by SVR): For each kind of objective values and amino acid groups, we use the selected features to train an SVR model. Thus, 180 models (20 kinds of residues, 3 kinds of atoms, 3D coordinates) are obtained. Then, these models are invoked to perform prediction of the target protein based on the residue and atom types.

  • Step 4 (Combine the predicted difference with PD2/BBQ): Combine the predicted differences with their corresponding predicted positions obtained by BBQ and PD2 to generate the final coordinates.

  • Step 5 (Merge all residues together): Bring the predicted coordinates of all residues together to reconstruct the 3D positions of all atoms (N, C and O atoms) on the target protein backbone.

The flow chart is shown in Fig. 2.

Fig. 2.
figure 2

The flow chart for predicting the 3D structure of a protein.

4 Experimental Results

For evaluating the performance of our method, we adopt CASP7, CASP8, CASP9, CASP10 and CASP11 as the experimental datasets, which contain 65, 52, 63, 39 and 55 proteins, respectively. We use only the information of chain A of proteins to carry out the experiments. If there is no chain A, the next chain is used. All features are scaled into the range of \([a, b] = [-1, 1]\) by Eq. 4.

$$\begin{aligned} \frac{{x_{i}}-{x_{min}}}{{x_{max}}-{x_{min}}}(b-a)+a, \end{aligned}$$
(4)

where \(x_{i}\) denotes the value of a certain feature of the ith training data element, \(x_{max}\) is the maximum value in the feature, \(x_{min}\) is the minimum value, a and b are the lower and upper bounds of the range, respectively.

In the 5-fold cross validation, each CASP is selected as the testing dataset for one time. Once a CASP is determined for testing, the remaining CASPs serve as the training datasets. For example, if we select CASP7 as the testing dataset, then the rest ones, from CASP8 to CASP11, are used as the training dataset. The testing procedure is performed for each CASP dataset.

As mentioned in the previous section, in our algorithm, Step 1 generates the feature candidates and Step 2 calculates correlation coefficients of these features and the objective values. From the correlation analysis, we find that some of the features are indeed significant.

Figure 3 shows the correlations of the 103 features and the \(O_x\)-differences of all 20 amino acids obtained from CASP8 to CASP11. We count the number of significances, which means that p-value (the complement of confidence level) is less than 0.05. The value of 0.05 is a widely adopted standard cut-off. It denotes that the test shows strong evidence against the null hypothesis that no correlation between the objective value and the feature value. Note that maximum count of significances is 20, because there are 20 types of amino acids. Here, if one of the following criteria is satisfied, a feature is considered as a significant one, and it is selected for training.

  1. 1.

    The feature has correlation value greater than or equal to 0.15 and the number of its significance counts is greater than or equal to 15.

  2. 2.

    The feature has correlation value greater than or equal to 0.4 and the number of its significance counts is greater than or equal to 10.

Fig. 3.
figure 3

The correlations of 103 features and \(O_x\)-differences of 20 amino acids in the training set consisting of CASP8 to CASP11. (Color figure online)

Fig. 4.
figure 4

The counts of significant occurrences in various features with respect to \(O_x\)-differences in the training set consisting of CASP8 to CASP11.

Furthermore, Fig. 4 illustrates the count of significances of the 103 features with respect to \(O_x\)-differences obtained from CASP8 to CASP11. According to our criteria, we can extract three significant features (indices 34, 37 and 40), which are the predicted differences \(C_x(PD2)- C_x(BBQ)\), \(O_x(PD2) - O_x(BBQ)\) and \(N_{next,x}(PD2) - N_{next,x}(BBQ)\), respectively. That is to say, these features highly correlate with our target \(O_x\)-differences. Similar results are revealed for y-coordinates and z-coordinates. All significant features over all CASPs are listed in Table 2.

Next, we use these selected features to train the SVR models. Here we adopt LibSVM [1] to perform our experiments. We use the RBF (Radial Basis Function) kernel and set the three SVR parameters (cost, tube and hyperparameter) to the default values.

The performance of our experimental results is evaluated by RMSD. Table 3 shows the RMSDs of individual N, C and O atoms of BBQ, PD2 and our 5-fold cross validation method. The improvement of O prediction over PD2 is about 7.7 %. And the improvement of C prediction over PD2 is about 13.2 %, which combines the gain from the BBQ over PD2 (about 6.4 %) and the gain from ours over BBQ (about 7.8 %). Table 4 shows the overall RMSDs of BBQ, PD2 and our method. Our method yields 8.03 % improvement over PD2.

Table 2. The significant features selected in the nine objective values of the 5-fold cross validation experiments. The bold underlined one means that the feature is chosen in all training sets.
Table 3. The average RMSDs of individual atoms (N, C and O) of BBQ, PD2 and our method in CASP7, CASP8, CASP9, CASP10 and CASP11 datasets. Here, the percentage inside parentheses means the improvement over PD2.
Table 4. The RMSDs of all atoms of BBQ, PD2 and our method for CASP7, CASP8, CASP9, CASP10 and CASP11 datasets.

5 Conclusion

In the past decades, lots of efforts have been devoted to the study of the protein backbone reconstruction problem. Until now, the methods, such as BBQ, PD2 and so on, have already been successfully developed and applied to the problem. Since these methods have their individual strengths and weakness, the prediction accuracy can thus be improved by taking advantage of their strengths.

In this paper, we propose an algorithm to refine the 3D coordinates of all atoms on a protein backbone with SVR. The objective values of our prediction is the differences between the predicted coordinates and the real ones. We first define a set of feature candidates extracted from the predicted coordinates of BBQ and PD2. It is well-known that the key factor to affect the prediction performance is the feature relevance. Thus, we perform the correlation analysis to identify significant features. The experimental datasets range from CASP7 to CASP11. As the experimental results show, the three most significant features for predicting the \(O_x\)-differences and \(C_x\)-differences are the differences of the predicted x-coordinates of PD2 and BBQ in C, O and N\(_{next}\) atoms. Similar results are exhibited for y-coordinates and z-coordinates. In summary, our method yields about 8 % improvement in RMSD over PD2, which is the best previous predictor in this problem up to now.