Introduction

Predicting the three-dimensional structure of a protein based on its amino acid sequence is important for understanding protein function in several biological activities. The Critical Assessment of protein Structure Prediction (CASP) is the current gold standard for predictive protein modeling and classifies the prediction methods into three main approaches [1]: homology modeling [2, 3], threading [46], and ab initio predictions [79]. All three approaches generate a large number of candidate models, which are then assessed for quality.

Homology modeling and threading share many similarities and are known as Template-Based Modeling (TBM). These predictive methods use a protein structure library to identify a “best fit” target sequence to an unknown protein. In contrast, ab initio modeling conducts a conformational search under the guidance of a designed energy function. This energy function first calculates the energy of the searched conformation according to a certain force field (e.g., amber). If the calculated energy is low, the conformation is conserved; if the calculated energy is high, the conformation is abandoned. This procedure usually generates a number of possible conformations (structure decoys), and final models are selected among these [79].

Determining a predictive structure by TBM consists of four steps. First, known structures (templates) are searched for sequences closely related to the target sequence. Second, the target sequence is aligned to the template structure. These two steps are critical for TBM [10].Third, the structural framework is built by copying the structure of the template protein in the aligned regions or by satisfying the spatial restraints from templates (for example, the distance between two atoms cannot be less than 0.1 nm). Finally, the unaligned loop regions are constructed and side-chain atoms are added.

The quality of the generated protein model is dependent on the selected template [11, 12]. Usually, the final choice of template is determined by the statistical significance of the alignment score between the query (target protein) and the template sequences [13]. There are four categories that determine a good match between structure and sequence: (1) a significant sequence match with significant structural match, (2) a significant sequence match with non-significant structural match, (3) a non-significant sequence match with significant structural match, and (4) a non-significant sequence match with non-significant structural match [14]. This suggests the fact that a significant sequence match (e.g., small E-value) does not necessarily result in structures similarities; on the other hand, non-significant sequence alignments might result in similarities in the protein structure. These facts tell us that the alignment score may not correlate with the chosen template’s quality, especially for a distant homolog and, therefore, it is important to identify reliable scoring functions to determine the quality [15].

The disconnect between alignment score and template quality suggests that the current protein modeling methods may not necessary select the best model when searching a template library. To test this hypothesis, we searched the protein “d1bmta1” (SCOP database, http://scop.mrc-lmb.cam.ac.uk/scop/) against a non-redundant database in which one protein was selected for each family in the SCOP database. The MaxSub score (determined using Eq. 1) from the best model was 0.603, while the score for the best alignment was only 0.173 (Li J and Fang H, unpublished observations). MaxSub is defined as:

$${\text{MaxSub}} = \frac{{\sum\nolimits_{i = 1}^{n} {\frac{1}{{1 + \left( {\frac{{d_{i} }}{d}} \right)^{2} }}} }}{q}$$
(1)

and the full description of the formula is provided in [14].

The above data suggest that identifying the model closest to the native conformation (i.e., its MaxSub score) will result in an accurate protein structure prediction. This should be the case for both the template-free approach [16, 17] and the template-alignment approach [18, 19]. There have been several studies that targeted the development of algorithms to address this issue (i.e., estimating the quality of the predicted model) and these have been summarized by Benkert et al. [20, 21].

QMEAN [20], which uses a composite score consisting of a linear combination of six parameters, is one of the best prediction methods currently in use. The six parameters are C_β interaction energy (C_beta), all-atom pairwise energy (PE), solvation energy (SE), torsion angle energy (TAE), secondary structure agreement (SSA), and solvent accessibility agreement (SAA). A higher quality model can be derived from the comprehensive evaluation of the existing score parameters in QMEAN when compared to prediction methods using simply an alignment score or its derived statistics (i.e., E value only).

In this study, we used QMEAN to assess the quality of the protein models generated by the method in our lab [22]. Since the six QMEAN parameters represent different features of a protein, we also used the individual parameter scores to separately assess the quality of the models. Interestingly, we found that SSA has a higher probability of predicting the quality of the protein model than the other five parameters.

Materials and methods

Datasets

Based on SCOP-1.67, we built a non-redundant template database composed of globular proteins. Each template protein in the database derived from different protein families present in the SCOP database. We selected one protein from each family; since there are 2160 families in SCOP-1.67, the database was composed of 2160 template proteins. Due to a large difference between any two selected proteins (generally, the sequence identities are <30 %) and the lack of redundancy, the database constructed was defined as non-redundant. One-hundred-thirty-eight query proteins distantly related to the template database were selected from distinct families for analysis. Moreover, the 138 proteins selected belong to the four classes of globular proteins (i.e., all α, all β, α/β, α + β structures) [23].

Alignment and assessment of the models

Each query protein was aligned with the above template database on the basis of the BLOSUM62 transformation matrix with the dynamic program method; models were then generated according to the alignment. First, each query protein sequence was aligned with the sequence of the templated protein. The sequence identity was calculated as the number of identical residues (between the target sequence and the templated protein sequence) divided by the number of total residues of the shorter of two sequences.

Afterward, the backbone structure of the template protein, obtained after alignment with the target protein, was copied as the predicted structure of target protein. Since there are 2160 proteins in template database and, in theory, each alignment between the query sequence and template protein sequence can generate one model, 2160 models were generated. The alignment quality was calculated using the TMscore algorithm (http://bioinformatics.buffalo.edu/TMscore) [6], which compared the predicted model structure with its experimental structure and then produced MaxSub value.

Details of the TMscore algorithm are described in Li and Fang [22]. The alignment between template protein sequence and target protein sequence includes aligned region and gap (unaligned) region. The backbone structure of the target protein is achieved by copying the backbone structure of template protein into corresponding aligned regions. If a loop of the template protein is exactly in the alignment region (the sequence alignment region between the template protein and the predicted protein), that loop of the template protein could be the loop structure of the predicted protein.

Here, MaxSub is used to estimate the quality of the model because it is independent of the protein size. The quality of each model in this study was also assessed using QMEAN (http://swissmodel.expasy.org/qmean/cgi/index.cgi). For each model, the MaxSub score, the QMEAN score, and the scores from each of the QMEAN’s six parameters were collected.

Results

Relationship between MAXSUB and C_beta, PE, SE, TAE, SSA, and SAA parameters; QMEAN and alignment score

Figure 1 shows the relationship between MaxSub and eight parameters (C_beta, PE, SE, TAE, SSA, SAA, QMEAN, and alignment score) for the randomly selected protein “d1bmta1”. We determined that SSA had the strongest correlation with MAXSUB compared to the other parameters (Fig. 1e). Table 1 summarizes the results from the analysis of the eight parameters using a t test. The t value of SSA was significantly higher than that of the other seven parameters tested. These results indicate that SSA significantly correlates with MaxSub. Moreover, as shown in Table 1, the alignment score was not significantly different than that according to QMEAN. However, the correlation between these two parameters (alignment score and QMEAN) was significantly higher than of the other five QMEAN parameters (C_beta, PE, SE, TAE, and SAA). From the correlation coefficient results, SSA was significantly higher than the others.

Fig. 1
figure 1

Relationship between 8 parameters and MAXSUB for protein d1bmta1. a C_beta interaction energy (C_beta); b all-atom pairwise energy (PE); c solvation energy (SE); d torsion angle energy (TAE); e secondary structure agreement (SSA); f solvent accessibility agreement (SAA); g QMEAN; h alignment score (Score)

Table 1 t test value between the eight parameters evaluated

MaxSub score of selected models

We then identified the near-native model for each target protein based on the MaxSub score of selected models. We investigated the performance of the aforementioned eight parameters using three criteria: the sum of MaxSub, the number of correct predictions [21], and the t value (Table 2). Table 2 shows that the sum of MaxSub scores based on SSA is the highest; followed by C_beta, PE, and QMEAN. Next, we investigated the number of correct models, which required a MaxSub score >0.1 [21, 24]. From higher to lower score, the parameters can be listed as SSA, PE, TAE, C_beta, and QMEAN. Using the t-test, we found the t-values for quality of those models selected by SSA compared with those selected by SE (2.473), TAE (2.905), SAA (4.488), QMEAN (2.215), and alignment score (3.050) were all larger than t (137, 0.005). This indicates that SSA performs best in identifying the near-native protein model when compared to SE, TAE, SAA, and QMEAN. The t value between QMEAN and SAA is 2.944, indicating that QMEAN is significantly better than SAA in identifying the near-native structure. Finally, the t-value between alignment score and C_beta (−2.059) and PE (−2.005) indicates that the performance of C_beta and PE is superior to the alignment score. Moreover, the t-value between alignment score and SAA was 2.132, which was larger than t (137, p = 0.05), indicating that the alignment score is not better than SAA in assessing the quality of the predicted model. Hence, SSA is a best parameter to estimate the quality of a predicted model.

Table 2 Performance comparison of identification of near native models between the eight parameters

Discussion

TBM generates the model for the alignment of the query sequence with the template protein sequence on the basis of the score matrix or the position specific score matrix (i.e., profile). In general, the number of models obtained is equivalent to the number of alignments. The best models are selected according to their alignment score or E value [26].

Unlike TBM, a successful ab initio modeling procedure could help answering the basic questions of how and why a protein adopts a specific structure out of many possibilities. Of note, most current template-free modeling methods still rely on template-based methods by using local sequence matching of fragments present in the database [25]. Similar to TBM, the ab initio prediction usually produces a considerable amount of candidate models. To identify the high quality models among the large number of candidate models is very important, and requires accurate evaluation of model quality.

The secondary structure of a protein is very close to its tertiary structure (i.e., its native structure). A number of researchers have produced hypotheses in order to understand the relationship between secondary structure prediction and tertiary structure. Rost [26] summarized the areas of concern for secondary structure prediction as: regions likely to undergo structural change, proteins that need to be classified based on secondary structure predictions from genome analysis, and predictions from the sequence to the 2D and 3D structures. Ortiz et al. [27] successfully used secondary structure predictions as one component of their 3D structure prediction method. Eyrich et al. [28] minimized the energy of arranging the predicted rigid secondary structure segments. Lomize et al. [29] also started protein structure predictions from secondary structure segments. Chen et al. [30] suggested using secondary structure predictions to reduce the complexity of molecular dynamics simulations. Samudrala et al. [31] combined secondary structure-based simplified presentations with a particular lattice simulation, attempting to enumerate all possible folds [31, 32].

The work summarized above demonstrates that predicting the protein secondary structure is essential for the prediction of the protein tertiary structure (topology). In this study, our results revealed that SSA is better suited for evaluating the predictive model quality (Fig. 1e). This algorithm may play an indispensable role in protein structure prediction. However, there is some bias in our experiments, which may overestimate the performance of SSA. One source of bias is the specified class of protein secondary structure (helix, sheet, and coil). Helix and sheet are considered regular structures; however, the protein fold consists mainly of coil. If the coil structural detail is predicted well, then measurement of SSA will exhibit better performance and identify a high quality model. Therefore, the drawback of using SSA is the requirement of structural details of the coil region.

Conclusions

Our analysis shows that SSA is the best parameter in assessing the quality of structural models, followed by C_beta, PE, and QMEAN. SSA is an essential parameter when choosing a predictive protein model template.