Introduction

The prediction of protein structure and function from amino acid sequences is one of the most important problems in molecular biology. The field of protein structure prediction began even before the first protein structures were actually solved (Pauling et al. 1951). The secondary structure is the basis for the spatial structure of a protein. In terms of structure formation, the secondary structure is formed at the early stage of  protein folding. Therefore, it is reasonable to study the protein secondary structure as the first and the most important step of three-dimensional structure prediction. Previous methods for secondary-structure prediction were based on single-residue statistics (Chou and Fasman 1974; Garnier et al. 1978) and provided generally poor accuracy. Then, a significant improvement was made with the PHD (Rost and Sander 1993) method, which used evolutionary information from multiple sequence alignments in the three-level neural networks.

The current prediction methods commonly employ machine learning techniques, such as neural networks (Jones 1999; Petersen et al. 2000; Dor and Zhou 2007), hidden Markov models (Karplus et al. 1998; Lin et al. 2005) and support vector machines (Ward et al. 2003; Guo et al. 2004), and have achieved an accuracy of Q 3 between 75 and 80%. Moreover, the accuracy can be further improved if the structure-based sequence alignments between high-homologue proteins are included as part of the prediction process (Montgomerie et al. 2006). On the other hand, along with secondary-structure prediction, several methods have been developed to predict protein structure class, protein-protein interaction and signal peptide (Chou 1995; Chou and Maggiora 1998; Chou and Zhang 1994; Cai et al. 2000; Chou and Cai 2006; Xiao et al. 2006; Zhang et al. 2007; Ding et al. 2007; Chou and Shen 2007c; Shen and Chou 2007b). The methodology development in predicting protein structural classification (Chou 2005a) has also greatly stimulated the areas for predicting the other attributes of proteins (Cedano et al. 1997; Chen et al. 2007; Chou 2001, 2005b; Chou and Elrod 1999; Chou and Shen 2007a, b; Guo et al. 2006b; Liu et al. 2005; Shen and Chou 2007a; Wang et al. 2004, 2005; Wen et al. 2006; Zhang et al. 2006a).

The empirical prediction of protein secondary structure essentially consists of two approaches: one is the direct sequence and structure comparison between high-homologue proteins by mapping the structure of known homologues onto the query protein’s sequence; another is based on the search and exploitation of the information of sequence-structure pattern and on the development of an algorithm for structure classification. The latter approach is more important for the understanding of the law governing the sequence-structure relations in proteins and for the prediction of secondary structure of low-homologue proteins. In this article we concentrate on the latter. We introduce a novel method, tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary-structure prediction. TPIDQD is based on tetra-peptide signals (called tetra-peptide structural words). Tetra-peptide is the structural unit of alpha helix in the sense that the hydrogen bonds in the helix connect the J-th and (J + 4)-th residues, and they play a crucial role in the formation of the regular structure. It has been estimated that 60–70% of tetra-peptides encode the specific structure (Rackovsky 1993). These tetra-peptides can be regarded as the protein folding code. To test the method and facilitate comparison with previous studies, TPIDQD is tested on the CB513 (Cuff and Barton 1999) dataset and yields a higher accuracy. The success of the TPIDQD algorithm demonstrates the importance of tetra-peptide structural words in protein structure prediction.

Materials and methods

Dataset and definition of protein secondary structure

We have selected the non-homologous CB513 (Cuff and Barton 1999) dataset with sequence identity of less than 25% for secondary-structure prediction. The automatic assignments of secondary structure to experimentally determined 3D structure are usually performed using DSSP (Kabsch and Sander 1983), STRIDE (Frishman and Argos 1995), and DEFINE (Richards and Kundrot 1988). The DSSP assignments divide the secondary structure into eight categories: H (α-helix), E (extended β-strand), G (310 helix), I (π-helix), B (isolated β-strand), T (turns), S (bend) and “_”(coil). Reducing eight classes to the three classes of helix (H), sheet (E), and coil (C) is an important step in the encoding of structure data (Rost and Sander 2000). Generally, two popular reduction methods are (1) H, G, and I to H; E to E; all others to C (Rost and Sander 1993), and (2) H, G to H; B, E to E; all others to C (Hua and Sun 2001). In CB513, method (2) yields many discrete states such as CEC, CEH, and HEC, that method (1) does not. So, we adopt the reduction method (1) in this study.

Definition of tetra-peptide structural words

Sliding a window of four residues along all protein sequences, one obtains the total occurrence frequency of a given tetra-peptide in the CB513 dataset, which will be denoted by N. In our statistics, only tetra-peptides with N ≥ 2 are considered. Although a particular tetra-peptide with N ≥ 2 may not occur in a given protein, we found each protein in CB513 dataset contains enough tetra-peptide signals with N ≥ 2. Suppose the tetra-peptide occurs in structure j for n j times, where j = αααα, ββββ, cccc, ααcc, ββcc, ccαα, ccββ, αααc, βββc, αccc, βccc, cccα, cccβ, cααα, cβββ. If its occurrence in structure j is a stochastic event, then the probability of the tetra-peptide occurring in this structure n j or more times will be

$$ 1 - CL_{j} = {\sum\limits_{n \ge n_{j} } {\frac{{N!}} {{n!(N - n)!}}\;} }p^{n}_{j} (1 - p_{j} )^{{N - n}} $$
(1)

where the sum in Eq. 1 is taken from n j to N. In Eq. 1 , p j is defined by the relative frequency of structure j in the database, namely, p j  = m j /M, here M is the total occurrence frequency of all tetra-peptides in the dataset and m j their occurrence frequency in structure j. The details of parameters M, m j , and p j in the CB513 database are listed in Appendix 1. As Eq. 1 is a small quantity, the tetra-peptide occurring in structure j for n j times should not be random. The confidence level of this statement is CL j . For example, if the frequency of a tetra-peptide occurring in the j-th structure n j satisfies Eq. 1 with CL j  ≥ 95%, then we say that the tetra-peptide is a j-type structural word with 95% confidence level.

Diversity and increment of diversity (ID)

Let n i be the absolute frequency of the i-th category of some feature variable; there are t categories corresponding to a t-dimensional space (called category space). Set \( X:{\left\{ {n_{i} |i = 1,\cdots,t} \right\}} \) as the source of diversity. The measure of diversity (Laxton 1978; Li and Lu 2001) as a function of source X is defined by

$$ D(X) = N\log \;N - {\sum\limits_{i = 1}^t {n_{i} } }\log \;n_{i} $$
(2)
$$ {\left( {N = {\sum\limits_{i = 1}^t {n_{i} } }} \right)} .$$

In general, for two sources of diversity in the same space of t dimensions, X:{n 1,n 2,...,n t } and Y:{m 1,m 2,...,m t }, the increment of diversity (Xu 1999) is defined by

$$ ID(X,Y) = D(X + Y) - D(X) - D(Y) $$
(3)
$$ {\left( {M = {\sum\limits_{i = 1}^t {m_{i} } },N = {\sum\limits_{i = 1}^t {n_{i} } }} \right)} $$

where ID(X, Y) is a function of  two sources, and D(Y) is the measure of diversity of the mixed source \( D(X + Y) = (M + N)\log (M + N) - M\log M - N\log N. \) It can be proved that the increment of diversity satisfies 0 ≤ ID(X, Y) ≤ D(Y).

Given a problem of classification of sequence X, we shall compare X with a standard set (called standard source), which consists of samples with known properties. The standard source of diversity S is defined by

$$ D(S) = D(m_{1} ,m_{2} ,\cdots,m_{t} ) = M\log M - {\sum\limits_{i = 1}^t {m_{i} } }\log m_{i} $$
(4)
$$ {\left( {M = {\sum\limits_i {m_{i} } }} \right)} $$

where m i is the sum of the frequency of the i-th category of the feature variables over all samples in the standard set. The increment of diversity [ID(X, S)] between X and S is still given by Eq. 3. The increment of diversity of two sources is essentially a measure of their similarity level. The smaller ID(X, S), the higher the similarity level between X and S.

Quadratic discriminant analysis (QD)

For a sequence X  to be classified, the increment of diversity between source X and the standard source is denoted by an r-dimensional vector, called R. r increments of diversity are integrated by using the quadratic discriminant function (Zhang 1997; Zhang and Luo 2003). Here we shall generalize the quadratic discriminant formulation for the classification of multi-groups. Consider X classified into n groups (ω 1,ω 2,...ω n ). The discriminant function between group i and group j is defined by

$$ \xi _{{ij}} = \ln p(\omega _{i} \left| x \right.) - \ln p(\omega _{j} \left| x \right.). $$
(5)

According to Bayes’ theorem, we can deduce the following equation for the two-group case (see Appendix 2)

$$ \xi _{{ij}} = \ln \frac{{p_{i} }} {{p_{j} }} - \frac{{\delta _{i} - \delta _{j} }} {2} - \frac{1} {2}\ln \frac{{{\left| {\Sigma _{i} } \right|}}} {{{\left| {\Sigma _{j} } \right|}}}. $$
(6)

The result can be generalized to n-groups directly. Set

$$ \eta _{v} = \ln p_{v} - \frac{{\delta _{v} }} {2} - \frac{1} {2}\ln {\left| {\Sigma _{v} } \right|} $$
(7)
$$ \delta _{v} = (R - \mu _{v} )\;\Sigma ^{{ - 1}}_{v} \;(R - \mu _{v} )\quad (v = 1,\ldots,n) $$
(8)

where p v denotes the number of samples in group v, μ v denotes increments of diversity averaged over group v, \( |\Sigma _{v} | \) is the determinant of matrix \( {\Sigma {_{v} } } \) and δ v is the square Mahalanobis distance between R and μ v with respect to \( \Sigma _{v} \) (Note that μ v and \( |\Sigma _{v} | \) are calculated in training set).

From Eqs. 6 and 7, we have

$$ \xi _{{ij}} = \eta _{i} - \eta _{j} \quad (i,j = 1,\cdots,n). $$
(9)

It can be easily proved that \( p(\omega _{k} |X) \) is the maximum of \( p(\omega _{v} |X) \) if η k is the maximum in η v (= 1,...,n). So, we predict that X belongs to group k.

Correction to IDQD prediction

The section includes two steps: (1) The structure fluctuation is removed. It is called the structure fluctuation in prediction if the predicted structure of one residue is different from its left and right neighbors, and the two neighbors are predicted as the same structure. The structure of central residue for the tri-peptide should be corrected to the same as the prediction of its left and right neighbors. (2) The structure boundary is corrected by using tetra-peptide boundary words after removal of fluctuation. There are four kinds of boundary words, namely, α\c-type boundary words (including subtypes “αααc,” “ααcc,” and “αccc”); β\c-type boundary words (“βββc,” “ββcc,” and “βccc”); c\α-type boundary words (“cccα,” “ccαα,” and “cααα”); and c\β-type boundary words (“cccβ,” “ccββ,” and “cβββ”). If the tetra-peptide on a predicted structural boundary, for example, α\c boundary, is in full accordance with three boundary words of “αααc,” “ααcc,” and “αccc” subtype, then the score = 3 is made in the prediction; if it is partly in accordance with three boundary words, namely, in accordance with only two or one boundary words of three subtypes then the score = 2 or 1; if no accordance at all, then the score = 0. Next, we assume boundary shifting toward left or right by one or two residues. In each case we score the prediction. Finally, we choose the maximum score as the ultimately predicted structural boundary through comparison of the above five cases. If the same score is obtained for several cases, we always choose the one with the least shift as the predicted.

Results

Tetra-peptide structural words at different confidence levels in CB513 dataset

Three kinds of tetra-peptide structural words and 12 kinds of tetra-peptide boundary words with different confidence levels have been deduced by using Eq. 1. Generally, the structural words with high confidence level give more accurate results. However, the number of these words is too small to afford enough information. In contrast, the serious overlap between different types of structure words with low confidence levels leads to ambiguity in prediction. Therefore, using appropriate structural words would yield a prediction with higher accuracy. We have utilized alpha and beta words at 80% confidence level and coil words at 55% confidence level. There is less overlap among the three types of structural words. To ensure a high enough number of structural words, we utilized coil words with the lower confidence level since coil itself is an irregular structure. Similarly we selected boundary words at 55% confidence level in the correction program of the algorithm. The numbers of tetra-peptide structural words with different confidence levels are summarized in Table 1, and the details of tetra-peptide structural words used in our TPIDQD algorithm can be found in the “Electronic Supplementary Material 1–7”.

Table 1 The number of tetra-peptide words of 15 structural types

The secondary-structure prediction of the central residue for a 21-residue fragment

We have predicted the structure of the central residue of a 21-residue fragment in the middle of a protein sequence by using the TPIDQD method (namely, central residues are located in the sequence from the 11th site at N′ terminal to the 11th site at C′ terminal). A set of 21-residue fragments was obtained by sliding a window of 21 residues along all protein sequences. They compose set A, B, or C in terms of the structure classes of their central residues. Each 21-residue fragment is equally divided into three segments, denoted as left (L), middle (M), and right (R) segments. By calculating the frequency of alpha words, beta words, and coil words occurring in L, M, and R segments of all samples in set A and using Eq. 4, we obtain nine standard sources of diversity (nine-dimensional vector). Similarly, nine standard sources of diversity on set B and set C can be deduced. For the 21-residue fragment X to be predicted, we can obtain nine increments of diversity between source X and S(A, BC) by using Eq. 3. Then we calculate δ v and \( |\Sigma _{v} | \)(A, B, C) in the training set. Following the quadratic discriminant function and using Eq. 7, we find the maximum among η A , η B and η C; then we predict that X belongs to its structure class. Similarly, we have predicted the structures of first (last) 10 residues at the N′ (C′) end of the chain by using 21-residue fragments. (Note that the addition of several blanks at the terminal is required to obtain the 21-residue fragment.) Since the IDQD is a method of statistical prediction, the fluctuation exists inevitably. In particular, those residues in the boundary of a secondary structure may easily be misidentified. Therefore, we have introduced the correction program to the IDQD prediction as given in the “Materials and methods” section.

Among the independent dataset tests, the k-fold cross-validated test and the jackknife test have often been used for examining the accuracy of a prediction method. The jackknife test is deemed the most rigorous and objective (Chou and Shen 2007b, 2008; Chou and Zhang 1995) and is widely and increasingly used to test the power of various statistical prediction methods (Cao et al. 2006; Chen et al. 2006a, b, 2007; Chen and Li 2007; Diao et al. 2007, 2008; Du and Li 2006; Fang et al. 2008; Gao and Wang 2006; Gao et al. 2005; Guo et al. 2006a, b; Huang and Li 2004; Jahandideh et al. 2007; Kedarisetti et al. 2006; Li and Li 2008; Lin and Li 2007a, b; Mondal et al. 2006; Mundra et al. 2007; Pugalenthi et al. 2007; Shi et al. 2007; Sun and Huang 2006; Tan et al. 2007; Wen et al. 2006; Zhang et al. 2006a; Zhang and Ding 2007; Zhang et al. 2006b; Zhou 1998; Zhou and Assa-Munt 2001; Zhou and Doctor 2003; Zhou et al. 2007). However, considering the longer time needed for the jackknife test and because the goal of our paper concentrated on introducing a new method for structure prediction and comparing it with published results, we adopt the threefold cross-validation to evaluate the prediction quality. We divided the training set randomly into three parts, two of which were for training and the rest for testing. The process was repeated three times. All results (IDQD predictions and corrections in the middle of all chains, at the N′ and C′ terminals, and in full protein chains) are listed in Table 2.

Table 2 Prediction accuracy for protein secondary structure

The secondary-structure prediction of central residues for different fragments

In the TPIDQD algorithm, we constructed sources of diversity and calculated the measure of diversity. One increment of diversity corresponded to one feature variable. Nine feature variables were assumed, and each variable was defined in a segment of seven residues (corresponding to a 21-residue fragment) in the above calculation. By using the same approach, we examined the effect of different segment lengths, namely the segment of five residues (corresponding to 15-residue fragment), nine residues (27-residue fragment) and 11 residues (33-residue fragment), on the accuracy of secondary-structure prediction and found that the best prediction is given in the case of the seven-residue segment. The results are listed in Table 3.

Table 3 The secondary-structure prediction for different length segments

The influence of  long-range amino acid interactions on the secondary-structure formation of  proteins is a complex problem (Kihara 2005; Tsai and Nussinov 2005). However, in our method we can easily take long-range sequence information into account by increasing the fragment length. The fragment length of  21 residues was assumed in previous sections, and the long-range information has been partially considered there. To further consider the long-range sequence information, as an example, we test the 49-residue fragment (seven segments of seven residues instead of the three segments above) and define 21 increments of diversity as the feature variables (instead of nine variables above). We find that the Q 3 scores are several points higher than those given in Table 2. Therefore, the TPIDQD algorithm has the capability of calculating the influence of long-range residue interactions and largely improving the secondary-structure prediction through introduction of  longer fragments.

Discussion

In the TPIDQD algorithm, a high enough number of tetra-peptide structural words is a key factor for the success of  the algorithm. In defining structural words, we only consider tetra-peptides with ≥ 2 in the dataset, since the occurrence once of a tetra-peptide in a given secondary structure may be at random. However, the probability of its occurrence two or more times at random in the same secondary structure is very small. The probability of random occurrence twice is about 1/81 and three times is (1/81)2. Therefore the tetra-peptide word with higher confidence level under this constraint (≥ 2) should really be a code word characteristic of a certain structure. However, we find a lot of tetra-peptides occurring only once in the CB513 dataset. If the condition ≥ 2 is not required, then one would obtain more tetra-peptide structural words following Eq. 1. When we use three kinds of tetra-peptide words (under condition ≥ 1) as the diversity sources to predict the structure of the same 21-residue fragment, the Q 3 score is again higher than that of the prediction based on ≥ 2 tetra-peptide words. (The values of Q 3 = 82.75 and 78.30% were achieved in direct test and threefold cross-validation test respectively under condition ≥ 1). The reason is that one can obtain more structural words at higher confidence levels under condition ≥ 1, and this makes the prediction more accurate.

Although tetra-peptide structural words occurring only once in a dataset may be an accident, and we have ignored them in the model of ≥ 2, these results show that a large number of structural words at higher confidence level are an important factor for improving the performance of prediction. Note that the total number of tetra-peptides is 160,000, and a portion of them are candidates for code words useful in structure determination. So there exists an upper limit of the number of tetra-peptide structural words. We have utilized 25,513 tetra-peptide structural words (≥ 2) based on the CB513 dataset in this paper. With the rapid expansion of the protein structure dataset, more tetra-peptide words with a higher confidence level will be obtainable, making the prediction more accurate.

The prediction of protein secondary structure is one of the typical classification problems in bioinformatics. We obtained an overall Q 3 score of  79.19% for secondary-structure prediction using the TPIDQD algorithm. The accuracy can be further improved by taking long-range sequence information (fragments with more than 21 residues) into account in prediction. The prediction capability of the TPIDQD method is comparable with that of other currently published top algorithms. In the same CB513 dataset, for example, the dual-layer SVM algorithm attained an overall Q 3 score of  75.2% (Guo et al. 2004), the SPINE yielded a tenfold cross-validated accuracy of Q 3 = 76.77% (Dor and Zhou 2007), and the dynamic programming algorithm achieved an overall Q 3 score of  66% (Sadeghi et al. 2005). The results show that tetra-peptide signals can indeed reflect some relationship between an amino acid’s sequence and its secondary structure, indicating the importance of tetra-peptide structural words in protein structure prediction.