Introduction

Transmembrane proteins represent about 25% of proteins coded by genomes (Rost et al. 1996; Jones 1998; Wallin and von Heijne 1998; Krogh et al. 2001; Arai et al. 2003; Ahram et al. 2006). They support essential biological functions as receptors, transporters or channels (White et al. 2001) and are embedded in the lipid membrane, which constitutes a very specific neighboring environment. Due to this specificity, obtaining experimental 3D transmembrane structures is still very difficult (White 2004, 2009; Newstead et al. 2008). Thus, the total number of transmembrane proteins in the Protein DataBank (Berman et al. 2000) is limited, comprising ~1% of available structures (Tusnady et al. 2005a; von Heijne 2006). Known structures show that they can be spread over two major classes. In the first one, proteins are composed of a series of transmembrane helices (White and von Heijne 2005; von Heijne 2006; Lacapere et al. 2007), e.g., the well-known rhodopsin (Palczewski et al. 2000), while in the second one, they are composed of a β-sheet succession, namely the outer membrane proteins (OMPs). The latter are specific to the outer bacterial membrane of mitochondria and chloroplasts (White and Wimley 1999; Gromiha and Suwa 2006). In the present study, we only focus on α-helical transmembrane proteins, i.e., proteins with transmembrane α-helices spanning the structures (TMPα) (Oberai et al. 2006; Arinaminpathy et al. 2009).

Many prediction methods have been applied to predict localization of transmembrane regions or helix orientation (Tusnady and Simon 2001; Nugent and Jones 2009), ranging from simple statistics method using one sequence (Taylor et al. 1994) to complex hidden Markov model using evolutionary information (Tusnady and Simon 1998; Krogh et al. 2001; Martelli et al. 2003; Zhou and Zhou 2003; Kall et al. 2004, 2005; Viklund and Elofsson 2004; Bagos et al. 2006) and leading to the prediction of structural models (Vaidehi et al. 2002; Becker et al. 2004; Shacham et al. 2004; Fleishman and Ben-Tal 2006; Yarov-Yarovoy et al. 2006; Zhang et al. 2006). As the number of available structures is limited, some prediction methods used annotated sequences and not 3D information. They were significantly biased (Moller et al. 2001; Chen and Rost 2002a, b) and often overestimated their prediction rates (Chen et al. 2002). Many studies focused on the analysis and conservation of amino acid properties in the helices with regard to the lipid or the aqueous phases (Stevens and Arkin 1999; Beuming and Weinstein 2004). Moreover, these are rarely perfect regular helices. For instance, kinks in helices are known to play some important biological roles (Ubarretxena-Belandia and Engelman 2001; Krishnamurthy et al. 2009) and are well conserved (Faham et al. 2004; Yohannan et al. 2004a, b; Rosenhouse-Dantsker and Logothetis 2006; Kauko et al. 2008). In the same way, some specific sequence patterns could also be characterized (Riek et al. 2001; Rigoutsos et al. 2003).

Fundamentally, an important common issue for TMPα is the precise localization of helical segments spanning the membrane from high (Zucic and Juretic 2004; Tusnady et al. 2005b; Lomize et al. 2006a, b) or intermediate resolution structures (Enosh et al. 2004). Indeed, the assignment of a regular secondary structure is not a trivial task; various criteria can be used to locate the α-helix and β-sheet (Pauling and Corey 1951a, b). Hence, numerous secondary structure assignment methods (SSAMs) based on energetic, geometrical and/or angular criteria exist (Thomas et al. 2001; Majumdar et al. 2005; Taylor et al. 2005; Hosseini et al. 2008). The most popular approach, DSSP (Kabsch and Sander 1983), is based on the identification of hydrogen bond patterns from the protein geometry and an electrostatic model. New approaches have extended the principles defined in DSSP, e.g., SECSTR that is dedicated to improve 310 and π-helices detection (Fodje and Al-Karadaghi 2002) and STRIDE that also takes into account dihedral angles (Frishman and Argos 1995). In another way, DEFINE method (Richards and Kundrot 1988) uses only Cα positions. It computes inter-Cα distance matrix and compares it with matrices produced by ideal repetitive secondary structures. KAKSI assignment uses both the inter-Cα distances and dihedral angles criteria (Martin et al. 2005). SEGNO uses also the Φ and Ψ dihedral angles coupled with other angles to assign secondary structures (Cubellis et al. 2005a, b). PSEA assigns the repetitive secondary structures from the sole Cα position using distance and angles criteria (Labesse et al. 1997). XTLSSTR uses all the backbone atoms to compute two angles and three distances (King and Johnson 1999). PCURVE generates a global peptide axis using an extended least-squares minimization procedure (Sklenar et al. 1989). The needs for developing so many approaches are related to their own specific limits and to the various specific interests of the authors. Precise description of various SSAMs can be found in reviews (Benros et al. 2007; Offmann et al. 2007) and in research article (Tyagi et al. 2009a).

As a consequence, these different assignment methods have generated specific problems. For example, the very classical and widely used DSSP can generate very long helices, which can be classified as linear, curved or kinked (Kumar and Bansal 1998; Bansal et al. 2000). That was one of the motivations of the KAKSI methodology to define linear helices instead of long kinked helices (Martin et al. 2005). Moreover, the disagreement between different SSAMs is not negligible for globular protein, leading to only 80% of agreement between two distinct methods (Colloc’h et al. 1993; Dupuis et al. 2004; Fourrier et al. 2004; Martin et al. 2005; Tyagi et al. 2009a). Most methods agree on the nature and the number of secondary structures, but disagree on the limits of the secondary structure elements. This could modify the sequence–structure relationship and consequently the data for predicting.

In this work, we analyzed the differences between secondary structure assignments on TMPα. The consequences of the disagreements on sequence–structure relationships and on secondary structure predictions were studied. Nine different SSAMs have been used. Moreover, we also analyzed the interest of protein blocks, a structural alphabet designed to analyze and predict protein structures (de Brevern et al. 2000, 2007; de Brevern 2005; Tyagi et al. 2009a). This study is based on a protein databank already published to benchmark prediction methods (Zhou and Zhou 2003; Viklund and Elofsson 2004). However, an updated version has been built to take into account novel protein structures. The specific assignment of this databank was also evaluated.

Materials and methods

Data sets

The benchmark set of proteins is the Zhou and Zhou data set (Zhou and Zhou 2003). It is composed of 73 proteins (http://www.smbs.buffalo.edu/phys_bio/service.htm). From the original data set, we have selected only the proteins having at least one transmembrane helix and kept only X-ray crystallographic structures. Each chain was carefully examined with geometric criteria (mainly bond lengths) to avoid bias from zones with missing density. If the bond lengths were larger than the most adopted values, we considered that the chain was probably disrupted. We also compared the primary sequence given by the SEQRES field in the PDB file with the sequence deduced from the ATOM fields, i.e., the sequence with Cartesian coordinates. In case of difference, we looked at the structure for tracing missing residues. If the residues were really missing, the chain was separated into two parts. Concerning long extremities, we considered that Nter and Cter larger than 20 residues present some particularities that could bias the results. Consequently, we chose to eliminate these regions to focus on transmembrane domains and only kept few residues in these domains. A limit of 20 residues allowed keeping intact all loop regions between TM domains. We so selected 56 proteins (available at http://www.dsimb.inserm.fr/~debrevern/S2_TMalpha/). A novel updated data set has been built. For this purpose, all transmembrane protein structures were downloaded from Stephen White’s Web site (http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html) (White 2009), PDBTM (Tusnady et al. 2004, 2005a) and OPM (Lomize et al. 2006b). More than 2,200 protein chains were selected. X-ray structures with a correct resolution and sharing less than 25% sequence identity with the set previously used were kept; they correspond to 375 protein chains. A new clustering on this restricted data set allows defining 51 clusters of sequence, sharing less than 25% of sequence identity. One representative protein was chosen for each sequence cluster and carefully examined with the same criteria aforementioned. The updated databank so comprises 107 proteins and is 2.5 times bigger than the previous one. Indeed, novel selected proteins are longer due to the improvement in transmembrane protein crystallization (Sarkar et al. 2008; Newby et al. 2009).

Protein blocks

Protein blocks correspond to a set of 16 local prototypes of five residues length based on a (Φ, Ψ) dihedral angle description (de Brevern et al. 2000; de Brevern 2005). They are labeled from a to p (cf. Figure 1 of Tyagi et al. 2009b). They were obtained by an unsupervised classifier similar to Kohonen maps (Kohonen 1982, 2001) and Hidden Markov models (Rabiner 1989). The PBs m and d can be roughly described as prototypes for core α-helices and core β-strands, respectively. PBs a through c primarily represent β-strand N-caps, and PBs e and f, C-caps; PBs g through j are specific to coils, PBs k and l to α-helix N-caps, and PBs n through p to C-caps. This structural alphabet allows a good approximation of local protein 3D structures (de Brevern 2005). PBs have been studied only on globular proteins.

Secondary structure assignments

We used nine distinct softwares: DSSP (Kabsch and Sander 1983) (CMBI version 2000), STRIDE (Frishman and Argos 1995), SECSTR (Fodje and Al-Karadaghi 2002) (version 0.2.3-1), XTLSSTR (King and Johnson 1999), PSEA (Labesse et al. 1997) (version 2.0), DEFINE (Richards and Kundrot 1988) (version 2.0), P-CURVE (Sklenar et al. 1989) (version 3.1), KAKSI (Martin et al. 2005) (version 1.0.1) and SEGNO (version 3.1) (Cubellis et al. 2005b). PBs (de Brevern et al. 2000) were assigned using an in-house software (available at http://www.dsimb.inserm.fr/~debrevern/DOWN/LECT/) that follows similar assignment rules done by the PBE Web server (http://bioinformatics.univ-reunion.fr/PBE/) (Tyagi et al. 2006a, b). DSSP, STRIDE, SECSTR, XTLSSTR and SEGNO give more than three states, so we reduced them: α-helix contains α, 310 and π-helices, β-strand contains only the β-sheets, and coils everything else (β-bridges, turns, bends, polyproline II and coil). Default settings were used. The curvature of helices was analyzed with dedicated software HELANAL (Bansal et al. 2000). It takes as input a PDB file and a description of helix boundaries. It calculates local axes for every four residues. The geometry of a helix is determined by the angles between axes and the goodness of fit of the helix trace with a circle or a line. Helices are then classified as kinked (K), linear (L) or curved (C). HELANAL can leave a helix unclassified if its geometry is ambivalent. The minimum length for a helix to be analyzed is nine residues. Helices for the PB approach have been assigned to PB m, while others are associated with the coil state.

Segment overlap

The necessity for a structurally meaningful measure of secondary structure prediction accuracy has been pointed out by numerous authors (Rost et al. 1994). The segment overlap (SOV) provides this kind of measure as it takes into account the type and position of secondary structure segments rather than a per-residue assignment of conformational state. It is more related to the natural variation of segment boundaries among families of homologous proteins and should be sensitive to the ambiguity in the position of segment ends due to differences in secondary structure classification approaches.

SOV measure assesses the quality of overlapping between repetitive structures (Rost et al. 1994). In our case, as SOV is not a bijective measure, we have fixed one SSAM as the reference to compute SOV, with its modified definition (Zemla et al. 1999):

$$ {\text{sov(}}i )= {\frac{1}{N(i)}}\sum\limits_{S(i)} {\left[ {{\frac{{{\text{minov (}}s1,s2 )+ \delta (s1,s2)}}{{{\text{maxov (}}s1,s2 )}}}*{\text{len(}}s1 )} \right]} *100 $$
$$ N(i) = \sum\limits_{s(i)} {{\text{len (}}s_{1} )} + \sum\limits_{s'(i)} {{\text{len (}}s_{1} )} $$

with s 1 and s 2 , the two studied sequences, maxov (s 1 ,s 2) the length of the total extent for which either of the segments s 1 or s 2 has a residue in the α-helix state, minov (s 1 , s 2) the minimal length, len (s 1) the length of the reference sequence and δ is a parameter enabling in a fine manner the overlapping of repetitive structures.

$$ \delta (s_{1} ,s_{2} ) = \min \left\{ {\begin{array}{*{20}c} {{\text{maxov (}}s_{1} ,s_{2} )- {\text{minov (}}s_{1} ,s_{2} )} \\ {{\text{minov (}}s_{1} ,s_{2} )} \\ {{\text{len(}}s_{1} )/2;{\text{len (}}s_{2} )/2} \\ \end{array} } \right\}. $$

Agreement rate

To compare two distinct secondary structure assignment methods, we used an agreement rate, which is the proportion of residues associated with the same state (α-helix, β-strand and coil). It is classically noted C 3 (Fourrier et al. 2004; Tyagi et al. 2009a). Here, as we only focus on helices, we compute the C 2, i.e., β-strand and coil are merged into one state.

Z score

The amino acid occurrences for each state have been normalized into a Z score (as in de Brevern et al. 2000, 2002, Etchebest et al. 2005, Tyagi et al. 2009a):

$$ Z(n_{i,j} ) = {\frac{{n_{i,j}^{\text{obs}} - n_{i,j}^{\text{th}} }}{{\sqrt {n_{i,j}^{\text{th}} } }}} $$

with \( n_{i,j}^{\text{obs}} \) the observed occurrence number of amino acid i in position j for a given state and \( n_{ij}^{\text{th}} \) the expected number. The product of the occurrences in position j with the frequency of amino acid i in the entire databank equal to \( n_{i,j}^{\text{th}} \). Positive Z scores (respectively negative) correspond to over-represented amino acids (respectively under-represented); threshold values of 4.42 and 1.96 were chosen (probability less than 10−5 and 5.10−2, respectively).

Asymmetric Kullback–Leibler measure

The Kullback–Leibler measure or relative entropy (Kullback and Leibler 1951), denoted by KLd, is a measure of conformity between two amino acid distributions, i.e., the amino acid distribution observed in a given position j and the reference amino acid distribution in the protein set (DB). The relative entropy KLd (j|T x ) in the site j for the state Tx is expressed as:

$$ {\text{KLd (}}j|T_{x} )= \sum\limits_{i = 1}^{i = 20} {P(aa_{j} = i|T_{x} )} .\ln \left( {{\frac{{P(aa_{j} = i|T_{x} )}}{{P(aa_{j} = i|{\text{DB}})}}}} \right) $$

where P(aa j  = i|T x ) is the probability of observing the amino acid i in position j \((j = -w, \ldots, 0, \ldots ,+w)\) of the sequence window given a state Tx, and P(aa j  = i|DB) the probability of observing the same amino acid in the databank (named DB). Thus, it allows one to detect the “informative” positions in terms of amino acids for a given protein block (de Brevern et al. 2000; Etchebest et al. 2005).

Prediction

In a strategy of structure prediction from sequence (de Brevern et al. 2000; Etchebest et al. 2005; Elofsson and von Heijne 2007), we must compute for a given sequence window \(S_{aa} = \{aa_{-w},\ldots, aa_{0},\ldots, aa_{+w}\}\), the probability of observing a given state Tx, i.e., P(Tx|S aa ). For this purpose, each state T (helix and non-helix) is associated with an occurrence matrix of dimension l × 20 centered upon the state, with l = 2 w +1 (in the study, w = 7). Using the Bayes theorem to compute this a posteriori probability P(Tx|S aa ) from the a priori probability, P(S aa |Tx) deduced from the occurrence matrix allows to define the odds score R x :

$$ \mathop R\nolimits_{x} = \prod\limits_{j = - w}^{j = + w} {{\frac{{P(aa_{j} = i|T_{x} )}}{{P(aa_{j} = i|{\text{DB)}}}}}} . $$

The highest score Rx corresponds to the most probable state (de Brevern et al. 2000). Q tot value is the total number of true predicted states over the total number of predicted residues. Q pred is the percentage of correct prediction of helical residues (or probability of correct prediction) and Q obs is the percentage of observed helical residues that are correctly predicted (or percentage of coverage).

Results

Analysis of repetitive secondary structures

The protein databank used is a benchmark created by (Zhou and Zhou (2003) to assess their prediction method THUMBD. It has been used for the assessment of the PRODIV-TMHMM prediction method (Viklund and Elofsson 2004). From the 73 original proteins, 56 proteins were selected. Among the 17 proteins excluded, 10 were composed of multiple NMR models, 2 had only Cα atoms and 4 were obtained with a good crystallographic resolution, but the transmembrane region was missing, i.e., only the extracellular domains is available. For the remaining protein, the PDB ID and sequence cannot be found in PDB or another database. Figure 1 shows two examples of the excluded proteins. Figure 1a and n focuses on the membrane fd coat protein [PDB code 1FDM (Almeida and Opella 1997)]. By using multidimensional solution NMR experiments on micelle samples, the authors succeeded in determining that an amphipathic α-helix and a hydrophobic α-helix were found approximately perpendicular. Figure 1a shows the superimposition of the 20 different structural models using PyMol software (DeLano 2002). Figure 1b gives the distribution of helical residues propensities along the protein sequence. This figure underlines the difficulty in defining precisely the helical regions of the transmembrane domain. Figure 1c shows the HLA-B27 protein, a class I histocompatibility antigen [HLA-B*2705, PDB code 1HSA (Madden et al. 1992)], which possesses a single transmembrane protein. However, it was not crystallized and so no precise assignment could be done [predicted positions can be found on Uniprot (Leinonen et al. 2004; UniProt_Consortium 2010)]. So, both were excluded.

Fig. 1
figure 1

Example of excluded proteins. a NMR models of membrane fd coat protein [PDB code 1FDM (Almeida and Opella 1997)]. b Protein HLA-B27 [PDB code 1HSA (Madden et al. 1992)] with putative transmembrane position

We have encoded the protein structures in terms of secondary structure assignment with different secondary structure assignment methods (SSAMs), in terms of protein blocks (PBs), and also checked the assignment defined by Zhou and Zhou (namely ZZ) to assess their prediction method (Zhou and Zhou 2003). The comparison of secondary structure frequencies do not show a high divergence between each method; the frequencies of α-helix residues for the SSAMs range from 49 to 55%, while it decreases to 52% for PBs and 45% for ZZ. Nonetheless, the distributions of helices length is clearly distinct, we can notice two main clusters of helix lengths, the first one associated with long helices (>21 residues) with P-CURVE (21.6 residues), DEFINE (23.2 residues) and ZZ (26.1 residues). We can notice that that ZZ assignment is associated with long helices. The second cluster is composed of short helices with all the other SSAMs; we can note that DSSP and PBs assignment have the shortest helices on average (14.7 residues and 13.1 residues, respectively). Thus, we already observe strong discrepancies between the helix assignments.

To compare two SSAMs, an agreement rate notes that C 2 is computed and corresponds to the percentage of residues associated with the same state (helix or not). Table 1 gives the comparison of SSAMs. Figure 2 gives a projection done with a Sammon map of this information (Sammon 1969). It allows a simple representation of the differences of C 2 values (see Figure 2 of Tyagi et al. 2009a for a similar approach performed on globular proteins). In only one cluster of SSAMs grouping, highly similar assignments located in the circle at the middle of the figure can be observed. The methods involved are all based on hydrogen bond assignment, i.e., DSSP, STRIDE and SECSTR, and have C 2 values among themselves better than 94%. No other cluster can be defined. These three SSAMs have C 2 values ranging from 87 to 90% with PCURVE, PSEA, KAKSI, SEGNO and XTLSSTR. These five last have C 2 values ranging from 86 to 89% (data not shown on the Figure for more clarity). Among all the automatic SSAMs, only DEFINE leads to a very distinct assignment given that C 2 values are on average ~63%. These results are also in accordance with C 3 values observed for globular proteins (Tyagi et al. 2009a). The two other methods which have specificities are PBs and ZZ; the C 2 values of PBs are ~85% and that of ZZs is lower with C 2 values ranging from 81 to 83%. In the same way, the SOV was computed. In our case, it corresponds to the overlap of the helical structures of the different SSAMs to the helical regions defined by DSSP (taken coarsely as the reference as it is the most widely SSAM used, see supplementary material 1). Our analysis of the results took into account the potential differences between helix length, i.e., DSSP and PCURVE. SOV and C 2 values highlighted similar behaviors. In the following, we have discarded DEFINE, as this last one does not allow having a correct protein topology description.

Table 1 Confusion matrix
Fig. 2
figure 2

Sammon map of C2 correspondence of SSAMs. The C 2 distances have been used to build a Sammon map (Sammon 1969) using R software (Ihaka and Gentleman 1996). Some values are given to help the interpretation of the data (see Table 1 for all the values)

Figures 3 and 4 show an example of multiple secondary structure assignments of well-known bacteriorhodopsin [PDB code: 2BRD (Grigorieff et al. 1996)]. In Fig. 4, the prediction with THUMBD is given as an illustration. In Fig. 3, the helices are colored red and connecting regions in green. For the other SSAMs, we showed, with orange balls, the residues assigned as part of a helix by other SSAMs and not by DSSP. Inversely, blue balls represent residues assigned by DSSP as helical and not by the concerned SSAM. This figure underlines two characteristics also found in other proteins of the databank: the discrepancies between SSAMs are mainly found in the extracellular regions of the transmembrane proteins. For instance, the N-cap of the first helix starts at residue 10 for DSSP and SECSTR, 8 for STRIDE, 9 for PSEA and SEGNO, 7 for PCURVE, and 11 for XTLSSTR. The C-cap is found at position 32 for DSSP, STRIDE, SECSTR and KAKSI and diverges by only one position for PSEA, PCURVE and XTLSSTR.

Fig. 3
figure 3

Three-dimensional structure of the bacteriorhodopsin (Grigorieff et al. 1996) assigned by different SSAMs. a DSSP, b STRIDE, c SECSTR, d SEGNO, e KAKSI, f ZZ, g PSEA, h XTLSSTR, i PCURVE and j the protein blocks. Visualization was done with PyMol software (DeLano 2002). The helices are in r ed and the loops in green. Residues assigned by DSSP as helical, but not by other SSAMs, are represented as blue balls. The opposite case is represented by orange balls (color figure online)

Fig. 4
figure 4

The structure of bacteriorhodopsin (Grigorieff et al. 1996) assigned by different SSAMs. The amino acid sequence of bacteriorhodopsin with numbering corresponding to the PDB files is given; H corresponds to a helical state and C to a non-helical state (see “Materials ad methods”). See also Fig. 3 for visualization

The analysis of long helices (≥9 residues) with HELANAL software did not show a specific tendency in comparison to globular proteins (Martin et al. 2005). Transmembrane helices are in a majority (50%) curved. Kinked helices represent 29% of the helices. Only few of them are linear helices (8%). The remaining is not considered by HELANAL.

Sequence–structure relationship

We analyzed the amino acid propensities within helices, coil, N and C-caps of helices (see Table 2 and supplementary material 2):

Table 2 Amino acid over- and under-representations
  1. 1.

    Concerning the N-cap of α-helices (see supplementary material 2a), we find a series of characteristic over-represented amino acid [NDGS]0 followed by [PW]1 and [EW]2 (the figures correspond to the positions 0 for the last residue in the coil and 1 for the position of the first helical residue). Thus, it is mainly composed of branched polar residues, tryptophan residue, well-known to be found at the membrane interface (von Heijne and Gavel 1988; de Planque et al. 1999; Fleishman et al. 2006) and amino acids, which could be helix breakers (e.g., P). Transmembrane segments are in majority deformed helices, i.e., curved and kinked (79%). These series are found for DSSP, STRIDE, SECSTR, PCURVE, PSEA and SEGNO, shifted by +1 residue for KAKSI and XTLSSTR and −2 for the protein blocks. These strong over-representations, i.e., Z score value higher than 4.4, are limited and localized to the central region of transition from coil to helix. The under-representations are also limited; we can notice in position 0, the under-representation of hydrophobic residues, e.g., alanine and valine. We can also note that using the ZZ assignment, these amino acids are associated with the lowest informativeness in terms of Kullback–Leibler values and also of Z scores (only one strong over-representation was observed).

  2. 2.

    Regarding the helices (see supplementary material 2b), only classical propensities are found with over-representation of aliphatic residues (leucine, valine and isoleucine), aromatic residues (tryptophan and phenylalanine) and hydrophobic alanine, while under-representation concerns polar negatively charged aspartate and glutamate, polar positively charged arginine and lysine, small polar serine and amino acids, which could be helix breaker proline, glycine and asparagine. None of the SSAMs lead to new amino acid specificities according to literature (Fleishman et al. 2006). We can notice that contrary to the previous case, ZZ assignment is the most informative one. This last observation is coherent with the fact that they have the longest helices and so the capping regions played a less important role in the estimation. The data for coil state are not presented because these are exactly opposed to the amino acid distributions for the helix state.

  3. 3.

    C-caps of α-helices (see supplementary material 2b) are the less informative regions. A simple amino acid series [NG]1 [P]2 [P]3 can be found and so is characteristic of the coil part. The distinction between helical and coil region is clear for most of the SSAMs with over-representation of aliphatic residues, e.g., leucine in the helical part and over-representation of breaker residues, e.g., proline in the coil part. Only KAKSI is clearly shifted by −1 residue. Interestingly, polar residue glutamine that is more often found under-represented in the helices is over-represented in the last position of helices of STRIDE and SECSTR, Aspartate is also found at position −3 for DSSP and STRIDE. Thus, some amino acids can be found as potential signals of helix ends.

Prediction

The influence of SSAMs on prediction has been assessed by using a simple statistical approach based on Bayes’ rule (de Brevern et al. 2000). It makes easy evaluation of the predictive power of each assignment possible. To insure a correct equilibrium between the protein used in the training and in the validation step, a random approach was used to select the sets for each protein: the training set representing 2/3 of the proteins and the validation step using the remaining 1/3. Two occurrence matrices were computed, one for the helical residue and another for the non-helical ones. Each residue in proteins is represented by a sequence fragment of 15-residue long centered on it. Then the prediction is performed and assessed; this strategy is done 100 times independently, similarly to (Tyagi et al. 2009b). This approach gives two series of values, the average ones and the best ones (see Table 3). With the exception of DEFINE (prediction rate, Q tot, ~69% at best), all the SSAMs enable prediction rates better than 78%. Differences between average (of the 100 simulations) and best values are within a fair range of [1.6, 3.2%].

Table 3 Prediction of transmembrane proteins

Thus, secondary structure prediction rates using only single sequence are within a range of 78.26–80.95% for the SSAMs. A structural alphabet (PB) approach gives a slightly better prediction (81.46%). Surprisingly, the secondary structure assignment used for benchmark set, ZZ, gives a prediction rate of 86.27%. This last remark is striking as it corresponds to a difference of 5% with the best SSAM, i.e., STRIDE, and 6.4% with DSSP, the most classical one. This higher value is associated also with a good MCC value equal to 0.73, more than 0.1 point better than the best MCC value. In the same way, Q obs and Q pred values have been computed; they correspond, respectively, to the percentage of helical residues correctly predicted for all the true helical residues (sensitivity) and to the percentage of helical residues correctly predicted for all the predicted helical residues (positive predictive value). Thus, the behavior of ZZ is mainly due to a lower number of helix residues; therefore, it gives the best Q obs value (or percentage of coverage), i.e., 93.7%, but a low Q pred value (or probability of correct prediction), i.e., 70.7%. In fact, it predicts 10% less helix than other approaches, while its helix frequency is only 5% lower.

Interestingly, the design of a consensus approach to improve the prediction (using DSSP as the standard) does not give any significant improvement and, in many cases, any combination of multiple SSAM prediction methods shows a decrease of the Q tot value.

In the same way, C 2 values have been computed for the predictions. C 2 values for “prediction” are better than C 2 “assignment” values in every case (see supplementary data 3). It is entirely consistent with the analysis of sequence–structure relationships (see “Sequence–structure relationship”) that shows limited differences between SSAMs. Hence, the predictions converge more to the same definition of helical and non-helical regions than the structure definition. Only ZZ does not show any important improvement emphasizing its specific definition.

As a last point, we examined the influence of the databank. Indeed, the databank, although used as a benchmark by other authors, was rather old. Moreover, the number of available structures has a recently markedly increased. The databank has been updated with novel high-quality non-redundant protein structures (see “Materials and methods”). The protein databank is 2.5 times bigger than the original one. Similarly, as previously done, prediction was applied to this updated databank (see supplementary material 4). One hundred independent simulations were performed for DSSP, STRIDE and PBs, and the average and best prediction rates were analyzed. On average, very few differences can be found for MCC, Q obs and Q pred. Q tot values slightly decrease, whereas standard deviations slightly increase.

This last point is underlined by the results obtained from the best prediction simulation. The MCCs increase by 0.03–0.06, while all Q tot values increase by 1.8% for DSSP, 1.1 for STRIDE and 1.6% for PBs, i.e., a value of 83.1%. Hence, the good results of this approach are improved with a larger data set. However, we were not able to test ZZ assignment because it could not be performed on new protein structures.

Discussion

This study focuses on the precise localization of helices. We used only X-ray 3D structures (Ikeda et al. 2003). Thus, from the original data set, some proteins have been excluded. As expected, SSAMs diverged as much for transmembrane protein as for globular ones (C 2 values ~88%). PBs, which are characterized by shorter helices lengths, are a bit more distant with C 2 values ~85%, while ZZ assignment has clearly distinct assignment with C 2 values ~82 and 20% less residues associated with the helices than other SSAMs. DEFINE remains an outlier as it was also for the globular proteins (Fourrier et al. 2004). We can notice that DSSP is associated with short helices, a behavior that is opposite to the one observed with globular proteins (Martin et al. 2005). Hence, DSSP gives more breaks in transmembrane helices than other related approaches. Concerning the helix breaks, a fine analysis of some examples shows that they cannot be attributed to the sole assignment method used, but are true disruption of the secondary structure. Moreover, we often observed proline at the break position or in the close neighborhood. The role of these proline residues needs to be further investigated considering multiple sequence alignment to check the conservation of this position. This could give clues on the structural and or functional role of this residue in the protein.

Precise analysis of the curvature of helices between the different SSAMs do not show significant differences between the different classical SSAMs, i.e., DSSP, STRIDE, SECSTR, PCURVE, PSEA, KAKSI, SEGNO and XTLSSTR. The percentage of linear helices remains low (<10%), while the curved helices still represent more than half of the helices. We observe only for PCURVE a slight increase of kinked helices, due to the fact that their helices are longer.

Analysis of the amino acid repartition shows that differences in terms of assignment has no consequence on the sequence structure relationships for helices, helices termini or coil states. It corroborates equivalent analyses done on globular proteins (Tyagi et al. 2009a, b). The most diverging SSAM is again ZZ, characterized by low informative helix extremities, but the most informative for the helix core. Nonetheless, all the different SSAMs describe propensities that support well the TM tendency scale defined by Zhao and London (2006). Indeed, residues associated with a positive value for this scale are over-represented in helix (and under-represented in coil). In the same way, the most under-represented residues in helix (and over-represented in coil) are associated with strong negative values. Future studies will deal more deeply with the comparative analysis of such features.

Prediction of the automatic SSAMs gives very homogeneous prediction rates with the notable exception of ZZ assignment that bypasses the best prediction by 5%. Viklund and Elofsson have assessed the prediction rates of THUMBUP and their own method (Viklund and Elofsson 2004), PRODIV-TMHMM, gives Q tot values of 84 and 88%. Both methods have been trained with the ZZ data set and are based on Hidden Markov models with evolutionary information. Here, the simple Bayesian approach using only one sequence gives 2% better prediction rate than THUMBUP and 2% less than PRODIV-TMHMM. These two methods were dedicated to protein topology prediction. Nonetheless, the results of such a simple approach are quite good. Moreover, it is a robust approach as we have shown that it is not sensitive to sequence identity level (Tyagi et al. 2009b). This work also emphasizes the importance of a precise definition of the assignment. So, we clearly support the approach by Cuthbertson et al. (2005) that compared numerous prediction methods in a very rigorous way. They defined TM helices within membrane protein structures using DSSP. They consider the full extent of each TM helix, including residues that may reside outside the (presumed) limits of the lipid bilayer. They adopted this approach because any attempt to define simply the bilayer spanning element of a TM helix is contingent on the model used to assign this latter. Indeed, the absence of lipid molecules from the majority of crystals of membrane proteins prevents any experimental delimitation. In this case, we can note that our Bayesian prediction gives a prediction rate of 79.9% for the original data set and 81.6% with the updated data set, thus 3–4 and 1.5–2.5% less than the best (and rigorously) evaluated prediction methods (Cuthbertson et al. 2005).

To go further, we analyzed on the original data set with prediction performed by PSI-PRED (Jones 1999) and MINNOU (Cao et al. 2006). The first one is specialized on the prediction of globular proteins, while the second is dedicated to TMPα. MINNOU has a published prediction rate of 9% higher than our approach, a coherent result with regard to the classification method and information used (Cao et al. 2006). However, on our data set, PSI-PRED prediction rate equals 82.5%, while the second is slightly lower at 81.8%. Both are greatly lower than THUMBD. Interestingly, only 82.8% of the residues have been predicted similarly by PSI-PRED and MINNOU. This confusion decreases with ZZ assignment and ZZ prediction (THUMBD); MINNOU has a C 2 of 71.0% with ZZ assignment and only 60.0% with the prediction. Part of this result is due to (1) the databank by itself, which had a significant influence and (2) to the absence of long protein extremities (composed only of coil residue always well predicted). The prediction rate decreases by 7% if long N and C termini are not taken into account.

Conclusions

This research shows that SSAMs differ in assignment even for transmembrane protein; it is coherent with previous remarks and researches on related subjects (Fourrier et al. 2004; Tusnady et al. 2004; Tyagi et al. 2009a). These divergences have no significant repercussion on sequence–structure relationships. Nonetheless, with a nonautomatic assignment as in the work of ZZ, a major and impressive difference is observed and can be related to the previous remarks by (Moller et al. (2001). This study highlights also clearly the influence of the assignment and potential consequences on the way prediction is assessed. Moreover, we tested a more complex learning approach with a neural agent that used also occurrence matrices. This approach does not increase greatly the prediction rate (1% on average for each method). In the same way, the use of consensus approach does not provide significant gain, contrary to other approaches that use multiple distinct prediction methods (Ikeda et al. 2002; Nilsson et al. 2002) or different SSAMs to describe the protein structure (Cuff and Barton 1999). This work also emphasis the importance of an independent assessment of state-of-the-art approach as TMH Benchmark performed in the Rost Lab (Kernytsky and Rost 2003). Methods that employ evolutionary information are mainly more accurate than methods based on information derived from a single sequence (Cuthbertson et al. 2005). However, we show here that single sequence methods give quite impressive results compared to more complex approaches. We can also notice that the obtained Q tot values are superior to PSI-PRED on PTMα, as evaluated by (Cao et al. 2006). As the number of structures used in the prediction research could vary from 73 (Cao et al. 2006) to 265 (Amirova et al. 2007), while others used data sets based on experimental evidences given the protein topology (Jones 2007; Roy Choudhury and Novic 2009), the comparison between methods is not straightforward. A curated structural benchmark could be a valuable tool for the scientific community, with clear description of the purpose and definition of the different states to be predicted (Moller et al. 2000). It will not change the quality of the prediction rates that are high (Cuthbertson et al. 2005), but could clarify the difficulty of comparison.

It was already shown years ago that many prediction methods were biased when using prediction of TMPα rather than structural information (Moller et al. 2001; Chen et al. 2002). Hence, this lack of consensus has implication for the conception of pertinent structural models (Law et al. 2005; Elofsson and von Heijne 2007). More than ten tools are nowadays available for defining the number and the limits of the TM segments and all of them exhibit rather comparable success rates (Shen and Chou 2008) (Rangwala et al. 2009). The relevance of prediction tools, well tried on soluble proteins, however, is far from being proved for TM proteins. For instance, the extension of Rosetta approach to TM proteins (Yarov-Yarovoy et al. 2006), despite its interest, requires some specific evaluation criterion for assessing its generalization. The TM segments may not be considered as simple helical stretches, but their structure requires a more accurate description (Bernsel et al. 2008). This may be obtained with the help of a structural alphabet (Offmann et al. 2007; Joseph et al. 2010) as it has been used for defining the DARC structural model (de Brevern et al. 2005, 2009; de Brevern 2009). The results herein described are quite important for molecular modeling of transmembrane proteins (de Graaf and Rognan 2009; Mornon et al. 2009), which are major medical drug targets (Jacoby et al. 2006; Lacapere et al. 2007; Landry and Gies 2008; Arinaminpathy et al. 2009) and to improve protein topology prediction approaches (Harrington and Ben-Tal 2009; Klammer et al. 2009; Nugent and Jones 2009).