Influence of assignment on the prediction of transmembrane helices in protein structures

Pylouster, Jean; Bornot, Aurélie; Etchebest, Catherine; de Brevern, Alexandre G.

doi:10.1007/s00726-010-0559-6

Influence of assignment on the prediction of transmembrane helices in protein structures

Original Article
Published: 28 March 2010

Volume 39, pages 1241–1254, (2010)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Amino Acids Aims and scope Submit manuscript

Influence of assignment on the prediction of transmembrane helices in protein structures

Download PDF

Jean Pylouster^1,2,
Aurélie Bornot^1,3,
Catherine Etchebest^1,3 &
…
Alexandre G. de Brevern^1,3

1205 Accesses
6 Citations
Explore all metrics

Abstract

α-Helical transmembrane proteins (TMP_α) are composed of a series of helices embedded in the lipid bilayer. Due to technical difficulties, few 3D structures are available. Therefore, the design of structural models of TMP_α is of major interest. We study the secondary structures of TMP_α by analyzing the influence of secondary structures assignment methods (SSAMs). For this purpose, a published and updated benchmark databank of TMP_α is used and several SSAMs (9) are evaluated. The analysis of the results points to significant differences in SSA depending on the methods used. Pairwise comparisons between SSAMs led to more than 10% of disagreement. Helical regions corresponding to transmembrane zones are often correctly characterized. The study of the sequence–structure relationship shows very limited differences with regard to the structural disagreement. Secondary structure prediction based on Bayes’ rule and using only a single sequence give correct prediction rates ranging from 78 to 81%. A structural alphabet approach gives a slightly better prediction, i.e., only 2% less than the best equivalent approach, whereas the prediction rate with a very different assignment bypasses 86%. This last result highlights the importance of the correct assignment choice to evaluate the prediction assessment.

TMFoldRec: a statistical potential-based transmembrane protein fold recognition tool

Article Open access 30 June 2015

TMalphaDB and TMbetaDB: web servers to study the structural role of sequence motifs in α-helix and β-barrel domains of membrane proteins

Article Open access 20 August 2015

How AlphaFold2 shaped the structural coverage of the human transmembrane proteome

Article Open access 20 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Transmembrane proteins represent about 25% of proteins coded by genomes (Rost et al. 1996; Jones 1998; Wallin and von Heijne 1998; Krogh et al. 2001; Arai et al. 2003; Ahram et al. 2006). They support essential biological functions as receptors, transporters or channels (White et al. 2001) and are embedded in the lipid membrane, which constitutes a very specific neighboring environment. Due to this specificity, obtaining experimental 3D transmembrane structures is still very difficult (White 2004, 2009; Newstead et al. 2008). Thus, the total number of transmembrane proteins in the Protein DataBank (Berman et al. 2000) is limited, comprising ~1% of available structures (Tusnady et al. 2005a; von Heijne 2006). Known structures show that they can be spread over two major classes. In the first one, proteins are composed of a series of transmembrane helices (White and von Heijne 2005; von Heijne 2006; Lacapere et al. 2007), e.g., the well-known rhodopsin (Palczewski et al. 2000), while in the second one, they are composed of a β-sheet succession, namely the outer membrane proteins (OMPs). The latter are specific to the outer bacterial membrane of mitochondria and chloroplasts (White and Wimley 1999; Gromiha and Suwa 2006). In the present study, we only focus on α-helical transmembrane proteins, i.e., proteins with transmembrane α-helices spanning the structures (TMP_α) (Oberai et al. 2006; Arinaminpathy et al. 2009).

Many prediction methods have been applied to predict localization of transmembrane regions or helix orientation (Tusnady and Simon 2001; Nugent and Jones 2009), ranging from simple statistics method using one sequence (Taylor et al. 1994) to complex hidden Markov model using evolutionary information (Tusnady and Simon 1998; Krogh et al. 2001; Martelli et al. 2003; Zhou and Zhou 2003; Kall et al. 2004, 2005; Viklund and Elofsson 2004; Bagos et al. 2006) and leading to the prediction of structural models (Vaidehi et al. 2002; Becker et al. 2004; Shacham et al. 2004; Fleishman and Ben-Tal 2006; Yarov-Yarovoy et al. 2006; Zhang et al. 2006). As the number of available structures is limited, some prediction methods used annotated sequences and not 3D information. They were significantly biased (Moller et al. 2001; Chen and Rost 2002a, b) and often overestimated their prediction rates (Chen et al. 2002). Many studies focused on the analysis and conservation of amino acid properties in the helices with regard to the lipid or the aqueous phases (Stevens and Arkin 1999; Beuming and Weinstein 2004). Moreover, these are rarely perfect regular helices. For instance, kinks in helices are known to play some important biological roles (Ubarretxena-Belandia and Engelman 2001; Krishnamurthy et al. 2009) and are well conserved (Faham et al. 2004; Yohannan et al. 2004a, b; Rosenhouse-Dantsker and Logothetis 2006; Kauko et al. 2008). In the same way, some specific sequence patterns could also be characterized (Riek et al. 2001; Rigoutsos et al. 2003).

Fundamentally, an important common issue for TMP_α is the precise localization of helical segments spanning the membrane from high (Zucic and Juretic 2004; Tusnady et al. 2005b; Lomize et al. 2006a, b) or intermediate resolution structures (Enosh et al. 2004). Indeed, the assignment of a regular secondary structure is not a trivial task; various criteria can be used to locate the α-helix and β-sheet (Pauling and Corey 1951a, b). Hence, numerous secondary structure assignment methods (SSAMs) based on energetic, geometrical and/or angular criteria exist (Thomas et al. 2001; Majumdar et al. 2005; Taylor et al. 2005; Hosseini et al. 2008). The most popular approach, DSSP (Kabsch and Sander 1983), is based on the identification of hydrogen bond patterns from the protein geometry and an electrostatic model. New approaches have extended the principles defined in DSSP, e.g., SECSTR that is dedicated to improve 3₁₀ and π-helices detection (Fodje and Al-Karadaghi 2002) and STRIDE that also takes into account dihedral angles (Frishman and Argos 1995). In another way, DEFINE method (Richards and Kundrot 1988) uses only C_α positions. It computes inter-C_α distance matrix and compares it with matrices produced by ideal repetitive secondary structures. KAKSI assignment uses both the inter-C_α distances and dihedral angles criteria (Martin et al. 2005). SEGNO uses also the Φ and Ψ dihedral angles coupled with other angles to assign secondary structures (Cubellis et al. 2005a, b). PSEA assigns the repetitive secondary structures from the sole C_α position using distance and angles criteria (Labesse et al. 1997). XTLSSTR uses all the backbone atoms to compute two angles and three distances (King and Johnson 1999). PCURVE generates a global peptide axis using an extended least-squares minimization procedure (Sklenar et al. 1989). The needs for developing so many approaches are related to their own specific limits and to the various specific interests of the authors. Precise description of various SSAMs can be found in reviews (Benros et al. 2007; Offmann et al. 2007) and in research article (Tyagi et al. 2009a).

As a consequence, these different assignment methods have generated specific problems. For example, the very classical and widely used DSSP can generate very long helices, which can be classified as linear, curved or kinked (Kumar and Bansal 1998; Bansal et al. 2000). That was one of the motivations of the KAKSI methodology to define linear helices instead of long kinked helices (Martin et al. 2005). Moreover, the disagreement between different SSAMs is not negligible for globular protein, leading to only 80% of agreement between two distinct methods (Colloc’h et al. 1993; Dupuis et al. 2004; Fourrier et al. 2004; Martin et al. 2005; Tyagi et al. 2009a). Most methods agree on the nature and the number of secondary structures, but disagree on the limits of the secondary structure elements. This could modify the sequence–structure relationship and consequently the data for predicting.

In this work, we analyzed the differences between secondary structure assignments on TMP_α. The consequences of the disagreements on sequence–structure relationships and on secondary structure predictions were studied. Nine different SSAMs have been used. Moreover, we also analyzed the interest of protein blocks, a structural alphabet designed to analyze and predict protein structures (de Brevern et al. 2000, 2007; de Brevern 2005; Tyagi et al. 2009a). This study is based on a protein databank already published to benchmark prediction methods (Zhou and Zhou 2003; Viklund and Elofsson 2004). However, an updated version has been built to take into account novel protein structures. The specific assignment of this databank was also evaluated.

Materials and methods

Data sets

The benchmark set of proteins is the Zhou and Zhou data set (Zhou and Zhou 2003). It is composed of 73 proteins (http://www.smbs.buffalo.edu/phys_bio/service.htm). From the original data set, we have selected only the proteins having at least one transmembrane helix and kept only X-ray crystallographic structures. Each chain was carefully examined with geometric criteria (mainly bond lengths) to avoid bias from zones with missing density. If the bond lengths were larger than the most adopted values, we considered that the chain was probably disrupted. We also compared the primary sequence given by the SEQRES field in the PDB file with the sequence deduced from the ATOM fields, i.e., the sequence with Cartesian coordinates. In case of difference, we looked at the structure for tracing missing residues. If the residues were really missing, the chain was separated into two parts. Concerning long extremities, we considered that Nter and Cter larger than 20 residues present some particularities that could bias the results. Consequently, we chose to eliminate these regions to focus on transmembrane domains and only kept few residues in these domains. A limit of 20 residues allowed keeping intact all loop regions between TM domains. We so selected 56 proteins (available at http://www.dsimb.inserm.fr/~debrevern/S2_TMalpha/). A novel updated data set has been built. For this purpose, all transmembrane protein structures were downloaded from Stephen White’s Web site (http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html) (White 2009), PDBTM (Tusnady et al. 2004, 2005a) and OPM (Lomize et al. 2006b). More than 2,200 protein chains were selected. X-ray structures with a correct resolution and sharing less than 25% sequence identity with the set previously used were kept; they correspond to 375 protein chains. A new clustering on this restricted data set allows defining 51 clusters of sequence, sharing less than 25% of sequence identity. One representative protein was chosen for each sequence cluster and carefully examined with the same criteria aforementioned. The updated databank so comprises 107 proteins and is 2.5 times bigger than the previous one. Indeed, novel selected proteins are longer due to the improvement in transmembrane protein crystallization (Sarkar et al. 2008; Newby et al. 2009).

Protein blocks

Protein blocks correspond to a set of 16 local prototypes of five residues length based on a (Φ, Ψ) dihedral angle description (de Brevern et al. 2000; de Brevern 2005). They are labeled from a to p (cf. Figure 1 of Tyagi et al. 2009b). They were obtained by an unsupervised classifier similar to Kohonen maps (Kohonen 1982, 2001) and Hidden Markov models (Rabiner 1989). The PBs m and d can be roughly described as prototypes for core α-helices and core β-strands, respectively. PBs a through c primarily represent β-strand N-caps, and PBs e and f, C-caps; PBs g through j are specific to coils, PBs k and l to α-helix N-caps, and PBs n through p to C-caps. This structural alphabet allows a good approximation of local protein 3D structures (de Brevern 2005). PBs have been studied only on globular proteins.

Secondary structure assignments

We used nine distinct softwares: DSSP (Kabsch and Sander 1983) (CMBI version 2000), STRIDE (Frishman and Argos 1995), SECSTR (Fodje and Al-Karadaghi 2002) (version 0.2.3-1), XTLSSTR (King and Johnson 1999), PSEA (Labesse et al. 1997) (version 2.0), DEFINE (Richards and Kundrot 1988) (version 2.0), P-CURVE (Sklenar et al. 1989) (version 3.1), KAKSI (Martin et al. 2005) (version 1.0.1) and SEGNO (version 3.1) (Cubellis et al. 2005b). PBs (de Brevern et al. 2000) were assigned using an in-house software (available at http://www.dsimb.inserm.fr/~debrevern/DOWN/LECT/) that follows similar assignment rules done by the PBE Web server (http://bioinformatics.univ-reunion.fr/PBE/) (Tyagi et al. 2006a, b). DSSP, STRIDE, SECSTR, XTLSSTR and SEGNO give more than three states, so we reduced them: α-helix contains α, 3₁₀ and π-helices, β-strand contains only the β-sheets, and coils everything else (β-bridges, turns, bends, polyproline II and coil). Default settings were used. The curvature of helices was analyzed with dedicated software HELANAL (Bansal et al. 2000). It takes as input a PDB file and a description of helix boundaries. It calculates local axes for every four residues. The geometry of a helix is determined by the angles between axes and the goodness of fit of the helix trace with a circle or a line. Helices are then classified as kinked (K), linear (L) or curved (C). HELANAL can leave a helix unclassified if its geometry is ambivalent. The minimum length for a helix to be analyzed is nine residues. Helices for the PB approach have been assigned to PB m, while others are associated with the coil state.

Segment overlap

The necessity for a structurally meaningful measure of secondary structure prediction accuracy has been pointed out by numerous authors (Rost et al. 1994). The segment overlap (SOV) provides this kind of measure as it takes into account the type and position of secondary structure segments rather than a per-residue assignment of conformational state. It is more related to the natural variation of segment boundaries among families of homologous proteins and should be sensitive to the ambiguity in the position of segment ends due to differences in secondary structure classification approaches.

SOV measure assesses the quality of overlapping between repetitive structures (Rost et al. 1994). In our case, as SOV is not a bijective measure, we have fixed one SSAM as the reference to compute SOV, with its modified definition (Zemla et al. 1999):

$$ {\text{sov(}}i )= {\frac{1}{N(i)}}\sum\limits_{S(i)} {\left[ {{\frac{{{\text{minov (}}s1,s2 )+ \delta (s1,s2)}}{{{\text{maxov (}}s1,s2 )}}}*{\text{len(}}s1 )} \right]} *100 $$

$$ N(i) = \sum\limits_{s(i)} {{\text{len (}}s_{1} )} + \sum\limits_{s'(i)} {{\text{len (}}s_{1} )} $$

with s ₁ and s ₂ , the two studied sequences, maxov (s ₁ ,s ₂) the length of the total extent for which either of the segments s ₁ or s ₂ has a residue in the α-helix state, minov (s ₁ , s ₂) the minimal length, len (s ₁) the length of the reference sequence and δ is a parameter enabling in a fine manner the overlapping of repetitive structures.

$$ \delta (s_{1} ,s_{2} ) = \min \left\{ {\begin{array}{*{20}c} {{\text{maxov (}}s_{1} ,s_{2} )- {\text{minov (}}s_{1} ,s_{2} )} \\ {{\text{minov (}}s_{1} ,s_{2} )} \\ {{\text{len(}}s_{1} )/2;{\text{len (}}s_{2} )/2} \\ \end{array} } \right\}. $$

Agreement rate

To compare two distinct secondary structure assignment methods, we used an agreement rate, which is the proportion of residues associated with the same state (α-helix, β-strand and coil). It is classically noted C ₃ (Fourrier et al. 2004; Tyagi et al. 2009a). Here, as we only focus on helices, we compute the C ₂, i.e., β-strand and coil are merged into one state.

Z score

The amino acid occurrences for each state have been normalized into a Z score (as in de Brevern et al. 2000, 2002, Etchebest et al. 2005, Tyagi et al. 2009a):

$$ Z(n_{i,j} ) = {\frac{{n_{i,j}^{\text{obs}} - n_{i,j}^{\text{th}} }}{{\sqrt {n_{i,j}^{\text{th}} } }}} $$

with $ n_{i,j}^{\text{obs}} $ the observed occurrence number of amino acid i in position j for a given state and $ n_{ij}^{\text{th}} $ the expected number. The product of the occurrences in position j with the frequency of amino acid i in the entire databank equal to $ n_{i,j}^{\text{th}} $. Positive Z scores (respectively negative) correspond to over-represented amino acids (respectively under-represented); threshold values of 4.42 and 1.96 were chosen (probability less than 10⁻⁵ and 5.10⁻², respectively).

Asymmetric Kullback–Leibler measure

The Kullback–Leibler measure or relative entropy (Kullback and Leibler 1951), denoted by KLd, is a measure of conformity between two amino acid distributions, i.e., the amino acid distribution observed in a given position j and the reference amino acid distribution in the protein set (DB). The relative entropy KLd (j|T _x) in the site j for the state Tx is expressed as:

$$ {\text{KLd (}}j|T_{x} )= \sum\limits_{i = 1}^{i = 20} {P(aa_{j} = i|T_{x} )} .\ln \left( {{\frac{{P(aa_{j} = i|T_{x} )}}{{P(aa_{j} = i|{\text{DB}})}}}} \right) $$

where P(aa _j = i|T _x) is the probability of observing the amino acid i in position j $(j = -w, \ldots, 0, \ldots ,+w)$ of the sequence window given a state Tx, and P(aa _j = i|DB) the probability of observing the same amino acid in the databank (named DB). Thus, it allows one to detect the “informative” positions in terms of amino acids for a given protein block (de Brevern et al. 2000; Etchebest et al. 2005).

Prediction

In a strategy of structure prediction from sequence (de Brevern et al. 2000; Etchebest et al. 2005; Elofsson and von Heijne 2007), we must compute for a given sequence window $S_{aa} = \{aa_{-w},\ldots, aa_{0},\ldots, aa_{+w}\}$, the probability of observing a given state Tx, i.e., P(Tx|S _aa). For this purpose, each state T (helix and non-helix) is associated with an occurrence matrix of dimension l × 20 centered upon the state, with l = 2 w +1 (in the study, w = 7). Using the Bayes theorem to compute this a posteriori probability P(Tx|S _aa) from the a priori probability, P(S _aa|Tx) deduced from the occurrence matrix allows to define the odds score R _x:

$$ \mathop R\nolimits_{x} = \prod\limits_{j = - w}^{j = + w} {{\frac{{P(aa_{j} = i|T_{x} )}}{{P(aa_{j} = i|{\text{DB)}}}}}} . $$

The highest score Rx corresponds to the most probable state (de Brevern et al. 2000). Q _tot value is the total number of true predicted states over the total number of predicted residues. Q _pred is the percentage of correct prediction of helical residues (or probability of correct prediction) and Q _obs is the percentage of observed helical residues that are correctly predicted (or percentage of coverage).

Results

Analysis of repetitive secondary structures

The protein databank used is a benchmark created by (Zhou and Zhou (2003) to assess their prediction method THUMBD. It has been used for the assessment of the PRODIV-TMHMM prediction method (Viklund and Elofsson 2004). From the 73 original proteins, 56 proteins were selected. Among the 17 proteins excluded, 10 were composed of multiple NMR models, 2 had only C_α atoms and 4 were obtained with a good crystallographic resolution, but the transmembrane region was missing, i.e., only the extracellular domains is available. For the remaining protein, the PDB ID and sequence cannot be found in PDB or another database. Figure 1 shows two examples of the excluded proteins. Figure 1a and n focuses on the membrane fd coat protein [PDB code 1FDM (Almeida and Opella 1997)]. By using multidimensional solution NMR experiments on micelle samples, the authors succeeded in determining that an amphipathic α-helix and a hydrophobic α-helix were found approximately perpendicular. Figure 1a shows the superimposition of the 20 different structural models using PyMol software (DeLano 2002). Figure 1b gives the distribution of helical residues propensities along the protein sequence. This figure underlines the difficulty in defining precisely the helical regions of the transmembrane domain. Figure 1c shows the HLA-B27 protein, a class I histocompatibility antigen [HLA-B*2705, PDB code 1HSA (Madden et al. 1992)], which possesses a single transmembrane protein. However, it was not crystallized and so no precise assignment could be done [predicted positions can be found on Uniprot (Leinonen et al. 2004; UniProt_Consortium 2010)]. So, both were excluded.

We have encoded the protein structures in terms of secondary structure assignment with different secondary structure assignment methods (SSAMs), in terms of protein blocks (PBs), and also checked the assignment defined by Zhou and Zhou (namely ZZ) to assess their prediction method (Zhou and Zhou 2003). The comparison of secondary structure frequencies do not show a high divergence between each method; the frequencies of α-helix residues for the SSAMs range from 49 to 55%, while it decreases to 52% for PBs and 45% for ZZ. Nonetheless, the distributions of helices length is clearly distinct, we can notice two main clusters of helix lengths, the first one associated with long helices (>21 residues) with P-CURVE (21.6 residues), DEFINE (23.2 residues) and ZZ (26.1 residues). We can notice that that ZZ assignment is associated with long helices. The second cluster is composed of short helices with all the other SSAMs; we can note that DSSP and PBs assignment have the shortest helices on average (14.7 residues and 13.1 residues, respectively). Thus, we already observe strong discrepancies between the helix assignments.

To compare two SSAMs, an agreement rate notes that C ₂ is computed and corresponds to the percentage of residues associated with the same state (helix or not). Table 1 gives the comparison of SSAMs. Figure 2 gives a projection done with a Sammon map of this information (Sammon 1969). It allows a simple representation of the differences of C ₂ values (see Figure 2 of Tyagi et al. 2009a for a similar approach performed on globular proteins). In only one cluster of SSAMs grouping, highly similar assignments located in the circle at the middle of the figure can be observed. The methods involved are all based on hydrogen bond assignment, i.e., DSSP, STRIDE and SECSTR, and have C ₂ values among themselves better than 94%. No other cluster can be defined. These three SSAMs have C ₂ values ranging from 87 to 90% with PCURVE, PSEA, KAKSI, SEGNO and XTLSSTR. These five last have C ₂ values ranging from 86 to 89% (data not shown on the Figure for more clarity). Among all the automatic SSAMs, only DEFINE leads to a very distinct assignment given that C ₂ values are on average ~63%. These results are also in accordance with C ₃ values observed for globular proteins (Tyagi et al. 2009a). The two other methods which have specificities are PBs and ZZ; the C ₂ values of PBs are ~85% and that of ZZs is lower with C ₂ values ranging from 81 to 83%. In the same way, the SOV was computed. In our case, it corresponds to the overlap of the helical structures of the different SSAMs to the helical regions defined by DSSP (taken coarsely as the reference as it is the most widely SSAM used, see supplementary material 1). Our analysis of the results took into account the potential differences between helix length, i.e., DSSP and PCURVE. SOV and C ₂ values highlighted similar behaviors. In the following, we have discarded DEFINE, as this last one does not allow having a correct protein topology description.

Table 1 Confusion matrix

Full size table

Figures 3 and 4 show an example of multiple secondary structure assignments of well-known bacteriorhodopsin [PDB code: 2BRD (Grigorieff et al. 1996)]. In Fig. 4, the prediction with THUMBD is given as an illustration. In Fig. 3, the helices are colored red and connecting regions in green. For the other SSAMs, we showed, with orange balls, the residues assigned as part of a helix by other SSAMs and not by DSSP. Inversely, blue balls represent residues assigned by DSSP as helical and not by the concerned SSAM. This figure underlines two characteristics also found in other proteins of the databank: the discrepancies between SSAMs are mainly found in the extracellular regions of the transmembrane proteins. For instance, the N-cap of the first helix starts at residue 10 for DSSP and SECSTR, 8 for STRIDE, 9 for PSEA and SEGNO, 7 for PCURVE, and 11 for XTLSSTR. The C-cap is found at position 32 for DSSP, STRIDE, SECSTR and KAKSI and diverges by only one position for PSEA, PCURVE and XTLSSTR.

The analysis of long helices (≥9 residues) with HELANAL software did not show a specific tendency in comparison to globular proteins (Martin et al. 2005). Transmembrane helices are in a majority (50%) curved. Kinked helices represent 29% of the helices. Only few of them are linear helices (8%). The remaining is not considered by HELANAL.

Sequence–structure relationship

We analyzed the amino acid propensities within helices, coil, N and C-caps of helices (see Table 2 and supplementary material 2):

Table 2 Amino acid over- and under-representations

Full size table

1.
Concerning the N-cap of α-helices (see supplementary material 2a), we find a series of characteristic over-represented amino acid [NDGS]₀ followed by [PW]₁ and [EW]₂ (the figures correspond to the positions 0 for the last residue in the coil and 1 for the position of the first helical residue). Thus, it is mainly composed of branched polar residues, tryptophan residue, well-known to be found at the membrane interface (von Heijne and Gavel 1988; de Planque et al. 1999; Fleishman et al. 2006) and amino acids, which could be helix breakers (e.g., P). Transmembrane segments are in majority deformed helices, i.e., curved and kinked (79%). These series are found for DSSP, STRIDE, SECSTR, PCURVE, PSEA and SEGNO, shifted by +1 residue for KAKSI and XTLSSTR and −2 for the protein blocks. These strong over-representations, i.e., Z score value higher than 4.4, are limited and localized to the central region of transition from coil to helix. The under-representations are also limited; we can notice in position 0, the under-representation of hydrophobic residues, e.g., alanine and valine. We can also note that using the ZZ assignment, these amino acids are associated with the lowest informativeness in terms of Kullback–Leibler values and also of Z scores (only one strong over-representation was observed).
2.
Regarding the helices (see supplementary material 2b), only classical propensities are found with over-representation of aliphatic residues (leucine, valine and isoleucine), aromatic residues (tryptophan and phenylalanine) and hydrophobic alanine, while under-representation concerns polar negatively charged aspartate and glutamate, polar positively charged arginine and lysine, small polar serine and amino acids, which could be helix breaker proline, glycine and asparagine. None of the SSAMs lead to new amino acid specificities according to literature (Fleishman et al. 2006). We can notice that contrary to the previous case, ZZ assignment is the most informative one. This last observation is coherent with the fact that they have the longest helices and so the capping regions played a less important role in the estimation. The data for coil state are not presented because these are exactly opposed to the amino acid distributions for the helix state.
3.
C-caps of α-helices (see supplementary material 2b) are the less informative regions. A simple amino acid series [NG]₁ [P]₂ [P]₃ can be found and so is characteristic of the coil part. The distinction between helical and coil region is clear for most of the SSAMs with over-representation of aliphatic residues, e.g., leucine in the helical part and over-representation of breaker residues, e.g., proline in the coil part. Only KAKSI is clearly shifted by −1 residue. Interestingly, polar residue glutamine that is more often found under-represented in the helices is over-represented in the last position of helices of STRIDE and SECSTR, Aspartate is also found at position −3 for DSSP and STRIDE. Thus, some amino acids can be found as potential signals of helix ends.

Prediction

The influence of SSAMs on prediction has been assessed by using a simple statistical approach based on Bayes’ rule (de Brevern et al. 2000). It makes easy evaluation of the predictive power of each assignment possible. To insure a correct equilibrium between the protein used in the training and in the validation step, a random approach was used to select the sets for each protein: the training set representing 2/3 of the proteins and the validation step using the remaining 1/3. Two occurrence matrices were computed, one for the helical residue and another for the non-helical ones. Each residue in proteins is represented by a sequence fragment of 15-residue long centered on it. Then the prediction is performed and assessed; this strategy is done 100 times independently, similarly to (Tyagi et al. 2009b). This approach gives two series of values, the average ones and the best ones (see Table 3). With the exception of DEFINE (prediction rate, Q _tot, ~69% at best), all the SSAMs enable prediction rates better than 78%. Differences between average (of the 100 simulations) and best values are within a fair range of [1.6, 3.2%].

Table 3 Prediction of transmembrane proteins

Full size table

Thus, secondary structure prediction rates using only single sequence are within a range of 78.26–80.95% for the SSAMs. A structural alphabet (PB) approach gives a slightly better prediction (81.46%). Surprisingly, the secondary structure assignment used for benchmark set, ZZ, gives a prediction rate of 86.27%. This last remark is striking as it corresponds to a difference of 5% with the best SSAM, i.e., STRIDE, and 6.4% with DSSP, the most classical one. This higher value is associated also with a good MCC value equal to 0.73, more than 0.1 point better than the best MCC value. In the same way, Q _obs and Q _pred values have been computed; they correspond, respectively, to the percentage of helical residues correctly predicted for all the true helical residues (sensitivity) and to the percentage of helical residues correctly predicted for all the predicted helical residues (positive predictive value). Thus, the behavior of ZZ is mainly due to a lower number of helix residues; therefore, it gives the best Q _obs value (or percentage of coverage), i.e., 93.7%, but a low Q _pred value (or probability of correct prediction), i.e., 70.7%. In fact, it predicts 10% less helix than other approaches, while its helix frequency is only 5% lower.

Interestingly, the design of a consensus approach to improve the prediction (using DSSP as the standard) does not give any significant improvement and, in many cases, any combination of multiple SSAM prediction methods shows a decrease of the Q _tot value.

In the same way, C ₂ values have been computed for the predictions. C ₂ values for “prediction” are better than C ₂ “assignment” values in every case (see supplementary data 3). It is entirely consistent with the analysis of sequence–structure relationships (see “Sequence–structure relationship”) that shows limited differences between SSAMs. Hence, the predictions converge more to the same definition of helical and non-helical regions than the structure definition. Only ZZ does not show any important improvement emphasizing its specific definition.

As a last point, we examined the influence of the databank. Indeed, the databank, although used as a benchmark by other authors, was rather old. Moreover, the number of available structures has a recently markedly increased. The databank has been updated with novel high-quality non-redundant protein structures (see “Materials and methods”). The protein databank is 2.5 times bigger than the original one. Similarly, as previously done, prediction was applied to this updated databank (see supplementary material 4). One hundred independent simulations were performed for DSSP, STRIDE and PBs, and the average and best prediction rates were analyzed. On average, very few differences can be found for MCC, Q _obs and Q _pred. Q _tot values slightly decrease, whereas standard deviations slightly increase.

This last point is underlined by the results obtained from the best prediction simulation. The MCCs increase by 0.03–0.06, while all Q _tot values increase by 1.8% for DSSP, 1.1 for STRIDE and 1.6% for PBs, i.e., a value of 83.1%. Hence, the good results of this approach are improved with a larger data set. However, we were not able to test ZZ assignment because it could not be performed on new protein structures.

Discussion

This study focuses on the precise localization of helices. We used only X-ray 3D structures (Ikeda et al. 2003). Thus, from the original data set, some proteins have been excluded. As expected, SSAMs diverged as much for transmembrane protein as for globular ones (C ₂ values ~88%). PBs, which are characterized by shorter helices lengths, are a bit more distant with C ₂ values ~85%, while ZZ assignment has clearly distinct assignment with C ₂ values ~82 and 20% less residues associated with the helices than other SSAMs. DEFINE remains an outlier as it was also for the globular proteins (Fourrier et al. 2004). We can notice that DSSP is associated with short helices, a behavior that is opposite to the one observed with globular proteins (Martin et al. 2005). Hence, DSSP gives more breaks in transmembrane helices than other related approaches. Concerning the helix breaks, a fine analysis of some examples shows that they cannot be attributed to the sole assignment method used, but are true disruption of the secondary structure. Moreover, we often observed proline at the break position or in the close neighborhood. The role of these proline residues needs to be further investigated considering multiple sequence alignment to check the conservation of this position. This could give clues on the structural and or functional role of this residue in the protein.

Precise analysis of the curvature of helices between the different SSAMs do not show significant differences between the different classical SSAMs, i.e., DSSP, STRIDE, SECSTR, PCURVE, PSEA, KAKSI, SEGNO and XTLSSTR. The percentage of linear helices remains low (<10%), while the curved helices still represent more than half of the helices. We observe only for PCURVE a slight increase of kinked helices, due to the fact that their helices are longer.

Analysis of the amino acid repartition shows that differences in terms of assignment has no consequence on the sequence structure relationships for helices, helices termini or coil states. It corroborates equivalent analyses done on globular proteins (Tyagi et al. 2009a, b). The most diverging SSAM is again ZZ, characterized by low informative helix extremities, but the most informative for the helix core. Nonetheless, all the different SSAMs describe propensities that support well the TM tendency scale defined by Zhao and London (2006). Indeed, residues associated with a positive value for this scale are over-represented in helix (and under-represented in coil). In the same way, the most under-represented residues in helix (and over-represented in coil) are associated with strong negative values. Future studies will deal more deeply with the comparative analysis of such features.

Prediction of the automatic SSAMs gives very homogeneous prediction rates with the notable exception of ZZ assignment that bypasses the best prediction by 5%. Viklund and Elofsson have assessed the prediction rates of THUMBUP and their own method (Viklund and Elofsson 2004), PRODIV-TMHMM, gives Q _tot values of 84 and 88%. Both methods have been trained with the ZZ data set and are based on Hidden Markov models with evolutionary information. Here, the simple Bayesian approach using only one sequence gives 2% better prediction rate than THUMBUP and 2% less than PRODIV-TMHMM. These two methods were dedicated to protein topology prediction. Nonetheless, the results of such a simple approach are quite good. Moreover, it is a robust approach as we have shown that it is not sensitive to sequence identity level (Tyagi et al. 2009b). This work also emphasizes the importance of a precise definition of the assignment. So, we clearly support the approach by Cuthbertson et al. (2005) that compared numerous prediction methods in a very rigorous way. They defined TM helices within membrane protein structures using DSSP. They consider the full extent of each TM helix, including residues that may reside outside the (presumed) limits of the lipid bilayer. They adopted this approach because any attempt to define simply the bilayer spanning element of a TM helix is contingent on the model used to assign this latter. Indeed, the absence of lipid molecules from the majority of crystals of membrane proteins prevents any experimental delimitation. In this case, we can note that our Bayesian prediction gives a prediction rate of 79.9% for the original data set and 81.6% with the updated data set, thus 3–4 and 1.5–2.5% less than the best (and rigorously) evaluated prediction methods (Cuthbertson et al. 2005).

To go further, we analyzed on the original data set with prediction performed by PSI-PRED (Jones 1999) and MINNOU (Cao et al. 2006). The first one is specialized on the prediction of globular proteins, while the second is dedicated to TMP_α. MINNOU has a published prediction rate of 9% higher than our approach, a coherent result with regard to the classification method and information used (Cao et al. 2006). However, on our data set, PSI-PRED prediction rate equals 82.5%, while the second is slightly lower at 81.8%. Both are greatly lower than THUMBD. Interestingly, only 82.8% of the residues have been predicted similarly by PSI-PRED and MINNOU. This confusion decreases with ZZ assignment and ZZ prediction (THUMBD); MINNOU has a C ₂ of 71.0% with ZZ assignment and only 60.0% with the prediction. Part of this result is due to (1) the databank by itself, which had a significant influence and (2) to the absence of long protein extremities (composed only of coil residue always well predicted). The prediction rate decreases by 7% if long N and C termini are not taken into account.

Conclusions

This research shows that SSAMs differ in assignment even for transmembrane protein; it is coherent with previous remarks and researches on related subjects (Fourrier et al. 2004; Tusnady et al. 2004; Tyagi et al. 2009a). These divergences have no significant repercussion on sequence–structure relationships. Nonetheless, with a nonautomatic assignment as in the work of ZZ, a major and impressive difference is observed and can be related to the previous remarks by (Moller et al. (2001). This study highlights also clearly the influence of the assignment and potential consequences on the way prediction is assessed. Moreover, we tested a more complex learning approach with a neural agent that used also occurrence matrices. This approach does not increase greatly the prediction rate (1% on average for each method). In the same way, the use of consensus approach does not provide significant gain, contrary to other approaches that use multiple distinct prediction methods (Ikeda et al. 2002; Nilsson et al. 2002) or different SSAMs to describe the protein structure (Cuff and Barton 1999). This work also emphasis the importance of an independent assessment of state-of-the-art approach as TMH Benchmark performed in the Rost Lab (Kernytsky and Rost 2003). Methods that employ evolutionary information are mainly more accurate than methods based on information derived from a single sequence (Cuthbertson et al. 2005). However, we show here that single sequence methods give quite impressive results compared to more complex approaches. We can also notice that the obtained Q _tot values are superior to PSI-PRED on PTM_α, as evaluated by (Cao et al. 2006). As the number of structures used in the prediction research could vary from 73 (Cao et al. 2006) to 265 (Amirova et al. 2007), while others used data sets based on experimental evidences given the protein topology (Jones 2007; Roy Choudhury and Novic 2009), the comparison between methods is not straightforward. A curated structural benchmark could be a valuable tool for the scientific community, with clear description of the purpose and definition of the different states to be predicted (Moller et al. 2000). It will not change the quality of the prediction rates that are high (Cuthbertson et al. 2005), but could clarify the difficulty of comparison.

It was already shown years ago that many prediction methods were biased when using prediction of TMP_α rather than structural information (Moller et al. 2001; Chen et al. 2002). Hence, this lack of consensus has implication for the conception of pertinent structural models (Law et al. 2005; Elofsson and von Heijne 2007). More than ten tools are nowadays available for defining the number and the limits of the TM segments and all of them exhibit rather comparable success rates (Shen and Chou 2008) (Rangwala et al. 2009). The relevance of prediction tools, well tried on soluble proteins, however, is far from being proved for TM proteins. For instance, the extension of Rosetta approach to TM proteins (Yarov-Yarovoy et al. 2006), despite its interest, requires some specific evaluation criterion for assessing its generalization. The TM segments may not be considered as simple helical stretches, but their structure requires a more accurate description (Bernsel et al. 2008). This may be obtained with the help of a structural alphabet (Offmann et al. 2007; Joseph et al. 2010) as it has been used for defining the DARC structural model (de Brevern et al. 2005, 2009; de Brevern 2009). The results herein described are quite important for molecular modeling of transmembrane proteins (de Graaf and Rognan 2009; Mornon et al. 2009), which are major medical drug targets (Jacoby et al. 2006; Lacapere et al. 2007; Landry and Gies 2008; Arinaminpathy et al. 2009) and to improve protein topology prediction approaches (Harrington and Ben-Tal 2009; Klammer et al. 2009; Nugent and Jones 2009).

Abbreviations

PDB:: Protein DataBank
SSAM:: Secondary structure assignment method
DSSP:: Dictionary secondary structure protein
TMP_α :: α-Helical transmembrane proteins

References

Ahram M, Litou ZI, Fang R, Al-Tawallbeh G (2006) Estimation of membrane proteins in the human proteome. In Silico Biol 6:379–386
CAS PubMed Google Scholar
Almeida FC, Opella SJ (1997) fd coat protein structure in membrane environments: structural dynamics of the loop between the hydrophobic trans-membrane helix and the amphipathic in-plane helix. J Mol Biol 270:481–495
Article CAS PubMed Google Scholar
Amirova SR, Milchevsky JV, Filatov IV, Esipova NG, Tumanyan VG (2007) Study and prediction of secondary structure for membrane proteins. J Biomol Struct Dyn 24:421–428
CAS PubMed Google Scholar
Arai M, Ikeda M, Shimizu T (2003) Comprehensive analysis of transmembrane topologies in prokaryotic genomes. Gene 304:77–86
Article CAS PubMed Google Scholar
Arinaminpathy Y, Khurana E, Engelman DM, Gerstein MB (2009) Computational analysis of membrane proteins: the largest class of drug targets. Drug Discov Today 14:1130–1135
Article CAS PubMed Google Scholar
Bagos PG, Liakopoulos TD, Hamodrakas SJ (2006) Algorithms for incorporating prior topological information in HMMs: application to transmembrane proteins. BMC Bioinformatics 7:189
Article PubMed CAS Google Scholar
Bansal M, Kumar S, Velavan R (2000) HELANAL: a program to characterize helix geometry in proteins. J Biomol Struct Dyn 17:811–819
CAS PubMed Google Scholar
Becker OM, Marantz Y, Shacham S, Inbal B, Heifetz A, Kalid O, Bar-Haim S, Warshaviak D, Fichman M, Noiman S (2004) G protein-coupled receptors: in silico drug discovery in 3D. Proc Natl Acad Sci USA 101:11304–11309
Article CAS PubMed Google Scholar
Benros C, Martin J, Tyagi M, and de Brevern AG (2007) Description of the local protein structure. I. Classical approaches. In: de Brevern AG (ed) Recent advances in structural bioinformatics. Research signpost, Trivandrum, pp 1–33
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242
Article CAS PubMed Google Scholar
Bernsel A, Viklund H, Falk J, Lindahl E, von Heijne G, Elofsson A (2008) Prediction of membrane-protein topology from first principles. Proc Natl Acad Sci USA 105:7177–7181
Article CAS PubMed Google Scholar
Beuming T, Weinstein H (2004) A knowledge-based scale for the analysis and prediction of buried and exposed faces of transmembrane domain proteins. Bioinformatics 20:1822–1835
Article CAS PubMed Google Scholar
Cao B, Porollo A, Adamczak R, Jarrell M, Meller J (2006) Enhanced recognition of protein transmembrane domains with prediction-based structural profiles. Bioinformatics 22:303–309
Article CAS PubMed Google Scholar
Chen CP, Rost B (2002a) Long membrane helices and short loops predicted less accurately. Protein Sci 11:2766–2773
Article CAS PubMed Google Scholar
Chen CP, Rost B (2002b) State-of-the-art in membrane protein prediction. Appl Bioinformatics 1:21–35
CAS PubMed Google Scholar
Chen CP, Kernytsky A, Rost B (2002) Transmembrane helix predictions revisited. Protein Sci 11:2774–2791
Article CAS PubMed Google Scholar
Colloc’h N, Etchebest C, Thoreau E, Henrissat B, Mornon JP (1993) Comparison of three algorithms for the assignment of secondary structure in proteins: the advantages of a consensus assignment. Protein Eng 6:377–382
Article PubMed Google Scholar
Cubellis MV, Caillez F, Blundell TL, Lovell SC (2005a) Properties of polyproline II, a secondary structure element implicated in protein–protein interactions. Proteins 58:880–892
Article CAS PubMed Google Scholar
Cubellis MV, Cailliez F, Lovell SC (2005b) Secondary structure assignment that accurately reflects physical and evolutionary characteristics. BMC Bioinformatics 6(Suppl 4):S8
Article PubMed CAS Google Scholar
Cuff JA, Barton GJ (1999) Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34:508–519
Article CAS PubMed Google Scholar
Cuthbertson JM, Doyle DA, Sansom MS (2005) Transmembrane helix prediction: a comparative evaluation and analysis. Protein Eng Des Sel 18:295–308
Article CAS PubMed Google Scholar
de Brevern AG (2005) New assessment of protein blocks. In Silico Biol 5:283–289
PubMed Google Scholar
de Brevern AG (2009) New opportunities to fight against infectious diseases and to identify pertinent drug targets with novel methodologies. Infect Disord Drug Targets 9:246–247
PubMed Google Scholar
de Brevern AG, Etchebest C, Hazout S (2000) Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 41:271–287
Article PubMed Google Scholar
de Brevern AG, Valadie H, Hazout S, Etchebest C (2002) Extension of a local backbone description using a structural alphabet: a new approach to the sequence–structure relationship. Protein Sci 11:2871–2886
Article PubMed CAS Google Scholar
de Brevern AG, Wong H, Tournamille C, Colin Y, Le Van Kim C, Etchebest C (2005) A structural model of a seven-transmembrane helix receptor: the Duffy antigen/receptor for chemokine (DARC). Biochim Biophys Acta 1724:288–306
PubMed Google Scholar
de Brevern AG, Etchebest C, Benros C, Hazout S (2007) “Pinning strategy”: a novel approach for predicting the backbone structure in terms of protein blocks from sequence. J Biosci 32:51–70
Article PubMed Google Scholar
de Brevern AG, Autin L, Colin Y, Bertrand O, Etchebest C (2009) In silico studies on DARC. Infect Disord Drug Targets 9:289–303
PubMed Google Scholar
de Graaf C, Rognan D (2009) Customizing G Protein-coupled receptor models for structure-based virtual screening. Curr Pharm Des 15:4026–4048
Article PubMed Google Scholar
de Planque MR, Kruijtzer JA, Liskamp RM, Marsh D, Greathouse DV, Koeppe RE 2nd, de Kruijff B, Killian JA (1999) Different membrane anchoring positions of tryptophan and lysine in synthetic transmembrane alpha-helical peptides. J Biol Chem 274:20839–20846
Article PubMed Google Scholar
DeLano WLT (2002) The PyMOL molecular graphics system DeLano Scientific, San Carlos. http://www.pymol.org
Dupuis F, Sadoc JF, Mornon JP (2004) Protein secondary structure assignment through Voronoi tessellation. Proteins 55:519–528
Article CAS PubMed Google Scholar
Elofsson A, von Heijne G (2007) Membrane protein structure: prediction vs reality. Annu Rev Biochem 76:125–140
Article CAS PubMed Google Scholar
Enosh A, Fleishman SJ, Ben-Tal N, Halperin D (2004) Assigning transmembrane segments to helices in intermediate-resolution structures. Bioinformatics 20(Suppl 1):I122–I129
Article CAS PubMed Google Scholar
Etchebest C, Benros C, Hazout S, de Brevern AG (2005) A structural alphabet for local protein structures: Improved prediction methods. Proteins 59:810–827
Article CAS PubMed Google Scholar
Faham S, Yang D, Bare E, Yohannan S, Whitelegge JP, Bowie JU (2004) Side-chain contributions to membrane protein structure and stability. J Mol Biol 335:297–305
Article CAS PubMed Google Scholar
Fleishman SJ, Ben-Tal N (2006) Progress in structure prediction of alpha-helical membrane proteins. Curr Opin Struct Biol 16:496–504
Article CAS PubMed Google Scholar
Fleishman SJ, Unger VM, Ben-Tal N (2006) Transmembrane protein structures without X-rays. Trends Biochem Sci 31:106–113
Article CAS PubMed Google Scholar
Fodje MN, Al-Karadaghi S (2002) Occurrence, conformational features and amino acid propensities for the pi-helix. Protein Eng 15:353–358
Article CAS PubMed Google Scholar
Fourrier L, Benros C, de Brevern AG (2004) Use of a structural alphabet for analysis of short loops connecting repetitive structures. BMC Bioinformatics 5:58
Article PubMed Google Scholar
Frishman D, Argos P (1995) Knowledge-based protein secondary structure assignment. Proteins 23:566–579
Article CAS PubMed Google Scholar
Grigorieff N, Ceska TA, Downing KH, Baldwin JM, Henderson R (1996) Electron-crystallographic refinement of the structure of bacteriorhodopsin. J Mol Biol 259:393–421
Article CAS PubMed Google Scholar
Gromiha MM, Suwa M (2006) Discrimination of outer membrane proteins using machine learning algorithms. Proteins 63:1031–1037
Article CAS PubMed Google Scholar
Harrington SE, Ben-Tal N (2009) Structural determinants of transmembrane helical proteins. Structure 17:1092–1103
Article CAS PubMed Google Scholar
Hosseini S, Sadeghi M, Pezeshk H, Eslahchi C, Habibi M (2008) PROSIGN: a method for protein secondary structure assignment based on three-dimensional coordinates of consecutive C(alpha) atoms. Comput Biol Chem 32:406–411
Article CAS PubMed Google Scholar
Ihaka R, Gentleman R (1996) R: a language for data analysis and graphics. J Comput Graph Stat 5:299–314
Article Google Scholar
Ikeda M, Arai M, Lao DM, Shimizu T (2002) Transmembrane topology prediction methods: a re-assessment and improvement by a consensus method using a dataset of experimentally-characterized transmembrane topologies. In Silico Biol 2:19–33
PubMed Google Scholar
Ikeda M, Arai M, Okuno T, Shimizu T (2003) TMPDB: a database of experimentally-characterized transmembrane topologies. Nucleic Acids Res 31:406–409
Article CAS PubMed Google Scholar
Jacoby E, Bouhelal R, Gerspacher M, Seuwen K (2006) The 7 TM G-protein-coupled receptor target family. Chem Med Chem 1:761–782
PubMed Google Scholar
Jones DT (1998) Do transmembrane protein superfolds exist? FEBS Lett 423:281–285
Article CAS PubMed Google Scholar
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 292:195–202
Article CAS PubMed Google Scholar
Jones DT (2007) Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics 23:538–544
Article CAS PubMed Google Scholar
Joseph AP, Bornot A, de Brevern AG (2010) Local structure alphabets. In: Rangwala H, Karypis G (eds) Protein structure prediction. Wiley, London (in press)
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637
Article CAS PubMed Google Scholar
Kall L, Krogh A, Sonnhammer EL (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338:1027–1036
Article CAS PubMed Google Scholar
Kall L, Krogh A, Sonnhammer EL (2005) An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics 21(Suppl 1):i251–i257
Article PubMed Google Scholar
Kauko A, Illergard K, Elofsson A (2008) Coils in the membrane core are conserved and functionally important. J Mol Biol 380:170–180
Article CAS PubMed Google Scholar
Kernytsky A, Rost B (2003) Static benchmarking of membrane helix predictions. Nucleic Acids Res 31:3642–3644
Article CAS PubMed Google Scholar
King SM, Johnson WC (1999) Assigning secondary structure from protein coordinate data. Proteins 35:313–320
Article CAS PubMed Google Scholar
Klammer M, Messina DN, Schmitt T, Sonnhammer EL (2009) MetaTM—a consensus method for transmembrane protein topology prediction. BMC Bioinformatics 10:314
Article PubMed CAS Google Scholar
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69
Article Google Scholar
Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin, p 501
Google Scholar
Krishnamurthy H, Piscitelli CL, Gouaux E (2009) Unlocking the molecular secrets of sodium-coupled transporters. Nature 459:347–355
Article CAS PubMed Google Scholar
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580
Article CAS PubMed Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Article Google Scholar
Kumar S, Bansal M (1998) Geometrical and sequence characteristics of alpha-helices in globular proteins. Biophys J 75:1935–1944
Article CAS PubMed Google Scholar
Labesse G, Colloc’h N, Pothier J, Mornon JP (1997) P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins. Comput Appl Biosci 13:291–295
CAS PubMed Google Scholar
Lacapere JJ, Pebay-Peyroula E, Neumann JM, Etchebest C (2007) Determining membrane protein structures: still a challenge!. Trends Biochem Sci 32:259–270
Article CAS PubMed Google Scholar
Landry Y, Gies JP (2008) Drugs and their molecular targets: an updated overview. Fundam Clin Pharmacol 22:1–18
Article CAS PubMed Google Scholar
Law RJ, Capener C, Baaden M, Bond PJ, Campbell J, Patargias G, Arinaminpathy Y, Sansom MS (2005) Membrane protein structure quality in molecular dynamics simulation. J Mol Graph Model 24:157–165
Article CAS PubMed Google Scholar
Leinonen R, Diez FG, Binns D, Fleischmann W, Lopez R, Apweiler R (2004) UniProt archive. Bioinformatics 20:3236–3237
Article CAS PubMed Google Scholar
Lomize AL, Pogozheva ID, Lomize MA, Mosberg HI (2006a) Positioning of proteins in membranes: a computational approach. Protein Sci 15:1318–1333
Article CAS PubMed Google Scholar
Lomize MA, Lomize AL, Pogozheva ID, Mosberg HI (2006b) OPM: orientations of proteins in membranes database. Bioinformatics 22:623–625
Article CAS PubMed Google Scholar
Madden DR, Gorga JC, Strominger JL, Wiley DC (1992) The three-dimensional structure of HLA-B27 at 2.1 A resolution suggests a general mechanism for tight peptide binding to MHC. Cell 70:1035–1048
Article CAS PubMed Google Scholar
Majumdar I, Krishna SS, Grishin NV (2005) PALSSE: a program to delineate linear secondary structural elements from protein structures. BMC Bioinformatics 6:202
Article PubMed CAS Google Scholar
Martelli PL, Fariselli P, Casadio R (2003) An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins. Bioinformatics 19(Suppl 1):i205–i211
Article PubMed Google Scholar
Martin J, Letellier G, Marin A, Taly J-F, de Brevern AG, Gibrat JF (2005) Protein secondary structure assignment revisited: a detailed analysis of different assignment methods. BMC Struct Biol 5:17
Article PubMed CAS Google Scholar
Matthews B (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 405:442–451
CAS PubMed Google Scholar
Moller S, Kriventseva EV, Apweiler R (2000) A collection of well characterised integral membrane proteins. Bioinformatics 16:1159–1160
Article CAS PubMed Google Scholar
Moller S, Croning MD, Apweiler R (2001) Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17:646–653
Article CAS PubMed Google Scholar
Mornon JP, Lehn P, Callebaut I (2009) Molecular models of the open and closed states of the whole human CFTR protein. Cell Mol Life Sci 66:3469–3486
Article CAS PubMed Google Scholar
Newby ZE, O’Connell JD 3rd, Gruswitz F, Hays FA, Harries WE, Harwood IM, Ho JD, Lee JK, Savage DF, Miercke LJ et al (2009) A general protocol for the crystallization of membrane proteins for X-ray structural investigation. Nat Protoc 4:619–637
Article CAS PubMed Google Scholar
Newstead S, Ferrandon S, Iwata S (2008) Rationalizing alpha-helical membrane protein crystallization. Protein Sci 17:466–472
Article CAS PubMed Google Scholar
Nilsson J, Persson B, Von Heijne G (2002) Prediction of partial membrane protein topologies using a consensus approach. Protein Sci 11:2974–2980
Article CAS PubMed Google Scholar
Nugent T, Jones DT (2009) Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics 10:159
Article PubMed CAS Google Scholar
Oberai A, Ihm Y, Kim S, Bowie JU (2006) A limited universe of membrane protein families and folds. Protein Sci 15:1723–1734
Article CAS PubMed Google Scholar
Offmann B, Tyagi M, de Brevern AG (2007) Local protein structures. Curr Bioinform 3:165–202
Article Google Scholar
Palczewski K, Kumasaka T, Hori T, Behnke CA, Motoshima H, Fox BA, Le Trong I, Teller DC, Okada T, Stenkamp RE et al (2000) Crystal structure of rhodopsin: a G protein-coupled receptor. Science 289:739–745
Article CAS PubMed Google Scholar
Pauling L, Corey RB (1951a) Atomic coordinates and structure factors for two helical configurations of polypeptide chains. Proc Natl Acad Sci USA 37:235–240
Article CAS PubMed Google Scholar
Pauling L, Corey RB (1951b) The pleated sheet, a new layer configuration of polypeptide chains. Proc Natl Acad Sci USA 37:251–256
Article CAS PubMed Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected application in speech recognition. Proc IEEE 77:257–286
Article Google Scholar
Rangwala H, Kauffman C, Karypis G (2009) svmPRAT: SVM-based protein residue annotation toolkit. BMC Bioinformatics 10:439
Article PubMed CAS Google Scholar
Richards FM, Kundrot CE (1988) Identification of structural motifs from protein coordinate data: secondary structure and first-level supersecondary structure. Proteins 3:71–84
Article CAS PubMed Google Scholar
Riek RP, Rigoutsos I, Novotny J, Graham RM (2001) Non-alpha-helical elements modulate polytopic membrane protein architecture. J Mol Biol 306:349–362
Article CAS PubMed Google Scholar
Rigoutsos I, Riek P, Graham RM, Novotny J (2003) Structural details (kinks and non-alpha conformations) in transmembrane helices are intrahelically determined and can be predicted by sequence pattern descriptors. Nucleic Acids Res 31:4625–4631
Article CAS PubMed Google Scholar
Rosenhouse-Dantsker A, Logothetis DE (2006) New roles for a key glycine and its neighboring residue in potassium channel gating. Biophys J 91:2860–2873
Article CAS PubMed Google Scholar
Rost B, Sander C, Schneider R (1994) Redefining the goals of protein secondary structure prediction. J Mol Biol 235:13–26
Article CAS PubMed Google Scholar
Rost B, Fariselli P, Casadio R (1996) Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 5:1704–1718
Article CAS PubMed Google Scholar
Roy Choudhury A, Novic M (2009) Data-driven model for the prediction of protein transmembrane regions. SAR QSAR Environ Res 20:741–754
Article CAS PubMed Google Scholar
Sammon JW Jr (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput 18:401–409
Article Google Scholar
Sarkar CA, Dodevski I, Kenig M, Dudli S, Mohr A, Hermans E, Pluckthun A (2008) Directed evolution of a G protein-coupled receptor for expression, stability, and binding selectivity. Proc Natl Acad Sci USA 105:14808–14813
Article CAS PubMed Google Scholar
Shacham S, Marantz Y, Bar-Haim S, Kalid O, Warshaviak D, Avisar N, Inbal B, Heifetz A, Fichman M, Topf M et al (2004) PREDICT modeling and in-silico screening for G-protein coupled receptors. Proteins 57:51–86
Article CAS PubMed Google Scholar
Shen H, Chou JJ (2008) MemBrain: improving the accuracy of predicting transmembrane helices. PLoS One 3:e2399
Google Scholar
Sklenar H, Etchebest C, Lavery R (1989) Describing protein structure: a general algorithm yielding complete helicoidal parameters and a unique overall axis. Proteins 6:46–60
Article CAS PubMed Google Scholar
Stevens TJ, Arkin IT (1999) Are membrane proteins “inside-out” proteins? Proteins 36:135–143
Article CAS PubMed Google Scholar
Taylor WR, Jones DT, Green NM (1994) A method for alpha-helical integral membrane protein fold prediction. Proteins 18:281–294
Article CAS PubMed Google Scholar
Taylor T, Rivera M, Wilson G, Vaisman II (2005) New method for protein secondary structure assignment based on a simple topological descriptor. Proteins 60:513–524
Article CAS PubMed Google Scholar
Thomas A, Bouffioux O, Geeurickx D, Brasseur R (2001) Pex, analytical tools for PDB files I. GF-Pex: basic file to describe a protein. Proteins 43:28–36
Article CAS PubMed Google Scholar
Tusnady GE, Simon I (1998) Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 283:489–506
Article CAS PubMed Google Scholar
Tusnady GE, Simon I (2001) The HMMTOP transmembrane topology prediction server. Bioinformatics 17:849–850
Article CAS PubMed Google Scholar
Tusnady GE, Dosztanyi Z, Simon I (2004) Transmembrane proteins in the Protein Data Bank: identification and classification. Bioinformatics 20:2964–2972
Article CAS PubMed Google Scholar
Tusnady GE, Dosztanyi Z, Simon I (2005a) PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic Acids Res 33:D275–D278
Article CAS PubMed Google Scholar
Tusnady GE, Dosztanyi Z, Simon I (2005b) TMDET: web server for detecting transmembrane regions of proteins by using their 3D coordinates. Bioinformatics 21:1276–1277
Article CAS PubMed Google Scholar
Tyagi M, Gowri VS, Srinivasan N, de Brevern AG, Offmann B (2006a) A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications. Proteins 65:32–39
Article CAS PubMed Google Scholar
Tyagi M, Sharma P, Swamy CS, Cadet F, Srinivasan N, de Brevern AG, Offmann B (2006b) Protein Block Expert (PBE): a web-based protein structure analysis server using a structural alphabet. Nucleic Acids Res 34:W119–W123
Article CAS PubMed Google Scholar
Tyagi M, Bornot A, Offmann B, de Brevern AG (2009a) Analysis of loop boundaries using different local structure assignment methods. Protein Sci 18:1869–1881
Article CAS PubMed Google Scholar
Tyagi M, Bornot A, Offmann B, de Brevern AG (2009b) Protein short loop prediction in terms of a structural alphabet. Comput Biol Chem 33:329–333
Article CAS PubMed Google Scholar
Ubarretxena-Belandia I, Engelman DM (2001) Helical membrane proteins: diversity of functions in the context of simple architecture. Curr Opin Struct Biol 11:370–376
Article CAS PubMed Google Scholar
UniProt_Consortium (2010) The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res 38:D142–D148
Article CAS Google Scholar
Vaidehi N, Floriano WB, Trabanino R, Hall SE, Freddolino P, Choi EJ, Zamanakos G, Goddard WA 3rd (2002) Prediction of structure and function of G protein-coupled receptors. Proc Natl Acad Sci USA 99:12622–12627
Article CAS PubMed Google Scholar
Viklund H, Elofsson A (2004) Best alpha-helical transmembrane protein topology predictions are achieved using hidden Markov models and evolutionary information. Protein Sci 13:1908–1917
Article CAS PubMed Google Scholar
von Heijne G (2006) Membrane-protein topology. Nat Rev Mol Cell Biol 7:909–918
Article CAS Google Scholar
von Heijne G, Gavel Y (1988) Topogenic signals in integral membrane proteins. Eur J Biochem 174:671–678
Article Google Scholar
Wallin E, von Heijne G (1998) Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7:1029–1038
Article CAS PubMed Google Scholar
White SH (2004) The progress of membrane protein structure determination. Protein Sci 13:1948–1949
Article CAS PubMed Google Scholar
White SH (2009) Biophysical dissection of membrane proteins. Nature 459:344–346
Article CAS PubMed Google Scholar
White SH, von Heijne G (2005) Transmembrane helices before, during, and after insertion. Curr Opin Struct Biol 15:378–386
Article CAS PubMed Google Scholar
White SH, Wimley WC (1999) Membrane protein folding and stability: physical principles. Annu Rev Biophys Biomol Struct 28:319–365
Article CAS PubMed Google Scholar
White SH, Ladokhin AS, Jayasinghe S, Hristova K (2001) How membranes shape protein structure. J Biol Chem 276:32395–32398
Article CAS PubMed Google Scholar
Yarov-Yarovoy V, Schonbrun J, Baker D (2006) Multipass membrane protein structure prediction using Rosetta. Proteins 62:1010–1025
Article CAS PubMed Google Scholar
Yohannan S, Faham S, Yang D, Whitelegge JP, Bowie JU (2004a) The evolution of transmembrane helix kinks and the structural diversity of G protein-coupled receptors. Proc Natl Acad Sci USA 101:959–963
Article CAS PubMed Google Scholar
Yohannan S, Yang D, Faham S, Boulting G, Whitelegge J, Bowie JU (2004b) Proline substitutions are not easily accommodated in a membrane protein. J Mol Biol 341:1–6
Article CAS PubMed Google Scholar
Zemla A, Venclovas C, Fidelis K, Rost B (1999) A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins 34:220–223
Article CAS PubMed Google Scholar
Zhang Y, Devries ME, Skolnick J (2006) Structure modeling of all identified G protein-coupled receptors in the human genome. PLoS Comput Biol 2:e13
Article PubMed CAS Google Scholar
Zhao G, London E (2006) An amino acid “transmembrane tendency” scale that approaches the theoretical limit to accuracy for prediction of transmembrane helices: relationship to biological hydrophobicity. Protein Sci 15:1987–2001
Article CAS PubMed Google Scholar
Zhou H, Zhou Y (2003) Predicting the topology of transmembrane helical proteins using mean burial propensity and a hidden-Markov-model-based method. Protein Sci 12:1547–1555
Article CAS PubMed Google Scholar
Zucic D, Juretic D (2004) Precise annotation of transmembrane segments with Garlic—a free molecular visualization program. Croatica Chemica Acta 77:397–401
CAS Google Scholar

Download references

Acknowledgments

The authors would like to thank the reviewers for their comments that helped improving the manuscript. They also thank Aurélie Urbain for her help in designing the new updated databank. This work was supported by grants from the Ministère de la Recherche, Université Paris Diderot-Paris 7, National Institute for Blood Transfusion (INTS) and National Institute for Health and Medical Research (INSERM). AB had a grant from the Ministère de la Recherche. AdB was also supported by an Indo-French collaborative grant (grant from CEFIPRA number 3903-E).

Author information

Authors and Affiliations

INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire (EBGM), DSIMB, Université Paris Diderot-Paris 7, case 7113, 2, place Jussieu, 75251, Paris, France
Jean Pylouster, Aurélie Bornot, Catherine Etchebest & Alexandre G. de Brevern
Régulation et dynamique des génomes, Laboratoire de Biophysique, MNHN, UMR CNRS INSERM 5153, 43 rue Cuvier, 75231, Paris Cedex 05, France
Jean Pylouster
INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot-Paris 7, Institut National de Transfusion Sanguine (INTS), 6, rue Alexandre Cabanel, 75739, Paris Cedex 15, France
Aurélie Bornot, Catherine Etchebest & Alexandre G. de Brevern

Authors

Jean Pylouster
View author publications
You can also search for this author in PubMed Google Scholar
Aurélie Bornot
View author publications
You can also search for this author in PubMed Google Scholar
Catherine Etchebest
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre G. de Brevern
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexandre G. de Brevern.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(DOC 30 kb)

(DOC 62 kb)

(DOC 27 kb)

(DOC 46 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pylouster, J., Bornot, A., Etchebest, C. et al. Influence of assignment on the prediction of transmembrane helices in protein structures. Amino Acids 39, 1241–1254 (2010). https://doi.org/10.1007/s00726-010-0559-6

Download citation

Received: 22 October 2009
Accepted: 08 March 2010
Published: 28 March 2010
Issue Date: November 2010
DOI: https://doi.org/10.1007/s00726-010-0559-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Influence of assignment on the prediction of transmembrane helices in protein structures

Abstract

Similar content being viewed by others

TMFoldRec: a statistical potential-based transmembrane protein fold recognition tool

TMalphaDB and TMbetaDB: web servers to study the structural role of sequence motifs in α-helix and β-barrel domains of membrane proteins

How AlphaFold2 shaped the structural coverage of the human transmembrane proteome

Introduction