Keywords

1 Introduction

Transmembrane (TM) proteins are involved in a wide range of essential biological processes including cell signalling, transport of membrane-impermeable molecules, cell-cell communication, cell recognition, cell adhesion and biogenesis of the bacterial outer membrane. Many are also prime drug targets, with approximately 60% of all drugs currently on the market targeting membrane proteins (Hopkins and Groom 2002). Despite recent progress in TM protein structure determination, the experimental difficulties associated with obtaining crystals that diffract to high resolution mean that TM protein are severely under-represented in structural databases, making up only 1% of known structures in the PDB (White 2004) of which only about 500 are unique. TM proteins, which have both hydrophobic and hydrophilic regions on their surfaces, are much more difficult to isolate than water-soluble proteins as the native membrane surrounding the protein must be disrupted and replaced with detergent molecules without causing any denaturation. Given the biological and pharmacological importance of TM proteins, an understanding of their structure and topology —the total number of TM helices, their boundaries and in/out orientation relative to the membrane—is essential for functional analysis and directing further experimental work. In the absence of vital structural data, bioinformatics strategies thus turn to sequence-based prediction methods.

2 Membrane Protein Structural Classes

TM proteins can be classified into two basic types: α-helical and β-barrel proteins. α-helical membrane proteins form the major category of TM proteins and are present in all type of biological membranes, including bacterial outer membranes. They consist of one or more α-helices, each of which contains a stretch of hydrophobic amino acids, embedded in the membrane and linked to subsequent helices by extra-membranous loop regions. It is thought such proteins may have up to 20 TM helices allowing a diverse range of differing topologies. Loop regions are known to contain substructures including re-entrant loops—short α-helices that enter and exit the membrane on the same side—as well as amphipathic helices that lie parallel to the membrane plane, and globular domains. β-barrel TM proteins (TMBs) mainly consist of transmembrane β-strands that form a closed barrel in the membrane. Analysis of solved β-barrel 3D structures show that these proteins can consist of 8–26 β-strands arranged in an anti-parallel manner in the bacterial outer-membrane. Some TMBs also have large plug-domains and outer loops that can interact with the barrel region to control substrate transport.

2.1 α-Helical Bundles

α-helical TM proteins can be further divided into a number of subtypes based on their topology. Type I and II membrane proteins consist of a single TM α helix, type III have multiple membrane-spanning helices while type IV membrane proteins have multiple domains which form an assembly that spans the membrane multiple times. Type I membrane proteins are attached to the membrane with an anchor sequence targeting their amino terminus to the endoplasmic reticulum lumen and the carboxy terminus exposed to the cytoplasmic side. These proteins are further divided into two subtypes. Type Ia—which constitutes most eukaryotic membrane proteins—contain cleavable signal sequences, while type Ib do not. Type II membrane proteins are similar to type I in that they span the membrane only once but their orientation is reversed; they have their amino terminus on the cytoplasmic side of the cell and the carboxy terminus on the exterior. Type III membrane proteins, which include G protein coupled receptors (e.g. PDB code 1gzm) consist of multiple TM helices and are also divided into two subtypes. Type IIIa have cleavable signal sequence while type IIIb do not, but do have their amino terminus exposed to the extracellular side of the membrane. Type IV membrane proteins have multiple domains which form an assembly that spans the membrane multiple times. Domains may reside on a single polypeptide chain but are often composed of more than one. Examples include Photosystem I, which comprises nine unique chains (1jb0).

2.2 Transmembrane β-Barrels

TMBs can be divided into two main categories depending on whether the barrel pore is formed from a single-chain, or via a homo-oligomeric complex, with each chain contributing 2–4 strands. All known bacterial transmembrane β-barrels consist of anti-parallel β-strands that traverse the outer-membrane in a regular manner (Fig. 5.1). Residues on a transmembrane β-strand follow a strict-dyad repeat such that alternate side-chain face the lipids and barrel pore, respectively. The lipid-facing residues are mostly hydrophobic, but the pore-facing residues can be a mixture of both polar and hydrophobic amino acids. Moreover, transmembrane β-strands generally have fewer residues than transmembrane α-helices and have a less prominent hydrophobic profile. Residues on adjacent β-strands are hydrogen bonded to each other such that alternate residues on strand S1 form a N–O and O–N bond with residues in-register on strand S2, where S1 and S2 are adjacent strands. Solved 3D structures of bacterial TMBs have 8 to 26 β-strands, while the only known Eukaryotic TMB structure - mitochondrial voltage dependent anion channel (VDAC) has 19 strands, where the first and the last strand are parallel to each other. TMBs have long extra-cellular loops that generally protrude away from the barrel pore region but can interact with the barrel domain and short inner loops. Additionally, a few TMBs have plug domains (Fig. 5.1) that sit inside the barrel and participate in gating and signaling (Ferguson et al. 2002). It is generally estimated that TMBs account for 2–3% of the genes in bacteria, but there is scope for improvement in accurately determining the number of yet unknown TMB families.

Fig. 5.1
figure 1

Top and front views of a diffusion porin (PDB code 3prn) and outer membrane iron transporter FecA (PDB code 1kmp). Both proteins have long outer-loops. The large plug domain of FecA (orange) sits in the barrel and facilitates substrate transport and allosteric transitions

Multi-chain TMBs mainly fall into one of four known superfamilies—(a) the pore-forming toxins (PDB codes 3w9t, 3o44, 4h56, 3b07, 7ahl) that are secreted by pathogenic bacteria such as Staphylococcus aureus, Clostridium perfringens and Vibrio cholerae, (b) outer membrane efflux proteins (PDB codes 4mt4, 4mt0, 2xmn, 3pik, 1wp1, 1yc9, 1ek9) that are used by bacteria to expel a wide range of molecules including antibacterial drugs thereby increasing multi-drug resistance, (c) mycobacterial porins (PDB code 1uun) in Mycobacteria that can be used to transport drugs through an otherwise low-permeability outer membrane environment that renders them resistant to many antibiotics, and (d) trimeric autotransporters (PDB codes 2lme, 2gr7) such as the Hia autotransporter of Haemophilus influenzae that belongs to the largest family of virulence proteins mediating bacterial adhesion, invasion and spread to host cells. Sequence-based analysis methods to identify protein sequences that belong to those families, and therefore estimate the number of multi-chain TMB families, are currently lacking. Additionally, better computational methods for their topology prediction and 3D assembly need to be developed to increase our understanding of their assembly mechanism and function.

3 Databases

There now exist a number of databases that serve as repositories for the sequences and structures of both α-helical and β-barrel TM proteins (Table 5.1). OPM (Lomize et al. 2006b, 2011), PDBTM (Tusnady et al. 2004, 2005a; Kozma et al. 2013), CGDB (Chetwynd et al. 2008) and the mpstruc database (http://blanco.biomol.uci.edu/mpstruc/) all contain TM proteins of known structure determined using X-ray and electron diffraction, nuclear magnetic resonance and cryo-electron microscopy. OPM, PDBTM and CGDB additionally contain orientation predictions of the protein relative to the membrane based on water-lipid transfer energy minimisation (Lomize et al. 2006a), hydrophobicity/structural feature analysis (Tusnady et al. 2005b) and coarse grained molecular dynamic simulations (Sansom et al. 2008), while MemProtMD (http://sbcb.bioch.ox.ac.uk/memprotmd/) contains orientations calculated using a knowledge-based statistical potential (Nugent and Jones 2013). TOPDB (Tusnady et al. 2008; Dobson et al. 2015a) and MPtopo (Jayasinghe et al. 2001) include topology data that has been experimentally validated using low-resolution techniques such as gene fusion, antibody and mutagenesis studies. Other TM protein databases tend to focus on specific families such as voltage-gated potassium channels , including VKCDB (Li and Gallin 2004; Gallin and Boutet 2011) and KDB (http://sbcb.bioch.ox.ac.uk/kdb/), while others such as the Transporter Classification Database (Saier et al. 2006, 2009, 2014) focus on particular structural or functional classes.

Table 5.1 Transmembrane protein databases

For TMBs, TMBB-DB (Freeman and Wimley 2012), TMBETA-GENOME (Gromiha et al. 2007) and OMPdb (Tsirigos et al. 2011) provide an exhaustive list of putative TMBs predicted using computational methods. In addition, HHomp (Remmert et al. 2009) provides a list of putative TMBs found by comprehensive, transitive homology search. As with all bioinformatics databases, care should be taken to ensure that a given resource is frequently updated. The rate at which new sequences and structures are deposited in GenBank and the PDB [and occasionally retracted e.g. (Chang et al. 2006)] results in significant manual annotation for database administrators, and much evidence suggests that this workload often exceeds the amount of time an administrator is willing to commit.

4 Multiple Sequence Alignments

As with globular proteins, multiple sequence alignments play an important role in TM protein structure prediction. Homologous sequences identified via database searches can be used to construct sequence profiles which can significantly enhance TM topology prediction accuracy (Henricson et al. 2005; Jones 2007), while recent co-evolution-based approaches (Jones et al. 2012, 2015) are dependent on high-quality alignments to infer residue-residue contacts which can be used for de novo modelling (Nugent and Jones 2012).

Conventional pair-wise alignment methods return possible matches based on a scoring function that relies on amino acid substitution matrices such as PAM (Dayhoff and Schwartz 1978) or BLOSUM (Henikoff and Henikoff 1992). Such matrices are derived from globular protein alignments, and as amino acid composition, hydrophobicity and conservation patterns differ between globular and TM proteins (Jones et al. 1994a), they are in principle unsuitable for TM protein alignment. A number of TM-specific substitution matrices have therefore been developed, which take into account such differences. For example, the JTT TM matrix (Jones et al. 1994b) was based on the observation that polar residues in TM proteins are highly conserved, while hydrophobic residues are more interchangeable. Other matrices such as SLIM (Muller et al. 2001), were reported to have the highest accuracy for detecting remote homologues in a manually curated GPCR dataset, while PHAT (Ng et al. 2000) has been shown to outperform JTT, especially on database searching.

More recently, a number of methods have been developed to improve actual TM protein alignment. HMAP (Tang et al. 2003) showed that alignment accuracy could be improved significantly using a profile-profile based approach incorporating structural information. STAM (Shafrir and Guy 2004) implemented higher penalties for insertion/deletions in TM segments compared to loop regions, with combinations of different substitution matrices to produce alignments resulting in more accurate homology models. PRALINETM (Pirovano et al. 2008), which integrates state-of-the-art sequence prediction techniques with membrane-specific substitution matrices, was shown to outperform standard multiple alignment techniques such as ClustalW (Thompson et al. 1994) and MUSCLE (Edgar 2004) when tested on the TM alignment benchmark set within BAliBASE (Bahr et al. 2001). AlignMe (Stamm et al. 2014, 2013; Khafizov et al. 2010), which uses secondary structure matching combined with evolutionary information, also demonstrated high quality alignments when tested on BAliBASE, although it was noted that accuracy was generally lower when transmembrane topology predictions were also included, although the inclusion of this information may still be useful in cases of extremely distantly related proteins for which sequence information is less informative. PSI-Coffee—a modification of the T-Coffee method (Chang et al. 2012; Notredame et al. 2000)—employs a homology extension technique that can be used to reveal and use specific conservation patterns found within transmembrane proteins , such as amphiphilic α-helices, resulting in significant improvements to the accuracy of alignments. Hill and co-workers constructed substitution tables for different environments within membrane proteins, demonstrating that, in the 10–25% sequence identity range, alignments could be improved by an average of 28 correctly aligned residues compared with alignments made using default substitution tables, leading to improved structural models (Hill and Deane 2013; Hill et al. 2011).

For TMBs, Jimenez-Morales and Liang (2011) have estimated the evolutionary pattern of residue substitutions which can be useful for improved sequence alignment of TMBs, while Yan et al. (2011), have shown the utility of secondary structure element alignment for the identification of putative TMBs. Additionally, a structure based alignment method for TMBs that uses TMB-specific topology features has been shown to improve alignment (Wang et al. 2013).

5 Transmembrane Protein Topology Prediction

The under-representation of TM proteins in structural databases makes their study extremely difficult. As a result, tools to analyse TM proteins have historically focused on sequence-based topology prediction —identifying the total number of TM helices, their boundaries, and in/out orientation relative to the membrane. Experimental approaches for determining TM topology include glycosylation analysis, insertion tags, antibody studies and fusion protein constructs; however, such studies are time consuming, often conflicting (Mao et al. 2003; Kyttala et al. 2004; Ratajczak et al. 2014), and also risk upsetting the natural topology by altering the protein sequence. Theoretical prediction methods therefore provide an important strategy for furthering our understanding of these biological and pharmacological important proteins.

5.1 Early α-Helical Topology Prediction Approaches

Early topology predictions methods were based on physicochemical observations of TM proteins. Even before the arrival of the first crystal structures, stretches of hydrophobic residues long enough to span the lipid bilayer were identified as TM spanning α-helices. Prediction methods by Kyte and Doolittle (1982) and Engelman et al. (1986), and later by Wimley and White (1996), relied on experimentally determined hydropathy indices to create a hydropathy plot for a protein. This involved taking a sliding window of 19–21 residues and averaging the score with peaks in the plots (regions of high hydrophobicity) corresponding to the locations of TM helices. With more sequences came the discovery that aromatic Trp and Tyr residues tend to cluster near the ends of the transmembrane segments (Wallin et al. 1997), possibly acting as physical buffers to stabilise TM helices within the lipid bilayer. Later, studies identified the appearance of sequence motifs, such as the GxxxG motif (Senes et al. 2000), within TM helices and also periodic patterns implicated in helix-helix packing and 3D structure (Samatey et al. 1995). However, perhaps the most important realisation was that positively-charged residues tend to cluster on cytoplasmic loop—the ‘positive-inside’ rule of Gunnar von Heijne (von Heijne 1992). Combined with hydrophobicity-based prediction of TM helices, this led to early topology prediction methods such as TopPred (Claros and von Heijne 1994).

5.2 Machine Learning Approaches for α-Helical Topology Prediction

Despite their early success, these methods based on hydrophobicity analysis combined with the ‘positive-inside’ rule have since been superseded by machine learning approaches which offer substantially higher prediction accuracy due to their probabilistic formulation (Table 5.2). Hidden Markov models (HMMs) were among the first supervised learning algorithms to be applied to TM topology prediction, with both TMHMM (Krogh et al. 2001) and HMMTOP (Tusnady and Simon 1998) proving highly successful. TMHMM implemented a cyclic model with seven states for a TM helix, while HMMTOP used HMMs to distinguish between five structural states [helix core, inside loop, outside loop, helix caps (C and N) and globular domains]. These states were connected by transition probabilities before dynamic programming was used to match a sequence against a model with the most probable topology. HMMTOP also allowed constrained predictions to be made, where specific residues could be fixed to a topological location based on experimental data, as did other methods such as HMM-TM (Bagos et al. 2006). Later HMM-based predictors include PRODIV-TMHMM and PolyPhobius, both of which made use of evolutionary information from homologs resulting in substantially increased performance (Viklund and Elofsson 2004; Kall et al. 2005).

Table 5.2 Topology prediction methods for α-helical transmembrane proteins

Neural networks (NNs) were employed by early methods including PHDhtm (Rost et al. 1996) and MEMSAT3 (Jones 2007). PHDhtm used multiple sequence alignments to perform a consensus prediction of TM helices by combining two NNs. The first created a ‘sequence-to-structure’ network, which represented the structural propensity of the central residue in a window. A ‘structure-to-structure’ network then smoothed these propensities to predict TM helices, before the ‘positive-inside’ rule was applied to produce an overall topology. MEMSAT3 uses a neural network and dynamic programming in order to predict not only TM helices, but also to score the topology and to identify possible signal peptides . Additional evolutionary information provided by multiple sequence alignments led to prediction accuracies increasing to as much as 80%. OCTOPUS (Viklund and Elofsson 2008) used a novel combination of hidden Markov models and artificial neural networks to further increase performance.

Later, Support Vector Machines (SVMs) gained in popularity and were successfully applied to TM protein topology prediction (Yuan et al. 2004; Lo et al. 2006, 2008). Particularly using non-linear kernel functions, SVMs are capable of learning complex relationships among the amino acids within a given window with which they are trained, particularly when provided with evolutionary information, and are also more resilient to the problem of over-training compared to other machine learning methods. MEMSAT -SVM (Nugent and Jones 2009), an extension of MEMSAT3, used multiple SVM models to classify sequence into one of four states [TM helix, inside or outside loop, re-entrant helix, or signal peptide ] before calculating the most likely topologies using dynamic programming, while a further SVM was used to discriminate between globular and TM proteins. Although multiclass SVMs do exist, their performance is typically poorer than binary SVMs since in many cases no single mathematical function exists to separate all classes of data from one another.

More recently, other machine learning algorithms have been applied to TM helix and topology prediction including dynamic Bayesian networks (Reynolds et al. 2008), random forests (Hayat and Khan 2013), self-organizing maps (Deng 2006) and deep learning (Qi et al. 2012). A selection of machine learning-based predictors can be found in Table 5.2.

5.3 Signal Peptides and Re-entrant Helices

One significant challenge faced by topology predictors is the discrimination between TM helices and other highly hydrophobic structural features. These include targeting motifs such as signal peptides and signal anchors, amphipathic helices, and re-entrant helices, membrane penetrating helices that enter and exit the membrane on the same side, common in many ion channel families (Fig. 5.2). The similarity between such features and the hydrophobic profile of a TM helix frequently leads to crossover between the different types of predictions. Should these elements be predicted as TM helices, the ensuing topology prediction is likely to be severely disrupted. Some prediction methods, such as SignalP (Petersen et al. 2011; Bendtsen et al. 2004) and TargetP (Emanuelsson et al. 2007), are effective in identifying signal peptides in TM proteins, and may be used as a pre-filter prior to analysis using a TM topology predictor. Phobius (Kall et al. 2004) used a HMM to successfully address the problem of signal peptides in TM protein topology prediction, while PolyPhobius (Kall et al. 2005) further increased accuracy by including homology information. Other methods such as MEMSAT -SVM, OCTOPUS and SPOCTOPUS (Viklund et al. 2008) have also attempted to incorporate identification of re-entrant regions and signal peptides into TM topology prediction but there is significant room for improvement. The problem, particularly regarding re-entrant helices , is the lack of reliable data with which to train machine-learning based methods.

Fig. 5.2
figure 2

Potassium channel KcsA (PDB code 1R3J). Each monomer of the homo-tetrameric complex consists of two TM helices and one re-entrant helix (orange), which surrounds the central pore and is involved in channel gating

5.4 Consensus Approaches for α-Helical Topology Prediction

While a number of methods successfully combine multiple machine learning approaches, for example ENSEMBLE (Martelli et al. 2003) uses a NN and two HMMs while OCTOPUS uses two sets of four NNs and one HMM, perhaps the best overall methods are those which adopt a consensus approach by combining the results of several predictors to yield more reliable results. Early consensus predictors such as BPROMPT (Taylor et al. 2003) combined the outputs of five different predictors to produce an overall topology using a Bayesian belief network, while Nilsson et al. (2002) used a simple majority-vote approach to return the best topology from their five predictors. The PONGO server (Amico et al. 2006) returns the results of 5 high scoring methods in a graphical format for direct comparison. More recently, MetaTM (Klammer et al. 2009) is based on SVM models and combines the results of six TM topology predictors and two signal peptide predictors. TOPCONS (Tsirigos et al. 2015; Bernsel et al. 2009) combines a number of topology predictions into one consensus prediction, while also quantifying the reliability of the prediction based on the level of agreement between the underlying methods, both at the protein level and at the level of individual TM regions (Fig. 5.3). Results indicate an overall increase in performance by 4% compared to the currently available best-scoring methods. CCTOP (Dobson et al. 2015b) makes use of ten different topology prediction methods, while also incorporating topology information from existing experimental and computational resources such as the PDBTM, TOPDB and TOPDOM databases, using a HMM. In most cases, but particularly proteins whose topology is not straightforward, using a consensus-based method is highly advisable.

Fig. 5.3
figure 3

Consensus topology prediction by TOPCONS (Tsirigos et al. 2015; Bernsel et al. 2009). The results from a number of individual predictors are combined to produce the TOPCONS prediction

5.5 Transmembrane β-Barrel Topology Prediction

Topology prediction of TMBs entails the estimation of the number and the location of TM β-strands. Traditional methods based on a sliding-window hydrophobicity profile are not sufficiently accurate, most likely due to the shorter size and less prominent hydrophobic nature of the TM β-strands. This problem is further complicated by the presence of other β-sheet rich regions in full protein sequences such as the pre-barrel region (seen, for example, in EstA Autotransporter protein; PDB code 3kvn) and large plug domains that reside inside the barrel (as seen in FecA protein; PDB code 1fep). Additionally, the absence of long stretches of hydrophobic residues makes it harder to distinguish TM β-strands from β-sheets in globular proteins. One strategy to predict the topology of TMBs relies on first predicting if the query sequence is a TMB or not (Table 5.3) and then using dedicated computational methods to predict the topology of sequences that are predicted to be TMBs. This can potentially improve the accuracy of computational methods that are based on learning from data points available from known 3D structures. Boctopus in combination with PSORTb (Imai et al. 2013), which is a bacterial subcellular localization tool, can be used to identify putative TMBs. The idea here is that proteins for which topology predictor methods predict at least 8 strands with predicted subcellular localization as ‘outer-membrane’ can be potential TMBs. BETAWARE (Savojardo et al. 2013a) is a machine learning based tool that predicts if a protein is TMB using N-to-1 network encoding and then predicts the topology using a constrained grammar. Other methods employ a combination of secondary structure features, hydrophobicity, amino acid composition and empirical scores to identify putative TMBs. In general, TMB topology prediction methods can be classified as empirical, machine learning and consensus-based. A few of these methods are discussed below (Table 5.4).

Table 5.3 Computational methods for identifying transmembrane β-barrels
Table 5.4 Topology prediction methods for transmembrane β-barrels

5.6 Empirical Approaches for β-Barrel Topology Prediction

Traditionally, features based on knowledge gained from 3D structures, such as the hydrophobicity analyses over a sliding window, amino acid distribution, length of TM β-strands and outer/inner loops, have been used for the topology prediction of TMBs (Schirmer and Cowan 1993; Gromiha et al. 1997; Gromiha and Ponnuswamy 1993; Diederichs et al. 1998). Wimley et al. (2002) combined features such as hydrophobicity profile, amino acid composition, known variation in the length of inner loops and the abundance of proteins facing the lipids of the barrel pore to formulate a computational score to predict TM stretches and also identify putative TMBs. The distribution of amino acids on a transmembrane β-strand along the membrane normal and the occurrence of the dyad-repeat pattern were employed by Jackups and Liang (2005) to improve the location of predicted strands and estimate the strand-registration such that the maximum number of hydrogen-bonds were satisfied between two adjacent β-strands.

5.7 Machine Learning Approaches for β-Barrel Topology Prediction

Machine learning-based methods for the topology prediction of TMBs are typically trained on a dataset of labeled data points extracted from known 3D structures. Rost and Sander (1993) showed early on that the use of information obtained from multiple sequence alignments yields higher prediction accuracy as compared to using features from a single-sequence alone. SVMs, neural networks and hidden Markov models have all been used for TMB topology prediction (Table 5.4). The use of a sequence profile-based HMM for the identification and topology prediction of TMBs was first introduced by Martelli et al. (2002). PROFtmb (Bigelow and Rost 2006) and PRED-TMBB (Bagos et al. 2004) used a similar approach, where an HMM is used to predict strands, inner-loop and outer-loop states using a sequence profile. The HMM architecture employed in these methods was chosen such that it resembled a pair of strands (up and down), a self-loop representing long outer-loops that connect the two strands on the extracellular side and a self-loop of the inner-membrane side. The number of states representing the β-strand region was chosen to account for the variation in the length of these elements that form TMBs.

Recently, two-stage predictors such as BOCTOPUS (Hayat and Elofsson 2012a) and tobmodel (Hayat and Elofsson 2012b) have been implemented. These methods employ SVMs in the first stage to predict the local preference of each residue to form an outer-loop, inner-loop or membrane strand region. The output of this stage is then fed to an HMM that predicts the overall topology. Another approach called BETAWARE (Savojardo et al. 2013a) consists of two methods, first an N-to-1 Extreme Learning Machine algorithm is used for the identification of TMBs, followed by a Grammatical-Restrained Hidden Conditional Random Field approach to predict the topology. In contrast to other methods, transFold (Waldispühl et al. 2006) does not require a training set but uses a grammar to predict the β-strands and inter-strand residue contacts. Most of these topology prediction methods can also be used for distinguishing TMBs from non-TMBs.

5.8 Consensus Approaches for β-Barrel Topology Prediction

To our knowledge, conBBPRED (Bagos et al. 2005) is the only consensus method available for TMB topology prediction . conBBPRED assigns a per-residue score by averaging over contributions of each individual predictor followed by a dynamic programming step to obtain the overall topology. On a dataset of 20 proteins, conBBPRED increases the accuracy of predicted topologies by 15% (Bagos et al. 2005). With larger datasets and more topology predictors becoming available, it will be interesting to see if consensus topology prediction methods for TMBs show improved accuracy over single methods.

6 3D Structure Prediction

As with globular proteins, 3D structure prediction of TM proteins can be dealt with via two main approaches, homology modelling and de novo modelling, covered in Chaps. 1 and 4 of this book.

6.1 Homology Modelling of α-Helical Transmembrane Proteins

Homology modelling involves the use of a related template structure in order to build a 3D model of a target protein. The method is based on the observation that protein structure is conserved more highly than amino acid sequence, hence even proteins that have diverged significantly in sequence but still share detectable similarity may also share common structural properties, and in particular, the overall fold. When a suitable template is available, predicting TM protein structure by homology modelling can be highly effective, especially when tools specifically designed for modelling TM proteins are used. Compared to globular proteins, lower sequence conservation is required for fold preservation in transmembrane regions, so it may even be possible to generate useful 3D models with templates that share as little as 20% sequence identity to the target, although the paucity of high resolution membrane protein structures will still limit the number of families that such methods are applicable to (Olivella et al. 2013).

A homology modelling protocol can be subdivided into a number of key steps which can each be performed iteratively to improve the quality of the final model: template selection, target-template alignment, model construction, and model quality assessment (Marti-Renom et al. 2000; Sanchez and Sali 1997). Aside from SWISS-MODEL (Peitsch 1996; Biasini et al. 2014) which has a 7TM/GPCR interface, few TM protein-specific homology modelling methods exist. MEDELLER (Kelm et al. 2010) is designed to approach the steps in structure prediction to take into account the differences between the physical environments of globular and TM proteins. The method is optimized to build a highly reliable core structure shared by the template and target proteins by first calculating membrane insertion using iMembrane (Kelm et al. 2009) which is used to guide target-template alignment by MP-T (Hill and Deane 2013). The core is gradually extended using a specialized membrane-specific substitution score, before loops are completed using the loop modelling protocols FREAD (Choi and Deane 2010) and Modeller (Marti-Renom et al. 2000). Results show that MEDELLER produces accurate core models and achieves a core model accuracy of 1.97 Å RMSD versus 2.57 Å for Modeller. The Memoir modeling pipeline now provides a fully automated web server that applies this protocol to both α-helical and β-barrel TM proteins (Ebejer et al. 2013).

Chen and co-workers developed a method specifically to deal with the issue of building homology models from very distantly related homologues exhibiting distinct loop and TM helix conformations (Chen et al. 2014). The approach is based on efficient sampling techniques of alternative TM helix structures, in order to reconstruct both TM core and loop regions from distant structural homologues, resulting in high quality models that were top-ranked when stringently validated in two blind predictions (Kufareva et al. 2011; Michino et al. 2009). Since the method requires only a single distant homolog, they estimate that around 60% of human membrane proteins can be reliably modeled using their approach, allowing the generation of 3D models for a large and diverse fraction of structurally uncharacterized TM proteins.

A number of tools also exist to model specific regions of TM proteins. These include TM loop regions, which have been shown to differ significantly from loop regions in globular proteins. Kelm and co-workers showed that it is possible to accurately predict the structure of TM loops using a database of small TM protein loop fragments (0.8–1.6 Å). Their findings show that while many globular protein fragments have similar shapes to their TM counterparts, their sequences are often very different, although they do not appear to differ in their substitution patterns. Their method is implemented in a modification to FREAD (Kelm et al. 2014). Modelling of TM kinks has also attracted a lot of attention, as they have been observed to provide important functional and structural roles in TM proteins (Yohannan et al. 2004). Tools to model TM kinks include the Monte Carlo method based algorithm, MC-HELAN, which determines helical axes alongside positions and angles of helical kinks (Langelaan et al. 2010), HELANAL-Plus (Kumar and Bansal 2012), a web server for analysis of helix geometry in TM protein structures, and TMKink, a neural network predictor which identifies over two-thirds of all bends with high sensitivity and specificity (Meruelo et al. 2011).

6.2 Homology Modelling of Transmembrane β-Barrel Proteins

For transmembrane β-barrel proteins, HHomp (Remmert et al. 2009) can be used to identify remote homologues with a known 3D structure that can act as template/s for 3D modelling of these proteins. Standard application of MEDELLER or MODELLER can then be used to generate all-atom homology models (Kelm et al. 2010; Marti-Renom et al. 2000). The TMBpro method (Randall et al. 2008) uses a combination of machine-learning to predict the location of β-strands and inter-strand contacts and then selects templates from TMBs with known 3D structure by matching the number of β-strands. However, as stated above, a key limitation of such an approach is that it is only limited to protein sequences for which a reliable template can be found. Additionally, for transmembrane β-barrels , where identification of novel families is still an open issue, such an approach might miss reliable templates.

6.3 De Novo Modelling of α-Helical Transmembrane Proteins

De novo modelling, or ab initio modelling, involves the construction of a 3D model in the absence of any tertiary structural data relating to the target protein. As with homology modelling, most methods address globular proteins although recently a number of methods have emerged specifically to deal with TM proteins including FILM (Pellegrini-Calace et al. 2003), RosettaMembrane (Barth et al. 2007, 2009) and BCL::MP-fold (Weiner et al. 2013) (Table 5.5).

Table 5.5 3D modelling tools for α-helical transmembrane proteins

FILM (Folding In Lipid Membranes) is a modification of the globular protein structure prediction method FRAGFOLD (Jones and McGuffin 2003; Jones 1997). FRAGFOLD employs simulated annealing in order to perform a conformational search using high-resolution super-secondary structural fragments to assemble the tertiary fold, guided by a statistical function that includes pairwise, solvation, steric and hydrogen bonding energy terms. FILM added a knowledge-based membrane potential term to the FRAGFOLD energy function, derived from the statistical analysis of a data set of 640 transmembrane helices whose topologies had been determined experimentally. The relative frequencies of each amino acid at fixed distances from the membrane centre were assessed, allowing the membrane potential term to be calculated by transforming these values using the inverse Boltzmann equation. Results indicated that it was possible to predict both the topology and conformation of small proteins at a reasonable level of accuracy, although attaining the level of compactness observed in larger TM helix bundles was challenging, since TM helix bundles are usually not optimally compact despite neighboring helices being closely packed together. Further modification to FILM allowed progress to be made in the prediction of larger TM helix bundles by incorporating another term accounting for lipid exposure into the energy function. This allowed models of seven TM helix bacteriorhodopsin and rhodopsin to be generated to within 6–7 Å root mean square deviation (rmsd) of the native structure (Hurwitz et al. 2006).

RosettaMembrane is also a modification of a globular protein structure prediction method—Rosetta (Rohl et al. 2004; Simons et al. 1999), which, like FRAGFOLD, assembles folds using fragments of known structures using simulated annealing or parallel tempering—an effective algorithm to overcome the slow convergence in low-temperature protein simulation. RosettaMembrane added terms to the Rosetta energy function that described intra-protein and protein-solvent interactions in the anisotropic membrane environment, treating hydrogen bonds explicitly and membrane protein/lipid interactions implicitly. The method describes interactions between protein residues at atomic detail while applying continuum solvent models to the water, hydrophobic core, and lipid head group regions of the membrane. Results suggest that the model captures the essential physical properties that govern the solvation and stability of membrane proteins, allowing the structures of 12 small TM protein domain (<150 residues) to be predicted successfully to a resolution of <2.5 Å (129), comparing favourably with predictions obtained on small water-soluble protein domains. More recently, the method was extended to incorporate distance constraints into the predictions to direct helix-helix interactions, the constraints being derived from either experimental data or sequence-based predictions (Fuchs et al. 2009; Lo et al. 2009; Nugent et al. 2011; Nugent and Jones 2010). This allowed larger (90–300 residues) structures with more complicated topologies to be successfully modelled to within 4 Å rmsd in the best four cases, with results indicating that only a single constraint was sometimes sufficient to enrich the population of near-native models.

A recent method BCL::MP-fold (Weiner et al. 2013), a modification of BCL::Fold (Karakas et al. 2012), generates models within a static membrane object by evaluating conformations using a knowledge-based energy potential which takes into account the unique properties of the apolar membrane in the amino acid environment potential, as well as an increased radius of gyration along the membrane normal. Three additional terms are introduced first to describe the preferential orientation of secondary structure elements with respect to the membrane, secondly to penalise connection of two neighboring TM helices that would require passage through the membrane, and finally to assess the agreement of residue placement in TM regions with predictions from sequence. Additionally, a symmetry folding mode allows for the prediction of obligate homo-multimeric TM complexes. A benchmark test using 40 TM protein 3D structures demonstrated that the method is able to accurately predict the correct topology in 34 cases, suggesting the approach can successfully predict protein topology without the need for large multiple sequence alignments, homologous template structures, or experimental restraints.

6.4 De Novo Modelling of Transmembrane β-Barrels

The topological arrangement of β-strands in transmembrane β-barrels is regular and can be exploited to generate 3D models of TMBs based on an idealized geometry (Naveed et al. 2012; Hayat and Elofsson 2012b). Existing methods based on idealized geometry approximate the diameter of a TMB, calculated based on its number of strands. Additionally, 3D coordinates of Cα atoms along β-strands and their placement with respect to the in-register Cα atom can also be determined using a theoretical description (Chou et al. 1990; Murzin et al. 1994a, b). Tobmodel uses these regular structural features to generate idealized Cα atoms of TMBs (Hayat and Elofsson 2012b). Another method, 3d-SpoT, uses an empirical scoring function derived from frequencies of lipid-facing and pore-facing residues in known TMB structures to find the optimal strand-registration and then uses a geometric model of intertwined coils to generate 3D models (Naveed et al. 2012) (Table 5.6).

Table 5.6 3D modelling tools for transmembrane β-barrels

6.5 Covariation-Based Approaches

Up until recently, using knowledge-based potentials derived from the statistical analysis of known protein structures has been the standard approach for de novo structure prediction. Over the last five years, the field has seen dramatic progress as new methods have emerged that are capable of accurately inferring residue-residue contacts from large multiple sequence alignments (MSAs), allowing 3D structures to be computed directly from sequence data. Two key factors have led to this revolution; firstly, the rapid growth in the size of sequence databases, which has resulted in the number of sequences available for a typical protein family increasing by orders of magnitude (Sadowski and Taylor 2013), and secondly, the application of advanced statistical methods to this sequence data that allows the detection of true correlated mutations between sites in MSAs. The main idea behind correlated mutations is that residues that are proximal in 3D space are more likely to impose constraints on each other, which should lead to a correlation in their substitution patterns in the MSA. Mutation of either residue might disrupt the stability of the contact, which is likely to have an impact on the stability of the overall fold. Subsequent mutation of one or both residues to a more physicochemically complementary pairing may increase the likelihood of the contact being maintained; therefore residue pairs that form contacts are often seen to covary. It is this property that modern contact prediction methods seek to exploit.

A number of different methods have been developed for predicting contacts from sequence data based on the recognition of these residue covariation patterns. Up until now, they major obstacle in achieving performance useful for structure prediction has been in dealing with indirect coupling effects: should a direct contact exist at sites A–B and A–C, an apparent interaction may appear between B-C even though no direct contact exists. The approach of Lapedes et al. (1999) dealt with this so-called chaining problem by applying a maximum entropy approach, but at a high computational cost. The Direct Coupling Analysis (DCA) method reduced the problem to one of maximum entropy inference, applying a heuristic message passing approach to determine the solution of the contact weights (Weigt et al. 2009). This allowed the approach of Lapedes et al. to be put to practical use, with prediction accuracy achieving sufficient quality to be useful in structure prediction (Taylor and Sadowski 2011). PSICOV is based on sparse inverse covariance estimation (Jones et al. 2012). It applies the graphical lasso method (Friedman et al. 2008) to estimate the inverse of the covariance matrix, which is calculated from the MSA, whilst also constraining the solution to be sparse. The inverse covariance matrix, also known as the precision matrix, gives the correlation between any two sites in the MSA, conditional on observations at all other sites. This global statistical model was able to predict contacts with an accuracy approaching 80%, even for long-range contacts (those separated by >23 residues in the sequence), which is sufficient to identify to the native fold for medium sized (<200 residue) globular proteins, where sufficient numbers of aligned sequences are available. A more recent method, plmDCA (Ekeberg et al. 2013) uses a pseudo-likelihood approach applied to the Potts models. This has been shown to significantly outperform existing DCA-based approaches, while consensus approaches such as PconsC (Skwark et al. 2013) and MetaPSICOV further improve performance (Jones et al. 2015).

6.6 Evolutionary Covariation-Based Methods for De Novo Modelling of α-Helical Membrane Proteins

The performance of these methods has led to the development of a number of de novo structure prediction methods capable of generating accurate models for even large domains, guided primarily by predicted contacts. Evfold_membrane (Hopf et al. 2012) incorporates predicted transmembrane topology into the EVfold protocol (Marks et al. 2011), which uses DCA in combination with the CNS molecular dynamics software suite to generate 3D models. A webserver to de novo fold proteins using EVfold protocol with DCA and plmDCA has also been implemented (Sheridan et al. 2015). It was shown to be capable of generating accurate models within the top-10 ranked structures for fifteen targets ranging in size from 50 to 260 residues to within 2.7–4.8 Å rmsd of their native structures over at least two-thirds of the protein length. The latest version of FILM , FILM3, replaces the statistical potential with a single scoring function based on predicted contacts and their estimated probabilities (Nugent and Jones 2012). Using contacts predicted by PSICOV, results indicate that models with TM-scores >0.5 could be generated for 25 out of 28 membrane protein targets with complex topologies and an average length over 300 residues (Fig. 5.4). In the most remarkable case, it was possible to build a model for all 514 residues of cytochrome c oxidase polypeptide I with a TM-score >0.75. As encouraging as these results are, data suggests that even with perfect distance constraints, folding methods are unable to generate models less than 2 Å rmsd of the native structure, suggesting that protein refinement protocols will play an increasingly important role in generating higher accuracy models.

Fig. 5.4
figure 4

Model of CASP 11 free modelling target T0836 (right)—a 5-helix TM protein. Predicted contacts were generated using MetaPSICOV (Jones et al. 2015) enabling a model to be produced using the FILM3 protocol (Nugent and Jones 2012) resulting in a TM-score of 0.60 (Kosciolek and Jones 2015). The native structure is on the left

6.7 Evolutionary Covariation-Based Methods for Transmembrane β-Barrel Structure Prediction

Transmembrane β-barrels have a uniform β-strand topological pattern, where alternate strands traverse from the inside to the outside and vice versa, and additionally, anti-parallel β-strands have a unique hydrogen-bonding pattern. These structural features can be exploited to enhance the accuracy of predicting residues pairs in contact between two adjacent β-strands. Further, these can also be used to estimate the registration (relative position of two strands with respect to each other) of two adjacent β-strands. This has been shown to be useful for 3D modelling of TMBs (Hayat and Elofsson 2012b; Naveed et al. 2012; Randall et al. 2008). Additionally, Hayat et al. (2015) have implemented a simple strand-shift algorithm, where adjacent strands are shifted up/down relative to each other to ascertain the position that gives the highest sum of evolutionary couplings (ECs) between paired residues to identify the correct registration of TM β-strands in TMBs. This hybrid algorithm that combines empirical knowledge about TM β-strands and evolutionary covariation analysis-based contact prediction improves the prediction accuracy of inter-strand residue contacts. These predicted inter-strands constraints can then used to identify the underlying hydrogen-bonding network and the resulting interactions are used as distance constraints to de novo fold large TMBs using a tool called EVfold_bb (Hayat et al. 2015). EVfold_bb method can correctly predict the 3D structure with an average TM-score of 0.54 for the top-ranking models. EVfold_bb can also identify the correct inter-strand registration with an accuracy of 44% (in generated models), which is an improvement over tobmodel (18%), which does not use ECs to guide optimal strand registration search. Moreover, the generated models are not restricted to idealized geometries and do not require a template. Most interestingly, EVfold_bb can also identify and model 3D interactions between the barrel and the large plug domain in FecA protein (TM-score 0.68). The plug domain sits in the TM barrel domain and participates in gating and signaling (Noinaj et al. 2012).

Furthermore, methods specifically meant for improving prediction of β-sheet contacts in both globular and membrane proteins have also been developed. These methods can be broadly divided into two groups based on the use of ECs. BetaPro (Cheng and Baldi 2005) and MLN-2S (Lippi and Frasconi 2009) use neural networks and Markov logic networks, respectively, to predict β-sheet contacts. Maximum entropy-based correlated mutation measures (CMM) (Burkoff et al. 2013), Bcov (Savojardo et al. 2013b), bbcontacts (Andreani and Söding 2015) and MetaPSICOV (Jones et al. 2015) all use evolutionary covariation. In addition, these methods employ an additional layer of machine-learning techniques such as deep learning or HMMs on predicted evolutionary couplings to increase the accuracy of predicted residue-residue contacts in β-sheets. In future, methods that combine the general principles of anti-parallel β-stands along with machine-learning based methods that employ predicted contacts should be able to improve the applicability of these techniques to TMBs.

7 Future Directions

Substantial progress has been made in the field of membrane protein structure prediction over recent years. Methods for the detection of remote homologues have drastically improved, making it possible to generate template-based models for a larger number of protein families. Advances in techniques for predicting pairwise residue contacts have made it possible to generate de novo 3D models of large membrane proteins. However, these techniques are only applicable to protein families with large multiple sequence alignments. It is anticipated that as more sequencing data becomes available, 3D models of yet unknown TM protein families will become model-able based on predicted contacts. Future challenges lie in further improving these contact prediction methods by optimizing multiple sequence alignments, generation of fragment libraries, statistical inference methods used and the tools employed to predict 3D models.