Abstract
A class of secondary structure prediction algorithms use the information from the statistics of the residue pairs found in secondary structural elements. Because the protein folding process is dominated by backbone hydrogen bonding, an approach based on backbone hydrogen-bonded residue pairings would improve the predicting capabilities of these class algorithms. The reliability of the prediction algorithms depends on the quality of the statistics, therefore, of the data set. In this study, it was aimed to determine the propensities of the backbone hydrogen-bonded residue pairings for secondary structural elements of α-helix and β-sheet in globular proteins using a new and comprehensive data set created from the peptides deposited in Worldwide Protein Data Bank. A master data set including 4882 globular peptide chains with resolution better than 2.5 Å, sequence identity smaller than 25% and length of no shorter than 100 residues were created. Separate data sub sets also were created for helix and sheet structures from master set and each sub set includes 4594 and 4483 chains, respectively. Backbone hydrogen-bonded residue pairings in helices and sheets were detected and the propensities of them were represented as odds ratios (observed/[random or expected]) in matrices. Propensities assigned by this study to the residue pairings in secondary structural elements (as helix, overall strands, parallel strands and antiparallel strands) differ from the previous studies by 19 to 34%. These dissimilarities are important and they would cause further improvements in secondary structure prediction algorithms.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Studies on protein functions mostly require tertiary structure of the protein. Due to the technical limitations, tertiary structure of the many proteins could not be determined by experimental methods such as X-ray diffraction, NMR spectroscopy, cryo electron microscopy or be determined in poor quality. In such cases, computational methods (homology/comparative, threading or ab initio modelling) are valuable approaches to obtain the tertiary structure. In existence of a known structure similar to the query sequence as a template, tertiary structure of an unknown protein chain could be modelled with a great success using homology modeling [1]. The huge number of protein structures in Worldwide Protein Data Bank (wwPDB) [2] is an important factor in this success [3]. If the similar template sequence is not available, de novo or ab initio based prediction methods [4,5,6,7,8] are the main alternative approaches to the homology modeling. De novo prediction methods are mainly based on Anfinsen’s thermodynamic hypothesis, which states that the Gibbs free energy of the conformation of a native protein in physiological condition is lowest [9]. Therefore, the main goal of the de novo methods is to find out the global free-energy minimum in conformational energy landscape [10]. However, there are so many local minima in vast conformational energy landscape [11] and it requires enormous amount of time to search the global free-energy minimum among them. Therefore, a qualified starting conformation which corresponding to neighborhood of global minimum in energy landscape and which leading the algorithms to the nearest local minimum is extremely important to overcome this intrinsic limitation [12]. Two of Critical Assessments of Methods of Protein Structure Prediction (CASPs), CASP12 and CASP13 [13, 14], have reported great improvements in de novo or template free modeling. Despite these successes, template free modeling requires further improvement, especially, for longer chains.
The qualified starting conformation can be constructed using secondary structure prediction methods (SSPMs) [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]. The main goal of SSPMs is to identify the secondary structural elements of the protein in peptide sequence: helices [32], sheets [33] and coils. SSPMs are classified in many ways depending on their approaches to the problem [28, 34]. Some statistical methods or studies those investigating the spatial aspects of β-strands mainly are based on the occurrence of amino acid pairings in α-helices and/or in partner strands of β-sheets in the chain [17, 35,36,37,38,39,40,41,42,43]. However, because the backbone hydrogen bonding is the dominant factor of the protein folding process, as proposed by Rose et al. [44], the SSPMs based on the frequencies of the hydrogen-bonded amino acid pairings are more reasonable candidates for constructing the qualified starting conformation for de novo methods. The success of latter class SSPMs directly depends on the reliability of the information gathered from statistics of the backbone hydrogen-bonded amino acid pairings of secondary structural elements.
In this study, all backbone hydrogen-bonded residue pairings in α-helices/β-sheets of globular proteins in appropriate data sub sets were determined. The master data set was prepared from the chains deposited in Worldwide Protein Data Bank according to the criteria stated in Sect. 2. The master data set includes 4882 globular, non-homolog protein chains. Two data sub sets also created for helix and sheet structures from master data set. Using the residue pairing frequencies, the propensities of hydrogen-bonded residue pairings in secondary structural elements were calculated as odds ratios.
The helix/sheet propensities of residue pairings were studied by many researchers to some extent in this context [17, 35,36,37,38,39,40,41,42,43, 45, 46]. However, this study differs from those in both size of the data set and protein type homogeneity. Membrane proteins, fibrous proteins, immunoglobulins, proteins related to extremophile organisms and homolog chains/domains were excluded from the master data set to attain this homogeneity.
Some findings of this study on propensities of the residue pairs are not in consistent with findings of the previous studies. As discussed in Sect. 10 in details, these inconsistencies could be important and make valuable contributions to the secondary structure prediction algorithms.
2 Material and Methods
2.1 Protein Data Sets
150,037 protein structure files in pdb format were downloaded from ftp site of Worldwide Protein Data Bank [47, 48]. The protein structure files those do not include any peptide or secondary structural elements (α-helix or β-sheet) and those do not meet those criteria were excluded from the data set:
Resolution value : ≤ 2.00 Å
Free R-value : ≤ 0.250
R-value : ≤ 0.200 (if Free R-Value not available)
Sequence length : ≥ 100 residues.
Membrane proteins, fibrous proteins, immunoglobulins, and proteins related to extremophile organisms were removed from the data set. Membrane proteins and extremophile organisms were determined according to the lists (see Supp_MembraneProteins.pdf and Supp_ExtremophileOrganisms.pdf, respectively) prepared using the data provided by Stephen White Laboratory at UC Irvine [49] and Wikipedia [50], respectively. Structure files including keywords of membrane, transmembrane, immunoglobulin, collagen, fibroin, keratin, fibrous, keratous in COMPND, SOURCE, HEADER, KEYWDS and TITLE record types of their PDB files also were removed. A match ratio higher than 90% between keywords and target word accepted as perfect match. Because proteins are classified according to their type as globular, membrane, fibrous and non-globular in SCOP2 database [51], remaining chains were checked against the list (see Supp_SCOP2.pdf) prepared from SCOP2 web site [52], including membrane, fibrous and non-globular chains; no match found.
2.2 Pairwise Alignment
Amino acid sequences of peptide chains in remaining PDB files were extracted using information from SEQRES entry of the PDB files and identical sequences were removed. If identical sequences are from different PDB files, the sequence has a better resolution left. After the removal of protein tags, remaining 18,384 chain sequences were aligned against to each other, as all possible pairs, using pairwise alignment algorithms in two stages. In first stage, global pairwise alignments were completed using Needleman and Wunsch algorithm [53] in order to detect the homolog chains. In second stage, local pairwise alignments were completed using Smith and Waterman algorithm [54] in order to detect the homolog domains in sequence pairs. In case of an identity value higher than 25%, the longer sequence in length was kept and the other sequence was excluded. The alignment parameters for both of algorithms are below:
Open gap penalty : 10
Extension gap penalty : 1
Substitution matrix : BLOSUM62 [55, 56]
At the end of the alignments, 4882 chains in 4782 PDB files left (see Supp_MasterDataSet_Chains.pdf and Supp_MasterDataSet_PDBFiles.pdf, respectively).
2.3 Hydrogen Bond Detection
Two different data sub sets were created for each secondary structural element (α-helix, β-sheet) from 4882 chains. Each chain member of the data sub set for helices includes at least one helix as secondary structural element. The same is true for the data sub set for sheet. The residue names, their sequence numbers and boundaries of helices/strands were obtained from HELIX and SHEET entries of the PDB files and residues only within boundaries were involved in hydrogen bond calculations. Residues those modified (information from MODRES entry), those have link with the other atoms (information from LINK entry) and those have missing backbone atoms (C, CA, N, O) (PDB file convention used for representing atoms) were discarded. Because of the resolution limitations of the experimental methods, hydrogen atoms rarely are found in PDB files. The coordinates of missing hydrogen atoms bound to the backbone nitrogen atom (NH) were determined according to the geometrical properties of the peptide bond (Fig. 1). The double-line joining C and O atoms of the nth residue was regarded parallel to the line joining N and H atoms of the (n+1)th residue and the bond length of N–H was accepted as 1.00 Å.
Baker and Hubbard’s hydrogen bonding criteria were used to detect the hydrogen bonds [57]. The COH angle is defined as the angle between the lines passing through the C=O and O···H atoms of the mth and nth residues, respectively. Likewise, the NHO angle is the angle between the lines passing through the O···H and H–N atoms of the mth and nth residues, respectively (Fig. 2). Because of the limited number of peptide chains in Baker and Hubbard’s study and the existence of peptide chains with worse resolution value in this study, slightly relaxed criteria were chosen for detecting hydrogen bonds.
2.3.1 In α-Helices
The helix data sub set includes 4594 chains (see Supp_DataSubSet_Helix.pdf) and each chain includes at least one α-helix. Peptide chains including unusual helices in length (that is, comprise more than 40 residues) were removed from the data sub set to avoid the involvement of the fibrous or extremophile related peptides. The hydrogen bond between the O atom of the nth residue and the HN atom of the (n + 4)th residue in α-helices were traced using those criteria for the bond length of the O···H and COH/NHO angles (Fig. 2).
Bond length : 2.000 ± 0.400 Å
COH angle : 150.0 ± 25.0°
NHO angle : 155.0 ± 25.0°
Any amino acid pair in sequential order of n and (n+4) in helical segment, listed in the HELIX entry of the PDB file, satisfying these criteria accepted as a backbone hydrogen-bonded residue pairing in α-helix. Because the proline residue cannot be a hydrogen bond donor, in such cases, i.e. XXX:PRO (XXX represents any residue), the hydrogen bond calculations were skipped.
2.3.2 In β-Sheets
The data sub set for sheets includes 4483 chains (see Supp_DataSubSet_Sheet.pdf) and each chain includes at least one β-sheet structure. Because of the conformational strains, the residues in partner strands are not aligned one-to-one despite it is depicted in textbooks as it is. The position of the residues in strands may shift a few residue back and forth and a bulb may occur in the strand. Therefore, the backbone hydrogen bonds between O and NH atoms in partner strands were traced by considering all the probable residue matches between the partner strands and those satisfied those criteria were accepted as a backbone hydrogen-bonded residue pairing in β-sheet.
Bond length : 2.000 ± 0.400 Å
COH angle : 150.0 ± 25.0°
NHO angle : 160.0 ± 25.0°
-
Depending on the orientation of the strand, these pairings were grouped as parallel, antiparallel and overall. Overall group includes all parallel and antiparallel pairings.
2.4 Odds ratios
If a hydrogen bond was determined between the O and HN atoms of the main chain of two different residues (sequential order of residues for helices and sheets were described in Sects. 2.3.1 and 2.3.2, respectively), these residues were counted as an amino acid pairing. Because of the topology of the antiparallel strands of the sheet, an amino acid pairing may have two hydrogen bonds between their O and HN atoms. In such case, this pairing was counted up twice. Relative abundance or odds ratios of amino acid pairings were calculated as the ratio of observed occurrence to random (or expected) occurrence in peptide chain and were represented by MH[i, j] and MSO, SP, SA[i, j] matrices for helices and sheets, respectively (H:helix, SO:sheet-overall, SP:sheet-parallel, SA:sheet-antiparallel). The data used to calculate the odds ratios were represented by AH, S[i] and FH, SO, SP, SA[i, j] matrices. The residue location within the pairing in the strands of the sheet is not preferential, that is, XXX1:XXX2 and XXX2:XXX1 residue pairings are regarded as identical for sheet in context of matrices. Therefore, MSO, SP, SA[i, j] and FSO, SP, SA[i, j] matrices are symmetric with respect to the main diagonal but, MH[i, j] and FH[i, j] matrices are non-symmetric.
Definitions
NAAPairs_H,SO,SP,SA = Total number of amino acid pairings detected in helices (H), in overall strands (SO), in parallel strands (SP) and in antiparallel strands (SA), respectively.
NAA_H, S = Total number of amino acids in chains in helix data sub set and in sheet data sub set, respectively.
AH, S[i] = Matrix representing the number of each amino acids in helix data sub set and in sheet data sub set, respectively.
FH, SO, SP, SA[i, j] = Matrix representing the number of each amino acid pairings detected in helix, in overall strands, in parallel strands and in antiparallel strands, respectively.
Po_H, SO, SP, SA(i, j) = Probability of observed occurrence of amino acid pairing i and j in helix, in overall strands, in parallel strands and in antiparallel strands, respectively.
Pr_H, SO, SP, SA(i, j) = Probability of random occurrence of amino acid pairing i and j in helix, in overall strands, in parallel strands and in antiparallel strands, respectively.
MH, SO, SP, SA[i, j] = Matrix representing the odds ratio of each amino acid pairings in helices, in overall strands, in parallel strands and in antiparallel strands, respectively.
2.5 Single Amino Acid Propensities
Single amino acid propensities to helix and strand were determined using MH[i, j] and MSO[i, j] matrices, respectively. Single amino acid propensities were calculated by normalizing the sum of the values of the cells including the same residue in the matrix (e.g. for ALA residue, all ALA:XXX cell values in MH[i, j] matrix or ALA:XXX and XXX:ALA cell values in MSO[i, j] matrix were summed) according to the normalization condition. Normalization condition is the sum of whole cell values in the related matrix.
Pairwise alignments, chain data extraction from PDB files and calculations for hydrogen bond detection and matrices were done using programs written by author in QB64 v1.2 [58].
3 Results
3.1 Amino Acid Pairing Propensities in α-Helices
MH[i, j] matrix for α-helices is shown in Fig. 3 (see Supp_AH_and_FH_Matrices.pdf for AH[i] and FH[i, j] matrices). Odds ratios of homopairs corresponding to diagonal of the MH[i, j] matrix are shaded in gray. An odds ratio higher than unity implies a higher abundance than expected. Therefore, it reflects the propensity of the pair in helices. 212 of the 400 amino acid pairs have an odds ratio greater than unity and 10 pairs of them have an odds ratio value greater than 2.000. The latter pairs are ALA:ALA, GLU:ARG, ARG:GLU, GLU:GLN, GLN:GLU, GLU:LYS, LYS:GLU, LEU:LEU, MET:LEU and MET:MET. The pairs including ALA, except ALA:ASN, ALA:ASP, ALA:PRO and ALA:THR, have an odds ratio greater than unity. Also, most of the XXX:[GLN, MET, ARG, LEU] pairs have the tendency to exist in helices. Odds ratio of MET:MET, 2.878, is the highest value in the matrix. In contrary, PRO:XXX, XXX:PRO, GLY:XXX, XXX:GLY, XXX:SER and XXX:THR residues, except PRO:ALA, PRO:ARG and GLY:ALA, have smaller odds ratios than unity. 53 of them have a value smaller than 0.500. Because PRO residue cannot act as a donor in hydrogen bonding, scores for XXX:PRO pairs are zero.
There are limited number of studies on α-helical segment in proteins using (n, n+4) pairing [17, 22, 35, 36, 59]. Studies by Gibrat et al. [22], Frishman et al. [17] and Periti et al. [59] include small number of peptides in their data sets and study by Fonseca et al. [36] deals only with residue pairs at the N- and C-termini of the helical segments. Therefore, comparing the findings of this study to these ones would not be conclusive.
However, scope of this study is similar to the one by de Sousa et al. [35] and a meaningful comparison could be obtained. Propensities of homopairs proposed by this study, which correspond to main diagonal of matrix MH[i,j], coincide with ones represented in matrix of “Table 1: Global propensities for the (i, i + 4) pairing.” by de Sousa et al., except CYS:CYS and TYR:TYR pairs. While this study gives a helical tendency to CYS:CYS and TYR:TYR homopairs by assigning matrix scores of 1.800 and 1.074, respectively, they look neutral and non-helical in the global propensities matrix by de Sousa et al. [35], respectively. There are also 75 heteropair dissimilarities between these two propensity matrices (Here, the word of “dissimilarity”, implies that propensity score of a residue pair from one study is greater than unity while the corresponding score from other study is smaller than unity or vice versa. Likewise, “similarity”, implies that both of propensity scores from different studies are greater or smaller than unity). All these 77 dissimilarities are represented in Fig. 4. The degree of dissimilarity for some pairs, such as VAL:TYR and TYR:TYR, is so small but for some pairs it is not negligible. Because the total number of dissimilarities in propensities of residue pairings for α-helices in these two studies corresponds to 19% of the pairings, these dissimilarities could be crucial when assigning helical secondary structure to primary peptide structure. Therefore, the propensity matrix proposed by this study for α-helical structure could be valuable for secondary structure prediction algorithms.
Last issue of this comparison on matrices is about XXX:PRO residue pairs. Study by de Sousa et al. [35] determines the (n, n + 4) pairings by just considering the position of the residues in helical region, not using hydrogen bonding information. Therefore, in their matrix, “Table 1: Global propensities for the (i, i + 4) pairing.”[35], they have scores greater than zero for XXX:PRO residue pairs. But, because this study is mainly based on the assumption proposed by Rose et al. [44], residue pairings were determined by taking into account the presence of backbone hydrogen bond between the pairs in sequential order of (n, n + 4). Residues at position (n + 4) are hydrogen bond donors and because proline cannot act as a hydrogen bond donor, XXX:PRO residue pairs in MH[i, j] matrix cannot have a backbone hydrogen bond. Therefore the scores of XXX:PRO pairs in MH[i, j] matrix are zero. This important difference between the matrices could be worth consideration, especially when using secondary structure prediction algorithms based on residue pairings.
3.2 Amino Acid Pairing Propensities in β-Sheets
Residue pairings in β-sheets were grouped as parallel and antiparallel depending on the orientation of the strand or as overall without noticing the orientation. MSO[i, j], MSP[i, j] and MSA[i, j] matrices represent propensities of pairs and are shown in Figs. 5, 6 and 7, respectively (see Supp_AS_and_FSO_Matrices.pdf, Supp_AS_and_FSP_Matrices.pdf and Supp_AS_and_FSA_Matrices.pdf for AS[i]/FSO[i, j], AS[i]/FSP[i, j] and AS[i]/FSA[i, j] matrices, respectively). Because there is no preferential order for the position of the residues in the peptide sequence for sheet structure (that is, ALA:XXX and XXX:ALA pairings are identical in sense of probability calculations in sheet), MSO[i, j], MSP[i, j] and MSA[i, j] matrices are symmetric with respect to the diagonal.
β-sheet propensities of pairs for each matrices are summarized in Table 1 by showing the number of pairs in the group (i.e. ALA:XXX) those have a score greater than unity and those have a score smaller than unity. Because matrices are symmetric, ALA:XXX represents both ALA:XXX and XXX:ALA pairs and so on. MSO[i, j], and MSA[i, j] matrices almost have the same tendency profile in general. In MSO[i, j] matrix, [ILE, TYR, VAL]:XXX pairs, in MSP[i, j] matrix, [ILE, VAL]:XXX pairs and in MSA[i, j] matrix, [ILE, TYR, VAL]:XXX pairs have a tendency for corresponding β-strands. In contrary, [ASN, ASP, GLN, GLU, LYS, PRO, SER]:XXX pairs in MSO[i, j] matrix, [ARG, ASN, ASP, GLN, GLU, GLY, LYS, PRO, SER]:XXX pairs in MSP[i, j] matrix and [ASN, ASP, GLU, PRO]:XXX pairs in MSA[i, j] matrix mostly avoid from hydrogen bonding in corresponding β-strands. Due to the limited hydrogen bonding capacity of proline, PRO:XXX pair scores are extremely low.
The remarkable pairing groups are ARG:XXX, GLN:XXX, LYS:XXX, THR:XXX, TRP:XXX and TYR:XXX in parallel and antiparallel strands. While ARG:XXX, GLN:XXX and LYS:XXX pairs are rarely found in parallel strand, THR:XXX, TRP:XXX and TYR:XXX pairs are mainly found in antiparallel strands. Besides those, some specific pairs such as HIS:HIS, SER:SER, THR:THR, TRP:CYS, ILE:ASN also have opposite tendencies for parallel and antiparallel strands. These distinctions in pairing propensities could provide valuable information for making a discrimination between parallel and antiparallel strands when using secondary structure prediction algorithms.
Propensities of amino acid pairings in β-sheet structure were studied by many researchers [17, 37,38,39,40,41,42,43]. In the study by Fooks et al. [37], the every residue pairing has one hydrogen bonded residue and one non-hydrogen bonded residue and data on antiparallel pairings are not available. The study by Hutchinson et al. [38] also has such an approach to the pairings in antiparallel strand. In the study by Frishman et al. [17], the criteria for X-ray resolution of peptides in the data set is slightly high, the number of peptides in the data set is low and also propensities of residues are not available. Due to these limitations, findings of this study could not be assessed in the viewpoint of these studies. The study by Wouters et al. [40] on antiparallel strands includes a score matrix for hydrogen bonded pairs. At the first glance, the different scores given to ASP:ASP, ILE:ILE, TYR:TYR and VAL:VAL homopairs by two studies deserve interest. While MSA[i, j] matrix assigns a score for ASP:ASP pair as low as 0.255, it has a tendency for sheet structure according to the Wouters et al. ILE:ILE, TYR:TYR and VAL:VAL homopairs have a 0 score in their study, but they have higher scores in MSA[i, j] matrix. Despite ASP:LYS and THR:ASN pairs are being the high scoring pairs in the study of Wouters et al. these pairs have scores smaller than unity in MSA[i, j] matrix. There are more inconsistencies like these ones between these two matrices assigning opposite propensity for the same pair.
In study by Kim et al. [39], favoured and unfavoured pairs in parallel and antiparallel strands are given in “Tables 4–7”. According to these tables, the numbers of residues those are favoured in parallel strands, unfavoured in parallel strands, favoured in antiparallel strands and unfavoured in antiparallel strands are 42, 40, 63, and 67, respectively. Of these, only 12, 12, 42, and 45 are overlapped in MSP[i, j] and MSA[i, j] matrices.
Despite the lack of discrimination between hydrogen-bonded and non-hydrogen-bonded pairings in the study by Zhang et al. [41], the findings of this study were compared with the ones by them because the data sets of both studies are similar in context of the size and criteria (see for comparison results Supp_ComparisonResultsforSheet.pdf). Because the two other studies by Zhang et al. [42, 43] have inadequate criteria for their data sets, findings of these two studies were not used. Within 210 amino acid pairs, of 66 (31%), 44 (21%) and 72 (34%) pairs have opposite propensity for overall, parallel and antiparallel strands, respectively.
Amino acid pairing propensities to helix and to sheet structures (represented in Fig. 3 and in Fig. 5 as MH[i, j] and MSO[i, j] matrices, respectively) were combined into a single color-coded matrix as in Fig. 8.
3.3 Assessment of the Backbone Hydrogen Bonding Assumption
This study is mainly based on the unproven assumption by Rose et al. [44] which states that energetics of the backbone hydrogen bonding is the dominant factor of the protein folding process. Therefore, if other dominating factors rather than backbone hydrogen bonding are discovered or this assumption is collapsed, the reliability of the findings of this study would reduce partially or completely. In case of the existence of other dominating factors, it is expected that validity of the findings would depend on the weight of backbone hydrogen bonding within the overall factors. But, in case of collapse of backbone hydrogen bonding assumption, the results of this study would become invalid and any consistency between findings of this study and related literature would be accidental.
In a study by Chemmama et al. [60], propensities of amino acid pairings in protein secondary structure were determined using molecular dynamics (MD) simulation. This methodological approach makes their findings free of any single dominant interaction. Therefore, comparing of findings of this study with the ones of Chemmama et al. [60] could be informative, at least to some extent, to assess the reliability of the backbone hydrogen bonding assumption.
Single amino acid propensities were compared using Fig. 9 of this manuscript and Fig. 2 from manuscript of Chemmama et al. [60]. Only propensities to helix and sheet were compared, to coil not included. If an amino acid has same relative propensities to secondary structure in both of these figures, findings for this residue were accepted as in agreement. According to the comparison, 13 of 20 residues (ALA, VAL, LEU, ILE, MET, TRP, THR, ASN, GLN, ASP, GLU, LYS, and HIS) have the same relative tendencies to the secondary structural elements.
This high percentage (65%) in agreement supports the reliability of the backbone hydrogen bonding assumption but, two aspects on methodology of the manuscripts must be taken into account. First, Chemmama et al. [60] used just hexapeptides, which are extremely shorter than an average protein chain. Therefore, in context of protein folding, all potential interactions from distant residues for MD simulation have been ignored. Second, in this study, for propensities to helix, amino acid pairs in a sequential order of (n, n + 4) were traced, and for propensities to sheet, there is no preferential sequential order for residue pairings. But in study of Chemmama et al. [60] only adjacent residue pairs were used.
4 Conclusion
In this study, propensities of amino acid pairings in α-helix and β-sheet structure of globular proteins were determined as odds ratios represented by matrices. Because the reliability of the results mainly depends on the quality of the data set, despite the previous studies on this issue, author has created a new, comprehensive data set using all peptides deposited in Worldwide Protein Data Bank. Only globular protein chains were included to data set by removing membrane, fibrous, immunoglobulins and extremophile related proteins.
To increase the quality of the data set, both homolog chains and homolog domains in the chains were detected using global and local pairwise alignment algorithms, respectively and were removed from the data set. Because alignment algorithms are heuristic algorithms and alignment parameters has been determined empirically, there is no way to determine the homolog chains or domains as absolutely. Despite this minor drawback, the data set of this study is one of the qualified data set available in the related literature.
Comparison of the findings of this study with the previous studies shows that propensities proposed by this and the other studies for the same residue pairing may differ. The number of such residue pairings corresponds to 19–34% of the all pairings in each secondary structure element. Therefore, findings of this study could provide valuable information to secondary structure prediction algorithms based on hydrogen-bonded residue pairings when predicting secondary structural elements of the peptide.
Change history
01 February 2020
In the original version of this article, under the Introduction section in paragraph starting "Some findings of this study..." the “Sect. 10” should be changed to “Sect. 3”.
01 February 2020
In the original version of this article, under the Introduction section in paragraph starting "Some findings of this study..." the ���Sect.��10��� should be changed to ���Sect.��3���.
References
Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3):779–815
Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10(12):980
Zhang Y, Skolnick J (2005) The protein structure prediction problem could be solved using the current PDB library. Proc Natl Acad Sci USA 102(4):1029–1034
Bonneau R, Tsai J, Ruczinski I, Chivian D, Rohl C, Strauss CE et al (2001) Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins Suppl 5:119–126
Bystroff C, Thorsson V, Baker D (2000) HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J Mol Biol 301(1):173–190
Levitt M, Warshel A (1975) Computer simulation of protein folding. Nature 253(5494):694–698
Osguthorpe DJ (1999) Improved ab initio predictions with a simplified, flexible geometry model. Proteins Suppl 3:186–193
Simons KT, Bonneau R, Ruczinski I, Baker D (1999) Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins Suppl 3:171–176
Anfinsen CB (1973) Principles that govern the folding of protein chains. Science 181(4096):223–230
Bonneau R, Baker D (2001) Ab initio protein structure prediction: progress and prospects. Annu Rev Biophys Biomol Struct 30:173–189
Scheraga HA (1971) Theoretical and experimental studies of conformations of polypeptides. Chem Rev 71(2):195–217
Burgess AW, Ponnuswamy PK, Scheraga HA (1974) Analysis of conformations of amino acid residues and prediction of backbone topography in proteins. Israel J Chem 12(1–2):239–86
Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramontano A (2018) Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins 86(Suppl 1):7–15
Kryshtafovych A, Schwede T, Topf M, Fidelis K, Moult J (2019) Critical assessment of methods of protein structure prediction (CASP)—round XIII. Proteins 87(12):1011–1020
Deleage G, Roux B (1987) An algorithm for protein secondary structure prediction based on class prediction. Protein Eng 1(4):289–294
Frishman D, Argos P (1995) Knowledge-based protein secondary structure assignment. Proteins 23(4):566–579
Frishman D, Argos P (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng 9(2):133–142
Garnier J, Gibrat JF, Robson B (1996) GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol 266:540–553
Garnier J, Osguthorpe DJ, Robson B (1978) Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 120(1):97–120
Geourjon C, Deleage G (1994) SOPM: a self-optimized method for protein secondary structure prediction. Protein Eng 7(2):157–164
Geourjon C, Deleage G (1995) SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments. Comput Appl Biosci 11(6):681–684
Gibrat JF, Garnier J, Robson B (1987) Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. J Mol Biol 198(3):425–443
Guermeur Y, Geourjon C, Gallinari P, Deleage G (1999) Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics 15(5):413–421
King RD, Sternberg MJ (1996) Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Sci 5(11):2298–2310
Levin JM (1997) Exploring the limits of nearest neighbour secondary structure prediction. Protein Eng 10(7):771–776
Levin JM, Garnier J (1988) Improvements in a secondary structure prediction method based on a search for local sequence homologies and its use as a model building tool. Biochim Biophys Acta 955(3):283–295
Levin JM, Robson B, Garnier J (1986) An algorithm for secondary structure determination in proteins based on sequence similarity. FEBS Lett 205(2):303–308
Rost B, Sander C (1993) Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol 232(2):584–599
Rost B, Sander C (1994) Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 19(1):55–72
Chou PY, Fasman GD (1974) Prediction of protein conformation. Biochemistry 13(2):222–245
Chou PY, Fasman GD (1974) Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. Biochemistry 13(2):211–222
Pauling L, Corey RB, Branson HR (1951) The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci USA 37(4):205–211
Pauling L, Corey RB (1951) The pleated sheet, a new layer configuration of polypeptide chains. Proc Natl Acad Sci USA 37(5):251–256
Deleage G, Blanchet C, Geourjon C (1997) Protein structure prediction. Implications for the biologist. Biochimie 79(11):681–686
de Sousa MM, Munteanu CR, Pazos A, Fonseca NA, Camacho R, Magalhaes AL (2011) Amino acid pair- and triplet-wise groupings in the interior of alpha-helical segments in proteins. J Theor Biol 271(1):136–144
Fonseca NA, Camacho R, Magalhaes AL (2008) Amino acid pairing at the N- and C-termini of helical segments in proteins. Proteins 70(1):188–196
Fooks HM, Martin AC, Woolfson DN, Sessions RB, Hutchinson EG (2006) Amino acid pairing preferences in parallel beta-sheets in proteins. J Mol Biol 356(1):32–44
Hutchinson EG, Sessions RB, Thornton JM, Woolfson DN (1998) Determinants of strand register in antiparallel beta-sheets of proteins. Protein Sci 7(11):2287–2300
Kim SB, Tsui KL, Borodovsky M (2006) Multiple testing in large-scale contingency tables: inferring patterns of pair-wise amino acid association in beta-sheets. Int J Bioinform Res Appl 2(2):193–217
Wouters MA, Curmi PM (1995) An analysis of side chain interactions and pair correlations within antiparallel beta-sheets: the differences between backbone hydrogen-bonded and non-hydrogen-bonded residue pairs. Proteins 22(2):119–131
Zhang N, Duan G, Gao S, Ruan J, Zhang T (2010) Prediction of the parallel/antiparallel orientation of beta-strands using amino acid pairing preferences and support vector machines. J Theor Biol 263(3):360–368
Zhang N, Ruan J, Duan G, Gao S, Zhang T (2009) The interstrand amino acid pairs play a significant role in determining the parallel or antiparallel orientation of beta-strands. Biochem Biophys Res Commun 386(3):537–543
Zhang N, Ruan J, Wu J, Zhang T (2007) SHEETSPAIR: a database of amino acid pairs in protein sheet structures. Data Sci J 6:S589–S595
Rose GD, Fleming PJ, Banavar JR, Maritan A (2006) A backbone-based theory of protein folding. Proc Natl Acad Sci USA 103(45):16623–16633
Lifson S, Sander C (1980) Specific recognition in the tertiary structure of beta-sheets of proteins. J Mol Biol 139(4):627–639
Petersen SB, Neves-Petersen MT, Henriksen SB, Mortensen RJ, Geertz-Hansen HM (2012) Scale-free behaviour of amino acid pair interactions in folded proteins. PLoS ONE 7(7):e41322
ww PDBc (2019) Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res 47(D1):D520–D528
Worldwide Protein Data Bank. FTP site. http://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/. Accessed 16 Apr 2019
Stephen White laboratory at UC Irvine. Membrane Proteins of Known 3D Structure. https://blanco.biomol.uci.edu/mpstruc/ Accessed 16 Apr 2019
Wikipedia The Free Encyclopedia. Extremophile. https://en.wikipedia.org/wiki/Extremophile Accessed 2 May 2019
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42(Database issue):D310–D314
MRC Laboratory of Molecular Biology. Structural Classification of Proteins 2. https://scop2.mrc-lmb.cam.ac.uk/pt-index.html. Accessed 11 Oct 2019.
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89(22):10915–10919
NCBI National Center for Biotechnology Information. BLOSUM Matrices. ftp://ncbi.nih.gov/blast/matrices/. Accessed 2 May 2019.
Baker EN, Hubbard RE (1984) Hydrogen bonding in globular proteins. Progr Biophys Mol Biol 44(2):97–179
QB64. https://www.portal.qb64.org/ . Accessed 21 Oct 2019.
Periti PF, Quagliarotti G, Liquori AM (1967) Recognition of alpha-helical segments in proteins of known primary structure. J Mol Biol 24(2):313–322
Chemmama IE, Chapagain PP, Gerstman BS (2015) Pairwise amino acid secondary structural propensities. Phys Rev E 91(4):042709
Funding
N/A
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no conflict of interest.
Ethical Approval
This article does not contain any studies with human participants or animals performed by the author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Nacar, C. Propensities of Amino Acid Pairings in Secondary Structure of Globular Proteins. Protein J 39, 21–32 (2020). https://doi.org/10.1007/s10930-020-09880-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10930-020-09880-6