1 Introduction

Studies on protein functions mostly require tertiary structure of the protein. Due to the technical limitations, tertiary structure of the many proteins could not be determined by experimental methods such as X-ray diffraction, NMR spectroscopy, cryo electron microscopy or be determined in poor quality. In such cases, computational methods (homology/comparative, threading or ab initio modelling) are valuable approaches to obtain the tertiary structure. In existence of a known structure similar to the query sequence as a template, tertiary structure of an unknown protein chain could be modelled with a great success using homology modeling [1]. The huge number of protein structures in Worldwide Protein Data Bank (wwPDB) [2] is an important factor in this success [3]. If the similar template sequence is not available, de novo or ab initio based prediction methods [4,5,6,7,8] are the main alternative approaches to the homology modeling. De novo prediction methods are mainly based on Anfinsen’s thermodynamic hypothesis, which states that the Gibbs free energy of the conformation of a native protein in physiological condition is lowest [9]. Therefore, the main goal of the de novo methods is to find out the global free-energy minimum in conformational energy landscape [10]. However, there are so many local minima in vast conformational energy landscape [11] and it requires enormous amount of time to search the global free-energy minimum among them. Therefore, a qualified starting conformation which corresponding to neighborhood of global minimum in energy landscape and which leading the algorithms to the nearest local minimum is extremely important to overcome this intrinsic limitation [12]. Two of Critical Assessments of Methods of Protein Structure Prediction (CASPs), CASP12 and CASP13 [13, 14], have reported great improvements in de novo or template free modeling. Despite these successes, template free modeling requires further improvement, especially, for longer chains.

The qualified starting conformation can be constructed using secondary structure prediction methods (SSPMs) [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]. The main goal of SSPMs is to identify the secondary structural elements of the protein in peptide sequence: helices [32], sheets [33] and coils. SSPMs are classified in many ways depending on their approaches to the problem [28, 34]. Some statistical methods or studies those investigating the spatial aspects of β-strands mainly are based on the occurrence of amino acid pairings in α-helices and/or in partner strands of β-sheets in the chain [17, 35,36,37,38,39,40,41,42,43]. However, because the backbone hydrogen bonding is the dominant factor of the protein folding process, as proposed by Rose et al. [44], the SSPMs based on the frequencies of the hydrogen-bonded amino acid pairings are more reasonable candidates for constructing the qualified starting conformation for de novo methods. The success of latter class SSPMs directly depends on the reliability of the information gathered from statistics of the backbone hydrogen-bonded amino acid pairings of secondary structural elements.

In this study, all backbone hydrogen-bonded residue pairings in α-helices/β-sheets of globular proteins in appropriate data sub sets were determined. The master data set was prepared from the chains deposited in Worldwide Protein Data Bank according to the criteria stated in Sect. 2. The master data set includes 4882 globular, non-homolog protein chains. Two data sub sets also created for helix and sheet structures from master data set. Using the residue pairing frequencies, the propensities of hydrogen-bonded residue pairings in secondary structural elements were calculated as odds ratios.

The helix/sheet propensities of residue pairings were studied by many researchers to some extent in this context [17, 35,36,37,38,39,40,41,42,43, 45, 46]. However, this study differs from those in both size of the data set and protein type homogeneity. Membrane proteins, fibrous proteins, immunoglobulins, proteins related to extremophile organisms and homolog chains/domains were excluded from the master data set to attain this homogeneity.

Some findings of this study on propensities of the residue pairs are not in consistent with findings of the previous studies. As discussed in Sect. 10 in details, these inconsistencies could be important and make valuable contributions to the secondary structure prediction algorithms.

2 Material and Methods

2.1 Protein Data Sets

150,037 protein structure files in pdb format were downloaded from ftp site of Worldwide Protein Data Bank [47, 48]. The protein structure files those do not include any peptide or secondary structural elements (α-helix or β-sheet) and those do not meet those criteria were excluded from the data set:

Resolution value : ≤ 2.00 Å

Free R-value : ≤ 0.250

R-value : ≤ 0.200 (if Free R-Value not available)

Sequence length : ≥ 100 residues.

Membrane proteins, fibrous proteins, immunoglobulins, and proteins related to extremophile organisms were removed from the data set. Membrane proteins and extremophile organisms were determined according to the lists (see Supp_MembraneProteins.pdf and Supp_ExtremophileOrganisms.pdf, respectively) prepared using the data provided by Stephen White Laboratory at UC Irvine [49] and Wikipedia [50], respectively. Structure files including keywords of membrane, transmembrane, immunoglobulin, collagen, fibroin, keratin, fibrous, keratous in COMPND, SOURCE, HEADER, KEYWDS and TITLE record types of their PDB files also were removed. A match ratio higher than 90% between keywords and target word accepted as perfect match. Because proteins are classified according to their type as globular, membrane, fibrous and non-globular in SCOP2 database [51], remaining chains were checked against the list (see Supp_SCOP2.pdf) prepared from SCOP2 web site [52], including membrane, fibrous and non-globular chains; no match found.

2.2 Pairwise Alignment

Amino acid sequences of peptide chains in remaining PDB files were extracted using information from SEQRES entry of the PDB files and identical sequences were removed. If identical sequences are from different PDB files, the sequence has a better resolution left. After the removal of protein tags, remaining 18,384 chain sequences were aligned against to each other, as all possible pairs, using pairwise alignment algorithms in two stages. In first stage, global pairwise alignments were completed using Needleman and Wunsch algorithm [53] in order to detect the homolog chains. In second stage, local pairwise alignments were completed using Smith and Waterman algorithm [54] in order to detect the homolog domains in sequence pairs. In case of an identity value higher than 25%, the longer sequence in length was kept and the other sequence was excluded. The alignment parameters for both of algorithms are below:

Open gap penalty : 10

Extension gap penalty : 1

Substitution matrix : BLOSUM62 [55, 56]

At the end of the alignments, 4882 chains in 4782 PDB files left (see Supp_MasterDataSet_Chains.pdf and Supp_MasterDataSet_PDBFiles.pdf, respectively).

2.3 Hydrogen Bond Detection

Two different data sub sets were created for each secondary structural element (α-helix, β-sheet) from 4882 chains. Each chain member of the data sub set for helices includes at least one helix as secondary structural element. The same is true for the data sub set for sheet. The residue names, their sequence numbers and boundaries of helices/strands were obtained from HELIX and SHEET entries of the PDB files and residues only within boundaries were involved in hydrogen bond calculations. Residues those modified (information from MODRES entry), those have link with the other atoms (information from LINK entry) and those have missing backbone atoms (C, CA, N, O) (PDB file convention used for representing atoms) were discarded. Because of the resolution limitations of the experimental methods, hydrogen atoms rarely are found in PDB files. The coordinates of missing hydrogen atoms bound to the backbone nitrogen atom (NH) were determined according to the geometrical properties of the peptide bond (Fig. 1). The double-line joining C and O atoms of the nth residue was regarded parallel to the line joining N and H atoms of the (n+1)th residue and the bond length of N–H was accepted as 1.00 Å.

Fig. 1
figure 1

Geometry of the peptide bond. All atoms are coplanar and line joining the O and C atoms is parallel to the line joining the N and H atoms. The length of bond between the N and H atoms is approximately 1 Å.

Baker and Hubbard’s hydrogen bonding criteria were used to detect the hydrogen bonds [57]. The COH angle is defined as the angle between the lines passing through the C=O and O···H atoms of the mth and nth residues, respectively. Likewise, the NHO angle is the angle between the lines passing through the O···H and H–N atoms of the mth and nth residues, respectively (Fig. 2). Because of the limited number of peptide chains in Baker and Hubbard’s study and the existence of peptide chains with worse resolution value in this study, slightly relaxed criteria were chosen for detecting hydrogen bonds.

Fig. 2
figure 2

Depiction of the COH and NHO angles. The hydrogen bond is represented by dotted line between the O atom of the mth residue and the H atom of the nth residue, respectively.

2.3.1 In α-Helices

The helix data sub set includes 4594 chains (see Supp_DataSubSet_Helix.pdf) and each chain includes at least one α-helix. Peptide chains including unusual helices in length (that is, comprise more than 40 residues) were removed from the data sub set to avoid the involvement of the fibrous or extremophile related peptides. The hydrogen bond between the O atom of the nth residue and the HN atom of the (n + 4)th residue in α-helices were traced using those criteria for the bond length of the O···H and COH/NHO angles (Fig. 2).

Bond length : 2.000 ± 0.400 Å

COH angle : 150.0 ± 25.0°

NHO angle : 155.0 ± 25.0°

Any amino acid pair in sequential order of n and (n+4) in helical segment, listed in the HELIX entry of the PDB file, satisfying these criteria accepted as a backbone hydrogen-bonded residue pairing in α-helix. Because the proline residue cannot be a hydrogen bond donor, in such cases, i.e. XXX:PRO (XXX represents any residue), the hydrogen bond calculations were skipped.

2.3.2 In β-Sheets

The data sub set for sheets includes 4483 chains (see Supp_DataSubSet_Sheet.pdf) and each chain includes at least one β-sheet structure. Because of the conformational strains, the residues in partner strands are not aligned one-to-one despite it is depicted in textbooks as it is. The position of the residues in strands may shift a few residue back and forth and a bulb may occur in the strand. Therefore, the backbone hydrogen bonds between O and NH atoms in partner strands were traced by considering all the probable residue matches between the partner strands and those satisfied those criteria were accepted as a backbone hydrogen-bonded residue pairing in β-sheet.

Bond length : 2.000 ± 0.400 Å

COH angle : 150.0 ± 25.0°

NHO angle : 160.0 ± 25.0°

  • Depending on the orientation of the strand, these pairings were grouped as parallel, antiparallel and overall. Overall group includes all parallel and antiparallel pairings.

2.4 Odds ratios

If a hydrogen bond was determined between the O and HN atoms of the main chain of two different residues (sequential order of residues for helices and sheets were described in Sects. 2.3.1 and 2.3.2, respectively), these residues were counted as an amino acid pairing. Because of the topology of the antiparallel strands of the sheet, an amino acid pairing may have two hydrogen bonds between their O and HN atoms. In such case, this pairing was counted up twice. Relative abundance or odds ratios of amino acid pairings were calculated as the ratio of observed occurrence to random (or expected) occurrence in peptide chain and were represented by MH[i, j] and MSO, SP, SA[i, j] matrices for helices and sheets, respectively (H:helix, SO:sheet-overall, SP:sheet-parallel, SA:sheet-antiparallel). The data used to calculate the odds ratios were represented by AH, S[i] and FH, SO, SP, SA[i, j] matrices. The residue location within the pairing in the strands of the sheet is not preferential, that is, XXX1:XXX2 and XXX2:XXX1 residue pairings are regarded as identical for sheet in context of matrices. Therefore, MSO, SP, SA[i, j] and FSO, SP, SA[i, j] matrices are symmetric with respect to the main diagonal but, MH[i, j] and FH[i, j] matrices are non-symmetric.

Definitions

NAAPairs_H,SO,SP,SA = Total number of amino acid pairings detected in helices (H), in overall strands (SO), in parallel strands (SP) and in antiparallel strands (SA), respectively.

NAA_H, S = Total number of amino acids in chains in helix data sub set and in sheet data sub set, respectively.

AH, S[i] = Matrix representing the number of each amino acids in helix data sub set and in sheet data sub set, respectively.

FH, SO, SP, SA[i, j] = Matrix representing the number of each amino acid pairings detected in helix, in overall strands, in parallel strands and in antiparallel strands, respectively.

Po_H, SO, SP, SA(i, j) = Probability of observed occurrence of amino acid pairing i and j in helix, in overall strands, in parallel strands and in antiparallel strands, respectively.

Pr_H, SO, SP, SA(i, j) = Probability of random occurrence of amino acid pairing i and j in helix, in overall strands, in parallel strands and in antiparallel strands, respectively.

MH, SO, SP, SA[i, j] = Matrix representing the odds ratio of each amino acid pairings in helices, in overall strands, in parallel strands and in antiparallel strands, respectively.

$${M}_{H,SO,SP,SA }[i,j]=\frac{{P}_{o\_H,SO,SP,SA}(i,j)}{{P}_{r\_H,SO,SP,SA}(i,j)}$$
$${P}_{o\_H,SO,SP,SA}\left(i,j\right)=\frac{{F}_{H,SO,SP,SA}\left[i, j\right]}{{N}_{AAPair{s}_{H},SO,SP,SA}} ( i\ge j for strands)$$
$${P}_{r\_H}\left(i,j\right)=\frac{{A}_{H}\left[i\right]{A}_{H}\left[j\right]}{{N}_{AA\_H}\left({N}_{AA\_H}-1\right)}$$
$${P}_{r\_SO,SP,SA}\left(i,j\right)=\frac{{A}_{S}\left[i\right]\left({A}_{S}\left[i\right]-1\right)}{{N}_{AA\_S}\left({N}_{AA\_S}-1\right)} \left(if i=j\right)$$
$${P}_{r\_SO,SP,SA}\left(i,j\right)=2\frac{{A}_{S}\left[i\right]{A}_{S}\left[j\right]}{{N}_{AA\_S}\left({N}_{AA\_S}-1\right)} \left(if i\ne j\right)$$
$$\sum {P}_{r\_H,SO,SP,SA}=1 and \sum {P}_{o\_H,SO,SP,SA}=1$$
$${N}_{AAPairs\_SO}={N}_{AAPairs\_SP}+{N}_{AAPairs\_SA}$$
$${F}_{SO}\left[i,j\right]={F}_{SP}[i,j]+{F}_{SA}[i,j]$$

2.5 Single Amino Acid Propensities

Single amino acid propensities to helix and strand were determined using MH[i, j] and MSO[i, j] matrices, respectively. Single amino acid propensities were calculated by normalizing the sum of the values of the cells including the same residue in the matrix (e.g. for ALA residue, all ALA:XXX cell values in MH[i, j] matrix or ALA:XXX and XXX:ALA cell values in MSO[i, j] matrix were summed) according to the normalization condition. Normalization condition is the sum of whole cell values in the related matrix.

Pairwise alignments, chain data extraction from PDB files and calculations for hydrogen bond detection and matrices were done using programs written by author in QB64 v1.2 [58].

3 Results

3.1 Amino Acid Pairing Propensities in α-Helices

MH[i, j] matrix for α-helices is shown in Fig. 3 (see Supp_AH_and_FH_Matrices.pdf for AH[i] and FH[i, j] matrices). Odds ratios of homopairs corresponding to diagonal of the MH[i, j] matrix are shaded in gray. An odds ratio higher than unity implies a higher abundance than expected. Therefore, it reflects the propensity of the pair in helices. 212 of the 400 amino acid pairs have an odds ratio greater than unity and 10 pairs of them have an odds ratio value greater than 2.000. The latter pairs are ALA:ALA, GLU:ARG, ARG:GLU, GLU:GLN, GLN:GLU, GLU:LYS, LYS:GLU, LEU:LEU, MET:LEU and MET:MET. The pairs including ALA, except ALA:ASN, ALA:ASP, ALA:PRO and ALA:THR, have an odds ratio greater than unity. Also, most of the XXX:[GLN, MET, ARG, LEU] pairs have the tendency to exist in helices. Odds ratio of MET:MET, 2.878, is the highest value in the matrix. In contrary, PRO:XXX, XXX:PRO, GLY:XXX, XXX:GLY, XXX:SER and XXX:THR residues, except PRO:ALA, PRO:ARG and GLY:ALA, have smaller odds ratios than unity. 53 of them have a value smaller than 0.500. Because PRO residue cannot act as a donor in hydrogen bonding, scores for XXX:PRO pairs are zero.

Fig. 3
figure 3

MH[i, j] matrix represents the odds ratios of amino acid pairings (n, n+4) in helices as [observed]/[random]. A value greater than unity implies the tendency of the residue pairing to helical structure and a higher value corresponds to a higher tendency. Odds ratios of homopairs are shaded in gray

There are limited number of studies on α-helical segment in proteins using (n, n+4) pairing [17, 22, 35, 36, 59]. Studies by Gibrat et al. [22], Frishman et al. [17] and Periti et al. [59] include small number of peptides in their data sets and study by Fonseca et al. [36] deals only with residue pairs at the N- and C-termini of the helical segments. Therefore, comparing the findings of this study to these ones would not be conclusive.

However, scope of this study is similar to the one by de Sousa et al. [35] and a meaningful comparison could be obtained. Propensities of homopairs proposed by this study, which correspond to main diagonal of matrix MH[i,j], coincide with ones represented in matrix of “Table 1: Global propensities for the (i, i + 4) pairing.” by de Sousa et al., except CYS:CYS and TYR:TYR pairs. While this study gives a helical tendency to CYS:CYS and TYR:TYR homopairs by assigning matrix scores of 1.800 and 1.074, respectively, they look neutral and non-helical in the global propensities matrix by de Sousa et al. [35], respectively. There are also 75 heteropair dissimilarities between these two propensity matrices (Here, the word of “dissimilarity”, implies that propensity score of a residue pair from one study is greater than unity while the corresponding score from other study is smaller than unity or vice versa. Likewise, “similarity”, implies that both of propensity scores from different studies are greater or smaller than unity). All these 77 dissimilarities are represented in Fig. 4. The degree of dissimilarity for some pairs, such as VAL:TYR and TYR:TYR, is so small but for some pairs it is not negligible. Because the total number of dissimilarities in propensities of residue pairings for α-helices in these two studies corresponds to 19% of the pairings, these dissimilarities could be crucial when assigning helical secondary structure to primary peptide structure. Therefore, the propensity matrix proposed by this study for α-helical structure could be valuable for secondary structure prediction algorithms.

Fig. 4
figure 4

Comparison of two propensity matrices (MH[i, j] matrix and matrix from the study by de Sousa et al. [35]; see the text) for helix structure represented in shades of blue and of red colors. While shades of blue color represent similar propensity, shades of red color do opposite propensity. Color shades are graded as H (high), M (moderate) and L (low) (Color figure online)

Last issue of this comparison on matrices is about XXX:PRO residue pairs. Study by de Sousa et al. [35] determines the (n, n + 4) pairings by just considering the position of the residues in helical region, not using hydrogen bonding information. Therefore, in their matrix, “Table 1: Global propensities for the (i, i + 4) pairing.”[35], they have scores greater than zero for XXX:PRO residue pairs. But, because this study is mainly based on the assumption proposed by Rose et al. [44], residue pairings were determined by taking into account the presence of backbone hydrogen bond between the pairs in sequential order of (n, n + 4). Residues at position (n + 4) are hydrogen bond donors and because proline cannot act as a hydrogen bond donor, XXX:PRO residue pairs in MH[i, j] matrix cannot have a backbone hydrogen bond. Therefore the scores of XXX:PRO pairs in MH[i, j] matrix are zero. This important difference between the matrices could be worth consideration, especially when using secondary structure prediction algorithms based on residue pairings.

3.2 Amino Acid Pairing Propensities in β-Sheets

Residue pairings in β-sheets were grouped as parallel and antiparallel depending on the orientation of the strand or as overall without noticing the orientation. MSO[i, j], MSP[i, j] and MSA[i, j] matrices represent propensities of pairs and are shown in Figs. 5, 6 and 7, respectively (see Supp_AS_and_FSO_Matrices.pdf, Supp_AS_and_FSP_Matrices.pdf and Supp_AS_and_FSA_Matrices.pdf for AS[i]/FSO[i, j], AS[i]/FSP[i, j] and AS[i]/FSA[i, j] matrices, respectively). Because there is no preferential order for the position of the residues in the peptide sequence for sheet structure (that is, ALA:XXX and XXX:ALA pairings are identical in sense of probability calculations in sheet), MSO[i, j], MSP[i, j] and MSA[i, j] matrices are symmetric with respect to the diagonal.

Fig. 5
figure 5

MSO[i, j] matrix represents the odds ratio of amino acid pairings in overall strand as [observed]/[random]. A value greater than unity implies the tendency of the residue pairing to sheet structure and a higher value corresponds to a higher tendency. Propensity matrices of sheet strands are symmetric with respect to the diagonal (see the text). Odds ratios of homopairs shaded in gray

Fig. 6
figure 6

MSP[i, j] matrix represents the odds ratio of amino acid pairings in parallel strand as [observed]/[random]. A value greater than unity implies the tendency of the residue pairing to sheet structure and a higher value corresponds to a higher tendency. Propensity matrices of sheet strands are symmetric with respect to the diagonal (see the text). Odds ratios of homopairs shaded in gray

Fig. 7
figure 7

MSA[i, j] matrix represents the odds ratio of amino acid pairings in antiparallel strand as [observed]/[random]. A value greater than unity implies the tendency of the residue pairing to sheet structure and a higher value corresponds to a higher tendency. Propensity matrices of sheet strands are symmetric with respect to the diagonal (see the text). Odds ratios of homopairs shaded in gray

β-sheet propensities of pairs for each matrices are summarized in Table 1 by showing the number of pairs in the group (i.e. ALA:XXX) those have a score greater than unity and those have a score smaller than unity. Because matrices are symmetric, ALA:XXX represents both ALA:XXX and XXX:ALA pairs and so on. MSO[i, j], and MSA[i, j] matrices almost have the same tendency profile in general. In MSO[i, j] matrix, [ILE, TYR, VAL]:XXX pairs, in MSP[i, j] matrix, [ILE, VAL]:XXX pairs and in MSA[i, j] matrix, [ILE, TYR, VAL]:XXX pairs have a tendency for corresponding β-strands. In contrary, [ASN, ASP, GLN, GLU, LYS, PRO, SER]:XXX pairs in MSO[i, j] matrix, [ARG, ASN, ASP, GLN, GLU, GLY, LYS, PRO, SER]:XXX pairs in MSP[i, j] matrix and [ASN, ASP, GLU, PRO]:XXX pairs in MSA[i, j] matrix mostly avoid from hydrogen bonding in corresponding β-strands. Due to the limited hydrogen bonding capacity of proline, PRO:XXX pair scores are extremely low.

Table 1 Distribution of properties of the residue pairs in β-sheet

The remarkable pairing groups are ARG:XXX, GLN:XXX, LYS:XXX, THR:XXX, TRP:XXX and TYR:XXX in parallel and antiparallel strands. While ARG:XXX, GLN:XXX and LYS:XXX pairs are rarely found in parallel strand, THR:XXX, TRP:XXX and TYR:XXX pairs are mainly found in antiparallel strands. Besides those, some specific pairs such as HIS:HIS, SER:SER, THR:THR, TRP:CYS, ILE:ASN also have opposite tendencies for parallel and antiparallel strands. These distinctions in pairing propensities could provide valuable information for making a discrimination between parallel and antiparallel strands when using secondary structure prediction algorithms.

Propensities of amino acid pairings in β-sheet structure were studied by many researchers [17, 37,38,39,40,41,42,43]. In the study by Fooks et al. [37], the every residue pairing has one hydrogen bonded residue and one non-hydrogen bonded residue and data on antiparallel pairings are not available. The study by Hutchinson et al. [38] also has such an approach to the pairings in antiparallel strand. In the study by Frishman et al. [17], the criteria for X-ray resolution of peptides in the data set is slightly high, the number of peptides in the data set is low and also propensities of residues are not available. Due to these limitations, findings of this study could not be assessed in the viewpoint of these studies. The study by Wouters et al. [40] on antiparallel strands includes a score matrix for hydrogen bonded pairs. At the first glance, the different scores given to ASP:ASP, ILE:ILE, TYR:TYR and VAL:VAL homopairs by two studies deserve interest. While MSA[i, j] matrix assigns a score for ASP:ASP pair as low as 0.255, it has a tendency for sheet structure according to the Wouters et al. ILE:ILE, TYR:TYR and VAL:VAL homopairs have a 0 score in their study, but they have higher scores in MSA[i, j] matrix. Despite ASP:LYS and THR:ASN pairs are being the high scoring pairs in the study of Wouters et al. these pairs have scores smaller than unity in MSA[i, j] matrix. There are more inconsistencies like these ones between these two matrices assigning opposite propensity for the same pair.

In study by Kim et al. [39], favoured and unfavoured pairs in parallel and antiparallel strands are given in “Tables 4–7”. According to these tables, the numbers of residues those are favoured in parallel strands, unfavoured in parallel strands, favoured in antiparallel strands and unfavoured in antiparallel strands are 42, 40, 63, and 67, respectively. Of these, only 12, 12, 42, and 45 are overlapped in MSP[i, j] and MSA[i, j] matrices.

Despite the lack of discrimination between hydrogen-bonded and non-hydrogen-bonded pairings in the study by Zhang et al. [41], the findings of this study were compared with the ones by them because the data sets of both studies are similar in context of the size and criteria (see for comparison results Supp_ComparisonResultsforSheet.pdf). Because the two other studies by Zhang et al. [42, 43] have inadequate criteria for their data sets, findings of these two studies were not used. Within 210 amino acid pairs, of 66 (31%), 44 (21%) and 72 (34%) pairs have opposite propensity for overall, parallel and antiparallel strands, respectively.

Amino acid pairing propensities to helix and to sheet structures (represented in Fig. 3 and in Fig. 5 as MH[i, j] and MSO[i, j] matrices, respectively) were combined into a single color-coded matrix as in Fig. 8.

Fig. 8
figure 8

This combined matrix represents the amino acid pairing propensities to helix and to sheet structures in a single matrix using shades of blue, red and purple colors. Shades of blue color represent the propensity to helix, shades of red color represent the propensity to sheet and shades of purple color represent the propensity to both helix and sheet structures. Therefore, for the same pairing, blue color implies that cell value in the MH[i, j] matrix is greater than unity and cell value in MSO[i, j] matrix is smaller than unity; red color implies that cell value in the MH[i, j] matrix is smaller than unity and cell value in MSO[i, j] matrix is greater than unity; purple color implies that cell values in the both MH[i, j] and MSO[i, j] matrices are greater than unity; gray color implies that cell values in the both MH[i, j] and MSO[i, j] matrices are smaller than unity. Color shades are graded as H (high), M (moderate) and L (low) (Color figure online)

3.3 Assessment of the Backbone Hydrogen Bonding Assumption

This study is mainly based on the unproven assumption by Rose et al. [44] which states that energetics of the backbone hydrogen bonding is the dominant factor of the protein folding process. Therefore, if other dominating factors rather than backbone hydrogen bonding are discovered or this assumption is collapsed, the reliability of the findings of this study would reduce partially or completely. In case of the existence of other dominating factors, it is expected that validity of the findings would depend on the weight of backbone hydrogen bonding within the overall factors. But, in case of collapse of backbone hydrogen bonding assumption, the results of this study would become invalid and any consistency between findings of this study and related literature would be accidental.

In a study by Chemmama et al. [60], propensities of amino acid pairings in protein secondary structure were determined using molecular dynamics (MD) simulation. This methodological approach makes their findings free of any single dominant interaction. Therefore, comparing of findings of this study with the ones of Chemmama et al. [60] could be informative, at least to some extent, to assess the reliability of the backbone hydrogen bonding assumption.

Single amino acid propensities were compared using Fig. 9 of this manuscript and Fig. 2 from manuscript of Chemmama et al. [60]. Only propensities to helix and sheet were compared, to coil not included. If an amino acid has same relative propensities to secondary structure in both of these figures, findings for this residue were accepted as in agreement. According to the comparison, 13 of 20 residues (ALA, VAL, LEU, ILE, MET, TRP, THR, ASN, GLN, ASP, GLU, LYS, and HIS) have the same relative tendencies to the secondary structural elements.

Fig. 9
figure 9

Single amino acid propensities to helix and sheet structures (see the text)

This high percentage (65%) in agreement supports the reliability of the backbone hydrogen bonding assumption but, two aspects on methodology of the manuscripts must be taken into account. First, Chemmama et al. [60] used just hexapeptides, which are extremely shorter than an average protein chain. Therefore, in context of protein folding, all potential interactions from distant residues for MD simulation have been ignored. Second, in this study, for propensities to helix, amino acid pairs in a sequential order of (n, n + 4) were traced, and for propensities to sheet, there is no preferential sequential order for residue pairings. But in study of Chemmama et al. [60] only adjacent residue pairs were used.

4 Conclusion

In this study, propensities of amino acid pairings in α-helix and β-sheet structure of globular proteins were determined as odds ratios represented by matrices. Because the reliability of the results mainly depends on the quality of the data set, despite the previous studies on this issue, author has created a new, comprehensive data set using all peptides deposited in Worldwide Protein Data Bank. Only globular protein chains were included to data set by removing membrane, fibrous, immunoglobulins and extremophile related proteins.

To increase the quality of the data set, both homolog chains and homolog domains in the chains were detected using global and local pairwise alignment algorithms, respectively and were removed from the data set. Because alignment algorithms are heuristic algorithms and alignment parameters has been determined empirically, there is no way to determine the homolog chains or domains as absolutely. Despite this minor drawback, the data set of this study is one of the qualified data set available in the related literature.

Comparison of the findings of this study with the previous studies shows that propensities proposed by this and the other studies for the same residue pairing may differ. The number of such residue pairings corresponds to 19–34% of the all pairings in each secondary structure element. Therefore, findings of this study could provide valuable information to secondary structure prediction algorithms based on hydrogen-bonded residue pairings when predicting secondary structural elements of the peptide.