1 Introduction

Secondary structure prediction methods, SSPMs [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], a class of in silico methods, attempt to identify the secondary structural elements of the protein (in case of three-state, these are α-helix, β-sheet and random coil) from primary structure (i.e. amino acid sequence of the protein). This approach is based on the assumption proposed by Haber and Anfinsen [19] which says that all the information required for folding is stored in amino acid sequence. The SSPMs have a wide range of application from 3D structure prediction methods and bioinformatics tools [20] to protein engineering and drug design [21]. The five new methods developed every year since 2010 [22], the increasing number of publications per year [22] and nearly 300 published algorithms [23] also show the dynamics of the research field.

The improvement in SSPMs is mainly measured by their prediction accuracies. The existing SSPMs have achieved 84% accuracy [22, 24] mainly by implementing neural networks and deep learning to prediction algorithms [24]. Another factor in this achievement is the growing number of 3D structures deposited in protein databases [25]. The achievable accuracy limit of SSPMs for three-state secondary structure (α-helix, β-sheet and random coil) of protein was proposed as 88% by Rost [25] but, it was updated to 92% by Ho et al. [23]. Despite the great success in prediction accuracy, there are at least 8 points to improve. This goal could only be achieved by gathering the information stored in the primary structure of the peptide. This information is partly based on the propensities of the amino acids to secondary structural elements. Therefore, studies on propensities of residues would provide important information about the peptides in regards of both secondary structure prediction and structural features.

Many theoretical and experimental studies have shown us that some amino acids, as either single or groups (pairing, di-, tri-, tetrapeptides, etc.), have an evident tendency to helix structure [6, 26,27,28,29,30,31,32,33,34,35,36,37,38,39,40]. Among these, residue pairings which have backbone hydrogen bonds deserve a special interest. Hydrogen bonding (HB) is one of the major factors in protein folding process [30, 31, 34, 36, 41, 42] but backbone hydrogen bond is dominant [32, 43]. Therefore, it can be concluded from all these studies that backbone hydrogen bond between the (n):(n + 4) residue pairing in helices is the major driving force of protein folding. Despite the studies and findings on residue pairing tendencies, almost nothing is known about the variations in these tendencies with helix length. This is the main problem addressed by this study.

This study aims to contribute to both improvement of SSPMs (in regard of accuracy limit of the SSPMs) and understanding of structural characteristics of globular helices (in regard of helix stability) by retrieving valuable information from the primary structure. This information is in form of variations in (n):(n + 4) pairing propensities with helix length. Although this kind of information is critical for propensity-based SSPMs and has potential to improve the accuracy limit, the number of studies on this issue is very limited. Due to small protein data set used in the study [37] or the preference of single amino acids instead of residue pairings [44], these few studies do not fully cover the issue. Considering the intrinsic role of backbone HB in formation and stability of α-helices in globular proteins, only (n):(n + 4) core residue pairings of α-helices including backbone hydrogen bonds were selected for this study. The length of the helices ranges 13 to 26 residues. It has been shown that as helix length increases, propensities of ALA:GLY and GLY:GLU pairings to α-helix in globular protein increase but of ALA:ALA and ALA:VAL decrease. Frequencies of ILE:ALA, LEU:ALA, LEU:GLN, LEU:GLU, LEU:LEU, MET:ILE and VAL:LEU pairings do not vary with helix length. While 25 residue pairings have varying regularities in narrow length range (i.e., propensities of pairings were investigated in a range of longer than 13 residues and shorter than 26 residues rather than full 13-to-26 residue range), the remaining pairings have no prominent propensity to α-helix.

2 Materials and Methods

2.1 Protein Dataset

Protein structure data set was obtained from Nacar [40]. Data set includes 4594 globular peptide chains from Protein Data Bank [45]. Each chain is no shorter than 100 residues and has at least one helix secondary structure. Resolution of each peptide is better than 2.00 Å and sequence identity is smaller than 25%. Total number of helices and of pairings in each length group (13-to-26 residues) are listed in Table 1.

Table 1 Distribution of helices according to their length

2.2 Residue Pairings

(n):(n + 4) amino acid pairings of helices were determined from the findings by Nacar [40] according to this criterion: each residue of pairing must have two backbone hydrogen bonds. This criterion is mainly based on the fact that hydrogen bond network is a determining factor for both protein folding process and helix stability. This factor needs to be equalized for all pairings when determining the propensities of the pairings, to prevent pairing from suppressing their true propensities. Therefore, each residue of pairing must have the same number of hydrogen bond. Hydrogen bonds were determined according to HB criteria identified by Baker and Hubbard [46]. If a pairing did not satisfy this criterion, it was excluded from the study even though it remains within the helix boundaries specified in the PDB file. Because the residues at the N- and C-capping regions (first and last four residues of helix, respectively [47]) may affect the stability of the helix [48], only residue pairings in the core of helices were included and pairings those have any amino acid from N- or C-capping sites were discarded. Because of the absence of free –NH group, proline residue can only have a backbone hydrogen bond with (n + 4) residue. So, pairings including proline residue (that is, PRO:PRO, PRO:XXX and XXX:PRO pairings) were also excluded. Therefore, each residue of accepted pairing, that is (n):(n + 4), has two backbone hydrogen bonds: (n) with (n-4) and (n + 4) residues, and (n + 4) with (n) and (n + 8) residues. All accepted pairings are homogeneous in this context. Due to this restriction, helix with a length of 13 residues has only one pairing, helix with a length of 14 residues has two pairings and so forth.

2.3 Limits of Helix Length

As discussed in Sect. 2.2, the smallest helix length that could include at least one pairing which satisfying the HB criterion is 13 residues. So, lower helix length limit was assigned as 13 residues. Since the number of helices longer than 26 residues in data set is very limited and long helices are mostly found in membrane and fibrous proteins, the upper helix length limit was set at 26 residues. Therefore, helix length was limited to 13-to-26 residues.

2.4 Frequencies of the Residue Pairings

Frequency of each pairing was calculated as the ratio of the total number of each pairing in specified helix length (L) to the total number of all pairings in the same helix length as percentage (1).

$${f}_{{XXX}_{1}:{XXX}_{2}\left(L\right)}= \frac{{N}_{{XXX}_{1}:{XXX}_{2}\left(L\right)}}{\sum {N}_{pairings\left(L\right)}} \times 100,$$
(1)
$${f}_{{XXX}_{1}:{XXX}_{2}\left(L\right)}=Frequency\, of\, {XXX}_{1}:{XXX}_{2} \,pairing \,in\, helices\, with\, length\, of\, L\, residues,$$
$${N}_{{XXX}_{1}:{XXX}_{2}\left(L\right)}= Number \,of \,{XXX}_{1}:{XXX}_{2}\, pairing\, in\, helices\, with\, length\, of \,L \,residues,$$
$$\sum {N}_{pairings\left(L\right)}= Total\, number \,of\, all \,pairings\, in \,helices\, with\, length\, of\, L\, residues.$$

The residue location in pairing is preferential, that is XXX1:XXX2 and XXX2:XXX1 pairings are not identical. Backbone HB between pairs in helices requires –NH group in (n + 4) residue but proline residue lacks of free –NH group. Because of this restriction, frequencies of the XXX:PRO residue pairings were not calculated.

2.5 Frequencies of the Amino Acids

Frequency of each amino acid in helices with a certain length was calculated as the ratio of the total number of each amino acid in helices with the same length to the total number of all amino acids in helices with the same length as percentage. Amino acid frequencies were calculated in two different ways. In the first way (labeled as fnonHB), amino acid set contains all residues remaining within helix boundaries specified in PDB files (disregarding the backbone hydrogen bonding, HB criterion) (2) but, in the second (labeled as fHB), it contains only residues satisfying the backbone HB criterion in these helices (3).

$${f}_{nonHB(L)}=\frac{{N}_{XXX(L)}}{\sum {N}_{amino\_acids(L)}} \times 100,$$
(2)
$${f}_{nonHB(L)}=Frequency\, of\, XXX\, amino \,acid\, in\, helices\, with\, length\, of \,L \,residues,$$
$${N}_{XXX(L)}= Number \,of \,XXX \,amino\, acid\, in\, helices\, with\, length\, of\, L\, residues,$$
$$\sum {N}_{amino\_acids(L)}= Total\, number \,of\, all\, amino\, acids \,in\, helices\, with\, length\, of\, L\, residues,$$
$${f}_{HB(L)}=\frac{{N}_{XXX(L)}}{\sum {N}_{amino\_acids(L)}} \times 100$$
(3)
$${f}_{HB(L)}=Frequency\, of\, XXX \,amino \,acid \,in \,helices\, with\, length\, of\, L\, residues,$$
$${N}_{XXX(L)}= Number \,of \,XXX\, amino \,acid\, in\, helices\, with\, length\, of\, L \,residues,$$
$$\sum {N}_{amino\_acids(L)}= Total \,number \,of\, all\, amino \,acids\, in \,helices \,with \,length\, of\, L\, residues.$$

2.6 Trend lines in pairing figures

Trend lines in pairing figures (Figs. 1, 2, 3, 4, 5) were drawn by Excel Software from Microsoft Office Professional Plus 2016 package using simple linear regression method based on least-square estimation technique. Parameters of trend lines [slope, y-intercept, coefficient of determination (R2) and standard error of estimate (SEE)] were also calculated using the same software and represented in Table 2.

Fig. 1
figure 1

The propensities of ALA:GLY (a) and GLY:GLU (b) pairings are increasing with helix length. Trends in increase are represented in red dotted lines drawn by simple linear regression method and standard errors are represented as error bars. See Table 2 for parameters of trend lines (Color figure online)

Fig. 2
figure 2

The propensities of ALA:ALA (a) and ALA:VAL (b) pairings are decreasing with helix length. Trends in decrease are represented in red dotted lines drawn by simple linear regression method and standard errors are represented as error bars. Some error bars may not be visible because they are smaller than points. See Table 2 for parameters of trend lines (Color figure online)

Fig. 3
figure 3

The propensities of ILE:ALA, LEU:ALA, LEU:GLN, LEU:GLU, LEU:LEU, MET:ILE, and VAL:LEU pairings are remaining roughly constant with helix length despite some exceptions. For instance, the frequencies corresponding to helix length of 13 residues in ILE:ALA (a), LEU:LEU (e) and VAL:LEU (g) pairings are relatively higher than the remaining lengths. Trends are represented in red dotted lines drawn by simple linear regression method and standard errors are represented as error bars. Some error bars may not be visible because they are smaller than points. See Table 2 for parameters of trend lines (Color figure online)

Fig. 4
figure 4

GLY:VAL (a) and HIS:MET (b) pairings are shown as two examples for increasing propensities in limited range of helix length. Likewise, ASP:ILE (c) and VAL:PHE (d) pairings are shown two examples for decreasing propensities. Narrow length ranges of GLY:VAL, HIS:MET, ASP:ILE and VAL:PHE pairings for local propensities are 13–22, 13–21, 16–21 and 17–23 residues, respectively. See Supplement_3-25 Pairings Propensities.docx and Table 5 for remaining pairings. Trends are represented in red dotted lines drawn by simple linear regression method and standard errors are represented as error bars. Some error bars may not be visible because they are smaller than points. See Table 2 for parameters of trend lines (Color figure online)

Fig. 5
figure 5

a The finding of “The Distribution of Ion Pairs with Helix Length” is partially consistent with the finding of this study on propensity of HIS:ARG pairing. b Was prepared using Table VI from the article by Sundaralingam et al. Because each “Ion Pairs” did not shown separately by Sundaralingam et al., this comparison is not completely valid

Table 2 Parameters of trend lines in pairing figures

3 Results

3.1 Frequencies of the Amino Acids

Amino acid frequencies calculated as fnonHB and fHB are represented in Tables 3 and 4, respectively. As seen in Tables 3 and 4, variations in amino acids frequencies whose calculated for helices with a certain length are negligible except Proline. This finding implies that variations in propensities of residue pairings in helices are independent of their residue frequencies.

Table 3 The calculated amino acid frequencies for each helix with length of 13-to-26 residues (hydrogen bond criterion not satisfied)
Table 4 The calculated amino acid frequencies for each helix with length of 13-to-26 residues (hydrogen bond criterion satisfied)

3.2 Propensities of the Residue Pairings in Helices with Different Length

Frequencies of all 400 residue pairings (except pairings include PRO) in helices with different length, errors in frequencies, and calculations for parameters of trend lines are represented in Supplement_1-Pairing Frequencies, Errors, SEE.xslx. Variations of these frequencies with length (that is, propensities of the pairings) are represented in Supplement_2-All Pairing Propensities.docx as graphs with linear regression line. Standard errors of frequencies of pairings in each length group were represented in figures as error bars. Although frequencies of pairings including proline residue were determined, these results were not evaluated for pairing propensities because they do not satisfy the double backbone hydrogen bond requirement described in Sect. 2.2. Propensities of ALA:GLY (Fig. 1a) and GLY:GLU (Fig. 1b) pairings to α-helix in globular protein increase with increasing helix length while propensities of ALA:ALA (Fig. 2a) and ALA:VAL (Fig. 2b) pairings decrease.

The steep decreases at residues of 26 in both ALA:GLY (Fig. 1a) and GLY:GLU (Fig. 1b) pairings seem like kind of variations those seen in other pairings such as HIS:MET (at residue 24, Fig. 4b) or HIS:ARG (at residues 14 and 23, Fig. 5a). Despite this similarity, these decreases requires a further explanation because of their locations. Their propensities may vary in longer peptides. However, because this study limits the maximum helix length to 26 residues, this discrepancy could be only resolved by further studies including longer peptides.

Propensities of ILE:ALA, LEU:ALA, LEU:GLN, LEU:GLU, LEU:LEU, MET:ILE and VAL:LEU (Fig. 3a–g) pairings were considered as steady despite some relatively small discrepancies in frequencies corresponding to certain lengths. These discrepancies are frequencies corresponding to length of 13, 13, 14, 24, 13, 19, and 13 residues in ILE:ALA (Fig. 3a), LEU:ALA (Fig. 3b), LEU:GLN (Fig. 3c), LEU:GLU (Fig. 3d), LEU:LEU (Fig. 3e), MET:ILE (Fig. 3f) and VAL:LEU (Fig. 3g) pairings, respectively.

There are also regularities in narrow length ranges in 25 pairings (listed in Table 5) corresponding to local propensities in related graphs (see Supplement_3-25 Pairings Propensities.docx). Four of them, two for increase (GLY:VAL, HIS:MET pairings) and two for decrease (ASP:ILE and VAL:PHE pairings), are represented in Fig. 4a–d, respectively. The remaining residue pairings have no prominent propensity or frequency variations.

Table 5 25 Residue pairings those have local propensities in narrow length range

4 Discussion

In this study, how the propensities of (n):(n + 4) amino acid pairings in α-helices in globular proteins varies with helix length was investigated using a comprehensive and qualified protein data set [40]. In statistical studies, the size of the data set directly effects the quality of the findings. Therefore, a data set that is representative of all globular helices was chosen for this study. Residue pairings satisfying only that criterion were accepted: each residue of pairing must have two backbone-based hydrogen bonds; one with (i − 4)th residue and one with the (i + 4)th residue. This criterion was set to make all pairings identical in context of hydrogen bond numbers and it is mainly based on Rose et al.’s “backbone-based theory of protein folding” assumption [43]. This theory states that backbone-based HB is the dominant driving force of protein folding. Since α-helices have many backbone-based hydrogen bonds, their structural features should be highly related to these bonds. Therefore, improvement in understanding of relationship between structural characteristics of helices and backbone HB would be valuable in regards of both SSPMs and protein stability.

The number of studies on relation between residue pairing propensities and helix length is very limited. Some studies have shown that propensities of single amino acids vary with helix length [44, 49] or with location in the helix [33]. In one of the rare studies on pairings, Sundaralingam, et al. [37] have investigated how distribution of the amino acids in protein varied with helix length using only charged residues from 47 globular proteins. These residues were grouped as Ion Pairs (charged residues) and Like Pairs (residue with same charge). The study has not included all possible pairings (only Ion Pairs) and HB criterion. When their findings from Table VI on distributions of (n):( n + 4) ion pairs with helix length were represented in a graph (Fig. 5b), it is seen that observed frequencies decrease with length. According to findings of this study, charged residue pairings have no variation in propensity for α-helix but except HIS:ARG. Propensity of HIS:ARG pairing to helix with length range of 16 to 20 residues is decreasing with length and this finding is consistent with the finding by Sundaralingam et al. [37] shown in their Table VI (Fig. 5a, b). However, this deduction is not entirely valid as Sundaralingam et al. [3737] did not show their findings for each Ion Pair separately in their study.

In another study done by Wang et al. using 1430 peptides [44], it has been shown that propensities of PRO and TRP amino acids to helix vary with helix length. Also, propensities of residue dyads or adjacent residues (n:n + 1/n − 1:n pairs) to helix structure has been analyzed and shown that propensities of many dyads vary with helix length. Despite the huge protein data set used, the study has not included (n):(n + 4) pairings. Moreover, because the helix length was classified roughly as short, middle and long, variations in propensities can not be assessed precisely. Therefore, findings from study by Wang et al. [44] and from this study are not comparable. Since the helix length in this study was limited to a minimum of 13 residues, studies on short helices were not included in the discussion.

The findings of this study address two issues: secondary structure prediction and helix stability.

4.1 Secondary Structure Prediction

An accurate secondary structure prediction is an important step in predicting the tertiary structure of the protein using ab initio prediction methods. One of the most important problems of SSPMs is that the helix boundaries cannot be determined precisely. The helix residues are classified as N-, C-capping and core residues. Since the N- and C-capping residues are the first and last four residues of the helix, respectively, boundaries of the α-helix would be determined precisely if the core residues can be predicted exactly. Because this study includes only core residues, the findings of this study would be valuable in predicting the core region of the helix. Amino acid pairings those have varying propensities with helix length could be used as helix core markers depending on the helix length. Therefore, these residue pairings determined as a consequence of information obtained from primary structure of peptide would improve the accuracy limits of SSPMs. Associated with this progress, ab initio predicting methods would also improve. Since minimum helix length was limited to 13 residues, findings from this study would be useless in predicting the short helices.

4.2 Helix Stability

Although thermodynamics of α-helix formation has been well known for many years [50, 51], it is not clear how the factors that determine the helix stability [30, 34, 36, 39, 52, 53] change with helix length. It is thought that helix stability increases with length and this is mainly related to the hydrogen bond network [54,55,56,57,58,59]. In this context, it could be proposed that the propensity of (n):(n + 4) amino acid pairings to vary with length could be related to the helix stability. Therefore, variations in the propensities of the amino acid pairings presented in the Sect. 3 can be associated with helix stability. So, it could be concluded that the propensity of ALA:GLY and GLY:GLU pairings to increase with length would increase the helix stability. Likewise, ALA:ALA and ALA:VAL pairings would also affect the helix stability negatively. Seven pairings whose tendencies remain constant could be considered as neutral in this sense. Also, the other 25 residue pairings with a certain propensity over a more restricted length range could be evaluated similarly. Although there are studies in the literature with single amino acids, especially with polyalanine peptides, there are no studies those overlap with the findings obtained in this study. It should be noted also that there were no short helices in this study and therefore findings of this study on structural features do not cover all α-helices.

An improved understanding of helix stability would be very useful especially in protein engineering, de novo protein design and protein folding. Some specific amino acid pairings may be preferred or avoided in order to obtain proper degree of stability. Considering that the information on variations in propensities of amino acids with helix length is very limited, findings of this study could make important contributions to the field in this context.

5 Conclusion

Even though there are many studies on propensities of amino acids to α-helix, the number of the studies on variation of propensities with length is very limited, especially the ones including (n):(n + 4) pairings. Besides that, findings from these rare studies are not conclusive due to their drawbacks such as insufficient data set or poor pairing criteria. In this study, the variations in propensities of residue pairings with helix length were investigated using a comprehensive data set and a rigorous biophysical criterion based on backbone HB. Findings from this study have shown that as helix length increases, propensities of ALA:GLY and GLY:GLU pairings increase but of ALA:ALA and ALA:VAL decrease. Frequencies of ILE:ALA, LEU:ALA, LEU:GLN, LEU:GLU, LEU:LEU, MET:ILE and VAL:LEU pairings do not vary with helix length. Besides those, 25 residue pairings have varying regularities in narrow length range, the remaining pairings have no prominent propensity to α-helix. These pairings, except the last group, could be used as additional parameters to specifically predict the core region of the α-helix. Therefore, these findings may move forward the SSPMs in regard of accuracy limit. The other contribution of these findings to the field could be in helix stability. Since length is one of the factors of helix stability, parameters related to the length would be useful in evaluating the stability. This issue is especially important in de novo protein design or protein engineering. However, findings of this study do not cover the shorter peptides because of restriction in helix length set at 13-to-26 residues.