Introduction

According to Anfinsen’s theory, the primary sequence of a protein as an ordered string of amino acids contains all the information required for it to gain its final functional three-dimensional structure (Anfinsen 1973). In the 45 years since this argument was made, important questions have been raised concerning the origin and identity of protein fold information (Dill and MacCallum 2012). These questions resulted in significant efforts toward elucidation of information hidden in protein sequences. Statistical analysis of databases containing protein sequences indicates that the 20 naturally occurring amino acids do not occur with equal frequency (Rani et al. 1995), while in other studies the relative frequency of each amino acid in a group of similar proteins has been determined (Schwartz et al. 2001). However, a single residue in a sequence has limited information, and the context of any residue can play a crucial role in its structural and/or functional properties, e.g., because of its neighbors (Fu et al. 2014). Hence, finding any regularity in protein sequences including dipeptides is of great importance, but in spite of much argument this need remains unaddressed (Hermans 2011). The frequency of motifs in proteins was first investigated in the context of protein primary structure sourced from whole-protein sequence databases (Unger and Sussman 1993; Aitken 1999), while Vonderviszt et al. analyzed the frequency of dipeptides in the sequences of known proteins (Vonderviszt et al. 1986). However, the total data set in protein databases we limited at that time and their input data contained the primary sequence of protein with no reference to secondary structures. It has been reported that some dipeptides may play a critical role in intracellular protein stability (Guruprasad et al. 1990), and Reddy et al. have analyzed some representative dipeptides and found that stabilizing and destabilizing dipeptides have different patterns of interactions (Reddy 1996). Furthermore, analyzing three-dimensional structure databases has revealed that Cys residue oxidation is affected by neighboring residues (Fiser et al. 1992). By encoding dipeptide features and selecting a subset of dipeptide compositions, Nakariyakul et al. developed an interaction predictor tool and reported that selected dipeptide features have important roles in the specificity of protein domain interactions (Nakariyakul and Liu 2011). In other studies, it was found by statistical database analysis of the four major structural classes of protein including all-alpha, all-beta, alpha/beta and alpha + beta proteins that the propensities of each amino acid for the secondary structure are related to the structural class of the protein overall (Costantini et al. 2006; Ismail and Chowdhury 2010), but this analysis concerned the propensities of single amino acids rather than dipeptides. The outputs of such studies on the single amino acids led to several important assumptions in protein science that formed the basis for applications such as substitution matrices (Henikoff and Henikoff 1992) and structural prediction algorithms (Lim 1974). However, in the majority of these applications, the direct effect of neighbor residues was ignored. For instance, in the construction of substitution matrices based on multiple sequence alignment of protein superfamilies, the identity of only a single amino acid in an alignment file is considered despite the fact that it seems the conservation of a single residue may be affected by adjacent amino acids (Anishetty et al. 2002; Betancourt and Skolnick 2004).

For the reasons discussed, it is important to identify regular patterns of di- and/or tri-peptides (motifs), which are specific for a group of protein families and may have similar structural and functional consequences to each other. To shift the concept of the neighbor effect from sequence-based information to include a three-dimensional structural element, we first investigated the frequency of different mono- and dipeptides in defined structural classes of proteins including all-alpha, all-beta, alpha + beta and alpha/beta proteins. We found that the frequency of dipeptides is not the same in different structural classes. Additionally, we found that in structurally similar proteins some dipeptides are not randomly distributed, and the first or second position of these motifs is occupied by specific amino acids. We conclude that the microenvironment of an amino acid can be considered as an evolutionary driving force in dictating the structural properties of a protein, which leads to directional selection of amino acids for structural and functional purposes.

Materials and methods

Data

All structures were selected from the Protein Data Bank (Berman 2000) under the advanced search menu. The structure of all selected proteins was resolved by X-ray crystallography with a resolution better than 3.0 Å. All structures with ligands and more than 30% identity have been omitted. Structural classes were filtered in the search menu using both the ScopTree and CathTree options. Based on these criteria, we found that there were 499, 587, 626 and 670 structures for all-alpha, all-beta, alpha + beta and alpha/beta protein classes, respectively, at the end of 2015. Among them, 125 structures were sampled randomly for each structural class. Note that any structure that has unusual, unknown or missing amino acids was discarded. Thus, our data set consists of 400 protein structures containing 152,474 residues. All structures were converted from PDB to DSSP file format using the Linux-based mkDSSP program (Kabsch and Sander 1983; Joosten et al. 2011). They were then analyzed by PARS software (Fathinavid et al. http://www.znu.ac.ir/members/newpage/702) to calculate the frequency of any of the 20 residues (or monopeptides) and 400 dipeptides. The output was further analyzed by MS-Excel software. All analysis was performed separately for each structural class as well as for the total data set.

Normalized frequency distribution

Based on the results of the PARS software, the total number of monopeptides and dipeptides for any structural class as well as for the total data set was calculated. As proposed by Vonderviszt et al., the normalized frequency distribution for any dipeptide (S ij ) formed by the ith and jth monopeptides in the first and second positions, respectively, was calculated by the following equation (Vonderviszt et al. 1986):

$$S_{ij} = \frac{{O_{ij} }}{{E_{ij} }}$$
(1)

where O ij and E ij are the observed and expected values of occurrence of dipeptide ij in the data set, respectively. The values of E ij for each dipeptide in each respective data set (total or any structural class) were calculated by Eq. 2:

$$E_{ij} = P_{i} \times P_{j} \times N$$
(2)

Here, P i and P j are the relative frequencies of individual amino acids in the first and second positions of a given dipeptide, and N is the total number of dipeptides in the corresponding data set. The values of P i and P j are provided in different columns of Table 1.

Table 1 Relative abundance (%) of monopeptides in the total data set and different structural classes

Based on these criteria, an S ij  = 1.0 means a completely random association of an ij pair in the primary sequence, while values 1.5× greater than unity (1.5) and 1.5× less than unity (0.67) are regarded as non-random distributions indicating preferential association and avoidance in the primary structure, respectively. So, 1.5 times greater and less than 1.0 have been written in red and blue font, respectively.

Cross-correlation coefficient

Cross-correlation coefficients were extracted from the S ij matrices by determining the correlation row-wise and column-wise for the ith and jth positions, respectively.

The cross-correlation coefficient can be used to reveal similarities among preferred sequential environments of various amino acids and also to determine the tendency of each residue to localize in the first or second position of an ij pair. In our correlation coefficient analysis and for prevention of any statistical fluctuations, P values less than 0.01 were considered statistically significant. Values of ±0.1 would correspond to perfect correlation between two number series, while 0.0 means no correlation.

Results and discussion

The sampling and choice of data set are very important because proteins that belong to the same family may have similar evolutionarily conservative dipeptides which could bias our analysis. Hence, stringent search criteria were used so that the selected proteins have only a maximum value of 30% sequence identity. It is also notable that several proteins may contain highly homologous domains or repetitive sequences, leading to the problem of redundancy. We minimized this effect by using the largest available data set. It should be noted that the observed and expected frequencies (O ij and E ij ) are not listed, and only the S ij and correlation coefficient values are addressed directly here.

The relative frequencies of all 20 monopeptides in the total data set and for all structural classes are provided in Table 1. These data show high correlation (CC = 0.97) with the result of the study by Xia and Xie in which more than 7343 protein sequences were analyzed (Xia and Xie 2002). The data in Table 1 indicate that the frequency distribution of all amino acids for different structural classes of proteins is not the same. Furthermore, our calculated and expected frequencies of dipeptides show good correlation (CC = 0.94 and 0.96) with the report provided by Shen et al. (2006).

In the next step of analysis, the normalized frequency distribution of dipeptides (S ij ) for the total data set as well as all-alpha, all-beta, alpha + beta and alpha/beta structural classes were calculated (Fig. 1). Quantitative data are provided as Tables 2–6 in the supplementary material. The values of S ij are in the range of 0.13 (indicating avoidance)–3.97 (indicating favorable association). Since S ij is equal to the ratio of observed to expected values of dipeptides, an S ij  ≥ 1.5 indicates the tendency of a given dipeptide to occur more than 1.49× relative to expected values and is considered a boundary for a high tendency for association. Similarly, the values of S ij  ≤ 0.67 are considered as a measure of the avoidance. These critical values are shown in the red- and blue-colored spectrum in Fig. 1 and corresponding tables in the supplementary data.

Fig. 1
figure 1

Graphical representation of the normalized frequency distribution matrix, S ij , of the dipeptide fragment for all-alpha (a), all-beta (b), alpha + beta (c), alpha/beta (d) class and our total data set (e). Each panel contains the first position of a dipeptide (i-position) in the horizontal line, while that of the second position (j-position) is shown in the vertical line. For better clarification in finding the differences between specific cells, the numerical values are also provided in the supplementary material 1

As shown in Fig. 1a, in the all-alpha structural class, there are 19 dipeptides with extremely high S ij including Cys-Cys, Met-Cys, Trp-Cys, His-Phe, Cys-His, His-His, Arg-His, His-Met, Ser-Met, Met-Asn, His-Pro, Pro-Pro, Trp-Pro, Tyr-Pro, Cys-Arg, Ala-Trp, Ser-Trp, Trp-Trp and Asp-Tyr, while 29 dipeptides including Cys-Ala, His-Cys, Asn-Cys, His-Asp, Met-Gly, His-His, Asn-His, Thr-His, Trp-His, Cys-Ile, Trp-Ile, HisLys, Pro-Lys, Cys-Leu, CysMet, Gly-Met, Pro-Met, Ala-Asn, Ala-Pro, Glu-Pro, His-Arg, Arg-Ser, Cys-Trp, Ile-Trp, Leu-Trp, Met-Trp, Tyr-Trp, Ala-Tyr and Trp-Tyr have extremely low S ij values. These data together demonstrate that Cys, His and Trp are the most selective amino acids in their sequential association; a number of 12 Cys-containing, 14 His-containing and 13 Trp-containing dipeptides are characterized by extremely high or low S ij values. In contrast, Gln and Val appear to be virtually neutral showing a nearly random association with other amino acids in the all-alpha class. Other amino acids have a moderate tendency to be selective. Additionally, there are four Ala-containing dipeptides with extremely low S ij values and only one with an extremely high S ij . This means that the selectivity of Ala is toward association rather than avoidance. It was also found that Ala is more selective when it localizes in the first position of an ij-pair.

S ij values for the all-beta structural class are provided in Fig. 1b showing 11 dipeptides have extremely high S ij values (Asp-Cys, His-His, Trp-His, Met-Met, Trp-Asn, Cys-Pro, Cys-Arg, His-Pro, Cys-Thr, Cys-Trp, Tyr-Tyr), while 16 dipeptides have extremely low S ij values (Cys-Ala, Met-Cys, Pro-Cys, CysE, ThrE, Trp-Phe, Cys-Gly, Phe-His, Lys-His, Phe-Ile, Met-Ile, HisMet, Gln-Asn, Trp-Pro, Cys-Val, Met-Trp). These data demonstrate that Cys is relatively selective in its sequential association; 11 Cys-containing dipeptides are characterized by extremely high or low S ij values. Leu and Ser appear to be virtually neutral, while others have a moderate tendency to be selective. As for Ala in the all-alpha structures, Cys is more selective when located in the ith position of a dipeptide in the all-beta structural class.

Analyzing the data for the alpha + beta structural class (Fig. 1c) shows that there are 11 dipeptides with extremely high S ij values (Ala-Cys, Phe-Cys, Gly-Cys, Ser-Cys, Trp-Cys, Asp-Phe, Cys-His, His-His, Trp-His, ThrP and Gln-Gln) and 31 dipeptides with extremely low S ij values (Cys-Glu, Cys-Leu, Asp-Ser, Glu-Cys, Glu-Met, Glu-Gln, PheMet,Gly-Pro, His-Cys, His-Glu, His-Lys, His-Asn, Ile-Trp, Lys-Cys, Lys-Trp, Leu-Phe, Leu-Trp, Met-Cys, Met-Phe, Met-Gln, Met-Val, Pro-Lys, Gln-Cys, Gln-Pro, Arg-Cys, Val-Cys, Val-Gln, Trp-Phe, Trp-Gly, Trp-Met and Tyr-Trp). In this structural class, Cys is the most highly selective residue; in total, nine and six Cys-containing dipeptides have extremely low or high S ij , respectively. We also found that five, four and seven dipeptides contain Glu, Lys and Met, respectively, with extremely low S ij values. However, these residues have no significant values of S ij for association. So, the selectivity of these amino acids is toward avoidance for pairing with other amino acids in the alpha + beta structural class.

In Fig. 1d, the S ij values for the alpha/beta structural class are shown. According to these data, there are 14 dipeptides with high S ij values (His-Cys, Trp-Cys, His-Phe, Asn-Phe, His-His, His-Met, Met-Met, His-Pro, Trp-Gln, Asn-Trp, Ser-Trp, Thr-Trp, Trp-Trp and Tyr-Trp) and 30 dipeptides with low S ij values (His-Trp, Ile-Trp, Leu-Trp, Met-Trp, Gln-Trp, Val-Trp, Cys-Ala, His-Ala, Glu-Cys, Met-Cys, Val-Cys, His-Asp, HisE, Lys-Phe, Met-Phe, Trp-Gly, Asp-His, Met-His, Met-Ile, Tyr-Ile, Cys-Ile, HisLys, TrpLys, CysMet, Ile-Met, Tyr-Met, Glu-Pro, Gln-Arg, Cys-Thr and Gln-Thr) revealing the selectivity for Trp, His and Met. Indeed, 15 Trp-containing, 13 His-containing and 11 Met-containing dipeptides have extremely high or low S ij values. Also these data show that His prefers to locate in the ith position, while the preference of Trp is for the jth position.

Interestingly, in the total data set (Fig. 1e), there are only five dipeptides, including Cys-Cys, Cys-Trp, His-His, His-Trp and His-Pro, which have extremely high S ij values, while six of them, including Cys-Ala, Glu-Cys, Trp-Gly, His-Lys, Ile-Trp and Met-Trp, have extremely low values of S ij .

The mean values of S ij for homo-dipeptides for all-alpha, all-beta, alpha + beta, alpha/beta and the total data set were 1.29, 1.08, 1.04, 1.22 and 1.21, respectively. This finding indicates that homo-peptides have a nearly random distribution. However, we found that some of them, including His-His, Pro-Pro, Gln-Gln, Met-Met and Cys-Cys, show some degree of frequency significance, which is in good agreement with the available data (Xia and Xie 2002). However, Xia and Xie reported that asymmetry between dipeptides is not significant, that is, the frequency of ij is nearly equal to that of ji, while as can be seen in Fig. 1, nearly all dipeptides show asymmetry in their amino acid positions.

Comparing our results with other work, particularly that of Vonderviszt et al. (1986), shows that cysteine is a specific amino acid in its selectivity for pairing with other amino acids. However, we observed similar behavior for other residues in the context of protein structural classes. Since cysteine is observed as a special residue in association or avoidance propensity, it appears that this observation may be related to its oxidation state in the structures of proteins, which needs a separate detailed structural study.

The above-mentioned results indicate that some amino acids have S ij values representing their occurrence far from randomness and that they are sensitive to pairing with or avoidance of other amino acids. We also show that positioning in the first or second position of a dipeptide may act as a determinant structural factor for the selection of a given amino acid. The preferences of residues for the first or second positions will be further discussed below. It was also found that dipeptides with association or avoidance far from random distribution are not the same for different structural classes. This indicates that dipeptide selectivity is determined mainly by structural factors rather than simply primary sequence. Since the differences of these structural classes originate from their secondary structures, it may be concluded that the effective parameters for different secondary structures play critical roles in this selectivity.

Other factors in our data are related to the difference in the number of avoided and associated dipeptides. While in the total data set this difference is not significant, using structural class as input data, the number of avoided dipeptides increases compared with associated ones. This fact demonstrates that the unique identity of a single residue is reflected in its pairing characteristics.

For a better understanding of the first step in our analysis, we extended the study by calculating the correlation coefficient between each row- and column-wise pair of S ij matrices as provided in Fig. 2 and Tables 7–11 in the supplementary data. The point of this analysis was to determine the similarity of the different residues localizing in the first or second positions of a given dipeptide. The row- and column-wise correlation coefficients were used to determine how similar the different residues in the first and second positions of a dipeptide were, respectively. Colored font values in the corresponding tables indicate significant low or high correlation between two amino acids that might be substituted for each other, meaning significant dissimilarity or similarity between two given amino acids. Note that the data in Fig. 2 should be analyzed by considering the values above and below the diagonal line. In this analysis, the values above the diagonal line refer to the ith position and those in the lower part to the jth position of a dipeptide.

Fig. 2
figure 2

Graphical representation of correlation coefficients of dipeptide fragment for all-alpha (a), all-beta (b), alpha + beta (c), alpha/beta (d) class and our total data set (e). Data are analyzed by considering the values in upper and lower part of the diagonal line. The values in the upper part refer to the ith position and those of the lower part are related to the jth position of a dipeptide. For better clarification in finding the differences between specific cells, the numerical values are also provided in supplementary material 2

The upper part of the data in Fig. 2a for the all-alpha structural class show that Leu & Asp, Met & Asp, Trp & Asp and Ser & Leu residues have significantly low correlation coefficients, which means that substitution of these amino acids for each other in the ith position of a dipeptide is avoided. On the other hand, Arg & Cys has a significantly high correlation coefficient meaning a similarity between these two amino acids for localizing in the ith position.

Examining the values of the column-wise correlation coefficient (below of the diagonal line in Fig. 2a) shows that Ile & Ala, Leu & Ser and Ala & Tyr have significantly low correlation coefficients indicating dissimilarity between Ile and Ala for positioning in the jth position, while Pro & Ala, Phe & Val and Met & Val have significantly high correlation coefficients, meaning a similarity between Pro and Ala for localizing in the jth position of a dipeptide.

A similar procedure was also used for analyzing the other structural classes. Figure 2b contains the correlation coefficient values for the all-beta structural class and shows significant dissimilarity for Ile & Asp, Leu & Asp, Pro & Asp, Trp & Glu, Asn & Phe, Arg & Gly, Ser & Gly, Gln & Leu and Trp & Val. Furthermore, significant similarity for Lys & Glu, Met & Glu, Asn & His, Leu & Ile, Trp & Pro, Thr & Ser and Trp & Ser was observed for localizing in the ith position. A significant dissimilarity for Glu & Cys, Gln & Gly and Thr & Leu together with significant similarity for the Lys & Glu pair in the jth position was also observed.

For the alpha + beta structural class (Fig. 2c), dissimilarity was observed for Cys & Ala, Phe & Asp, Met & Asp and pairs in the ith position and for Asn & Leu, Arg & Ala, Ser & Gln, Ser & Lys and Ser & Met in the jth position. Likewise, similarity for Asn & Cys and Trp & Phe can be seen in the jth position.

In Fig. 2d the correlation coefficient values are provided for the alpha/beta structural class, showing significant dissimilarity for Gly & Ala, Val & Cys, Met & Asp, Phe & Glu, Thr & Gln, Ser & Phe and Val & Glu and significant similarity for Glu & Leu to localize in the ith position. On the other hand, these data show a significant dissimilarity for Arg & Leu, Ser & Met, Thr & Leu, Val & Ser together with Trp & Val and significant similarity for Trp & Ser for positioning at the jth position.

Previous reports emphasized that some residues in helices (known as helix formers) tend to be similar and can be substituted with each other (Xia and Xie 2002) but this insight is not confirmed by our results.

Although we examined the frequency of dipeptides in different structural classes of proteins, each structural class has a different content of secondary structural elements, and more studies, including determining the similarity index for any dipeptide in the context of every secondary structure, is needed. Generally, for both the ith and jth positions, the number of dissimilar amino acids is significantly greater than that of similar ones. As mentioned above, this fact may originate from the unique properties of amino acids, which lead to more sensitivity in selecting their neighbors.

Unexpectedly, it can be seen that a number of different amino acids have a similar behavior in localizing at the same position of a dipeptide, and they can be substituted with each other. As we know, amino acids are classified based on physico-chemical properties such as hydrophobicity, polarity, size and so on. Our data indicate that upon pairing of amino acids, the characteristics of the individual amino acids matter less than those of the pair such that pairs with quite different physico-chemical properties can confer similar features on equivalent positions in protein structures.

It thus seems that the role of each residue in the context of the secondary structure is not the same when considered alone and in pairs. With respect to the local steric interactions in dipeptides based on their side-chain dihedral angle distributions (Jacobson et al. 2002), their S ij values could be studied further to determine how they correlate with the allowed conformations of the dipeptides using, e.g., hard sphere models (Zhou et al. 2012, 2014).

The significance of this work includes analysis of the frequency of dipeptides in every structural class of proteins. We find that determining the tendency of different dipeptides to be found in different defined secondary structural elements could help researchers in an improved understanding of information stored in the sequences of proteins.