Introduction

Some positions within a protein’s amino acid sequence are critical to the function of the protein and cannot be replaced without significant deleterious consequences (Suckow et al. 1996; Smith and Raines 2006). Conservation analysis has been shown to be a reliable method for identifying such functionally important amino acids. Residues involved in ligand binding (Magliery and Regan 2005; Liang et al. 2006), protein–protein interactions (Caffrey et al. 2004; Guharoy and Chakrabarti 2005; Mintseris and Weng 2005), and structure stabilization (Karlin and Brocchieri 1996) have all been associated with conserved positions in homologous protein sequences. Non-synonymous substitutions occur most often in areas of less functional and structural importance. Approximately 44% of the >300 amino acids of the lac repressor gene in prokaryotes were found to be tolerant to substitution such that non-synonymous substitutions in these positions did not result in a measurable change in the activity of the resulting peptide (Suckow et al. 1996). A similar study of ribonucleases found that less than 5% of 124 positions were completely intolerant of substitutions while approximately 80% were tolerant to substitution (Smith and Raines 2006).

The energy cost of producing an individual amino acid ranges from 9.5 (glutamate) to 75.5 (tryptophan) high-energy phosphate bonds (~P) (Craig and Weber 1998; Wagner 2005). The majority of amino acid cost studies have examined single cell organisms. Escherichia coli and Bacillus subtilis have both been shown to exhibit metabolic efficiency in that they preferentially utilize amino acids with low average biosynthetic costs in genes with high levels of expression (Akashi and Gojobori 2002). No single subset of amino acids or functional category of proteins was found to be responsible for this trend (Akashi and Gojobori 2002; Heizer et al. 2006). A subsequent study that included four additional prokaryotes (two parasitic organisms, one thermophilic organism, and a photoautotroph) similarly found a negative correlation between expressivity and average amino acid biosynthetic cost (Heizer et al. 2006). Indications of metabolic efficiency have even been found in a unicellular eukaryote (Saccharomyces cerevisiae) despite the difficulties in identifying the consequences of weak selective forces in organisms with relatively small effective population sizes (Raiford et al. 2008).

Functional constraint could make amino acid biosynthetic cost a secondary consideration in the evolution of a protein’s amino acid sequence. However, functionally constrained positions within proteins appear to be much less common than those where alternative amino acids can be tolerated (Suckow et al. 1996). If there is a selective advantage to be gained by utilization of less costly amino acids, then one might expect the effect to be more readily observed at positions that are under the least functional and structural constraint. To assess this relationship, amino acid sequence information from 9,119 homologous proteins in human, dog, mouse, and cow were evaluated to determine the extent to which amino acid biosynthetic cost is correlated with evolutionary conservation. The presence or absence of CpG islands (Aerts et al. 2004) upstream of the coding genes was also used as an indicator of broad/high expressivity. Proteins that are broadly expressed and tissue/condition-specific were analyzed separately to determine the extent to which there was a difference in average amino acid biosynthetic cost of these classes.

Materials and Methods

Homolog Acquisition

Homologous protein-coding loci (10,954) common to human (Homo sapiens), mouse (Mus musculus), dog (Canis lupus familiaris), and cow (Bos taurus) were obtained using homologene (NCBI 2009a). The nucleotide (CDS) and protein sequences associated with these genes were then downloaded from NCBI. If a homologous sequence was shorter than 100 amino acids in length in any of the four organisms, then the locus was removed (206 out of 10,954) to minimize length biases and possible sampling effects (Eyre-Walker 1996).

Sequence Acquisition and Alignment

Multiple alignments of the proteins were created utilizing COBALT (NCBI 2009b). Any alignments, where at least one protein was composed of more than 30% gaps, were removed from the study (1,611 out of 10,748). In addition, any proteins that could not be aligned using the default settings of COBALT were discarded (18 out of 9,137) leaving 9,119 proteins of the original 10,954 for analysis. Locus names, weighted amino acid biosynthetic costs, conservation, and the number of CpG dinucleotides in an 800 bp window centered on the transcriptional start site of human genes can be found at http://birg.cs.wright.edu/heizer11.

Individual amino acid biosynthetic costs used in this article were taken from those calculated by Wagner (2005). The amino acid biosynthetic costs calculated in this article are based on amino acid biosynthesis in E. coli. Similar analyses of other prokaryotes (Akashi and Gojobori 2002; Heizer et al. 2006) and yeast (Raiford et al. 2008) suggest that the near universality of the amino acid biosynthetic pathways makes these costs suitable for a wide variety of organisms. Even auxotrophic amino acids tend to be utilized in a fashion consistent with their synthetic cost (Doug Raiford, personal communication). For each position in the alignment, a weighted amino acid cost (weighted cost) was calculated. This was done by dividing the number of times an amino acid was used by four (the total number of amino acids in each position). This weight was then multiplied by the cost of the amino acid. This results in the biosynthetic cost of an amino acid that is used more frequently in a position having a larger effect on the calculated weighted cost of that position. A geometric mean of the weights (with the root being the number of different amino acids in a position) was then calculated to determine average weighted cost in single positions. Usage of a geometric mean generally reduces the effect of outliers on the calculated average weighted costs. If a position has three low-cost amino acids and a single high-cost amino acid, then usage of a simple mean would result in the high-cost amino acid having a larger effect. As the biosynthetic “cost” of a gap is difficult to assign, positions in alignments with one or more gaps were disregarded.

Positions in the alignment were placed into one of five classes based on the number of different amino acids present in that position in the four-deep alignment. Class I positions are those where all the four amino acids are the same. Class IIa positions have two different amino acids, one of which is used three times while the other is used once. Class IIb positions also have two different amino acids, each of which is used two times. Class III and Class IV positions have three and four different amino acids, respectively. The number of times each of the 20 amino acids was utilized in each specific class was also tallied.

The number of substitutions that result in Class IIa, IIb, and III overlaps from a parsimony perspective when examined using an unrooted tree that groups human with mouse and cow with dog (Miller et al. 2007; Prasad et al. 2008). Class IIa can only result from a single substitution while Class III can result only from two substitutions. Class IIb positions were resolved from a parsimony perspective into those resulting from either one or two substitutions in both the original and simulated datasets.

Proteins were rank ordered according to degree of conservation (percent of total amino acid positions falling into Class I). Class I is perfectly conserved; thus, determining the percent of Class I positions gives an estimate of the overall conservation for a protein. The deciles with the highest and the lowest percents of amino acids in Class I were analyzed to determine the average weighted cost in single positions in each class. The first and last deciles were chosen to compare the most conserved proteins to the least conserved. Average weighted cost of the amino acids (average weighted amino acid cost) for each of the 9,119 proteins were calculated by summing the weighted costs for each position within the protein’s four-deep alignment and dividing by the number of positions in the protein’s alignment in which there were no gaps. Proteins were rank ordered by average amino acid biosynthetic cost and then placed in 20 bins each containing 5% of the total number of proteins for visualization purposes. All the analyses were performed on unbinned data.

Nucleotide alignments for determining synonymous and non-synonymous substitution rates were generated using the protein alignment as a guide with a custom script. SNAP (Nei and Gojobori 1986) was utilized to determine the substitution rate associated with those alignments.

CpG Content and Protein Biosynthetic Cost

Regions flanking the transcriptional start site (±400 nucleotides) of 4,643 of the 9,119 human genes being evaluated were also available on BioMart (Ensembl 2010). The number of CpG dinucleotides within these 800 nucleotides was counted, and the average amino acid biosynthetic cost (including the first methionine but not the stop codons) for each of these 4,643 human proteins was determined. Average amino acid biosynthetic cost for these human proteins was calculated by summing the cost for each of the amino acids in the protein, as calculated by Wagner (2005), and dividing by the length of the protein. GC3 content was also determined by examining the third codon position of the 9,119 human genes.

Statistical analyses

For each of the five conservation classes, a Spearman rank correlation (Spearman 1904) was calculated between the amino acid usage frequency and the calculated amino acid biosynthetic cost (Wagner 2005). Spearman rank correlations were also calculated between the amino acid usage frequency and PAM100 diagonal values (i.e., relative mutability). PAM100 was chosen because of the high average similarity of the protein sequences due to their sharing a common ancestor at the time of the mammalian radiation between 100 and 80 million years ago (Hallstrom and Janke 2010). Average weighted costs in single positions were determined for all the five classes of positions within the four-deep alignments. Significant differences in the average weighted costs in single positions among these classes were identified using (two-sample, unpaired, unequal variance) T-tests. The Spearman rank correlation coefficient was also calculated between average amino acid biosynthetic cost for a protein and CpG content; GC3 content; ds; dn; and ds/dn of the gene encoding that protein. The threshold of significance for all tests was α = 0.05.

In order to ensure that observed changes in average weighted cost in single positions were due to functional constraint rather than sampling artifacts, a bootstrap method was devised to generate the baseline cost distribution for different degrees of conservation. One thousand human genes from the data set of 9,119 genes were chosen at random. For each of these 1,000 genes, three mutants were created where 20% of the amino acids had been substituted using EMBOSS 3 (Rice et al. 2000). The original protein and the altered versions were then aligned and the correlation analyses between the conservation class and the weighted cost were performed upon this test set of proteins. Differences in average weighted costs in single positions between the artificial data set and the original data were evaluated using (two-sample, unpaired, unequal variance) T-tests.

Results

Amino Acid Usage and Cost in Different Conservation Classes

Amino acids with the highest biosynthetic costs were least commonly used regardless of how well conserved a position was within the four-deep alignment [i.e., statistically significant negative correlations, P < 0.05, between amino acid cost (Wagner 2005) and amino acid usage were found in all classes] (Table 1). Amino acid usage and PAM100 diagonal values were also significantly negatively correlated for each of the five classes. The weakest correlation between amino acid usage and cost (−0.42) was seen in Class IIa, with correlations in the other classes ranging from −0.50 (Class IIb) to −0.55 (Class I and Class III). Similarly, the weakest correlation between amino acid usage and PAM100 (−0.67) was seen in Class I. Correlations in the other classes ranged from −0.71 (Class IIa) to −0.73 (Class III).

Table 1 Amino acid usage in positions based on variability

Comparison of the invariable to the variable positions shows that variable positions utilize less biosynthetically expensive amino acids than would have occurred by chance alone while invariable positions utilize more costly amino acids. A positive (r s = 0.38; P < 0.05) correlation is seen between the average weighted cost across all positions within a single protein and the degree of conservation (Fig. 1).

Fig. 1
figure 1

Fraction of positions within a protein falling into Class I (perfectly conserved across all the four mammalian orders) versus the average weighted cost across all the positions within a single protein (average of the weighted costs of the individual amino acids across an entire protein). Each data point represents the average for a bin [ranked by average weighted cost across all the positions within a single protein (~P)] containing 1/20 of the 9,119 proteins analyzed in this study. Error bars represent the standard error of the means

Class IV had an average weighted cost in single positions of 5.36 high-energy phosphate bonds for the amino acids used in this group, whereas Class I had an average weighted cost in single positions of 25.93 ~P. However, it is important to note that distributions of possible average weighted cost in single positions among the five conservation classes are not equivalent. The weighted cost for Class I can range between 9.50 (if all positions were glutamate) and 75.50 high-energy phosphate bonds (if all positions were tryptophan). Class IV positions have a more limited range between 3.01 (glutamine, glutamate, alanine, and glycine) and 14.17 ~P (tryptophan, phenylalanine, tyrosine, and isoleucine). In order to determine the extent to which this sampling constraint affects weighted cost measurements, 1,000 proteins were selected at random to establish a null distribution. Each of these proteins was mutated three times, resulting in a four-deep alignment in which each mutant differed from the original in approximately 20% of their positions. The resulting 1,000 four-deep alignments were found to yield an average weighted cost in single positions for Class I of 25.54 ~P (Table 2). The average weighted cost in single positions for Class I in the original data (25.93 ~P) is significantly higher (P < 0.05) than the value obtained for the simulated data set. In each of the other classes significantly lower (P < 0.05) average weighted costs in single positions are observed in the actual data than in the simulated data set. The average number of positions in each class was similar for the original data (Class I: 454.6, IIa: 11.1, IIb: 71.1, III: 21, and IV: 3) and the synthetic data (Class I: 386.5, IIa: 10.8, IIb: 161.3, III: 28.3, and IV: 1.8).

Table 2 Average amino acid biosynthetic cost in each class

In order to determine whether the observed trends in amino acid usage were the same regardless of overall gene conservation, the genes were separated into deciles based upon the fraction of their amino acids observed in Class I positions. The average weighted cost in single positions for each of the five conservation classes was determined for the decile containing the most conserved genes (average fraction of Class I positions of proteins in the top decile 97.29 ± 1.46%) and the decile comprising the least conserved genes (average fraction of Class I positions of proteins in the last decile 42.37 ± 8.46%). The results seen are very similar to those found when all the genes are considered (Table 2).

Protein Cost and Substitution Rate

A statistically significant negative correlation (−0.21; P < 0.05) was seen between synonymous substitution rate and non-synonymous substitution rate and average weighted cost across all positions within a single protein (Table 3). A significant positive correlation (0.40; P < 0.05) between average weighted cost across all positions within a single protein and the ratio of synonymous to non-synonymous substitutions was also observed (Table 3).

Table 3 Correlation between cost and substitution measures and GC3 content

Division of Class IIa, IIb, and III into categories based on the number of substitutions from a parsimony perspective where human and mouse were grouped and dog and cow were grouped on an unrooted tree resulted in an average weighted cost of 14.95 ~P for sites requiring a minimum of a single mutation (n = 700,317; σ = 6.95) and 10.56 ~P for sites requiring a minimum of two mutations (n = 237,928; σ = 5.06). These values are significantly (P < 0.05) less than the values found for the simulated data set [15.76 ~P for sites requiring a minimum of a single mutation (n = 164,864; σ = 7.11) and 11.16 ~P for sites requiring a minimum of two mutations (n = 35,460; σ = 5.06)].

Human CpG Content

The regions flanking the human transcriptional start site (TSS) (for the 4,643 human genes where 400 bp of the upstream and downstream sequence was available) were found to have a bi-modal distribution in terms of the number of CpG dinucleotides they contain. This distribution effectively allows the division of human genes into two classes: genes with high CpG content (presumably those that are more highly/broadly expressed) and genes with low CpG content (presumably those that are less highly expressed) (Aerts et al. 2004). Genes with an intermediate number of CpGs in their upstream sequence (genes having between 21 and 36 CpGs) were removed from consideration. The remaining 3,359 human genes with greater than 36 CpGs within the 800 bp around the TSS were considered to be associated with CpG islands, and the 910 genes with less than 21 CpGs were considered not to be proximal to a CpG island. The average amino acid biosynthetic cost of the proteins for the population of CpG island-associated human genes (25.2 high-energy phosphate bonds) was found to be significantly different from the average cost of those that were not CpG island-associated (25.8 high-energy phosphate bonds; P = 5.7E−18).

The 4,643 human genes were rank-ordered by ascending average amino acid biosynthetic cost (Table 4). In the first quintile (containing the human genes with the lowest average amino acid biosynthetic costs) the majority of the genes (82.90%) were found to be associated with CpG islands whereas in the fifth quintile only 66.90% of the genes were associated with CpG islands. A weak but significant correlation of −0.14 (P < 0.05) was found between number of CpGs and average amino acid biosynthetic cost.

Table 4 Genes in each quintile when ranked by ascending average amino acid biosynthetic cost

Discussion

The results of selection favoring the utilization of amino acids with comparatively low biosynthetic costs has been found in all domains of life (Swire 2007). Usage of biosynthetically inexpensive amino acids in proteins allows organisms to reduce energy expenditure, possibly resulting in greater fitness, particularly during times when energy is not readily available. This trend is demonstrated by the general avoidance of amino acids with relatively high biosynthetic costs within homologous proteins in all the four mammalian orders included in this study regardless of how variable or well conserved a position was within a four-way alignment (Table 1).

Despite its universality, metabolic efficiency is a relatively weak selective force that is easily obscured by other influences, such as functional constraint (Raiford et al. 2008), which can vary substantially across the length of a protein (Pond and Frost 2005). Negative correlations between amino acid usage and amino acid biosynthetic cost in each class are consistent with a tendency for mammals to use preferentially biosynthetically inexpensive amino acids (Table 2).

Approximately 81% of the amino acids examined in this study are invariant in the four mammalian orders for which sequence information is available. Exhaustive point mutagenesis studies (Suckow et al. 1996; Smith and Raines 2006) suggest that it is unlikely that such a large fraction of the amino acid positions are critically important to the function of the protein. Rather, a substantial majority of the positions in Class I simply have not changed since mammals last shared a common ancestor approximately 100–80 million years ago (Hallstrom and Janke 2010). A simulated set of 1,000 proteins gives insights into the average weighted cost in single positions of amino acid biosynthesis in each class that would occur by chance alone (e.g., absent functional constraint). Table 2 shows that the average weighted costs in single positions at Class IIa, IIb, III, and IV positions within these mammalian proteins are significantly less than that would have occurred by chance alone. While amino acid utilization in Class I positions is inversely related to weighted cost (r s = −0.55; P < 0.05), this class is the only one where the observed average weighted cost in single positions (25.93 ~P) is significantly greater than the expected average weighted cost in single positions (25.54 ~P) in a fashion that suggests that functional constraint overrides cost considerations for these positions. When the positions in the alignment are divided into groups based on the minimum numbers of substitutions necessary to result in the observed amino acids, a similar significantly lower average amino acid biosynthetic cost is seen at sites where a minimum of two mutations are required relative to those where a minimum of one mutation is required.

The use of weighted cost helps us mitigate the effect of single (or a few) high-cost amino acids in the alignments. This can be seen as follows: consider an alignment position with the four amino acids (three glutamates and one tryptophan) costing 9.50 and 75.50 high-energy phosphate bonds, respectively. The arithmetic mean of the costs is 26.00, while the unweighted geometric mean is 26.78. The weighted geometric mean, however, is 11.60. The use of the weighted mean helps us reduce the effect of the outlier (in this case, tryptophan) upon the calculated mean cost. The use of weighted cost has no effect on invariant positions. In an invariant position, the weight given will be 1 and the first root will be taken resulting in a weighted cost equal to the cost calculated according to Wagner (2005).

The analysis of the first and last decile demonstrates a higher average weighted cost in single positions in each class in the last decile than the corresponding class in the first decile (Table 2). However, proteins in the first decile are more expensive to produce overall because of their containing nearly 50% more Class I positions.

While a trend toward the usage of less biosynthetically expensive amino acids is seen in each class, a positive correlation is seen between the average weighted cost across all positions within a single protein and the degree of conservation. The rate of synonymous and non-synonymous substitutions was found to correlate negatively with average weighted cost across all positions within a single protein (Table 3). The magnitude of the correlation was greater for non-synonymous mutations (−0.43) than synonymous mutations (−0.21). As proteins that have a high average weighted cost across all positions within a single protein tend to also be highly conserved, it is expected that they would also have fewer mutations, and that these mutations would be mostly synonymous. Amino acid usage frequency within proteins is also significantly correlated with PAM100 diagonal values for each of the 20 amino acids (Table 1). PAM matrices describe how frequently an amino acid is replaced by other amino acids (Dayhoff 1978). Tryptophan, with a PAM100 diagonal value (1—relative mutability) of 12, is much less likely than alanine, with a PAM100 diagonal value of 4, to be replaced with a different amino acid in a homologous protein. As a result, the negative correlation between amino acid usage and PAM diagonal values is to be expected since amino acids with high PAM diagonal values are less likely to be found in a variable position within the four-deep alignments used in this study. The fact that PAM diagonal values and amino acid biosynthetic costs are highly correlated suggests that metabolic efficiency, as well as functional constraint, may be an important factor contributing to the values seen in PAM matrices.

Studies in metabolic efficiency suggest that pressure to reduce average amino acid biosynthetic cost would be the greatest in genes that have the highest levels of expression (Heizer et al. 2006; Raiford et al. 2008). Human genes appear to have two overlapping populations with respect to the presence of CpG islands near their transcriptional start sites (Fig. 2). The average amino acid biosynthetic cost of the 3,360 human genes that are associated with CpG islands (25.2 ~P; SD = 1.80) is subtly but significantly (P = 5.7E−18) lower than that for the 910 genes without CpG islands (25.8 ~P; SD = 1.92). Quintile analysis reveals a shift from CpG-containing proteins to non-CpG containing ones as average amino acid biosynthetic cost increases (Table 4). Previous studies have shown that CpG islands tend to be found in the GC-richest regions of mammalian genomes (Varriale and Bernardi 2010) and that GC-rich regions also tend to have high GC3-contents and high concentrations of housekeeping genes (Lercher et al. 2003; Lawson and Zhang 2008), and that GC-rich codons tend to code for amino acids with lower biosynthetic costs than GC-poor codons (Seligmann 2003). This analysis suggests that localization of the highly expressed genes within GC-rich compartments of genomes results in a subtle but statistically significant reduction in the energy devoted to amino acid biosynthesis.

Fig. 2
figure 2

Distribution of genes containing a specific number of CpG dinucleotides in their upstream sequences. Number of CpG dinucleotides is the number of CG dinucleotides in a window of 800 bp centered on the transcriptional start site. Number of human genes is the number of human genes with the specific number of CpG dinucleotides

Many forces shape a genome. Contrary to reasonable intuitive expectations, even the relatively weak selective force of metabolic efficiency has resulted in significant changes in the amino acid composition of complex multicellular organisms such as the four mammals in this study. Knowledge of cost selection and where it acts may improve our ability to discover functionally important amino acids. Amino acids with relatively high amino acid biosynthetic costs are more likely to be functionally important and, as a result, to be evolutionarily conserved. Protein alignment algorithms might benefit by placing a premium on aligning the relatively rare amino acids that also have high biosynthetic costs.