Introduction

In silico methods capable of indicating the metabolism and lifestyle of an organism and explaining its evolution in the wild would be an invaluable tool, especially for organisms for which these characteristics are still unknown. High expression rates of certain genes during exponential growth, or of enzymes involved in fundamental metabolic activities like glycolysis, have been very often reported in the literature for fast growers and studied, since the 1980s, through the notion of the Codon Adaptation Index (CAI) (Sharp and Li 1987). For fast-growing organisms like Escherichia coli and Saccharomyces cerevisiae, but also Caenorhabditis elegans and Drosophila melanogaster, genes have been ranked by CAI values and correlated with gene expression levels and functional activity (Sharp and Li 1987; Sharp et al. 1986, 1988; Shields and Sharp 1987; Stenico et al. 1994). Carbone et al. (2003) showed that the CAI can be defined purely by sequence analysis, that it can be calculated for genomes of organisms whose biological activities are not known, and that the resulting ranking of genes correlates well with the dominant codon bias of the organism. This bias need not be related to translational efficiency as for fast growers, but can have any other nature (such as compositional, as for AT- and GC-skew bias, or physical, e.g., strand bias). Exploiting the fact that CAI is a universal measure for analyzing dominant codon bias of any origin in organisms which are not necessarily translationally biased, one can numerically evaluate the strength of different codon biases for an organism and classify genes accordingly to dominant biases (Carbone et al. 2004). For translationally biased organisms, where gene ranking based on CAI has immediate biological interpretation, one can shift the paradigm of the 1980s, which is applied to genes, to groups of enzymes and their metabolic activity, instead, and approach, in a systematic way, the classification of metabolic pathways distinguishing those which are essential for the life cycle of the organism from those which are rarely activated but employed to survive under specific environmental conditions. Namely, through the new notion of the Relative Pathway Index, we rank metabolic pathways and correlate them with the importance of their metabolic function for the life of the organism through evolution. Metabolic pathways that are constitutive in the everyday life of the organism and metabolic pathways that need to be activated rapidly under specific conditions are well characterized by the analysis. Strong evolutionary pressures which favored a given metabolism of an organism in the wild are also characterized in the analysis.

Several energy metabolism pathways turn out to be constituted by enzymes with high codon bias in translationally biased organisms known to be driven by very different physiologies: for instance, the highly biased codon composition of glycolytic enzymes is strikingly evident in fast-growing aerobic bacteria and in S. cerevisiae, and the key role of photosynthetic pathways for Synechocystis (Mrázek et al. 2001) or of methane metabolism for Methanosarcina acetivorans can be identified by the high codon bias of their enzymes. Important metabolic pathways can be detected for organisms which are not necessarily fast growing and whose genomes might have a homogeneous codon composition even though the signal is weaker. For instance, a highly biased codon composition of glycolytic enzymes is present in Mycobacterium tuberculosis H37Rv (mtbRv) whose genome displays only a weak form of translation bias (Carbone et al. 2004), and more surprisingly in Helicobacter pylori (hpy), a genome of rather homogeneous codon composition (Lafay et al. 2000), where glycolytic enzymes are biased above average. This finding suggests that translational selection affected hpy.

In general, given an organism or groups of organisms whose genomes have been affected by some (weak or strong) form of translational selection and have a similar lifestyle, we show that genetic coding is tuned, favoring specific pathways and disfavoring others. These coding features hold across organisms and detecting them allows us to predict or confirm the lifestyle of an organism. This is done with the help of the new concept of the Comparative Pathway Index, which enables organism comparisons based on their metabolic activities. Several examples, supported by experimental evidence, have been chosen to illustrate how well the method works in view of its application to subsequent analysis of poorly studied organisms.

To validate our statistical analysis, we consider the metabolic activities of S. cerevisiae reported by transcriptomic data under sufficiently different biological conditions and compare them with the classification of metabolic pathways obtained by sequence analysis. Because changes in gene expression over sufficiently long periods of time and variations of biological conditions might be complex and might involve the integration of many kinds of information on the nutritional and metabolic state of the cell, we consider mRNA abundance collected during the S. cerevisiae cell cycle under diauxic shift (deRisi et al. 1997) (here glucose quantities decrease in the medium during the cell cycle and induce the yeast to move from fermentation to aerobic respiration), and we analyze the yeast metabolic network through these data. A classification of metabolic pathways based on transcriptomic data and on the new concept of the Evolutionary Pathway Index is proposed. The comparison between this metabolic classification and the one obtained through sequence analysis suggests that the coding sequence of enzymes involved in specific metabolic functions contains information on the physiological responses of an organism during evolutionarily favored conditions. Namely, we argue in favor of the fact that in S. cerevisiae the biasing of genes involved in the fermentative state corresponds to a strong evolutionary pressure favoring this fermentative metabolic state of yeast in the wild (Wagner 2000). The high transcription of enzymes involved in highly active pathways during fermentation (and possibly not during aerobic respiration) correlates particularly well with the high CAI value of these enzymes. This observation is supported also by the available transcriptomic data on seripauperine proteins (Holstege et al. 1998; James et al. 2003; Viswanathan et al. 1994) and on Hem13 proteins (Amillet et al. 1995).

Our results open a way of exploring evolutionary pressure and natural selection for organisms grown in the wild, understanding the need for different levels of gene regulation in unicellular organisms, and, hopefully, predicting the metabolic activities of translationally biased organisms, as well as suggesting the best conditions for growth in the laboratory. Similar questions have been addressed by Akashi and Gojobori (2002), who report that metabolic efficiency is related to amino acid composition for the genomes of E. coli and B. subtilis.

Materials and Methods

Organisms and Genomes

An analysis of metabolic pathways has been done for the translationally biased organisms listed in Table 1 plus Pyrococcus abissi, Synechocystis, Methanobacterium thermoautotrophicum, Methanosarcina acetivorans, Thermosynecoccus elongatus, and Chlorobium tepidum. The metabolic pathways of the last six translationally biased organisms are much less known than for those listed in Table 1. The analysis was also done for Helicobacter pylori (hpy) and Mycobacterium tuberculosis H37Rv (mtbRv), the second genome displaying some weak form of translational bias. All genomes have been completely sequenced. Genomes along with gene annotation were retrieved from the genomes directory of GenBank via FTP. All coding sequences (CDS) were considered, including those annotated as hypothetical and those predicted by computational methods only. An enzyme refers to a protein participating in a metabolic reaction, not a complex.

Table 1 List of organisms considered for metabolic pathway comparison analysis

Fast and Slow Growth

An organism is fast growing if it has a doubling time of at most a couple of hours. Translational bias is usually detected in genomes of fast growers. Weak forms of translational bias can be detected for organisms that are not growing fast, for instance, M. tuberculosis H37Rv, which has a doubling time of 20 h, or M. thermoautotrophicum (Carbone et al. 2004).

Metabolic Networks

For all organisms in Table 1, H. pylori, and M. tuberculosisH37Rv, we used BioCyc software distribution, which includes EcoCyc and MetaCyc databases (Karp et al. 2000, 2002), version 7. The pathway/genome databases for B.subtilis, S. cerevisiae, M. tuberculosis, and H. influenzae were created by DoubleTwist Inc. The numbers of pathways and enzymatic reactions in Table 1 come from the BioCyc database. In BioCyc, pathways are organized into several functional classes, used in Figs. 1 and 3. We disregard from our analysis fragmentary pathways with no assigned function in the BioCyc network. The methane metabolism networks for organisms in Table 1 and Methanosarcina acetivorans and the photosynthetic networks (Calvin cycle, photosystems I and II) for Synechocistis are taken from the KEGG database at http://www.genome.ad.jp/kegg/pathway.html.

Figure 1
figure 1

Metabolic pathways occurring in four or more organisms among those listed in Table 1, Mycobacterium tuberculosis H37Rv (mtbrv), and Helicobacter pylori (hpy). For each organism, the RPI value associated with a pathway, if it exists, is indicated by the corresponding ranking color. Each row corresponds to a pathway denoted by (Biocyc) name and functional classification. Pathways are increasingly ordered from top to bottom by CPI.

Translational Bias

Translational selection refers to the benefit of an increased translational output for a fixed investment in the translational machinery (ribosomes, tRNA, elongation factors, etc.) if only a subset of codons (and their corresponding tRNAs) is used preferentially. Since the benefit of using a particular codon depends on how often it is translated, the strength of translational selection, and hence the degree of codon bias, is expected to vary with the expression level of a gene within an organism. This pattern has been confirmed for the organisms in Table 1 and others. Mutational bias (i.e., an excess or deficit of G+C content compared to A + T content, for instance) might obscure translational selection, which can appear in strong or weak forms (see below).

Codon Bias, Codon Weights, and Codon Adaptation Index

Sharp (Sharp and Li 1987) formulated the hypothesis that for each genome sequence G, there is a set S of coding sequences, constituting roughly 1% of the genes in G, which are representative of the dominating codon bias in G. This bias can be described by listing a set of codon weights calculated on genes in S as follows: given an amino acid j, its synonymous codons might have different frequencies in S; if x ij is the number of times that the codon i for the amino acid j occurs in S, then one associates to i a weight w ij relative to its sibling of maximal frequency y j in S,

$$ w_{i,j} = x_{i,j}/y_j $$
(1)

Such weights describe codon preferences in G and they are successfully used by Sharp to correlate gene expression levels with translational codon bias in fast-growing organisms. This is done by computing the CAI (Sharp and Li 1987) for all genes, \( {\rm{CAI}}(g) = \left( {\prod_{k = 1}^{L}} w_k \right)^{1/L} \), where g is a gene, w k is the weight of the kth codon in g, L is the number of codons in g, and S consists of genes coding for proteins known to be highly expressed, such as ribosomal and glycolytic proteins are for fast growers. Genes are then ranked by CAI values. Genes ranking the highest are the most biased and those ranking the lowest are the least affected by selective bias. For fast growers, genes with high CAI value turn out to be most expressed and translationally biased (Sharp and Li 1987).

More generally, CAI correlates with any kind of dominating bias in genomes (like GC content, preference for codons with G or C at the third nucleotide position, a leading strand richer in G+T than a lagging strand), not just with translational bias (Carbone et al. 2004). This is shown by observing that the set of most biased genes S (consisting of 1% of the genes in G and where the size of S is suggested by Sharp’s original work) can be automatically computed by a pure statistical analysis of the collection of all genes which is not based on biological knowledge of the organism. In the case of fast growers, S indeed is found to consist of highly expressed genes. The lack of reliance on biological knowledge allows computation of weights for organisms of unknown lifestyle. Codon weights, reference set S, and CAI values are calculated with the program CAIJava written by the authors, which uses parsers of GenBank flat files from the Biojava (http://www.biojava.org) programming package. A description of the algorithm and a validation of the approach are reported by Carbone et al. (2004). The program CAIJava is available at http://www.ihes.fr/∼carbone/data.htm.

Detection of Weak and Strong Forms of Translational Bias

In Carbone et al. (2004), two numerical criteria were introduced to detect translational bias. The ribosomal criterion defines the z-score\( \left( CAI(r) - \mu \right)/ \sigma \), for each gene of a ribosomal protein r, where mean μ and standard deviation σ are calculated for the CAI distribution over all CDS; this allows us to define the average \( \bar {\rm z}_{\rm {Rib}} \) of z-scores for ribosomal proteins and say that an organism characterized by translational bias is expected to have high \( \bar {\rm z}_{\rm {Rib}} \) i.e., >1. The strength criterion computes codon weights, as in (1), based on all genes in the genome G (\( \left( w_k\left( G \right) \right) \)) and on the genes in the set S (w k ) and expects the difference between \( w_k \left( G \right) \) and w k to be large for translationally biased genomes (i.e.,\( \sum_{k = 1}^{64} \left( {w_k\left( G \right) - w_k} \right) /2 > 8 \); this sum is an indicator of the number of amino acids having different preferred codons in the entire genome and in the set of most biased genes; the threshold 8 has been empirically calculated on known translationally biased organisms [Carbone et al. 2004]). The combination of the two criteria allows determination of which genomes are strongly translationally biased, that is, those satisfying both criteria, from those that are weakly so, that is, those that only satisfy the ribosomal criterion. Notice that our numerical criteria provide quantitative values ranging within a continuous interval and that, based on these values, one can identify strong, weak, and absent forms of bias as well as finer classifications.

Codon Bias Signatures

Given a genome sequence, weak and strong tendencies toward content bias, translational bias, and strand bias can be identified (Carbone et al. 2004). The collection of strong codon biases affecting CDS coding is referred to as the codon bias signature of the genome. Signatures for the genomes in Table 1 have been analyzed by Carbone et al. (2004). Codon signatures are indicators of the evolutionary pressure organisms undergo.

Ranking of Pathways in the Metabolic Network of an Organism

For each pathway P of the metabolic network M of an organism, we compute the mean of the CAI values of the enzymes involved in P. If an enzyme is involved in more than one reaction in P, its CAI value is counted with multiplicity. Some enzymes in a pathway might contribute no CAI value due to the fact that they have not been identified in the genome. This numerical index of pathways is called the Pathway Index (PI), and we denote PI(P) the mean value for a pathway P. This index allows us to define a ranking of pathways involved in metabolism.

There is a significant difference in PI values between pathways (see Table 1). S. cerevisiae is the organism in Table 1 that displays the most variation of PI values among pathways, with μ M  = 0.31 and σ M  = 0.12, where μ M and σ M are the mean and the standard deviation of the distribution of PI(P) values for P in M, with a minimum and a maximum PI value of 0.15 and 0.73 for the methionine biosynthesis from homoserine pathway and the glycolysis pathway. This large spread among PI values justifies the use of the average CAI of genes involved in a pathway as a biological relevant measure. In the sequel, we consider the interval\( \left[ \mu_M - \sigma_M,\mu_M + \sigma_M \right] \), and intuitively, we say that PI(P) values \( \ge \mu_M + \sigma_M \) are “high” and those \( \le \mu_M - \sigma_M \) are “low.”

Clustering of Genes Along the Genome and Pathway Index

The pathway index is independent of the localization of genes along the genome. To see this, we consider the enzyme positions along the genomic sequence, and we allow up to three genes coding for proteins that do not belong to the pathway to intercalate between two genes coding for enzymes in the pathway.

We say that two enzymes are clustered if they are located in this pattern along the genomic sequence. The results are displayed in Table 1. Most pathways consist of enzymes which are not clustered in the same genomic region. On average, only 7–8% of enzymatic reactions are involved in “clustered” pathways. In particular, in E .coli this value drops to 3%.

Ranking of Metabolic Networks and Network Comparison Across Organisms

To compare the distribution of codon bias in metabolic pathways across organisms, we define the Relative Pathway Index as \( RPI(P) = \left( PI(P) - \mu_M \right) / \sigma_M, \), where the mean and the standard deviation are taken over all Ps in the metabolic network M of an organism. This measure is used to compare organisms with respect to common metabolic pathways. After this normalization of PI values we can define the Comparative Pathway Index CPI(P) to be the average of the RPI(P) for all organisms sharing a pathway P, for comparison of the ranking of pathways over multiple organisms.

For visual representation and comparison (see Fig. 1), we recall that an RPI(P) value of +1 corresponds to a pathway with a bias one σ above the mean for the organism, and a value of −1, one σ below. So the interval [−1,1] is mapped into a continuous range of colors, going gradually from violet, blue, and green (lower values) to yellow, orange, and red (higher values) and pathways are assigned the corresponding color. Pathways with RPI(P) values falling outside the interval take the values of the closer extremes, −1 (violet), +1 (red). The mapping provides a good spread of colors for a suitable reading of metabolic differences, especially for CAI homogeneous organisms, independent of the width of the underlying statistical distribution.

Transcriptomic Data and Expression Levels

Transcriptomic data for S. cerevisiae are taken from the study reported by Holstege et al. (1998) and based on Affymetrix GeneChip high-density oligonucleotide array (HDA) technology (downloaded from http://www.wi.mit.edu/young/expression.html in 1999, and available at http://www.ihes.fr/∼carbone/data.htm). It concerns a set of 4849 genes whose expression profiles are determined by growing yeast cultures to midlog phase. Expression levels are defined as the numbers of copies of a given mRNA per yeast cell. The arrays can detect as few as 0.1 mRNA molecule/cell; the dynamic range over which detection is accurate is approximately 0.1–100 mRNA molecule/cell. The yeast genome is covered by four HDAs and each gene is represented on the HDA by 20- to 25-mer oligos that match the sequence of the message (perfect match oligos) and 20 oligos that are identical but differ by one base (mismatch oligos). Expression levels are calculated by subtracting the signal of a mismatch from its perfect match partner and averaging the difference for each oligo pair for a given gene. The average difference value is a measure of the expression level of that gene used in Fig. 2.

Figure 2
figure 2

Transcriptional level of S. cerevisiae genes are plotted (in log scale) with CAI values; a group of outliers displaying high CAI values and low transcriptional levels is illustrated as light gray squares.

Transcriptomic data on the S. cerevisiae cell cycle under diauxic shift are taken from deRisi et al. (1997). Expression levels are computed as absolute fluorescence intensity minus background. Because of the considerable variation across microarray experiments, expression levels shown do not translate into absolute mRNA concentrations and are informative only when taken in relation to other genes. Despite the differences in technology and definitions of expression levels, transcriptomic data collected at different time points by deRisi et al. (1997), when plotted with CAI values (not shown), display shapes similar to Fig. 2, constructed from Holstege et al. (1998). Compare with the slope of the distributions of pathways in Fig. 3

Figure 3
figure 3

S. cerevisiae cell cycle during diauxic shift. Top: Plots corresponding to the activity of 86 metabolic pathways during time points 0, 1, 2 (left) and 4, 5, 6 (right). Each point in a plot is a metabolic pathway P represented by its maximum PIT(P) (y-axis) value calculated over three time points and PI(P) (x-axis) value; slopes of the distribution of metabolic pathways go down from the anaerobic state (fermentation), left to the aerobic state (respiration), right. (Also, the r2 values of the fits are 0.782 and 0.583, respectively, based on the least-squares fit model). Bottom: RPIT value of metabolic pathways during time points 0⋖6 of the diauxic cycle; EPI value for time points 0, 1, 2 (EPI012), 4, 5, 6 (EPI456), and all (EPI); and RPI(P) values (CAI). Pathways (cited by name, functional classification, and number of distinct enzymes involved) are ordered by EPI.

Transcriptomic Data as Indices of Metabolic Networks

A transcriptional pathway index for a metabolic network can be calculated from mRNA abundance levels obtained through microarray analysis. We define the PIT(P) of a pathway P as the average of the expression levels of the enzymes involved in P at a given environmental condition (analogous to our definition of an index based on CAI values). Again, enzymes are counted with multiplicity and omitted where data are lacking.

Normalization of PIT values is done as for PIs above, by defining \( {\rm {RPI}}_{\rm T}(\rm P) = \left( {\rm {PI}}_{\rm T}(\rm P) - \mu_T \right)/ \sigma_T, \) where μ T , σ T are the mean and standard deviation of the distribution of PIT(P) values for P in M at a given time point. The visual representation (in Fig. 3) of the interval [−1, +1] is mapped into a continuous range of colors, going gradually from violet to red, as above. Pathways P with RPIT(P) value falling outside the interval take the values of the closer extremes (violet or red).

For a given organism and a set of microarray data experiments obtained under different biological conditions, we define the Evolutionary Pathway Index (EPI) to be a numerical ranking of pathways P, ordering them by the maximum among the PIT(P) values within all conditions in the set. Intuitively, the maximum value obtained over a sufficiently large set of transcriptomic data corresponding to different biological states should provide information on those pathways and enzymes that are rarely highly translated but that on certain occasions need to be expressed in large quantities and/or rapidly.

An evolutionarily meaningful set of experiments is a set of transcriptomic data which presents a sufficiently large variation of gene expression levels under sufficiently large changes in biological conditions. The transcriptomic data on the diauxic shift of deRisi et al. (1997) present such variation.

Metabolic Activities Derived from CAI Values

Genes with high CAI values in fast-growing organisms are commonly interpreted as those which are involved in maintaining this growth rate or in enzymatic activities with a rapid response (Gouy and Gautier 1982; Sharp and Li 1987; Sharp et al. 1988; Médigue et al. 1991; Shields and Sharp 1987; Carbone et al. 2004). Genes with such properties are coding for ribosomal proteins, heat shock proteins, antioxidant proteins, proteins involved in the respiratory chain, structural genes (for eukaryotes), and enzymes involved in the metabolic pathways of amino acids biosynthesis. Functional groups of genes which have low CAI values are transcription factors, genes involved in pattern formation (for eukaryotes), cell cycle progression, clock genes, genes involved in the metabolic pathways of nucleotide biosynthesis, carbohydrates, lipids, and secondary metabolites, but also degradation as the ubiquitin–proteosome pathway, or apoptosis. When the organism is not growing fast, then high-CAI-value genes are often correlated with a dominant bias of other origins, like GC3 content or strand bias (Carbone et al. 2003).

A more careful analysis reveals that for fast growers, there exists a large gap between the mean (μ) of the CAI distribution over all CDSs and the average CAI value for ribosomal proteins (μ R ); for E. coli, for instance, μ R  = 0.60 and μ = 0.30, with a standard deviation σ = 0.10. For many species that do not grow fast, the gap between μ and μ R is less important, but ribosomal proteins still display a relatively high CAI, that is \( \mu_R > \mu + \sigma \). Mycoplasma pulmonis, for instance, has μ R  = 0.7, μ = 0.57, and σ = 0.05. This organism has a dominant codon bias with no apparent translational origin but, rather, a dominant AT3 bias (Carbone et al. 2004); nevertheless, the codon composition of its ribosomal proteins is largely affected by AT3 bias.

For those translationally biased organisms for which not much is known of the metabolic activity of their enzymes, CAI analysis of specific genes and comparison with the coding of ribosomal proteins suggest a criterion to infer lifestyle: genes whose CAI value is close to μ R , that is, \( \ge \mu_R - \sigma \), can be treated as indicators of important metabolic activities for the organism. (Notice that this bound implicitly considers the “strength” of the translational bias of the genome, indicated by how large the interval μ R  − μ is.) We discuss this hypothesis for ferrodoxin metabolism, photosynthesis, and methanogenesis.

Ferrodoxin in P. abissi

The archaea P. abissi has been computationally classified as a translationally biased organism (Carbone et al. 2004), with μ = 0.45, σ = 0.07, and μ R  = 0.59. Among its most biased genes, besides ribosomal proteins and elongation factors, we find ferrodoxin (CAI(fdxA)=0.71), ferredoxin oxidoreductase (CAI(for) = 0.63), and keto-valine–ferredoxin oxidoreductase γ-chain (CAI(PAB1470) = 0.62). Ferredoxin appears to be the major metabolic electron carrier in pyrococci (Cohen et al. 2003). After reduction during peptide or sugar degradation, it is mainly reoxidized by a membrane-bound hydrogenase (Silva et al. 2000), potentially generating membrane potential. In addition, ferredoxin has been suggested to be reoxidized by ferredoxin–NADP oxidoreductase (Silva et al. 2000; Schut et al. 2001). Moreover, NADPH can also be oxidized via the conversion of pyruvate to alanine, via glutamate dehydrogenase (CAI(PAB0391) = 0.74) and alanine aminotransferase (CAI(PAB1810) = 0.55) (Ward et al. 2000). The high bias of all genes involved in ferrodoxin metabolism hints at the importance of this pathway for the organism.

Photosynthesis Pathways: Synechocystis

The cyanobacterium Synechocystis is classified to be translationally biased (Mrázek et al. 2001) and its most biased genes are ribosomal proteins, phycocyanin (CAI(cpcB) = 0.79, CAI(cpcA) = 0.78), allophycocyanin (CAI(apcB) = 0.78, CAI(apcA) = 0.72), photosystem II proteins (CAI(psbA2) = CAI(psbA3) = 0.77, CAI(psbI) = 0.70)), fructose-1,6-bisphosphatealdolase (CAI(cbbA) = 0.71), and ferredoxin (CAI(petF) = 0.76) (Carbone et al. 2004). The presence of proteins involved in photosynthesis within the most biased genes is a good indicator of the known photosynthetic activity and lifestyle of Synechocystis. In fact, the metabolic networks of photosystem I, photosystem II, and the Calvin cycle have PI=0.62, 0.58, and 0.62, respectively, and from μ R  = 0.60, μ = 0.50, and σ = 0.07, a photosynthetic preference of Synechocystis is confirmed with PI(P)\( \ge \mu_R - \sigma \), for all three pathways P above.

Methane Metabolism: Methanosarcina acetivorans

The archaeon M. acetivorans has been computationally classified as a translationally biased organism (Carbone et al. 2004), with μ R =0.63, μ = 0.50, and σ = 0.06. Besides ribosomal proteins, methanol-5 hydroxybenzimidazolylcobamide comethyltransferase (CAI(mtaB1) = 0.83, CAI(mtaB2) = 0.75, CAI(mtaC1) = 0.71, CAI(mtaB3) = 0.68), methyl coenzyme M reductase (CAI(MA4546) = 0.79, CAI(MA4547) = 0.76, CAI(MA4550) = 0.73), and methylcobamide methyltransferase isozyme M (CAI(mtaA) = 0.72) are among the most biased genes. In particular, PI(Met) = 0.57 \( \ge \mu_R - \sigma \), where Met is the methane metabolism network. As shown in Table 2, no organism in Table 1 has PI(Met) within 1 SD σ from μ R , while M. acetivorans does. From this and the high CAI value of proteins involved in methane metabolism, one can infer the unusual living environmental conditions of these bacteria.

Table 2 PI of methane metabolism pathway (PI[Met]) in some translationally biased organisms

Metabolic Activities and RPI Values: Analysis of Metabolic Networks Across Species

For translationally biased organisms whose metabolic network has been partially reconstructed, CAI values of enzymes might be profitably used as indicators of preferential metabolic pathways and of lifestyle. Using the notion of Relative Pathway Index (RPI; defined in Materials and Methods), we observe that pathways can be grouped into classes with low, medium, and high RPI. As shown in Fig. 1 (and the same conclusions hold for pathways shared by fewer than four organisms), there are pathways that display the same bias across organisms: shared high RPI values (red and orange squares) suggest that a metabolic activity is favored and that it has a possibly constitutive regime; shared low RPI values (violet and blue squares) suggest that a metabolic activity is likely not involved in chains of rapid enzymatic responses. Pathways with low RPI values are essentially cofactors–coenzymes involved in vitamin biosynthesis which are known not to be usually produced with a high efficiency, and pathways with high RPI values are mainly involved in energy metabolism. There is a third group of pathways that display mixed RPI values across species, and from this set of pathways one might expect to infer differences in lifestyle. These pathways are mainly involved in central metabolism and amino acid synthesis and degradation. A thorough organism comparison can be realized by following the ranking described by the metabolic maps in Fig. 1. Here we consider a few specific pathways which are highly biased in one organism, but not in others, and argue, based on experimental evidence, in favor of their importance for the life of the organism. Below, a pathway is associated with the numbering listed in Fig. 1.

Glycolytic Pathway

For all translationally biased organisms in Table 1, the glycolytic pathway is expected to be especially favored because of fast growth. It has been observed that most genes with the highest CAI values are involved in this pathway (Sharp and Li 1987), and accordingly, we find that the RPI value of the pathway (4) is one of the highest as displayed in Fig. 1.

Glutamate Biosynthesis for A. tumefaciens

The genome of A. tumefaciens contains seven glutamine synthetase genes encoded in three distinguished types (with CAI(glnA I) = 0.77, CAI(glnA II) = 0.74, CAI(glnA III) = 0.57), and the presence of multiple copies of these genes seems to be related to the observation that this bacterium requires high concentrations of glutamate for optimal growth (Wood et al. 2001). The RPI value of the glutamate biosynthesis pathway (29) in A. tumefaciens is high; note that for all other organisms in Table 1 this pathway has a much lower RPI.

Metabolic Differences Between S. cerevisiae and Aerobic Bacteria

A number of pathways involved in amino acid biosynthesis (homoserine [43], valine [27], and cysteine [82]) and intermediate metabolism (mannose and GDP-mannose [60]) are highly biased in S. cerevisiae but not in most translationally biased aerobic bacteria in Table 1. The highly biased mannose and GDP-mannose metabolic pathway confirms the experimental evidence that S. cerevisiae can produce ethanol from glucose and mannose if the concentrations of sugars are high or when the yeast is grown under anaerobic conditions (Ratledge 1991), and this finding agrees with the statistical evidence reported in the next section, that the genome of S. cerevisiae has undergone strong selective pressure favoring a predominantly fermentative metabolism.

Also, the pyruvate dehydrogenase pathway (6) and the removal of superoxide radicals pathway (9) are highly biased in aerobic bacteria in Table 1 but not in the S. cerevisiae genome. In this respect, note that aerobic conditions lead to the generation of acetyl-CoA and pyruvate dehydrogenase, and its coenzymes play a decisive part in this reaction. Also, since oxygen and its derivatives are toxic and can lethally damage certain cellular components, protective enzyme systems have been evolved by aerobic organisms (that use oxygen as terminal electron acceptor in respiration in a crucial way), and this agrees with the removal of superoxide radical pathway being highly biased. On the other hand, again, the absence of bias for these two pathways found in S. cerevisiae sustains the hypothesis of selective pressure favoring fermentation in this organism (see below).

L-Serine Degradation in E. coli

L-serine is a preferred growth factor for E. coli and experimental evidence is reported in several studies (Pizer and Potochny 1964). The high RPI value of the L-serine degradation pathway (75) justifies the sensitivity of E. coli to L-serine and its potential toxicity to the cell (Newman and Walker 1982).

Ammonia Assimilation Pathway in E. coli

The enteric bacterium E. coli (and many other organisms) have two primary pathways of glutamate synthesis, the glutamate dehydrogenase (GDH) and the glutamine synthetase–glutamate synthase (GOGAT) pathway. GDH plays a role in glutamate synthesis when E. coli is under energy (and carbon) restriction but not under ammonium or phosphate restriction, and the GOGAT pathway is responsible for glutamate synthesis when energy is plentiful or when the ammonium or phosphate concentration becomes low. In an energy-rich (glucose-containing), nitrogen-poor environment, glutamine synthetase and glutamate synthase form the ammonia assimilatory cycle GOGAT, which is ATP-dependent and essential for nitrogen-limited growth and for steady-state growth with some sources of nitrogen. We found that the RPI value of GOGAT is high (with CAI(glnA) = 0.60, CAI(gltD) = 0.49, and CAI(gltB) = 0.42), and this is in agreement with the essential role of the pathway. In particular, RPI(GOGAT) is higher than RPI(GDH) (with CAI(gdhA) = 0.41), and this sustains the experimental evidence that GOGAT plays at least two roles for which GDH cannot substitute. GOGAT can fix ammonium into organic molecules (glutamine, thence glutamate and other compounds) when the external concentration of ammonium is low, and it reduces the concentration of glutamine when that concentration becomes high (Reitzer 1986). Note also that a strain deficient in glutamate dehydrogenase has no observable growth phenotype under usual growth conditions (Reitzer 1986) but that GDH appears to be important during energy-limited growth because, through its use, the cost of biosynthesis is scaled down (Helling 1994).

Evolution of Metabolic Networks and Transcriptomic Data in S. cerevisiae

The high CAI value of genes is a good indicator not only of the high level at which a protein might be constitutively translated but also of the “possibility” for a protein to be highly expressed under very specific biological conditions. The next two examples suggest that codon bias is more informative than expression data for those enzymes involved in the wild that may not be highly expressed in laboratory experiments.

Seripauperine Proteins in Yeast

We consider the set of outlier ORFs (see Fig. 2 and also Carbone et al. [2003]) in the correlation plot between S. cerevisiae transcriptomic data (Holstege et al. 1998) and CAI values (Carbone et al. 2003). These outliers correspond to the gene family of seripauperines (the majority of unknown ORFs within the set are homologous to PAU1), which are small proteins (100 aa) presenting a strong sequence identity to proteins that are serine rich but deprived of the serine-rich domains. These genes are highly induced during fermentation conditions or long-lasting anaerobism (experimental evidence is reported on brewery yeast by James et al. [2003]), while their expression is difficult to detect under normal conditions. (Viswanathau et al. 1994)

Hem13 Proteins in Yeast

Hem13 is an enzyme involved in the heme biosynthetic pathway of S. cerevisiae. It is an oxidase with oxygen as a substrate (Zitomer and Lowry 1992). It displays a quite high CAI value (0.47), compared to μ = 0.16 and σ = 0.12 for all S. cerevisiae genes, and it is highly translated under anaerobic conditions (Amillet et al. 1995) but not otherwise. Several facultative aerobic organisms contain a second enzyme catalyzing the same reaction as Hem13 but where the role of oxygen is replaced by molybdenum cofactor and S-adenosylmethionine. Since S. cerevisiae does not contain such an alternative enzyme, it is plausible that during anaerobism (or micro-aerobism) S. cerevisiae produces a large quantity of Hem13 to efficiently use the little oxygen present in the medium and to induce the heme necessary for its metabolism.

These examples suggest that we look at the drastic changes in the expression of genes involved in fundamental cellular processes which are detectable along the switch from anaerobic growth (fermentation) to aerobic respiration upon depletion of glucose in the cell cycle of S. cerevisiae. This cycle is known as the diauxic shift; it has been documented by Johnston and Carlson (1992) and studied through DNA microarrays by deRisi et al. (1997). Through this global view of the way the cell adapts to a changing environment, and by comparing transcriptomic levels to PI values of metabolic networks, we aim to detect those pathways that might be important for cell viability but whose relevance might be difficult to observe in laboratory experiments. For this, we analyze thoroughly the transcriptomic data of deRisi et al. (1997) collected along seven time points (0...6) at successive 2-h intervals during diauxic shift. We look at slopes of the regression lines for PIT values in terms of PI values of metabolic pathways and notice that the change in transcriptomic values taking place from fermentation (corresponding to time points 0, 1, 2) to aerobic respiration (corresponding to time points 4, 5, 6) globally affects the entire set of metabolic pathways involved in the cell cycle. The regression lines associated with the successive time points 0...6, corresponding to the progressive depletion of glucose in the medium, display gradually decreasing slopes (not shown). In particular, pathways with a low PI value (that is, PI(P) <0.25) on time points 0, 1, 2 suddenly augment their PIT on time points 4, 5, 6, and accordingly, pathways with a medium–high PI value (0.25 <PI(P) <0.45) decrease in PIT. (Figure 3 summarizes this observation; it shows average PI(P) and PIT(P) values computed over time points 0, 1, 2 and 4, 5, 6 and decreasing slopes for the corresponding regression lines.) The list of affected pathways is given in Fig. 3. Below, we refer to pathways with the numbering listed in Fig. 3.

It has been suggested that the high slope associated with the fermentative state (i.e., a strong correlation between CAI values and expression levels) might indicate a strong evolutionary pressure that favored a predominantly fermentative metabolism of yeast in the wild (Wagner 2000). Moving the organism to an unfavored state is then reflected by the drop in slope as less often used genes (with correspondingly lower CAI values) are moved to higher expression levels. Our analysis confirms that pathways which are highly active during fermentation are also highly biased and leads us to conclude that the coding sequences of enzymes involved in specific metabolic functions (indexed by PI values) describe the physiological responses of yeast precisely during fermentation (indexed by PIT). The examples of seripauperine and Hem13 proteins discussed above, presenting strong codon bias (i.e., large CAI values) and large RNA abundance during fermentation, sustain this hypothesis. Other metabolic pathways playing a crucial role in fermentation are detectable from the analysis of the diauxic cycle. We see that pathways with a high RPI value (red and orange squares in the CAI line in Fig. 3, bottom) and at least two enzymes participating in the metabolic reaction have a high EPI value during fermentation (i.e., at time points 0, 1, 2; see red and orange squares in the EPI012 line in Fig. 3). They are glycolysis (3), gluconeogenesis (4), glycine cleavage (5), glutamine biosynthesis (7), sorbitol (1) and mannitol (2) degradation, and the non-oxidative branch of the pentose phosphate pathway (9). These are also highly expressed during aerobic respiration, while mannose, GDP-mannose (34), and valine biosynthesis (16), methyl-donor molecule biosynthesis (23), and cysteine biosynthesis (49) are highly expressed only during fermentation. There are three pathways where this claim is not verified by the diauxic cycle dataset (i.e., RPI is high but EPI012 is low): homoserine biosynthesis (38), gluconate utilization (59), and the oxidative branch of the pentose phosphate pathway (64). There is experimental evidence, however, of the involvement of the oxidative pentose phosphate pathway and of the gluconate utilization pathway during fermentation. It has been reported (Middelhoven et al. 2000) that under anaerobic conditions, the yeast Saccharomyces bulderi rapidly ferments (δ-gluconolactone to ethanol and carbon dioxide. In particular, levels of the pentose phosphate pathway enzymes were 10-fold higher in δ-gluconolactone-grown anaerobic cultures than in glucose-grown cultures (van Dijken et al. 2002). Also, growth of S. cerevisiae on δ-gluconolactone was found to be associated with a specific coordinate induction of the synthesis of two enzymes of the oxidative pentose phosphate pathway, 6-phosphogluconate dehydrogenase and 6-phosphogluconolactonase, together with that of gluconokinase (Sinha and Maitra 1992). For the homoserine biosynthesis pathway we could not find experimental evidence of its involvement in fermentation, and we claim that there is some time point during fermentation when genes belonging to this pathway are highly expressed.

Discussion

New measures aiming to numerically study biological and evolutionary questions and to index metabolic pathways across translationally biased organisms with respect to genetic coding are introduced. The approach seems to apply profitably also to genomes displaying weak forms of translational bias. Even though the interval between μ and μ R is much less pronounced in genomes that present weak forms of translational bias, CAI analysis and RPI analysis help to determine metabolic preferences and to reveal interesting information on lifestyle. We discuss two examples.

The Case of M. thermoautotrophicum

The archaeon M. thermoautotrophicum has been computationally classified to have a weak form of translational bias (Carbone et al. 2004), with μ = 0.51, σ = 0.07, and μ R  = 0.59. Besides ribosomal proteins, tungsten formylmethanofuran dehydrogenase and methyl reductase dehydrogenase, that is, genes involved in methane metabolism, are among the most biased genes in Methanobacterium.

The Case of M. tuberculosis H37Rv

The virulent strain M. tuberculosis H37Rv has a doubling time of about 20 h and its genome presents a weak form of translational bias (Carbone et al. 2004), with μ R  = 0.57, μ = 0.50, and σ = 0.07. The metabolic map of M. tuberculosis H37Rv is displayed in Fig. 1. Among networks with a high RPI value, a key example is provided by the biotin biosynthesis pathway (106 in Fig. 1), which turns out to have a very high RPI for Mycobacteria (note that also the clinical isolate M. tuberculosis CDC displays a very high RPI value for biotin biosynthesis; not shown) but a very low RPI for all other organisms we studied: M. tuberculosis has a lipid-rich cell envelope which contributes to virulence and antibiotic resistance. Acyl-coenzyme A carboxylase, which catalyzes the first committed step of lipid biosynthesis, consists in mycobacteria of two subunits, one of which is indeed biotinylated. Genes encoding a biotinylated protein have been cloned and sequenced and the presence of biotin-binding sites has been experimentally demonstrated (Norman et al. 1994). Several other metabolic pathways have been ranked high in RPI for M. tuberculosis H37Rv despite their low RPI values for other species. Experiments demonstrated that the chorismate biosynthesis pathway (62) is essential for the viability of M. tuberculosis H37Rv (Parish and Stoker 2002) and that the aspargine degradation pathway (66), pyridoxal 5-phosphate biosynthesis (52), valine degradation (not shown in Fig. 1), and leucine biosynthesis (31) are also essential pathways (Sassetti et al. 2003). Finally, the ppGpp metabolic pathway (81) is essential for long-term survival of mycobacteria under starvation conditions (Primm et al. 2000). All these pathways of M. tuberculosis have high RPI values. Note that the serine biosynthesis pathway, which has been detected to be essential by Sassetti et al. (2003), has been found to be very highly ranked in the M. tuberculosis CDC strain (not shown).

In genomes where codon usage has a strictly mutational origin (i.e., strand bias or compositional bias), one expects highly ranked pathways to have a different base composition with no associated biological meaning. However, there are genomes that might present weak tendencies toward translational bias which can be identified by our approach but cannot be detected with classical statistical methods such as hidden Markov chains and multivariance statistical methods (Perrière and Thioulouse 2002; Nicolas et al. 2002; Lafay et al. 2000) or with numerical criteria based on ribosomal analysis (Carbone et al. 2004).

The Case of H. Pylori

H. pylori is a microaerophilic, Gram-negative bacterium, whose infection is associated with type B gastritis and peptic ulcer disease and is a risk factor for gastric carcinomas in humans. H. pylori is a slow grower. It has been cultured on diverse agar-based media, resulting in 2 to 4 days of growth at 37°C. Its genome is known to display no sharp dominant codon bias but, rather, a homogeneous codon composition (Lafay et al. 2000), with μ = 0.55, σ = 0.12, and μ R  = 0.59. Despite this fact, we observe that the pathway for glycolysis (4) has a high RPI in H. pylori, and this indicates that some translational bias in this organism is nevertheless present. In this respect, note that glucose appears to be the only carbohydrate utilized by this bacterium (Tomb et al. 1997) and that rapid growth with a doubling time of about 50 min has been reported (Andersen et al. 1997). Our analysis also indicates the thioredoxin pathway (40) to have a rather high RPI value (see Fig. 1), and this is experimentally confirmed by the finding that H. pylori has a thioredoxin-dependent peroxiredoxin system playing a critical role in the defense against oxygen toxicity that is essential for survival and growth, even in microaerophilic environments (Baker et al. 2001). Another highly RPI ranked pathway is riboflavin biosynthesis (48), which turns out to play a crucial role in ferric iron reduction and iron acquisition by H. pylori (Worst et al. 1998). In fact, as other pathogenic bacteria, H. pylori encounters an iron-limiting environment when it attempts to colonize or invade a mammalian host.

To summarize, we proposed that the lifestyle of an organism can be deduced for translationally biased organisms by analyzing the set of most biased genes in the genome, and we observed that weakly translationally biased genomes also carry information on lifestyle. The case of H. pylori suggests that this rule might be extended to a much broader class of organisms. In support of this hypothesis, we found that the most biased genes of the cyanobacterium Thermosynechococcus elongatus have photosynthetic functions. Over the first 24 top CAI ranked genes there are 6 photosystem I and II proteins, 1 phycobilisome small core linker polypeptide, 2 phycocyanin subunits, 2 allophycocyanin subunits, 2 cytochrome b6 proteins, and 2 ribosomal proteins. There is no evidence that Thermosynechococcus elongatus is translationally biased, neither experimental nor computational (Carbone et al. 2004) (in particular, no compositional bias has been detected), but its most biased genes highly reflect the lifestyle of this organism. Again, Chlorobium tepidum, a Gram-negative bacterium of the green sulfur phylum, is an obligate anaerobic photolithoautotroph and lives in high-sulfide hot springs where anoxic layers containing reduced sulfur compounds are exposed to light. It is a thermophile, growing optimally at 48°C. There is no evidence that C. tepidum is translationally biased, neither experimental nor computational (Carbone et al. 2004), but we note that its first 20 top CAI ranked genes contain 2 hydrogenase/sulfur reductases, 3 sulfite reductase subunits, 2 iron–sulfur cluster binding proteins, 1 ferrodoxin protein, and 1 ribosomal protein. These are proteins that characterize well the living conditions of these bacteria. Note that the strong GC3 bias of this genome does not prevent the genomic coding from revealing the crucial role of certain enzymes.

These observations lead us to conclude that there is still much to understand about the role of codon bias in genomes and that refined quantitative measures of codon bias which are able to differentiate bias strengths are needed to investigate in silico a new range of biological possibilities for larger classes of organisms.

Dominant Codon Bias and tRNA Charging Pathways

Charging of tRNA pathways has been described for E. coli, A. tumefaciens, and V. cholerae among the organisms listed in Table 1. For these organisms most amino acids have a tRNA charging pathway P with PI(P) \( \ge \mu + \sigma \). Namely, E. coli, A. tumefaciens, and V. cholerae have 20, 16, and 18 pathways P with PI(P), \( \ge \mu + \sigma \) respectively. This finding leads to the hypothesis that all translationally biased genomes present high bias on tRNA charging pathways. We also verified that in H. pylori, 19 tRNA charging pathways P display RPI(P) ≥ μ and 12 of them have RPI(P) ≥ μ R . This last observation suggests a need for efficiency of this process which is independent of the strength of translational bias. Note also that, for E. coli, A. tumefaciens, and V. cholerae, we separated two sets of genes, one containing the 2% of most biased genes and the other containing all other genes. We then applied linear discriminant analysis and observed that the amino acid separation coefficients do not display a significant correlation between the organism-preferred codons and the specific codons of the tRNA charging complexes displaying a high RPI. The availability of further metabolic information on genomes will allow us to verify the hypothesis for a class of organisms displaying strong and weak forms of translational bias.

PI, PI\(_T\), and “Logically” Differentiated Regulation for Unicellular Organisms

Constitutive high expression of rarely used but genomically favored (i.e., with high PI) pathways is expected to be avoided by specific patterns of regulation, and the existence of such patterns can be detected by combining sequence analysis and transcriptomic data: the four logical combinations coming from “high” and “low” PIT and PI values lead us to envisage at least four different kinds of enzyme regulation. Given a set of experiments, it is reasonable to distinguish the regulation of enzymes in a pathway P which has “high” PI(P) and “low” PIT(P) for all (or most) time point experiments from the regulation of a pathway P′ where PIT(P′) and PI(P′) are always “high.” Enzymes participating in P′ are likely to be constitutively regulated, while it is plausible that enzymes participating in P are regulated under specific biological conditions (as in the case of seripauperine or Hem13 proteins discussed above). The number of “logically” foreseeable regulation modes becomes larger if we consider the frequency at which high and low PIT(P) values, for a pathway P, occur in experiments. This logic-based diversity characterizing metabolic pathways challenges the understanding of the underlying regulatory mechanisms. The interplaying role of statistical analysis and transcriptomic data turns out to be essential here.

Caveats to the Analysis Based on Transcriptomic Data

A major concern with our analysis based on microarray data is the noisiness of transcriptomic data and the limited correlation between mRNA and protein expression levels. However, the amount of noise is smaller and the correlation higher for highly expressed genes (Gygi et al. 1999), those of most interest in our analysis.