Introduction

Whipple’s disease is a rare multisystemic bacterial infection, involving the gastrointestinal tract as well as various other organs (Bentley et al. 2003). Symptoms include chronic diarrhea, intestinal bleeding, abdominal pain, weight loss, fatigue, and weakness (Ratnaike 2000; Dutly and Altwegg 2001). Cardiac or neurological problems, not always associated with digestive symptoms, have also been observed (Slowik and Szczudlik 2002). Patients are reported to have subtle immune defects and there are evidences that such altered immune functions play an important role in the manifestation of the disease (Ring et al. 2003).

The causative agent of Whipple’s disease is an intracellular pathogenic actinobacteria Tropheryma whipplei, which localizes in the intestinal lamina propria of the host (Bentley et al. 2003). The bacterium is resistant to cultivation in vitro and seems to depend heavily on human cell-associated factors for growth. The complete genome sequences of two different strains of this human pathogen, T. whipplei TW08/27 and T. whipplei Twist, reveal several atypical characteristics of the organism (Bentley et al. 2003; Raoult et al. 2003). First, their genome sizes are small (<1 Mb) and their G + C contents (∼46%) are low, compared to other actinomycete species, which have, in general, genome sizes ranging from 1 million to 8 million bp and high G + C contents (Casjens 1998). Second, the genomes bear all the traits of strictly host-adapted organisms, including pronounced deficiencies in energy metabolism and a lack of key biosynthetic pathways, indicating that like many other obligatory intracellular bacteria (Bentley et al. 2003; Raoult et al. 2003), T. whipplei also might have suffered gradual decay and loss of genes. Third, despite its small genome size, the bacterium devotes a large amount of coding capacity to biosynthesis of surface-associated features. It codes for a family of large surface proteins, some of which are associated with large amounts of noncoding repetitive DNA, unlike anything previously seen in bacterial genome sequences, and the genome exhibits a great deal of genetic variability, including phase variation and mostly directed toward the changes in cell surface proteins, indicating that immune evasion and host interaction play an important role in the lifestyle of this persistent bacterial pathogen (Bentley et al. 2003; Raoult et al. 2003). Such atypical characteristics of T. whipplei point toward the possible existence of special selection forces operative at the genome and/or proteome levels, unveiling of which calls for an in-depth analysis of the trends in codon and amino acid usage in the organism.

Synonymous codon usage in most of the unicellular organisms is primarily governed by directional mutational bias and translational selection (Sharp and Li 1987a; Pan et al. 1998; Zavala et al. 2002; Rocha 2004), though several other factors, such as context dependence (Berg and Silva 1997), replicational-transcriptional selection (McInerney 1998; Lafay et al. 1999; Romero et al. 2000; Das et al. 2005a), protein hydropathy (de Miranda et al. 2000), and ecological niches (Singer and Hickey 2003),may also have significant influences. However, in some bacterial genomes with an extremely high/low G + C- content (Ohkubo et al. 1987; Ohama et al. 1990; Wright and Bibb 1992; Andersson and Sharp 1996), no impact of translational selection is apparent on synonymous codon preferences and there are organisms like Helicobacter pylori where the codon usage is apparently influenced neither by simple mutational bias nor by translational selection (Lafay et al. 2000). The amino acid usage in microbial organisms may also be influenced by several factors like hydrophobicity, expressivity, and aromaticity of the respective proteins (Lobry and Gautier 1994) and cost minimization and conservation of GC-rich amino acids in highly expressed gene products etc. (Seligmann 2003; Das et al. 2005a, b; Chanda et al. 2005).

In some species, the mutational bias is markedly strand-specific (McInerney 1998; Lafay et al. 1999; Romero et al. 2000; Das et al. 2005a), and such organisms, in general, exhibit significantly different synonymous and/or nonsynonymous codon usage patterns in genes transcribed on the leading and lagging strands of replication. Replicational-transcriptional selection is a major source of variation in codon usage in those species—replicational selection is responsible for the higher number of genes on the leading strand and transcriptional selection for the enrichment of highly expressed and/or essential genes on that strand (Lopez and Philippe 2001; Das et al. 2005a; Rocha and Danchin 2003). In Bacillus subtilis, 78% of highly expressed genes and 94% of essential genes are preferentially positioned in the leading strand (Rocha and Danchin 2003). Interestingly enough, the organisms with strong strand-specific asymmetries studied so far are characterized by massive genome reduction and overall A + T richness, irrespective of their taxonomic positions. Examples include the spirochaetes B. burgdorferi and T. pallidum (McInerney 1998; Lafay et al. 1999), Chlamydia trachomatis among the Chlamydiae (Romero et al. 2000), the γ proteobacterium Buchnera aphidicola (Rispe et al. 2004), and the α proteobacteria Bartonella (Das et al. 2005a). However, not all A + T-rich reduced genomes exhibit pronounced compositional asymmetries in leading and lagging strands of replication. H. pylori (Lafay et al. 2000), Wigglesworthia (Herbeck et al. 2003), and Wolbachia (Wu et al. 2004) are the prime examples. The facts that T. whipplei is the only known reduced genome species (<1 Mb) within the Actinobacteria and that the organization of the TW08/27 and Twist genomes differs by an inversion of a large chromosomal segment, symmetrical to the origin of replication (Raoult et al. 2003), have made it more intriguing to examine whether the reduction in genome size and/or the intracellular lifestyles of T. whipplei could exert any effect on its codon/amino acid preferences. Multivariate analyses carried out in the present study indicate that the codon and amino acid usage in this human pathogen might be a consequence of a complex balance among replicational-transcriptional selection, translational control, and other physicochemical properties of the gene products. The study, apart from providing an insight into the underlying selection pressures operative at the gene/protein level of T. whipplei, may also offer a better understanding of evolution of this host-adapted microorganism.

Materials and Methods

Actinobacterial Genome Sequences

All protein-coding sequences of the T. whipplei TW08/27 and Twist genomes were extracted from NCBI GenBank. Protein-coding sequences of 15 other actinobacterial species, complete genome sequences of which are available in NCBI GenBank (Version 145.0), were also extracted (listed in Table 1) for comparative analyses. In order to reduce sampling errors, the annotated genes with fewer than 100 codons were excluded from the analysis. The presumed duplicates, genes for transposase and integrase, and genes with internal stop codons and untranslatable codons were also excluded. Finally, 729 sequences for T. whipplei TW08/27 and 734 sequences for T. whipplei Twist were selected for analysis.

Table 1 Genome characteristics and correlations of GC3 and GT3 content with first two axes of COA on RSCU values of genes in 17 completely sequenced actinobacterial genomes

Determination of Leading and Lagging Strand Genes

The genes present in the leading and lagging strand were isolated on the basis of the reported location of oriC in T. whipplei Twist by Raoult et al. (2003). Based on the change in AT-skew signal using Oriloc (available at http://www.pbil.univ-lyon1.fr/software/Oriloc) (Frank and Lobry 2000) and the conservation of the dnaA-dnaN-recF gene cluster, the oriC region in T. whipplei TW08/27 is assumed to be located at 0 kB. The terminus is assumed at the second inflexion in AT-skew and thus allowed us to locate each coding sequence present in either the leading or the lagging strands of replication.

Multivariate Analyses of Relative Frequencies of Synonymous Codons and Amino Acids

To identify the major factors shaping variation in relative synonymous codon usage (RSCU) and relative amino acid usage (RAAU) among T. whipplei genes, we applied correspondence analyses (COA) using CODONW 1.4.2 (written by John Peden and available at www.molbiol.ox.ac.uk/win95.codonW.zip). For COA on RSCU values, the data were plotted in a multidimensional space of 59 axes (excluding AUG, UGG, and three stop codons) and then the most prominent axes contributing to the codon usage variation among the genes were determined. Similarly, in COA on RAAU, the variables were the frequencies of the 20 amino acids, which sum up to 1 for each gene.

Indexes Used to Identify the Trends of Variation in Codon and Amino Acid Usage

GC1+2 (G + C content at first and second codon positions), GC3S, and GT3S (frequency of G + C and G + T, respectively, at synonymous third codon positions) were calculated for each coding sequences in both strains of T. whipplei. Indexes like the total number of occurrences of each codon, RSCU (Sharp and Li 1987a), codon adaptive index (CAI; Sharp and Li 1987a), RAAU, average hydrophobicity (Gravy score) (Kyte and Doolittle 1982), aromaticity (Lobry and Gautier 1994), and average size/complexity quotient (Dufton 1997) of encoded proteins were also calculated to find out the factors influencing codon and amino acid usage. The CAI was calculated for all leading strand genes with respect to the RSCU values of the genes for ribosomal proteins (≥100 aa) present in that strand.

Prediction of Secondary Structure and Protein Disorder

Prediction of protein secondary structure was performed using the GOR IV algorithm (Garnier et al. 1996) from Expasy proteomics server and the disordered regions within proteins were predicted using GlobPlot (available at http://www.globplot.embl.de).

Estimation of Synonymous and Nonsynonymous Substitution Patterns in Highly and Lowly Expressed Genes

Orthologous genes of T. whipplei TW08/27 and T. whipplei Twist were obtained from NCBI GenePlot by filtering the symmetrical best hits of protein homologues and sequence alignments were performed using ClustalW. To examine the nucleotide substitution patterns, we estimated pairwise synonymous divergences (dS) as well as nonsynonymous divergences (dN) between the orthologues using the MEGA program (version 2.1), as described by Nei and Gojobori (1986).

Statistical Methods

In order to detect the significant differences between the two classes of genes (i.e., leading vs. lagging strand genes and highly vs. lowly expressed genes), if any, codon and amino acid abundances were compared by chi-square in 2 × 2 contingency tables having 1 degree of freedom. For each codon, the first and second row of the contingency table represented, respectively, the number of occurrences of the codon being analyzed and the total number of alternate synonyms for the corresponding residue in the two classes under comparison. For amino acid, rows in the contingency table were the count of a particular amino acid residue and the total count of the remaining residues. Linear regression analysis was used to find out the significance of association between the positions of sequences along major axes of COA and biological variables using STATISTICA (Version 6.0).

Results and Discussion

Asymmetrical Mutational Bias, Coupled with Replicational-Transcriptional Selection on Synonymous Codon Usage

Figures 1a and b show the position of the genes on the plane defined by the first and second major axes generated by COA on RSCU values of coding sequences in the TW08/27 and Twist genomes, respectively. The first principal axis accounts for 9.23% and 9.19%, while the second axis accounts for 5.15% and 5.07% of the total variability for TW08/27 and Twist genomes, respectively. The low percentage of variance explained by the first and second major axes indicates that no single axis accounts for much of the total variability in codon usage of T. whipplei.

Figure 1
figure 1

Position of genes along the first two principal axes generated by COA on RSCU values of all genes from the (a) T. whipplei TW08/27 and (b) T. whipplei Twist genomes represented in the upper panel, and plots of the first two axes of COA on RSCU values of the genes located in the leading strand of the (c) TW08/27 and (d) Twist genomes are in the lower panel. The filled quadrangle and open quadrangle in a and b represent genes transcribed in the leading and the lagging strands of replication, respectively, whereas in c and d, the open and filled circles represent, respectively, the potential highly expressed genes and the other genes transcribed from the leading strand only.

Interestingly enough, in each plot, the genes transcribing in the leading and the lagging strands of replication are segregated in two discrete clusters with little overlap along axis 1. Similar scatter plots with two distinct clusters of points were observed earlier in case of genomes with pronounced strand-specific mutational bias (McInerney 1998; Lafay et al. 1999; Das et al. 2005a). When the positions of all synonymous codons are plotted on the plane defined by the first and second axes of COA on RSCU, most of the U- and G- ending codons are found to be distributed on the negative side of axis 1, whereas A- and C-ending codons are mostly located on the positive side (plot not shown). The codons CUG (Leu), GGG (Gly), AGU (Ser), CGC (Arg), GGC (Gly), GCC (Ala), and ACC (Thr) are the exceptions. It is also observed that the triplets occupying two extremes on the first axis are more represented by the NNA and NNU codons than by the NNC and NNG codons, suggesting that the relative contributions of the NNA and NNU codons to imparting the strand specificity in codon usage patterns are greater than those of NNC and NNG. Thus, in both species under study, the primary cause of variation in synonymous codon usage is whether a gene is transcribed from the leading strand or from the lagging strand of replication. To our knowledge, this is the first report that the strong strand-specific bias is the prime source behind the synonymous codon usage in any actinobacteria. Furthermore, in both strains of Tropheryma, the number of predicted coding sequences is significantly higher in the leading strand (73.8% for the TW08/27 and 74.1% for the Twist genome) than in the lagging strand. Although in T. whipplei complete information on essential genes is not available, it is important to note that the distribution of potential highly expressed genes (i.e., ribosomal proteins, transcription and translation processing factors, etc.) is also significantly skewed, most of them (>70%) being transcribed from the leading strand. All these observations indicate that replicational-transcriptional selection coupled with asymmetrical mutational bias might have been the primary source of intragenomic variations in codon usage pattern in T. whipplei.

In order to examine whether the strand-specific codon usage is a general characteristic of actinobacteria or whether it is a specific feature of T. whipplei only, synonymous codon usage patterns in all actinobacterial genomes sequenced so far were analyzed in the present study. Table 1 represents the comparison of codon usage patterns (as determined by COA on RSCU values of genes) of 17 completely sequenced actinobacteria. In all organisms except the two strains of T. whipplei and three species of Corynebacterium, the position of each gene along the first axis exhibits a much higher correlation with the respective GC3 content than with GT3, so that the genes characterized by the highest and the lowest GC3 contents lie on the two extremes of that axis, and for most of these actinobacterial genomes, the second major axis exhibits a strong correlation with the GT3 content of the genes. In C. diphtheriae, C. glutamicum, and C. efficiens, strong correlations of axis 1 are observed with GT3 as well as with GC3 content of the genes. The correlation of axis 1 with GT3 is stronger in C. diphtheriae (average GC content, 53.5%) and C. glutamicum (average GC content, 53.8%) than that in C. efficiens (GC content, 63.1%). This result is consistent with the GC skew profiles of the respective genomes (Nishio et al. 2003). On the other hand, in T. whipplei, the axis 1 exhibits a strong correlation only with GT3 (r = −0.82 and −0.81 at p < 0.0001 for the TW08/27 and Twist genomes, respectively). The genes representing the highest GT3 content and displaying the most negative values along the first principal axis are present in the leading strand of replication (Figs. 1a and b). Thus the influence of strand-specific base compositional asymmetry on synonymous codon usage seems to be more pronounced in T. whipplei compared to other actinobacteria under study.

Table 2 shows the cumulative codon usage patterns in the genes located in the leading and lagging strands of the TW08/27 and Twist genomes. The chi-square test on occurrences of different codons in two replicating strands showed that there are 22 G-ending/U-ending codons, usages of which significantly increase (p < 0.01) in the leading strand genes, whereas usages of 24 and 25 C-ending/A-ending codons are significantly higher in the lagging strand of the TW08/27 and Twist genomes, respectively. The rest of the synonymous codons are used almost equally by both replicating strands. Unlike B. burgdorferi (McInerney 1998), not all these equally used codons are the rare codons. For example, CCG (Pro) and ACC (Thr), though used with relatively high frequencies, are almost equally used by the genes of two strands (Table 2). Although the TW08/27 and Twist genomes share >99% of identity at the nucleotide sequence level, it is noticeable that the genome organization differs by an approximately symmetrical inversion of a large chromosomal segment (Raoult et al. 2003). Due to such symmetrical inversion encompassing the origin or terminus, the locations of the genes do not change with respect to the leading and lagging strands (Eisen et al. 2000; Mackiewicz et al. 2001). The genomic coordinates of inversion region in the Twist genome are 182333–713982 bp. The Twist coordinates 182333 and 713982 bp correspond to the coordinates 712566 and 181874 bp, respectively, in the TW08/27 genome (Raoult et al. 2003). The inverted region in T. whipplei encompasses the replication terminus, which is located almost at the center of this region. Hence in T. whipplei, after inversion the location of the gene is conserved with respect to the two replicating strands, and as a consequence the trends in synonymous codon usage remain unaltered.

Table 2 Codon usage of the leading and lagging strand genes and preferred codons in highly expressed sequences for the T. whipplei TW08/27 and T. whipplei Twist genomes

It is worth mentioning at this point that depending on the relative kinetics of genomic rearrangements and the mutational bias, the intensities of strand asymmetries vary from organism to organism and high genome plasticity, in general, diminishes the base skews induced by asymmetric mutational biases in two replicating strands (Lopez and Philippe 2001). The genome of T. whipplei is characterized by a high degree of plasticity (Bentley et al. 2003; Raoult et al. 2003). The existence of a pronounced strand asymmetry in T. whipplei genome, therefore, contradicts the hypothesis made by Lopez and Philippe (2001). Probably in T. whipplei, the strand-specific asymmetric nature of the mutational bias has had enough time to conciliate, in spite of their high genome plasticity.

Evidence for Translational Selection in T. whipplei

In order to examine the possible effect of translational selection, if any, on codon selection by the highly expressed genes of T. whipplei, we have performed a COA on RSCU values of the genes transcribed from the leading strand of replication only (as most of the highly expressed genes (>70%) are located in leading strand). The potential highly expressed genes (i.e., genes encoding ribosomal proteins, transcription translation processing factors, heat shock proteins, etc.) that are transcribed from the leading strand mostly clustered at one extreme of axis 1, which represents about 8% of the total variance for both strains of T. whipplei (Figs. 1c and d). Therefore, the position of genes along the first axis of COA may be related to expressivity, though the influence of gene expressivity on synonymous codon selection is not expected to be strong enough, as axis 1 described a rather small amount of total variation. Consistent with this, when the pooled data from either extreme of the first axis are compared, it is found that there are 23 codons for TW08/27 and 25 codons for Twist genomes, usages of which are significantly higher (p < 0.05) among the potential highly expressed genes (Table 2). Among these preferred codons in potential highly expressed genes, 21 codons for the TW08/27 and 23 codons for the Twist genomes are G and U ending. However, in the quartets of Arg and Gly, C-ending triplets (CGC for Arg and GGC for Gly) are also significantly overrepresented in highly expressed genes. All these observations suggest that apart from the replication-transcriptional selection, a residual codon bias exists in highly expressed genes of T. whipplei, possibly due to translational selection.

In order to ascertain the influence of the residual selection putatively operating at the level of translation on synonymous codon bias in highly expressed genes, we used two different approaches. First, we calculated CAI for all leading strand genes and examined whether there exists any correlation of CAI values of the genes with their positions along the first principal axis of COA on RSCU values of the leading strand genes. The first axis exhibits a strong significant correlation (r = 0.79 and −0.77 at p < 0.0001 for the TW08/27 and Twist genomes, respectively) with the CAI values of genes. The genes displaying the higher values of CAI are the sequences with putative high expression levels, including genes coding for ribosomal proteins, subunits of RNA polymerase, heat shock proteins, etc. This suggests that there is a preferential use of a particular subset of synonymous codons by the highly expressed genes in T. whipplei.

The second approach is to compare the estimated dS values between 50 putative highly and 50 putative lowly expressed orthologous genes between the TW08/27 and the Twist genomes taken from the pooled data of the two extremes of axis 1 generated by COA on RSCU for leading strand genes of both organisms. Although these two genomes are very closely related, the genes representing significantly low values of dS are the potential highly expressed genes (i.e., ribosomal proteins, transcription and translation processing factors, heat shock proteins, etc.). The mean dS value for high-expression genes is 0.0049, which is significantly lower (t-test, p < 10−4) than that of low-expression genes (mean dS = 0.0058), indicating that highly expressed genes have lower divergence at synonymous sites than lowly expressed genes of these two T. whipplei genomes, since their separation from the last common ancestor. This, in turn, is consistent with the hypothesis that the nucleotide selection force acting at the synonymous sites is stronger for highly expressed genes than that for lowly expressed ones, as reported previously for enterobacteria (Sharp and Li 1987b) and Bartonella (Das et al. 2005a). Therefore, it may be concluded that apart from the strand-specific mutational bias, a selection, putatively operating at the level of translation, also might have influenced the synonymous codon choice in genes of these two strains of T. whipplei.

Recent studies on complete microbial genomes indicated a close association among synonymous codon usage bias, tRNA abundance, number of rRNA operons, generation time, and genome size (Rocha 2004; dos Reis et al. 2004; Sharp et al. 2005). Evidence for apparently little or no translational selection has been reported for most of the species with reduced genomes examined so far, such as B. burgdorferi (McInerney 1998), B. aphidicola (Rispe et al. 2004), H. pylori (Lafay et al. 2000), T. pallidum (Lafay et al. 1999), and Wigglesworthia (Herbeck et al. 2003). Most endoparasitic and endosymbiotic bacteria are characterized by the presence of only one or two rRNA operons and a small number of tRNA genes, a long generation time, and a reduced genome and overall AT richness. Some of these bacteria such as B. burgdorferi (McInerney 1998), B. aphidicola Bpi strain (Rispe et al. 2004), T. pallidum (Lafay et al. 1999), C. trachomatis (Romero et al. 2000), and two species of Bartonella (Das et al. 2005a) exhibit strong base compositional asymmetries like T. whipplei. The genomes of the TW08/27 and Twist strains of T. whipplei have undergone a massive reduction in size as well as a decrease in overall GC content compared to their free-living actinobacterial relatives. Consistent with the characteristics of intracellular bacteria (Sharp et al. 2005; Rocha 2004), the in vitro doubling time of T. whipplei is remarkably longest among bacteria (Bentley et al. 2003), suggesting its slow growth rate. Both genomes contain only one rRNA operon and a limited number of tRNA genes (51 for the TW08/27 and 50 for the Twist genome). It is worth mentioning at this point that in spite of the relatively high GC content and large genome size of Mycobacterium species, the weakly selected codon usage bias found was probably due to the lower number of rRNA operons and tRNA genes (Sharp et al. 2005). Evidence for a weak translational selection in T. whipplei genomes is, therefore, consistent with the earlier observations on the organisms adapted in intracellular lifestyle.

Major Sources of Variation in Proteome Composition

To investigate whether the DNA strand specific mutational biases have any impact on amino acid usage in T. whipplei gene products, we performed COA on relative amino acid usage (RAAU) of the encoded proteins. Figure 2 represents the position of each gene on the plane defined by the first two major axes of COA on RAAU of T. whipplei gene products. There is no clear segregation of the proteins encoded by the leading and lagging strands of the genomes under study. As G and T bases are prevalent in the leading strand, while C and A are overrepresented in the lagging strand, the amino acids encoded by either G + U-rich (such as Phe, Cys, Trp, Gly, and Val) or A + C-rich (such as Lys, Asn, Pro, Gln, His, and Thr) codons are expected to be prevalent in respective strands. When the cumulative amino acid usages of encoded gene products in two strands are compared separately, among these 11 residues, only 2 (Phe and Val) are significantly overrepresented (p < 0.001) in the leading strand and 3 residues (Lys, Asn, and Thr) are more abundant (p < 0.001) in the lagging strand. All other residues are used evenly by the genes of the two strands, indicating that the influence of the strand-specific mutational bias on the amino acid composition of T. whipplei gene products is not as pronounced as on their synonymous codon usage.

Figure 2
figure 2

Position of each gene along the first and second axes (a, b) and first and third axes (c, d) generated by COA on RAAU of encoded gene products for T. whipplei TW08/27 and T. whipplei Twist. The filled quadrangle and open quadrangle represent genes transcribed in the leading and lagging strands of replication, respectively. The large and small clusters of genes encoding membrane-associated proteins are marked by dashed-line ovals (in a and b). The open triangles (in c and d) represent potentially highly expressed genes.

In both strains of T. whipplei, the first three axes generated by COA on amino acid usage explain about 42% of the total variability. The first and second major axes are strongly correlated both with hydrophobicity and with the aromaticity of encoded proteins (Table 3), implying hydropathy and aromaticity to be the major factors for amino acid variation in T. whipplei proteins. The first axis, therefore, separates the genes encoding proteins having high values of mean hydropathy and mean aromaticity on the left and the others on the right side of that axis (Fig. 2). Furthermore, in the TW08/27 genome low-hydrophobic and low-aromatic proteins display positive values on axis 2, whereas in the Twist genome such proteins are represented by negative values on the same axis (Fig. 2).

Table 3 Major trends in amino acid usage, as revealed by COA on amino acid frequencies of the encoded proteins

It is worth noting that there are two distinct clusters of proteins near the left end of axis 1 (Fig. 2). A careful investigation reveals that the small cluster contained the genes for membrane-associated proteins including WiSP family members and a few hypothetical proteins, whereas genes coding for integral membrane proteins, several transporters, subunits of cytochrome c, translocase protein TatC, etc., are present in the large cluster. Although the genes present in both clusters are mainly membrane-associated proteins, the amino acid usage profiles exhibited by them are quite distinct from one another. Membrane proteins are known to have very different amino acid compositions (Lobry and Gautier 1994; Pascal et al. 2005). The products of the genes present in the large cluster are significantly overrepresented (p < 0.001) by Phe, Leu, Cys, Trp, Arg, Ile, Met, and Ala, whereas Ser, Tyr, Pro, Gln, Thr, Asn, Lys, and Asp present in significantly higher amounts (p < 0.001) in the proteins encoded by small cluster genes (Table 4).

Table 4 Amino acid usage and mean value of potential secondary structures of membrane-associated proteins in the large and small clusters for the T. whipplei TW08/27 and T. whipplei Twist genomes

These two groups of proteins also differ in their predicted secondary structures. For the members of the small cluster, most of which are WiSP family proteins, propensities for formation of ß-sheet structures are much higher than that for formation of α-helices, while the proteins encoded by the large cluster genes have higher propensities for formation of α-helix structures than ß-sheets (Table 4). Proteins of these two clusters also differ in content of potentially disordered structures. Disordered regions in proteins can be predicted by the lack of regular secondary structures and increased flexibility in the polypeptide chain, whereas ordered regions (often termed globular) typically contain regular secondary structures packed into a compact globule (Linding et al. 2003; Fuxreiter et al. 2004). As the probable coil-forming regions are significantly higher in the small cluster proteins in both strains of T. whipplei (Table 4), disordered structures are more commonly found in the proteins of the small cluster compared to those comprising the large cluster (Fig. 3). Recent investigations have indicated that disordered structures are usually more favored by proteins involved in regulatory functions and binding of various ligands (Fuxreiter et al. 2004; Fink 2005). Therefore, it may be presumed that the proteins in the small cluster, which might play important roles in interactions with the host and/or immune evasion (Bentley et al. 2003; Raoult et al. 2003), would be overrepresented by disordered structures. Members of the other cluster, containing fewer disordered regions and exhibiting higher propensities for α-helical regions, are primarily involved in transport and other membrane-associated processes. Most of the variations in cell surface proteins might be potentially important for T. whipplei to evade the host’s immune response and probably evolved during the course of parasitic adaptation. Compositions of the membrane-associated proteins of these two clusters are, therefore, distinguished from each other in a fine coordination among function, structure, and stability.

Figure 3
figure 3

GlobPlot of the membrane-associated proteins encoded by genes taken from large and small clusters generated by COA on RAAU for the T. whipplei TW08/27 genome. Plots in the left and right panels represent large and small clusters, respectively. Black color indicates disordered regions (lack of regular secondary structures) and gray color indicates ordered (globular) regions that typically contain regular secondary structures.

Another factor that influences the variations in amino acid usage is gene expressivity, as indicated by the presence of most of the potential highly expressed genes near the positive extreme of the axis 3 (Figs. 2c and d). Recently it was reported that a decrease in the average size/complexity quotient of encoded proteins with a potentially high expression level points toward a decrease in biosynthetic cost of highly expressed gene products (Das et al. 2005c). The strong negative correlation between the third major axis and the average size/complexity quotient of encoded proteins (Table 3) suggests that the highly expressed genes in T. whipplei have a tendency to avoid the heavier residues including the aromatic ones. This supports the cost minimization hypothesis (Seligmann 2003), which claims that highly expressed genes tend to use small and energetically less expensive amino acids in their encoded proteins. Considering the fact that both strains of T. whipplei are obligatory intracellular parasites, which lack the enzymes required for amino acid biosynthesis and obtain all amino acid residues by salvage from the host (Bentley et al. 2003; Raoult et al. 2003), it is surprising to see that they follow the cost minimization hypothesis. However, recently Peixoto et al. (2004) mentioned that parasitic organisms may follow the cost minimization hypothesis as a strategy for genome-level adaptation to the host environment, as utilization of the less expensive and smaller residues by the highly expressed genes can minimize the energy exhaustion of the host and help them thereby to maintain the sustained infection, having the least chance of elimination by the host.

Relation Among Gene Expression, Protein Conservation, and Amino Acid Usage

To understand the possible effect of gene expression on amino acid usage, we compared the average RAAU profiles of putative highly and lowly expressed gene products of T. whipplei (Fig. 4). Depending on the GC content of the nonsynonymous codon positions of the respective codons, the amino acid residues may be classified into three groups: (i) GC-rich (Gly, Ala, Arg, and Pro), (ii) AU-rich (Phe, Tyr, Met, Ile, Asn, and Lys), and (iii) unbiased amino acids (the rest) (Foster et al. 1997). On average, the codons, which are AU-rich at nonsynonymous sites, tend to encode relatively expensive amino acids (Akashi and Gojobori 2002). It is interesting to note that in both strains of T. whipplei, except for Pro, all residues encoded by GC-rich codons are present in significantly higher amounts (p < 0.001) in highly expressed gene products, while usages of the residues encoded by all AU-rich codons, except Lys, are significantly higher (p < 0.001) in lowly expressed gene products (Fig. 4). Increased usage of GC-rich or decreased usage of AU-rich codons in highly expressed genes of T. whipplei is supported by the significant positive correlation of GC1+2 with the coordinates of axis 3 of COA on RAAU (Table 3). This observation is in accordance with the studies on other intracellular bacteria, such as Buchnera, Blochmannia floridanus, and Wigglesworthia, exhibiting higher usage GC-rich amino acids in high-expression genes (Schaber et al. 2005).

Figure 4
figure 4

Difference in relative amino acid usage of highly and lowly expressed genes of T. whipplei TW08/27 (filled bars) and T. whipplei Twist (open bars). The differences were derived as DHL = [(Freq.H/Freq.L) − 1] × 100}.

Both the genomes under study have a relatively low G + C-content (∼46%) compared to other actinomycete species, and hence, a question arises: Why do the highly expressed genes of these species exhibit higher usage of the residues encoded by G + C-rich codons? This might have happened for either of the following two reasons: (i) the highly expressed genes of T. whipplei are more conserved at the amino acid level than their lowly expressed counterparts, and hence, they have retained a GC-richer composition which is closer to their putative ancestral state, or (ii) all genes are undergoing substitutions at a comparable rate irrespective of their level of expression, but due to some functional advantages, the highly expressed genes exhibit a positive selection in favor of the residues encoded by GC-rich codons. In order to examine which of these two possibilities is more likely to be true, we compared the estimated pairwise nonsynonymous divergences (dN) between all orthologues of putative highly and lowly expressed genes preidentified from the two extremes of axis 1 generated by COA on RSCU of the leading strand genes of T. whipplei TW08/27 and T. whipplei Twist. The mean dN for highly expressed genes (0.021) is significantly lower (t-test, p < 10−4) than that for lowly expressed genes (dN = 0.057), suggesting that the amino acids of highly expressed genes are more conserved in T. whipplei. A similar inference was drawn previously in several organisms like E. coli, S. typhimurium, and P. falciparum (Sharp 1991; Chanda et al. 2005). It may be inferred therefore that in T. whipplei, the overall protein conservation in highly expressed genes is much higher and it also influences the discrimination in amino acid usage between highly and lowly expressed genes. The enhanced usage of the residues encoded by GC-rich codons in the highly expressed gene-products might have been possible due to more conservation at the amino acid level in highly expressed genes than the low-expression genes and hence, they retain GC-rich compositions which are closer to their putative ancestral state.

In summary, the present study reveals that the patterns of synonymous codon and amino acid usage in T. whipplei are the result of several factors. As observed earlier in many other species characterized by genome reduction and overall A + T-richness (McInerney 1998; Romero et al. 2000; Das et al. 2005a), in T. whipplei also, replicational-transcriptional selection coupled with asymmetric mutational bias is the primary cause of intragenomic variations in codon usage pattern. There is a residual intrastrand bias in synonymous codon usage by the highly expressed genes, possibly due to the presence of translational selection. However, the influence of strand-specific mutational pressure is not as pronounced at the level of amino acid usage. Hydrophobicity and aromaticity seem to be the major sources of variation, both having a nearly equal influence on amino acid usage. In spite of the intracellular lifestyle, the amino acid preferences in highly expressed gene products of T. whipplei follow the cost minimization hypothesis. Another interesting finding is that the products of the highly expressed genes prefer to use the residues encoded by GC-rich codons, although the T. whipplei genome, on average, has only a 46% G + C content. The analysis presented here indicates that this might be due to greater conservation of a relatively GC-rich ancestral state in the highly expressed genes. Even the energetic cost of amino acid residues (Akashi and Gojobori 2002) might play a significant role in retaining the GC-rich residues in highly expressed genes. The study also sheds light on the diverse compositional and structural characteristics of two groups of membrane-associated proteins that might play distinct roles in host interactions.