Background

Investigations at the sequence level showed that (i) the genomes of multicellular eukaryotes are compartmentalized in mosaics of isochores that belong to a small number of families that are characterized by different GC levels and dinucleotide frequencies [16]. These findings confirmed and extended previous investigations (originally using density gradient ultracentrifugation [79]) carried out by our laboratory over many years (see [6] for a review).

The results available so far support the idea of isochores being a “fundamental level of genome organization” [10] not only in vertebrates but also in the other multicellular eukaryotes analyzed. Indeed, as established by our previous work, not only gene distribution, but also chromatin structure, short sequence frequencies, DNA methylation, gene expression, replication timing and recombination are the main structural and functional properties associated with isochore families of all multicellular eukaryotes explored so far. We also proposed that the large conservation of GC levels and dinucleotide frequencies of isochore families reflect the conservation of chromatin structures, whereas the conservation of isochore size might be due to the role played by isochores in chromosome structure and replication [2, 11]. These results stress the interest of understanding the structure and the evolution of compositional patterns in unicellular eukaryotes.

Some early results indicated that the nuclear genome of Euglena gracilis and the macro-nuclear genome of Tetrahymena pyriformis were remarkably homogeneous in base composition, while the nuclear genome of Saccharomyces cerevisiae showed a slight heterogeneity [8]. Later work based on sequenced yeast chromosomes showed that some of them consist of alternating large domains of GC-rich and GC-poor DNA [1214], generally correlating with a variation in gene density. More recent work showed that in yeast GC-rich and GC-poor isochores are different in chromatin conformation, histone modification and transcription; more precisely, GC-rich isochores have a more extended chromatin conformation, different levels of histone acetylation and more highly expressed GC-rich genes [15].

In the case of Plasmodium falciparum, the unicellular parasite responsible for the most virulent and widespread form of human malaria, a striking feature is that it hosts the GC-poorest (19.4% GC) nuclear genome known so far [16, 17]. In Plasmodium cynomolgi, a compositional compartmentalization was demonstrated in the nuclear DNA, which consists of DNA segments likely to average 100 kb [18].

Both DNAs from Trypanosoma brucei and Trypanosoma equiperdum (two closely related trypanosomes [19, 20]) showed a bimodal distribution characterized by two major peaks banding at 1.702-1.703 and 1.707-1.708 g/cm3 in CsCl density gradients and representing 1/3 and 2/3 of total DNA, respectively; a number of minor components were also detected, corresponding to satellite DNAs and possibly to ribosomal DNA [21].

In conclusion, the results on yeast, Plasmodia and Trypanosomes indicated that a compositional compartmentalization was not only present in the genomes of metazoan and plants, but also in those of unicellular eukaryotes. These findings encouraged us to extend our investigations to other unicellular eukaryotes.

Other important aspects, indicative of a wide genomic diversity are worth mentioning: 1) The range of genome sizes of unicellular eukaryotes (8.7 Mb to 357 Mb, a 41-fold range; [22]) is even broader than that of metazoans (from 94.4 Mb to 3000 Mb, a 32-fold range, neglecting cases of polyploidy; [4, 5]). 2) The range of average GC levels of the genomes of unicellular eukaryotes is as broad as that of prokaryotes [23, 24]. 3) The chromatin structure of unicellular eukaryotes may be organized in a different way compared to that of multicellular eukaryotes. For example, Saccharomyces cerevisiae lacks histone H1; similarly, Trypanosomes, although they have H1 histone, this protein is quite divergent and chromatin does not reach high levels of compaction during mitosis. 4) The environmental conditions under which unicellular eukaryotes live are much more diverse than those of vertebrates and also of invertebrates. 5) Unicellular eukaryotes lack the very complex regulatory system involved in the developmental process of multicellular eukaryotes.

All these considerations prompted us to tackle the analysis of compositional organization in unicellular eukaryotes. Here we approached these problems by studying the genomes of representative species from all the so-called “supergroups” of unicellular eukaryotes.

Results

In this work we studied the compositional organization in representative species from all the so called eukaryotic “supergroups” (see also Additional file 1: Table S1 and refs. [25, 26]). In Additional file 2: Figure S1 we report the phylogenetic distribution of the unicellular species analyzed in the present work [24].

Green and red algae (Ostreococcus tauri, Cyanidioschyzon merolae respectively) represent Plantae. The supergroup Amebozoa is represented by the slime mold, Dictyostelium discoideum. In the supergroup Chromoalvelata we analysed species from the four main groups: two diatoms (Thalassiosira pseudonana and Phaeodactylum tricornutum) representing, Stramopiles. For the Apicomplexans (that include parasitic species in mammals) we analysed the human pathogen Toxoplasma gondii and the malarial parasites Plasmodium berghei, Plasmodium chabaudi, Plasmodium knowlesi, Plasmodium falciparum and Plasmodium vivax. The Cryptophyta group is represented by Guillardia theta, while for the last group, Ciliates, the analysis was only partial due to the fragmented genome assembly that is available for this species. The Excavata supergroup is represented by two Kinetoplastids (Trypanosoma brucei and Trypanosoma cruzi), while in the Fornicata group, the species analysed was Giardia lamblia, even if, in this case too, the analysis was only partial due to the incompleteness of the assembled genome. Finally the supergroup Opisthokonta (which also includes animals) is represented here by several unicellular fungi: Saccharomyces cerevisiae, Candida glabrata, Ashbya gossypii and Cryptococcus neoformans.

The different groups of organisms studied here exhibit a diversity of genome compositional patterns, that range from very weak to very strong compartmentalization (see Table 1).

Table 1 Average GC (A) and relative amounts (B) in percentage of components from unicellular eukaryotes

The results obtained in this work indicate that unicellar eukaryotes encompass a wide range of situation in terms of genomic composition and heterogeneity. In the first group, namely Algae, the green alga Ostreococcus tauri and the red alga Cyanidioschyzon merolae showed very GC-rich genomes (see Figure 1). In the first case, DNA was centered at 59-60% GC, in the second at 55-56% GC, with a smaller component at 52-53% GC. Both diatoms analyzed, Thalassiosira pseudonana and Phaeodactylum tricornutum, showed GC-rich genomes consisting of components that were centered at 47% and 49% GC, respectively (Figure 1).

Figure 1
figure 1

Distribution by weight of DNA segments according to GC levels in the green alga O. tauri , in the red alga C. merolae and in diatoms T. pseudonana and P. tricornutum.

The genomes of fungi exhibited very different GC ranges (see Figure 2). Indeed, Saccharomyces cerevisiae and Candida glabrata showed GC-poor genomes, essentially consisting of DNA components centered at 38-39%, that were accompanied in the case of C. glabrata by a minor component ranging from 34% to 38% GC and also by a very minor GC-richer component in the 42-46% GC range. In contrast, the other two fungi analyzed showed GC-richer genomes: Ashbya gossypii comprised two GC-rich components, the first one centered at about 52-53% GC, the second one centered at 55% GC, whereas Cryptococcus neoformans exhibited one component centered at 48-49% GC.

Figure 2
figure 2

Distribution by weight of DNA segments according to GC levels in fungi, S. cerevisiae, C. glabrata, A. gossypii and C. neoformans.

Protists are an exceptionally diverse group from a phylogenetic viewpoint. Indeed, the genome-wide distances and times of divergence between two protozoan groups are many times larger than those of the most divergent metazoans. In this work we have studied species that are representative of all major groups among which two well known groups of human parasites, Trypanosomatids and Plasmodia. As long as the first of these two groups is concerned, it is interesting to note that Trypanosoma brucei and Trypanosoma cruzi, exhibited GC-rich genomes (Figure 3). In particular the first one was essentially formed by a component centered at 48% GC, and by minor GC-poorer ones; the second one showed two main components, the first one centered at 48% GC, the second, smaller one at 54% GC.

Figure 3
figure 3

Distribution by weight of DNA segments in protists T. cruzi, T. brucei and T. gondii and D. discoideum.

As far as the second group is concerned (see Figure 4), the situation is more striking because the malaria parasite Plasmodium vivax exhibits a genome covering a broad compositional spectrum (28%-55% GC) with two major components centered at about 44% and 49% GC, whereas in an exceedingly sharp contrast, Plasmodium chabaudi, Plasmodium berghei, and Plasmodium falciparum, which have genome sizes very close to that of P. vivax, showed very GC-poor genomes with single major components centered at 24%, 22% and 19.4% GC, respectively. Only the P. falciparum genome showed some minor components ranging from 20% to 32% GC. Plasmodium knowlesi exhibited a genome pattern which was intermediate between P. falciparum and P. vivax, exhibiting two major components centered at about 39% and 43% GC as well as a smaller component at 35% GC. All the DNA components from unicellular genomes were grouped in families according to their GC levels, as reported in Table 1.

Figure 4
figure 4

Distribution by weight of DNA segments in protists P. vivax , P. knowlesi , P. chabaudi , P. berghei and P. falciparum.

The parasitic protist Toxoplasma gondii consisted of one major component centered at 52% GC and a smaller component at 55% GC, whereas the Amoeba Dictiostelium discoideum showed one major component centered at 28% GC (Figure 3).

Unfortunately, only contigs/scaffolds were available for the genomes of the unicellular eukaryotes listed in Table 2 (see Additional file 2: Figure S1). In these cases, we analysed the contigs/scaffolds larger than 100 kb that represented a large percentage of the available sequences as shown in Additional file 3: Table S2. Several of these genomes covered some missing taxa (at the group classification level, see Additional file 2: Figure S1), such as ciliates, while others belong to taxa for which a complete analysis was done in other species from the same group (like Stramopiles). These genomes covered a wide GC spectrum, ranging from the very GC-poor genome for Tethahymena thermophila to the very GC-rich genome of P. sojae (as reported in Figure 5).

Table 2 GC content, number of contigs/scaffolds and their total lengths in megabases (Mb), length of scaffolds > 100 and > 100 kb and their percentage on the total length were reported
Figure 5
figure 5

The amounts of DNA in megabases (Mb) for contigs/scaffolds of unicellular genomes listed in Table 2 pooled in bins of 0.5% GC.

The extreme contrast between the compositional patterns of P. vivax and P. falciparum prompted us to analyze (using about 4,000 orthologous genes) the compositional distribution of GC, GC1, GC2 , GC3 as well as the correlations between the GC levels of the three codon positions. The first analysis (Figure 6) showed, as expected, a strong shift towards lower values of the distributions from P. vivax to P. falciparum, reaching a complete absence of overlap in the case of GC3. The second analysis (Figure 7) showed a very significant correlation coefficient 0.50-0.51, for the GC1vs. GC2, as expected from the universal correlation of D’Onofrio and Bernardi [27]. In contrast, the correlations between GC1/GC2 and GC3 were weaker in P. vivax (0.39 and 0.22, respectively) and very weak or absent in P. falciparum (0.03 and 0.12, respectively), a result likely to be linked to the extremely low values and narrow distribution of GC3.

Figure 6
figure 6

The histograms show the distributions of the GC, GC 1 , GC 2 and GC 3 for the coding sequences of a set of about 4000 orthologous genes for P. vivax and P. falciparum .

Figure 7
figure 7

Scatterplots of GC, GC 1 , GC 2 and GC 3 among themselves for orthologous genes for P. vivax and P. falciparum . The orthogonal regression equations, the correlation coefficient (R) and the number of genes (N) are reported. The main diagonal is indicated by a broken line.

Discussion

The results just reported clearly show that the genomes of unicellular eukaryotes range from narrow compositional distributions, as in the case of O. tauri, T. pseudonana, C. neoformans and P. falciparum, P. berghei and P. chabaudi, to more heterogeneous patterns, such as those of S. cerevisiae and T. brucei, while in many other groups such as P. knowlesi, P. vivax, and T. cruzi, the heterogeneity is remarkable. These observations deserve some general comments (in addition to those already made in the preceding section).

Several findings are very striking when compared with both vertebrate and invertebrate genomes. Even if the number of genomes is admittedly modest, a first observation is that free-living unicellular organisms generally show narrower compositional distributions with only minor additional components (S. cerevisiae and A. gossypii, the latter showing, however, a slightly wider compositional range; 52%-55% GC). This narrow distribution is centered, however, on very different GC levels, that range from 38%-40% GC for the two yeasts to almost 60% GC for the green alga O. tauri. Obviously, it would be interesting to correlate these very different compositions to environmental factors. This seems, however, to be possible only for C. merolae, in which case the high GC level (55% GC) might be related to the hot acid springs (45°C; pH 2.0) of its habitat. This idea is supported by our previous findings in which high GC levels are correlated with the high body or optimal growth temperatures, in the case of vertebrates and bacteria, respectively (see [6] for a review). Interestingly, protein divergence between Galdieria sulphuraria, which lives like C. merolae in hot spring, and Galdieria phlegrea, which lives in less extreme habitat (i.e. moderate pH and temperature) is similar to that between human and medaka [28].

In contrast, parasitic unicellular organisms show some striking features, namely that within the same genus one species may have a wide compositional distribution (this is the case of T. cruzi and of P. vivax) and other ones have a very narrow distribution (P. falciparum, P. berghei and P. chabaudi). These results are highly suggestive of compositional adaptation. Needless to say, it would be of great interest to identify the causes for such adaptations, especially since recent results [29] reported a lack of synteny among Apicomplexa due to genome rearrangements.

The compositional compartmentalization of some genomes of unicellular eukaryotes is possibly linked to a different chromatin structure and different regulation of gene expression. The results of Table 1 also show something of great potential interest, namely that, apart from the extreme cases of P. falciparum and O. tauri, the GC values for the single or multiple DNA components are very close to those previously found for the isochore families of vertebrates and invertebrates. This might be a coincidence, but might also be linked to specific features of chromatin structures. Needless to say that it would be also very interesting to consider whether genes characterized by specific functions are differentially distributed in the two major families exhibited by T. brucei and P. vivax, respectively.

At this point, it is worthwhile mentioning that an intrachromosomal compositional heterogeneity was also found in prokaryotic genomes [30]. In fact, while most prokariotic species tested are compositionally homogeneous, a minority are rather heterogeneous in composition, an explanation, being, however, associated with recent lateral transfers.

Conclusions

Previous results on the genomes from a small number of unicellular eukaryotes provided the first indication that a compositional compartmentalization was not only present in the genomes of multicellular eukaryotes, but also in those of some protozoa. The findings presented here revealed that situations of compositional compartmentalization covering a very broad range were generally present in unicellular eukaryotes. Even if the sample of organisms investigated is admittedly modest this point is clearly demonstrated. This distinguishes eukaryotes that always show compartmentalized genomes from prokaryotes, in which case the compositional heterogeneity is exceedingly rare and possibly always associated with recent lateral tranfers.

The results presented here, and previous observations (like those already mentioned for the budding yeast), lead us to suggest that genome compartmentalization is a very general feature of all eukaryotes. Different levels of compartmentalization are probably linked with increasing regulatory complexity and/or other functional requirements to which organisms are bound. This idea is in line with a more general notion in Biology concerning the role of compartmentalization as a fundamental way to organize structure and function at all levels from the organ level down to the cellular and genome level.

Two additional conclusions we consider as preliminary, but, if confirmed by investigations on a larger sample, would be of very great interest. The first one concerns the differences found between free-living and parasitic unicellular eukaryotes. The second one, the fact that GC levels found in unicellular eukaryotes are very close (with two exceptions) to those of isochore families from multicellular eukaryotes. Indeed, the first point suggests compositional adaptation of the genomes of parasitic unicellular organisms, the second a correlation with chromatin structure.

Methods

Genome and gene sequences: the resources

The sequences of unicellular genomes as well as those of the genes analyzed in this study were downloaded from different websites (see Additional file 3: Table S2). Partial, putative, synthetic construct, predicted, not experimental, hypothetical protein, r-RNA, t-RNA, ribosomal and mitochondrial genes were eliminated and then the cleanup program [31] was applied for ridding nucleotide sequence databases of redundancies. For the remaining genes a script implemented by us was used in order to identify the coding sequences beginning with a start codon and ending with a stop codon. The coordinates of genes on the chromosomes were retrieved from the website used for downloading the chromosomes.

Compositional patterns: methodology and nomenclature

The entire chromosomal sequences of the finished genome assembly were partitioned into non-overlapping windows, and their GC levels were calculated using the program draw_chromosome_gc.pl [32, 33]. The general methodology used to map DNA segments on unicellular genomes was that described for the isochore map of vertebrates [1] and invertebrates genomes [4, 5]. It should be stressed that this methodology has a trend to overestimate compositionally homogeneous regions, because the standard deviation tends to decrease with increasing size of the regions. Because of the small chromosome sizes of several unicellular genomes under analysis, we used a non-overlapping window of 25 kb, a size suitable for all the unicellular genomes. The GC levels of compositionally nearly homogeneous DNA segments were calculated using a script implemented by us. The sequences of contigs/scaffolds for unicellular genomes reported in Table 2 were downloaded from Ensembl Genome Browser (http://protists.ensembl.org/).

In order to demonstrate that the different compositional patterns found were not an artifact due to the small window used (25 kb), we analyzed two unicellular genomes showing a strong compositional heterogeneity using two different non-overlapping windows. Additional file 4: Figure S2A-B display the compositional profiles of T. brucei and P. vivax at windows of 25 kb and 100 kb. The results clearly demonstrate that the levels of heterogeneity at 25 kb were barely larger than at 100 kb.

Additional file 5: Tables S3-S20 report the coordinates, sizes and GC levels of the segments identified in the genomes. When these segments were pooled in bins of 0.5% GC, families of segments were found according to their average GC levels. Table 1 reports the average GC levels and the relative amounts from these families. For the sake of comparison, Table 1 also shows the average GC levels calculated for the different isochore families of vertebrates [4] and invertebrates [5].

As far as the name of each DNA segment is concerned we used a convention in which the first number in the name represents the chromosome number, the following two letters are the initials of the scientific name of the species under consideration, and the last number identifies the fragment.