Gene Organization Features in A/T-Rich Organisms

Szafranski, Karol; Lehmann, Rüdiger; Parra, Genis; Guigo, Roderic; Glöckner, Gernot

doi:10.1007/s00239-004-0201-2

Gene Organization Features in A/T-Rich Organisms

Articles
Published: January 2005

Volume 60, pages 90–98, (2005)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Molecular Evolution Aims and scope Submit manuscript

Gene Organization Features in A/T-Rich Organisms

Download PDF

Karol Szafranski¹,
Rüdiger Lehmann¹,
Genis Parra²,
Roderic Guigo² &
…
Gernot Glöckner¹

404 Accesses
19 Citations
Explore all metrics

Abstract

Several species have genomes in which the four nucleotides are not equally represented (Glöckner 2000). Interestingly, shifts to very high A/T or G/C levels can occur in several distinct branches of the tree of life. The underlying reasons for these shifts therefore may be of different origin. Now entire chromosome sequences from two different A/T-rich genomes, Dictyostelium discoideum and Plasmodium falciparum, are available (Bowman et al. 1999; Gardner et al. 2002; Glöckner et al. 2002). This gives us the opportunity to investigate how a high A/T content may influence the signals that are the landmarks for gene specification. We found that, in contrast with most known metazoan and plant genomes, splice signals contain, little information other than the canonical GT–AG dinucleotides. Intron lengths in A/T rich organisms, on the other hand, are comparable to those of other lower eukaryotes. Intergenic regions show, dependent on the orientation of adjacent genes, a size pattern with a ratio of 1 (3′–3′) to 2 (3′–5′) to 3 (5′–5′). Overall, gene organization patterns seem not to be influenced by the A/T bias. Surprisingly, the slightly higher A/T content of the P. falciparum genome compared to that of D. discoideum (80.1 versus 77.4%) is not achieved by increased A/T richness in intergenic regions. Instead both the shift of the nucleotide usage in coding regions to A/T-rich codons and the longer intergenic regions make an equal contribution to the higher A/T content in this organism.

Evolutionary determinants of genome-wide nucleotide composition

Article 01 January 2018

Forces acting on codon bias in malaria parasites

Article Open access 29 October 2018

Homopolymer tract organization in the human malarial parasite Plasmodium falciparum and related Apicomplexan parasites

Article Open access 03 October 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

To date several genomes in diverse evolutionary branches have been sequenced and many more will follow. Yet compared to their richness of diversity, lower eukaryotes are underrepresented in this selection. Thus, we are only at the dawn of the understanding of genomic organization in this large group of organisms. Here we attempt to describe common genomic features of lower eukaryotes with comparable high A/T contents. The different theories on why nucleotide biases in genomes occur were reviewed by Glöckner (2000).

The extraction of the genomic information content of the cell, in particular, the localization and structure of protein coding genes, is a challenge since no simple rules exist for this purpose. Even though the mechanisms by which gene structures are specified in eukaryotic genomes appear to be quite general, each genome seems to have its singular signature for gene recognition that was and is shaped by selective pressures during evolution (Gentles and Karlin 2001). This implies that gene structures and the transcription and translation apparatus in each species are coadapted to the same extent to enable the cell appropriate transcription and translation of each gene. Deviations from a given signature may pinpoint sequence domains and genes that are introduced into a given genome from a foreign source (Karlin 2001). Thus, analysis of gene signatures of a species’ genome not only provides tools for the most accurate prediction of gene structures but also may shed light on the common mechanisms underlying gene specification and its evolution. In addition, they may also be useful to resolve conflicting phylogenies and to track down horizontal gene transfers, genetic drift, and other evolutionary events.

Despite the sometimes obvious differences of foreign genes to the original genes of an organism, the cellular machinery is able to transcribe, translate, and use these genes seemingly without difficulties. Only in highly biased genomes artificially introduced genes with no adaptation to the host are not expressed (Fuhrmann et al. 1999). Computed probabilities for the existence and structure of a gene rely on the similarity of the gene structure to be detected to well-defined gene structures of a given organism. This statistical approach implies the prediction of false positives and inability to detect true genes. The ability to correctly use all the information content of a genome despite deviations of particular genes from the mean signatures distinguishes the cellular recognition of genes from gene detection by bioinformatic means.

Gene detection in prokaryote genomes is comparably easy. Here the open reading frames (ORFs) framed by appropriate translation start and stop signals very often equal the genes themselves. Yet overlapping genes and the potential of small ORFs to be meaningful for the cell are obstacles to a straightforward strategy for gene detection in prokaryotes. Thus, even in the simplest systems not all true genes can be detected and not all predicted genes are biologically meaningful. Therefore, gene prediction can only lead to an approximation of gene content and genome complexity.

Eukaryote genomes pose even more problems for gene detection. Here a gene is not only one block of information from translation start to stop but may be distributed over several exons separated by introns that can span several kilobases of genomic sequence. In addition, in contrast to the densely gene packed prokaryotic genomes, eukaryotic genes are separated from each other by large intergenic regions. These characteristics of eukaryote genes make it even more difficult to define correct gene structures than in prokaryotes. To overcome these limitations, eukaryotic gene prediction programs attempt to integrate the different species-specific gene signatures into likely candidate gene models (Hooper et al. 2000; Makarov 2002; Mathe et al. 2002; Rogic et al. 2001; Zhang 2002).

In the analysis presented here we focus on the analysis of Dictyostelium discoideum gene features and gene predictions. This organism belongs to the amoebozoa, a branch believed to have diverged early in evolution, perhaps before the separation of animals and plants (Baldauf and Doolittle 1997; Baldauf et al. 2000). This species therefore may be best suited for the identification of ancient genomic features. Moreover, it has a compact genome that makes it easier to study gene features than in large genomes, where these characteristics may be hidden in large regions of, from our current point of view, seemingly senseless DNA. The genome examined is unusual with respect to nucleotide distribution since it is shifted toward a very high A/T content of nearly 80%. Why a genome is forced to develop such extreme nucleotide disparities is still unclear.

Extreme values of such biases in the same direction in different organisms may affect gene signatures in similar ways. To address this question we compare the results obtained with D. discoideum to those with Plasmodium falciparum, the causative agent of malaria. This organism has also a very A/T-rich genome (Gardner et al. 2002) that apparently evolved independently. This independent acquisition of the A/T bias is underlined by the deep branching of the organisms (Plasmodium, Alveolata > Apicomplexa > Haemosporida > Plasmodium; and Dictyostelium, Mycetozoa > Dictyosteliida > Dictyostelium) in the eukaryote phylogeny. Since we now have large amounts of A/T-rich sequences from these two highly biased genomes, we can identify common features of genes in these organisms. We here discuss also whether nucleotide biases may have a measurable impact on the evolution of gene features.

Materials and Methods

Gene Prediction Programs

The following gene prediction programs were used: geneid (http://www1.imim.es/geneid.html), GlimmerM (http://www.tigr.org/tdb/glimmerm/glmr_form.html) trained for prediction in P. falciparum, and GenScan (http://genes.mit.edu/GENSCAN.html) trained for A. thaliana gene predictions. To measure accuracy we used the statistics described by Burset and Guigo (1996) and Reese et al. (2000).

Data Sets

To obtain a data set for the training of geneid, we extracted full-length entries of Dictyostelium discoideum genes from the GenBank database. The gene structures should be derived from nuclear genes, not being of mitochondrial origin. Furthermore, each gene we included in the data set was unique, i.e., it did not share more than 80% identity on the nucleotide level with other genes in the data set. The final Dictyostelium discoideum training set contains 140 genes, 38 corresponding to single-exon genes and 102 multiexonic genes containing 180 introns. The genes, which were predicted on chromosome 2 after training of geneid with the above set, were then used for further analysis.

The P. falciparum data set was constructed from the annotated sequence of chromosome 3. This annotation was carried out by the Sanger Institute group using manual curation of the output of different software packages (Hall et al. 2002). The sequence corresponding to this chromosome contained 234 genes, 98 corresponding to single-exon genes, and 136 multiexonic genes containing 379 introns.

Splicing and Translational Signals

Donor and acceptor sites were extracted from the D. discoideum chromosome 2 data set. The information content of each position around the signal was computed as by Burge et al. (1999). The positions taken for each site were −2 to +5 for donors −15 to −1 for acceptors, and −6 to +5 for the translation start side (0 being the first position downstream between coding and noncoding junction and −1 being the first position upstream). Background probabilities have been computed in the same positions around every minimal conserved nucleotide for each site (corresponding to GT for donors, AG for acceptors, and ATG for start sites) in the training set sequences not annotated as functional sites. The final score for each site was computed as a log-likelihood ratio of the frequencies of the real annotated sites versus the frequencies of the bases around the nonannotated minimal sites (assuming that the nonannotated sites in these sequences correspond to the distribution of the nonreal sites). Compositional properties were computed as by Guigo (1999) and Parra et al. (2000).

Evaluation of available prediction programs for their ability to predict correct D. discoideum gene structures was performed using the training set. All data sets used in this study with geneid and the parameters inferred are available at http://genome.imim.es/datasets/D.discoideum/ and http://genome.imim.es/datasets/P.falciparum/.

Results

Gene Prediction in D. discoideum

Investigation of determinants of gene structures requires an experimentally confirmed data set consisting of as many complete gene structures as possible. For D. discoideum we obtained a data set of 140 genes after filtering for a number of features (see Materials and Methods).

With this data set we examined the performance of three gene prediction programs without prior training geneid (Blanco et al. 2003), GenScan (Burge and Karlin 1997) with Arabidopsis thaliana parameters, and a version of GlimmerM (Salzberg et al. 1999) designed for gene prediction in Plasmodium falciparum. We observed that the three programs worked similarly well on this data set, with GlimmerM being more sensitive than geneid, and geneid more sensitive than GenScan (supplementary material).

Next, we tested a version of geneid specifically trained for gene prediction in D. discoideum. This version showed great accuracy, with 99% of the coding nucleotides correctly predicted and 97% of the nucleotides predicted as coding actually being coding (that is, at the nucleotide level, sensitivity was 0.99 and specificity 0.97). With the trained version of geneid, even small annotated exons missed by the other programs could be detected (Fig. 1). The accuracy of the geneid predictions in D. discoideum is clearly superior to that in human (here the sensitivity as measured in human chromosome 22 is 0.8 and the specificity 0.63; data not shown).

Predicted Genes

On chromosome 2 of D. discoideum geneid predicted 2799 genes (> 40 amino acids [aa]) (Glöckner et al. 2002). Since the statistical probability to be a true gene increases with the length of the predicted gene, the low threshold chosen may have led to a slight overestimation of gene numbers. Of the predicted genes 109 (3.7%) were incomplete at the 5′ end and 56 (1.9%) were incomplete at the 3′ end. Since EST data for D. discoideum are scarce, especially for the 5′ end, we could only determine the real gene structure for a minor part of the predicted incomplete genes. In many cases the lack of completeness at the 5′ end may be due to very short portions of coding sequence on the first exons. Incompleteness at the 3′ end results from artificial prediction of a splice donor site shortly before the real stop codon. Generally, the analysis of the incomplete predicted genes showed that the bordering sequences exhibit a very high A/T content in which no obvious coding potential could be detected. Thus, we conclude that only very small portions of the coding region of these partially predicted genes are missed.

Exon–Intron Structure, Splice Signals, andIntergenic Regions

On average the introns of the predicted genes from chromosome 2 comprise 177 bases. D. discoideum genes also have a relatively small number of introns per gene, comparable to that of Schizosaccharomyces pombe but more than that of Saccharomyces cerevisiae (Deutsch and Long 1999; Kriventseva and Gelfand 1999; Wood et al. 2002). Interestingly, this is a significantly higher value than that of the training set, with 132 bases on average. The average number of introns per gene is 1.28. The maximum number of introns in one gene was 15; overall 48 genes (1.7 %) have more than 4 introns.

In contrast to most known metazoan and plant genomes, D. discoideum splice signals carry little information other than the GT–AG canonical dinucleotides (Fig. 2). Only the A nucleotide in position +2 of the donor site (+0 being the first position in the intron, occupied by the canonical G) and a G at position +4 appear to be clearly overrepresented with respect to the expected values.

Due to the high number of genes in a relatively small genome, the intergenic regions are short. Here we count the entire space between putative translated gene loci from the start or stop signals of one gene to that of the next as intergenic, not taking in account the untranslated but transcribed regions (UTRs) of a gene. On chromosome 2 these “intercoding” regions (ICRs) have an average length of 786 bases on chromosome 2. A similar average size was found for the ICRs of chromosome 1. We therefore think that these values are valid for the whole genome. We found that the average size of ICRs differs dependent on the orientation of the adjacent coding regions (Table 1). The translated portions of the genes with neighboring 5′ ends are separated from each other by three times more space than translated portions of genes with neighbouring 3′ ends. We then tried to estimate the length of the untranscribed intergenic regions using available EST data and a set of manually annotated genes. Of the 36 genes analyzed, 18 were randomly chosen, and 18 belong to the rasGEF family of proteins (supplementary information). We found that the 36 genes tested have a 3′UTR of 74 bases on average from the end of the coding region to the poly(A) tail, with a standard deviation of 58 bases. Since ESTs of D. discoideum do not represent the 5′ ends of the genes very well, we did not analyze the 5′UTR lengths.

Table 1 Comparison of intercoding regions (ICRs) between different species

Full size table

Comparison to Organization in Other Species

To analyze the influence of nucleotide biases on gene properties we compared the results obtained with D. discoideum to a second A/T-rich genome, that of Plasmodium falciparum. For this purpose we used the previously annotated chromosome 3 (Bowman et al. 1999). We found the same restriction of the splice signal to the canonical GT–AG motifs with some weak preferences for certain nucleotides near the canonical motifs (supplementary material). This may be partially caused by the extreme A/T values. The mean intron length of 209 bases as well as the A/T content of 87% is in the same range as D. discoideum.

Figure 3 shows the G/C distribution in introns and coding regions in D. discoideum and P. falciparum. Interestingly, in coding regions of D. discoideum the G/C values follow a Gaussian distribution, whereas the P. falciparum distribution is characterized by a long extension toward higher G/C values.

The gene density in P. falciparum is much lower than that of D. discoideum. Here the mean distance between the coding regions is around three times the distance in S. pombe and D. discoideum. Since the mean gene and intron lengths differ only slightly between both organisms, only different mean lengths of ICRs are causative for the different observed gene densities. We found that as in D. discoideum the average size of ICRs differs dependent on the orientation of the adjacent coding regions (Table 1).

Interestingly, for both organisms the A/T content of ICRs is the same, at 86%. In contrast, the coding sequences (CDS) in P. falciparum exhibit a more extreme bias toward the usage of codons with A and T nucleotides (data not shown). This preference results in a lower G/C content in CDS in P. falciparum (24 %) than in D. discoideum (27%).

We examined the ICRs of two further organisms, Arabidopsis thaliana and S. cerevisiae. The plant species exhibits a gene density comparable to that of P. falciparum, at 4.4 kb/gene (The A. thaliana Genome Initiative, 2000), albeit S. cerevisae constitutes an extreme, at 1.8 kb/gene (Goffeau et al. 1996). The A/T content of both species does not reach the extremes as in D. discoideum or P. falciparum. For our analysis of ICRs we used A. thaliana chromosome 4 and S. cerevisiae chromosome 4. As can be seen Table 1 the ICR length depends on the gene orientation in all genomes irrespective of the gene density. The ratio of the ICR lengths can be roughly given as 1 (3′/3′):2 (3′/5′):3 (5′/5′) for all organisms but S. cerevisiae, if we take the median instead of the mean values into account (Table 1).

Discussion

Since research on Dictyostelium discoideum focuses mainly on cytoskeleton, signal transduction, and development, the described genes and therefore the gene features of the training set for this organism may not represent the average gene of this organism. Previous studies are limited by the preselection of certain gene families (Rivero 2002). Future gene predictions and genome analyses have to include also genes from the primary metabolism and other functional groups. This way, gene prediction will become more accurate and efforts at manual curation of the predicted gene data set will be minimized. Despite these limitations the analysis presented here gives some insight into the gene organization of lower eukaryotes with a high A/T bias.

A bias toward a preferred use of A and T nucleotides can be found in several organisms belonging to well-separated branches in the tree of life (Glöckner 2000). These organisms evolved these more or less pronounced biases in the same direction independently. This genomic bias also shows local effects in that it overlays the gene features that are common to all organisms. The D. discoideum genome exhibits one of the most pronounced biases, with an A/T content of more than 78%. This value is only surpassed by that of P. falciparum, with an A/T nucleotide content of 80%.

Despite the clear similarities in gene specification and gene organization between the two A/T-rich genomes, gene signatures still appear to be quite specific. Table 2 shows the accuracy on P. falciparum sequences of the version of geneid trained on D. discoideum sequences, and vice versa. Surprisingly, predictions using the D. discoideum version of geneid on P. falciparum sequences were substantially worse than predictions using the P. falciparum version of geneid on D. discoideum sequences. This could be attributed to the more Gaussian distribution of G/C content in D. discoideum, indicating more homogeneous gene features in this species than in P. falciparum. A/T richness alone thus may not be a key feature that influences gene structure signals.

Table 2 Performance of GeneID in A/T-rich organisms

Full size table

Cells are also able to transcribe and translate foreign genes if they do not have a very different base composition as can be seen from cross-species complementation. Evolutionary constraints therefore led to a well-conserved transcription/splicing/translation machinery. This conservation should enable a general approach for gene detection in unknown sequences. Yet currently available software tools have to be individually trained for a given organism for optimal performance. This analysis shows that, even if extreme compositional biases are shared between organisms, training for the specific organism is indispensable. Thus, gene recognition in the cell is different from bioinformatic approaches and may use more widespread, not yet detected signals.

Like all other lower eukaryotes for which sequence data are know, D. discoideum has only short introns. Surprisingly, the mean intron length of the training set was 132 bases, much shorter than that of all chromosome 2 genes (177 bases). This difference is in part due to predicted long (>1000 bases) introns in cases where geneid was not able to resolve the correct gene structure. Manual inspection of all 74 introns larger than 1000 bases revealed that at least 90% of them are artifacts. These artificial introns are the main source of the observed difference between intron lengths of the training set genes and those of the chromosome 2 genes.

Although introns are small and rare in the D. discoideum genome, they could slightly contribute to the overall A/T bias of the genome. The canonical GT–AG motif at the donor and acceptor sites of introns is conserved with only small deviations throughout the tree of life. In D. discoideum the requirements for proper splicing seem to be almost entirely restricted to this common motif. Besides these common sites, only weak preferences for A nucleotides downstream of the donor site could be detected in D. discoideum. The further positions are only slightly favored by a (A/T)GT motif. A similar picture was obtained when analyzing P. falciparum splice sites (supplementary material). This is in contrast to metazoa and plants, where additional nucleotides strongly contribute to a consensus motif. The splice apparatus therefore has to be able to correctly detect and process these signals in spite of the relative weakness of the signals compared to other organisms. Possibly the difference in A/T content between intron sequences and coding sequences as can be deduced from the different nucleotide compositions is used by the cell as additional signal. Due to their lack of special requirements concerning sequence motifs, the entire intron sequence is able to contribute to the high mean A/T content in the genome of 78%.

For S. pombe as well as for S. cerevisiae it was shown that there is a strong bias of intron positions toward the 5′ end of the genes (Sakurai et al. 2002). This is also the case for the introns of D. discoideum. The mechanism by which the preferred 5′ localization of introns is achieved is not yet entirely understood (Fink 1987; Mourier and Jeffares 2003; Wood et al. 2002). Considering the early branching of the mycetozoa, to which D. discoideum belongs (Baldauf and Doolittle 1997), this mechanism may be established relatively early in evolution.

ICRs in D. discoideum are relatively short. The length of these regions depends also on the orientation of the adjacent genes (Table 1). On average half of the space between tail-to-tail-oriented coding regions is occupied by the 3′UTRs of the genes, as could be identified using available EST data. Thus, the real mean distance between transcribed units in the tail-to-tail orientation is only 150 bases. The situation for 5′ ends remains unclear since too few confirmed full-length mRNA sequences are available. But analyses from other organisms suggest that 5′UTRs are shorter than 3′UTRs (Jareborg et al. 1999). If this is also true for D. discoideum genes, the function of the additional space of 300 bases/gene required for genes in the head-to-head orientation cannot be attributed to long 5′UTRs.

The orientation-dependent characteristic of intergenic spacer regions was also described for S. pombe (Wood et al. 2002), an organism with the same high gene density as D. discoideum. The respective values for both organisms are in the same range, thus the contribution of gene orientation to the size of intergenic regions is independent of nucleotide biases. Interestingly, in S. cerevisiae with the highest gene density among eukaryotes measured so far, the size of the ICRs is practically not correlated with the orientation of adjacent genes. This may be due to the fact that this organism exhibits the highest gene density so far measured and therefore a minimum of ICRs.

P. falciparum has a considerably lower gene density than D. discoideum or S. pombe, but the CDS and introns have a comparable size. Thus, longer ICRs, which occupy around three times as much space in P. falciparum as in D. discoideum, mainly cause the lower gene density. For the ICRs we also found a gene orientation-dependent size (Table 1). Interestingly, for both organisms the ratios for the orientations 3′–3′, 3′–5′, and 5′–5′ follow the rule 1 to 2 to 3. The same holds true for the A. thaliana ICRs, if we calculate the median of the ICR sizes. Thus, for gene densities between 2.5 and 4.8 kb/gene there seems to exist a rule for proper gene spacing. The deviation of the mean values from this rule may reflect the more relaxed gene spacing in this genome. In fact, for the human genome with its very large ICRs, we could not detect such a rule (data not shown). S. cerevisiae as the other extreme shows compressed ICRs and also does not follow this rule.

If we view a genome as concatenated blocks of gene regions, we can conclude that the 5′ region down to the start codon of such a block occupies two times as much space as the 3′ region from the stop to the end, irrespective of the absolute length. Analysis of further compact genomes, preferably of diverse protists, is required to determine whether the 1:2:3 size distribution plays an important role in genome organization or is an intrinsic property of a gene block.

To achieve a higher A/T content than the D. discoideum genome P. falciparum would have had several possibilities: (i) extension of intergenic regions, (ii) elevated A/T values in distinct sections of the genome, and (iii) higher A/T values all over the genome. The 3% higher overall A/T content of P. falciparum compared to D. discoideum can be attributed to the first and second possibilities, since CDS show a higher A/T content, and the intergenic regions are three times as long as in D. discoideum, but show the same mean A/T content of 86%. Even if the ICRs of D. discoideum were adjusted to match the length of ICRs in P. falciparum, the A/T content of this genome would be only 78.8%. Thus, about half of the higher A/T bias in the P. falciparum genome can be attributed to the longer intergenic regions; the other half is caused by the nucleotide bias in CDS. It is not easy to understand why, in the case of P. falciparum, a contribution of CDS nucleotide bias to the overall genome bias was preferred over elongation or achievement of higher A/T values for intergenic regions. Due to the need to retain functionality, coding sequences commonly underlie stronger evolutionary pressures than intergenic regions. If natural selection were the cause for nucleotide biases in genomes, the organism would tend to minimize the costs of this shift. Therefore, one could speculate that a compact genome was preferable to extended intergenic regions. Additionally, the value of 86% A/T might be the upper threshold which can be reached over several kilobases of DNA.

Conclusion

Gene recognition patterns seem to be conserved over large phylogenetic distances, but they also appear to have some taxon-specific component. Moreover, gene structure properties are only slightly influenced by nucleotide biases in a genome. Intron and intergenic size distributions show no differences in A/T-rich versus normal genomes. Furthermore, space re-quirements before and after genes seem to follow a strict rule which is only weakened in large or very compact genomes. Thus, A/T richness is only one additional feature of a genome that is superimposed on regular features found in all genomes.

References

InstitutionalAuthorNameThe Arabidopsis thaliana Genome Initiative (2000) ArticleTitleAnalysis of the genome sequence of the flowering plant Arabidopsis thaliana Nature 408 796–815
Google Scholar
SL Baldauf WF Doolittle (1997) ArticleTitleOrigin and evolution of the slime molds (Mycetozoa) Proc Natl Acad Sci USA 94 12007–12012 Occurrence Handle10.1073/pnas.94.22.12007 Occurrence Handle1:CAS:528:DyaK2sXntFSmsb8%3D Occurrence Handle9342353
Article CAS PubMed Google Scholar
SL Baldauf AJ Roger I Wenk-Siefert WF Doolittle (2000) ArticleTitleA kingdom-level phylogeny of eukaryotes based on combined protein data Science 290 972–977 Occurrence Handle10.1126/science.290.5493.972 Occurrence Handle1:CAS:528:DC%2BD3cXnvVWksL0%3D Occurrence Handle11062127
Article CAS PubMed Google Scholar
E Blanco G Parra R Guigo (2003) Using GeneID to identify genes A Baxevanis (Eds) Current protocols in bioinformatics John Wiley & Sons New York
Google Scholar
S Bowman D Lawson D Basham et al. (1999) ArticleTitleThe complete nucleotide sequence of chromosome 3 of Plasmodium falciparum Nature 400 532–538
Google Scholar
C Burge S Karlin (1997) ArticleTitlePrediction of complete gene structures in human genomic DNA J Mol Biol 268 78–94 Occurrence Handle10.1006/jmbi.1997.0951 Occurrence Handle1:CAS:528:DyaK2sXjtlSqtL4%3D Occurrence Handle9149143
Article CAS PubMed Google Scholar
CB Burge T Tuschl PS Sharp (1999) Splicing precursors to mRNAs by the splicesomes RF Gesteland TR Cech JF Atkins (Eds) The RNA world Cold Springer Harbor Laboratory Press Cold Spring Harbor, NY 525–560
Google Scholar
M Burset R Guigo (1996) ArticleTitleEvaluation of gene structure prediction programs Genomics 34 353–367
Google Scholar
M Deutsch M Long (1999) ArticleTitleIntron-exon structures of eukaryotic model organisms Nucleic Acids Res 27 3219–3228
Google Scholar
GR Fink (1987) ArticleTitlePseudogenes in yeast? Cell 49 5–6
Google Scholar
M Fuhrmann W Oertel P Hegemann (1999) ArticleTitleA synthetic gene coding for the green fluorescent protein (GFP) is a versatile reporter in Chlamydomonas reinhardtii Plant J 19 353–361
Google Scholar
MJ Gardner N Hall E Fung O White M Berriman RW Hyman JM Carlton A Pain KE Nelson S Bowman IT Paulsen K James JA Eisen K Rutherford SL Salzberg A Craig S Kyes MS Chan V Nene SJ Shallom B Suh J Peterson S Angiuoli M Pertea J Alien J Selengut D Haft MW Mather AB Vaidya DM Martin AH Fairlamb MJ Fraunholz DS Roos SA Ralph GI McFadden LM Cummings GM Subramanian C Mungall JC Venter DJ Carucci SL Hoffman C Newbold RW Davis CM Fraser B Barrell (2002) ArticleTitleGenome sequence of the human malaria parasite Plasmodium falciparum Nature 419 498–511 Occurrence Handle10.1038/nature01097 Occurrence Handle1:CAS:528:DC%2BD38XnsFais7Y%3D Occurrence Handle12368864
Article CAS PubMed Google Scholar
AJ Gentles S Karlin (2001) ArticleTitleGenome-scale compositional comparisons in eukaryotes Genome Res 11 540–564
Google Scholar
G Glöckner (2000) ArticleTitleLarge scale sequencing and analysis of AT rich eukaryote genomes Curr Genomics 1 289–299
Google Scholar
G Glöckner L Eichinger K Szafranski JA Pachebat AT Bankier PH Dear R Lehmann C Baumgart G Parra JF Abril R Guigo K Kumpf DGS Consortium E Cox MA Quail M Platzer A Rosenthal AA Noegel (2002) ArticleTitleSequence and analysis of chromosome 2 of Dictyostelium discoideum Nature 418 79–85
Google Scholar
A Goffeau BG Barrell H Bussey RW Davis B Dujon H Feldmam F Galibert JD Hoheisel C Jacq M Johnston EJ Louis HW Mewes Y Murakami P Philippsen H Tettelin SG Oliver (1996) ArticleTitleLife with 6000 genes Science 274 563–567
Google Scholar
R Guigo (1999) DNA composition, codon usage and exon prediction M Bishop (Eds) Genetic databases Academic Press San Diego, CA 53–80
Google Scholar
N Hall A Pain M Berriman C Churcher B Harris D Harris K Mungall S Bowman R Atkin S Baker A Barron K Brooks CO Buckee C Burrows I Cherevach C Chillingworth T Chillingworth Z Christodoulou L Clark R Clark C Corton A Cronin R Davies P Davis P Dear F Dearden J Doggett T Feltwell A Goble I Goodhead R Gwilliam N Hamlin Z Hance D Harper H Hauser T Hornsby S Holroyd P Horrocks S Humphray K Jagels KD James D Johnson A Kerhornou A Knights B Konfortov S Kyes N Larke D Lawson N Lennard A Line M Maddison J McLean P Mooney S Moule L Murphy K Oliver D Ormond C Price MA Quail E Rabbinowitsch MA Rajandream S Rutter KM Rutherford M Sanders M Simmonds K Seeger S Sharp R Smith R Squares S Squares K Stevens K Taylor A Tivey L Unwin S Whitehead J Woodward JE Sulston A Craig C Newbold BG Barrell (2002) ArticleTitleSequence of Plasmodium falciparum chromosomes 1, 3–9 and 13 Nature 419 527–531
Google Scholar
PM Hooper H Zhang DS Wishart (2000) ArticleTitlePrediction of genetic structure in eukaryotic DNA using reference point logistic regression and sequence alignment Bioinformatics 16 425–438
Google Scholar
N Jareborg E Birney R Durbin (1999) ArticleTitleComparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs Genome Res 9 815–824 Occurrence Handle10.1101/gr.9.9.815 Occurrence Handle1:CAS:528:DyaK1MXmslKgs78%3D Occurrence Handle10508839
Article CAS PubMed Google Scholar
S Karlin (2001) ArticleTitleDetecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes Trends Microbiol 9 335–443 Occurrence Handle10.1016/S0966-842X(01)02079-0 Occurrence Handle1:CAS:528:DC%2BD3MXlsFKntb8%3D Occurrence Handle11435108
Article CAS PubMed Google Scholar
EV Kriventseva MS Gelfand (1999) ArticleTitleStatistical analysis of the exon-intron structure of higher and lower eukaryote genes J Biomol Struct Dyn 17 281–228
Google Scholar
V Makarov (2002) ArticleTitleComputer programs for eukaryotic gene prediction Brief Bioinform 3 195–199
Google Scholar
C Mathe MF Sagot T Schiex P Rouze (2002) ArticleTitleCurrent methods of gene prediction, their strengths and weaknesses Nucleic Acids Res 30 4103–117
Google Scholar
T Mourier DC Jeffares (2003) ArticleTitleEukaryotic intron loss Science 300 1393
Google Scholar
G Parra E Blanco R Guigo (2000) ArticleTitleGeneID in Drosophila Genome Res 10 511–515
Google Scholar
MG Reese G Hartzell NL Harris U Ohler JF Abril SE Lewis (2000) ArticleTitleGenome annotation assessment in Drosophila melanogaster Genome Res 10 483–501
Google Scholar
F Rivero (2002) ArticleTitlemRNA processing in Dictyostelium: Sequence requirements for termination and splicing Protist 153 169–176
Google Scholar
S Rogic AK Mackworth FB Ouellette (2001) ArticleTitleEvaluation of gene-finding programs on mammalian sequences Genome Res 11 817–832 Occurrence Handle10.1101/gr.147901 Occurrence Handle1:CAS:528:DC%2BD3MXjs1Wmurc%3D Occurrence Handle11337477
Article CAS PubMed Google Scholar
A Sakurai S Fujimori H Kochiwa S Kitamura-Abe T Washio R Saito P Carninci Y Hayashizaki M Tomita (2002) ArticleTitleOn biased distribution of introns in various eukaryotes Gene 300 89–95
Google Scholar
SL Salzberg M Pertea AL Delcher MJ Gardner H Tettelin (1999) ArticleTitleInterpolated Markov models for eukaryotic gene finding Genomics 59 24–31
Google Scholar
V Wood R Gwilliam MA Rajandream et al. (2002) ArticleTitleThe genome sequence of Schizosaccharomyces pombe Nature 415 871–880 Occurrence Handle10.1038/nature724
Article Google Scholar
MQ Zhang (2002) ArticleTitleComputational prediction of eukaryotic protein-coding genes Nat Rev Genet 3 698–709
Google Scholar

Download references

Acknowledgment

We thank Matthias Platzer for carefully revising the manuscript.

Author information

Authors and Affiliations

Department of Genome Analysis, Institute for Molecular Biotechnology (IMB) Jena, Beutenbergstr. 11, Jena, D-07745, Germany
Karol Szafranski, Rüdiger Lehmann & Gernot Glöckner
Institut Municipal d’Investigació Mèdica, Universitat Pompeu Fabra,Centre de Regulació Genòmica, Grup de Recerca en Informatica Biomedica (GRIB), Pg. Maritim de la Barceloneta 37-49, Barcelona, E-08003, Spain
Genis Parra & Roderic Guigo

Authors

Karol Szafranski
View author publications
You can also search for this author in PubMed Google Scholar
Rüdiger Lehmann
View author publications
You can also search for this author in PubMed Google Scholar
Genis Parra
View author publications
You can also search for this author in PubMed Google Scholar
Roderic Guigo
View author publications
You can also search for this author in PubMed Google Scholar
Gernot Glöckner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gernot Glöckner.

Additional information

Reviewing Editor: Dr. Manyuan Long

Appendix

Supplementary material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Szafranski, K., Lehmann, R., Parra, G. et al. Gene Organization Features in A/T-Rich Organisms. J Mol Evol 60, 90–98 (2005). https://doi.org/10.1007/s00239-004-0201-2

Download citation

Received: 19 December 2003
Accepted: 18 August 2004
Issue Date: January 2005
DOI: https://doi.org/10.1007/s00239-004-0201-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Gene Organization Features in A/T-Rich Organisms

Abstract

Similar content being viewed by others

Evolutionary determinants of genome-wide nucleotide composition

Forces acting on codon bias in malaria parasites

Homopolymer tract organization in the human malarial parasite Plasmodium falciparum and related Apicomplexan parasites

Introduction